Automatic Extraction of Semantic Networks from Text using Leximancer

格式：pdf
大小：216.15 KB
文档页数：2

下载文档原格式

语义分析的一些方法

语义分析的一些方法语义分析的一些方法(上篇)•5040语义分析，本文指运用各种机器学习方法，挖掘与学习文本、图片等的深层次概念。

wikipedia上的解释：In machine learning, semantic analysis of a corpus is the task of building structures that approximate concepts from a large set of documents(or images)。

工作这几年，陆陆续续实践过一些项目，有搜索广告，社交广告，微博广告，品牌广告，内容广告等。

要使我们广告平台效益最大化，首先需要理解用户，Context(将展示广告的上下文)和广告，才能将最合适的广告展示给用户。

而这其中，就离不开对用户，对上下文，对广告的语义分析，由此催生了一些子项目，例如文本语义分析，图片语义理解，语义索引，短串语义关联，用户广告语义匹配等。

接下来我将写一写我所认识的语义分析的一些方法，虽说我们在做的时候，效果导向居多，方法理论理解也许并不深入，不过权当个人知识点总结，有任何不当之处请指正，谢谢。

本文主要由以下四部分组成：文本基本处理，文本语义分析，图片语义分析，语义分析小结。

先讲述文本处理的基本方法，这构成了语义分析的基础。

接着分文本和图片两节讲述各自语义分析的一些方法，值得注意的是，虽说分为两节，但文本和图片在语义分析方法上有很多共通与关联。

最后我们简单介绍下语义分析在广点通“用户广告匹配”上的应用，并展望一下未来的语义分析方法。

1 文本基本处理在讲文本语义分析之前，我们先说下文本基本处理，因为它构成了语义分析的基础。

而文本处理有很多方面，考虑到本文主题，这里只介绍中文分词以及Term Weighting。

1.1 中文分词拿到一段文本后，通常情况下，首先要做分词。

分词的方法一般有如下几种：•基于字符串匹配的分词方法。

此方法按照不同的扫描方式，逐个查找词库进行分词。

(计算机软件与理论专业论文)特定领域的自动摘要生成策略

ｓｕｍｍａｒｉｚａｔｉｏｎｉＳｐｒｏｖｅｄｂｙａＱ＆Ａｅｖａｌｕａｔｉｏｎ；ｔｈｅＣｈｕｎｋＣＲＦ－ｂａｓｅｄｍｅｔｈｏｄｗｈｉｃｈ
ｉｄｅｎｔｉｆｉｅｄｏｐｉｎｉｏｎｈｏｌｄｅｒｓｗｉｔｈｐｒｅｃｉｓｉｏｎｏｖｅｒ８０％ｃｏｕｌｄａｓｓｉｓｔｔｈｅｏｐｉｎｉｏｎ－·ｈｏｌｄｅｒ－·ｂａｓｅｄ
Байду номын сангаас
ａｎａｌｙｓｉｓａｒｅｍａｄｅｕｓｅｏｆｔｏｃｒｅａｔｅａｌｌｏｐｔｉｍｉｚｅｄｆｅａｔｕｒｅ－ｂａｓｅｄｏｐｉｎｉｏｎｓｕｍｍａｒｉｚａｔｉｏｎａｎｄ
ｖｉｓｕａｌｉｚａｔｉｏｎｒｅｓｕｌｔ．
Ｅｘｐｅｒｉｍｅｎｔｓｓｈｏｗｅｄｔｈｅｓｕｍｍａｒｙｃｒｅａｔｅｄｂｙｔｈｅｍｏｂｉｌｅｓｕｍｍａｒｉｚａｔｉｏｎｉｎｔｈｉｓｐａｐｅｒｄｏｅｓｗｅｌｌｉｎｃｏｎｃｉｓｅｎｅｓｓ，ｒｅａｄａｂｉｌｉｔｙａｎｄｃｏｖｅｒａｇｅ，ｍｏｒｅｏｖｅｒ，ｔｈｅｅｆｆｅｃｔｉｖｅｎｅｓｓｈｉｅｒａｒｃｈｉｃａｌ
ｔｈｉｓｐａｐｅｒ，ａＣｏｎｄｉｔｉｏｎＲａｎｄｏｍＦｉｅｌｄ（ＣＲＦＯｍｏｄｅｌｉｓｔｒａｉｎｅｄｉｎｏｒｄｅｒｔｏａｓｓｉｓｔｔｈｅ
ｃｏｍｐａｒａｔｉｖｅｒｅｌａｔｉｏｎｓａｎｄｆｅａｔｕｒｅｅｘｔｒａｃｔｉｏｎ．Ｏｎｔｈｉｓｂａｓｉｓ，ｆｅａｔｕｒｃｍｅｒｇｅａｎｄｐｏｌａｒｉｔｙ
ｏｐｉｎｉｏｎｓｕｍｍａｒｉｚａｔｉｏｎａｌｅｄｅｓｉｇｎｅｄ．
ＭｏｂｉｌｅｏｒｉｅｎｔｅｄａｕｔｏｍａｔｉｃｓｕｍｍａｒｉｚａｔｉｏｎｉＳｒｅｓｔｒｉｃｔｅｄｔｏｓｕｍｍａｒｙｌｅｎ舀ｈｄｕｅｔｏｔｈｅｓｍａｌｌｅｒｓｃｒｅｅｎｓ．Ｉｎｔｈｉｓｐａｐｅｒ，ａｎｉｍｐｒｏｖｅｄＳｔｒｉｎｇ—ｅｄｉｔＤｉｓｔａｎｃｅ－ｂａｓｅｄｍｏｂｉｌｅｓｕｍｍａｒｉｚａｔｉｏｎｔｅｃｈｎｉｑｕｅｉｓｄｅｓｉｇｎｅｄｔｏｃｒｅａｔｅｔｈｅｓｕｍｍａｒｙｄｉｓｐｌａｙｅｄｏｎｔｈｅｍｏｂｉｌｅｔｅｒｍｉｎａｌ．Ｃｏｎｓｉｄｅｒｉｎｇｓｏｍｅｗｅｂｐａｇｅｓａｒｅｓｔｒｕｃｔｕｒｅｄｗｉｔｈｓｕｂｔｉｔｌｅｓ，ｈｉｅｒａｒｃｈｉｃａｌｓｕｍｍａｒｉｚａｔｉｏｎｉｓａｐｐｌｉｅｄｔｏｔｈｅｍｉｎ

一种有效的基于时空信息的视频运动对象分割算法_鲁梅

第 30 卷第 1 期 2013 年 1 月
计算机应用研究 Application Research of Computers
Vol． 30 No． 1 Jan． 2013
一种有效的基于时空信息的视频运动对象分割算法
鲁
摘
1 梅，卢 2 1 忱，范九伦
*
（ 1．西安邮电大学通信与信息工程学院，西安 710061 ； 2．武警工程大学通信工程系，西安 710086 ）要：为实现视频编码标准 MPEG-4 中语义对象的自动提取，提出一种基于时空信息的运动对象分割算法。
在时域上通过双边加权累积帧差和分块高阶统计算法得到目标的运动区域检测模板，以在充分利用时域信息的同时提高算法的速度；在提取空域信息时，先对视频序列的灰度图进行对比度增强处理，然后利用自适应 Canny 算子获取准确的空间边缘信息；最后进行时空融合，用空域边缘信息修正过的时域运动模板来提取运动对象。实验结果表明，本算法可以快速准确地分割视频运动对象测模板；空间边缘信息中图分类号： TP391． 4 文献标志码： A 文章编号： 1001-3695 （ 2013 ） 01-0303-04 doi： 10． 3969 / j． issn． 10013695． 2013． 01． 078
Abstract： To achieve the automatic extraction of semantic objects in video coding standard MPEG4 ，this paper proposed a moving object segmentation algorithm based on temporalspatial information． The proposed algorithm firstly obtained the moving region detection template of the target in temporal domain by using the bilateral weighted cumulative frame difference and blocked higherorder statistics algorithm，and it increased the speed of the algorithm while taking full information of temporal． it enhanced the contrast of the grayscale in video sequence at first， and then Secondly ，in the extraction of spatial information， obtained accurate information of spatial edge by using an adaptive Canny operator． Finally，the proposed algorithm merged the temporal information and spatial information together，extracted the moving objects with the temporal moving template which had been corrected by the spatial edge information． The experimental results show that the proposed algorithm can obtain an accurate segmentation of moving objects in video with a high speed． Key words： video moving objects segmentation； temporalspatial segmentation； moving region detection template； spatial edge information

单文档关键词自动提取方法述评

*
中国博士后科学基金特别资助项目（201003297） -1
关键词较早和较成熟的自动提取算法是 PAT-TREE 算法[2]。
1
单文档关键词的基本提取方法
关键词自动提取工作最早由 Luhn 在 19 世纪 50 年代开始研究[3]。1963 年，美国化学摘要为了提高文档
引言
关键词是为了文献标引工作，从报告、论文中选取出来用于表示全文主题内容信息款目的单词或术语。关键词在文档中能够表征文档的重要信息和核心内容，方便读者迅速的理解文档的摘要信息并快速的检索具体文档，对于新闻阅读、广告推荐、历史文化研究、文本处理、机器翻译、输入法词汇选取等一系列产业和研究都有着至关重要的作用。而关键词提取在文档聚类，web 页面获取、数据挖掘以及自动问答系统等方面都扮演极其重要的角色。无论是从传递信息角度，还是储存信息角度考虑，关键词的标引都给文献的储存和检索带来极大的方便。通过自动标注关键词，补充拓展文献中已有的关键词信息，帮助检索系统对文档进行聚类、索引、管理和总结。而如何提高单文本关键词自动标注系统的准确性、时效性和自适应特性也是目前研究的重点。目前，针对英文的关键词提取已经取得了较多的研究成果，提取方法也比较成熟，如 TF*IDF 算法[1]。由于中文的语言特点，在词与词之间没有明显的界限，因此分词成为中文关键词提取中一个重要的影响因素，而分词的效率和准确率也在某些程度上限制了中文关键词提取的研究。中文
Rachada Kongkachandra[11]提出了一种只使用文章的内部知识提取关键词的方法，不使用其他常用的外部知识，如词典、语义信息、训练组等。首先对文章标题进行句法分析，将其中的所有名词作为种子关键词，而在文章中任何与种子关键词相关的词汇都会被标记为候选关键词。然后根据已有的和新生成的关键词构建语义图，挑选候选关键词并将选择结果与已通过检测的种子关键词一起存入基础知识库，最后根据种子关键词和基础知识库得到最后的提取结果。Rachada Kongkachandra 的论文摆脱了外部知识库，精简了算法的空间开销。但由于只使用文章的内部信息，较容易受到语料组织和逻辑的影响。Meng 等人[12]指出传统的关键词提取方法不能够适应新生词的不断产生，他们改进已有的基于语义提取方法，通过计算得出词语间的相似值，进而构建相似词典并总结词典中条目的属性，替代人工选定生成的词典，解决了新生关键词的提取工作。

提取关键词的方法英语作文

提取关键词的方法英语作文Title: Methods for Extracting Keywords in English Writing。

Keywords play a crucial role in enhancing the readability, searchability, and relevance of written content. Whether you're crafting an academic paper, a blog post, or a marketing campaign, selecting the right keywords can significantly impact the effectiveness of your message. In this essay, we will explore various methods for extracting keywords in English writing.1. Manual Extraction:Manual extraction involves identifying relevant terms and phrases by carefully reading and analyzing the text. Writers often rely on their own understanding of the subject matter and the context to select keywords that best represent the content. This method allows for a personalized approach and can be particularly effective fornuanced or specialized topics.2. Frequency Analysis:Frequency analysis involves identifying keywords based on their frequency of occurrence within the text. Tools such as word frequency counters or software programs can quickly analyze a document and generate a list of the most commonly used terms. Writers can then review this list and select keywords that accurately reflect the main themes or concepts discussed.3. TF-IDF (Term Frequency-Inverse Document Frequency):TF-IDF is a statistical method used to evaluate the importance of a word in a document relative to a collection of documents. It takes into account both the frequency of a term within a document (TF) and its rarity across theentire document collection (IDF). Keywords with high TF-IDF scores are considered more significant and can help capture the essence of the text.4. Keyword Extraction Algorithms:Several algorithms have been developed specifically for keyword extraction, leveraging techniques such asnatural language processing (NLP) and machine learning. These algorithms analyze various linguistic features, such as word frequency, semantic relevance, and syntactic patterns, to identify keywords automatically. Examples include TextRank, RAKE (Rapid Automatic Keyword Extraction), and YAKE (Yet Another Keyword Extractor).5. Topic Modeling:Topic modeling techniques, such as Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF), can also aid in keyword extraction. These methods identify underlying topics within a corpus of text and assign keywords to each topic based on their semantic relevance. Writers can then use these keywords to optimize their content for specific themes or subjects.6. User Feedback and Engagement:In addition to automated methods, writers can also gather insights from user feedback and engagement metrics. Monitoring how readers interact with the content, such as which keywords attract the most clicks or engagement, can provide valuable feedback for refining keyword selection strategies.7. Competitor Analysis:Analyzing the keywords used by competitors orsimilar content creators can offer insights into popular terms and topics within a particular niche. Tools like keyword research platforms or search engine analytics can help identify relevant keywords that align with audience interests and preferences.In conclusion, effective keyword extraction isessential for optimizing written content for search engines, enhancing readability, and engaging target audiences. By employing a combination of manual analysis, statistical methods, and algorithmic techniques, writers can identifyand incorporate relevant keywords that elevate the quality and impact of their writing.。

The Proposition Bank An Annotated Corpus of Semantic Roles

The Proposition Bank:An AnnotatedCorpus of Semantic RolesMartha Palmer∗Daniel GildeaUniversity of Pennsylvania University of RochesterPaul KingsburyUniversity of PennsylvaniaThe Proposition Bank project takes a practical approach to semantic representation,adding a layer of predicate-argument information,or semantic role labels,to the syntactic structures of the Penn Treebank.The resulting resource can be thought of as shallow,in that it does not represent coreference,quantiﬁcation,and many other higher-order phenomena,but also broad, in that it covers every instance of every verb in the corpus and allows representative statistics to be calculated.We discuss the criteria used to deﬁne the sets of semantic roles used in the annotation pro-cess,and analyze the frequency of syntactic/semantic alternations in the corpus.We describe an automatic system for semantic role tagging trained on the corpus,and discuss the effect on its performance of various types of information,including a comparison of full syntactic parsing with aﬂat representation,and the contribution of the empty“trace”categories of the Treebank.1.IntroductionRobust syntactic parsers,made possible by new statistical techniques(Ratnaparkhi, 1997;Collins,1999;Collins,2000;Bangalore and Joshi,1999;Charniak,2000)and by the availability of large,hand-annotated training corpora(Marcus,Santorini,and Marcin-kiewicz,1993;Abeill´e,2003),have had a major impact on theﬁeld of natural language processing in recent years.However,the syntactic analyses produced by these parsers are a long way from representing the full meaning of the sentence.As a simple example, in the sentences:(1)John broke the window.(2)The window broke.a syntactic analysis will represent the window as the verb’s direct object in theﬁrst sen-tence and its subject in the second,but does not indicate that it plays the same under-lying semantic role in both cases.Note that both sentences are in the active voice,and that this alternation between transitive and intransitive uses of the verb does not always occur,for example,in the sentences:(3)The sergeant played taps.(4)The sergeant played.the subject has the same semantic role in both uses.The same verb can also undergo syntactic alternation,as in:∗Department of Computer and Information Science,University of Pennsylvania,3330Walnut Street, Philadelphia,PA19104.Email:mpalmer@c Association for Computational LinguisticsComputational Linguistics Volume XX,Number X(5)Taps played quietly in the background.and even in transitive uses,the role of the verb’s direct object can differ:(6)The sergeant played taps.(7)The sergeant played a beat-up old bugle.Alternation in the syntactic realization of semantic arguments is widespread,affect-ing most English verbs in some way,and the patterns exhibited by speciﬁc verbs vary widely(Levin,1993).The syntactic annotation of the Penn Treebank makes it possible to identify the subjects and objects of verbs in sentences such as the above examples. While the Treebank provides semantic function tags such as temporal and locative for certain constituents(generally syntactic adjuncts),it does not distinguish the different roles played by a verb’s grammatical subject or object in the above examples.Because the same verb used with the same syntactic subcategorization can assign different se-mantic roles,roles cannot be deterministically added to the Treebank by an automatic conversion process with100%accuracy.Our semantic role annotation process begins with a rule-based automatic tagger,the output of which is then hand-corrected(see Section4for details).The Proposition Bank aims to provide a broad-coverage hand annotated corpus of such phenomena,enabling the development of better domain-independent language understanding systems,and the quantitative study of how and why these syntactic al-ternations take place.We deﬁne a set of underlying semantic roles for each verb,and annotate each occurrence in the text of the original Penn Treebank.Each verb’s roles are numbered,as in the following occurrences of the verb offer from our data: (8)...[Arg0the company]to...offer[Arg1a15%to20%stake][Arg2to the public].(wsj0345)1(9)...[Arg0Sotheby’s]...offered[Arg2the Dorrance heirs][Arg1a money-backguarantee](wsj1928)(10)...[Arg1an amendment]offered[Arg0by Rep.Peter DeFazio]...(wsj0107)(11)...[Arg2Subcontractors]will be offered[Arg1a settlement]...(wsj0187)We believe that providing this level of semantic representation is important for ap-plications including information extraction,question answering,and machine transla-tion.Over the past decade,most work in theﬁeld of information extraction has shifted from complex rule-based systems designed to handle a wide variety of semantic phe-nomena including quantiﬁcation,anaphora,aspect and modality(e.g.Alshawi(1992)), to more robustﬁnite-state or statistical systems(Hobbs et al.,1997;Miller et al.,1998). These newer systems rely on a shallower level of semantic representation,similar to the level we adopt for the Proposition Bank,but have also tended to be very domain speciﬁc.The systems are trained and evaluated on corpora annotated for semantic rela-tions pertaining to,for example,corporate acquisitions or terrorist events.The Proposi-tion Bank(PropBank)takes a similar approach in that we annotate predicates’semantic roles,while steering clear of the issues involved in quantiﬁcation and discourse-level structure.By annotating semantic roles for every verb in our corpus,we provide a more domain-independent resource,which we hope will lead to more robust and broad-coverage natural language understanding systems.1Example sentences drawn from the Treebank corpus are identiﬁed by theﬁle in which they occur. Made-up examples usually feature“John.”2The Proposition Bank Palmer et al.The Proposition Bank focuses on the argument structure of verbs,and provides a complete corpus annotated with semantic roles,including roles traditionally viewed as arguments and as adjuncts.The Proposition Bank allows us for theﬁrst time to determine the frequency of syntactic variations in practice,the problems they pose for natural language understanding,and the strategies to which they may be susceptible.We begin the paper by giving examples of the variation in the syntactic realization of semantic arguments and drawing connections to previous research into verb alternation behavior.In Section3we describe our approach to semantic role annotation,including the types of roles chosen and the guidelines for the annotators.Section4compares our PropBank methodology and choice of semantic role labels to those of another semantic annotation project,FrameNet.We conclude the paper with a discussion of several pre-liminary experiments we have performed using the PropBank annotations,and discuss the implications for natural language research.2.Semantic Roles and Syntactic AlternationOur work in examining verb alternation behavior is inspired by previous research into the linking between semantic roles and syntactic realization,in particular the com-prehensive study of Levin(1993).Levin argues that the syntactic frames are a direct reﬂection of the underlying semantics;the sets of syntactic frames associated with a particular Levin class reﬂect underlying semantic components that constrain allowable arguments.On this principle,Levin deﬁnes verb classes based on the ability of the verb to occur or not occur in pairs of syntactic frames that are in some sense meaning-preserving(diathesis alternations).The classes also tend to share some semantic compo-nent.For example,the previous break examples are related by a transitive/intransitive alternation called the causative/inchoative alternation.Break,and verbs such as shatter and smash,are also characterized by their ability to appear in the middle construction, as in Glass breaks/shatters/smashes easily.Cut,a similar change-of-state verb,seems to share in this syntactic behavior,and can also appear in the transitive(causative)as well as the middle construction,John cut the bread,This loaf cuts easily.However,it cannot also occur in the simple intransitive:The window broke/*The bread cut.In contrast,cut verbs can occur in the conative:John valiantly cut/hacked at the frozen loaf,but his knife was too dull to make a dent in it,whereas break verbs cannot:*John broke at the window.The explanation given is that cut describes a series of actions directed at achieving the goal of separating some object into pieces.These actions consist of grasping an instrument with a sharp edge such as a knife,and applying it in a cutting fashion to the object.It is possible for these actions to be performed without the end result being achieved,but where the cutting manner can still be recognized,i.e.,John cut at the loaf.Where break is concerned,the only thing speciﬁed is the resulting change of state where the object becomes separated into pieces.VerbNet(Kipper,Dang,and Palmer,2000;Kipper,Palmer,and Rambow,2002)ex-tends Levin’s classes by adding an abstract representation of the syntactic frames for each class with explicit correspondences between syntactic positions and the semantic roles they express,as in Agent REL Patient,or Patient REL into pieces for break.2(For other extensions of Levin see also(Dorr and Jones,2000;Korhonen,Krymolowsky,and Marx,2003)).The original Levin classes constitute theﬁrst few levels in the hierarchy, with each class subsequently reﬁned to account for further semantic and syntactic dif-ferences within a class.The argument list consists of thematic labels from a set of20 2These can be thought of as a notational variant of Tree Adjoining Grammar elementary trees or Tree Adjoining Grammar partial derivations(Kipper,Dang,and Palmer,2000).3Computational Linguistics Volume XX,Number X possible such labels(Agent,Patient,Theme,Experiencer,etc.).The syntactic frames rep-resent a mapping of the list of thematic labels to deep-syntactic arguments.Additional semantic information for the verbs is expressed as a set(i.e.,conjunction)of semantic predicates,such as motion,contact,transfer info.Currently,all Levin verb classes have been assigned thematic labels and syntactic frames and over half the classes are com-pletely described,including their semantic predicates.In many cases,the additional information that VerbNet provides for each class has caused it to subdivide,or use in-tersections of,Levin’s original classes,adding an additional level to the hierarchy(Dang et al.,1998).We are also extending the coverage by adding new classes(Korhonen and Briscoe,2004).Our objective with the Proposition Bank is not a theoretical account of how and why syntactic alternation takes place,but rather to provide a useful level of representation and a corpus of annotated data to enable empirical study of these issues.We have referred to Levin’s classes wherever possible to ensure that verbs in the same classes are given consistent role labels.However,there is only a50%overlap between verbs in VerbNet and those in the Penn TreeBank II,and PropBank itself does not deﬁne a set of classes,nor does it attempt to formalize the semantics of the roles it deﬁnes.While lexical resources such as Levin’s classes and VerbNet provide information about alternation patterns and their semantics,the frequency of these alternations and their effect on language understanding systems has never been carefully quantiﬁed. While learning syntactic subcategorization frames from corpora has been shown to be possible with reasonable accuracy(Manning,1993;Brent,1993;Briscoe and Carroll, 1997),this work does not address the semantic roles associated with the syntactic ar-guments.More recent work has attempted to group verbs into classes based on alter-nations,usually taking Levin’s classes as a gold standard(McCarthy,2000;Merlo and Stevenson,2001;Schulte im Walde,2000;Schulte im Walde and Brew,2002).But with-out an annotated corpus of semantic roles,this line of research has not been able to measure the frequency of alternations directly,or,more generally,to ascertain how well the classes deﬁned by Levin correspond to real world data.We believe that a shallow labeled dependency structure provides a feasible level of annotation which,coupled with minimal co-reference links,could provide the founda-tion for a major advance in our ability to extract salient relationships from text.This will in turn improve the performance of basic parsing and generation components,as well as facilitate advances in text understanding,machine translation,and fact retrieval. 3.Annotation Scheme:Choosing the Set of Semantic RolesBecause of the difﬁculty of deﬁning a universal set of semantic or thematic roles cov-ering all types of predicates,PropBank deﬁnes semantic roles on a verb by verb basis. An individual verb’s semantic arguments are numbered,beginning with0.For a par-ticular verb,Arg0is generally the argument exhibiting features of a prototypical Agent (Dowty,1991)while Arg1is a prototypical Patient or Theme.No consistent generaliza-tions can be made across verbs for the higher numbered arguments,though an effort was made to consistently deﬁne roles across members of VerbNet classes.In addition to verb-speciﬁc numbered roles,PropBank deﬁnes several more general roles that can apply to any verb.The remainder of this section describes in detail the criteria used in assigning both types of roles.As examples of verb-speciﬁc numbered roles,we give entries for the verbs accept and kick below.These examples are taken from the guidelines presented to the anno-tators,and are also available on the web at /˜cotton/cgi-bin/pblex fmt.cgi.4The Proposition Bank Palmer et al.(12)Frameset accept.01“take willingly”Arg0:AcceptorArg1:Thing acceptedArg2:Accepted-fromArg3:AttributeEx:[Arg0He][ArgM-MOD would][ArgM-NEG n’t]accept[Arg1anything of value][Arg2from those he was writing about].(wsj0186)(13)Frameset kick.01“drive or impel with the foot”Arg0:KickerArg1:Thing kickedArg2:Instrument(defaults to foot)Ex1:[ArgM-DIS But][Arg0two big New York banks i]seem[Arg0*trace*i]to havekicked[Arg1those chances][ArgM-DIR away],[ArgM-TMP for the moment],[Arg2with the embarrassing failure of Citicorp and Chase Manhattan Corp.to deliver$7.2billion in bankﬁnancing for a leveraged buy-out of United Airlines parent UAL Corp].(wsj1619)Ex2:[Arg0John i]tried[Arg0*trace*i]to kick[Arg1the football],but Mary pulled it away at the last moment.A set of roles corresponding to a distinct usage of a verb is called a roleset,and can be associated with a set of syntactic frames indicating allowable syntactic variations in the expression of that set of roles.The roleset with its associated frames is called a Frameset.A polysemous verb may have more than one Frameset,when the differences in mean-ing are distinct enough to require a different sets of roles,one for each Frameset.The tagging guidelines include a“descriptor”ﬁeld for each role,such as“kicker”or“instru-ment”,which is intended for use during annotation and as documentation,but which does not have any theoretical standing.In addition,each Frameset is complemented by a set of examples,which attempt to cover the range of syntactic alternations afforded by that usage.The collection of Frameset entries for a verb is referred to as the verb’s Frame File.The use of numbered arguments and their mnemonic names was instituted for a number of reasons.First and foremost,the numbered arguments plot a middle course among many different theoretical viewpoints.3The numbered arguments can then be mapped easily and consistently onto any theory of argument structure,such as tra-ditional Theta-Role(Kipper,Palmer,and Rambow,2002),Lexical-Conceptual Structure (Rambow et al.,2003),or Prague Tectogrammatics(Hajiˇc ova and Kuˇc erov´a,2002).While most rolesets have two to four numbered roles,as many as six can appear,in particular for certain verbs of motion4(14)Frameset edge.01“move slightly”Arg0:causer of motion Arg3:start pointArg1:thing in motion Arg4:end pointArg2:distance moved Arg5:directionEx:[Arg0Revenue]edged[Arg5up][Arg2-EXT3.4%][Arg4to$904million][Arg3from $874million][ArgM-TMP in last year’s third quarter].(wsj1210) 3By following the TreeBank,however,we are following a very loose Government-Binding framework.4We make no attempt to adhere to any linguistic distinction between arguments and adjuncts.While many linguists would consider any argument higher than Arg2or Arg3to be an adjunct,they occur frequently enough with their respective verbs,or classes of verbs,that they are assigned a numbered argument in order to ensure consistent annotation.5Computational Linguistics Volume XX,Number X Table1Subtypes of the ArgM modiﬁer tagLOC:location CAU:causeEXT:extent TMP:timeDIS:discourse connectives PNC:purposeADV:general-purpose MNR:mannerNEG:negation marker DIR:directionMOD:modal verbBecause of the use of Arg0for Agency,there arose a small set of verbs where an external force could cause the Agent to execute the action in question.For example,in the sentence...Mr.Dinkins would march his staff out of board meetings and into his private ofﬁce...(wsj0765)the staff is unmistakably the marcher,the agentive role.Yet Mr.Dink-ins also has some degree of Agency,since he is causing the staff to do the marching.To capture this,a special tag of ArgA is used for the agent of an induced action.This ArgA tag is only used for verbs of volitional motion such as march and walk,modern uses of volunteer(eg,Mary volunteered John to clean the garage or more likely the passive of that, John was volunteered to clean the garage and,with some hesitation,graduate based on us-ages such as Penn only graduates35%of its students.(This usage does not occur as such in the Penn Treebank corpus,although it is evoked in the sentence No student should be permitted to be graduated from elementary school without having mastered the3R’s at the level that prevailed20years ago.(wsj1286)In addition to the semantic roles described in the rolesets,verbs can take any of a set of general,adjunct-like arguments(ArgMs),distinguished by one of the function tags shown in Table1.Although they are not considered adjuncts,NEG for verb-level negation(eg’John did n’t eat his peas’)and MOD for modal verbs(eg’John would eat everything else’)are also included in this list to allow every constituent surrounding the verb to be anno-tated.DIS is also not an adjunct,but was included to ease future discourse connective annotation.3.1Distinguishing FramesetsThe criteria to distinguish framesets is based on both semantics and syntax.Two verb meanings are distinguished as different framesets if they take different numbers of ar-guments.For example,the verb decline has two framesets:(15)Frameset:decline.01“go down incrementally”Arg1:entity going downArg2:amount gone down by,EXTArg3:start pointArg4:end pointEx:...[Arg1its net income]declining[Arg2-EXT42%][Arg4to$121million][ArgM-TMP in theﬁrst9months of1989].(wsj0067)(16)Frameset:decline.02“demure,reject”Arg0:agentArg1:rejected thingEx:[Arg0A spokesman i]declined[Arg1*trace*i to elaborate](wsj0038)6The Proposition Bank Palmer et al.However,alternations which preserve verb meanings,such as causative/inchoative or object deletion are considered to be one frameset only,as shown in the example for open.01.Both the transitive and intransitive uses of the verb open correspond to the same frameset,with some of the arguments left unspeciﬁed.(17)Frameset open.01“cause to open”Arg0:agentArg1:thing openedArg2:instrumentEx1:[Arg0John]opened[Arg1the door]Ex2:[Arg1The door]openedEx3:[Arg0John]opened[Arg1the door][Arg2with his foot]Moreover,differences in the syntactic type of the arguments do not constitute cri-teria for distinguishing between framesets,for example,see.01allows for both an NP object or a clause object,as illustrated below.(18)Frameset see.01“view”Arg0:viewerArg1:thing viewedEx1:[Arg0John]saw[Arg1the President]Ex2:[Arg0John]saw[Arg1the President collapse]Furthermore,verb-particle constructions are treated as separate from the corre-sponding simplex verb,whether the meanings are approximately the same or not.For example,three of the framesets for cut can be seen below:(19)Frameset cut.01“slice”Arg0:cutterArg1:thing cutArg2:medium,sourceArg3:instrumentEx:[Arg0Longer production runs][ArgM-MOD would]cut[Arg1inefﬁciencies from adjusting machinery between production cycles].(wsj0317)(20)Frameset cut.04“cut off=slice”Arg0:cutterArg1:thing cut(off)Arg2:medium,sourceArg3:instrumentEx:[Arg0The seed companies]cut off[Arg1the tassels of each plant].(wsj0209) (21)Frameset cut.05“cut back=reduce”Arg0:cutterArg1:thing reducedArg2:amount reduced byArg3:start pointArg4:end pointEx:“Whoa,”thought John,“[Arg0I i]’ve got[Arg0*trace*i]to start[Arg0*trace*i]cutting back[Arg1my intake of chocolate].Note that the verb and particle do not need to be contiguous;the second sentence above could just as well be said“The seed companies cut the tassels of each plant off.”7Computational Linguistics Volume XX,Number X Currently,there are frames for over3,300verbs,with a total of just over4,500frame-sets described,implying an average polysemy of1.36.Of these verb frames,only21.5% (721/3342)have more than one frameset,while less than100verbs have4or more.Each instance of a polysemous verb is marked as to which frameset it belongs to,with inter-annotator agreement of94%.The framesets can be viewed as extremely coarse-grained sense distinctions,with each frameset corresponding to one or more of the Senseval2 WordNet1.7verb groupings.Each grouping in turn corresponds to several WordNet 1.7senses(Palmer,Babko-Malaya,and Dang,2004).3.2Secondary PredicationsThere are two other functional tags which,unlike those listed above,can also be as-sociated with numbered arguments in the Frames Files.Theﬁrst one,EXT,’extent,’indicates that a constituent is a numerical argument on its verb,as in’climbed15%’or’walked3miles’.The second,PRD for’secondary predication’,marks a more subtle relationship.If one thinks of the arguments of a verb as existing in a dependency tree, all arguments depend directly from the verb.Each argument is basically independent of the others.There are those verbs,however,which predict that there is a predicative relationship between their arguments.A canonical example of this is call in the sense of’attach a label to,’as in Mary called John an idiot.In this case there is a relationship between John and an idiot(at least in Mary’s mind).The PRD tag is associated with the Arg2label in the Frames File for this frameset,since it is predictable that the Arg2pred-icates on the Arg1John.This helps to disambiguate the crucial difference between the following two sentences:predicative reading ditransitive readingMary called John a doctor.Mary called John a doctor.5(LABEL)(SUMMON)Arg0:Mary Arg0:MaryRel:called Rel:calledArg1:John(item being labeled)Arg2:John(benefactive)Arg2-PRD:a doctor(attribute)Arg1:a doctor(thing summoned) It is also possible for ArgM’s to predicate on another argument.Since this must be decided on a case by case basis,the PRD function tag is added to the ArgM by the annotator,as in Example28below.3.3Subsumed ArgumentsBecause verbs which share a VerbNet class are rarely synonyms,their shared argument structure occasionally takes on odd characteristics.Of primary interest among these are the cases where an argument predicted by one member of a class cannot be attested by another member of the same class.For a relatively simple example,consider the verb hit,in classes18.1and18.4.This takes three very obvious arguments:(22)Frameset hit“strike”Arg0:hitterArg1:thing hit,targetArg2:instrument of hittingEx1:Agentive subject:“[Arg0He i]digs in the sand instead of[Arg0*trace*i]hitting[Arg1the ball],like a farmer,”said Mr.Yoneyama.(wsj1303) 5This sense could also be stated in the dative:Mary called a doctor for John.8The Proposition Bank Palmer et al.Ex2:Instrumental subject:Dealers said[Arg1the shares]were hit[Arg2by fearsof a slowdown in the U.S.economy].(wsj1015)Ex3:All arguments:[Arg0John]hit[Arg1the tree][Arg2with a stick].6 Classes18.1and18.4areﬁlled with verbs of hitting,such as beat,hammer,kick,knock, strike,tap,whack and so forth.For some of these the instrument of hitting is necessarily included in the semantics of the verb itself.For example,kick is essentially’hit with the foot’and hammer is exactly’hit with a hammer’.For these verbs,then,the Arg2might not be available,depending on how strongly the instrument is incorporated into the verb.Kick,for example,shows28instances in the Treebank but only one instance of a (somewhat marginal)instrument:(23)[ArgM-DIS But][Arg0two big New York banks]seem to have kicked[Arg1thosechances][ArgM-DIR away],[ArgM-TMP for the moment],[Arg2with theembarrassing failure of Citicorp and Chase Manhattan Corp.to deliver$7.2billion in bankﬁnancing for a leveraged buy-out of United Airlines parentUAL Corp].(wsj1619)Hammer shows several examples of Arg2’s,but these are all metaphorical hammers: (24)Despite the relatively strong economy,[Arg1junk bond prices i]did nothingexcept go down,[Arg1*trace*i]hammered[Arg2by a seemingly endless trail ofbad news](wsj2428)Another,perhaps more interesting case is where two arguments can be merged into one in certain syntactic situations.Consider the case of meet,which canonically takes two arguments:(25)Frameset meet“come together”Arg0:one partyArg1:the other partyEx:[Arg0Argentine negotiator Carlos Carballo][ArgM-MOD will]meet[Arg1withbanks this week].(wsj0021)It is perfectly possible,of course,to mention both meeting parties in the same con-stituent:(26)[Arg0The economic and foreign ministers of12Asian and Paciﬁc nations][ArgM-MOD will]meet[ArgM-LOC in Australia][ArgM-TMP next week][ArgM-PRP todiscuss global trade as well as regional matters such as transportation andtelecommunications].(wsj0043)In these cases there is an assumed or default Arg1along the lines of’each other’: (27)[Arg0The economic and foreign ministers of12Asian and Paciﬁc nations][ArgM-MOD will]meet[Arg1-REC(with)each other]...Similarly,verbs of attachment(attach,tape,tie,etc)can express the’things being attached’as either one constituent or two:(28)Frameset connect.01“attach”Arg0:agent,entity causing two objects to be attachedArg1:patientArg2:attached-toArg3:instrument6The Wall Street Journal corpus contains no examples with both an agent and an instrument.9Computational Linguistics Volume XX,Number X Ex1:The subsidiary also increased reserves by$140million,however,and setaside an additional$25million for[Arg1claims]connected[Arg2with HurricaneHugo].(wsj1109)Ex2:Machines using the486are expected to challenge higher-priced workstations and minicomputers in applications such as[Arg0so-called servers i],[Arg0which i][Arg0*trace*i]connect[Arg1groups of computers][ArgM-PRDtogether,and in computer-aided design.(wsj0781)3.4Role Labels and Syntactic TreesThe Proposition Bank assigns semantic roles to nodes in the syntactic trees of the Penn Treebank.Annotators are presented with the roleset descriptions and the syntactic tree, and mark the appropriate nodes in the tree with role labels.The lexical heads of con-stituents are not explicitly marked either in the Treebank trees or in the semantic labeling layered on top of them.Annotators cannot change the syntactic parse,but they are not otherwise restricted in assigning the labels.In certain cases,more than one node may be assigned the same role.The annotation software does not require that the nodes being assigned labels be in any syntactic relation to the verb.We discuss the ways in which we handle the speciﬁcs of the Treebank syntactic annotation style in this section. Prepositional Phrases The treatment of prepositional phrases is complicated by several factors.On one hand,if a given argument is deﬁned as a“Destination”then in a sen-tence such as John poured the water into the bottle the destination of the water is clearly the bottle,not“into the bottle”.The fact that the water is going in to the bottle is inherent in the description“destination”;the preposition merely adds the speciﬁc information that the water will end up inside the bottle.Thus arguments should properly be associated with the NP heads of prepositional phrases.On the other hand,however,ArgM’s which are prepositional phrases are annotated at the PP level,not the NP level.For the sake of consistency,then,numbered arguments are also tagged at the PP level.This also fa-cilitates the treatment of multi-word prepositions such as“out of”,“according to”and “up to but not including”.7(29)[Arg1Its net income]declining[Arg2-EXT42%]to[Arg4$121million][ArgM-TMP intheﬁrst9months of1989].(wsj0067)Traces and Control Verbs The Penn Treebank contains empty categories known as traces,which are often co-indexed with other constituents in the tree.When a trace is assigned a role label by an annotator,the co-indexed constituent is automatically added to the annotation,as in:(30)[Arg0John i]tried[Arg0*trace*i]to kick[Arg1the football],but Mary pulled itaway at the last momentVerbs such as cause,force,and persuade,known as object control verbs,pose a prob-lem for the analysis and annotation of semantic structure.Consider a sentence such as Commonwealth Edison said the ruling could force it to slash its1989earnings by$1.55a share. (wsj-0015)The Penn Treebank’s analysis assigns a single sentential(S)constituent to the entire string it to slash....a share,making it a single syntactic argument to the verb force. In the PropBank annotation,we split the sentential complement into two semantic roles for the verb force,assigning roles to the noun phrase and verb phrase but not to the S node which subsumes them:7Note that“out of”is exactly parallel to“into”,but one is spelled with a space in the middle and the other isn’t.10。

关于最近研究的关键词提取keywordextraction做的笔记

关于最近研究的关键词提取keywordextraction做的笔记之前内容的整理要求：第⼀: ⾸先找出具有proposal性质的paper,归纳出经典的⽅法有哪些. 第⼆:我们如果想⽤的话,哪种更实⽤或者易于实现? 哪种在研究上更有意义.第⼀，较好较全⾯地介绍keyword extraction的经典特征的⽂章《Finding Advertising Keywords on Web Pages》.基于概念的keywords提取，使⽤概念、分类来辅助关键词抽取。

较经典的⽂章《Discovering Key Concepts in Verbose Queries》,《A study on automatically extracted keywords in text categorization》基于查询⽇志的keywords提取，有⽂章《Using the wisdom of the crowds for keyword generation》,《Keyword Extraction for Contextual Advertisement》Keywords扩展，keywords⽣成《Keyword Generation for Search Engine Advertising using Semantic Similarity》, 《Using the wisdom of the crowds for keyword generation》,《n-Keyword based Automatic Query Generation》第⼆，较常⽤的特征，之前研究者提到过的特征：《Finding Advertising Keywords on web pages》中提到过的特征1.语⾔特征词性标注2.⾸字母⼤写3.关键词是否在hypertext⾥4.关键词是否在meta data⾥5.关键词是否在title⾥6.关键词是否在url⾥7.TF,DF8.关键词所处位置信息9.关键词所在句⼦长度及⽂档长度10.候选短语的长度11.查询⽇志我想到的特征1.周围信息含量，附近⼏个词甚⾄是⼀个句⼦的平均信息含量。

twitter事件检测中的语义和情感分析

摘要摘要随着社交网络的快速发展，人们在社交平台上随时随地分享自己的所见所闻所想。

许多研究者认为社交网络是一种反映真实世界的传感网络。

社交媒体数据的分析具有广泛的应用，例如侦测犯罪活动，预测公众行为等。

由于文本数据在社交媒体数据中所占比例高并且含有丰富信息，文本语义分析对社交媒体数据分析至关重要。

过去的文本语义分析工作主要针对的是规范语言的文本数据，如新闻文本，维基百科等。

然而，社交媒体文本长度有限，包含大量的错误拼写，俚语，语法错误等非规范语言应用。

因此，传统的文本语义分析技术在社交媒体文本上的直接应用取得的效果并不理想。

针对推文中语义信息量和准确度有限的特点，本文在现有的文本语义分析技术的基础上，研究了一种语义和情感信息结合的推文特征学习方法，并将该推文特征应用于推特事件检测。

本文的主要工作可概括为以下两个部分：1. 构建语义和情感结合的词语表示。

词语的语义向量是文本语义分析的基础。

本文重点分析了目前最先进的神经网络语言模型word2vec。

针对word2vec 词向量近反义区分能力弱的缺点，本文提出了一种同时使用词语语境的语义和情感信息构建词向量的方法，提升词向量的近反义区分能力。

具体地，本文使用远程监督方法，利用推文中的表情符号作为弱情感标签，扩展word2vec神经网络模型，将语境的语义和情感信息编码到词向量中。

本文称这种语义和情感结合的词向量为senti-word2vec词向量。

2. 融合语义和情感信息的推特事件检测。

传统的推特事件检测把语义相似的推文组织起来表征事件。

然而，许多语义特征提取方法的近反义区分能力有限，因此同一事件簇中的推文对事件可能表达不同的情感态度。

在情感信息的约束条件下，本文提出将推特事件簇进一步划分为事件支持簇，事件反对簇和事件中立簇。

具体地，本文使用senti-word2vec词向量生成语义和情感结合的推文特征，分析该推文特征对推文语义相似性判断和情感分析的影响，最后运用该推文特征进行情感细分的事件检测。

一种基于潜语义分析的中文网页自动摘要方法_叶昭晖

者可以在潜在语义空间推断出句子与句子之间的相似度或者句子与文档全文之间的相关程度。向量内
k=t
积 Sim（ x，y） = x·y = ∑（ xk·yk），设矩阵 M = ATA，那么矩阵中的元素 M（ i，j）的值是句子 Si与句子 k =1
Sj向量内积，由奇异值分解定理有 M = ATA = （ USVT ） T USVT = VST UT USVT = VS2 VT，句子之间的相似度
随着 Web 信息数量的不断增加，读者在较短时间内快速了解 Web 文章内容的方法为阅读文摘。文摘是简明、确切地记述原文献重要内容的语义连贯的短文，文摘的特征是忠实原文、语义连贯、语言简明确切［1］。自动文摘就是利用计算机自动地从原始文献中提取文摘［2］，自动文摘技术对于 Web 信息内容的整理有着重要的意义，尤其是对于中文的 Web 信息。
可以用降维后的对角奇异值矩阵和右奇异矩阵乘积来表示，如果直接由向量内积来结算两个句子之间
的相似度，遇到较长的句子时，会导致准确度下降严重。考虑这个因素，结合实际经验，笔者采用对向量
为了验证上述的假设，笔者将全文作为一个特殊的句子（对应图 1 中矩阵 A 和 VT 矩阵的阴影部
分），与全文其他句子的特征词 T 相乘构成句子 S 的（其行向量为特征词 T，列向量为句子 S）。通过对
项 / 文档矩阵的奇异值分解（ SVD），把高维的向量空间模型（ VSM）表示的文档映射到低维的新空间中，
第2 期
叶昭晖等：一种基于潜语义分析的中文网页自动摘要方法
343
定义 1 定义 2 征值。定义 3
如果矩阵 ATA = 1n×n，那么 A = （ aij） m×n 是正交矩阵，这里 AT 是 A 的转置矩阵。如果矩阵 x ∈ Rn 是 Bn×n 的一个特征向量，并且当 λ ∈ R ，有 Ax = λx ，则 λ 是 Bn×n 的特

纹理物体缺陷的视觉检测算法研究--优秀毕业论文

摘要
在竞争激烈的工业自动化生产过程中，机器视觉对产品质量的把关起着举足轻重的作用，机器视觉在缺陷检测技术方面的应用也逐渐普遍起来。与常规的检测技术相比，自动化的视觉检测系统更加经济、快捷、高效与安全。纹理物体在工业生产中广泛存在，像用于半导体装配和封装底板和发光二极管，现代化电子系统中的印制电路板，以及纺织行业中的布匹和织物等都可认为是含有纹理特征的物体。本论文主要致力于纹理物体的缺陷检测技术研究，为纹理物体的自动化检测提供高效而可靠的检测算法。纹理是描述图像内容的重要特征，纹理分析也已经被成功的应用与纹理分割和纹理分类当中。本研究提出了一种基于纹理分析技术和参考比较方式的缺陷检测算法。这种算法能容忍物体变形引起的图像配准误差，对纹理的影响也具有鲁棒性。本算法旨在为检测出的缺陷区域提供丰富而重要的物理意义，如缺陷区域的大小、形状、亮度对比度及空间分布等。同时，在参考图像可行的情况下，本算法可用于同质纹理物体和非同质纹理物体的检测，对非纹理物体的检测也可取得不错的效果。在整个检测过程中，我们采用了可调控金字塔的纹理分析和重构技术。与传统的小波纹理分析技术不同，我们在小波域中加入处理物体变形和纹理影响的容忍度控制算法，来实现容忍物体变形和对纹理影响鲁棒的目的。最后可调控金字塔的重构保证了缺陷区域物理意义恢复的准确性。实验阶段，我们检测了一系列具有实际应用价值的图像。实验结果表明本文提出的纹理物体缺陷检测算法具有高效性和易于实现性。关键字: 缺陷检测；纹理；物体变形；可调控金字塔；重构
Keywords: defect detection, texture, object distortion, steerable pyramid, reconstruction
II

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Automatic Extraction of Semantic Networks from Text using LeximancerAndrew E.Smith.Key Centre for Human Factors and Applied Cognitive Psychology,The University of Queensland,Queensland,Australia,4072.asmith@.auAbstractLeximancer is a software system for perform-ing conceptual analysis of text data in a largely language independent manner.The system is modelled on Content Analysis and provides unsupervised and supervised analysis using seeded concept classiﬁers.Unsupervised on-tology discovery is a key component.1MethodThe strategy used for conceptual mapping of text in-volves abstracting families of words to thesaurus con-cepts.These concepts are then used to classify text at a resolution of several sentences.The resulting concept tags are indexed to provide a document exploration en-vironment for the user.A smaller number of simple concepts can index many more complex relationships by recording co-occurrences,and complex systems ap-proaches can be applied to these systems of agents.To achieve this,several novel algorithms were de-veloped:a learning optimiser for automatically select-ing,learning,and adapting a concept from the word us-age within the text,and an asymmetric scaling process for generating a cluster map of concepts based on co-occurrence in the text.Extensive evaluation has been performed on real doc-ument collections in collaboration with domain experts.The method adopted has been to perform parallel analy-ses with these experts and compare the results.An outline of the algorithms (Smith,2000)follows:1.Text preparation:Standard techniques are em-ployed,including name and term preservation,to-kenisation,and the application of a stop-list.2.Unsupervised and supervised ontology discovery:Concepts can be seeded by a domain expert to suituser requirements,or they can be chosen automat-ically using a ranking algorithm for ﬁnding seed words which reﬂect the themes present in the data.This process looks for words near the centre of local maxima in the lexical co-occurrence network.3.Filling the thesaurus:A machine learning algorithm is used to ﬁnd the relevant thesaurus words from the text data.This iterative optimiser,derived from a word disambiguation technique (Yarowsky,1995),ﬁnds the nearest local maximum in the lexical co-occurrence network from each concept seed.Early results show that this lexical network can be reduced to a Scale-free and Small-world network 1.4.Classiﬁcation:Text is tagged with multiple concepts using the thesaurus,to a sentence resolution.5.Mapping:The concepts and their relative co-occurrence frequencies now form a semantic net-work.This is scaled using an asymmetric scaling algorithm,and made into a lattice by ranking con-cepts by their connectedness,or er interface:A browser is used for exploring the classiﬁcation system in depth.The semantic lat-tice browser enables semantic characterisation of the data and discovery of indirect association.Con-cept co-occurrence spectra and themed text segment browsing are also provided.2Analysis of the PNAS Data SetThe data set presented here consisted of text and meta-data from Proceedings of the National Academy of Sci-ence,1997to 2002.These examples are extracted from the abstract data.Firstly,Leximancer was conﬁgured to map the document set in unsupervised mode.A screen image of this interactive map is shown in ﬁgure 1.This1Following (Steyvers and Tenenbaum,2003).Edmonton, May-June 2003Demonstrations , pp. 23-24 Proceedings of HLT-NAACL 2003shows the semantic lattice (left),with the co-occurrence links from the concept ‘brain’highlighted (left andright).Figure 1:Unsupervised map of PNAS abstracts.Figure 2shows the top of the thesaurus entry for the concept ‘brain’.This concept was seeded with just the word ‘brain’and then the learning system found a larger family of words and names which are strongly relevant to ‘brain’in the these abstracts.In the ﬁgure,terms in square brackets are identiﬁed proper names,and numeri-cal values are the relevancy weights.Figure 2:Thesaurus entry for ‘brain’(excerpt).It is also of interest to discover which concepts tend to be unique to each year of the PNAS proceedings,and so identify trends.This usually requires a different form of analysis,since concepts which characterise the whole data set may not be good for discriminating parts.By placing the data for each year in a folder,Leximancer can tag each text sentence with the relevant year,and place each year as a prior concept on the map.The result-ing map contains the prior concepts plus other concepts which are relevant to at least one of the priors,and shows trending from early years to later years (ﬁgure3).Figure 3:Temporal map of PNAS abstracts.3ConclusionThe Leximancer system has demonstrated several major strengths for text data analysis:•Large amounts of text can be analysed rapidly in a quantitative manner.Text is quickly re-classiﬁed us-ing different ontologies when needs change.•The unsupervised analysis generates concepts which are well-deﬁned —they have signiﬁers which com-municate the meaning of each concept to the user.•Machine Learning removes much of the need to re-vise thesauri as the domain vocabulary evolves.ReferencesAndrew E.Smith.2000.Machine mapping of document collections:the leximancer system.In Proceedings of the Fifth Australasian Document Computing Sym-posium ,Sunshine Coast,Australia,December.DSTC./technology.html.Mark Steyvers and Joshua B.Tenenbaum.2003.The large-scale structure of semantic networks:Statistical analyses and a model of semantic growth.Submitted to Cognitive Science ./˜msteyver.David Yarowsky.1995.Unsupervised word-sense dis-ambiguation rivaling supervised methods.In Proceed-ings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95),pages 189–196,Cambridge,MA./˜yarowsky/pubs.html.。