Automatic Extraction of Semantic Networks from Text using Leximancer
- 格式:pdf
- 大小:216.15 KB
- 文档页数:2
语义分析的一些方法语义分析的一些方法(上篇)•5040语义分析,本文指运用各种机器学习方法,挖掘与学习文本、图片等的深层次概念。
wikipedia上的解释:In machine learning, semantic analysis of a corpus is the task of building structures that approximate concepts from a large set of documents(or images)。
工作这几年,陆陆续续实践过一些项目,有搜索广告,社交广告,微博广告,品牌广告,内容广告等。
要使我们广告平台效益最大化,首先需要理解用户,Context(将展示广告的上下文)和广告,才能将最合适的广告展示给用户。
而这其中,就离不开对用户,对上下文,对广告的语义分析,由此催生了一些子项目,例如文本语义分析,图片语义理解,语义索引,短串语义关联,用户广告语义匹配等。
接下来我将写一写我所认识的语义分析的一些方法,虽说我们在做的时候,效果导向居多,方法理论理解也许并不深入,不过权当个人知识点总结,有任何不当之处请指正,谢谢。
本文主要由以下四部分组成:文本基本处理,文本语义分析,图片语义分析,语义分析小结。
先讲述文本处理的基本方法,这构成了语义分析的基础。
接着分文本和图片两节讲述各自语义分析的一些方法,值得注意的是,虽说分为两节,但文本和图片在语义分析方法上有很多共通与关联。
最后我们简单介绍下语义分析在广点通“用户广告匹配”上的应用,并展望一下未来的语义分析方法。
1 文本基本处理在讲文本语义分析之前,我们先说下文本基本处理,因为它构成了语义分析的基础。
而文本处理有很多方面,考虑到本文主题,这里只介绍中文分词以及Term Weighting。
1.1 中文分词拿到一段文本后,通常情况下,首先要做分词。
分词的方法一般有如下几种:•基于字符串匹配的分词方法。
此方法按照不同的扫描方式,逐个查找词库进行分词。
提取关键词的方法英语作文Title: Methods for Extracting Keywords in English Writing。
Keywords play a crucial role in enhancing the readability, searchability, and relevance of written content. Whether you're crafting an academic paper, a blog post, or a marketing campaign, selecting the right keywords can significantly impact the effectiveness of your message. In this essay, we will explore various methods for extracting keywords in English writing.1. Manual Extraction:Manual extraction involves identifying relevant terms and phrases by carefully reading and analyzing the text. Writers often rely on their own understanding of the subject matter and the context to select keywords that best represent the content. This method allows for a personalized approach and can be particularly effective fornuanced or specialized topics.2. Frequency Analysis:Frequency analysis involves identifying keywords based on their frequency of occurrence within the text. Tools such as word frequency counters or software programs can quickly analyze a document and generate a list of the most commonly used terms. Writers can then review this list and select keywords that accurately reflect the main themes or concepts discussed.3. TF-IDF (Term Frequency-Inverse Document Frequency):TF-IDF is a statistical method used to evaluate the importance of a word in a document relative to a collection of documents. It takes into account both the frequency of a term within a document (TF) and its rarity across theentire document collection (IDF). Keywords with high TF-IDF scores are considered more significant and can help capture the essence of the text.4. Keyword Extraction Algorithms:Several algorithms have been developed specifically for keyword extraction, leveraging techniques such asnatural language processing (NLP) and machine learning. These algorithms analyze various linguistic features, such as word frequency, semantic relevance, and syntactic patterns, to identify keywords automatically. Examples include TextRank, RAKE (Rapid Automatic Keyword Extraction), and YAKE (Yet Another Keyword Extractor).5. Topic Modeling:Topic modeling techniques, such as Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF), can also aid in keyword extraction. These methods identify underlying topics within a corpus of text and assign keywords to each topic based on their semantic relevance. Writers can then use these keywords to optimize their content for specific themes or subjects.6. User Feedback and Engagement:In addition to automated methods, writers can also gather insights from user feedback and engagement metrics. Monitoring how readers interact with the content, such as which keywords attract the most clicks or engagement, can provide valuable feedback for refining keyword selection strategies.7. Competitor Analysis:Analyzing the keywords used by competitors orsimilar content creators can offer insights into popular terms and topics within a particular niche. Tools like keyword research platforms or search engine analytics can help identify relevant keywords that align with audience interests and preferences.In conclusion, effective keyword extraction isessential for optimizing written content for search engines, enhancing readability, and engaging target audiences. By employing a combination of manual analysis, statistical methods, and algorithmic techniques, writers can identifyand incorporate relevant keywords that elevate the quality and impact of their writing.。
The Proposition Bank:An AnnotatedCorpus of Semantic RolesMartha Palmer∗Daniel GildeaUniversity of Pennsylvania University of RochesterPaul KingsburyUniversity of PennsylvaniaThe Proposition Bank project takes a practical approach to semantic representation,adding a layer of predicate-argument information,or semantic role labels,to the syntactic structures of the Penn Treebank.The resulting resource can be thought of as shallow,in that it does not represent coreference,quantification,and many other higher-order phenomena,but also broad, in that it covers every instance of every verb in the corpus and allows representative statistics to be calculated.We discuss the criteria used to define the sets of semantic roles used in the annotation pro-cess,and analyze the frequency of syntactic/semantic alternations in the corpus.We describe an automatic system for semantic role tagging trained on the corpus,and discuss the effect on its performance of various types of information,including a comparison of full syntactic parsing with aflat representation,and the contribution of the empty“trace”categories of the Treebank.1.IntroductionRobust syntactic parsers,made possible by new statistical techniques(Ratnaparkhi, 1997;Collins,1999;Collins,2000;Bangalore and Joshi,1999;Charniak,2000)and by the availability of large,hand-annotated training corpora(Marcus,Santorini,and Marcin-kiewicz,1993;Abeill´e,2003),have had a major impact on thefield of natural language processing in recent years.However,the syntactic analyses produced by these parsers are a long way from representing the full meaning of the sentence.As a simple example, in the sentences:(1)John broke the window.(2)The window broke.a syntactic analysis will represent the window as the verb’s direct object in thefirst sen-tence and its subject in the second,but does not indicate that it plays the same under-lying semantic role in both cases.Note that both sentences are in the active voice,and that this alternation between transitive and intransitive uses of the verb does not always occur,for example,in the sentences:(3)The sergeant played taps.(4)The sergeant played.the subject has the same semantic role in both uses.The same verb can also undergo syntactic alternation,as in:∗Department of Computer and Information Science,University of Pennsylvania,3330Walnut Street, Philadelphia,PA19104.Email:mpalmer@c Association for Computational LinguisticsComputational Linguistics Volume XX,Number X(5)Taps played quietly in the background.and even in transitive uses,the role of the verb’s direct object can differ:(6)The sergeant played taps.(7)The sergeant played a beat-up old bugle.Alternation in the syntactic realization of semantic arguments is widespread,affect-ing most English verbs in some way,and the patterns exhibited by specific verbs vary widely(Levin,1993).The syntactic annotation of the Penn Treebank makes it possible to identify the subjects and objects of verbs in sentences such as the above examples. While the Treebank provides semantic function tags such as temporal and locative for certain constituents(generally syntactic adjuncts),it does not distinguish the different roles played by a verb’s grammatical subject or object in the above examples.Because the same verb used with the same syntactic subcategorization can assign different se-mantic roles,roles cannot be deterministically added to the Treebank by an automatic conversion process with100%accuracy.Our semantic role annotation process begins with a rule-based automatic tagger,the output of which is then hand-corrected(see Section4for details).The Proposition Bank aims to provide a broad-coverage hand annotated corpus of such phenomena,enabling the development of better domain-independent language understanding systems,and the quantitative study of how and why these syntactic al-ternations take place.We define a set of underlying semantic roles for each verb,and annotate each occurrence in the text of the original Penn Treebank.Each verb’s roles are numbered,as in the following occurrences of the verb offer from our data: (8)...[Arg0the company]to...offer[Arg1a15%to20%stake][Arg2to the public].(wsj0345)1(9)...[Arg0Sotheby’s]...offered[Arg2the Dorrance heirs][Arg1a money-backguarantee](wsj1928)(10)...[Arg1an amendment]offered[Arg0by Rep.Peter DeFazio]...(wsj0107)(11)...[Arg2Subcontractors]will be offered[Arg1a settlement]...(wsj0187)We believe that providing this level of semantic representation is important for ap-plications including information extraction,question answering,and machine transla-tion.Over the past decade,most work in thefield of information extraction has shifted from complex rule-based systems designed to handle a wide variety of semantic phe-nomena including quantification,anaphora,aspect and modality(e.g.Alshawi(1992)), to more robustfinite-state or statistical systems(Hobbs et al.,1997;Miller et al.,1998). These newer systems rely on a shallower level of semantic representation,similar to the level we adopt for the Proposition Bank,but have also tended to be very domain specific.The systems are trained and evaluated on corpora annotated for semantic rela-tions pertaining to,for example,corporate acquisitions or terrorist events.The Proposi-tion Bank(PropBank)takes a similar approach in that we annotate predicates’semantic roles,while steering clear of the issues involved in quantification and discourse-level structure.By annotating semantic roles for every verb in our corpus,we provide a more domain-independent resource,which we hope will lead to more robust and broad-coverage natural language understanding systems.1Example sentences drawn from the Treebank corpus are identified by thefile in which they occur. Made-up examples usually feature“John.”2The Proposition Bank Palmer et al.The Proposition Bank focuses on the argument structure of verbs,and provides a complete corpus annotated with semantic roles,including roles traditionally viewed as arguments and as adjuncts.The Proposition Bank allows us for thefirst time to determine the frequency of syntactic variations in practice,the problems they pose for natural language understanding,and the strategies to which they may be susceptible.We begin the paper by giving examples of the variation in the syntactic realization of semantic arguments and drawing connections to previous research into verb alternation behavior.In Section3we describe our approach to semantic role annotation,including the types of roles chosen and the guidelines for the annotators.Section4compares our PropBank methodology and choice of semantic role labels to those of another semantic annotation project,FrameNet.We conclude the paper with a discussion of several pre-liminary experiments we have performed using the PropBank annotations,and discuss the implications for natural language research.2.Semantic Roles and Syntactic AlternationOur work in examining verb alternation behavior is inspired by previous research into the linking between semantic roles and syntactic realization,in particular the com-prehensive study of Levin(1993).Levin argues that the syntactic frames are a direct reflection of the underlying semantics;the sets of syntactic frames associated with a particular Levin class reflect underlying semantic components that constrain allowable arguments.On this principle,Levin defines verb classes based on the ability of the verb to occur or not occur in pairs of syntactic frames that are in some sense meaning-preserving(diathesis alternations).The classes also tend to share some semantic compo-nent.For example,the previous break examples are related by a transitive/intransitive alternation called the causative/inchoative alternation.Break,and verbs such as shatter and smash,are also characterized by their ability to appear in the middle construction, as in Glass breaks/shatters/smashes easily.Cut,a similar change-of-state verb,seems to share in this syntactic behavior,and can also appear in the transitive(causative)as well as the middle construction,John cut the bread,This loaf cuts easily.However,it cannot also occur in the simple intransitive:The window broke/*The bread cut.In contrast,cut verbs can occur in the conative:John valiantly cut/hacked at the frozen loaf,but his knife was too dull to make a dent in it,whereas break verbs cannot:*John broke at the window.The explanation given is that cut describes a series of actions directed at achieving the goal of separating some object into pieces.These actions consist of grasping an instrument with a sharp edge such as a knife,and applying it in a cutting fashion to the object.It is possible for these actions to be performed without the end result being achieved,but where the cutting manner can still be recognized,i.e.,John cut at the loaf.Where break is concerned,the only thing specified is the resulting change of state where the object becomes separated into pieces.VerbNet(Kipper,Dang,and Palmer,2000;Kipper,Palmer,and Rambow,2002)ex-tends Levin’s classes by adding an abstract representation of the syntactic frames for each class with explicit correspondences between syntactic positions and the semantic roles they express,as in Agent REL Patient,or Patient REL into pieces for break.2(For other extensions of Levin see also(Dorr and Jones,2000;Korhonen,Krymolowsky,and Marx,2003)).The original Levin classes constitute thefirst few levels in the hierarchy, with each class subsequently refined to account for further semantic and syntactic dif-ferences within a class.The argument list consists of thematic labels from a set of20 2These can be thought of as a notational variant of Tree Adjoining Grammar elementary trees or Tree Adjoining Grammar partial derivations(Kipper,Dang,and Palmer,2000).3Computational Linguistics Volume XX,Number X possible such labels(Agent,Patient,Theme,Experiencer,etc.).The syntactic frames rep-resent a mapping of the list of thematic labels to deep-syntactic arguments.Additional semantic information for the verbs is expressed as a set(i.e.,conjunction)of semantic predicates,such as motion,contact,transfer info.Currently,all Levin verb classes have been assigned thematic labels and syntactic frames and over half the classes are com-pletely described,including their semantic predicates.In many cases,the additional information that VerbNet provides for each class has caused it to subdivide,or use in-tersections of,Levin’s original classes,adding an additional level to the hierarchy(Dang et al.,1998).We are also extending the coverage by adding new classes(Korhonen and Briscoe,2004).Our objective with the Proposition Bank is not a theoretical account of how and why syntactic alternation takes place,but rather to provide a useful level of representation and a corpus of annotated data to enable empirical study of these issues.We have referred to Levin’s classes wherever possible to ensure that verbs in the same classes are given consistent role labels.However,there is only a50%overlap between verbs in VerbNet and those in the Penn TreeBank II,and PropBank itself does not define a set of classes,nor does it attempt to formalize the semantics of the roles it defines.While lexical resources such as Levin’s classes and VerbNet provide information about alternation patterns and their semantics,the frequency of these alternations and their effect on language understanding systems has never been carefully quantified. While learning syntactic subcategorization frames from corpora has been shown to be possible with reasonable accuracy(Manning,1993;Brent,1993;Briscoe and Carroll, 1997),this work does not address the semantic roles associated with the syntactic ar-guments.More recent work has attempted to group verbs into classes based on alter-nations,usually taking Levin’s classes as a gold standard(McCarthy,2000;Merlo and Stevenson,2001;Schulte im Walde,2000;Schulte im Walde and Brew,2002).But with-out an annotated corpus of semantic roles,this line of research has not been able to measure the frequency of alternations directly,or,more generally,to ascertain how well the classes defined by Levin correspond to real world data.We believe that a shallow labeled dependency structure provides a feasible level of annotation which,coupled with minimal co-reference links,could provide the founda-tion for a major advance in our ability to extract salient relationships from text.This will in turn improve the performance of basic parsing and generation components,as well as facilitate advances in text understanding,machine translation,and fact retrieval. 3.Annotation Scheme:Choosing the Set of Semantic RolesBecause of the difficulty of defining a universal set of semantic or thematic roles cov-ering all types of predicates,PropBank defines semantic roles on a verb by verb basis. An individual verb’s semantic arguments are numbered,beginning with0.For a par-ticular verb,Arg0is generally the argument exhibiting features of a prototypical Agent (Dowty,1991)while Arg1is a prototypical Patient or Theme.No consistent generaliza-tions can be made across verbs for the higher numbered arguments,though an effort was made to consistently define roles across members of VerbNet classes.In addition to verb-specific numbered roles,PropBank defines several more general roles that can apply to any verb.The remainder of this section describes in detail the criteria used in assigning both types of roles.As examples of verb-specific numbered roles,we give entries for the verbs accept and kick below.These examples are taken from the guidelines presented to the anno-tators,and are also available on the web at /˜cotton/cgi-bin/pblex fmt.cgi.4The Proposition Bank Palmer et al.(12)Frameset accept.01“take willingly”Arg0:AcceptorArg1:Thing acceptedArg2:Accepted-fromArg3:AttributeEx:[Arg0He][ArgM-MOD would][ArgM-NEG n’t]accept[Arg1anything of value][Arg2from those he was writing about].(wsj0186)(13)Frameset kick.01“drive or impel with the foot”Arg0:KickerArg1:Thing kickedArg2:Instrument(defaults to foot)Ex1:[ArgM-DIS But][Arg0two big New York banks i]seem[Arg0*trace*i]to havekicked[Arg1those chances][ArgM-DIR away],[ArgM-TMP for the moment],[Arg2with the embarrassing failure of Citicorp and Chase Manhattan Corp.to deliver$7.2billion in bankfinancing for a leveraged buy-out of United Airlines parent UAL Corp].(wsj1619)Ex2:[Arg0John i]tried[Arg0*trace*i]to kick[Arg1the football],but Mary pulled it away at the last moment.A set of roles corresponding to a distinct usage of a verb is called a roleset,and can be associated with a set of syntactic frames indicating allowable syntactic variations in the expression of that set of roles.The roleset with its associated frames is called a Frameset.A polysemous verb may have more than one Frameset,when the differences in mean-ing are distinct enough to require a different sets of roles,one for each Frameset.The tagging guidelines include a“descriptor”field for each role,such as“kicker”or“instru-ment”,which is intended for use during annotation and as documentation,but which does not have any theoretical standing.In addition,each Frameset is complemented by a set of examples,which attempt to cover the range of syntactic alternations afforded by that usage.The collection of Frameset entries for a verb is referred to as the verb’s Frame File.The use of numbered arguments and their mnemonic names was instituted for a number of reasons.First and foremost,the numbered arguments plot a middle course among many different theoretical viewpoints.3The numbered arguments can then be mapped easily and consistently onto any theory of argument structure,such as tra-ditional Theta-Role(Kipper,Palmer,and Rambow,2002),Lexical-Conceptual Structure (Rambow et al.,2003),or Prague Tectogrammatics(Hajiˇc ova and Kuˇc erov´a,2002).While most rolesets have two to four numbered roles,as many as six can appear,in particular for certain verbs of motion4(14)Frameset edge.01“move slightly”Arg0:causer of motion Arg3:start pointArg1:thing in motion Arg4:end pointArg2:distance moved Arg5:directionEx:[Arg0Revenue]edged[Arg5up][Arg2-EXT3.4%][Arg4to$904million][Arg3from $874million][ArgM-TMP in last year’s third quarter].(wsj1210) 3By following the TreeBank,however,we are following a very loose Government-Binding framework.4We make no attempt to adhere to any linguistic distinction between arguments and adjuncts.While many linguists would consider any argument higher than Arg2or Arg3to be an adjunct,they occur frequently enough with their respective verbs,or classes of verbs,that they are assigned a numbered argument in order to ensure consistent annotation.5Computational Linguistics Volume XX,Number X Table1Subtypes of the ArgM modifier tagLOC:location CAU:causeEXT:extent TMP:timeDIS:discourse connectives PNC:purposeADV:general-purpose MNR:mannerNEG:negation marker DIR:directionMOD:modal verbBecause of the use of Arg0for Agency,there arose a small set of verbs where an external force could cause the Agent to execute the action in question.For example,in the sentence...Mr.Dinkins would march his staff out of board meetings and into his private office...(wsj0765)the staff is unmistakably the marcher,the agentive role.Yet Mr.Dink-ins also has some degree of Agency,since he is causing the staff to do the marching.To capture this,a special tag of ArgA is used for the agent of an induced action.This ArgA tag is only used for verbs of volitional motion such as march and walk,modern uses of volunteer(eg,Mary volunteered John to clean the garage or more likely the passive of that, John was volunteered to clean the garage and,with some hesitation,graduate based on us-ages such as Penn only graduates35%of its students.(This usage does not occur as such in the Penn Treebank corpus,although it is evoked in the sentence No student should be permitted to be graduated from elementary school without having mastered the3R’s at the level that prevailed20years ago.(wsj1286)In addition to the semantic roles described in the rolesets,verbs can take any of a set of general,adjunct-like arguments(ArgMs),distinguished by one of the function tags shown in Table1.Although they are not considered adjuncts,NEG for verb-level negation(eg’John did n’t eat his peas’)and MOD for modal verbs(eg’John would eat everything else’)are also included in this list to allow every constituent surrounding the verb to be anno-tated.DIS is also not an adjunct,but was included to ease future discourse connective annotation.3.1Distinguishing FramesetsThe criteria to distinguish framesets is based on both semantics and syntax.Two verb meanings are distinguished as different framesets if they take different numbers of ar-guments.For example,the verb decline has two framesets:(15)Frameset:decline.01“go down incrementally”Arg1:entity going downArg2:amount gone down by,EXTArg3:start pointArg4:end pointEx:...[Arg1its net income]declining[Arg2-EXT42%][Arg4to$121million][ArgM-TMP in thefirst9months of1989].(wsj0067)(16)Frameset:decline.02“demure,reject”Arg0:agentArg1:rejected thingEx:[Arg0A spokesman i]declined[Arg1*trace*i to elaborate](wsj0038)6The Proposition Bank Palmer et al.However,alternations which preserve verb meanings,such as causative/inchoative or object deletion are considered to be one frameset only,as shown in the example for open.01.Both the transitive and intransitive uses of the verb open correspond to the same frameset,with some of the arguments left unspecified.(17)Frameset open.01“cause to open”Arg0:agentArg1:thing openedArg2:instrumentEx1:[Arg0John]opened[Arg1the door]Ex2:[Arg1The door]openedEx3:[Arg0John]opened[Arg1the door][Arg2with his foot]Moreover,differences in the syntactic type of the arguments do not constitute cri-teria for distinguishing between framesets,for example,see.01allows for both an NP object or a clause object,as illustrated below.(18)Frameset see.01“view”Arg0:viewerArg1:thing viewedEx1:[Arg0John]saw[Arg1the President]Ex2:[Arg0John]saw[Arg1the President collapse]Furthermore,verb-particle constructions are treated as separate from the corre-sponding simplex verb,whether the meanings are approximately the same or not.For example,three of the framesets for cut can be seen below:(19)Frameset cut.01“slice”Arg0:cutterArg1:thing cutArg2:medium,sourceArg3:instrumentEx:[Arg0Longer production runs][ArgM-MOD would]cut[Arg1inefficiencies from adjusting machinery between production cycles].(wsj0317)(20)Frameset cut.04“cut off=slice”Arg0:cutterArg1:thing cut(off)Arg2:medium,sourceArg3:instrumentEx:[Arg0The seed companies]cut off[Arg1the tassels of each plant].(wsj0209) (21)Frameset cut.05“cut back=reduce”Arg0:cutterArg1:thing reducedArg2:amount reduced byArg3:start pointArg4:end pointEx:“Whoa,”thought John,“[Arg0I i]’ve got[Arg0*trace*i]to start[Arg0*trace*i]cutting back[Arg1my intake of chocolate].Note that the verb and particle do not need to be contiguous;the second sentence above could just as well be said“The seed companies cut the tassels of each plant off.”7Computational Linguistics Volume XX,Number X Currently,there are frames for over3,300verbs,with a total of just over4,500frame-sets described,implying an average polysemy of1.36.Of these verb frames,only21.5% (721/3342)have more than one frameset,while less than100verbs have4or more.Each instance of a polysemous verb is marked as to which frameset it belongs to,with inter-annotator agreement of94%.The framesets can be viewed as extremely coarse-grained sense distinctions,with each frameset corresponding to one or more of the Senseval2 WordNet1.7verb groupings.Each grouping in turn corresponds to several WordNet 1.7senses(Palmer,Babko-Malaya,and Dang,2004).3.2Secondary PredicationsThere are two other functional tags which,unlike those listed above,can also be as-sociated with numbered arguments in the Frames Files.Thefirst one,EXT,’extent,’indicates that a constituent is a numerical argument on its verb,as in’climbed15%’or’walked3miles’.The second,PRD for’secondary predication’,marks a more subtle relationship.If one thinks of the arguments of a verb as existing in a dependency tree, all arguments depend directly from the verb.Each argument is basically independent of the others.There are those verbs,however,which predict that there is a predicative relationship between their arguments.A canonical example of this is call in the sense of’attach a label to,’as in Mary called John an idiot.In this case there is a relationship between John and an idiot(at least in Mary’s mind).The PRD tag is associated with the Arg2label in the Frames File for this frameset,since it is predictable that the Arg2pred-icates on the Arg1John.This helps to disambiguate the crucial difference between the following two sentences:predicative reading ditransitive readingMary called John a doctor.Mary called John a doctor.5(LABEL)(SUMMON)Arg0:Mary Arg0:MaryRel:called Rel:calledArg1:John(item being labeled)Arg2:John(benefactive)Arg2-PRD:a doctor(attribute)Arg1:a doctor(thing summoned) It is also possible for ArgM’s to predicate on another argument.Since this must be decided on a case by case basis,the PRD function tag is added to the ArgM by the annotator,as in Example28below.3.3Subsumed ArgumentsBecause verbs which share a VerbNet class are rarely synonyms,their shared argument structure occasionally takes on odd characteristics.Of primary interest among these are the cases where an argument predicted by one member of a class cannot be attested by another member of the same class.For a relatively simple example,consider the verb hit,in classes18.1and18.4.This takes three very obvious arguments:(22)Frameset hit“strike”Arg0:hitterArg1:thing hit,targetArg2:instrument of hittingEx1:Agentive subject:“[Arg0He i]digs in the sand instead of[Arg0*trace*i]hitting[Arg1the ball],like a farmer,”said Mr.Yoneyama.(wsj1303) 5This sense could also be stated in the dative:Mary called a doctor for John.8The Proposition Bank Palmer et al.Ex2:Instrumental subject:Dealers said[Arg1the shares]were hit[Arg2by fearsof a slowdown in the U.S.economy].(wsj1015)Ex3:All arguments:[Arg0John]hit[Arg1the tree][Arg2with a stick].6 Classes18.1and18.4arefilled with verbs of hitting,such as beat,hammer,kick,knock, strike,tap,whack and so forth.For some of these the instrument of hitting is necessarily included in the semantics of the verb itself.For example,kick is essentially’hit with the foot’and hammer is exactly’hit with a hammer’.For these verbs,then,the Arg2might not be available,depending on how strongly the instrument is incorporated into the verb.Kick,for example,shows28instances in the Treebank but only one instance of a (somewhat marginal)instrument:(23)[ArgM-DIS But][Arg0two big New York banks]seem to have kicked[Arg1thosechances][ArgM-DIR away],[ArgM-TMP for the moment],[Arg2with theembarrassing failure of Citicorp and Chase Manhattan Corp.to deliver$7.2billion in bankfinancing for a leveraged buy-out of United Airlines parentUAL Corp].(wsj1619)Hammer shows several examples of Arg2’s,but these are all metaphorical hammers: (24)Despite the relatively strong economy,[Arg1junk bond prices i]did nothingexcept go down,[Arg1*trace*i]hammered[Arg2by a seemingly endless trail ofbad news](wsj2428)Another,perhaps more interesting case is where two arguments can be merged into one in certain syntactic situations.Consider the case of meet,which canonically takes two arguments:(25)Frameset meet“come together”Arg0:one partyArg1:the other partyEx:[Arg0Argentine negotiator Carlos Carballo][ArgM-MOD will]meet[Arg1withbanks this week].(wsj0021)It is perfectly possible,of course,to mention both meeting parties in the same con-stituent:(26)[Arg0The economic and foreign ministers of12Asian and Pacific nations][ArgM-MOD will]meet[ArgM-LOC in Australia][ArgM-TMP next week][ArgM-PRP todiscuss global trade as well as regional matters such as transportation andtelecommunications].(wsj0043)In these cases there is an assumed or default Arg1along the lines of’each other’: (27)[Arg0The economic and foreign ministers of12Asian and Pacific nations][ArgM-MOD will]meet[Arg1-REC(with)each other]...Similarly,verbs of attachment(attach,tape,tie,etc)can express the’things being attached’as either one constituent or two:(28)Frameset connect.01“attach”Arg0:agent,entity causing two objects to be attachedArg1:patientArg2:attached-toArg3:instrument6The Wall Street Journal corpus contains no examples with both an agent and an instrument.9Computational Linguistics Volume XX,Number X Ex1:The subsidiary also increased reserves by$140million,however,and setaside an additional$25million for[Arg1claims]connected[Arg2with HurricaneHugo].(wsj1109)Ex2:Machines using the486are expected to challenge higher-priced workstations and minicomputers in applications such as[Arg0so-called servers i],[Arg0which i][Arg0*trace*i]connect[Arg1groups of computers][ArgM-PRDtogether,and in computer-aided design.(wsj0781)3.4Role Labels and Syntactic TreesThe Proposition Bank assigns semantic roles to nodes in the syntactic trees of the Penn Treebank.Annotators are presented with the roleset descriptions and the syntactic tree, and mark the appropriate nodes in the tree with role labels.The lexical heads of con-stituents are not explicitly marked either in the Treebank trees or in the semantic labeling layered on top of them.Annotators cannot change the syntactic parse,but they are not otherwise restricted in assigning the labels.In certain cases,more than one node may be assigned the same role.The annotation software does not require that the nodes being assigned labels be in any syntactic relation to the verb.We discuss the ways in which we handle the specifics of the Treebank syntactic annotation style in this section. Prepositional Phrases The treatment of prepositional phrases is complicated by several factors.On one hand,if a given argument is defined as a“Destination”then in a sen-tence such as John poured the water into the bottle the destination of the water is clearly the bottle,not“into the bottle”.The fact that the water is going in to the bottle is inherent in the description“destination”;the preposition merely adds the specific information that the water will end up inside the bottle.Thus arguments should properly be associated with the NP heads of prepositional phrases.On the other hand,however,ArgM’s which are prepositional phrases are annotated at the PP level,not the NP level.For the sake of consistency,then,numbered arguments are also tagged at the PP level.This also fa-cilitates the treatment of multi-word prepositions such as“out of”,“according to”and “up to but not including”.7(29)[Arg1Its net income]declining[Arg2-EXT42%]to[Arg4$121million][ArgM-TMP inthefirst9months of1989].(wsj0067)Traces and Control Verbs The Penn Treebank contains empty categories known as traces,which are often co-indexed with other constituents in the tree.When a trace is assigned a role label by an annotator,the co-indexed constituent is automatically added to the annotation,as in:(30)[Arg0John i]tried[Arg0*trace*i]to kick[Arg1the football],but Mary pulled itaway at the last momentVerbs such as cause,force,and persuade,known as object control verbs,pose a prob-lem for the analysis and annotation of semantic structure.Consider a sentence such as Commonwealth Edison said the ruling could force it to slash its1989earnings by$1.55a share. (wsj-0015)The Penn Treebank’s analysis assigns a single sentential(S)constituent to the entire string it to slash....a share,making it a single syntactic argument to the verb force. In the PropBank annotation,we split the sentential complement into two semantic roles for the verb force,assigning roles to the noun phrase and verb phrase but not to the S node which subsumes them:7Note that“out of”is exactly parallel to“into”,but one is spelled with a space in the middle and the other isn’t.10。
关于最近研究的关键词提取keywordextraction做的笔记之前内容的整理要求:第⼀: ⾸先找出具有proposal性质的paper,归纳出经典的⽅法有哪些. 第⼆:我们如果想⽤的话,哪种更实⽤或者易于实现? 哪种在研究上更有意义.第⼀,较好较全⾯地介绍keyword extraction的经典特征的⽂章《Finding Advertising Keywords on Web Pages》.基于概念的keywords提取,使⽤概念、分类来辅助关键词抽取。
较经典的⽂章《Discovering Key Concepts in Verbose Queries》,《A study on automatically extracted keywords in text categorization》基于查询⽇志的keywords提取,有⽂章《Using the wisdom of the crowds for keyword generation》,《Keyword Extraction for Contextual Advertisement》Keywords扩展,keywords⽣成《Keyword Generation for Search Engine Advertising using Semantic Similarity》, 《Using the wisdom of the crowds for keyword generation》,《n-Keyword based Automatic Query Generation》第⼆,较常⽤的特征,之前研究者提到过的特征:《Finding Advertising Keywords on web pages》中提到过的特征1.语⾔特征词性标注2.⾸字母⼤写3.关键词是否在hypertext⾥4.关键词是否在meta data⾥5.关键词是否在title⾥6.关键词是否在url⾥7.TF,DF8.关键词所处位置信息9.关键词所在句⼦长度及⽂档长度10.候选短语的长度11.查询⽇志我想到的特征1.周围信息含量,附近⼏个词甚⾄是⼀个句⼦的平均信息含量。
摘要摘要随着社交网络的快速发展,人们在社交平台上随时随地分享自己的所见所闻所想。
许多研究者认为社交网络是一种反映真实世界的传感网络。
社交媒体数据的分析具有广泛的应用,例如侦测犯罪活动,预测公众行为等。
由于文本数据在社交媒体数据中所占比例高并且含有丰富信息,文本语义分析对社交媒体数据分析至关重要。
过去的文本语义分析工作主要针对的是规范语言的文本数据,如新闻文本,维基百科等。
然而,社交媒体文本长度有限,包含大量的错误拼写,俚语,语法错误等非规范语言应用。
因此,传统的文本语义分析技术在社交媒体文本上的直接应用取得的效果并不理想。
针对推文中语义信息量和准确度有限的特点,本文在现有的文本语义分析技术的基础上,研究了一种语义和情感信息结合的推文特征学习方法,并将该推文特征应用于推特事件检测。
本文的主要工作可概括为以下两个部分:1. 构建语义和情感结合的词语表示。
词语的语义向量是文本语义分析的基础。
本文重点分析了目前最先进的神经网络语言模型word2vec。
针对word2vec 词向量近反义区分能力弱的缺点,本文提出了一种同时使用词语语境的语义和情感信息构建词向量的方法,提升词向量的近反义区分能力。
具体地,本文使用远程监督方法,利用推文中的表情符号作为弱情感标签,扩展word2vec神经网络模型,将语境的语义和情感信息编码到词向量中。
本文称这种语义和情感结合的词向量为senti-word2vec词向量。
2. 融合语义和情感信息的推特事件检测。
传统的推特事件检测把语义相似的推文组织起来表征事件。
然而,许多语义特征提取方法的近反义区分能力有限,因此同一事件簇中的推文对事件可能表达不同的情感态度。
在情感信息的约束条件下,本文提出将推特事件簇进一步划分为事件支持簇,事件反对簇和事件中立簇。
具体地,本文使用senti-word2vec词向量生成语义和情感结合的推文特征,分析该推文特征对推文语义相似性判断和情感分析的影响,最后运用该推文特征进行情感细分的事件检测。
Automatic Extraction of Semantic Networks from Text using LeximancerAndrew E.Smith.Key Centre for Human Factors and Applied Cognitive Psychology,The University of Queensland,Queensland,Australia,4072.asmith@.auAbstractLeximancer is a software system for perform-ing conceptual analysis of text data in a largely language independent manner.The system is modelled on Content Analysis and provides unsupervised and supervised analysis using seeded concept classifiers.Unsupervised on-tology discovery is a key component.1MethodThe strategy used for conceptual mapping of text in-volves abstracting families of words to thesaurus con-cepts.These concepts are then used to classify text at a resolution of several sentences.The resulting concept tags are indexed to provide a document exploration en-vironment for the user.A smaller number of simple concepts can index many more complex relationships by recording co-occurrences,and complex systems ap-proaches can be applied to these systems of agents.To achieve this,several novel algorithms were de-veloped:a learning optimiser for automatically select-ing,learning,and adapting a concept from the word us-age within the text,and an asymmetric scaling process for generating a cluster map of concepts based on co-occurrence in the text.Extensive evaluation has been performed on real doc-ument collections in collaboration with domain experts.The method adopted has been to perform parallel analy-ses with these experts and compare the results.An outline of the algorithms (Smith,2000)follows:1.Text preparation:Standard techniques are em-ployed,including name and term preservation,to-kenisation,and the application of a stop-list.2.Unsupervised and supervised ontology discovery:Concepts can be seeded by a domain expert to suituser requirements,or they can be chosen automat-ically using a ranking algorithm for finding seed words which reflect the themes present in the data.This process looks for words near the centre of local maxima in the lexical co-occurrence network.3.Filling the thesaurus:A machine learning algorithm is used to find the relevant thesaurus words from the text data.This iterative optimiser,derived from a word disambiguation technique (Yarowsky,1995),finds the nearest local maximum in the lexical co-occurrence network from each concept seed.Early results show that this lexical network can be reduced to a Scale-free and Small-world network 1.4.Classification:Text is tagged with multiple concepts using the thesaurus,to a sentence resolution.5.Mapping:The concepts and their relative co-occurrence frequencies now form a semantic net-work.This is scaled using an asymmetric scaling algorithm,and made into a lattice by ranking con-cepts by their connectedness,or er interface:A browser is used for exploring the classification system in depth.The semantic lat-tice browser enables semantic characterisation of the data and discovery of indirect association.Con-cept co-occurrence spectra and themed text segment browsing are also provided.2Analysis of the PNAS Data SetThe data set presented here consisted of text and meta-data from Proceedings of the National Academy of Sci-ence,1997to 2002.These examples are extracted from the abstract data.Firstly,Leximancer was configured to map the document set in unsupervised mode.A screen image of this interactive map is shown in figure 1.This1Following (Steyvers and Tenenbaum,2003).Edmonton, May-June 2003Demonstrations , pp. 23-24 Proceedings of HLT-NAACL 2003shows the semantic lattice (left),with the co-occurrence links from the concept ‘brain’highlighted (left andright).Figure 1:Unsupervised map of PNAS abstracts.Figure 2shows the top of the thesaurus entry for the concept ‘brain’.This concept was seeded with just the word ‘brain’and then the learning system found a larger family of words and names which are strongly relevant to ‘brain’in the these abstracts.In the figure,terms in square brackets are identified proper names,and numeri-cal values are the relevancy weights.Figure 2:Thesaurus entry for ‘brain’(excerpt).It is also of interest to discover which concepts tend to be unique to each year of the PNAS proceedings,and so identify trends.This usually requires a different form of analysis,since concepts which characterise the whole data set may not be good for discriminating parts.By placing the data for each year in a folder,Leximancer can tag each text sentence with the relevant year,and place each year as a prior concept on the map.The result-ing map contains the prior concepts plus other concepts which are relevant to at least one of the priors,and shows trending from early years to later years (figure3).Figure 3:Temporal map of PNAS abstracts.3ConclusionThe Leximancer system has demonstrated several major strengths for text data analysis:•Large amounts of text can be analysed rapidly in a quantitative manner.Text is quickly re-classified us-ing different ontologies when needs change.•The unsupervised analysis generates concepts which are well-defined —they have signifiers which com-municate the meaning of each concept to the user.•Machine Learning removes much of the need to re-vise thesauri as the domain vocabulary evolves.ReferencesAndrew E.Smith.2000.Machine mapping of document collections:the leximancer system.In Proceedings of the Fifth Australasian Document Computing Sym-posium ,Sunshine Coast,Australia,December.DSTC./technology.html.Mark Steyvers and Joshua B.Tenenbaum.2003.The large-scale structure of semantic networks:Statistical analyses and a model of semantic growth.Submitted to Cognitive Science ./˜msteyver.David Yarowsky.1995.Unsupervised word-sense dis-ambiguation rivaling supervised methods.In Proceed-ings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95),pages 189–196,Cambridge,MA./˜yarowsky/pubs.html.。