Word sense disambiguation using conceptual density
- 格式:pdf
- 大小:74.62 KB
- 文档页数:13
A. Gelbukh (Ed.): CICLing 2007, LNCS 4394, pp. 267 – 274, 2007.© Springer-Verlag Berlin Heidelberg 2007Word Clustering for Collocation-BasedWord Sense Disambiguation*Peng Jin, Xu Sun, Yunfang Wu, and Shiwen YuDepartment of Computer Science and TechnologyInstitute of Computational Linguistics, Peking University, 100871, Beijing, China {jandp, sunxu, wuyf, yusw}@Abstract. The main disadvantage of collocation-based word sense disambigua-tion is that the recall is low, with relatively high precision. How to improve therecall without decrease the precision? In this paper, we investigate a word-classapproach to extend the collocation list which is constructed from the manuallysense-tagged corpus. But the word classes are obtained from a larger scale cor-pus which is not sense tagged. The experiment results have shown that the F-measure is improved to 71% compared to 54% of the baseline system where theword-class is not considered, although the precision decreases slightly. Furtherstudy discovers the relationship between the F-measure and the number ofword-class trained from the various sizes of corpus.1 IntroductionWord sense disambiguation (WSD) aims to identify the intended sense of a polyse-mous word given a context. A typical case is the Chinese word “讲” when occurring in “讲真话” (“tell the truth”) and “讲实效” (“pay attention to the actual effect”). Correctly sense-tagging the word in context can prove to be beneficial for many NLP applications such as Information Retrieval [6], [14], and Machine Translation [3], [7].Collocation is a combination of words that has certain tendency to be used to-gether [5] and it is used widely to attack the WSD task. Many researchers used the collocation as an important feature in the supervised learning algorithms: Naïve Bayes[7], [13], Support Vector Machines [8], and Maximum Entropy [2]. And the other researches [15], [16] directly used the collocation to form decision list to deal with the WSD problem.Word classes are often used to alleviate the data sparseness in NLP. Brown [1] performed automatic word clustering to improve the language model. Li [9] con-ducted syntactic disambiguation by using the acquired word-class. Och [12] provided an efficient method for determining bilingual word classes to improve statistical MT. This paper integrates the contribution of word-class to collocation-based WSD. When the word-based collocation which is obtained from sense tagged corpus fails, * Support by National Grant Fundamental Research 973 Program of China Under Grant No. 2004CB318102.268 P. Jin et al.class-based collocation is used to perform the WSD task. The results of experiment have shown that the average F-measure is improved to 70.81% compared to 54.02% of the baseline system where the word classes are not considered, although the preci-sion decreases slightly. Additionally, the relationship between the F-measure and the number of word-class trained from the various sizes of corpus is also investigated.The paper is structured as follows. Section 2 summarizes the related work. Section 3 describes how to extend the collocation list. Section 4 presents our experiments as well as the results. Section 5 analyzes the results of the experiments. Finally section 6 draws the conclusions and summarizes further work.2 Related WorkThe underlying idea is that one sense per collocation which has been verified by Yarowsky [15] on a coarse-grained WSD task. But the problem of data spars will be more serious on the fine-grained WSD task. We attempt to resolve the data sparseness with the help of word-class. Both of them are described as follows.2.1 The Yarowsky AlgorithmYarowsky [15] used the collocation to form a decision list to perform the WSD task. In his experiments, the content words (i.e., nouns, verbs, adjectives and adverbs) holding some relationships to the target word were treated as collocation words. The relationships include direct adjacency to left or right and first to the left or right in a sentence. He also considered certain syntactic relationships such as verb/object, sub-ject/verb. Since similar corpus is not available in Chinese, we just apply the four co-occurrence words described above as collocation words. Different types of evidences are sorted by the equation 1 to form the final decision list.)|Pr()|Pr(((21i i n Collocatio Sense n Collocatio Sense Log Abs (1)To deal with the same collocation indicates more than two senses, we adapt to the equation 1. For example, “上 (shang4)” has fifteen different senses as an verb. If the same collocation corresponds to different senses of 上, we use the frequency counts of the most commonly-used sense as the nominator in equation 1, and the frequency counts of the rest senses as the denominator. The different types of evidence are sorted by the value of equation 1. When a new instance is encountered, one steps through the decision list until the evidence at that point in the list matches the cur-rent context under consideration. The sense with the greatest listed probability is returned.The low recall is the main disadvantage of Yarowsky’s algorithm to the fine-grained sense disambiguation. Because of the data sparseness, the collocation word in the novel context has little chance to match exactly with the items in the decision list. To resolve this problem, the word clustering is introduced.Word Clustering for Collocation-Based Word Sense Disambiguation 2692.2 Word ClusteringIn this paper, we use an efficient method for word clustering which Och [12] intro-duced for machine translation. The task of a statistical language model is used to estimate the probability ()N w P 1 of the word sequence N N w w w ...11=. A simple ap-proximation of ()N w P 1 is to model it as a product of bi-gram probabilities:()()∏=−=N i i i N w w p w P 111|. Using the word class rather than the single word, we avoid the use of the most of the rarely seen bi-grams to estimate the probabilities. Rewriting the probability using word classes, we obtain the probability model as follow:()()()()()()i i N i i i Nw C w P w C w C P C w P ||:|111•=∏=− (2)Where the function C maps words to w their classes ()w C . In this model, we have two types of probabilities: the transition probability ()'|C C P for class C given its predecessor class 'C , and the membership probability ()C w P | for word w given classC . To determine the optimal word classes Cˆ for a given number of classes, we per-form a maximum-likelihood estimation:()C w P C N C |max arg ˆ1= (3)To the implementation, an efficient optimization algorithm is the exchange algo-rithm [13].It is necessary to set the number of word classes before the iteration.Two word classes are selected for illustration. First is “花生 (peanut), 大豆 (bean), 棉花 (cotton), 水稻 (rice), 早稻 (early rice), 芒果 (mango), 红枣 (jujube), 柑桔 (or-ange), 银杏 (ginkgo)”. To the target verb “吃” (which have five senses), these nouns can be its objects and indicate the same sense of “吃”. Another word class is “ 灌溉 (irrigate), 育秧 (raise rice seedlings), 施肥 (apply fertilizer), 播种 (sow), 移植 (trans-plant), 栽培 (cultivate), 备耕 (make preparations for plowing and sowing)”. Most of them indicate the sense “plant” of the target noun “小麦 (wheat)” which has two senses categories: “plant” and “seed”. For example, there is a collocation pair “灌溉小麦” in the collocation list which is obtained from the sense tagged corpus, an un-familiar collocation pair “备耕小麦” will be tagged with the intended sense of “小麦” because “灌溉” and “备耕” are clustered in the same word-class.3 Extending the Collocation ListThe algorithm of extending the collocation list which is constructed from the sense tagged corpus is quite straightforward. Given a new collocation pair exists in the novel context consists of the target word, the collocation word and the collocation type. If this specific collocation pair is found in the collocation list, we return the sense at the point in this decision list. While the match fails, we replace this collocation word with one of the words which are clustered in the same word-class to match again. The270 P. Jin et al.process is finished when any match success or all words in the word-class are tried. If all words in this word-class fail to match, we let this target word untagged.For example, “讲政治”(pay attention to the politics), “讲故事”(tell a story) are ordered in the collocation list. But to a new instance “讲笑话”(tell a joke), apparently we can not match the Chinese word “笑话” with any of the collocation word. Search-ing from the top of the collocation list, we check that “笑话” and “故事” are clustered in the same word-class. So the sense “tell” is returned and the process is ended.4 ExperimentWe have designed a set of experiments to compare the Yarowsky algorithm with and without the contribution of word classes. Yarowsky algorithm introduced in section 2.1 is used as our baseline. Both close test and open test are conducted.4.1 Data SetWe have selected 52 polysemous verbs randomly with the four senses on average. Senses of words are defined with the Contemporary Chinese Dictionary, the Gram-matical Knowledge-base of Contemporary Chinese and other hard-copy dictionaries. For each word sense, a lexical entry includes definition in Chinese, POS, Pinyin, semantic feature, subcategory framework, valence, semantic feature of subject, se-mantic feature of object, English equivalent and an example sentence.A corpus containing People’s Daily News (PDN) of the first three months of year 2000 (i.e., January, February and March) is used as our training/test set. The corpus is segmented (3,719,951 words) and POS tagged automatically before hand, and then is sense-tagged manually. To keep the consistency, a text is first tagged by one annota-tor and then checked by other two checkers. Five annotators are all native Chinese speakers. What’s more, a software tool is developed to gather all the occurrences of a target word in the corpus into a checking file with the sense KWIC (Key Word in Context) format in sense tags order. Although the agreement rate between human annotators on verb sense annotation is only 81.3%, the checking process with the help of this tool improves significantly the consistency.We also conduct an open test. The test corpus consists of the news of the first ten days of January 1998. The news corresponding to the first three months of 2000 are used as training set to construct the collocation list. The corpus which is used to word cluster amounts to seven months PDN.4.2 Experimental SetupFive-fold cross-validation method is used to evaluate these performances. We divide the sense-tagged three months corpus into five equal parts. In each process, the sense labels in one part are removed in order to be used as test corpus. And then, the collo-cation list is constructed from the other four parts of corpus. We first use this list to tag test corpus according to the Yarwosky algorithm and set its result as the baseline. After that the word-class is considered and the test corpus is tagged again according to the algorithm described in section 3.Word Clustering for Collocation-Based Word Sense Disambiguation 271 To draw the learning curve, we vary the number of word-class and the sizes of corpus which used to cluster the words. In open test, the collocation list is constructed from the news corresponding to the first three months of year 2000.4.3 Experiment ResultsTable 1 shows the results of close test. It is achieved by 5-fold Cross-Validation with 200 word-clusters trained from the seven months corpus. “Tagged tokens” is referred to the occurrences of the polysemous words which are disambiguated automatically. “All tokens” means the occurrences of the all polysemous words in one test corpus.We can see the performance of each process is stable. It demonstrates that the word class is very useful to alleviate the data sparse problem.Table 1. Results with 200 Word Classes Trained from 7 Month CorpusTagged TokensAllTokensPrecision Recall F-measureT1 2,346 4237 0.9301 0.5537 0.6942T2 2,969 4,676 0.9343 0.5766 0.7131T3 2,362 4,133 0.9306 0.5715 0.7081T4 2,773 4,721 0.9318 0.5874 0.7206T5 2,871 4,992 0.9154 0.5751 0.7046Ave. 2,664 4,552 0.9284 0.5729 0.7081 Table 2 shows the power of word-class. B1 and B2 denote individually the baseline in close and open test. S1 and S2 show the performance with the help of word-classes in these tests. Although the precision decreases slightly, the F-measures are improved significantly. Because in open test, the size of corpus used to training is bigger while the size of corpus used to test is less compared with the corpus in open test, the F-measure is even a bit higher than in close test.Table 2. Results of Close and Open TestTagged TokensAllTokensPrecision Recall F-measureB1 1,691 4,552 0.9793 0.3708 0.5401S1 2,664 4,552 0.9284 0.5729 0.7081B2 874 2,325 0.9908 0.3559 0.5450S2 1,380 2,325 0.9268 0.5935 0.7237 5 Discussion of ResultsFig 1 presents the relationship between the F-measure and the number of word-class trained from the various sizes of corpus. The reasons for errors are also explained.272 P. Jin et al.5.1 Relationship Between F-Measure with Word-Class and CorpusWhen we fix the size of the corpus which is used to cluster the word-class, we can see that the F-measure is verse proportional to the number of the word classes. However in our experiments, the precision is proportional to the number of the word classes (this can not be presented in this figure). The reason is straightforward that with the augment of the word classes, there are fewer words in every word-class. So the collo-cation which comes from test corpus has less chance of finding the word in the decision list belonging to the same word-class.Fig. 1. F-measure at different number of word-class trained from the various sizes of corpus When we fix the number of word classes, we can see that the F-measure increases with the size of the training corpus. This demonstrates that more data improve the system performance. But the increase rate is less and less. It shows there is a ceiling effect. That is to say, the effect on the performance will be less although more cor-puses are trained for clustering the words.5.2 Error AnalysisUnrelated words are clustered is the main cause of precision decreases. For example, there are two words “牛” (cattle) and “鞭炮” (cracker) are clustered in the same word-class. To the target word “放”, “放牛” means “graze cattle” and “放鞭炮” means “fire crackers”. To resolve this problem, we should pay much attention to improve the clustering results.Word Clustering for Collocation-Based Word Sense Disambiguation 273However, the reasonable word-classes also cause errors. Another example is “包饺子” (wrap dumpling) and “包午餐” (offer free lunch) . The word “饺子” (dumpling) and the word “午餐”(lunch) are clustered reasonable because both of them are nouns and related concepts. However, to the target polysemous word “包” , the sense is completely different: the former means “wrap” and the sense of the later is “offer free”. It also explains why the WSD system benefits little from the ontology such as HowNet [4].Although the collocation list obtained from the sense tagged corpus is extended by word classes, the F-measure is still not satisfied. There are still many unfamiliar collocations can not be matched because of the data sparseness.6 Conclusion and the Future WorkWe have demonstrated the word-class is very useful to improve the performance of the collocation-base method. The result shows that the F-measure is improved to 70.81% compared to 54.02% of the baseline system where the word clusters are not considered, although the precision decreases slightly. To open test, the performance is also improved from 54.50% to 72.37%.This method will be used to help us to accelerate the construction sense tagged corpus. Another utility of word class is used as a feature in the supervised machine learning algorithms in our future research.We can see that some words are highly sensitive to collocation while others are not. To the later, the performance is poor whether the word-class is used or not. We will further study which words and why they are sensitive to collocation from the perspectives of both linguistics and WSD.References1.Brown, P. F., Pietra, V. J., deSouza, P. V., Lai, J. C. and Mercer, R. L. Class-based N-gram Models of Natural Language. Computational Linguistics. 4 (1992) 467-4792.Chao, G., Dyer, G.M. Maximum Entropy Models for Word Sense Disambiguation. Pro-ceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan (2002) 155–1613.Dagan, D., Itai, A. Word Sense Disambiguation Using a Second Language MonolingualCorpus. Computational Linguistics. 4 (1994) 563–5964.Dang, H. T., Chia, C., Palmer, M., Chiou, F. D., Rosenzweig J. Simple Features for Chi-nese Word Sense Disambiguation. Proceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan (2002) 204–2115.Gelbukh, A., G. Sidorov, S.-Y. Han, E. Hernández-Rubio. Automatic Enrichment of aVery Large Dictionary of Word Combinations on the Basis of Dependency Formalism.Proceedings of Mexican International Conference on Artificial Intelligence. Lecture Notes in Artificial Intelligence, N 2972, Springer-Verlag, (2004) 430-4376.Kim, S.B., Seo, H.C., Rim, H.C. Information Retrieval Using Word Senses: Root SenseTagging Approach, SIGIR’04, Sheffield, South Yorkshire, UK (2004) 258–265274 P. Jin et al.7.Lee, H.A., Kim, G.C. Translation Selection through Source Word Sense Disambiguationand Target Word Selection. Proceedings of the 19th International. Conference on Compu-tational Linguistics, Taipei, Taiwan (2002)8.Lee, Y. K., Ng, H. T. and Chia, T. K. Supervised Word Sense Disambiguation with Sup-port Vector Machines and Multiple Knowledge Sources. Proceedings of SENSEVAL-3: Third International Workshop on the Evaluating Systems for the Semantic Analysis of Text, Barcelona, Spain. (2004)9.Li, H. Word Clustering and Disambiguation Based on Co-occurrence Data. Natural Lan-guage Engineering. 8 (2002) 25-4210.Li W.Y., Lu Q., Li W.J. Integrating Collocation Features in Chinese Word Sense Disam-biguation. Proceeding of the Fourth SIGHAN Workshop on Chinese Language Processing (2005) 87–9411.Martin, S., Liermann, J. and Ney, K. Algorithms for Bigram and Trigram Word Clustering.Speech Communication. 1 (1998) 19-3712.Och, F. J. An Efficient Method for Determining Bilingual Word Classes. Proceeding of theNinth Conference of the European Chapter of the Association for Computational Linguis-tics. (1999) 71-7613.Pedersen, T. A Simple Approach to Building Ensembles of Naive Bayesian Classifiers forWord Sense Disambiguation. Proceeding of the first Annul Meeting of the North Ameri-can Chapter for Computational Linguistics (2000) 63–6914.Stokoe, C., Oakes, M.P., Tait, J. Word Sense Disambiguation in Information Retrieval Re-visited. Proceeding of the 26th annul International ACM SIGIR conference On research and development in Information retrieval (2003)15.Yarowsky, D. One Sense Per Collocation, Proceeding of ARPA Human Language Tech-nology workshop.Princeton, New Jersey (1993)16.Yarowsky, D. Hierarchical Decision Lists for Word Sense Disambiguation, Computers andthe Humanities. 1 (2000) 179–186。
语义分析的工作原理语义分析(Semantic Analysis)是自然语言处理领域中的重要研究方向,其主要目标是理解自然语言中的语义信息,并对其进行进一步的处理和分析。
本文将介绍语义分析的工作原理,讨论其主要方法和应用领域。
一、概述语义分析是自然语言处理中的核心任务之一,其主要目标是从文本中提取意义,理解語言和信息之間的關聯。
与传统的基于语法的分析方法不同,语义分析注重从文本中获取更深层次的含义。
其应用广泛,包括情感分析、问答系统、机器翻译等。
二、方法和技术1. 词义消歧词义消歧(Word Sense Disambiguation)是语义分析的一个关键步骤。
在自然语言中,一个词可能有多个不同的意义,而词义消歧的任务就是确定在特定上下文中该词的正确含义。
常用的方法包括基于知识库、统计方法和机器学习等。
2. 句法分析句法分析(Syntactic Parsing)是另一个与语义分析密切相关的任务。
它的主要目标是确定一句话中的各个词语之间的句法关系,从而提供给语义分析更准确的输入。
句法分析方法包括依存句法分析和短语结构分析等。
3. 语义角色标注语义角色标注(Semantic Role Labeling)是一项关键任务,它用于识别和标注句子中的谓词与各个论元之间的语义关系。
通过语义角色标注,我们可以更好地理解句子中不同成分之间的作用和关系。
4. 实体识别实体识别(Named Entity Recognition)是一项重要的任务,旨在识别和提取文本中的特定实体,如人名、地名、组织名等。
实体识别在文本理解和信息提取中具有重要意义,为语义分析提供了重要的输入信息。
5. 语义关系抽取语义关系抽取(Semantic Relation Extraction)是指从文本中抽取出不同实体之间的语义关系。
通过语义关系抽取,我们可以获得更深层次的语义信息,从而实现更高级别的语义分析。
三、应用领域1. 情感分析情感分析(Sentiment Analysis)是一种常见的语义分析应用,用于识别和分析文本中的情感倾向,如正面、负面或中性。
自然语言处理中的词义消歧方法评估指标自然语言处理(Natural Language Processing,NLP)是人工智能领域中的一个重要研究方向,涉及到词义消歧(Word Sense Disambiguation,WSD)是其中的一个关键问题。
词义消歧指的是在文本中确定一个词语的正确含义,因为同一个词语在不同的上下文中可能有不同的意思。
在NLP中,评估词义消歧方法的指标是非常重要的,本文将探讨几种常见的评估指标。
一、准确率(Accuracy)准确率是评估词义消歧方法最常用的指标之一。
它表示在所有的词义消歧决策中,正确的决策所占的比例。
具体计算公式为:准确率 = 正确决策数量 / 总决策数量然而,准确率并不是唯一的评估指标,因为它无法反映出不同词义的重要程度和难易程度。
二、精确率(Precision)和召回率(Recall)精确率和召回率是另外两个常用的评估指标,它们常常结合使用。
精确率表示在所有被判定为某个词义的样本中,真正属于该词义的样本所占的比例。
召回率表示在所有属于某个词义的样本中,被正确判定为该词义的样本所占的比例。
精确率 = 真正属于某个词义的样本数量 / 所有被判定为该词义的样本数量召回率 = 真正属于某个词义的样本数量 / 所有属于该词义的样本数量精确率和召回率的计算方式使得它们能够更好地反映出不同词义的重要程度和难易程度。
三、F1值F1值是精确率和召回率的综合指标,它是精确率和召回率的调和平均值。
F1值的计算公式为:F1 = 2 * (精确率 * 召回率) / (精确率 + 召回率)F1值能够更全面地评估词义消歧方法的性能,因为它综合考虑了精确率和召回率。
四、信息增益(Information Gain)信息增益是一种基于信息论的评估指标,它用于衡量一个特征对于分类任务的重要程度。
在词义消歧中,可以将每个词义作为一个类别,将特征作为一个词义的上下文,然后计算信息增益。
信息增益的计算公式为:信息增益 = H(词义) - H(词义|特征)其中,H(词义)表示词义的熵,H(词义|特征)表示在已知特征的条件下,词义的条件熵。
stanford nlp 用法-概述说明以及解释1.引言1.1 概述概述部分旨在介绍本文的主题——Stanford NLP,并提供一些背景信息。
Stanford NLP是由斯坦福大学自然语言处理(Natural Language Processing,简称NLP)小组开发的一套自然语言处理工具包。
它提供了丰富的功能和算法,能够帮助研究人员和开发者进行文本分析、语言理解和信息提取等任务。
自然语言处理是人工智能领域的一个重要分支,涉及了对人类语言的理解和生成。
随着互联网和数字化时代的到来,海量的文本数据成为了研究和应用的宝贵资源。
然而,人类语言的复杂性和多样性给文本处理带来了挑战。
Stanford NLP应运而生,旨在利用先进的技术和算法帮助研究人员和开发者解决这些挑战。
在本文中,我们将探讨Stanford NLP的主要功能和用途。
首先,我们将介绍Stanford NLP的简介,包括其目标和诞生背景。
然后,我们将详细讨论Stanford NLP在各个领域的应用,包括文本分类、命名实体识别、情感分析等。
最后,我们将总结Stanford NLP的应用优势,并展望其未来的发展潜力。
在阅读本文之前,读者需要对自然语言处理的基本概念有一定的了解,同时,具备一定的编程和机器学习知识也将有助于更好地理解本文。
本文将从大的框架上介绍Stanford NLP的用法,并提供一些具体的实例和应用场景,以帮助读者更好地理解和使用Stanford NLP。
接下来,让我们深入探索Stanford NLP的世界,了解它的用途和优势,并展望它在自然语言处理领域的未来发展。
文章结构部分的内容可以如下所示:1.2 文章结构本文主要分为引言、正文和结论三个部分。
引言部分(Section 1)首先概述了本文的主题和目的,然后简要介绍了Stanford NLP的概念及其在自然语言处理领域的重要性。
接下来,给出了本文的整体结构安排。
正文部分(Section 2)详细介绍了Stanford NLP的应用。
Word Sense Disambiguation using ILPLucia Specia1, Ashwin Srinivasan2,3, Ganesh Ramakrishnan2, Maria das Graças V. Nunes 11 ICMC – University of São Paulo, Trabalhador São-Carlense, 400, São Carlos, 13560-970, Brazil{lspecia, gracan}@p.br2 IBM India Research Laboratory, Block 1, Indian Institute of Technology, New Delhi 110016, India3 Dept. of Computer Science and Engineering, University of New South Wales, Sydney, Australia{ashwin.srinivasan, ganramkr}@Abstract. We investigate the use of ILP for the task of Word Sense Disambiguation(WSD) in two different ways: (a) as a stand-alone constructor of models for WSD; and(b) to build interesting features, which can then used by standard model-builder such asSVM. Experiments examining a multilingual WSD task in the context of English-Portuguese machine translation of 7 highly ambiguous verbs showed promising results:our ILP-based standalone approach outperformed the results of propositional algorithmsand the ILP-generated features yielded improvements on the accuracy of standard pro-positional algorithms when compared to their use with low level features only.1 IntroductionWord Sense Disambiguation (WSD) aims to identify the correct sense of an ambiguous word in a sentence. Sometimes described as an “intermediate task”— that is, not an end in itself — it is necessary in most natural language tasks like machine translation, information retrieval, and so on. That is extremely difficult, possibly impractical, to solve completely is a long-standing view [1] and accuracies with state-of-the art methods are substantially lower than in other areas of text understanding (part-of-speech tagging accuracies, e.g., are now over 95%, while the best WSD results are still well below 80%). It is generally thought that WSD should benefit significantly by adopting a “deep approach” in which access to substantial body of world knowledge could assist in resolving ambiguities. This belief is based on the following observa-tion. While it is true that statistical methods like support vector machines using shallow fea-tures referring to the local context of the ambiguous word have usually yielded the best results to date, the accuracies obtained are low and significant improvements do not appear to be forthcoming. The incorporation of large amounts of domain knowledge has been hampered by the following: (a) access to such information in electronic form suitable for constructing mod-els; and (b) modeling techniques capable of utilizing diverse sources of domain knowledge, even when they are available. The first of these difficulties is now greatly alleviated by the availability in electronic form of very large semantic lexicons like WordNet, dictionaries, parsers, grammars and so on. In addition, there are now very large amounts of “shallow” data in the form of electronic text corpora from which statistical information can be readily ex-tracted. Using these diverse sources of information is, however, beyond the capabilities of existing general-purpose statistical methods that have been used for WSD resulting in the development of various ad hoc techniques for using specific sets of information for particulartasks. Arguably, ILP systems provide the most general-purpose framework for dealing with such data: there are explicit provisions made for the inclusion of background knowledge of any form; the representation language is powerful enough to capture the contextual relationships that arise; and modeling is not restricted to being of a particular form. In this paper, using a task in Machine Translation (MT) as a test bed and 7 different sources of background knowledge, we investigate the use of ILP for WSD in 2 ways: (a) the induction of disambiguation models; and (b) the construction of interesting features to be used by propositional algorithms such as SVM, which have presented reasonably good results with low level features in previous work.2 Empirical StudyAim. We investigate whether the use of an ILP system equipped with substantial background knowledge can significantly improve WSD accuracies for a task in English-Portuguese MT. Materials.Data.Data consist of 7 highly frequent and ambiguous verbs and a sample corpus of around 200 English sentences for each verb with the verb translation automatically annotated. The verbs are (numbers of possible translations in our corpus in brackets): come (11), get (17), give (5), go (11), look (7), make (11), and take (13).Background Knowledge.We exploit knowledge from 7 syntactic, semantic and pragmatic sources: (1) Bag-of-words of ±5 words surrounding the verb; (2) Part-of-speech tags of ±5 content words surrounding the verb; (3) Subject and object syntactic relations with respect to the verb; (4) 11 collocations with respect to the verb; (5) Selectional restrictions and semantic features; (6) Phrasal verbs possibly occurring in the sentence; (7) Overlapping words in dic-tionary definitions for the possible verb translations and the surrounding words in the sentence. Refer to [3] for details about representation of these knowledge sources.Algorithm. We used the ILP system Aleph [4] to implement the construction of disambigua-tion models and features for statistical model construction.Method. For reasons of space, we refer the reader to [4] for details of how models and features are constructed by Aleph. We follow standard methodology to construct and test models (i.e., cross-validation on training data to select the best models; and testing on unseen data). Results.Disambiguation Models. We evaluated the results according to the metrics usually employed for WSD, namely accuracy on the positive examples. We used Aleph constraint’s mechanism to create a rule to classify all the cases that have not been classified by other rules according to the majority class (the most frequent translation in the corpus). Table 1 shows the accuracy achieved by the test-bed previously mentioned, according to a 10-fold cross-validation strat-egy, together with the accuracy of two propositional algorithms that usually perform well on the WSD task, C4.5 and SVM, here having the relational features pre-processed in order to allow an attribute-value representation. ILP results are significantly better (t-test, p < 0.05). Feature Construction. The numbers of features constructed by the ILP engine for each verb are shown in Table 2. Together with 23 original low-level features, these new ILP features were then used to test SVM’s performance. The enhanced set of features improved SVM’s accuracy for 2 verbs (“come” and “go”), with other verbs unaffected. It is not evident at this stage whether this due to (a) the small number of examples; (b) inadequate relational features;(c) inadequate background knowledge; or (d) inadequate model construction by the statistical method. We are conducting experiments to shed further light on these questions.Table 1. Verbs and their possible translations in the sample corpusAccuracyVerbILP C4.5SVMcome 0.82 0.53 0.62get 0.51 0.36 0.26give 0.96 0.96 0.98go 0.88 0.76 0.74look 0.83 0.57 0.79make 0.81 0.74 0.74take 0.81 0.31 0.44Average 0.800.60.65Table 2. Accuracy achieved by SVM with and without ILP featuresVerb # ILPfeatures Accuracy: originalfeature setAccuracy: enhancedfeature setcome 483 0.58 0.71get 329 0.31 0.31give 19 0.98 0.98go 174 0.69 0.71look 677 0.79 0.79make 4122 0.76 0.76take 411 0.47 0.473 Concluding RemarksThe results reported here suggest that ILP could play a useful role in WSD. As stand-alone constructor of WSD models, ILP yielded particularly good results, significantly outperforming propositional algorithms on the same data. This is mainly due to the real hybrid nature of the approach and the rich set of knowledge sources employed. Regarding the use of ILP to con-struct features, our findings suggest that: addition of background does improve the accuracy of WSD models in some cases. Features can be constructed efficiently: usually, the time taken for feature-construction was comparable, or often less, than to that taken to build models. In both cases, the results strongly support the undertaking a substantially larger study with more data. References1.Bar-Hillel, Y. The Present Status of Automatic Translation of Languages. Advances in Com-puters, 1 (1960) 91-1632.Mooney, R.J. Inductive Logic Programming for Natural Language Processing. 6th Interna-tional Inductive Logic Programming Workshop (1997) 3-243.Specia, L. A Hybrid Relational Approach for WSD – First Results. Coling-ACL (2006) 55-604.Srinivasan, A. The Aleph Manual. See: /oucl/research/areas/machlearn/Aleph/。
自然语言处理一些任务的总结本节总结一下NLP中常见的任务,从一个全局观来看看NLP:NLP任务总结一:词法分析•分词 (Word Segmentation/Tokenization, ws): 在对文本进行处理的时候,会对文本进行一个分词的处理,下面是一个常用的词库。
库开源or商业支持语言分词词性标注命名实体识别费用HanLP 开源Java, C++,Python有有有无Jieba 开源Java, C++,Python有无无无FudanNLP 开源Java 有有有无LTP 开源Java, C++,Python有有有无THULAC 开源Java, C++,Python有有无无BosonNLP 商业REST 有有有免费调用百度NLP 商业REST 有有有待定腾讯文智商业REST 有有有按次数/按月阿里云NLP商业REST 有有有按次数•新词发现 (New Words Identification, nwi):这个好理解,因为网络上总是有新的词汇出现,比如以前的'神马'这类的网络流行词汇。
•形态分析 (Morphological Analysis, MA):分析单词的形态组成,包括词干(Sterms)、词根(Roots)、词缀(Prefixes and Suffixes)等•词性标注 (Part-of-speech Tagging, POS):确定文本中每个词的词性。
词性包括动词(Verb)、名词(Noun)、代词(pronoun)等。
开源的人民日报数据中就按照规范对句子中的每个词的词性给标注好了。
可以对着规范来看。
•拼写校正 (Spelling Correction, SP):顾名思义,需要找到错误的词,并对错误的词进行修改。
二:句法分析•语言模型 (Language Modeling, LM):语言模型的应用还是挺广泛的,给出了对语言模型的详细介绍。
现在好多模型都是基于LM来的。
•组块分析 (Chunking):标出句子中的短语块,例如名词短语(NP),动词短语(VP)等•成分句法分析 (Constituency Parsing, CP):分析句子的成分,给出一棵树由终结符和非终结符构成的句法树•依存句法分析(Dependency Parsing, DP):分析句子中词与词之间的依存关系,给一棵由词语依存关系构成的依存句法树。
UMND1:Unsupervised Word Sense Disambiguation Using ContextualSemantic RelatednessSiddharth Patwardhan School of ComputingUniversity of Utah Salt Lake City,UT84112. sidd@Satanjeev BanerjeeLanguage Technologies Inst.Carnegie Mellon UniversityPittsburgh,PA15217.banerjee@Ted PedersenDept.of Computer ScienceUniversity of MinnesotaDuluth,MN55812.tpederse@AbstractIn this paper we describe an unsuper-vised WordNet-based Word Sense Disam-biguation system,which participated(asUMND1)in the SemEval-2007Coarse-grained English Lexical Sample task.Thesystem disambiguates a target word by usingWordNet-based measures of semantic relat-edness tofind the sense of the word thatis semantically most strongly related to thesenses of the words in the context of the tar-get word.We briefly describe this system,the configuration options used for the task,and present some analysis of the results.1IntroductionWordNet::SenseRelate::TargetWord1(Patwardhan et al.,2005;Patwardhan et al.,2003)is an unsuper-vised Word Sense Disambiguation(WSD)system, which is based on the hypothesis that the intended sense of an ambiguous word is related to the words in its context.For example,if the“financial institution”sense of bank is intended in a context, then it is highly likely the context would contain related words such as money,transaction,interest rate,etc.The algorithm,therefore,determines the intended sense of a word(target word)in a given context by measuring the relatedness of each sense of that word with the words in its context. The sense of the target word that is most related to its context is selected as the intended sense of the target word.The system uses WordNet-based 1 measures of semantic relatedness2(Pedersen et al.,2004)to measure the relatedness between the different senses of the target word and the words in its context.This system is completely unsupervised and re-quires no annotated data for training.The lexical database WordNet(Fellbaum,1998)is the only re-source that the system uses to measure the related-ness between words and concepts.Thus,our system is classified under the closed track of the task.2System DescriptionOur WSD system consists of a modular framework, which allows different algorithms for the different subtasks to be plugged into the system.We divide the disambiguation task into two primary subtasks: context selection and sense selection.The context selection module tries to select words from the con-text that are most likely to be indicative of the sense of the target word.The sense selection module then uses the set of selected context words to choose one of the senses of the target word as the answer. Figure1shows a block schematic of the system, which takes SemEval-2007English Lexical Sample instances as input.Each instance is a made up of a few English sentences,and one word from these sentences is marked as the target word to be dis-ambiguated.The system processes each instance through multiple modules arranged in a sequential pipeline.Thefinal output of the pipeline is the sense that is most appropriate for the target word in the given context.2Instance Preprocessing Format FilterTarget Sense Context SelectionPostprocessingSense SelectionRelatedness MeasureFigure 1:System Architecture2.1Data PreparationThe input text is first passed through a format fil-ter ,whose task is to parse the input XML file.This is followed by a preprocessing step.Each instance passed to the preprocessing stage is first segmented into words,and then all compound words are iden-tified.Any sequence of words known to be a com-pound in WordNet is combined into a single entity.2.2Context SelectionAlthough each input instance consists of a large number of words,only a few of these are likely to be useful for disambiguating the target word.We use the context selection algorithm to select a subset of the context words to be used for sense selection.By removing the unimportant words,the computa-tional complexity of the algorithm is reduced.In this work,we use the NearestWords context selection algorithm.This algorithm algorithm se-lects 2n +1content words surrounding the target word (including the target word)as the context.A stop list is used to identify closed-class non-content words.Additionally,any word not found in Word-Net is also discarded.The algorithm then selects n content words before and n content words follow-ing the target word,and passes this unordered set of 2n +1words to the Sense Selection module.2.3Sense Selection AlgorithmThe sense selection module takes the set of words output by the context selection module,one of which is the target word to be disambiguated.For each of the words in this set,it retrieves a list of senses from WordNet,based on which it determines the intended sense of the target word.The package provides two main algorithms for Sense Selection:the local and the global algorithms,as described in previous work (Banerjee and Peder-sen,2002;Patwardhan et al.,2003).In this work,we use the local algorithm,which is faster and was shown to perform as well as the global algorithm.The local sense selection algorithm measures the semantic relatedness of each sense of the target word with the senses of the words in the context,and se-lects that sense of the target word which is most re-lated to the context word-senses.Given the 2n +1context words,the system scores each sense of the target word.Suppose the target word t has T senses,enumerated as t 1,t 2,...,t T .Also,suppose w 1,w 2,...,w 2n are the words in the context of t ,each hav-ing W 1,W 2,...,W 2n senses,respectively.Then for each t i a score is computed as score(t i )=2n j =1max k =1to W j(relatedness(t i ,w jk ))where w jk is the k th sense of word w j .The sense t iof target word t with the highest score is selected as the intended sense of the target word.The relatedness between two word senses is com-puted using a measure of semantic relatedness de-fined in the WordNet::Similarity software package (Pedersen et al.,2004),which is a suite of Perl mod-ules implementing a number WordNet-based mea-sures of semantic relatedness.For this work,we used the Context Vector measure (Patwardhan and Pedersen,2006).The relatedness of concepts is computed based on word co-occurrence statistics derived from WordNet glosses.Given two WordNet senses,this module returns a score between 0and 1,indicating the relatedness of the two senses.Our system relies on WordNet as its sense inven-tory.However,this task used OntoNotes (Hovy et al.,2006)as the sense inventory.OntoNotes word senses are groupings of similar WordNet senses.Thus,we used the training data answer key to gen-erate a mapping between the OntoNotes senses of the given lexical elements and their corresponding WordNet senses.We had to manually create the mappings for some of the WordNet senses,which had no corresponding OntoNotes senses.The sense selection algorithm performed all of its computa-tions with respect to the WordNet senses,and finally the OntoNotes sense corresponding to the selected WordNet sense of the target word was output as theanswer for each instance.3Results and AnalysisFor this task,we used the freely available Word-Net::SenseRelate::TargetWord v0.10and the Word-Net::Similarity v1.04packages.WordNet v2.1was used as the underlying knowledge base for these. The context selection module used a window size offive(including the target word).The semantic re-latedness of concepts was measured using the Con-text Vector measure,with configuration options as defined in previous research(Patwardhan and Ped-ersen,2006).Since we always predict exactly one sense for each instance,the precision and recall val-ues of all our experiments were always the same. Therefore,in this section we will use the name“ac-curacy”to mean both precision and recall.3.1Overall Results,and BaselinesThe overall accuracy of our system on the test data is0.538.This represents2,609correctly disam-biguated instances,out of a total of4,851instances. As baseline,we compare against the random al-gorithm where for each instance,we randomly pick one of the WordNet senses for the lexical element in that instance,and report the OntoNotes senseid it maps to as the answer.This algorithm gets an ac-curacy of0.417.Thus,our algorithm gets an im-provement of12%absolute(29%relative)over this random baseline.Additionally,we compare our algorithm against the WordNet SenseOne algorithm.In this algorithm, we pick thefirst sense among the WordNet senses of the lexical element in each instance,and report its corresponding OntoNotes sense as the answer for that instance.This algorithm leverages the fact that (in most cases)the WordNet senses for a particular word are listed in the database in descending order of their frequency of occurrence in the corpora from which the sense inventory was created.If the new test data has a similar distribution of senses,then this algorithm amounts to a“majority baseline”.This algorithm achieves an accuracy of0.681which is 15%absolute(27%relative)better than our algo-rithm.Although this seemingly na¨ıve algorithm out-performs our algorithm,we choose to avoid using this information in our algorithms because it repre-sents a large amount of human supervision in the form of manual sense tagging of text,whereas our goal is to create a purely unsupervised algorithm. Additionally,our algorithms can,with little change, work with other sense inventories besides WordNet that may not have this information.3.2Results Disaggregated by Part of SpeechIn our past experience,we have found that av-erage disambiguation accuracy differs significantly between words of different parts of speech.For the given test data,we separately evaluated the noun and verb instances.We obtained an accuracy of0.399 for the noun targets and0.692for the verb targets. Thus,wefind that our algorithm performs much bet-ter on verbs than on nouns,when evaluated using the OntoNotes sense inventory.This is different from our experience with S ENSEVAL data from previous years where performance on nouns was uniformly better than that on verbs.One possible reason for the better performance on verbs is that the OntoNotes sense inventory has,on average,fewer senses per verb word(4.41)than per noun word(5.71).How-ever,additional experimentation is needed to more fully understand the difference in performance.3.3Results Disaggregated by Lexical Element To gauge the accuracy of our algorithm on different words(lexical elements),we disaggregated the re-sults by individual word.Table1lists the accuracy values over instances of individual verb lexical ele-ments,and Table2lists the accuracy values for noun lexical elements.Our algorithm gets all instances correct for13verb lexical elements,and for none of the noun lexical elements.More generally,our al-gorithm gets an accuracy of50%or more on45out of the65verb lexical elements,and on15out of the 35noun lexical elements.For nouns,when the ac-curacy results are viewed in sorted order(as in Table 2),one can observe a sudden degradation of results between the accuracy of the word system.n–0.443–and the word source.n–0.257.It is unclear why there is such a jump;there is no such sudden degra-dation in the results for the verb lexical elements.4ConclusionsThis paper describes our system UMND1,which participated in the SemEval-2007Coarse-grainedWord Accuracy Word Accuracyremove 1.000purchase 1.000negotiate 1.000improve 1.000hope 1.000express 1.000exist 1.000estimate 1.000describe 1.000cause 1.000avoid 1.000attempt 1.000affect 1.000say0.969explain0.944complete0.938disclose0.929remember0.923allow0.914announce0.900kill0.875occur0.864do0.836replace0.800maintain0.800complain0.786believe0.764receive0.750approve0.750buy0.739produce0.727regard0.714propose0.714need0.714care0.714feel0.706recall0.667examine0.667claim0.667report0.657find0.607grant0.600work0.558begin0.521build0.500keep0.463go0.459contribute0.444rush0.429start0.421raise0.382end0.381prove0.364enjoy0.357see0.296set0.262promise0.250hold0.250lead0.231prepare0.222join0.222ask0.207come0.186turn0.048fix0.000Table1:Verb Lexical Element Accuracies English Lexical Sample task.The system is based on WordNet::SenseRelate::TargetWord,which is a freely available unsupervised Word Sense Disam-biguation software package.The system uses WordNet-based measures of semantic relatedness to select the intended sense of an ambiguous word.The system required no training data and using WordNet as its only knowledge source achieved an accuracy of54%on the blind test set.AcknowledgmentsThis research was partially supported by a National Science Foundation Early CAREER Development award(#0092784).ReferencesS.Banerjee and T.Pedersen.2002.An Adapted Lesk Al-gorithm for Word Sense Disambiguation Using Word-Net.In Proceedings of the Third International Con-Word Accuracy Word Accuracy policy0.949people0.904future0.870drug0.870space0.857capital0.789effect0.767condition0.765job0.692bill0.686area0.676base0.650management0.600power0.553development0.517chance0.467exchange0.459order0.456part0.451president0.446system0.443source0.257network0.218state0.208share0.192rate0.186hour0.167plant0.109move0.085point0.080value0.068defense0.048position0.044carrier0.000authority0.000Table2:Noun Lexical Element Accuraciesference on Intelligent Text Processing and Computa-tional Linguistics,pages136–145,Mexico City,Mex-ico,February.C.Fellbaum,editor.1998.WordNet:An electronic lexi-cal database.MIT Press.E.Hovy,M.Marcus,M.Palmer,L.Ramshaw,andR.Weischedel.2006.OntoNotes:The90%Solu-tion.In Proceedings of the Human Language Tech-nology Conference of the North American Chapter of the ACL,pages57–60,New York,NY,June.S.Patwardhan and ing WordNet-based Context Vectors to Estimate the Semantic Relat-edness of Concepts.In Proceedings of the EACL2006 Workshop on Making Sense of Sense:Bringing Com-putational Linguistics and Psycholinguistics Together, pages1–8,Trento,Italy,April.S.Patwardhan,S.Banerjee,and -ing Measures of Semantic Relatedness for Word Sense Disambiguation.In Proceedings of the Fourth In-ternational Conference on Intelligent Text Processing and Computational Linguistics,pages241–257,Mex-ico City,Mexico,February.S.Patwardhan,T.Pedersen,and S.Banerjee.2005.SenseRelate::TargetWord-A Generalized Framework for Word Sense Disambiguation.In Proceedings of the Twentieth National Conference on Artificial In-telligence(Intelligent Systems Demonstrations),pages 1692–1693,Pittsburgh,PA,July.T.Pedersen,S.Patwardhan,and J.Michelizzi.2004.WordNet::Similarity-Measuring the Relatedness of Concepts.In Human Language Technology Confer-ence of the North American Chapter of the Association for Computational Linguistics Demonstrations,pages 38–41,Boston,MA,May.。
Word Sense Disambiguation for Cross-Language Information RetrievalMary Xiaoyong Liu , Ted Diamond, and Anne R. DiekemaSchool of Information StudiesSyracuse UniversityAbstractWe have developed a word sense disambiguation algorithm, following Cheng and Wilensky (1997), to disambiguate among WordNet synsets. This algorithm is to be used in a cross-language information retrieval system, CINDOR, which indexes queries and documents in a language-neutral concept representation based on WordNet synsets. Our goal is to improve retrieval precision through word sense disambiguation. An evaluation against human disambiguation judgements suggests promise for our approach. 1 Introduction The CINDOR cross-language information retrieval system (Diekema et al., 1998) uses an information structure known as “conceptual interlingua” for query and document representation. This conceptual interlingua is a hierarchically organized multilingual concept lexicon, which is structured following WordNet (Miller, 1990). By representing query and document terms by their WordNet synset numbers we arrive at essentially a language neutral representation consisting of synset numbers representing concepts. This representation facilitates cross-language retrieval by matching term synonyms in English as well as across languages. However, many terms are polysemous and belong to multiplesynsets, resulting in spurious matches in retrieval. The noun figure for example appears in 13 synsets in WordNet 1.6. This research paper describes the early stages 1 of our efforts to develop a word sense disambiguation (WSD)algorithm aimed at improving the precision of our cross-language retrieval system.2 Related Work To determine the sense of a word, a WSD algorithm typically uses the context of the ambiguous word, external resources such as machine-readable dictionaries, or a combination of both. Although dictionaries provide useful word sense information and thesauri provide additional information about relationships between words, they lack pragmatic information as can be found in corpora. Corpora contain examples of words that enable the developmentof statistical models of word senses and their contexts (Ide and Veronis, 1998; Leacock and Chodorow, 1998). There are two general problems with using corpora however; 1) corpora typically do not come pre-tagged with manually disambiguated senses, and 2) corpora are often not large nor diverse enough for all senses of a word to appear often enough for reliable statistical models (data sparseness). Although researchers have tried sense-tagging corpora automatically by using either supervised or unsupervised training methods, we have adopted a WSD algorithm which avoids the necessity for a sense-tagged training corpus.1 Please note that the disambiguation researchdescribed in this paper has not yet been extended to multiple language areas.P(synset|context(w)) =w))P(context(P(synset)synset)|w)P(context( (1)The problem of data sparseness is usually solved by using either smoothing methods, class-based methods, or by relying on similarity-based methods between words and co-occurrence data. Since we are using a WordNet-based resource for retrieval, using class-based methods seems a natural choice. Appropriate word classes can be formed by synsets or groups of synsets. The evidence of a certain sense (synset) is then no longer dependent on one word but on all the members of a particular synset.Yarowsky (1992) used Rogets Thesaurus categories as classes for WSD. His approach was based on selecting the most likely Roget category for nouns given their context of 50 words on either side. When any of the category indicator words appeared in the context of an ambiguous word, the indicator weights for each category were summed to determine the most likely category. The category with the largest sum was then selected.A similar approach to that of Yarowsky was followed by Cheng and Willensky (1997) who used a training matrix of associations of words with a certain category. Their algorithm was appealing to us because it requires no human intervention, and more importantly, it avoids the use of sense-tagged data. Our methodology described in the next section is therefore based on Cheng and Wilensky’s approach.Methods to reduce (translation) ambiguity in cross-language information retrieval have included using part-of-speech taggers to restrict the translation options (Davis 1997), applying pseudo-relevance feedback loops to expand the query with better terms aiding translation (Ballesteros and Croft 1997), using corpora for term translation disambiguation (Ballesteros and Croft, 1998), and weighted Boolean models which tend to have a self-disambiguating quality (Hull, 1997; Diekema et al., 1999; Hiemstra and Kraaij, 1999).3 MethodologyTo disambiguate a given word, we would like to know the probability that a sense occurs in a given context, i.e., P(sense|context). In this study, WordNet synsets are used to represent word senses, so P(sense|context) can berewritten as P(synset|context), for each synset of which that word is a member. For nouns, we define the context of word w to be the occurrence of words in a moving window of 100 words (50 words on each side) around w 2.By Bayes Theorem, we can obtain the desired probability by inversion (see equation (1)). Since we are not specifically concerned with getting accurate probabilities but rather relative rank order for sense selection, we ignore P(context(w)) and focus on estimating P(context(w)|synset)P(synset). The event space from which "context(w)" is drawn is the set of sets of words that ever appear with each other in the window around w. In other words, w induces a partition on the set of words. We define "context(w)" to be true whenever any of the words in the set appears in the window around w, and conversely to be false whenever none of the words in the set appears around w. If we assume independence of appearance of any two words in a given context, then we get:∏∈×contextw i synset)))|P(w - (1 - (1 P(synset)i (2)Due to the lack of sense-tagged corpora, we are not able to directly estimate P(synset) and P(w i |synset). Instead, we introduce "noisy estimators" (P e (synset) and P e (w i |synset)) to approximate these probabilities. In doing so, we make two assumptions: 1) The presence of any word w k that belongs to synset s i signals the presence of s i ; 2) Any word w k belongs to all its synsets simultaneously, and with equal probability. Although the assumptions underlying the "noisy estimators" are not strictly true, it is our belief that the "noisy estimators" should work reasonably well if:•The words that belong to synset s i tend to appear in similar contexts when s i is their intended sense;•These words do not completely overlap with the words belonging to some synset s j ( i ≠ j ) that partially overlaps with s i ;2For other parts of speech, the window size should be much smaller as suggested by previous research.• The common words between s i and s j appear in different contexts when s i ands j are their intended senses.4 The WSD AlgorithmWe chose as a basis the algorithms described by Yarrowsky (1992) and by Cheng and Wilensky (1997). In our variation, we use the synset numbers in WordNet to represent the senses of a word. Our algorithm learns associations of WordNet synsets with words in a surrounding context to determine a word sense. It consists of two phases.During the training phase, the algorithm reads in all training documents in collection and computes the distance-adjusted weight of co-occurrence of each word with each corresponding synset. This is done by establishing a 100-word window around a target word (50 words on each side), and correlating each synset to which the target word belongs with each word in the surrounding window. The result of the training phase is a matrix of associations of words with synsets.In the sense prediction phase, the algorithm takes as input randomly selected testing documents or sentences that contain the polysemous words we want to disambiguate andexploits the context vectors built in the trainingphase by adding up the weighted "votes". It thenreturns a ranked list of probability values associated with each synset, and chooses thesynset with the highest probability as the senseof the ambiguous word.Figure 1 and Figure 2 show an outline of the algorithm.In this algorithm, "noisy estimators" are employed in the sense prediction phase. Theyare calculated using following formulas:P e(w i|x) =[][][][]∑∈W w i x wMxwM(3)where w i is a stem, x is a given synset,M[w][x] is a cell in the correlation matrix that corresponds to word w and synset x, andP e(x) =[][][][]∑∑∈∈∈YyWwWwywMxwM,(4)where w is any stem in the collection, xis a given synset, y is any synset ever occurredin collection.For each document d in collectionread in a noun stem w from dfor each synset s in which w occursget the column b in the association matrix M that corresponds to s if the column already exists; create a new column for s otherwisefor each word stem j appearing in the 100-word window around wget the row a in M that corresponds to j if the row already exists; create a new row for j otherwiseadd a distance-adjusted weight to M[a][b]Figure 1: WSD Algorithm: the training phaseSet value = 1For each word w to be disambiguatedget synsets of wfor each synset x of wfor each w i in the context of w (within the 100-window around w)calculate P e(w i|x)value *= ( 1 - P e(w i|x))P(context(w)|x) = 1 - valueCalculate p e(x)P(x|context(w))=p e(x)* P(context(w)|x)display a ranked list of the synsets arranged according to their P(x|context(w)) in decreasing orderFigure 2: WSD Algorithm: the sense prediction phase5 EvaluationAs suggested by the WSD literature, evaluation of word sense disambiguation systems is not yet standardized (Resnik and Yarowsky, 1997). Some WSD evaluations have been done using the Brown Corpus as training and testing resources and comparing the results against SemCor3, the sense-tagged version of the Brown Corpus (Agirre and Rigau, 1996; Gonzalo et al., 1998). Others have used common test suites such as the 2094-word line data of Leacock et al. (1993). Still others have tended to use their own metrics. We chose an evaluation with a user-based component that allowed a ranked list of sense selection for each target word and enabled a comprehensive comparison between automatic and manual WSD results. In addition we wanted to base the disambiguation matrix on a corpus that we use for retrieval. This approach allows for a much richer evaluation than a simple hit-or-miss test. For validation purpose, we will conduct a fully automatic evaluation against SemCor in our future efforts.We use in vitro evaluation in this study, i.e. the WSD algorithm is tested independent of the retrieval system. The population consists of all the nouns in WordNet, after removal of monosemous nouns, and after removal of a problematic class of polysemous nouns.4 We drew a random sample of 87 polysemous nouns5 from this population.In preparation, for each noun in our sample we identified all the documents containing that noun from the Associated Press (AP) newspaper corpus. The testing document set was then formed by randomly selecting 10 documents from the set of identified documents for each of the 87 nouns. In total, there are 867 documents in the testing set. The training document set 3 SemCor is a semantically sense-tagged corpus comprising approximately 250, 000 words. The reported error rate is around 10% for polysemous words.4This class of nouns refers to nouns that are in synsets in which they are the sole word, or in synsets whose words were subsets of other synsets for that noun. This situation makes disambiguation extremely problematic. This class of noun will be dealt with in a future version of our algorithm but for now it is beyond the scope of this evaluation.5 A polysemous noun is defined as a noun that belongs to two or more synsets. consists of all the documents in the AP corpus excluding the above-mentioned 867 documents.For each noun in our sample, we selected all its corresponding WordNet noun synsets and randomly selected 10 sentence occurrences witheach from one of the 10 random documents.After collecting 87 polysemous nouns with10 noun sentences each, we had 870 sentencesfor disambiguation. Four human judges were randomly assigned to two groups with two judges each, and each judge was asked to disambiguate 275 word occurrences out of which 160 were unique and 115 were sharedwith the other judge in the same group. For eachword occurrence, the judge put the target word’s possible senses in rank order according to their appropriateness given the context (ties are allowed).Our WSD algorithm was also fed with the identical set of 870 word occurrences in the sense prediction phase and produced a rankedlist of senses for each word occurrence.Since our study has a matched-group designin which the subjects (word occurrences) receiveboth the treatments and control, the measurement of variables is on an ordinal scale,and there is no apparently applicable parametric statistical procedure available, two nonparametric procedures -the Friedman two-way analysis of variance and the Spearman rank correlation coefficient -were originally chosenas candidates for the statistical analysis of our results. However, the number of ties in our results renders the Spearman coefficient unreliable. We have therefore concentrated onthe Friedman analysis of our experimental results. We use the two-alternative test withα=0.05.The first tests of interest were aimed at establishing inter-judge reliability across the 115 shared sentences by each pair of judges. Thenull hypothesis can be generalized as “There isno difference in judgments on the same word occurrences between two judges in the same group”. Following general steps of conducting a Friedman test as described by Siegel (1956), wecast raw ranks in a two-way table having 2 conditions/columns (K = 2) with each of the human judges in the pair serving as one condition and 365 subjects/rows (N = 365) which are all the senses of the 115 word occurrences that were judged by both human judges. We then rankedNKX r2 df Rejection region Reject H0?First pair of judges 365 2 .003 1 3.84 No Second pair of judges 380 2 2.5289 1 3.84 No Figure 3: Statistics for significance tests of inter-judge reliability (α=.05, 2-alt. Test)NKX r2 df Rejection region Reject H0? Auto WSD vs man. WSDvs sense pooling2840 3 73.217 2 5.99 Yes Auto WSD vs man. WSD 2840 2 3.7356 1 3.84 No Auto WSD vs sense pooling 2840 2 5.9507 1 3.84 Yes Man. WSD vs sense pooling 2840 2 126.338 1 3.84 Yes Figure 4: Statistics for significance tests among automatic WSD, manual WSD,and sense pooling (α=.05, 2-alt. Test)the scores in each row from 1 to K (in this case K is 2), summed the derived ranks in each column, and calculated X r2 which is .003. For α=0.05, degrees of freedom df = 1 (df = K –1), the rejection region starts at 3.84. Since .003 is smaller than 3.84, the null hypothesis is not rejected. Similar steps were used for analyzing reliability between the second pair of judges. In both cases, we did not find significant difference between judges (see Figure 3).Our second area of interest was the comparison of automatic WSD, manual WSD, and “sense pooling”. Sense pooling equates to no disambiguation, where each sense of a word is considered equally likely (a tie). The null hypothesis (H0) is “There is no difference among manual WSD, automatic WSD, and sense pooling (all the conditions come from the same population)”. The steps for Friedman analysis were similar to what we did for the inter-judge reliability test while the conditions and subjects were changed in each test according to what we would like to compare. Test results are summarized in Figure 4. In the three-way comparison shown in the first row of the table, we rejected H0 so there was at least one condition that was from a different population. By further conducting tests which examined each two of the above three conditions at a time we found that it was sense pooling that came from a different population while manual and automatic WSD were not significantly different. We can therefore conclude that our WSD algorithm is better than no disambiguation.6 Concluding RemarksThe ambiguity of words may negatively impact the retrieval performance of a concept-based information retrieval system like CINDOR. We have developed a WSD algorithm that uses all the words in a WordNet synset as evidence of a given sense and builds an association matrix to learn the co-occurrence between words and senses. An evaluation of our algorithm against human judgements of a small sample of nouns demonstrated no significant difference between our automatic ranking of senses and the human judgements. There was, however, a significant difference between human judgement and rankings produced with no disambiguation where all senses were tied.These early results are such as to encourage us to continue our research in this area. In our future work we must tackle issues associated with the fine granularity of some WordNet sense distinctions, synsets which are proper subsets of other synsets and are therefore impossible to distinguish, and also extend our evaluation to multiple languages and to other parts of speech. The next step in our work will be to evaluate our WSD algorithm against the manually sense-tagged SemCor Corpus for validation, and then integrate our WSD algorithm into CINDOR’s processing and evaluate directly the impact on retrieval performance. We hope to verify that word sense disambiguation leads to improved precision in cross-language retrieval. AcknowledgementsThis work was completed under a research practicum at MNIS-TextWise Labs, Syracuse, NY. We thank Paraic Sheridan for many useful discussions and the anonymous reviewers for constructive comments on the manuscript. ReferencesAgirre, E., and Rigau, G. (1996). Word sense disambiguation using conceptual density. In:Proceedings of the 16th International Conference on Computational Linguistics,Copenhagen,1996.Ballesteros, L., and Croft, B. (1997). Phrasal Translation and Query Expansion Techniques forCross-Language Information Retrieval. In:Proceedings of the Association for ComputingMachinery Special Interest Group on Information Retrieval (ACM/SIGIR) 20thInternational Conference on Research andDevelopment in Information Retrieval; 1997 July25-31; Philadelphia, PA. New York, NY: ACM,1997. 84-91.Ballesteros, L., and Croft, B. (1998). Resolving Ambiguity for Cross-language Retrieval. In:Proceedings of the Association for ComputingMachinery Special Interest Group on Information Retrieval (ACM/SIGIR) 21st International Conference on Research andDevelopment in Information Retrieval; 1998August 24-28; Melbourne, Australia. New York,NY: ACM, 1998. 64-71.Cheng, I., and Wilensky, R. (1997). An Experiment in Enhancing Information Access by NaturalLanguage Processing. UC Berkeley ComputerScience Technical Report UCB/CSD UCB//CSD-97-963.Davis, M. (1997). New Experiments in Cross-Language Text Retrieval at NMSU’s ComputingResearch Lab. In: D.K. Harman, Ed. The FifthText Retrieval Conference (TREC-5). 1996,November. National Institute of Standards andTechnology (NIST), Gaithersburg, MD. Diekema, A., Oroumchian, F., Sheridan, P., and Liddy, E. D. (1999). TREC-7 Evaluation ofConceptual Interlingua Document Retrieval(CINDOR) in English and French. In: E.M.Voorhees and D.K. Harman (Eds.) The SeventhText REtrieval Conference (TREC-7). 1998,November 9-11; National Institute of Standardsand Technology (NIST), Gaithersburg, MD. 169-180.Gonzalo, J., Verdejo, F., Chugur, I., and Cigarran, J.(1998). Indexing with WordNet synsets canimprove text retrieval. In: Proceedings of theCOLING/ACL Workshop on Usage of WordNetin Natural Language Processing Systems,Montreal, 1998.Hiemstra, D., and Kraaij, W. (1999). Twenty-One at TREC-7: Ad-hoc and Cross-language Track. In:E.M. Voorhees and D.K. Harman (Eds.) TheSeventh Text REtrieval Conference (TREC-7).1998, November 9-11; National Institute of Standards and Technology (NIST), Gaithersburg, MD. 227-238.Hull, D. A. (1997). Using Structured Queries for Disambiguation in Cross-Language InformationRetrieval. In: American Association for ArtificialIntelligence (AAAI) Symposium on Cross-Language Text and Speech Retrieval; 1997March 24-26; Palo Alto, CA 1997. 84-98.Ide, N., and Veronis, J. (1998). Introduction to the Special Issue on Word Sense Disambiguation:The State of the Art. Computational Linguistics,Vol. 24, No. 1, 1-40.Leacock, C., and Chodorow, M. (1998). Combining Local Context and WordNet Similarity for WordSense Identification. In: Christiane Fellbaum(Eds.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press. Leacock, C., Towell, G., and Voorhees, E. (1993).Corpus-based Statistical Sense Resolution. In:Proceedings, ARPA Human Language Technology Workshop, Plainsboro, NJ. 260-265. Miller, G. (1990). WordNet: An On-line Lexical Database. International Journal of Lexicography,Vol. 3, No. 4, Special Issue.Resnik, P., and Yarowsky, D. (1997). A Perspective on Word Sense Disambiguation Methods andTheir Evaluation, position paper presented at theACL SIGLEX Workshop on Tagging Text withLexical Semantics: Why, What, and How?, heldApril 4-5, 1997 in Washington, D.C., USA inconjunction with ANLP-97.Siegel, S. (1956). Nonparametric Statistics for the Behavioral Sciences. New York: McGraw-Hill,1956.Yarowsky, D. (1992). Word-Sense Disambiguation Using Statistical Models of Roget's CategoriesTrained on Large Corpora. In: Proceedings of theFourteenth International Conference on Computational Linguistics. Nantes, France. 454-460.。
1. 自然语言处理(NLP)的主要目标是什么?A. 使计算机能够理解和生成人类语言B. 提高计算机的计算速度C. 优化数据库查询D. 增强图形处理能力2. 在NLP中,词性标注(POS tagging)的主要目的是什么?A. 识别文本中的每个单词B. 确定每个单词在句子中的语法功能C. 分析文本的情感倾向D. 提取文本中的关键词3. 以下哪个不是自然语言处理的子领域?A. 机器翻译B. 语音识别C. 数据挖掘D. 文本分类4. 在NLP中,句法分析的主要任务是什么?A. 确定单词的词性B. 分析句子的结构和语法关系C. 识别文本中的实体D. 评估文本的情感5. 命名实体识别(NER)在NLP中的主要作用是什么?A. 识别和分类文本中的特定实体,如人名、地点、组织等B. 分析句子的语法结构C. 确定单词的词性D. 翻译文本6. 以下哪种技术常用于文本分类?A. 词袋模型(Bag of Words)B. 语音合成C. 图像识别D. 数据压缩7. 在NLP中,情感分析的主要目的是什么?A. 确定文本的情感倾向,如正面、负面或中性B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本8. 以下哪个是深度学习在NLP中的应用?A. 循环神经网络(RNN)B. 决策树C. 支持向量机(SVM)D. 关联规则学习9. 在NLP中,词嵌入(Word Embedding)的主要作用是什么?A. 将单词转换为数值向量,以便计算机处理B. 分析句子的语法结构C. 识别文本中的实体D. 翻译文本10. 以下哪个是NLP中的预处理步骤?A. 分词(Tokenization)B. 语音识别C. 图像处理D. 数据压缩11. 在NLP中,停用词(Stop Words)的主要作用是什么?A. 去除文本中不重要的词汇,如“的”、“是”等B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本12. 以下哪个是NLP中的序列标注任务?A. 命名实体识别(NER)B. 文本分类C. 情感分析D. 机器翻译13. 在NLP中,依存句法分析(Dependency Parsing)的主要目的是什么?A. 分析句子中单词之间的依赖关系B. 识别文本中的实体C. 确定单词的词性D. 翻译文本14. 以下哪个是NLP中的生成模型?A. 生成对抗网络(GAN)B. 支持向量机(SVM)C. 决策树D. 关联规则学习15. 在NLP中,语言模型(Language Model)的主要作用是什么?A. 预测下一个单词或短语的概率B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本16. 以下哪个是NLP中的无监督学习任务?A. 聚类分析B. 文本分类C. 情感分析D. 机器翻译17. 在NLP中,主题模型(Topic Model)的主要作用是什么?A. 识别文本中的主题或话题B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本18. 以下哪个是NLP中的序列到序列(Seq2Seq)模型?A. 机器翻译B. 文本分类C. 情感分析D. 命名实体识别19. 在NLP中,注意力机制(Attention Mechanism)的主要作用是什么?A. 提高模型对重要信息的关注度B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本20. 以下哪个是NLP中的强化学习任务?A. 对话系统B. 文本分类C. 情感分析D. 机器翻译21. 在NLP中,文本摘要(Text Summarization)的主要作用是什么?A. 生成文本的简洁概述B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本22. 以下哪个是NLP中的问答系统任务?A. 回答用户提出的问题B. 文本分类C. 情感分析D. 机器翻译23. 在NLP中,语义角色标注(Semantic Role Labeling)的主要作用是什么?A. 识别句子中各个成分的语义角色B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本24. 以下哪个是NLP中的知识图谱任务?A. 构建实体之间的关系图谱B. 文本分类C. 情感分析D. 机器翻译25. 在NLP中,词义消歧(Word Sense Disambiguation)的主要作用是什么?A. 确定单词在特定上下文中的确切含义B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本26. 以下哪个是NLP中的预训练模型?A. BERTB. 支持向量机(SVM)C. 决策树D. 关联规则学习27. 在NLP中,跨语言文本处理的主要任务是什么?A. 处理和分析不同语言的文本B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本28. 以下哪个是NLP中的语音处理任务?A. 语音识别B. 文本分类C. 情感分析D. 机器翻译29. 在NLP中,文本蕴涵(Textual Entailment)的主要作用是什么?A. 判断一个文本是否蕴含另一个文本的信息B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本30. 以下哪个是NLP中的对话系统任务?A. 与用户进行自然语言对话B. 文本分类C. 情感分析D. 机器翻译31. 在NLP中,文本纠错(Text Correction)的主要作用是什么?A. 自动检测和修正文本中的错误B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本32. 以下哪个是NLP中的信息抽取任务?A. 从文本中提取有用信息B. 文本分类C. 情感分析D. 机器翻译33. 在NLP中,文本分割(Text Segmentation)的主要作用是什么?A. 将文本分割成有意义的单元,如句子或段落B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本34. 以下哪个是NLP中的文本生成任务?A. 自动生成文本内容B. 文本分类C. 情感分析D. 机器翻译35. 在NLP中,文本对齐(Text Alignment)的主要作用是什么?A. 将不同语言或版本的文本对齐B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本36. 以下哪个是NLP中的文本挖掘任务?A. 从大量文本数据中提取有用信息B. 文本分类C. 情感分析D. 机器翻译37. 在NLP中,文本相似度计算的主要作用是什么?A. 计算两个文本之间的相似度B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本38. 以下哪个是NLP中的文本聚类任务?A. 将相似的文本分组B. 文本分类C. 情感分析D. 机器翻译39. 在NLP中,文本规范化(Text Normalization)的主要作用是什么?A. 将文本转换为标准格式B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本40. 以下哪个是NLP中的文本去噪任务?A. 去除文本中的噪声或无关信息B. 文本分类C. 情感分析D. 机器翻译41. 在NLP中,文本表示(Text Representation)的主要作用是什么?A. 将文本转换为计算机可处理的格式B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本42. 以下哪个是NLP中的文本增强任务?A. 通过各种技术增强文本数据B. 文本分类C. 情感分析D. 机器翻译43. 在NLP中,文本过滤(Text Filtering)的主要作用是什么?A. 根据特定标准筛选文本B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本44. 以下哪个是NLP中的文本排序任务?A. 根据特定标准对文本进行排序B. 文本分类C. 情感分析D. 机器翻译45. 在NLP中,文本转换(Text Transformation)的主要作用是什么?A. 将文本从一种形式转换为另一种形式B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本46. 以下哪个是NLP中的文本压缩任务?A. 减少文本的数据量B. 文本分类C. 情感分析D. 机器翻译47. 在NLP中,文本可视化(Text Visualization)的主要作用是什么?A. 将文本数据以可视化形式展示B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本48. 以下哪个是NLP中的文本挖掘工具?A. NLTKB. 支持向量机(SVM)C. 决策树D. 关联规则学习49. 在NLP中,文本分析(Text Analysis)的主要作用是什么?A. 对文本数据进行深入分析B. 识别文本中的实体C. 分析句子的语法结构D. 翻译文本50. 以下哪个是NLP中的文本挖掘框架?A. spaCyB. 支持向量机(SVM)C. 决策树D. 关联规则学习答案:1. A2. B3. C4. B5. A6. A7. A8. A9. A10. A11. A12. A13. A14. A15. A16. A17. A18. A19. A20. A21. A22. A23. A24. A25. A26. A27. A28. A29. A30. A31. A32. A33. A34. A35. A36. A37. A38. A39. A40. A41. A42. A43. A44. A45. A46. A47. A48. A49. A50. A。
Improving Word Sense Disambiguation Using Topic FeaturesJun Fu Cai,Wee Sun Lee Department of Computer Science National University of Singapore 3Science Drive2,Singapore117543 {caijunfu,leews}@.sgYee Whye TehGatsby Computational Neuroscience Unit University College London17Queen Square,London WC1N3AR,UK ywteh@AbstractThis paper presents a novel approach for ex-ploiting the global context for the task ofword sense disambiguation(WSD).This isdone by using topic features constructed us-ing the latent dirichlet allocation(LDA)al-gorithm on unlabeled data.The features areincorporated into a modified na¨ıve Bayesnetwork alongside other features such aspart-of-speech of neighboring words,singlewords in the surrounding context,local col-locations,and syntactic patterns.In both theEnglish all-words task and the English lex-ical sample task,the method achieved sig-nificant improvement over the simple na¨ıveBayes classifier and higher accuracy than thebest official scores on Senseval-3for bothtask.1IntroductionNatural language tends to be ambiguous.A word often has more than one meanings depending on the context.Word sense disambiguation(WSD)is a nat-ural language processing(NLP)task in which the correct meaning(sense)of a word in a given context is to be determined.Supervised corpus-based approach has been the most successful in WSD to date.In such an ap-proach,a corpus in which ambiguous words have been annotated with correct senses isfirst collected. Knowledge sources,or features,from the context of the annotated word are extracted to form the training data.A learning algorithm,like the support vector machine(SVM)or na¨ıve Bayes,is then applied on the training data to learn the model.Finally,in test-ing,the learnt model is applied on the test data to assign the correct sense to any ambiguous word. The features used in these systems usually in-clude local features,such as part-of-speech(POS) of neighboring words,local collocations,syntac-tic patterns and global features such as single words in the surrounding context(bag-of-words)(Lee and Ng,2002).However,due to the data scarcity prob-lem,these features are usually very sparse in the training data.There are,on average,11and28 training cases per sense in Senseval2and3lexi-cal sample task respectively,and6.5training cases per sense in the SemCor corpus.This problem is especially prominent for the bag-of-words feature; more than hundreds of bag-of-words are usually ex-tracted for each training instance and each feature could be drawn from any English word.A direct consequence is that the global context information, which the bag-of-words feature is supposed to cap-ture,may be poorly represented.Our approach tries to address this problem by clustering features to relieve the scarcity problem, specifically on the bag-of-words feature.In the pro-cess,we construct topic features,trained using the latent dirichlet allocation(LDA)algorithm.We train the topic model(Blei et al.,2003)on unlabeled data, clustering the words occurring in the corpus to a pre-defined number of topics.We then use the resulting topic model to tag the bag-of-words in the labeled corpus with topic distributions.We incorporate the distributions,called the topic features,using a sim-ple Bayesian network,modified from na¨ıve Bayesmodel,alongside other features and train the model on the labeled corpus.The approach gives good per-formance on both the lexical sample and all-words tasks on Senseval data.The paper makes mainly two contributions.First, we are able to show that a feature that efficiently captures the global context information using LDA algorithm can significantly improve the WSD ac-curacy.Second,we are able to obtain this feature from unlabeled data,which spares us from any man-ual labeling work.We also showcase the potential strength of Bayesian network in the WSD task,ob-taining performance that rivals state-of-arts meth-ods.2Related WorkMany WSD systems try to tackle the data scarcity problem.Unsupervised learning is introduced pri-marily to deal with the problem,but with limited success(Snyder and Palmer,2004).In another ap-proach,the learning algorithm borrows training in-stances from other senses and effectively increases the training data size.In(Kohomban and Lee, 2005),the classifier is trained using grouped senses for verbs and nouns according to WordNet top-level synsets and thus effectively pooling training cases across senses within the same synset.Similarly, (Ando,2006)exploits data from related tasks,using all labeled examples irrespective of target words for learning each sense using the Alternating Structure Optimization(ASO)algorithm(Ando and Zhang, 2005a;Ando and Zhang,2005b).Parallel texts is proposed in(Resnik and Yarowsky,1997)as po-tential training data and(Chan and Ng,2005)has shown that using automatically gathered parallel texts for nouns could significantly increase WSD ac-curacy,when tested on Senseval-2English all-words task.Our approach is somewhat similar to that of us-ing generic language features such as POS tags;the words are tagged with its semantic topic that may be trained from other corpuses.3Feature ConstructionWefirst present the latent dirichlet allocation algo-rithm and its inference procedures,adapted from the original paper(Blei et al.,2003).3.1Latent Dirichlet AllocationLDA is a probabilistic model for collections of dis-crete data and has been used in document model-ing and text classification.It can be represented as a three level hierarchical Bayesian model,shown graphically in Figure1.Given a corpus consisting of M documents,LDA models each document using a mixture over K topics,which are in turn character-ized as distributions overwords.Figure1:Graphical Model for LDAIn the generative process of LDA,for each doc-ument d wefirst draw the mixing proportion over topicsθd from a Dirichlet prior with parametersα. Next,for each of the N d words w dn in document d,a topic z dn isfirst drawn from a multinomial distribu-tion with parametersθd.Finally w dn is drawn from the topic specific distribution over words.The prob-ability of a word token w taking on value i given that topic z=j was chosen is parameterized using a matrixβwithβij=p(w=i|z=j).Integrating outθd’s and z dn’s,the probability p(D|α,β)of the corpus is thus:Md=1p(θd|α)Ndn=1z dnp(z dn|θd)p(w dn|z dn,β)dθd3.1.1InferenceUnfortunately,it is intractable to directly solve the posterior distribution of the hidden variables given a document,namely p(θ,z|w,α,β).However,(Blei et al.,2003)has shown that by introducing a set of variational parameters,γandφ,a tight lower bound on the log likelihood of the probability can be found using the following optimization procedure:(γ∗,φ∗)=arg minγ,φD(q(θ,z|γ,φ) p(θ,z|w,α,β))whereq(θ,z|γ,φ)=q(θ|γ)Nn=1q(z n|φn),γis the Dirichlet parameter forφand the multino-mial parameters(φ1···φN)are the free variational parameters.Note hereγis document specific in-stead of corpus specific likeα.Graphically,it is rep-resented as Figure2.The optimizing values ofγand φcan be found by minimizing the Kullback-Leibler (KL)divergence between the variational distribution and the trueposterior.Figure2:Graphical Model for Variational Inference3.2Baseline FeaturesFor both the lexical sample and all-words tasks, we use the following standard baseline features for comparison.POS Tags For each training or testing word,w, we include POS tags for P words prior to as well as after w within the same sentence boundary.We also include the POS tag of w.If there are fewer than P words prior or after w in the same sentence,we denote the corresponding feature as NIL.Local Collocations Collocation C i,j refers to the ordered sequence of tokens(words or punctuations) surrounding w.The starting and ending position of the sequence are denoted i and j respectively,where a negative value refers to the token position prior to w.We adopt the same11collocation features as (Lee and Ng,2002),namely C−1,−1,C1,1,C−2,−2, C2,2,C−2,−1,C−1,1,C1,2,C−3,−1,C−2,1,C−1,2, and C1,3.Bag-of-Words For each training or testing word, w,we get G words prior to as well as after w,within the same document.These features are position in-sensitive.The words we extract are converted back to their morphological root forms.Syntactic Relations We adopt the same syntactic relations as(Lee and Ng,2002).For easy reference, we summarize the features into Table1.POS of w FeaturesNoun Parent headword hPOS of hRelative position of h to wVerb Left nearest child word of w,lRight nearest child word of w,rPOS of lPOS of rPOS of wV oice of wAdjective Parent headword hPOS of hTable1:Syntactic Relations FeaturesThe exact values of P and G for each task are set according to cross validation result.3.3Topic FeaturesWefirst select an unlabeled corpus,such as20 Newsgroups,and extract individual words from it (excluding stopwords).We choose the number of topics,K,for the unlabeled corpus and we apply the LDA algorithm to obtain theβparameters,where βrepresents the probability of a word w i given a topic z j,p(w i|z j)=βij.The model essentially clusters words that occurred in the unlabeled cor-pus according to K topics.The conditional prob-ability p(w i|z j)=βij is later used to tag the words in the unseen test example with the probability of each topic.For some variants of the classifiers that we con-struct,we also use theγparameter,which is doc-ument specific.For these classifiers,we may need to run the inference algorithm on the labeled corpus and possibly on the test documents.Theγparam-eter provides an approximation to the probability ofselecting topic i in the document:p(z i|γ)=γiKγk.(1)4Classifier Construction4.1Bayesian NetworkWe construct a variant of the na¨ıve Bayes network as shown in Figure3.Here,w refers to the word. s refers to the sense of the word.In training,s is observed while in testing,it is not.The features f1 to f n are baseline features mentioned in Section3.2 (including bag-of-words)while z refers to the la-tent topic that we set for clustering unlabeled corpus. The bag-of-words b are extracted from the neigh-bours of w and there are L of them.Note that L can be different from G,which is the number of bag-of-words in baseline features.Both will be determined by the validation result.Figure3:Graphical Model with LDA feature The log-likelihood of an instance, (w,s,F,b) where F denotes the set of baseline features,can be written as=log p(w)+log p(s|w)+Flog(p(f|s))+L logKp(z k|s)p(b l|z k).The log p(w)term is constant and thus can be ig-nored.Thefirst portion is normal na¨ıve Bayes.And second portion represents the additional LDA plate.We decouple the training process into three separate stages.Wefirst extract baseline features from the task training data,and estimate,using normal na¨ıve Bayes,p(s|w)and p(f|s)for all w,s and f.The parameters associated with p(b|z)are estimated us-ing LDA from unlabeled data.Finally we estimate the parameters associated with p(z|s).We experi-mented with three different ways of both doing the estimation as well as using the resulting model and chose one which performed best empirically.4.1.1Expectation Maximization ApproachFor p(z|s),a reasonable estimation method is to use maximum likelihood estimation.This can be done using the expectation maximization(EM)algo-rithm.In classification,we just choose s∗that maxi-mizes the log-likelihood of the test instance,where: s∗=arg maxs(w,s,F,b)In this approach,γis never used which means the LDA inference procedure is not used on any labeled data at all.4.1.2Soft Tagging ApproachClassification in this approach is done using the full Bayesian network just as in the EM approach. However we do the estimation of p(z|s)differently. Essentially,we perform LDA inference on the train-ing corpus in order to obtainγfor each document. We then use theγandβto obtain p(z|b)for each word usingp(z i|b l,γ)=p(b l|z i)p(z i|γ)Kp(b l|z k)p(z k|γ), where equation[1]is used for estimation of p(z i|γ). This effectively transforms b to a topical distri-bution which we call a soft tag where each soft tag is probability distribution t1,...,t K on topics. We then use this topical distribution for estimating p(z|s).Let s i be the observed sense of instance iand t ij1,...,t ijKbe the soft tag of the j-th bag-of-word feature of instance i.We estimate p(z|s)asp(z jk|s)=s i=st ijks i=skt ijk(2)This approach requires us to do LDA inference on the corpus formed by the labeled training data,butnot the testing data.This is because we needγto get transformed topical distribution in order to learn p(z|s)in the training.In the testing,we only apply the learnt parameters to the model.4.1.3Hard Tagging ApproachHard tagging approach no longer assumes that z is latent.After p(z|b)is obtained using the same pro-cedure in Section4.1.2,the topic z i with the high-est p(z i|b)among all K topics is picked to represent z.In this way,b is transformed into a single most “prominent”topic.This topic label is used in the same way as baseline features for both training and testing in a simple na¨ıve Bayes model.This approach requires us to perform the transfor-mation both on the training as well as testing data, since z becomes an observed variable.LDA infer-ence is done on two corpora,one formed by the training data and the other by testing data,in order to get the respective values ofγ.4.2Support Vector Machine ApproachIn the SVM(Vapnik,1995)approach,wefirst form a training and a testingfile using all standard features for each sense following(Lee and Ng,2002)(one classifier per sense).To incorporate LDA feature, we use the same approach as Section4.1.2to trans-form b into soft tags,p(z|b).As SVM deals with only observed features,we need to transform b both in the training data and in the testing pared to(Lee and Ng,2002),the only difference is that for each training and testing case,we have additional L∗K LDA features,since there are L bag-of-words and each has a topic distribution represented by K values.5Experimental SetupWe describe here the experimental setup on the En-glish lexical sample task and all-words task.We use MXPOST tagger(Adwait,1996)for POS tagging,Charniak parser(Charniak,2000)for ex-tracting syntactic relations,SVMlight1for SVM classifier and David Blei’s version of LDA2for LDA training and inference.All default parameters are used unless mentioned otherwise.For all standard 12/˜blei/lda-c/baseline features,we use Laplace smoothing but for the soft tag(equation[2]),we use a smoothing pa-rameter value of2.5.1Development Process5.1.1Lexical Sample TaskWe use the Senseval-2lexical sample task for preliminary investigation of different algorithms, datasets and other parameters.As the dataset is used extensively for this purpose,only the Senseval-3lex-ical sample task is used for evaluation.Selecting Bayesian Network The best achievable result,using the three different Bayesian network approaches,when validating on Senseval-2test data is shown in Table2.The parameters that are used are P=3and G=3.EM68.0Hard Tagging65.6Soft Tagging68.9Table2:Results on Senseval-2English lexical sam-ple using different Bayesian network approaches. From the results,it appears that both the EM and the Hard Tagging approaches did not yield as good results as the Soft Tagging approach did.The EM approach ignores the LDA inference result,γ,which we use to get our topic prior.This information is document specific and can be regarded as global context information.The Hard Tagging approach also uses less information,as the original topic dis-tribution is now represented only by the topic with the highest probability of occurring.Therefore,both methods have information loss and are disadvan-taged against the Soft Tagging approach.We use the Soft Tagging approach for the Senseval-3lexical sample and the all-words tasks.Unlabeled Corpus Selection The unlabeled cor-pus we choose to train LDA include20News-groups,Reuters,SemCor,Senseval-2lexical sam-ple data and Senseval-3lexical sample data.Al-though the last three are labeled corpora,we only need the words from these corpora and thus they can be regarded as unlabeled too.For Senseval-2and Senseval-3data,we define the whole passage for each training and testing instance as one document.The relative effect using different corpus and com-binations of them is shown in Table3,when validat-ing on Senseval-2test data using the Soft Tagging approach.Corpus|w|K L Senseval-2 20Newsgroups 1.7M406067.9 Reuters 1.3M306065.5 SemCor0.3M306066.9 Senseval-20.6M304066.9 Senseval-30.6M506067.6All 4.5M604068.9 Table3:Effect of using different corpus for LDA training,|w|represents the corpus size in terms of the number of words in the corpusThe20Newsgroups corpus yields the best result if used individually.It has a relatively larger corpus size at1.7million words in total and also a well bal-anced topic distribution among its documents,rang-ing across politics,finance,science,computing,etc. The Reuters corpus,on the other hand,focuses heav-ily onfinance related articles and has a rather skewed topic distribution.This probably contributed to its inferior result.However,we found that the best re-sult comes from combining all the corpora together with K=60and L=40.Results for Optimized Configuration As base-line for the Bayesian network approaches,we use na¨ıve Bayes with all baseline features.For the base-line SVM approach,we choose P=3and include all the words occurring in the training and testing passage as bag-of-words feature.The F-measure result we achieve on Senseval-2 test data is shown in Table4.Our four systems are listed as the top four entries in the table.Soft Tag refers to the soft tagging Bayesian network ap-proach.Note that we used the Senseval-2test data for optimizing the configuration(as is done in the ASO result).Hence,the result should not be taken as reliable.Nevertheless,it is worth noting that the improvement of Bayesian network approach over its baseline is very significant(+5.5%).On the other hand,SVM with topic features shows limited im-provement over its baseline(+0.8%).Bayes(Soft Tag)68.9SVM-Topic66.0SVM baseline65.2NB baseline63.4ASO(best configuration)(Ando,2006)68.1Classifier Combination(Florian,2002)66.5Polynomial KPCA(Wu et al.,2004)65.8SVM(Lee and Ng,2002)65.4Senseval-2Best System64.2 Table4:Results(best configuration)compared to previous best systems on Senseval-2English lexical sample task.5.1.2All-words TaskIn the all-words task,no official training data is provided with Senseval.We follow the common practice of using the SemCor corpus as our training data.However,we did not use SVM approach in this task as there are too few training instances per sense for SVM to achieve a reasonably good accuracy. As there are more training instances in SemCor, 230,000in total,we obtain the optimal configura-tion using10fold cross validation on the SemCor training data.With the optimal configuration,we test our system on both Senseval-2and Senseval-3 official test data.For baseline features,we set P=3and B=1.We choose a LDA training corpus comprising20News-groups and SemCor data,with number of topics K =40and number of LDA bag-of-words L=14.6ResultsWe now present the results on both English lexical sample task and all-words task.6.1Lexical Sample TaskWith the optimal configurations from Senseval-2, we tested the systems on Senseval-3data.Table5 shows our F-measure result compared to some of the best reported systems.Although SVM with topic features shows limited success with only a0.6% improvement,the Bayesian network approach has again demonstrated a good improvement of3.8% over its baseline and is better than previous reported best systems except ASO(Ando,2006).Bayes(Soft Tag)73.6SVM-topic73.0SVM baseline72.4NB baseline69.8ASO(Ando,2006)74.1SVM-LSA(Strapparava et al.,2004)73.3Senseval-3Best System(Grozea,2004)72.9 Table5:Results compared to previous best systems on Senseval-3English lexical sample task.6.2All-words TaskThe F-measure micro-averaged result for our sys-tems as well as previous best systems for Senseval-2 and Senseval-3all-words task are shown in Table6 and Table7respectively.Bayesian network with soft tagging achieved2.6%improvement over its base-line in Senseval-2and1.7%in Senseval-3.The re-sults also rival some previous best systems,except for SMUaw(Mihalcea,2002)which used additional labeled data.Bayes(Soft Tag)66.3 NB baseline63.7 SMUaw(Mihalcea,2002)69.0 Simil-Prime(Kohomban and Lee,2005)66.4 Senseval-2Best System63.6 (CNTS-Antwerp(Hoste et al.,2001))Table6:Results compared to previous best systems on Senseval-2English all-words task.Bayes(Soft Tag)66.1 NB baseline64.6 Simil-Prime(Kohomban and Lee,2005)66.1 Senseval-3Best System65.2 (GAMBL-AW-S(Decadt et al.,2004))Senseval-32nd Best System(SenseLearner64.6 (Mihalcea and Faruque,2004))Table7:Results compared to previous best systems on Senseval-3English all-words task.6.3Significance of ResultsWe perform theχ2-test,using the Bayesian network and its na¨ıve Bayes baseline(NB baseline)as pairs,to verify the significance of these results.The result is reported in Table8.The results are significant at 90%confidence level,except for the Senseval-3all-words task.Senseval-2Senseval-3 All-word0.05270.2925Lexical Sample<0.00010.0002Table8:P value forχ2-test significance levels of results.6.4SVM with Topic FeaturesThe results on lexical sample task show that SVM benefits less from the topic feature than the Bayesian approach.One possible reason is that SVM base-line is able to use all bag-of-words from surround-ing context while na¨ıve Bayes baseline can only use very few without decreasing its accuracy,due to the sparse representation.In this sense,SVM baseline already captures some of the topical information, leaving a smaller room for improvement.In fact,if we exclude the bag-of-words feature from the SVM baseline and add in the topic features,we are able to achieve almost the same accuracy as we did with both features included,as shown in Table9.This further shows that the topic feature is a better rep-resentation of global context than the bag-of-words feature.SVM baseline72.4SVM baseline-BAG+topic73.5SVM-topic73.6Table9:Results on Senseval-3English lexical sam-ple task6.5Results on Different Parts-of-SpeechWe analyse the result obtained on Senseval-3En-glish lexical sample task(using Senseval-2optimal configuration)according to the test instance’s part-of-speech,which includes noun,verb and adjec-tive,compared to the na¨ıve Bayes baseline.Ta-ble10shows the relative improvement on each part-of-speech.The second column shows the number of testing instances belonging to the particular part-of-speech.The third and fourth column shows the0.640.6450.650.6550.660.6650.670.6750.680123456789LFigure 4:Accuracy with varing L and K on Senseval-2all-words taskaccuracy achieved by na¨ıve Bayes baseline and the Bayesian network.Adjectives show no improve-ment while verbs show a moderate +2.2%improve-ment.Nouns clearly benefit from topical informa-tion much more than the other two parts-of-speech,obtaining a +5.7%increase over its baseline.POS Total NB baselineBayes (Soft Tag)Noun 180769.575.2Verb 197871.173.5Adj 15957.257.2Total394469.873.6Table 10:Improvement with different POS on Senseval-3lexical sample task 6.6Sensitivity to L and KWe tested on Senseval-2all-words task using differ-ent L and K.Figure 4is the result.6.7Results on SemEval-1We participated in SemEval-1English coarse-grained all-words task (task 7),English fine-grained all-words task (task 17,subtask 3)and English coarse-grained lexical sample task (task 17,subtask 1),using the method described in this paper.For all-words task,we use Senseval-2and Senseval-3all-words task data as our validation set to fine tune the parameters.For lexical sample task,we use the training data provided as the validation set.We achieved 88.7%,81.6%and 57.6%for coarse-grained lexical sample task,coarse-grained all-words task and fine-grained all-words task respec-tively.The results ranked first,second and fourth in the three tasks respectively.7Conclusion and Future WorkIn this paper,we showed that by using LDA algo-rithm on bag-of-words feature,one can utilise more topical information and boost the classifiers accu-racy on both English lexical sample and all-words task.Only unlabeled data is needed for this improve-ment.It would be interesting to see how the feature can help on WSD of other languages and other nat-ural language processing tasks such as named-entity recognition.ReferencesY .K.Lee and H.T.Ng.2002.An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation.In Proc.of EMNLP .B.Snyder and M.Palmer.2004.The English All-Words Task.In Proc.of Senseval-3.U.S.Kohomban and W.S.Lee 2005.Learning Semantic Classes for Word Sense Disambiguation.In Proc.of ACL .R.K.Ando.2006.Applying Alternating Structure Op-timization to Word Sense Disambiguation.In Proc.of CoNLL .Y .S.Chan and H.T.Ng 2005.Scaling Up Word Sense Disambiguation via Parallel Texts.In Proc.of AAAI .R.K.Ando and T.Zhang.2005a.A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data.Journal of Machine Learning Re-search .R.K.Ando and T.Zhang.2005b.A High-Performance Semi-Supervised Learning Method for Text Chunking.In Proc.of ACL .P.Resnik and D.Yarowsky.1997.A Perspective on Word Sense Disambiguation Methods and Their Eval-uation.In Proc.of ACL .D.M.Blei and A.Y .Ng and -tent Dirichlet Allocation.Journal of Machine Learn-ing Research .A.Ratnaparkhi1996.A Maximum Entropy Model forPart-of-Speech Tagging.In Proc.of EMNLP.E.Charniak2000.A Maximum-Entropy-InspiredParser.In Proc.of the1st Meeting of the North Ameri-can Chapter of the Association for Computational Lin-guistics.V.N.Vapnik1995.The Nature of Statistical Learning Theory.Springer-Verlag,New York.R.Florian and D.Yarowsky2002.Modeling consensus: Classifier Combination for Word Sense Disambigua-tion.In Proc.of EMNLP.D.Wu and W.Su and M.Carpuat.2004.A Kernel PCAMethod for Superior Word Sense Disambiguation.In Proc.of ACL.C.Strapparava and A.Gliozzo and C.Giuliano2004.Pattern Abstraction and Term Similarity for Word Sense Disambiguation:IRST at Senseval-3.In Proc.of Senseval-3.C.Grozea2004.Finding Optimal Parameter Settingsfor High Performance Word Sense Disambiguation.In Proc.of Senseval-3.R.Mihalcea2002.Bootstrapping Large Sense Tagged Corpora.In Proc.of the3rd International Conference on Languages Resources and Evaluations.V.Hoste and A.Kool and W.Daelmans2001.Classifier Optimization and Combination in English All Words Task.In Proc.of Senseval-2.B.Decadt and V.Hoste and W.Daelmans2004.GAMBL,Genetic Algorithm Optimization of Memory-Based WSD.In Proc.of Senseval-3.R.Mihalcea and E.Faruque2004.Sense-learner:Mini-mally Supervised Word Sense Disambiguation for All Words in Open Text.In Proc.of Senseval-3.。
人工智能基础复习题与答案1、以下关于词性标注的描述错误的是:A、词性是词汇基本的语法属性,通常称为词类。
B、词性标注是在给定句子中判断每个词的语法范畴,确定其词性并加以标注的过程。
C、通常将词性标注作为序列标注问题来解决。
D、词性标注最主流的方法是从预料库中统计每个词对应的高频词性, 将其作为默认的词性。
答案:D2、以下哪个数据集常备用于信息检索任务A、MNISTB、ImageNetC、TRECD、IMDB-Face答案:C3、下面哪个是NLP用例A、从图像中检测物体B、面部识别C、语音生物识别D、文本摘要答案:D4、自然语言处理能在以下哪些领域发挥作用A、自动文本摘要B、自动问答系统C、信息检索D、以上所有答案:D5、对“The kid runs”使用ngram后得到“The kid”,“kid runs”A、UnigramB、BigramC、TrigramD、Quadrigrams答案:B6、关于CTC最佳路径解码说法错误的是A、通过在每个时间步中选择最可能的字符来计算最佳路径B、它先删除重复的字符,再从路径中删除所有空格C、它先从路径中删除所有空格,再删除重复的字符D、它可以直接用于表示已识别的文本答案:C7、“在KBQA中,设计问题回复模板可以用来生成自然语言的回复”是正确的吗?A、正确B、错误答案:A8、以下关于逻辑表达式的说法错误的是:A、逻辑表达式是区别于语义解析方法与模板匹配方法的根本差异。
B、逻辑表达式不适用于知识库的结构化查询方式。
C、逻辑表达式适合查找知识库中的实体及实体关系等信息。
D、逻辑表达式具备逻辑运算能力以及将原子级别的逻辑表达式组合成更复杂的逻辑表达形式的能力。
答案:B9、以下哪种词向量模型为静态词向量模型,且使用了全局统计信息进行模型训练A、ONE-HOTB、Word2vecC、GloVeD、ELMo答案:C10、Faster RCNN中用于区分前景背景和修正proposals的组件是什么?A、VGGB、RPNC、Roi PoolingD、Classifier答案:B11、MNIST数据集包含内容以及建立时间A、手写数字识别,2013B、手写数字识别,2011C、标准字符识别,2013D、标准字符识别,2011答案:A12、BERT预训练任务中,有关N-gram掩码和原始掩码语言模型(MLM)的难度关系,下列哪个描述是正确的A、难度一样B、N-gram masking比MLM难C、MLM比N-gram masking难D、无法比较答案:B13、以下哪种单词表示方法仅使用了局部共现信息A、ONE-HOTB、Word2vecC、GloVeD、ELMo答案:B14、以下可以对弯曲文本进行检测的方法为?A、TextBoxes++算法B、EAST算法C、CTD算法D、MOST算法答案:C15、以下哪个是BERT中的掩码标记A、CLS]B、SEP]C、MASK]D、TAG]答案:C16、以下哪个NLP工具包处理速度最快A、NLTKB、CoreNLPC、LTPD、HanLP答案:D17、卷积核通常写成()形式。
ISSN 1000-9825, CODEN RUXUEW E-mail: jos@Journal of Software, Vol.20, No.8, August 2009, pp.2138−2152 doi: 10.3724/SP.J.1001.2009.03566 Tel/Fax: +86-10-62562563© by Institute of Software, the Chinese Academy of Sciences. All rights reserved.∗无监督词义消歧研究王瑞琴1,2+, 孔繁胜11(浙江大学人工智能研究所,浙江杭州 310027)2(温州大学物理与电子信息工程学院,浙江温州 325035)Research on Unsupervised Word Sense DisambiguationWANG Rui-Qin1,2+, KONG Fan-Sheng11(Artificial Intelligence Institute, Zhejiang University, Hangzhou 310027, China)2(College of Physics & Electronic Information Engineering, Wenzhou University, Wenzhou 325035, China)+ Corresponding author: E-mail: angelwrq@Wang RQ, Kong FS. Research on unsupervised word sense disambiguation. Journal of Software, 2009,20(8):2138−2152. /1000-9825/3566.htmAbstract: The goal of this paper is to give a brief summary of the current unsupervised word sensedisambiguation techniques in order to facilitate future research. First of all, the significance of unsupervised wordsense disambiguation study is introduced. Then, key techniques of various unsupervised word sense disambiguationstudies at home and abroad are reviewed, including data sources, disambiguation methods, evaluation system andthe achieved performance. Finally, 14 novel unsupervised word sense disambiguation methods are summarized, andthe existing research and possible direction for the development of unsupervised word sense disambiguation studyare pointed out.Key words: word sense disambiguation; unsupervised word sense disambiguation; natural language processing;semantic understanding摘要: 研究的目的是对现有的无监督词义消歧技术进行总结,以期为进一步的研究指明方向.首先,介绍了无监督词义消歧研究的意义.然后,重点总结分析了国内外各类无监督词义消歧研究中的各项关键技术,包括使用的数据源、采用的消歧方法、评价体系以及达到的消歧效果等方面.最后,对14个较有特色的无监督词义消歧方法进行了总结,并指出无监督词义消歧的现有研究成果和可能的发展方向.关键词: 词义消歧;无监督词义消歧;自然语言处理;语义理解中图法分类号: TP391文献标识码: A∗Supported by the Zhejiang Provincial Natural Science Foundation of China under Grant No.Y1080372 (浙江省自然科学基金)Received 2008-07-06; Accepted 2009-01-14王瑞琴 等:无监督词义消歧研究21391 词义消歧(word sense disambiguation )的基础知识及研究意义1.1 词义消歧的定义词汇的歧义性是自然语言的固有特征.词义消歧根据一个多义词在文本中出现的上下文环境来确定其词义,作为各项自然语言处理的基础步骤和必经阶段被提出来.所谓的词义消歧是指根据一个多义词在文本中出现的上下文环境来确定其词义.形式化地,令词语w 具有n 个词义,w 在特定的上下文环境C 里只有S ′是正确的词义,词义消歧的任务就是在这n 个词义中确定词义S ′.每个词义S K 和上下文C 都存在或强或弱的联系,记为R (S K ,C ),其中S ′与上下文C 的关系应当是最强的.词义消歧技术通过分析和计算W 出现的上下文C 和每个词义S K 之间的关系R ,排除干扰词义,最后确定S ′.整个过程可用下面的公式来描述:arg max (,)K S' R S C = (1) 上下文中的某些词语限定了多义词的词义,正是这些词的存在,可以帮助人们迅速地去推理和判断,最终得到答案.自动词义消岐研究的是机器模拟人类思维的过程,在上下文中收集重要的语义信息,提取特征词语来指导对多义词的歧义消解.词义消歧问题曾一度被认为是一个计算机无法攻克的难题[1],致使从那以后的一段时间里,研究人员逐渐放弃了对词义消歧的研究.但随着计算技术的飞速发展,超大容量的存储设备和具有强大计算能力的多核处理器相继出现,包括词义消歧在内的自然语言处理领域的各种问题研究一一复苏,并进入了崭新的发展阶段,词义消歧逐渐成为计算语言学和自然语言处理领域中的一个重要研究课题,也是近些年来该领域的热点研究问题之一.1.2 词义消歧的分类每个分类问题都会根据分类依据的不同而得到不同的分类结果,词义消歧也不例外.根据消歧知识来源的不同,词义消歧方法可分为基于知识的方法和基于统计的方法,基于知识的消歧一般又细分为基于规则的方法和基于词典的方法.基于知识库的消歧方法主要是依赖语言学专家的语言知识构造知识库,通过分析多义词所在的上下文,选择满足一定规则的义项.知识库的类型包括专家规则库、词典、本体、知识库等.基于统计的方法则以大型语料库为知识源,从标注或未标注词义的语料中学习各种不同的消歧特征,进而用于词义消歧.按照消歧过程有无指导,词义消歧分为有导消歧和无导消歧.前者利用已标注了词义的大型语料库来提取特定词义的特征属性,利用机器学习方法生成分类器或分类规则对新实例进行词义判定;后者则从原始的数据文集或机器可读字典中获取词义的相关特征,对新实例进行词义判定.所以,有指导的词义消歧常被看作词义分类问题,无指导词义消歧被看作聚类问题.按照消歧结果的评价体系,词义消歧分为独立型评估和应用型评估.独立型评估是指不依赖于应用领域,使用一组标准的测试集,独立评价词义消歧性能.应用型评估不单独地评价词义消歧的效果,而是考察其对实际自然语言处理系统最终目标的贡献,比如,词义消歧在机器翻译系统中对翻译性能的影响、在信息检索中对搜索性能的改善情况等等.1.3 词义消歧研究的意义词义消歧是对词的处理,属于自然语言理解的底层研究,在许多高层次的研究和应用上,词义消歧都大有用武之地.词义消歧并不是自然语言处理的最终目的,而是自然语言处理中不可缺少的一个环节,歧义问题的解决将会带动至少下列自然语言处理领域的新进展:• 机器翻译:在机器翻译中,要让计算机进行准确的译文选择,一个重要的前提条件就是能够在某个特定上下文中自动排除歧义,确定多义词的词义.所以,词义消歧从50年代初期开始机器翻译研究起就一直备受计算语言学家的关注.• 信息检索:一个拼写正确的词汇通常包含许多词义,在特定的查询上下文中,很多词义是不相关的.在一个特定的查询中,用户只对其中一个词义感兴趣,因此只需检索和那个词义相关的文档,而当前基于关键字的搜索引擎就面临检索包含相关词义文档而过滤掉无关词义文档的大难题.据统计,在信息检索中引2140 Journal of Software软件学报 V ol.20, No.8, August 2009入部分多义词消歧技术以后,可使其整个系统的正确率由29%提高到34.2%,取得较为明显的改善.• 主题内容分析和文本处理:如文本分类、信息抽取、自动文摘和辅助写作等文本处理任务,只有对文本中的多义词进行消歧,明确单词所表示的概念,才能正确分析文本及句子的概念和主题.• 语音处理和文语转换:这类任务往往同时涉及语音和文字的处理,语音识别中同音字的识别、语音合成中语音的校正以及文字的处理都离不开词义消歧.• 语法分析或句法分析:帮助解决语法的歧义问题,降低语法分析难度,改善语法分析效果.总之,词义消歧是计算语言学和自然语言处理领域的基础研究课题,提高词义消歧的研究水平,提供高质量的词义消歧技术,对包括机器翻译、信息检索、文本分类等在内的众多研究领域都会有重要的推动作用.2 无监督词义消歧方法概述无监督词义消歧按照消歧数据源的不同分为基于知识的方法和基于统计的方法两大类.本节将分门别类地讨论当前国内外各类主流的无监督词义消歧方法,从消歧过程中使用的数据源、采用的消歧技术、评估体系和消歧效果4个方面进行阐述,研究各类消歧方法使用的关键技术及其消歧性能,指出各自的优缺点及改进方案,特别地,对那些具有代表性的消歧算法将进行详细论述.2.1 基于知识的无监督词义消岐基于知识的无导词义消岐进一步被划分为基于规则的方法和基于词典的方法.早期人们所使用的词义消歧知识一般是凭人手工编制的规则,由于手工编写规则费时、费力,存在严重的知识获取的瓶颈问题,20世纪80年代以后,语言学家提供的各类词典成为人们获取词义消歧知识的一个重要知识源.2.1.1 基于机读词典的词义消歧机读词典提供了有关词汇用法及词义描述的丰富知识,是早期词义消歧的主要知识来源.最早利用机器可读字典实现无监督词义消歧的研究始于1986年的Lesk方法[2].Lesk利用词典中词义的解释或定义来指导多义词在上下文中的词义判定.该方法简单易行,只需计算多义词的各个词义在词典中的定义与多义词上下文词语的定义之间的词汇重叠度,选择重叠度最大的词义作为其正确的词义即可.Lesk分别用3个机器可读词典(Webster’s 7th Collegiate,Collins English Dictionary和Oxford Advanced Learner’s Dictionary of Current English)对一组多义词实例进行了词义消歧测试,正确率在50%~70%之间.随着Lesk方法的提出,无监督词义消歧逐渐流行起来.研究者对Lesk方法进行了各种改良,总体思想是进一步扩展词义的定义描述,使得词汇重叠的几率增加.Wilks[3]对Longman字典(Longman Dictionary of Contemporary English,简称LDOCE)中每个词义的定义添加了与其定义词汇同现频率较高的其他词汇(同现频率的高低使用该词典的所有定义条目统计得到),如此将词典中的所有定义进行了扩展之后,大大提高了定义词汇重叠的概率.Pook等人[4]提出一种改进方案,对上下义词语进行同义词扩展,从而扩大了上下文窗口的大小.实验结果表明此方法可以增加词义消歧的覆盖率.Dagan[5]和Gale[6]则利用双语对照词典来帮助多义词消歧.2.1.2 基于义类词典的词义消歧义类词典的编排与传统词典有很多不同之处.它是按照词语含义编纂的辞典,把相类似的词语放在相同的目录下,使得查找同类或同义词更加方便、快捷.义类词典有助于我们提高用词的准确性.Roget’s Thesaurus[7]和WordNet[8]是常用的英语义类词典.Yarowsky(1994)[9]和卢志茂等人[10]利用Roget’s词典进行词义消歧;Voorhees[11]和Resnik[12]从不同角度利用WordNet中的上下位关系、同义关系进行英语的词义消歧探索.《同义词词林》[13]和知网(HowNet)[14]是最常用的汉语义类资源.汉语词义消歧研究从20世纪90年代以后才开始.陈浩等人[15,16]使用HowNet作为知识源,利用聚类技术进行词义消歧.李涓子[17]和中国科学院计算技术研究所的鲁松[18]都采用《同义词词林》进行无指导的词义消歧,李涓子在大规模语料库中自动获取任意同义词集中单义词的同现实词,按照同现实词的词义分辨能力对它们加权,构成词义分类器,实现一种代价最小的无指导学习算法;鲁松则把待消歧的多义词的上下文视为查询,把与该多义词某个义项具有相同、相似或相关语义范畴的词语的上下文视为文档,从而用信息检索中的向量空间模型来解决词义消歧问题.王瑞琴等:无监督词义消歧研究21412.1.2.1 WordNet简介WordNet是从1995年开始,在普林斯顿(Princeton)大学认知科学实验室(Cognitive Science Laboratory)的心理学教授George A. Miller的指导下,由Princeton大学的心理学家、语言学家和计算机工程师联合设计的一种基于认知语言学的在线英语词典.由于它包含了语义信息,所以有别于通常意义下的字典.WordNet根据词条的意义将它们分组,那些具有相同意义的词条称为同义词集(synset),每个synset代表一个潜在的概念(concept),一个多义词将出现在与其各词义对应的多个同义词集中.它不仅把单词以字母顺序排列,而且按照单词的意义组成一个“单词的网络”.WordNet的开发有两个目的:其一,它既是一个词典,又是一个辞典,从直觉上讲,它比单纯的词典或辞典更加有用;其二,支持自动的文本分析以及人工智能应用.WordNet是完全免费的资源,其数据库及相应的软件工具的开发都遵从BSD许可协议,可以自由地下载和使用,亦可在线查询和使用.WordNet已经在英语语言处理的研究中得到了广泛应用,几乎成了英语语言知识库的标准.2.1.2.2 基于WordNet的词义消歧WordNet中包含了丰富的语义知识,包括词义的定义描述、使用实例、结构化的语义关系、词频信息等,所有这些信息都可以用于词义消岐.(1) 基于定义描述的词义消岐WordNet为其中的每个同义词集都提供了简短、概要的定义描述和使用实例,如bus#1: autobus, coach, charabanc, double-decker, jitney, motorbus, motorcoach, omnibus, passenger vehicle--(a vehicle carrying many passengers; used for public transport; “he always rode the bus to work”).Satanjeev[19]使用WordNet来代替传统的机读词典,对原始Lesk算法进行改进,提出了Adapted Lesk算法.该算法使用WordNet中的多种语义关系来扩展词义的定义描述.在计算两个同义词集的相关度时,不同于原始Lesk算法的简单计数,Satanjeev对多字短语分配了较高的权重(短语字数的平方),以突出其在相关性判断中的重要性.Satanjeev使用了一种全局消歧策略,即消歧时不是独立确定每个词汇的词义,而是以整个句子作为处理单位,对每种词义的不同组合,计算整体相关性,取相关性最高的组合中的各个词义作为相应词汇的词义.使用WordNet 1.7和Senseval-2 lexical sample数据集的消歧测试结果为noun:32.2%,verb:24.9%,adjective:46.9%,平均消歧精度为32.3%.Chen和Yin提出了AALesk 算法[20],是对Adapted Lesk算法的进一步优化.在消歧的过程中,AALesk算法考虑了WordNet中定义的全部语义关系,并根据各自的重要性分别为每种关系分配权重,从而使消歧过程不仅依赖于统计理论,而且以一种语义指导的方式进行.对于一个词汇的多个词义,Chen等人采用平行执行多个AALesk算法来消除那些无关的词义组合,进而加快消歧速度.使用WordNet 1.7和Senseval-2 lexical sample数据集的测试结果为noun:32.6%, verb:25.1%,adjective:47.2%,平均消歧精度为32.6%;使用WordNet 2.0和Senseval-2 lexical sample数据集的测试结果为noun:33.3%,verb:26.2%,adjective:47.5%,平均消歧精度为33.4%.Ledo-Mezquita[21]提出合并Lesk方法和大型词汇资源库(如同义词词典、WordNet本体)进行词义消歧的方法.对一个多义词汇,首先用Lesk方法计算其每个词义的值;然后通过同义词词典或WordNet本体找到每个词义的同义词、上义词、下义词等相关词,再利用Lesk方法计算这些词义的值;最后将这些词义值与其各自的权重相乘,再与源词义值加权求和,得到最终的词义值,词义值最大者为岐义词的确切词义.使用共包含4 287词义的872个词语的测试集得到63%的消歧精度,而使用相同的数据集,原始Lesk算法的消歧精度仅为50%.(2) 基于概念区域密度的词义消岐基于概念区域密度的词义消岐的基本思想是:将本体层次结构根据待消歧词的各个词义划分成互相独立的子结构,每个子结构以歧义词的每个词义作为根节点,分别计算各子结构的概念密度来判定目标词的词义.Agirre[22]最先提出使用概念密度进行词义判断,充分利用概念间的概念距离生成定义良好的概念密度公式,基于WordNet中覆盖广泛的名词层次结构对名词进行词义判断.该方法是一个完全自动化的无监督词义消歧方法,给定处于子结构根节点处的词义概念c,nhyp和h分别表示子结构中节点包含的下义概念节点的平均数和子结构的高度,当此子结构中包含m个目标词与上下文词汇的词义时,c的概率密度为2142Journal of Software 软件学报 V ol.20, No.8, August 2009 1100(,)=m h ii i CD c m nhyp nhyp −−==∑∑i (2) 公式(2)中的分子表示包含m 个词义标记的子结构的估计区域大小,分母表示子结构区域的真实大小.利用SemCor 数据集进行测试得到了76.04%的消歧精度和23.21%的召回率.Rosso [23]对原始的概念密度方法进行扩展,提出了一种计算概念密度的新方法,对于公式(2)中考虑的每个节点的平均子节点个数,WN1.6与WN1.4中的不一致,所以Rosso 决定只考虑子结构中由目标词和上下文词汇的词义路径决定的相关分支,而忽略那些无关的分支,提出使用子结构中包含的相关词义个数与子结构中的词义总数的比值计算概率密度的基础公式;Rosso 发现如果不考虑词义的词频信息,概率密度法有可能会选择低频词义,在大多数情况下这是错误的,这是由多义词词义使用的严重偏斜决定的,所以他将词频信息添加到基础公式中对消歧进行调节;另外,为了提高包含相关词义较多的子结构的权重,还增加了词汇重叠调节因子;最后,考虑到位于越低层次的子结构中的词义粒度就越细、词义间的相似性越大,将概率密度函数作为子结构深度的增函数,称这种情况为聚类深度关联(CDC).调整后的概率密度公式为log (,,,)(/)(()1)α f βCD M nh f depth M M nh depth cl avgdepth =××−+ (3)其中M 表示子结构中包含的相关词义个数,nh 表示子结构包含的词义总数,f 是子结构对应的词义在WordNet中的词频,cl 为子结构的根节点,depth (cl )为子结构的深度,avgdepth 为平均子结构深度,α和β为调节因子.利用SemCor 数据集,不采用任何相关调节技术,当上下文窗口大小为2时,消歧效果最佳,得到81.48%的精度、60.17%的召回率和73.18%的覆盖率;使用了CDC 技术后,效果未见明显改善.将上下文窗口增大到4时,召回率和覆盖度有所提高,分别为61.27%和77.87%,当上下文窗口增大到6时,召回率仍保持在60%左右,精度有些微降低,但覆盖率得到的明显的增加.Davide [24]在Rosso 的基础上,对概念密度方法作了进一步的改进.Davide 考虑到词汇的领域信息对消歧的影响,对消歧公式作了进一步修改,在概念密度公式中添加了互领域权重(mutual domain weight,简称MDW)调和项.在子结构中,如果某词汇与目标词汇具有相同的领域属性,则它们的互领域权重为二者领域权重(权重与频率成正比)的乘积,考虑到WordNet Domains 中的Factotum 领域过于一般化,当词汇对的领域属性均为Factotum 时,其互领域权重要降低一个数量级(×10−1):(4)log 01(,,,,)=(/)(,)|C|k α f f ij i j CD M nh w f C M M nh MDW w c ==×+∑∑ 10,if ()()(,)11,if ()=()() 10(11),if ()()() f ij f ij f ij f f ij f Dom w Dom cMDW w c /f /j Dom w Dom c Dom w Factotum /f /j Dom w Dom c Dom w Factotum −⎧≠⎪=×∧≠⎨⎪××=∧=⎩ (5) 其中,C 表示上下文词汇向量,k 为上下文词汇c i 的词义个数,c ij 表示词汇c i 的第j 个词义.利用SemCor 数据集进行测试得到了78.33%的消歧精度和62.60%的召回率,如果在计算领域相关性时不考虑杂类,则得到的精度和召回率分别为80.70%和59.08%,由此可见,尽管Factotum 领域没有对整体的消歧任务提供有用的信息,但是由于采用了与词义频率成比例的权重分配方案,Factotum 有助于对大量使用常用词义的名词进行消歧.鉴于早期Lesk 利用定义描述文本进行词义消歧的思路,Davide 也考虑用词汇在WordNet 中的定义描述来进一步改善消歧精度,在概念密度公式中添加了定义描述权重(gloss weight,简称GM)调和项:log 01(,,,,)(/)(,)|C|kα f f ij i i CD M nh w f C M M nh GW w c ===×+∑∑ (6) (7) 0,if ()(,)03,if ()i f i i f c Gl w GW w c .c Gl w ∉⎧⎪=⎨∈⎪⎩f ““” ” 其中,c i 表示第i 个上下文,w f 表示待消歧词的第f 个词义,Gl (w f )函数用于返回w f 的定义描述中的非停用词.由于WordNet 中关于词义的描述包括两部分:定义描述和实例描述,相应地定义了两个定义描述权重函数Gl d (x )和Gl s (x )分别返回各部分中的非停用词.利用SemCor 数据集进行的测试表明,利用定义实例进行消歧的精度王瑞琴等:无监督词义消歧研究2143(80.12%)比利用定义描述本身(79.85%)或综合定义和实例的消歧精度(79.31%)更高,所以他们决定使用另外的机器可读词典来扩展WordNet的定义实例部分,文献[24]中采用了Cambridge Advanced Learner’s Dictionary (CALD).利用Senseval-3 All-Words Task 数据集进行测试,该方法达到了业界最好的消歧效果(79.78%),然而利用Senseval-3 AWT的测试结果却降低了大约10个百分点(64.72%),原因在于,新加入定义描述中的词汇存在偏题现象,提高外来词典与WordNet定义匹配时的阈值可以解决这个问题.(3) 基于结构化语义关系的图论式词义消岐WordNet中各同义词集之间通过各种语义关系产生互连.这些语义关系包括:上下位关系(hypernym/ hyponym)、组成成分关系(meronym/part-of)、相似关系(synonym)、反义关系(antonym)等.其中,上下位关系是最常用的语义关系,它将WordNet中的同义词集组织成树状的概念层次体系结构.图论(graph theory)是数学的一个分支,以图为研究对象,用点代表事物,用连接两点的线表示相应的两个事物间具有某种关系,当人们的研究对象在结构上具有内在的、结构化的联系时,常常可以采用图论的方法解决问题.Mihalcea[25]提出一种在描述语义依赖性的图中使用随机行走策略进行词汇序列消歧的方法.首先将待消歧的词汇序列通过一个语义连接图表示出来,图中的顶点为词汇的语义标签,边为语义节点之间的语义依赖关系;然后在此图中运行一个随机行走算法,使用PageRank算法迭代计算每个节点的重要性值;最后算法将收敛到一组节点标签的静态概率值,这些值用于为词汇序列中的每个词汇确定其最可能的词义.该方法仅使用词典定义,达到了54.2%的消歧精度,远远超过了在这之前的同类方法[26−30].其主要原因在于,在词汇序列标注的过程中考虑了序列词义之间的整体依赖性.结构化语义互连(structural semantic interconnection,简称SSI)[31]是目前效果最佳的半监督词义消歧技术之一,它属于基于知识的结构化模式识别问题.首先,为消歧词和上下文词汇的每个词义建立一个结构化的图形描述,词义描述的信息来自于多个数据源(WordNet 2.0,Domain labels,WordNet glosses,WordNet usage examples,Dictionaries of collocations),各数据源之间通过人工或自动化的方法进行集成;然后,用一套语法规则来生成各种有意义的关联模式,并为每个关联模式分配权重;最后根据歧义词的每个词义及其上下文信息与这些规则的匹配情况进行消歧.需要指出的是,原SSI方法中的数据源包括了标注文集(SemCor,LDC-DSO),去掉这一数据源之后,SSI才可称得上是一项无监督的词义消歧技术.SSI算法的执行是一个迭代的过程,初始化输入是一系列同现词汇T、未决词汇集P和相关词义I(I中包含单义词的词义或固定词义,在没有单义词或固定词义的情况下,对歧义度最小的词汇的词义作初始假设,然后这一过程被克隆执行m 次);在迭代阶段,将P中的某元素t确定词义并将其移到I中,要求t的至少1个词义与I中的词义存在语义关联,当P为空或某次迭代P中元素的个数未减时算法停止,修改后的I为算法的输出.SSI性能的测试采用了独立型测试和应用型测试相结合的方式.在独立型测试中,采用了测试集Senseval-3 All-Words Task,实验结果得到60.40%的消歧精度和60.40%的召回率,在无监督词义消歧的参与测试者中达到了最高精度;在应用型测试中,将SSI算法应用到多种语义消歧问题中,包括自动本体构建、文本的语句排列消歧和定义描述词汇的消歧.在自动本体学习的任务中,分别对4个领域的本体学习进行测试,得到了56%~88%的消歧精度;在语句排列任务的消歧实验中,SSI达到了86.84%的精度和82.58%的召回率,且随着上下文窗口的增大,精度和召回率都有所提高,因为增大窗口大小意味着包括了更多的语义互连信息;在Senseval-3 Gloss定义描述的消歧实验中,SSI达到了非常高的消歧精度(82.6%),但召回率相对较低(32.3%),在同任务竞争者中名列第二.Agirre[32]提出了两种图论式算法HyperLex和PageRank用于词义消歧.首先,以大型文集为数据源构建目标词汇的同现图,图中节点表示与目标词同现的词汇,两个词汇在同一个文本段落中出现视为同现,如果图中的两个节点对应的词汇出现在同一个文本段落中,则将相应的节点用边连接起来,根据词汇共现的相对频率为边赋权值.然后,采用HyperLex和PageRank算法找出图中的hub节点.对于HyperLex算法,在每一步中,算法寻找图中相对频率最高的节点.如果找到的节点大于给定的阈值,就将其作为hub.当某个节点被作为hub后,其邻居节点就失去了作为hub的资格.当找到的节点的相对频率低于规定阈值时算法终止.也可以使用PageRank算法在共现图中寻找hub节点.PageRank是一个迭代式的算法,它使用随机行走策略标注图形中所有节点的page rank 值.节点的page rank值不仅与推荐的节点个数有关,而且也与推荐节点本身的page rank值相关.一旦代表目标。
基于序列标注的全词消歧方法周云;王挺;易绵竹;张禄彭;王之元【摘要】全词消歧(All-Words Word Sense Disambiguation)可以看作一个序列标注问题,该文提出了两种基于序列标注的全词消歧方法,它们分别基于隐马尔可夫模型(Hidden Markov Model,HMM)和最大熵马尔可夫模型(Maximum Entropy Markov Model,MEMM).首先,我们用HMM对全词消歧进行建模.然后,针对HMM只能利用词形观察值的缺点,我们将上述HMM模型推广为MEMM模型,将大量上下文特征集成到模型中.对于全词消歧这类超大状态问题,在HMM和MEMM模型中均存在数据稀疏和时间复杂度过高的问题,我们通过柱状搜索Viterbi算法和平滑策略来解决.最后,我们在Senseval-2和Senseval-3的数据集上进行了评测,该文提出的MEMM方法的F1值为0.654,超过了该评测上所有的基于序列标注的方法.%All-Words Word Sense Disambiguation (WSD) can be regarded as a sequence labeling problem, and two All-Words WSD methods based on sequence labeling are proposed in this paper, which are based on Hidden Markov Model (HMM) and Maximum Entropy Markov Model (MEMM), respectively. First, we model All-Words WSD using HMM. Since HMM can only exploit lexical observation, we generalize HMM to MEMM by incorporating a large number of non-independent features. For All-Words WSD which is a typical extra-large state problem, the data sparsity and high time complexity seriously hinder the application of HMM and MEMM models. We solve these problems by beam-search Viterbi algorithm and smoothing strategy. Finally, we test our methods on the dataset of All-Words WSD tasks in Senseval-2 and Senseval-3, andachieving a 0. 654 Fl value forthe MEMM method which outperforms other methods based on sequence labeling.【期刊名称】《中文信息学报》【年(卷),期】2012(026)002【总页数】7页(P28-34)【关键词】全词消歧;隐马尔可夫模型;最大熵马尔可夫模型;超大状态问题【作者】周云;王挺;易绵竹;张禄彭;王之元【作者单位】国防科技大学计算机学院,湖南长沙410073;国防科技大学计算机学院,湖南长沙410073;解放军外国语学院国防语言文化研究所,河南洛阳471003;解放军外国语学院欧亚语系,河南洛阳471003;国防科技大学计算机学院,湖南长沙410073;国防科技大学并行与分布处理国家重点实验室,湖南长沙410073【正文语种】中文【中图分类】TP3911 引言词义消歧,即在特定的上下文中确定歧义词的词义。
Word Sense Disambiguationusing Conceptual DensityEneko Agirre*Lengoaia eta Sistema Informatikoak saila. Euskal Herriko Universitatea.p.k. 649, 200800 Donostia. Spain. jibagbee@si.heu.esGerman Rigau**Departament de Llenguatges i Sistemes Informàtics. Universitat Politècnica de Catalunya.Pau Gargallo 5, 08028 Barcelona. Spain. g.rigau@lsi.upc.esAbstract.This paper presents a method for the resolution of lexical ambiguity of nouns and its automatic evaluation over the Brown Corpus. The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured bya Conceptual Density formula developed for this purpose. This fully automatic method requires no hand coding of lexical entries, hand tagging of text nor any kind of training process. The results of the experiments have been automatically evaluated against SemCor, the sense-tagged version of the Brown Corpus.1 IntroductionMuch of recent work in lexical ambiguity resolution offers the prospect that a disambiguation system might be able to receive as input unrestricted text and tag each word with the most likely sense with fairly reasonable accuracy and efficiency. The most extended approach use the context of the word to be disambiguated together with information about each of its word senses to solve this problem. Interesting experiments have been performed in recent years using preexisting lexical knowledge resources: [Cowie et al. 92], [Wilks et al. 93] with LDOCE, [Yarowsky 92] with Roget's International Thesaurus, and [Sussna 93], [Voorhees 93], [Richardson et al. 94], [Resnik 95]with WordNet. Although each of these techniques looks promising for disambiguation, either they have been only applied to a small number of words, a few sentences or not in a public domain corpus. For this reason we have tried to disambiguate all the nouns from real *Eneko Agirre was supported by a grant from the Basque Goverment. Part of this work is included in projects 141226-TA248/95 of the Basque Country University and PI95-054 of the Basque Government.**German Rigau was supported by a grant from the Ministerio de Educación y Ciencia.texts in the public domain sense tagged version of the Brown corpus [Francis & Kucera 67], [Miller et al. 93], also called Semantic Concordance or SemCor for short1. The words in SemCor are tagged with word senses from WordNet, a broad semantic taxonomy for English [Miller 90]2. Thus, SemCor provides an appropriate environment for testing our procedures and comparing among alternatives in a fully automatic way.The automatic decision procedure for lexical ambiguity resolution presented in this paper is based on an elaboration of the conceptual distance among concepts: Conceptual Density [Agirre & Rigau 95]. The system needs to know how words are clustered in semantic classes, and how semantic classes are hierarchically organised. For this purpose, we have used WordNet. Our system tries to resolve the lexical ambiguity of nouns by finding the combination of senses from a set of contiguous nouns that maximises the Conceptual Density among senses. The performance of the procedure was tested on four SemCor texts chosen at random. For comparison purposes two other approaches, [Sussna 93] and [Yarowsky 92], were also tried. The results show that our algorithm performs better on the test set. Following this short introduction the Conceptual Density formula is presented. The main procedure to resolve lexical ambiguity of nouns using Conceptual Density is sketched on section 3. Section 4 describes extensively the experiments and its results. Finally, sections 5 and 6 deal with further work and conclusions.1Semcor comprises approximately 250,000 words. The tagging was done manually, and the error rate measured by the authors is around 10% for polysemous words.2The senses of a word are represented by synonym sets (or synsets), one for each word sense. The nominal part of WordNet can be viewed as a tangled hierarchy of hypo/hypernymy relations among synsets. Nominal relations include also three kinds of meronymic relations, which can be paraphrased as member-of, made-of and component-part-of. The version used in this work is WordNet 1.4, The coverage in WordNet of senses for open-class words in SemCor reaches 96% according to the authors.2 Conceptual Density and WordSense DisambiguationConceptual distance tries to provide a basis for measuring closeness in meaning among words, taking as reference a structured hierarchical net. Conceptual distance between two concepts is defined in [Rada et al. 89] as the length of the shortest path that connectsthe concepts in a hierarchical semantic net. In a similar approach, [Sussna 93] employs the notion of conceptual distance between network nodes in order to improve precision during document indexing.[Resnik 95] captures semantic similarity (closely related to conceptual distance) by means of the information content of the concepts in a hierarchical net.I n general these approaches focus on nouns.The measure of conceptual distance among concepts we are looking for should be sensitive to:• the length of the shortest path that connects the concepts involved.• the depth in the hierarchy: concepts in a deeper part of the hierarchy should be ranked closer.• the density of concepts in the hierarchy: concepts in a dense part of the hierarchy are relatively closer than those in a more sparse region.• the measure should be independent of the number of concepts we are measuring.We have experimented with several formulas that follow the four criteria presented above. The experiments reported here were performed using the Conceptual Density formula [Agirre & Rigau 95], which compares areas of subhierarchies.To illustrate how Conceptual Density can help to disambiguate a word, in figure 1 the word W has four senses and several context words. Each sense of the words belongs to a subhierarchy of WordNet. The dots in the subhierarchies represent the senses of either the word to be disambiguated (W) or the words in the context. Conceptual Density will yield the highest density for the subhierarchy containing more senses of those, relative to the total amount of senses in the subhierarchy. The sense of W contained in the subhierarchy with highest Conceptual Density will be chosen as the sense disambiguating W in the given context. In figure 1, sense2 would be chosen.Word to be disambiguated: WContext words: w1 w2 w3 w4 ...Figure 1: senses of a word in WordNetGiven a concept c, at the top of a subhierarchy, and given nhyp (mean number of hyponyms per node), the Conceptual Density for c when its subhierarchy contains a number m (marks) of senses of the words to disambiguate is given by the formula below:CD(c,m)=nhyp i0.20i=0m−1∑descendants c(1)Formula 1 shows a parameter that was computed experimentally. The 0.20 tries to smooth the exponential i, as m ranges between 1 and the total number of senses in WordNet. Several values were tried for the parameter, and it was found that the best performance was attained consistently when the parameter was near 0.20.3 The Disambiguation AlgorithmUsing Conceptual DensityGiven a window size, the program moves the window one noun at a time from the beginning of the document towards its end, disambiguating in each step the noun in the middle of the window and considering the other nouns in the window as context. Non-noun words are not taken into account.The algorithm to disambiguate a given noun w in the middle of a window of nouns W (c.f. figure 2) roughly proceeds as follows:(Step 1) t ree := compute_tree(words_in_window)loop(Step 2) tree := compute_conceptual_distance(tree)(Step 3) concept := selecct_concept_with_highest_weigth(tree) if concept = null then exitloop(Step 4) tree := mark_disambiguated_senses(tree,concept)endloop(Step 5) o utput_disambiguation_result(tree)Figure 2: algorithm for each windowFirst, the algorithm represents in a lattice the nouns present in the window, their senses and hypernyms (step 1). Then, the program computes the Conceptual Density of each concept in WordNet according to the senses it contains in its subhierarchy (step 2). It selects the concept c with highest Conceptual Density (step 3) and selects the senses below it as the correct senses for the respective words (step 4).The algorithm proceeds then to compute the density for the remaining senses in the lattice, and continues to disambiguate the nouns left in W (back to steps 2, 3 and 4). When no further disambiguation is possible,the senses left for w are processed and the result is presented (step 5).Besides completely disambiguating a word or failing to do so, in some cases the disambiguation algorithm returns several possible senses for a word. In the experiments we considered these partial outcomes as failure to disambiguate.4 The Experiments4.1 The textsWe selected four texts from SemCor at random: br-a01 (where a stands for gender "Press: Reportage"), br-b20 (b for "Press: Editorial"), br-j09 (j means "Learned: Science") and br-r05 (r for "Humour"). Table 1 shows some statistics for each text.text words nouns nounsin WNmonosemous br-a012079564464149 (32%)br-ab202153453377128 (34%)br-j092495620586205 (34%)br-r052407457431120 (27%) total913420941858602 (32%)An average of 11% of all nouns in these four texts were not found in WordNet. According to this data, the amount of monosemous nouns in these texts is bigger (32% average) than the one calculated for the open-class words from the whole SemCor (27.2% according to [Miller et al. 94]).For our experiments, these texts play both the role of input files (without semantic tags) and (tagged) test files. When they are treated as input files, we throw away all non-noun words, only leaving the lemmas of the nouns present in WordNet.4.2 Results and evaluationOne of the goals of the experiments was to decide among different variants of the Conceptual Density formula. Results are given averaging the results of the four files. Partial disambiguation is treated as failure to disambiguate. Precision (that is, the percentage of actual answers which were correct) and recall (that is, the percentage of possible answers which were correct) are given in terms of polysemous nouns only. Graphs are drawn against the size of the context3 .•m e r o n y m y d o e s n o t i m p r o v e performance as expected. A priori, the more relations are taken in account (e.i. meronymic relations, in addition to the hypo/hypernymy relation) the better density would capture semantic relatedness, and therefore better results can be expected.38394041424344Window SizelocalglobalFigure 4: local nhyp vs. global nhyp3context size is given in terms of nouns.While local nhyp is the actual average for a given concept, global nhyp gives only an estimation. The results (c.f. figure 4) show that local nhyp performs only slightly better. Therefore global nhyp i s favoured and was used in subsequent experiments.• context size: different behaviour for each text. One could assume that the more context there is, the better the disambiguation results would be. Our experiments show that each file from SemCor has a different behaviour (c.f. figure 5) while br-b20 shows clear improvement for bigger window sizes, br-r05 gets a local maximum at a 10 sizewindow, etc.3035404550Window SizeSenseFile Figure 6: sense level vs. file level• evaluation of the results Figure 7 shows that, overall, coverage over polysemous nouns increases significantly with the window size, without losing precision. Coverage tends to get stabilised near 80%, getting little improvement for window sizes bigger than 20.The figure also shows the guessing baseline,given by selecting senses at random. This baseline was first calculated analytically and later checked experimentally. We also compare the performance of our algorithm with that of the "most frequent"heuristic. The frequency counts for each sense were collected using the rest of SemCor, and then applied to the four texts. While the precision is similar to that of our algorithm, the coverage is 8% worse.304050607080Window Sizemost frequent semantic density Coverage:Precision:semantic density most frequent guessingFigure 7: precision and coverageAll the data for the best window size can be seen in table 2. The precision and coverage shown in all the preceding graphs were relative to the polysemous nouns only. Including monosemic nouns precision raises, as shown in table 2, from 43% to 64.5%, and the coverage increases from 79.6% to 86.2%.%w=30Cover.Prec.Recall overallFile 86.271.261.4Sense 64.555.5polysemic File 79.653.942.8Sense 4334.2Table 2: overall data for the best window size 4.3 Comparison with other worksThe raw results presented here seem to be poor when compared to those shown in [Hearst 91], [Gale et al. 93] and [Yarowsky 92]. We think that several factors make the comparison difficult. Most of those works focus in a selected set of a few words, generally with a couple of senses of very different meaning (coarse-grained distinctions), and for which their algorithm could gather enough evidence. On the contrary, we tested our method with all the nouns in a subset of an unrestricted public domain corpus (more than 9.000 words), making fine-grained distinctions among all the senses in WordNet.An approach that uses hierarchical knowledge is that of [Resnik 95], which additionally uses the information content of each concept gathered fromcorpora. Unfortunately he applies his method on a different task, that of disambiguating sets of related nouns. The evaluation is done on a set of related nouns from Roget's Thesaurus tagged by hand. The fact that some senses were discarded because the human judged them not reliable makes comparison even more difficult.In order to compare our approach we decided to implement [Yarowsky 92] and [Sussna 93], and test them on our texts. For [Yarowsky 92] we had to adapt it to work with WordNet. His method relies on cooccurrence data gathered on Roget's Thesaurus semantic categories. Instead, on our experiment we use saliency values 4 based on the lexicographic file tags in SemCor. The results for a window size of 50nouns are those shown in table 35. The precision attained by our algorithm is higher. To compare figures better consider the results in table 4, were the coverage of our algorithm was easily extended using the version presented below, increasing recall to 70.1%.%Cover.Prec.Recall C.Density 86.271.261.4Yarowsky 100.064.064.0Table 3: comparison with [Yarowsky 92]From the methods based on Conceptual Distance,[Sussna 93] is the most similar to ours. Sussna disambiguates several documents from a public corpus using WordNet. The test set was tagged by hand, allowing more than one correct senses for a single word. The method he uses has to overcome a combinatorial explosion 6 controlling the size of the window and “freezing” the senses for all the nouns preceding the noun to be disambiguated. In order to freeze the winning sense Sussna's algorithm is forced to make a unique choice. When Conceptual Distance is not able to choose a single sense, the algorithm chooses one at random.Conceptual Density overcomes the combinatorial explosion extending the notion of conceptual distance from a pair of words to n words, and therefore can yield more than one correct sense for a word. For comparison, we altered our algorithm to also make random choices when unable to choose a single sense.We applied the algorithm Sussna considers best,4We tried both mutual information and association ratio,and the later performed better.5The results of our algorithm are those for window size 30, file matches and overall.6In our replication of his experiment the mutual constraint for the first 10 nouns (the optimal window size according to his experiments) of file br-r05 had to deal with more than 200,000 synset pairs.discarding the factors that do not affect performance significantly7, and obtain the results in table 4.%Cover.Prec.C.Density File100.070.1Sense60.1 Sussna File100.064.5Sense52.3Table 4: comparison with [Sussna 93]A more thorough comparison with these methods could be desirable, but not possible in this paper for the sake of conciseness.5 Further WorkWe would like to have included in this paper a study on whether there is or not a correlation among correct and erroneous sense assignations and the degree of Conceptual Density, that is, the actual figure held by formula 1. If this was the case, the error rate could be further decreased setting a certain threshold for Conceptual Density values of winning senses. We would also like to evaluate the usefulness of partial disambiguation: decrease of ambiguity, number of times correct sense is among the chosen ones, etc.There are some factors that could raise the performance of our algorithm:•Work on coherent chunks of text. Unfortunately any information about discourse structure is absent in SemCor, apart from sentence endings The performance would gain from the fact that sentences from unrelated topics would not be considered in the disambiguation window.• Extend and improve the semantic data. WordNet provides sinonymy, hypernymy and meronyny relations for nouns, but other relations are missing. For instance, WordNet lacks cross-categorial semantic relations, which could be very useful to extend the notion of Conceptual Density of nouns to Conceptual Density of words. Apart from extending the disambiguation to verbs, adjectives and adverbs, cross-categorial relations would allow to capture better the relations among senses and provide firmer grounds for disambiguating.These other relations could be extracted from other knowledge sources, both corpus-based or MRD-based. If those relations could be given on WordNet senses, Conceptual Density could profit from them. It is our belief, following the ideas of [McRoy 92] that full-fledged lexical ambiguity resolution should combine several information sources. Conceptual Density 7Initial mutual constraint size is 10 and window size is 41. Meronymic links are also considered. All the links have the same weigth.might be only one of a number of complementary evidences of the plausibility of a certain word sense. Furthermore, WordNet 1.4 is not a complete lexical database (current version is 1.5).• Tune the sense distinctions to the level best suited for the application. On the one hand the sense distinctions made by WordNet 1.4 are not always satisfactory. On the other hand, our algorithm is not designed to work on the file level, e.g. if the sense level is unable to distinguish among two senses, the file level also fails, even if both senses were from the same file. If the senses were collapsed at the file level, the coverage and precision of the algorithm at the file level might be even better.6 ConclusionThe automatic method for the disambiguation of nouns presented in this paper is ready-usable in any general domain and on free-running text, given part of speech tags. It does not need any training and uses word sense tags from WordNet, an extensively used lexical data base.Conceptual Density has been used for other tasks apart from the disambiguation of free-running test. Its application for automatic spelling correction is outlined in [Agirre et al. 94]. It was also used on Computational Lexicography, enriching dictionary senses with semantic tags extracted from WordNet [Rigau 94], or linking bilingual dictionaries to WordNet [Rigau and Agirre 96].In the experiments, the algorithm disambiguated four texts (about 10,000 words long) of SemCor, a subset of the Brown corpus. The results were obtained automatically comparing the tags in SemCor with those computed by the algorithm, which would allow the comparison with other disambiguation methods. Two other methods, [Sussna 93] and [Yarowsky 92], were also tried on the same texts, showing that our algorithm performs better.Results are promising, considering the difficulty of the task (free running text, large number of senses per word in WordNet), and the lack of any discourse structure of the texts. Two types of results can be obtained: the specific sense or a coarser, file level, tag.AcknowledgementsThis work, partially described in [Agirre &Rigau 96], was started in the Computing Research Laboratory in New Mexico State University. We wish to thank all the staff of the CRL and specially Jim Cowie, Joe Guthtrie, Louise Guthrie and David Farwell. We would also like to thank Xabier Arregi, Jose mari Arriola, Xabier Artola, Arantza Díaz de Ilarraza, Kepa Sarasola and Aitor Soroa from the Computer Science Faculty of EHU and Francesc Ribas, Horacio Rodríguez and Alicia Ageno from the Computer Science Department of UPC.ReferencesAgirre E., Arregi X., Diaz de Ilarraza A. and Sarasola K. 1994. Conceptual Distance and Automatic Spelling Correction. in Workshop on Speech recognition and handwriting. Leeds, England. Agirre E., Rigau G. 1995. A Proposal for Word Sense Disambiguation using conceptual Distance, International Conference on Recent Advances in Natural Language Processing. Tzigov Chark, Bulgaria.Agirre, E. and Rigau G. 1996. An Experiment in Word SenseDisambiguation of the Brown Corpus Using WordNet. Memoranda in Computer and Cognitive Science, MCCS-96-291, Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico.Cowie J., Guthrie J., Guthrie L. 1992. Lexical Disambiguation using Simulated annealing, in proceedings of DARPA WorkShop on Speech and Natural Language, New York. 238-242.Francis S. and Kucera H. 1967. Computing analysis of present-day American English, Providenc, RI: Brown University Press, 1967.Gale W., Church K. and Yarowsky D. 1993. A Method for Disambiguating Word Sense sin a Large Corpus, in Computers and the Humanities, n. 26.Guthrie L., Guthrie J. and Cowie J. 1993. Resolving Lexical Ambiguity, in Memoranda in Computer and Cognitive Science MCCS-93-260, Computing Research Laboratory, New Mexico State University. Las Cruces, New Mexico. Hearst M. 1991. Towards Noun Homonym Disambiguation Using Local Context in Large Text Corpora, in Proceedings of the Seventh Annual Conference of the UW Centre for the New OED and Text Research. Waterloo, Ontario. McRoy S. 1992. Using Multiple Knowledge Sources for Word Sense Discrimination, Computational Linguistics, vol. 18, num. 1.Miller G. 1990. Five papers on WordNet, Special Issue of International Journal of Lexicogrphy 3(4). 1990.Miller G. Leacock C., Randee T. and Bunker R.1993. A Semantic Concordance, in proceedings of the 3rd DARPA Workshop on Human Language Technology, 303-308, Plainsboro, New Jersey. Miller G., Chodorow M., Landes S., Leacock C. and Thomas R. 1994. Using a Semantic Concordancefor sense Identification, in proceedings of ARPA Workshop on Human Language Technology, 232-235.Rada R., Mili H., Bicknell E. and Blettner M. 1989.Development an Applicationof a Metric on Semantic Nets, in IEEE Transactions on Systems, Man and Cybernetics, vol. 19, no. 1, 17-30.Resnik P. 1995. Disambiguating Noun Groupings with Respect to WordNet Senses, in Proceedings of the Third Workshop on Very Large Corpora, MIT.Richardson R., Smeaton A.F. and Murphy J. 1994.Using WordNet as a Konwledge Base for Measuring Semantic Similarity between Words, in Working Paper CA-1294, School of Computer Applications, Dublin City University. Dublin, Ireland.Rigau G. 1994. An experiment on Automatic Semantic Tagging of Dictionary Senses, WorkShop "The Future of Dictionary", Aix-les-Bains, France. published as Research Report LSI-95-31-R. Computer Science Department. UPC.Barcelona.Rigau G. and Agirre E. 1996. Linking Bilingual Dictionaries to WordNet, in proceedings of the 7th Euralex International Congress on Lexcography (Euralex’96), Gothenburg, Sweden, 1996.Sussna M. 1993. Word Sense Disambiguation for Free-text Indexing Using a Massive Semantic Network, in Proceedings of the Second International Conference on Information and knowledge Management. Arlington, Virginia. Voorhees E. 1993. Using WordNet to Disambiguate Word Senses for Text Retrival, in proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 171-180, PA.Wilks Y., Fass D., Guo C., McDonal J., Plate T.and Slator B. 1993. Providing Machine Tractablle Dictionary Tools, in Semantics and the Lexicon (Pustejovsky J. ed.), 341-401.Yarowsky, D. 1992. Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora, in proceedings of the 15th International Conference on Computational Linguistics (Coling'92). Nantes, France.。
机器翻译中基于语法、语义知识库的汉语词义消歧策略♣王惠北京大学计算语言学研究所,北京,100871摘要:词义消歧研究在自然语言处理的许多应用领域中具有重要的理论和实践意义,在机器翻译中更是如此,它直接关系到译文质量的提高。
但目前已有的词义消歧系统基本上都面临着消歧知识获取的瓶颈问题。
本文认为,要真正有效地提高词义知识库的质量,需要在词类划分基础上,增加词义的语法功能分析和语义搭配限制,综合利用现有的语法、语义资源,提取多义词的每个意义在不同层级上的各种分布特征。
以此为基础,本文提出了一种汉英机器翻译系统中基于语法、语义知识库的汉语词义消歧分析算法。
初步的实验结果表明,该方法可以高质量地进行汉语名词、动词、形容词的词义消歧。
关键词:词义消歧(WSD)汉英机器翻译语法词典语义词典A Study of Chinese Word Sense Disambiguation in MT Based onGrammatical & Semantic Knowledge-basesWang, Hui(Institute of Computational Linguistics, Peking University, Beijing 100871, China)AbstractWord sense disambiguation (WSD) plays an important role in Machine Translation and many other areas of natural language processing. The research on WSD has great theoretical and practical significance. The main work in this paper is to study what kind of knowledge is useful for WSD in system, and establish a multi-level WSD model based on syntagmatic features and semantic information, which can be used to disambiguate word sense in Mandarin Chinese effectively.The model makes full use of the Grammatical Knowledge-base of Contemporary Chinese as one of its main machine-readable dictionary (MRD), which can provide rich grammatical information for disambiguation such as Chinese lexicon, parts-of-speech (POS) and syntax function. Another resource of the model is the Semantic Dictionary of Contemporary Chinese, which provides a thesaurus and semantic collocation information of 68,000 Chinese words.The results of this study indicate that the two MRD resources are effective for word sense disambiguation in MT and are likely to be important for general Chinese NLP.Key words:Word Sense Disambiguation, Chinese-English Machine Translation, Grammatical Knowledge, Semantic Dictionary♣本项研究得到国家973项目“面向新闻领域的汉英机器翻译系统”(项目号:G1998030507-4)的支持。
词汇之间的同形关系与共轭搭配法荷兰Keukenhof花园所见(冯志伟摄)词汇之间的同形关系与共轭搭配法冯志伟单词本⾝的语义信息是很重要的,根据“组成性原则”,句⼦的语义是由构成该句⼦的单词的语义以及这些单词之间的语义关系组成的。
因此,我们在⾃然语⾔处理中,应该重视词汇语义的研究。
语⾔中的词汇具有⾼度系统化的结构,正是这种结构决定了单词的意义和⽤法。
这种结构包括单词和它的意义之间的关系以及个别单词的内部结构。
对这种系统化的、与意义相关的结构的词汇研究叫做“词汇语义学”(Lexical Semantics)。
从词汇语义学看来,词汇不是单词的有限的列表,⽽是⾼度系统化的结构。
在继续讲述词汇语义学之前,让我们⾸先引⼊⼀些新的术语,因为迄今为⽌我们⽤过的这些术语都过于模糊。
例如,对于“词”(word)这个术语,⽬前已有各式各样的⽤法,这增加了我们澄清其⽤法的难度。
因此我们将使⽤“词位”(lexeme)这个术语来替代“词”这个术语,词位表⽰词典中⼀个单独的条⽬,是⼀个特定的正字法形式和⾳素形式与⼀些符号的意义表⽰形式的组合。
词典(Lexicon)是有限个词位的列表,从词汇语义学的观点看来,词典还是⽆限的意义的⽣成机制。
⼀个词位的意义部分叫做“涵义”(sense)。
词位和它的涵义之间存在着复杂的关系。
这些关系可以⽤同形关系、多义关系、同义关系和上下位关系来描述。
同形关系。
这篇博⽂先讲词汇之间的同形关系。
同形关系(homonymy)。
具有同形关系的词位叫做同形词形式相同⽽意义上没有联系的词位之间的关系叫做同形关系(homonyms)。
例如: bank有两个不同的意思:①银⾏(financial institution)。
在句⼦ “A bank can hold the investments in an account in the client’s name.”中的bank就具有这个意思,我们把它叫做 bank1。
Word Sense DisambiguationUsing Conceptual DensityEneko Agirre.*Lengoaia eta Sistema Informatikoak saila.Euskal Herriko Unibertsitatea.p.k. 649, 20080 Donostia. Spain.jibagbee@si.ehu.esGerman Rigau.**Departament de Llenguatges i Sistemes Informàtics.Universitat Politècnica de Catalunya.Pau Gargallo 5, 08028 Barcelona. Spain.g.rigau@lsi.upc.esAbstract.This paper presents a method for the resolution of lexical ambiguity of nouns and its automatic evaluation over the Brown Corpus. The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose. This fully automatic method requires no hand coding of lexical entries, hand tagging of text nor any kind of training process. The results of the experiments have been automatically evaluated against SemCor, the sense-tagged version of the Brown Corpus.Keywords: Word Sense Disambiguation, Conceptual Distance, WordNet, SemCor. TOPIC: Large Text Corpora, Word Sense Disambiguation WORD COUNT: 3980Submitted to Coling’96 and ACL '96* Eneko Agirre was su pported by a grant from the Basqu e Government.** German Rigau was su pported by a grant from the Ministerio de Edu cación y Ciencia.Word Sense DisambiguationUsing Conceptual DensityAbstract.This paper presents a method for the resolution of lexical ambiguity of nouns and its automatic evaluation over the Brown Corpus. The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose. This fully automatic method requires no hand coding of lexical entries, hand tagging of text nor any kind of training process. The results of the experiments have been automatically evaluated against SemCor, the sense-tagged version of the Brown Corpus.Keywords: Word Sense Disambiguation, Conceptual Distance, WordNet, SemCor.1IntroductionWord sense disambiguation is a long-standing problem in Computational Linguistics. Much of recent work in lexical ambiguity resolution offers the prospect that a disambiguation system might be able to receive as input unrestricted text and tag each word with the most likely sense with fairly reasonable accuracy and efficiency. The most extended approach is to attempt to use the context of the word to be disambiguated together with information about each of its word senses to solve this problem.S everal interesting experiments have been performed in recent years using preexisting lexical knowledge resources: [Cowie et al. 92], [Wilks et al. 93] with LDOCE, [Yarowsky 92] with Roget's International Thesaurus, and [S ussna 93], [Voorhees 93], [Richarson et al. 94], [Resnik 95]with WordNet.Although each of these techniques looks promising for disambiguation, either they have been only applied to a small number of words, a few sentences or not in a public domain corpus. For this reason we have tried to disambiguate all the nouns from real texts in the public domain sense tagged version of the Brown corpus [Francis & Kucera 67], [Miller et al. 93], also called Semantic Concordance or S emCor for short1. The words in S emCor are tagged with word senses from WordNet, a broad semantic taxonomy for English [Miller 90]2. Thus S emCor provides an appropriate environment for testing our procedures and comparing among alternatives in a fully automatic way.The automatic decision procedure for lexical ambiguity resolution presented in this paper is based on an elaboration of the conceptual distance among concepts: Conceptual Density [Agirre & Rigau 95]. The system needs to know how words are clustered in semantic classes, and how semantic1Semcor comprises approximately 250,000 words. The tagging was done manu ally, and the error rate measu red by the au thors is arou nd 10% for polysemou s words.2The senses of a word are represented by synsets, one for each word sense. The nominal part of WordNet can be viewed as a tangled hierarchy of hypo/hypernymy relations. Nominal relations inclu de also three kinds of meronymic relations, which can be paraphrased as member-of, made-of and component-part-of. The version u sed in this work is WordNet 1.4, The coverage in WordNet of the senses for open-class words in SemCor reaches 96% according to the au thors.classes are hierarchically organised. For this purpose, we have used WordNet. Our system tries to resolve the lexical ambiguity of nouns by finding the combination of senses from a set of contiguous nouns that maximises the total Conceptual Density among senses.The performance of the procedure was tested on four texts from SemCor chosen at random. For comparison purposes two other approaches, [Sussna 93] and [Yarowsky 92], were also tried. The results show that our algorithm performs better on the test set.Following this short introduction the Conceptual Density formula is presented. The main procedure to resolve lexical ambiguity of nouns using Conceptual Density is sketched on section 3. Section 4 describes extensively the experiments and its results. Finally, sections 5 and 6 deal with further work and conclusions.2Conceptual Density and Word Sense DisambiguationA measure of the relatedness among concepts can be a valuable prediction knowledge source for several decisions in Natural Language Processing. For example, the relatedness of a certain word-sense to the context allows us to select that sense over the others, and actually disambiguate the word. As was pointed by [Miller & Teibel, 91], relatedness can be measured by a fine-grained conceptual distance among concepts in a hierarchical semantic net such as WordNet. This measure would allow to discover reliably the lexical cohesion of a given set of words in English.Conceptual distance tries to provide a basis for determining closeness in meaning among pairs of words, taking as reference a structured hierarchical net. Conceptual distance between two concepts is defined in [Rada et al. 89] as the length of the shortest path that connects the concepts in a hierarchical semantic net. In a similar approach, [S ussna 93] employs the notion of conceptual distance between network nodes in order to improve precision during document indexing.[Resnik 95] captures semantic similarity (closely related to conceptual distance) by means of the information content of the concepts in a hierarchical net.In general these approaches focus on nouns.The measure of conceptual distance among concepts we are looking for should be sensitive to:•the length of the shortest path that connects the concepts involved.•the depth in the hierarchy: concepts in a deeper part of the hierarchy should be ranked closer.• the density of concepts in the hierarchy: concepts in a dense part of the hierarchy are relatively closer than those in a more sparse region.•the measure should be independent of the number of concepts we are measuring.We have experimented with several formulas that follow the four criteria presented above. The experiments reported here were performed using the Conceptual Density formula [Agirre & Rigau 95], which compares areas of subhierarchies.Word to be disambiguated: WContext words: w1 w2 w3 w4 ...Figure 1: senses of a word in WordNetTo illustrate how Conceptual Density can help to disambiguate a word, in figure 1 the word W has four senses and several context words. Each sense of the words belongs to a subhierachy of WordNet. The dots in the subhierarchies represent the senses of either the word to be disambiguated (W) or the words in the context. Conceptual Density will yield the highest density for the subhierarchy containing more senses of those, relative to the total amount of senses in the subhierarchy. The sense of W contained in the subhierarchy with highest Conceptual Density will be chosen as the sense disambiguating W in the given context. In figure 1, sense2 would be chosen.Given a concept c, at the top of a subhierarchy, and given nhyp and h (mean number of hyponyms per node and height of the subhierarchy, respectively), the Conceptual Density for c when its subhierarchy contains a number m (marks) of senses of the words to disambiguate is given by the formula below:CD(c,m)=nhyp i0.20i=0m−1∑descendants c(1)Formula 1 shows a parameter that was computed experimentally. The 0.20 tries to smooth the exponential i, as m ranges between 1 and the total number of senses in WordNet. Several values were tried for the parameter, and it was found that the best performance was attained consistently when the parameter was near 0.20.3 The Disambiguation Algorithm Using Conceptual DensityGiven a window size, the program moves the window one noun at a time from the beginning of the document towards its end, disambiguating in each step the noun in the middle of the window and considering the other nouns in the window as context. Non-noun words are not taken into account.The algorithm to disambiguate a given noun w in the middle of a window of nouns W (c.f. figure 2) roughly proceeds as follows. First, the algorithm represents in a lattice the nouns present in the window, their senses and hypernyms (step 1). Then, the program computes the Conceptual Density of each concept in WordNet according to the senses it contains in its subhierarchy (step 2). It selects the concept c with highest Conceptual Density (step 3) and selects the senses below it as the correct senses for the respective words (step 4).The algorithm proceeds then to compute the density for the remaining senses in the lattice, and continues to disambiguate the nouns left in W (back to steps 2, 3 and 4). When no further disambiguation is possible, the senses left for w are processed and the result is presented (step 5).(Step 1) t ree := compute_tree(words_in_window)loop(Step 2) tree := compute_conceptual_distance(tree)(Step 3) concept := selecct_concept_with_highest_weigth(tree)if concept = null then exitloop(Step 4) tree := mark_disambiguated_senses(tree,concept)endloop(Step 5) o utput_disambiguation_result(tree)Figure 2: algorithm for each windowBesides completely disambiguating a word or failing to do so, in some cases the disambiguation algorithm returns several possible senses for a word. In the experiments we considered these partial outcomes as failure to disambiguate.4The Experiments4.1 The textsWe selected four texts from SemCor at random: br-a01 (where a stands for the gender "Press: Reportage"), br-b20 (b for "Press: Editorial"), br-j09 (j means "Learned: Science") and br-r05 (r for "Humour"). Table 1 shows some statistics for each texttext words nouns nounsmonosemousin WNbr-a012079564464149 (32%)br-b202153453377128 (34%)br-j092495620586205 (34%)br-r052407457431120 (27%)total913420941858602 (32%)Table 1: data for each textAn average of 11% of all the nouns in these four texts were not found in WordNet. According to this data, the amount of monosemous nouns in these texts is bigger (32% average) than the one calculated for the open-class words from the whole SemCor (27.2% according to [Miller et al. 94]).For our experiments, these texts play both the role of input files (without semantic tags) and (tagged) test files. When they are treated as input files, we throw away all non-noun words, only leaving the lemmas of the nouns present in WordNet. Figure 4 shows the SemCor format for the nouns in the example sentence in figure 3:The jury praised the administration and operation of the AtlantaPolice_Department, the Fulton_Tax_Commissioner_'s_Office, theBellwood and Alpharetta prison_farms, Grady_Hospital and theFulton_Health_Department.Figure 3: sample sentence from SemCor<s><wd>jury</wd><sn>[noun.group.0]</sn><tag>NN</tag><wd>administration</wd><sn>[noun.act.0]</sn><tag>NN</tag><wd>operation</wd><sn>[noun.state.0]</sn><tag>NN</tag><wd>Police_Department</wd><sn>[noun.group.0]</sn><tag>NN</tag><wd>prison_farms</wd><mwd>prison_farm</mwd><msn>[noun.artifact.0]</msn><tag>NN</tag> </s>Figure 4: SemCor formatAfter erasing the irrelevant information we get the words3 as they will be input to the algorithm:jury administration operation Police_Department prison_farmFigure 5: input wordsThe output of the algorithm comprises sense tags that can be compared automatically with the original file (c.f. figure 4).3Note that we already have the knowledge that police department and prison farm are compou nd nou ns, and that the lemma of prison farms is prison farm.4.2 Results and evaluationOne of the goals of the experiments was to decide among different variants of the Conceptual Density formula. Results are given averaging the results of the four files. Partial disambiguation is treated as failure to disambiguate. Precision (that is, the percentage of actual answers which were correct) and recall (that is, the percentage of possible answers which were correct) are given in terms of polysemous nouns only. The graphs are drawn against the size of the context 4 that was taken into account when disambiguating.• meronymy does not improve performance as expected.One parameter controls whether meronymic relations, in addition to the hypo/hypernymy relation, are taken into account or not. A priori, the more relations are taken in account the better density would capture semantic relatedness, and therefore better results can be expected. The experiments, see figure 6, showed that there is not much difference; adding meronymic information does not improve precision, and raises coverage only 3% (approximately). Nevertheless, in the restof the results reported below, meronymy and hypernymy were used.38394041424344P r e c i s i o n (%)11223Window Size hyper.meron.Figure 6: meronymy and hyperonymy• global nhyp is as good as local nhyp.There was an aspect of the density formula which we could not decide analytically, and which we wanted to check experimentally. The average number of hyponyms or nhyp (c.f. formula 1) can be approximated in two ways. If an independent nhyp is computed for every concept in WordNet we call it local nhyp . If instead, a unique nhyp is computed using the whole hierarchy, we have global nhyp . While local nhyp is the actual average for a given concept, global nhyp gives only an estimation. The results (c.f. figure 7) show that local nhyp performs only slightly better.Therefore global nhyp is favoured and was used in subsequent experiments.4context size is given in terms of nouns.P r e c i s i o n (%)11223Window Size localglobal Figure 7: local nhyp vs. global nhyp• context size: different behaviour for each textDeciding the optimum context-size for disambiguating using Conceptual Density is an important issue. One could assume that the more context there is, the better the disambiguation results would be. Our experiments show that each file from S emCor has a different behaviour (c.f. figure 8)while br-b20 shows clear improvement for bigger window sizes, br-r05 gets a local maximum at a 10size window, etc.P r e c i s i o n (%)11223Window Size average br-r05br-j09br-b20br-a01Figure 8: context size and different filesAs each text is structured a list of sentences, lacking any indication of headings, sections,paragraph endings, text changes, etc. the program gathers the context without knowing whether the nouns actually occur in coherent pieces of text. This could account for the fact that in br-r05,composed mainly by short pieces of dialogues, the best results are for window size 10, the average size of this dialogue pieces. Longer windows will include other pieces of unrelated dialogues that could mislead the disambiguation.Besides, the files can be composed of different pieces of unrelated texts without pointing it explicitly. For instance, two of our test files (br-a01 and br-b20) are collections of short journalistic texts. This could explain that the performance of br-a01 decreases for windows of 30 nouns, because for most of the nouns the context would include nouns from another article.The polysemy level could also affect the performance, but in our texts less polysemy does not correlate with better performance. Nevertheless the actual nature of each text is for sure animportant factor, difficult to measure, which could account for the different behaviour on its own.For instance, the poor performance on text br-j09 could be explained by its technical nature. Further analysis of the errors, contexts and relations found among the words would be needed to be more conclusive.Leaving aside these considerations, and in order to give an overall view of the performance, we consider the average behaviour in order to lead our conclusions.• file vs. senseWordNet groups senses in 24 lexicographer's files. The algorithm assigns a noun both an specific sense and a file label. Both file matches and sense matches are interesting to count. While the sense level gives a fine graded measure of the algorithm, the file level gives an indication of the performance if we were interested in a less sharp level of disambiguation. The granularity of the sense distinctions made in [Hearst, 91], [Gale et al. 93] and [Yarowsky 92], also called homographs in [Guthrie et al. 93], can be compared to that of the file level in WordNet.For instance, in [Yarowsky 92] two homographs of the noun bas s are considered, one characterised as MUSIC and the other as ANIMAL, INSECT. In WordNet, the 6 senses of bas s related to music appear in the following files: ARTIFACT, ATTRIBUTE, COMMUNICATION and PERSON. The 3senses related to animals appear in the files ANIMAL and FOOD. This means that while the homograph level in [Yarowsky 92] distinguishes two sets of senses, the file level in WordNet distinguishes six sets of senses, still finer in granularity.The following figure shows that, as expected, file-level matches attain better performance(71.2% overall and 53.9% for polysemic nouns) than sense-level matches.P r e c i s i o n (%)11223Window Size SenseFile Figure 9: sense level vs. file level• evaluation of the resultsFigure 10 shows that, overall, coverage over polysemous nouns increases significantly with the window size, without losing precision. Coverage tends to get stabilised near 80%, getting little improvement for window sizes bigger than 20.3040506070801015202530Window Sizesemantic densitymost frequentmost frequentguessing semantic densityFigure 10: precision and coverageThe figure also shows the guessing baseline, given by selecting senses at random. This baseline was first calculated analytically and later checked experimentally. We also compare the performance of our algorithm with that of the "most frequent" heuristic. The frequency counts for each sense were collected using the rest of SemCor, and then applied to the four texts. While the precision is similar to that of our algorithm, the coverage is 8% worse.All the data for the best window size can be seen in table 2. The precision and coverage shown in all the preceding graphs were relative to the polysemous nouns only. If we also include monosemic nouns precision raises, as shown in table 2, from 43% to 64.5%, and the coverage increases from 79.6% to 86.2%.%w=30Cover.Prec.Recall overall File 86.271.261.4Sense 64.555.5polysemic File 79.653.942.8Sense 4334.24.3 Comparison with other worksThe raw results presented here seem to be poor when compared to those shown in [Hearst 91],[Gale et al. 93] and [Yarowsky 92]. We think that several factors make the comparison difficult.Most of those works focus in a selected set of a few words, generally with a couple of senses of very different meaning (coarse-grained distinctions), and for which their algorithm could gather enough evidence. On the contrary, we tested our method with all the nouns in a subset of an unrestricted public domain corpus (more than 9.000 words), making fine-grained distinctions among all the senses in WordNet.An approach that uses hierarchical knowledge is that of [Resnik 95], which additionally uses the information content of each concept gathered from corpora. Unfortunately he applies his method on a different task, that of disambiguating sets of related nouns. The evaluation is done on a set of related nouns from Roget's Thesaurus tagged by hand. The fact that some senses were discarded because the human judged them not reliable makes comparison even more difficult.In order to compare our approach we decided to implement [Yarowsky 92] and [Sussna 93], and test them on our texts. For [Yarowsky 92] we had to adapt it to work with WordNet. His methodrelies on cooccurrence data gathered on Roget's Thesaurus semantic categories. Instead, on our experiment we use saliency values5 based on the lexicographic file tags in SemCor (c.f. figure 4). The results for a window size of 50 are those shown in table 36. The precision attained by our algorithm is higher. To compare figures better consider the results in table 4, were the coverage of our algorithm was easily extended using the version presented below, increasing recall to 70.1%.%Cover.Prec.RecallC.Density86.271.261.4Yarowsky100.064.064.0Table 3: comparison with [Yarowsky 92]From the methods based on Conceptual Distance, [Sussna 93] is the most similar to ours. Sussna disambiguates several documents from a public corpus using WordNet. The test set was tagged by hand, allowing more than one correct senses for a single word. The method he uses has to overcome a combinatorial explosion7 controlling the size of the window and “freezing” the senses for all the nouns preceding the noun to be disambiguated.In order to freeze the winning sense S ussna's algorithm is forced to make a unique choice. When Conceptual Distance is not able to choose a single sense, he has to choose one at random.Conceptual Density overcomes the combinatorial explosion extending the notion of conceptual distance from a pair of words to n words, and therefore can yield more than one correct sense for a word. For comparison, we altered our algorithm to also make random choices when unable to choose a single sense. We applied the algorithm Sussna considers best, discarding the factors that do not affect performance significantly8, and obtain the results in table 4.%Cover.Prec.C.Density File100.070.1Sense60.1Sussna File100.064.5Sense52.3Table 4: comparison with [Sussna 93]A more thorough comparison with these methods could be desirable, but not possible in this paper for the sake of conciseness.5 Further WorkWe would like to have included in this paper a study on whether there is or not a correlation among correct and erroneous sense assignations and the degree of Conceptual Density, that is, the actual figure held by formula 1. If this was the case, the error rate could be further decreased setting a certain threshold for Conceptual Density values of winning senses. We would also like to evaluate the usefulness of partial disambiguation: decrease of ambiguity, number of times correct sense is among the chosen ones, etc.5We tried both mu tu al information and association ratio, and the later performed better.6The resu lts of ou r algorithm are those for window size 30, file matches and overall.7In ou r replication of his experiment the mu tu al constraint for the first 10 nou ns (the optimal window size according to his experiments) of file br-r05 had to deal with more than 200.000 synset pairs.8Initial mu tu al constraint size is 10 and window size is 41. Meronymic links are also considered. All the links have the same weigth.There are some factors that could raise the performance of our algorithm:•Work on coherent chunks of text.Unfortunately any information about discourse structure is absent in SemCor, apart from sentence endings. If coherent pieces of discourse were taken as input, both performance and efficiency of the algorithm might improve. The performance would gain from the fact that sentences from unrelated topics would not be considered in the disambiguation window. We think that efficiency could also be improved if the algorithm worked on entire coherent chunks instead of one word at a time.• Extend and improve the semantic data.WordNet provides sinonymy, hypernymy and meronyny relations for nouns, but other relations are missing. For instance, WordNet lacks cross-categorial semantic relations, which could be very useful to extend the notion of Conceptual Density of nouns to Conceptual Density of words. Apart from extending the disambiguation to verbs, adjectives and adverbs, cross-categorial relations would allow to capture better the relations among senses and provide firmer grounds for disambiguating.These other relations could be extracted from other knowledge sources, both corpus-based or MRD-based, such as topic information (as can be found in Roget's Thesaurus), word frequencies, collocations [Yarowsky 93], selectional restrictions [Ribas 95], etc. If those relations could be given on WordNet senses, Conceptual Density could profit from them. [Richardson et al. 94] tries to combine WordNet and informational measures taken from corpora, defining a conceptual similarity considering both, but does not give any evaluation of their method. It is our belief, following the ideas of [McRoy 92] that full-fledged lexical ambiguity resolution should combine several information sources. Conceptual Density might be only one of a number of complementary evidences of the plausibility of a certain word sense.• Tune the sense distinctions to the level best suited for the application.On the one hand the sense distinctions made by WordNet 1.4 are not always satisfactory and obviously, WordNet 1.4 is not a complete lexical database. On the other hand, our algorithm is not designed to work on the file level, e.g. if the sense level is unable to distinguish among two senses, the file level also fails, even if both senses were from the same file. If the senses were collapsed at the file level, the coverage and precision of the algorithm at the file level might be even better.6 ConclusionThe automatic method for the disambiguation of nouns presented in this paper is ready-usable in any general domain and on free-running text, given part of speech tags. It does not need any training and uses word sense tags from WordNet, an extensively used lexical data base. The algorithm is theoretically motivated and founded, and offers a general measure of the semantic relatedness for any number of nouns.Conceptual Density has been used for other tasks apart from the disambiguation of free-running test. Its application for automatic spelling correction is outlined in [Agirre et al. 94]. It was also used on Computational Lexicography, enriching dictionary senses with semantic tags extracted from WordNet [Rigau 94], or linking bilingual dicitonaries to WordNet [Rigau and Agirre 95] In the experiments, the algorithm disambiguated four texts (more than 10.000 words long) of SemCor, a subset of the Brown corpus. The results were obtained automatically comparing the tags in SemCor with those computed by the algorithm, which would allow the comparison with other disambiguation methods. Two other methods, [Sussna 93] and [Yarowsky 92], were also tried on the same texts, showing that our algorithm performs better.The results are promising, considering the difficulty of the task (free running text, large number of senses per word in WordNet), and the lack of any discourse structure of the texts. Two kinds of results can be obtained: the specific sense or a coarser, file level, tag.。