Word sense disambiguation using conceptual density
- 格式:pdf
- 大小:74.62 KB
- 文档页数:13
A. Gelbukh (Ed.): CICLing 2007, LNCS 4394, pp. 267 – 274, 2007.© Springer-Verlag Berlin Heidelberg 2007Word Clustering for Collocation-BasedWord Sense Disambiguation*Peng Jin, Xu Sun, Yunfang Wu, and Shiwen YuDepartment of Computer Science and TechnologyInstitute of Computational Linguistics, Peking University, 100871, Beijing, China {jandp, sunxu, wuyf, yusw}@Abstract. The main disadvantage of collocation-based word sense disambigua-tion is that the recall is low, with relatively high precision. How to improve therecall without decrease the precision? In this paper, we investigate a word-classapproach to extend the collocation list which is constructed from the manuallysense-tagged corpus. But the word classes are obtained from a larger scale cor-pus which is not sense tagged. The experiment results have shown that the F-measure is improved to 71% compared to 54% of the baseline system where theword-class is not considered, although the precision decreases slightly. Furtherstudy discovers the relationship between the F-measure and the number ofword-class trained from the various sizes of corpus.1 IntroductionWord sense disambiguation (WSD) aims to identify the intended sense of a polyse-mous word given a context. A typical case is the Chinese word “讲” when occurring in “讲真话” (“tell the truth”) and “讲实效” (“pay attention to the actual effect”). Correctly sense-tagging the word in context can prove to be beneficial for many NLP applications such as Information Retrieval [6], [14], and Machine Translation [3], [7].Collocation is a combination of words that has certain tendency to be used to-gether [5] and it is used widely to attack the WSD task. Many researchers used the collocation as an important feature in the supervised learning algorithms: Naïve Bayes[7], [13], Support Vector Machines [8], and Maximum Entropy [2]. And the other researches [15], [16] directly used the collocation to form decision list to deal with the WSD problem.Word classes are often used to alleviate the data sparseness in NLP. Brown [1] performed automatic word clustering to improve the language model. Li [9] con-ducted syntactic disambiguation by using the acquired word-class. Och [12] provided an efficient method for determining bilingual word classes to improve statistical MT. This paper integrates the contribution of word-class to collocation-based WSD. When the word-based collocation which is obtained from sense tagged corpus fails, * Support by National Grant Fundamental Research 973 Program of China Under Grant No. 2004CB318102.268 P. Jin et al.class-based collocation is used to perform the WSD task. The results of experiment have shown that the average F-measure is improved to 70.81% compared to 54.02% of the baseline system where the word classes are not considered, although the preci-sion decreases slightly. Additionally, the relationship between the F-measure and the number of word-class trained from the various sizes of corpus is also investigated.The paper is structured as follows. Section 2 summarizes the related work. Section 3 describes how to extend the collocation list. Section 4 presents our experiments as well as the results. Section 5 analyzes the results of the experiments. Finally section 6 draws the conclusions and summarizes further work.2 Related WorkThe underlying idea is that one sense per collocation which has been verified by Yarowsky [15] on a coarse-grained WSD task. But the problem of data spars will be more serious on the fine-grained WSD task. We attempt to resolve the data sparseness with the help of word-class. Both of them are described as follows.2.1 The Yarowsky AlgorithmYarowsky [15] used the collocation to form a decision list to perform the WSD task. In his experiments, the content words (i.e., nouns, verbs, adjectives and adverbs) holding some relationships to the target word were treated as collocation words. The relationships include direct adjacency to left or right and first to the left or right in a sentence. He also considered certain syntactic relationships such as verb/object, sub-ject/verb. Since similar corpus is not available in Chinese, we just apply the four co-occurrence words described above as collocation words. Different types of evidences are sorted by the equation 1 to form the final decision list.)|Pr()|Pr(((21i i n Collocatio Sense n Collocatio Sense Log Abs (1)To deal with the same collocation indicates more than two senses, we adapt to the equation 1. For example, “上 (shang4)” has fifteen different senses as an verb. If the same collocation corresponds to different senses of 上, we use the frequency counts of the most commonly-used sense as the nominator in equation 1, and the frequency counts of the rest senses as the denominator. The different types of evidence are sorted by the value of equation 1. When a new instance is encountered, one steps through the decision list until the evidence at that point in the list matches the cur-rent context under consideration. The sense with the greatest listed probability is returned.The low recall is the main disadvantage of Yarowsky’s algorithm to the fine-grained sense disambiguation. Because of the data sparseness, the collocation word in the novel context has little chance to match exactly with the items in the decision list. To resolve this problem, the word clustering is introduced.Word Clustering for Collocation-Based Word Sense Disambiguation 2692.2 Word ClusteringIn this paper, we use an efficient method for word clustering which Och [12] intro-duced for machine translation. The task of a statistical language model is used to estimate the probability ()N w P 1 of the word sequence N N w w w ...11=. A simple ap-proximation of ()N w P 1 is to model it as a product of bi-gram probabilities:()()∏=−=N i i i N w w p w P 111|. Using the word class rather than the single word, we avoid the use of the most of the rarely seen bi-grams to estimate the probabilities. Rewriting the probability using word classes, we obtain the probability model as follow:()()()()()()i i N i i i Nw C w P w C w C P C w P ||:|111•=∏=− (2)Where the function C maps words to w their classes ()w C . In this model, we have two types of probabilities: the transition probability ()'|C C P for class C given its predecessor class 'C , and the membership probability ()C w P | for word w given classC . To determine the optimal word classes Cˆ for a given number of classes, we per-form a maximum-likelihood estimation:()C w P C N C |max arg ˆ1= (3)To the implementation, an efficient optimization algorithm is the exchange algo-rithm [13].It is necessary to set the number of word classes before the iteration.Two word classes are selected for illustration. First is “花生 (peanut), 大豆 (bean), 棉花 (cotton), 水稻 (rice), 早稻 (early rice), 芒果 (mango), 红枣 (jujube), 柑桔 (or-ange), 银杏 (ginkgo)”. To the target verb “吃” (which have five senses), these nouns can be its objects and indicate the same sense of “吃”. Another word class is “ 灌溉 (irrigate), 育秧 (raise rice seedlings), 施肥 (apply fertilizer), 播种 (sow), 移植 (trans-plant), 栽培 (cultivate), 备耕 (make preparations for plowing and sowing)”. Most of them indicate the sense “plant” of the target noun “小麦 (wheat)” which has two senses categories: “plant” and “seed”. For example, there is a collocation pair “灌溉小麦” in the collocation list which is obtained from the sense tagged corpus, an un-familiar collocation pair “备耕小麦” will be tagged with the intended sense of “小麦” because “灌溉” and “备耕” are clustered in the same word-class.3 Extending the Collocation ListThe algorithm of extending the collocation list which is constructed from the sense tagged corpus is quite straightforward. Given a new collocation pair exists in the novel context consists of the target word, the collocation word and the collocation type. If this specific collocation pair is found in the collocation list, we return the sense at the point in this decision list. While the match fails, we replace this collocation word with one of the words which are clustered in the same word-class to match again. The270 P. Jin et al.process is finished when any match success or all words in the word-class are tried. If all words in this word-class fail to match, we let this target word untagged.For example, “讲政治”(pay attention to the politics), “讲故事”(tell a story) are ordered in the collocation list. But to a new instance “讲笑话”(tell a joke), apparently we can not match the Chinese word “笑话” with any of the collocation word. Search-ing from the top of the collocation list, we check that “笑话” and “故事” are clustered in the same word-class. So the sense “tell” is returned and the process is ended.4 ExperimentWe have designed a set of experiments to compare the Yarowsky algorithm with and without the contribution of word classes. Yarowsky algorithm introduced in section 2.1 is used as our baseline. Both close test and open test are conducted.4.1 Data SetWe have selected 52 polysemous verbs randomly with the four senses on average. Senses of words are defined with the Contemporary Chinese Dictionary, the Gram-matical Knowledge-base of Contemporary Chinese and other hard-copy dictionaries. For each word sense, a lexical entry includes definition in Chinese, POS, Pinyin, semantic feature, subcategory framework, valence, semantic feature of subject, se-mantic feature of object, English equivalent and an example sentence.A corpus containing People’s Daily News (PDN) of the first three months of year 2000 (i.e., January, February and March) is used as our training/test set. The corpus is segmented (3,719,951 words) and POS tagged automatically before hand, and then is sense-tagged manually. To keep the consistency, a text is first tagged by one annota-tor and then checked by other two checkers. Five annotators are all native Chinese speakers. What’s more, a software tool is developed to gather all the occurrences of a target word in the corpus into a checking file with the sense KWIC (Key Word in Context) format in sense tags order. Although the agreement rate between human annotators on verb sense annotation is only 81.3%, the checking process with the help of this tool improves significantly the consistency.We also conduct an open test. The test corpus consists of the news of the first ten days of January 1998. The news corresponding to the first three months of 2000 are used as training set to construct the collocation list. The corpus which is used to word cluster amounts to seven months PDN.4.2 Experimental SetupFive-fold cross-validation method is used to evaluate these performances. We divide the sense-tagged three months corpus into five equal parts. In each process, the sense labels in one part are removed in order to be used as test corpus. And then, the collo-cation list is constructed from the other four parts of corpus. We first use this list to tag test corpus according to the Yarwosky algorithm and set its result as the baseline. After that the word-class is considered and the test corpus is tagged again according to the algorithm described in section 3.Word Clustering for Collocation-Based Word Sense Disambiguation 271 To draw the learning curve, we vary the number of word-class and the sizes of corpus which used to cluster the words. In open test, the collocation list is constructed from the news corresponding to the first three months of year 2000.4.3 Experiment ResultsTable 1 shows the results of close test. It is achieved by 5-fold Cross-Validation with 200 word-clusters trained from the seven months corpus. “Tagged tokens” is referred to the occurrences of the polysemous words which are disambiguated automatically. “All tokens” means the occurrences of the all polysemous words in one test corpus.We can see the performance of each process is stable. It demonstrates that the word class is very useful to alleviate the data sparse problem.Table 1. Results with 200 Word Classes Trained from 7 Month CorpusTagged TokensAllTokensPrecision Recall F-measureT1 2,346 4237 0.9301 0.5537 0.6942T2 2,969 4,676 0.9343 0.5766 0.7131T3 2,362 4,133 0.9306 0.5715 0.7081T4 2,773 4,721 0.9318 0.5874 0.7206T5 2,871 4,992 0.9154 0.5751 0.7046Ave. 2,664 4,552 0.9284 0.5729 0.7081 Table 2 shows the power of word-class. B1 and B2 denote individually the baseline in close and open test. S1 and S2 show the performance with the help of word-classes in these tests. Although the precision decreases slightly, the F-measures are improved significantly. Because in open test, the size of corpus used to training is bigger while the size of corpus used to test is less compared with the corpus in open test, the F-measure is even a bit higher than in close test.Table 2. Results of Close and Open TestTagged TokensAllTokensPrecision Recall F-measureB1 1,691 4,552 0.9793 0.3708 0.5401S1 2,664 4,552 0.9284 0.5729 0.7081B2 874 2,325 0.9908 0.3559 0.5450S2 1,380 2,325 0.9268 0.5935 0.7237 5 Discussion of ResultsFig 1 presents the relationship between the F-measure and the number of word-class trained from the various sizes of corpus. The reasons for errors are also explained.272 P. Jin et al.5.1 Relationship Between F-Measure with Word-Class and CorpusWhen we fix the size of the corpus which is used to cluster the word-class, we can see that the F-measure is verse proportional to the number of the word classes. However in our experiments, the precision is proportional to the number of the word classes (this can not be presented in this figure). The reason is straightforward that with the augment of the word classes, there are fewer words in every word-class. So the collo-cation which comes from test corpus has less chance of finding the word in the decision list belonging to the same word-class.Fig. 1. F-measure at different number of word-class trained from the various sizes of corpus When we fix the number of word classes, we can see that the F-measure increases with the size of the training corpus. This demonstrates that more data improve the system performance. But the increase rate is less and less. It shows there is a ceiling effect. That is to say, the effect on the performance will be less although more cor-puses are trained for clustering the words.5.2 Error AnalysisUnrelated words are clustered is the main cause of precision decreases. For example, there are two words “牛” (cattle) and “鞭炮” (cracker) are clustered in the same word-class. To the target word “放”, “放牛” means “graze cattle” and “放鞭炮” means “fire crackers”. To resolve this problem, we should pay much attention to improve the clustering results.Word Clustering for Collocation-Based Word Sense Disambiguation 273However, the reasonable word-classes also cause errors. Another example is “包饺子” (wrap dumpling) and “包午餐” (offer free lunch) . The word “饺子” (dumpling) and the word “午餐”(lunch) are clustered reasonable because both of them are nouns and related concepts. However, to the target polysemous word “包” , the sense is completely different: the former means “wrap” and the sense of the later is “offer free”. It also explains why the WSD system benefits little from the ontology such as HowNet [4].Although the collocation list obtained from the sense tagged corpus is extended by word classes, the F-measure is still not satisfied. There are still many unfamiliar collocations can not be matched because of the data sparseness.6 Conclusion and the Future WorkWe have demonstrated the word-class is very useful to improve the performance of the collocation-base method. The result shows that the F-measure is improved to 70.81% compared to 54.02% of the baseline system where the word clusters are not considered, although the precision decreases slightly. To open test, the performance is also improved from 54.50% to 72.37%.This method will be used to help us to accelerate the construction sense tagged corpus. Another utility of word class is used as a feature in the supervised machine learning algorithms in our future research.We can see that some words are highly sensitive to collocation while others are not. To the later, the performance is poor whether the word-class is used or not. We will further study which words and why they are sensitive to collocation from the perspectives of both linguistics and WSD.References1.Brown, P. F., Pietra, V. J., deSouza, P. V., Lai, J. C. and Mercer, R. L. Class-based N-gram Models of Natural Language. Computational Linguistics. 4 (1992) 467-4792.Chao, G., Dyer, G.M. Maximum Entropy Models for Word Sense Disambiguation. Pro-ceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan (2002) 155–1613.Dagan, D., Itai, A. Word Sense Disambiguation Using a Second Language MonolingualCorpus. Computational Linguistics. 4 (1994) 563–5964.Dang, H. T., Chia, C., Palmer, M., Chiou, F. D., Rosenzweig J. Simple Features for Chi-nese Word Sense Disambiguation. Proceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan (2002) 204–2115.Gelbukh, A., G. Sidorov, S.-Y. Han, E. Hernández-Rubio. Automatic Enrichment of aVery Large Dictionary of Word Combinations on the Basis of Dependency Formalism.Proceedings of Mexican International Conference on Artificial Intelligence. Lecture Notes in Artificial Intelligence, N 2972, Springer-Verlag, (2004) 430-4376.Kim, S.B., Seo, H.C., Rim, H.C. Information Retrieval Using Word Senses: Root SenseTagging Approach, SIGIR’04, Sheffield, South Yorkshire, UK (2004) 258–265274 P. Jin et al.7.Lee, H.A., Kim, G.C. Translation Selection through Source Word Sense Disambiguationand Target Word Selection. Proceedings of the 19th International. Conference on Compu-tational Linguistics, Taipei, Taiwan (2002)8.Lee, Y. K., Ng, H. T. and Chia, T. K. Supervised Word Sense Disambiguation with Sup-port Vector Machines and Multiple Knowledge Sources. Proceedings of SENSEVAL-3: Third International Workshop on the Evaluating Systems for the Semantic Analysis of Text, Barcelona, Spain. (2004)9.Li, H. Word Clustering and Disambiguation Based on Co-occurrence Data. Natural Lan-guage Engineering. 8 (2002) 25-4210.Li W.Y., Lu Q., Li W.J. Integrating Collocation Features in Chinese Word Sense Disam-biguation. Proceeding of the Fourth SIGHAN Workshop on Chinese Language Processing (2005) 87–9411.Martin, S., Liermann, J. and Ney, K. Algorithms for Bigram and Trigram Word Clustering.Speech Communication. 1 (1998) 19-3712.Och, F. J. An Efficient Method for Determining Bilingual Word Classes. Proceeding of theNinth Conference of the European Chapter of the Association for Computational Linguis-tics. (1999) 71-7613.Pedersen, T. A Simple Approach to Building Ensembles of Naive Bayesian Classifiers forWord Sense Disambiguation. Proceeding of the first Annul Meeting of the North Ameri-can Chapter for Computational Linguistics (2000) 63–6914.Stokoe, C., Oakes, M.P., Tait, J. Word Sense Disambiguation in Information Retrieval Re-visited. Proceeding of the 26th annul International ACM SIGIR conference On research and development in Information retrieval (2003)15.Yarowsky, D. One Sense Per Collocation, Proceeding of ARPA Human Language Tech-nology workshop.Princeton, New Jersey (1993)16.Yarowsky, D. Hierarchical Decision Lists for Word Sense Disambiguation, Computers andthe Humanities. 1 (2000) 179–186。
语义分析的工作原理语义分析(Semantic Analysis)是自然语言处理领域中的重要研究方向,其主要目标是理解自然语言中的语义信息,并对其进行进一步的处理和分析。
本文将介绍语义分析的工作原理,讨论其主要方法和应用领域。
一、概述语义分析是自然语言处理中的核心任务之一,其主要目标是从文本中提取意义,理解語言和信息之間的關聯。
与传统的基于语法的分析方法不同,语义分析注重从文本中获取更深层次的含义。
其应用广泛,包括情感分析、问答系统、机器翻译等。
二、方法和技术1. 词义消歧词义消歧(Word Sense Disambiguation)是语义分析的一个关键步骤。
在自然语言中,一个词可能有多个不同的意义,而词义消歧的任务就是确定在特定上下文中该词的正确含义。
常用的方法包括基于知识库、统计方法和机器学习等。
2. 句法分析句法分析(Syntactic Parsing)是另一个与语义分析密切相关的任务。
它的主要目标是确定一句话中的各个词语之间的句法关系,从而提供给语义分析更准确的输入。
句法分析方法包括依存句法分析和短语结构分析等。
3. 语义角色标注语义角色标注(Semantic Role Labeling)是一项关键任务,它用于识别和标注句子中的谓词与各个论元之间的语义关系。
通过语义角色标注,我们可以更好地理解句子中不同成分之间的作用和关系。
4. 实体识别实体识别(Named Entity Recognition)是一项重要的任务,旨在识别和提取文本中的特定实体,如人名、地名、组织名等。
实体识别在文本理解和信息提取中具有重要意义,为语义分析提供了重要的输入信息。
5. 语义关系抽取语义关系抽取(Semantic Relation Extraction)是指从文本中抽取出不同实体之间的语义关系。
通过语义关系抽取,我们可以获得更深层次的语义信息,从而实现更高级别的语义分析。
三、应用领域1. 情感分析情感分析(Sentiment Analysis)是一种常见的语义分析应用,用于识别和分析文本中的情感倾向,如正面、负面或中性。
自然语言处理中的词义消歧方法评估指标自然语言处理(Natural Language Processing,NLP)是人工智能领域中的一个重要研究方向,涉及到词义消歧(Word Sense Disambiguation,WSD)是其中的一个关键问题。
词义消歧指的是在文本中确定一个词语的正确含义,因为同一个词语在不同的上下文中可能有不同的意思。
在NLP中,评估词义消歧方法的指标是非常重要的,本文将探讨几种常见的评估指标。
一、准确率(Accuracy)准确率是评估词义消歧方法最常用的指标之一。
它表示在所有的词义消歧决策中,正确的决策所占的比例。
具体计算公式为:准确率 = 正确决策数量 / 总决策数量然而,准确率并不是唯一的评估指标,因为它无法反映出不同词义的重要程度和难易程度。
二、精确率(Precision)和召回率(Recall)精确率和召回率是另外两个常用的评估指标,它们常常结合使用。
精确率表示在所有被判定为某个词义的样本中,真正属于该词义的样本所占的比例。
召回率表示在所有属于某个词义的样本中,被正确判定为该词义的样本所占的比例。
精确率 = 真正属于某个词义的样本数量 / 所有被判定为该词义的样本数量召回率 = 真正属于某个词义的样本数量 / 所有属于该词义的样本数量精确率和召回率的计算方式使得它们能够更好地反映出不同词义的重要程度和难易程度。
三、F1值F1值是精确率和召回率的综合指标,它是精确率和召回率的调和平均值。
F1值的计算公式为:F1 = 2 * (精确率 * 召回率) / (精确率 + 召回率)F1值能够更全面地评估词义消歧方法的性能,因为它综合考虑了精确率和召回率。
四、信息增益(Information Gain)信息增益是一种基于信息论的评估指标,它用于衡量一个特征对于分类任务的重要程度。
在词义消歧中,可以将每个词义作为一个类别,将特征作为一个词义的上下文,然后计算信息增益。
信息增益的计算公式为:信息增益 = H(词义) - H(词义|特征)其中,H(词义)表示词义的熵,H(词义|特征)表示在已知特征的条件下,词义的条件熵。
stanford nlp 用法-概述说明以及解释1.引言1.1 概述概述部分旨在介绍本文的主题——Stanford NLP,并提供一些背景信息。
Stanford NLP是由斯坦福大学自然语言处理(Natural Language Processing,简称NLP)小组开发的一套自然语言处理工具包。
它提供了丰富的功能和算法,能够帮助研究人员和开发者进行文本分析、语言理解和信息提取等任务。
自然语言处理是人工智能领域的一个重要分支,涉及了对人类语言的理解和生成。
随着互联网和数字化时代的到来,海量的文本数据成为了研究和应用的宝贵资源。
然而,人类语言的复杂性和多样性给文本处理带来了挑战。
Stanford NLP应运而生,旨在利用先进的技术和算法帮助研究人员和开发者解决这些挑战。
在本文中,我们将探讨Stanford NLP的主要功能和用途。
首先,我们将介绍Stanford NLP的简介,包括其目标和诞生背景。
然后,我们将详细讨论Stanford NLP在各个领域的应用,包括文本分类、命名实体识别、情感分析等。
最后,我们将总结Stanford NLP的应用优势,并展望其未来的发展潜力。
在阅读本文之前,读者需要对自然语言处理的基本概念有一定的了解,同时,具备一定的编程和机器学习知识也将有助于更好地理解本文。
本文将从大的框架上介绍Stanford NLP的用法,并提供一些具体的实例和应用场景,以帮助读者更好地理解和使用Stanford NLP。
接下来,让我们深入探索Stanford NLP的世界,了解它的用途和优势,并展望它在自然语言处理领域的未来发展。
文章结构部分的内容可以如下所示:1.2 文章结构本文主要分为引言、正文和结论三个部分。
引言部分(Section 1)首先概述了本文的主题和目的,然后简要介绍了Stanford NLP的概念及其在自然语言处理领域的重要性。
接下来,给出了本文的整体结构安排。
正文部分(Section 2)详细介绍了Stanford NLP的应用。
Word Sense Disambiguation using ILPLucia Specia1, Ashwin Srinivasan2,3, Ganesh Ramakrishnan2, Maria das Graças V. Nunes 11 ICMC – University of São Paulo, Trabalhador São-Carlense, 400, São Carlos, 13560-970, Brazil{lspecia, gracan}@p.br2 IBM India Research Laboratory, Block 1, Indian Institute of Technology, New Delhi 110016, India3 Dept. of Computer Science and Engineering, University of New South Wales, Sydney, Australia{ashwin.srinivasan, ganramkr}@Abstract. We investigate the use of ILP for the task of Word Sense Disambiguation(WSD) in two different ways: (a) as a stand-alone constructor of models for WSD; and(b) to build interesting features, which can then used by standard model-builder such asSVM. Experiments examining a multilingual WSD task in the context of English-Portuguese machine translation of 7 highly ambiguous verbs showed promising results:our ILP-based standalone approach outperformed the results of propositional algorithmsand the ILP-generated features yielded improvements on the accuracy of standard pro-positional algorithms when compared to their use with low level features only.1 IntroductionWord Sense Disambiguation (WSD) aims to identify the correct sense of an ambiguous word in a sentence. Sometimes described as an “intermediate task”— that is, not an end in itself — it is necessary in most natural language tasks like machine translation, information retrieval, and so on. That is extremely difficult, possibly impractical, to solve completely is a long-standing view [1] and accuracies with state-of-the art methods are substantially lower than in other areas of text understanding (part-of-speech tagging accuracies, e.g., are now over 95%, while the best WSD results are still well below 80%). It is generally thought that WSD should benefit significantly by adopting a “deep approach” in which access to substantial body of world knowledge could assist in resolving ambiguities. This belief is based on the following observa-tion. While it is true that statistical methods like support vector machines using shallow fea-tures referring to the local context of the ambiguous word have usually yielded the best results to date, the accuracies obtained are low and significant improvements do not appear to be forthcoming. The incorporation of large amounts of domain knowledge has been hampered by the following: (a) access to such information in electronic form suitable for constructing mod-els; and (b) modeling techniques capable of utilizing diverse sources of domain knowledge, even when they are available. The first of these difficulties is now greatly alleviated by the availability in electronic form of very large semantic lexicons like WordNet, dictionaries, parsers, grammars and so on. In addition, there are now very large amounts of “shallow” data in the form of electronic text corpora from which statistical information can be readily ex-tracted. Using these diverse sources of information is, however, beyond the capabilities of existing general-purpose statistical methods that have been used for WSD resulting in the development of various ad hoc techniques for using specific sets of information for particulartasks. Arguably, ILP systems provide the most general-purpose framework for dealing with such data: there are explicit provisions made for the inclusion of background knowledge of any form; the representation language is powerful enough to capture the contextual relationships that arise; and modeling is not restricted to being of a particular form. In this paper, using a task in Machine Translation (MT) as a test bed and 7 different sources of background knowledge, we investigate the use of ILP for WSD in 2 ways: (a) the induction of disambiguation models; and (b) the construction of interesting features to be used by propositional algorithms such as SVM, which have presented reasonably good results with low level features in previous work.2 Empirical StudyAim. We investigate whether the use of an ILP system equipped with substantial background knowledge can significantly improve WSD accuracies for a task in English-Portuguese MT. Materials.Data.Data consist of 7 highly frequent and ambiguous verbs and a sample corpus of around 200 English sentences for each verb with the verb translation automatically annotated. The verbs are (numbers of possible translations in our corpus in brackets): come (11), get (17), give (5), go (11), look (7), make (11), and take (13).Background Knowledge.We exploit knowledge from 7 syntactic, semantic and pragmatic sources: (1) Bag-of-words of ±5 words surrounding the verb; (2) Part-of-speech tags of ±5 content words surrounding the verb; (3) Subject and object syntactic relations with respect to the verb; (4) 11 collocations with respect to the verb; (5) Selectional restrictions and semantic features; (6) Phrasal verbs possibly occurring in the sentence; (7) Overlapping words in dic-tionary definitions for the possible verb translations and the surrounding words in the sentence. Refer to [3] for details about representation of these knowledge sources.Algorithm. We used the ILP system Aleph [4] to implement the construction of disambigua-tion models and features for statistical model construction.Method. For reasons of space, we refer the reader to [4] for details of how models and features are constructed by Aleph. We follow standard methodology to construct and test models (i.e., cross-validation on training data to select the best models; and testing on unseen data). Results.Disambiguation Models. We evaluated the results according to the metrics usually employed for WSD, namely accuracy on the positive examples. We used Aleph constraint’s mechanism to create a rule to classify all the cases that have not been classified by other rules according to the majority class (the most frequent translation in the corpus). Table 1 shows the accuracy achieved by the test-bed previously mentioned, according to a 10-fold cross-validation strat-egy, together with the accuracy of two propositional algorithms that usually perform well on the WSD task, C4.5 and SVM, here having the relational features pre-processed in order to allow an attribute-value representation. ILP results are significantly better (t-test, p < 0.05). Feature Construction. The numbers of features constructed by the ILP engine for each verb are shown in Table 2. Together with 23 original low-level features, these new ILP features were then used to test SVM’s performance. The enhanced set of features improved SVM’s accuracy for 2 verbs (“come” and “go”), with other verbs unaffected. It is not evident at this stage whether this due to (a) the small number of examples; (b) inadequate relational features;(c) inadequate background knowledge; or (d) inadequate model construction by the statistical method. We are conducting experiments to shed further light on these questions.Table 1. Verbs and their possible translations in the sample corpusAccuracyVerbILP C4.5SVMcome 0.82 0.53 0.62get 0.51 0.36 0.26give 0.96 0.96 0.98go 0.88 0.76 0.74look 0.83 0.57 0.79make 0.81 0.74 0.74take 0.81 0.31 0.44Average 0.800.60.65Table 2. Accuracy achieved by SVM with and without ILP featuresVerb # ILPfeatures Accuracy: originalfeature setAccuracy: enhancedfeature setcome 483 0.58 0.71get 329 0.31 0.31give 19 0.98 0.98go 174 0.69 0.71look 677 0.79 0.79make 4122 0.76 0.76take 411 0.47 0.473 Concluding RemarksThe results reported here suggest that ILP could play a useful role in WSD. As stand-alone constructor of WSD models, ILP yielded particularly good results, significantly outperforming propositional algorithms on the same data. This is mainly due to the real hybrid nature of the approach and the rich set of knowledge sources employed. Regarding the use of ILP to con-struct features, our findings suggest that: addition of background does improve the accuracy of WSD models in some cases. Features can be constructed efficiently: usually, the time taken for feature-construction was comparable, or often less, than to that taken to build models. In both cases, the results strongly support the undertaking a substantially larger study with more data. References1.Bar-Hillel, Y. The Present Status of Automatic Translation of Languages. Advances in Com-puters, 1 (1960) 91-1632.Mooney, R.J. Inductive Logic Programming for Natural Language Processing. 6th Interna-tional Inductive Logic Programming Workshop (1997) 3-243.Specia, L. A Hybrid Relational Approach for WSD – First Results. Coling-ACL (2006) 55-604.Srinivasan, A. The Aleph Manual. See: /oucl/research/areas/machlearn/Aleph/。
自然语言处理一些任务的总结本节总结一下NLP中常见的任务,从一个全局观来看看NLP:NLP任务总结一:词法分析•分词 (Word Segmentation/Tokenization, ws): 在对文本进行处理的时候,会对文本进行一个分词的处理,下面是一个常用的词库。
库开源or商业支持语言分词词性标注命名实体识别费用HanLP 开源Java, C++,Python有有有无Jieba 开源Java, C++,Python有无无无FudanNLP 开源Java 有有有无LTP 开源Java, C++,Python有有有无THULAC 开源Java, C++,Python有有无无BosonNLP 商业REST 有有有免费调用百度NLP 商业REST 有有有待定腾讯文智商业REST 有有有按次数/按月阿里云NLP商业REST 有有有按次数•新词发现 (New Words Identification, nwi):这个好理解,因为网络上总是有新的词汇出现,比如以前的'神马'这类的网络流行词汇。
•形态分析 (Morphological Analysis, MA):分析单词的形态组成,包括词干(Sterms)、词根(Roots)、词缀(Prefixes and Suffixes)等•词性标注 (Part-of-speech Tagging, POS):确定文本中每个词的词性。
词性包括动词(Verb)、名词(Noun)、代词(pronoun)等。
开源的人民日报数据中就按照规范对句子中的每个词的词性给标注好了。
可以对着规范来看。
•拼写校正 (Spelling Correction, SP):顾名思义,需要找到错误的词,并对错误的词进行修改。
二:句法分析•语言模型 (Language Modeling, LM):语言模型的应用还是挺广泛的,给出了对语言模型的详细介绍。
现在好多模型都是基于LM来的。
•组块分析 (Chunking):标出句子中的短语块,例如名词短语(NP),动词短语(VP)等•成分句法分析 (Constituency Parsing, CP):分析句子的成分,给出一棵树由终结符和非终结符构成的句法树•依存句法分析(Dependency Parsing, DP):分析句子中词与词之间的依存关系,给一棵由词语依存关系构成的依存句法树。
UMND1:Unsupervised Word Sense Disambiguation Using ContextualSemantic RelatednessSiddharth Patwardhan School of ComputingUniversity of Utah Salt Lake City,UT84112. sidd@Satanjeev BanerjeeLanguage Technologies Inst.Carnegie Mellon UniversityPittsburgh,PA15217.banerjee@Ted PedersenDept.of Computer ScienceUniversity of MinnesotaDuluth,MN55812.tpederse@AbstractIn this paper we describe an unsuper-vised WordNet-based Word Sense Disam-biguation system,which participated(asUMND1)in the SemEval-2007Coarse-grained English Lexical Sample task.Thesystem disambiguates a target word by usingWordNet-based measures of semantic relat-edness tofind the sense of the word thatis semantically most strongly related to thesenses of the words in the context of the tar-get word.We briefly describe this system,the configuration options used for the task,and present some analysis of the results.1IntroductionWordNet::SenseRelate::TargetWord1(Patwardhan et al.,2005;Patwardhan et al.,2003)is an unsuper-vised Word Sense Disambiguation(WSD)system, which is based on the hypothesis that the intended sense of an ambiguous word is related to the words in its context.For example,if the“financial institution”sense of bank is intended in a context, then it is highly likely the context would contain related words such as money,transaction,interest rate,etc.The algorithm,therefore,determines the intended sense of a word(target word)in a given context by measuring the relatedness of each sense of that word with the words in its context. The sense of the target word that is most related to its context is selected as the intended sense of the target word.The system uses WordNet-based 1 measures of semantic relatedness2(Pedersen et al.,2004)to measure the relatedness between the different senses of the target word and the words in its context.This system is completely unsupervised and re-quires no annotated data for training.The lexical database WordNet(Fellbaum,1998)is the only re-source that the system uses to measure the related-ness between words and concepts.Thus,our system is classified under the closed track of the task.2System DescriptionOur WSD system consists of a modular framework, which allows different algorithms for the different subtasks to be plugged into the system.We divide the disambiguation task into two primary subtasks: context selection and sense selection.The context selection module tries to select words from the con-text that are most likely to be indicative of the sense of the target word.The sense selection module then uses the set of selected context words to choose one of the senses of the target word as the answer. Figure1shows a block schematic of the system, which takes SemEval-2007English Lexical Sample instances as input.Each instance is a made up of a few English sentences,and one word from these sentences is marked as the target word to be dis-ambiguated.The system processes each instance through multiple modules arranged in a sequential pipeline.Thefinal output of the pipeline is the sense that is most appropriate for the target word in the given context.2Instance Preprocessing Format FilterTarget Sense Context SelectionPostprocessingSense SelectionRelatedness MeasureFigure 1:System Architecture2.1Data PreparationThe input text is first passed through a format fil-ter ,whose task is to parse the input XML file.This is followed by a preprocessing step.Each instance passed to the preprocessing stage is first segmented into words,and then all compound words are iden-tified.Any sequence of words known to be a com-pound in WordNet is combined into a single entity.2.2Context SelectionAlthough each input instance consists of a large number of words,only a few of these are likely to be useful for disambiguating the target word.We use the context selection algorithm to select a subset of the context words to be used for sense selection.By removing the unimportant words,the computa-tional complexity of the algorithm is reduced.In this work,we use the NearestWords context selection algorithm.This algorithm algorithm se-lects 2n +1content words surrounding the target word (including the target word)as the context.A stop list is used to identify closed-class non-content words.Additionally,any word not found in Word-Net is also discarded.The algorithm then selects n content words before and n content words follow-ing the target word,and passes this unordered set of 2n +1words to the Sense Selection module.2.3Sense Selection AlgorithmThe sense selection module takes the set of words output by the context selection module,one of which is the target word to be disambiguated.For each of the words in this set,it retrieves a list of senses from WordNet,based on which it determines the intended sense of the target word.The package provides two main algorithms for Sense Selection:the local and the global algorithms,as described in previous work (Banerjee and Peder-sen,2002;Patwardhan et al.,2003).In this work,we use the local algorithm,which is faster and was shown to perform as well as the global algorithm.The local sense selection algorithm measures the semantic relatedness of each sense of the target word with the senses of the words in the context,and se-lects that sense of the target word which is most re-lated to the context word-senses.Given the 2n +1context words,the system scores each sense of the target word.Suppose the target word t has T senses,enumerated as t 1,t 2,...,t T .Also,suppose w 1,w 2,...,w 2n are the words in the context of t ,each hav-ing W 1,W 2,...,W 2n senses,respectively.Then for each t i a score is computed as score(t i )=2n j =1max k =1to W j(relatedness(t i ,w jk ))where w jk is the k th sense of word w j .The sense t iof target word t with the highest score is selected as the intended sense of the target word.The relatedness between two word senses is com-puted using a measure of semantic relatedness de-fined in the WordNet::Similarity software package (Pedersen et al.,2004),which is a suite of Perl mod-ules implementing a number WordNet-based mea-sures of semantic relatedness.For this work,we used the Context Vector measure (Patwardhan and Pedersen,2006).The relatedness of concepts is computed based on word co-occurrence statistics derived from WordNet glosses.Given two WordNet senses,this module returns a score between 0and 1,indicating the relatedness of the two senses.Our system relies on WordNet as its sense inven-tory.However,this task used OntoNotes (Hovy et al.,2006)as the sense inventory.OntoNotes word senses are groupings of similar WordNet senses.Thus,we used the training data answer key to gen-erate a mapping between the OntoNotes senses of the given lexical elements and their corresponding WordNet senses.We had to manually create the mappings for some of the WordNet senses,which had no corresponding OntoNotes senses.The sense selection algorithm performed all of its computa-tions with respect to the WordNet senses,and finally the OntoNotes sense corresponding to the selected WordNet sense of the target word was output as theanswer for each instance.3Results and AnalysisFor this task,we used the freely available Word-Net::SenseRelate::TargetWord v0.10and the Word-Net::Similarity v1.04packages.WordNet v2.1was used as the underlying knowledge base for these. The context selection module used a window size offive(including the target word).The semantic re-latedness of concepts was measured using the Con-text Vector measure,with configuration options as defined in previous research(Patwardhan and Ped-ersen,2006).Since we always predict exactly one sense for each instance,the precision and recall val-ues of all our experiments were always the same. Therefore,in this section we will use the name“ac-curacy”to mean both precision and recall.3.1Overall Results,and BaselinesThe overall accuracy of our system on the test data is0.538.This represents2,609correctly disam-biguated instances,out of a total of4,851instances. As baseline,we compare against the random al-gorithm where for each instance,we randomly pick one of the WordNet senses for the lexical element in that instance,and report the OntoNotes senseid it maps to as the answer.This algorithm gets an ac-curacy of0.417.Thus,our algorithm gets an im-provement of12%absolute(29%relative)over this random baseline.Additionally,we compare our algorithm against the WordNet SenseOne algorithm.In this algorithm, we pick thefirst sense among the WordNet senses of the lexical element in each instance,and report its corresponding OntoNotes sense as the answer for that instance.This algorithm leverages the fact that (in most cases)the WordNet senses for a particular word are listed in the database in descending order of their frequency of occurrence in the corpora from which the sense inventory was created.If the new test data has a similar distribution of senses,then this algorithm amounts to a“majority baseline”.This algorithm achieves an accuracy of0.681which is 15%absolute(27%relative)better than our algo-rithm.Although this seemingly na¨ıve algorithm out-performs our algorithm,we choose to avoid using this information in our algorithms because it repre-sents a large amount of human supervision in the form of manual sense tagging of text,whereas our goal is to create a purely unsupervised algorithm. Additionally,our algorithms can,with little change, work with other sense inventories besides WordNet that may not have this information.3.2Results Disaggregated by Part of SpeechIn our past experience,we have found that av-erage disambiguation accuracy differs significantly between words of different parts of speech.For the given test data,we separately evaluated the noun and verb instances.We obtained an accuracy of0.399 for the noun targets and0.692for the verb targets. Thus,wefind that our algorithm performs much bet-ter on verbs than on nouns,when evaluated using the OntoNotes sense inventory.This is different from our experience with S ENSEVAL data from previous years where performance on nouns was uniformly better than that on verbs.One possible reason for the better performance on verbs is that the OntoNotes sense inventory has,on average,fewer senses per verb word(4.41)than per noun word(5.71).How-ever,additional experimentation is needed to more fully understand the difference in performance.3.3Results Disaggregated by Lexical Element To gauge the accuracy of our algorithm on different words(lexical elements),we disaggregated the re-sults by individual word.Table1lists the accuracy values over instances of individual verb lexical ele-ments,and Table2lists the accuracy values for noun lexical elements.Our algorithm gets all instances correct for13verb lexical elements,and for none of the noun lexical elements.More generally,our al-gorithm gets an accuracy of50%or more on45out of the65verb lexical elements,and on15out of the 35noun lexical elements.For nouns,when the ac-curacy results are viewed in sorted order(as in Table 2),one can observe a sudden degradation of results between the accuracy of the word system.n–0.443–and the word source.n–0.257.It is unclear why there is such a jump;there is no such sudden degra-dation in the results for the verb lexical elements.4ConclusionsThis paper describes our system UMND1,which participated in the SemEval-2007Coarse-grainedWord Accuracy Word Accuracyremove 1.000purchase 1.000negotiate 1.000improve 1.000hope 1.000express 1.000exist 1.000estimate 1.000describe 1.000cause 1.000avoid 1.000attempt 1.000affect 1.000say0.969explain0.944complete0.938disclose0.929remember0.923allow0.914announce0.900kill0.875occur0.864do0.836replace0.800maintain0.800complain0.786believe0.764receive0.750approve0.750buy0.739produce0.727regard0.714propose0.714need0.714care0.714feel0.706recall0.667examine0.667claim0.667report0.657find0.607grant0.600work0.558begin0.521build0.500keep0.463go0.459contribute0.444rush0.429start0.421raise0.382end0.381prove0.364enjoy0.357see0.296set0.262promise0.250hold0.250lead0.231prepare0.222join0.222ask0.207come0.186turn0.048fix0.000Table1:Verb Lexical Element Accuracies English Lexical Sample task.The system is based on WordNet::SenseRelate::TargetWord,which is a freely available unsupervised Word Sense Disam-biguation software package.The system uses WordNet-based measures of semantic relatedness to select the intended sense of an ambiguous word.The system required no training data and using WordNet as its only knowledge source achieved an accuracy of54%on the blind test set.AcknowledgmentsThis research was partially supported by a National Science Foundation Early CAREER Development award(#0092784).ReferencesS.Banerjee and T.Pedersen.2002.An Adapted Lesk Al-gorithm for Word Sense Disambiguation Using Word-Net.In Proceedings of the Third International Con-Word Accuracy Word Accuracy policy0.949people0.904future0.870drug0.870space0.857capital0.789effect0.767condition0.765job0.692bill0.686area0.676base0.650management0.600power0.553development0.517chance0.467exchange0.459order0.456part0.451president0.446system0.443source0.257network0.218state0.208share0.192rate0.186hour0.167plant0.109move0.085point0.080value0.068defense0.048position0.044carrier0.000authority0.000Table2:Noun Lexical Element Accuraciesference on Intelligent Text Processing and Computa-tional Linguistics,pages136–145,Mexico City,Mex-ico,February.C.Fellbaum,editor.1998.WordNet:An electronic lexi-cal database.MIT Press.E.Hovy,M.Marcus,M.Palmer,L.Ramshaw,andR.Weischedel.2006.OntoNotes:The90%Solu-tion.In Proceedings of the Human Language Tech-nology Conference of the North American Chapter of the ACL,pages57–60,New York,NY,June.S.Patwardhan and ing WordNet-based Context Vectors to Estimate the Semantic Relat-edness of Concepts.In Proceedings of the EACL2006 Workshop on Making Sense of Sense:Bringing Com-putational Linguistics and Psycholinguistics Together, pages1–8,Trento,Italy,April.S.Patwardhan,S.Banerjee,and -ing Measures of Semantic Relatedness for Word Sense Disambiguation.In Proceedings of the Fourth In-ternational Conference on Intelligent Text Processing and Computational Linguistics,pages241–257,Mex-ico City,Mexico,February.S.Patwardhan,T.Pedersen,and S.Banerjee.2005.SenseRelate::TargetWord-A Generalized Framework for Word Sense Disambiguation.In Proceedings of the Twentieth National Conference on Artificial In-telligence(Intelligent Systems Demonstrations),pages 1692–1693,Pittsburgh,PA,July.T.Pedersen,S.Patwardhan,and J.Michelizzi.2004.WordNet::Similarity-Measuring the Relatedness of Concepts.In Human Language Technology Confer-ence of the North American Chapter of the Association for Computational Linguistics Demonstrations,pages 38–41,Boston,MA,May.。
Word Sense DisambiguationUsing Conceptual DensityEneko Agirre.*Lengoaia eta Sistema Informatikoak saila.Euskal Herriko Unibertsitatea.p.k. 649, 20080 Donostia. Spain.jibagbee@si.ehu.esGerman Rigau.**Departament de Llenguatges i Sistemes Informàtics.Universitat Politècnica de Catalunya.Pau Gargallo 5, 08028 Barcelona. Spain.g.rigau@lsi.upc.esAbstract.This paper presents a method for the resolution of lexical ambiguity of nouns and its automatic evaluation over the Brown Corpus. The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose. This fully automatic method requires no hand coding of lexical entries, hand tagging of text nor any kind of training process. The results of the experiments have been automatically evaluated against SemCor, the sense-tagged version of the Brown Corpus.Keywords: Word Sense Disambiguation, Conceptual Distance, WordNet, SemCor. TOPIC: Large Text Corpora, Word Sense Disambiguation WORD COUNT: 3980Submitted to Coling’96 and ACL '96* Eneko Agirre was su pported by a grant from the Basqu e Government.** German Rigau was su pported by a grant from the Ministerio de Edu cación y Ciencia.Word Sense DisambiguationUsing Conceptual DensityAbstract.This paper presents a method for the resolution of lexical ambiguity of nouns and its automatic evaluation over the Brown Corpus. The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose. This fully automatic method requires no hand coding of lexical entries, hand tagging of text nor any kind of training process. The results of the experiments have been automatically evaluated against SemCor, the sense-tagged version of the Brown Corpus.Keywords: Word Sense Disambiguation, Conceptual Distance, WordNet, SemCor.1IntroductionWord sense disambiguation is a long-standing problem in Computational Linguistics. Much of recent work in lexical ambiguity resolution offers the prospect that a disambiguation system might be able to receive as input unrestricted text and tag each word with the most likely sense with fairly reasonable accuracy and efficiency. The most extended approach is to attempt to use the context of the word to be disambiguated together with information about each of its word senses to solve this problem.S everal interesting experiments have been performed in recent years using preexisting lexical knowledge resources: [Cowie et al. 92], [Wilks et al. 93] with LDOCE, [Yarowsky 92] with Roget's International Thesaurus, and [S ussna 93], [Voorhees 93], [Richarson et al. 94], [Resnik 95]with WordNet.Although each of these techniques looks promising for disambiguation, either they have been only applied to a small number of words, a few sentences or not in a public domain corpus. For this reason we have tried to disambiguate all the nouns from real texts in the public domain sense tagged version of the Brown corpus [Francis & Kucera 67], [Miller et al. 93], also called Semantic Concordance or S emCor for short1. The words in S emCor are tagged with word senses from WordNet, a broad semantic taxonomy for English [Miller 90]2. Thus S emCor provides an appropriate environment for testing our procedures and comparing among alternatives in a fully automatic way.The automatic decision procedure for lexical ambiguity resolution presented in this paper is based on an elaboration of the conceptual distance among concepts: Conceptual Density [Agirre & Rigau 95]. The system needs to know how words are clustered in semantic classes, and how semantic1Semcor comprises approximately 250,000 words. The tagging was done manu ally, and the error rate measu red by the au thors is arou nd 10% for polysemou s words.2The senses of a word are represented by synsets, one for each word sense. The nominal part of WordNet can be viewed as a tangled hierarchy of hypo/hypernymy relations. Nominal relations inclu de also three kinds of meronymic relations, which can be paraphrased as member-of, made-of and component-part-of. The version u sed in this work is WordNet 1.4, The coverage in WordNet of the senses for open-class words in SemCor reaches 96% according to the au thors.classes are hierarchically organised. For this purpose, we have used WordNet. Our system tries to resolve the lexical ambiguity of nouns by finding the combination of senses from a set of contiguous nouns that maximises the total Conceptual Density among senses.The performance of the procedure was tested on four texts from SemCor chosen at random. For comparison purposes two other approaches, [Sussna 93] and [Yarowsky 92], were also tried. The results show that our algorithm performs better on the test set.Following this short introduction the Conceptual Density formula is presented. The main procedure to resolve lexical ambiguity of nouns using Conceptual Density is sketched on section 3. Section 4 describes extensively the experiments and its results. Finally, sections 5 and 6 deal with further work and conclusions.2Conceptual Density and Word Sense DisambiguationA measure of the relatedness among concepts can be a valuable prediction knowledge source for several decisions in Natural Language Processing. For example, the relatedness of a certain word-sense to the context allows us to select that sense over the others, and actually disambiguate the word. As was pointed by [Miller & Teibel, 91], relatedness can be measured by a fine-grained conceptual distance among concepts in a hierarchical semantic net such as WordNet. This measure would allow to discover reliably the lexical cohesion of a given set of words in English.Conceptual distance tries to provide a basis for determining closeness in meaning among pairs of words, taking as reference a structured hierarchical net. Conceptual distance between two concepts is defined in [Rada et al. 89] as the length of the shortest path that connects the concepts in a hierarchical semantic net. In a similar approach, [S ussna 93] employs the notion of conceptual distance between network nodes in order to improve precision during document indexing.[Resnik 95] captures semantic similarity (closely related to conceptual distance) by means of the information content of the concepts in a hierarchical net.In general these approaches focus on nouns.The measure of conceptual distance among concepts we are looking for should be sensitive to:•the length of the shortest path that connects the concepts involved.•the depth in the hierarchy: concepts in a deeper part of the hierarchy should be ranked closer.• the density of concepts in the hierarchy: concepts in a dense part of the hierarchy are relatively closer than those in a more sparse region.•the measure should be independent of the number of concepts we are measuring.We have experimented with several formulas that follow the four criteria presented above. The experiments reported here were performed using the Conceptual Density formula [Agirre & Rigau 95], which compares areas of subhierarchies.Word to be disambiguated: WContext words: w1 w2 w3 w4 ...Figure 1: senses of a word in WordNetTo illustrate how Conceptual Density can help to disambiguate a word, in figure 1 the word W has four senses and several context words. Each sense of the words belongs to a subhierachy of WordNet. The dots in the subhierarchies represent the senses of either the word to be disambiguated (W) or the words in the context. Conceptual Density will yield the highest density for the subhierarchy containing more senses of those, relative to the total amount of senses in the subhierarchy. The sense of W contained in the subhierarchy with highest Conceptual Density will be chosen as the sense disambiguating W in the given context. In figure 1, sense2 would be chosen.Given a concept c, at the top of a subhierarchy, and given nhyp and h (mean number of hyponyms per node and height of the subhierarchy, respectively), the Conceptual Density for c when its subhierarchy contains a number m (marks) of senses of the words to disambiguate is given by the formula below:CD(c,m)=nhyp i0.20i=0m−1∑descendants c(1)Formula 1 shows a parameter that was computed experimentally. The 0.20 tries to smooth the exponential i, as m ranges between 1 and the total number of senses in WordNet. Several values were tried for the parameter, and it was found that the best performance was attained consistently when the parameter was near 0.20.3 The Disambiguation Algorithm Using Conceptual DensityGiven a window size, the program moves the window one noun at a time from the beginning of the document towards its end, disambiguating in each step the noun in the middle of the window and considering the other nouns in the window as context. Non-noun words are not taken into account.The algorithm to disambiguate a given noun w in the middle of a window of nouns W (c.f. figure 2) roughly proceeds as follows. First, the algorithm represents in a lattice the nouns present in the window, their senses and hypernyms (step 1). Then, the program computes the Conceptual Density of each concept in WordNet according to the senses it contains in its subhierarchy (step 2). It selects the concept c with highest Conceptual Density (step 3) and selects the senses below it as the correct senses for the respective words (step 4).The algorithm proceeds then to compute the density for the remaining senses in the lattice, and continues to disambiguate the nouns left in W (back to steps 2, 3 and 4). When no further disambiguation is possible, the senses left for w are processed and the result is presented (step 5).(Step 1) t ree := compute_tree(words_in_window)loop(Step 2) tree := compute_conceptual_distance(tree)(Step 3) concept := selecct_concept_with_highest_weigth(tree)if concept = null then exitloop(Step 4) tree := mark_disambiguated_senses(tree,concept)endloop(Step 5) o utput_disambiguation_result(tree)Figure 2: algorithm for each windowBesides completely disambiguating a word or failing to do so, in some cases the disambiguation algorithm returns several possible senses for a word. In the experiments we considered these partial outcomes as failure to disambiguate.4The Experiments4.1 The textsWe selected four texts from SemCor at random: br-a01 (where a stands for the gender "Press: Reportage"), br-b20 (b for "Press: Editorial"), br-j09 (j means "Learned: Science") and br-r05 (r for "Humour"). Table 1 shows some statistics for each texttext words nouns nounsmonosemousin WNbr-a012079564464149 (32%)br-b202153453377128 (34%)br-j092495620586205 (34%)br-r052407457431120 (27%)total913420941858602 (32%)Table 1: data for each textAn average of 11% of all the nouns in these four texts were not found in WordNet. According to this data, the amount of monosemous nouns in these texts is bigger (32% average) than the one calculated for the open-class words from the whole SemCor (27.2% according to [Miller et al. 94]).For our experiments, these texts play both the role of input files (without semantic tags) and (tagged) test files. When they are treated as input files, we throw away all non-noun words, only leaving the lemmas of the nouns present in WordNet. Figure 4 shows the SemCor format for the nouns in the example sentence in figure 3:The jury praised the administration and operation of the AtlantaPolice_Department, the Fulton_Tax_Commissioner_'s_Office, theBellwood and Alpharetta prison_farms, Grady_Hospital and theFulton_Health_Department.Figure 3: sample sentence from SemCor<s><wd>jury</wd><sn>[noun.group.0]</sn><tag>NN</tag><wd>administration</wd><sn>[noun.act.0]</sn><tag>NN</tag><wd>operation</wd><sn>[noun.state.0]</sn><tag>NN</tag><wd>Police_Department</wd><sn>[noun.group.0]</sn><tag>NN</tag><wd>prison_farms</wd><mwd>prison_farm</mwd><msn>[noun.artifact.0]</msn><tag>NN</tag> </s>Figure 4: SemCor formatAfter erasing the irrelevant information we get the words3 as they will be input to the algorithm:jury administration operation Police_Department prison_farmFigure 5: input wordsThe output of the algorithm comprises sense tags that can be compared automatically with the original file (c.f. figure 4).3Note that we already have the knowledge that police department and prison farm are compou nd nou ns, and that the lemma of prison farms is prison farm.4.2 Results and evaluationOne of the goals of the experiments was to decide among different variants of the Conceptual Density formula. Results are given averaging the results of the four files. Partial disambiguation is treated as failure to disambiguate. Precision (that is, the percentage of actual answers which were correct) and recall (that is, the percentage of possible answers which were correct) are given in terms of polysemous nouns only. The graphs are drawn against the size of the context 4 that was taken into account when disambiguating.• meronymy does not improve performance as expected.One parameter controls whether meronymic relations, in addition to the hypo/hypernymy relation, are taken into account or not. A priori, the more relations are taken in account the better density would capture semantic relatedness, and therefore better results can be expected. The experiments, see figure 6, showed that there is not much difference; adding meronymic information does not improve precision, and raises coverage only 3% (approximately). Nevertheless, in the restof the results reported below, meronymy and hypernymy were used.38394041424344P r e c i s i o n (%)11223Window Size hyper.meron.Figure 6: meronymy and hyperonymy• global nhyp is as good as local nhyp.There was an aspect of the density formula which we could not decide analytically, and which we wanted to check experimentally. The average number of hyponyms or nhyp (c.f. formula 1) can be approximated in two ways. If an independent nhyp is computed for every concept in WordNet we call it local nhyp . If instead, a unique nhyp is computed using the whole hierarchy, we have global nhyp . While local nhyp is the actual average for a given concept, global nhyp gives only an estimation. The results (c.f. figure 7) show that local nhyp performs only slightly better.Therefore global nhyp is favoured and was used in subsequent experiments.4context size is given in terms of nouns.P r e c i s i o n (%)11223Window Size localglobal Figure 7: local nhyp vs. global nhyp• context size: different behaviour for each textDeciding the optimum context-size for disambiguating using Conceptual Density is an important issue. One could assume that the more context there is, the better the disambiguation results would be. Our experiments show that each file from S emCor has a different behaviour (c.f. figure 8)while br-b20 shows clear improvement for bigger window sizes, br-r05 gets a local maximum at a 10size window, etc.P r e c i s i o n (%)11223Window Size average br-r05br-j09br-b20br-a01Figure 8: context size and different filesAs each text is structured a list of sentences, lacking any indication of headings, sections,paragraph endings, text changes, etc. the program gathers the context without knowing whether the nouns actually occur in coherent pieces of text. This could account for the fact that in br-r05,composed mainly by short pieces of dialogues, the best results are for window size 10, the average size of this dialogue pieces. Longer windows will include other pieces of unrelated dialogues that could mislead the disambiguation.Besides, the files can be composed of different pieces of unrelated texts without pointing it explicitly. For instance, two of our test files (br-a01 and br-b20) are collections of short journalistic texts. This could explain that the performance of br-a01 decreases for windows of 30 nouns, because for most of the nouns the context would include nouns from another article.The polysemy level could also affect the performance, but in our texts less polysemy does not correlate with better performance. Nevertheless the actual nature of each text is for sure animportant factor, difficult to measure, which could account for the different behaviour on its own.For instance, the poor performance on text br-j09 could be explained by its technical nature. Further analysis of the errors, contexts and relations found among the words would be needed to be more conclusive.Leaving aside these considerations, and in order to give an overall view of the performance, we consider the average behaviour in order to lead our conclusions.• file vs. senseWordNet groups senses in 24 lexicographer's files. The algorithm assigns a noun both an specific sense and a file label. Both file matches and sense matches are interesting to count. While the sense level gives a fine graded measure of the algorithm, the file level gives an indication of the performance if we were interested in a less sharp level of disambiguation. The granularity of the sense distinctions made in [Hearst, 91], [Gale et al. 93] and [Yarowsky 92], also called homographs in [Guthrie et al. 93], can be compared to that of the file level in WordNet.For instance, in [Yarowsky 92] two homographs of the noun bas s are considered, one characterised as MUSIC and the other as ANIMAL, INSECT. In WordNet, the 6 senses of bas s related to music appear in the following files: ARTIFACT, ATTRIBUTE, COMMUNICATION and PERSON. The 3senses related to animals appear in the files ANIMAL and FOOD. This means that while the homograph level in [Yarowsky 92] distinguishes two sets of senses, the file level in WordNet distinguishes six sets of senses, still finer in granularity.The following figure shows that, as expected, file-level matches attain better performance(71.2% overall and 53.9% for polysemic nouns) than sense-level matches.P r e c i s i o n (%)11223Window Size SenseFile Figure 9: sense level vs. file level• evaluation of the resultsFigure 10 shows that, overall, coverage over polysemous nouns increases significantly with the window size, without losing precision. Coverage tends to get stabilised near 80%, getting little improvement for window sizes bigger than 20.3040506070801015202530Window Sizesemantic densitymost frequentmost frequentguessing semantic densityFigure 10: precision and coverageThe figure also shows the guessing baseline, given by selecting senses at random. This baseline was first calculated analytically and later checked experimentally. We also compare the performance of our algorithm with that of the "most frequent" heuristic. The frequency counts for each sense were collected using the rest of SemCor, and then applied to the four texts. While the precision is similar to that of our algorithm, the coverage is 8% worse.All the data for the best window size can be seen in table 2. The precision and coverage shown in all the preceding graphs were relative to the polysemous nouns only. If we also include monosemic nouns precision raises, as shown in table 2, from 43% to 64.5%, and the coverage increases from 79.6% to 86.2%.%w=30Cover.Prec.Recall overall File 86.271.261.4Sense 64.555.5polysemic File 79.653.942.8Sense 4334.24.3 Comparison with other worksThe raw results presented here seem to be poor when compared to those shown in [Hearst 91],[Gale et al. 93] and [Yarowsky 92]. We think that several factors make the comparison difficult.Most of those works focus in a selected set of a few words, generally with a couple of senses of very different meaning (coarse-grained distinctions), and for which their algorithm could gather enough evidence. On the contrary, we tested our method with all the nouns in a subset of an unrestricted public domain corpus (more than 9.000 words), making fine-grained distinctions among all the senses in WordNet.An approach that uses hierarchical knowledge is that of [Resnik 95], which additionally uses the information content of each concept gathered from corpora. Unfortunately he applies his method on a different task, that of disambiguating sets of related nouns. The evaluation is done on a set of related nouns from Roget's Thesaurus tagged by hand. The fact that some senses were discarded because the human judged them not reliable makes comparison even more difficult.In order to compare our approach we decided to implement [Yarowsky 92] and [Sussna 93], and test them on our texts. For [Yarowsky 92] we had to adapt it to work with WordNet. His methodrelies on cooccurrence data gathered on Roget's Thesaurus semantic categories. Instead, on our experiment we use saliency values5 based on the lexicographic file tags in SemCor (c.f. figure 4). The results for a window size of 50 are those shown in table 36. The precision attained by our algorithm is higher. To compare figures better consider the results in table 4, were the coverage of our algorithm was easily extended using the version presented below, increasing recall to 70.1%.%Cover.Prec.RecallC.Density86.271.261.4Yarowsky100.064.064.0Table 3: comparison with [Yarowsky 92]From the methods based on Conceptual Distance, [Sussna 93] is the most similar to ours. Sussna disambiguates several documents from a public corpus using WordNet. The test set was tagged by hand, allowing more than one correct senses for a single word. The method he uses has to overcome a combinatorial explosion7 controlling the size of the window and “freezing” the senses for all the nouns preceding the noun to be disambiguated.In order to freeze the winning sense S ussna's algorithm is forced to make a unique choice. When Conceptual Distance is not able to choose a single sense, he has to choose one at random.Conceptual Density overcomes the combinatorial explosion extending the notion of conceptual distance from a pair of words to n words, and therefore can yield more than one correct sense for a word. For comparison, we altered our algorithm to also make random choices when unable to choose a single sense. We applied the algorithm Sussna considers best, discarding the factors that do not affect performance significantly8, and obtain the results in table 4.%Cover.Prec.C.Density File100.070.1Sense60.1Sussna File100.064.5Sense52.3Table 4: comparison with [Sussna 93]A more thorough comparison with these methods could be desirable, but not possible in this paper for the sake of conciseness.5 Further WorkWe would like to have included in this paper a study on whether there is or not a correlation among correct and erroneous sense assignations and the degree of Conceptual Density, that is, the actual figure held by formula 1. If this was the case, the error rate could be further decreased setting a certain threshold for Conceptual Density values of winning senses. We would also like to evaluate the usefulness of partial disambiguation: decrease of ambiguity, number of times correct sense is among the chosen ones, etc.5We tried both mu tu al information and association ratio, and the later performed better.6The resu lts of ou r algorithm are those for window size 30, file matches and overall.7In ou r replication of his experiment the mu tu al constraint for the first 10 nou ns (the optimal window size according to his experiments) of file br-r05 had to deal with more than 200.000 synset pairs.8Initial mu tu al constraint size is 10 and window size is 41. Meronymic links are also considered. All the links have the same weigth.There are some factors that could raise the performance of our algorithm:•Work on coherent chunks of text.Unfortunately any information about discourse structure is absent in SemCor, apart from sentence endings. If coherent pieces of discourse were taken as input, both performance and efficiency of the algorithm might improve. The performance would gain from the fact that sentences from unrelated topics would not be considered in the disambiguation window. We think that efficiency could also be improved if the algorithm worked on entire coherent chunks instead of one word at a time.• Extend and improve the semantic data.WordNet provides sinonymy, hypernymy and meronyny relations for nouns, but other relations are missing. For instance, WordNet lacks cross-categorial semantic relations, which could be very useful to extend the notion of Conceptual Density of nouns to Conceptual Density of words. Apart from extending the disambiguation to verbs, adjectives and adverbs, cross-categorial relations would allow to capture better the relations among senses and provide firmer grounds for disambiguating.These other relations could be extracted from other knowledge sources, both corpus-based or MRD-based, such as topic information (as can be found in Roget's Thesaurus), word frequencies, collocations [Yarowsky 93], selectional restrictions [Ribas 95], etc. If those relations could be given on WordNet senses, Conceptual Density could profit from them. [Richardson et al. 94] tries to combine WordNet and informational measures taken from corpora, defining a conceptual similarity considering both, but does not give any evaluation of their method. It is our belief, following the ideas of [McRoy 92] that full-fledged lexical ambiguity resolution should combine several information sources. Conceptual Density might be only one of a number of complementary evidences of the plausibility of a certain word sense.• Tune the sense distinctions to the level best suited for the application.On the one hand the sense distinctions made by WordNet 1.4 are not always satisfactory and obviously, WordNet 1.4 is not a complete lexical database. On the other hand, our algorithm is not designed to work on the file level, e.g. if the sense level is unable to distinguish among two senses, the file level also fails, even if both senses were from the same file. If the senses were collapsed at the file level, the coverage and precision of the algorithm at the file level might be even better.6 ConclusionThe automatic method for the disambiguation of nouns presented in this paper is ready-usable in any general domain and on free-running text, given part of speech tags. It does not need any training and uses word sense tags from WordNet, an extensively used lexical data base. The algorithm is theoretically motivated and founded, and offers a general measure of the semantic relatedness for any number of nouns.Conceptual Density has been used for other tasks apart from the disambiguation of free-running test. Its application for automatic spelling correction is outlined in [Agirre et al. 94]. It was also used on Computational Lexicography, enriching dictionary senses with semantic tags extracted from WordNet [Rigau 94], or linking bilingual dicitonaries to WordNet [Rigau and Agirre 95] In the experiments, the algorithm disambiguated four texts (more than 10.000 words long) of SemCor, a subset of the Brown corpus. The results were obtained automatically comparing the tags in SemCor with those computed by the algorithm, which would allow the comparison with other disambiguation methods. Two other methods, [Sussna 93] and [Yarowsky 92], were also tried on the same texts, showing that our algorithm performs better.The results are promising, considering the difficulty of the task (free running text, large number of senses per word in WordNet), and the lack of any discourse structure of the texts. Two kinds of results can be obtained: the specific sense or a coarser, file level, tag.。