Japanese Named Entity Extraction with Redundant Morphological Analysis
- 格式:pdf
- 大小:109.02 KB
- 文档页数:8
现代电子技术Modern Electronics TechniqueMar. 2024Vol. 47 No. 62024年3月15日第47卷第6期DOI :10.16652/j.issn.1004‐373x.2024.06.006引用格式:张继元,钱育蓉,冷洪勇,等.基于深度学习的命名实体识别研究综述[J].现代电子技术,2024,47(6):32‐42.基于深度学习的命名实体识别研究综述张继元1,2,3, 钱育蓉1,2,3, 冷洪勇2,3,5, 侯树祥2,3,4, 陈嘉颖1,2,3(1.新疆大学 软件学院, 新疆 乌鲁木齐 830000;2.新疆大学 新疆维吾尔自治区信号检测与处理重点实验室, 新疆 乌鲁木齐 830046;3.新疆大学 软件工程重点实验室, 新疆 乌鲁木齐 830000;4.新疆大学 信息科学与工程学院, 新疆 乌鲁木齐 830000;5.北京理工大学 计算机学院, 北京 100081)摘 要: 命名实体识别是自然语言处理领域的一项关键任务,其目的在于从自然语言文本中识别出具有特定含义的实体,如人名、地名、机构名和专有名词等。
在命名实体识别任务中,研究人员提出过多种方法,包括基于知识和有监督的机器学习方法。
近年来,随着互联网文本数据规模的快速扩大和深度学习技术的快速发展,深度学习模型已成为命名实体识别的研究热点,并在该领域取得显著进展。
文中全面回顾现有的命名实体识别深度学习技术,主要分为四类:基于卷积神经网络模型、基于循环神经网络模型、基于Transformer 模型和基于图神经网络模型的命名实体识别。
此外,对深度学习的命名实体识别架构进行了介绍。
最后,探讨命名实体识别所面临的挑战以及未来可能的研究方向,以期推动命名实体识别领域的进一步发展。
关键词: 命名实体识别; 深度学习; 自然语言处理; 卷积神经网络; 循环神经网络; Transformer ; 图神经网络中图分类号: TN919‐34 文献标识码: A 文章编号: 1004‐373X (2024)06‐0032‐11Survey of named entity recognition research based on deep learningZHANG Jiyuan 1, 2, 3, QIAN Yurong 1, 2, 3, LENG Hongyong 2, 3, 5, HOU Shuxiang 2, 3, 4, CHEN Jiaying 1, 2, 3(1. School of Software, Xinjiang University, Urumqi 830000, China;2. Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Urumqi 830046, China;3. Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830000, China;4. School of Information science and Engineering, Xinjiang University, Urumqi 830000, China;5. School of computer science, Beijing Institute of Technology, Beijing 100081, China)Abstract : Named entity recognition is a crucial task in the field of Natural Language Processing, which aims to identify entities with specific meanings from natural language texts, such as person names, place names, institution names, and proper nouns. In the task of named entity recognition, researchers have proposed various methods, including those based on domain knowledge and supervised machine learning approaches. In recent years, with the rapid expansion ofinternet text data and the rapid development of deep learning techniques, deep learning models have become aresearch hotspot in named entity recognition and have made significant progress in this field. A comprehensive review of existing deep learning techniques for named entityrecognition is provided, categorizing them into four main categories: models based on convolutional neural networks (CNN), recurrent neural networks (RNN), Transformer models, and graph neural networks (GNN) for NER. An overview of deep learning architectures for named entity recognition is presented. The challenges faced by named entity recognition and potential research directions in the future are explored to promote further development in the field of named entity recognition.Keywords : named entity recognition; deep learning; natural language processing; convolutional neural networks; recurrentneural network; Transformer; graph neural network收稿日期:2023‐08‐31 修回日期:2023‐10‐08基金项目:国家自然科学基金项目(62266043);国家自然科学基金项目(61966035);新疆维吾尔自治区自然科学基金项目(2021D01C083);新疆维吾尔自治区自然科学基金项目(2022D01C692);新疆维吾尔自治区高校基本科研业务经费科研项目(XJEDU2023P012);杰出青年科学基金(2023D01E01);天山创新团队(2023D14012);新疆高校基本科研业务费项目(XJEDU2023Z001)32第6期0 引 言自然语言处理(Natural Language Processing, NLP )是计算机科学和人工智能领域的重要研究方向,主要研究人与计算机之间用自然语言进行有效交流的理论和方法。
命名实体识别算法
命名实体识别(Named Entity Recognition,NER)是指抽取文本中具有特定意义的
实体,如人名、地名、机构名等。
它是自然语言理解中比较重要的一步,也是处理语言问
题的基础技术。
主要内容包括词法分析、词性标注、命名实体识别和关系抽取等。
命名实体识别在计
算机领域又称实体抽取(Entity Extraction),主要指从文本中抽取学识、时间、地点
和机构等主要实体,是语言处理的一个重要过程,也是语言理解的重要基础。
中文命名实体识别的基本流程包括:1、分词;2、词性标注;3、实体角色标注。
词
法分析是中文分词和词性标注的一个过程,可以使用相关的软件和算法,如深度学习算法
或分词系统,来实现自动分词及词性标注;词性标注很重要,可以选择相应的词性标注,
使实体更准确。
实体角色标注是指根据文本和实体之间的关系,为实体打上不同的角色,
如被动实体、客体、动作主体、时间等,从而明确实体的环境及角色的关系体系。
除了基于统计的算法外,近几年,深度学习技术也被应用于中文命名实体识别,如RNN、LSTM、CNN、BERT等技术,它们可以直接基于文本,识别出文章中的实体,具有较高的准确性。
总之,中文命名实体识别是一个复杂的技术问题,它涉及到自然语言处理、深度学习、机器学习和大数据技术等多方面,可以通过多层次的算法结合,实现准确、高效的中文命
名实体识别。
文本件中的实体命名识别与关系提取技术综述实体命名识别(Named Entity Recognition,简称NER)与关系提取(Relation Extraction)是自然语言处理(Natural Language Processing,简称NLP)中的重要任务,它们在信息提取、知识图谱构建、问答系统等领域有着广泛的应用。
本文将对实体命名识别和关系提取的技术综述进行介绍。
一、实体命名识别技术综述实体命名识别是指从文本中识别出具有特定意义的实体,如人名、地名、组织机构名等。
常用的实体命名识别方法主要包括基于规则的方法、基于统计机器学习的方法和基于深度学习的方法。
基于规则的方法是指通过预定义的规则来识别实体。
这种方法需要手工制定规则,因此对领域和语言的适应性较差。
基于规则的方法虽然简单易实现,但在复杂的语境下表现不佳。
基于统计机器学习的方法是指利用统计模型来识别实体。
常用的统计机器学习算法包括最大熵模型、隐马尔可夫模型和条件随机场等。
这些方法依赖于大量的标注数据,通过学习文本中的特征和上下文信息来判断实体类型。
基于统计机器学习的方法在准确率上有较好的表现,但需要大量的标注数据来训练模型,并且对于新的领域和语言需要重新训练。
基于深度学习的方法是指利用深度神经网络来进行实体命名识别。
常见的深度学习模型包括循环神经网络(Recurrent Neural Network,简称RNN)和卷积神经网络(Convolutional NeuralNetwork,简称CNN)。
这些模型通过捕捉文本中的上下文信息和语义特征来进行实体命名识别,相对于传统方法具有更好的泛化性能。
二、关系提取技术综述关系提取是指从文本中提取出实体之间的关系。
关系提取可以分为两个子任务:实体对齐和关系分类。
实体对齐是指将文本中的实体与知识库或语料库中的实体进行对应,关系分类是指将实体对之间的关系进行分类。
常用的关系提取方法主要包括基于规则的方法、基于统计机器学习的方法和基于深度学习的方法。
实体关系抽取⼀、相关名词IE(Information Extraction):信息抽取NER(Named Entity Recognition):命名实体识别RE(Relation Extraction):关系抽取EE(Event Extraction):事件抽取Web IE:⽹络信息抽取⼆、学习链接、三、相关论⽂A Frustratingly Easy Approach for Joint Entity and Relation Extraction, Danqi Chen, 2020打破平常⼈们认为的joint extraction 好于pipeline (分两阶段进⾏实体抽取和关系抽取)的观念,⾸次提出融⼊typed entity markers (即加了类型的实体标签,eg,<S:Md></S:Md>);使⽤的两阶段的encoder之间不共享参数,认为这两个任务具有不同的输⼊格式,并且需要不同的特征来预测实体类型和关系;也可选地可以融⼊跨句信息,也就是将句⼦扩展到固定窗⼝⼤⼩W=100来简单地引⼊跨句上下⽂(eg,原来句⼦有n=50个词,那再各从两边取25个词);也在关系抽取的任务中提出了加速计算的近似推理⽅法,将实体边界和类型的标识符放⼊到⽂本之后,然后与原⽂对应实体共享位置向量。
上图中相同的颜⾊代表共享相同的位置向量。
具体地,在attention层中,⽂本token只去attend⽂本token、不去attend标识符token,⽽标识符token可以attend原⽂token。
通过这种「近似模型」可以实现⼀次编码⽂本就可以判断所有实体pair间的关系。
Distant Supervision for Relation Extraction with Sentence-level Attention and Entity Descriptions, 2017提出了句⼦级别的注意⼒机制模型(能选取多个有效实例进⾏特征提取)和实体描述信息(提供更多的背景知识并且为注意⼒机制提升实体表⽰)从⽂本中进⾏关系的抽取。
关系抽取研究综述母克东;万琪【摘要】信息抽取、自然语言理解、信息检索等应用需要更好地理解两个实体之间的语义关系,对关系抽取进行概况总结。
将关系抽取划分为两个阶段研究:特定领域的传统关系抽取和开放领域的关系抽取。
并对关系抽取从抽取算法、评估指标和未来发展趋势三个部分对关系抽取系统进行系统的分析总结。
%Many applications in natural language understanding, information extraction, information retrieval require an understanding of the seman-tic relations between entities. Carries on the summary to the relation extraction. There are two paradigms extracting the relation-ship be-tween two entities: the Traditional Relation Extraction and the Open Relation Extraction. Makes detailed introduction and analysis of the algorithm of relation extraction, evaluation indicators and the future of the relation extraction system.【期刊名称】《现代计算机(专业版)》【年(卷),期】2015(000)002【总页数】4页(P18-21)【关键词】关系抽取;机器学习;信息抽取;开放关系抽取【作者】母克东;万琪【作者单位】四川大学计算机学院,成都 610065;四川大学计算机学院,成都610065【正文语种】中文随着大数据的不断发展,海量信息以半结构或者纯原始文本的形式展现给信息使用者,如何采用自然语言处理和数据挖掘相关技术从中帮助用户获取有价值的信息,是当代计算机研究技术迫切的需求。
人工智能在工业应用作业指导书第1章人工智能基础理论 (3)1.1 人工智能概述 (3)1.2 人工智能发展历程 (3)1.3 人工智能关键技术 (3)第2章工业智能化概述 (4)2.1 工业智能化发展背景 (4)2.2 工业智能化的意义与价值 (4)2.3 工业智能化体系架构 (5)第3章机器学习在工业应用 (5)3.1 监督学习 (5)3.1.1 故障诊断 (5)3.1.2 质量控制 (5)3.1.3 预测维护 (6)3.2 无监督学习 (6)3.2.1 数据预处理 (6)3.2.2 产品分类 (6)3.2.3 过程优化 (6)3.3 强化学习 (6)3.3.1 自动控制 (6)3.3.2 路径规划 (6)3.3.3 能源管理 (6)第4章深度学习在工业应用 (7)4.1 卷积神经网络 (7)4.1.1 图像识别 (7)4.1.2 物体检测 (7)4.1.3 视频监控 (7)4.2 循环神经网络 (7)4.2.1 时间序列预测 (7)4.2.2 自然语言处理 (7)4.3 对抗网络 (8)4.3.1 图像 (8)4.3.2 数据增强 (8)4.3.3 模式识别 (8)第5章计算机视觉在工业应用 (8)5.1 图像识别与处理 (8)5.1.1 概述 (8)5.1.2 图像预处理 (8)5.1.3 特征提取与匹配 (8)5.1.4 识别算法 (8)5.2 视觉检测技术 (9)5.2.1 概述 (9)5.2.3 缺陷检测 (9)5.2.4 自动检测与控制 (9)5.3 三维重建与虚拟现实 (9)5.3.1 概述 (9)5.3.2 三维重建技术 (9)5.3.3 虚拟现实技术 (9)5.3.4 工业应用案例 (9)第6章自然语言处理在工业应用 (9)6.1 词向量与文本表示 (9)6.1.1 词向量表示方法 (10)6.1.2 工业应用案例 (10)6.2 命名实体识别与关系抽取 (10)6.2.1 命名实体识别 (10)6.2.2 关系抽取 (10)6.2.3 工业应用案例 (10)6.3 机器翻译与对话系统 (10)6.3.1 机器翻译 (11)6.3.2 对话系统 (11)6.3.3 工业应用案例 (11)第7章语音识别与处理在工业应用 (11)7.1 语音信号处理 (11)7.1.1 语音信号预处理 (11)7.1.2 语音特征提取 (11)7.2 语音识别技术 (11)7.2.1 基于隐马尔可夫模型(HMM)的语音识别 (11)7.2.2 深度学习在语音识别中的应用 (12)7.3 语音合成与交互 (12)7.3.1 语音合成技术 (12)7.3.2 语音交互技术 (12)7.3.3 语音交互在工业场景的应用案例 (12)第8章与自动化 (12)8.1 工业概述 (12)8.2 路径规划与控制 (12)8.3 视觉与抓取 (13)第9章人工智能在制造业的应用案例 (13)9.1 智能制造系统 (13)9.1.1 智能调度与优化 (13)9.1.2 设备故障预测与维护 (13)9.1.3 质量检测与控制 (13)9.2 智能生产线 (14)9.2.1 智能 (14)9.2.2 自动化装配线 (14)9.2.3 智能物流系统 (14)9.3.1 自动化立体仓库 (14)9.3.2 智能分拣系统 (14)9.3.3 智能物流配送 (14)第10章人工智能在工业应用的未来发展 (14)10.1 工业互联网与大数据 (14)10.2 边缘计算与云计算 (15)10.3 人工智能在工业应用的安全与伦理问题 (15)10.4 人工智能在工业应用的发展趋势与展望 (15)第1章人工智能基础理论1.1 人工智能概述人工智能(Artificial Intelligence, )是指由人制造出来的系统所表现出来的智能行为。
NER Systems that Suit User’s Preferences:Adjusting the Recall-PrecisionTrade-off for Entity ExtractionEinat Minkov,Richard C.Wang Language TechnologiesInstituteCarnegie Mellon University einat,*************.eduAnthony TomasicInst.for Software ResearchInternationalCarnegie Mellon University**************.eduWilliam W.CohenMachine Learning Dept.Carnegie Mellon University*************.eduAbstractWe describe a method based on“tweak-ing”an existing learned sequential classi-fier to change the recall-precision tradeoff,guided by a user-provided performancecriterion.This method is evaluated onthe task of recognizing personal names inemail and newswire text,and proves to beboth simple and effective.1IntroductionNamed entity recognition(NER)is the task of iden-tifying named entities in free text—typically per-sonal names,organizations,gene-protein entities, and so on.Recently,sequential learning methods, such as hidden Markov models(HMMs)and con-ditional randomfields(CRFs),have been used suc-cessfully for a number of applications,including NER(Sha and Pereira,2003;Pinto et al.,2003;Mc-callum and Lee,2003).In practice,these methods provide imperfect performance:precision and re-call,even for well-studied problems on clean well-written text,reach at most the mid-90’s.While performance of NER systems is often evaluated in terms of F1measure(a harmonic mean of preci-sion and recall),this measure may not match user preferences regarding precision and recall.Further-more,learned NER models may be sub-optimal also in terms of F1,as they are trained to optimize other measures(e.g.,loglikelihood of the training data for CRFs).Obviously,different applications of NER have different requirements for precision and recall.A system might require high precision if it is designed to extract entities as one stage of fact-extraction, where facts are stored directly into a database.On the other hand,a system that generates candidate ex-tractions which are passed to a semi-automatic cu-ration system might prefer higher recall.In some domains,such as anonymization of medical records, high recall is essential.One way to manipulate an extractor’s precision-recall tradeoff is to assign a confidence score to each extracted entity and then apply a global threshold to confidence level.However,confidence thresholding of this sort cannot increase recall.Also,while confi-dence scores are straightforward to compute in many classification settings,there is no inherent mecha-nism for computing confidence of a sequential ex-tractor.Culotta and McCallum(2004)suggest sev-eral methods for doing this with CRFs.In this paper,we suggest an alternative simple method for exploring and optimizing the relation-ship between precision and recall for NER systems. In particular,we describe and evaluate a technique called“extractor tweaking”that optimizes a learned extractor with respect to a specific evaluation met-ric.In a nutshell,we directly tweak the threashold term that is part of any linear classifier,including se-quential extractors.Though simple,this approach has not been empirically evaluated before,to our knowledge.Further,although sequential extractors such as HMMs and CRFs are state-of-the-art meth-ods for tasks like NER,there has been little prior re-search about tuning these extractors’performance to suit user preferences.The suggested algorithm op-timizes the system performance per a user-providedevaluation criterion,using a linear search procedure.Applying this procedure is not trivial,since the un-derlying function is not smooth.However,we showthat the system’s precision-recall rate can indeed betuned to user preferences given labelled data usingthis method.Empirical results are presented for aparticular NER task—recognizing person names,forthree corpora,including email and newswire text. 2Extractor tweakingLearning methods such as VP-HMM and CRFs op-timize criteria such as margin separation(implicitlymaximized by VP-HMMs)or log-likelihood(ex-plicitly maximized by CRFs),which are at best indi-rectly related to precision and recall.Can such learn-ing methods be modified to more directly reward auser-provided performance metric?In a non-sequential classifier,a threshold on confi-dence can be set to alter the precision-recall tradeoff.This is nontrivial to do for VP-HMMs and CRFs.Both learners use dynamic programming tofind thelabel sequence y=(y1,...,y i,...,y N)for a word sequence x=(x1,...,x i,...,x N)that maximizes the function W· i f(x,i,y i−1,y i),where W is the learned weight vector and f is a vector of fea-tures computed from x,i,the label y i for x i,and theprevious label y i−1.Dynamic programmingfindsthe most likely state sequence,and does not outputprobability for a particular sub-sequence.(Culottaand McCallum,2004)suggest several ways to gen-erate confidence estimation in this framework.Wepropose a simpler approach for directly manipulat-ing the learned extractor’s precision-recall ratio.We will assume that the labels y include one labelO for“outside any named entity”,and let w0be theweight for the feature f0,defined as follows:f0(x,i,y i−1,y i)≡ 1if y i=O0elseIf no such feature exists,then we will create one.The NER based on W will be sensitive to the valueof w0:large negative values will force the dynamicprogramming method to label tokens as inside enti-ties,and large positive values will force it to labelfewer entities1.1We clarify that w0will refer to feature f0only,and not to other features that may incorporate label information.We thus propose to“tweak”a learned NER by varying the single parameter w0systematically so as to optimize some user-provided performance metric. Specifically,we tune w0using a a Gauss-Newton line search,where the objective function is itera-tively approximated by quadratics.2We terminate the search when two adjacent evaluation results are within a0.01%difference3.A variety of performance metrics might be imag-ined:for instance,one might wish to optimize re-call,after applying some sort of penalty for pre-cision below somefixed threshold.In this paper we will experiment with performance metrics based on the(complete)F-measure formula,which com-bines precision and recall into a single numeric value based on a user-provided parameterβ:F(β,P,R)=(β2+1)P Rβ2P+RA value ofβ>1assigns higher importance to re-call.In particular,F2weights recall twice as much as precision.Similarly,F0.5weights precision twice as much as recall.We consider optimizing both token-and entity-level Fβ–awarding partial credit for partially ex-tracted entities and no credit for incorrect entity boundaries,respectively.Performance is optimized over the dataset on which W was trained,and tested on a separate set.A key question our evaluation should address is whether the values optimized for the training examples transfer well to unseen test ex-amples,using the suggested approximate procedure. 3Experiments3.1Experimental SettingsWe experiment with three datasets,of both email and newswire text.Table1gives summary statis-tics for all datasets.The widely-used MUC-6dataset includes news articles drawn from the Wall Street Journal.The Enron dataset is a collection of emails extracted from the Enron corpus(Klimt and Yang, 2004),where we use a subcollection of the mes-sages located in folders named“meetings”or“cal-endar”.The Mgmt-Groups dataset is a second email 2from /pub/code/inv.3In the experiments,this is usually within around10itera-tions.Each iteration requires evaluating a“tweaked”extractor on a training set.collection,extracted from the CSpace email cor-pus,which contains email messages sent by MBA students taking a management course conducted at Carnegie Mellon University in1997.This data was split such that its test set contains a different mix of entity names comparing to training exmaples.Fur-ther details about these datasets are available else-where(Minkov et al.,2005).#documents#namesTrain Test#tokens per doc.MUC-634730204,071 6.8Enron833143204,423 3.0Mgmt-Groups631128104,662 3.7 Table1:Summary of the corpora used in the experimentsWe used an implementation of Collins’voted-percepton method for discriminatively training HMMs(henceforth,VP-HMM)(Collins,2002)as well as CRF(Lafferty et al.,2001)to learn a NER. Both VP-HMM and CRF were trained for20epochs on every dataset,using a simple set of features such as word identity and capitalization patterns for a window of three words around each word being clas-sified.Each word is classified as either inside or out-side a person name.43.2Extractor tweaking ResultsFigure1evaluates the effectiveness of the optimiza-tion process used by“extractor tweaking”on the Enron dataset.We optimized models for Fβwith different values ofβ,and also evaluated each op-timized model with different Fβmetrics.The top graph shows the results for token-level Fβ,and the bottom graph shows entity-level Fβbehavior.The graph illustates that the optimized model does in-deed roughly maximize performance for the target βvalue:for example,the token-level Fβcurve for the model optimized forβ=0.5indeed peaks at β=0.5on the test set data.The optimization is only roughly accurate5for several possible reasons:first,there are differences between train and test sets; in addition,the line search assumes that the perfor-mance metric is smooth and convex,which need not be true.Note that evaluation-metric optimiza-tion is less successful for entity-level performance, 4This problem encoding is basic.However,in the context of this paper we focus on precision-recall trade-off in the general case,avoiding settings’optimization.5E.g,the token-level F2curve peaks atβ=5.F(Beta)BetaF(Beta)BetaFigure1:Results of token-level(top)and entity-level(bot-tom)optimization for varying values ofβ,for the Enron dataset, VP-HMM.The y-axis gives F in terms ofβ.β(x-axis)is given in a logarithmic scale.which behaves less smoothly than token-level per-formance.Token EntityβPrec Recall Prec RecallBaseline93.376.093.670.60.210053.298.257.00.595.371.194.467.91.088.679.489.270.92.081.083.981.870.95.065.891.369.471.4Table2:Sample optimized CRF results,for the MUC-6 dataset and entity-level optimization.Similar results were obtained optimizing baseline CRF classifiers.Sample results(for MUC-6only, due to space limitations)are given in Table2,opti-mizing a CRF baseline for entity-level Fβ.Note that asβincreases,recall monotonically increases and precision monotonically falls.The graphs in Figure2present another set of re-sults with a more traditional recall-precision curves. The top three graphs are for token-level Fβopti-mization,and the bottom three are for entity-level optimization.The solid lines show the token-level and entity-level precision-recall tradeoff obtained byMUC-6EnronM.Groups50 60 7080 90 100 5060 7080 90 100P r e c i s i o nRecallToken-level Entity-levelToken-level baseline Entity-level baseline 50 60 70 80 90 100 5060 7080 90 100Recall50 60 70 80 90 100 5060 7080 90 100Recall50 60 7080 90 100 5060 7080 90 100P r e c i s i o nRecallToken-level Entity-levelToken-level baseline Entity-level baseline 50 60 70 80 90 100 5060 7080 90 100Recall50 60 70 80 90 100 5060 7080 90 100RecallFigure 2:Results for the evaluation-metric model optimization.The top three graphs are for token-level F (β)optimization,and the bottom three are for entity-level optimization.Each graph shows the baseline learned VP-HMM and evaluation-metric optimization for different values of β,in terms of both token-level and entity-level performance.varying 6βand optimizing the relevant measure for F β;the points labeled “baseline”show the precision and recall in token and entity level of the baseline model,learned by VP-HMM.These graphs demon-strate that extractor “tweaking”gives approximately smooth precision-recall curves,as desired.Again,we note that the resulting recall-precision trade-off for entity-level optimization is generally less smooth.4ConclusionWe described an approach that is based on mod-ifying an existing learned sequential classifier to change the recall-precision tradeoff,guided by a user-provided performance criterion.This approach not only allows one to explore a recall-precision tradeoff,but actually allows the user to specify a performance metric to optimize,and optimizes a learned NER system for that metric.We showed that using a single free parameter and a Gauss-Newton line search (where the objective is itera-tively approximated by quadratics),effectively op-timizes two plausible performance measures,token-6We varied βover the values 0.2,0.5,0.8,1,1.2,1.5,2,3and 5level F βand entity-level F β.This approach is in fact general,as it is applicable for sequential and/or structured learning applications other than NER.ReferencesM.Collins.2002.Discriminative training methods for hidden markov models:Theory and experiments with perceptron al-gorithms.In EMNLP .A.Culotta and A.McCallum.2004.Confidence estimation for information extraction.In HLT-NAACL .B.Klimt and Y .Yang.2004.Introducing the Enron corpus.In CEAS fferty,A.McCallum,and F.Pereira.2001.Conditional random fields:Probabilistic models for segmenting and la-beling sequence data.In ICML .A.Mccallum and W.Lee.2003.early results for named entity recognition with conditional random fields,feature induction and web-enhanced lexicons.In CONLL .E.Minkov,R.C.Wang,and W.W.Cohen.2005.Extracting personal names from emails:Applying named entity recog-nition to informal text.In HLT-EMNLP .D.Pinto,A.Mccallum,X.Wei,and W.B.Croft.2003.table extraction using conditional random fields.In ACM SIGIR .F.Sha and F.Pereira.2003.Shallow parsing with conditional random fields.In HLT-NAACL .。
Japanese Named Entity Extraction with Redundant Morphological Analysis Masayuki Asahara and Yuji MatsumotoGraduate School of Information Science Nara Institute of Science and Technology,Japan masayu-a,matsu@is.aist-nara.ac.jpAbstractNamed Entity(NE)extraction is an importantsubtask of document processing such as in-formation extraction and question answering.A typical method used for NE extraction ofJapanese texts is a cascade of morphologicalanalysis,POS tagging and chunking.However,there are some cases where segmentation gran-ularity contradicts the results of morphologi-cal analysis and the building units of NEs,sothat extraction of some NEs are inherently im-possible in this setting.To cope with the unitproblem,we propose a character-based chunk-ing method.Firstly,the input sentence is an-alyzed redundantly by a statistical morpholog-ical analyzer to produce multiple(n-best)an-swers.Then,each character is annotated withits character types and its possible POS tags ofthe top n-best answers.Finally,a support vec-tor machine-based chunker picks up some por-tions of the input sentence as NEs.This methodintroduces richer information to the chunkerthan previous methods that base on a singlemorphological analysis result.We apply ourmethod to IREX NE extraction task.The crossvalidation result of the F-measure being87.2shows the superiority and effectiveness of themethod.1IntroductionNamed Entity(NE)extraction aims at identifying proper nouns and numerical expressions in a text,such as per-sons,locations,organizations,dates,and so on.This is an important subtask of document processing like infor-mation extraction and question answering.A common standard data set for Japanese NE extrac-tion is provided by IREX workshop(IREX Committee, editor,1999).Generally,Japanese NE extraction is done in the following steps:Firstly,a Japanese text is seg-mented into words and is annotated with POS tags by a morphological analyzer.Then,a chunker brings together the words into NE chunks based on contextual informa-tion.However,such a straightforward method cannot ex-tract NEs whose segmentation boundary contradicts that of morphological analysis outputs.For example,a sen-tence“”is segmented as“//////”by a morphological analyzer.“”(“Koizumi Jun’ichiro”–family andfirst names)as a person name and“”(“Septem-ber”)as a date will be extracted by combining word units. On the other hand,“”(abbreviation of North Korea) cannot be extracted as a name of location because it is contained by the word unit“”(visiting North Korea). Figure1illustrates the example with English translation. Some previous works try to cope with the word unit problem:Uchimoto(Uchimoto et al.,2000)introduces transformation rules to modify the word units given by a morphological analyzer.Isozaki(Isozaki and Kazawa, 2002)controls the parameters of a statistical morpholog-ical analyzer so as to produce morefine-grained output. These method are used as a preprocessing of chunking. By contrast,we propose more straightforward method in which we perform the chunking process based on char-acter units.Each character receives annotations with character type and multiple POS information of the words found by a morphological analyzer.We make use of re-dundant outputs of the morphological analysis as the base features for the chunker to introduce more information-rich features.We use a support vector machine(SVM)-based chunker yamcha(Kudo and Matsumoto,2001)for the chunking process.Our method achieves better score than all the systems reported previously for IREX NE ex-traction task.Section2presents the IREX NE extraction task.Sec-tion3describes our method in detail.In section4,we show the results of experiments,andfinally we give con-clusions in section5.2IREX NE extraction taskThe task of NE extraction in the IREX workshop is to recognize eight NE types as shown in Table1(IREX Committee,editor,1999).In their definitions,“ARTI-FACT”contains book titles,laws,brand names and so on. The task can be defined as a chunking problem to iden-Example Sentence:Koizumi Jun’ichiro Prime-Minister particle September particle visiting-North-Korea Prime Minister Koisumi Jun’ichiro will visit North Korea in September.Named Entities in the Sentence:/“Koizumi Jun’ichiro”/PERSON,/“September”/DATE,/“North Korea”/LOCATIONFigure1:Example of word unit problemIOB1B-PERSON I-PERSON O O O B-LOCATION B-LOCATION O IOE1I-PERSON E-PERSON O O O E-LOCATION E-LOCATION O SEPrime Minister Koizumi does between Japan and North Korea.Figure2:Examples of NE tag setsTable1:Examples of NEs in IREXNE Type Examples in Englishtify word sequences which compose NEs.The chunking problem is solved by annotation of chunk tags to tokens. Five chunk tag sets,IOB1,IOB2,IOE1,IOE2(Ramshaw and Marcus,1995)and SE(Uchimoto et al.,2000),are commonly used.In IOB1and IOB2models,three tags I, O and B are used,meaning inside,outside and beginning of a chunk.In IOB1,B is used only at the beginning of a chunk that immediately follows another chunk,while in IOB2,B is always used at the beginning of a chunk.IOE1 and IOE2use E tag instead of B and are almost the same as IOB1and IOB2except that the end points of chunks are tagged with E.In SE model,S is tagged only to one-symbol chunks,and B,I and E denote exactly the begin-ning,intermediate and end points of a chunk.Generally, the words given by the single output of a morphological analyzer are used as the units for chunking.By contrast,we take characters as the units.We annotate a tag on each character.Figure2shows examples of character-based NE anno-tations according to thefive tag sets.“”(PERSON),“”(LOCATION)and“”(LOCATION)are NEs in the sentence and annotated as NEs.While the detailed expla-nation of the tags will be done later,note that an NE tag is a pair of an NE type and a chunk tag.3MethodIn this section,we describe our method for Japanese NE extraction.The method is based on the following three steps:1.A statistical morphological/POS analyzer is appliedto the input sentence and produces POS tags of the n-best answers.2.Each character in the sentences is annotated with thecharacter type and multiple POS tag information ac-cording to the n-best answers.ing annotated features,NEs are extracted by anSVM-based chunker.Now,we illustrate each of these three steps in more detail.3.1Japanese Morphological AnalysisOur Japanese morphological/POS analysis is based on Markov model.Morphological/POS analysis can be de-fined as the determination of POS tag sequence once a segmentation into a word sequence is given.The goal is tofind the POS and word sequences and that maximize the following probability:Bayes’rule allows to be decomposed as the product of tag and word probabilities.We introduce approximations that the word probabil-ity is conditioned only on the tag of the word,and the tag probability is determined only by the immediately pre-ceding tag.The probabilities are estimated from the fre-quencies in tagged corpora using Maximum Likelihood ing these parameters,the most probable tag and word sequences are determined by the Viterbi al-gorithm.In practice,we use log likelihood as cost.Maximiz-ing probabilities means minimizing costs.In our method, redundant analysis output means the top n-best answers within a certain cost width.The n-best answers are picked up for each character in the order of the accu-mulated cost from the beginning of the sentence.Note that,if the difference between the costs of the best answer and n-th best answer exceeds a predefined cost width,we abandon the n-th best answer.The cost width is defined as the lowest probability in all events which occur in the training data.3.2Feature Extraction for ChunkingFrom the output of redundant analysis,each character re-ceives a number of features.POS tag information is sub-categorized so as to encode relative positions of charac-ters within a word.For encoding the position we employ SE tag model.Then,a character is tagged with a pair of POS tag and the position tag within a word as one fea-ture.For example,the character at the initial,intermedi-ate andfinal positions of a common noun(Noun-General) are represented as“Noun-General-B”,“Noun-General-I”and“Noun-General-E”,respectively.The list of tags for positions in a word is illustrated in Table2.Note that O tag is not necessary since every character is a part of a certain word.Character types are also used for features.We define seven character types as listed in Table3.Figure3shows an example of the features used for chunking process.Table2:Tags for positions in a wordDescriptionone-character wordBlast character in a multi-character word Iword(only for words longer than2chars)TagZSPACEDigitZLLETUppercase alphabetical letterHIRAGKatakanaOTHER3.3Support Vector Machine-based ChunkingWe used the chunker yamcha(Kudo and Matsumoto, 2001),which is based on support vector machines(Vap-nik,1998).Below we present support vector machine-based chunking briefly.Suppose we have a set of training data for a binary class problem:,whereis a feature vector of the i-th sample in the training data and is the label of the sample.The goal is tofind a decision function which accurately predicts for an unseen.An support vector machine classifier gives the decision function for an input vector wheremeans that is a positive member, means that is a negative member.The vectors are called support vectors.Support vectors and other con-stants are determined by solving a quadratic program-ming problem.is a kernel function which maps vectors into a higher dimensional space.We use the poly-nomial kernel of degree2given by. To facilitate chunking tasks by SVMs,we have to ex-tend binary classifiers to n-class classifiers.There are two well-known methods used for the extension,“One-vs-Rest method”and“Pairwise method”.In“One-vs-Rest method”,we prepare binary classifiers,one be-tween a class and the rest of the classes.In“Pairwise method”,we prepare binary classifiers between all pairs of classes.Position Char.Char.Type POS(Best)POS(2nd)POS(3rd)NE tag OTHER Noun-Proper-Name-Surname-B Prefix-Nominal-S Noun-General-S B-PERSONOTHER Noun-Proper-Name-Surname-E Noun-Proper-Place-General-E Noun-Proper-General-E I-PERSONO OTHER Noun-General-E Noun-Suffix-General-S*HIRAG Particle-Case-General-S**Figure3:An example of features for chunkingChunking is done deterministically either from the be-ginning or the end of sentence.Figure3illustrates a snap-shot of chunking procedure.Two character contexts on both sides are referred rmation of two preceding NE tags is also used since the chunker has already deter-mined them and they are available.In the example,to infer the NE tag(“O”)at the position,the chunker uses the features appearing within the solid box.3.4The effect of n-best answerThe model copes with the problem of word segmentation by character-based chunking.Furthermore,we introduce n-best answers as features for chunking to capture the fol-lowing behavior of the morphological analysis.The am-biguity of word segmentation occurs in compound words. When both longer and shorter unit words are included in the lexicon,the longer unit words are more likely to be output by the morphological analyzer.Then,the shorter units tend to be hidden behind the longer unit words. However,introducing the shorter unit words is more nec-essary to named entity extraction to generalize the model, because the shorter units are shared by many compound words.Figure4shows the example in which the shorter units are effective for NE extraction.In this example“”(Japan)is extracted as a location by second best answer,namely“Noun-Proper-Place-Country”. Unknown word problem is also solved by the n-best answers.Contextual information in Markov Model is lost at the position unknown word occurs.Then,preceding or succeeding words of an unknown word tend to be mis-taken in POS tagging.However,correct POS tags occur-ring in n-best answer may help to extract named entity. Figure5shows such an example.In this example,the begining of the person name is captured by the best an-swer at the position1and the end of the person name is captured by the second best answer at the position5.4Evaluation4.1DataWe use CRL NE data(IREX Committee,editor,1999)for evaluation of our method.CRL NE data includes1,174 newspaper articles and19,262NEs.We performfive-fold cross-validation on several settings to investigate the length of contextual feature,the size of redundant mor-phological analysis,feature selection and the degree of polynomial Kernel functions.For the chunk tag scheme we use IOB2model since it gave the best result in a pilot study.F-Measure()is used for evaluation.4.2The length of contextual featureFirstly,we compare the extraction accuracies of the mod-els by changing the length of contextual features and the direction of chunking.Table4shows the result in accu-racy for each of NEs as well as the total accuracy of all NEs.For example,“L2R2”denotes the model that uses the features of two preceding and two succeeding char-acters.“For”and“Back”mean the chunking direction:“For”specifies the chunking direction from left to right, and“Back”specifies that from right to left. Concerning NE types except for“TIME”,“Back”di-rection gives better accuracy for all NE types than“For”direction.It is because suffixes are crucial feature for NE extraction.“For”direction gives better accuracy for “TIME”,since“TIME”often contains prefixes such as“”(a.m.)and“”(p.m.).“L2R2”gives the best accurary for most of NE types. For“ORGANIZATION”,the model needs longer contex-tual length of features.The reason will be that the key prefixes and suffixes are longer in this NE type such as“”(company limited)and“”(research in-stitute).4.3The depth of redundant morphological analysis Table5shows the results when we change the depth(the value n of the n-best answers)of redundant morphologi-cal analysis.Redundant outputs of morphological analysis slightly improve the accuracy of NE extraction except for nu-meral expressions.The best answer seems enough to extract numeral experssions except for“MONEY”.It is because numeral expressions do not cause much errors in morphological analysis.To extract“MONEY”,the model needs more redundant output of morphological analysis.A typical occurs at“”(Canadian dollars=MONEY)which is not including training data and is analyzed as“”(Canada=LOCATION).The similar error occurs at“”(Hong Kong dollars) and so on.4.4Feature selectionWe use POS tags,characters,character types and NE tags as features for chunking.To evaluate how they arePosition Char.POS(Best)POS(2nd)NENoun-General3Noun-Suffix-General1Noun-General PERSON23*4Adjective5Noun-Suffix-GeneralPair Wise Method One vs Rest MethodL2R2L2R2 Direction For For For Back Back Back ARTIFACT29.7442.1743.9045.5949.5847.82 DATE84.9891.1692.4790.2293.9793.41 LOCATION80.1684.0785.7586.6287.7587.61 MONEY43.4659.8872.5393.3093.8593.60 ORGANIZATION66.0672.6375.5574.8078.3379.95 PERCENT67.6683.7785.2695.9696.0694.16 PERSON83.4485.3586.3184.9887.1987.65 TIME88.2189.8289.5487.5488.3388.0883.7286.1986.0276.6582.1284.16with thesaurusDirection For For50.0649.15DATE91.1991.7887.6188.59MONEY61.6264.5879.2780.37PERCENT86.2386.6487.4087.73TIME90.5490.19ALL82.5883.58“L2R2”contextual feature,2-best answers ofredundant morphological analysis,One vs Rest method with Features:POS,Characters,Character Types and NE tags.In the experimentation above,we follow the featuresused in the preceding work(Yamada et al.,2002).Isozaki(Isozaki and Kazawa,2002)introduces the thesaurus–NTT Goi Taikei(Ikehara et al.,1999)–to augment theTable5:The depth of redundant analysis and the extraction accuracyPair Wise Method2-best ans.4-best ans.Back Back Back Back ARTIFACT44.3743.5742.1742.1093.8194.2394.1493.71LOCATION84.3584.2084.0783.9293.8994.2895.8295.96 ORGANIZATION73.8373.7172.6372.4697.2096.7696.3196.81PERSON86.2385.6585.3585.2288.2287.7287.4787.7786.2586.3086.1986.08One vs Rest Method2-best ans.4-best ans.Back Back Back Back ARTIFACT43.1141.1239.8438.6594.1894.1893.9793.83LOCATION84.7284.6784.3184.1593.7993.6793.8595.47 ORGANIZATION74.3773.7072.7472.7397.0996.0296.0696.28PERSON85.9286.0385.5185.4189.0488.0788.3388.3286.4086.3586.1186.07F-measure50.16 DATE88.57 MONEY80.44 PERCENT87.81 TIMEALLWhile we must have afixed feature set among all NE types in Pairwise method,it is possible to select differ-ent feature sets and models when applying One-vs-Rest method.The best combined model achieves F-measure 87.21(Table9).The model uses one-vs-rest method with the best model for each type shown in Table4-8.Table 10shows comparison with related works.Our method attains the best result in the previously reported systems.Previous works report that POS information in preced-ing and succeeding two-word window is the most effec-tive for Japanese NE extraction.Our current work dis-proves the widespread belief about the contextual feature. In our experiments,the preceding and succeeding two or three character window is the best effective.Our method employs exactly same chunker with the work by Yamada et.al.(2002).To see the influence of boundary contradiction between morphological anal-ysis and NEs,they experimented with an ideal setting in which morphological analysis provides the perfect re-sults for the NE chunker.Their result shows F-measure 85.1in the same data set as ours.Those results show that our method solves more than the word unit problem compared with their results.Table6:The feature set and the extraction accuracyPair Wise MethodChar.POS subcat.Back Back Back Back ARTIFACT42.1723.6441.3641.4594.1480.4194.0493.33LOCATION84.0777.2983.8776.3795.8287.4895.8190.91 ORGANIZATION72.6360.8172.1566.1096.3183.0595.9894.58PERSON85.3581.4684.5973.5587.4781.5687.5786.2686.1975.1385.7877.94One vs Rest MethodChar.POS subcat.Back Back Back Back ARTIFACT39.8422.9739.9839.6993.9780.5794.0993.34LOCATION84.3175.8784.5076.9993.8585.1994.8689.89 ORGANIZATION72.7458.8572.7766.6096.0679.6196.0994.81PERSON85.5180.4384.8773.9288.3377.3188.2786.5986.1174.9285.9681.72Table7:The degree of polynomial kernel function and the extraction accuracyDegree of p.ker.1313 Direction For For For Back Back Back ARTIFACT36.8142.1738.8645.2649.5844.25 DATE90.2191.1691.2593.0293.9793.63 LOCATION83.7984.0783.7485.8887.7587.26 MONEY55.2259.8859.6394.2193.8593.91 ORGANIZATION71.6272.6372.6075.6178.3378.13 PERCENT84.1383.7780.1495.3596.0694.18 PERSON83.2585.3585.1385.0587.1986.90 TIME89.0989.8289.9988.0688.3387.2584.1086.1985.3680.3682.1282.17“L2R2”contextual feature,3-best answers of redundant morphological analysis,Features:POS,Characters,Character Types and NE tags.Table10:Comparison with related works(Uchimoto et al.,2000)80.17ME Transformation rules(Yamada et al.,2002)83.7SVM Examples in training data aresegmented (Takemoto et al.,2001)83.86Lexicon and Rules Compound lexicon(Utsuro et al.,2002)84.07Stacking(ME and Decision List)(Isozaki and Kazawa,2002)86.7785.77SVM with sigmoid curve Parameter control for a statis-tical morphological analyzer Our Method87.21SVM Chunking by Character。