Information Extraction and Classification from Free Text Using a Neural Approach
- 格式:pdf
- 大小:459.31 KB
- 文档页数:9
n C tinfo 技术SECURITY研究2019年第10_________________________________________________________________________________________________________■doi:10.3969/j.issn.1671-1122.2019.10.004基于图像分析的恶意软件检测技术研究-------------------张健㈣,陈博翰1,宫良_打顾兆军4------------------------------------------------------(1.天津理工大学计算机科学与工程学院,天津300384;2.南开大学网络空间安全学院,天津300350;3.天津市网络与数据安全技术重点、实验室,天津300350;4.中国民航大学信息安全测评中心,天津300300)摘要:随着恶意软件在复杂性和数量方面的不断增长,恶意软件检测变得越来越具有挑战性。
目前最常见的恶意软件检测方法是使用机器学习技术进行恶意软件检测。
为进一步提高恶意软件分析的效率,一些研究人员提出基于图像分析的方法对恶意软件进行分类。
文章总结了使用图像分析方法检测恶意软件的不同方法,并从图像生成、特征提取和分类算法等方面进行了对比,最后针对图像分析方法的不足提出了解决方案。
关键词:网络安全;恶意软件检测;恶意软件图像;机器学习中图分类号:TP309文献标识码:A文章编号:1671-1122(2019)10-0024-08中文引用格式:张健,陈博翰,宫良一,等.基于图像分析的恶意软件检测技术研究[J].信息网络安全,2019,19(10):24-31.英文引用格式:ZHANG Jian,CHEN Bohan,GONG Liangyi,et al.Research on Malware Detection Technology Based on Image Analysis[J],Netinfb Security,2019,19(10):24-31.Research on Malware Detection Technology Based on Image AnalysisZHANG Hallie,CHEN Bohan1,GONG Liangyi1,GU Zhaojun4(1.School of C omputer Science and E ngineering,Tianjin University qfTechnology,Tianjin300384,China}2.College ofCyber Science,Nankai University,Tianjin300350,China;3.Tianjin Key Laboratory of N etwork and D ata SecurityTechnology,Tianjin300350,China',rmation Security Evaluation Center,Civil A viation University ofChina,Tianjin300300,China)Abstract:With the increasing complexity and quantity of malware,malware detection is becoming increasingly challenging.At present,the most common malware detection method is touse machine learning technology to detect malware.In order to improve the efficiency of malwareanalysis,some researchers have proposed a method based on image analysis to classify malware.This paper summarized the different methods of detecting malware using malware images,and compared them from the aspects of image generation,feature extraction and classificationalgorithms.Finally,the solutions to the shortcomings of these methods is proposed.Key words:cyber security;malware detection;malware image;machine learning收稿日期:2019—6—15基金项目:国家重点研发计划[2016YFB0800805];天津市科技服务业科技重大专项[16ZXFWGX00140];天津市自然科学基金[1冯CQNJC69900];中国民航大学信息安全测评中心开放基金课题[CAAC—ISECCA—201501]作者简介:张健(1968—),男,天津,正高级工程师,博士,主要研究方向为网络空间安全、云妥全、系统安全、恶意代码防治;陈博翰(1994—),男,河南,硕士研究生,主要研究方向为信息安全;宫良一(1987—),男,山东,博士,主要研究方向为普适计算、网络与信息安全;顾兆军(1966—),男,山东,教授,博士,主要研究方向为网络与信息安全、民航信息系统。
第29卷 第8期2008年8月仪器仪表学报Ch i nese Journa l o f Sc ientific Instru m entV ol 129N o 18A ug .2008收稿日期:2007-07 Recei ved Date :2007-07*基金项目:/十一五0国家高技术研究发展计划(863计划,2007AA04Z236)、国家自然科学基金(60501005)、天津市科技支撑计划重点项目(07ZCKFSF01300)资助脑-机接口研究中想象动作电位的特征提取与分类算法*程龙龙,明 东,刘双迟,朱誉环,周仲兴,万柏坤(天津大学生物医学工程与科学仪器系 天津300072)摘 要:人在想象但未实施肢体或其他身体部位动作时,与该动作相关的大脑运动皮层区域会发生与该动作实施时相似的电生理响应,称为想象动作电位。
想象动作电位的提取与分类是脑-机接口(BC I)技术的关键和难点。
本文分别介绍了想象动作电位的时频分析、复杂度分析、相位耦合测量、多通道线性描述符、多维统计分析等特征提取方法和线性判别分析、人工神经网络、支持向量机等分类算法,以供BC I 系统设计与研究时参考。
关键词:脑-机接口;想象动作电位;特征提取;分类算法中图分类号:R 318 文献标识码:A 国家标准学科分类代码:310.6110Feature extracti on and cl assification algorith m for m otor i m agi nary potenti alCheng Long l o ng ,M i n g Dong ,L i u Shuangch,i Zhu Yuhuan ,Zhou Zhongx ing ,W an Baikun(D e p ar t m ent of B io m edical E n g ineer i ng &Scienti f ic Instrumen ts ,T ianjin U ni ver sit y,T ianjin 300072,China )Abst ract :Specific e lectrophysiological response w ill occur in the related reg ion o f brain m otor cortex when hu m an pursues li m bs or o t h er body parts i n i m ag inati o n ,wh ich is si m ilar to t h e phenom enon caused by rea lm otor o f the cor -respond i n g body action .Extracti n g and classif y ing t h e m oto r i m ag inary potential are the key and d ifficu lt points in brai n -co m puter i n terface .I n this article ,so m e fea t u re ex traction m ethods such as ti m e -frequency ana l y sis ,co m p lex-i ty analysis ,phase coupling m easure m en,t m ult-i d i m ensi o na l stati s tica l ana l y sis and etc .,and so m e classificati o n m eans such as li n ear d iscri m i n ati o n ana lysis ,artificia l neura l net w o r k ,suppo rt vector m achine and etc ,are i n tro -duced as the references i n design and st u dy o f bra i n -co m puter i n terface syste m.K ey w ords :brain -co m puter interface (BC I);m otor i m aginary potentia;l feature extracti o n ;classificati o n1 引 言脑-机接口(br a i n -c o mputer interface ,BCI)是指在不依赖于外周神经和肌肉组织等常规大脑信息输出通路,而运用工程技术手段在人脑和计算机或其他机电设备之间建立能直接/让思想变成行动0的对外信息交流和控制新途径[1-2]。
大数据中的信息提取技术随着互联网的发展和普及,人们所能获取的信息越来越多,大量数据被产生和储存。
大数据处理及挖掘技术的开发与应用成为一个新兴领域,其中信息提取技术是大数据处理过程中重要的一个环节。
本文将介绍大数据中的信息提取技术。
一、信息提取技术概述信息提取技术(Information Extraction,简称IE)是指从非结构化或半结构化的文本中抽取出基于预定义规则或语言学知识的有意义信息的过程。
信息提取通常包含以下几个步骤:(1)文本预处理:包括分词、词性标注、命名实体识别等。
(2)拟合规则:确定语言规则或统计模型,以匹配文本并抽取信息。
(3)特征抽取:抽取文本中指定的信息、属性或实体。
(4)信息抽取:将预测结果输出为结构化数据,例如XML或表格形式。
二、信息提取技术应用场景信息提取技术可以在许多场景下应用。
例如:(1)新闻事件监测:监测新闻中的关键词、地点、人名等信息。
(2)在线广告定位:根据网站用户的浏览历史和搜索历史推送相关的广告。
(3)社交媒体分析:获取社交媒体上用户的态度和情感,以提高营销策略效益。
(4)自动化知识抽取:收集医学文献中的疾病、症状和治疗措施等信息,以支持临床医生的诊断和治疗。
三、信息提取技术发展历程和进展信息提取技术的发展历程可以追溯至20世纪60年代末期。
随着计算机技术和自然语言处理技术的进步,信息提取技术逐渐发展起来,并被广泛应用于金融、医疗、法律等领域。
信息提取技术的发展也面临着一些问题。
例如,传统的抽取规则方法需要大量的人工制定和调整,容易出错和过时。
另外,大量的文本数据需要长时间的处理,而且数据的质量可能不尽如人意。
近年来,机器学习和深度学习技术的发展为信息提取技术带来了新的机遇。
例如,基于深度学习的命名实体识别模型可以显著提高信息提取的准确率和效率。
同时,自然语言处理和机器学习技术的结合,可以自动发现一些新的信息或规则,并可以动态更新信息抽取模型,拓展信息提取技术的应用场景和范围。
Vol. 2,N o.l Academic New Vision of Sichuan Technology and Business University Mar. 2017•应用型教育教学改革与研究-《信息检索与利用》试卷分析及应用型教改研究卫丽(四川工商学院图书馆,四川眉山620000)摘要:试卷成绩在一定程度上能够衡量教师“教”的水平和学生“学”的效果,为了客观地对教学质量进行综合 性评价,发现教学环节中的薄弱环节,本文对四川工商学院2015级会计本科60份试卷进行了分析。
结果表明,本套试 卷题型较为合理,题量与难易度适中,基本覆盖了信息检索课所有重难点知识,学生成绩基本成正态分布,但考试知 识点还应该增加学生上机操作部分,以便增强本课程的实践应用性,并结合以上分析结果探讨了信息检索课应用型 教学改革方向。
关键词:信息检索与利用;试卷分析;应用型教学改革中图分类号:G258.6;G252 文献标识码:AExamination Paper Analysis of “Information Retrieval” and Research on Applied Teaching ReformWei Li{Library of S ichuan Technology and Business University ^Sichuan Meishan 620000) Abstract:Examination results can be used to measure the level of teachers ’ teaching and students’ learning effect. In order to objectively evaluate the quality of teaching and find the weak link in the teaching process, this paper analyzes the 60 papers of the Sichuan Institute of industry and Commerce of the 2015 accounting undergraduate.The results show that the set of test questions is more reasonable, quantity and difficulty is moderate and basically covering all the important and difficult knowledge of Information Retrieval Course, student achievement basically normal distribution, but the examination of knowledge should also increase students hands-on practice and application of the part, in order to enhance the practical application of the course, and combined with the results of the above analysis, this paper discusses the application of teaching reform direction.Key words:Information retrieval and utilization ; Examination paper analysis;Applied teaching reform21世纪的社会计算机技术广泛地被应用,社会 每天产生的信息量如井喷式地增长,在这信息的海 洋里,如何有效地利用网络获取正确的信息正是《信 息检索与利用》课的授课重点,本课程的目的就是培 养学生的信息素养,让学生学会发现、获取、利用信 息的能力。
information extraction 评价指标-回复关于信息抽取技术的评价指标信息抽取(Information Extraction, IE)是一项关键的自然语言处理任务,旨在从非结构化文本中识别和提取出重要的信息。
这些信息可以是实体名称、关系、事件或其他与特定领域相关的实体和概念。
评价信息抽取技术的指标对于进一步改进和发展该领域的算法和系统至关重要。
本文将一步一步回答关于信息抽取评价指标的问题。
1. 信息抽取任务的常见评价指标是哪些?在信息抽取任务中,常见的评价指标包括准确率、召回率、F1分数和精确度。
- 准确率是指模型正确预测的正样本数与预测的总样本数之比。
准确率越高,说明模型的预测结果越接近真实结果。
- 召回率是指模型正确预测的正样本数与真实正样本数之比。
召回率越高,说明模型能够更好地找到真实的正样本。
- F1分数是综合考虑准确率和召回率的评价指标,可以用来平衡这两个指标。
F1分数的计算公式为2 * (准确率* 召回率) / (准确率+ 召回率)。
- 精确度是指模型在预测为正样本的情况下,真实为正样本的概率。
精确度越高,可以认为模型的预测结果更加可信。
2. 信息抽取的评价指标如何计算?为了计算准确率、召回率、F1分数和精确度,需要有一个标准的数据集来比较模型的预测结果和真实结果。
通常情况下,这个数据集会标注有正确的实体、关系或事件信息。
接下来我们将以实体提取任务为例,介绍如何计算这些评价指标。
- 首先,需要将模型的预测结果与真实结果进行比较。
每个实体都可以被视为一个二元分类问题,要么是正样本(正确提取的实体),要么是负样本(未能正确提取的实体)。
- 其次,计算模型的准确率、召回率和精确度。
可以使用以下公式进行计算:准确率=正确预测的正样本数/ 预测的总样本数召回率=正确预测的正样本数/ 真实正样本数精确度=正确预测的正样本数/ 预测为正样本的样本数- 最后,可以使用F1分数来综合衡量模型的性能。
information extraction 评价指标-回复关于信息抽取的评价指标引言:随着互联网和大数据时代的到来,大量的信息被生成和存储,如何从海量的文本数据中获取有效的信息成为了一项关键任务。
信息抽取(Information Extraction)是一种自然语言处理技术,旨在从非结构化的文本中提取出有用的信息,如实体、关系和事件等。
信息抽取涉及多个阶段和过程,包括实体识别、关系抽取和事件抽取等,因此需要评价指标来衡量其性能和有效性。
本文将介绍一些常用的信息抽取评价指标,并逐步解释其定义和使用。
一、精确率(Precision)精确率是信息抽取中常用的评价指标之一。
它衡量了信息抽取系统在识别结果中有多少是正确的。
精确率的计算公式如下:精确率= 正确识别的实体数/ 总的实体数精确率的值范围为0到1,越接近1代表系统的识别能力越好。
然而,精确率不能完全反映系统的性能,因为它忽略了系统未能捕捉到的实体。
二、召回率(Recall)召回率是另一个常用的信息抽取评价指标。
它衡量了信息抽取系统在文本中能识别出多少个实体。
召回率的计算公式如下:召回率= 正确识别的实体数/ 真实的实体数召回率的值也在0到1之间,越接近1代表系统的抽取能力越好。
与精确率相反,召回率忽略了系统识别结果中的误识别。
三、F1值(F1 Score)F1值是综合考虑精确率和召回率的评价指标,通常用于信息抽取系统的综合评估。
它的计算公式如下:F1 = 2 * (精确率* 召回率) / (精确率+ 召回率)F1值的范围也在0到1之间,它给出了一个综合的系统性能度量,比单一的精确率和召回率更具有可比性和稳定性。
当精确率和召回率都很高时,F1值也会相对较高。
四、准确率(Accuracy)准确率是另一个常见的信息抽取评价指标。
它衡量了信息抽取系统在整个文本中正确识别实体的比例。
准确率的计算公式如下:准确率= (正确识别的实体数+ 正确未识别的实体数) / 总的实体数准确率的值也在0到1之间,越接近1代表系统的抽取能力越好。
面向文本信息处理的信息提取技术研究随着信息时代的到来,我们每天面对的数据越来越庞大,其中不乏海量的文本信息。
而对于人工来处理这些信息,显然是件巨大的工程。
但是,利用计算机来处理文本信息,却成为了现代科技的一项重要技术。
其中,信息提取技术就是其中的一项核心技术。
一、信息提取技术的概念信息提取技术,英文为Information extraction,通常是指从文本中提取出需要的信息,去除冗余的内容。
信息提取技术可以自动对文本进行语义分析,从中抽取出符合特定语义模式的实体、关系等内容。
常见的信息提取技术包括实体抽取、关系抽取、事件抽取、情感分析等等。
二、信息提取技术的特点相比于传统的文本处理方式,信息提取技术具有以下几个特点:1、自动化处理:信息提取技术处理文本的速度比人工要快得多,能够快速地提取出大量信息。
而且,多数情况下可以自动化地完成。
2、高效性:信息提取技术可以从庞大的文本中提取出我们需要的信息,大大缩短了信息处理时间。
3、精度高:信息提取技术可以通过自然语言处理技术,从语句中提取出真正有意义的信息。
因此,其误差较小,可信度高。
4、可扩展性:对于某些新词汇或新领域等未知的情况,信息提取技术也可以有效地扩展和适应。
三、信息提取技术的应用领域信息提取技术在很多领域中得到了广泛应用:1、搜索引擎:搜索引擎利用自动化的信息提取技术来抓取互联网上的文本信息,并从中提取出与用户搜索有关的信息。
2、金融风险控制:信息提取技术可以利用大量的文本数据,研判金融市场的变化,为金融风险控制提供数据支持。
3、智能客服:信息提取技术可以将用户的提问和其他语境信息进行对话分析,做出相应的回答,提高用户的满意度。
4、社交媒体分析:信息提取技术可以对社交媒体上大量的文本数据进行分析,从中获取用户需求、市场趋势等有用信息。
五、信息提取技术的研究近年来,信息提取技术在大数据时代得到快速发展,相关产业应用领域也逐渐拓宽。
随着技术的不断成熟,研究人员也在不断探索如何优化算法,提高信息提取技术的效率和准确性。
Information Extraction from Single and Multiple SentencesMark StevensonDepartment of Computer ScienceRegent Court,211Portobello Street,University of SheffieldSheffieldS14DP,UKmarks@AbstractSome Information Extraction(IE)systemsare limited to extracting events expressedin a single sentence.It is not clear what ef-fect this has on the difficulty of the extrac-tion task.This paper addresses the prob-lem by comparing a corpus which has beenannotated using two separate schemes:onewhich lists all events described in the textand another listing only those expressedwithin a single sentence.It was found thatonly40.6%of the events in thefirst anno-tation scheme were fully contained in thesecond.1IntroductionInformation Extraction(IE)is the process of identifying specific pieces of information in text, for example,the movements of company execu-tives or the victims of terrorist attacks.IE is a complex task and a the description of an event may be spread across several sentences or para-graphs of a text.For example,Figure1shows two sentences from a text describing manage-ment succession events(i.e.changes in corpo-rate executive management personnel).It can be seen that the fact that the executives are leaving and the name of the organisation are listed in thefirst sentence.However,the names of the executives and their posts are listed in the second sentence although it does not mention the fact that the executives are leaving these posts.The succession events can only be fully understood from a combination of the informa-tion contained in both sentences.Combining the required information across sentences is not a simple task since it is neces-sary to identify phrases which refer to the same entities,“two top executives”and“the execu-tives”in the above example.Additional diffi-culties occur because the same entity may be referred to by a different linguistic unit.For ex-ample,“International Business Machines Ltd.”may be referred to by an abbreviation(“IBM”),Pace American Group Inc.said it notified two top executives it intends to dismiss them because an internal investigation found ev-idence of“self-dealing”and“undisclosedfi-nancial relationships.”The executives are Don H.Pace,cofounder,president and chief executive officer;and Greg S.Kaplan,senior vice president and chieffinancial officer. Figure1:Event descriptions spread across two sentencesnickname(“Big Blue”)or anaphoric expression such as“it”or“the company”.These complica-tions make it difficult to identify the correspon-dences between different portions of the text de-scribing an event.Traditionally IE systems have consisted of several components with some being responsi-ble for carrying out the analysis of individual sentences and other modules which combine the events they discover.These systems were of-ten designed for a specific extraction task and could only be modified by experts.In an ef-fort to overcome this brittleness machine learn-ing methods have been applied to port sys-tems to new domains and extraction tasks with minimal manual intervention.However,some IE systems using machine learning techniques only extract events which are described within a single sentence,examples include(Soderland, 1999;Chieu and Ng,2002;Zelenko et al.,2003). Presumably an assumption behind these ap-proaches is that many of the events described in the text are expressed within a single sen-tence and there is little to be gained from the extra processing required to combine event de-scriptions.Systems which only attempt to extract events described within a single sentence only report results across those events.But the proportion of events described within a single sentence is not known and this has made it difficult to com-pare the performance of those systems against ones which extract all events from text.This question is addressed here by comparing two versions of the same IE data set,the evaluation corpus used in the Sixth Message Understand-ing Conference(MUC-6)(MUC,1995).The corpus produced for this exercise was annotated with all events in the corpus,including those described across multiple sentences.An inde-pendent annotation of the same texts was car-ried out by Soderland(1999),although he only identified events which were expressed within a single sentence.Directly comparing these data sets allows us to determine what proportion of all the events in the corpus are described within a single sentence.The remainder of this paper is organised as follows.Section2describes the formats for rep-resenting events used in the MUC and Soder-land data sets.Section3introduces a common representation scheme which allows events to be compared,a method for classifying types of event matches and a procedure for comparing the two data sets.The results and implications of this experiment are presented in Section4. Some related work is discussed in Section5.2Event Scope and Representation The topic of the sixth MUC(MUC-6)was management succession events(Grishman and Sundheim,1996).The MUC-6data has been commonly used to evaluate IE systems.The test corpus consists of100Wall Street Jour-nal documents from the period January1993 to June1994,54of which contained manage-ment succession events(Sundheim,1995).The format used to represent events in the MUC-6 corpus is now described.2.1MUC RepresentationEvents in the MUC-6evaluation data are recorded in a nested template structure.This format is useful for representing complex events which have more than one participant,for ex-ample,when one executive leaves a post to be replaced by another.Figure2is a simplified event from the the MUC-6evaluation similar to one described by Grishman and Sundheim (1996).This template describes an event in which “John J.Dooner Jr.”becomes chairman of the company“McCann-Erickson”.The MUC tem-plates are too complex to be described fully here but some relevant features can be discussed. Each SUCCESSION EVENT contains the name of <SUCCESSION_EVENT-9402240133-2>:=SUCCESSION_ORG:<ORGANIZATION-9402240133-1>POST:"chairman"IN_AND_OUT:<IN_AND_OUT-9402240133-4>VACANCY_REASON:DEPART_WORKFORCE<IN_AND_OUT-9402240133-4>:=IO_PERSON:<PERSON-9402240133-1>NEW_STATUS:INON_THE_JOB:NOOTHER_ORG:<ORGANIZATION-9402240133-1>REL_OTHER_ORG:SAME_ORG<ORGANIZATION-9402240133-1>:=ORG_NAME:"McCann-Erickson"ORG_ALIAS:"McCann"ORG_TYPE:COMPANY<PERSON-9402240133-1>:=PER_NAME:"John J.Dooner Jr."PER_ALIAS:"John Dooner""Dooner"Figure2:Example Succession event in MUC formatthe POST,organisation(SUCCESSION ORG)and references to at least one IN AND OUT sub-template,each of which records an event in which a person starts or leaves a job.The IN AND OUT sub-template contains details of the PERSON and the NEW STATUSfield which records whether the person is starting a new job or leav-ing an old one.Several of thefields,including POST,PERSON and ORGANIZATION,may contain aliases which are alternative descriptions of thefieldfiller and are listed when the relevant entity was de-scribed in different was in the text.For ex-ample,the organisation in the above template has two descriptions:“McCann-Erickson”and “McCann”.It should be noted that the MUC template structure does not link thefieldfillers onto particular instances in the texts.Conse-quently if the same entity description is used more than once then there is no simple way of identifying which instance corresponds to the event description.The MUC templates were manuallyfilled by annotators who read the texts and identified the management succession events they contained. The MUC organisers provided strict guidelines about what constituted a succession event and how the templates should befilled which the an-notators sometimes found difficult to interpret (Sundheim,1995).Interannotator agreementwas measured on30texts which were examined by two annotators.It was found to be83%when one annotator’s templates were assumed to be correct and compared with the other.2.2Soderland’s Representation Soderland(1999)describes a supervised learn-ing system called WHISK which learned IE rules from text with associated templates. WHISK was evaluated on the same texts from the MUC-6data but the nested template struc-ture proved too complex for the system to learn. Consequently Soderland produced his own sim-pler structure to represent events which he de-scribed as“case frames”.This representation could only be used to annotate events described within a single sentence and this reduced the complexity of the IE rules which had to be learned.The succession event from the sentence “Daniel Glass was named president and chief executive officer of EMI Records Group,a unit of London’s Thorn EMI PLC.”would be represented as follows:1@@TAGS Succession{PersonIn DANIEL GLASS}{Post PRESIDENT AND CHIEF EXECUTIVE OFFICER} {Org EMI RECORDS GROUP}Events in this format consist of up to four components:PersonIn,PersonOut,Post and Org.An event may contain all four components although none are compulsory.The minimum possible set of components which can form an event are(1)PersonIn,(2)PersonOut or(3) both Post and Org.Therefore a sentence must contain a certain amount of information to be listed as an event in this data set:the name of an organisation and post participating in a management succession event or the name of a person changing position and the direction of that change.Soderland created this data from the MUC-6evaluation texts without using any of the existing annotations.The texts werefirst pre-processing using the University of Mas-sachusetts BADGER syntactic analyser(Fisher et al.,1995)to identify syntactic clauses and the named entities relevant to the management suc-cession task:people,posts and organisations. Each sentence containing relevant entities was examined and succession events manually iden-tified.1The representation has been simplified slightly for clarity.This format is more practical for machine learning research since the entities which par-ticipate in the event are marked directly in the text.The learning task is simplified by the fact that the information which describes the event is contained within a single sentence and so the feature space used by a learning algorithm can be safely limited to items within that context. 3Event Comparison3.1Common Representation andTransformationThere are advantages and disadvantages to the event representation schemes used by MUC and Soderland.The MUC templates encode more information about the events than Soderland’s representation but the nested template struc-ture can make them difficult to interpret man-ually.In order to allow comparison between events each data set was transformed into a com-mon format which contains the information stored in both representations.In this format each event is represented as a single database record with fourfields:type,person,post and organisation.The typefield can take the values person in,person out or,when the di-rection of the succession event is not known, person move.The remainingfields take the person,position and organisation names from the text.Thesefields may contain alternative values which are separated by a vertical bar (“|”).MUC events can be translated into this format in a straightforward way since each IN AND OUT sub-template corresponds to a sin-gle event in the common representation.The MUC representation is more detailed than the one used by Soderland and so some in-formation is discarded from the MUC tem-plates.For example,the VACANCY REASON filed which lists the reason for the manage-ment succession event is not transfered to the common format.The event listed in Figure2would be represented as follows: type(person in)person(‘John J.Dooner Jr.’|‘John Dooner’|‘Dooner’)org(‘McCann-Erickson’|‘McCann’)post(chairman)Alternativefillers for the person and org fields are listed here and these correspond to the PER NAME,PER ALIAS,ORG NAME and ORG ALIASfields in the MUC template.The Soderland succession event shown in Section 2.2would be represented as follows in the common format.type(person in)person(‘Daniel Glass’)post(‘president’)org(‘EMI Records Group’)type(person in)person(‘Daniel Glass’)post(‘chief executive officer’)org(‘EMI Records Group’)In order to carry out this transformation an event has to be generated for each PersonIn and PersonOut mentioned in the Soderland event. Soderland’s format also lists conjunctions of post names as a single slotfiller(“president and chief executive officer”in this example).These are treated as separate events in the MUC for-mat.Consequently they are split into the sepa-rate post names and an event generated for each in the common representation.It is possible for a Soderland event to consist of only a Post and Org slot(i.e.there is nei-ther a PersonIn or PersonOut slot).In these cases an underspecified type,person move,is used and no personfield listed.Unlike MUC templates Soderland’s format does not contain alternative names forfieldfillers and so these never occur when an event in Soderland’s for-mat is translated into the common format.3.2MatchingThe MUC and Soderland data sets can be com-pared to determine how many of the events in the former are also contained in the latter. This provides an indication of the proportion of events in the MUC-6domain which are express-ible within a single sentence.Matches between Soderland and MUC events can be classified as full,partial or nomatch.Each of these possi-bilities may be described as follows:Full A pair of events can only be fully match-ing if they contain the same set offields.In addition there must be a commonfiller for eachfield.The following pair of events are an example of two which fully match.type(person in)person(‘R.Wayne Diesel’|‘Diesel’)org(‘Mechanical Technology Inc.’|‘Mechanical Technology’)post(‘chief executive officer’)type(person in)person(‘R.Wayne Diesel’)org(‘Mechanical Technology’)post(‘chief executive officer’)Partial A partial match occurs when one event contains a proper subset of thefields of an-other event.Eachfield shared by the two events must also share at least onefiller.The following event would partially match either of the above events;the orgfield is absent therefore the matches would not be full.type(person in)person(‘R.Wayne Diesel’)post(‘chief executive officer’)Nomatch A pair of events do not match if the conditions for a full or partial match are not met.This can occur if correspondingfields do not share afiller or if the set offields in the two events are not equivalent or one the subset of the other.Matching between the two sets of events is carried out by going through each MUC event and comparing it with each Soderland event for the same document.The MUC event isfirst compared with each of the Soderland events to check whether there are any equal matches.If one is found a note is made and the matching process moves onto the next event in the MUC set.If an equal match is not found the MUC event is again compared with the same set of Soderland events to see whether there are any partial matches.We allow more than one Soder-land event to partially match a MUC event so when one is found the matching process con-tinues through the remainder of the Soderland events to check for further partial matches.4Results4.1Event level analysisAfter transforming each data set into the com-mon format it was found that there were276 events listed in the MUC data and248in the Soderland set.Table1shows the number of matches for each data set following the match-ing process described in Section3.2.The countsunder the“MUC data”and“Soderland data”headings list the number of events which fall into each category for the MUC and Soderland data sets respectively along with corresponding percentages of that data set.It can be seen that 112(40.6%)of the MUC events are fully cov-ered by the second data set,and108(39.1%) partially covered.Match MUC data Soderland data Type Count%Count%Full11240.6%11245.2% Partial10839.1%11847.6% Nomatch5620.3%187.3% Total276248Table1:Counts of matches between MUC and Soderland data.Table1shows that there are108events in the MUC data set which partially match with the Soderland data but that118events in the Soderland data set record partial matches with the MUC data.This occurs because the match-ing process allows more than one Soderland event to be partially matched onto a single MUC event.Further analysis showed that the difference was caused by MUC events which were partially matched by two events in the Soderland data set.In each case one event contained details of the move type,person in-volved and post title and another contained the same information without the post title.This is caused by the style in which the newswire sto-ries which make up the MUC corpus are writ-ten where the same event may be mentioned in more than one sentence but without the same level of detail.For example,one text contains the sentence“Mr.Diller,50years old,succeeds Joseph M.Segel,who has been named to the post of chairman emeritus.”which is later fol-lowed by“At that time,it was announced that Diller was in talks with the company on becom-ing its chairman and chief executive upon Mr. Segel’s scheduled retirement this month.”Table1also shows that there are56events in the MUC data which fall into the nomatch cat-egory.Each of these corresponds to an event in one data set with no corresponding event in the other.The majority of the unmatched MUC events were expressed in such a way that there was no corresponding event listed in the Soder-land data.The events shown in Figure1are examples of this.As mentioned in Section2.2, a sentence must contain a minimum amount of information to be marked as an event in Soder-land’s data set,either name of an organisation and post or the name of a person changing po-sition and whether they are entering or leaving. In Figure1thefirst sentence lists the organisa-tion and the fact that executives were leaving. The second sentence lists the names of the exec-utives and their positions.Neither of these sen-tences contains enough information to be listed as an event under Soderland’s representation, consequently the MUC events generated from these sentences fall into the nomatch category. It was found that there were eighteen events in the Soderland data set which were not in-cluded in the MUC version.This is unexpected since the events in the Soderland corpus should be a subset of those in the MUC corpus.Anal-ysis showed that half of these corresponded to spurious events in the Soderland set which could not be matched onto events in the text.Many of these were caused by problems with the BAD-GER syntactic analyser(Fisher et al.,1995) used to pre-process the texts before manual analysis stage in which the events were identi-fied.Mistakes in this pre-processing sometimes caused the texts to read as though the sentence contained an event when it did not.We exam-ined the MUC texts themselves to determine whether there was an event rather than relying on the pre-processed output.Of the remaining nine events it was found that the majority(eight)of these corresponded to events in the text which were not listed in the MUC data set.These were not identi-fied as events in the MUC data because of the the strict guidelines,for example that historical events and non-permanent management moves should not be annotated.Examples of these event types include“...Jan Carlzon,who left last year after his plan for a merger with three other European airlines failed.”and“Charles T.Young,chieffinancial officer,stepped down voluntarily on a‘temporary basis pending con-clusion’of the investigation.”The analysis also identified one event in the Soderland data which appeared to correspond to an event in the text but was not listed in the MUC scenario tem-plate for that document.It could be argued that there nine events should be added to the set of MUC events and treated as fully matches. However,the MUC corpus is commonly used as a gold standard in IE evaluation and it was de-cided not to alter it.Analysis indicated that one of these nine events would have been a fullmatch and eight partial matches.It is worth commenting that the analysis car-ried out here found errors in both data sets. There appeared to be more of these in the Soderland data but this may be because the event structures are much easier to interpret and so errors can be more readily identified.It is also difficult to interpret the MUC guidelines in some cases and it sometimes necessary to make a judgement over how they apply to a particular event.4.2Event Field AnalysisA more detailed analysis can be carried out examining the matches between each of the fourfields in the event representation individu-ally.There are1,094fields in the MUC data. Although there are276events in that data set seven of them do not mention a post and three omit the organisation name.(Organisa-tion names are omitted from the template when the text mentions an organisation description rather than its name.)Table4.2lists the number of matches for each of the four eventfields across the two data sets. Each of the pairs of numbers in the main body of the table refers to the number of matching in-stances of the relevantfield and the total num-ber of instances in the MUC data.The column headed“Full match”lists the MUC events which were fully matched against the Soderland data and,as would be expected, allfields are matched.The column marked “Partial match”lists the MUC events which are matched onto Soderlandfields via partially matching events.The column headed“No-match”lists the eventfields for the56MUC events which are not represented at all in the Soderland data.Of the total1,094eventfields in the MUC data727,66.5%,can be found in the Soderland data.The rightmost column lists the percent-ages of eachfield for which there was a match. The counts for the type and personfields are the same since the type and personfields are com-bined in Soderland’s event representation and hence can only occur together.Thesefigures also show that there is a wide variation between the proportion of matches for the differentfields with76.8%of the person and typefields be-ing matched but only43.2%of the organisation field.This difference betweenfields can be ex-plained by looking at the style in which the texts forming the MUC evaluation corpus are writ-ten.It is very common for a text to introduce a management succession event near the start of the newswire story and this event almost in-variably contains all four eventfields.For ex-ample,one story starts with the following sen-tence:“Washington Post Co.said Katharine Graham stepped down after20years as chair-man,and will be succeeded by her son,Don-ald E.Graham,the company’s chief executive officer.”Later in the story further succession events may be mentioned but many of these use an anaphoric expression(e.g.“the company”) rather than explicitly mention the name of the organisation in the event.For example,this sen-tence appears later in the same story:“Alan G. Spoon,42,will succeed Mr.Graham as presi-dent of the company.”Other stories again may only mention the name of the person in the suc-cession event.For example,“Mr.Jones is suc-ceeded by Mr.Green”and this explains why some of the organisationfields are also absent from the partially matched events.4.3DiscussionFrom some perspectives it is difficult to see why there is such a difference between the amount of events which are listed when the entire text is viewed compared with considering single sen-tences.After all a text comprises of an ordered list of sentences and all of the information the text contains must be in these.Although,as we have seen,it is possible for individual sentences to contain information which is difficult to con-nect with the rest of the event description when a sentence is considered in isolation.The results presented here are,to some ex-tent,dependent on the choices made when rep-resenting events in the two data sets.The events listed in Soderland’s data require a min-imal amount of information to be contained within a sentence for it to be marked as con-taining information about a management suc-cession event.Although it is difficult to see how any less information could be viewed as repre-senting even part of a management succession event.5Related WorkHuttunen et al.(2002)found that there is varia-tion between the complexity of IE tasks depend-ing upon how the event descriptions are spread through the text and the ways in which they are encoded linguistically.The analysis presented here is consistent with theirfinding as it hasFull match Partial match Nomatch TOTAL% Type112/112100/1080/56212/27676.8% Person112/112100/1080/56212/27676.8% Org112/1126/1080/53118/27343.2% Post111/11174/1080/50185/26968.8% Total447/447280/4320/215727/109466.5% Table2:Matches between MUC and Soderland data atfield levelbeen observed that the MUC texts are often written in such as way that the name of the organisation in the event is in a different part of the text to the rest of the organisation de-scription and the entire event can only be con-structed by resolving anaphoric expressions in the text.The choice over which information about events should be extracted could have an effect on the difficulty of the IE task.6ConclusionsIt seems that the majority of events are not fully described within a single sentence,at least for one of the most commonly used IE evaluation sets.Only around40%of events in the original MUC data set were fully expressed within the Soderland data set.It was also found that there is a wide variation between different eventfields and some information may be more difficult to extract from text when the possibility of events being described across multiple sentences is not considered.This observation should be borne in mind when deciding which approach to use for a particular IE task and should be used to put the results reported for IE systems which extract from a single sentence into context. AcknowledgementsI am grateful to Stephen Soderland for allowing access to his version of the MUC-6corpus and advice on its construction.Robert Gaizauskas and Beth Sundheim also provided advice on the data used in the MUC evaluation.Mark Hep-ple provided valuable comments on early drafts of this paper.I am also grateful to an anony-mous reviewer who provided several useful sug-gestions.ReferencesH.Chieu and H.Ng.2002.A Maximum Entroy Approach to Information Extraction from Semi-structured and Free Text.In Pro-ceedings of the Eighteenth International Con-ference on Artificial Intelligence(AAAI-02), pages768–791,Edmonton,Canada.D.Fisher,S.Soderland,J.McCarthy,F.Feng, and W.Lehnert.1995.Description of the UMass system as used for MUC-6.In Pro-ceedings of the Sixth Message Understand-ing Conference(MUC-6),pages221–236,San Francisco,CA.R.Grishman and B.Sundheim.1996.Mes-sage understanding conference-6:A brief history.In Proceedings of the16th Interna-tional Conference on Computational Linguis-tics(COLING-96),pages466–470,Copen-hagen,Denmark.S.Huttunen,R.Yangarber,and R.Grishman. plexity of Event Structures in IE Scenarios.In Proceedings of the19th Interna-tional Conference on Computational Linguis-tics(COLING-2002),pages376–382,Taipei, Taiwan.MUC.1995.Proceedings of the Sixth Mes-sage Understanding Conference(MUC-6), San Mateo,CA.Morgan Kaufmann.S.Soderland.1999.Learning Information Ex-traction Rules for Semi-structured and free text.Machine Learning,31(1-3):233–272. B.Sundheim.1995.Overview of results of the MUC-6evaluation.In Proceedings of the Sixth Message Understanding Conference (MUC-6),pages13–31,Columbia,MA.D.Zelenko,C.Aone,and A.Richardella.2003. Kernel methods for relation extraction.Jour-nal of Machine Learning Research,3:1083–1106.。
Python程序设计-介绍李正华October25,20161个人介绍工大博士毕业家有父母妻女初为人师三年教学科研并重自感任重道远一定全力以赴见识粗陋浅薄还需多多包涵1983.9-2002.9:家乡2002.9-2013.5:大学2013.8-2016.7:讲师2016.7-:副教授课题组:人类语言技术研究所研究方向:自然语言处理、人工智能个人主页:/~zhli电子邮箱:zhli13@2人类语言技术研究所介绍英文名称Human Language Technology(HLT)。
课题组主页:。
我们课题组目前有3名教授(其中1杰青、1优青)、2名副教授。
专注自然语言处理与人工智能,子方向包括:•语言分析与理解(Language Analysis and Understanding>)•信息抽取(Information Extraction)、知识挖掘(Knowledge Mining)•机器翻译(Goolge Translation)•自动问答(Apple Siri,IBM Watson)•搜索引擎(Google,百度)•用户理解,产品推荐(淘宝、京东)3课程管理课程主页:/~zhli/teach/python-2016-fall(各种课程相关的信息、讲义发布、作业布置、成绩公布等)4上课要求每节课都得来。
如果有事不能来,必须在上课时让同学代交正式的请假条(需要辅导员签字或盖章)。
认真完成上机作业。
上机课早退需要向助教说明原因。
5师生相处之道互相理解、尊重互相学习、提高随时有问题,随时打断发现我的错误,及时反馈(记录在平时成绩中,我可以做几个俯卧撑。
)6如何计算课程成绩?(双学位)计划按照如下方式计算成绩(需要和教务秘书确认是否可行):平时成绩(出勤、回答问题等)10%编程作业成绩30%(每节上机课专门有助教检查作业)期末考试成绩60%(可能采用上机考试的形式)7如何计算课程成绩?(计算机大一)过程化考试8课程学习目标会写程序!实践动手能力是计算机学科最基本的要求。
Information Extraction and Classification from FreeText Using a Neural ApproachIgnazio Gallo and Elisabetta BinaghiDepartment of Computer Science and CommunicationUniversit´a degli Studi dell’Insubriavia Mazzini5,Varese,Italyignazio.gallo@uninsubria.ithttp://www.uninsubria.eu/Abstract.Many approaches to Information Extraction(IE)have been proposedin literature capable offinding and extract specific facts in relatively unstructureddocuments.Their application in a large information space makes data ready forpost-processing which is crucial to many context such as Web mining and search-ing tools.This paper proposes a new IE strategy,based on symbolic and neuraltechniques,and tests it experimentally within the price comparison service do-main.In particular the strategy seeks to locate a set of atomic elements in freetext which is preliminarily extracted from web documents and subsequently clas-sify them assigning a class label representing a specific product.Keywords:Information Extraction,Neural Network,Text Classification1IntroductionWith the Internet becoming increasingly popular,more and more information is avail-able in a relatively free text format.This situation creates the premise for efficient on line services and Web mining application in several domains.The on line availability of ever larger amounts of commercial information,for example,creates the premise for profitable price comparison services allowing individual to see lists of prices for specific products.However critical aspects such as information overload,heterogeneity and ambiguity due to vocabulary differences limit the diffusion and usefulness of these advanced tools requiring expensive maintenance and frustrating users instead of empowering them.To address these problems,efficient Information Extraction(IE)techniques must be provided capable offinding and extract specific facts in relatively unstructured doc-uments.Their application in a large information space makes data ready for post-processing which is crucial to many context such as Web mining and searching tools.Information extraction programs analyze a small subset of any given text,e.g.,those parts that contain certain trigger words,and then attempt tofill out a fairly simple form that represents the objects or events of interest.An IE task is defined by its input and its extraction target.The input can be unstructured documents like free text that are written in natural language(e.g.,Fig.1)or the semi-structured documents that abound on the Web such as tables or itemized and enumerated lists(e.g.,Fig.2).The extraction targetof an IE task can be a relation of k-tuple(where k is the number of attributes in a record) or it can be a complex object with hierarchically organized data.Many approaches to IE have been proposed in literature and classified from differ-ent points of view such as the degree of automation[1],type of input document and structure/constraint of the extraction pattern[2].This paper proposes a new IE strategy and tests it experimentally within the price comparison service domain.Most price comparison services do not sell products them-selves,but show prices of the retailers from whom users can buy.Since the stores are heterogeneous and each one describes products in different ways(see example in Fig.3),a generic procedure must be devised to extract the content of a particular informa-tion source.In particular our IE strategy seeks to locate a set of atomic elements in free text preliminarily extracted from web documents and subsequently classify them ac-cordingly.In this context,our principal interest is to extract,from a textual description, information that identify a commercial product with an unambiguous label in order to be able to compare prices.In our experiments product price was associated to the description,therefore its extraction is not necessary.The Motorola RAZR V3i is fully loaded*-delivering the ultimate com-bination of design and technology.Beneath this sculpted metal exterioris a lean mean,globe-hopping machine.Modelled after the MotorolaRAZR V3,the RAZR V3i has an updated and streamlined design,of-fering consumers a large internal color screen,...Fig.1.An unstructured document written in natural language that describes the product’Motorola RAZR V3i’.Following the terminology used by Chang et.al[3]the salient aspects are–an hybrid solution for building thesaurus based on manual and supervised neural technics–a tree structured matcher for identifying meaningful sequences of atomic elements (tokens);–a set of logical rules which interpret and evaluate distance measures in order to assign the correct class to documents.2System OverviewThe IE system developed is composed of two main parts,matcher and classifier,and acts on free text documents obtained from original web documents.It specifically ad-dresses the following problems typical of the price comparison service information space:an attribute may have zero(missing)or multiple instantiations in a document; various permutations of attributes or typographical errors may occur in the input docu-ments(see an example in Fig.3).Fig.2.Semi-structured documents written in natural language that describes a set of products.Prior to both matching and classification phases the tokenizer divides the text into simple tokens having the following nature:–word:a word is defined as any set of contiguous upper or lowercase letters;–number:a number is defined as any combination of consecutive digits.2.1MatcherIn our context the matcher has to operate on specific annotations that can be matched to brands(B)and models(M)of products enlisted in the price comparison service.The matcher is then constructed starting from a set KB={(b,m)|b∈B,m∈M b}that contains the couples formed by a brand b and a model m.b belongs to the set of brands B of a given category of products,while m belongs to the set of models M b that have b as brand.A thesaurus that collects all the synonyms commonly used to describe a particular category of products is used to extend the KB.In particular if there are one or more synonyms in the thesaurus for a couple(b,m)then we add a new couple(¯b,¯m)to the KB for each synonym.Fig.3.List of potential descriptions in (b)that may be used to describe product (a).The tokenizer associates with every couple in the KB a sequence of tokens T b,m =(t b 1,t b 2,...,t m 1,t m 2,...)where t b i and t m j are all the tokens obtained from the brand band the model m respectively.Every sequence T b,m is used to construct a path within the matcher tree structure:it starts from the first node associated with token t b 1and arrives at a leaf node associated with the label derived from (b,m )(see example in Fig.4).Based on these solutions,the IE task is accomplished splitting an input document into a tokens sequence (d 1,d 2,...).Starting from the root node the search for the sub-sequent node (better match in the subsequent layer)is performed using the concept of edit distance (ed )between input tokens and every token of a tree’s layer and the posi-tion distance (pd )between two matched input tokens.The edit distance between two tokens measures the minimum number of unit editing operations of insertion,deletion,replacement of a symbol ,and transposition of adjacent symbols [4]necessary to convert one token into another.The position distance measures the number of tokens d i found between two matched d j d k tokens that would have to be consecutive The system first searches a sequence of perfect match ( ed =0)that leads to recognition of all the tokens t b i associated with one brand.In this way we obtain a list of possible brands associated with the input document.Starting from the last node (with token t b i )of a matched brand,the algorithm begins the search for the most probable model.The system searches all the paths that lead to a leaf node with the sum of ed equal to zero.If this is not possible the system returns the path with minimum ed .In case of a token t i with multiple match (d j ,d k ,...),with identical minimum ed ,the input token with minimum pd (compared to the parent token in the tree)will be selected (see example in Fig.5).2.2ClassifierA rule based classifier was designed with the aim of assigning a class label representing a specific product to each document using the information extracted from the matching phase.The classifier receives in input the output of the matcher i.e.the set of matched sequence T b,m of tokens weighted as a function of the ed .These input values are used to construct a sub-tree of the Matcher starting from which the classifier computes the class of the given document.The set of predefined classes is constituted by all the products (b,m )inserted in the KB .The classifier selects a class from a subset obtained by the input set of matched sequence T b,m (there is one class for each matched sequence).Fig.4.An example of KB set and thesaurus used to build the matcher.Fig.5.Role of edit distance and position distance during the matching process.In case of tokens having equal ed measure(ed(t i+1,d j+1)=ed(t i+1,d s)=ed(t i+1,d r))the token d j+1with minimum pd is selected.Position and edit distances are evaluated to decide class assignment:the classifier start from the leaves of the sub-tree and at each step compares each node t ij with its parent t i−1using the following rules:–select the node t ij having minimum pd(t ij,t i−1)for each j or with minimum av-erage pd computed through all the seen tree nodes;–in case of a node t ij with multiple match associated with its parent t i−1,with identical minimum ed,the input token t i−1with minimum pd will be selected(see example in Fig.6);–between two nodes t ij with identical weight(ed+pd)in the same layer i,select that with a greater path starting from the leaf node;–if all the previous rulesfind no differences between two nodes t ij and its parent t i−1then selects that with minimum pd computed by the matcher.Fig.6.Role of edit distance and position distance during the classification process.In case of tokens having equal ed measure(ed(t i−1,d k)=ed(t i−1,d r)=ed(t i−1,d s))the token d s with minimum pd is selected.3Automatic Thesaurus BuildingThe accuracy strongly depends on the completeness of the thesaurus.Unfortunately, thesaurus maintenance is an expensive process.The present work proposed a neural adaptive tool able to support thesaurus updating.The main idea of our solution is to automatically induce from examples general rules able to identify the presence of synonyms within sequences of tokens produced by the matcher.These rules,difficult to hand-craft and define explicitly,are obtained adaptively using neural learning.In particular,a Multilayer Perceptron(MLP)[5]is trained to receive in input the following types of features from the sequence of matched tokens:–edit distance between the tokens(t b1,t b2,...,t m1,t m2,...)of the KB and tokens found in the document;–number of characters between two consecutive tokens found in the document;–typology of found tokens(word or number);–bits that identify the presence of each token.As illustrated in Fig.7,each pattern is divided into four groups of P features,where P represents the maximum number of tokens obtainable from a product(b,m).The output pattern identifies a synonym¯S of S,where S is a particular sequence of tokens extracted from T b,m,and¯S is a different sequence of tokens extracted from T b,m or from the sequence of matched token of the document.Each output pattern has a dimension equal to3P.The output of the trained neural network is used to add a new item S=¯S to the thesaurus:when an output neuron has a value greater than a threshold,the corresponding token will be considered part of a synonym.4ExperimentsThe aim of these experiments was to measure the classification effectiveness in terms of precision and recall,and to measure the contribution of neural thesaurus updating within the overall strategy.Fig.7.An example of input pattern/output construction for the MLP model.4.1Text CollectionDocument collection is taken from the price comparison service Shoppydoo(http: // and http://www.shoppydoo.it),a price comparison service that allows users to compare prices and models of commercial products.Two experiments were conceived and conducted in thefield of price comparison ser-vices.Two product categories were identified,cell-phones and digital-cameras.Three specific brands were considered in our set of couples(b,m)KB for each category (Nokia,Motorola,Sony for cell phones and Canon,Nikon Samsung for digital-camera). The total number of documents collected for the cell-phone category was1315of which 866associated with one of the three identified brands.The number of documents be-longing to the digital-camera category were2712of which1054associated with one of the three identified brands.Remaining documents belonging to brands different from those considered in the experiment,must be classified not relevant.4.2Evaluation MetricsPerformance is measured by recall,precision and F-measure.Let us assume a collection of N documents.Suppose that in this collection there are n<N documents relevant to the specific information we want to extract(brand and model of a product).The IE system recognizes m documents,a of which are actually relevant.Then the recall,R, of the IE system on that information is given byR=a/n(1)and the precision,P,is given byP=a/m(2) One way of looking at recall and precision is in terms of a2×2contingency table (see Table1).Relevant Not-relevant TotalMatched a b a+b=mNot-matched c d c+d=N−mTotal a+c=n b+d a+b+c+d=NOverall accuracy(OA):(a+b)/NTable1.A contingency table analysis of precision and recall.Another measure used to evaluate information extraction that combines recall and precision into a single measure is the F-measure Fαdefined as follows:Fα=1α1P+(1−α)1R(3)whereαis a weight for calibrating the relative importance of recall versus precision[6].4.3ResultsThe results of the two experiments are summarized in Tables2and3.We started with an empty thesaurus that was then populated with the support of the neural network. For both datasets we obtained Precision and Recall equal to100%after adding new synonyms to the thesaurus to resolve all the cases not perfectly matched.Relevant Not-relevant Total Matched8660866 Not-matched0449770 Total8664491315Recall100%Precision100%Fα=0.5100%OA100%Relevant Not-relevant Total Matched7550755 Not-matched111449560 Total8664491315Recall87.18%Precision100%Fα=0.593.15%OA91.56%(a)(b)Table2.Evaluation metrics for the problem’cell-phone’with thesaurus(a)and without the-saurus(b).Relevant Not-relevant Total Matched105401054 Not-matched016581659 Total105416582712Recall100%Precision100%Fα=0.5100%OA100%Relevant Not-relevant Total Matched9630963 Not-matched9116581749 Total105416582712Recall91.37%Precision100%Fα=0.595.49%OA96.64%(a)(b)Table3.Evaluation metrics for the problem’digital-camera’with a thesaurus(a)and without thesaurus(b).5Conclusions and Future WorksThe present work tested a system that make use of IE and classification in the context of a price comparison service.The approach proved highly accurate but it requires the assistance of an expert during the construction of the KB and the corresponding thesaurus.Future work will extend the present solution including a tool for building the KB by automatically extracting unknown models of a product from a document. References1.Chang,C.H.,Hsu,C.N.,Lui,S.C.:Automatic information extraction from semi-structuredweb pages by pattern discovery.Decis.Support Syst.35(1)(April2003)129–1472.Muslea,I.:Extraction patterns for information extraction tasks:A survey.In Califf,M.E.,ed.:Papers from the Sixteenth National Conference on Artificial Intelligence(AAAI-99)Work-shop on Machine Learning for Information Extraction,Orlando,FL,AAAI Press(July1999) 3.Chang,C.H.,Kayed,M.,Girgis,M.R.,Shaalan,K.F.:A survey of web information extractionsystems.IEEE Transactions on Knowledge and Data Engineering18(10)(2006)1411–1428 4.Damerau,F.J.:A technique for computer detection and correction of spelling -munications of the Association for Computing Machinery7(3)(1964)171–1765.Rumelhart,D.E.,Hinton,G.E.,Williams,R.J.:Learning internal representations by errorpropagation.(1986)318–3626.Jackson,P.,Moulinier,I.:Natural Language Processing for Online Applications:Text Re-trieval,Extraction,and Categorization(Natural Language Processing,5).John Benjamins Publishing Co(June2002)。