Information Extraction and Classification from Free Text Using a Neural Approach

格式：pdf
大小：459.31 KB
文档页数：9

下载文档原格式

基于图像分析的恶意软件检测技术研究

n C tinfo 技术SECURITY研究2019年第10_________________________________________________________________________________________________________■doi：10.3969/j.issn.1671-1122.2019.10.004基于图像分析的恶意软件检测技术研究-------------------张健㈣，陈博翰1,宫良_打顾兆军4------------------------------------------------------(1.天津理工大学计算机科学与工程学院，天津300384;2.南开大学网络空间安全学院，天津300350;3.天津市网络与数据安全技术重点、实验室，天津300350;4.中国民航大学信息安全测评中心，天津300300)摘要：随着恶意软件在复杂性和数量方面的不断增长，恶意软件检测变得越来越具有挑战性。

目前最常见的恶意软件检测方法是使用机器学习技术进行恶意软件检测。

为进一步提高恶意软件分析的效率，一些研究人员提出基于图像分析的方法对恶意软件进行分类。

文章总结了使用图像分析方法检测恶意软件的不同方法，并从图像生成、特征提取和分类算法等方面进行了对比，最后针对图像分析方法的不足提出了解决方案。

关键词：网络安全；恶意软件检测；恶意软件图像；机器学习中图分类号：TP309文献标识码：A文章编号：1671-1122(2019)10-0024-08中文引用格式：张健，陈博翰，宫良一，等.基于图像分析的恶意软件检测技术研究[J].信息网络安全，2019,19(10):24-31.英文引用格式:ZHANG Jian,CHEN Bohan,GONG Liangyi,et al.Research on Malware Detection Technology Based on Image Analysis[J],Netinfb Security,2019,19(10):24-31.Research on Malware Detection Technology Based on Image AnalysisZHANG Hallie,CHEN Bohan1,GONG Liangyi1,GU Zhaojun4(1.School of C omputer Science and E ngineering,Tianjin University qfTechnology,Tianjin300384,China}2.College ofCyber Science,Nankai University,Tianjin300350,China;3.Tianjin Key Laboratory of N etwork and D ata SecurityTechnology,Tianjin300350,China',rmation Security Evaluation Center,Civil A viation University ofChina,Tianjin300300,China)Abstract:With the increasing complexity and quantity of malware,malware detection is becoming increasingly challenging.At present,the most common malware detection method is touse machine learning technology to detect malware.In order to improve the efficiency of malwareanalysis,some researchers have proposed a method based on image analysis to classify malware.This paper summarized the different methods of detecting malware using malware images,and compared them from the aspects of image generation,feature extraction and classificationalgorithms.Finally,the solutions to the shortcomings of these methods is proposed.Key words:cyber security;malware detection;malware image;machine learning收稿日期：2019—6—15基金项目：国家重点研发计划[2016YFB0800805];天津市科技服务业科技重大专项[16ZXFWGX00140];天津市自然科学基金[1冯CQNJC69900];中国民航大学信息安全测评中心开放基金课题[CAAC—ISECCA—201501]作者简介：张健（1968—）,男，天津，正高级工程师，博士，主要研究方向为网络空间安全、云妥全、系统安全、恶意代码防治；陈博翰（1994—）,男，河南，硕士研究生，主要研究方向为信息安全；宫良一（1987—）,男，山东，博士，主要研究方向为普适计算、网络与信息安全;顾兆军（1966—）,男，山东，教授，博士，主要研究方向为网络与信息安全、民航信息系统。

脑_机接口研究中想象动作电位的特征提取与分类算法_程龙龙

第29卷第8期2008年8月仪器仪表学报Ch i nese Journa l o f Sc ientific Instru m entV ol 129N o 18A ug .2008收稿日期:2007-07 Recei ved Date :2007-07*基金项目:/十一五0国家高技术研究发展计划(863计划,2007AA04Z236)、国家自然科学基金(60501005)、天津市科技支撑计划重点项目(07ZCKFSF01300)资助脑-机接口研究中想象动作电位的特征提取与分类算法*程龙龙,明东,刘双迟,朱誉环,周仲兴,万柏坤(天津大学生物医学工程与科学仪器系天津300072)摘要:人在想象但未实施肢体或其他身体部位动作时,与该动作相关的大脑运动皮层区域会发生与该动作实施时相似的电生理响应,称为想象动作电位。

想象动作电位的提取与分类是脑-机接口(BC I)技术的关键和难点。

本文分别介绍了想象动作电位的时频分析、复杂度分析、相位耦合测量、多通道线性描述符、多维统计分析等特征提取方法和线性判别分析、人工神经网络、支持向量机等分类算法,以供BC I 系统设计与研究时参考。

关键词:脑-机接口;想象动作电位;特征提取;分类算法中图分类号:R 318 文献标识码:A 国家标准学科分类代码:310.6110Feature extracti on and cl assification algorith m for m otor i m agi nary potenti alCheng Long l o ng ,M i n g Dong ,L i u Shuangch,i Zhu Yuhuan ,Zhou Zhongx ing ,W an Baikun(D e p ar t m ent of B io m edical E n g ineer i ng &Scienti f ic Instrumen ts ,T ianjin U ni ver sit y,T ianjin 300072,China )Abst ract :Specific e lectrophysiological response w ill occur in the related reg ion o f brain m otor cortex when hu m an pursues li m bs or o t h er body parts i n i m ag inati o n ,wh ich is si m ilar to t h e phenom enon caused by rea lm otor o f the cor -respond i n g body action .Extracti n g and classif y ing t h e m oto r i m ag inary potential are the key and d ifficu lt points in brai n -co m puter i n terface .I n this article ,so m e fea t u re ex traction m ethods such as ti m e -frequency ana l y sis ,co m p lex-i ty analysis ,phase coupling m easure m en,t m ult-i d i m ensi o na l stati s tica l ana l y sis and etc .,and so m e classificati o n m eans such as li n ear d iscri m i n ati o n ana lysis ,artificia l neura l net w o r k ,suppo rt vector m achine and etc ,are i n tro -duced as the references i n design and st u dy o f bra i n -co m puter i n terface syste m.K ey w ords :brain -co m puter interface (BC I);m otor i m aginary potentia;l feature extracti o n ;classificati o n1 引言脑-机接口(br a i n -c o mputer interface ,BCI)是指在不依赖于外周神经和肌肉组织等常规大脑信息输出通路,而运用工程技术手段在人脑和计算机或其他机电设备之间建立能直接/让思想变成行动0的对外信息交流和控制新途径[1-2]。

大数据中的信息提取技术

大数据中的信息提取技术随着互联网的发展和普及，人们所能获取的信息越来越多，大量数据被产生和储存。

大数据处理及挖掘技术的开发与应用成为一个新兴领域，其中信息提取技术是大数据处理过程中重要的一个环节。

本文将介绍大数据中的信息提取技术。

一、信息提取技术概述信息提取技术（Information Extraction，简称IE）是指从非结构化或半结构化的文本中抽取出基于预定义规则或语言学知识的有意义信息的过程。

信息提取通常包含以下几个步骤：（1）文本预处理：包括分词、词性标注、命名实体识别等。

（2）拟合规则：确定语言规则或统计模型，以匹配文本并抽取信息。

（3）特征抽取：抽取文本中指定的信息、属性或实体。

（4）信息抽取：将预测结果输出为结构化数据，例如XML或表格形式。

二、信息提取技术应用场景信息提取技术可以在许多场景下应用。

例如：（1）新闻事件监测：监测新闻中的关键词、地点、人名等信息。

（2）在线广告定位：根据网站用户的浏览历史和搜索历史推送相关的广告。

（3）社交媒体分析：获取社交媒体上用户的态度和情感，以提高营销策略效益。

（4）自动化知识抽取：收集医学文献中的疾病、症状和治疗措施等信息，以支持临床医生的诊断和治疗。

三、信息提取技术发展历程和进展信息提取技术的发展历程可以追溯至20世纪60年代末期。

随着计算机技术和自然语言处理技术的进步，信息提取技术逐渐发展起来，并被广泛应用于金融、医疗、法律等领域。

信息提取技术的发展也面临着一些问题。

例如，传统的抽取规则方法需要大量的人工制定和调整，容易出错和过时。

另外，大量的文本数据需要长时间的处理，而且数据的质量可能不尽如人意。

近年来，机器学习和深度学习技术的发展为信息提取技术带来了新的机遇。

例如，基于深度学习的命名实体识别模型可以显著提高信息提取的准确率和效率。

同时，自然语言处理和机器学习技术的结合，可以自动发现一些新的信息或规则，并可以动态更新信息抽取模型，拓展信息提取技术的应用场景和范围。

《信息检索与利用》试卷分析及应用型教改研究

Vol. 2,N o.l Academic New Vision of Sichuan Technology and Business University Mar. 2017•应用型教育教学改革与研究-《信息检索与利用》试卷分析及应用型教改研究卫丽(四川工商学院图书馆，四川眉山620000)摘要:试卷成绩在一定程度上能够衡量教师“教”的水平和学生“学”的效果，为了客观地对教学质量进行综合性评价，发现教学环节中的薄弱环节，本文对四川工商学院2015级会计本科60份试卷进行了分析。

结果表明，本套试卷题型较为合理，题量与难易度适中，基本覆盖了信息检索课所有重难点知识，学生成绩基本成正态分布，但考试知识点还应该增加学生上机操作部分，以便增强本课程的实践应用性，并结合以上分析结果探讨了信息检索课应用型教学改革方向。

关键词:信息检索与利用;试卷分析;应用型教学改革中图分类号:G258.6;G252 文献标识码:AExamination Paper Analysis of “Information Retrieval” and Research on Applied Teaching ReformWei Li{Library of S ichuan Technology and Business University ^Sichuan Meishan 620000) Abstract:Examination results can be used to measure the level of teachers ’ teaching and students’ learning effect. In order to objectively evaluate the quality of teaching and find the weak link in the teaching process, this paper analyzes the 60 papers of the Sichuan Institute of industry and Commerce of the 2015 accounting undergraduate.The results show that the set of test questions is more reasonable, quantity and difficulty is moderate and basically covering all the important and difficult knowledge of Information Retrieval Course, student achievement basically normal distribution, but the examination of knowledge should also increase students hands-on practice and application of the part, in order to enhance the practical application of the course, and combined with the results of the above analysis, this paper discusses the application of teaching reform direction.Key words:Information retrieval and utilization ； Examination paper analysis;Applied teaching reform21世纪的社会计算机技术广泛地被应用，社会每天产生的信息量如井喷式地增长，在这信息的海洋里，如何有效地利用网络获取正确的信息正是《信息检索与利用》课的授课重点，本课程的目的就是培养学生的信息素养，让学生学会发现、获取、利用信息的能力。

information extraction 评价指标 -回复

information extraction 评价指标-回复关于信息抽取技术的评价指标信息抽取（Information Extraction, IE）是一项关键的自然语言处理任务，旨在从非结构化文本中识别和提取出重要的信息。

这些信息可以是实体名称、关系、事件或其他与特定领域相关的实体和概念。

评价信息抽取技术的指标对于进一步改进和发展该领域的算法和系统至关重要。

本文将一步一步回答关于信息抽取评价指标的问题。

1. 信息抽取任务的常见评价指标是哪些？在信息抽取任务中，常见的评价指标包括准确率、召回率、F1分数和精确度。

- 准确率是指模型正确预测的正样本数与预测的总样本数之比。

准确率越高，说明模型的预测结果越接近真实结果。

- 召回率是指模型正确预测的正样本数与真实正样本数之比。

召回率越高，说明模型能够更好地找到真实的正样本。

- F1分数是综合考虑准确率和召回率的评价指标，可以用来平衡这两个指标。

F1分数的计算公式为2 * (准确率* 召回率) / (准确率+ 召回率)。

- 精确度是指模型在预测为正样本的情况下，真实为正样本的概率。

精确度越高，可以认为模型的预测结果更加可信。

2. 信息抽取的评价指标如何计算？为了计算准确率、召回率、F1分数和精确度，需要有一个标准的数据集来比较模型的预测结果和真实结果。

通常情况下，这个数据集会标注有正确的实体、关系或事件信息。

接下来我们将以实体提取任务为例，介绍如何计算这些评价指标。

- 首先，需要将模型的预测结果与真实结果进行比较。

每个实体都可以被视为一个二元分类问题，要么是正样本（正确提取的实体），要么是负样本（未能正确提取的实体）。

- 其次，计算模型的准确率、召回率和精确度。

可以使用以下公式进行计算：准确率=正确预测的正样本数/ 预测的总样本数召回率=正确预测的正样本数/ 真实正样本数精确度=正确预测的正样本数/ 预测为正样本的样本数- 最后，可以使用F1分数来综合衡量模型的性能。

information extraction 评价指标 -回复

information extraction 评价指标-回复关于信息抽取的评价指标引言：随着互联网和大数据时代的到来，大量的信息被生成和存储，如何从海量的文本数据中获取有效的信息成为了一项关键任务。

信息抽取（Information Extraction）是一种自然语言处理技术，旨在从非结构化的文本中提取出有用的信息，如实体、关系和事件等。

信息抽取涉及多个阶段和过程，包括实体识别、关系抽取和事件抽取等，因此需要评价指标来衡量其性能和有效性。

本文将介绍一些常用的信息抽取评价指标，并逐步解释其定义和使用。

一、精确率（Precision）精确率是信息抽取中常用的评价指标之一。

它衡量了信息抽取系统在识别结果中有多少是正确的。

精确率的计算公式如下：精确率= 正确识别的实体数/ 总的实体数精确率的值范围为0到1，越接近1代表系统的识别能力越好。

然而，精确率不能完全反映系统的性能，因为它忽略了系统未能捕捉到的实体。

二、召回率（Recall）召回率是另一个常用的信息抽取评价指标。

它衡量了信息抽取系统在文本中能识别出多少个实体。

召回率的计算公式如下：召回率= 正确识别的实体数/ 真实的实体数召回率的值也在0到1之间，越接近1代表系统的抽取能力越好。

与精确率相反，召回率忽略了系统识别结果中的误识别。

三、F1值（F1 Score）F1值是综合考虑精确率和召回率的评价指标，通常用于信息抽取系统的综合评估。

它的计算公式如下：F1 = 2 * (精确率* 召回率) / (精确率+ 召回率)F1值的范围也在0到1之间，它给出了一个综合的系统性能度量，比单一的精确率和召回率更具有可比性和稳定性。

当精确率和召回率都很高时，F1值也会相对较高。

四、准确率（Accuracy）准确率是另一个常见的信息抽取评价指标。

它衡量了信息抽取系统在整个文本中正确识别实体的比例。

准确率的计算公式如下：准确率= (正确识别的实体数+ 正确未识别的实体数) / 总的实体数准确率的值也在0到1之间，越接近1代表系统的抽取能力越好。

信息抽取

AB 1992
Seen in General Surgical
This lady who has had a mastectomy and left open capsulotomy and removal of her prosthesis was seen by me in the clinic today on behalf of XXXXXXXXXXX. She has extensive bony lymphoedema in her left arm which does not seem to be getting any better although she is more or less reconciled to the problem. The original problem was that she complained of shooting pain in the direction of ulna nerve and although there does not seem to be any evidence of local, local, regional regional or or distant distant recurrence the pain clinic XXXXXXXXX could itself warrants management in a pain clinic. be seen in the pain clinic at the XXXXXXX but as this would involve a lot of travelling would like to be treated nearer her home. I wonder whether it would be possible for you to investigate if there is a pain clinic available at XXXXXXXXXXX as I am sure XXXXX could be treated and benefit from its management management. I have otherwise arranged for her to be seen in the year’s time. time There are no signs of recurrence clinic again in a year's at this time time. 5213A4F612F1

面向文本信息处理的信息提取技术研究

面向文本信息处理的信息提取技术研究随着信息时代的到来，我们每天面对的数据越来越庞大，其中不乏海量的文本信息。

而对于人工来处理这些信息，显然是件巨大的工程。

但是，利用计算机来处理文本信息，却成为了现代科技的一项重要技术。

其中，信息提取技术就是其中的一项核心技术。

一、信息提取技术的概念信息提取技术，英文为Information extraction，通常是指从文本中提取出需要的信息，去除冗余的内容。

信息提取技术可以自动对文本进行语义分析，从中抽取出符合特定语义模式的实体、关系等内容。

常见的信息提取技术包括实体抽取、关系抽取、事件抽取、情感分析等等。

二、信息提取技术的特点相比于传统的文本处理方式，信息提取技术具有以下几个特点：1、自动化处理：信息提取技术处理文本的速度比人工要快得多，能够快速地提取出大量信息。

而且，多数情况下可以自动化地完成。

2、高效性：信息提取技术可以从庞大的文本中提取出我们需要的信息，大大缩短了信息处理时间。

3、精度高：信息提取技术可以通过自然语言处理技术，从语句中提取出真正有意义的信息。

因此，其误差较小，可信度高。

4、可扩展性：对于某些新词汇或新领域等未知的情况，信息提取技术也可以有效地扩展和适应。

三、信息提取技术的应用领域信息提取技术在很多领域中得到了广泛应用：1、搜索引擎：搜索引擎利用自动化的信息提取技术来抓取互联网上的文本信息，并从中提取出与用户搜索有关的信息。

2、金融风险控制：信息提取技术可以利用大量的文本数据，研判金融市场的变化，为金融风险控制提供数据支持。

3、智能客服：信息提取技术可以将用户的提问和其他语境信息进行对话分析，做出相应的回答，提高用户的满意度。

4、社交媒体分析：信息提取技术可以对社交媒体上大量的文本数据进行分析，从中获取用户需求、市场趋势等有用信息。

五、信息提取技术的研究近年来，信息提取技术在大数据时代得到快速发展，相关产业应用领域也逐渐拓宽。

随着技术的不断成熟，研究人员也在不断探索如何优化算法，提高信息提取技术的效率和准确性。

Information Extraction from Single and Multiple Sentences

Information Extraction from Single and Multiple SentencesMark StevensonDepartment of Computer ScienceRegent Court,211Portobello Street,University of SheﬃeldSheﬃeldS14DP,UKmarks@AbstractSome Information Extraction(IE)systemsare limited to extracting events expressedin a single sentence.It is not clear what ef-fect this has on the diﬃculty of the extrac-tion task.This paper addresses the prob-lem by comparing a corpus which has beenannotated using two separate schemes:onewhich lists all events described in the textand another listing only those expressedwithin a single sentence.It was found thatonly40.6%of the events in theﬁrst anno-tation scheme were fully contained in thesecond.1IntroductionInformation Extraction(IE)is the process of identifying speciﬁc pieces of information in text, for example,the movements of company execu-tives or the victims of terrorist attacks.IE is a complex task and a the description of an event may be spread across several sentences or para-graphs of a text.For example,Figure1shows two sentences from a text describing manage-ment succession events(i.e.changes in corpo-rate executive management personnel).It can be seen that the fact that the executives are leaving and the name of the organisation are listed in theﬁrst sentence.However,the names of the executives and their posts are listed in the second sentence although it does not mention the fact that the executives are leaving these posts.The succession events can only be fully understood from a combination of the informa-tion contained in both sentences.Combining the required information across sentences is not a simple task since it is neces-sary to identify phrases which refer to the same entities,“two top executives”and“the execu-tives”in the above example.Additional diﬃ-culties occur because the same entity may be referred to by a diﬀerent linguistic unit.For ex-ample,“International Business Machines Ltd.”may be referred to by an abbreviation(“IBM”),Pace American Group Inc.said it notiﬁed two top executives it intends to dismiss them because an internal investigation found ev-idence of“self-dealing”and“undisclosedﬁ-nancial relationships.”The executives are Don H.Pace,cofounder,president and chief executive oﬃcer;and Greg S.Kaplan,senior vice president and chiefﬁnancial oﬃcer. Figure1:Event descriptions spread across two sentencesnickname(“Big Blue”)or anaphoric expression such as“it”or“the company”.These complica-tions make it diﬃcult to identify the correspon-dences between diﬀerent portions of the text de-scribing an event.Traditionally IE systems have consisted of several components with some being responsi-ble for carrying out the analysis of individual sentences and other modules which combine the events they discover.These systems were of-ten designed for a speciﬁc extraction task and could only be modiﬁed by experts.In an ef-fort to overcome this brittleness machine learn-ing methods have been applied to port sys-tems to new domains and extraction tasks with minimal manual intervention.However,some IE systems using machine learning techniques only extract events which are described within a single sentence,examples include(Soderland, 1999;Chieu and Ng,2002;Zelenko et al.,2003). Presumably an assumption behind these ap-proaches is that many of the events described in the text are expressed within a single sen-tence and there is little to be gained from the extra processing required to combine event de-scriptions.Systems which only attempt to extract events described within a single sentence only report results across those events.But the proportion of events described within a single sentence is not known and this has made it diﬃcult to com-pare the performance of those systems against ones which extract all events from text.This question is addressed here by comparing two versions of the same IE data set,the evaluation corpus used in the Sixth Message Understand-ing Conference(MUC-6)(MUC,1995).The corpus produced for this exercise was annotated with all events in the corpus,including those described across multiple sentences.An inde-pendent annotation of the same texts was car-ried out by Soderland(1999),although he only identiﬁed events which were expressed within a single sentence.Directly comparing these data sets allows us to determine what proportion of all the events in the corpus are described within a single sentence.The remainder of this paper is organised as follows.Section2describes the formats for rep-resenting events used in the MUC and Soder-land data sets.Section3introduces a common representation scheme which allows events to be compared,a method for classifying types of event matches and a procedure for comparing the two data sets.The results and implications of this experiment are presented in Section4. Some related work is discussed in Section5.2Event Scope and Representation The topic of the sixth MUC(MUC-6)was management succession events(Grishman and Sundheim,1996).The MUC-6data has been commonly used to evaluate IE systems.The test corpus consists of100Wall Street Jour-nal documents from the period January1993 to June1994,54of which contained manage-ment succession events(Sundheim,1995).The format used to represent events in the MUC-6 corpus is now described.2.1MUC RepresentationEvents in the MUC-6evaluation data are recorded in a nested template structure.This format is useful for representing complex events which have more than one participant,for ex-ample,when one executive leaves a post to be replaced by another.Figure2is a simpliﬁed event from the the MUC-6evaluation similar to one described by Grishman and Sundheim (1996).This template describes an event in which “John J.Dooner Jr.”becomes chairman of the company“McCann-Erickson”.The MUC tem-plates are too complex to be described fully here but some relevant features can be discussed. Each SUCCESSION EVENT contains the name of <SUCCESSION_EVENT-9402240133-2>:=SUCCESSION_ORG:<ORGANIZATION-9402240133-1>POST:"chairman"IN_AND_OUT:<IN_AND_OUT-9402240133-4>VACANCY_REASON:DEPART_WORKFORCE<IN_AND_OUT-9402240133-4>:=IO_PERSON:<PERSON-9402240133-1>NEW_STATUS:INON_THE_JOB:NOOTHER_ORG:<ORGANIZATION-9402240133-1>REL_OTHER_ORG:SAME_ORG<ORGANIZATION-9402240133-1>:=ORG_NAME:"McCann-Erickson"ORG_ALIAS:"McCann"ORG_TYPE:COMPANY<PERSON-9402240133-1>:=PER_NAME:"John J.Dooner Jr."PER_ALIAS:"John Dooner""Dooner"Figure2:Example Succession event in MUC formatthe POST,organisation(SUCCESSION ORG)and references to at least one IN AND OUT sub-template,each of which records an event in which a person starts or leaves a job.The IN AND OUT sub-template contains details of the PERSON and the NEW STATUSﬁeld which records whether the person is starting a new job or leav-ing an old one.Several of theﬁelds,including POST,PERSON and ORGANIZATION,may contain aliases which are alternative descriptions of theﬁeldﬁller and are listed when the relevant entity was de-scribed in diﬀerent was in the text.For ex-ample,the organisation in the above template has two descriptions:“McCann-Erickson”and “McCann”.It should be noted that the MUC template structure does not link theﬁeldﬁllers onto particular instances in the texts.Conse-quently if the same entity description is used more than once then there is no simple way of identifying which instance corresponds to the event description.The MUC templates were manuallyﬁlled by annotators who read the texts and identiﬁed the management succession events they contained. The MUC organisers provided strict guidelines about what constituted a succession event and how the templates should beﬁlled which the an-notators sometimes found diﬃcult to interpret (Sundheim,1995).Interannotator agreementwas measured on30texts which were examined by two annotators.It was found to be83%when one annotator’s templates were assumed to be correct and compared with the other.2.2Soderland’s Representation Soderland(1999)describes a supervised learn-ing system called WHISK which learned IE rules from text with associated templates. WHISK was evaluated on the same texts from the MUC-6data but the nested template struc-ture proved too complex for the system to learn. Consequently Soderland produced his own sim-pler structure to represent events which he de-scribed as“case frames”.This representation could only be used to annotate events described within a single sentence and this reduced the complexity of the IE rules which had to be learned.The succession event from the sentence “Daniel Glass was named president and chief executive oﬃcer of EMI Records Group,a unit of London’s Thorn EMI PLC.”would be represented as follows:1@@TAGS Succession{PersonIn DANIEL GLASS}{Post PRESIDENT AND CHIEF EXECUTIVE OFFICER} {Org EMI RECORDS GROUP}Events in this format consist of up to four components:PersonIn,PersonOut,Post and Org.An event may contain all four components although none are compulsory.The minimum possible set of components which can form an event are(1)PersonIn,(2)PersonOut or(3) both Post and Org.Therefore a sentence must contain a certain amount of information to be listed as an event in this data set:the name of an organisation and post participating in a management succession event or the name of a person changing position and the direction of that change.Soderland created this data from the MUC-6evaluation texts without using any of the existing annotations.The texts wereﬁrst pre-processing using the University of Mas-sachusetts BADGER syntactic analyser(Fisher et al.,1995)to identify syntactic clauses and the named entities relevant to the management suc-cession task:people,posts and organisations. Each sentence containing relevant entities was examined and succession events manually iden-tiﬁed.1The representation has been simpliﬁed slightly for clarity.This format is more practical for machine learning research since the entities which par-ticipate in the event are marked directly in the text.The learning task is simpliﬁed by the fact that the information which describes the event is contained within a single sentence and so the feature space used by a learning algorithm can be safely limited to items within that context. 3Event Comparison3.1Common Representation andTransformationThere are advantages and disadvantages to the event representation schemes used by MUC and Soderland.The MUC templates encode more information about the events than Soderland’s representation but the nested template struc-ture can make them diﬃcult to interpret man-ually.In order to allow comparison between events each data set was transformed into a com-mon format which contains the information stored in both representations.In this format each event is represented as a single database record with fourﬁelds:type,person,post and organisation.The typeﬁeld can take the values person in,person out or,when the di-rection of the succession event is not known, person move.The remainingﬁelds take the person,position and organisation names from the text.Theseﬁelds may contain alternative values which are separated by a vertical bar (“|”).MUC events can be translated into this format in a straightforward way since each IN AND OUT sub-template corresponds to a sin-gle event in the common representation.The MUC representation is more detailed than the one used by Soderland and so some in-formation is discarded from the MUC tem-plates.For example,the VACANCY REASON ﬁled which lists the reason for the manage-ment succession event is not transfered to the common format.The event listed in Figure2would be represented as follows: type(person in)person(‘John J.Dooner Jr.’|‘John Dooner’|‘Dooner’)org(‘McCann-Erickson’|‘McCann’)post(chairman)Alternativeﬁllers for the person and org ﬁelds are listed here and these correspond to the PER NAME,PER ALIAS,ORG NAME and ORG ALIASﬁelds in the MUC template.The Soderland succession event shown in Section 2.2would be represented as follows in the common format.type(person in)person(‘Daniel Glass’)post(‘president’)org(‘EMI Records Group’)type(person in)person(‘Daniel Glass’)post(‘chief executive officer’)org(‘EMI Records Group’)In order to carry out this transformation an event has to be generated for each PersonIn and PersonOut mentioned in the Soderland event. Soderland’s format also lists conjunctions of post names as a single slotﬁller(“president and chief executive oﬃcer”in this example).These are treated as separate events in the MUC for-mat.Consequently they are split into the sepa-rate post names and an event generated for each in the common representation.It is possible for a Soderland event to consist of only a Post and Org slot(i.e.there is nei-ther a PersonIn or PersonOut slot).In these cases an underspeciﬁed type,person move,is used and no personﬁeld listed.Unlike MUC templates Soderland’s format does not contain alternative names forﬁeldﬁllers and so these never occur when an event in Soderland’s for-mat is translated into the common format.3.2MatchingThe MUC and Soderland data sets can be com-pared to determine how many of the events in the former are also contained in the latter. This provides an indication of the proportion of events in the MUC-6domain which are express-ible within a single sentence.Matches between Soderland and MUC events can be classiﬁed as full,partial or nomatch.Each of these possi-bilities may be described as follows:Full A pair of events can only be fully match-ing if they contain the same set ofﬁelds.In addition there must be a commonﬁller for eachﬁeld.The following pair of events are an example of two which fully match.type(person in)person(‘R.Wayne Diesel’|‘Diesel’)org(‘Mechanical Technology Inc.’|‘Mechanical Technology’)post(‘chief executive officer’)type(person in)person(‘R.Wayne Diesel’)org(‘Mechanical Technology’)post(‘chief executive officer’)Partial A partial match occurs when one event contains a proper subset of theﬁelds of an-other event.Eachﬁeld shared by the two events must also share at least oneﬁller.The following event would partially match either of the above events;the orgﬁeld is absent therefore the matches would not be full.type(person in)person(‘R.Wayne Diesel’)post(‘chief executive officer’)Nomatch A pair of events do not match if the conditions for a full or partial match are not met.This can occur if correspondingﬁelds do not share aﬁller or if the set ofﬁelds in the two events are not equivalent or one the subset of the other.Matching between the two sets of events is carried out by going through each MUC event and comparing it with each Soderland event for the same document.The MUC event isﬁrst compared with each of the Soderland events to check whether there are any equal matches.If one is found a note is made and the matching process moves onto the next event in the MUC set.If an equal match is not found the MUC event is again compared with the same set of Soderland events to see whether there are any partial matches.We allow more than one Soder-land event to partially match a MUC event so when one is found the matching process con-tinues through the remainder of the Soderland events to check for further partial matches.4Results4.1Event level analysisAfter transforming each data set into the com-mon format it was found that there were276 events listed in the MUC data and248in the Soderland set.Table1shows the number of matches for each data set following the match-ing process described in Section3.2.The countsunder the“MUC data”and“Soderland data”headings list the number of events which fall into each category for the MUC and Soderland data sets respectively along with corresponding percentages of that data set.It can be seen that 112(40.6%)of the MUC events are fully cov-ered by the second data set,and108(39.1%) partially covered.Match MUC data Soderland data Type Count%Count%Full11240.6%11245.2% Partial10839.1%11847.6% Nomatch5620.3%187.3% Total276248Table1:Counts of matches between MUC and Soderland data.Table1shows that there are108events in the MUC data set which partially match with the Soderland data but that118events in the Soderland data set record partial matches with the MUC data.This occurs because the match-ing process allows more than one Soderland event to be partially matched onto a single MUC event.Further analysis showed that the diﬀerence was caused by MUC events which were partially matched by two events in the Soderland data set.In each case one event contained details of the move type,person in-volved and post title and another contained the same information without the post title.This is caused by the style in which the newswire sto-ries which make up the MUC corpus are writ-ten where the same event may be mentioned in more than one sentence but without the same level of detail.For example,one text contains the sentence“Mr.Diller,50years old,succeeds Joseph M.Segel,who has been named to the post of chairman emeritus.”which is later fol-lowed by“At that time,it was announced that Diller was in talks with the company on becom-ing its chairman and chief executive upon Mr. Segel’s scheduled retirement this month.”Table1also shows that there are56events in the MUC data which fall into the nomatch cat-egory.Each of these corresponds to an event in one data set with no corresponding event in the other.The majority of the unmatched MUC events were expressed in such a way that there was no corresponding event listed in the Soder-land data.The events shown in Figure1are examples of this.As mentioned in Section2.2, a sentence must contain a minimum amount of information to be marked as an event in Soder-land’s data set,either name of an organisation and post or the name of a person changing po-sition and whether they are entering or leaving. In Figure1theﬁrst sentence lists the organisa-tion and the fact that executives were leaving. The second sentence lists the names of the exec-utives and their positions.Neither of these sen-tences contains enough information to be listed as an event under Soderland’s representation, consequently the MUC events generated from these sentences fall into the nomatch category. It was found that there were eighteen events in the Soderland data set which were not in-cluded in the MUC version.This is unexpected since the events in the Soderland corpus should be a subset of those in the MUC corpus.Anal-ysis showed that half of these corresponded to spurious events in the Soderland set which could not be matched onto events in the text.Many of these were caused by problems with the BAD-GER syntactic analyser(Fisher et al.,1995) used to pre-process the texts before manual analysis stage in which the events were identi-ﬁed.Mistakes in this pre-processing sometimes caused the texts to read as though the sentence contained an event when it did not.We exam-ined the MUC texts themselves to determine whether there was an event rather than relying on the pre-processed output.Of the remaining nine events it was found that the majority(eight)of these corresponded to events in the text which were not listed in the MUC data set.These were not identi-ﬁed as events in the MUC data because of the the strict guidelines,for example that historical events and non-permanent management moves should not be annotated.Examples of these event types include“...Jan Carlzon,who left last year after his plan for a merger with three other European airlines failed.”and“Charles T.Young,chiefﬁnancial oﬃcer,stepped down voluntarily on a‘temporary basis pending con-clusion’of the investigation.”The analysis also identiﬁed one event in the Soderland data which appeared to correspond to an event in the text but was not listed in the MUC scenario tem-plate for that document.It could be argued that there nine events should be added to the set of MUC events and treated as fully matches. However,the MUC corpus is commonly used as a gold standard in IE evaluation and it was de-cided not to alter it.Analysis indicated that one of these nine events would have been a fullmatch and eight partial matches.It is worth commenting that the analysis car-ried out here found errors in both data sets. There appeared to be more of these in the Soderland data but this may be because the event structures are much easier to interpret and so errors can be more readily identiﬁed.It is also diﬃcult to interpret the MUC guidelines in some cases and it sometimes necessary to make a judgement over how they apply to a particular event.4.2Event Field AnalysisA more detailed analysis can be carried out examining the matches between each of the fourﬁelds in the event representation individu-ally.There are1,094ﬁelds in the MUC data. Although there are276events in that data set seven of them do not mention a post and three omit the organisation name.(Organisa-tion names are omitted from the template when the text mentions an organisation description rather than its name.)Table4.2lists the number of matches for each of the four eventﬁelds across the two data sets. Each of the pairs of numbers in the main body of the table refers to the number of matching in-stances of the relevantﬁeld and the total num-ber of instances in the MUC data.The column headed“Full match”lists the MUC events which were fully matched against the Soderland data and,as would be expected, allﬁelds are matched.The column marked “Partial match”lists the MUC events which are matched onto Soderlandﬁelds via partially matching events.The column headed“No-match”lists the eventﬁelds for the56MUC events which are not represented at all in the Soderland data.Of the total1,094eventﬁelds in the MUC data727,66.5%,can be found in the Soderland data.The rightmost column lists the percent-ages of eachﬁeld for which there was a match. The counts for the type and personﬁelds are the same since the type and personﬁelds are com-bined in Soderland’s event representation and hence can only occur together.Theseﬁgures also show that there is a wide variation between the proportion of matches for the diﬀerentﬁelds with76.8%of the person and typeﬁelds be-ing matched but only43.2%of the organisation ﬁeld.This diﬀerence betweenﬁelds can be ex-plained by looking at the style in which the texts forming the MUC evaluation corpus are writ-ten.It is very common for a text to introduce a management succession event near the start of the newswire story and this event almost in-variably contains all four eventﬁelds.For ex-ample,one story starts with the following sen-tence:“Washington Post Co.said Katharine Graham stepped down after20years as chair-man,and will be succeeded by her son,Don-ald E.Graham,the company’s chief executive oﬃcer.”Later in the story further succession events may be mentioned but many of these use an anaphoric expression(e.g.“the company”) rather than explicitly mention the name of the organisation in the event.For example,this sen-tence appears later in the same story:“Alan G. Spoon,42,will succeed Mr.Graham as presi-dent of the company.”Other stories again may only mention the name of the person in the suc-cession event.For example,“Mr.Jones is suc-ceeded by Mr.Green”and this explains why some of the organisationﬁelds are also absent from the partially matched events.4.3DiscussionFrom some perspectives it is diﬃcult to see why there is such a diﬀerence between the amount of events which are listed when the entire text is viewed compared with considering single sen-tences.After all a text comprises of an ordered list of sentences and all of the information the text contains must be in these.Although,as we have seen,it is possible for individual sentences to contain information which is diﬃcult to con-nect with the rest of the event description when a sentence is considered in isolation.The results presented here are,to some ex-tent,dependent on the choices made when rep-resenting events in the two data sets.The events listed in Soderland’s data require a min-imal amount of information to be contained within a sentence for it to be marked as con-taining information about a management suc-cession event.Although it is diﬃcult to see how any less information could be viewed as repre-senting even part of a management succession event.5Related WorkHuttunen et al.(2002)found that there is varia-tion between the complexity of IE tasks depend-ing upon how the event descriptions are spread through the text and the ways in which they are encoded linguistically.The analysis presented here is consistent with theirﬁnding as it hasFull match Partial match Nomatch TOTAL% Type112/112100/1080/56212/27676.8% Person112/112100/1080/56212/27676.8% Org112/1126/1080/53118/27343.2% Post111/11174/1080/50185/26968.8% Total447/447280/4320/215727/109466.5% Table2:Matches between MUC and Soderland data atﬁeld levelbeen observed that the MUC texts are often written in such as way that the name of the organisation in the event is in a diﬀerent part of the text to the rest of the organisation de-scription and the entire event can only be con-structed by resolving anaphoric expressions in the text.The choice over which information about events should be extracted could have an eﬀect on the diﬃculty of the IE task.6ConclusionsIt seems that the majority of events are not fully described within a single sentence,at least for one of the most commonly used IE evaluation sets.Only around40%of events in the original MUC data set were fully expressed within the Soderland data set.It was also found that there is a wide variation between diﬀerent eventﬁelds and some information may be more diﬃcult to extract from text when the possibility of events being described across multiple sentences is not considered.This observation should be borne in mind when deciding which approach to use for a particular IE task and should be used to put the results reported for IE systems which extract from a single sentence into context. AcknowledgementsI am grateful to Stephen Soderland for allowing access to his version of the MUC-6corpus and advice on its construction.Robert Gaizauskas and Beth Sundheim also provided advice on the data used in the MUC evaluation.Mark Hep-ple provided valuable comments on early drafts of this paper.I am also grateful to an anony-mous reviewer who provided several useful sug-gestions.ReferencesH.Chieu and H.Ng.2002.A Maximum Entroy Approach to Information Extraction from Semi-structured and Free Text.In Pro-ceedings of the Eighteenth International Con-ference on Artiﬁcial Intelligence(AAAI-02), pages768–791,Edmonton,Canada.D.Fisher,S.Soderland,J.McCarthy,F.Feng, and W.Lehnert.1995.Description of the UMass system as used for MUC-6.In Pro-ceedings of the Sixth Message Understand-ing Conference(MUC-6),pages221–236,San Francisco,CA.R.Grishman and B.Sundheim.1996.Mes-sage understanding conference-6:A brief history.In Proceedings of the16th Interna-tional Conference on Computational Linguis-tics(COLING-96),pages466–470,Copen-hagen,Denmark.S.Huttunen,R.Yangarber,and R.Grishman. plexity of Event Structures in IE Scenarios.In Proceedings of the19th Interna-tional Conference on Computational Linguis-tics(COLING-2002),pages376–382,Taipei, Taiwan.MUC.1995.Proceedings of the Sixth Mes-sage Understanding Conference(MUC-6), San Mateo,CA.Morgan Kaufmann.S.Soderland.1999.Learning Information Ex-traction Rules for Semi-structured and free text.Machine Learning,31(1-3):233–272. B.Sundheim.1995.Overview of results of the MUC-6evaluation.In Proceedings of the Sixth Message Understanding Conference (MUC-6),pages13–31,Columbia,MA.D.Zelenko,C.Aone,and A.Richardella.2003. Kernel methods for relation extraction.Jour-nal of Machine Learning Research,3:1083–1106.。

Python程序设计-介绍

Python程序设计-介绍李正华October25,20161个人介绍工大博士毕业家有父母妻女初为人师三年教学科研并重自感任重道远一定全力以赴见识粗陋浅薄还需多多包涵1983.9-2002.9：家乡2002.9-2013.5：大学2013.8-2016.7：讲师2016.7-:副教授课题组：人类语言技术研究所研究方向：自然语言处理、人工智能个人主页：/~zhli电子邮箱：zhli13@2人类语言技术研究所介绍英文名称Human Language Technology(HLT)。

课题组主页：。

我们课题组目前有3名教授（其中1杰青、1优青）、2名副教授。

专注自然语言处理与人工智能，子方向包括：•语言分析与理解（Language Analysis and Understanding>）•信息抽取（Information Extraction）、知识挖掘（Knowledge Mining）•机器翻译(Goolge Translation）•自动问答（Apple Siri,IBM Watson）•搜索引擎（Google,百度）•用户理解，产品推荐（淘宝、京东）3课程管理课程主页：/~zhli/teach/python-2016-fall（各种课程相关的信息、讲义发布、作业布置、成绩公布等）4上课要求每节课都得来。

如果有事不能来，必须在上课时让同学代交正式的请假条（需要辅导员签字或盖章）。

认真完成上机作业。

上机课早退需要向助教说明原因。

5师生相处之道互相理解、尊重互相学习、提高随时有问题，随时打断发现我的错误，及时反馈（记录在平时成绩中，我可以做几个俯卧撑。

）6如何计算课程成绩？(双学位)计划按照如下方式计算成绩（需要和教务秘书确认是否可行）：平时成绩（出勤、回答问题等）10%编程作业成绩30%（每节上机课专门有助教检查作业）期末考试成绩60%（可能采用上机考试的形式）7如何计算课程成绩？(计算机大一)过程化考试8课程学习目标会写程序！实践动手能力是计算机学科最基本的要求。

文本信息提取技术概述

会议信息访问信息外交事件恐怖活动自然灾害 ……
信息提取
内容索引库
用户界面
错误匹配
19980410-06-006-004 目前智利全国各地正开展形式多样的宣传活动，迎接第二届美洲首脑会议４月１８日在智利召开。图为首都圣地亚哥市中心商业区过街通道旁竖起展览橱窗，向市民介绍参加首脑会议的美洲国家的历史文化。（新华社记者韩晓华摄）
会议报道（例2）：人民日报
1998-01-07

19980107-06-016-001意大利总理普罗迪４日说，欧洲国家将采取行动，共同对付库尔德难民涌入问题。普罗迪４日晚召开了由意外长、内政和国防部长参加的紧急会议，商讨应付库尔德难民问题的对策。会前，普罗迪说，“在经过最初的混乱后，欧洲国家的行动已经大大加强”，今后几天内将在此问题上进行系统合作。

“XXX系统”

这个软件工具就是一个典型的信息提取系统，或者更准确地说，“人民日报会议信息自动提取系统”。

更多的信息提取任务：
访问信息外交事件恐怖活动自然灾害 ……
一种报刊信息加工“高级应用” 系统结构
DB Interface 香港日报: 1998 : 湖南日报 1999 1998 人民日报: 2000 1999 1998 … 2000 1999 … 2000 … 语料库
错误匹配
<EventTemplateInstatnces> <ConferenceInfo> <Time> UNKNOWN </Time> <Spot>智利</Spot> <Converner> UNKNOWN </Converner> <Title> 目前智利全国各地正开展形式多样的宣传活动，迎接第二届美洲首脑会议 </Title> </ConferenceInfo> </EventTemplateInstatnces>

确定影响商品价格关键因素的有效预测模型——以埃塞俄比亚商品交易所的咖啡为例(IJIEEB-V11-N6-5)

I.J. Information Engineering and Electronic Business, 2019, 6, 32-36Published Online November 2019 in MECS (/)DOI: 10.5815/ijieeb.2019.06.05Efficient Predictive Model for Determining Critical Factors Affecting Commodity Price: The Case of Coffee in Ethiopian CommodityExchange (ECX)Worku Abebe Degife 11Faculty of Informatics, University of Gondar, Gondar, EthiopiaEmail: workuabebeis@,@.etDr.ing. Abiot Sinamo (PhD) 22 Ethiopian Institute of Technology-Mekelle (EiT-M), Mekelle University, EthiopiaEmail: Abiotsinamo35@Received: 24 September 2019; Accepted: 25 October 2019; Published: 08 November 2019Abstract—In this paper, we have focused on the data mining technique on market data to establish meaningful relationships or patterns to determine the determinate critical factors of commodity price. The data is taken from Ethiopia commodity exchange and 18141 data sets were used. The dataset contains all main information. The hybrid methodology is followed to explore the application of data mining on the market dataset. Data cleaning and data transformation were used for preprocessing the data. WEKA 3.8.1 data mining tool, classification algorithms are applied as a means to address the research problem. The classification task was made using J48 decision tree classification algorithms, and different experimentations were conducted. The experiments have been done using pruning and unpruning for all attributes. The developed models were evaluated using the standard metrics of accuracy, ROC area. The most effective model to determine the determinate critical factors for the commodity has an accuracy of 88.35% and this result is a good experiment result.The output of this study is helpful to support decision-making activities in the area of the Ethiopia Commodity Exchange. The study support commodity suppliers to take care of the determinant factors work towards maintaining quality. Ethiopia Commodity Exchange (ECX), as the main facilitator of commodity exchanges, can also use the model for setting price ranges and regulations.Index Terms—Commodity Price, Predictive Model, J48 decision treeI.I NTRODUCTIONModern commodity exchanges dates back to the trading of Rice futures in the 17th century in Osaka, Japan[1]. With the liberalization of agricultural trade in many countries, and the withdrawal of government support to agricultural producers a new need arises for price discovery and even physical trading mechanisms, a need that can often be met by commodity exchanges. Hence, the rapid creation of new commodity exchanges, and the expansion of existing ones have increased over the past decade.At present, there are major commodity exchanges globally and a large number of brand new exchanges have been created during the past decade in developing countries[1].Although the major task of ECX is protecting its customers through the modern trading system, controlling the factors affecting commodity prices should also be taken into consideration for smooth and healthy exchanges between customers. In this regard number of factors can be considered as determinants but there is no a clear indicator that can directly be inferred out of the available commodity exchange data. Therefore, as one means of potential intervention, data mining research can be conducted to unveil the unseen pattern that can potentially be useful for regulating commodity exchange trends.II.L ITERATURE R EVIEW2.1. Data MiningData mining is the process to discover interesting knowledge from large amounts of data[3]. Nowadays, data stored in market databases are growing in gradually rapid way. Therefore, it is necessary to analyse this huge amount of data and extract useful information from it. Data mining is the process that results in the discovery of new patterns in large data sets. The goal of the data mining process is to extract knowledge from an existing data set and transform it into a human understandableformation for advance use. It is the process of analyzing data from different perspectives and summarizing it into useful information. There is no restriction to the type of data that can be analyzed by data mining. We can analyze data contained in a relational database, a data warehouse, a web server log or a simple text file. Analysis of data in effective way requires understanding of appropriate techniques of data mining. [3,4,5,6] In addition to this, data mining technology can generate new business opportunities by providing automated process of finding predictive information in large databases and discovery of previously unknown patterns.A large data collection is required for producing information. Only data retrieving is not enough, rather we need a means to automate the aggregation of data, information extraction, and recognize discovery patterns in The source data. Files, databases and other repositories consists of huge amount of data, hence it is necessary to develop a prevailing tool for analysis and explanation of data and extracting interesting knowledge to facilitate in decision making. Data mining can solve all of the tasks[7].Data mining is a method of extracting unknown projecting information from large databases which is a widespread technology that helps organizations to focus on the most important information in data repositories with great potential[7,8,9,10]. Data analysis tools predict future trends and behavior, helping organizations in active business solutions to knowledge driven decisions [6,11,12]. Intelligent data analysis tools produce a database to search for hidden patterns, finding projecting information that may be missed due to beyond experts’ prediction. The task of data mining are varied and distinct because there are many patterns in a large database. Deferent kinds of methods and techniques are needed to find deferent kinds of patterns[10,12,13].2.2. Related WorksThe first work in this regard is the work done by Ticlavilca et al [14]. This work applied a MVRVM model to develop multiple-time-ahead predictions with confidence intervals of monthly agricultural commodity prices. The predictions are one, two and three months ahead of prices of cattle, hogs and corn. The MVRVM is a regression tool extension of the RVM model to produce multivariate outputs. The statistical test results indicate an overall good performance of the model for one and two month’s prediction for all the commodity prices. The performance decreased for the three-month prediction of the three commodity prices. The MVRVM model outperforms the ANN most of the time with the exception of corn price prediction two and three months ahead. However, the bootstrap histograms of the MVRVM model show narrow confidence bounds in comparison to the histograms of the ANN model for the three commodity price forecasts. Based on this study, the MVRVM is more robust.From the study in [15], the author discussed, determining the Stock market forecasts has always been challenging work for business analysts. In the paper, the author attempted to make use of these huge chaotic in nature data to predict the stock market indices. Moreover, if we combine both these chaotic data and numeric time series analysis, the accuracy in predictions can be achieved. Investors can use this prediction model to take trading decision by observing market behavior. Enhancements of this system are focused to help in improving more accurate predictability in stock market regardless how chaotic the stock market data can be.The study of Santoso and Rusdianto [16] resulted with a “Hybrid Clustering Method for Stock Price and Commodity Price”. They find out that, combination between K-Means Clustering and Principal Component Analysis can give better analysis and classification. The results of using those two methods are a dimension reduced cluster (compact cluster). The result is supported by some findings in the reality through having some observations in the news and daily reports of the company conditions. In their summary, the study has shown that it is an effective method to use a hybrid method to cluster the stock price and commodity price using K-Means Clustering and Principal Component Analysis. It is supported that the result of the Principal Component Analysis can have a direct impact on the number and type of dietary patterns revealed in the data. They also mentioned that, reducing the dimension of the cluster is an important task and it can be implemented by using Principal Component Analysis. Moreover, for further research works, they suggested to combine the K-Means algorithm, Principal Component Analysis, and Neural Network to have better solutions. By conducting the Neural Network to the established clusters, it will give the exact information on how to identify the information in every cluster.III.P ROBLEM OF THE S TATEMENTKnowing how an industry is influenced by market trends is essential to stay competitive and meeting consumers’ needs. In order to keep a company ahead of the competition, it is also important to utilize market trend analysis that is the process of evaluating changes to a given market. As a pioneer and modern marketing platform in Ethiopia [2]. ECX should also determine factors affecting commodities prices. Although the global market plays a significant role in this aspect, there is no a clear cutting rule explicitly list the determent factors and indicate the relationships within the factors.As a researcher knowledge there was no any work that studded determinate factors of the commodities price which one considered in ECX. The study will help the organization (ECX) to take action according to the discovered knowledge and experts and suppliers also use for decision making.IV. E XPERIMENTATION D ESIGNIn this research number of experiments for J48 were conducted and high accuracy values are recorded. A total of 18141 datasets with 7 independent attributes and one dependent variable are used throughout all the experiments. Once the experimental setups are established, building model with a number of parameters that govern the model generation process would be the next task.Table 1 Sample values of final attributes after pre-processed4.1. Model Building Using J48 Decision TreeModel building is an iterative process. Therefore, it is important to conduct different experiments to find the optimal model to address the problem. In this study, different experiments are conducted by altering parameters of the J48 decision tree but only some of the experiments are presented here which score high accuracy [17,118]. (Compare to each other, this scenario is also used for other selected algorithm). J48 algorithm contains some parameters that can be changed to further improve classification accuracy. Initially the classification model is built with the default parameter values of the J48 algorithm. Table 2 summarizes the default parameters with their values for the J48 decision tree algorithm.Table 2 Some of the J48 algorithm parameters and their default valuesBy changing the different default parameter values of the J48 algorithm, the experimentations of the decision tree model-building phase are approved.Table 3 Values of parameters used for J48 algorithmThe performance measures (in terms of accuracy, ROC, and other effectiveness measures) of the above six experiments of J48 decision tree algorithms are organized in Table 4 below.Table 4 Experimentation result of J48 AlgorithmsKey : CCI: Correctly classified Instance, ICCI (Incorrectly classified Instance), ROC: Relative Optical character curve.On presented Table 4 the resultofeachexperiment developed. The experiment was designed to evaluate the performance of a J48 classifier Unpruned and pruned tree. The J48 pruned achieved more accuracy out of the 6 experiments . As the result Experiment #2 (Building decision tree pruned and testing option10 cross validation) is best based on accuracy which registered 88.35% and correctly classified instance which is accounts 16031 out of 18141. And The Number of leaves and the size of the tree have a value of 142 and 182 respectively.Therefore, the researcher selected Experiment #2 (pruned J48 decision tree with 10 fold cross validation) for comparison with other classification algorithms out of the 8 experiments.In addition to this, the researcher attempted to use different parameters out of that listed in the table 2 to increases the accuracy of the model and to minimize the number of leaves and tree for all experiments. With this objective in mind, the MinNumObj (minimum number of objects in a leaf) parameter was tried with a value 10, 15 and 20. The confidence factor also tried with a value 0.6, 0.7... But the result is not much improved when compare it with Experiment#2. This is because if the value of MinNumObj increases, the number of the leaves and size of the tree also decreases but the accuracy and the performance of the model decrease. Consequently, if the value of confidence factor increases, the number of the leaves and size of the tree also increases as well as accuracy and the performance of the model decrease. This is because smaller values incur more pruning.V.C ONCLUSIONSIn this study, DM techniques have been used with the aim of identifying and critical determinate factors for commodities price. The hybrid process model was followed during undertaking experimentation and discussion. The data set used in this study has been taken from ECX market data. After taking the data, it has been preprocessed and prepared in a format suitable for the DM tasks.The study was conducted using classification techniques namely decision tree, for model building and experimentation J48 Experimentation was conducted using two test options (10-fold cross validation and 80 % percentage split) for each experiment. Various experiments were made by making adjustments on the modeling parameters in order to come up with meaningful results. J48 algorithms performed with accuracy of 88.35%. The results from this study can contributes towards encouraging and supporting the decision making process for marketing organizations and marketers. To conclude, the study showed that the problem of fluctuating the commodities price can be solve using data mining techniques.VI.R ECOMMENDATIONSThis study has provided a predictive model for determining critical factors of commodity price in ECX. Based on the results and findings obtained from the study, the researcher forwards the following recommendations and potential future works.The output of this research is helpful to the domain experts of ECX and suppliers foridentifying the determinant factors forcommodities price. This can help for solvingproblems rising during trading activity with regardto price fluctuations.The model which is developed in this research generates various patterns and rules that can be used by ECX branches in deferent location of the country. Market policy makers, suppliers and planners can have useful insight for future planning and special intervention program based on the findings of the study.In this study the scope was limited to the commodity namely coffee, further study can be conducted by considering different commodities including local coffee. This study can be used as an input for the development of a full-fledged model that can be used for price prediction by developing a full-fledged decision support system. Graphical User Interface as a prototype of the proposed modelOther methodology or classification algorithms (neural network and Support Vector Machine) can also be used for further research work.A CKNOWLEDGEMENTWe would like thank to Ethiopian Commodity Exchange (ECX) higher management for their willingness to access the data and assigned the experts. We thank to Ethiopian Commodity Exchange (ECX) experts for their support and valuable comments.R EFERENCE[1] C. J. Santana-Boado, Leonela; Brading, ‘Commodityex changes in a globalized economy’, no. September, pp.1–8, 2000.[2]M. A. Hernandez, S. Lemma, and S. Rashid, ‘TheEthiopian commodity exchange and the coffee market: Are local prices more integrated to global markets?’, no.April 2008, 2015.[3]H. Patel and D. Patel, ‘A Brief survey of Data MiningTechniques Applied to Agricultural Data’, vol. 95, no. 9, pp. 6–8, 2014.[4]T. Edition and T. C. Corporation, Introduction to DataMining and. .[5]M. Gandhi and G. Vishwavidyalaya, ‘Data miningTechniques for Predicting Crop Productivity –A review article’, vol. 4333, pp. 98–100, 2011.[6]G. Marketos, K. Pediaditakis, and Y. Theodoridis,‘Intelligent Stock Market Assistant using Temporal Data Mining’, pp. 1–11.[7] D. Das and M. S. Uddin, ‘T he S Tock M arkeT’, vol. 4,no. 1, pp. 117–127, 2013.[8] D. V. Setty, ‘A Review on Data Mining Applications tothe Performance of Stock Marketing’, vol. 1, no. 3, pp.33–43, 2010.[9]S. Tiwari and A. Gulati, ‘Prediction of Stock Market fromStream Data Time Series Pattern using Neural Network and Dec ision Tree’, vol. 7109, pp. 99–102, 2011.[10]S. Džeroski, ‘Relational data mining’, in Data Mining andKnowledge Discovery Handbook, Springer, 2009, pp.887–911.[11]K. V Nesbitt and S. Barrass, ‘Patterns in Stock’, pp. 45–55, 2004.[12]Niketa Gandhi, Leisa Armstrong,” Applying Data MiningTechniques to predict yield of Rice in Humid Subtropical Climatic Zone of India”. 978-9-3805-44212/16/$31.00_c 2016[13] A.T.M Shakil Ahamed, Navid Tanzeem Mahmood,Nazmul Hossain, Mohammad Tanzir Kabir, Kallal Das, Faridur Rahman, Rashedur M Rahman,”Applying Data Mining Techniques to Predict Annual Yield of Major Crops and Recommend Planting Different Crops in Different Districts in Bangladesh”, 15978-1-4799-8676-7/15/$31.00 copyright 2015 IEEE SNPD 2015, June 1-3 2015,[14] A. M. Ticlavilca, D. M. Feuz, and M. Mckee, ‘ForecastingAgricultural Commodity Prices Using Multivariate Bayesian Machine Learning Regression by Andres M .Ticlavilca , Dillon M . Feuz , and Mac McKee’, 2010. [15]M. P. Naeini, H. Taremian, and H. B. Hashemi, ‘Stockmarket value prediction using neural networks’, in Computer Information Systems and Industrial Management Applications (CISIM), 2010 International Conference on, 2010, pp. 132–136.[16]H. Pan, C. Tilakaratne, and J. Yearwood, ‘PredictingAustralian stock market index using neural networks exploiting dynamical swings and intermarket infl uences’, J. Res. Pract. Inf. Technol., vol. 37, no. 1, pp. 43–56, 2005.[17]Monali Paul, Santosh K. Vishwakarma, Ashok Verma,”Analysis of Soil Behaviour and Prediction of Crop Yield using Data Mining Approach” ,2015 International Conference on Computational Intelligence and Communication Networks,978-1-5090-0076-0/15 $31.00 © 2015[18]Ramesh A. Medar,Vijay S. Rajpurohit, “A survey on DataMining Techniques for crop yield prediction”, International Journal of Advance Research in Computer Science and Management Studies.Volume 2 , Issue 9, Sept 2014. Authors’ ProfilesWorku A. Degife (MSc)lecturer atUniversity of Gondar and also currentlyhe studying his PhD in Taiwan atNational Taiwan University of scienceand Technology, in major of InformationManagement. Mr. Worku A. received hisMSc. in Information Technology fromUniversity of Gondar, Gondar, Ethiopia in 2017. His main research fields are data mining, Neural network, machine learning, data mining and knowledge based system, big data management and security, reinforcement learning and AI.Dr.ing. Abiot Sinamo (PhD), is an Asst.Professor in Assistant Professor ofIntelligent Systems and ERP SystemsSchool of ComputingEthiopian Instituteof Technology-Mekelle (EiT-M),Mekelle University. Currently, he isDirector General, ICT Sector FDREMinistry of Innovation and Technology His main research fields are, Artificial Intelligence, Business Analytics, Big data Analytics, Green Computing, Engineering, Enterprise Systems, ERP Systems, Very Large Business Applications(VLBAs), Knowledge based Systems, Expert Systems, Intelligent Systems, Robotics, Natural Language Processing, Machine Learning, Image Processing, Computer VisionHow to cite this paper: Worku Abebe Degife, Dr.ing. Abiot Sinamo, " Efficient Predictive Model for Determining Critical Factors Affecting Commodity Price: The Case of Coffee in Ethiopian Commodity Exchange (ECX)", International Journal of Information Engineering and Electronic Business(IJIEEB), Vol.11, No.6, pp. 32-36, 2019. DOI: 10.5815/ijieeb.2019.06.05。

中文电子病历多层次信息抽取方法的探索

中文电子病历多层次信息抽取方法的探索吴骋①徐蕾②秦婴逸①何倩①王志勇**基金项目：上海市自然科学基金(编号：19ZR1469800);全军后勤科研重大项目于题(编号：AWS15J005-4 );海军军医大学第一附属医院"234学科攀峰计划"(编号：2019YBZ(X)2 )*通信作者：海军军医大学第一附属医院信息科，200433,上海市杨浦区长海路168号① 海军军医大学军队卫生统计学教研室，200433,上海市杨浦区翔殷路800号② 海军第905医院财经中心，200052,上海市长宁区华山路1328号摘要目的：探索新的多层次信息抽取模式，以改进当前以“医学词典”和“正则表达式”为主的电子病历信息抽取技术。

方法：通过“文书类别预测模块”和“分类模型”，对不同病历文书及章节内容进行区分；并在此基础上，利用“规则+深度学习模型”，根据不同文本信息特点搭建相应的信息抽取模型，对不同实体及其语义关系进行识别和建立。

结果：通过对文书类别、章节以及实体属性的归纳识别、分层建模，实现了对医疗文本中各种信息的多维解析与分类存储。

结论：多层次信息抽取方法为实现电子病历智能化应用奠定了坚实基础，对于优化诊疗模式、辅助临床决策、促进知识共享等具有实际意义。

关键词电子病历自然语言处理命名实体识别多层次信息抽取Doi:10.3969/j.issn.1673-7571.2020.06.009［中图分类号］R319［文献标识码］AExploration on the Multi-level Information Extraction Method of Chinese Electronic Medical Records / WU Cheng, XU Lei, QINYing —yi, et al//China Digital Medicine.—2020 15(06): 29 to 31Abstract Objective: To explore a new multi-level information extraction model, so as to improve the current electronic medicalrecord information extraction technology based on "medical dictionary" and "regular expression". Methods: Through "document category prediction module" and "classification model", different medical records and chapters are distinguished; and on this basis, "rules+deep learning model" are used to build the corresponding information extraction model according to the characteristicsof different text information, and different entities and their semantic relations are identified and established. Results: The multi dimensional analysis and classified storage of various infbnnation in medical text is realized by inductive identification and hierarchicalmodeling of document categories, chapters and entity attributes. Conclusion: The multi-level information extraction method lays asolid foundation for the intelligent application of electronic medical records, and has practical significance for optimizing the diagnosis and treatment mode, assisting with the clinical decision-making and promoting the knowledge sharing.Keywords electronic medical records, natural language processing, named entity recognition, multi-level information extraction Fund project Shanghai Natural Science Foundation (No. 19ZR 1469800); Subproject of Major Project of Scientific Research ofLogistics of the Whole Army (No. AWS15JOO5—4); "234 Discipline Peak —climbing Plan" of the First Affiliated Hospital of Naval Medical University (No. 2019YBZ002)Corresponding author Information Department, the First Affiliated Hospital of Naval Medical University, Shanghai 200433, P.R.C.1前言电子病历(Electronic MedicalRecord, EMR )囊括了患者从入院到出院疾病发生、发展、治疗和转归的全过程，是医务人员和科研工作者深入了解疾病特征、用药情况、治疗方式以及预后结局等信息的重要数据来源⑴。

自然语言处理中的信息抽取与分类研究

自然语言处理中的信息抽取与分类研究自然语言处理(NLP)是计算机科学与人工智能领域的重要研究方向，而其中的信息抽取和分类技术则是其中的重要组成部分。

信息抽取是指从文本数据中提取结构化信息的过程，例如从新闻文章中提取出各种事件或者商品名称等；而分类则是将文本数据分成不同的预定义类别，例如将新闻分类为体育、政治、娱乐等。

信息抽取和分类技术的应用十分广泛，例如在搜索引擎、商品推荐系统、舆情分析等方面都有着重要的作用。

下面将分别介绍信息抽取和分类在自然语言处理中的研究现状。

一、信息抽取信息抽取(Information Extraction, IE)是从自然语言文本中自动提取出人们关心的事实或结构化信息的过程。

例如，从新闻文章中自动提取出主题、人物、组织、地点、时间、关系等信息。

信息抽取技术可以分为三个主要步骤：命名实体识别、关系识别和事件抽取。

命名实体识别(Named Entity Recognition, NER)是指识别出文本中表示具体实体的词语或短语，包括人名、组织、地点等。

目前，基于深度学习的NER方法已经成为信息抽取领域的主流方法。

例如，通过使用卷积神经网络(Convolutional Neural Network, CNN)和循环神经网络(Recurrent Neural Network, RNN)组合的方法，可以有效地识别出实体。

关系识别(Relation Extraction, RE)是指根据识别到的实体之间的文本关系，抽取具体的关系信息。

针对关系识别的方法主要包括基于规则和基于机器学习两种。

基于规则的方法需要预定义一大批模板规则，并手动编写正则表达式进行匹配，缺点是效率较低且难以泛化；而基于机器学习的方法则需要标注大量的样本数据进行训练，并可以利用深度学习技术进一步提高准确率。

事件抽取(Event Extraction, EE)是指从文本中识别出一些特定类型的事件，例如自然灾害、政治事件等。

信息抽取InformationExtraction-北京交通大学图书馆

– – – – – – Clinical histories radiology reports pathology reports annotations on genomic and image databases technical literature Web based resources
Individual Summaries & Queries
Data Access Cycle
CLEF Architecture Outline

临床报告
ROYAL MARSDEN NHS TRUST - PATIENT CASE NOTE 324A621F:MRS Dorothy Smith DOB: 12/05/44 21, Park Crescent Basingstoke B12 Q13
信息抽取(Information Extraction) 及其在数字图书馆中的应用研究
中国科学院国家科学图书馆张智雄
北京 2006.8.15

主要内容
1. 2. 3. 4. 5. 6. 什么是信息抽取(IE) 信息抽取相关研究活动信息抽取的层次和类型信息抽取系统及其应用数字图书馆中信息抽取技术的应用前景中文信息抽取系统的开发
Pseudonymise In Hospital
Construct ‘Chronicle’
Data Acquisition Cycle
Reidentify By Hospital
Summarise & Formulate Queries
Privacy Enhancement Technologies

从文本中实现关键信息抽取
##### ####### NHS TRUST - PATIENT CASE NOTE ########:######### ####### DOB: 1944 CLEF-RMH-Entry-Key: 52A4F6DB2B46E

信息检索与信息抽取差异性探析

示的信息为ＵＲＬ链接信息；③以数据库内容为代表的结构化
信息，抽取相对简单，关于这方面的探讨还比较少。
４信息检索与信息抽取的关键技术
信息检索通常有分析标引与响应检索两大过程，信息抽取的分析过程更复杂、更有针对性。信息检索可以做成通用的，而信息抽取往往是领域相关的或特征相关的。
图１信息抽取与信息检索发文量趋势（１：１６）从图１中可以看出，信息检索得到了持续的关注，从１９９８年开始迅猛增长，增长的原因主要是搜索引擎的崛起，带动了整个信息检索领域的新发展。而信息抽取从２０世纪９０年代末开始得到关注，从２００３年开始得到迅速发展。目前信息抽取的增长势头非常迅猛，而信息检索相对平稳一些。如果说１０年间信息抽取的研究经历了从无到有的过程，那么信息检索的研究就是从弱到强的过程。
然而笔者认为，信息抽取不是信息检索的高级阶段，它并不能代表信息检索的发展方向。信息抽取可以应用于信息检索，提高检索质量与精度，反之，信息检索的应用也会对信息抽取提出更新的挑战。
１信息检索与信息抽取的学术关注度差异
在中国知网上检索相关文献（题名或关键词精确匹配），关于信息抽取的第一篇文章为１９９７年刊登在《情报学报》上的《基于信息抽取和文本生成的自动文摘系统设计》；关于信息检索的第一篇文章为１９８０年刊登在《情报科学》上的《全息情报检索ＱＱＪ系统简介》；关于文献检索的第一篇文章为
信息检索会议信息抽取与信息检索的出入口
信息检索强调对检索入口进行控制，并不对检索出口进行控制，也就是说，信息检索策略的调整只能决定检索结果的多与少，并不能决定每条检索结果的大与小。通过构造检索表达式与指定检索范围等策略来决定检索结果的记录数，而不能对某条记录的内容进行抽取。例如，要查找中国所有自然语言处理方向的博士生导师，利用搜索引擎进行检索，用户需要遍历每一个网页，然后进行人工汇总。如果将信息抽取技术应用于搜索引擎，在检索之前可以指定内容的范围，也就是说会有两个检索输入框，第一个为检索入口，每二个为检索出口，检索入口输入“自然语言处理方向博士生导师”，检索出口输入“姓名、所在单位、专业、年龄、招生人数、考试科目”等信息，利用信息抽取技术就会直接显示出一个二维列表，用户只需阅读一个网页，这种搜索也称之为列表式搜索。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Information Extraction and Classiﬁcation from FreeText Using a Neural ApproachIgnazio Gallo and Elisabetta BinaghiDepartment of Computer Science and CommunicationUniversit´a degli Studi dell’Insubriavia Mazzini5,Varese,Italyignazio.gallo@uninsubria.ithttp://www.uninsubria.eu/Abstract.Many approaches to Information Extraction(IE)have been proposedin literature capable ofﬁnding and extract speciﬁc facts in relatively unstructureddocuments.Their application in a large information space makes data ready forpost-processing which is crucial to many context such as Web mining and search-ing tools.This paper proposes a new IE strategy,based on symbolic and neuraltechniques,and tests it experimentally within the price comparison service do-main.In particular the strategy seeks to locate a set of atomic elements in freetext which is preliminarily extracted from web documents and subsequently clas-sify them assigning a class label representing a speciﬁc product.Keywords:Information Extraction,Neural Network,Text Classiﬁcation1IntroductionWith the Internet becoming increasingly popular,more and more information is avail-able in a relatively free text format.This situation creates the premise for efﬁcient on line services and Web mining application in several domains.The on line availability of ever larger amounts of commercial information,for example,creates the premise for proﬁtable price comparison services allowing individual to see lists of prices for speciﬁc products.However critical aspects such as information overload,heterogeneity and ambiguity due to vocabulary differences limit the diffusion and usefulness of these advanced tools requiring expensive maintenance and frustrating users instead of empowering them.To address these problems,efﬁcient Information Extraction(IE)techniques must be provided capable ofﬁnding and extract speciﬁc facts in relatively unstructured doc-uments.Their application in a large information space makes data ready for post-processing which is crucial to many context such as Web mining and searching tools.Information extraction programs analyze a small subset of any given text,e.g.,those parts that contain certain trigger words,and then attempt toﬁll out a fairly simple form that represents the objects or events of interest.An IE task is deﬁned by its input and its extraction target.The input can be unstructured documents like free text that are written in natural language(e.g.,Fig.1)or the semi-structured documents that abound on the Web such as tables or itemized and enumerated lists(e.g.,Fig.2).The extraction targetof an IE task can be a relation of k-tuple(where k is the number of attributes in a record) or it can be a complex object with hierarchically organized data.Many approaches to IE have been proposed in literature and classiﬁed from differ-ent points of view such as the degree of automation[1],type of input document and structure/constraint of the extraction pattern[2].This paper proposes a new IE strategy and tests it experimentally within the price comparison service domain.Most price comparison services do not sell products them-selves,but show prices of the retailers from whom users can buy.Since the stores are heterogeneous and each one describes products in different ways(see example in Fig.3),a generic procedure must be devised to extract the content of a particular informa-tion source.In particular our IE strategy seeks to locate a set of atomic elements in free text preliminarily extracted from web documents and subsequently classify them ac-cordingly.In this context,our principal interest is to extract,from a textual description, information that identify a commercial product with an unambiguous label in order to be able to compare prices.In our experiments product price was associated to the description,therefore its extraction is not necessary.The Motorola RAZR V3i is fully loaded*-delivering the ultimate com-bination of design and technology.Beneath this sculpted metal exterioris a lean mean,globe-hopping machine.Modelled after the MotorolaRAZR V3,the RAZR V3i has an updated and streamlined design,of-fering consumers a large internal color screen,...Fig.1.An unstructured document written in natural language that describes the product’Motorola RAZR V3i’.Following the terminology used by Chang et.al[3]the salient aspects are–an hybrid solution for building thesaurus based on manual and supervised neural technics–a tree structured matcher for identifying meaningful sequences of atomic elements (tokens);–a set of logical rules which interpret and evaluate distance measures in order to assign the correct class to documents.2System OverviewThe IE system developed is composed of two main parts,matcher and classiﬁer,and acts on free text documents obtained from original web documents.It speciﬁcally ad-dresses the following problems typical of the price comparison service information space:an attribute may have zero(missing)or multiple instantiations in a document; various permutations of attributes or typographical errors may occur in the input docu-ments(see an example in Fig.3).Fig.2.Semi-structured documents written in natural language that describes a set of products.Prior to both matching and classiﬁcation phases the tokenizer divides the text into simple tokens having the following nature:–word:a word is deﬁned as any set of contiguous upper or lowercase letters;–number:a number is deﬁned as any combination of consecutive digits.2.1MatcherIn our context the matcher has to operate on speciﬁc annotations that can be matched to brands(B)and models(M)of products enlisted in the price comparison service.The matcher is then constructed starting from a set KB={(b,m)|b∈B,m∈M b}that contains the couples formed by a brand b and a model m.b belongs to the set of brands B of a given category of products,while m belongs to the set of models M b that have b as brand.A thesaurus that collects all the synonyms commonly used to describe a particular category of products is used to extend the KB.In particular if there are one or more synonyms in the thesaurus for a couple(b,m)then we add a new couple(¯b,¯m)to the KB for each synonym.Fig.3.List of potential descriptions in (b)that may be used to describe product (a).The tokenizer associates with every couple in the KB a sequence of tokens T b,m =(t b 1,t b 2,...,t m 1,t m 2,...)where t b i and t m j are all the tokens obtained from the brand band the model m respectively.Every sequence T b,m is used to construct a path within the matcher tree structure:it starts from the ﬁrst node associated with token t b 1and arrives at a leaf node associated with the label derived from (b,m )(see example in Fig.4).Based on these solutions,the IE task is accomplished splitting an input document into a tokens sequence (d 1,d 2,...).Starting from the root node the search for the sub-sequent node (better match in the subsequent layer)is performed using the concept of edit distance (ed )between input tokens and every token of a tree’s layer and the posi-tion distance (pd )between two matched input tokens.The edit distance between two tokens measures the minimum number of unit editing operations of insertion,deletion,replacement of a symbol ,and transposition of adjacent symbols [4]necessary to convert one token into another.The position distance measures the number of tokens d i found between two matched d j d k tokens that would have to be consecutive The system ﬁrst searches a sequence of perfect match ( ed =0)that leads to recognition of all the tokens t b i associated with one brand.In this way we obtain a list of possible brands associated with the input document.Starting from the last node (with token t b i )of a matched brand,the algorithm begins the search for the most probable model.The system searches all the paths that lead to a leaf node with the sum of ed equal to zero.If this is not possible the system returns the path with minimum ed .In case of a token t i with multiple match (d j ,d k ,...),with identical minimum ed ,the input token with minimum pd (compared to the parent token in the tree)will be selected (see example in Fig.5).2.2ClassiﬁerA rule based classiﬁer was designed with the aim of assigning a class label representing a speciﬁc product to each document using the information extracted from the matching phase.The classiﬁer receives in input the output of the matcher i.e.the set of matched sequence T b,m of tokens weighted as a function of the ed .These input values are used to construct a sub-tree of the Matcher starting from which the classiﬁer computes the class of the given document.The set of predeﬁned classes is constituted by all the products (b,m )inserted in the KB .The classiﬁer selects a class from a subset obtained by the input set of matched sequence T b,m (there is one class for each matched sequence).Fig.4.An example of KB set and thesaurus used to build the matcher.Fig.5.Role of edit distance and position distance during the matching process.In case of tokens having equal ed measure(ed(t i+1,d j+1)=ed(t i+1,d s)=ed(t i+1,d r))the token d j+1with minimum pd is selected.Position and edit distances are evaluated to decide class assignment:the classiﬁer start from the leaves of the sub-tree and at each step compares each node t ij with its parent t i−1using the following rules:–select the node t ij having minimum pd(t ij,t i−1)for each j or with minimum av-erage pd computed through all the seen tree nodes;–in case of a node t ij with multiple match associated with its parent t i−1,with identical minimum ed,the input token t i−1with minimum pd will be selected(see example in Fig.6);–between two nodes t ij with identical weight(ed+pd)in the same layer i,select that with a greater path starting from the leaf node;–if all the previous rulesﬁnd no differences between two nodes t ij and its parent t i−1then selects that with minimum pd computed by the matcher.Fig.6.Role of edit distance and position distance during the classiﬁcation process.In case of tokens having equal ed measure(ed(t i−1,d k)=ed(t i−1,d r)=ed(t i−1,d s))the token d s with minimum pd is selected.3Automatic Thesaurus BuildingThe accuracy strongly depends on the completeness of the thesaurus.Unfortunately, thesaurus maintenance is an expensive process.The present work proposed a neural adaptive tool able to support thesaurus updating.The main idea of our solution is to automatically induce from examples general rules able to identify the presence of synonyms within sequences of tokens produced by the matcher.These rules,difﬁcult to hand-craft and deﬁne explicitly,are obtained adaptively using neural learning.In particular,a Multilayer Perceptron(MLP)[5]is trained to receive in input the following types of features from the sequence of matched tokens:–edit distance between the tokens(t b1,t b2,...,t m1,t m2,...)of the KB and tokens found in the document;–number of characters between two consecutive tokens found in the document;–typology of found tokens(word or number);–bits that identify the presence of each token.As illustrated in Fig.7,each pattern is divided into four groups of P features,where P represents the maximum number of tokens obtainable from a product(b,m).The output pattern identiﬁes a synonym¯S of S,where S is a particular sequence of tokens extracted from T b,m,and¯S is a different sequence of tokens extracted from T b,m or from the sequence of matched token of the document.Each output pattern has a dimension equal to3P.The output of the trained neural network is used to add a new item S=¯S to the thesaurus:when an output neuron has a value greater than a threshold,the corresponding token will be considered part of a synonym.4ExperimentsThe aim of these experiments was to measure the classiﬁcation effectiveness in terms of precision and recall,and to measure the contribution of neural thesaurus updating within the overall strategy.Fig.7.An example of input pattern/output construction for the MLP model.4.1Text CollectionDocument collection is taken from the price comparison service Shoppydoo(http: // and http://www.shoppydoo.it),a price comparison service that allows users to compare prices and models of commercial products.Two experiments were conceived and conducted in theﬁeld of price comparison ser-vices.Two product categories were identiﬁed,cell-phones and digital-cameras.Three speciﬁc brands were considered in our set of couples(b,m)KB for each category (Nokia,Motorola,Sony for cell phones and Canon,Nikon Samsung for digital-camera). The total number of documents collected for the cell-phone category was1315of which 866associated with one of the three identiﬁed brands.The number of documents be-longing to the digital-camera category were2712of which1054associated with one of the three identiﬁed brands.Remaining documents belonging to brands different from those considered in the experiment,must be classiﬁed not relevant.4.2Evaluation MetricsPerformance is measured by recall,precision and F-measure.Let us assume a collection of N documents.Suppose that in this collection there are n<N documents relevant to the speciﬁc information we want to extract(brand and model of a product).The IE system recognizes m documents,a of which are actually relevant.Then the recall,R, of the IE system on that information is given byR=a/n(1)and the precision,P,is given byP=a/m(2) One way of looking at recall and precision is in terms of a2×2contingency table (see Table1).Relevant Not-relevant TotalMatched a b a+b=mNot-matched c d c+d=N−mTotal a+c=n b+d a+b+c+d=NOverall accuracy(OA):(a+b)/NTable1.A contingency table analysis of precision and recall.Another measure used to evaluate information extraction that combines recall and precision into a single measure is the F-measure Fαdeﬁned as follows:Fα=1α1P+(1−α)1R(3)whereαis a weight for calibrating the relative importance of recall versus precision[6].4.3ResultsThe results of the two experiments are summarized in Tables2and3.We started with an empty thesaurus that was then populated with the support of the neural network. For both datasets we obtained Precision and Recall equal to100%after adding new synonyms to the thesaurus to resolve all the cases not perfectly matched.Relevant Not-relevant Total Matched8660866 Not-matched0449770 Total8664491315Recall100%Precision100%Fα=0.5100%OA100%Relevant Not-relevant Total Matched7550755 Not-matched111449560 Total8664491315Recall87.18%Precision100%Fα=0.593.15%OA91.56%(a)(b)Table2.Evaluation metrics for the problem’cell-phone’with thesaurus(a)and without the-saurus(b).Relevant Not-relevant Total Matched105401054 Not-matched016581659 Total105416582712Recall100%Precision100%Fα=0.5100%OA100%Relevant Not-relevant Total Matched9630963 Not-matched9116581749 Total105416582712Recall91.37%Precision100%Fα=0.595.49%OA96.64%(a)(b)Table3.Evaluation metrics for the problem’digital-camera’with a thesaurus(a)and without thesaurus(b).5Conclusions and Future WorksThe present work tested a system that make use of IE and classiﬁcation in the context of a price comparison service.The approach proved highly accurate but it requires the assistance of an expert during the construction of the KB and the corresponding thesaurus.Future work will extend the present solution including a tool for building the KB by automatically extracting unknown models of a product from a document. References1.Chang,C.H.,Hsu,C.N.,Lui,S.C.:Automatic information extraction from semi-structuredweb pages by pattern discovery.Decis.Support Syst.35(1)(April2003)129–1472.Muslea,I.:Extraction patterns for information extraction tasks:A survey.In Califf,M.E.,ed.:Papers from the Sixteenth National Conference on Artiﬁcial Intelligence(AAAI-99)Work-shop on Machine Learning for Information Extraction,Orlando,FL,AAAI Press(July1999) 3.Chang,C.H.,Kayed,M.,Girgis,M.R.,Shaalan,K.F.:A survey of web information extractionsystems.IEEE Transactions on Knowledge and Data Engineering18(10)(2006)1411–1428 4.Damerau,F.J.:A technique for computer detection and correction of spelling -munications of the Association for Computing Machinery7(3)(1964)171–1765.Rumelhart,D.E.,Hinton,G.E.,Williams,R.J.:Learning internal representations by errorpropagation.(1986)318–3626.Jackson,P.,Moulinier,I.:Natural Language Processing for Online Applications:Text Re-trieval,Extraction,and Categorization(Natural Language Processing,5).John Benjamins Publishing Co(June2002)。

Information Extraction and Classification from Free Text Using a Neural Approach

合集下载

基于图像分析的恶意软件检测技术研究

脑_机接口研究中想象动作电位的特征提取与分类算法_程龙龙

大数据中的信息提取技术

《信息检索与利用》试卷分析及应用型教改研究

information extraction 评价指标 -回复

information extraction 评价指标 -回复

信息抽取

面向文本信息处理的信息提取技术研究

Information Extraction from Single and Multiple Sentences

Python程序设计-介绍

文本信息提取技术概述

确定影响商品价格关键因素的有效预测模型——以埃塞俄比亚商品交易所的咖啡为例(IJIEEB-V11-N6-5)

中文电子病历多层次信息抽取方法的探索

自然语言处理中的信息抽取与分类研究

信息抽取InformationExtraction-北京交通大学图书馆

信息检索与信息抽取差异性探析

文档推荐

最新文档