Ontologies and Information Extraction
- 格式:pdf
- 大小:148.90 KB
- 文档页数:23
CoMoTo_⼀个基于本体的情境建模⼯具CoMoTo:⼀个基于本体的情境建模⼯具徐朝晖吴刚上海交通⼤学软件学院,上海 200240摘要:情境感知是近⼏年来普适计算的研究热点,合适的情境建模⽅法和⼯具是实现情境感知的基础。
本⽂采⽤了基于本体的⽅法进⾏情境的建模,并从通⽤性和易⽤性两个⾓度出发,给出了⼀个基于本体的情境建模⼯具(CoMoTo)。
⽂中讨论了对情境进⾏分级建模的⽅法,描述了该建模⼯具的分析与设计⼯作,并通过⼀个案例说明了该⼯具的建模能⼒。
关键词:情境建模;本体;建模⼯具1.引⾔普适计算是以⼈为中⼼的计算,⼀个重要特性就是情境感知的能⼒,即能随着所处环境中情境的变化⽽动态地作出相称的反应。
为了更好地描述和管理情境,需要对情境有⼀个统⼀的认识。
许多⽂献都给出了情境的定义,但因为作者分析⾓度的不同⽽不尽相同,Dey[1]等⼈提出的定义是其中具有代表性、通⽤性相对较强的⼀个,他们认为情境是任何可以⽤来刻画⼀个实体的处境的信息,⽽这个实体可以是⼈、地⽅或者任何跟⽤户与应⽤之间的交互有关联的物体(包括⽤户和应⽤本⾝)。
本⽂涉及的情境,其含义均参考这⼀描述。
在情境感知应⽤中,情境信息可以通过多种渠道获得(例如传感器、存储器及⼿⼯输⼊等),情境信息的这种异源性导致对其进⾏描述、管理和利⽤等操作都将是复杂的过程。
情境模型通过为情境感知应⽤提供情境的抽象描述,使得上述这些复杂的操作对其变得透明,从⽽在很⼤程度上简化了搭建情境感知应⽤的流程。
因此,⼀个具有良好结构的情境模型是构建情境感知系统的关键[2]。
本体作为⼀种描述⼿段,能明确地、形式化⽽规范地对共享概念模型进⾏说明[3],强⼤的表达能⼒使得其能描述更复杂的情境,⽽它提供的丰富的语义⽀持也使得基于情境进⾏推理成为可能。
得益于这些优势,研究⼈员⼴泛使⽤本体对情境信息进⾏建模。
当前已有多种软件⽤于构建本体,斯坦福⼤学开发的Protege便是其中优秀的⼀员。
然⽽,由于Protege⾯向所有领域的本体构建,其极强的通⽤性使得操作⽐较复杂,建模⼈员需要有很强的专业知识。
人工智能英文版第五版教学设计IntroductionArtificial Intelligence () is an interdisciplinary field that has attracted increasing attention over the past few decades. As technologies and applications continue to evolve, it is essential to develop effective teaching strategies and materials to prepare our students for the future. This document presents a proposed teaching design for the English version of the fifth edition of Artificial Intelligence, with the m of creating a stimulating and engaging learning environment for students to master the fundamentals of and apply these concepts to real-world problems.AudienceThe proposed teaching design is med at undergraduate students who have a basic understanding of programming anddata structures. An introductory course in computer scienceor engineering is recommended, but not mandatory. Studentswith a background in mathematics, statistics, or otherrelated fields may also benefit from this course.PrerequisitesBefore starting this course, students should be familiar with the following:•Programming languages such as Python, Java, or C++•Data structures such as arrays, lists, and trees•Basic algorithms such as sorting and searching•Linear algebra and calculusCourse ContentWeek 1: Introduction to•Definition of•Brief history of•Applications of•Key concepts in•Overview of the courseWeek 2: Problem Solving and Search•Problem-solving strategies•State-space search algorithms•Heuristic search algorithms•Uninformed and informed search algorithms•Adversarial searchWeek 3: Knowledge Representation and Reasoning •Propositional logic•First-order logic•Rule-based systems•Frames and semantic networks•Ontologies and knowledge graphsWeek 4: Planning and Decision Making•Planning methods•Decision trees•Decision networks•Utility theory•Game theoryWeek 5: Machine Learning•Supervised learning•Unsupervised learning•Reinforcement learning•Neural networks•Deep learningWeek 6: Natural Language Processing•Text processing•Language modelling•Information extraction•Sentiment analysis•Machine translationWeek 7: Robotics and Perception•Robot architectures•Sensing and perception•Robotics applications•Simultaneous localization and mapping (SLAM)•Path planning and controlWeek 8: Ethics and Social Implications of•Bias in•Frness in•Privacy and security concerns•Misuse and abuse of•Future ofTeaching MethodsThe proposed teaching design combines traditional lectures with interactive classroom activities to promote active learning and engagement. Lectures will cover the key concepts and theories of , while classroom activities will involveproblem-solving exercises, group discussions, and case studies.In addition, students will be expected to complete several coding assignments and a final project. The coding assignments will provide hands-on experience with algorithms and techniques, while the final project will give students the opportunity to apply these concepts to a real-world problem and present their findings to the class.AssessmentAssessment will be based on the following:•Coding assignments (40%)•Final project (30%)•Midterm exam (20%)•Classroom participation (10%)Students will receive feedback on their coding assignments and final project throughout the course, and there will be several opportunities for peer review and group feedback.ConclusionThe proposed teaching design for the English version of the fifth edition of Artificial Intelligence ms to provide students with a solid foundation in the key concepts andtheories of , as well as hands-on experience with coding and real-world applications. By combining traditional lectures with interactive classroom activities, students will be engaged in their learning and better prepared for the future of .。
知识管理的名词解释英语Knowledge Management (KM) is a discipline that encompasses a range of strategies and practices aimed at identifying, capturing, organizing, storing, retrieving, and utilizing an organization's knowledge assets to foster innovation, increase efficiency, and improve decision-making capabilities. It involves a systematic and structured approach to managing knowledge within an organization to ensure that valuable knowledge is shared and leveraged effectively.At its core, KM recognizes that knowledge is a critical asset that holds the potential to drive competitive advantage and business success. It is not limited to explicit knowledge that is codified and easily transferable, but also encompasses tacit knowledge, which resides in an individual's experiences, insights, and intuition. By harnessing both forms of knowledge, organizations can unlock hidden potential and facilitate effective collaboration across teams and departments.One of the key components of KM is knowledge creation. This involves the generation of new knowledge through various means such as research and development, experimentation, or simply by leveraging the collective intelligence of the organization. Innovation, creativity, and continuous learning play significant roles in this process. Organizations that prioritize knowledge creation foster an environment that encourages curiosity, experimentation, and risk-taking, allowing for the discovery of new insights and opportunities.Another important aspect of KM is knowledge capture. This refers to the process of identifying and capturing knowledge from various sources within the organization, including individuals, documents, databases, and even external networks. Captured knowledge is then organized and stored in a format that is easily accessible and searchable. This allows for efficient retrieval of information when needed, enabling employees to quickly find and apply relevant knowledge to their work.Knowledge organization and categorization are critical to effective KM. Information and knowledge need to be classified and structured in a way that reflects theorganization's goals, processes, and workflows. This often involves the use of taxonomies, ontologies, and metadata to categorize and tag knowledge assets, making them easily discoverable and retrievable. Additionally, knowledge management systems and technologies, such as intranets, databases, and content management systems, are used to facilitate the organization and storage of knowledge.Knowledge sharing and dissemination play a fundamental role in KM. Organizations must establish channels and platforms that enable employees to share their knowledge and experiences with others. This can take the form of formal training programs, communities of practice, knowledge sharing sessions, or even social collaboration tools. By sharing knowledge, organizations benefit from collective insights, avoid reinventing the wheel, and foster a culture of learning and collaboration.Beyond sharing, KM also emphasizes the importance of knowledge utilization. It's not enough to simply hoard knowledge; organizations must actively encourage and facilitate the application of knowledge in decision-making processes and problem-solving activities. This requires the integration of knowledge into various business processes and systems, ensuring that it is available and utilized at the point of need.Moreover, KM also encompasses knowledge evaluation and improvement. Organizations need to continuously assess the relevance, accuracy, and effectiveness of their knowledge assets. This involves monitoring usage patterns, soliciting feedback from users, and regularly updating and improving knowledge resources to ensure their continued usefulness and relevance.In conclusion, Knowledge Management is a multifaceted discipline aimed at maximizing the value and impact of an organization's knowledge assets. It involves a range of strategies and practices, from knowledge creation and capture to organization, sharing, utilization, and evaluation. By adopting effective KM practices, organizations can foster a culture of learning, innovation, and collaboration, leading to improved performance and sustained success.。
—Ing´e nierie des Connaissances —R ESEARCH R EPORTN o 03.05Mai 2003Ontology enrichment and indexing process E.Desmontils,C.Jacquin,L.SimonInstitut de Recherche en Informatique de Nantes2,rue de la HoussinireB.P.9220844322NANTES CEDEX 3E.Desmontils,C.Jacquin,L.SimonOntology enrichment and indexing process18p.Les rapports de recherche de l’Institut de Recherche en Informatique de Nantes sont disponibles aux formats PostScript®et PDF®`a l’URL:http://www.sciences.univ-nantes.fr/irin/Vie/RR/Research reports from the Institut de Recherche en Informatique de Nantes are available in PostScript®and PDF®formats at the URL:http://www.sciences.univ-nantes.fr/irin/Vie/RR/indexGB.html ©May2003by E.Desmontils,C.Jacquin,L.SimonOntology enrichmentand indexing processE.Desmontils,C.Jacquin,L.Simondesmontils,jacquin,simon@irin.univ-nantes.fr+AbstractWithin the framework of Web information retrieval,this paper presents some methods to improve an indexing process which uses terminology oriented ontologies specific to afield of knowledge.Thus,techniques to enrich ontologies using specialization processes are proposed in order to manage pages which have to be indexed but which are currently rejected by the indexing process.This ontology specialization process is made supervised to offer to the expert of the domain a decision-making aid concerning itsfield of application.The proposed enrichment is based on some heuristics to manage the specialization of the ontology and which can be controlled using a graphic tool for validation.Categories and Subject Descriptors:H.3.1[Content Analysis and Indexing]General Terms:Abstracting methods,Dictionaries,Indexing methods,Linguistic processing,Thesauruses Additional Key Words and Phrases:Ontology,Enrichment,Supervised Learning,Thesaurus,Indexing Process, Information Retrieval in the Web1IntroductionSearch engines,like Google1or Altavista2help us tofind information on the Internet.These systems use a cen-tralized database to index information and a simple keywords based requester to reach information.With such systems,the recall is often rather convenient.Conversely,the precision is weak.Indeed,these systems rarely take into account content of documents in order to index them.Two major approaches,for taking into account the se-mantic of document,exist.Thefirst approach concerns annotation techniques based on the use of ontologies.They consist in manually annotating documents using ontologies.The annotations are then used to retrieve information from the documents.They are rather dedicated to request/answer system(KAON3...)The second approach,for taking into account of Web document content,are information retrieval techniques based on the use of domain ontologies[8].They are usually dedicated for retrieving documents which concern a specific request.For this type of systems,the index structure of the web pages is given by the ontology structure.Thus,the document indexes belong to the concepts set of the ontology.An encountered problem is that many concepts extracted from docu-ment and which belong to the domain are not present in the domain ontology.Indeed,the domain coverage of the ontology may be too small.In this paper,wefirst present the general indexing process based on the use of a domain ontology(section 2).Then,we present an analysis of experiment results which leads us to propose improvements of the indexing process which are based on ontology enrichment.They make it possible to increase the rate of indexed concepts (section3).Finally,we present a visualisation tool which enables an expert to control the indexing process and the ontology enrichment.2Overview of the indexing processThe main goal is to build a structured index of Web pages according to an ontology.This ontology provides the index structure.Our indexing process can be divided into four steps(figure1)[8]:1.For each page,aflat index of terms is built.Each term of this index is associated with its weighted frequency.This coefficient depends on each HTML marker that describes each term occurrence.2.A thesaurus makes it possible to generate all candidate concepts which can be labeled by a term of theprevious index.In our implementation,we use the Wordnet thesaurus([14]).3.Each candidate concept of a page is studied to determine its representativeness of this page content.Thisevaluation is based on its weighted frequency and on the relations with the other concepts.It makes it possible to choose the best sense(concept)of a term in relation to the context.Therefore,the more a concept has strong relationships with other concepts of its page,the more this concept is significant into its page.This contextual relation minimizes the role of the weighted frequency by growing the weight of the strongly linked concepts and by weakening the isolated concepts(even with a strong weighted frequency).4.Among these candidate concepts,afilter is produced via the ontology and the representativeness of thely,a selected concept is a candidate concept that belongs to the ontology and has an high representativeness of the page content(the representativeness exceeds a threshold of sensitivity).Next,the pages which contain such a selected concept are assigned to this concept into the ontology.Some measures are evaluated to characterize the indexing process.They determine the adequacy between the Web site and the ontology.These measures take into account the number of pages selected by the ontology(the Ontology Cover Degree or OCD),the number of concepts included in the pages(the Direct Indexing Degree or DID and the Indirect Indexing Degree or IID)...The global evaluation of the indexing process(OSAD:Ontology-Site Adequacy Degree)is a linear combination of the previous measures(weighted means)among different threshold from0to1.The measure enables us to quantify the“quality”of our indexing process(see[8])for more details).67ValueValid and indexed(representativeness degree greater than0.3)337428333547With a representativeness degree greater than0.3Not in WordnetofIn Wordnet2734053881Number of processed candidate concepts4“/”(1315HTML pages).89105like http://www.acronymfi,an online database that contains more than277000acronymes.11 6For instance,,a search engine that allows keywords like AND,OR,NOT or NEAR.1213Initial indexing process With the pruning process8021684.33%98.86%58.75%87.04%56.84%81.5%0.62%11.5%Table2:Results of the indexing process concerning1000pages of the site of the CSE department of the University of Washington(with a threshold of0,3).phases!).This phenomenon is due to the enrichment algorithm which authorizes the systematic addition of any representative concept(i.e.threshold of representativeness)to the ontology of the domain.While the second enrichment method,which operates with pruning rules(see sub-section3.3),enables to only add136concepts to the ontology.Also let us notice that this method keeps the rate of coverage(98,86%)of the enrichment method without pruning.Indeed,during this pruning phase,some concepts which does not index enough pages(according to the threshold),are removed from the ontology.Their pages are then linked to concepts that subsume them.Next,the number of concepts that index pages is growing.It is not surprising because we add only concepts indexing a minimal number of pages.Finally,the rate of accepted concepts goes from0.62%to11.5%!So,our process uses more available concepts that the pages contain.4OntologyManager:a user interface for ontology validationA tool which makes it possible to control the ontology enrichment has been developed(see Figure7).This tool implemented in java language,proposes a tree like view of the ontology.On the one hand,it proposes a general view of the ontology which enables the expert to easily navigate throw the ontology,on the other hand,it proposes a more detailed view which informs the expert about coefficient associated with concepts and pages.Notice that, in this last case,concepts are represented with different colours according to their associated coefficient.So a human expert easily can compares them.Moreover,some part of the ontology graph can also be masked in order to focus the expert attention on a specific part of the ontology.We are now developing a new functionality for the visualisation tool.It enables the user to have an hyperbolic view of the ontology graph(like OntoRama tool[9]or like H3Viewer[16]).In this context,the user can work with bigger ontologies.The user interface also makes it possible to visualise the indexed pages(see Figure8)and the ontology enrich-ment(by a colour system which can be customized).It will be easy to the human expert to validate or invalidate the added concepts,to obtain the indexing rate of a particular concept and to dynamically reorganize(by a drag and drop system)the ontology.The concept validation process is divided into4steps defining4classes of concepts:•bronze concepts:concepts proposed by our learning process and accepted by an expert just“to see”;•silver concepts:concepts accepted by the expert for all indexing processes he/she does;•gold concepts:concepts proposed by an expert to its community7for testing;141516Ontology enrichmentand indexing processE.Desmontils,C.Jacquin,L.SimonAbstractWithin the framework of Web information retrieval,this paper presents some methods to improve an indexing process which uses terminology oriented ontologies specific to afield of knowledge.Thus,techniques to enrich ontologies using specialization processes are proposed in order to manage pages which have to be indexed but which are currently rejected by the indexing process.This ontology specialization process is made supervised to offer to the expert of the domain a decision-making aid concerning itsfield of application.The proposed enrichment is based on some heuristics to manage the specialization of the ontology and which can be controlled using a graphic tool for validation.Categories and Subject Descriptors:H.3.1[Content Analysis and Indexing]General Terms:Abstracting methods,Dictionaries,Indexing methods,Linguistic processing,Thesauruses Additional Key Words and Phrases:Ontology,Enrichment,Supervised Learning,Thesaurus,Indexing Process, Information Retrieval in the Web。
Ontologies and the Configuration of Problem-Solving MethodsRudi Studer1,Henrik Eriksson2,John Gennari3,Samson Tu3, Dieter Fensel1,4,and Mark Musen31Institute AIFB, University of Karlsruhe, D-76128 Karlsruhee-mail: studer@aifb.uni-karlsruhe.de2Department of Computer and Information Science, Linköping University, S-58183 Linköpinge-mail: her@ida.liu.se3Section on Medical Informatics, Knowledge Systems Laboratory, Stanford University School of Medicine, Stanford, CA 94305-5479, USAe-mail: {gennari,tu,musen}@4Department SWI, University of Amsterdam, NL-1018 WB Amsterdame-mail: dieter@swi.psy.uva.nlAbstractProblem-solving methods model the problem-solving behavior of knowledge-based systems. The PROTÉGÉ-II framework includes a library of problem-solving methods that can be viewed as reusable components. For developers to use these components as building blocks in the construction of methods for new tasks, they must configure the components to fit with each other and with the needs of the new task. As part of this configuration process, developers must relate the ontologies of the generic methods to the ontologies associated with other methods and submethods. We present a model of method configuration that incorporates the use of several ontologies in multiple levels of methods and submethods, and we illustrate the approach by providing examples of the configuration of the board-game method.1. IntroductionProblem-solving methods for knowledge-based systems capture the problem-solving behavior required to performing the system's task (McDermott, 1988). Because certain tasks are common (e.g., planning and configuration), and are approachable by the same problem-solving behavior, developers can reuse problem-solving methods in several applications ((Chandrasekaran and Johnson, 1993), (Breuker and Van de Velde, 1994)). Thus, a library of reusable methods would allow the developer to create new systems by selecting, adapting and configuring such methods. Moreover, development tools, such as PROTÉGÉ-II (Puerta et al., 1992), can support the developer in the reuse of methods.Problem-solving methods are abstract descriptions of problem-solving behavior. The development of problem solvers from reusable components is analogous to the general approach of software reuse. In knowledge engineering as well as software engineering, developers often duplicate work on similar software components, which are used in different applications. The reuse of software components across several applications is a potentially useful technique that promises to improve the software-development process (Krueger, 1992). Similarly, the reuse of problem-solving methods can improve the quality, reliability, and maintainability of the software (e.g., by the reuse of quality-proven components). Of course, software reuse is only financially beneficial in the end if the indexing and configuration overhead is less than the effort that is needed to create the required component several times from scratch.Although software reuse is an appealing approach theoretically, there are serious practical problems associated with reuse. Two of the most important impediments to software reuse are (1) the problem of finding reusable components (e.g., locating appropriate components in a library), and (2) the problem of adapting reusable components to their task and to their environment. The firstproblem is sometimes called the indexing problem, and the second problem is sometimes called the configuration problem. These problems are also present in the context of reusable problem-solving methods. In the remainder of this paper we shall focus on the configuration problem.Method configuration is a difficult task, because the output of one method may not correspond to the input of the next method, and because the method may have subtasks, which are solved by submethods offering a functionality that is different from the assumptions of the subtask. Domain-independent methods use a method ontology, which, for example, might include concepts such as states, transitions, locations, moves, and constraints, whereas the user input and the (domain-specific) knowledge-base use a domain-oriented ontology, which might include concepts such as office workers, office rooms, and room-assignment constraints. Thus, a related issue to the configuration problem is the problem of mappings between ontologies (Gennari et al., 1994).In this paper, we shall address the configuration problem. The problems of how to organize a library of problem-solving methods and of how to select an appropriate method from such a library are beyond the scope of the paper. We shall introduce an approach for handling method ontologies when configuring a method from more elementary submethods. In our framework, configuration of a method means selecting appropriate submethods for solving the subtasks of which a method is composed. We introduce the notion of a subtask ontology in order to be able (i) to make the ontology of a method independent of the submethods that are chosen to solve its subtasks, and (ii) to specify how a selected submethod is adapted to its subtask environment. Our approach supports such a configuration process on multiple levels of subtasks and submethods. Furthermore, the access of domain knowledge is organized in such a way that no mapping of domain knowledge between the different subtask/submethod levels is required.The approach which is described in this paper provides a framework for handling some important aspects of the method configuration problem. However, our framework does not provide a complete solution to this problem. Furthermore, the proposed framework needs in the future a thorough practical evaluation by solving various application tasks.The rest of this paper is organized as follows. Section 2 provides a background to PROTÉGÉ-II, the board-game method, and the MIKE approach. Section 3 introduces the notions of method and subtask ontologies and discusses their relationships with the interface specification of methods and subtasks, respectively. Section 4 analyses the role of ontologies for the configuration of problem-solving methods and presents a model for configuring a problem-solving method from more elementary submethods that perform the subtasks of the given problem-solving method. In Sections 5 and 6, we discuss the results, and draw conclusions, respectively.2. Background: PROTÉGÉ-II, the Board-Game Method, and MIKEIn this section, we shall give a brief introduction into PROTÉGÉ-II and MIKE (Angele et al., 1996b) and will describe the board-game method (Eriksson et al., 1995) since this method will be used to illustrate our approach.2.1 Method Reuse for PROTÉGÉ-IIPROTÉGÉ-II (Puerta et al., 1992, Gennari et al., 1994, Eriksson et al., 1995) is a methodology and suite of tools for constructing knowledge-based systems. The PROTÉGÉ-II methodology emphasizes the reuse of components, including problem-solving methods, ontologies, and knowledge-bases.PROTÉGÉ-II allows developers to reuse library methods and to generate custom-tailored knowledge-acquisition tools from ontologies. Domain experts can then use these knowledge-Inputs ->Figure 2-1: Method-subtask decomposition in PROTÉGÉ-IIacquisition tools to create knowledge bases for the problem solvers. In addition to developing tool support for knowledge-based systems, PROTÉGÉ-II is also a research project aimed at understanding the reuse of problem-solving methods, and at alternative approaches to reuse.Naturally, the configuration of problem-solving methods for new tasks is a critical step in the reuse process, and an important research issue for environments such as PROTÉGÉ-II.The model of reuse for PROTÉGÉ-II includes the notion of a library of reusable problem-solving methods (PSMs) that perform tasks . PROTÉGÉ-II uses the term task to indicate the computations and inferences a method should perform in terms of its input and output. (Note that the term task is used sometimes in other contexts to indicate the overall role of an application system, or the application task .) In PROTÉGÉ-II problem-solving methods are decomposable into subtasks .Other methods, sometimes called submethods, can perform these subtasks. Primitive methods that cannot be decomposed further are called mechanisms . This decomposition of tasks into methods and mechanisms is shown graphically in Figure 2-1.Submethods and mechanisms should be reusable by developers as they build a solution to a particular problem. Thus, the developer should be able to select a generic method that performs a task, and then configure this method by selecting and substituting appropriate submethods and mechanisms to perform the method’s subtasks. Note that because the input-output requirements of tasks and subtasks often differ from the input-output assumptions of preexisting methods and mechanisms, we must introduce mappings among the task (or subtask) and method (or submethod)ontologies.PROTÉGÉ-II uses three major types of ontologies for defining various aspects of the knowledge-based system: domain , method , and application ontologies . Domain ontologies model concepts and relationships for a particular domain of interest. Ideally, these ontologies should be partititioned so as to separate those parts that may be more dependent on the problem-solving method. Method ontologies model concepts related to problem-solving methods, including input and output assumptions. To enable reuse, method ontologies should be domain-independent. In most situations, reusable domain and method ontologies by themselves are insufficient for a completeapplication system. Thus, PROTÉGÉ-II uses an application ontology that combines domain and method ontologies for a particular application. Application ontologies are used to generate domain-specific, method-specific knowledge-acquisition tools.The focus of this paper is on the configuration of problem-solving methods and submethods. Thus, we will describe method ontologies (see Section 3), rather than domain or application ontologies, and the mappings among tasks and methods that are necessary for method configuration (see Section 4).2.2 The Board-Game Method (BGM)We shall use the board-game method ((Eriksson et al., 1995), (Fensel et al., 1996a)) as a sample method to illustrate method configuration in the PROTÉGÉ-II framework. The basic idea behind the board-game method is that the method should provide a conceptual model of a board game where game pieces move between locations on a board (see Figure 2-2). The state of such a board game is defined by a set of assignments specifying which pieces are assigned to which locations. Developers can use this method to perform tasks that they can model as board-game problems.Figure 2-2: The board-game method provides a conceptual model where pieces movebetween locations.To configure the board-game method for a new game, the developer must define among other the pieces, locations, moves, the initial state and goal states of the game. The method operates by searching the space of legal moves, and by determining and applying the most promising moves until the game reaches a goal state. The major advantage of the board-game method is that the notion of a board game as the basis for the method configuration, makes the method convenient to reuse for the developer.We have used the board-game method in different configurations to perform several tasks. Examples of such tasks are the towers-of-Hanoi, the cannibals-and-missionaries, and the Sisyphus room-assignment (Linster, 1994) problem. By modeling other types of tasks as board games, the board-game method can perform tasks beyond simple games. The board-game method can perform the Sisyphus room-assignment task, for instance, if we (1) model the office workers as pieces, (2) start from a state where all the workers are at a location outside the building, and (3) move the workers one by one to appropriate rooms under the room-assignment constraints.2.3 The MIKE ApproachThe MIKE approach (Model-based and Incremental Knowledge Engineering) (Angele et al., 1996b) aims at providing a development method for knowledge-based systems covering all steps from knowledge acquisition to design and implementation. As part of the MIKE approach theKnowledge Acquisition and Representation Language KARL (Fensel et al., 1996c), (Fensel, 1995) has been developed. KARL is a formal and operational knowledge modeling language which can be used to formally specify a KADS like model of expertise (Schreiber et al., 1993). Such a model of expertise is split up into three layers:The domain layer contains the domain model with knowledge about concepts, their features, and their relationships. The inference layer contains a specifiation of the single inference steps as well as a specification of the knowledge roles which indicate in which way domain knowledge is used within the problem solving steps. In MIKE three types of knowledge roles are distinguished: Stores are used as containers which provide input data to inference actions or collect output data generated by inference actions. Views and terminators are used to connect the (generic) inference layer with the domain layer: Views provide means for delivering domain knowledge to inference actions and to transform the domain specific terminology into the generic PSM specific terminology. In an analogous way, terminators may be used to write the results of the problem solving process back to the domain layer and thus to reformulate the results in domain specific terms. The task layer contains a specification of the control flow for the inference steps as defined on the inference layer.For the remainder of the paper, it is important to know that in KARL a problem solving method is specified in a generic way on the inference and task layer of a model of expertise. A main characteristic of KARL is the integration of object orientation into a logical framework. KARL provides classes and predicates for specifying concepts and relationships, respectively. Furthermore, classes are characterized by single- and multi-valued attributes and are embedded in an is-a hierarchy. For all these modeling primitives, KARL offers corresponding graphical representations. Finally, sufficient and necessary constraints, which have to be met by class and predicate definitions, may be specified using first-order formulae.Currently, a new version of KARL is under development which among others will provide the notion of a method ontology and will provide primitives for specifying pre- and postconditions for a PSM (Angele et al., 1996a). Thus, this new version of KARL includes all the modeling primitives which are needed to formally describe the knowledge-level framework which shall be introduced in Sections 3 and 4. However, this formal specification is beyond the scope of this paper.3. Problem-Solving Method OntologiesWhen describing a PSM, various characteristic features may be identified, such as the input/output behavior or the knowledge assumptions on which the PSM is based (Fensel, 1995a), (Fensel et al., 1996b). In the context of this paper, we will consider a further characteristic aspect of a PSM: its ontological assumptions. These assumptions specify what kind of generic concepts and relationships are inherent for the given PSM. In the framework of PROTÉGÉ-II, these assumptions are captured in the method ontology (Gennari et al.,1994).Subsequently, we define the notions of a method ontology and of a subtask ontology, and discuss the relationship between the method ontology and the subtask ontologies associated with the subtasks of which the PSM is composed. For that discussion we assume that a PSM comes with an interface specification that describes which generic external knowledge roles (Fensel, 1995b) are used as input and output. Each role includes the definition of concepts and relationships for specifying the terminology used within the role.Fig. 3-1 shows the interface specification of the board-game method. We see, for instance, that knowledge about moves, preferences among moves, and applicability conditions for moves is provided by the input roles "Moves", "Preferred_Moves", and "Applic_Moves" (applicable moves), respectively; within the role "Moves" the concept "moves" is defined, whereas forexample within the role "Preferred_Moves" the relationship "prefer_m" is defined which specifies a preference relation between two moves for a given state (see Fig. 3-3).external knowledge role data flowFigure 3-1: The interface of the board-game methodSince the context in which the method will be used is not known in advance, one cannot specify which input knowledge is delivered from other tasks as output and which input knowledge has to be taken from the domain. Therefore, besides the output role "Solution", all other input roles are handled homogenously (that is, as external input knowledge roles).The interface description determines in which way a method can be adapted to its calling environment: a subset of the external input roles will be used later on as domain views (that is for defining the mapping between the domain-specific knowledge and the generic PSM knowledge).3.1 Method OntologiesWe first consider the situation that a complete PSM is given as a building block in the library of PSMs. In this case, a PSM comes with a top-level ontology, its method ontology, specifying all the generic concepts and relationships that are used by the PSM for providing its functionality. This method ontology is divided into two parts:(i) Global definitions, which include all generic concept and relationship definitions that are partof the interface specification of the PSM (that is, the external input and output knowledge roles of the PSM, respectively). For each concept or relationship definition, it is possible to indicate whether it is used as input or as output (however, that does not hold for subordinate concept definitions, i.e. concepts that are just used as range restrictions of concept attributes, or for high-level concepts that are just used for introducing attributes which are inherited bysubconcepts). Thus, the ontology specifies clearly which type of generic knowledge is expected as input, and which type of generic knowledge is provided as output.(ii) Internal definitions, which specify all concepts and relationships that are used for defining the dataflow within the PSM (that is, they are defined within stores).Within both parts, constraints can be specified for further restricting the defined terminology. It should be clear that the global definitions are exactly those definitions that specify the ontological assumptions that must be met for applying the PSM.We assume that a PSM that is stored in the library comes with an ontology description at two levels of refinement. First, a compact representation is given that just lists the names of the concepts and relationships of which the ontology is composed. This compact representation also includes the distinctions of global and internal definitions. It is used for providing an initial, not too detailed overview about the method ontology.Fig. 3-2 shows this compact representation of the board-game method ontology. We see that for instance "moves" is an input concept, "prefer_m" (preference of moves) is an input relationship, and "goal-states" is an output concept; "assignments" is an example of a subordinate concept which is used within state definitions, whereas "movable_objects" is an example of a high level concept. Properties of "movable_objects" are for instance inherited by the concept "pieces" (see below). As we will see later on, the concept "current-states" is part of the internal definitions, since it is used within the board game method for specifying the data flow between subtasks (see Section 3.2)Figure 3-2: The compact representation of the ontology of the board-game method Second, a complete specification of the method ontology is given. We use KARL for formally specifying such an ontology which provides all concept and relationship definitions as well as all constraints. Fig. 3-3 gives a graphic KARL representation of the board-game method ontology(not including constraints). We can see that for instance a move defines a new location for a given piece (single-valued attribute "new_assign" with domain "moves" and range "assignments") or that the preference between moves is state dependent (relationship "prefer_m"). The attribute "assign" is an example of a multi-valued attribute since states consist of a set of assignments.: is-aFigure: 3-3: The graphic KARL representation of the board-game method ontology When comparing the interface specification (Fig. 3-1) and the method ontology (Fig. 3-3) we can see that the union of the terminology of the external knowledge roles is equal to the set of global definitions being found in the method ontology.3.2 Subtask OntologiesIn general, within the PROTÉGÉ-II framework, a PSM is decomposed into several subtasks. Each subtask may be decomposed in turn again by choosing more elementary methods for solving it. We generalize this approach in the sense that, for trivial subtasks, we do not distinguish between the subtasks and the also trivial mechanisms for solving them. Instead, we use the notion of an elementary inference action (Fensel et al., 1996c). In the given context, such an elementary inference action may be interpreted as a "hardwired" mechanism for solving a subtask. Thus, for trivial subtasks, we can avoid the overhead that is needed for associating a subtask with its corresponding method (see below). That is, in general we assume that a method can be decomposed into subtasks and elementary inference actions.When specifying a PSM, a crucial design decision is the decomposition of the PSM into its top-level subtasks. Since subtasks provide the slots where more elementary methods can be plugged in, the type and number of subtasks determine in which way a PSM can be configured from otherdata flow subtaskinternal storeexternal knowledge rolemethods. As a consequence, the adaptability of a PSM is characterized by the knowledge roles of its interface description, and by its top-level task decomposition.For a complete description of the decomposition structure of a PSM, one also has to specify the interfaces of the subtasks and inference actions, as well as the data and control flow among these constituents. The interface of a subtask consists of knowledge roles, which are either external knowledge roles or (internal) stores , which handle the input/output from/to other subtasks and inference actions. Some of these aspects are described for the BGM in Figures 3-4 and 3-5,respectively.Figure 3-4: The top-level decomposition of the board-game methodinto elementary inference actions and subtasksFig. 3-4 shows the decomposition of the board-game method into top-level subtasks andFigure 3-5: The interface of the subtask "Apply_Moves"elementary inference actions. We can see two subtasks ("Apply_Moves" and "Select_Best_State") and three elementary inference actions ("Init_State", "Check_Goal_State", "Transfer_Solution"). This decomposition structure specifies clearly that the board-game method may be configured by selecting appropriate methods for solving the subtasks "Apply_Moves" and "Select_Best_State". In Fig. 3-5 the interface of the subtask "Apply_Moves" is shown. The interface specifies that "Apply_Moves" receives (board-game method) internal input from the store "Current_State" and delivers output to the store "Potential_Successor_States". Furthermore, three external knowledge roles provide the task- and/or domain-specific knowledge that is required for performing the subtask "Apply_Moves".Having introduced the notion of a method ontology, the problem arises how that method ontology can be made independent of the selection of (more elementary) methods for solving the subtasks of the PSM. The basic idea for getting rid of this problem is that each subtask is associated with a subtask ontology. The method ontology is then essentially derived from these subtask ontologies by combining the different subtask ontologies, and by introducing additional superconcepts, like for example the concept "movable_objects" (compare Fig. 3-2). Of course, the terminology associated with the elementary inference actions has to be considered in addition.Figure: 3-6: The method ontology and the related subtask ontologiesThe notion of subtask ontologies has two advantages:1) By building up the method ontology from the various subtask ontologies, the method ontologyis independent of the decision which submethod will be used for solving which subtask. Thus,a configuration independent ontology definition can be given for each PSM in the library.2) Subtask ontologies provide a context for mapping the ontologies of the submethods, which areselected to solve the subtasks, to the global method ontology (see Section 4).The type of mapping that is required between the subtask ontology and the ontology of the method used to solve the subtask is dependent on the distinction of input and output definitions within the subtask ontology (see Section 4). Therefore, we again separate the subtask ontology appropriately. In addition, a distinction is made between internal and external input/output. The feature "internal" is used for indicating that this part of the ontology is used within the method of which the subtask is a part (that is, for defining the terminology of the stores used for the data flow among the subtasks which corresponds to the internal-definitions part of the method ontology). The feature "external" indicates that input is received from the calling environment of the method of which the subtask is a part, or that output is delivered to that calling environment (that corresponds to the global-definitions part of the method ontology). This distinction can be made on the basis of the data dependencies defined among the various subtasks (see Fig. 3-6).In Fig. 3-7, we introduce the ontology for the subtask "Apply_Moves". According to the interface description which is given in Fig. 3-5, we assume that the current state is an internal input for "Apply_Moves", whereas the potential successor states are treated as internal output. Moves, preferences among moves, and applicability conditions for moves are handled as external input.Figure 3-7: The compact representation of the "Apply_Moves" subtask ontologyWhen investigating the interface and ontology specification of a subtask, one can easily recognize that even after having introduced the subtask decomposition, it is still open as to what kind of external knowledge is taken from the domain and what kind of knowledge is received as output from another task. That is, in a complex application task environment, in which the board-game method is used to solve, for example, a subtask st1, it depends on the calling environment of st1 whether, e.g., the preference of moves ("prefer_m") has to be defined in a mapping from the domain, or is just delivered as output from another subtask st2, which is called before st1 is called.4. Configuring Problem Solving Methods from More Elementary MethodsThe basic idea when building up a library of PSMs is that one does not simply store completely defined PSMs. In order to achieve more flexibility in adapting a PSM from the library to its task environment, concepts are required for configuring a PSM from more elementary building blocks (i.e., from more elementary methods (Puerta et al., 1992)). Besides being more flexible, such a configuration approach also provides means for reusing these building blocks in different contexts.Based on the structure introduced in Section 3, the configuration of a PSM from building blocks requires the selection of methods that are appropriate for solving the subtasks of the PSM. Since we do not consider the indexing and adaptation problem in this paper, we assume subsequently, that we have found a suitable (sub-)method for solving a given subtask, e.g. by exploiting appropriate semantic descriptions of the stored methods. Such semantic descriptions could, for instance, be pre-/postconditions which specify the functional behavior of a method (Fensel et al., 1996b). By using the new version of KARL (Angele et al., 1996a) such pre-/postconditions can be specified in a completely formal way.。
Knowledge Engineering:Principles and MethodsRudi Studer1, V. Richard Benjamins2, and Dieter Fensel11Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany{studer, fensel}@aifb.uni-karlsruhe.dehttp://www.aifb.uni-karlsruhe.de2Artificial Intelligence Research Institute (IIIA),Spanish Council for Scientific Research (CSIC), Campus UAB,08193 Bellaterra, Barcelona, Spainrichard@iiia.csic.es, http://www.iiia.csic.es/~richard2Dept. of Social Science Informatics (SWI),richard@swi.psy.uva.nl, http://www.swi.psy.uva.nl/usr/richard/home.htmlAbstractThis paper gives an overview about the development of the field of Knowledge Engineering over the last 15 years. We discuss the paradigm shift from a transfer view to a modeling view and describe two approaches which considerably shaped research in Knowledge Engineering: Role-limiting Methods and Generic Tasks. To illustrate various concepts and methods which evolved in the last years we describe three modeling frameworks: CommonKADS, MIKE, and PROTÉGÉ-II. This description is supplemented by discussing some important methodological developments in more detail: specification languages for knowledge-based systems, problem-solving methods, and ontologies. We conclude with outlining the relationship of Knowledge Engineering to Software Engineering, Information Integration and Knowledge Management.Key WordsKnowledge Engineering, Knowledge Acquisition, Problem-Solving Method, Ontology, Information Integration1IntroductionIn earlier days research in Artificial Intelligence (AI) was focused on the development offormalisms, inference mechanisms and tools to operationalize Knowledge-based Systems (KBS). Typically, the development efforts were restricted to the realization of small KBSs in order to study the feasibility of the different approaches.Though these studies offered rather promising results, the transfer of this technology into commercial use in order to build large KBSs failed in many cases. The situation was directly comparable to a similar situation in the construction of traditional software systems, called …software crisis“ in the late sixties: the means to develop small academic prototypes did not scale up to the design and maintenance of large, long living commercial systems. In the same way as the software crisis resulted in the establishment of the discipline Software Engineering the unsatisfactory situation in constructing KBSs made clear the need for more methodological approaches.So the goal of the new discipline Knowledge Engineering (KE) is similar to that of Software Engineering: turning the process of constructing KBSs from an art into an engineering discipline. This requires the analysis of the building and maintenance process itself and the development of appropriate methods, languages, and tools specialized for developing KBSs. Subsequently, we will first give an overview of some important historical developments in KE: special emphasis will be put on the paradigm shift from the so-called transfer approach to the so-called modeling approach. This paradigm shift is sometimes also considered as the transfer from first generation expert systems to second generation expert systems [43]. Based on this discussion Section 2 will be concluded by describing two prominent developments in the late eighties:Role-limiting Methods [99] and Generic Tasks [36]. In Section 3 we will present some modeling frameworks which have been developed in recent years: CommonKADS [129], MIKE [6], and PROTÈGÈ-II [123]. Section 4 gives a short overview of specification languages for KBSs. Problem-solving methods have been a major research topic in KE for the last decade. Basic characteristics of (libraries of) problem-solving methods are described in Section 5. Ontologies, which gained a lot of importance during the last years are discussed in Section 6. The paper concludes with a discussion of current developments in KE and their relationships to other disciplines.In KE much effort has also been put in developing methods and supporting tools for knowledge elicitation (compare [48]). E.g. in the VITAL approach [130] a collection of elicitation tools, like e.g. repertory grids (see [65], [83]), are offered for supporting the elicitation of domain knowledge (compare also [49]). However, a discussion of the various elicitation methods is beyond the scope of this paper.2Historical Roots2.1Basic NotionsIn this section we will first discuss some main principles which characterize the development of KE from the very beginning.Knowledge Engineering as a Transfer Process…This transfer and transformation of problem-solving expertise from a knowledge source to a program is the heart of the expert-system development process.” [81]In the early eighties the development of a KBS has been seen as a transfer process of humanknowledge into an implemented knowledge base. This transfer was based on the assumption that the knowledge which is required by the KBS already exists and just has to be collected and implemented. Most often, the required knowledge was obtained by interviewing experts on how they solve specific tasks [108]. Typically, this knowledge was implemented in some kind of production rules which were executed by an associated rule interpreter. However, a careful analysis of the various rule knowledge bases showed that the rather simple representation formalism of production rules did not support an adequate representation of different types of knowledge [38]: e.g. in the MYCIN knowledge base [44] strategic knowledge about the order in which goals should be achieved (e.g. “consider common causes of a disease first“) is mixed up with domain specific knowledge about for example causes for a specific disease. This mixture of knowledge types, together with the lack of adequate justifications of the different rules, makes the maintenance of such knowledge bases very difficult and time consuming. Therefore, this transfer approach was only feasible for the development of small prototypical systems, but it failed to produce large, reliable and maintainable knowledge bases.Furthermore, it was recognized that the assumption of the transfer approach, that is that knowledge acquisition is the collection of already existing knowledge elements, was wrong due to the important role of tacit knowledge for an expert’s problem-solving capabilities. These deficiencies resulted in a paradigm shift from the transfer approach to the modeling approach.Knowledge Engineering as a Modeling ProcessNowadays there exists an overall consensus that the process of building a KBS may be seen as a modeling activity. Building a KBS means building a computer model with the aim of realizing problem-solving capabilities comparable to a domain expert. It is not intended to create a cognitive adequate model, i.e. to simulate the cognitive processes of an expert in general, but to create a model which offers similar results in problem-solving for problems in the area of concern. While the expert may consciously articulate some parts of his or her knowledge, he or she will not be aware of a significant part of this knowledge since it is hidden in his or her skills. This knowledge is not directly accessible, but has to be built up and structured during the knowledge acquisition phase. Therefore this knowledge acquisition process is no longer seen as a transfer of knowledge into an appropriate computer representation, but as a model construction process ([41], [106]).This modeling view of the building process of a KBS has the following consequences:•Like every model, such a model is only an approximation of the reality. In principle, the modeling process is infinite, because it is an incessant activity with the aim of approximating the intended behaviour.•The modeling process is a cyclic process. New observations may lead to a refinement, modification, or completion of the already built-up model. On the other side, the model may guide the further acquisition of knowledge.•The modeling process is dependent on the subjective interpretations of the knowledge engineer. Therefore this process is typically faulty and an evaluation of the model with respect to reality is indispensable for the creation of an adequate model. According to this feedback loop, the model must therefore be revisable in every stage of the modeling process.Problem Solving MethodsIn [39] Clancey reported on the analysis of a set of first generation expert systems developed to solve different tasks. Though they were realized using different representation formalisms (e.g. production rules, frames, LISP), he discovered a common problem solving behaviour.Clancey was able to abstract this common behaviour to a generic inference pattern called Heuristic Classification , which describes the problem-solving behaviour of these systems on an abstract level, the so called Knowledge Level [113]. This knowledge level allows to describe reasoning in terms of goals to be achieved, actions necessary to achieve these goals and knowledge needed to perform these actions. A knowledge-level description of a problem-solving process abstracts from details concerned with the implementation of the reasoning process and results in the notion of a Problem-Solving Method (PSM).A PSM may be characterized as follows (compare [20]):• A PSM specifies which inference actions have to be carried out for solving a given task.• A PSM determines the sequence in which these actions have to be activated.•In addition, so-called knowledge roles determine which role the domain knowledge plays in each inference action. These knowledge roles define a domain independent generic terminology.When considering the PSM Heuristic Classification in some more detail (Figure 1) we can identify the three basic inference actions abstract ,heuristic match , and refine . Furthermore,four knowledge roles are defined:observables ,abstract observables ,solution abstractions ,and solutions . It is important to see that such a description of a PSM is given in a generic way.Thus the reuse of such a PSM in different domains is made possible. When considering a medical domain, an observable like …410 C“ may be abstracted to …high temperature“ by the inference action abstract . This abstracted observable may be matched to a solution abstraction, e.g. …infection“, and finally the solution abstraction may be hierarchically refined to a solution, e.g. the disease …influenca“.In the meantime various PSMs have been identified, like e.g.Cover-and-Differentiate for solving diagnostic tasks [99] or Propose-and-Revise [100] for parametric design tasks.PSMs may be exploited in the knowledge engineering process in different ways:Fig. 1 The Problem-Solving Method Heuristic Classificationroleinference action•PSMs contain inference actions which need specific knowledge in order to perform their task. For instance,Heuristic Classification needs a hierarchically structured model of observables and solutions for the inference actions abstract and refine, respectively.So a PSM may be used as a guideline to acquire static domain knowledge.• A PSM allows to describe the main rationale of the reasoning process of a KBS which supports the validation of the KBS, because the expert is able to understand the problem solving process. In addition, this abstract description may be used during the problem-solving process itself for explanation facilities.•Since PSMs may be reused for developing different KBSs, a library of PSMs can be exploited for constructing KBSs from reusable components.The concept of PSMs has strongly stimulated research in KE and thus has influenced many approaches in this area. A more detailed discussion of PSMs is given in Section 5.2.2Specific ApproachesDuring the eighties two main approaches evolved which had significant influence on the development of modeling approaches in KE: Role-Limiting Methods and Generic Tasks. Role-Limiting MethodsRole-Limiting Methods (RLM) ([99], [102]) have been one of the first attempts to support the development of KBSs by exploiting the notion of a reusable problem-solving method. The RLM approach may be characterized as a shell approach. Such a shell comes with an implementation of a specific PSM and thus can only be used to solve a type of tasks for which the PSM is appropriate. The given PSM also defines the generic roles that knowledge can play during the problem-solving process and it completely fixes the knowledge representation for the roles such that the expert only has to instantiate the generic concepts and relationships, which are defined by these roles.Let us consider as an example the PSM Heuristic Classification (see Figure 1). A RLM based on Heuristic Classification offers a role observables to the expert. Using that role the expert (i) has to specify which domain specific concept corresponds to that role, e.g. …patient data”(see Figure 4), and (ii) has to provide domain instances for that concept, e.g. concrete facts about patients. It is important to see that the kind of knowledge, which is used by the RLM, is predefined. Therefore, the acquisition of the required domain specific instances may be supported by (graphical) interfaces which are custom-tailored for the given PSM.In the following we will discuss one RLM in some more detail: SALT ([100], [102]) which is used for solving constructive tasks.Then we will outline a generalization of RLMs to so-called Configurable RLMs.SALT is a RLM for building KBSs which use the PSM Propose-and-Revise. Thus KBSs may be constructed for solving specific types of design tasks, e.g. parametric design tasks. The basic inference actions that Propose-and-Revise is composed of, may be characterized as follows:•extend a partial design by proposing a value for a design parameter not yet computed,•determine whether all computed parameters fulfil the relevant constraints, and•apply fixes to remove constraint violations.In essence three generic roles may be identified for Propose-and-Revise ([100]):•…design-extensions” refer to knowledge for proposing a new value for a design parameter,•…constraints” provide knowledge restricting the admissible values for parameters, and •…fixes” make potential remedies available for specific constraint violations.From this characterization of the PSM Propose-and-Revise, one can easily see that the PSM is described in generic, domain-independent terms. Thus the PSM may be used for solving design tasks in different domains by specifying the required domain knowledge for the different predefined generic knowledge roles.E.g. when SALT was used for building the VT-system [101], a KBS for configuring elevators, the domain expert used the form-oriented user interface of SALT for entering domain specific design extensions (see Figure 2). That is, the generic terminology of the knowledge roles, which is defined by object and relation types, is instantiated with VT specific instances.1Name:CAR-JAMB-RETURN2Precondition:DOOR-OPENING = CENTER3Procedure:CALCULATION4Formula:[PLATFORM-WIDTH -OPENING-WIDTH] / 25Justification:CENTER-OPENING DOORS LOOKBEST WHEN CENTERED ONPLATFORM.(the value of the design parameter CAR-JUMB-RETURN iscalculated according to the formula - in case the preconditionis fulfilled; the justification gives a description why thisparameter value is preferred over other values (example takenfrom [100]))Fig. 2 Design Extension Knowledge for VTOn the one hand, the predefined knowledge roles and thus the predefined structure of the knowledge base may be used as a guideline for the knowledge acquisition process: it is clearly specified what kind of knowledge has to be provided by the domain expert. On the other hand, in most real-life situations the problem arises of how to determine whether a specific task may be solved by a given RLM. Such task analysis is still a crucial problem, since up to now there does not exist a well-defined collection of features for characterizing a domain task in a way which would allow a straightforward mapping to appropriate RLMs. Moreover, RLMs have a fixed structure and do not provide a good basis when a particular task can only be solved by a combination of several PSMs.In order to overcome this inflexibility of RLMs, the concept of configurable RLMs has been proposed.Configurable Role-Limiting Methods (CRLMs) as discussed in [121] exploit the idea that a complex PSM may be decomposed into several subtasks where each of these subtasks may be solved by different methods (see Section 5). In [121], various PSMs for solving classification tasks, like Heuristic Classification or Set-covering Classification, have been analysed with respect to common subtasks. This analysis resulted in the identification ofshared subtasks like …data abstraction” or …hypothesis generation and test”. Within the CRLM framework a predefined set of different methods are offered for solving each of these subtasks. Thus a PSM may be configured by selecting a method for each of the identified subtasks. In that way the CRLM approach provides means for configuring the shell for different types of tasks. It should be noted that each method offered for solving a specific subtask, has to meet the knowledge role specifications that are predetermined for the CRLM shell, i.e. the CRLM shell comes with a fixed scheme of knowledge types. As a consequence, the introduction of a new method into the shell typically involves the modification and/or extension of the current scheme of knowledge types [121]. Having a fixed scheme of knowledge types and predefined communication paths between the various components is an important restriction distinguishing the CRLM framework from more flexible configuration approaches such as CommonKADS (see Section 3).It should be clear that the introduction of such flexibility into the RLM approach removes one of its disadvantages while still exploiting the advantage of having a fixed scheme of knowledge types, which build the basis for generating effective knowledge-acquisition tools. On the other hand, configuring a CRLM shell increases the burden for the system developer since he has to have the knowledge and the ability to configure the system in the right way. Generic Task and Task StructuresIn the early eighties the analysis and construction of various KBSs for diagnostic and design tasks evolved gradually into the notion of a Generic Task (GT) [36]. GTs like Hierarchical Classification or State Abstraction are building blocks which can be reused for the construction of different KBSs.The basic idea of GTs may be characterized as follows (see [36]):• A GT is associated with a generic description of its input and output.• A GT comes with a fixed scheme of knowledge types specifying the structure of domain knowledge needed to solve a task.• A GT includes a fixed problem-solving strategy specifying the inference steps the strategy is composed of and the sequence in which these steps have to be carried out. The GT approach is based on the strong interaction problem hypothesis which states that the structure and representation of domain knowledge is completely determined by its use [33]. Therefore, a GT comes with both, a fixed problem-solving strategy and a fixed collection of knowledge structures.Since a GT fixes the type of knowledge which is needed to solve the associated task, a GT provides a task specific vocabulary which can be exploited to guide the knowledge acquisition process. Furthermore, by offering an executable shell for a GT, called a task specific architecture, the implementation of a specific KBS could be considered as the instantiation of the predefined knowledge types by domain specific terms (compare [34]). On a rather pragmatic basis several GTs have been identified including Hierarchical Classification,Abductive Assembly and Hypothesis Matching. This initial collection of GTs was considered as a starting point for building up an extended collection covering a wide range of relevant tasks.However, when analyzed in more detail two main disadvantages of the GT approach have been identified (see [37]):•The notion of task is conflated with the notion of the PSM used to solve the task, sinceeach GT included a predetermined problem-solving strategy.•The complexity of the proposed GTs was very different, i.e. it remained open what the appropriate level of granularity for the building blocks should be.Based on this insight into the disadvantages of the notion of a GT, the so-called Task Structure approach was proposed [37]. The Task Structure approach makes a clear distinction between a task, which is used to refer to a type of problem, and a method, which is a way to accomplish a task. In that way a task structure may be defined as follows (see Figure 3): a task is associated with a set of alternative methods suitable for solving the task. Each method may be decomposed into several subtasks. The decomposition structure is refined to a level where elementary subtasks are introduced which can directly be solved by using available knowledge.As we will see in the following sections, the basic notion of task and (problem-solving)method, and their embedding into a task-method-decomposition structure are concepts which are nowadays shared among most of the knowledge engineering methodologies.3Modeling FrameworksIn this section we will describe three modeling frameworks which address various aspects of model-based KE approaches: CommonKADS [129] is prominent for having defined the structure of the Expertise Model, MIKE [6] puts emphasis on a formal and executable specification of the Expertise Model as the result of the knowledge acquisition phase, and PROTÉGÉ-II [51] exploits the notion of ontologies.It should be clear that there exist further approaches which are well known in the KE community, like e.g VITAL [130], Commet [136], and EXPECT [72]. However, a discussion of all these approaches is beyond the scope of this paper.Fig. 3 Sample Task Structure for DiagnosisTaskProblem-Solving MethodSubtasksProblem-Solving MethodTask / Subtasks3.1The CommonKADS ApproachA prominent knowledge engineering approach is KADS[128] and its further development to CommonKADS [129]. A basic characteristic of KADS is the construction of a collection of models, where each model captures specific aspects of the KBS to be developed as well as of its environment. In CommonKADS the Organization Model, the Task Model, the Agent Model, the Communication Model, the Expertise Model and the Design Model are distinguished. Whereas the first four models aim at modeling the organizational environment the KBS will operate in, as well as the tasks that are performed in the organization, the expertise and design model describe (non-)functional aspects of the KBS under development. Subsequently, we will briefly discuss each of these models and then provide a detailed description of the Expertise Model:•Within the Organization Model the organizational structure is described together with a specification of the functions which are performed by each organizational unit.Furthermore, the deficiencies of the current business processes, as well as opportunities to improve these processes by introducing KBSs, are identified.•The Task Model provides a hierarchical description of the tasks which are performed in the organizational unit in which the KBS will be installed. This includes a specification of which agents are assigned to the different tasks.•The Agent Model specifies the capabilities of each agent involved in the execution of the tasks at hand. In general, an agent can be a human or some kind of software system, e.g.a KBS.•Within the Communication Model the various interactions between the different agents are specified. Among others, it specifies which type of information is exchanged between the agents and which agent is initiating the interaction.A major contribution of the KADS approach is its proposal for structuring the Expertise Model, which distinguishes three different types of knowledge required to solve a particular task. Basically, the three different types correspond to a static view, a functional view and a dynamic view of the KBS to be built (see in Figure 4 respectively “domain layer“, “inference layer“ and “task layer“):•Domain layer : At the domain layer all the domain specific knowledge is modeled which is needed to solve the task at hand. This includes a conceptualization of the domain in a domain ontology (see Section 6), and a declarative theory of the required domain knowledge. One objective for structuring the domain layer is to model it as reusable as possible for solving different tasks.•Inference layer : At the inference layer the reasoning process of the KBS is specified by exploiting the notion of a PSM. The inference layer describes the inference actions the generic PSM is composed of as well as the roles , which are played by the domain knowledge within the PSM. The dependencies between inference actions and roles are specified in what is called an inference structure. Furthermore, the notion of roles provides a domain independent view on the domain knowledge. In Figure 4 (middle part) we see the inference structure for the PSM Heuristic Classification . Among others we can see that …patient data” plays the role of …observables” within the inference structure of Heuristic Classification .•Task layer : The task layer provides a decomposition of tasks into subtasks and inference actions including a goal specification for each task, and a specification of how theseFig. 4 Expertise Model for medical diagnosis (simplified CML notation)goals are achieved. The task layer also provides means for specifying the control over the subtasks and inference actions, which are defined at the inference layer.Two types of languages are offered to describe an Expertise Model: CML (Conceptual Modeling Language) [127], which is a semi-formal language with a graphical notation, and (ML)2 [79], which is a formal specification language based on first order predicate logic, meta-logic and dynamic logic (see Section 4). Whereas CML is oriented towards providing a communication basis between the knowledge engineer and the domain expert, (ML)2 is oriented towards formalizing the Expertise Model.The clear separation of the domain specific knowledge from the generic description of the PSM at the inference and task layer enables in principle two kinds of reuse: on the one hand, a domain layer description may be reused for solving different tasks by different PSMs, on the other hand, a given PSM may be reused in a different domain by defining a new view to another domain layer. This reuse approach is a weakening of the strong interaction problem hypothesis [33] which was addressed in the GT approach (see Section 2). In [129] the notion of a relative interaction hypothesis is defined to indicate that some kind of dependency exists between the structure of the domain knowledge and the type of task which should be solved. To achieve a flexible adaptation of the domain layer to a new task environment, the notion of layered ontologies is proposed:Task and PSM ontologies may be defined as viewpoints on an underlying domain ontology.Within CommonKADS a library of reusable and configurable components, which can be used to build up an Expertise Model, has been defined [29]. A more detailed discussion of PSM libraries is given in Section 5.In essence, the Expertise Model and the Communication Model capture the functional requirements for the target system. Based on these requirements the Design Model is developed, which specifies among others the system architecture and the computational mechanisms for realizing the inference actions. KADS aims at achieving a structure-preserving design, i.e. the structure of the Design Model should reflect the structure of the Expertise Model as much as possible [129].All the development activities, which result in a stepwise construction of the different models, are embedded in a cyclic and risk-driven life cycle model similar to Boehm’s spiral model [21].The basic structure of the expertise model has some similarities with the data, functional, and control view of a system as known from software engineering. However, a major difference may be seen between an inference layer and a typical data-flow diagram (compare [155]): Whereas an inference layer is specified in generic terms and provides - via roles and domain views - a flexible connection to the data described at the domain layer, a data-flow diagram is completely specified in domain specific terms. Moreover, the data dictionary does not correspond to the domain layer, since the domain layer may provide a complete model of the domain at hand which is only partially used by the inference layer, whereas the data dictionary is describing exactly those data which are used to specify the data flow within the data flow diagram (see also [54]).3.2The MIKE ApproachThe MIKE approach (Model-based and Incremental Knowledge Engineering) (cf. [6], [7])。
自然语言处理中的信息抽取与知识图谱构建自然语言处理(Natural Language Processing,NLP)是人工智能(AI)领域中一门重要的技术,旨在使计算机能够理解和处理人类自然语言的文本和语音。
其中,信息抽取(Information Extraction,IE)是NLP的一个关键任务,它旨在从大量的非结构化文本中提取出结构化的、有意义的信息。
与此同时,知识图谱的构建则是将抽取出的信息整合到一个结构化的知识库中,以便进行更高级的推理和分析。
信息抽取是一项具有挑战性的任务,因为非结构化的自然语言文本缺乏明确的语法和语义规则,且常常存在歧义和复杂的表达方式。
然而,通过应用各种技术和算法,可以有效地从文本中抽取出一系列重要的信息。
主要的信息抽取任务包括命名实体识别(Named Entity Recognition,NER)、关系抽取(Relation Extraction)、事件提取(Event Extraction)和态度抽取(Sentiment Analysis)等。
命名实体识别是信息抽取中的基础任务之一,它旨在从文本中识别出特定类别的实体,如人物、地点、组织机构、时间等。
通过使用机器学习算法和模型,可以构建高效的命名实体识别系统。
关系抽取则是通过分析文本中的语义关系,识别出实体之间的关系。
这种关系可以是预定义的,也可以是从文本中自动学习的。
事件提取是一项更加复杂的信息抽取任务,它要求从文本中抽取出与特定事件相关的信息。
例如,在新闻报道中,事件提取可以识别出事件的主题、时间、地点、参与者等关键信息。
为了实现事件提取,需要将文本分析成句子,并进行句法和语义分析,以获得更精确的信息。
态度抽取是另一项重要的信息抽取任务,它旨在从文本中提取出作者的情感和观点。
这种任务对于社交媒体数据分析、舆情监测和市场调研等领域具有广泛的应用。
通过应用情感分析和机器学习算法,可以对文本进行情感极性分类(积极、消极或中性)以及主观性判断(主观或客观)等。
information extraction 评价指标-回复关于信息抽取的评价指标引言:随着互联网和大数据时代的到来,大量的信息被生成和存储,如何从海量的文本数据中获取有效的信息成为了一项关键任务。
信息抽取(Information Extraction)是一种自然语言处理技术,旨在从非结构化的文本中提取出有用的信息,如实体、关系和事件等。
信息抽取涉及多个阶段和过程,包括实体识别、关系抽取和事件抽取等,因此需要评价指标来衡量其性能和有效性。
本文将介绍一些常用的信息抽取评价指标,并逐步解释其定义和使用。
一、精确率(Precision)精确率是信息抽取中常用的评价指标之一。
它衡量了信息抽取系统在识别结果中有多少是正确的。
精确率的计算公式如下:精确率= 正确识别的实体数/ 总的实体数精确率的值范围为0到1,越接近1代表系统的识别能力越好。
然而,精确率不能完全反映系统的性能,因为它忽略了系统未能捕捉到的实体。
二、召回率(Recall)召回率是另一个常用的信息抽取评价指标。
它衡量了信息抽取系统在文本中能识别出多少个实体。
召回率的计算公式如下:召回率= 正确识别的实体数/ 真实的实体数召回率的值也在0到1之间,越接近1代表系统的抽取能力越好。
与精确率相反,召回率忽略了系统识别结果中的误识别。
三、F1值(F1 Score)F1值是综合考虑精确率和召回率的评价指标,通常用于信息抽取系统的综合评估。
它的计算公式如下:F1 = 2 * (精确率* 召回率) / (精确率+ 召回率)F1值的范围也在0到1之间,它给出了一个综合的系统性能度量,比单一的精确率和召回率更具有可比性和稳定性。
当精确率和召回率都很高时,F1值也会相对较高。
四、准确率(Accuracy)准确率是另一个常见的信息抽取评价指标。
它衡量了信息抽取系统在整个文本中正确识别实体的比例。
准确率的计算公式如下:准确率= (正确识别的实体数+ 正确未识别的实体数) / 总的实体数准确率的值也在0到1之间,越接近1代表系统的抽取能力越好。
知识图谱及其在自然语言处理中的应用一、前言随着互联网和人工智能技术的不断发展,数据量的剧增和日益高效应用的需求,人类处理和利用数据的能力面临着巨大挑战。
知识图谱(knowledge graph)因其清晰的结构、丰富的关联性和高效的查询能力,成为数据管理和智能应用领域的一种重要工具。
本文将介绍知识图谱的基本概念和构建方法,并着重探讨其在自然语言处理中的应用。
二、知识图谱概述知识图谱被认为是将自然语言文字转化为可计算的知识表示形式的一种重要途径。
它是由一系列实体、属性和关系构成的图形化知识库,在知识表达、知识检索和数据挖掘等方面具有广泛的应用。
知识图谱的核心是实体、属性和关系,分别表示了实际世界中的事物、这些事物的属性和它们之间的关联。
其中,实体通常指人、地点、组织、事件等具有实际意义的概念,属性用于描述实体的特征或状态,关系则表示实体之间的联系或连接。
知识图谱的构建可以通过多种方法实现。
最常用的是基于本体学(ontologies)的方法,即对实体、属性和关系进行分类和描述,然后将它们组织为一个层次结构,在不同层次之间建立关联。
另一种方法是基于信息抽取(information extraction)的自动构建方法,通过自然语言处理技术自动从大规模文本中抽取实体、属性和关系信息,创建一张庞大的知识图谱。
三、知识图谱在自然语言处理中的应用3.1 实体识别(entity recognition)实体识别是指从自然语言文本中识别出具有特定语义的实体。
知识图谱中的实体通常是由一个唯一的标识符和一些属性描述组成的,因此实体识别可以被看作是自然语言文本和知识图谱之间的桥梁。
实体识别的结果可以直接用于索引、检索和推荐等任务,也可以与其他自然语言处理技术相结合,如关系抽取和事件识别等。
已有的一些知识图谱中包含了大量的实体、关系信息,例如维基百科、Freebase和YAGO等。
3.2 关系抽取(relation extraction)关系抽取是指从自然语言文本中自动识别出实体之间的语义关系。
Statistics Gathering for Learning from Distributed,Heterogeneous andAutonomous Data SourcesDoina Caragea,Jaime Reinoso,Adrian Silvescu,and Vasant HonavarArtificial Intelligence Research Laboratory,Computer Science Department,Iowa State University,226Atanasoff Hall,Ames,IA50011-1040,USA,{dcaragea,jreinoso,silvescu,and honavar}@AbstractWith the growing use of distributed informationnetworks,there is an increasing need for algorith-mic and system solutions for data-driven knowl-edge acquisition using distributed,heterogeneousand autonomous data repositories.In many appli-cations,practical constraints require such systemsto provide support for data analysis where the dataand the computational resources are available.Thispresents us with distributed learning problems.Weprecisely formulate a class of distributed learningproblems;present a general strategy for transform-ing traditional machine learning algorithms intodistributed learning algorithms based on the de-composition of the learning task into hypothesisgeneration and information extraction components;formally defined the information required for gen-erating the hypothesis(sufficient statistics);andshow how to gather the sufficient statistics from dis-tributed,heterogeneous,autonomous data sources,using a query decomposition(planning)approach.The resulting algorithms are provably exact in thatthe hypothesis constructed from distributed data isidentical to that obtained by the corresponding al-gorithm when in the batch setting.1IntroductionDevelopment of high throughput data acquisition technolo-gies in a number of domains(e.g.,biological sciences,envi-ronmental sciences,space sciences etc.)together with ad-vances in digital storage,computing,and communications technologies have resulted in unprecedented opportunities for scientists,to utilize,at least in principle,the wealth of infor-mation available on the Internet in learning,scientific dis-covery,and decision making.In practice,effective use of the growing body of data,information,and knowledge to achieve fundamental advances in scientific understanding and decision making presents several challenges[Honavar et al., 1998;2001]:•In such domains,data repositories are large in size, dynamic and physically distributed.Consequently,it is neither desirable nor feasible to gather all the datain a centralized location for analysis.Hence,efficient distributed learning algorithms that can operate across multiple autonomous data sources without the need to transmit large amounts of data are needed[Caragea et al.,2001b;Davies and Edwards,1999;Kargupta et al., 1999;Prodromidis et al.,2000;Provost and Kolluri, 1999].•Data sources of interest are autonomously owned and operated.Consequently,the range of operations that can be performed on the data source(e.g.,the types of queries allowed),and the precise mode of allowed inter-actions can be quite diverse(e.g.,PROSITE repository of protein data limits queries to those that can be entered using the forms provided on the web).Hence,strategies for obtaining the required information within the opera-tional constraints imposed by the data source are needed [Levy,2000].•Data sources are heterogeneous in structure(e.g.,rela-tional databases,flatfiles)and content(names and types of attributes and relations among attributes used to rep-resent the data).For example,data about proteins in-clude the amino acid sequences of proteins,multiple sources of3-dimensional structures of proteins,mul-tiple sources of structural features of proteins,multi-ple sources of protein-protein interaction data,multiple sources of functional annotations for proteins(according to different notions of protein function),among others.•The ontologies implicit in the design of autonomous data sources(i.e.,assumptions concerning objects that ex-ist in the world,which determine the choice of terms and relationships among terms)often do not match the ontologies of the users of those data sources.In sci-entific discovery applications,because users often need to examine data in different contexts from different per-spectives,methods for context-dependent dynamic in-formation extraction from distributed data based on user-specified ontologies are needed to support information extraction and knowledge acquisition from heteroge-neous distributed data([Honavar et al.,2001;Levy, 2000]).Against this background,our main goal is to develop ef-ficient strategies for extracting the information needed for learning(e.g.,sufficient statistics)from heterogeneous,au-tonomous,and distributed data sources,under a given set of ontological commitments in a given context.The rest of the paper is organized as follows:In Section 2,we precisely formulate a class of distributed learning prob-lems and present a general strategy for transforming tradi-tional machine learning algorithms into distributed learning algorithms based on the decomposition of the learning task into hypothesis generation and information extraction com-ponents.The resulting algorithms are provably exact in that the hypothesis constructed from distributed data is identical to that obtained by the corresponding algorithm when in the batch setting.In Section3,we formally define the sufficient statistics of a data set D with respect to a learning algorithm L and show how we can obtain these statistics from distributed data sets using a query decomposition(query planning)ap-proach,assuming that data is presented to the algorithm as a table whose rows correspond to instances and whose columns correspond to attributes.Section4shows how heterogeneous data sources can be integrated and made to look as tables. Section5concludes with a summary and a brief outline of future research directions.2Distributed LearningTo keep things simple,in what follows,wefirst assume that regardless of the structure of the individual data repositories (relational databases,flatfiles,etc.)the effective data set for learning algorithm can be thought of as a table whose rows correspond to instances and whose columns correspond to at-tributes.We will later discuss how heterogeneous data can be integrated and put into this form.In this setting,the problem of learning from distributed data sets can be summarized as follows:data is distributed across multiple sites and the learner’s task is to discover use-ful knowledge from all the available data.For example,such knowledge might be expressed in the form of a decision tree or a set of rules for pattern classification.We assume that it is not feasible to transmit raw data between sites.Conse-quently,the learner has to rely on information(e.g.,statistical summaries such as counts of data tuples that satisfy particular criteria)extracted from the sites.Definition:A distributed learning algorithm L D is said to be exact with respect to the hypothesis inferred by a learning algorithm L,if the hypothesis produced by L D,using dis-tributed data sets D1through D n is the same as that obtained by L when it is given access to the complete data set D,which can be constructed(in principle)by combining the individual data sets D1through D n.Definition:A distributed learning algorithm L D is said to be approximate with respect to the hypothesis inferred by a learning algorithm L,if the hypothesis produced by L D,us-ing distributed data sets D1through D n is a good approxi-mation of that obtained by L when it is given access to the complete data set D.Our approach to the exact/approximate distributed learning is based on a decomposition of the learning task into a control part which drives the execution of the algorithm toward the generation of a hypothesis and an information extraction part which is triggered by the control part whenever the algorithm requires statistics about the available data in order to generate the hypothesis(Figure1).Inf(D2)InformationExtractionHypothesisHypothesisGenerationInformationDataGenerationHypothesisD1D2HypothesisInf.Extr.(D1)Inf.Extr.(D2)Inf(D1)Figure1:Task Decomposition into Hypothesis Generation and Information Extraction Components.In this approach to distributed learning,only the informa-tion extraction component has to effectively cope with the distributed and heterogeneous nature of the data in order to guarantee provably exact or approximate learning algorithms. The control component doesn’t access data directly,but just the statistics extracted from data.These statistics are obtained from queries that the control part asks in the process of the hypothesis generation.A query answering engine,which acts like a planner,de-composes the queries in terms of operators available to the data sources and returns the answers to the control part.The results of the queries can be seen as a statistics oracle which responds according to the needs of the control part(i.e.,ac-cording to the requirements of a particular algorithm at each step).The classical example oracle,which here is distributed into a network,is used to provide sufficient statistics(e.g., counts for decision tree)to the centralized statistics oracle. The statistics oracle acts as a buffer between the data and the hypothesis corresponding to that data(Figure2).Our strategy for the distributed learning can be used to transform any batch learning algorithm into an efficient dis-tributed learning algorithm,once the sufficient statistics with respect to that particular learning algorithm have been identi-fied and a plan for gathering them has been found.3Sufficient StatisticsIn order to be able to define sufficient statistics[Casella and Berger,1990]in the context of learning,we look at a learn-ing algorithm as search into a space of possible hypotheses. The hypotheses in the search space can be thought as defin-ing a parametric function.A particular choice for the set of parameters will give us a particular hypothesis.In particular, those parameters can be estimated through learning based on the given data.Here we assume that every data set we are working with can be represented as a table.Statistics about these data sets are functions of the corresponding tables.Definition:Let F be a class of functions that a learning al-gorithm is called upon to learn.A statistic s(D)is called a sufficient statistic for learning a function f∈F given a data set D={(x1,y1)···,(x m,y m)},if there exists a function g such that g(s(D))=f(D).(Control Part)OutputQueryOntology MappingQuery Result E.g., Counts...GoalLearning Agent OntologyEngineAnswering Query Data Source 1ology 1Ont 2ology 2Ont Data Source ology N Ont DataSource N)Planner ( E.g., Decision Tree Hyperplane ...OracleOperators&Decomposition Query Query ResultResources DatabasesStatistic Engine Core Learning AgentFigure 2:Distributed Learning System based on Task Decomposition into Hypothesis Generation and Information Extraction Components.The hypothesis generation component acts as a control part which triggers the execution of the information extraction component whenever the learning algorithm needs data.The information extraction component can be viewed as a query answering engine or planner.The particular learning algorithm considered determines a particular class of functions F (e.g.,decision trees in the case of the decision tree algorithm,hyperplanes in the case of the SVM algorithm).So we say that a sufficient statistic is de-fined with respect to a learning algorithm and a training data set.Note that a data set D is a trivial sufficient statistic for D .However,whenever possible,we are interested in finding minimal sufficient statistics.Definition:The sufficient statistic s ∗(D )is called minimal sufficient statistic if for any other sufficient statistic s (D ),s ∗(D )is a function of s (D ),i.e.there exists h such that s ∗(D )=h (s (D ))for any data set D .For some learning algorithms (e.g.,Naive Bayes),the suf-ficient statistics have to be computed once in order for the learning algorithm to generate the hypothesis.However,for other algorithms (e.g.,Decision Trees)the statistics gather-ing process and the hypothesis generation are interleaved.In this case,the target function cannot be computed in one step;the learner has to go back and forth between the data and the partially learned hypotheses several times.Instead of deal-ing with sufficient statistics,we use partial sufficient statistics here.Examples of sufficient statistics for several classes of learn-ing algorithms are shown below:•Naive Bayes algorithm (NB)-counts of the instances that match certain criteria (e.g.,attribute-value counts)represent minimal sufficient statistics.•Decision Tree algorithm (DT)-counts of the instances that match certain criteria (e.g.,attribute-value-class)are minimal partial sufficient statistics for computing one level of the tree.Subsequent counts that depend on the current built hypothesis (tree)are needed in order to find the final decision tree [Bhatnagar and Srinivasan,1997].•Support Vector Machines algorithm (SVM)-the weight vector which determines the separating hyperplane,can be considered sufficient statistics for the separating hy-perplane.3.1Gathering Sufficient Statistics from Homogenous DataIn designing distributed learning algoritms using our decom-position strategy,we assume that the identification of the suf-ficient statistics is done by a human expert,who also designs the control part of the algorithm (function g in the definition of the sufficient statistics).However,the gathering of the suf-ficient statistics necessary for learning (construction of the statistics oracle)is done automatically every time when the control part needs some statistics about data,by the query an-swering engine.The query answering engine receives as in-put a query,and builds a plan for this query according to the resources and the operators available to each data set (Figure 2).The operators associated with a data source can be primi-tive operators (such as selection,projection,union,addition etc.),or they can be aggregate operators (e.g,counts or other database built-in functions).If the data source allows,the user can define specific operators (functions of the primitive oper-ators).The set of primitive operators should be complete with respect to the set of learning tasks that needs to be executed (i.e.,the application of these operators or functions of these operators is enough for gathering the information necessary for a particular learning task considered).Definition:We call learning plan for computing the statistics s required by a learning algorithm L ,from the distributed data sets D 1,···,D n a procedure P ,which transforms a given query into an execution plan.An execution plan can be seen as an expression tree,where each node corresponds to an op-erator and each leaf corresponds to basic statistics that can be extracted directly from the data sources.All the statistics that cannot be extracted directly from the data sources,should be replaced by their definitions recursively,until we obtain a plan in which only basic statistics appear as leaves.Each of the operators available at the distributed data sources has a cost associated with it.Based on these opera-tors and their costs,the query answering engine,which plays the role of the planner,finds the best execution plan for the current query and sends it to the distributed data sources forexecution.Each data source returns the statistics(answers to queries)extracted from its data to the query answering en-gine,which sends thefinal result to the learning agent.If the algorithm needs more information about data in order tofin-ish its job,a new query is sent to the query answering engine, and the process repeats.Definition:We say that two learning plans P1and P2are equivalent if they compute the same set of statistics.If we consider the costs associated with the operators included in a plan,we say that a learning plan P1is more efficient than an equivalent learning plan P2if the cost of thefirst plan is smaller than the cost of the second plan.The job of the query engine is tofind the best learning plan for a query,given a set of primitive operators,aggregate op-erators and user defined functions that can be applied to these operators,and their associated costs.4Gathering Sufficient Statistics from Heterogeneous Data SourcesIn the previous section,we assumed that data is presented to the distributed algorithms as tables whose rows correspond to instances and whose columns correspond to attributes.How-ever,in a heterogeneous environment,it is not trivial to get the data into this format.The differences in ontological commit-ments assumed by autonomous data sources present a signif-icant hurdle in theflexible use of data from multiple sources and from different perspectives in scientific discovery.To address this problem,we developed the INDUS soft-ware environment[Reinoso-Castillo,2002]for rapid and flexible assembly of data sets derived from multiple data sources.INDUS is designed to provide a unified query in-terface over a set of distributed data sources which enables us to view each data source as if it were a table.Thus,a scien-tist can integrate data from different sources from his or her own perspective using INDUS.INDUS builds on a large body of work on information integration,including in particular, approaches to querying heterogeneous information sources using source descriptions[Levy et al.,1996;Levy,2000; Ullman,1997],as well as distributed computing[Honavar et al.,1998;Wong et al.,].The input from a typical user(scientist)includes:an on-tology that links the various data sources from the users point of view,executable code that performs specific computations, needed if they are not directly supported by the data sources, and a query expressed in terms of the user-specified ontol-ogy.In this case,the query answering engine,receives this query as input,finds the best execution plan for it,translates the plan according to the ontologies specific to the distributed data sources,and sends it to the distributed data sources.Each data source returns answers to the queries it receives and the planner translates them back to the user(learning agent)on-tology.Thus,the user can extract and combine data from mul-tiple data sources and store the results in a relational database which is structured according to his or her own ontology.The results of the queries thus executed are stored in a relational database and can be manipulated using application programs or relational database(SQL)operations and used to derive other data sets(as those necessary for learning algorithms).More precisely,INDUS integration system is based on a federated query centric database approach[Mena et al., 2000].It consists of three principal layers which together provide a solution to the data integration problem in a scien-tific discovery environment(Figure3):•The physical layer allows the system to communicate with data sources.This layer is based on a federated database architecture(data is retrieved only in response to a query).It implements a set of instantiators which allow the interaction with each data source according to the constraints imposed by their autonomy and limited query capabilities.As a consequence,the central repos-itory can view disparate data sources as if they were a set of tables(relations).New iterators may be added to the system when needed(e.g.a new kind of data source has become available,the functionality offered by a data source has changed).•The ontological layer permits users to define one or more global(user/learning agent)ontologies and also to automatically transform queries into execution plans us-ing a query-centric approach.More specifically,it con-sists of the following:–A meta-model,developed under a relationaldatabase,allowing users to define one or moreglobal ontologies of any size.New statistics canbe defined in terms of existing statistics using a setof compositional operators.The set of operators isextensible allowing users to add new operators asneeded.–An interface for defining queries based on statisticsin a global ontology.–An algorithm,based on a query-centric approach,for transforming those queries into an executableplan.The query-centric approach allows users todefine each compound statistics in terms of basicstatistics,(i.e.statistics whose instances are di-rectly retrieved from data sources),using a prede-fined set of compositional operations.Therefore,the system has a description of how to obtain the setof instances of a global statistic based on extractingand processing instances from data sources.Theplans describe what information to extract fromeach data source and how to combine the results.–An algorithm that executes a plan by invoking theappropriate set of instantiators and combining theresults.–A repository for materialized(instantiated)planswhich allow users to inspect them if necessary.•Finally,the user interface layer enable users to interact with the system,define ontologies,post queries and re-ceive answers.Also,the materialization of an executed plan can be inspected.In conclusion,the most important features that INDUS of-fers in terms of data integration are summarized below:•From a user perspective,data accessible through IN-DUS can be represented as tables in which the rowsLayerData Source 1Accesible WorldLayerPhysical Instantiator 2Instantiator M................Data Source NData Source 2Instantiator 1Relational DarabaseRelational DatabaseQuery Centric AlgorithmExecutorOntological LayerUser Interface Figure 3:INDUS Information Integration Component.correspond to instances and columns correspond to at-tributes,regardless of the structure of the underlying data sources.•The system can include several ontologies at any time.Individual users can introduce new ontologies as needed.•New data sources can be incorporated into INDUS by specifying the data source descriptions including the corresponding data-source specific ontology and the set of instantiators.The main difference between INDUS and other integra-tion systems [Garcia-Molina et al.,1997;Arens et al.,1993;Knoblock et al.,2001;Subrahmanian et al.,June 2000;Draper et al.,2001;Paton et al.,1999]is that it can include several ontologies at any time.The users can introduce new ontologies or add new data sources and their associated on-tologies,as needed.INDUS has been successfully used by computational biol-ogists (including graduate and undergraduate students with varying degrees of expertise in computing)in our lab for quickly extracting and assembling the necessary data sets from multiple data repositories for exploring and visualiz-ing protein sequence-structure-function relationships [Wang et al.,2002;Andorf et al.,2002].For the purpose of distributed learning,INDUS is used to execute queries whose results are tables containing data of interest (e.g.,counts).Depending on the operations that are allowed at a particular data source,these tables may contain raw but integrated data (according to the global ontology that different data sources share)or counts or other statistics ex-tracted from the data sources.Thus,if a particular data source can answer specific queries,but it does not allow the execu-tion of any program or the storage of any data at that site,then the answer to the query will produce a table that needs to be stored locally,but closely to the original data source,in order to avoid the transfer of large amount of data.This table can be further used to obtain the statistics needed for the genera-tion of the hypothesis.On the other hand,if the data sources support aggregate operations (e.g.,those that provide statis-tics needed by the learning algorithm)or allow user-supplied programs to be executed at the data source,we can avoid ship-ping large amounts of data from distributed repositories.5Summary and Future WorkThe approach to the distributed learning taken in this paper is a generalization of a federated query centric database ap-proach [Mena et al.,2000].In the case of distributed learning the set of operators is usually a superset of the operators used in classical databases.Besides,here the whole query answer-ing process is just one step in the execution of the learning al-gorithm during which the statistics required by the algorithm are provided to the statistic oracle.Assuming that the points of interaction between a learning algorithm and the available training data can be identified,the distributed learning strategy described here can be easily used to transform any batch learning algorithm into an exact (or at least approximate)distributed learning algorithm.Future work is aimed at:•Experiment with the decomposition strategy for various classes of learning algorithms and prove theoretical re-sults with respect to the exact or the approximate quality of the distributed algorithms obtained.•A big variety of data mining algorithms,such as deci-sion trees,instance-based learners,Bayesian classifiers,Bayesian networks,multi-layer neural networks and support vector machines,among others,will be incorpo-rated in our distributed learning system.Some of these algorithms can be easily decomposed into hypothesis generation and information extraction components ac-cording to our task decomposition strategy (e.g.,Naive Bayes,decision trees),while others require substan-tial changes to the traditional learning algorithm (e.g.,Support Vector Machines),resulting sometimes in new learning algorithms for distributed learning [Caragea et al.,2001a;2000].•Formally define the set of operators for the algorithms that will be included in our distributed learning system and prove its completeness with respect to these algo-rithms.•Extend the formal definitions for plans,formulate prop-erties of these plans and prove these properties under various assumptions made in a distributed heteroge-neous environment.AcknowledgmentsThis work has been supported in part by grants from the Na-tional Science Foundation (#9982341,#9972653,#0219699),Pioneer Hi-Bred,Inc.,Iowa State University Graduate Col-lege,and IBM.References[Andorf et al.,2002]C.Andorf,D.Dobbs,and V .Honavar.Protein function classifiers based on reduced alphabet rep-resentations of protein sequences.2002.[Arens et al.,1993]Y.Arens, C.Chin, C.Hsu,andC.Knoblock.Retrieving and integrating data from multi-ple information sources.International Journal on Intelli-gent and Cooperative Information Systems,2(2):127–158, 1993.[Bhatnagar and Srinivasan,1997]R.Bhatnagar and S.Srini-vasan.Pattern discovery in distributed databases.In Pro-ceedings of the AAAI-97Conference,1997.[Caragea et al.,2000]D.Caragea, A.Silvescu,and V.Honavar.Agents that learn from distributed dynamic data sources.In Proceedings of the Workshop on Learn-ing Agents,Agents2000/ECML2000,pages53–61, Barcelona,Spain,2000.[Caragea et al.,2001a]D.Caragea, A.Silvescu,and V.Honavar.Decision tree learning from distributed data.Technical Report TR,Iowa State University,Ames,IA, 2001.[Caragea et al.,2001b]D.Caragea, A.Silvescu,and V.Honavar.Invited chapter.toward a theoretical frame-work for analysis and synthesis of agents that learn from distributed dynamic data sources.In Emerging Neural Architectures Based on Neuroscience.Berlin: Springer-Verlag,2001.[Casella and Berger,1990]G.Casella and R.L.Berger.Sta-tistical Inference.Duxbury Press,Belmont,CA,1990. [Davies and Edwards,1999]W.Davies and P.Edwards.Dagger:A new approach to combining multiple models learned from disjoint subsets.In ML99,1999.[Draper et al.,2001]Denise Draper,Alon Y.Halevy,and Daniel S.Weld.The nimble XML data integration sys-tem.In ICDE,pages155–160,2001.[Garcia-Molina et al.,1997]H.Garcia-Molina,Y.Papakon-stantinou,D.Quass,A.Rajaraman,Y.Sagiv,J.Ullman, V.Vassalos,and J.Widom.The tsimmis approach to me-diation:Data models and languages.Journal of Intelligent Information Systems,Special Issue on Next Generation In-formation Technologies and Systems,8(2),1997. [Honavar et al.,1998]V.Honavar,ler,and J.S.Wong.Distributed knowledge networks.In Proceedings of the IEEE Conference on Information Technology,Syracuse, NY,1998.[Honavar et al.,2001]H.Honavar,A.Silvescu,D.Caragea,C.Andorf,J.Reinoso-Castillo,andD.Dobbs.Ontology-driven information extraction and knowledge acquisition from heterogeneous,distributed,autonomous biological data sources.In Proceedings of the IJCAI-2001Work-shop on Knowledge Discovery from Heterogeneous,Dis-tributed,Autonomous,Dynamic Data and Knowledge Sources,Seattle,W A,2001.[Kargupta et al.,1999]H.Kargupta,B.H.Park,D.Hersh-berger,and E.Johnson.Collective data mining:A new perspective toward distributed data mining.In H.Kargupta and P.Chan,editors,Advances in Distributed and Parallel Knowledge Discovery.MIT/AAAI Press,1999.[Knoblock et al.,2001]Craig A.Knoblock,Steven Minton, Jose Luis Ambite,Naveen Ashish,Ion Muslea,Andrew Philpot,and Sheila Tejada.The ariadne approach to web-based information integration.International Journal of Cooperative Information Systems,10(1-2):145–169,2001. [Levy et al.,1996]A.Levy,A.Rajaraman,and J.Ordille.Querying heterogeneous information sources using source descriptions.1996.[Levy,2000]A.Levy.Logic-based techniques in data inte-gration.Kluwer Publishers,2000.[Mena et al.,2000]E.Mena,A.Illarramendi,V.Kashyap, and A.Sheth.Observer:An approach for query pro-cessing in global information systems based on interoper-ation across pre-existing ontologies.International journal on Distributed And Parallel Databases(DAPD),8(2):223-272,2000.[Paton et al.,1999]N.W.Paton,R.Stevens,P.G.Baker,C.A.Goble,and S.Bechhofer.Query processing in the tam-bis bioinformatics source integration system.In Proc.11th Int.Conf.on Scientific and Statistical Databases(SS-DBM),pages138–147.IEEE Press,1999. [Prodromidis et al.,2000]A.L.Prodromidis,P.Chan,and S.J.Stolfo.Meta-learning in distributed data mining systems:Issues and approaches.In H.Kargupta and P.Chan,editors,Advances of Distributed Data Mining.AAAI Press,2000.[Provost and Kolluri,1999]Foster J.Provost and Venkateswarlu Kolluri.A survey of methods for scaling up inductive algorithms.Data Mining and Knowledge Discovery,3(2):131–169,1999.[Reinoso-Castillo,2002]J.Reinoso-Castillo.Ontology-driven query-centric federated solution for information ex-traction and integration from autonomous,heterogeneous, distributed data sources,2002.[Subrahmanian et al.,June2000]V.S.Subrahmanian, P.Bonatti,J.Dix,T.Eiter,S.Kraus, F.Ozcan,and R.Ross.Heterogeneous Agent Systems:Theory and Implementation.MIT Press,June2000.[Ullman,1997]rmation integration using logical views.volume1186,pages19–40,Berlin: Springer-Verlag,1997.[Wang et al.,2002]X.Wang,D.Schroeder,D.Dobbs,,and V.Honavar.Data-driven discovery of rules for protein function classification based on sequence motifs:Rules discovered for peptidase families based on meme motifs outperform those based on prosite patterns and profiles.2002.[Wong et al.,]J.Wong,G.Helmer,V.Honavar,V.Na-ganathan,S.Polavarapu,,and ler.Smart mobile agent facility.Journal of Systems and Software,56:9–22.。
博士生发一篇information fusion Information Fusion: Enhancing Decision-Making through the Integration of Data and KnowledgeIntroduction:Information fusion, also known as data fusion or knowledge fusion, is a rapidly evolving field in the realm of decision-making. It involves the integration and analysis of data and knowledge from various sources to generate meaningful and accurate information. In this article, we will delve into the concept of information fusion, explore its key components, discuss its application in different domains, and highlight its significance in enhancingdecision-making processes.1. What is Information Fusion?Information fusion is the process of combining data and knowledge from multiple sources to provide a comprehensive and accurate representation of reality. The goal is to overcome the limitations inherent in individual sources and derive improved insights and predictions. By assimilating diverse information,information fusion enhances situational awareness, reduces uncertainty, and enables intelligent decision-making.2. Key Components of Information Fusion:a. Data Sources: Information fusion relies on various data sources, which can include sensors, databases, social media feeds, and expert opinions. These sources provide different types of data, such as text, images, audio, and numerical measurements.b. Data Processing: Once data is collected, it needs to be processed to extract relevant features and patterns. This step involves data cleaning, transformation, normalization, and aggregation to ensure compatibility and consistency.c. Information Extraction: Extracting relevant information is a crucial step in information fusion. This includes identifying and capturing the crucial aspects of the data, filtering out noise, and transforming data into knowledge.d. Knowledge Representation: The extracted information needs to be represented in a meaningful way for integration and analysis.Common methods include ontologies, semantic networks, and knowledge graphs.e. Fusion Algorithms: To integrate the information from various sources, fusion algorithms are employed. These algorithms can be rule-based, model-based, or data-driven, and they combine multiple pieces of information to generate a unified and coherent representation.f. Decision-Making Processes: The ultimate goal of information fusion is to enhance decision-making. This requires the fusion of information with domain knowledge and decision models to generate insights, predictions, and recommendations.3. Applications of Information Fusion:a. Defense and Security: Information fusion plays a critical role in defense and security applications, where it improves intelligence analysis, surveillance, threat detection, and situational awareness. By integrating information from multiple sources, such as radars, satellites, drones, and human intelligence, it enables effective decision-making in complex and dynamic situations.b. Health Monitoring: In healthcare, information fusion is used to monitor patient health, combine data from different medical devices, and provide real-time decision support to medical professionals. By fusing data from wearables, electronic medical records, and physiological sensors, it enables early detection of health anomalies and improves patient care.c. Smart Cities: Information fusion offers enormous potential for the development of smart cities. By integrating data from multiple urban systems, such as transportation, energy, and public safety, it enables efficient resource allocation, traffic management, and emergency response. This improves the overall quality of life for citizens.d. Financial Markets: In the financial sector, information fusion helps in the analysis of large-scale and diverse datasets. By integrating data from various sources, such as stock exchanges, news feeds, and social media mentions, it enables better prediction of market trends, risk assessment, and investmentdecision-making.4. Significance of Information Fusion:a. Enhanced Decision-Making: Information fusion enables decision-makers to obtain comprehensive and accurate information, reducing uncertainty and improving the quality of decisions.b. Improved Situational Awareness: By integrating data from multiple sources, information fusion enhances situational awareness, enabling timely and informed responses to dynamic and complex situations.c. Risk Reduction: By combining information from diverse sources, information fusion improves risk assessment capabilities, enabling proactive and preventive measures.d. Resource Optimization: Information fusion facilitates the efficient utilization of resources by providing a holistic view of the environment and enabling optimization of resource allocation.Conclusion:In conclusion, information fusion is a powerful approach to enhance decision-making by integrating data and knowledge from multiple sources. Its key components, including data sources, processing, extraction, knowledge representation, fusion algorithms, and decision-making processes, together create a comprehensive framework for generating meaningful insights. By applying information fusion in various domains, such as defense, healthcare, smart cities, and financial markets, we can maximize the potential of diverse information sources to achieve improved outcomes.。
Text-mining Needs and Solutions for the Biomolecular Interaction NetworkDatabase (BIND)Ian DonaldsonBlueprint Initiative, Mount Sinai HospitalToronto, OntarioCanadaian.donaldson@mshri.on.caProteomics represents a collection of experimental approaches that may be used toinvestigate biological systems. Such approaches commonly produce vast amounts of data thatrelate to physical interactions between biomolecules. One challenge in extracting usefulinformation from these data is determining how they relate to current knowledge.The goal of the BIND database (Biomolecular Interaction Network Database) is to curateand archive these interaction data from the literature using a standard data representation so that they may be effectively used for knowledge discovery (http://bind.ca ) (Bader et al., 2001; Bader and Hogue, 2000). This database facilitates placing experimental data into context.For instance, a biologist may be presented with evidence suggesting that several different proteins may interact with their protein of interest. One of the first obvious questions is; “Is there any evidence to support any of these potential interactions?” These questions may be answered using a number of approaches. For each of the potential interactions:1) Are there any matching interaction records in BIND or some other interaction database?2) Are there any interaction records that involve proteins with similar sequences?3) Are there any interaction records that involve proteins with similar conserved domain profiles?4) Is the potential interaction likely given the Gene Ontology annotation associated with the two interacting proteins?5) What are the synonyms for the two potential interactors and do these synonyms ever co-occur in the literature.Answering each of these questions requires addressing a number of technical issues,which, in principal, are trivial. However, in practice, it is non-trivial to solve all of theseproblems to completion and to solve them consistently. Failing to do so means that knowledge that may support a potential interaction will be lost. This is unacceptable since much ofproteomics is about filtering meaningful data away from noise.Interestingly, solving these questions is also of interest to text-miners. Mentions of any two proteins in text may be viewed as a potential interaction. A set of potential interactions may be sorted according to the answers to the above questions.I will describe here, the ongoing efforts to incorporate the functionality of a text-miningtool called PreBIND (Donaldson et al., 2003) into a larger bioinformatics applicationprogramming platform called SeqHound. This platform already incorporates the NCBI’sGenBank sequence database, Molecular Modelling Database, LocusLink and Conserved Domain database as well as functional annotation from the Gene Ontology consortium and (in the nearfuture) interaction data from BIND (the Biomolecular Interaction Network Database). SeqHound is freely available via a web-interface and an application programming interface in C, C++, PERL and Java ( )(Michalickova et al., 2002). I envision that this system will be used by biologists to examine interaction data from high-throughput proteomics studies. Association for Computational LinguisticsOntologies, and Databases, pp. 50-51. BioLINK 2004: Linking Biological Literature,In addition, it may also be used by text-miners to help generate and submit preliminary BIND records to the BIND database.ReferencesBader, G. D., Donaldson, I., Wolting, C., Ouellette, B. F., Pawson, T., and Hogue, C. W. (2001). BIND--The Biomolecular Interaction Network Database. Nucleic Acids Res 29, 242-245. Bader, G. D., and Hogue, C. W. (2000). BIND--a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 16, 465-477. Donaldson, I., Martin, J., De Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G. D., Michalickova, K., et al. (2003). PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4, 11.Michalickova, K., Bader, G. D., Dumontier, M., Lieu, H. C., Betel, D., Isserlin, R., and Hogue, C. W. (2002). SeqHound: biological sequence and structure database as a platform for bioinformatics research. BMC Bioinformatics 3, 32.。
REFERENCESAlavi, M., & Carlson, P. (1992). A review of MIS Research and Disciplinary Development. Journal of Management Information Systems, 8(4), 45-62. Alexandrini, F., Krechel, D., Maximini, K., & von Wangenheim, A. (2003).Integrating CBR into the health care organization. Paper presented at the 16th IEEE Symposium on Computer-Based Medical Systems, New York, NewYork, USA.Amardeilh, F., Laublet, P., & Minel, J. (2005). Documentation Annotation and Ontology Population from Linguistic Extractions. Paper presented at the K-CAP '05, Banff, Alberta, Canada.Anderson, G. Lecture 1: Scientific Method. Retrieved February 8, 2006, from /office/ganderson/es10/lectures/lecture01/lecture01 .htmlAnthony, R.N. (1965). Planning and Control Systems: A Framework for Analysis.Cambridge, MA., Harvard University Graduate School of BusinessManagement.Australian Health Information Council. (2003). Electronic Decision Support for Australia's Health Sector . Retrieved 11 May 2006, from .au Australian Health Information Council. (2003) Electronic Decision Support Evaluation Methodology. Retrieved 11 May 2006, from.au/evaluation/guidelines.htmAvison, D. (2002). Action Research: A Research Approach for Cooperative Work.Paper presented at the 7th International Conference on Computer SupportedCooperative Work in Design, Rio de Janeiro, Brazil.Avison, D., Lau, F., & Myers, MD. (1999). Action Research. Communications of the ACM, 42(1), 94-97.Babcock, B., Babu, S., Datar, M., Motwani, R. & Widom, J. (2002). Models and Issues in Data Stream Systems. Paper presented at the 21st ACM SIGMOD-SIGART Symposium on Principles of Database Systems, Madison,Wisconsin.Beckett, D. (2004). Scalable RDBMS report. Retrieved 4 June 2004, from /2001/sw/Europe/reports/scalable_rdbms_mapping_report Becquet, C., Blachon, S., Jeudy, B., Boulicaut, JF., Gandrillon, O. (2002). Strong-association-rule mining for large-scale gene-expression data analysis: a casestudy on human SAGE data. Genome Biology, 3(12).Bench-Capon, T., Coenen, F., Nwana, H., Paton, R., Shave, M. (1993). Two aspects of the validation and verification of knowledge based systems. IEEE Expert,8(3), 76-81.Bench-Capon, T., & Visser, P. (1997). Ontologies in Legal Information Systems; The Need for Explicit Specifications of Domain Conceptualisations. Paperpresented at the 6th International Conference on AI and Law, Melbourne,Victoria, Australia.Berndt, DJ., Fisher, JW., Hevner, AR., & Studnicki, J. (2001). Healthcare Data Warehousing and Quality Assurance. Computer, December 2001, 56-65. Blackmore, K., & Bossomaier, T.R.J. (2002). Soft computing methodologies for mining missing person data. In Proceedings of Sixth Australia-Japan JointWorkshop on Intelligent and Evolutionary Systems (AJJWIES 2002),Canberra, ACT, Australia.Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A. & Paraboschi, S. (2001). Designing Data Marts for Data Warehouses. ACM Transactions on Software Engineering and Methodology, 10(4), 452-483.Boose, J. (1985). A Knowledge Acquisition Program for Expert Systems Based on Personal Construct Psychology. International Journal of Man MachineStudies, 23, 495-525.Boucelma, O., Castano, S., & Goble, C. (2002). Report on the EDBT'02 Panel on Scientific Data Integration. SIGMOD Record, 31(4).Brossette, S., Sprague, A., Hardin, M., Waites, K., Jones, W., & Moser, S. (1998).Association rules and data mining in hospital infection control and publichealth domain. Journal of American Medical Informatics Association, 5(4),373-381.Cabena, P., Hadjinian, P., Stadler, R., Verhees, J. & Zanasi, A. (1997). Discovering Data Mining from Concept to Implementation. New Jersey, USA: Prentice-Hall PTR.Carey, S. (1994). A Guide to the Scientific Method. California, USA: Wadsworth Publishing Company.Cesnik, B. (2002). Report of the Electronic Decision Support Governance Workshop.Retrieved 11 May 2006, from .auConnolly, T., & Begg, C. (2005). Database Systems: A Practical Approach to Design, Implementation and Management (4th ed.). England: Addison-Wesley. Corsi, P., Weindling, P. (Eds.) (1983) Information Sources in the History of Science and Medicine. London: Butterworths.CRISP-DM. (2004). Retrieved 1 December 2004, from Devlin, B. (1997). Data Warehouse- from Architecture to Implementation. Reading Mass: Addison Wesley.DxPlain (2006) Lab of Computer Science, Massachusetts General Hospital.Retrieved 3 February 2006, from/projects/dxplain.htmlETL Portal. (2006). DM Review Retrieved 1 November 2006, from/portals/portal.cfm?topicId=230206Ewen, E., Medsker, C., Dusterhoft, L., Levan-Schultz, K., Smith, J., & Gottschall, M.(1999). Data Warehousing in an Integrated Health System; Building theBusiness Case. ACM, 47-53.Gahleitner, E., Behrendt, W., Palkoska, J., Weippi, E. (2005). On Cooperatively Creating Dynamic Ontologies. Paper presented at the ACM HT'05, Salzburg,Austria.Galliers, R.D. (1993). Research Issues in information systems. Journal of Information Technology, 8, 92-98.Golfarelli, M., Rizzi, S., & Vrdoljak, B. (2001). Data warehouse design from XML sources. Paper presented at the 4th ACM International Workshop on DataWarehousing and OLAP, Atlanta, Georgia, USA.Gomez-Perez, A. (2004). Retrieved 4 June 2004, from ontoweb.aifb.uni-karlsruhe.de/Members/ruben/Deliverable%201.5Goodwin, L., & Grzymala-Busse, J. (2001). Data Mining Approaches for Perinatal Knowledge Building. Handbook of Data Mining and Knowledge Discovery.New York: Oxford University Press.Goodwin, L., Iannacchione, A., Hammond, W., Crockett, P., Mahler, S., & Schlitz, K.(2001). Data Mining Methods Find Demographic Predictors of Preterm Birth.Nursing Research, 50(6), 340 - 345.Goodwin, L., Maher, S., Ochno-Machado, L., Iannacchione, M., Crockett, P., Dreiseitl, S., Vinterbo, S., & Hammond, W. (2000). Building Knowledge in aComplex Preterm Birth Problem Domain. Paper presented at the AMIAAnnual Fall Symposium, Philadelphia.Gorry, G.A, & Scott Morton, M. (1971). A Framework for Management Information Systems. Sloan Management Review, 13, 55-70.Graziano, A., & Raulin, M. (2003). Research Methods a Process of Inquiry (5th ed.).Boston: Pearson Education.Gross- Portney, L., & Watkins, M. (2000). Foundations of Clinical Research, applications to practice. (7th ed.). New Jersey: Prentice Hall Health. Gruber, TR. (1992). ONTOLINGUA: A Mechanism to Support Portable Ontologies, technical report: Knowledge Systems Laboratory, Stanford UniversityCalfornia, USA.Hagland, M. (2004). Health Care Informatics Online Data Mining. Health Care Informatics. Retrieved 1 April 2004 from http://www.healthcare-/Han, J. (1995). Mining Knowledge at Multiple Concept Levels. Paper presented at the 4th International Conference on Information and Knowledge Management.Baltimore, Maryland, USA.Han, J. (1996). Data mining techniques. ACM SIGMOD Record, Proceedings of 1996 ACM SIGMOD international conference on management of data SIGMOD'96, 25(2), Montreal, Quebec, Canada.Han, J. (1998). Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27(1).Han, J. (2002). Evolving data mining into solutions for insights: Emerging scientific applications in data mining. Communications of the ACM, 45(8), 54-58. Han, J., Chiang, J., Chee, S., Chen, J., Chen, Q., Cheng, S., Gong, W., Kamber, M., Koperski, K., Liu, G., Lu, Y., Stefanovic, N., Winstone, L., Xia, B., Zaine, O., Zhang, S., & Zhu, H. (1997). DBMiner: a system for data mining in reltaional databases and data warehouses. Paper presented at the 1997 Conference ofthe Centre for Advanced Studies on Collaborative research. Toronto, Canada. Han, J, & Kamber, M. (2001). Data Mining Concepts and Techniques (1 ed.). San Francisco: Morgan Kaufmann Publishers.Han, J., & Pei, J. (2000). Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD Explorations Newsletter, 2(2), 14-20. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. ACM SIGMOD Record, SIGMOD'00, 29(2), 1-12.Hayes, P., Reichherzer, T., & Mehrotra, M. (2005). Collaborative Knowledge Capture in Ontologies. Paper presented at the K-CAP '05, Banff, Alberta, Canada. Heath, J., Heath, S., McGregor, C., Smoleniec, J. (2004). DataBabes: A Case Study in Data Warehousing and Mining Perinatal Data. Paper presented atCASEMIX, Sydney, Australia.Heath, J., & McGregor, C. (2004). Research Issues in Intelligent Decision Support.Paper presented at the UWS College of Science, Technology andEnvironment, Innovation Conference, Sydney, Australia.Heath, J., McGregor, C., & Smoleniec, J. (2005). DataBabes: A Case Study in Feto-Maternal Clinical Data Mining. Paper presented at the Health InformaticsConference of Australia, Melbourne.Hummer, W., Bauer, A., & Harde, G. (2003). XCube - XML For Data Warehouses.Paper presented at the 6th ACM International Workshop on Data Warehousing and OLAP, New Orleans, Louisiana, USA.Inmon, W. (2002). Building the Data Warehouse (3rd ed.)New York: Wiley.Ji, W., Naguib, R.N.G., & Ghoneim, M.A. (2003). Neural network-based assessment of prognostic markers and outcome prediction in bilharziasis-associatedbladder cancer. Information Technology in Biomedicine, IEEE Transactionson, 7(3), 218-224.Johnson, S.B. (2004). The development of decision support systems to enable plant demographic research in the Australian cotton industry. Australia:Department of Primary Industries.Jung, G., & Gudivada, V. (1995). Automatic determination and visualisation of relationships among symptoms for building medical knowledge bases.Paperpresented at the 1995 ACM Symposium on Applied Computing, NashvilleTennessee, USA.Kawamoto, K., Houlihan, C., Balas, E., & Lobach, D. (2005). Improving clinical practice using clinical decision support systems: a systematic review of trialsto identify features critical to success. British Medical Journal, 330(7494). Kennedy, P. (2004). Extracting and Explaining Biological Knowledge in Microarray Data. Paper presented at the Pacific Asia Knowledge Discovery in Data(PAKDD) 2004, Sydney, Australia.Kimball, R. (1996). The Data Warehouse Toolkit. New York: Wiley.Kock, N.F, Avison, D., Baskerville, R., Myers, M. & Wood-Harper, T. (1999). IS Action Research: Can We Serve Two Masters? Paper presented at the 20thInternational Conference on Information Systems, Charlotte, North Carolina,USA.Kock, N.F, McQueen, R.J, Baker, M., (1996). Negotiation In Information Systems Action Research. Paper presented at the Information Systems Conference ofNew Zealand, Palmerston North, New Zealand.Kovalerchuk, B., Vityaev, E., Ruiz, J.F. (2000). Consistent knowledge discovery in medical diagnosis. IEEE Engineering in Medicine and Biology Magazine,19(4), 26-37.Lee, S., Abbott, & P. (2003). Bayesian networks for knowledge discovery in large datasets: basics for nurse researchers. Biomed Inform, 36, 389-399.Little, J.D. (1970). Models and Managers: The Concept of a Decision Calculus.Management Science, 16(8), 466-485.Lord,S., Genski, V., & Keech, C. (2004). Multiple analyses in clinical trials: sound science or data dredging? Medical Journal of Australia, 181(8), 452-454. Lyman, J., Boyd, J., & Dalton, J. (2003). Applying the HL7 reference information model to a clinical data warehouse. Paper presented at the IEEE International Conferemce on Systems, Man and Cybernetics, 2003., Washington DC, USA. Mackinnon, J., & Glick, N. (1999). Data Mining and Knowledge Discovery in Databases - An Overview. Australian and New Zealand Journal of Statistics,41(3), 255-275.Mallach, E. (2000). Decision Support and Data Warehouse Systems, New York: Irwin McGraw-Hill.Marakas, G.M. (2002a). Decision Support Systems in the 21st Century. Upper Saddle River, New Jersey: Prentice Hall.Marakas, G.M. (2002b). Modern Data Warehousing, Mining and Visualisation.Upper Saddle River, New Jersey: Prentice Hall.Masuda, G., Sakamoto, N., & Yamamoto, R. (2002). A Framework for Dynamic Evidence Based Medicine using Data Mining. Paper presented at the 15thIEEE Symposium on Computer-Based Medical Systems, Maribor, Slovenia. Masuda, G., & Sakamoto, N. (2002). A framework for dynamic evidence based medicine using data mining. Paper presented at the 15th IEEE Symposium onComputer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia. Mathews, J.R. (1995). Quantification and the Quest for Medical Certainty.New Jersey: Princeton University Press.Matsumoto, T., Ueda, Y., & Kawaji, S. (2002). A software system for giving clues of medical diagnosis to clinician. Paper presented at the 15th IEEE Computer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia. McCarthy, J. (2000). Phenomenal Data Mining: From Data to Phenomena. ACM SIGKDD Explorations Newsletter, 1(2), 24-29.McGregor, C., Bryan, G., Curry, J., Tracey, M. (2002). The e-Baby Data Warehouse:A Case Study. Paper presented at the 35th Hawaii International Conference onSystem Sciences, Hawaii, USA.Miquel, M., & Tchounikine, A. (2002). Software components integration in medical data warehouses: a proposal. Paper presented at the 15th IEEE Symposium on Computer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia. Ohsaki, M., Sato, Y., Kitaguchi, S., Yokoi, H., & Yamaguchi, T. (2004). Comparison between objective interestingness measures and real human interest inmedical data mining. Paper presented at the 17th International Conference onInnovations in Applied Artificial Intelligence, Ottawa, Canada.Ohsaki, M., Kitaguchi, S., Yokoi, H., & Yamaguchi, T. (2005). Investigation of Rule Interestingness in Medical Data Mining. Active Mining, Springer(3430), 174-189.Pedersen, T., & Jensen,C. (1998). Research Issues in Clinical Data Warehousing.Paper presented at the 10th International Conference on Scientific andStatistical Database Management, Capri, Italy.Piantadosi, P. (1997). Clinical Trials- A Methodologic Perspective (1st ed.). New York: John Wiley & Sons.Podgorelec, V., Kokol, P., & Stiglic, M. (2002). Searching for new patterns in cardiovascular data. Paper presented at the 15th IEEE Symposium onComputer-Based Medical Systems, 2002. (CBMS 2002), Maribor, Slovenia. Popp, R., Armour, T., Senator, T., & Numryk, K. (2004). Countering Terrorism Through Information Technology. Communications of the ACM, 47(3), 36-43. Povalej, P., Lenic, M., Zorman, M., Kokol, P., Peterson, M., & Lane, J. (2003).Intelligent data analysis of human bone density. Paper presented at the 16thIEEE Computer-Based Medical Systems, 2003, New York, New York. Qiao, L., Agrawal, D., & Abbadi,A. (2003). Supporting Sliding Windows Queries for Continuous Data Streams. Paper presented at the 15th InternationalConference on Scientific and Statistical Database Management, Cambridge,MA, USA.Raghupathi, Winiwarter, Werner, & Tan,J. (2002). Strategic IT Applications in Health Care. Communications of the ACM, 45(12), 56-61.Rao, R., Niculescu, R., Germond, C., Rao, H. (2003). Clinical and Financial Outcomes Analysis with Existing Hospital Patient Records. Paper presented at the SIGKDD, Washington DC.Rindfleisch, T. (1997). Privacy, Information Technology and Health Care.Communications of the ACM, 40(8), 93-100.Robinson, J.B. (2005). Understanding and Applying decision support systems in Australian farming systems research. University of Western Sydney, Sydney. Roddick, J., Fule, P., & Graco,W. (2003). Exploratory Medical Knowledge Discovery: Experiences and Issues. SIGKDD Explorations Newsletter, 5(1),94-99.Roiger, R., & Geatz, M. (2003). Data Mining, England: Addison Wesley.Sabou, M., Wroe, C., Goble, C., & Mishne, G. (2005). Learning Domain Ontologies for Web Service Descriptions: an experiment in Bioinformatics. Paperpresented at the IW3C2, Chiba, Japan.Sackett, D., Rosenberg, W., Muir Gray, J., Haynes, B., & Scott-Richardson, W.(1996). Evidence based medicine: what it is and what it isn't. British MedicalJournal, 312, 71-71.Schubart, J., & Einbinder, J. (2000). Evaluation of a data warehouse in an academic health sciences center. International Journal of Medical Informatics, 60(3),319-333.Simon, H.A. (1960). The New Science of Management Decision. New York: Harper and Collins.Summons, P., Giles, W., & Gibbon,G. (1999). Decision Support for Fetal Gestation Age Estimation. Paper presented at the 10th Australiasian Conference onInformation Systems, Wellington, New Zealand.Susman, G.I, & Evered, R.D. (1978). An Assessment of the Scientific Merits of Action Research. Administrative Science Quarterly, 23, 582-603.Sydney South West Area Health Service. (2006). Retrieved 27th December 2005, 2005, from .au/Service_Facility.aspx Tsymbal, A., Cunningham, P., Pechenizkiy, M., & Puuronen, S. (2003). Search strategies for ensemble feature selection in medical diagnostics. Paperpresented at the 16th IEEE Symposium on Computer-Based Medical Systems, 2003, New York, New York.Turban, E., & Aronson J. (2001). Decision Support Systems and Intelligent Systems.Upper Saddle River, NJ: Prentice Hall.Upadhyaya, S., & Kumar, P. (2005). ERONTO: A Tool for Extracting Ontologies from Extended E/R Diagrams. Paper presented at the SAC'05, Santa Fe, NewMexico, USA.Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). Conceptual Modeling for ETL Processes. Proceedings of 5th ACM International Workshop on datawarehousing and OLAP, McLean, VA, USA, 14-21.Wang, H., Fan, W., Yu, S., & Han, J. (2003). Mining concept-drifting data streams using ensemble classifiers. Paper presented at the 9th ACM SIGKDD,Washington DC, USA.Warren, J., & Stanek, J. (2005). Decision Support Systems. In Conrick & M (Eds.), Health Informatics Transforming Healthcare with Technology (pp. 252-265).Melbourne: Thomson.Webb, G., Han, J., & Fayyad, U. (2004). Panel Discussion. Paper presented at the 8th Pacific Asia Knowledge Discovery in Data, Sydney, Australia.Webb, G.I. (2001, August 2001). Discovering associations with numeric variables.Paper presented at the 7th ACM SIGKDD international conference onknowledge discovery and data mining, Boston, MA, USA.Webb, G.I, Butler, S., & Newlands, D. (2003). On detecting differences between groups. Paper presented at the 9th ACM SIGKDD international conference on knowledge discovery and data mining, Washington DC, USA.Webb, G.I. (2000). Efficient search for association rules. Paper presented at the 6th ACM SIGKDD international conference on knowledge discovery and datamining, Boston, MA, USA.Wong, M.L., Lam, W., Leung, K. S., Ngan, P. S., & Cheng, J.C.Y. (2000).Discovering knowledge from medical databases using evolutionoryalgorithms. IEEE Engineering in Medicine and Biology Magazine, 19(4), 45-55.Xintao, W., & Daniel, B. (2002). Learning missing values from summary constraints.ACM SIGDDD Explorations Newsletter, 4(1).Xu, Z., Cao, X., Dong., Y., & Wenping, S. (2004). Formal Approach and Automated Tool for Translating ER Schemata into OWL Ontologies, 8th Pacific AsiaKnowledge Discovery in Data Conference, 2004, Sydney, Australia.Yu, C. (2004). A web-based consumer-oriented intelligent decision support system for personalized e-service. Paper presented at the 6th International conference onelectronic commerce ICEC '04, Delft, The Netherlands.Yu, & P. (2004). Keynote Address. Paper presented at the 8th Pacific Asia Knowledge Discovery in Data, Sydney, Australia.Zaidi, S., Abidi, S., & Manickam, S. (2002). Distributed data mining from heterogeneous healthcare data repositories: towards an intelligent agent-based framework. Paper presented at the 15th IEEE Symposium on Computer-Based Medical Systems, 2002. (CBMS 2002)Maribor, Slovenia. Zdanowicz, J. (2004). Detecting Money Laundering and Terrorist Financing with Data Mining. Communications of the ACM, 47(5), 53-55.Zeleznikow, J., & Nolan, J. (2001). Using Soft Computing to build real world intelligent decision support systems in uncertain domains. Decision SupportSystems, 31, 263-285.Zorman, M., Kokol, P., Lenic, M., Povalej, P., Stiglic, B., & Flisar, D. (2003).Intelligent platform for automatic medical knowledge acquisition: detectionand understanding of neural dysfunctions. Paper presented at the 16th IEEEsymposium on Computer-Based Medical Systems, 2003, New York, NewYork.。
中文电子病历多层次信息抽取方法的探索吴骋①徐蕾②秦婴逸①何倩①王志勇**基金项目:上海市自然科学基金(编号:19ZR1469800);全军后勤科研重大项目于题(编号:AWS15J005-4 );海军军医大学第一附属医 院"234学科攀峰计划"(编号:2019YBZ(X)2 )*通信作者:海军军医大学第一附属医院信息科,200433,上海市杨浦区长海路168号① 海军军医大学军队卫生统计学教研室,200433,上海市杨浦区翔殷路800号② 海军第905医院财经中心,200052,上海市长宁区华山路1328号摘 要 目的:探索新的多层次信息抽取模式,以改进当前以“医学词典”和“正则表达式”为主的电子病历信息抽取技术。
方法:通过“文书类别预测模块”和“分类模型”,对不同病历文书及章节内容进行区分;并在此基础上,利用“规 则+深度学习模型”,根据不同文本信息特点搭建相应的信息抽取模型,对不同实体及其语义关系进行识别和建立。
结果:通过对文书类别、章节以及实体属性的归纳识别、分层建模,实现了对医疗文本中各种信息的多维解析与分类存储。
结 论:多层次信息抽取方法为实现电子病历智能化应用奠定了坚实基础,对于优化诊疗模式、辅助临床决策、促进知识共享 等具有实际意义。
关键词电子病历自然语言处理命名实体识别多层次信息抽取Doi:10.3969/j.issn.1673-7571.2020.06.009[中图分类号]R319[文献标识码]AExploration on the Multi-level Information Extraction Method of Chinese Electronic Medical Records / WU Cheng, XU Lei, QINYing —yi, et al//China Digital Medicine.—2020 15(06): 29 to 31Abstract Objective: To explore a new multi-level information extraction model, so as to improve the current electronic medicalrecord information extraction technology based on "medical dictionary" and "regular expression". Methods: Through "document category prediction module" and "classification model", different medical records and chapters are distinguished; and on this basis, "rules+deep learning model" are used to build the corresponding information extraction model according to the characteristicsof different text information, and different entities and their semantic relations are identified and established. Results: The multi dimensional analysis and classified storage of various infbnnation in medical text is realized by inductive identification and hierarchicalmodeling of document categories, chapters and entity attributes. Conclusion: The multi-level information extraction method lays asolid foundation for the intelligent application of electronic medical records, and has practical significance for optimizing the diagnosis and treatment mode, assisting with the clinical decision-making and promoting the knowledge sharing.Keywords electronic medical records, natural language processing, named entity recognition, multi-level information extraction Fund project Shanghai Natural Science Foundation (No. 19ZR 1469800); Subproject of Major Project of Scientific Research ofLogistics of the Whole Army (No. AWS15JOO5—4); "234 Discipline Peak —climbing Plan" of the First Affiliated Hospital of Naval Medical University (No. 2019YBZ002)Corresponding author Information Department, the First Affiliated Hospital of Naval Medical University, Shanghai 200433, P.R.C.1前言电子病历(Electronic MedicalRecord, EMR )囊括了患者从入院到 出院疾病发生、发展、治疗和转归的 全过程,是医务人员和科研工作者深 入了解疾病特征、用药情况、治疗方式以及预后结局等信息的重要数据来 源⑴。
Infrastructure for Web ExplanationsDeborah L.McGuinness and Paulo Pinheiro da SilvaKnowledge Systems Laboratory,Stanford University,Stanford CA94305{dlm,pp}@AbstractThe Semantic Web lacks support for explaining knowledge provenance.When web applications return answers,many users do not know what in-formation sources were used,when they were updated,how reliable thesource was,or what information was looked up versus derived.The Se-mantic Web also lacks support for explaining reasoning paths used to de-rive answers.The Inference Web(IW)aims to take opaque query answersand make the answers more transparent by providing explanations.Theexplanations include information concerning where answers came fromand how they were derived(or retrieved).In this paper we describe aninfrastructure for IW explanations.The infrastructure includes:an ex-tensible web-based registry containing details on information sources,rea-soners,languages,and rewrite rules;a portable proof specification;anda proof and explanation browser.Source information in the IW registryis used to convey knowledge provenance.Representation and reasoninglanguage axioms and rewrite rules in the IW registry are used to supportproofs,proof combination,and semantic web agent interoperability.TheIW browser is used to support navigation and presentations of proofs andtheir explanations.The Inference Web is in use by two Semantic Webagents using an embedded reasoning engine fully registered in the IW.Additional reasoning engine registration is underway in order to help pro-vide input for evaluation of the adequacy,breadth,and scalability of ourapproach.1IntroductionInference Web(IW)aims to enable applications to generate portable and dis-tributed explanations for any of their answers.IW addresses needs that arise with systems performing reasoning and retrieval tasks in heterogeneous envi-ronments such as the ers(humans and computer agents)need to decide when to trust answers from varied sources.We believe that the key to trust is understanding.Explanations of knowledge provenance and derivation historycan be used to provide that understanding[16].In the simplest case,users would retrieve information from individual or multiple sources and they may need knowledge provenance(e.g.,source identification,source recency,author-itativeness,etc.)before they decide to trust an ers may also obtain information from systems that manipulate data and derive information that was implicit rather than ers may need to inspect the deductive proof trace that was used to derive implicit information before they trust the system answer.Many times proof traces are long and complex so users may need the proof transformed(or abstracted)into something more understandable that we call an explanation.Some users may agree to trust the deductions if they know what reasoner was used to deduce answers and what data sources were used in the ers may also obtain information from hybrid and distributed sys-tems and they may need help integrating answers and solutions.As web usage grows,a broader and more distributed array of information services becomes available for use and the needs for explanations that are portable,sharable,and reusable grows.Inference web addresses the issues of knowledge provenance with its registry infrastructure.It also addresses the issues of proof tracing with its browser.It addresses the issues of explanations(proofs transformed by rewrite rules for understandability)with its language axioms and rewrite rules. IW addresses the needs for combination and sharing with its portable proof specification.In this paper,we include a list of explanation requirements gathered from past work,literature searches,and from surveying users.We present the Infer-ence Web architecture and provide a description of the major IW components including the portable proof specification,the registry[17](containing infor-mation about inference engines,proof methods,ontologies,and languages and their axioms),the explanation mechanism,and the justification browser.We also provide some simple usage examples.We conclude with a discussion of our work in the context of explanation work and state our contributions with respect to trust and reuse.2Background and Related WorkRecognition of the importance of explanation components for reasoning systems has existed in a number offields for many years.For example,from the early days in expert systems(e.g.,MYCIN[18]),expert systems researchers identi-fied the need for systems that understood their reasoning processes and could generate explanations in a language understandable to its users.Inference Web attempts to stand on the shoulders of past work in expert systems,such as MYCIN and the Explainable Expert System[20]on generating explanations.IW also builds on the learnings of explanation in description logics(e.g.,[1,2, 13,14])which attempt to provide a logical infrastructure for separating pieces of logical proofs and automatically generating follow-questions based on the logical format.IW goes beyond this work in providing an infrastructure for ex-plaining answers in a distributed,web-based environment possibly integratingmany question answering agents using multiple reasoners.IW also attempts to integrate learnings from the theorem proving community on proof presenta-tion(e.g.,[4,9])and explanation(e.g.,[12]),moving from proof tracing presen-tation to abstractions and understandable explanations.IW attempts to learn from this and push the explanation component started in Huang’s work and also add the emphasis on provenance and distributed environments.The work in this paper also builds on experience designing query compo-nents for frame-like systems[3,10,13]to generate requirements.The founda-tional work in those areas typically focus on answers and only secondarily on information supporting the understanding of the answers.In our requirements gathering effort,we obtained requirements input from contractors in DARPA-sponsored programs concerning knowledge-based applications(the High Perfor-mance Knowledge Base program1,Rapid Knowledge Formation Program2,and the DARPA Agent Markup Language Program3)and more recently,the ARDA AQUAINT4and NIMD5programs and DARPA’s IPTO Office programs.We also gathered requirements from work on the usability of knowledge represen-tation systems(e.g.,[15])and ontology environments(e.g.,[8]).We have also gathered needs from the World Wide Web Consortium efforts on CWM6and the related reasoner effort on Euler7.Finally,we gathered knowledge prove-nance requirements from the programs above and from previous work on data provenance from the database community(e.g.,[5]).3RequirementsIf humans and agents need to make informed decisions about when and how to use answers from applications,there are many things to consider.Decisions will be based on the quality of the source information,the suitability and quality of the reasoning/retrieval engine,and the context of the situation.Particularly for use on the web,information needs to be available in a distributed environment and be interoperable across applications.3.1Support for Knowledge ProvenanceEven when search engines or databases simply retrieve asserted or“told”in-formation,users(and agents)may need to understand where the source in-formation came from with varying degrees of detail.Similarly,even if users are willing to trust the background reasoner in a question answering environ-ment,they may need to understand where the background reasoner obtained its ground rmation about the origins of asserted facts,sometimesIntelligence/6/2000/10/swap/doc/cwm.html7/w3c/euler/called provenance,may be viewed as meta information about told information. Knowledge provenance requirements may include:•Source name(e.g.,CIA World Fact Book).If facts are encountered in multiple sources,any integrated solution needs to have a way of identifying from which source information was taken.•Date and author(s)of original information and any updates•Authoritativeness of the source(is this knowledge store considered or cer-tified as reliable by a third party?)•Degree of belief(is the author certain about the information?)•Degree of completeness(Within a particular scope,is the source considered complete.For example,does this source have information about all of the employees of a particular organization up until a some date?If so, notfinding information about a particular employee would mean that this person is not employed,counting employees would be an accurate response to number of employees,etc.)The information above could be handled with meta information about content sources and about individual assertions.Additional types of information may be required if users need to understand the meaning of terms or implications of query answers.•Term or phrase meaning(in natural language or a formal language)•Term inter-relationships(ontological relations including subclass,super-class,part-of,etc.)3.2Support for Reasoning InformationOnce systems do more than simple retrieval,additional requirements result.If information is manipulated as a result of integration,synthesis,abstraction, deduction,etc.,then users may need access to a trace of the manipulations performed along with information about the manipulations as well as informa-tion about the provenance.We refer to this as reasoning traces or proof traces. Requirements as a result of reasoning may include the following:•The reasoner used•Reasoning method(e.g.,tableaux,model elimination,etc.)•Inference rules supported by the reasoner•Reasoner soundness and completeness properties•Reasoner assumptions(e.g.,closed world vs.open world,unique names assumption,etc.)•Reasoner authors,version,etc.•Detailed trace of inference rules applied(with appropriate variable bind-ings)to provide conclusion•Term coherence(is a particular definition incoherent?)•Were assumptions used in a derivation?If so,have the assumptions changed?•Source consistency(is there support in a system for both A and¬A)•Support for alternative reasoning paths to a single conclusion3.3Support for Explanation GenerationWhile knowledge provenance and proof traces may be enough for expert logi-cians when they attempt to understand why an answer was returned,usually they are inadequate for a typical user.For our purposes,we view an explanation as a transformation of a proof trace into an understandable justification for an answer.With this view in mind,we consider techniques for taking proofs and proof fragments and rewriting them into abstractions that produce the foun-dation for what is presented to users.In order to handle rewriting,details of the representation and reasoning language must be captured along with their intended semantics.Requirements for explanations may include:•Representation language descriptions(e.g.,DAML+OIL,OWL,RDF,...)•Axioms capturing the semantics of the representation languages•Description of rewriting rules based on language axioms3.4Support for Distributed ProofsMuch of the past work on explanation,whether from expert systems,theorem proving,or description logics,has focused on single systems or integrated sys-tems that either use a single reasoner or use one integrated reasoning system. Systems being deployed on the web are moving to distributed environments where source information is quite varied and sometimes question answering sys-tems include hybrid reasoning techniques.Additionally multi-agent systems may provide inference by many applications.Thus many additional require-ments for proofs and their explanations may arise from a distributed architec-ture.Some requirements we are addressing are listed below:•Reasoner result combinations(if a statement is proved by one system and another system uses that statement as a part of another proof,then the second system needs to have access to the proof trace from thefirst system).•Portable proof interlingua(if two or more systems need to share proof fragments,they need an language for sharing proofs).3.5Support for Proof PresentationIf humans are expected to view proofs and their explanations,presentation sup-port needs to be provided.Human users will need some help in asking questions,obtaining manageable size answers,asking followup question,etc.Additionally, even agents need some control over proof requests.If agents request very large proofs,they may need assistance in breaking them into appropriate size por-tions and also in asking appropriate follow-up questions.Requirements for proof presentation may include:•A method for asking for explanations(or proofs)•A way of breaking up proofs into manageable pieces•A method for pruning proofs and explanations to help the userfind rele-vant information•A method for allowing proof and explanation navigation(including the ability to ask followup questions)•A presentation solution compatible with web browsers•A way of seeing alternative justifications for answers4Use CasesEvery combination of a query language with a query-answering environment is a potential new context for the Inference Web.We provide two motivating scenarios.Consider the situation where someone has analyzed a situation pre-viously and wants to retrieve this analysis.In order to present thefindings, the analyst may need to defend the conclusions by exposing the reasoning path used along with the source of the information.In order for the analyst to reuse the previous work,s/he will also need to decide if the source information used previously is still valid(and possibly if the reasoning path is still valid).Another simple motivating example arises when a user asks for information from a web application and then needs to decide whether to act on the informa-tion.For example,a user might use a search engine interface or a query language such as DQL8for retrieving information such as“zinfandels from Napa Valley”or“wine recommended for serving with a spicy red meat meal”(as exemplified in the wine agent example in the OWL guide document[19]).A user might ask for an explanation of why the particular wines were recommended as well as why any particular property of the wine was recommended(likeflavor,body,color, etc.).The user may also want information concerning whose recommendations these were(a wine store trying to move its inventory,a wine writer,etc.).In order for this scenario to be operationalized,we need to have the following:•A way for applications(reasoners,retrieval engines,etc.)to dump justi-fications for their answers in a format that others can understand.This supports the distributed proofs requirements above.To solve this problem we introduce a portable and sharable proof specification.•A place for receiving,storing,manipulating,annotating,comparing,and returning meta information used to enrich proofs and proof fragments.To address this requirement,we introduce the Inference Web registry for stor-ing the meta information and the Inference Web registrar web application for handling the registry.This addresses the issues related to knowledge provenance.•A way to present justifications to the user.Our solution to this has mul-tiple components.First the IW browser is capable of navigating through proof dumps provided in the portable proof format.It can display multiple formats including KIF9and English.Additionally,it is capable of using rewrite rules(or tactics)to abstract proofs in order to provide more un-derstandable explanations.This addresses the issues related to reasoning, explanations,and presentation.5Inference WebInference Web contains both data used for proof manipulation and tools for building,maintaining,presenting,and manipulating proofs.Figure1presents an abstract and partial view of the Inference Web framework10.There,Infer-ence Web data includes proofs and explanations published anywhere on the web. Inference and search engines can generate proofs using the IW format.The ex-plainer,an IW tool,can abstract proofs into explanations.Inference Web data also has a distributed repository of meta-data including sources,inference en-gines,inference rules and ontologies.In addition to the explainer,Inference Web tools include a registrar for interacting with the registry,a browser for dis-playing proofs,and planned future tools such as proof web-search engines,proof verifiers,proof combinators,and truth maintenance systems.In this paper,we limit our discussion to the portable proofs(and an associated parser),registry (and the associated registrar tools),explanations,and the browser.5.1Portable ProofThe Inference Web provides a proof specification written in the web markup language DAML+OIL11[7].Proofs dumped in the portable proof format be-come a portion of the Inference Web data used for combining and presenting proofs and for generating explanations.Our portable proof specification in-cludes two major components of IW proof trees:inference steps and node sets. Proof metadata as described in Section5.2are the other components of our proof specification.Figure2presents a typical dump of an IW node set.Each node set is labeled by a well formed formula(WFF)written in KIF.(In this example,the node setFigure1:Inference Web framework overview.is labeled with a WFF stating that the color of W1is?x or the value of the color property of Wine1is the item of interest.)The node set represents a statement and the last step in a deductive path that led a system to derive the statement.It is a set because there could be multiple deductive paths leading to the statement.Figure2shows an instance of a node set,an inference step,and a reference to an inference rule.There is no source associated with this node set since it is derived(although it could be derived and associated with a source).If it had been asserted,it would require an association to a source,which is typically an ontology that contains it.In general,each node set can be associated with multiple,one,or no inference steps as described by the iw:isConsequentOf property of the node set in Figure2.A proof can then be defined as a tree of inference steps explaining the process of deducing the consequent sentence. Thus,a proof can physically vary from a singlefile containing all its node sets to manyfiles,each one containing a single node set.Also,files containing node sets can be distributed in the web.Considering the IW requirement that proofs need to be combinable,it is important to emphasize that an IW proof is a forest of trees since the nodes of a proof tree are sets of inference steps.In contrast with typical proof trees that are composed of nodes rather than node sets,every theorem in an IW proof can have multiple justifications.An inference step is a single application of an inference rule,whether the rule is atomic or derived as discussed in Section5.2.Inference rules(such as modus ponens)can be used to deduce a consequent(a well formed formula)from any number of antecedents(also well formed formulae).Inference steps contain pointers to proof nodes of its antecedents,the inference rule used,and any variable bindings used in the step.The antecedent sentence in an inference step<?xml version=’1.0’?><rdf:RDF(...)><iw:NodeSet rdf:about=’../sample/IW1.daml#IW1’><iw:NodeSetContent><iw:KIF><iw:Statement>(wines:COLOR W1?x)</iw:Statement></iw:KIF></iw:NodeSetContent><iw:isConsequentOf rdf:parseType=’daml:collection’>(a NodeSet can be associated to a set of Inference steps)<iw:InferenceStep><iw:hasInferenceRule rdf:parseType=’daml:collection’><iw:InferenceRule rdf:about=’../registry/IR/GMP.daml’/></iw:hasInferenceRule><iw:hasInferenceEngine rdf:parseType=’daml:collection’><iw:InferenceEngine rdf:about=’../registry/IE/JTP.daml’/></iw:hasInferenceEngine>(...)<iw:has Antecedent rdf:parseType=’daml:collection’>(inference step antecedents are IW files with their own URIs)<iw:NodeSet rdf:about=’../sample/IW3.daml#IW3’/><iw:NodeSet rdf:about=’../sample/IW4.daml#IW3’/></iw:hasAntecedent><iw:hasVariableMappingrdf:type=’/2001/03/daml+oil#List’/> (...)</iw:InferenceStep></iw:isConsequentOf></iw:NodeSet></rdf:RDF>Figure2:An Inference Web Proof.may come from inference steps in other node sets,existing ontologies,extraction from documents,or they may be assumptions.With respect to a query,a logical starting point for a proof in Inference Web is a proof fragment containing the last inference step used to derive a node set that contains the answer sentence for the query.Any inference step can be presented as a stand alone,meaningful proof fragment as it contains the inference rule used with links to its antecedents and variable bindings.The generation of proof fragments is a straightforward task once inference engine data structures storing proof elements are identified as IW components.To facilitate the generation of proofs,the Inference Web provides a web service that dumps proofs from IW components and uploads IW components from proofs. This service is a language-independent facility used to dump proofs.Also,it is a valuable mechanism for recording the usage of registry entries.The IW infrastructure can automatically generate follow-up questions for any proof fragment by asking how each antecedent sentence was derived.The individual proof fragments may be combined together to generate a complete proof,i.e.,a set of inference steps culminating in inference steps containing only asserted(rather than derived)antecedents.When an antecedent sentence isasserted,there are no additional follow-up questions required and that ends the complete proof generation.The specification of IW concepts used in Figure2 is available at /software/IW/spec/iw.daml.5.2RegistryThe IW registry is a hierarchical interconnection of distributed repositories of information relevant to proofs and explanations.Entries in the registry contain the information linked to in the proofs.Every entry in the registry is afile written in DAML+OIL.Also,every entry is an instance of a registry concept. InferenceEngine,Language and Source are the core concepts in the registry. Other concepts in the registry are related to one of these core concepts.In order to interact with the IW registry,the IW provides a web agent registrar that supports users in updating or browsing the registry.The registrar may grant update or access privileges on a concept basis and it may define and implement policies for accessing the registry.The current demonstration registrar is available at::8080/iwregistrar/.The InferenceEngine is a core concept since every inference step should have a link to at least one entry of InferenceEngine that was responsible for in-stantiating the inference step itself.For instance,Figure2shows that the iw:hasInferenceEngine property of iw:InferenceStep has a pointer to JTP.-daml,which is the registry meta information about Stanford’s JTP12model-elimination theorem prover.Inference engines may have the following properties associated with them:name,URL,author(s),date,version number,organiza-tion,etc.InferenceRule is one of the most important concepts associated with InferenceEngine.With respect to an inference engine,registered rules can be either atomic or derived from other registered rules.A screen shot from an IW registrar interface browsing the entry for the generalized modus ponens(GMP)rule is presented in Figure3.GMP is an atomic inference rule for JTP13.Each of the inference rules includes a name, description,optional example,and optional formal specification.An inference rule is formally specified by a set of sentences patterns for its premises,a sentence pattern for its conclusion,and optional side conditions.Patterns and conditions are specified using KIF and a set of name conventions for KIF arguments.For example,an argument@Si such as the@S1and@S2in Figure3says that it can be bound to a sentence while@SSi says that it can be bound to a set of sentences. Many reasoners also use a set of derived rules that may be useful for optimization or other efficiency concerns.One individual reasoner may not be able to provide a proof of any particular derived rule but it may point to another reasoner’s proof of a rule.Thus,reasoner-specific rules can be explained in the registry before the reasoner is actually used to generate IW proofs.Inference Web thus provides a way to use one reasoner to explain another reasoner’s inference rules. (This was the strategy used in[2]for example where the performance tableauxreasoner was explained by a set of natural-deduction style inference rules in the explanation system.)This strategy may be useful for explaining heavily optimized inference engines.Inference Web’s registry,when fully populated, will contain inference rule sets for many common reasoning systems.In this case,users may view inference rule sets to help them decide whether to use a particular inference engine.Figure3:Sample registry entry for an inference rule.Inference engines may use specialized language axioms to support a language such as DAML or nguage is a core IW concept.Axiom sets such as the one specified in[11]may be associated with a Language.The axiom set may be used as a source and specialized rewrites of those axioms may be used by a particular theorem prover to reason efficiently.Thus proofs may depend upon these language-specific axioms sets called LanguageAxiomSet in the IW. It is worth noting that an entry of Language may be associated with a number of entries of LanguageAxiomSet as different reasoners mayfind different sets of axioms to be more useful.For example,JTP uses a horn-style set of DAML axioms while another reasoner may use a slightly different set.Also,an entry of an Axiom can be included in multiple entries of LanguageAxiomSet.The content attribute of Axiom entries contains the axiom stated in KIF.Source is the other core concept of the registry.Source is specialized into five basic classes:Person,Team,Publication,Ontology,and Organization.At the moment,we are expanding the specification of(authoritative)sources as required.Thus,we are keeping a minimal description of these sources in the initial specification used in the IW.Entries of Ontology,for example,describe stores of assertions that may be used in proofs.It can be important to be able to present information such as ontology source,date,version,URL(for browsing),etc.Figure4contains a sample ontology registry entry for the ontology used in our wine examples.Figure4:Sample registry entry for an ontology.5.3ExplanationsAlthough essential for automated reasoning,inference rules such as those used by theorem provers and registered in the registry as InferenceRule entries are often inappropriate for“explaining”reasoning tasks.Moreover,syntactic ma-nipulations of proofs based on atomic inference rules may also be insufficient for abstracting machine-generated proofs into some more understandable proofs [12].Proofs,however,can be abstracted when they are rewritten using rules de-rived from axioms and other rules.Axioms in rewriting rules are the elements responsible for aggregating some semantics in order to make the rules more understandable.Entries of DerivedRule are the natural candidates for storing specialized sets of rewriting rules.In the IW,tactics are rules associated with axioms,and are used independent of whether a rule is atomic or derived.Many intermediate results are“dropped”along with their supporting ax-ioms,thereby abstracting the structure of proofs.The user may always ask follow-up questions and still obtain the detail,however the default explana-tion provides abstracted explanations.The general result is to hide the core reasoner rules and expose abstractions of the higher-level derived rules.An example of an IW explanation is described in the Inference Web web page at: /software/iw/Ex1/.The implementation of the IW explainer is work in progress.The explainer algorithm generate explanations in a systematic way using the derived rules related to a selected language axiom set.。
Ontologies and Information Extraction*C. Nédellec1 and A. Nazarenko21 Laboratoire Mathématique, Informatique et Génome (MIG), INRA, Domaine de Vilvert, 78352 F-Jouy-en-Josas cedex2 Laboratoire d’Informatique de Paris-Nord (LIPN), Université Paris-Nord & CNRS, av. J.B. Clément, F-93430 Villetaneuse1 IntroductionAn ontology is a description of conceptual knowledge organized in a computer-based representation while information extraction (IE) is a method for analyzing texts expressing facts in natural language and extracting relevant pieces of infor-mation from these texts.IE and ontologies are involved in two main and related tasks,•Ontology is used for Information Extraction: IE needs ontologies as part of the understanding process for extracting the relevant information; •Information Extraction is used for populating and enhancing the ontology: texts are useful sources of knowledge to design and enrich ontologies.These two tasks are combined in a cyclic process: ontologies are used for inter-preting the text at the right level for IE to be efficient and IE extracts new knowl-edge from the text, to be integrated in the ontology.We will argue that even in the simplest cases, IE is an ontology-driven process. It is not a mere text filtering method based on simple pattern matching and key-words, because the extracted pieces of texts are interpreted with respect to a prede-fined partial domain model. We will show that depending on the nature and the depth of the interpretation to be done for extracting the information, more or less knowledge must be involved.Extracting information from texts calls for lexical knowledge, grammars de-scribing the specific syntax of the texts to be analyzed, as well as semantic and on-tological knowledge. In this chapter, we will not take part in the debate about the limit between lexicon and ontology as a conceptual model. We will rather focus* LIPN Internal Report, 2005. This paper has been originally written in march 2003. A shorter version has been published under the title “Ontology and Information Extraction: A Necessary Symbiosis”, in Ontology Learning from Text: Methods, Evaluation and Ap-plications, edited by P. Buitelaar, P. Cimiano and B. Magnini, IOS Press Publication, July 2005,on the role that ontologies viewed as semantic knowledge bases could play in IE. The ontologies that can be used for and enriched by IE relate conceptual knowl-edge to its linguistic realizations (e.g. a concept must be associated with the terms that express it, eventually in various languages).Interpreting text factual information also calls for knowledge on the domain referential entities that we consider as part of the ontology (Sect. 2.2.1).This chapter will be mainly illustrated in biology, a domain in which there are critical needs for content-based exploration of the scientific literature and which becomes a major application domain for IE.2 SettingsBefore exploring the close relationship that links ontology and IE in Sect. 3 and Sect. 4, we will define Information Extraction and ontology.2.1 What is IE?The considerable development of multimedia communication goes along with an exponentially increasing volume of textual information. Today, mere Information Retrieval (IR) technologies are unable to meet the needs of specific information because they provide information at a document collection level. Developing in-telligent tools and methods, which give access to document content and extract relevant information, is more than ever a key issue for knowledge and information management. IE is one of the main research fields that attempt to fulfill this need.2.1.1 DefinitionThe IE field has been initiated by the DARPA's MUC program (Message Under-standing Conference in 1987 (MUC Proceedings; Grishman and Sundheim 1996). MUC has originally defined IE as the task of (1) extracting specific, well-defined types of information from the text of homogeneous sets of documents in restricted domains and (2) filling pre-defined form slots or templates with the extracted in-formation. MUC has also brought about a new evaluation paradigm: comparing the information extracted by automatic ways to human-produced results. MUC has inspired a large amount of work in IE and has become a major reference in the text-mining field. Even as such, it is still a challenging task to build an efficient IE system with good recall (coverage) and precision (correctness) rates.A typical IE task is illustrated by Fig. 1 from a CMU corpus of seminar an-nouncements (Freitag 1998). IE process recognizes a name (John Skvoretz) and classifies it as a person name. It also recognizes a seminar event and creates a seminar event form (John Skvoretz is the seminar speaker whose presentation is entitled “Embedded commitment”).Even in such a simple example, IE should not be considered as a mere keyword filtering method. Filling a form with some extracted words and textual fragments involves a part of interpretation. Any fragment must be interpreted with respect to its “context” (i.e. domain knowledge or other pieces of information extracted from the same document) and according to its “type” (i.e. the information is the value of an attribute / feature / role represented by a slot of the form). In the document of Fig. 1, “4-5:30” is understood as a time interval and background knowledge about seminars is necessary to interpret “4” as “4 pm” and as the seminar starting time.Form to fill (partial)place:?starting time: ?title:??speaker:Document: Professor John Skvoretz, U. of South Carolina, Columbia, will present a seminar entitled "Embedded commitment", on Thursday, May 4th from 4-5:30 in PH 223D.Filled form (partial)place: PH 223Dstarting time: 4 pmtitle: Embedded commitmentspeaker: Professor John Skvoretz […]Fig 1. A seminar announcement event example2.1.2 IE overall processOperationally, IE relies on document preprocessing and extraction rules (or ex-traction patterns) to identify and interpret the information to be extracted. The ex-traction rules specify the conditions that the preprocessed text must verify and how the relevant textual fragments can be interpreted to fill the forms. In the sim-plest case, the textual fragment and the coded information are the same and there are neither text preprocessing nor interpretation.More precisely, in a typical IE system, three processing steps can be identified (Hobbs et al. 1997; Cowie and Wilks 2000):1.text preprocessing, whose level ranges from mere text segmentation intosentences and sentences into tokens to a full linguistic analysis;2.rule selection: the extraction rules are associated with triggers (e.g. key-words), the text is scanned to identify the triggering items and the corre-sponding rules are selected;3.rule application, which checks the conditions of the selected rules and fillsthe forms according to the conclusions of the matching rules.Extraction rules. The rules are usually declarative. The conditions are expressed in a Logics-based formalism (Fig. 3), in the form of regular expressions, patterns or transducers. The conclusion explains how to identify in the text the value that should fill a slot of the form. The result may be a filled form, as in Fig. 1 and 2, or equivalently, a labeled text as in Fig. 3.Sentence: "GerE stimulates the expression of cotA."RuleConditions: X="expression of"Conclusions: Interaction_Target <-next-token(X).Filled form: Interaction_Target: cotAFig. 2. IE partial example from functional genomicsExperiments have been made with various kinds of rules, ranging from the simplest ones (Riloff 1993) (e.g. the subject of the passive form of the verb “mur-der” is interpreted as a victim) to sophisticated ones as in (Soderland et al. 1995). The more explicit (i.e. the more semantic and conceptual) the IE rule, the more powerful, concise and understandable it is. However, it requires the input text be-ing parsed and semantically tagged.A single slot rule extracts a single value, as in Fig. 2, while a multi-slot rule cor-rectly extracts at the same time all the values for a given form as in Fig. 3, even if there is more than one event reported in the text fragment.IE forms. Extraction usually proceeds by filling forms of increasing complexity (Wilks 1997):•Filling entity forms aims at identifying the items representing the domain referential entities. These items are called “named entities” (e.g. Analysis & Technology Inc.) and assimilated to proper names (company, person, gene names) but they can be any kind of word or expression that refers to a do-main entity: dates, numbers, titles for the management succession MUC-6 application, bedrooms in a real-estate IE application (Soderland 1999). •Filling domain event forms: The information about the events extracted by the rules is then encoded into forms in which a specific event of a given type and its role fillers are described. An entity form may fill an event role. •Merging forms that are issued from different parts of the text but provide in-formation about a same entity or event.•Assembling scenario forms: Ideally, various event and entity forms can be further organized into a larger scenario form describing a temporal or logi-cal sequence of actions/events.Text processing. As shown in Fig. 3, the condition part of the extraction rules may check the presence of a given lexical item (e.g. the verb named), the syntactic category of words and their syntactic dependencies (e.g. object and subject rela-tions). Different clues such as typographical characteristics, relative position of words, semantic tags1 or even coreference relations can also be exploited.Most IE systems therefore involve linguistic text processing and semantic knowledge: segmentation into words, morpho-syntactic tagging (the part-of-speech categories of words are identified), syntactic analysis (sentence constitu-ents such as noun or verb phrases are identified and the structure of complex sen-1E.g., if the verbs “named”, “appointed” and “elected” of Fig.3 were all known as ‘nomina-tion’ verbs, the fourth condition of the rule could have been generalized to their semantic category 'nomination'.tences is analyzed) and sometimes additional processing: lexical disambiguation, semantic tagging or anaphora resolution.Sentence: "NORTH STONINGTON, Connecticut (Business Wire) - 12/2/94 - Joseph M. Marino and Richard P. Mitchell have been named senior vice president of Analysis & Technology Inc. (NASDAQ NMS: AATI), Gary P. Bennett, president and CEO, has announced. "RuleConditions:noun-phrase (PNP, head(isa(person-name))),noun-phrase (TNP, head(isa(title))),noun-phrase (CNP, head(isa(company-name))),verb-phrase (VP, type(passive),head(named or elected or appointed)),preposition (PREP, head(of or at or by)),subject (PNP, VP),object (VP, TNP),post_nominal_prep (TNG,PREP),prep_object (PREP, CNP)Conclusion:management_appointment (M, person(PNP), title (TNP), company (CNP)).Comment:if there is a noun phrase (NP) whose head is a person name (PNP), an NP whose head is a title name (TNP), an NP whose head is a company name (CNP), a verb phrase whose head is a passive verb (named or elected or appointed), a preposition of, at or by,if PNP and TNP are respectively subject and object of the verb,and if CNP modifies TNP,then it can be stated that the person “PNP” is named "TNP" of the company “CNP”.Labeled documentNORTH STONINGTON, Connecticut (Business Wire) - 12/2/94 - <Person>Joseph M.Marino and Richard P. Mitchell</Person> have been named <Title>senior vice presi-dent</Title> of <Company>Analysis & Technology Inc</Company>. (NASDAQ NMS: AATI), Gary P. Bennett, president and CEO, has announced.Fig. 3. Example from MUC-6, a newswire about management succession However, the role and the scope of this analysis differ from one IE system to another. Text analysis can be performed either as preprocessing or during extrac-tion rule application. In the former case, the whole text is first analyzed. The analysis is global in the sense that items spread all over the document can contrib-ute to built the normalized and enriched representation of the text. Then, the appli-cation of extraction rules comes to a simple filtering process of the enriched repre-sentation. In the latter case, the text analysis is driven by the rule condition verification. The analysis is local, focuses on the context of the triggering items of the rules, and fully depends on the conditions to be checked in the selected rules.In the first IE systems (Hobbs et al. 1997), local and goal-driven analysis was preferred to full text preanalysis to increase efficiency, and the text preprocessing step was kept to minimum. Although costly, data-driven, full text analysis and normalization can improve the IE process in various manners. (1) It improves fur-ther NL processing steps, e.g. syntactic parsing improves attachment disambigua-tion (Basili et al. 1993) or coreference resolution. (2) Full text analysis and nor-malization also facilitates the discovery of lexical and linguistic regularities in specific documents. This idea, initially promoted by works on sublanguages (Har-ris 1968, Sager et al. 1987) for tuning NL processing to a given type of texts, is now popularized by Machine Learning (ML) papers in the IE field for learning ex-traction rules. There are two main reasons for that. First, annotating training data is costly and the quantity of data to be annotated decreases with the normalization (the less variations in the data, the less data annotation is needed). Next, ML sys-tems tend to learn non-understandable rules by picking details in training exam-ples that do not look as related. Normalizing the text by representing it in a more abstract way increases the understandability of the learned rules. However, nor-malization also raises problems such as the biased choice of the right representa-tion before learning, that is not dealt with in the IE literature.We will see in the following that these two approaches, in which text analysis is respectively used for interpretation (goal-driven) and normalization (data-driven), are very much tangled, as any normalization process involves a part of interpreta-tion. One of the difficulties in designing IE systems is to set the limit between lo-cal and global analysis. Syntactic analysis or entity recognition can be performed on a local basis but are improved by knowledge inferred at a global level. Thus, ambiguous cases of syntactic attachments or entity classification can be solved by comparison with non-ambiguous similar cases of the same document.2.1.3 IE, an ambitious approach to text explorationAs mentioned above, there is a need for tools that give a real access to the docu-ment content. IE and Question Answering (Q/A) tasks both try to identify in documents the pieces of information that are relevant to a given query. They dif-fer, however, in the type of information that is looked for. A Q/A system has to answer to a wide range of unpredictable user questions. In IE, the information that is looked for is richer but the type of information is known in advance. The rele-vant pieces of text have to be identified and then interpreted with respect to the knowledge partially represented in the forms to fill.IE and Q/A systems both differ in their empirism from their common ancestors, the text-understanding systems. They both rely on targeted and local techniques of text exploration rather than on a large coverage and in-depth semantic analysis of the text. The MUC competition framework has gathered a large and stable IE community. It has also drawn the research towards easily implementable and effi-cient methods rather than strong and well-founded NLP theories.The role of semantics in IE is often reduced to very shallow semantic labeling. Semantic analysis is rather considered as a way to disambiguate syntactic steps than as a way to build a conceptual interpretation. Today, most of the IE systems that involve semantic analysis exploit the most simple part of the whole spectrum of domain and task knowledge, that is to say, named entities. However, the grow-ing need for IE application to domains such as functional genomics that require more text understanding pushes towards more sophisticated semantic knowledge resources and thus towards ontologies viewed as conceptual models, as it will be shown in this chapter.2.2 What is an Ontology in the IE framework?Even though ontologies usually do not appear as an autonomous component or re-source in IE systems, we argue that IE relies on ontological knowledge.2.2.1 Ontologies populated with referential entitiesThe ontology identifies the entities that have a form of existence in a given do-main and specifies their essential properties. It does not describe the spurious properties of these entities. On the contrary, the goal of IE is to extract factual knowledge to instantiate one or several predefined forms. The structure of the form (e.g. Fig. 4) is a matter of ontology whereas the values of the filled template usually reflect factual knowledge (as shown in Fig. 2 above) that is not part of the ontology. In these examples, the form to fill represents a part of the biological model of gene regulation network: proteins interact positively or negatively with genes. In Sect. 3.4, we will show that IE is ontology-driven in that respect.Type: {negative, positive}InteractionAgent: any proteinTarget: any geneFig. 4. An example of IE form in the genomics domainThe status of the named entities is a pending question. Do they belong to the ontology or are they factual knowledge? From a theoretical point of view, accord-ing to Brachman’s terminological logics view (1979), they are instances of con-cepts and as such, they are described and typed at the assertional level and not at the terminological or ontological level. In this chapter, we will nevertheless con-sider that entities, being referential entities, are part of the domain ontology be-cause it is the way IE considers them.2.2.2 Ontology with a natural language anchorageWhether one wants to use ontological knowledge to interpret natural language or to exploit written documents to create or update ontologies, in any case, the ontol-ogy has to be connected to linguistic phenomena. Ontology must be linguistically anchored. A large effort has been devoted in traditional IE systems based on local analysis to the definitions of extraction rules that achieve this anchoring. In the very simple example about gene interaction (Fig. 2 above), the ontological knowl-edge is encoded as a keyword rule, which can be considered as a kind of compiled knowledge. In more powerful IE systems, the ontological knowledge is more ex-plicitly stated in the rules that bridge the gap between the word level and text in-terpretation. For instance, the rule of Fig. 3 above, states that a management ap-pointment event can be expressed through three verbs (named, elected or appointed). As such, an ontology is not a purely conceptual model, it is a model associated to a domain-specific vocabulary and grammar. In the IE framework, weconsider that this vocabulary and grammar are part of the ontology, even when they are embodied in extraction rules.The complexity of the linguistic anchoring of ontological knowledge is well known and should not be underestimated. A concept can be expressed by different terms and many words are ambiguous. Rhetoric, such as lexicalized metonymies or elisions, introduces conceptual shortcuts at the linguistic level and must be elic-ited to be interpreted into domain knowledge. A noun phrase (e.g. “the citizen”) may refer to an instance (a specific citizen which has been previously mentioned in the text) or to the class (the set of all the citizens) leading then to a very differ-ent interpretation. These phenomena, which illustrate the gab between the linguis-tic and the ontological levels, strongly affect IE performance. This explains why IE rules are so difficult to design.2.2.3 Partial ontologiesIE is a targeted textual analysis process. The target information is described in the structure of the forms to fill. As mentioned above (Sect. 2.1.2) MUC has identified various types of forms describing elements or entities, events and scenarios.IE does not require a whole formal ontological system but parts of it only. We consider that the ontological knowledge involved in IE can be viewed as a set of interconnected and concept-centered descriptions, or “conceptual nodes2”. In con-ceptual nodes the concept properties and the relations between concepts are ex-plicit. These conceptual nodes should be understood as chunks of a global knowl-edge model of the domain. We consider here various types of concepts: an object node lists the various properties of the object; an event node describes the various objects involved in the event and their roles; a scenario node describes one or sev-eral events involved in the scenario and their interrelations. The use of this type of knowledge in NLP systems is traditional (Schank and Abelson 1977) and is illus-trated by MUC tasks.2.3 Specificity of the ontology-IE relationshipOntology and IE are closely connected by a mutual contribution. The ontology is required for the IE interpreting process and IE provides methods for ontological knowledge acquisition. Even if using IE for extracting ontological knowledge is still rather marginal, it is gaining in importance. We distinguish both aspects in the following Sects. 3 and 4, although the whole process is a cyclic one. A first level of ontological knowledge (e.g. entities) helps to extract new pieces of knowledge from which more elaborated abstract ontological knowledge can be designed, which help to extract new pieces of information in an iterative process.2 We define a conceptual node as a piece of ontological model to which linguistic informa-tion can be attached. It differs from the “conceptual nodes” of (Soderland et al. 1995), which are extraction patterns describing a concept. We will see below that several extrac-tion rules may be associated to a unique conceptual node.3. Ontology for Information extractionThe template or form to be fulfilled by IE is a partial model of world knowledge. IE forms are also classically viewed as a model of a database to be filled by the in-stances extracted. This view is consistent with the first one. In this respect, any IE system is ontology-driven: in IE processes, the ontological knowledge is primarily used for text interpretation. How poor the semantics underlying the form to fill may be (see Fig. 2, for instance), whether it is explicit (Gaizauskas and Wilks, 1997; Embley et al., 1998) or not (Freitag 1998) (see Fig. 5 below), IE is always based on a knowledge model. In this Sect. 3, for exposition purposes, we distin-guish different levels of ontological knowledge:•The referential domain entities and their variations are listed in “flat ontolo-gies”. This is mainly used for entity identification and semantic tagging of character strings in documents.•At a second level, the conceptual hierarchy improves normalization by ena-bling more general levels of representation.•More sophisticated IE systems also make use of chunks of a domain model(i.e. conceptual nodes), in which the properties and interrelations of entitiesare described. The projection of these relations on the text both improves the NL processes and guides the instantiation of conceptual frames, scenar-ios or database tuples. The corresponding rules are based either on lexico-syntactic patterns or on more semantic ones.•The domain model itself is used for inference. It enables different structures to be merged and the implicit information to be brought to light.3.1 Sets of entitiesRecognizing and classifying named entities in texts require knowledge on the do-main entities. Specialized lexical or key-word lists are commonly used to identify the referential entities in documents. For instance, in the context of cancer treat-ment, (Rindflesh et al. 2000) makes use of the concepts of the Metathesaurus of UMLS to identify and classify biological entities in papers reporting interactions between proteins, genes and drugs. In different experiments, some lists of gene and protein names are exploited. For instance, (Humphreys et al. 2000) makes use of the SWISS PROT resource whereas (Ono et al. 2001) combines pattern match-ing with a manually constructed dictionary. In the financial news of MUC-5, lists of company names have been used.In a similar way, Auto-Slog (Riloff 1993), CRYSTAL (Soderland et al. 1995), PALKA (Kim and Moldovan 1995), WHISK (Soderland 1999) and Pinocchio (Ciravegna 2000) make use of list of entities to identify the referential entities in documents. The use of lexicon and dictionaries is however controversial. Some authors like (Mikheev et al. 1999) argue that entity named recognition can be done without it.Three main objectives of these specialized lexicons can be distinguished, se-mantic tagging, naming normalization and linguistic normalization, although these operations are usually processed all at once.Semantic taggingSemantic tagging. List of entities are used to tag the text entities with the relevant semantic information. In the ontology or lexicon, an entity (e.g. Tony Bridge) is described by its type (the semantic class to which it belongs, here PERSON) and by the list of the various textual forms (typographical variants, abbreviations, syno-nyms) that may refer to it3 (Mr. Bridge, Tony Bridge, T. Bridge).However, exact character strings are often not reliable enough for a precise en-tity identification and semantic tagging. Polysemic words that do exist even in sublanguages belong to different semantic classes. In the above example, the string “Bridge” could also refer to a bridge named “Tony”. (Soderland 1999) re-ports experiments on a similar problem on a software job ad domain: WHISK is able to learn some contextual IE rules but some rules are difficult to learn because they rely on subtle semantic variations, e.g., the word “Java” can be interpreted as competency in the programming language except in “Java Beans”. Providing the system with lists of entities does not help that much, “because too many of the relevant terms in the domain undergo shifts of meaning depending on context for simple lists of words to be useful”. The connection between the ontological and the textual levels must therefore be stronger. Identification and disambiguation contextual rules can be attached to named entities.This disambiguation problem is addressed as an autonomous process in IE works by systems that learn contextual rules for entity identification (Sect. 4.1). Naming normalization. As a by-effect, these resources are also used for normali-zation purposes. For instance, the various forms of Mr. Bridge will be tagged as MAN and associated with its canonical name form: Tony Bridge (<PERSON id=Tony Bridge>). In (Soderland 1999), the extraction rules may refer to some class of typographical variations (such as Bdrm=(brs, br, bdrm, bed-room s, bedroom, bed) in the Rental Ad domain). This avoids rule over-fitting by enabling then specific rules to be abstracted.Specialized genomics systems are particularly concerned with the variation problem, as the nomenclatures are often not respected in the genomics literature, when they exist. Thus, the well-known problem of identifying protein and gene names has attracted a large part of the research effort in IE to genomics (Proux et al. 1998; Fukuda et al. 1998; Collier et al. 2000). In many cases, rules rely on shallow constraints rather than morpho-syntactic dependencies.Linguistic normalization. Beyond typographical normalization, the semantic tag-ging of entities contributes to sentence normalization at a linguistic level. It solves some syntactic ambiguities, e.g. if cotA is tagged as a gene, in the sentence “the stimulation of the expression of cotA”, knowning that a gene can be “expressed”3 These various forms may be listed extensionally or intentionally by variation rules.。