Ontologies and Information Extraction

格式：pdf
大小：148.90 KB
文档页数：23

下载文档原格式

/ 23

CoMoTo_一个基于本体的情境建模工具

CoMoTo_⼀个基于本体的情境建模⼯具CoMoTo：⼀个基于本体的情境建模⼯具徐朝晖吴刚上海交通⼤学软件学院，上海 200240摘要：情境感知是近⼏年来普适计算的研究热点，合适的情境建模⽅法和⼯具是实现情境感知的基础。

本⽂采⽤了基于本体的⽅法进⾏情境的建模，并从通⽤性和易⽤性两个⾓度出发，给出了⼀个基于本体的情境建模⼯具（CoMoTo）。

⽂中讨论了对情境进⾏分级建模的⽅法，描述了该建模⼯具的分析与设计⼯作，并通过⼀个案例说明了该⼯具的建模能⼒。

关键词：情境建模；本体；建模⼯具1．引⾔普适计算是以⼈为中⼼的计算，⼀个重要特性就是情境感知的能⼒，即能随着所处环境中情境的变化⽽动态地作出相称的反应。

为了更好地描述和管理情境，需要对情境有⼀个统⼀的认识。

许多⽂献都给出了情境的定义，但因为作者分析⾓度的不同⽽不尽相同，Dey[1]等⼈提出的定义是其中具有代表性、通⽤性相对较强的⼀个，他们认为情境是任何可以⽤来刻画⼀个实体的处境的信息，⽽这个实体可以是⼈、地⽅或者任何跟⽤户与应⽤之间的交互有关联的物体（包括⽤户和应⽤本⾝）。

本⽂涉及的情境，其含义均参考这⼀描述。

在情境感知应⽤中，情境信息可以通过多种渠道获得（例如传感器、存储器及⼿⼯输⼊等），情境信息的这种异源性导致对其进⾏描述、管理和利⽤等操作都将是复杂的过程。

情境模型通过为情境感知应⽤提供情境的抽象描述，使得上述这些复杂的操作对其变得透明，从⽽在很⼤程度上简化了搭建情境感知应⽤的流程。

因此，⼀个具有良好结构的情境模型是构建情境感知系统的关键[2]。

本体作为⼀种描述⼿段，能明确地、形式化⽽规范地对共享概念模型进⾏说明[3]，强⼤的表达能⼒使得其能描述更复杂的情境，⽽它提供的丰富的语义⽀持也使得基于情境进⾏推理成为可能。

得益于这些优势，研究⼈员⼴泛使⽤本体对情境信息进⾏建模。

当前已有多种软件⽤于构建本体，斯坦福⼤学开发的Protege便是其中优秀的⼀员。

然⽽，由于Protege⾯向所有领域的本体构建，其极强的通⽤性使得操作⽐较复杂，建模⼈员需要有很强的专业知识。

人工智能英文版第五版教学设计

人工智能英文版第五版教学设计IntroductionArtificial Intelligence () is an interdisciplinary field that has attracted increasing attention over the past few decades. As technologies and applications continue to evolve, it is essential to develop effective teaching strategies and materials to prepare our students for the future. This document presents a proposed teaching design for the English version of the fifth edition of Artificial Intelligence, with the m of creating a stimulating and engaging learning environment for students to master the fundamentals of and apply these concepts to real-world problems.AudienceThe proposed teaching design is med at undergraduate students who have a basic understanding of programming anddata structures. An introductory course in computer scienceor engineering is recommended, but not mandatory. Studentswith a background in mathematics, statistics, or otherrelated fields may also benefit from this course.PrerequisitesBefore starting this course, students should be familiar with the following:•Programming languages such as Python, Java, or C++•Data structures such as arrays, lists, and trees•Basic algorithms such as sorting and searching•Linear algebra and calculusCourse ContentWeek 1: Introduction to•Definition of•Brief history of•Applications of•Key concepts in•Overview of the courseWeek 2: Problem Solving and Search•Problem-solving strategies•State-space search algorithms•Heuristic search algorithms•Uninformed and informed search algorithms•Adversarial searchWeek 3: Knowledge Representation and Reasoning •Propositional logic•First-order logic•Rule-based systems•Frames and semantic networks•Ontologies and knowledge graphsWeek 4: Planning and Decision Making•Planning methods•Decision trees•Decision networks•Utility theory•Game theoryWeek 5: Machine Learning•Supervised learning•Unsupervised learning•Reinforcement learning•Neural networks•Deep learningWeek 6: Natural Language Processing•Text processing•Language modelling•Information extraction•Sentiment analysis•Machine translationWeek 7: Robotics and Perception•Robot architectures•Sensing and perception•Robotics applications•Simultaneous localization and mapping (SLAM)•Path planning and controlWeek 8: Ethics and Social Implications of•Bias in•Frness in•Privacy and security concerns•Misuse and abuse of•Future ofTeaching MethodsThe proposed teaching design combines traditional lectures with interactive classroom activities to promote active learning and engagement. Lectures will cover the key concepts and theories of , while classroom activities will involveproblem-solving exercises, group discussions, and case studies.In addition, students will be expected to complete several coding assignments and a final project. The coding assignments will provide hands-on experience with algorithms and techniques, while the final project will give students the opportunity to apply these concepts to a real-world problem and present their findings to the class.AssessmentAssessment will be based on the following:•Coding assignments (40%)•Final project (30%)•Midterm exam (20%)•Classroom participation (10%)Students will receive feedback on their coding assignments and final project throughout the course, and there will be several opportunities for peer review and group feedback.ConclusionThe proposed teaching design for the English version of the fifth edition of Artificial Intelligence ms to provide students with a solid foundation in the key concepts andtheories of , as well as hands-on experience with coding and real-world applications. By combining traditional lectures with interactive classroom activities, students will be engaged in their learning and better prepared for the future of .。

知识管理的名词解释英语

知识管理的名词解释英语Knowledge Management (KM) is a discipline that encompasses a range of strategies and practices aimed at identifying, capturing, organizing, storing, retrieving, and utilizing an organization's knowledge assets to foster innovation, increase efficiency, and improve decision-making capabilities. It involves a systematic and structured approach to managing knowledge within an organization to ensure that valuable knowledge is shared and leveraged effectively.At its core, KM recognizes that knowledge is a critical asset that holds the potential to drive competitive advantage and business success. It is not limited to explicit knowledge that is codified and easily transferable, but also encompasses tacit knowledge, which resides in an individual's experiences, insights, and intuition. By harnessing both forms of knowledge, organizations can unlock hidden potential and facilitate effective collaboration across teams and departments.One of the key components of KM is knowledge creation. This involves the generation of new knowledge through various means such as research and development, experimentation, or simply by leveraging the collective intelligence of the organization. Innovation, creativity, and continuous learning play significant roles in this process. Organizations that prioritize knowledge creation foster an environment that encourages curiosity, experimentation, and risk-taking, allowing for the discovery of new insights and opportunities.Another important aspect of KM is knowledge capture. This refers to the process of identifying and capturing knowledge from various sources within the organization, including individuals, documents, databases, and even external networks. Captured knowledge is then organized and stored in a format that is easily accessible and searchable. This allows for efficient retrieval of information when needed, enabling employees to quickly find and apply relevant knowledge to their work.Knowledge organization and categorization are critical to effective KM. Information and knowledge need to be classified and structured in a way that reflects theorganization's goals, processes, and workflows. This often involves the use of taxonomies, ontologies, and metadata to categorize and tag knowledge assets, making them easily discoverable and retrievable. Additionally, knowledge management systems and technologies, such as intranets, databases, and content management systems, are used to facilitate the organization and storage of knowledge.Knowledge sharing and dissemination play a fundamental role in KM. Organizations must establish channels and platforms that enable employees to share their knowledge and experiences with others. This can take the form of formal training programs, communities of practice, knowledge sharing sessions, or even social collaboration tools. By sharing knowledge, organizations benefit from collective insights, avoid reinventing the wheel, and foster a culture of learning and collaboration.Beyond sharing, KM also emphasizes the importance of knowledge utilization. It's not enough to simply hoard knowledge; organizations must actively encourage and facilitate the application of knowledge in decision-making processes and problem-solving activities. This requires the integration of knowledge into various business processes and systems, ensuring that it is available and utilized at the point of need.Moreover, KM also encompasses knowledge evaluation and improvement. Organizations need to continuously assess the relevance, accuracy, and effectiveness of their knowledge assets. This involves monitoring usage patterns, soliciting feedback from users, and regularly updating and improving knowledge resources to ensure their continued usefulness and relevance.In conclusion, Knowledge Management is a multifaceted discipline aimed at maximizing the value and impact of an organization's knowledge assets. It involves a range of strategies and practices, from knowledge creation and capture to organization, sharing, utilization, and evaluation. By adopting effective KM practices, organizations can foster a culture of learning, innovation, and collaboration, leading to improved performance and sustained success.。

Ontology enrichment and indexing process

—Ing´e nierie des Connaissances —R ESEARCH R EPORTN o 03.05Mai 2003Ontology enrichment and indexing process E.Desmontils,C.Jacquin,L.SimonInstitut de Recherche en Informatique de Nantes2,rue de la HoussinireB.P.9220844322NANTES CEDEX 3E.Desmontils,C.Jacquin,L.SimonOntology enrichment and indexing process18p.Les rapports de recherche de l’Institut de Recherche en Informatique de Nantes sont disponibles aux formats PostScript®et PDF®`a l’URL:http://www.sciences.univ-nantes.fr/irin/Vie/RR/Research reports from the Institut de Recherche en Informatique de Nantes are available in PostScript®and PDF®formats at the URL:http://www.sciences.univ-nantes.fr/irin/Vie/RR/indexGB.html ©May2003by E.Desmontils,C.Jacquin,L.SimonOntology enrichmentand indexing processE.Desmontils,C.Jacquin,L.Simondesmontils,jacquin,simon@irin.univ-nantes.fr+AbstractWithin the framework of Web information retrieval,this paper presents some methods to improve an indexing process which uses terminology oriented ontologies speciﬁc to aﬁeld of knowledge.Thus,techniques to enrich ontologies using specialization processes are proposed in order to manage pages which have to be indexed but which are currently rejected by the indexing process.This ontology specialization process is made supervised to offer to the expert of the domain a decision-making aid concerning itsﬁeld of application.The proposed enrichment is based on some heuristics to manage the specialization of the ontology and which can be controlled using a graphic tool for validation.Categories and Subject Descriptors:H.3.1[Content Analysis and Indexing]General Terms:Abstracting methods,Dictionaries,Indexing methods,Linguistic processing,Thesauruses Additional Key Words and Phrases:Ontology,Enrichment,Supervised Learning,Thesaurus,Indexing Process, Information Retrieval in the Web1IntroductionSearch engines,like Google1or Altavista2help us toﬁnd information on the Internet.These systems use a cen-tralized database to index information and a simple keywords based requester to reach information.With such systems,the recall is often rather convenient.Conversely,the precision is weak.Indeed,these systems rarely take into account content of documents in order to index them.Two major approaches,for taking into account the se-mantic of document,exist.Theﬁrst approach concerns annotation techniques based on the use of ontologies.They consist in manually annotating documents using ontologies.The annotations are then used to retrieve information from the documents.They are rather dedicated to request/answer system(KAON3...)The second approach,for taking into account of Web document content,are information retrieval techniques based on the use of domain ontologies[8].They are usually dedicated for retrieving documents which concern a speciﬁc request.For this type of systems,the index structure of the web pages is given by the ontology structure.Thus,the document indexes belong to the concepts set of the ontology.An encountered problem is that many concepts extracted from docu-ment and which belong to the domain are not present in the domain ontology.Indeed,the domain coverage of the ontology may be too small.In this paper,weﬁrst present the general indexing process based on the use of a domain ontology(section 2).Then,we present an analysis of experiment results which leads us to propose improvements of the indexing process which are based on ontology enrichment.They make it possible to increase the rate of indexed concepts (section3).Finally,we present a visualisation tool which enables an expert to control the indexing process and the ontology enrichment.2Overview of the indexing processThe main goal is to build a structured index of Web pages according to an ontology.This ontology provides the index structure.Our indexing process can be divided into four steps(ﬁgure1)[8]:1.For each page,aﬂat index of terms is built.Each term of this index is associated with its weighted frequency.This coefﬁcient depends on each HTML marker that describes each term occurrence.2.A thesaurus makes it possible to generate all candidate concepts which can be labeled by a term of theprevious index.In our implementation,we use the Wordnet thesaurus([14]).3.Each candidate concept of a page is studied to determine its representativeness of this page content.Thisevaluation is based on its weighted frequency and on the relations with the other concepts.It makes it possible to choose the best sense(concept)of a term in relation to the context.Therefore,the more a concept has strong relationships with other concepts of its page,the more this concept is signiﬁcant into its page.This contextual relation minimizes the role of the weighted frequency by growing the weight of the strongly linked concepts and by weakening the isolated concepts(even with a strong weighted frequency).4.Among these candidate concepts,aﬁlter is produced via the ontology and the representativeness of thely,a selected concept is a candidate concept that belongs to the ontology and has an high representativeness of the page content(the representativeness exceeds a threshold of sensitivity).Next,the pages which contain such a selected concept are assigned to this concept into the ontology.Some measures are evaluated to characterize the indexing process.They determine the adequacy between the Web site and the ontology.These measures take into account the number of pages selected by the ontology(the Ontology Cover Degree or OCD),the number of concepts included in the pages(the Direct Indexing Degree or DID and the Indirect Indexing Degree or IID)...The global evaluation of the indexing process(OSAD:Ontology-Site Adequacy Degree)is a linear combination of the previous measures(weighted means)among different threshold from0to1.The measure enables us to quantify the“quality”of our indexing process(see[8])for more details).67ValueValid and indexed(representativeness degree greater than0.3)337428333547With a representativeness degree greater than0.3Not in WordnetofIn Wordnet2734053881Number of processed candidate concepts4“/”(1315HTML pages).89105like http://www.acronymﬁ,an online database that contains more than277000acronymes.11 6For instance,,a search engine that allows keywords like AND,OR,NOT or NEAR.1213Initial indexing process With the pruning process8021684.33%98.86%58.75%87.04%56.84%81.5%0.62%11.5%Table2:Results of the indexing process concerning1000pages of the site of the CSE department of the University of Washington(with a threshold of0,3).phases!).This phenomenon is due to the enrichment algorithm which authorizes the systematic addition of any representative concept(i.e.threshold of representativeness)to the ontology of the domain.While the second enrichment method,which operates with pruning rules(see sub-section3.3),enables to only add136concepts to the ontology.Also let us notice that this method keeps the rate of coverage(98,86%)of the enrichment method without pruning.Indeed,during this pruning phase,some concepts which does not index enough pages(according to the threshold),are removed from the ontology.Their pages are then linked to concepts that subsume them.Next,the number of concepts that index pages is growing.It is not surprising because we add only concepts indexing a minimal number of pages.Finally,the rate of accepted concepts goes from0.62%to11.5%!So,our process uses more available concepts that the pages contain.4OntologyManager:a user interface for ontology validationA tool which makes it possible to control the ontology enrichment has been developed(see Figure7).This tool implemented in java language,proposes a tree like view of the ontology.On the one hand,it proposes a general view of the ontology which enables the expert to easily navigate throw the ontology,on the other hand,it proposes a more detailed view which informs the expert about coefﬁcient associated with concepts and pages.Notice that, in this last case,concepts are represented with different colours according to their associated coefﬁcient.So a human expert easily can compares them.Moreover,some part of the ontology graph can also be masked in order to focus the expert attention on a speciﬁc part of the ontology.We are now developing a new functionality for the visualisation tool.It enables the user to have an hyperbolic view of the ontology graph(like OntoRama tool[9]or like H3Viewer[16]).In this context,the user can work with bigger ontologies.The user interface also makes it possible to visualise the indexed pages(see Figure8)and the ontology enrich-ment(by a colour system which can be customized).It will be easy to the human expert to validate or invalidate the added concepts,to obtain the indexing rate of a particular concept and to dynamically reorganize(by a drag and drop system)the ontology.The concept validation process is divided into4steps deﬁning4classes of concepts:•bronze concepts:concepts proposed by our learning process and accepted by an expert just“to see”;•silver concepts:concepts accepted by the expert for all indexing processes he/she does;•gold concepts:concepts proposed by an expert to its community7for testing;141516Ontology enrichmentand indexing processE.Desmontils,C.Jacquin,L.SimonAbstractWithin the framework of Web information retrieval,this paper presents some methods to improve an indexing process which uses terminology oriented ontologies speciﬁc to aﬁeld of knowledge.Thus,techniques to enrich ontologies using specialization processes are proposed in order to manage pages which have to be indexed but which are currently rejected by the indexing process.This ontology specialization process is made supervised to offer to the expert of the domain a decision-making aid concerning itsﬁeld of application.The proposed enrichment is based on some heuristics to manage the specialization of the ontology and which can be controlled using a graphic tool for validation.Categories and Subject Descriptors:H.3.1[Content Analysis and Indexing]General Terms:Abstracting methods,Dictionaries,Indexing methods,Linguistic processing,Thesauruses Additional Key Words and Phrases:Ontology,Enrichment,Supervised Learning,Thesaurus,Indexing Process, Information Retrieval in the Web。

Ontologies and the Configuration of Problem-Solving Methods

Ontologies and the Configuration of Problem-Solving MethodsRudi Studer1,Henrik Eriksson2,John Gennari3,Samson Tu3, Dieter Fensel1,4,and Mark Musen31Institute AIFB, University of Karlsruhe, D-76128 Karlsruhee-mail: studer@aifb.uni-karlsruhe.de2Department of Computer and Information Science, Linköping University, S-58183 Linköpinge-mail: her@ida.liu.se3Section on Medical Informatics, Knowledge Systems Laboratory, Stanford University School of Medicine, Stanford, CA 94305-5479, USAe-mail: {gennari,tu,musen}@4Department SWI, University of Amsterdam, NL-1018 WB Amsterdame-mail: dieter@swi.psy.uva.nlAbstractProblem-solving methods model the problem-solving behavior of knowledge-based systems. The PROTÉGÉ-II framework includes a library of problem-solving methods that can be viewed as reusable components. For developers to use these components as building blocks in the construction of methods for new tasks, they must configure the components to fit with each other and with the needs of the new task. As part of this configuration process, developers must relate the ontologies of the generic methods to the ontologies associated with other methods and submethods. We present a model of method configuration that incorporates the use of several ontologies in multiple levels of methods and submethods, and we illustrate the approach by providing examples of the configuration of the board-game method.1. IntroductionProblem-solving methods for knowledge-based systems capture the problem-solving behavior required to performing the system's task (McDermott, 1988). Because certain tasks are common (e.g., planning and configuration), and are approachable by the same problem-solving behavior, developers can reuse problem-solving methods in several applications ((Chandrasekaran and Johnson, 1993), (Breuker and Van de Velde, 1994)). Thus, a library of reusable methods would allow the developer to create new systems by selecting, adapting and configuring such methods. Moreover, development tools, such as PROTÉGÉ-II (Puerta et al., 1992), can support the developer in the reuse of methods.Problem-solving methods are abstract descriptions of problem-solving behavior. The development of problem solvers from reusable components is analogous to the general approach of software reuse. In knowledge engineering as well as software engineering, developers often duplicate work on similar software components, which are used in different applications. The reuse of software components across several applications is a potentially useful technique that promises to improve the software-development process (Krueger, 1992). Similarly, the reuse of problem-solving methods can improve the quality, reliability, and maintainability of the software (e.g., by the reuse of quality-proven components). Of course, software reuse is only financially beneficial in the end if the indexing and configuration overhead is less than the effort that is needed to create the required component several times from scratch.Although software reuse is an appealing approach theoretically, there are serious practical problems associated with reuse. Two of the most important impediments to software reuse are (1) the problem of finding reusable components (e.g., locating appropriate components in a library), and (2) the problem of adapting reusable components to their task and to their environment. The firstproblem is sometimes called the indexing problem, and the second problem is sometimes called the configuration problem. These problems are also present in the context of reusable problem-solving methods. In the remainder of this paper we shall focus on the configuration problem.Method configuration is a difficult task, because the output of one method may not correspond to the input of the next method, and because the method may have subtasks, which are solved by submethods offering a functionality that is different from the assumptions of the subtask. Domain-independent methods use a method ontology, which, for example, might include concepts such as states, transitions, locations, moves, and constraints, whereas the user input and the (domain-specific) knowledge-base use a domain-oriented ontology, which might include concepts such as office workers, office rooms, and room-assignment constraints. Thus, a related issue to the configuration problem is the problem of mappings between ontologies (Gennari et al., 1994).In this paper, we shall address the configuration problem. The problems of how to organize a library of problem-solving methods and of how to select an appropriate method from such a library are beyond the scope of the paper. We shall introduce an approach for handling method ontologies when configuring a method from more elementary submethods. In our framework, configuration of a method means selecting appropriate submethods for solving the subtasks of which a method is composed. We introduce the notion of a subtask ontology in order to be able (i) to make the ontology of a method independent of the submethods that are chosen to solve its subtasks, and (ii) to specify how a selected submethod is adapted to its subtask environment. Our approach supports such a configuration process on multiple levels of subtasks and submethods. Furthermore, the access of domain knowledge is organized in such a way that no mapping of domain knowledge between the different subtask/submethod levels is required.The approach which is described in this paper provides a framework for handling some important aspects of the method configuration problem. However, our framework does not provide a complete solution to this problem. Furthermore, the proposed framework needs in the future a thorough practical evaluation by solving various application tasks.The rest of this paper is organized as follows. Section 2 provides a background to PROTÉGÉ-II, the board-game method, and the MIKE approach. Section 3 introduces the notions of method and subtask ontologies and discusses their relationships with the interface specification of methods and subtasks, respectively. Section 4 analyses the role of ontologies for the configuration of problem-solving methods and presents a model for configuring a problem-solving method from more elementary submethods that perform the subtasks of the given problem-solving method. In Sections 5 and 6, we discuss the results, and draw conclusions, respectively.2. Background: PROTÉGÉ-II, the Board-Game Method, and MIKEIn this section, we shall give a brief introduction into PROTÉGÉ-II and MIKE (Angele et al., 1996b) and will describe the board-game method (Eriksson et al., 1995) since this method will be used to illustrate our approach.2.1 Method Reuse for PROTÉGÉ-IIPROTÉGÉ-II (Puerta et al., 1992, Gennari et al., 1994, Eriksson et al., 1995) is a methodology and suite of tools for constructing knowledge-based systems. The PROTÉGÉ-II methodology emphasizes the reuse of components, including problem-solving methods, ontologies, and knowledge-bases.PROTÉGÉ-II allows developers to reuse library methods and to generate custom-tailored knowledge-acquisition tools from ontologies. Domain experts can then use these knowledge-Inputs ->Figure 2-1: Method-subtask decomposition in PROTÉGÉ-IIacquisition tools to create knowledge bases for the problem solvers. In addition to developing tool support for knowledge-based systems, PROTÉGÉ-II is also a research project aimed at understanding the reuse of problem-solving methods, and at alternative approaches to reuse.Naturally, the configuration of problem-solving methods for new tasks is a critical step in the reuse process, and an important research issue for environments such as PROTÉGÉ-II.The model of reuse for PROTÉGÉ-II includes the notion of a library of reusable problem-solving methods (PSMs) that perform tasks . PROTÉGÉ-II uses the term task to indicate the computations and inferences a method should perform in terms of its input and output. (Note that the term task is used sometimes in other contexts to indicate the overall role of an application system, or the application task .) In PROTÉGÉ-II problem-solving methods are decomposable into subtasks .Other methods, sometimes called submethods, can perform these subtasks. Primitive methods that cannot be decomposed further are called mechanisms . This decomposition of tasks into methods and mechanisms is shown graphically in Figure 2-1.Submethods and mechanisms should be reusable by developers as they build a solution to a particular problem. Thus, the developer should be able to select a generic method that performs a task, and then configure this method by selecting and substituting appropriate submethods and mechanisms to perform the method’s subtasks. Note that because the input-output requirements of tasks and subtasks often differ from the input-output assumptions of preexisting methods and mechanisms, we must introduce mappings among the task (or subtask) and method (or submethod)ontologies.PROTÉGÉ-II uses three major types of ontologies for defining various aspects of the knowledge-based system: domain , method , and application ontologies . Domain ontologies model concepts and relationships for a particular domain of interest. Ideally, these ontologies should be partititioned so as to separate those parts that may be more dependent on the problem-solving method. Method ontologies model concepts related to problem-solving methods, including input and output assumptions. To enable reuse, method ontologies should be domain-independent. In most situations, reusable domain and method ontologies by themselves are insufficient for a completeapplication system. Thus, PROTÉGÉ-II uses an application ontology that combines domain and method ontologies for a particular application. Application ontologies are used to generate domain-specific, method-specific knowledge-acquisition tools.The focus of this paper is on the configuration of problem-solving methods and submethods. Thus, we will describe method ontologies (see Section 3), rather than domain or application ontologies, and the mappings among tasks and methods that are necessary for method configuration (see Section 4).2.2 The Board-Game Method (BGM)We shall use the board-game method ((Eriksson et al., 1995), (Fensel et al., 1996a)) as a sample method to illustrate method configuration in the PROTÉGÉ-II framework. The basic idea behind the board-game method is that the method should provide a conceptual model of a board game where game pieces move between locations on a board (see Figure 2-2). The state of such a board game is defined by a set of assignments specifying which pieces are assigned to which locations. Developers can use this method to perform tasks that they can model as board-game problems.Figure 2-2: The board-game method provides a conceptual model where pieces movebetween locations.To configure the board-game method for a new game, the developer must define among other the pieces, locations, moves, the initial state and goal states of the game. The method operates by searching the space of legal moves, and by determining and applying the most promising moves until the game reaches a goal state. The major advantage of the board-game method is that the notion of a board game as the basis for the method configuration, makes the method convenient to reuse for the developer.We have used the board-game method in different configurations to perform several tasks. Examples of such tasks are the towers-of-Hanoi, the cannibals-and-missionaries, and the Sisyphus room-assignment (Linster, 1994) problem. By modeling other types of tasks as board games, the board-game method can perform tasks beyond simple games. The board-game method can perform the Sisyphus room-assignment task, for instance, if we (1) model the office workers as pieces, (2) start from a state where all the workers are at a location outside the building, and (3) move the workers one by one to appropriate rooms under the room-assignment constraints.2.3 The MIKE ApproachThe MIKE approach (Model-based and Incremental Knowledge Engineering) (Angele et al., 1996b) aims at providing a development method for knowledge-based systems covering all steps from knowledge acquisition to design and implementation. As part of the MIKE approach theKnowledge Acquisition and Representation Language KARL (Fensel et al., 1996c), (Fensel, 1995) has been developed. KARL is a formal and operational knowledge modeling language which can be used to formally specify a KADS like model of expertise (Schreiber et al., 1993). Such a model of expertise is split up into three layers:The domain layer contains the domain model with knowledge about concepts, their features, and their relationships. The inference layer contains a specifiation of the single inference steps as well as a specification of the knowledge roles which indicate in which way domain knowledge is used within the problem solving steps. In MIKE three types of knowledge roles are distinguished: Stores are used as containers which provide input data to inference actions or collect output data generated by inference actions. Views and terminators are used to connect the (generic) inference layer with the domain layer: Views provide means for delivering domain knowledge to inference actions and to transform the domain specific terminology into the generic PSM specific terminology. In an analogous way, terminators may be used to write the results of the problem solving process back to the domain layer and thus to reformulate the results in domain specific terms. The task layer contains a specification of the control flow for the inference steps as defined on the inference layer.For the remainder of the paper, it is important to know that in KARL a problem solving method is specified in a generic way on the inference and task layer of a model of expertise. A main characteristic of KARL is the integration of object orientation into a logical framework. KARL provides classes and predicates for specifying concepts and relationships, respectively. Furthermore, classes are characterized by single- and multi-valued attributes and are embedded in an is-a hierarchy. For all these modeling primitives, KARL offers corresponding graphical representations. Finally, sufficient and necessary constraints, which have to be met by class and predicate definitions, may be specified using first-order formulae.Currently, a new version of KARL is under development which among others will provide the notion of a method ontology and will provide primitives for specifying pre- and postconditions for a PSM (Angele et al., 1996a). Thus, this new version of KARL includes all the modeling primitives which are needed to formally describe the knowledge-level framework which shall be introduced in Sections 3 and 4. However, this formal specification is beyond the scope of this paper.3. Problem-Solving Method OntologiesWhen describing a PSM, various characteristic features may be identified, such as the input/output behavior or the knowledge assumptions on which the PSM is based (Fensel, 1995a), (Fensel et al., 1996b). In the context of this paper, we will consider a further characteristic aspect of a PSM: its ontological assumptions. These assumptions specify what kind of generic concepts and relationships are inherent for the given PSM. In the framework of PROTÉGÉ-II, these assumptions are captured in the method ontology (Gennari et al.,1994).Subsequently, we define the notions of a method ontology and of a subtask ontology, and discuss the relationship between the method ontology and the subtask ontologies associated with the subtasks of which the PSM is composed. For that discussion we assume that a PSM comes with an interface specification that describes which generic external knowledge roles (Fensel, 1995b) are used as input and output. Each role includes the definition of concepts and relationships for specifying the terminology used within the role.Fig. 3-1 shows the interface specification of the board-game method. We see, for instance, that knowledge about moves, preferences among moves, and applicability conditions for moves is provided by the input roles "Moves", "Preferred_Moves", and "Applic_Moves" (applicable moves), respectively; within the role "Moves" the concept "moves" is defined, whereas forexample within the role "Preferred_Moves" the relationship "prefer_m" is defined which specifies a preference relation between two moves for a given state (see Fig. 3-3).external knowledge role data flowFigure 3-1: The interface of the board-game methodSince the context in which the method will be used is not known in advance, one cannot specify which input knowledge is delivered from other tasks as output and which input knowledge has to be taken from the domain. Therefore, besides the output role "Solution", all other input roles are handled homogenously (that is, as external input knowledge roles).The interface description determines in which way a method can be adapted to its calling environment: a subset of the external input roles will be used later on as domain views (that is for defining the mapping between the domain-specific knowledge and the generic PSM knowledge).3.1 Method OntologiesWe first consider the situation that a complete PSM is given as a building block in the library of PSMs. In this case, a PSM comes with a top-level ontology, its method ontology, specifying all the generic concepts and relationships that are used by the PSM for providing its functionality. This method ontology is divided into two parts:(i) Global definitions, which include all generic concept and relationship definitions that are partof the interface specification of the PSM (that is, the external input and output knowledge roles of the PSM, respectively). For each concept or relationship definition, it is possible to indicate whether it is used as input or as output (however, that does not hold for subordinate concept definitions, i.e. concepts that are just used as range restrictions of concept attributes, or for high-level concepts that are just used for introducing attributes which are inherited bysubconcepts). Thus, the ontology specifies clearly which type of generic knowledge is expected as input, and which type of generic knowledge is provided as output.(ii) Internal definitions, which specify all concepts and relationships that are used for defining the dataflow within the PSM (that is, they are defined within stores).Within both parts, constraints can be specified for further restricting the defined terminology. It should be clear that the global definitions are exactly those definitions that specify the ontological assumptions that must be met for applying the PSM.We assume that a PSM that is stored in the library comes with an ontology description at two levels of refinement. First, a compact representation is given that just lists the names of the concepts and relationships of which the ontology is composed. This compact representation also includes the distinctions of global and internal definitions. It is used for providing an initial, not too detailed overview about the method ontology.Fig. 3-2 shows this compact representation of the board-game method ontology. We see that for instance "moves" is an input concept, "prefer_m" (preference of moves) is an input relationship, and "goal-states" is an output concept; "assignments" is an example of a subordinate concept which is used within state definitions, whereas "movable_objects" is an example of a high level concept. Properties of "movable_objects" are for instance inherited by the concept "pieces" (see below). As we will see later on, the concept "current-states" is part of the internal definitions, since it is used within the board game method for specifying the data flow between subtasks (see Section 3.2)Figure 3-2: The compact representation of the ontology of the board-game method Second, a complete specification of the method ontology is given. We use KARL for formally specifying such an ontology which provides all concept and relationship definitions as well as all constraints. Fig. 3-3 gives a graphic KARL representation of the board-game method ontology(not including constraints). We can see that for instance a move defines a new location for a given piece (single-valued attribute "new_assign" with domain "moves" and range "assignments") or that the preference between moves is state dependent (relationship "prefer_m"). The attribute "assign" is an example of a multi-valued attribute since states consist of a set of assignments.: is-aFigure: 3-3: The graphic KARL representation of the board-game method ontology When comparing the interface specification (Fig. 3-1) and the method ontology (Fig. 3-3) we can see that the union of the terminology of the external knowledge roles is equal to the set of global definitions being found in the method ontology.3.2 Subtask OntologiesIn general, within the PROTÉGÉ-II framework, a PSM is decomposed into several subtasks. Each subtask may be decomposed in turn again by choosing more elementary methods for solving it. We generalize this approach in the sense that, for trivial subtasks, we do not distinguish between the subtasks and the also trivial mechanisms for solving them. Instead, we use the notion of an elementary inference action (Fensel et al., 1996c). In the given context, such an elementary inference action may be interpreted as a "hardwired" mechanism for solving a subtask. Thus, for trivial subtasks, we can avoid the overhead that is needed for associating a subtask with its corresponding method (see below). That is, in general we assume that a method can be decomposed into subtasks and elementary inference actions.When specifying a PSM, a crucial design decision is the decomposition of the PSM into its top-level subtasks. Since subtasks provide the slots where more elementary methods can be plugged in, the type and number of subtasks determine in which way a PSM can be configured from otherdata flow subtaskinternal storeexternal knowledge rolemethods. As a consequence, the adaptability of a PSM is characterized by the knowledge roles of its interface description, and by its top-level task decomposition.For a complete description of the decomposition structure of a PSM, one also has to specify the interfaces of the subtasks and inference actions, as well as the data and control flow among these constituents. The interface of a subtask consists of knowledge roles, which are either external knowledge roles or (internal) stores , which handle the input/output from/to other subtasks and inference actions. Some of these aspects are described for the BGM in Figures 3-4 and 3-5,respectively.Figure 3-4: The top-level decomposition of the board-game methodinto elementary inference actions and subtasksFig. 3-4 shows the decomposition of the board-game method into top-level subtasks andFigure 3-5: The interface of the subtask "Apply_Moves"elementary inference actions. We can see two subtasks ("Apply_Moves" and "Select_Best_State") and three elementary inference actions ("Init_State", "Check_Goal_State", "Transfer_Solution"). This decomposition structure specifies clearly that the board-game method may be configured by selecting appropriate methods for solving the subtasks "Apply_Moves" and "Select_Best_State". In Fig. 3-5 the interface of the subtask "Apply_Moves" is shown. The interface specifies that "Apply_Moves" receives (board-game method) internal input from the store "Current_State" and delivers output to the store "Potential_Successor_States". Furthermore, three external knowledge roles provide the task- and/or domain-specific knowledge that is required for performing the subtask "Apply_Moves".Having introduced the notion of a method ontology, the problem arises how that method ontology can be made independent of the selection of (more elementary) methods for solving the subtasks of the PSM. The basic idea for getting rid of this problem is that each subtask is associated with a subtask ontology. The method ontology is then essentially derived from these subtask ontologies by combining the different subtask ontologies, and by introducing additional superconcepts, like for example the concept "movable_objects" (compare Fig. 3-2). Of course, the terminology associated with the elementary inference actions has to be considered in addition.Figure: 3-6: The method ontology and the related subtask ontologiesThe notion of subtask ontologies has two advantages:1) By building up the method ontology from the various subtask ontologies, the method ontologyis independent of the decision which submethod will be used for solving which subtask. Thus,a configuration independent ontology definition can be given for each PSM in the library.2) Subtask ontologies provide a context for mapping the ontologies of the submethods, which areselected to solve the subtasks, to the global method ontology (see Section 4).The type of mapping that is required between the subtask ontology and the ontology of the method used to solve the subtask is dependent on the distinction of input and output definitions within the subtask ontology (see Section 4). Therefore, we again separate the subtask ontology appropriately. In addition, a distinction is made between internal and external input/output. The feature "internal" is used for indicating that this part of the ontology is used within the method of which the subtask is a part (that is, for defining the terminology of the stores used for the data flow among the subtasks which corresponds to the internal-definitions part of the method ontology). The feature "external" indicates that input is received from the calling environment of the method of which the subtask is a part, or that output is delivered to that calling environment (that corresponds to the global-definitions part of the method ontology). This distinction can be made on the basis of the data dependencies defined among the various subtasks (see Fig. 3-6).In Fig. 3-7, we introduce the ontology for the subtask "Apply_Moves". According to the interface description which is given in Fig. 3-5, we assume that the current state is an internal input for "Apply_Moves", whereas the potential successor states are treated as internal output. Moves, preferences among moves, and applicability conditions for moves are handled as external input.Figure 3-7: The compact representation of the "Apply_Moves" subtask ontologyWhen investigating the interface and ontology specification of a subtask, one can easily recognize that even after having introduced the subtask decomposition, it is still open as to what kind of external knowledge is taken from the domain and what kind of knowledge is received as output from another task. That is, in a complex application task environment, in which the board-game method is used to solve, for example, a subtask st1, it depends on the calling environment of st1 whether, e.g., the preference of moves ("prefer_m") has to be defined in a mapping from the domain, or is just delivered as output from another subtask st2, which is called before st1 is called.4. Configuring Problem Solving Methods from More Elementary MethodsThe basic idea when building up a library of PSMs is that one does not simply store completely defined PSMs. In order to achieve more flexibility in adapting a PSM from the library to its task environment, concepts are required for configuring a PSM from more elementary building blocks (i.e., from more elementary methods (Puerta et al., 1992)). Besides being more flexible, such a configuration approach also provides means for reusing these building blocks in different contexts.Based on the structure introduced in Section 3, the configuration of a PSM from building blocks requires the selection of methods that are appropriate for solving the subtasks of the PSM. Since we do not consider the indexing and adaptation problem in this paper, we assume subsequently, that we have found a suitable (sub-)method for solving a given subtask, e.g. by exploiting appropriate semantic descriptions of the stored methods. Such semantic descriptions could, for instance, be pre-/postconditions which specify the functional behavior of a method (Fensel et al., 1996b). By using the new version of KARL (Angele et al., 1996a) such pre-/postconditions can be specified in a completely formal way.。

Knowledge Engineering-Principles And Methods

Knowledge Engineering:Principles and MethodsRudi Studer1, V. Richard Benjamins2, and Dieter Fensel11Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany{studer, fensel}@aifb.uni-karlsruhe.dehttp://www.aifb.uni-karlsruhe.de2Artificial Intelligence Research Institute (IIIA),Spanish Council for Scientific Research (CSIC), Campus UAB,08193 Bellaterra, Barcelona, Spainrichard@iiia.csic.es, http://www.iiia.csic.es/~richard2Dept. of Social Science Informatics (SWI),richard@swi.psy.uva.nl, http://www.swi.psy.uva.nl/usr/richard/home.htmlAbstractThis paper gives an overview about the development of the field of Knowledge Engineering over the last 15 years. We discuss the paradigm shift from a transfer view to a modeling view and describe two approaches which considerably shaped research in Knowledge Engineering: Role-limiting Methods and Generic Tasks. To illustrate various concepts and methods which evolved in the last years we describe three modeling frameworks: CommonKADS, MIKE, and PROTÉGÉ-II. This description is supplemented by discussing some important methodological developments in more detail: specification languages for knowledge-based systems, problem-solving methods, and ontologies. We conclude with outlining the relationship of Knowledge Engineering to Software Engineering, Information Integration and Knowledge Management.Key WordsKnowledge Engineering, Knowledge Acquisition, Problem-Solving Method, Ontology, Information Integration1IntroductionIn earlier days research in Artificial Intelligence (AI) was focused on the development offormalisms, inference mechanisms and tools to operationalize Knowledge-based Systems (KBS). Typically, the development efforts were restricted to the realization of small KBSs in order to study the feasibility of the different approaches.Though these studies offered rather promising results, the transfer of this technology into commercial use in order to build large KBSs failed in many cases. The situation was directly comparable to a similar situation in the construction of traditional software systems, called …software crisis“ in the late sixties: the means to develop small academic prototypes did not scale up to the design and maintenance of large, long living commercial systems. In the same way as the software crisis resulted in the establishment of the discipline Software Engineering the unsatisfactory situation in constructing KBSs made clear the need for more methodological approaches.So the goal of the new discipline Knowledge Engineering (KE) is similar to that of Software Engineering: turning the process of constructing KBSs from an art into an engineering discipline. This requires the analysis of the building and maintenance process itself and the development of appropriate methods, languages, and tools specialized for developing KBSs. Subsequently, we will first give an overview of some important historical developments in KE: special emphasis will be put on the paradigm shift from the so-called transfer approach to the so-called modeling approach. This paradigm shift is sometimes also considered as the transfer from first generation expert systems to second generation expert systems [43]. Based on this discussion Section 2 will be concluded by describing two prominent developments in the late eighties:Role-limiting Methods [99] and Generic Tasks [36]. In Section 3 we will present some modeling frameworks which have been developed in recent years: CommonKADS [129], MIKE [6], and PROTÈGÈ-II [123]. Section 4 gives a short overview of specification languages for KBSs. Problem-solving methods have been a major research topic in KE for the last decade. Basic characteristics of (libraries of) problem-solving methods are described in Section 5. Ontologies, which gained a lot of importance during the last years are discussed in Section 6. The paper concludes with a discussion of current developments in KE and their relationships to other disciplines.In KE much effort has also been put in developing methods and supporting tools for knowledge elicitation (compare [48]). E.g. in the VITAL approach [130] a collection of elicitation tools, like e.g. repertory grids (see [65], [83]), are offered for supporting the elicitation of domain knowledge (compare also [49]). However, a discussion of the various elicitation methods is beyond the scope of this paper.2Historical Roots2.1Basic NotionsIn this section we will first discuss some main principles which characterize the development of KE from the very beginning.Knowledge Engineering as a Transfer Process…This transfer and transformation of problem-solving expertise from a knowledge source to a program is the heart of the expert-system development process.” [81]In the early eighties the development of a KBS has been seen as a transfer process of humanknowledge into an implemented knowledge base. This transfer was based on the assumption that the knowledge which is required by the KBS already exists and just has to be collected and implemented. Most often, the required knowledge was obtained by interviewing experts on how they solve speciﬁc tasks [108]. Typically, this knowledge was implemented in some kind of production rules which were executed by an associated rule interpreter. However, a careful analysis of the various rule knowledge bases showed that the rather simple representation formalism of production rules did not support an adequate representation of different types of knowledge [38]: e.g. in the MYCIN knowledge base [44] strategic knowledge about the order in which goals should be achieved (e.g. “consider common causes of a disease first“) is mixed up with domain specific knowledge about for example causes for a specific disease. This mixture of knowledge types, together with the lack of adequate justifications of the different rules, makes the maintenance of such knowledge bases very difficult and time consuming. Therefore, this transfer approach was only feasible for the development of small prototypical systems, but it failed to produce large, reliable and maintainable knowledge bases.Furthermore, it was recognized that the assumption of the transfer approach, that is that knowledge acquisition is the collection of already existing knowledge elements, was wrong due to the important role of tacit knowledge for an expert’s problem-solving capabilities. These deficiencies resulted in a paradigm shift from the transfer approach to the modeling approach.Knowledge Engineering as a Modeling ProcessNowadays there exists an overall consensus that the process of building a KBS may be seen as a modeling activity. Building a KBS means building a computer model with the aim of realizing problem-solving capabilities comparable to a domain expert. It is not intended to create a cognitive adequate model, i.e. to simulate the cognitive processes of an expert in general, but to create a model which offers similar results in problem-solving for problems in the area of concern. While the expert may consciously articulate some parts of his or her knowledge, he or she will not be aware of a significant part of this knowledge since it is hidden in his or her skills. This knowledge is not directly accessible, but has to be built up and structured during the knowledge acquisition phase. Therefore this knowledge acquisition process is no longer seen as a transfer of knowledge into an appropriate computer representation, but as a model construction process ([41], [106]).This modeling view of the building process of a KBS has the following consequences:•Like every model, such a model is only an approximation of the reality. In principle, the modeling process is inﬁnite, because it is an incessant activity with the aim of approximating the intended behaviour.•The modeling process is a cyclic process. New observations may lead to a reﬁnement, modiﬁcation, or completion of the already built-up model. On the other side, the model may guide the further acquisition of knowledge.•The modeling process is dependent on the subjective interpretations of the knowledge engineer. Therefore this process is typically faulty and an evaluation of the model with respect to reality is indispensable for the creation of an adequate model. According to this feedback loop, the model must therefore be revisable in every stage of the modeling process.Problem Solving MethodsIn [39] Clancey reported on the analysis of a set of first generation expert systems developed to solve different tasks. Though they were realized using different representation formalisms (e.g. production rules, frames, LISP), he discovered a common problem solving behaviour.Clancey was able to abstract this common behaviour to a generic inference pattern called Heuristic Classification , which describes the problem-solving behaviour of these systems on an abstract level, the so called Knowledge Level [113]. This knowledge level allows to describe reasoning in terms of goals to be achieved, actions necessary to achieve these goals and knowledge needed to perform these actions. A knowledge-level description of a problem-solving process abstracts from details concerned with the implementation of the reasoning process and results in the notion of a Problem-Solving Method (PSM).A PSM may be characterized as follows (compare [20]):• A PSM speciﬁes which inference actions have to be carried out for solving a given task.• A PSM determines the sequence in which these actions have to be activated.•In addition, so-called knowledge roles determine which role the domain knowledge plays in each inference action. These knowledge roles deﬁne a domain independent generic terminology.When considering the PSM Heuristic Classification in some more detail (Figure 1) we can identify the three basic inference actions abstract ,heuristic match , and refine . Furthermore,four knowledge roles are defined:observables ,abstract observables ,solution abstractions ,and solutions . It is important to see that such a description of a PSM is given in a generic way.Thus the reuse of such a PSM in different domains is made possible. When considering a medical domain, an observable like …410 C“ may be abstracted to …high temperature“ by the inference action abstract . This abstracted observable may be matched to a solution abstraction, e.g. …infection“, and finally the solution abstraction may be hierarchically refined to a solution, e.g. the disease …influenca“.In the meantime various PSMs have been identified, like e.g.Cover-and-Differentiate for solving diagnostic tasks [99] or Propose-and-Revise [100] for parametric design tasks.PSMs may be exploited in the knowledge engineering process in different ways:Fig. 1 The Problem-Solving Method Heuristic Classiﬁcationroleinference action•PSMs contain inference actions which need speciﬁc knowledge in order to perform their task. For instance,Heuristic Classiﬁcation needs a hierarchically structured model of observables and solutions for the inference actions abstract and reﬁne, respectively.So a PSM may be used as a guideline to acquire static domain knowledge.• A PSM allows to describe the main rationale of the reasoning process of a KBS which supports the validation of the KBS, because the expert is able to understand the problem solving process. In addition, this abstract description may be used during the problem-solving process itself for explanation facilities.•Since PSMs may be reused for developing different KBSs, a library of PSMs can be exploited for constructing KBSs from reusable components.The concept of PSMs has strongly stimulated research in KE and thus has influenced many approaches in this area. A more detailed discussion of PSMs is given in Section 5.2.2Speciﬁc ApproachesDuring the eighties two main approaches evolved which had significant influence on the development of modeling approaches in KE: Role-Limiting Methods and Generic Tasks. Role-Limiting MethodsRole-Limiting Methods (RLM) ([99], [102]) have been one of the first attempts to support the development of KBSs by exploiting the notion of a reusable problem-solving method. The RLM approach may be characterized as a shell approach. Such a shell comes with an implementation of a specific PSM and thus can only be used to solve a type of tasks for which the PSM is appropriate. The given PSM also defines the generic roles that knowledge can play during the problem-solving process and it completely fixes the knowledge representation for the roles such that the expert only has to instantiate the generic concepts and relationships, which are defined by these roles.Let us consider as an example the PSM Heuristic Classification (see Figure 1). A RLM based on Heuristic Classification offers a role observables to the expert. Using that role the expert (i) has to specify which domain specific concept corresponds to that role, e.g. …patient data”(see Figure 4), and (ii) has to provide domain instances for that concept, e.g. concrete facts about patients. It is important to see that the kind of knowledge, which is used by the RLM, is predefined. Therefore, the acquisition of the required domain specific instances may be supported by (graphical) interfaces which are custom-tailored for the given PSM.In the following we will discuss one RLM in some more detail: SALT ([100], [102]) which is used for solving constructive tasks.Then we will outline a generalization of RLMs to so-called Configurable RLMs.SALT is a RLM for building KBSs which use the PSM Propose-and-Revise. Thus KBSs may be constructed for solving specific types of design tasks, e.g. parametric design tasks. The basic inference actions that Propose-and-Revise is composed of, may be characterized as follows:•extend a partial design by proposing a value for a design parameter not yet computed,•determine whether all computed parameters fulﬁl the relevant constraints, and•apply ﬁxes to remove constraint violations.In essence three generic roles may be identified for Propose-and-Revise ([100]):•…design-extensions” refer to knowledge for proposing a new value for a design parameter,•…constraints” provide knowledge restricting the admissible values for parameters, and •…ﬁxes” make potential remedies available for speciﬁc constraint violations.From this characterization of the PSM Propose-and-Revise, one can easily see that the PSM is described in generic, domain-independent terms. Thus the PSM may be used for solving design tasks in different domains by specifying the required domain knowledge for the different predefined generic knowledge roles.E.g. when SALT was used for building the VT-system [101], a KBS for configuring elevators, the domain expert used the form-oriented user interface of SALT for entering domain specific design extensions (see Figure 2). That is, the generic terminology of the knowledge roles, which is defined by object and relation types, is instantiated with VT specific instances.1Name:CAR-JAMB-RETURN2Precondition:DOOR-OPENING = CENTER3Procedure:CALCULATION4Formula:[PLATFORM-WIDTH -OPENING-WIDTH] / 25Justiﬁcation:CENTER-OPENING DOORS LOOKBEST WHEN CENTERED ONPLATFORM.(the value of the design parameter CAR-JUMB-RETURN iscalculated according to the formula - in case the preconditionis fulﬁlled; the justiﬁcation gives a description why thisparameter value is preferred over other values (example takenfrom [100]))Fig. 2 Design Extension Knowledge for VTOn the one hand, the predefined knowledge roles and thus the predefined structure of the knowledge base may be used as a guideline for the knowledge acquisition process: it is clearly specified what kind of knowledge has to be provided by the domain expert. On the other hand, in most real-life situations the problem arises of how to determine whether a specific task may be solved by a given RLM. Such task analysis is still a crucial problem, since up to now there does not exist a well-defined collection of features for characterizing a domain task in a way which would allow a straightforward mapping to appropriate RLMs. Moreover, RLMs have a fixed structure and do not provide a good basis when a particular task can only be solved by a combination of several PSMs.In order to overcome this inflexibility of RLMs, the concept of configurable RLMs has been proposed.Configurable Role-Limiting Methods (CRLMs) as discussed in [121] exploit the idea that a complex PSM may be decomposed into several subtasks where each of these subtasks may be solved by different methods (see Section 5). In [121], various PSMs for solving classification tasks, like Heuristic Classification or Set-covering Classification, have been analysed with respect to common subtasks. This analysis resulted in the identification ofshared subtasks like …data abstraction” or …hypothesis generation and test”. Within the CRLM framework a predefined set of different methods are offered for solving each of these subtasks. Thus a PSM may be configured by selecting a method for each of the identified subtasks. In that way the CRLM approach provides means for configuring the shell for different types of tasks. It should be noted that each method offered for solving a specific subtask, has to meet the knowledge role specifications that are predetermined for the CRLM shell, i.e. the CRLM shell comes with a fixed scheme of knowledge types. As a consequence, the introduction of a new method into the shell typically involves the modification and/or extension of the current scheme of knowledge types [121]. Having a fixed scheme of knowledge types and predefined communication paths between the various components is an important restriction distinguishing the CRLM framework from more flexible configuration approaches such as CommonKADS (see Section 3).It should be clear that the introduction of such flexibility into the RLM approach removes one of its disadvantages while still exploiting the advantage of having a fixed scheme of knowledge types, which build the basis for generating effective knowledge-acquisition tools. On the other hand, configuring a CRLM shell increases the burden for the system developer since he has to have the knowledge and the ability to configure the system in the right way. Generic Task and Task StructuresIn the early eighties the analysis and construction of various KBSs for diagnostic and design tasks evolved gradually into the notion of a Generic Task (GT) [36]. GTs like Hierarchical Classification or State Abstraction are building blocks which can be reused for the construction of different KBSs.The basic idea of GTs may be characterized as follows (see [36]):• A GT is associated with a generic description of its input and output.• A GT comes with a ﬁxed scheme of knowledge types specifying the structure of domain knowledge needed to solve a task.• A GT includes a ﬁxed problem-solving strategy specifying the inference steps the strategy is composed of and the sequence in which these steps have to be carried out. The GT approach is based on the strong interaction problem hypothesis which states that the structure and representation of domain knowledge is completely determined by its use [33]. Therefore, a GT comes with both, a fixed problem-solving strategy and a fixed collection of knowledge structures.Since a GT fixes the type of knowledge which is needed to solve the associated task, a GT provides a task specific vocabulary which can be exploited to guide the knowledge acquisition process. Furthermore, by offering an executable shell for a GT, called a task specific architecture, the implementation of a specific KBS could be considered as the instantiation of the predefined knowledge types by domain specific terms (compare [34]). On a rather pragmatic basis several GTs have been identified including Hierarchical Classification,Abductive Assembly and Hypothesis Matching. This initial collection of GTs was considered as a starting point for building up an extended collection covering a wide range of relevant tasks.However, when analyzed in more detail two main disadvantages of the GT approach have been identified (see [37]):•The notion of task is conﬂated with the notion of the PSM used to solve the task, sinceeach GT included a predetermined problem-solving strategy.•The complexity of the proposed GTs was very different, i.e. it remained open what the appropriate level of granularity for the building blocks should be.Based on this insight into the disadvantages of the notion of a GT, the so-called Task Structure approach was proposed [37]. The Task Structure approach makes a clear distinction between a task, which is used to refer to a type of problem, and a method, which is a way to accomplish a task. In that way a task structure may be defined as follows (see Figure 3): a task is associated with a set of alternative methods suitable for solving the task. Each method may be decomposed into several subtasks. The decomposition structure is refined to a level where elementary subtasks are introduced which can directly be solved by using available knowledge.As we will see in the following sections, the basic notion of task and (problem-solving)method, and their embedding into a task-method-decomposition structure are concepts which are nowadays shared among most of the knowledge engineering methodologies.3Modeling FrameworksIn this section we will describe three modeling frameworks which address various aspects of model-based KE approaches: CommonKADS [129] is prominent for having defined the structure of the Expertise Model, MIKE [6] puts emphasis on a formal and executable specification of the Expertise Model as the result of the knowledge acquisition phase, and PROTÉGÉ-II [51] exploits the notion of ontologies.It should be clear that there exist further approaches which are well known in the KE community, like e.g VITAL [130], Commet [136], and EXPECT [72]. However, a discussion of all these approaches is beyond the scope of this paper.Fig. 3 Sample Task Structure for DiagnosisTaskProblem-Solving MethodSubtasksProblem-Solving MethodTask / Subtasks3.1The CommonKADS ApproachA prominent knowledge engineering approach is KADS[128] and its further development to CommonKADS [129]. A basic characteristic of KADS is the construction of a collection of models, where each model captures specific aspects of the KBS to be developed as well as of its environment. In CommonKADS the Organization Model, the Task Model, the Agent Model, the Communication Model, the Expertise Model and the Design Model are distinguished. Whereas the first four models aim at modeling the organizational environment the KBS will operate in, as well as the tasks that are performed in the organization, the expertise and design model describe (non-)functional aspects of the KBS under development. Subsequently, we will briefly discuss each of these models and then provide a detailed description of the Expertise Model:•Within the Organization Model the organizational structure is described together with a speciﬁcation of the functions which are performed by each organizational unit.Furthermore, the deﬁciencies of the current business processes, as well as opportunities to improve these processes by introducing KBSs, are identiﬁed.•The Task Model provides a hierarchical description of the tasks which are performed in the organizational unit in which the KBS will be installed. This includes a speciﬁcation of which agents are assigned to the different tasks.•The Agent Model speciﬁes the capabilities of each agent involved in the execution of the tasks at hand. In general, an agent can be a human or some kind of software system, e.g.a KBS.•Within the Communication Model the various interactions between the different agents are speciﬁed. Among others, it speciﬁes which type of information is exchanged between the agents and which agent is initiating the interaction.A major contribution of the KADS approach is its proposal for structuring the Expertise Model, which distinguishes three different types of knowledge required to solve a particular task. Basically, the three different types correspond to a static view, a functional view and a dynamic view of the KBS to be built (see in Figure 4 respectively “domain layer“, “inference layer“ and “task layer“):•Domain layer : At the domain layer all the domain speciﬁc knowledge is modeled which is needed to solve the task at hand. This includes a conceptualization of the domain in a domain ontology (see Section 6), and a declarative theory of the required domain knowledge. One objective for structuring the domain layer is to model it as reusable as possible for solving different tasks.•Inference layer : At the inference layer the reasoning process of the KBS is speciﬁed by exploiting the notion of a PSM. The inference layer describes the inference actions the generic PSM is composed of as well as the roles , which are played by the domain knowledge within the PSM. The dependencies between inference actions and roles are speciﬁed in what is called an inference structure. Furthermore, the notion of roles provides a domain independent view on the domain knowledge. In Figure 4 (middle part) we see the inference structure for the PSM Heuristic Classiﬁcation . Among others we can see that …patient data” plays the role of …observables” within the inference structure of Heuristic Classiﬁcation .•Task layer : The task layer provides a decomposition of tasks into subtasks and inference actions including a goal speciﬁcation for each task, and a speciﬁcation of how theseFig. 4 Expertise Model for medical diagnosis (simpliﬁed CML notation)goals are achieved. The task layer also provides means for specifying the control over the subtasks and inference actions, which are deﬁned at the inference layer.Two types of languages are offered to describe an Expertise Model: CML (Conceptual Modeling Language) [127], which is a semi-formal language with a graphical notation, and (ML)2 [79], which is a formal specification language based on first order predicate logic, meta-logic and dynamic logic (see Section 4). Whereas CML is oriented towards providing a communication basis between the knowledge engineer and the domain expert, (ML)2 is oriented towards formalizing the Expertise Model.The clear separation of the domain specific knowledge from the generic description of the PSM at the inference and task layer enables in principle two kinds of reuse: on the one hand, a domain layer description may be reused for solving different tasks by different PSMs, on the other hand, a given PSM may be reused in a different domain by defining a new view to another domain layer. This reuse approach is a weakening of the strong interaction problem hypothesis [33] which was addressed in the GT approach (see Section 2). In [129] the notion of a relative interaction hypothesis is defined to indicate that some kind of dependency exists between the structure of the domain knowledge and the type of task which should be solved. To achieve a flexible adaptation of the domain layer to a new task environment, the notion of layered ontologies is proposed:Task and PSM ontologies may be defined as viewpoints on an underlying domain ontology.Within CommonKADS a library of reusable and configurable components, which can be used to build up an Expertise Model, has been defined [29]. A more detailed discussion of PSM libraries is given in Section 5.In essence, the Expertise Model and the Communication Model capture the functional requirements for the target system. Based on these requirements the Design Model is developed, which specifies among others the system architecture and the computational mechanisms for realizing the inference actions. KADS aims at achieving a structure-preserving design, i.e. the structure of the Design Model should reflect the structure of the Expertise Model as much as possible [129].All the development activities, which result in a stepwise construction of the different models, are embedded in a cyclic and risk-driven life cycle model similar to Boehm’s spiral model [21].The basic structure of the expertise model has some similarities with the data, functional, and control view of a system as known from software engineering. However, a major difference may be seen between an inference layer and a typical data-flow diagram (compare [155]): Whereas an inference layer is specified in generic terms and provides - via roles and domain views - a flexible connection to the data described at the domain layer, a data-flow diagram is completely specified in domain specific terms. Moreover, the data dictionary does not correspond to the domain layer, since the domain layer may provide a complete model of the domain at hand which is only partially used by the inference layer, whereas the data dictionary is describing exactly those data which are used to specify the data flow within the data flow diagram (see also [54]).3.2The MIKE ApproachThe MIKE approach (Model-based and Incremental Knowledge Engineering) (cf. [6], [7])。

自然语言处理中的信息抽取与知识图谱构建

自然语言处理中的信息抽取与知识图谱构建自然语言处理（Natural Language Processing，NLP）是人工智能（AI）领域中一门重要的技术，旨在使计算机能够理解和处理人类自然语言的文本和语音。

其中，信息抽取（Information Extraction，IE）是NLP的一个关键任务，它旨在从大量的非结构化文本中提取出结构化的、有意义的信息。

与此同时，知识图谱的构建则是将抽取出的信息整合到一个结构化的知识库中，以便进行更高级的推理和分析。

信息抽取是一项具有挑战性的任务，因为非结构化的自然语言文本缺乏明确的语法和语义规则，且常常存在歧义和复杂的表达方式。

然而，通过应用各种技术和算法，可以有效地从文本中抽取出一系列重要的信息。

主要的信息抽取任务包括命名实体识别（Named Entity Recognition，NER）、关系抽取（Relation Extraction）、事件提取（Event Extraction）和态度抽取（Sentiment Analysis）等。

命名实体识别是信息抽取中的基础任务之一，它旨在从文本中识别出特定类别的实体，如人物、地点、组织机构、时间等。

通过使用机器学习算法和模型，可以构建高效的命名实体识别系统。

关系抽取则是通过分析文本中的语义关系，识别出实体之间的关系。

这种关系可以是预定义的，也可以是从文本中自动学习的。

事件提取是一项更加复杂的信息抽取任务，它要求从文本中抽取出与特定事件相关的信息。

例如，在新闻报道中，事件提取可以识别出事件的主题、时间、地点、参与者等关键信息。

为了实现事件提取，需要将文本分析成句子，并进行句法和语义分析，以获得更精确的信息。

态度抽取是另一项重要的信息抽取任务，它旨在从文本中提取出作者的情感和观点。

这种任务对于社交媒体数据分析、舆情监测和市场调研等领域具有广泛的应用。

通过应用情感分析和机器学习算法，可以对文本进行情感极性分类（积极、消极或中性）以及主观性判断（主观或客观）等。

information extraction 评价指标 -回复

information extraction 评价指标-回复关于信息抽取的评价指标引言：随着互联网和大数据时代的到来，大量的信息被生成和存储，如何从海量的文本数据中获取有效的信息成为了一项关键任务。

信息抽取（Information Extraction）是一种自然语言处理技术，旨在从非结构化的文本中提取出有用的信息，如实体、关系和事件等。

信息抽取涉及多个阶段和过程，包括实体识别、关系抽取和事件抽取等，因此需要评价指标来衡量其性能和有效性。

本文将介绍一些常用的信息抽取评价指标，并逐步解释其定义和使用。

一、精确率（Precision）精确率是信息抽取中常用的评价指标之一。

它衡量了信息抽取系统在识别结果中有多少是正确的。

精确率的计算公式如下：精确率= 正确识别的实体数/ 总的实体数精确率的值范围为0到1，越接近1代表系统的识别能力越好。

然而，精确率不能完全反映系统的性能，因为它忽略了系统未能捕捉到的实体。

二、召回率（Recall）召回率是另一个常用的信息抽取评价指标。

它衡量了信息抽取系统在文本中能识别出多少个实体。

召回率的计算公式如下：召回率= 正确识别的实体数/ 真实的实体数召回率的值也在0到1之间，越接近1代表系统的抽取能力越好。

与精确率相反，召回率忽略了系统识别结果中的误识别。

三、F1值（F1 Score）F1值是综合考虑精确率和召回率的评价指标，通常用于信息抽取系统的综合评估。

它的计算公式如下：F1 = 2 * (精确率* 召回率) / (精确率+ 召回率)F1值的范围也在0到1之间，它给出了一个综合的系统性能度量，比单一的精确率和召回率更具有可比性和稳定性。

当精确率和召回率都很高时，F1值也会相对较高。

四、准确率（Accuracy）准确率是另一个常见的信息抽取评价指标。

它衡量了信息抽取系统在整个文本中正确识别实体的比例。

准确率的计算公式如下：准确率= (正确识别的实体数+ 正确未识别的实体数) / 总的实体数准确率的值也在0到1之间，越接近1代表系统的抽取能力越好。

知识图谱及其在自然语言处理中的应用

知识图谱及其在自然语言处理中的应用一、前言随着互联网和人工智能技术的不断发展，数据量的剧增和日益高效应用的需求，人类处理和利用数据的能力面临着巨大挑战。

知识图谱（knowledge graph）因其清晰的结构、丰富的关联性和高效的查询能力，成为数据管理和智能应用领域的一种重要工具。

本文将介绍知识图谱的基本概念和构建方法，并着重探讨其在自然语言处理中的应用。

二、知识图谱概述知识图谱被认为是将自然语言文字转化为可计算的知识表示形式的一种重要途径。

它是由一系列实体、属性和关系构成的图形化知识库，在知识表达、知识检索和数据挖掘等方面具有广泛的应用。

知识图谱的核心是实体、属性和关系，分别表示了实际世界中的事物、这些事物的属性和它们之间的关联。

其中，实体通常指人、地点、组织、事件等具有实际意义的概念，属性用于描述实体的特征或状态，关系则表示实体之间的联系或连接。

知识图谱的构建可以通过多种方法实现。

最常用的是基于本体学（ontologies）的方法，即对实体、属性和关系进行分类和描述，然后将它们组织为一个层次结构，在不同层次之间建立关联。

另一种方法是基于信息抽取（information extraction）的自动构建方法，通过自然语言处理技术自动从大规模文本中抽取实体、属性和关系信息，创建一张庞大的知识图谱。

三、知识图谱在自然语言处理中的应用3.1 实体识别（entity recognition）实体识别是指从自然语言文本中识别出具有特定语义的实体。

知识图谱中的实体通常是由一个唯一的标识符和一些属性描述组成的，因此实体识别可以被看作是自然语言文本和知识图谱之间的桥梁。

实体识别的结果可以直接用于索引、检索和推荐等任务，也可以与其他自然语言处理技术相结合，如关系抽取和事件识别等。

已有的一些知识图谱中包含了大量的实体、关系信息，例如维基百科、Freebase和YAGO等。

3.2 关系抽取（relation extraction）关系抽取是指从自然语言文本中自动识别出实体之间的语义关系。

IODP样品和数据的申请政策IODP_Obligations_Policy_03152012

1. Policy Overview ................................................................................................................................................2 2. Policy Guidelines ..............................................................................................................................................2 2.1. Guidelines for Science Party Members................................................................................................. 2 2.1.a. Definition of Science Party ..................................................................................................... 2 2.1.b. Submitting Sample Requests .................................................................................................. 2 2.1.c. Accessing Data............................................

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Ontologies and Information Extraction*C. Nédellec1 and A. Nazarenko21 Laboratoire Mathématique, Informatique et Génome (MIG), INRA, Domaine de Vilvert, 78352 F-Jouy-en-Josas cedex2 Laboratoire d’Informatique de Paris-Nord (LIPN), Université Paris-Nord & CNRS, av. J.B. Clément, F-93430 Villetaneuse1 IntroductionAn ontology is a description of conceptual knowledge organized in a computer-based representation while information extraction (IE) is a method for analyzing texts expressing facts in natural language and extracting relevant pieces of infor-mation from these texts.IE and ontologies are involved in two main and related tasks,•Ontology is used for Information Extraction: IE needs ontologies as part of the understanding process for extracting the relevant information; •Information Extraction is used for populating and enhancing the ontology: texts are useful sources of knowledge to design and enrich ontologies.These two tasks are combined in a cyclic process: ontologies are used for inter-preting the text at the right level for IE to be efficient and IE extracts new knowl-edge from the text, to be integrated in the ontology.We will argue that even in the simplest cases, IE is an ontology-driven process. It is not a mere text filtering method based on simple pattern matching and key-words, because the extracted pieces of texts are interpreted with respect to a prede-fined partial domain model. We will show that depending on the nature and the depth of the interpretation to be done for extracting the information, more or less knowledge must be involved.Extracting information from texts calls for lexical knowledge, grammars de-scribing the specific syntax of the texts to be analyzed, as well as semantic and on-tological knowledge. In this chapter, we will not take part in the debate about the limit between lexicon and ontology as a conceptual model. We will rather focus* LIPN Internal Report, 2005. This paper has been originally written in march 2003. A shorter version has been published under the title “Ontology and Information Extraction: A Necessary Symbiosis”, in Ontology Learning from Text: Methods, Evaluation and Ap-plications, edited by P. Buitelaar, P. Cimiano and B. Magnini, IOS Press Publication, July 2005,on the role that ontologies viewed as semantic knowledge bases could play in IE. The ontologies that can be used for and enriched by IE relate conceptual knowl-edge to its linguistic realizations (e.g. a concept must be associated with the terms that express it, eventually in various languages).Interpreting text factual information also calls for knowledge on the domain referential entities that we consider as part of the ontology (Sect. 2.2.1).This chapter will be mainly illustrated in biology, a domain in which there are critical needs for content-based exploration of the scientific literature and which becomes a major application domain for IE.2 SettingsBefore exploring the close relationship that links ontology and IE in Sect. 3 and Sect. 4, we will define Information Extraction and ontology.2.1 What is IE?The considerable development of multimedia communication goes along with an exponentially increasing volume of textual information. Today, mere Information Retrieval (IR) technologies are unable to meet the needs of specific information because they provide information at a document collection level. Developing in-telligent tools and methods, which give access to document content and extract relevant information, is more than ever a key issue for knowledge and information management. IE is one of the main research fields that attempt to fulfill this need.2.1.1 DefinitionThe IE field has been initiated by the DARPA's MUC program (Message Under-standing Conference in 1987 (MUC Proceedings; Grishman and Sundheim 1996). MUC has originally defined IE as the task of (1) extracting specific, well-defined types of information from the text of homogeneous sets of documents in restricted domains and (2) filling pre-defined form slots or templates with the extracted in-formation. MUC has also brought about a new evaluation paradigm: comparing the information extracted by automatic ways to human-produced results. MUC has inspired a large amount of work in IE and has become a major reference in the text-mining field. Even as such, it is still a challenging task to build an efficient IE system with good recall (coverage) and precision (correctness) rates.A typical IE task is illustrated by Fig. 1 from a CMU corpus of seminar an-nouncements (Freitag 1998). IE process recognizes a name (John Skvoretz) and classifies it as a person name. It also recognizes a seminar event and creates a seminar event form (John Skvoretz is the seminar speaker whose presentation is entitled “Embedded commitment”).Even in such a simple example, IE should not be considered as a mere keyword filtering method. Filling a form with some extracted words and textual fragments involves a part of interpretation. Any fragment must be interpreted with respect to its “context” (i.e. domain knowledge or other pieces of information extracted from the same document) and according to its “type” (i.e. the information is the value of an attribute / feature / role represented by a slot of the form). In the document of Fig. 1, “4-5:30” is understood as a time interval and background knowledge about seminars is necessary to interpret “4” as “4 pm” and as the seminar starting time.Form to fill (partial)place:?starting time: ?title:??speaker:Document: Professor John Skvoretz, U. of South Carolina, Columbia, will present a seminar entitled "Embedded commitment", on Thursday, May 4th from 4-5:30 in PH 223D.Filled form (partial)place: PH 223Dstarting time: 4 pmtitle: Embedded commitmentspeaker: Professor John Skvoretz […]Fig 1. A seminar announcement event example2.1.2 IE overall processOperationally, IE relies on document preprocessing and extraction rules (or ex-traction patterns) to identify and interpret the information to be extracted. The ex-traction rules specify the conditions that the preprocessed text must verify and how the relevant textual fragments can be interpreted to fill the forms. In the sim-plest case, the textual fragment and the coded information are the same and there are neither text preprocessing nor interpretation.More precisely, in a typical IE system, three processing steps can be identified (Hobbs et al. 1997; Cowie and Wilks 2000):1.text preprocessing, whose level ranges from mere text segmentation intosentences and sentences into tokens to a full linguistic analysis;2.rule selection: the extraction rules are associated with triggers (e.g. key-words), the text is scanned to identify the triggering items and the corre-sponding rules are selected;3.rule application, which checks the conditions of the selected rules and fillsthe forms according to the conclusions of the matching rules.Extraction rules. The rules are usually declarative. The conditions are expressed in a Logics-based formalism (Fig. 3), in the form of regular expressions, patterns or transducers. The conclusion explains how to identify in the text the value that should fill a slot of the form. The result may be a filled form, as in Fig. 1 and 2, or equivalently, a labeled text as in Fig. 3.Sentence: "GerE stimulates the expression of cotA."RuleConditions: X="expression of"Conclusions: Interaction_Target <-next-token(X).Filled form: Interaction_Target: cotAFig. 2. IE partial example from functional genomicsExperiments have been made with various kinds of rules, ranging from the simplest ones (Riloff 1993) (e.g. the subject of the passive form of the verb “mur-der” is interpreted as a victim) to sophisticated ones as in (Soderland et al. 1995). The more explicit (i.e. the more semantic and conceptual) the IE rule, the more powerful, concise and understandable it is. However, it requires the input text be-ing parsed and semantically tagged.A single slot rule extracts a single value, as in Fig. 2, while a multi-slot rule cor-rectly extracts at the same time all the values for a given form as in Fig. 3, even if there is more than one event reported in the text fragment.IE forms. Extraction usually proceeds by filling forms of increasing complexity (Wilks 1997):•Filling entity forms aims at identifying the items representing the domain referential entities. These items are called “named entities” (e.g. Analysis & Technology Inc.) and assimilated to proper names (company, person, gene names) but they can be any kind of word or expression that refers to a do-main entity: dates, numbers, titles for the management succession MUC-6 application, bedrooms in a real-estate IE application (Soderland 1999). •Filling domain event forms: The information about the events extracted by the rules is then encoded into forms in which a specific event of a given type and its role fillers are described. An entity form may fill an event role. •Merging forms that are issued from different parts of the text but provide in-formation about a same entity or event.•Assembling scenario forms: Ideally, various event and entity forms can be further organized into a larger scenario form describing a temporal or logi-cal sequence of actions/events.Text processing. As shown in Fig. 3, the condition part of the extraction rules may check the presence of a given lexical item (e.g. the verb named), the syntactic category of words and their syntactic dependencies (e.g. object and subject rela-tions). Different clues such as typographical characteristics, relative position of words, semantic tags1 or even coreference relations can also be exploited.Most IE systems therefore involve linguistic text processing and semantic knowledge: segmentation into words, morpho-syntactic tagging (the part-of-speech categories of words are identified), syntactic analysis (sentence constitu-ents such as noun or verb phrases are identified and the structure of complex sen-1E.g., if the verbs “named”, “appointed” and “elected” of Fig.3 were all known as ‘nomina-tion’ verbs, the fourth condition of the rule could have been generalized to their semantic category 'nomination'.tences is analyzed) and sometimes additional processing: lexical disambiguation, semantic tagging or anaphora resolution.Sentence: "NORTH STONINGTON, Connecticut (Business Wire) - 12/2/94 - Joseph M. Marino and Richard P. Mitchell have been named senior vice president of Analysis & Technology Inc. (NASDAQ NMS: AATI), Gary P. Bennett, president and CEO, has announced. "RuleConditions:noun-phrase (PNP, head(isa(person-name))),noun-phrase (TNP, head(isa(title))),noun-phrase (CNP, head(isa(company-name))),verb-phrase (VP, type(passive),head(named or elected or appointed)),preposition (PREP, head(of or at or by)),subject (PNP, VP),object (VP, TNP),post_nominal_prep (TNG,PREP),prep_object (PREP, CNP)Conclusion:management_appointment (M, person(PNP), title (TNP), company (CNP)).Comment:if there is a noun phrase (NP) whose head is a person name (PNP), an NP whose head is a title name (TNP), an NP whose head is a company name (CNP), a verb phrase whose head is a passive verb (named or elected or appointed), a preposition of, at or by,if PNP and TNP are respectively subject and object of the verb,and if CNP modifies TNP,then it can be stated that the person “PNP” is named "TNP" of the company “CNP”.Labeled documentNORTH STONINGTON, Connecticut (Business Wire) - 12/2/94 - <Person>Joseph M.Marino and Richard P. Mitchell</Person> have been named <Title>senior vice presi-dent</Title> of <Company>Analysis & Technology Inc</Company>. (NASDAQ NMS: AATI), Gary P. Bennett, president and CEO, has announced.Fig. 3. Example from MUC-6, a newswire about management succession However, the role and the scope of this analysis differ from one IE system to another. Text analysis can be performed either as preprocessing or during extrac-tion rule application. In the former case, the whole text is first analyzed. The analysis is global in the sense that items spread all over the document can contrib-ute to built the normalized and enriched representation of the text. Then, the appli-cation of extraction rules comes to a simple filtering process of the enriched repre-sentation. In the latter case, the text analysis is driven by the rule condition verification. The analysis is local, focuses on the context of the triggering items of the rules, and fully depends on the conditions to be checked in the selected rules.In the first IE systems (Hobbs et al. 1997), local and goal-driven analysis was preferred to full text preanalysis to increase efficiency, and the text preprocessing step was kept to minimum. Although costly, data-driven, full text analysis and normalization can improve the IE process in various manners. (1) It improves fur-ther NL processing steps, e.g. syntactic parsing improves attachment disambigua-tion (Basili et al. 1993) or coreference resolution. (2) Full text analysis and nor-malization also facilitates the discovery of lexical and linguistic regularities in specific documents. This idea, initially promoted by works on sublanguages (Har-ris 1968, Sager et al. 1987) for tuning NL processing to a given type of texts, is now popularized by Machine Learning (ML) papers in the IE field for learning ex-traction rules. There are two main reasons for that. First, annotating training data is costly and the quantity of data to be annotated decreases with the normalization (the less variations in the data, the less data annotation is needed). Next, ML sys-tems tend to learn non-understandable rules by picking details in training exam-ples that do not look as related. Normalizing the text by representing it in a more abstract way increases the understandability of the learned rules. However, nor-malization also raises problems such as the biased choice of the right representa-tion before learning, that is not dealt with in the IE literature.We will see in the following that these two approaches, in which text analysis is respectively used for interpretation (goal-driven) and normalization (data-driven), are very much tangled, as any normalization process involves a part of interpreta-tion. One of the difficulties in designing IE systems is to set the limit between lo-cal and global analysis. Syntactic analysis or entity recognition can be performed on a local basis but are improved by knowledge inferred at a global level. Thus, ambiguous cases of syntactic attachments or entity classification can be solved by comparison with non-ambiguous similar cases of the same document.2.1.3 IE, an ambitious approach to text explorationAs mentioned above, there is a need for tools that give a real access to the docu-ment content. IE and Question Answering (Q/A) tasks both try to identify in documents the pieces of information that are relevant to a given query. They dif-fer, however, in the type of information that is looked for. A Q/A system has to answer to a wide range of unpredictable user questions. In IE, the information that is looked for is richer but the type of information is known in advance. The rele-vant pieces of text have to be identified and then interpreted with respect to the knowledge partially represented in the forms to fill.IE and Q/A systems both differ in their empirism from their common ancestors, the text-understanding systems. They both rely on targeted and local techniques of text exploration rather than on a large coverage and in-depth semantic analysis of the text. The MUC competition framework has gathered a large and stable IE community. It has also drawn the research towards easily implementable and effi-cient methods rather than strong and well-founded NLP theories.The role of semantics in IE is often reduced to very shallow semantic labeling. Semantic analysis is rather considered as a way to disambiguate syntactic steps than as a way to build a conceptual interpretation. Today, most of the IE systems that involve semantic analysis exploit the most simple part of the whole spectrum of domain and task knowledge, that is to say, named entities. However, the grow-ing need for IE application to domains such as functional genomics that require more text understanding pushes towards more sophisticated semantic knowledge resources and thus towards ontologies viewed as conceptual models, as it will be shown in this chapter.2.2 What is an Ontology in the IE framework?Even though ontologies usually do not appear as an autonomous component or re-source in IE systems, we argue that IE relies on ontological knowledge.2.2.1 Ontologies populated with referential entitiesThe ontology identifies the entities that have a form of existence in a given do-main and specifies their essential properties. It does not describe the spurious properties of these entities. On the contrary, the goal of IE is to extract factual knowledge to instantiate one or several predefined forms. The structure of the form (e.g. Fig. 4) is a matter of ontology whereas the values of the filled template usually reflect factual knowledge (as shown in Fig. 2 above) that is not part of the ontology. In these examples, the form to fill represents a part of the biological model of gene regulation network: proteins interact positively or negatively with genes. In Sect. 3.4, we will show that IE is ontology-driven in that respect.Type: {negative, positive}InteractionAgent: any proteinTarget: any geneFig. 4. An example of IE form in the genomics domainThe status of the named entities is a pending question. Do they belong to the ontology or are they factual knowledge? From a theoretical point of view, accord-ing to Brachman’s terminological logics view (1979), they are instances of con-cepts and as such, they are described and typed at the assertional level and not at the terminological or ontological level. In this chapter, we will nevertheless con-sider that entities, being referential entities, are part of the domain ontology be-cause it is the way IE considers them.2.2.2 Ontology with a natural language anchorageWhether one wants to use ontological knowledge to interpret natural language or to exploit written documents to create or update ontologies, in any case, the ontol-ogy has to be connected to linguistic phenomena. Ontology must be linguistically anchored. A large effort has been devoted in traditional IE systems based on local analysis to the definitions of extraction rules that achieve this anchoring. In the very simple example about gene interaction (Fig. 2 above), the ontological knowl-edge is encoded as a keyword rule, which can be considered as a kind of compiled knowledge. In more powerful IE systems, the ontological knowledge is more ex-plicitly stated in the rules that bridge the gap between the word level and text in-terpretation. For instance, the rule of Fig. 3 above, states that a management ap-pointment event can be expressed through three verbs (named, elected or appointed). As such, an ontology is not a purely conceptual model, it is a model associated to a domain-specific vocabulary and grammar. In the IE framework, weconsider that this vocabulary and grammar are part of the ontology, even when they are embodied in extraction rules.The complexity of the linguistic anchoring of ontological knowledge is well known and should not be underestimated. A concept can be expressed by different terms and many words are ambiguous. Rhetoric, such as lexicalized metonymies or elisions, introduces conceptual shortcuts at the linguistic level and must be elic-ited to be interpreted into domain knowledge. A noun phrase (e.g. “the citizen”) may refer to an instance (a specific citizen which has been previously mentioned in the text) or to the class (the set of all the citizens) leading then to a very differ-ent interpretation. These phenomena, which illustrate the gab between the linguis-tic and the ontological levels, strongly affect IE performance. This explains why IE rules are so difficult to design.2.2.3 Partial ontologiesIE is a targeted textual analysis process. The target information is described in the structure of the forms to fill. As mentioned above (Sect. 2.1.2) MUC has identified various types of forms describing elements or entities, events and scenarios.IE does not require a whole formal ontological system but parts of it only. We consider that the ontological knowledge involved in IE can be viewed as a set of interconnected and concept-centered descriptions, or “conceptual nodes2”. In con-ceptual nodes the concept properties and the relations between concepts are ex-plicit. These conceptual nodes should be understood as chunks of a global knowl-edge model of the domain. We consider here various types of concepts: an object node lists the various properties of the object; an event node describes the various objects involved in the event and their roles; a scenario node describes one or sev-eral events involved in the scenario and their interrelations. The use of this type of knowledge in NLP systems is traditional (Schank and Abelson 1977) and is illus-trated by MUC tasks.2.3 Specificity of the ontology-IE relationshipOntology and IE are closely connected by a mutual contribution. The ontology is required for the IE interpreting process and IE provides methods for ontological knowledge acquisition. Even if using IE for extracting ontological knowledge is still rather marginal, it is gaining in importance. We distinguish both aspects in the following Sects. 3 and 4, although the whole process is a cyclic one. A first level of ontological knowledge (e.g. entities) helps to extract new pieces of knowledge from which more elaborated abstract ontological knowledge can be designed, which help to extract new pieces of information in an iterative process.2 We define a conceptual node as a piece of ontological model to which linguistic informa-tion can be attached. It differs from the “conceptual nodes” of (Soderland et al. 1995), which are extraction patterns describing a concept. We will see below that several extrac-tion rules may be associated to a unique conceptual node.3. Ontology for Information extractionThe template or form to be fulfilled by IE is a partial model of world knowledge. IE forms are also classically viewed as a model of a database to be filled by the in-stances extracted. This view is consistent with the first one. In this respect, any IE system is ontology-driven: in IE processes, the ontological knowledge is primarily used for text interpretation. How poor the semantics underlying the form to fill may be (see Fig. 2, for instance), whether it is explicit (Gaizauskas and Wilks, 1997; Embley et al., 1998) or not (Freitag 1998) (see Fig. 5 below), IE is always based on a knowledge model. In this Sect. 3, for exposition purposes, we distin-guish different levels of ontological knowledge:•The referential domain entities and their variations are listed in “flat ontolo-gies”. This is mainly used for entity identification and semantic tagging of character strings in documents.•At a second level, the conceptual hierarchy improves normalization by ena-bling more general levels of representation.•More sophisticated IE systems also make use of chunks of a domain model(i.e. conceptual nodes), in which the properties and interrelations of entitiesare described. The projection of these relations on the text both improves the NL processes and guides the instantiation of conceptual frames, scenar-ios or database tuples. The corresponding rules are based either on lexico-syntactic patterns or on more semantic ones.•The domain model itself is used for inference. It enables different structures to be merged and the implicit information to be brought to light.3.1 Sets of entitiesRecognizing and classifying named entities in texts require knowledge on the do-main entities. Specialized lexical or key-word lists are commonly used to identify the referential entities in documents. For instance, in the context of cancer treat-ment, (Rindflesh et al. 2000) makes use of the concepts of the Metathesaurus of UMLS to identify and classify biological entities in papers reporting interactions between proteins, genes and drugs. In different experiments, some lists of gene and protein names are exploited. For instance, (Humphreys et al. 2000) makes use of the SWISS PROT resource whereas (Ono et al. 2001) combines pattern match-ing with a manually constructed dictionary. In the financial news of MUC-5, lists of company names have been used.In a similar way, Auto-Slog (Riloff 1993), CRYSTAL (Soderland et al. 1995), PALKA (Kim and Moldovan 1995), WHISK (Soderland 1999) and Pinocchio (Ciravegna 2000) make use of list of entities to identify the referential entities in documents. The use of lexicon and dictionaries is however controversial. Some authors like (Mikheev et al. 1999) argue that entity named recognition can be done without it.Three main objectives of these specialized lexicons can be distinguished, se-mantic tagging, naming normalization and linguistic normalization, although these operations are usually processed all at once.Semantic taggingSemantic tagging. List of entities are used to tag the text entities with the relevant semantic information. In the ontology or lexicon, an entity (e.g. Tony Bridge) is described by its type (the semantic class to which it belongs, here PERSON) and by the list of the various textual forms (typographical variants, abbreviations, syno-nyms) that may refer to it3 (Mr. Bridge, Tony Bridge, T. Bridge).However, exact character strings are often not reliable enough for a precise en-tity identification and semantic tagging. Polysemic words that do exist even in sublanguages belong to different semantic classes. In the above example, the string “Bridge” could also refer to a bridge named “Tony”. (Soderland 1999) re-ports experiments on a similar problem on a software job ad domain: WHISK is able to learn some contextual IE rules but some rules are difficult to learn because they rely on subtle semantic variations, e.g., the word “Java” can be interpreted as competency in the programming language except in “Java Beans”. Providing the system with lists of entities does not help that much, “because too many of the relevant terms in the domain undergo shifts of meaning depending on context for simple lists of words to be useful”. The connection between the ontological and the textual levels must therefore be stronger. Identification and disambiguation contextual rules can be attached to named entities.This disambiguation problem is addressed as an autonomous process in IE works by systems that learn contextual rules for entity identification (Sect. 4.1). Naming normalization. As a by-effect, these resources are also used for normali-zation purposes. For instance, the various forms of Mr. Bridge will be tagged as MAN and associated with its canonical name form: Tony Bridge (<PERSON id=Tony Bridge>). In (Soderland 1999), the extraction rules may refer to some class of typographical variations (such as Bdrm=(brs, br, bdrm, bed-room s, bedroom, bed) in the Rental Ad domain). This avoids rule over-fitting by enabling then specific rules to be abstracted.Specialized genomics systems are particularly concerned with the variation problem, as the nomenclatures are often not respected in the genomics literature, when they exist. Thus, the well-known problem of identifying protein and gene names has attracted a large part of the research effort in IE to genomics (Proux et al. 1998; Fukuda et al. 1998; Collier et al. 2000). In many cases, rules rely on shallow constraints rather than morpho-syntactic dependencies.Linguistic normalization. Beyond typographical normalization, the semantic tag-ging of entities contributes to sentence normalization at a linguistic level. It solves some syntactic ambiguities, e.g. if cotA is tagged as a gene, in the sentence “the stimulation of the expression of cotA”, knowning that a gene can be “expressed”3 These various forms may be listed extensionally or intentionally by variation rules.。