当前位置:文档之家› Ontologies and Information Extraction

Ontologies and Information Extraction

Ontologies and Information Extraction
Ontologies and Information Extraction

Ontologies and Information Extraction*

C. Nédellec1 and A. Nazarenko2

1 Laboratoire Mathématique, Informatique et Génome (MIG), INRA, Domaine de Vilvert, 7835

2 F-Jouy-en-Josas cedex

2 Laboratoire d’Informatique de Paris-Nord (LIPN), Université Paris-Nord & CNRS, av. J.B. Clément, F-93430 Villetaneuse

1 Introduction

An ontology is a description of conceptual knowledge organized in a computer-based representation while information extraction (IE) is a method for analyzing texts expressing facts in natural language and extracting relevant pieces of infor-mation from these texts.

IE and ontologies are involved in two main and related tasks,

?Ontology is used for Information Extraction: IE needs ontologies as part of the understanding process for extracting the relevant information; ?Information Extraction is used for populating and enhancing the ontology: texts are useful sources of knowledge to design and enrich ontologies.

These two tasks are combined in a cyclic process: ontologies are used for inter-preting the text at the right level for IE to be efficient and IE extracts new knowl-edge from the text, to be integrated in the ontology.

We will argue that even in the simplest cases, IE is an ontology-driven process. It is not a mere text filtering method based on simple pattern matching and key-words, because the extracted pieces of texts are interpreted with respect to a prede-fined partial domain model. We will show that depending on the nature and the depth of the interpretation to be done for extracting the information, more or less knowledge must be involved.

Extracting information from texts calls for lexical knowledge, grammars de-scribing the specific syntax of the texts to be analyzed, as well as semantic and on-tological knowledge. In this chapter, we will not take part in the debate about the limit between lexicon and ontology as a conceptual model. We will rather focus

* LIPN Internal Report, 2005. This paper has been originally written in march 2003. A shorter version has been published under the title “Ontology and Information Extraction: A Necessary Symbiosis”, in Ontology Learning from Text: Methods, Evaluation and Ap-plications, edited by P. Buitelaar, P. Cimiano and B. Magnini, IOS Press Publication, July 2005,

on the role that ontologies viewed as semantic knowledge bases could play in IE. The ontologies that can be used for and enriched by IE relate conceptual knowl-edge to its linguistic realizations (e.g. a concept must be associated with the terms that express it, eventually in various languages).

Interpreting text factual information also calls for knowledge on the domain referential entities that we consider as part of the ontology (Sect. 2.2.1).

This chapter will be mainly illustrated in biology, a domain in which there are critical needs for content-based exploration of the scientific literature and which becomes a major application domain for IE.

2 Settings

Before exploring the close relationship that links ontology and IE in Sect. 3 and Sect. 4, we will define Information Extraction and ontology.

2.1 What is IE?

The considerable development of multimedia communication goes along with an exponentially increasing volume of textual information. Today, mere Information Retrieval (IR) technologies are unable to meet the needs of specific information because they provide information at a document collection level. Developing in-telligent tools and methods, which give access to document content and extract relevant information, is more than ever a key issue for knowledge and information management. IE is one of the main research fields that attempt to fulfill this need.

2.1.1 Definition

The IE field has been initiated by the DARPA's MUC program (Message Under-standing Conference in 1987 (MUC Proceedings; Grishman and Sundheim 1996). MUC has originally defined IE as the task of (1) extracting specific, well-defined types of information from the text of homogeneous sets of documents in restricted domains and (2) filling pre-defined form slots or templates with the extracted in-formation. MUC has also brought about a new evaluation paradigm: comparing the information extracted by automatic ways to human-produced results. MUC has inspired a large amount of work in IE and has become a major reference in the text-mining field. Even as such, it is still a challenging task to build an efficient IE system with good recall (coverage) and precision (correctness) rates.

A typical IE task is illustrated by Fig. 1 from a CMU corpus of seminar an-nouncements (Freitag 1998). IE process recognizes a name (John Skvoretz) and classifies it as a person name. It also recognizes a seminar event and creates a seminar event form (John Skvoretz is the seminar speaker whose presentation is entitled “Embedded commitment”).

Even in such a simple example, IE should not be considered as a mere keyword filtering method. Filling a form with some extracted words and textual fragments involves a part of interpretation. Any fragment must be interpreted with respect to its “context” (i.e. domain knowledge or other pieces of information extracted from the same document) and according to its “type” (i.e. the information is the value of an attribute / feature / role represented by a slot of the form). In the document of Fig. 1, “4-5:30” is understood as a time interval and background knowledge about seminars is necessary to interpret “4” as “4 pm” and as the seminar starting time.

Form to fill (partial)

place:?

starting time: ?

title:

?

?

speaker:

Document: Professor John Skvoretz, U. of South Carolina, Columbia, will present a seminar entitled "Embedded commitment", on Thursday, May 4th from 4-5:30 in PH 223D.

Filled form (partial)

place: PH 223D

starting time: 4 pm

title: Embedded commitment

speaker: Professor John Skvoretz […]

Fig 1. A seminar announcement event example

2.1.2 IE overall process

Operationally, IE relies on document preprocessing and extraction rules (or ex-traction patterns) to identify and interpret the information to be extracted. The ex-traction rules specify the conditions that the preprocessed text must verify and how the relevant textual fragments can be interpreted to fill the forms. In the sim-plest case, the textual fragment and the coded information are the same and there are neither text preprocessing nor interpretation.

More precisely, in a typical IE system, three processing steps can be identified (Hobbs et al. 1997; Cowie and Wilks 2000):

1.text preprocessing, whose level ranges from mere text segmentation into

sentences and sentences into tokens to a full linguistic analysis;

2.rule selection: the extraction rules are associated with triggers (e.g. key-

words), the text is scanned to identify the triggering items and the corre-

sponding rules are selected;

3.rule application, which checks the conditions of the selected rules and fills

the forms according to the conclusions of the matching rules.

Extraction rules. The rules are usually declarative. The conditions are expressed in a Logics-based formalism (Fig. 3), in the form of regular expressions, patterns or transducers. The conclusion explains how to identify in the text the value that should fill a slot of the form. The result may be a filled form, as in Fig. 1 and 2, or equivalently, a labeled text as in Fig. 3.

Sentence: "GerE stimulates the expression of cotA."

Rule

Conditions: X="expression of"

Conclusions: Interaction_Target <-next-token(X).

Filled form: Interaction_Target: cotA

Fig. 2. IE partial example from functional genomics

Experiments have been made with various kinds of rules, ranging from the simplest ones (Riloff 1993) (e.g. the subject of the passive form of the verb “mur-der” is interpreted as a victim) to sophisticated ones as in (Soderland et al. 1995). The more explicit (i.e. the more semantic and conceptual) the IE rule, the more powerful, concise and understandable it is. However, it requires the input text be-ing parsed and semantically tagged.

A single slot rule extracts a single value, as in Fig. 2, while a multi-slot rule cor-rectly extracts at the same time all the values for a given form as in Fig. 3, even if there is more than one event reported in the text fragment.

IE forms. Extraction usually proceeds by filling forms of increasing complexity (Wilks 1997):

?Filling entity forms aims at identifying the items representing the domain referential entities. These items are called “named entities” (e.g. Analysis & Technology Inc.) and assimilated to proper names (company, person, gene names) but they can be any kind of word or expression that refers to a do-main entity: dates, numbers, titles for the management succession MUC-6 application, bedrooms in a real-estate IE application (Soderland 1999). ?Filling domain event forms: The information about the events extracted by the rules is then encoded into forms in which a specific event of a given type and its role fillers are described. An entity form may fill an event role. ?Merging forms that are issued from different parts of the text but provide in-formation about a same entity or event.

?Assembling scenario forms: Ideally, various event and entity forms can be further organized into a larger scenario form describing a temporal or logi-cal sequence of actions/events.

Text processing. As shown in Fig. 3, the condition part of the extraction rules may check the presence of a given lexical item (e.g. the verb named), the syntactic category of words and their syntactic dependencies (e.g. object and subject rela-tions). Different clues such as typographical characteristics, relative position of words, semantic tags1 or even coreference relations can also be exploited.

Most IE systems therefore involve linguistic text processing and semantic knowledge: segmentation into words, morpho-syntactic tagging (the part-of-speech categories of words are identified), syntactic analysis (sentence constitu-ents such as noun or verb phrases are identified and the structure of complex sen-

1E.g., if the verbs “named”, “appointed” and “elected” of Fig.3 were all known as ‘nomina-tion’ verbs, the fourth condition of the rule could have been generalized to their semantic category 'nomination'.

tences is analyzed) and sometimes additional processing: lexical disambiguation, semantic tagging or anaphora resolution.

Sentence: "NORTH STONINGTON, Connecticut (Business Wire) - 12/2/94 - Joseph M. Marino and Richard P. Mitchell have been named senior vice president of Analysis & Technology Inc. (NASDAQ NMS: AATI), Gary P. Bennett, president and CEO, has announced. "

Rule

Conditions:

noun-phrase (PNP, head(isa(person-name))),

noun-phrase (TNP, head(isa(title))),

noun-phrase (CNP, head(isa(company-name))),

verb-phrase (VP, type(passive),head(named or elected or appointed)),

preposition (PREP, head(of or at or by)),

subject (PNP, VP),

object (VP, TNP),

post_nominal_prep (TNG,PREP),

prep_object (PREP, CNP)

Conclusion:

management_appointment (M, person(PNP), title (TNP), company (CNP)).

Comment:

if there is a noun phrase (NP) whose head is a person name (PNP), an NP whose head is a title name (TNP), an NP whose head is a company name (CNP), a verb phrase whose head is a passive verb (named or elected or appointed), a preposition of, at or by,

if PNP and TNP are respectively subject and object of the verb,

and if CNP modifies TNP,

then it can be stated that the person “PNP” is named "TNP" of the company “CNP”.

Labeled document

NORTH STONINGTON, Connecticut (Business Wire) - 12/2/94 - Joseph M.

Marino and Richard P. Mitchell have been named senior vice presi-dent of Analysis & Technology Inc. (NASDAQ NMS: AATI), Gary P. Bennett, president and CEO, has announced.

Fig. 3. Example from MUC-6, a newswire about management succession However, the role and the scope of this analysis differ from one IE system to another. Text analysis can be performed either as preprocessing or during extrac-tion rule application. In the former case, the whole text is first analyzed. The analysis is global in the sense that items spread all over the document can contrib-ute to built the normalized and enriched representation of the text. Then, the appli-cation of extraction rules comes to a simple filtering process of the enriched repre-sentation. In the latter case, the text analysis is driven by the rule condition verification. The analysis is local, focuses on the context of the triggering items of the rules, and fully depends on the conditions to be checked in the selected rules.

In the first IE systems (Hobbs et al. 1997), local and goal-driven analysis was preferred to full text preanalysis to increase efficiency, and the text preprocessing step was kept to minimum. Although costly, data-driven, full text analysis and normalization can improve the IE process in various manners. (1) It improves fur-ther NL processing steps, e.g. syntactic parsing improves attachment disambigua-tion (Basili et al. 1993) or coreference resolution. (2) Full text analysis and nor-malization also facilitates the discovery of lexical and linguistic regularities in specific documents. This idea, initially promoted by works on sublanguages (Har-

ris 1968, Sager et al. 1987) for tuning NL processing to a given type of texts, is now popularized by Machine Learning (ML) papers in the IE field for learning ex-traction rules. There are two main reasons for that. First, annotating training data is costly and the quantity of data to be annotated decreases with the normalization (the less variations in the data, the less data annotation is needed). Next, ML sys-tems tend to learn non-understandable rules by picking details in training exam-ples that do not look as related. Normalizing the text by representing it in a more abstract way increases the understandability of the learned rules. However, nor-malization also raises problems such as the biased choice of the right representa-tion before learning, that is not dealt with in the IE literature.

We will see in the following that these two approaches, in which text analysis is respectively used for interpretation (goal-driven) and normalization (data-driven), are very much tangled, as any normalization process involves a part of interpreta-tion. One of the difficulties in designing IE systems is to set the limit between lo-cal and global analysis. Syntactic analysis or entity recognition can be performed on a local basis but are improved by knowledge inferred at a global level. Thus, ambiguous cases of syntactic attachments or entity classification can be solved by comparison with non-ambiguous similar cases of the same document.

2.1.3 IE, an ambitious approach to text exploration

As mentioned above, there is a need for tools that give a real access to the docu-ment content. IE and Question Answering (Q/A) tasks both try to identify in documents the pieces of information that are relevant to a given query. They dif-fer, however, in the type of information that is looked for. A Q/A system has to answer to a wide range of unpredictable user questions. In IE, the information that is looked for is richer but the type of information is known in advance. The rele-vant pieces of text have to be identified and then interpreted with respect to the knowledge partially represented in the forms to fill.

IE and Q/A systems both differ in their empirism from their common ancestors, the text-understanding systems. They both rely on targeted and local techniques of text exploration rather than on a large coverage and in-depth semantic analysis of the text. The MUC competition framework has gathered a large and stable IE community. It has also drawn the research towards easily implementable and effi-cient methods rather than strong and well-founded NLP theories.

The role of semantics in IE is often reduced to very shallow semantic labeling. Semantic analysis is rather considered as a way to disambiguate syntactic steps than as a way to build a conceptual interpretation. Today, most of the IE systems that involve semantic analysis exploit the most simple part of the whole spectrum of domain and task knowledge, that is to say, named entities. However, the grow-ing need for IE application to domains such as functional genomics that require more text understanding pushes towards more sophisticated semantic knowledge resources and thus towards ontologies viewed as conceptual models, as it will be shown in this chapter.

2.2 What is an Ontology in the IE framework?

Even though ontologies usually do not appear as an autonomous component or re-source in IE systems, we argue that IE relies on ontological knowledge.

2.2.1 Ontologies populated with referential entities

The ontology identifies the entities that have a form of existence in a given do-main and specifies their essential properties. It does not describe the spurious properties of these entities. On the contrary, the goal of IE is to extract factual knowledge to instantiate one or several predefined forms. The structure of the form (e.g. Fig. 4) is a matter of ontology whereas the values of the filled template usually reflect factual knowledge (as shown in Fig. 2 above) that is not part of the ontology. In these examples, the form to fill represents a part of the biological model of gene regulation network: proteins interact positively or negatively with genes. In Sect. 3.4, we will show that IE is ontology-driven in that respect.

Type: {negative, positive}

Interaction

Agent: any protein

Target: any gene

Fig. 4. An example of IE form in the genomics domain

The status of the named entities is a pending question. Do they belong to the ontology or are they factual knowledge? From a theoretical point of view, accord-ing to Brachman’s terminological logics view (1979), they are instances of con-cepts and as such, they are described and typed at the assertional level and not at the terminological or ontological level. In this chapter, we will nevertheless con-sider that entities, being referential entities, are part of the domain ontology be-cause it is the way IE considers them.

2.2.2 Ontology with a natural language anchorage

Whether one wants to use ontological knowledge to interpret natural language or to exploit written documents to create or update ontologies, in any case, the ontol-ogy has to be connected to linguistic phenomena. Ontology must be linguistically anchored. A large effort has been devoted in traditional IE systems based on local analysis to the definitions of extraction rules that achieve this anchoring. In the very simple example about gene interaction (Fig. 2 above), the ontological knowl-edge is encoded as a keyword rule, which can be considered as a kind of compiled knowledge. In more powerful IE systems, the ontological knowledge is more ex-plicitly stated in the rules that bridge the gap between the word level and text in-terpretation. For instance, the rule of Fig. 3 above, states that a management ap-pointment event can be expressed through three verbs (named, elected or appointed). As such, an ontology is not a purely conceptual model, it is a model associated to a domain-specific vocabulary and grammar. In the IE framework, we

consider that this vocabulary and grammar are part of the ontology, even when they are embodied in extraction rules.

The complexity of the linguistic anchoring of ontological knowledge is well known and should not be underestimated. A concept can be expressed by different terms and many words are ambiguous. Rhetoric, such as lexicalized metonymies or elisions, introduces conceptual shortcuts at the linguistic level and must be elic-ited to be interpreted into domain knowledge. A noun phrase (e.g. “the citizen”) may refer to an instance (a specific citizen which has been previously mentioned in the text) or to the class (the set of all the citizens) leading then to a very differ-ent interpretation. These phenomena, which illustrate the gab between the linguis-tic and the ontological levels, strongly affect IE performance. This explains why IE rules are so difficult to design.

2.2.3 Partial ontologies

IE is a targeted textual analysis process. The target information is described in the structure of the forms to fill. As mentioned above (Sect. 2.1.2) MUC has identified various types of forms describing elements or entities, events and scenarios.

IE does not require a whole formal ontological system but parts of it only. We consider that the ontological knowledge involved in IE can be viewed as a set of interconnected and concept-centered descriptions, or “conceptual nodes2”. In con-ceptual nodes the concept properties and the relations between concepts are ex-plicit. These conceptual nodes should be understood as chunks of a global knowl-edge model of the domain. We consider here various types of concepts: an object node lists the various properties of the object; an event node describes the various objects involved in the event and their roles; a scenario node describes one or sev-eral events involved in the scenario and their interrelations. The use of this type of knowledge in NLP systems is traditional (Schank and Abelson 1977) and is illus-trated by MUC tasks.

2.3 Specificity of the ontology-IE relationship

Ontology and IE are closely connected by a mutual contribution. The ontology is required for the IE interpreting process and IE provides methods for ontological knowledge acquisition. Even if using IE for extracting ontological knowledge is still rather marginal, it is gaining in importance. We distinguish both aspects in the following Sects. 3 and 4, although the whole process is a cyclic one. A first level of ontological knowledge (e.g. entities) helps to extract new pieces of knowledge from which more elaborated abstract ontological knowledge can be designed, which help to extract new pieces of information in an iterative process.

2 We define a conceptual node as a piece of ontological model to which linguistic informa-tion can be attached. It differs from the “conceptual nodes” of (Soderland et al. 1995), which are extraction patterns describing a concept. We will see below that several extrac-tion rules may be associated to a unique conceptual node.

3. Ontology for Information extraction

The template or form to be fulfilled by IE is a partial model of world knowledge. IE forms are also classically viewed as a model of a database to be filled by the in-stances extracted. This view is consistent with the first one. In this respect, any IE system is ontology-driven: in IE processes, the ontological knowledge is primarily used for text interpretation. How poor the semantics underlying the form to fill may be (see Fig. 2, for instance), whether it is explicit (Gaizauskas and Wilks, 1997; Embley et al., 1998) or not (Freitag 1998) (see Fig. 5 below), IE is always based on a knowledge model. In this Sect. 3, for exposition purposes, we distin-guish different levels of ontological knowledge:

?The referential domain entities and their variations are listed in “flat ontolo-gies”. This is mainly used for entity identification and semantic tagging of character strings in documents.

?At a second level, the conceptual hierarchy improves normalization by ena-bling more general levels of representation.

?More sophisticated IE systems also make use of chunks of a domain model

(i.e. conceptual nodes), in which the properties and interrelations of entities

are described. The projection of these relations on the text both improves the NL processes and guides the instantiation of conceptual frames, scenar-ios or database tuples. The corresponding rules are based either on lexico-syntactic patterns or on more semantic ones.

?The domain model itself is used for inference. It enables different structures to be merged and the implicit information to be brought to light.

3.1 Sets of entities

Recognizing and classifying named entities in texts require knowledge on the do-main entities. Specialized lexical or key-word lists are commonly used to identify the referential entities in documents. For instance, in the context of cancer treat-ment, (Rindflesh et al. 2000) makes use of the concepts of the Metathesaurus of UMLS to identify and classify biological entities in papers reporting interactions between proteins, genes and drugs. In different experiments, some lists of gene and protein names are exploited. For instance, (Humphreys et al. 2000) makes use of the SWISS PROT resource whereas (Ono et al. 2001) combines pattern match-ing with a manually constructed dictionary. In the financial news of MUC-5, lists of company names have been used.In a similar way, Auto-Slog (Riloff 1993), CRYSTAL (Soderland et al. 1995), PALKA (Kim and Moldovan 1995), WHISK (Soderland 1999) and Pinocchio (Ciravegna 2000) make use of list of entities to identify the referential entities in documents. The use of lexicon and dictionaries is however controversial. Some authors like (Mikheev et al. 1999) argue that entity named recognition can be done without it.

Three main objectives of these specialized lexicons can be distinguished, se-mantic tagging, naming normalization and linguistic normalization, although these operations are usually processed all at once.

Semantic tagging

Semantic tagging. List of entities are used to tag the text entities with the relevant semantic information. In the ontology or lexicon, an entity (e.g. Tony Bridge) is described by its type (the semantic class to which it belongs, here PERSON) and by the list of the various textual forms (typographical variants, abbreviations, syno-nyms) that may refer to it3 (Mr. Bridge, Tony Bridge, T. Bridge).

However, exact character strings are often not reliable enough for a precise en-tity identification and semantic tagging. Polysemic words that do exist even in sublanguages belong to different semantic classes. In the above example, the string “Bridge” could also refer to a bridge named “Tony”. (Soderland 1999) re-ports experiments on a similar problem on a software job ad domain: WHISK is able to learn some contextual IE rules but some rules are difficult to learn because they rely on subtle semantic variations, e.g., the word “Java” can be interpreted as competency in the programming language except in “Java Beans”. Providing the system with lists of entities does not help that much, “because too many of the relevant terms in the domain undergo shifts of meaning depending on context for simple lists of words to be useful”. The connection between the ontological and the textual levels must therefore be stronger. Identification and disambiguation contextual rules can be attached to named entities.

This disambiguation problem is addressed as an autonomous process in IE works by systems that learn contextual rules for entity identification (Sect. 4.1). Naming normalization. As a by-effect, these resources are also used for normali-zation purposes. For instance, the various forms of Mr. Bridge will be tagged as MAN and associated with its canonical name form: Tony Bridge (). In (Soderland 1999), the extraction rules may refer to some class of typographical variations (such as Bdrm=(brs, br, bdrm, bed-room s, bedroom, bed) in the Rental Ad domain). This avoids rule over-fitting by enabling then specific rules to be abstracted.

Specialized genomics systems are particularly concerned with the variation problem, as the nomenclatures are often not respected in the genomics literature, when they exist. Thus, the well-known problem of identifying protein and gene names has attracted a large part of the research effort in IE to genomics (Proux et al. 1998; Fukuda et al. 1998; Collier et al. 2000). In many cases, rules rely on shallow constraints rather than morpho-syntactic dependencies.

Linguistic normalization. Beyond typographical normalization, the semantic tag-ging of entities contributes to sentence normalization at a linguistic level. It solves some syntactic ambiguities, e.g. if cotA is tagged as a gene, in the sentence “the stimulation of the expression of cotA”, knowning that a gene can be “expressed”

3 These various forms may be listed extensionally or intentionally by variation rules.

helps to understand that “cotA” is the patient of the expression rather than its agent or the agent of the stimulating action. Semantic tagging is also traditionally used for anaphora resolution: (Pustejovsky et al. 2002) makes use of UMLS types to identify and order the potential antecedents of an anaphoric pronoun (it) or noun phrase (these enzymes, both genes).

Semantic types and naming variations are both used for text normalization, without a clear distinction between them.

3.2 Hierarchies

Beyond lists of entities, ontologies are often described as hierarchies of semantic or word classes. Traditionally, IE focuses on the use of word classes rather than on the use of the hierarchical organization. For instance, in WordNet (Miller 1990), the word classes (synsets) are used for the semantic tagging and disambiguation of words but the hyponymy relation that structures the synsets into a hierarchy of semantic or conceptual classes is seldom exploited for ontological generalization inference. Some ML-based experiments have been done to exploit hierarchies of WordNet and of more specific lexicons, such as UMLS (Soderland et al. 1995; Chai et al. 1999; Freitag 1998). The ML systems learn extraction rules by general-izing from annotated training examples. They relax constraints along two axes, climbing the hyperonym path and dropping conditions. This way, the difficult choice of the correct level in the hierarchy is left to the systems.

(Chai et al. 1999) reports experiments that show how difficult is this choice in WordNet. Their IE patterns are in the form of pairs of noun phrases (NP) repre-senting two target values, related by a verb or a preposition representing the rela-tion, such as NP(X->Enzym e) Verb(interact) NP(Y->Gene) where Enzyme and Gene represent two slots of a form. In this example, the categories of X and Y are not constrained and the pattern is over general. As one may have expected, the experiment shows that generalizing the two NP types with WordNet increases the recall but decreases the precision. Chai et al. system automatically learns for each relevant NP in the pattern, the optimal level of semantic generali-zation on the WordNet hyperonym path by climbing WordNet hierarchies. For ambiguous words, which have several hyperonyms, the choice of the right hierar-chy to climb is based on the user selection of the headword senses in a training corpus. The rule learned using WordNet enhances the overall F-measurement (combination of precision and recall) by about 10 %. The performance increases by about 30 % for certain facts such as LOCATION in a job advertisement IE task. The conclusion is moderate: generalization along WordNet hierarchy brings a significant benefit to IE but the incompleteness of WordNet in specific domain and the word sense ambiguity are questionable.

acqabr :-

some(Var, null, capitalized, true), length(>2)

some(Var [next_token], all_lower_case, true),

some(Var [prev_token], all_lower_case, true),

some(Var [right_AN], wordnet_word, "possession"),

some(Var [right_AN prev_token], wordnet_word, "stock"),

some(Var [prev_tok prev_token], doubleton, false).

applies to the sentences,

"to purchase 4.5 mln Trilogy common shares at"

"acquire another 2.4 mln Roach treasury shares."

where Possession (instanciated by shares) and stock (instanciated by common and treasury) are WordNet classes. Var represents the company name(Trilogy and Roach) and Right_AN represents the head of the noun phrase at the right of the company name.

Fig. 5. Application of SRV to MUC-5

The IE learning system, SRV, also uses semantic class information such as syn-sets and hyperonym links from WordNet lexicon to constrain the application of the IE rules (Fig. 5), but D. Freitag (1998) concludes that the improvement is not clear.

A lot of work has been devoted to the manual or automatic acquisition of do-main dependent hierarchies for a wide range of NL processing tasks in order to overcome the general ontologies limitations. For instance, for IE purpose, (Riloff and Shepherd 1997) proposes to build semantic classes starting with a few seed words and growing by adjunction of words as a first step before extraction rules learning. At this stage, no clear and reusable conclusions for IE can be drawn from these attempts.

3.3 Conceptual nodes

The ontological knowledge is not always explicitly stated as it is in (Gaizauskas and Wilks 1997), which represents an ontology as a hierarchy of concepts, each concept being associated with an attribute-value structure, or in (Embley et al. 1998), which describes an ontology as database relational schema. However, onto-logical knowledge is reflected by the target form that IE must fill and which repre-sents the conceptual nodes to be instantiated. Extraction rules ensure the mapping between a conceptual node and the potentially various linguistic phrasing express-ing the relevant elements of information.

Most of the works aiming at extracting gene/protein interactions are based on such event conceptual nodes. In (Yakushiji et al. 2001), predicate-argument struc-tures (P-A structures), also referred as subcategorization frames, describe the number, type and syntactic construction of the predicate arguments. The P-A structures are used for extracting gene and protein interactions (see Fig. 7). The mapping between P-A structures and event frames (event conceptual nodes) is ex-plicit and different P-A structures can be associated to a same event frame. For in-stance, the extraction of gene/protein interactions is viewed as the search for the

subject and the object of an interaction verb, which are interpreted as the agent and the target of the interaction.

In these works, parsing is made by shallow, robust or full parsers, which handle or not coordinates, anaphora, passive mood and nominalization (Sekimizu et al. 1998; Thomas et al. 2000; Roux et al. 2000; Park et al. 2001; Leroy and Chen 2002). Additional semantic constraints may be added as selectional restrictions4 for disambiguation purposes.

activate is an interaction verb

P-A structure of activate:

Pred: activate Frame: activate

args: subject (1) slot: agent (1)

(2)

target

slot:

(2)

object

Fig 6. Example of a conceptual-node driven rule in functional genomics

These approaches rely on the assumption that semantic relations (e.g. agent, target) are fully determined by the verb/noun predicate, its syntactic dependencies and optionally the semantic categories of its arguments, (Pustejovky et al. 1993; Gildea and Jurafsky 2002).

The same assumption is made in the very interesting work of (Sasaki and Ma-tsuo 2000) that goes one step further. Their system, RHB+, learns this mapping with the help of case-frames in Fillmore's sense (1968). RHB+ is able to combine multiple case-frames to map a unique conceptual node, as opposed to the direct binary mapping described above. RHB+ makes use of Japanese linguistic re-sources that include a 12-level hierarchical concept thesaurus of 3,000 categories (300,000 words) linked by is-a and has-a relations and 15,000 case frames for 6,000 verbs. The case-roles, the semantic relations between a predicate and its ar-gument, within a text are determined by the semantic categories of the predicate arguments together with the prepositions or the syntactic dependencies.

As for RHB+, considerable effort has been made towards designing automatic methods for learning such extraction rules. The main difficulty arises from the complexity of the text representation once enriched by the multiple linguistic and conceptual levels. The more expressive the representation, the larger is the search space for the IE rule and the more difficult the learning. The extreme alternative consists in either selecting the potentially relevant features before learning with the risk of excluding the solution from the search space, or leaving the system the entire choice, provided that there is enough representative and annotated data to find the relevant regularities. For instance, the former consists in normalizing by replacing names by category labels when the latter consists in tagging without re-moving the names. The learning complexity can even be increased when the con-ceptual or semantic classes are learned together with the conceptual node informa-tion (Riloff and Jones 1999; Yangarber et al. 2000).

4 A selectional restriction is a semantic type constraint that a given predicate enforces on its arguments.

An interesting alternative to the purely corpus-based approach for learning IE rules has been proposed in the context of ECRAN (Basili et al. 1998). This ap-proach, based on Wilks’s notion of lexical tuning, consists in adaptating an exist-ing dictionary to a given corpus, the LDOCE dictionary, which aims at describing the subcategorization frames of any word sense. The confrontation of a specialized corpus syntactic analysis and the LDOCE allows the selection of the most relevant subcategorization frames and possibly the discovery of new word senses.

3.4 Domain conceptual model

The link between the syntactic level and the event description is not always so straightforward. The text interpretation may require inference reasoning with do-main knowledge. For instance, to be able to extract

Type: negative

Interaction

Agent: sigma K

Target: spoIIID

from, "[…], such that production of sigma K leads to a decrease in the level of spoIIID.", more biological knowledge is necessary to interpret the level changes in term of interaction. P-A structures as those above will be useful at the lower level for interpreting the text and build a semantic structure but a causal model stating that correlation in quantity variations can be interpreted as an interaction is needed to connect and interpret the instantiated syntactic structures at a conceptual level.

4 Information extraction for ontology design

Acquisition of ontological knowledge is a well-known bottleneck for many AI ap-plications and a large amount of work has been devoted to knowledge acquisition from text. The underlying idea, inherited from Harris’ work on the immunology sublanguage (Harris et al. 1989), is that, in specific domains, the linguistics re-flects the domain conceptual organization. Although it has been observed, as we mentioned above, that the linguistic representation of the conceptual domain is bi-ased, it remains one of the most promising approaches to knowledge acquisition. Following (Meyer et al. 1992), a large amount of work has been devoted to term extraction (Bourigault 1996, Jacquemin 1996) as a mean to identify the concepts of a given domain and thus to bootstrap ontology design (Grishman and Sterling 1992; Nazarenko et al. 1997; Aussenac-Gilles et al. 2000). Identifying how these terms relate to each other in texts help to understand the properties and relation-ships of the underlying concepts.

Various methods are applied to corpora to achieve this acquisition process: en-dogenous distributional or cooccurrence analysis and rule-based extraction are complementary in this respect. We focus here on the latter approach, which per-tains to IE. We show that it can indeed contribute to the ontology acquisition and enrichment process. Rule-based extraction produces elementary results that are in-terpreted in terms of chunks of ontological knowledge: the referential entities and

their interrelationships. Once extracted, these chunks have to be integrated into the ontology. We do not deal with that point here, as it goes beyond IE.

4.1 Entity name extraction

As explained in Sect. 2.2, we consider here that the referential entities (e.g. per-sons, dates or genes), which are usually represented as instances of concepts, are part of the ontology. In this perspective, there is a need for “populating” the ontol-ogy with the referential entities of the domain of interest by automatic ways.

IE has been widely applied to the recognition and categorization of the entities mentioned in documents, either specialized texts or web pages by means of pat-terns. The extraction methods differ regarding their pattern design technique, which is either automatic or manual.

4.1.1 Automatic pattern learning

Hidden Markov Models (HMM) based on sequences of bigrams (pairs of tokens) has become a popular method for learning named entity recognition patterns from annotated corpora since Nymble (Bikel et al. 1997, 1999) because simple bigrams appear as sufficient for learning efficient rules. According to (Collier et al. 2000), the applied HMM differ in their ability to learn the model structure or not, in the way they estimate the transition probabilities (from training data or models built by hand) and in their reusability in different domains. For instance, the method of (Collier et al. 2000) aims at recognizing biological entity names and is based on an HMM trained on 100 MedLine abstracts using only character features and lexi-cal information. The results (F-score 73 %) are much better than those obtained by previous hand-coded patterns (Fukuda et al. 1998).

4.1.2 Hand-coded patterns and dictionaries

While the pattern learning approach tends to use very basic information from the text, the hand-coded pattern approach on contrary rely more on linguistics (Proux et al. 1998), external ontologies (Rindflesh et al. 2000) and context (Humphreys et al. 2000; Fukuda et al. 1998; Hishiki et al. 1998).

The EDGAR system (Rindflesh et al. 2000) identifies unknown gene names and cell lines by two ways: the concepts of UMLS and hand-coded contextual pat-terns, such as appositives, filtered through UMLS and an English dictionary and occurring after some signal words, (e.g. cell, clone and line for cells). A second phase identifies cell features, (e.g. organ type, cancer type and organism) by a similar mechanism.

(Hishiki et al. 1998) gives examples of contextual regular expressions applied to the entity recognition and categorization. They rely, for instance, on: ?Indefinite appositions: the pattern NP(X), a NP(Y) gives X as an in-stance of Y, if Y is a type. From the sentence "csbB, a putative membrane-

bound glucosyl transferase", csbB is interpreted as an instance of transferase if transferase is defined as a type.

?Exemplification of copula constructions: NP(X) be one of NP(Y) or NP(X) e.g. NP(Y).The fact that abrB is an instance of gene is ex-tracted from "to repress certain genes , e.g.. abrB".

The recognition of gene names (Proux et al. 1998) and biological entity names (Humpreys et al. 2000) can also rely on various linguistic-based methods, i.e. grammatical tagging, contextual hand-coded patterns, specific lexicon (e.g. SWISS-PROT keywlist) and combination of morphology-based, biochemical suf-fix and prefix recognition rules. The results obtained by (Proux et al. 1998) on a FlyBase corpus are of high quality, (94,4 % recall and 91,4 % precision). With comparable performance, (Humpreys et al. 2000) identifies 25,000 component terms of 52 categories as MUC named entity results. Populating ontology with help of entity name recognition from textual data can therefore be considered as operational for specific domains.

4.2 Relation extraction

In a structured ontology, the concepts are related to each other according to a vari-ety of relations. Three main approaches acquire ontological relations from texts: ?The cooccurrence-based method identifies couples of cooccurring terms.

When applied to large corpora, this method is robust but further interpreta-tion is required to type the relation underlying the collocation.

?The knowledge-based method makes use of a bootstrapping dictionary, a thesaurus or an ontology and tunes it to adapt it to the specific domain at hand according to a representative “tuning” corpus.

?The IE pattern-based method.

The IE approach has the advantage over the first one that the type of extracted relation is known, since patterns are designed to characterize a given relation. It is complementary to the second one: preexisting knowledge can help to design an extraction rule in an acquisition iterative process. For instance, if the preexisting knowledge base states that ‘X is-part-of Y’, identifying this relation in text helps to design a first is-part-of extraction rule, which is used in turn to extracts new in-stances of the that relation (Hearst 1992; Morin and Jacquemin 1999).

Two kinds of relations can roughly be distinguished: the generic ones, which can be found in almost any ontology, and the model-specific ones.

4.2.1 Generic relations

The links that form the main structure of the ontology are the most popular rela-tions: the intra-concept relations (synonymy) and the hierarchical is-a and part-of relations. They can be considered either at the linguistic level (hyperonymy and meronymy are traditional lexicographic relations) or at the ontological level (is-a

and part-of). The acquisition goal is to exploit the linguistic organization to infer an ontological structure.

Hyponymy or is-a relation. In her pioneering work, M. Hearst (1992) proposes six patterns for the acquisition of hyponyms, among which are the following: Such NP as {NP,}* {or|and} NP

NP {,} including {NP,}* {or|and} NP

The first NP is interpreted as the hyperonym, and the latter one(s) as the hypo-nym(s). The first rule matches the sentence “… such exotic fruits as kiwis, man-goes, pineapple or coconuts…” from which the relations kiwi is-a exotic fruit and mango is-a exotic fruit can be extracted. These patterns are in the form of regular expressions that combine lexical units, punctuation marks and morpho-syntactic tags. It requires a tagged corpus.

Many works have followed this track. Variant forms of exemplification and enumeration patterns have been designed for specific corpora (see the above pat-terns proposed by (Hishiki et al. 1998)). The results have obviously to be vali-dated by a human expert and a taxonomy can be then constructed by inference on the single types derived from the corpus by pattern matching.

As shown on the previous examples, the hyponym relation can be interpreted either as an instance-class relation or as a generalization relation between two classes. The language does not distinguish the one from the other, since a proto-typical instance may refer to the class as a whole. For example, from "PP2C fam-ily of eukaryotic Ser/Thr protein phosphatases", one can derive two relations (PP2C is a phosphatase and phosphatase is a protein), where classes and instances cannot be distinguished.

As one may expect, this approach gives high precision scores but no reliable measure of recall, which would call for a corpus where hyponyms are tagged. The number of extracted relations seems to be low regarding the diversity of corpus in-formation as well as the number of ontological categories. It is generally agreed that ontology design imposes to favor reliability over cover. The obvious NP is a NP pattern is usually disregarded as too imprecise. In the sentence “the key regulator is an example of…” it would lead to interpret the key regulator as an in-stance of example.

Meronymy or part-of relation. Meronymy has not been as much studied as hy-ponymy. However, (Berland and Charniak 1997) adapts the above approach to find parts of physical objects in very large corpora. Their acquisition method is designed for ontology enhancement. They propose five patterns such as Such NP as {NP,}* {or|and} NP

NP {,} including {NP,}* {or|and} NP

The first NP is interpreted as the hyperonym, and the latter one(s) as the hypo-nym(s). The first rule matches the sentence “… such exotic fruits as kiwis, man-goes, pineapple or coconuts…” from which the relations kiwi is-a exotic fruit and mango is-a exotic fruit can be extracted. These patterns are in the form of regular expressions that combine lexical units, punctuation marks and morpho-syntactic tags. It requires a tagged corpus.

Many works have followed this track. Variant forms of exemplification and enumeration patterns have been designed for specific corpora (cf. the above pat-terns proposed by (Hishiki et al. 1998)). The results have obviously to be vali-dated by a human expert and a taxonomy can be then constructed by inference on the single types derived from the corpus by pattern matching.

As shown by the previous examples, the hyponym relation can be interpreted either as an instance-class relation or as a generalization relation between two classes. The language does not distinguish the one from the other, since a proto-typical instance may refer to the class as a whole. For example, from "PP2C fam-ily of eukaryotic Ser/Thr protein phosphatases", one can derive two relations (PP2C is a phosphatase and phosphatase is a protein), where classes and instances cannot be distinguished.

As one may expect, this approach gives high precision scores but no reliable measure of recall, as it would call for a corpus where hyponyms are tagged. The number of extracted relations seems to be low regarding the diversity of corpus in-formation as well as the number of ontological categories. It is generally agreed that ontology design imposes to favor reliability over cover. The obvious NP is a NP pattern is usually disregarded as too imprecise. In the sentence “the key regulator is an example of…” it would lead to interpret the key regulator as an in-stance of example.

Meronymy or part-of relation. Meronymy has not been as much studied as hy-ponymy. (Berland and Charniak 1997) however adapt the above approach to find parts of physical objects in very large corpora. Their acquisition method is de-signed for ontology enhancement. They propose five patterns such as {N|Nplural}’s POSSESSIVE {N|Nplural}

where the first N is interpreted as the whole and the second one as the part. Mor-phological constraints rule out quality words (words ending with –ness, -ity…), which are not supposed to refer to physical objects. Phrases like “…basements in/of buildings…”, “basement in|of a building”, “basement’s building” are covered by the five patterns proposed. Due to these weakly constrained patterns, many po-tential meronyms are extracted. They are statistically ordered and proposed to an expert for validation. Results show 55% of precision among the first 50 part-whole pairs, which is quite low.

(Hishiki et al. 1998) proposes patterns relying on partitive verbs for biological literature: NP consist of NP and NP be ... part of NP as in "sigE is part of an operon" or in " the gerE locus consists of one gene". This work raises the same evaluation problem as the previous one.

Synonymy. The same approach has been experimented to detect synonymy rela-

tions in corpora. Reformulation and abbreviation patterns have been proposed (Pearson 1998): i.e., e.g. known as, called. (Hishiki et al. 1998) suggests that “termed” “designated as” and parenthesizing denote synonymy: in the sequence NP (NP), the NPs are considered as synonymous, like in "spoIIIG (sigma G)".

However, the productivity of these patterns is highly dependent on the corpus (Hamon and Nazarenko 2001). For instance, in biology, parentheses do not only denote synonymy or typographic variations as in "sigma-D (sigma D)" or in,

"chloramphenicol acetyltransferase (CAT)". They may also introduce an instance as in "a small DNA segment (157 bp)", a reference, as in "in an earlier study Pig-got (J. Bacteriol)" or simply, the species of interest as in "spoIIG operon from ei-ther B. subtilis (spoIIG (Bs)) or C. acetobutylicum (spoIIG (Ca))". Overall, synon-ymy extraction patterns are not as reliable as for hyponymy, because extraction patterns capture syntagmatic information whereas synonymy is a paradigmatic re-lation5.

4.2.2 Model-specific relations

A wide range of domain specific relations are examined in IE works. Elementary relations can be interpreted as attributes of a given object class. The attributes age, name, phone number, parent, birth place can be associated to a person (Embley et al. 1998). Various relations can hold between objects or events: from semantic roles, such as agent or patient roles, to more complex ones such as the symptom relation in the medical domain or the interaction between biological entities in ge-nomics.

Extracting relations between entities helps to populate a database. However, ex-tracting a relation in isolation is usually not sufficient for ontology design. The elementary relation must be structured in more complex schemata (Embley et al. 1998; Aone and Ramos-Santacruz 2000). For instance, in functional genomics, one of the most popular IE task aims at building enzymes and metabolic pathways, or regulation networks that can be considered as specific ontologies. Such net-works are described by complex graphs of interactions between genes, proteins and environmental factors such as drugs or stress. The ontological result of the ex-traction should represent at least the entities, their reactions, their properties and, at a higher level, feedback cycles. Single elementary and binary relations between entities are independently extracted by IE methods. The integration of these ele-mentary relations into the ontology highly depends on the biological model repre-sented in the ontology and on the other extracted facts. Few works address this in-tegration question. As shown in Sect. 3, the improvement of the ontology by IE simply comes to add new instances of the interaction relation in most of the cases. For instance, with the semantic roles associated to repress (Agent(Repress, Protein) and Target(Repress, Gene)), the repress relation can be enriched by new instances. "SpoIIID represses spoVD transcription" yields Agent(Repress, SpoIIID) and Target(Repress, spoVD) (Roux et al. 2000). Other works such as (Ng and Wong 1999) aim at providing a user-friendly interface to facilitate the interpretation of the elementary results by the biologist.

On the whole, although useful, pattern-based relation acquisition cannot be the main knowledge source for ontology design. The best results in precision are ob-tained in hyponymy and specific relation extractions. Some reasons can be in-voked. The variation in phrasing is difficult to capture and this affects the recall quality. General patterns must rely on grammatical words or construct (like prepo-5 Along the paradigmatic axis, the terms can substitute to each other; along the syntagmatic axis, terms rather tend to combine.

sitions) which are semantically vague. This affects the precision. More fundamen-tally, the linguistically-based model cannot be directly mapped onto an ontology (Bouaud et al. 1996; Gangemi et al. 2001). Hyponymy between polysemic terms cannot be considered as a transitive relation; metonymy phenomena are concep-tual shortcuts, language makes the confusion between the roles and the entities that hold the roles. The use of IE relation extraction techniques must therefore be restricted to the complementation and tuning of an existing ontology and any ex-tracted information must be further interpreted in ontological terms.

5 Conclusion

As illustrated in this chapter, the IE research related to the ontology is abundant, multiple and mainly applied. Many systems, approaches, algorithms and evalua-tions on quite basic applications are reported. At this stage, the main goal is more to develop systems that get a better precision and recall than making explicit and defending a given general approach against others. The influence of statistics on NLP, the influence of MUC on IE and the cost of ontological processing partially explain this. We rather interpret the quasi absence of clear direction and modeling attempt by the novelty of the IE field. The simplest tasks are solved first (e.g. named entity recognition). IE methods for interpreting the lowest text levels are now well established. This maturity and the growing needs for real applications will draw the field towards a stronger involvement of the ontological knowledge.

Difficult and unexplored questions dealing with the discrepancy between what the text is about, the exogenous lexicon and ontology should be investigated. This gap may not be only due to representation languages, to divergent generality lev-els and incompleteness of the knowledge sources, which have been tackled by the revision field, but also to divergent text genres, points of view and underlying problem-solving tasks. IE driven by the ontology and integration of the extracted knowledge in the ontology will not be properly done without appropriate answers to these questions.

References

Aone C, Ramos-Santacruz M (2000) REES: A Large–Scale Relation and Event Extraction System. Proc. of ANLP’2000, Seattle.

Aussenac-Gilles N, Biébow B, Szulman S (2000) Revisiting Ontology Design: a methodol-ogy based on corpus analysis. In R Dieng, O Corby (eds.) Engineering and Knowledge Management: Methods, Models, and Tools. Proceedings of EKAW’2000, LNAI 1937, Springer-Verlag, pp. 172-188.

Basili R, Pazienza MT, Velardi P (1993) Semi-automatic extraction of linguistic informa-tion for syntactic disambiguation, in Applied Artificial Intelligence, 7:339-364.

交互式多模型算法仿真与分析

硕037班 刘文3110038020 2011/4/20交互式多模型仿真与分析IMM算法与GBP算法的比较,算法实现和运动目标跟踪仿真,IMM算法的特性分析 多源信息融合实验报告

交互式多模型仿真与分析 一、 算法综述 由于混合系统的结构是未知的或者随机突变的,在估计系统状态参数的同时还需要对系统的运动模式进行辨识,所以不可能通过建立起一个固定的模型对系统状态进行效果较好的估计。针对这一问题,多模型的估计方法提出通过一个模型集{}(),1,2,,j M m j r == 中不同模型的切换来匹配不同目标的运动或者同一目标不同阶段的运动,达到运动模式的实时辨识的效果。 目前主要的多模型方法包括一阶广义贝叶斯方法(BGP1),二阶广义贝叶斯方法(GPB2)以及交互式多模型方法等(IMM )。这些多模型方法的共同点是基于马尔科夫链对各自的模型集进行切换或者融合,他们的主要设计流程如下图: M={m1,m2,...mk} K 时刻输入 值的形式 图一 多模型设计方法 其中,滤波器的重初始化方式是区分不同多模型算法的主要标准。由于多模型方法都是基于一个马尔科夫链来切换与模型的,对于元素为r 的模型集{}(),1,2,,j M m j r == ,从0时刻到k 时刻,其可能的模型切换轨迹为 120,12{,,}k i i i k trace k M m m m = ,其中k i k m 表示K-1到K 时刻,模型切换到第k i 个, k i 可取1,2,,r ,即0,k trace M 总共有k r 种可能。再令1 2 1 ,,,,k k i i i i μ+ 为K+1时刻经由轨迹0,k trace M 输入到第1k i +个模型滤波器的加权系数,则输入可以表示为 0,11 2 1 12|,,,,|,,,???k k trace k k k i M k k i i i i k k i i i x x μ++=?∑ 可见轨迹0,k trace M 的复杂度直接影响到算法计算量和存储量。虽然全轨迹的

五种大数据压缩算法

?哈弗曼编码 A method for the construction of minimum-re-dundancy codes, 耿国华1数据结构1北京:高等教育出版社,2005:182—190 严蔚敏,吴伟民.数据结构(C语言版)[M].北京:清华大学出版社,1997. 冯桂,林其伟,陈东华.信息论与编码技术[M].北京:清华大学出版社,2007. 刘大有,唐海鹰,孙舒杨,等.数据结构[M].北京:高等教育出版社,2001 ?压缩实现 速度要求 为了让它(huffman.cpp)快速运行,同时不使用任何动态库,比如STL或者MFC。它压缩1M数据少于100ms(P3处理器,主频1G)。 压缩过程 压缩代码非常简单,首先用ASCII值初始化511个哈夫曼节点: CHuffmanNode nodes[511]; for(int nCount = 0; nCount < 256; nCount++) nodes[nCount].byAscii = nCount; 其次,计算在输入缓冲区数据中,每个ASCII码出现的频率: for(nCount = 0; nCount < nSrcLen; nCount++) nodes[pSrc[nCount]].nFrequency++; 然后,根据频率进行排序: qsort(nodes, 256, sizeof(CHuffmanNode), frequencyCompare); 哈夫曼树,获取每个ASCII码对应的位序列: int nNodeCount = GetHuffmanTree(nodes); 构造哈夫曼树 构造哈夫曼树非常简单,将所有的节点放到一个队列中,用一个节点替换两个频率最低的节点,新节点的频率就是这两个节点的频率之和。这样,新节点就是两个被替换节点的父

LZ77压缩算法实验报告

LZ77压缩算法实验报告 一、实验内容 使用C++编程实现LZ77压缩算法的实现。 二、实验目的 用LZ77实现文件的压缩。 三、实验环境 1、软件环境:Visual C++ 6.0 2、编程语言:C++ 四、实验原理 LZ77 算法在某种意义上又可以称为“滑动窗口压缩”,这是由于该算法将一个虚拟的,可以跟随压缩进程滑动的窗口作为术语字典,要压缩的字符串如果在该窗口中出现,则输出其出现位置和长度。使用固定大小窗口进行术语匹配,而不是在所有已经编码的信息中匹配,是因为匹配算法的时间消耗往往很多,必须限制字典的大小才能保证算法的效率;随着压缩的进程滑动字典窗口,使其中总包含最近编码过的信息,是因为对大多数信息而言,要编码的字符串往往在最近的上下文中更容易找到匹配串。 五、LZ77算法的基本流程 1、从当前压缩位置开始,考察未编码的数据,并试图在滑动窗口中找出最长的匹 配字符串,如果找到,则进行步骤2,否则进行步骤3。 2、输出三元符号组( off, len, c )。其中off 为窗口中匹

配字符串相对窗口边 界的偏移,len 为可匹配的长度,c 为下一个字符。然后将窗口向后滑动len + 1 个字符,继续步骤1。 3、输出三元符号组( 0, 0, c )。其中c 为下一个字符。然后将窗口向后滑动 len + 1 个字符,继续步骤1。 六、源程序 /********************************************************************* * * Project description: * Lz77 compression/decompression algorithm. * *********************************************************************/ #include #include #include #include #define OFFSET_CODING_LENGTH (10) #define MAX_WND_SIZE 1024 //#define MAX_WND_SIZE (1<

LZSS压缩算法实验报告

实验名称:LZSS压缩算法实验报告 一、实验内容 使用Visual 6..0 C++编程实现LZ77压缩算法。 二、实验目的 用LZSS实现文件的压缩。 三、实验原理 LZSS压缩算法是词典编码无损压缩技术的一种。LZSS压缩算法的字典模型使用了自适应的方式,也就是说,将已经编码过的信息作为字典, 四、实验环境 1、软件环境:Visual C++ 6.0 2、编程语言:C++ 五、实验代码 #include #include #include #include /* size of ring buffer */ #define N 4096 /* index for root of binary search trees */ #define NIL N /* upper limit for g_match_len. Changed from 18 to 16 for binary compatability with Microsoft COMPRESS.EXE and EXPAND.EXE #define F 18 */ #define F 16 /* encode string into position and length if match_length is greater than this: */ #define THRESHOLD 2 /* these assume little-endian CPU like Intel x86

-- need byte-swap function for big endian CPU */ #define READ_LE32(X) *(uint32_t *)(X) #define WRITE_LE32(X,Y) *(uint32_t *)(X) = (Y) /* this assumes sizeof(long)==4 */ typedef unsigned long uint32_t; /* text (input) size counter, code (output) size counter, and counter for reporting progress every 1K bytes */ static unsigned long g_text_size, g_code_size, g_print_count; /* ring buffer of size N, with extra F-1 bytes to facilitate string comparison */ static unsigned char g_ring_buffer[N + F - 1]; /* position and length of longest match; set by insert_node() */ static unsigned g_match_pos, g_match_len; /* left & right children & parent -- these constitute binary search tree */ static unsigned g_left_child[N + 1], g_right_child[N + 257], g_parent[N + 1]; /* input & output files */ static FILE *g_infile, *g_outfile; /***************************************************************************** initialize trees *****************************************************************************/ static void init_tree(void) { unsigned i; /* For i = 0 to N - 1, g_right_child[i] and g_left_child[i] will be the right and left children of node i. These nodes need not be initialized. Also, g_parent[i] is the parent of node i. These are initialized to NIL (= N), which stands for 'not used.' For i = 0 to 255, g_right_child[N + i + 1] is the root of the tree for strings that begin with character i. These are initialized to NIL. Note there are 256 trees. */ for(i = N + 1; i <= N + 256; i++) g_right_child[i] = NIL; for(i = 0; i < N; i++) g_parent[i] = NIL; } /***************************************************************************** Inserts string of length F, g_ring_buffer[r..r+F-1], into one of the trees (g_ring_buffer[r]'th tree) and returns the longest-match position and length via the global variables g_match_pos and g_match_len. If g_match_len = F, then removes the old node in favor of the new one, because the old one will be deleted sooner.

多媒体数据压缩实验报告

多媒体数据压缩实验报告 篇一:多媒体实验报告_文件压缩 课程设计报告 实验题目:文件压缩程序 姓名:指导教师:学院:计算机学院专业:计算机科学与技术学号: 提交报告时间:20年月日 四川大学 一,需求分析: 有两种形式的重复存在于计算机数据中,文件压缩程序就是对这两种重复进行了压 缩。 一种是短语形式的重复,即三个字节以上的重复,对于这种重复,压缩程序用两个数字:1.重复位置距当前压缩位置的距离;2.重复的长度,来表示这个重复,假设这两个数字各占一个字节,于是数据便得到了压缩。 第二种重复为单字节的重复,一个字节只有256种可能的取值,所以这种重复是必然的。给 256 种字节取值重新编码,使出现较多的字节使用较短的编码,出现较少的字节使用较长的编码,这样一来,变短的字节相对于变长的字节更多,文件的总长度就会减少,并且,字节使用比例越不均

匀,压缩比例就越大。 编码式压缩必须在短语式压缩之后进行,因为编码式压缩后,原先八位二进制值的字节就被破坏了,这样文件中短语式重复的倾向也会被破坏(除非先进行解码)。另外,短语式压缩后的结果:那些剩下的未被匹配的单、双字节和得到匹配的距离、长度值仍然具有取值分布不均匀性,因此,两种压缩方式的顺序不能变。 本程序设计只做了编码式压缩,采用Huffman编码进行压缩和解压缩。Huffman编码是一种可变长编码方式,是二叉树的一种特殊转化形式。编码的原理是:将使用次数多的代码转换成长度较短的代码,而使用次数少的可以使用较长的编码,并且保持编码的唯一可解性。根据 ascii 码文件中各 ascii 字符出现的频率情况创建 Huffman 树,再将各字符对应的哈夫曼编码写入文件中。同时,亦可根据对应的哈夫曼树,将哈夫曼编码文件解压成字符文件. 一、概要设计: 压缩过程的实现: 压缩过程的流程是清晰而简单的: 1. 创建 Huffman 树 2. 打开需压缩文件 3. 将需压缩文件中的每个 ascii 码对应的 huffman 编码按 bit 单位输出生成压缩文件压缩结束。

数据快速压缩算法的C语言实现

价值工程 置,是一项十分有意义的工作。另外恶意代码的检测和分析是一个长期的过程,应对其新的特征和发展趋势作进一步研究,建立完善的分析库。 参考文献: [1]CNCERT/CC.https://www.doczj.com/doc/328094750.html,/publish/main/46/index.html. [2]LO R,LEVITTK,OL SSONN R.MFC:a malicious code filter [J].Computer and Security,1995,14(6):541-566. [3]KA SP ER SKY L.The evolution of technologies used to detect malicious code [M].Moscow:Kaspersky Lap,2007. [4]LC Briand,J Feng,Y Labiche.Experimenting with Genetic Algorithms and Coupling Measures to devise optimal integration test orders.Software Engineering with Computational Intelligence,Kluwer,2003. [5]Steven A.Hofmeyr,Stephanie Forrest,Anil Somayaji.Intrusion Detection using Sequences of System calls.Journal of Computer Security Vol,Jun.1998. [6]李华,刘智,覃征,张小松.基于行为分析和特征码的恶意代码检测技术[J].计算机应用研究,2011,28(3):1127-1129. [7]刘威,刘鑫,杜振华.2010年我国恶意代码新特点的研究.第26次全国计算机安全学术交流会论文集,2011,(09). [8]IDIKA N,MATHUR A P.A Survey of Malware Detection Techniques [R].Tehnical Report,Department of Computer Science,Purdue University,2007. 0引言 现有的压缩算法有很多种,但是都存在一定的局限性,比如:LZw [1]。主要是针对数据量较大的图像之类的进行压缩,不适合对简单报文的压缩。比如说,传输中有长度限制的数据,而实际传输的数据大于限制传输的数据长度,总体数据长度在100字节左右,此时使用一些流行算法反而达不到压缩的目的,甚至增大数据的长度。本文假设该批数据为纯数字数据,实现压缩并解压缩算法。 1数据压缩概念 数据压缩是指在不丢失信息的前提下,缩减数据量以减少存储空间,提高其传输、存储和处理效率的一种技术方法。或按照一定的算法对数据进行重新组织,减少数据的冗余和存储的空间。常用的压缩方式[2,3]有统计编码、预测编码、变换编码和混合编码等。统计编码包含哈夫曼编码、算术编码、游程编码、字典编码等。 2常见几种压缩算法的比较2.1霍夫曼编码压缩[4]:也是一种常用的压缩方法。其基本原理是频繁使用的数据用较短的代码代替,很少使用 的数据用较长的代码代替,每个数据的代码各不相同。这些代码都是二进制码,且码的长度是可变的。 2.2LZW 压缩方法[5,6]:LZW 压缩技术比其它大多数压缩技术都复杂,压缩效率也较高。其基本原理是把每一个第一次出现的字符串用一个数值来编码,在还原程序中再将这个数值还成原来的字符串,如用数值0x100代替字符串ccddeee"这样每当出现该字符串时,都用0x100代替,起到了压缩的作用。 3简单报文数据压缩算法及实现 3.1算法的基本思想数字0-9在内存中占用的位最 大为4bit , 而一个字节有8个bit ,显然一个字节至少可以保存两个数字,而一个字符型的数字在内存中是占用一个字节的,那么就可以实现2:1的压缩,压缩算法有几种,比如,一个自己的高四位保存一个数字,低四位保存另外一个数字,或者,一组数字字符可以转换为一个n 字节的数值。N 为C 语言某种数值类型的所占的字节长度,本文讨论后一种算法的实现。 3.2算法步骤 ①确定一种C 语言的数值类型。 —————————————————————— —作者简介:安建梅(1981-),女,山西忻州人,助理实验室,研究方 向为软件开发与软交换技术;季松华(1978-),男,江苏 南通人,高级软件工程师,研究方向为软件开发。 数据快速压缩算法的研究以及C 语言实现 The Study of Data Compression and Encryption Algorithm and Realization with C Language 安建梅①AN Jian-mei ;季松华②JI Song-hua (①重庆文理学院软件工程学院,永川402160;②中信网络科技股份有限公司,重庆400000)(①The Software Engineering Institute of Chongqing University of Arts and Sciences ,Chongqing 402160,China ; ②CITIC Application Service Provider Co.,Ltd.,Chongqing 400000,China ) 摘要:压缩算法有很多种,但是对需要压缩到一定长度的简单的报文进行处理时,现有的算法不仅达不到目的,并且变得复杂, 本文针对目前一些企业的需要,实现了对简单报文的压缩加密,此算法不仅可以快速对几十上百位的数据进行压缩,而且通过不断 的优化,解决了由于各种情况引发的解密错误,在解密的过程中不会出现任何差错。 Abstract:Although,there are many kinds of compression algorithm,the need for encryption and compression of a length of a simple message processing,the existing algorithm is not only counterproductive,but also complicated.To some enterprises need,this paper realizes the simple message of compression and encryption.This algorithm can not only fast for tens of hundreds of data compression,but also,solve the various conditions triggered by decryption errors through continuous optimization;therefore,the decryption process does not appear in any error. 关键词:压缩;解压缩;数字字符;简单报文Key words:compression ;decompression ;encryption ;message 中图分类号:TP39文献标识码:A 文章编号:1006-4311(2012)35-0192-02 ·192·

压缩编码算法设计与实现实验报告

压缩编码算法设计与实现实验报告 一、实验目的:用C语言或C++编写一个实现Huffman编码的程序,理解对数据进行无损压缩的原理。 二、实验内容:设计一个有n个叶节点的huffman树,从终端读入n个叶节 点的权值。设计出huffman编码的压缩算法,完成对n个节点的编码,并写出程序予以实现。 三、实验步骤及原理: 1、原理:Huffman算法的描述 (1)初始化,根据符号权重的大小按由大到小顺序对符号进行排序。 (2)把权重最小的两个符号组成一个节点, (3)重复步骤2,得到节点P2,P3,P4……,形成一棵树。 (4)从根节点开始顺着树枝到每个叶子分别写出每个符号的代码 2、实验步骤: 根据算法原理,设计程序流程,完成代码设计。 首先,用户先输入一个数n,以实现n个叶节点的Huffman 树; 之后,输入n个权值w[1]~w[n],注意是unsigned int型数值; 然后,建立Huffman 树; 最后,完成Huffman编码。 四、实验代码:#include #include #include #define MAX 0xFFFFFFFF typedef struct / /*设计一个结构体,存放权值,左右孩子*// { unsigned int weight; unsigned int parent,lchild,rchild; }HTNode,*HuffmanTree; typedef char** HuffmanCode;

int min(HuffmanTree t,int i) { int j,flag; unsigned int k = MAX; for(j=1;j<=i;j++) if(t[j].parent==0&&t[j].weight s2) { tmp = s1; s1 = s2; s2 = tmp; } } void HuffmanCoding(HuffmanTree &HT,HuffmanCode &HC,int *w,int n,int &wpl) { int m,i,s1,s2,start,j; unsigned int c,f; HuffmanTree p; char *cd; if(n<=1) return; m=2*n-1; HT=(HuffmanTree)malloc((m+1)*sizeof(HTNode)); for(p=HT+1,i=1;i<=n;++i,++p,++w) { (*p).weight=*w; (*p).parent=0; (*p).lchild=0; (*p).rchild=0; }

交互式多模型算法卡尔曼滤波仿真

交互式多模型算法卡尔曼滤波仿真 1 模型建立 分别以加速度u=0、1、2代表三个不同的运动模型 状态方程x(k+1)=a*x(k)+b*w(k)+d*u 观察方程z(k)=c*x(k)+v(k) 其中,a=[1 dt;0 1],b=[dt^2/2;dt],d=[dt^2/2;dt],c=[1 0] 2 程序代码 由两个功能函数组成,imm_main用来实现imm 算法,move_model1用来生成仿真数据,初始化运动参数 function imm_main %交互式多模型算法主程序 %初始化有关参数 move_model %调用运动模型初始化及仿真运动状态生成函数 load movedata %调入有关参数初始值(a b d c u position velocity pmeas dt tg q r x_hat p_var) p_tran=[0.8 0.1 0.1;0.2 0.7 0.1;0.1 0.2 0.7];%转移概率 p_pri=[0.1;0.6;0.3];%模型先验概率 n=1:2:5; %提取各模型方差矩阵 k=0; %记录仿真步数 like=[0;0;0];%视然函数 x_hat_hat=zeros(2,3);%三模型运动状态重初始化矩阵 u_=zeros(3,3);%混合概率矩阵 c_=[0;0;0];%三模型概率更新系数 %数据保存有关参数初始化 phat=[];%保存位置估值 vhat=[];%保存速度估值 xhat=[0;0];%融合和运动状态 z=0;%量测偏差(一维位置) pvar=zeros(2,2);%融合后估计方差 for t=0:dt:tg; %dt为为仿真步长;tg为仿真时间长度 k=k+1;%记录仿真步数 ct=0; %三模型概率更新系数 c_max=[0 0 0];%混合概率规范系数 p_var_hat=zeros(2,6);%方差估计重初始化矩阵, %[x_hat_hat p_var_hat]=model_reinitialization(p_tran,p_pri,x_hat,p_var);%调用重初始化函数,进行混合估计,生成三个模型重初始化后的运动状态、方差 %混合概率计算 for i=1:3 u_(i,:)=p_tran(i,:)*p_pri(i); end for i=1:3 c_max=c_max+u_(i,:); end

数据压缩实验指导书

目 录 实验一用C/C++语言实现游程编码 实验二用C/C++语言实现算术编码 实验三用C/C++语言实现LZW编码 实验四用C/C++语言实现2D-DCT变换13

实验一用C/C++语言实现游程编码 1. 实验目的 1) 通过实验进一步掌握游程编码的原理; 2) 用C/C++语言实现游程编码。 2. 实验要求 给出数字字符,能正确输出编码。 3. 实验内容 现实中有许多这样的图像,在一幅图像中具有许多颜色相同的图块。在这些图块中,许多行上都具有相同的颜色,或者在一行上有许多连续的象素都具有相同的颜色值。在这种情况下就不需要存储每一个象素的颜色值,而仅仅存储一个象素的颜色值,以及具有相同颜色的象素数目就可以,或者存储一个象素的颜色值,以及具有相同颜色值的行数。这种压缩编码称为游程编码,常用(run length encoding,RLE)表示,具有相同颜色并且是连续的象素数目称为游程长度。 为了叙述方便,假定一幅灰度图像,第n行的象素值为: 用RLE编码方法得到的代码为:0@81@38@501@40@8。代码中用黑体表示的数字是游程长度,黑体字后面的数字代表象素的颜色值。例如黑体字50代表有连续50个象素具有相同的颜色值,它的颜色值是8。 对比RLE编码前后的代码数可以发现,在编码前要用73个代码表示这一行的数据,而编码后只要用11个代码表示代表原来的73个代码,压缩前后的数据量之比约为7:1,即压缩比为7:1。这说明RLE确实是一种压缩技术,而且这种编码技术相当直观,也非常经济。RLE所能获得的压缩比有多大,这主要是取决于图像本身的特点。如果图像中具有相同颜色的图像块越大,图像块数目越少,获得的压缩比就越高。反之,压缩比就越小。 译码时按照与编码时采用的相同规则进行,还原后得到的数据与压缩前的数据完全相同。因此,RLE是无损压缩技术。

数据压缩实验

实验二图像预测编码 一、实验题目: 图像预测编码: 二、实验目的: 实现图像预测编码和解码. 三、实验内容: 给定一幅图片,对其进行预测编码,获得预测图像,绝对残差图像, 再利用预测图像和残差图像进行原图像重建并计算原图像和重建图像误差. 四、预备知识: 预测方法,图像处理概论。 五、实验原理: 根据图像中任意像素与周围邻域像素之间存在紧密关系,利用周围邻域四个像素来进行该点像素值预测,然后传输图像像素值与其预测值的差值信号,使传输的码率降低,达到压缩的目的。 六、实验步骤: (1)选取一幅预测编码图片; (2)读取图片内容像素值并存储于矩阵; (3)对图像像素进行预测编码; (4)输出预测图像和残差图像; (5)根据预测图像和残差图像重建图像; (6)计算原预测编码图像和重建图像误差. 七、思考题目: 如何采用预测编码算法实现彩色图像编码和解码. 八、实验程序代码: 预测编码程序1: 编码程序: i1=imread('lena.jpg'); if isrgb(i1) i1=rgb2gray(i1);

end i1=imcrop(i1,[1 1 256 256]); i=double(i1); [m,n]=size(i); p=zeros(m,n); y=zeros(m,n); y(1:m,1)=i(1:m,1); p(1:m,1)=i(1:m,1); y(1,1:n)=i(1,1:n); p(1,1:n)=i(1,1:n); y(1:m,n)=i(1:m,n); p(1:m,n)=i(1:m,n); p(m,1:n)=i(m,1:n); y(m,1:n)=i(m,1:n); for k=2:m-1 for l=2:n-1 y(k,l)=(i(k,l-1)/2+i(k-1,l)/4+i(k-1,l-1)/8+i(k-1,l+1)/8); p(k,l)=round(i(k,l)-y(k,l)); end end p=round(p); subplot(3,2,1); imshow(i1); title('原灰度图像'); subplot(3,2,2); imshow(y,[0 256]); title('利用三个相邻块线性预测后的图像'); subplot(3,2,3); imshow(abs(p),[0 1]); title('编码的绝对残差图像'); 解码程序 j=zeros(m,n); j(1:m,1)=y(1:m,1); j(1,1:n)=y(1,1:n); j(1:m,n)=y(1:m,n);

LZ77 压缩算法实验报告 一

LZ77 压缩算法实验报告 一、实验内容:使用 C++编程实现 LZ77 压缩算法的实现。 二、实验目的:用 LZ77 实现文件的压缩。 三、实验环境: 1、软件环境:Visual C++ 6.0 2、编程语言:C++ 四、实验原理: LZ77 算法在某种意义上又可以称为“滑动窗口压缩”,这是 由于该算法将一个虚拟的,可以跟随压缩进程滑动的窗口作为术语字典,要压缩的字符串如果在该窗口中出现,则输出其出现位置和长度。使用固定大小窗口进行术语匹配,而不是在所有已经编码的信息中匹配,是因为匹配算法的时间消耗往往很多,必须限制字典的大小才能保证算法的效率;随着压缩的进程滑动字典窗口,使其中总包含最近编码过的信息,是因为对大多数信息而言,要编码的字符串往往在最近的上下文中更容易找到匹配串。 五、 LZ77 算法的基本流程:1、从当前压缩位置开始,考察未编码的数据,并 试图在滑动窗口中找出最长的匹配字符串,如果找到,则进行步骤2,否则进行步骤 3。 2、输出三元符号组 ( off, len, c )。其中 off 为窗口中匹配字符串相 对窗口边界的偏移,len 为可匹配的长度,c 为下一个字符。然后将窗口向后滑动 len + 1 个字符,继续步骤 1。 3、输出三元符号组 ( 0, 0, c )。其中 c 为下一个字符。然后将窗口向 后滑动 len + 1 个字符,继续步骤 1。 代码如下: #include #include #include #include"lz77.h" ////////////////////////////////////////////////////////////////// // out file format: // 0;flag2;buffer;0;flag2;buffer;...flag1;flag2;bufferlast // flag1 - 2 bytes, source buffer block length // if block size is 65536, be zero // flag2 - 2 bytes, compressed buffer length // if can not compress, be same with flag1 ////////////////////////////////////////////////////////////////// void main(int argc, char* argv[]) { /*if (argc != 4) { puts("Usage: ");

哈夫曼算法实现字符串压缩——实验报告单

《用哈夫曼编码实现文件压缩》 实验报告 课程名称《数据结构B》 实验学期 2011 至 2012 学年第一学期学生所在系部计算机系 年级 2009级专业班级计科B09—1 学生姓名韩翼学号 200907014106 任课教师盛建瓴 实验成绩

一、实验题目: 用哈夫曼编码实现文件压缩 二、实验目的: 1、了解文件的概念。 2、掌握线性链表的插入、删除等算法。 3、掌握Huffman 树的概念及构造方法。 4、掌握二叉树的存储结构及遍历算法。 5、利用Huffman 树及Huffman 编码,掌握实现文件压缩的一般原理。 三、实验设备与环境: 微型计算机、Windows 系列操作系统 、Visual C++6.0软件 四、实验内容: 根据ascii 码文件中各ascii 字符出现的频率情况创建Haffman 树,再将各字符对应的哈夫曼编码写入文件中,实现文件压缩。 五、概要设计: 本次试验采用将字符用长度尽可能短的二进制数位表示方法,即对于文件中出现的字符,无须全部都用8位的ASCLL 码进行存储,根据他们在文件中出现的频率不同,我们利用Haffman 算法使每个字符能以最短的二进制字符进行存储,以达到节省存储空间,压缩文件的目的。解决了压缩需采用的算法,程序的思路已然清晰: 1、统计需压缩文件中每个字符出现的频率。 2、将每个字符的出现频率作为叶子结点构建Haffman 树,然后将树中结点 引向其左孩子的分支标“0”,引向其右孩子的分支标“1” ; 每个字符的编码即为从根到每个叶子的路径上得到的0、1序列,这样便完成了Haffman 编码,将每个字符用最短的二进制字符表示。 3、打开需压缩的文件,再将需压缩文件中的每个ASCII 码对应的编码按bit 单位输出。 4、文件压缩结束。 六、详细设计: (1)Huffman 树简介 路径:从树中一个结点到另一个结点之间的分支构成这两个结点间的路径 路径长度:路径上的分支数 树的路径长度:从树根到每一个结点的路径长度之和 树的带权路径长度:树中所有带权结点的路径长度之和 —结点到根的路径长度 ——权值 —其中:记作:k k n k k k l w l w wpl ∑==1

《数据压缩与信源编码》实验指导书

《数据压缩与信源编码》实验指导书 适用专业:信息工程 课程代码: 6088619 总学时: 40 总学分: 2.5 编写单位:电气与电子信息学院 编写人:李斌 审核人: 审批人: 批准时间: 2015 年 11 月 10日

目录 实验一码书的设计和使用 (2) 实验二基于DCT变换的图像压缩技术 (8) 实验三基于小波变换的图像压缩技术 (15)

实验一 码书的设计和使用 一、实验目的 采用矢量量化算法(LBG )获得图像压缩所需要的码书,通过码书实现图像压缩编码。 二、实验内容 对给定的一幅图像进行码书设计、编码和解码。 三、实验仪器、设备及材料 操作系统:Windowsxp ; 软件:MA TLAB 四、实验原理 要想得到好的性能编码,仅采用标量量化是不可能的。当把多个信源符号联合起来形成多维矢量,再对矢量进行标量量化时自由度将更大,同样的失真下,量化基数可进一步减少,码率可进一步压缩。这种量化叫矢量量化。 一种有效和直观的矢量量化码书设计算法——LBG 算法(也叫GLA 算法)是由Linde 、Buzo 和Gray 于1980年首先提出来的。该算法基于最佳矢量量化器设计的最佳划分和最佳码书这两个必要条件,且是Lloyd 算法在矢量空间的推广,其特点为物理概念清晰、算法理论严密及算法实现容易。 设训练矢量集为{}110,,,-=M x x x X ,待产生的码书为{}110,,,-=N y y y C ,其中{})1(10,,,-=k i i i i x x x x ,{})1(10,,,-=k j j j j y y y y ,10,10-≤≤-≤≤N j M i ,则码书设计过程就是需求把训练矢量集X 分成N 个子集)1,,1,0(-=N j S j 的一种最佳聚类方案,而子集j S 的质心矢量j y 作为码字。假设平方误差测度用来表征训练矢量i x 和码字j y 之间的失真,即: ∑-=-=1 02)(),(k l jl il j i y x y x d 则码书设计的准则可用下列数学形式表达: 最小化 ∑∑-=-==101 0),(),,(N j M i j i ij y x d w C X W f 约束条件 ∑-==101N j ij w ,10-≤≤M i 其中W 为N M ?矩阵,其元素满足:

文本压缩算法的比较研究

第22卷 第12期 2006年12月 甘肃科技 Gansu Science and Technology V ol.22 N o.12Dec. 2006 文本压缩算法的比较研究 徐成俊1,舒 毅1,柴 蓉2,张其斌1,田全红1,郝 涛3 (1甘肃省计算中心,甘肃兰州730030;2西北师范大学,甘肃兰州730070;3兰州理工大学甘肃兰州730050) 摘 要:论述了4种不同的文本压缩算法。根据压缩算法的优点和缺点,在实践中,要有针对性选 择算法,用其优点,从而得到比较理想的压缩文本。关键词:压缩算法;哈夫曼;算术;L Z ;游程中图分类号:TP391 1 引言 随着计算机技术的快速发展,各种系统数据量越来越大,给信息存储特别是网络传输带来诸多的困难,己成为有效获取和使用信息的瓶颈。为了节省信息的存储空间和提高信息的传输效率,必须对大量的实际数据进行压缩。实践证明,采用数据压缩技术可以节省80%以上的费用,且一些难点问题通过压缩技术能够实现。 数据压缩技术种类繁多、应用广泛,技术不断发展,一些分支已成为当今研究的焦点。按照编码的失真程度数据压缩可以分为有损压缩和无损压缩。有损压缩即原始数据不能完全恢复,主要应用于图像和数字化的语音方面。无损压缩就是经过一个压缩后,能够产生和输入完全一致的压缩技术,主要用于存储数据库记录或处理文本文件。 2 目前主要文本压缩算法 文本压缩是根据一定方法对大量数据进行编码处理以达到信息压缩存储过程,被压缩的数据应该能够通过解码恢复到压缩以前的原状态,而不会发生信息丢失现象。发展到现在已经有很多关于文本压缩的算法,主要有Huff man 编码、算术编码等无损压缩和预测编码、量化、变换编码等有损压缩。如图1所示。 本文对常见的几种无损压缩算法进行综合分类,比较研究。 3 算法描述 3.1哈夫曼编码 美国数学家David Huff man 在上世纪五十年 代初就提出了这种编码,其主导思想是根据字符出 现的概率来构造平均长度最短的编码,并且保持编码的唯一可解性。也就是说,在源数据中出现概率越高的字符,相应码字越短;出现概率越小的字符,其码长越长,从而达到用尽可能少的码符号来表示源数据,达到压缩的效果。哈夫曼编码是一种变长的编码(因为其长度是随符号出现的概率而不同),在编码过程中,若各码字长度严格按照码字所对应符号出现概率的大小的逆序排列,则编码的平均长度是最小的。它最根本的原则是累计的(字符的统计数字3字符的编码长度)为最小,也就是权值(字符的统计数字3字符的编码长度)的和最小。 图1 文本压缩的分类 哈夫曼编码的编码过程如下:①对源信号的出现频率进行统计,每个源信号根据它出现频率大小被赋予一定的编码,高频率的信号对应短码,低频率的信号对应长码。 ②用信号对应的编码去取代源数据中的信号。在已知源数据流中各字符发生的概率情况下使用,即在压缩数据时,遵循固定的源数据符号代码表。在不允许两次扫描源文件的情况下,往往根据先验统计概率或假设构造这张代码表。压缩处理时,若源数据流中的符号与代码表中的符号相匹配,则用与之相对应的较短的代码替代该字符,否则不

实验报告-数据滤波和数据压缩实验

实验题目:使用Haar 小波和傅里叶变换方法滤波及数据压缩 1 实验目的 (1)掌握离散数据的Haar 小波变换和傅里叶变换的定义,基本原理和方法 (2)使用C++实现数据的Haar 小波变换和离散傅里叶变换 (3)掌握数据滤波的基本原理和方法 (4)掌握使用Haar 小波变换和离散傅里叶变换应用于数据压缩的基本原理和方法,并且对两种数据压缩进行评价 2 实验步骤 2.1 算法原理 2.1.1 Haar 小波变换 (1)平均,细节及压缩原理 设{x1,x2}是一组两个元素组成的信号,定义平均与细节为(12)/2a x x =+, (12)/2d x x =-。则可以将{a ,d}作为原信号的一种表示,且信号可由{a ,d}恢复,1x a d =+,2x a d =-。 由上述可以看出,当x1,x2非常接近时,d 会很小。此时,{x1,x2}可以近似的用{a}来表示,由此实现了信号的压缩。重构的信号为{a ,a},误差信号为 {|1|,|2|}{||,||}x a x a d d --=。因此,平均值a 可以看做是原信号的整体信息,而d 可 以看成是原信号的细节信息。用{a}近似的表示原信号,可以实现对原信号的压缩,而且丢失的细节对于最终信号的重构不会有重大影响。对于多元素的信号,可以看成是对于二元信号的一种推广。 (2)尺度函数和小波方程 在小波分析中,引入记号 [1,0)()() t X t φ=,其中, [1,0)() X t 表示区间[1,0]上的特征函 数。定义 ,()(2),0,1,...,21 j j j k t t k k φφ=-=- 称()t φ为Haar 尺度函数。由上式可知, ,()j k t φ都可以由 0,0() t φ伸缩和平移得到。 小波分析中,对于信号有不同分辨率的表示,当用较低分辨率来表示原始信号时,会丢失细节信息,需要找到一个函数来描述这种信息,该函数称之为小波函数。基本的小波函数定义如下:

相关主题
文本预览
相关文档 最新文档