当前位置:文档之家› Semantic Caching in Ontology-based Mediator Systems

Semantic Caching in Ontology-based Mediator Systems

Semantic Caching in Ontology-based Mediator Systems
Semantic Caching in Ontology-based Mediator Systems

Semantic Caching in Ontology-based Mediator Systems

Marcel Karnstedt

marcel@https://www.doczj.com/doc/819284132.html, University of Halle-Wittenberg 06099Halle/Saale,Germany Kai-Uwe Sattler Ingolf Geist Hagen H?pfner?{kus|geist|hoepfner}@iti.cs.uni-magdeburg.de

University of Magdeburg

P.O.Box4120,39016Magdeburg,Germany

Abstract:The integration of heterogenous web sources is still a big challenge.One

approach to deal with integration problems is the usage of domain knowledge in form

of vocabularies or ontologies during the integration(mapping of source data)as well

as during query processing.However,such an ontology-based mediator system still

have to overcome performance issues because of high communication costs to the

local sources.Therefore,a global cache can reduce the response time signi?cantly.

In this work we describe the semantic cache of the ontology-based mediator system

Y ACOB.In this approach the cache entries are organized by semantic regions and the

cache itself is tightly coupled with the ontology on the concept level.Furthermore,the

cache-based query processing is shown as well as the advantages of the global concept

schema in the creation of complementary queries.

1Introduction

Many autonomous,heterogeneous data sources exists in the Web on various topics.Pro-viding an integrated view to the data of such sources is still big challenge.In this scenario several problems arise because of autonomy and heterogeneity of the sources as well as scalability and adaptability with regard to a great number of–possibly changing–data sources.Approaches to overcome these issues are for instance metasearch engines,ma-terialized approaches and mediator systems,which answer queries on a global schema by decomposing them,forwarding the sub-queries to the source systems and combining the results to a global answer.First generation mediators achieve integration mainly on a structural level,i.e.data from different sources are combined based on structural cor-respondences,e.g.the existence of common classes and attributes.Newer approaches of mediators use semantic information,such as vocabularies,concepts hierarchies or ontolo-gies to integrate different sources.In this paper,we use the Y ACOB mediator system which uses domain knowledge modeled as concepts as well as their properties and relationships. The system supports the mapping of the data from the local sources to a global concept schema.The semantic information are not only used to overcome the problems resulting from heterogeneity and autonomy of the different sources but also during query processing and optimization.

However,response times and scalability are still problems because of high communication ?Research supported by the DFG under grant SA782/3-2

costs to the local sources.One approach to reduce the response times and improve the scal-ability of the system is the introduction of a cache which holds results of previous queries. Thus,queries can be answered(partially)from the cache saving communication costs. As page or tuple based cache organizations are not useful in distributed,heterogeneous environments,the Y ACOB mediator supports a semantic cache,i.e.,the cache entries are identi?ed by queries that generated them.This approach promises to be particularly useful because of the typical behavior of the user during the search:starting with a?rst,relatively inexact query the users want to get an overview of the contained objects.Subsequently, the user iteratively re?nes the query by adding conjuncts or disjuncts to the original query. Therefore,it is very likely that a cache contains a(partial)data set to answer the re?ned query.

The contribution of this paper is the description of the caching component of the Y ACOB mediator.We discuss different possibilities of the organization of cache according to the ontology model as well as the retrieval of matching cache entries based on a modi?ed query containment determination.Furthermore,the paper shows the generation of com-plementary queries using the global concept model as well as the ef?cient inclusion of the cache into the query processing.

The remainder of the paper is structured as following:Section2gives a brief overview of the Y ACOB mediator system and its data model and query processing.In Section3the structure of the cache as well as replacement strategies and cache management with help of semantic regions is described.The query processing based on the cache is discussed in Section4.After a comparison with the related work in Section5we conclude the paper with some preliminary performance results and give an outlook to our future work in Section6.

2The Y ACOB Mediator System

The Y ACOB mediator is a system that uses explicitly modeled domain knowledge for in-tegrating heterogeneous data from the Web.Domain knowledge is represented in terms of concepts,properties and relationships.Here,concepts act as terminological anchors for the integration beyond structural aspects.One of the scenarios where Y ACOB is applicable and for which it was originally developed is the(virtual)integrated access to Web databases on cultural assets that where lost or stolen during World War II.Examples of such databases–which are in fact integrated by our system–are www.lostart.de,www.herkomstgezocht.nl and www.restitution-art.cz.

The overall architecture of the system is shown in Fig.1.The sources are connected via wrappers which process simple XPath query(e.g.,by translating them according the proprietary query interface of the source)and return the result as an XML document of an arbitrary DTD.

The mediator accesses the wrappers using services from the access component which for-wards XPath queries via SOAP to the wrappers.The wrappers work as Web services and therefore can be placed at the mediator’s site,at the source’s site,or at a third place.An-

Figure1:Architecture of the Y ACOB mediator

other part of the access component is the semantic cache which stores results of queries in a local database and in this way allow to use it for answering subsequent queries.This part of the system is subject of this paper and described in the following sections.Fur-ther components are the concept management component providing services for storing and retrieving metadata(concepts as well as their mapping)in terms of a RDF graph, the query planning and execution component which processes global queries as well as the transformation component responsible for transforming result data retrieved from the sources according the global schema.Architecture and implementation of this system are described in[SGHS03].Thus,we omit further details.

Exploiting domain knowledge in a data integration system requires ways for modeling this knowledge and for relating concepts to source data.For this purpose,we use a two-level model in our approach:the instance level comprises the data managed by the sources and is represented using XML,the metadata or concept level describes the semantics of the data and is based on RDF Schema(RDFS).Here,we provide

?concepts which are classes in the sense of RDFS and for which extensions(set of instances)are available in the sources,

?properties(attributes)of concepts,

?relationships which are modeled as properties,too,

?as well as categories which represent abstract property values used for semantic grouping of objects.

These primitives are used for annotating local schemas,i.e.,mappings between the global concept level and the local schema are speci?ed in a Local-as-View manner[SGHS03].

In this way,a source supports a certain concept,if it provides a subset of the extension (either with all properties or only with a subset).For each supporting concept,a source mapping speci?ed the local element and an optional?lter restricting the instance set.Such a mapping is used both for rewriting queries as well as transforming source data in the transformation component.

In the Y ACOB mediator queries are formulate in CQuery–an extension of XQuery. CQuery provides additional operators applicable to the concept level such as selecting concepts,traversing relationships,computing the transitive closure etc.as well as for ob-taining the extension of a concept.Concept-level operators are processed always at the global mediator level,whereas instance-level operators(?lter,join,set operations)can be performed both in the mediator as well as by the source.For a detailed description of CQuery we refer again to[SGHS03,SGS03].For the remainder of this paper,it is only important to know that a global CQuery is rewritten and decomposed into several source queries in the form of XPath expressions which can be delegated to the sources via the wrappers.

Because we are aware that for the average user of our application domain(historians, lawyers etc.)CQuery is much too complex,we hide the query language behind a graphical Web user interface combining browsing and structured querying.The browsing approach implements a navigation along the relationships(e.g.subClass)and properties de?ned in the concept level.The user can pick concepts and categories in order to re?ne the search. In each step the de?ned properties are presented as input?elds allowing to specify search values for exact and fuzzy matching.

From the discussion of the architecture as well as the user interface the necessity of a cache should be obviously:

?First,accessing sources over the Web and encapsulating sources using wrappers

(i.e.translating queries and extracting data from HTML pages)result in poor per-

formance compared to processing queries in a local DBMS.

?Second,a user interface paradigm involving browsing allows to re?ne queries.That means,the user can restrict a query by removing queried concepts or by conjunc-tively adding predicates and he/she can expand a query by adding concepts or by disjunctively adding predicates.In the?rst case,the restricted query could be com-pletely answered from the cache(assuming the result of the initial query was already added to the cache).In the latter case,at least portions of the query can be answered from the cache and quickly presented to the user,but additional complementary queries have to executed retrieving the remaining data from the sources.

Based on these observations,we will present in the following our caching approach that uses concepts of the domain model as anchor points for cache items and exploits–to a certain degree–domain knowledge for determining complementary queries.

3Cache Management

The cache is designed to store result data,which is received as XML documents,and the corresponding queries,which are the semantic description of those results.If a new query arrives it has to be matched against the cached queries and possibly a(partial)result has to be extracted from the cache’s physical storage(see Section4).

In order to realize the assumed behavior a simple and effective way of storing XML data persistently and fail-safe is needed.One way of storing is the usage of a native XML database solution which is Apache’s X INDICE in this work.The open source database X INDICE stores XML documents into containers called collections.These collections are organized hierarchically,comparable to the organization of folders in a?le system.The cache is placed below the ontology level,which means the cached entries are collected regarding to the queried concepts.All entries corresponding to a concept are stored in a collection named after that concept’s name.The actual data is stored as it is received from the sources in a sub-collection“results”,the describing data,namely the calculated “ranges”(see Section4),the query string decomposed to the single sub-goals and a ref-erence to the corresponding result document,are stored in another sub-collection called “entries”(Fig.2(a)).During query matching the XML encoded entries are read from the database and the match type for the currently handled query is determined.If an entry matches,a part of or the whole result document is added to the global result data.If only a part of the cached result is needed,i.e.if the two queries overlap in some way,the part corresponding to the processed query has to be extracted.Here another advantage of using a native XML database becomes apparent:X INDICE supports XPath as its query language. In order to retrieve the required data we simply have to execute the current query against the data of the entry.

An important decision is the level of caching:we can store query results either at con-cept level or at source level.The difference is the form of the queries and correspond-ing result documents.Caching at concept level means caching queries formulated at the global schema.The queries will be transformed according to the local source schemas after processing them in the cache.The retrieved result documents stored to the cache are already transformed back to the global schema,too.

Caching at source level is placed below of the transformation component.There are sep-arate entries for each source,because the stored queries and results are formulated in the speci?c schema of a source.An evident advantage of the source level cache is the?ner granularity of the cached items,e.g.enabling the detection of temporarily of?ine sources by the cache manager.The main disadvantages–which are the bene?ts of using concept level caching–are the increased management overhead because of the rising amount of entries and a loss of performance at cache hit.Caching at concept level does not require any transformations at cache hit:the result is returned immediately.Additionally,a query matching making use of the global ontology,including a smart way of building a comple-mentary query,is supported.

Fig.2(b)shows a part of the global ontology together with an associated cache entry.This entry is created by executing the following query and storing the result in the cache:

//Graphics[Artist=’van Gogh’and Motif=’People’]

(a)Cache entry

(b)Cache structure Figure2:Structure of the semantic cache

Using a storage strategy as described above,the cached data is grouped together into se-mantically related sets called semantic regions.Every cache entry represents one semantic region,where the sub-goals of the predicate are conjunctive expressions.Disjunctive parts of a query get an own entry.The containment of a query is decided between the cached entries and every single conjunction of the disjunctive normal form of the query pred-icate.The decision algorithm is explained in detail in Section4.The regions have to be disjoint,so each cached item is associated with exactly one semantic region.This is useful for getting a maximum of cached data when processing a query,in contrast other works let the regions overlap and avoid data redundancy using reference counters ([LC98,LC99,LC01,KB96]).There arise certain problems and open questions if the re-gions have to be disjoint and are forbidden to overlap.Different strategies are possible if a processed query overlaps with a cached query,more exactly their result sets are overlap-ping.In this case,the part of the result data already stored in the cache is extracted and a corresponding complementary query is sent to the sources.The data received as result of this query and the data found in cache form a large semantic region.Now,it have to be decided whether keeping this region or splitting it or coalescing the separate parts in some way.Because putting all data in one region will result in bad granularity and lead to stor-age usage problems,the regions are stored separately.There are still some remaining ways of how to split/coalesce the single parts,all effecting the query answering mechanism and possible replacement strategies.

In our approach the data for the complementary query forms a new semantic region and is inserted into the cache(inclusive the query representing the semantic description).Here, the semantic region holding the cached part of the result data is unchanged.Another possi-ble way is to collect cached data and send the original query instead of the complementary query to the sources.The last approach is useful in the case,that matching all cached en-tries to a processed query results in a complementary query which causes multiple source queries or which is simply not answerable at all,e.g.due to unsupported operations such as“<>”.Details of building a complementary query and related issues are described in

Section 4.In order to keep the semantic regions disjunct,it is important to store only that part in the cache which corresponds to the complementary query created before.

Both ways guarantee that the semantic regions do not overlap,which is one of the formu-lated constraints to the cache.Collecting the data in such disjoint regions allows a simple replacement strategy:replacement on region level,i.e.if a region has to be replaced,its complete represented data is deleted from the physical storage.This replacement strategy requires the following cache characteristics:At ?rst,the cached regions may not be too large.On the one hand,replacing large regions means deleting a big part of the cached data and results into inef?cient storage usage.On the other hand,a large amount of relatively small semantic regions leads to bad performance of the query processing.Small regions may enable a replacement based on a much ?ner granularity,but the cost of query process-ing will rise,because many regions have to be considered.Additionally,complementary queries will become very complex and last but not least because small regions mean long query strings that have to be combined when creating the complementary query.

Currently,our implemented replacement strategy is very simple.Timestamps referring to the date of collection and last reference are kept enabling a replacement strategy based on the age and referencing frequency of a cached entry.Entries are removed from the cache together with the corresponding result data,if either its cache holding time expires or if an entry has to be replaced in order to make room for a new entry.Other conceivable strategies could make use of some kind of semantic distance (like in [DFJ +96])or other locality aspects.The timestamp strategy is suf?cient for the Y ACOB mediator system because the main concern is to support an ef?cient interactive query re?nement by the cache.(Dis-)advantages of other strategies have to be the subject of future work.4Cache-based Query Answering

Cache lookup is an integral part of the query processing approach used in the Y ACOB mediator.Thus,we will sketch in the following ?rst the overall process before describing the cache lookup procedure.

In general,a query in CQuery consists of two kind of expressions:a concept level expres-sion CExpr for computing a set of concepts,e.g.by specifying certain concepts,apply ?lter,traversal or set operations and an instance level expression IExpr(c )consisting of operators such as selection which are applied to the extension of each concept c com-puted with CExpr .The results of the evaluation of IExpr(c )for each c are combined by a union operator,which we denote extensional union .Thus,we can formulate a query as follows: c ∈CExpr IExpr (c ).For example,in the following query:

FOR $c IN concept[name=’Paintings’]

LET $e :=extension($c)

WHERE $e/artist =’van Gogh’

RETURN $e/title

$e/artist

CExpr corresponds to the FOR clause and IExpr corresponds to the WHERE clause.

If a query contains additional instance level operators involving more than one extension (e.g.a join)these are applied afterwards but are not considered here because they are not affected by the caching approach.

Based on this,a query is processed as follows.The ?rst step is the evaluation of the concept level expression.For each of the selected concept we try to answer the instance level expression by ?rst translating it into an XPath query and applying this to the extension.Basically,this means to send the XPath query to the source system.However,using the cache we can try to answer the query from the cache.For this purpose,the function cache-lookup returns a (possibly partial)result set satisfying the query condition,i.e.,if necessary additional ?lter operations are applied to the stored cache entries,as well as a (possibly empty)complementary XPath query.In case of an non-empty complementary query or if no cache entry was found,the XPath query is further processed by translating it according to the concept mapping CM (c )and send this translated query to the corresponding source s .Finally,the results of calling cache-lookup and/or process-source-query are combined.Input :query expression in the form of U c ∈CExpr IExpr (c )

result set R :={}

1compute concept set C :=CExpr

2forall c ∈C do

3/*translate query into XPath */

4q :=toXPath (IExpr (c ))

5/*look for the query q in the cache →result is denoted by R c ,6

q is returned as complementary query for q */7R c :=cache-lookup (q,q )

8if R c ={}then

9/*found cache entry */

10R :=R U R c

11q :=q

12?

13if q =empty then

14q s :=translate-for-source (q ,CM (c ))

15R s :=process-source-query (q s ,s )16R :=R U R s

17?

18od

Figure 3:Steps of Query Processing

Note that in this context complementary queries are derived at two points:

?First,because a global query is decomposed into a set of single concept related queries,complementary queries are derived implicitly.This means,if one wants to retrieve in query q the extension ext (c )of a concept c with two sub-concepts c 1and c 2where it holds ext (c )=ext (c 1)∪ext (c 2)and ext (c 1)is already stored in the

match type situation cached part of complementary

(Q,C)result data query

exact data to Q and C identical C’s data none

containing C containing Q Q on C’s data none

contained C contained in Q C’s data Q∧?C

overlapping data to C and Q overlaps Q on C’s data Q∧?C

disjoint C’s data is no part of result data none Q Table1:Possible match types between processed query Q and cached entry C cache,two queries q1(for ext(c1))and q2(for ext(c2))have to be processed.How-ever,because we can answer q1from the cache,only the complementary query q2 needs to be executed(i.e.,q2=q),which is achieved by iterating over all concepts of the set C(line2).

?Secondly,if the cache holds only a subset of ext(c1)restricted by a certain predicate p we have to determine q1with?p during the cache lookup.

Because the?rst issue is handled as part of the query decomposition,we focus in the following only on cache lookup.

We will use the example started in Section3to picture this step.Let us assume,we are in a very early(almost starting)state of the cache,where the query

//Graphics[Artist=’van Gogh’and Motif=’People’]

is the only stored query referencing the result?le data.xml.Now,the new query has to be processed:

//Graphics[(Artist=’van Gogh’or Artist=’Monet’)and Date=’1600’] During the cache lookup every conjunction found in the disjunctive normal form of the processed query is matched against each cache entry in no special order.As mentioned in Section3the semantic regions do not overlap.Thus,independent from the order of processing cache entries,all available parts of the result can be found in cache.In other words,if an entry contains a part of the queried data no other one will contain this data, e.g.if an exact match to the query exists,there will be no other containing match and the exact match will be detected independent of any entries observed before.There are?ve possible match types that may occur which are summarized in Tbl.1.

The match types are listed top down in the order of their quality.Obviously,the exact match is the best one,because the query can be answered simply by returning the all data of the entry.If no exact match can be found,a containing match would be the next best. In this case,a superset of the needed data is cached and can be extracted by applying the processed query Q to the cached documents.The cases of contained and disjoint match type require to process a complementary query.Only a portion of the required data is stored and the complementary query retrieves the remaining part from the source. Considering our example the disjunctive normal form of the new query is:

//Graphics[(Artist=’van Gogh’and Date=’1600’)

or(Artist=’Monet’and Date=’1600’)]

The two parts in brackets are the conjunctions we have to handle separately during the cache lookup.

The algorithm in Fig.4displays the procedure of cache lookup in pseudo code.

Input:

Query q

Output:

result set R:={}

complementary query q:=

1q =disjunctive-normal-form(q);

2E=get-cache-entries(get-concept(q));

5forall conjunction Conj of q do

6CC={Conj};/*current conjunctions(still to check)*/

7forall cache entry C∈E do

8NC:=?;/*new conjunctions(to check)*/

9forall conjunctions Conj ∈CC do

10M:=match-type(Conj ,C);

11switch(M)do

12case’disjoint’:break;

13case’exact’:R:=R∪C→Data;

14CC:=CC\{Conj };

15break;

16case’containing’:R:=R∪Conj (C→Data);

17CC:=CC\{Conj };

18break;

19case’contained’:R:=R∪C→Data;

20CC:=CC\{Conj };

21NC:=NC∪{(Conj ∧?C)};

22break;

23case’overlapping’:R:=R∪Conj (C→Data);

24CC:=CC\{Conj };

25NC:=NC∪{(Conj ∧?C)};

26break;

27od

28od

29CC:=CC∪NC;

30if CC={}then break;

31od

32if CC={}then q:=q+CC;

33od

34return R,q;

Figure4:Procedure cache-lookup

The match type between the processed and the cached query is determined by calling

the procedure match-type(line10).This procedure implements a solution to the query containment problem and is discussed later in this section.

Running over all cached entries we only have to match the query remaining from the step before instead of having to match the original query again each time.This re?ects a part of the result data already been found.This data cannot occur in the semantic regions described by other entries and is not needed to retrieved from the sources.In cases of exact and containing matches there will no query remain,because all of the data can be found in the cache and therefore the cache lookup is?nished(lines13to18).In all other cases, we still have to match against the remaining entries.If we encounter a disjoint match, the currently checked conjunction has to be matched against the next entry(line12).In contrast,if a part of the result data is found,only the complementary query is to process further on.In order to avoid checks of conjunctions created during the generation of the complementary query,we?rst collect them separately(lines21and25)and add them to the set of later checked conjunctions(line29).The query remaining after checking all (possibly new created)conjunctions against all entries is built in q(line32).Here,the operation’+’denotes a concatenation of the existing query q and all conjunctions in CC by a logical OR.If all entries belonging to the queried concept are checked,a possibly remaining query have to be used to fetch data which could not be found in the cache.The procedure returns a set of all collected references to parts of the result stored in cache as well as the complementary query(34).This query is sent to the sources and–in parallel–the cached data are extracted from physical storage.

In the simple example introduced above only one cache entry is created which we have to check for a match.Here,we get an overlapping match between the cached query and the ?rst of the two conjunctions.Thus,the complementary query looks as follows:

//Graphics[Artist=’van Gogh’and Date=’1600’and Motif!=’People’] This expression is added to the global complementary query,because no further entry is left that we could check and therefore,we cannot?nd any further parts of the result data in https://www.doczj.com/doc/819284132.html,paring the second query conjunction we receive a disjoint match,because the query predicates together are unsatis?able in the attribute Artist.After checking against all existing entries this conjunction becomes part of the complementary query unchanged. The?nal complementary query is:

//Graphics[(Artist=’van Gogh’and Date=’1600’and Motif!=’People’) or(Artist=’Monet’and Date=’1600’)]

The cached part of the result data is extracted from the cache database by applying

//Graphics[Artist=’van Gogh’and Date=’1600’]

to the result document data.xml,which is done using the XPath query service provided by X INDICE.

Match Type Determination.In order to determine the match type between processed and cached query the problem of query containment has to be solved,symbolized in the pseudo code by calling method match-type.We can restrict the general problem to a

containment on query predicates.In the Y ACOB mediator all predicates are in a special form:they are sets of sub-goals combined by logical OR and/or AND.The sub-goals are only simple attribute/constant expressions,limited to Xθc,where X is an attribute,c is an constant andθ∈{=,=,~=}.We do not have to forbid numerical constants and the corresponding operations(it is easy to adapt the implemented containment algorithm to numerical domains),but in fact there are currently only attributes de?ned on string domains in Y ACOB.The algorithm is based on solutions to the problems of satis?ability and implication.The NP-hardness of the general containment problem is not given here because of the limitation that only constants may appear on the right side of an expression. In all solutions to the problem found in literature the NP-hardness results from allowing the!=operator together with comparisons between attributes de?ned on integer domains. See[GSW96]for a good comparison.

The basic idea is to parse the query and to apply a range for every identi?ed sub-goal.For each attribute these ranges contain the values the attribute may contain.When determining the containment between two queries it is done using the ranges created before.

A special treatment is required for CQuery’s text similarity operator’~=’which is mapped to appropriate similarity operators of the source query interfaces such as like or contains.In order to decide if a cached query matches the current query,we have to de-tect such similarities between attribute values without knowing about how actual semantics of the similarity operation in the source system.For solving this problem,we have chosen a pragmatic approach:if a query uses the’~=’operator the result will include all similar objects,e.g.querying Artist~=’Gogh’will return objects with Artist=’v. Gogh’,Artist=’van Gogh,V.’,Artist=’v.Gogh,V’etc.If later a new query?lters an attribute value similar to the element stored in cache(in the given example for instance Artist~=’van Gogh’)the cached results can be used.

We have implemented the handling of the similarity operator based on substrings.As-suming two queries having only one condition in their predicates,one an attribute A with A~=x and the other on the same attribute with A~=y.Now,if the string x is contained in y as a substring,the cache entry to A~=x contains all data data belonging to A~=y. Otherwise,if the result for A~=y is cached and the query is A~=x we encounter a contained match.

Complementary Query Construction.The construction of the complementary query in cases of contained and overlapping match types has to be examined in a detailed way. The objective is to create a new query which queries only the data not already found in cache.If the processed query is q and the query of the cached entry is C the new query is q∧?C.This query is obtained by negating each sub-goal c i of C and combining it with query q by a logical AND.This results into n parts,where n is the number of sub-goals in C,all OR-combined.So,the generated query will extract the data belonging to the result of q not found in C and it is already in disjunctive normal form.In general,some of the n constructed conjunctions in will be unsatis?able.The following example illustrates this:?cached query://Graphics[Artist=’van Gogh’and Motif=’People’]

?new query:///Graphics[Artist=’van Gogh’and Date=’1600’]

?complementary query:

//Graphics[(Artist=’van Gogh’and Date=’1600’

and Artist!=’van Gogh’)or(Artist=’van Gogh’

and Date=’1600’and Motif!=’People’)]?pruned to satis?able parts:

//Graphics[Artist=’van Gogh’and Date=’1600’

and Motif!=’People’]

The implemented satis?ability algorithm is used to detect and delete these parts of the complementary query.

The more entries we?nd overlapping to the processed query,the more complex the com-plementary query will become.Each of the cached entries will expand the query by some sub-goals.At the end,the resulting query could be too complex that an ef?cient process-ing by the sources is not possible or that they cannot be answered at all.Thus,we need some heuristics to decide whether the constructed complementary query is too complex to process it.If so,the original query should be sent to the sources instead accepting a higher network load and the need for duplicate elimination.However,the cache still supports a fast creation and delivering of an answer set to the user.Such heuristics could be based on query capability descriptions of the sources,but this is currently not supported by our system.

5Related Work

The entries in a semantic cache are organized by semantic regions.Therefore,the selection of relevant cache entries for answering a query is based on the problems of query contain-ment and equivalence.There are several publications which focus on different aspects of query containment,such as completeness,satis?ability as well as complexity.Surveys of these approaches can be found for instance in[GSW96,Hal01].

Caching data in general and semantic caching in particular are common approaches for reducing response times and transmission costs in(heterogeneous)distributed information systems.These works comprise classical client-server databases[DFJ+96,KB96],het-erogeneous multi-database environments[GG97,GG99],Web databases[LC98,LC01]as well as mobile information systems[LLS99,RD00].

The idea of semantic regions was introduced by Dar et al.[DFJ+96].Semantic regions are de?ned dynamically based on the processed queries which are restricted on selection conditions on single relations.Constraint formulas describing the regions are conjunctive representations of the used selection predicates.Thereby,the regions are disjoint.

The semantic query cache(SQC)approach for mediator systems over structured sources is presented in[GG97,GG99].The authors discuss most aspects of semantic caching:

determining when answers are in cache,?nding answers in cache,semantic overlapping and semantic independence and semantic remainder in a theoretical manner.Our approach also re?ects most addressed aspects,but deals with semistructured data.

Keller and Basu describe in[KB96]an approach for semantic caching that examines and maintains the contents of the cache.Therefore,the predicate description of executed queries are stored on the client as well as on the server.Queries can include selections, projections and joins over one ore more relations but results have to comprise all keys that have been referenced in the query.

Semantic caching in Web mediator systems is proposed in[LC98,LC01].Most ideas in this approach are based on[DFJ+96].However,the authors also introduce a technique that allows the generation of cache hits by using of additional semantic background in-formation.In contrast to our approach,the cache is not located in the mediator access component but in the wrappers.As discussed in the previous sections a tight coupling to the global ontology structure was chosen in the Y ACOB system.

In mobile information systems semantic caches are typically used for bridging the gap be-tween the portability of mobile devices and the availability of information.The LDD cache [RD00]is optimized in order to cache location depended information,but in fact,uses also techniques which are common for semantic caches.Results are cached on the mobile de-vices and are indexed with meta information which are generated from the queries.But in this approach the index is only a table which references a logical page which is similar to semantic regions.

6Discussion and Conclusions

Semantic caching is a viable approach for improving response times and reducing com-munication costs in a Web integration system.In this paper,we have presented a caching approach which we developed as part of our Y ACOB mediator system.A special feature of this approach is the tight connection to the ontology level–the cache is organized along the concepts.Furthermore,the modeled domain knowledge is exploited for obtaining complementary queries required for processing queries which can only partially answered from cache.

For evaluation purposes,we ran some preliminary results in our real-world setup1showing that response times for queries which can be answered from the cache are reduced by a fac-tor of4to6.However,the results depend strongly on the query mix,i.e.the user behavior, as well as on the source characteristics(e.g.response time and query capabilities),so we omit details here.Currently,we evaluate the caching approach using different strategies for replacement and determining complementary queries in a simulated environment.

In future research,we plan to exploit more information from the concept level in order to reduce the effort for complementary queries.

1http://arod.cs.uni-magdeburg.de:8080/Yacob/index.html

References

[DFJ+96]S.Dar,M.J.Franklin,B.Tór Jónsson,D.Srivastava,and M.Tan.Semantic Data Caching and Replacement.In VLDB’96,Proc.of22th Int.Conf.on Very Large Data Bases,pages

330–341,Mumbai(Bombay),India,September3–61996.Morgan Kaufmann.

[GG97]P.Godfrey and J.Gryz.Semantic Query Caching for Heterogeneous Databases.In Intelligent Access to Heterogeneous Information,Proc.of the4th Workshop KRDB-97,

Athens,Greece,volume8of CEUR Workshop Proceedings,pages6.1–6.6,August30

1997.

[GG99]P.Godfrey and J.Gryz.Answering Queries by Semantic Caches.In Database and Expert Systems Applications,10th Int.Conf.,DEXA’99,Florence,Italy,Proc.,volume1677of

LNCS,pages485–498.Springer,August30-September31999.

[GSW96]S.Guo,W.Sun,and M.A.Weiss.On Satis?ability,Equivalence,and Implication Prob-lems Involving Conjunctive Queries in Database https://www.doczj.com/doc/819284132.html,DE,8(4):604–616,August

1996.

[Hal01] A.Y.Halevy.Answering Queries using Views:A Survey.VLDB Journal:Very Large Data Bases,10(4):270–294,December2001.

[KB96] A.M.Keller and J.Basu.A Predicate-based Caching Scheme for Client-Server Database Architectures.VLDB Journal:Very Large Data Bases,5(1):35–47,January1996. [LC98] D.Lee and W.W.Chu.Conjunctive Point Predicate-based Semantic Caching for Wrap-pers in Web Databases.In CIKM’98Workshop on Web Information and Data Manage-

ment(WIDM’98),Washington,DC,USA,November61998.

[LC99] D.Lee and W.W.Chu.Semantic Caching via Query Matching for Web Sources.In Proc.of the1999ACM CIKM Int.Conf.on Information and Knowledge Management,

Kansas City,Missouri,USA,pages77–85.ACM,November2–61999.

[LC01] D.Lee and W.W.Chu.Towards Intelligent Semantic Caching for Web Sources.Journal of Intelligent Information Systems,17(1):23–45,November2001.

[LLS99]K.C.K.Lee,H.V.Leong,and A.Si.Semantic Query Caching in a Mobile Environment.

ACM SIGMOBILE Mobile Computing and Communications Review,3(2):28–36,April

1999.

[RD00]Q.Ren and https://www.doczj.com/doc/819284132.html,ing Semantic Caching to Manage Location Dependent Data in Mobile Computing.In Proc.of the6th Annual Int.Conf.on Mobile Computing and

Networking(MOBICOM-00),pages210–242,New York,August6–112000.ACM. [SGHS03]K.Sattler,I.Geist,R.Habrecht,and E.Schallehn.Konzeptbasierte Anfrageverarbeitung in Mediatorsystemen.In Proc.BTW’03-Datenbanksysteme für Business,Technologie

und Web,Leipzig,2003,GI-Edition,Lecture Notes in Informatics,pages78–97,2003. [SGS03]K.Sattler,I.Geist,and E.Schallehn.Concept-based Querying in Mediator Systems.

Technical Report2,Dept.of Computer Science,University of Magdeburg,2003.

趋势分析之语义网

趋势分析之语义网 近几年来,语义网越来越频繁地出现在IT报道中,PowerSet、Twine、SearchMonkey、Hakia等一批语义网产品也陆续推出。早在2010年,Google就已经收购了语义网公司Metaweb。对于这次收购Google产品管理主管杰克·门泽尔(Jack Menzel)发文称,该公司可以处理许多搜索请求,但Metaweb的信息可以使其处理更多搜索请求,“通过推出搜索答案等功能,我们才刚刚开始将我们对互联网的理解用于改进搜索体验”,但对于部分搜索仍然无能为力,“例如,‘美国西海岸地区学费低于3万美元的大学’或‘年龄超过40岁且获得过至少一次奥斯卡奖的演员’,这些问题都很难回答。我们之所以收购Metaweb,是因为我们相信,整合Metaweb的技术将使我们能提供更好的答案”。这表明语义网技术经过近10年的研究与发展,已经走出实验室进入工程实践阶段。 语义网热度变化图 语义网(Semantic Web)是一种智能网络,它不但能够理解词语和概念,而且还能够理解它们之间的逻辑关系,可以使交流变得更有效率和价值。语义网实际上是对未来网络的一个设想,现在与Web 3.0这一概念结合在一起,作为3.0网络时代的特征之一。 语义网这一概念是由万维网联盟的蒂姆·伯纳斯-李(Tim Berners-Lee)在1998年提出的一个概念,实际上是基于很多现有技术的,也依赖于后来和text-and-markup与知识表现的综合。其渊源甚至可以追溯到20世纪60年代末期的Collins、Quillian、Loftus等人的研究,还有之后70年代初Simon、Schamk、Minsky等人陆续提出的一些理论上的成果。其中Simon在进行自然语言理解的应用研究时提出了语义网络(Semantic Network,不是现在的Semantic Web)的概念。 下面我们用Trend analysis分析语义网领域内的研究热点。(点击链接即可进入https://https://www.doczj.com/doc/819284132.html,/topic/trend?query=Semantic%20Web)

语义网和语义网格中的本体研究综述

语义网和语义网格中的本体研究综述 余一娇1,2 (1 华中师范大学语言学系,武汉,430079) (2 华中科技大学计算机学院 武汉 430074) E-mail: yjyu@https://www.doczj.com/doc/819284132.html, 摘要:本体是语义网和语义网格研究中的一种重要方法。文中首先介绍本体的定义、本体的四元素表示法和六元组表示方法,以及本体的设计分析生命周期;然后回顾语义网研究中曾产生过巨大影响的七种本体语言。通过分析众多文献的观点,文中提出在将来我们应重点针对 DAML+OIL 和OWL两种本体语言进行深入研究。文中还列举出了本体在生物信息计算和网络管理领域应用的两个实例。最后根据语义网格和本体研究现状,提出了利用本体研究语义网格服务质量的基本思路和研究方法。 关键词:本体 本体语言 DAML+OIL OWL 语义网 语义网格 服务质量 1.前 言 Ontology在哲学领域常译为“存在论”,是指关于事物是否存在思考的学科。在计算机科学和人工智能领域则译为“本体”,其词义与哲学中的“存在论”大相径邻。1993年美国Stanford大学知识系统实验室的Gruber博士在文献[1]中定义:本体是用来帮助程序和人共享知识的概念的规范描述 (An ontology is the specification of conceptualizations, used to help programs and humans share knowledge.),后来该定义得到了进一步发展和完善[2]。文献[1]还指出:概念化是关于世界上的实体,如:事物、事物之间的关系和约束条件的知识表达。而规范一词是强调这种表达是用一种固定的形式来描述。从我们已经阅读的多篇相关文献来看,几乎所有论文都接受了上述关于本体的定义。 迅速增加的Web页面数量、丰富的页面内容和时新的消息,为知识工程领域的科学家实现面向终端用户的应用研究、开发带来了极好的机会。在Internet上实现基于语义的信息检索和情报收集,无疑是广大因特网用户的迫切需求。2001年5月,Web之父Tim Berners-Lee和合作者在《Scientific American》杂志上发表了“The Semantic Web”一文。文中正式提出了语义网的概念,鉴于Tim Berners-Lee在Web领域的巨大影响,该文后来一直被公认为是开辟语义网研究的源头文献。为了实现知识的共享和重用,语义网研究中引入本体技术是最近几年来的发展趋势,且正在被不断的实践。知识工程和人工智能学科针对本体技术进行研究已有多年历史,其中最有影响的科学研究组织是美国Stanford大学的知识系统实验室。该实验室的Gruber博士以及Deborah L. McGuiinness博士都对本体和语义网本体研究作出了巨大的贡献。 本文的结构安排如下:第二部分介绍本体的表示方法和本体开发的生命周期;第三部分介绍语义网研究中的本体语言发展过程以及多种本体语言之间的关系;第四部分介绍本体在语义网研究中的应用实例;第五部分讨论我们今后一年的研究思路和研究目标。 2. 本体的表示与本体开发 关于本体的定义如今在计算机科学领域已比较统一,但在具体的应用环境中如何规范化描述本体至今还缺乏统一的标准。目前有两种本体表示方法应用比较广泛,第一是传统的四元素表示方法、第二是较新的六元组表示法。前者源于Gruber博士的观点,后者则是2002年由新加坡南洋理工大学的Myo Myo Naing博士在一篇国际会议论文中提出。前者在世界范围内得到了比较高的认同,但

语义网技术

语义网技术是当前互联网技术研究的热点之一。目前大多数页面中的使用的文字信息不便于机器自动处理,只适合人们自己阅读理解,解决可自动处理的数据和信息方面发展较慢的问题,在网络上信息量剧增、人们迫切需要计算机分担知识整理这一压力的今天,成为信息检索的一个难题。本文首先建构了一种形式化的本体描述方法,并给出了标准化的定义,主要针对在本体层定义的基础上对逻辑层展开了基础研究,对于本体概念进行逻辑推理,通过本体中关系的属性,推理出隐含在本体概念间的关系。在本文的定义中本体包含五个基本的建模元语,概念,关系,函数,公理,实例,通过本体的五个建模元语构建本体,给出本体的形式化的规范定义,本体描述中的四种特殊关系有继承关系,部分关系,实例关系和属性关系,关系的各种属性是进行本体推理的逻辑依据,有传递性属性,关系继承性,反向关系继承性,逆属性,对称性属性,反身性属性,等价性属性等等,依据这些属性的逻辑性,可以推理出所要的查找。本文利用属性的逻辑推理机制采用树搜索的查找检索方式查找出隐含在概念之间的逻辑关系是本文所要进行的主要工作,这样可以判断出概念之间是否存在一些给定判断的关系,或者一个概念和什么概念存在给定的关系,再或者两个概念间都存在什么关系等等都是我们用推理检索所要实现的判断。摘要语义网技术是当前互联网技术研究的热点之一。目前大多数页面中所使用的文字信息不便于机器自动处理,只适合人们自己阅读理解,解决可自动处理的数据和信息方面发展较慢的问题,在网络上信息量剧增、人们迫切需要计算机分担知识整理这一压力的今

天,成为信息检索的一个难题,本文中对本体层概念的推理就是为了探索计算机理解语义所做的一个尝试。语义网的体系结构向我们说明了语义网中各个层次的功能和特征,语义网的研究是阶段性的,首先解决syntax(语法)层面的问题,也就是xml,然后是解决(数据层)基本资源描述问题,也就是rdf,然后是(本体层)对资源间关系的形式化描述,就是owl,damloil,这三步已经基本告罄,当然,基于rdf 或者owl的数据挖掘和ontology管理(如合并,映射,进化)按TIMBERNERS-LEE的构想,这个工作大概到2008左右可以完成,在商业上,很快就会在知识管理,数据挖掘,数据集成方面出现一些企业。目前亟待发展的是LogicLayer(逻辑层),这方面在国内外的期刊著作中还少有提到,接下来的工作就应该是对于owlbased的数据进行推理和查询了,当前的推理方法主要是针对本体而言的,而本体的概念是在某个特定领域范围内的,而且在知识库中推理和查询是紧密的结合在一起的,相辅相成的,查询的同时必然存在着推理,而这里的推理就必须要建立在一定的逻辑模型的基础上,所以推理的方法就是基于逻辑模型的逻辑推理,可采用逻辑推理的方法。本体中推理的重点在于推理结论的正确性、完备性,若是不能保证推理的正确性,则语义网的引入就不但没有给网络资源的查询带来便利,反而阻碍了网络的发展,而且还要保证推理的完备,不遗漏应有的推理结果。本体推理的难点在于推理的高效性、资源利用率,若推理虽能达到正确性,完备性的目的而浪费了大量的时间和资源,则语义网也不能达到预期的效果,所以推理方法的使用及其效果是语义网成功的关

语义网本体

Part2:创建本体 本次所创建的本体是一个植物(plant)本体,所用的工具是Protege4.3。首先根据植物的分类来建立本体的Schema层,按照不同的分类方式可以有不同的分类例如可以分为花(flower)、草(grass)和树(tree)三类。花又可以分为蔷薇科(Rosaceae )、十字花科(cruciferae)、百合科(liliaceae)。草又可以分为草坪草(turfgrass)、孔雀草(maidenhair)、千日草(One thousand days grass)。树又可以分为乔木(arbor)、灌木(shrub)。所建的Schema层如下图1所示。 图1 植物本体的Schema层构建图 2、添加属性,属性包括对象属性和数据属性。所添加的对象属性有:颜色、枯萎季节、茂盛季节开花时间、开花时长,其定义域均设置为Plant。添加的数据属性有:根茎的长度。具体的添加如下图2所示。 (1)对象属性添加图(2)数据属性添加图 图2 植物本体的属性构建图

3、添加相应的实例。为百合科添加实例:百合花(greenish lily flower )为乔木添加实例:雪松和杨树,为草坪草添加实例:马蹄金草(The horseshoe golden grass )具体的实例图如下图3所示。 图3 具体实例添加图 4、定义公理,例如可以对其定义灌木为丛生状态比较矮小。则需要添加对象属性丛生状态(Cluster_State)和子属性主要丛生状态(Main_Cluster_State),然后添加分类:Type,包括short and small和tall。对草坪草定义为:主要丛生状态是short and small。对乔木添加定义:主要丛生状态是tall。在Plant类下面添加叶子(leaf),然后添加对象属性is_part_of,给leaf定义为:叶子是树叶的一部分。对草坪草的具体的定义效果如下图4所示。 图4 草坪草定义效果图

语义网主要应用技术与研究趋势_吴玥

2012年第2期 Computer CD Software and Applications 信息技术应用研究 — 41 — 语义网主要应用技术与研究趋势 吴 玥 (苏州大学计算机科学与技术学院,江苏苏州 215006) 摘 要:我国企业多数已经实现了网络办公自动化,为企业的经营管理创造了优越的环境。但随着销售业务的增长,企业经营管理的范围逐渐扩大,其内部网络面临的运营难题更加明显,网络知识管理是当前企业存在的最大困难。语义网络技术的运用方便了知识管理系统的构建与操控,促进了企业知识管理效率的提升。针对这一点,本文主要分析了语义网应用的相关技术,对未来研究趋势进行总结。 关键词:语义网;应用技术;知识管理;趋势 中图分类号:TP391.1 文献标识码:A 文章编号:1007-9599(2012)02-0041-02 The Main Application Technology and Research Trends of Semantic Web Wu Yue (School of Computer Science&Technology,Soochow University,Suzhou 215006,China) Abstract:Our country enterprise majority already realize the network office automation,enterprise management to create a favorable environment.But as the sales growth,gradually expanding the scope of business management of enterprise,its internal network operator facing the problem is more apparent,network knowledge management is the current enterprise is the most difficult.Semantic network technology is convenient to use the knowledge management system's construction and operation,promote the enterprise to improve the efficiency of knowledge management.In view of this,this article mainly analyzes the semantic web technologies,the future research trends are summarized. Keywords:Semantic network;Application technology;Knowledge management;Trend 语义网是对未来计算机网络的一种假设,通过相匹配的网络 语言对文件信息详细描述,最终判断不同文档之间的内在关系。 简言之,语义网就是能参照语义完成判断的网络。企业在经营管 理中引进语义网有助于数据信息的挖掘,对数据库潜在的信息资 源充分利用,以创造更大的经济收益。 一、传统互联网知识管理的不足 互联网用于企业经营管理初期,加快了国内行业经济的改革进 步,促进了企业自动化操控模式的升级。然而,当企业经营范围不 断扩大之后,企业面临的网络管理问题也更加显著。如:业务增多、产品增多、客户增多等, 企业网络每天需要处理的文件信息不计其 数,基于传统互联网的知识管理系统也会遇到多种问题。 (一)检索问题。互联网检索是十分重要的功能,如图一。用 户在互联网上检索某一项资源时,常用的方法是通过关键词搜寻, 未能考虑到语义对资源搜索的重要性。这种检索模式下则会遇到许 多难题,如:对同义词检索会出现多余的无关资源,尽管用户在互 联网上可以查找到许多与关键词相关的信息,但多数是无用的。 图一 互联网信息检索 (二)集成问题。信息集成是网络系统按照统一的标准、编码、程序等,对整个系统存储的资源集成处理,然后实现信息资源的共享。企业互联网信息集成依旧采用人工处理,这是由于网络的自动代理软件不能处理文本代表的常识知识,信息集成问题将制约着互联网功能的持续发挥。 (三)维护问题。对于企业知识管理系统而言,其采用的文档大部分是半结构化数据,这种数据的维护管理难度较大。现有的互联网在文档维护方面缺乏先进的软件工具,对于文档信息的处理也会遇到不少错误。知识管理中的数据库资源错误会给企业经营造成误导,且带来巨大的经济损失。 二、语义网应用的相关技术 互联网研发对语义网应用研究的最终目标是“开发各种各样计算机可理解和处理的表达语义信息的语言和技术,让语义网络的功能得到最大发挥” 。因此,结合语义网络的功能特点、结构形式、信息储存等情况,用户需掌握各种语义网应用技术。就目前而言,语义网主要的应用技术包括: (一)编码技术。编码是计算机网络运行的重要元素,通过编码之后才能让程序信号及时传递。语义网编码技术就是通过编码处理将知识内容表达出来,这一过程能够把不同的知识编码为某个数据结构,从而方便了用户对数据的检索。编码技术要用到各种知识表达方法,如:一阶谓词逻辑表示法、产生式表示法、框表示法、语义网络表示法等等。 (二)框架技术。框架技术本质上就是对语义网进行层次划分,将网络结构分层不同的层面。语义网框架技术应用要借助语义 Web 模型,经过长期研究,我们把语义网体系结构分为7个层面,如图二。每个层面在语义网运行时都可发挥对应的功能,促进了语义网程序操控的稳定进行。层面框架的分析,可以掌握语义网体系中各层的功能强弱。 图二 语义网的体系结构

浅析语义场理论对英语词汇教学的启发

浅析语义场理论对英语词汇教学的启发 201010815437 朱友秀指导老师:邱志芳 [摘要]正如英语语法在英语的学习中不可忽视一样,英语词汇教学在英语学习中也起着举足轻重的作用,如果不用方法和技巧,要记住那么多的单词和短语是一件既枯燥又令人头痛的事,而语义场理论为词汇教学提供了系统的理论依据。它用归纳分类和丰富的联想的方法在词与词之间建立起各种各样的联系,使得记忆它们不再那么乏味和单调,本文浅析语义场理论对英语词汇教学的几点启发并以此扩大学生的词汇量,从而提高英语教学与学习的效率,从而实现快乐地学习的目标。 [关键词] 语义场;英语词汇教学;启发 1 引言 词汇是语言的建筑基石和语言意义的载体,它维系着语音和语法。语言若离开了词汇,就无所谓语言。[1]因此词汇学习是英语学习的重要组成部分,掌握词汇的多与少,直接影响学生语言能力的发展与提高。传统的词汇教学没有注意词汇之间的有机联系,学生对词汇的掌握主要靠大量的孤立式强化记忆,而语义场理论认为语言系统中的词汇在语义上是相互联系的,它们共同构成一个完整的词汇系统,为词汇教学提供了系统的理论依据,本文浅析语义场理论对英语词汇教学的几点启发。 1.1 语义场理论 语义场理论(semantic field theory亦称field theory)最早是20世纪20-30年代由德国与瑞士的一些学者提出的。其中最著名的是德国学者特雷尔(J. Trier )。[2]该理论认为:语言系统中的词汇在语义上是相互联系的,它们共同构成一个完整的词汇系统。该理论主要包含三层涵义:其一,语言中的某些词,可以在一个共同的概念的支配下,结合在一起组成一个语义场。[3]例如,由tiger、lion、elephant、bear、dog、cat、pig等词组成animal这个语义场;其二,属于同一语义场的词,不仅在语义上相关,而且在语义上是相互制约、相互规定的。其三,指这个完整的词汇系统是很不稳定的,处于不断变化之中。 1.2 语义场的类型 根据对共同义素的分析角度不同,语义场相应的区分为不同的类型,主要包括:分类义场、上下义场、同义义场、反义义场等。[4] (1)分类语义场。分类语义场是由类义词(因为组成义位的某些义素相同而以类相聚的一群词语)组成的语义聚合体。[5]例如以matter为义素的语义场可以分为solid、liquid、gas等三个分义场,Change 这个义场包括chemical change和physical change,像这样的列子数不胜数。 (2)上下义义场。上下义义场是指一词在上表示总的概念,两个或三个以上的词在下,表示具体概念,在上者称为上义词,在下者称为下义词。上下义义场又分为两元的和多元的。[6]两元的是指一个上义词只包括两个下义词,例如parent这个义场包括father和mother两个词,children这个义场包括son 和daughter。多元的是指一个上义词包括三个或三个以上的下义词,如vehicle这个义场包括car、bus、truck、train等多个词,plant这个义场包括tree, flower, grass, vegetable等。 (3)同义义场。同义义场是指由指称意义相同的义素形成的语义聚合体。在同义义场中,绝对同义词是比较少见的,许多同义词在中心意义上相似的,但其在语体风格、感情色彩、搭配关系上等存在着差别。[7]比如dad和father的语体不同, dad是口语而father是书面语,;再比如pretty girl中的pretty是小巧玲珑的意思,是褒义词而tiny man中的tiny是一种不正常的小的意思,是贬义词,pretty和tiny的感情色彩不同;再比如对……严格,可以有be strict with…与be strict to…两种词组,它们的搭配不同,be strict with sb.,而be strict to sth.。 (4)反义义场。反义义场是由意思相反、相对或矛盾的属于同一词性的和同一范畴的一组词构成的

黄智生博士谈语义网与Web 3

黄智生博士谈语义网与Web 3.0 作者徐涵发布于 2009年3月26日下午6时0分 社区 Architecture, SOA 主题 语义网 标签 Web 2.0, 采访, 元数据, 语义网 近两年来,“语义网(Semantic Web)”或“Web 3.0”越来越频繁地出现在IT 报道中,这表明语义网技术经过近10年的研究与发展,已经走出实验室进入工程实践阶段。PowerSet、Twine、 SearchMonkey、Hakia等一批语义网产品的陆续推出,预示着语义网即将在现实世界中改变人们的生活与工作方式。在Web 3.0时代即将揭开序幕之际,正确理解、掌握语义网的概念与技术,对IT人士与时俱进和增加优势是必不可少的。为此,InfoQ中文站特地邀请到来自著名语义网研究机构荷兰阿姆斯特丹自由大学的黄智生博士,请他为我们谈一谈工业界人士感兴趣的语义网话题,包括什么是语义网、语义网与Web 3.0的关系以及语义网如何给商业公司带来效益等。 InfoQ中文站:您是语义网方面的权威专家,能否先请您为我们消除概念上的困惑。现在有一个说法,即Web 3.0就是语义网。但是除了W3C定义的语义网以外,关于Web 3.0还有许多种其他说法,您认为谁才真正代表了Web 3.0?为什么? 黄智生博士(以下称黄博士):首先需要说明的是:我不认为自己是所谓的“权威”。纵观万维网的发展,总是年轻人在创造历史,他们给人类社会带来了一次又一次的惊奇。且不说万维网之父Tim Berners-Lee在1989年构想万维网的时候仅仅三十出头。Web 1.0产生的雅虎和谷歌等国际大公司的创始人大多是年轻的博士生。Web 2.0产生的Facebook等公司创始人的情况也大体如此。Web 3.0的情况也可能如此。我们甚至都不能完全指望通过现有的IT大公司的巨大投入来发展语义网。这些大公司往往受着过去成功经验的束缚,而且新技术采用的是与以往完全不同的思路,从而会加深大公司对新技术的怀疑。当然,这也为年轻人书写历史创造辉煌提供了发展空间。 由于Web 1.0和Web 2.0技术的成熟,Web 3.0的想法实际上表达了现在人们对下一代万维网技术的种种期待。从这个意义上讲,Web 3.0并不等同于语义网。网络上对Web 3.0众说纷纭,都有一定的道理。但我有一定的理由相信,语义网技术是Web 3.0的重要技术基础。我于2008年底在国内一些大学巡回讲学报告

语义搜索的分类

语义搜索的分类 一.按语义搜索引擎服务内容的分类 语义搜索引擎从人们头脑中的概念到在搜索领域占据一席之地经历不少坎坷。语义网出现后,语义搜索迎来了高速发展的机遇期。虽然语义搜索服务内容主要集中在传统搜索引擎不擅长的语义网搜索方面。不过语义搜索引擎也试图拓展服务范围,提供比传统搜索引擎更全面的服务。语义搜索引擎的服务内容主要包括以下几个方面:知识型搜索服务、生活型搜索服务、语义工具服务等。 (1)知识型搜索方面,主要针对语义网知识信息资源。其中包括: ①词典型搜索服务。一种形式是如同使用电子词典一样,通过关键词直接查询与关键词对应的概念。这些概念由语义搜索引擎索引的本体文件中提取。另一种形式则是对在线百科全书的搜索服务,如PowerSet,这一点与传统搜索引擎近似,但语义搜索引擎在信息的组织上远胜于传统搜索引擎。 ②语义网文档(SWD)的查询服务。用户可以通过语义搜索引擎查询所需的语义网文档和相关的语义网文档。Falcons 为统一资源标识符(URI)定义的语义网对象和内容提供基于关键词的检索方式。Swoogle 从互联网上抽取由RDF 格式编制的语义网文档(SWDs),并提供搜索语义网本体、语义网例证数据和语义网术语等服务。 ③领域知识查询。部分语义搜索引擎提供了针对某个或某几个专业门类的信息检索服务,用户可以选择自己所需相关信息。Cognition 以搜索法律、卫生和宗教领域为主。个别语义搜索引擎提供针对特定领域的多媒体语义搜索服务,如Falcon-S 对足球图片的搜索服务。不过多媒体语义搜索面临与传统多媒体搜索相似的困境,缺乏有效的语义标注。对多媒体信息的辨别和分类能力仍有待提高。 (2)生活型搜索方面,语义搜索引擎在传统搜索引擎力所不及的诸方面发展迅速。 ①社会网络搜索。部分语义搜索引擎提供社会网络搜索功能,这种功能可以实现通过姓名、著作、所在单位等信息中的一条或几条,查询与这些信息有关联的更多信息,如我国的ArnetMiner。 ②资讯搜索。目前语义化的网络搜索服务能够更有针对性,更准确地为用户提供新闻资讯。Koru就是这方面的代表。 (3)语义工具服务。 这是语义搜索引擎所属的研究机构的一个较为独特的方面,和传统搜索引擎提供的桌面搜索等工具不同,语义搜索引擎提供的语义工具一般不是对语义搜索功能的直接移植,而是对文档的相似性、标注等进行处理用的。这些工具可以为语义搜索引擎的索引对象进行前期数据加工,同时也供科研使用。 理论上讲语义搜索引擎能够提供包括普通网络文档检索在内的所有类型网络文档搜索服务,但是由于语义搜索引擎对网页的索引方式不同,微处理器需要比传统搜索更长的时间才能分析完一个页面,因此很多语义搜索网站只能扫描到外部网站的二级页面,这样将难以满足用户全网络搜索的需求。 二.按语义搜索引擎服务模式分类 语义搜索引擎高速发展的阶段正值传统搜索引擎发展的平台期,虽然语义搜索引擎暂时尚不具备传统搜索引擎的市场竞争力,但是它们却可以很容易地借鉴传统搜索引擎的成

语义场理论和英语词汇教学

语义场理论和英语词汇教学 语义学是一门崭新的学科,把语义学用于大学英语教学更是一个全新的内容。英语教学中,很大的一部分内容就是如何把语义学的理论融入到词汇教学当中,使学生能熟练掌握词汇,并能得体地、恰当地运用它们。英国语言学家威尔金斯曾经说过:“没有语法,人们表达的事物寥寥无几,而没有词汇,人们则无法表达任何事物”(Wilkins,1978:111)。由此可见,词汇教学在语言教学中占有重要的地位。目前,我国的大学生英语的学习时间都有5到8年,然而很多学生一谈到英语词汇的学习,都感到头痛。他们普遍存在着下列问题:(1)不能有效巩固已学的单词;(2)不能在特定的语境中恰当地运用单词;(3)错误地使用词的形式,例如:“Be seated, ladies and gentlemen”(formal),“Have a seat”(informal),“Take a pew”(colloquial);(4)不能地道地使用词汇,例如:“no other corner of our planet”;(5)不能有效地把词汇使用在有意义的语境中等。如果能把语义学理论引入英语教学之中,用语义学的原理分析和认识词汇,从而运用更加科学有效的方法提高词汇教学,并从中总结和找出一定的规律,这对提高学生学习英语的兴趣,提高他们掌握英语词汇的能力,进而更好地、恰当地并且得体地使用词汇,都有重要的现实意义。这种理论与实践的结合是很有意义的。 语义学中的语义场理论可帮助学生重温、联想所学词汇,使之变成长期记忆。语义场理论是德国学者J. Tries(伍谦光,1995:94)最先提出来的。根据这一理论,我们知道,尽管语言词汇数目巨大,浩如烟海,但它们并不是杂乱无章的。语义场理论把一种语言的词汇看成是完整的、在语义上相关联的,将一个词与其它词联系起来,而形成语义场。实际上它就是语言中的某些词在一个概念支配下组成一个语义场。例如,在“animal”这个概念下,“cat,dog,horse,tiger,elephant”等单词构成一个语义场,根据词义上的类属关系,看它们是否是上、下义的关系,进而可以推出它们的层次结构。从上面的例句,可以得出从属“animal”这个概念的有“horse,dog,cat and etc.”,每个场下面还可以再分出若干个“子场”,“子场”下面可再分出“次子场”等等,这样就可以使词汇体系及词义系统有次序地展现出来。如果我们了解词与词之间的各种关系,为多个单词设立多种形式的语义场,这样就便于记忆和使用词汇。如表示颜色的语义场有:red,yellow,blue,black,purple,orange等词;还有表示相反意义的语义场如accept/refuse;buy/sell;easy/difficult;true/false等等。通过记忆 laugh(笑)、smile(微笑)、grin(露齿而笑)、giggle(格格地笑、傻笑)、roar(哄笑、大笑)guffaw(捧腹大笑)、mock(嘲笑)、jeer(讥笑)等词,学生不仅弄清了这些词的细微区别,而且学会了正确地使用它们。通过语义场的词汇学习,常常可带出一长串词汇,这样不仅扩大了词汇量,而且不易忘记,所以运用语义场理论学习词汇是扩大词汇量的一种科学有效的好方法。根据德国心理学家H. Ebbinghauss的遗忘曲线学说:复习所需时间比初学所需时间要少得多,复习次数越多,需要时间越少,遗忘速度越慢,因此单词在初学后要马上不断地重复记忆,使之成为长久记忆。同时,运用语义场理论,学生可以对词汇体系进行分析研究,有次序地展现语言的词汇系统。更重要的是,通过语义场,他们能清楚地看到词义之间的相互依存关系和结构层次关系,这样有助于联想和重温所学的词汇。 英语教学的根本目的在于培养学生的语言交际能力 ,因此我们的教学必须把语义摆在一个重要的位置 ,而不能仅仅以语法和结构作为教学的中心。语法结构和意义之间往往存在歧义的问题。结构分析的诸多缺陷可以在意义层次的研究中得到弥补 ,将结构和意义结合起来进行分析和研究的方法对教学产生了深远的指导意义。我们的词汇教学不应该是一味强求词汇量的扩增 ,追求强记效果 ,而是要深入地剖析词汇并把它们联结成一个网络。(Tylor ,1993)如果我们在词汇教学中单纯追求扩大词汇量 ,要求学生死记硬背单词 ,其结

语义检索

在数字图书馆中,信息检索存在明显不足。在文献的组织与描述上,简单将关键词作为描述文献的基本元素,文献之间没有关联,是相互独立的、无结构的集合。在检索操作上,通常是基于关键词的无结构查询,难以反映词语间各种语义联系, 查询能力有限,误检率和漏检率很高,检索结果的真实相关度较低;计算查询和文档之间的相似度的方法也有局限。在用户交互界面上,用户的检索意图难以被机器理解,采用自然语言输入的检索关键词与机器的交互存在障碍。现有数字图书馆信息资源检索存在资源表示语义贫乏和检索手段语义贫乏、查准率低下等问题,语义网技术的出现,为数字图书馆的发展注入了新的活力,为信息检索质量的提高带来了新的生机。运用语义网技术,使解决信息检索中现存的问题,完善信息检索流程成为了可能。3.1 数字图书馆信息检索模型目前数字图书馆的信息检索主要借助于目录、索引、关键词方法来实现, 或者要求了解检索对象数据结构等, 对用户提供的关键词的准确性要求较高,基于语法结构进行检索, 却不能处理复杂语义关系,常常检索出大量相关度很差的文献。 图3.1 数字图书馆信息检索模型用户通过检索界面,输入关键词,文本操作系统对用户的关键词进行简单的语法层次的处理整合,与数字图书馆资源进行匹配检索,最终将检索的结果,再通过用户界面返回给用户。而数字图书资源,专业数据库等都是数字图书馆信息检索的范畴,这些数字化的知识资源主要以数据库形态分布于全球互联网的数千个站点,这种以数据库形式存放的信息资源,通常是电子化了的一次文献,包括元数据、摘要或者是全文,也可以是全文链接的地址。 24 基于语义网的数字图书馆信息检索模型研究 3.2 基于语义网的数字图书馆信息检索模型的设计思想数字图书馆信息检索系统存在诸多问题。查询服务智能化水平低,无法对用户请求进行语义分析;信息资源的共享程度低,仅仅采用题名、文摘或全文中出现的关键词标识文献内容,难以揭示文献资料所反映的知识信息,易形成信息孤岛;对用户输入的关键词进行句法匹配,查准率不高;片面追求查全率,返回大量无关结果等。这些问题最终造成用户的真正检索意图难以实现。人们希望有突破性的信息检索技术出现,能够支持更为强大的信息检索功能,具备理解语义和自动扩展、联想的能力,并为用户提供个性化服务。在这样的需求下,本节深入探讨了现存问题的解决方法,结合语义网技术,提出了以下基于语义网的数字图书馆信息检索模型的设计思想。3.2.1 机器理解与人机交互人们通过信息的交流和沟通,表达一定的思想、意思和内容,因此,自然语言和表达的信息中蕴含着丰富的语义。尤其是自然语言中,一词多义、一义多词现象十分常见,在不同的语境中,同样的词汇还可以表达出不同的意义。在人与人的交流中,近义词、反义词、词语的词性、语法结构等帮助人们在特定的语言环境中理解语言表达的确切含义,而计算机要做到这点却有难度。随着网络的不断发展,网络信息充斥着人们的视野。如何在浩如烟海的信息资源中,以最短的时间查找出相关资源,成为人们所关注的问题之一。通常,检索系统总会返回相关度不高,甚至完全无关的信息,而有些相关的信息却往往被遗漏了。一方面,检索工具没能把已经存在的、对用户有价值的信息检索出来,另一方面,信息资源没有很好的被归纳,提炼成知识。利用语义网技术,将语义丰富的描述信息和资源关联起来,通过机器理解和人机交互,对信息资源进行深层次的分析和挖掘。从本质上讲,人机交互是认知的过程,主要通过系统建模、形式化语言描述等信息技术,最终实现和应用人机交互系统。3.2.2 语义知识与描述逻辑从语义学的角度讲,语义是语言形式表达的内容,是思维的体现者,是客观事物在人们头脑中的反映[72]。人们在进行信息交流和沟通时,通过词语、符号来表达思想。当人们看到

词义分析和语义场

(一)什么是词义 客观事物(这里指人脑以外的所有事物,包括一切生物、非生物、事件以及它们的行动、状态、性质等等)反映在人脑中,产生感觉(sensation),知觉(perception),表象(representation);人脑把感觉、知觉、表象加以概括和抽象,形成概念(concept)。人们用语言形式把概念固定下来,成为人们交流思想的符号(sign),这就是有一定意义的词。也就是说,词的意义的"人"赋予它的,难怪英国语言学家帕特里奇(Eric Partridge)说过,"Words have no meaning; peope have meaning for them"(词本无义,人赋予之)。传统语义学家通常用三角图形来说明词义,称"词义三角"(triangle of significance): 意义(概念)Meaning (Concept) 词Word 形式Form……所指对象Referent 这个图形表示:第一,词有两个方向--"形式"和"意义(概念)"。"形式"首先是指词的语音形式,也就是平常所说的发音,其次是指词的书面形式,也就是平常的说的拼写;与此相对的是词的意义(其核心部分是概念),也就是词的内容。每个词都有一定的形式和意义;这两方面缺一不可,统一在每个词中。第二,词义与所指对象联接在一起--一方面,词义在客观世界中是有所指的,另一方面,词义又是客观世界的某一(或某些)事物在语言中的反映。客观世界是无穷无尽、无限丰富的,客观世界的事物之间存在着细微的千差万别。人类语言无法完完全全地准确表达客观世界。一种语言无论词汇多么浩瀚、词义多么丰富,都不足以完完全全地准确反映客观世界。 现代语义学家对"词义三角"提出很多批评,涉及哲学、心理学等各个领域。但是如果我们从学习语言的实用目的出发,粗浅地分析词义、解释有关词义的种种现象,传统语义学提出的"词义三角"还是能说明一些问题的。 (三)词义分析和语义场 前面已经提到,词义是词的内容,每个词有一个或几个意义,每个词义又可以进一步分析成若干个语义成分(semantic compo-nents)或语义特征(semantic features)作为区别性特征(distinctive features)。叶姆斯列夫(Hjelmslev)在本世纪四十年代出版的《语言理论导论》中应用于语义成分分析了"公羊、母羊、男人、女人、男孩、女孩、公马、母马"这组词。至于语义成分分析法(componential analysis)则是美国的人类学家威廉·古迪纳夫(W. H. Doodenough)于1956年提出来的。此后,美国语言学家奈达(Nida)、卡兹(Katz)和福德(Fodor)等人进一步发展了这种方法。举个最简单的例子:man, woman, boy, girl都属于"人"(human)这个语义范围(即语义场,semantic field),以"人"(human)这一共同的语义成分为核心,以男性(male)女性(female),成年(adult 每个词的语义又可以分别通过区别性特征表示为: man: +人+ 成年+ 男性 woman: +人+ 成年-男性 boy: +人- 成年+ 男性 girl: +人- 成年- 男性 把每个词义分析成若干语义成分可以帮助我们比较完整地了解词义,以及词与词之间的关系。例如:adult和grown-up这一对同义词就可以表示为:+ 人+成年,另外再表明它们的文体区别,前者是正式用语,后者是非正式用语。又如man这个词的多义性可以表示为:它有两个词义,第一个词义是+ 人+ 成年+ 男性,第二个词义是+ 人。 词义研究的另一种理论是所谓的"语义场"(semantic field),这个理论最早是由德语言学家特里尔(Jost Trier)在本世纪三十年代初提出来的。他认为,每个词都身居其亲属概念之间,这些词相互之间以及它们特指的那个词一起构成了一个自成系统的整体,这种结构可称为"词场"(das Wortfeld)或语言符号场(das Spraches Zeichenfeld)。其实,语义场就是同一词类的词由于它们之间紧密的联系而形成的一个词汇小体系,这些词具有它们的共有语义成分,体现了它们的概念核心。属于同一个语义场的词除了具有共同的语义成分以外,还具有各种非共同语义成分作为区别性特征。按照区别性特征的不同,语义场可以分为同义、近义类义、反义等几种。同义语义场里的词的概念意义相同,关联意义有差别;近义语义场里的词的共同语义成分的含义在程度上有细微的差别;类义语义场里的词在非共同语义成分的种别上有差异;反义语义场里的词的非共同语义成分的含义截然相反或者相对。各个语义场之间也呈现着包含、并列、部分覆盖等错综复杂的关系。用语义场的概念

语义网基础教程

第一章概述 1.1万维网现状 万维网改变了人类彼此交流的方式和商业创作的方式。发达社会正在向知识经济和知识社会转型,而万维网处于这场革命的核心位置。 这种发展使得人们对计算机的看法也发生了变化。起初,计算机仅仅用作数值计算,而现在则主要用于信息处理,典型的应用包括数据库,文档处理和游戏等等。眼下,人们对计算机关注的焦点正在经历新的转变,将其视作信息高速公路的入口。 绝大部分现有的网络内容适合于人工处理。即使是从数据库自动生成的网络内容,通常也会丢弃原有的结构信息。目前万维网的典型应用方式是,人们在网上查找和使用信息、搜索和联系其他人、浏览网上商店的目录并且填表格订购商品等等。 现有软件工具没有很好的支持这些应用。除了建立文件间联系的链接之处,最优价值和必不可少的工具是搜索引擎。 基础关键词的搜索引擎,比如Alta Vista、Yahoo,Google等,是使用现有万维网的主要工具。毫无疑问,加入没有这些搜索引擎,万维网不会取得现在这么大的成功。然而,搜索引擎的使用也存在一些严重过的问题: ●高匹配、低精度。即使搜到了主要相关页面,但它们与同时搜到的28758 个低相关或不相关页面混在一起,检索的效果就很差。太多和太少一样令人不满意。 ●低匹配或无匹配。有时用户得不到任何搜索结果,或者漏掉了一些重要的 相关页面。虽然对于现在的搜索引擎来说,这种情况发生的频率不高,但确实会出现。 ●检索结果对词汇高度敏感。使用最初填写的关键词往往不能得到想要的结 果,因为祥光的文档里使用了与检索关键词不一样的术语。这当然令人不满意,因为语义相似的查询理应返回相似的结果。 ●检索结果是单一的网页。如果所需要的信息分布在不同的文档中,则用户 必须给出多个查询来收集相关的页面,然后自己提取这些页面中的相关信息并组织成一个整体 有趣的是,尽管搜索引擎技术在发展,但主要的困难还是上述几条,技术的发展速度似乎落后于网上内容量的增长速度。 此外,即使搜索是成功的,用户仍必须自己浏览搜索到的文档,从中提取所需的信息,也就是说,对极其耗时的信息检索本身,搜索引擎并没有提供更多支持。因此,用信息检索来描述搜索引擎为用户提供的功能,是不确切的;用信息定位可能更加合适。另外,由于现有网络搜索的结果不易直接被其他软件进一步处理,因此搜索引擎的应用往往是孤立的。 目前,为网络用户提供更大支持的主要障碍在于,网上内容的含义不是机器可解读的。当然,有一些工具能够检索文档、把它们分割成更小的部分、检查拼写并统计词频等等。可是,一旦牵涉到解释句子含义和提取对用户有用的信息,现有的软件能力就有限了。举一个简单的例子。对现有技术而言,一下俩个句子的含义是难以区分的: 我是一个计算机科学的教授。 你不妨认为,我是一个计算机科学的教授。

语义场的论文

关于“笑”的动词语义场分析 12092班杨乐1209114072

一、摘要 语义场理论日益受到语言学家的重视,语义场理论也发展得越来越完善。语义场理论传入我国大陆后,我们运用此理论对汉语义位系统的研究也开始了。本文运用义素分析法,对关于常见的“笑”的动词语义场进行分析,从几个最小语义场入手,对动词进行深入探究。 二、关键词 义素分析法、语义场、动词、笑、义位 三、引言 1、语义场理论的简介 (1)语义场的定义 语义场是指义位形成的系统。如果若干个义位含有相同的表彼此共性的义素和相应的表彼此差异的义素,因而连结在一起,互相规定、互相制约、互相作用,那么这些义位就构成一个语义场。语义场可分为分类义场、部分义场、顺序义场、关系义场、反义义场、两极义场、部分否定义场、同义义场、枝干义场和描绘义场十种类型。从现代语义学的观点来看,一组同义词(不包括等义词)或一组反义词的义位以及一组意义上紧密联系的义位,构成一个最简单的义位系统,是一个最小的子语义场。 (2)语义场的兴起与发展演变趋势 洪堡特是普通语言学的奠基人,他初步具有语义场的概念。最早

提出语义场的概念并进行了认真研究的,是一些德国和瑞士的结构主义语言学家,其中最出名的是特里尔。特里尔的语义场理论是20世纪30年代提出来的。乌尔曼认为特里尔的理论在语义学的历史上开辟了一个新阶段。这一理论后来由他的学生和威斯皆伯进一步发展了。威斯皆伯在30年代曾与特里尔合作,第二次世界大战以后他又继续钻研语义场的理论。结构主义语义学家正确的指出语义场的理论是他们的主要成就。但是在三四十年代,语义场理论的影响是很有限的。到了50年代,乔姆斯基提出了转换生成语法,美国人类语言学家提出义素分析法,语义研究日益受到重视。语义场的理论才引起普遍的注意。当然,50年代以后,人们研究语义场的理论较之三四十年代已有相当的发展了。语义场的理论传入我国大陆后,我们运用这一新理论对汉语义位系统的研究也开始了。语义场可以进一步分作词汇场和联想场两类。 2、本文研究的研究内容 本文运用义素分析法,对关于“笑”的动词语义场进行分析,在表达的感情、声音的不同、嘴角上扬程度(表达感情的激烈程度)、语言类笑容四个不同角度列出最小语义子场,分析常见的关于“笑”的动词的意义。

相关主题
文本预览
相关文档 最新文档