NB-Tree An Indexing Structure for Content-Based Retrieval in Large Databases
- 格式:pdf
- 大小:268.69 KB
- 文档页数:25
外文文献原稿和译文原稿DATABASEA database may be defined as a collection interrelated data store together with as little redundancy as possible to serve one or more applications in an optimal fashion .the data are stored so that they are independent of programs which use the data .A common and controlled approach is used in adding new data and in modifying and retrieving existing data within the data base .One system is said to contain a collection of database if they are entirely separate in structure .A database may be designed for batch processing , real-time processing ,or in-line processing .A data base system involves application program, DBMS, and database.THE INTRODUCTION TO DATABASE MANAGEMENT SYSTEMSThe term database is often to describe a collection of related files that is organized into an integrated structure that provides different people varied access to the same data. In many cases this resource is located in different files in different departments throughout the organization, often known only to the individuals who work with their specific portion of the total information. In these cases, the potential value of the information goes unrealized because a person in other departments who may need it does not know it or it cannot be accessed efficiently. In an attempt to organize their information resources and provide for timely and efficient access, many companies have implemented databases.A database is a collection of related data. By data, we mean known facts that can be recorded and that have implicit meaning. For example, the names, telephone numbers, and addresses of all the people you know. You may have recorded this data in an indexed address book, or you may have stored it on a diskette using a personalcomputer and software such as DBASE Ⅲor Lotus 1-2-3. This is a collection of related data with an implicit meaning and hence is a database.The above definition of database is quite general. For example, we may consider the collection of words that made up this page of text to be usually more restricted. A database has the following implicit properties:● A database is a logically coherent collection of data with some inherent meaning. A random assortment of data cannot be referred to as a database.● A database is designed, built, and populated with data for a specific purpose. It has an intended group of user and some preconceived applications in which these users are interested.● A database represents some aspect of the real world, sometimes called the miniworld. Changes to the miniworld are reflected in the database.In other words, a database has some source from which data are derived, some degree of interaction with events in the real world, and an audience that is actively interested in the contents of the database.A database management system (DBMS) is composed of three major parts: (1) a storage subsystem that stores and retrieves data in files; (2)a modeling and manipulation subsystem that provides the means with which to organize the data and to add, delete, maintain, and update the data; and (3) an interface between the DBMS and its users. Several major trends are emerging that enhance the value and usefulness of database management systems.●Managers who require more up-to-date information to make effective decisions.●Customers who demand increasingly sophisticated information services and more current information about the status of their orders, invoices, and accounts.●Users who find that they can develop custom applications with database systems in a fraction of the time it takes to use traditional programming languages.●Organizations that discover information has a strategic value; they utilize their database systems to gain an edge over their competitors.A DBMS can organize, process, and present selected data elements from the database. This capability enables decision makers to search, probe, and query database contents in order to extract answers to nonrecurring and unplanned questions that aren’t available in regular reports. These questions might initially be vague and/or p oorly defined, but people can “browse” through the database until they have the needed information. In short, the DBMS will “mange” the stored data items and assemble the needed items from the common database in response to the queries of those who aren’t programmers. In a file-oriented system, user needing special information may communicate their needs to a programmer, who, when time permits, will write one or more programs to extract the data and prepare the information. The availability of a DBMS, however, offers users a much faster alternative communications path.DATABASE QUERYIf the DBMS provides a way to interactively enter and update the database ,as well as interrogate it ,this capability allows for managing personal database. However, it does not automatically leave an audit trail of actions and does not provide the kinds of controls necessary in a multi-user organization .There controls are only available when a set of application programs is customized for each data entry and updating function.Software for personal computers that perform some of the DBMS functions has been very popular .Individuals for personal information storage and processing intended personal computers for us .Small enterprises, professionals like doctors, architects, engineers, lawyers and so on have also used these machines extensively. By the nature of intended usage ,database system on there machines are except from several of the requirements of full-fledged database systems. Since data sharing is not intended, concurrent operations even less so ,the software can be less complex .Security and integrity maintenance are de-emphasized or absent .as data volumes will be small, performance efficiency is also less important .In fact, the only aspect of a database system that is important is data independence. Data independence ,as stated earlier ,means that application programs and user queries need not recognize physical organization of data on secondary storage. The importance of this aspect , particularly for the personal computer user ,is that this greatly simplifies database usage . The user can store ,access and manipulate data at ahigh level (close to the application)and be totally shielded from the low level (close to the machine )details of data organization.DBMS STRUCTURING TECHNIQUESSpatial data management has been an active area of research in the database field for two decades ,with much of the research being focused on developing data structures for storing and indexing spatial data .however, no commercial database system provides facilities for directly de fining and storing spatial data ,and formulating queries based on research conditions on spatial data.There are two components to data management: history data management and version management .Both have been the subjects of research for over a decade. The troublesome aspect of temporal data management is that the boundary between applications and database systems has not been clearly drawn. Specifically, it is not clear how much of the typical semantics and facilities of temporal data management can and should be directly incorporated in a database system, and how much should be left to applications and users. In this section, we will provide a list of short-term research issues that should be examined to shed light on this fundamental question.The focus of research into history data management has been on defining the semantics of time and time interval, and issues related to understanding the semantics of queries and updates against history data stored in an attribute of a record. Typically, in the context of relational databases ,a temporal attribute is defined to hold a sequence of history data for the attribute. A history data consists of a data item and a time interval for which the data item is valid. A query may then be issued to retrieve history data for a specified time interval for the temporal attribute. The mechanism for supporting temporal attributes is to that for supporting set-valued attributes in a database system, such as UniSQL.In the absence of a support for temporal attributes, application developers who need to model and history data have simply simulated temporal attributes by creating attribute for the time interval ,along with the “temporal” attribute. This of course may result in duplication of records in a table, and more complicated search predicates in queries. The one necessary topic of research in history data management is to quantitatively establish the performance (and even productivity) differences betweenusing a database system that directly supports attributes and using a conventional database system that does not support either the set-valued attributes or temporal attributes.Data security, integrity, and independenceData security prevents unauthorized users from viewing or updating the database. Using passwords, users are allowed access to the entire database of the database, called subschemas. For example, an employee database can contain all the data about an individual employee, but one group of users may be authorized to view only payroll data, while others are allowed access to only work history and medical data.Data integrity refers to the accuracy, correctness, or validity of the data in the database. In a database system, data integrity means safeguarding the data against invalid alteration or destruction. In large on-line database system, data integrity becomes a more severe problem and two additional complications arise. The first has to do with many users accessing the database concurrently. For example, if thousands of travel agents book the same seat on the same flight, the first agent’s booking will be lost. In such cases the technique of locking the record or field provides the means for preventing one user from accessing a record while another user is updating the same record.The second complication relates to hardware, software or human error during the course of processing and involves database transaction which is a group of database modifications treated as a single unit. For example, an agent booking an airline reservation involves several database updates (i.e., adding the passenger’s name and address and updating the seats-available field), which comprise a single transaction. The database transaction is not considered to be completed until all updates have been completed; otherwise, none of the updates will be allowed to take place.An important point about database systems is that the database should exist independently of any of the specific applications. Traditional data processing applications are data dependent.When a DMBS is used, the detailed knowledge of the physical organization of the data does not have to be built into every application program. The application program asks the DBMS for data by field name, for example, a coded representationof “give me customer name and balance due” would be sent to the DBMS. Without a DBMS the programmer must reserve space for the full structure of the record in the program. Any change in data structure requires changes in all the applications programs.Data Base Management System (DBMS)The system software package that handles the difficult tasks associated with creating ,accessing and maintaining data base records is called a data base management system (DBMS). A DBMS will usually be handing multiple data calls concurrently.It must organize its system buffers so that different data operations can be in process together .It provides a data definition language to specify the conceptual schema and most likely ,some of the details regarding the implementation of the conceptual schema by the physical schema.The data definition language is a high-level language, enabling one to describe the conceptual schema in terms of a “data model “.At the present time ,there are four underling structures for database management systems. They are :List structures.Relational structures.Hierarchical (tree) structures.Network structures.Management Information System(MIS)An MIS can be defined as a network of computer-based data processing procedures developed in an organization and integrated as necessary with manual and other procedures for the purpose of providing timely and effective information to support decision making and other necessary management functions.One of the most difficult tasks of the MIS designer is to develop the information flow needed to support decision making .Generally speaking ,much of the information needed by managers who occupy different levels and who have different levels and have different responsibilities is obtained from a collection of exiting information system (or subsystems)Structure Query Language (SQL)SQL is a data base processing language endorsed by the American NationalStandards Institute. It is rapidly becoming the standard query language for accessing data on relational databases .With its simple ,powerful syntax ,SQL represents a great progress in database access for all levels of management and computing professionals.SQL falls into two forms : interactive SQL and embedded SQL. Embedded SQL usage is near to traditional programming in third generation languages .It is the interactive use of SQL that makes it most applicable for the rapid answering of ad hoc queries .With an interactive SQL query you just type in a few lines of SQL and you get the database response immediately on the screen.译文数据库数据库可以被定义为一个相互联系的数据库存储的集合。
Reusing Mechanical Engineering DesignZoE LACROIXDepartment of Mechanical and Aerospace EngineeringArizona State UniversityPO BOX 876106Tempe A2 85287.6106, USAAbstract-Design in mechanical engineering is often re- design and involves retrieving similar designs from various integrated design repositories or paper drawings. To sup- port designers a SyStem must provide to previous valid designs that satisfy satisfy requirements, functions, or Characteristics, and allow the integration and transformation of the retrieved designs into a new design that meets the new design specifications. Many approaches have been developed in the past to support designers in accessing dcsigns for their new design needs. Most of ex- isting approaches typicalb store design documents as files and provide an additional structure to access the docu- ments through the indexing the entire file and do not offer designers the ability the express more complex re- trieval queries or simply queries different from the Ones pre-computed in the index. No existing approach appears to offer a real query language to retrieve cases for engineer- ing design in various contexts represented by the geome- try and topology of a finished design. In these systems, retrieval queries usually are Boolean expressions of terms consisting of valued attributes as represented through the indices (or .. Pre-comDuted annotations). In addition. these approaches usually do not provide a framework for design integration (when components extracted from several designs must be integrated to meet the new requirements), or transformation (when a component is similar but needs to be transformed to meet the new specification) as needed to support mechanical engineering design reuse. In this paper we present an approach to support mechanical design data reuse that exploits constmint databasea. This approach pro- vides a real query language that express complex queries that retrieve, integrate and transform design data. It relies on a data model through eonstminta compatible with the Standard for the ExchanEe I of Product Model Data (STEP).I ,I standard for mechanical design data representation.I. INTRODUCTION Mechanical engineering design data are very complex geometric data (e.g. aircraft engine) composed of largenumbers of parts (e.g. the 777 engine has 30,000 parts). Design data are often stored on paper (drawings) or in electronic files typically in the proprietary format of thecomputer-aided design (CAD) system that produced them.Most of existing retrieval approaches support case-based retrieval tasks, assisting designers in accessing previousvalid designs that satisfy or nearly satisfy requirements,functions, or characteristics 1311. Such systems rely onindexing techniques such as generated design signature[IZ], object oriented [23] or feature based [I81 approaches.These approaches typically store design documents as filesand provide an additional structure to access the docu-ments through the indexing of the entire file and do notoffer additional granularity. Indeed, the geometry of the design itself is generally not indexed and no direct access to the portion of the geometry that satisfies the propertyvalidated by the index is provided. The selection of theproperties and characteristics used to index design data has significant consequences, Indeed, selected attributes constrain the level of expression in the retrieval phase. In general, retrieval approaches for design data are usu- ally designed for a specific domain usage such as feature or -materjal.driven retrieval and usually-lack efficiency to handle large data repositories. These approaches do not provide a real query language but rather limited query in- terfaces that rely heavily upon indexed systems such as the expert driven questionnaires in Conversational CBD [I], tabular [2], frame based [9], or function structures [21]. Retrieval queries usually are Boolean expressions of terms consisting of valued attributes as represented through the The limitations of existing approaches include:1, queries limited to pre-computed of the drawing, as opposed to the design data themselves;2. retrieval of a whole file instead of the extraction of geometric components of a given design that satisfy specified properties;3. domain specific retrieval approaches, as opposed to a generic approach;4. no query language for data integration and transfor- mation;5. no query language that allows the expression of com-plex queries as combinations of basic operators; Recent efforts towards a standard for mechanical design data representation led to the Standard for the Exchange of Product Model Data (STEP) [22]. STEP expresses geome- try and a variety of properties through constraints. STEP is not only the most accepted common representation for data exchange, but its representation through constraintscan also be exploited to model and query data within a mechanical engineering reuse system. However, most existing approaches developed to store, access, manipulate, transform and analyze design data only use STEP as an exchange format. In contrast, we propose a different ap- proach where the geomet,ry and the constraints expressed in STEP are the core of the data representation. We show that the usage of the constraint database technology [24] to model, retrieve, integrate and trans- form mechanical engineering data could be quite effec- tive to overcome the current limitations of design datareuse approaches. Indeed, the proposed approach com- bines constraint-based representation (in STEP) with constraint-based querying techniques. The characteristics of the approach benefit design reuse for many reasons in-cluding: U&n representation. Constraints homogeneously support the representation of both geometric and non- geometric design data properties; Expressiue power. Constraint query languages expressgeometrical, topological, and descriptive design data properties and allow the expression of retrieval queries as well as integration and transformation queries; Retrieval granularity. Constraint query languages sup- port the retrieval and transformation of the compo- nents of design data, satisfying the given property, thus providing selective retrieval and transformation caoabilities. ~~ r~~~~~~ ~~ Constraint database technology can also be useful in improving the efficiency of archival systems for design data. Indeed several optimization and indexing techniques for constraint databases have already been developed. Moreover, constraint databases were already proven successful to address problems inherent to the lack of interoperabil- ity between systems, in geographical information systems (GIs). Similar solutions can be applied to design archival and retrievalsystems, where often design data are stored in different repositories, in various formats, and the only way to access in a single view all parts of an artifact is by integrating the underlying resources. In this paper, after a short introduction to constraint databases, we present our approach to design reuse. Then, we discuss how techniques developed in the constraint database context can be successfully used in efficiently im- plementing archival systems based on constraint technol- ogy. 11. CONSTRAINT DATABASES Constraint databases generalize relational databases by using constraints both to model and query data [19]. More precisely, in a constraint database, data are represented as quantifier-free conjunctions of constraints (called generalized tuples) on a given decidable logical theory.Different logical theories can be used to model different types of information.For their characteristics,constraint databases are well suited to model multidimensional and structured data, like spatial data. Indeed, the set of points corre- sponding to the extension of a spatial object can be inter- preted as the extension of a generalized tuples. For exam- ple, the generalized tuple 1 < z < 5 A 3 < y 5 8 represents a rectangle’. Constraint query languages have been obtained by ex- tending traditional relational languages to cope with con- straint data, leading to the definition of the constraint calculus and the constraint algebra [19]. Constraint query languages must be bottom-up evaluable and closed, i.e.,the output of any query must be a set of generalized tu-ples on the chosen theory. As data in a constraint databaseare represented as conjunctions of constraints, queries on constraint databases are no more than constraint problems that can be evaluated through the usage of constraint solving algorithms, suitable for the chosen logical theory. For example, consider a relation R(n, x, y), containing generalized tuples representing spatial objects in the 2-dimensional space (x,y), each identified by an iden- tifier n, using the linear polynomial constraint theory. A possible query is “Detemine the intersection of all geo- metric objects with a certain object 0”. The intersection opera tion can be easily represented as a conjunction of constraints, those representing the set of objects in R and that representing the query object 0. The result, i.e. the objects obtained through the intersection, can then be eas- ily computed by applying typical constraint solving techniques and, as the language is closed, the result is still represented as a set of generalized tuple on the chosen logical theory. In order to make constraint databases a practical technology, efficient optimization and indexing techniques have been developed, extending those proposed for the relational model. Techniques were developed to address two main issues. First of all, to rewrite a given query expressed in the constraint algebra, into an equivalent hut more efficient one [30], [14], (151. After rewriting, to identify the more efficient access plan for the obtained query expression, by using statistics on data distribution [l0]. In both cases, the proposed techniques extend the ones proposed for the relational model since traditional relational assump tions, such as the availability of selection conditions and assumptions concerning data distributions, do not neces- sarily hold for constraint queries [15]. It is important to point out that, as constraint query languages directly support the specification of spatial operators 1141, instead of using a separate operator for each type of transformation, as it is typically done in spatial databases, query optimization for spatial applications is highly improved. This characteristics motivates the use of constraint databases to support mechanicalengineering design as spatial operations are critical in this context. Indexing techniques have also been proposed for linear constraint databases (i.e., constraint databases using the linear polynomial constraint theory) 161, (51. Such tech- niques have mainly been obtained by adapting techniques developed for spatial databases [6] or shape management [17]. By interpreting generalized tuples as spatial objects, such techniques efficiently support the execution of queries requiring the detection of all the objects intersecting or containing a given query object, as well as similarity based retrieval for specific types of constraints. The previous characteristics make constraint databases suitable to model spatia-temporal applications, including multidimensional design, resource allocation, data fusion, sensor control, and shape manipulation in multimedia databases. Issues in applying constraint database technology to this problem have been presented in [4]. Prototypes have experimentally proved the advantages in using con- straint technology with respect to other technologies in modeling some of the applications cited above. Among B. Querying Design Data them, DEDALE [19] is a system for spatial applications. The design of a system for reue of mechanical engineering design data should assess the following requirements: 111. DESIGN REUSE Data representation - Geometric and non-geometric When designing a new artifact engineers typically try data should be represented in an homogeneous for to find in previous valid designs design components that mat; moreover, the user should be able to access a match or nearly match their new design specifications. whole artifact as well as any geometric component of Past designs have been throughoutly analyzed and have interest within an artifact; finally, the data format proven to be successful, their reuse is likely to save sig- should be compatible with existing standards to favor nificant resources and manpower. A system that assists operability between systems. engineers in their task must provide design storage, access . QuenJ language - Queries should be asked against both to stored designs with a query language expressing various geometric and non-geometric attributes, and express complex design specifications, access to design components geometric, topological or semantic properties; more- (as opposed to the whole document), integration of design over, queries should he expressed as combination of components,and transformation to meet the design re- basic operators. quirements. We present our approach by first addressing Similarity search - the system should allow to retrieve how constraints may be exploited for design data represen- similar designs, or designs validating similar require- tation in Section A, and design data querying in Section B. ments for design reuse. Adaptation - the system should allow transformations,A. Design Data merges of similar retrieved design to meet the design Mechanical engineering design data are typically di- requirements. vided between geometric and annotations. Geometric data QuenJ execution - Query execution should use tech- include geometry, shape, and topology of the artifact, niques such as indexing, query planning, and opti- whereas annotations relate to all non-geometric attributes mization to process queries efficiently; the system such as material, tolerance, etc. The first effort to nor- should allow efficient storage for large datasets. malize design data produced the Initial Graphic Exchange In this paper we present how constraint databases can Specification that characterizes physical objects, in partic- be used to assess some ofthe above requirements. Unlike ular electrical and mechanical artifacts [3]. Geometric data other approaches to mechanical design reuse, constraint are represented by entities and their annotations through databases provide a data representation and query lanattributes or relationships. This high level of representation guage that offer the needed flexibility for design reuse. In favors annotation-based systems with very limited query addition, using constraint databases offer a variety of tech expression through the defined attributes. niques for efficient query processing. As was proven in the Nowadays, design data are generally represented with past, transformation of designs may also be expressed with the Standard for the Exchange of Product data model constraints 181, [29], [28]. Therefore, a constraint database (STEP - IS0 10303). STEP was designed as a standard for may also be a framework appropriate to the adaptation the exchange of product data enabling the interoperahility phase of design reuse. of systems that generate and manipulate such data. STEP The constraint algebra expresses spatial transformations specifies product information specific to domain applica- of geometric objects expressed with linear constraints in tions with application protocols (AP). For example, AP224 1161. Its ability to handle spatial operations makes the defines mechanical products. STEP provides a generic de- constraint algebra adequate to express queries of inter- sign representation 1221 through a logical representation est to engineers. The constraint algebra is defined with with EXPRESS [26], and a physical representation with the same basic operators as the relational algebra: selec-constraints, as illustrated in Figure 1. The design data tion (U), projection (T), Cartesian product (x), join (W), represented, in a simplified format, in Figure 2 define a union (U), and intersection (n), but with a rather differ- volume, three planes and a line, and their relationships: ent semantics. The semantics of the constraint algebraic the volume is characterized by the three planes Bound- operators is expressed by two successive steps: a symbolic ary(Vl,Sl,SZ,S3), and the line characterizes an edge of the manipulation of the constraint set, followed by constraint volume defined by the three planes Boundary(Cl,Sl,S2) solving. and Boundary(C2,S3,S2). B.l Symbolic evaluation V olume vi: The semantics of the operators is defined in a sym- Plane s1, s2, 53; bolic way as a manipulation of the constraints representing Boundary (V1, ISl, 52. S3)); the mechanical artifact. The algebraic operations defined Line c1, c2; Boundary (Cl, IS1, S2)); above can be expressed with four symbolic operations on Boundary (CZ, IS3. S2)); sets of constraints. Let RI and RZ be two relations (me- chanical artifacts) respectively defined by the sets of con- Fig. 2. Simplified dcsign data straints e, and ex. The four symbolic operators aredefined as follows. Fig. 1. Example of design data in STEP format1. R1 x R2 = {t1∧t2 | tl∈e1,t2∈e2}.2. RI is computed using the algorithm of quantifier elimination, based on Fourier-Motzkin elimination method.3. R1∪R2= el∪e2.4. R1—R2 = {t1∧t2| tl∈e1,t2∈e2 }, where ec is the set of tuples or disjuncts of a DNF formula corresponding to -e.。
2021年12月Journal on Communications December 2021 第42卷第12期通信学报V ol.42No.12QML:一种混合空间索引结构崔栋,温巧燕,张华,王华伟(北京邮电大学网络与交换技术国家重点实验室,北京 100876)摘 要:为了丰富现有学习多维索引的功能并提高索引效率,提出了可以保留数据分布特征的动态数据分段算法DDSA,并结合四叉树和Z顺序曲线构建了混合空间索引(QML),在此基础上分别设计范围查询算法和KNN查询算法。
这种保留数据分布特征的索引可以灵活实现快速查询和更新。
实验结果表明,QML索引在实现丰富功能的前提下优化了检索效率,数据更新的时间复杂度为O(1)。
与R*-tree相比,QML索引存储减少约33%,更新效率提升40%~80%。
查询效率与最优树形索引相近。
关键词:数据库;空间索引;学习索引中图分类号:TP392文献标识码:ADOI: 10.11959/j.issn.1000−436x.2021229QML: a hybrid spatial index structureCUI Dong, WEN Qiaoyan, ZHANG Hua, WANG HuaweiThe State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China Abstract: In order to enrich the functionalities of existing learned multidimensional indexes and improve the efficiency, the dynamic data segmentation algorithm DDSA was proposed, which could preserve the data distribution characteristics.A hybrid spatial index was constructed by combining the QuadTree and Z-order curve (QML). The range query algorithmwere designed and KNN query algorithm respectively. The proposed index allowed flexible fast queries and updates with preserving the characteristics of data distribution. Experimental results show that QML optimizes the query efficiency on the premise of achieving rich functionalities, and the time complexity of data update is O(1). Compared with R*-tree, the storage consumption of QML is reduced by about 33%, and the update efficiency is improved by 40%~80% . The query efficiency is similar to the optimal tree Index.Keywords: database, spatial index, learned index1 引言物联网设备会生成大量的地理空间数据,为了有效地访问和处理此类数据,数据库管理员通常会采用基于树的索引结构来提高数据分析和事务性工作负载性能[1]。
PUBMED简介(PUBMED abstract)Source: /sjs/pubmedjj.htmA brief introduction to PubMed retrieval systemPubMed retrieval system /PubMed /The PubMed system was developed by the National Center for Biotechnology Information (NCBI), an online retrieval system for retrieving MEDLINE and PreMED-LINE databases. MEDLINE is the most important bibliographic abstracts database ofU.S.National Library of Medicine, which covers medicine, nursing, dentistry, veterinary medicine, health care and basic medicine. More than 4000 kinds of biomedical journals have been collected in more than 70 countries and regions in the world, with over 1000 entries of bibliography abstracts in 1966.PreMEDLINE is a temporary database of medical literature. It receives new data every day, providing users with basic bibliographic entries and abstracts, which are moved to MEDLINE once a week after indexing and processing.PubMed also accepts the document entry data provided by the publisher via electronic communication, which is illustrated with [MEDLINE, record, in, process], and is marked with the [Record as supplied by publisher]. The daily entries are constantly sent to the PreMEDLINE database, but some entries due beyond MEDLARS database scope, will never be replaced by PreMEDLINE or MEDLINE entry, for example in the comprehensive scientific journal (Science or Nature) on the table the geography etc..Main features of PubMed systemPubMed searchThe use of the Details keyThe use of the --URL key for storage retrieval policyFeature column (Feature Bar) introducedSearch results display, save, print--------------------------------------------------------------------------------Main features of PubMed system1. vocabulary automatic conversion function (Automatic, Term, Mapping)In the search questions box of the PubMed home, type a search word, and the system will use the following 4 tables or indexes sequentially to retrieve the retrieved words and then retrieve them.(1) MeSH conversion table (MeSH, Translation, Table), including MeSH words, see words, sub topics, etc.. If the system has found the word matching the search word, it will automatically translate it into the corresponding MeSH word and TextWord word (Title noun and abstract word). For example, type"Vitamin H", and the system converts it to "Biotin [MeSH Tems] OR Vitamin h [Textword]" after retrieval.(2) of the conversion table (Journal Tanslation Table), including full name, abbreviation of MEDLINE and ISSN. The conversion table can type of name into the "MEDLINE Name]" after the retrieval of the abbreviation [Journal. For example, in the search questions box, type "New England Journal of medicine", and PubMed converts it to "N, Engl, J, Med, [Journal, Name]".(3) phrase list (Phrase, list). The unified medical language system in the table and form phrases from MeSH containing synonyms or different English vocabulary writing (UMLS: Unified Medical Language System) and supplementary concept (material) name [Supplementary Concept (Substance) Name]. If the PubMed system is not found and the retrieval word matching the words in the MeSH and of the conversion table, will find a short list.(4) author index (Author, Index). If the word entered does not find the matching word in the above table, or the typed word is a phrase followed by an 1~2 letter, the PubMed is the author index. If you still can't find matching words,PubMed will break the word and repeat the automatic vocabulary conversion process until the word matches the word entered. If there are still no matching words, individual words are joined together (in AND) to retrieve in all fields.For example, "single cell", the system automatically divides it into two words: "single" and "cell" retrieval, and itsretrieval expression is "single AND cell"". To check the conversion of the search term, click the "Details" key.2. word retrieval functionPubMed allows the use of * numbers as a wildcard for word retrieval. If you type 'bacter *', the system will find the words in the previous section, such as bacteria, bacterium, bacteriophage, and bacter, and retrieve them separately. If the word is less than 150, PubMed will be one by one word retrieval, if more than 150 (such as Staph*), PubMed will display the following warning: "Wildcard search for'term*'used only the first Lengthen the root word 150 variations. to search for all endings". Truncation function is limited to words and is invalid for phrases. Such as: "infection*" includes "infections", but does not include "infection control" and so on.When using the truncation function, the PubMed system automatically turns off the lexical conversion function.3. mandatory search functionAs mentioned above, the PubMed homepage and search box type questions a phrase and click "Go", with the high speed automatic conversion function to find the corresponding word matching and retrieval system; however, when the type of words, no words, PubMed will enter the words off after the automatic repeat vocabulary conversion process, if still no match, the system will be decomposed into words and phrases, AND together in all fields in search. Obviously, the results of this retrieval arenot consistent with the user's requirements. Therefore, PubMed allows double quotation marks ("" "" ") to enforce systematic phrase retrieval. For example, in the PubMed page retrieval type "Single cell" question box, and use double quotes, and then click "Go", the system will be as an integral phrase retrieval in all fields in the database.Using double quotation marks automatically turns off the lexical conversion function.4. link function(1) linking relevant documents. Each document in the PubMed system has a related literature link. In the display state of the retrieval results, there is a "Related Articles" hyperlink on the right of each record. Click on the chain, the system shows the relevant documents from high to low according to the relevance of the document. Using the search history (History) button can further restrict the relevant literature. Click on "History", enter the relevant search numbers into the query box, and then type additional conditional words, and finally click "Go"". If you type "#7 AND english[la]" in the query question box, the search can be further limited to the English literature.(2) link NCBI (National, Center, for, Biotechnology, Information) database. PubMed built hyperlinks on its home page with 5 NCBI databases. These databases include: Protein[amino acid sequence database (protein), Nucleotide (DNA sequence) database, Popset (population, phylogenetic or mutation sequence) database, Structure (molecular structure model) andGenome database (genome sequence database).(3) linking external resources. PubMed provides links from search results to full journal text, biological data, sequence centers, and more. This function is achieved by linking to the above resource site. In the status of the search results, click "LinkOut" to enter the relevant web site.(4) link related books. Click on "Books" to refer to the abstract page of the related books.Some of the phrases on the book Abstract page are hyperlinks, and click the phrase hyperlink to connect to the page list of the relevant books and find the phrases on the table.In addition, the PubMed system allows the user to view the cited journal name table, click on the PubMed Journal Brower on the home page, and PubMed provides the user with the abbreviated name and ISSN number of the journal.PubMed searchPubMed system search range: MEDLINE database and PreMEDLINE database.1. words (subject) retrievalOn the PubMed homepage question box type English words or phrases (uppercase or lowercase can enter or click the "Go", PubMed is using the vocabulary automatically search, and search results are directly displayed on the home page below. Forexample, type "vitamin C common cold", then enter or click Go, and PubMed begins to retrieve and displays the search results.If the search results do not meet the requirements, can increase or delete the box in question words, or in the Details state (see Details key usage) to modify search, can also use the "Limits" (see Limits.) select restricted condition after retrieval. In addition, users can use wildcards, * * or double quotation marks to make word search or mandatory search as needed.2. authors searchIn the question box type author surname name and initials, format: author name space name acronyms, such as Smith JA, then press enter or click on Go, the system will automatically go to the field of retrieval, and display the search results.If you enter the author's last name, the system will first find the author's last name in the MeSH conversion table. After you've checked, the system retrieves the topic fields and title and digest fields, otherwise the system will retrieve them in all fields.If you use double quotes around the author's name and use the author field identifier - [au] to qualify, such as "Smith JA" [au], then the system retrieves only author fields.3. of the searchIn the question of the box type or MEDLINE form, referred toas ISSN, for example: molecular biology of the cell Mol Biol, or cell, or 1059-1524, then press enter or click on Go in the field of retrieval, the system will display the search results, and. If the words in the vocabulary of MeSH and the same, for example: Gene Therapy, Science or Cell, PubMed will use the word as the word retrieval MeSH. In this case, need of field identification -- [ta] is defined as "gene therapy[TA"".Single word title also need to be qualified, with [TA]: Scanning[TA]. Otherwise, the system will retrieve in all fields. The use of the MEDLINE can be retrieved or referred to all relevant records in the database. Retrieval using the ISSN number does not guarantee an earlier record in the checked database.If there should be input of parentheses, brackets omitted. For example: J, Hand, Surg[, Am] J Hand Surg Am should be typed.4. Boolean logic searchThe PubMed system allows Boolean logic retrieval, but type Boolean logic operators (AND, OR, or NOT) in the question box. Such as: vitamin, C, OR, zinc.Boolean logic searches sequentially from left to right, but parentheses can be used to change their order of operations. For example, common, cold, AND (vitamin, C, OR, zinc), the first search in parentheses is the first operation.Boolean logical retrieval allows you to add field identifiers after the retrieval word to qualify the retrieval field (fieldidentifiers are enclosed in square brackets and placed behind the search words). The format of its retrieval expression is: search word [field identifier], Boolean operator, search word [field ID]. Such as: [dna[, MH], AND, Crick, [au] AND 1993 [DP];又如:ansthma /治疗[MH]回顾[葡]和孩子,学前[MH]各种字段标识见表1。
Unit 3Transition to Modern Information ScienceChapter One&Part4 Extensive Reading @Part 1 Notes to Text@Part5Notes to Passage & Part 2 Word Study@Part3 Practice on Text @Part6 Practice on Passage@Part 1 Notes to TextTransition to Modern Information Science1)With the 1950‘s came increasing awareness of the potentialof automatic devices for literature searching and informationstorage and retrieval.随着二十世纪五十年代的来临,人们对用于文献资料搜索、信息储存与检索的自动装置的潜力认识日益增长。
注释:该句是一个完全倒装句。
主语是awareness;介词短语With the 1950‘s是状语,修饰谓语动词came。
2)As these concepts grew in magnitude and potential, so did thevariety of information science interests. 由于这些概念的大量增长,潜移默化,对信息科学研究的各种兴趣也亦如此。
注释:介词短语in magnitude and potential作方式状语,意思是“大量地,潜移默化地”;后面的主句因为so放在句首而倒装。
So指代前文的grew in magnitude and potential。
3) Grateful Med at the National Library of Medicine美国国家医学图书馆数据库注释:Grateful Med是对另一个NLM(国家医学图书馆)基于网络的查询系统的链接。
NB-Tree:AnIndexingStructureforContent-BasedRetrievalinLargeDatabases
ManuelJ.Fonseca∗,JoaquimA.JorgeDepartmentofInformationSystemsandComputerScienceINESC-ID/IST/TechnicalUniversityofLisbonR.AlvesRedol,9,1000-029Lisboa,Portugalmjf@inesc-id.pt,jorgej@acm.org
AbstractManyindexingapproachesforhigh–dimensionaldatapointshaveevolvedintoverycomplexandhardtocodealgorithms.Sometimesthiscomplexityisnotmatchedbyincreaseinperfor-mance.Motivatedbytheseideas,wetakeastepbackandlookatsimplerapproachestoindexingmultimediadata.Inthispaperweproposeasimple,(notsimplistic)yetefficientindexingstruc-tureforhigh–dimensionaldatapointsofvariabledimension,usingdimensionreduction.Ourapproachmapsmultidimensionalpointstoa1DlinebycomputingtheirEuclideanNorm.InasecondstepwesorttheseusingaB+-Treeonwhichweperformallsubsequentoperations.WeexploitB+-Treeefficientindexedsequentialsearchtodevelopsimple,yetperformantmethodstoimplementpoint,rangeandnearest-neighborqueries.Toevaluateourtechniqueweconductedasetofexperiments,usingbothsyntheticandrealdata.Weanalyzecreation,insertionandquerytimesasafunctionofdatasetsizeanddimension.Resultssofarshowthatoursimpleschemeoutperformscurrentapproaches,suchasthePyramidTechnique,theA-TreeandtheSR-Tree,formanydatadistributions.Moreover,ourapproachseemstoscalebetterbothwithgrowingdimensionalityanddatasetsize,whileexhibitinglowinsertionandsearchtimes.
1IntroductionInrecentyears,increasingnumbersofcomputerapplicationsinCAD,geography,biology,medicalimaging,etc.,accessdatastoredonlargedatabases.Acommonfeaturetosuchdatabasesisthat∗Correspondingauthor.
1objectsaredescribedbyvectorsofnumericvalues,knownasfeaturevectors,whichmapindividualinstancestopointsinahighdimensionalvectorspace.Animportantfunctionalitythatshouldbepresentinsuchapplicationsissimilaritysearch,i.e.findingasetofobjectssimilartoagivenquery.Thesimilaritybetweencomplexobjectsisnotmeasuredontheircontentsdirectly,sincethistendstobeexpensive.Ratherweusetheirfeaturevectors,assumingthatfeaturesare”well-behaved”,thatis,similarobjectshavefeaturevectorsthatarenearinhyperspaceandvice-versa.Thisway,searchingobjectsbysimilarityinadatabasebecomesanearestneighborsearchinahigh-dimensionalvectorspace,followedbysimilaritytestsappliedtothetenresultingpoints.Tosupportprocessinglargeamountsofhigh–dimensionaldata,avarietyofindexingapproacheshavebeenproposedinthepastfewyears.Someofthemarestructuresforlow–dimensionaldatathatwereadaptedtohigh–dimensionaldataspaces.However,suchmethodswhileprovidinggoodresultsonlow–dimensionaldata,donotscaleupwelltohigh–dimensionalspaces.Recentstudies[17]showthatmanyindexingtechniquesbecomelessefficientthansequentialsearch,fordimensionshigherthanten.Otherindexingmechanismsareincrementalevolutionsfromexistingapproaches,wheresometimes,theincreasedcomplexitydoesnotyieldcomparableenhancementsinperfor-mance.Otherindexingtechniquesbasedondimensionreductionreturnonlyapproximateresults.Finally,structurescombiningseveralofthepreviousapproacheshaveemerged.Often,thecorre-spondingalgorithmsareverycomplexandunwieldy.Ourinterestinthisproblemderivesfromthedevelopmentofeffectiveandefficientsystemsforcontent-basedretrievaloftechnicaldrawings.Intheapproachwepresentedin[8],technicaldrawingsaredescribedusingtopologygraphsthatincludeshapeandspatialinformation.Onasecondstep,graphsareconvertedinfeaturevectorsbycomputingandcombiningeigenvaluesfromtheadjacencymatrixofthegraph.Thedimensionalityoftheresultingfeaturevectorswilldependofthecomplexityofthetechnicaldrawing.Morecomplexdrawingswillproducedescriptorsofhigherdimension,whilesimpledrawingswillproducelowerdimensionfeaturevectors.Allindexingstructuresstudiedsofar,onlysupportdatasetsoffixeddimension.However,insomeapplicationdomains,astheonedescribedbefore,thedimensionoffeaturevectorscanvaryfromobjecttoobjectandthemaximumdimensioncannotbepredictedinadvance.Insuchscenarios,currentindexingstructureswilldefinea(maximum)fixeddimensionforthedataspaceandfeaturevectorsofsmallerdimensionswillbepaddedwithzeros.However,ifwearetoinsertnewfeaturevectorsoflargerthanmaximumdimension,theindexingstructuremustberebuiltto
2accommodatethenewdata.Theincreasinglycomplexdatastructuresandspecializedapproachestohigh-dimensionalindex-ingmakeitdifficulttoascertainwethertheremightbeareasonablyfastandgeneralapproachtoaddressvariabledimensiondata.Webelievetheremightbesomemeritintakingastepbackandlookingatsimplerapproachestoindexingsuchdata.Unlikeotherexistingapproaches,weuseaverysimpledimensionreductionfunctiontomaphigh–dimensionalpointsintoaone–dimensionalvalue.Ourapproachismotivatedbythreeobser-vations.First,realdatasetstendtohavealotofclustersdistributedalongthedataspace.Thus,theEuclideannormsofpointstendtobe”evenly”distributed(seesection4).Second,thesetofresultingnearestneighborswilllieinsideanarrowrangeofnormvalues,thusreducingthenumberofdatapointstoexamineduringsearch.Third,someapplicationdomainsneedtomanipulatelargeamountsofvariabledimensiondatapoints,whichcallsforanapproachthatcanhandlevariabledatainanaturalmanner.Basedontheserequirements,wedevelopedtheNB-Tree1,anindexingtechniquebasedonasimple,yetefficientalgorithmtosearchpointsinhigh–dimensionalspaceswithvariabledimension,usingdimensionreduction.Multidimensionalpointsaremappedtoa1DlinebycomputingtheirEuclideanNorm.InasecondstepwesortthesemappedpointsusingaB+-Treeonwhichweperformallsubsequentoperations.Thus,theNB-TreecanbeimplementedonexistingDBMSswithoutadditionalcomplexity.Ourapproachsupportsalltypicalkindsofqueries,suchaspoint,rangeandnearestneighborqueries(KNN).Also,ourmethodprovidesfastandaccurateresults,incontrasttootherindexingtechniques,whichreturnapproximateresults.Moreover,forsomemethods[20]whenaccuracyincreases,theirperformancedecreases.WeimplementedtheNB-Treeandevaluateditsperformanceagainstmorecomplexapproaches,suchas,thePyramidTechnique,theA-TreeandtheSR-Tree.Weconductedasetofexperiments,usingsyntheticandrealdata,toanalyzecreation,insertionandquerytimesasafunctionofdatasetsizeanddimension.Resultssofarshowbetterresultsforourmethod,formanydatadistributions.Moreover,ourapproachseemstoscalebetterbothwithgrowingdimensionalityanddatasetsize,whileexhibitinglowinsertionandsearchtimes.Therestofthepaperisorganizedasfollows.Inthenextsectionwegiveanoverviewoftherelatedworkinhigh–dimensionalindexingstructures.Section3explainsthebasicideaoftheNB-1Norm+B+-Tree=NB-Tree