XQuery Implementation in a Relational Database System
- 格式:pdf
- 大小:291.02 KB
- 文档页数:12
AJax笔试题目1. 以下( A,C )是针对XML文档的定义和规范。
【选两项】A. DTDB. SGMLC. SchemaD. complexType2. 关于XML的语法规则,下列说法中错误的有( C,D )。
【选两项】A.XML文档必须有且只能有一个根元素B.XML文档中的标签(元素)区分大小写C.XML文档中的元素和属性都必须加引号D.XML文档中一个属性可以包含多个值3. 以下选项中,XML元素中的子元素转换成属性表示正确的是( A )。
<书桌><长>1100mm</长><宽>600mm</宽><高>710mm</高></书桌>A.<书桌长="1100mm" 宽="600mm" 高="710mm"></书桌>B.<书桌长="600mm" 宽="710mm" 高="1100mm"></书桌>C.<书桌长="700mm" 宽="600mm" 高="1100mm"></书桌>D.以上都不正确4. 关于XML Schema相比DTD的优势,以下说法错误的是( D )。
A.Schema可以更容易地描述文档内容B.Schema可以更容易地与来自数据库的数据一起工作C.Schema可以更容易地定义数据约束D.Schema更容易嵌入到XML文档中进行验证5. 关于DOM级别,下列说法中正确的有( A,B )。
【选两项】A.DOM级别是W3C组织的DOM规范B.DOM级别1专注于文档模型C.DOM级别2专注于文档模型D.DOM级别2规定了DTD和Schema6. 关于XML节点树,下列描述错误的是( D )。
1.Which of the following options best describes a domain found in the table below?从下表中,哪一选项正确的描述了域?CREATE TABLE EMPLOYEE (ID INTEGER NOT NULL,NAME VARCHAR(30) NOT NULL,EXTENSION INTEGER NOT NULL,MANAGER VARCHAR(30) NOT NULLPRIMARY KEY (ID));TERMINATE;ID NAME EXTENSION MANAGER1 John S 53412 Y2 Susan P 54123 N3 Jennifer L 51234 NA. Manager Domain = (N)B. Name Domain = (Set of all Possible Names)C. Extension Doman = (53412)D. ID Domain = (1, 2, 3)答案:B2.Which of the following statements best describes the XML Regions Index in DB2?下列哪个陈述最好的描述了DB2中的XML区域索引?A. The Regions Index is a descriptor for referencing large objects in the LOB storage area.B. The Regions Index is a new type of XML index available in DB2 9.7.C. The Regions Index facilitates access to document regions in the XML data area.区域索引,有助于文档区域中的XML数据区的访问D. The Regions Index can be compressed by issuing an offline reorg operation on the table.答案:C3. What is the purpose of a DB2 Access Plan?什么是一个DB2存取计划的目的?A. SQL developers can define Access Plans to tell DB2 the best way to retrieve the data from a SQL queryB. Describes the order of operations to access data necessary to execute a SQL or XQuery statement 描述运算次序,访问必要的数据去执行SQL或者XQuery的语句C. To replicate data between a DB2 database and relational databases from other vendorsD. To visually construct complex DML statements and examine the results of their execution答案:B4. Given the following UPDATE statement:UPDATE employees SET workdept =(SELECT deptno FROM department WHERE deptno = 'A01')WHERE workdept IS NULLWhich of the following describes the result if this statement is executed?如果上面的语句被执行,下面哪一个选项是正确的?A. The statement will fail because an UPDATE statement cannot contain a subqueryB. The statement will only succeed if the data retrieved by the subquery does not contain multiple records 语句可能成功如果子查询不包含多条记录的情况下。
Database Qualifying Exam Reading List, Fall 2004INFORMATIONThe Database Management Systems (DBMS) qualifying exam is intended to cover a wide range of database systems literature, testing the students' preparedness and ability for pursuing serious research in the database area. As such, students who do not intend to pursue database-related Ph.D. topics are advised not to attempt taking this exam. The exam will be graded with the expectation that each examinee will have a fairly deep understanding of the basic issues plus a broad knowledge of recent and current database research work.The exam is intended to cover a set of fundamental topics in databases. Students should have a solid grasp of the database systems area at the advanced undergraduate / introductory graduate level, as can be obtained by studying Ramakrishnan and Gerhke, or a comparable textbook, and in addition, they should have a deep understanding of many of the advanced topics discussed in the research literature. The following list of topics and papers indicates the material that all students will be expected to be familiar with. Students should take CS 564, CS 764, and CS 784 in preparation for the database qualifying exam.REFERENCESAs mentioned above, either of the following textbook (among others) is a good introductory source of database management system material.Ramakrishnan, R., and Gehrke, J., Database Management Systems, 3rd Edition, McGraw-Hill, 2000.In particular, you are expected to be familiar with the coverage of the following topics at the level of this text: file organizations, database design, data models and languages, data mining, decision support, data warehousing, deductive databases, information retrieval, object-oriented and object-relational databases, query processing, transaction management, view maintenance, security and integrity, and XML. Further required reading, in the form of papers from the research literature, is listed below.Case StudiesChamberlin, D., et al, “A History and Evaluation of System R”, Communications of the ACM 24(10), 1981. Stonebraker, M., et al., “The Design and Implementation of INGRES”, ACM Transactions on Database Systems 1(3), 1976.File Organizations and Access MethodsComer, D., “The Ubiquitous B-Tree”, ACM Computing Surveys 11(2), June 1979.Nievergelt, J., et al., “The Grid File: An Adaptable, Symmetric Multikey File Structure”, ACM Transactions on Database Systems 9(1), March 1984.Guttman, A., “R-Trees: A Dynamic Index Structure for Spatial Searching”, Proceedings of the ACM SIGMOD International Conference on the Management of Data, 1984.Brin, S. and Page, L., “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Proceedings of Computer Networks and ISDN Systems, 1998.Data Models and LanguagesStonebraker, M., “Inclusion of New Types In Relational Data Base Systems”, Proceedings of the International Conference on Data Engineering, 1986.Stonebraker, M. and L. Rowe, “The Design of Postgres,” Proceedings of the ACM-SIGMOD International Conference on Management of Data, 1986.Lamb, C., Landis, G., Orenstein, J., and Weinreb, D., “The ObjectStore Database System”, Communications of the ACM, 34(10), 1991.Blakely, J.A., Larson, P-A., and Tompa, F.W., “Efficiently Updating Materialized Views”, Proceedings of the ACM-SIGMOD International Conference on Management of Data, 1986.Zaniolo, C., “The Database Language GEM”, Proceedings of the ACM-SIGMOD International Conference on Management of Data, 1983.Chamberlin, D., “XQuery: An XML Query Language”, IBM Systems Journal, Vol 41, No. 4, 2002.Query Processing and OptimizationShapiro, L. D., “Join Processing in Database Systems with Large Main Memories”, ACM Transactions on Database Systems 11(3), 1986.Selinger, P., et al., “Access Path Selection in a Relational Database Management System”, Proceedings of the ACM-SIGMOD International Conference on Management of Data, 1979.Concurrency Control and RecoveryGray, J., et al., “Granularity of Locks and Degrees of Consistency in a Shared Data Base”, Proceedings of the IFIP Working Conference on Modeling of Data Base Management Systems, 1979.Kung, H., and Robinson, J., “On Optimistic Methods for Concurrency Control”, ACM Transactions on Database Systems 6(2), June 1981.Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM Transactions on Database Systems, 17(1): 94-162.Lehman, P. and Yao, S., “Efficient Locking for Concurrent Operations on B-Trees”, ACM Transactions on Database Systems, 6(4): 650-670.Distributed and Parallel DatabasesLohman, G., and Mackert, L., “R* Optimizer Validation and Performance Evaluation for Distributed Queries”, Proceedings of the 12th International Conference on Very Large Data Bases, 1986.Mohan, C., Lindsay, B., and Obermarck R., “Transaction Management in the R* Distributed Database Management System”, ACM Transactions on Database Systems 11 (4), 1986.DeWitt, D., et al, “The GAMMA Database Machine Project,” IEEE Transactions on Knowledge and Data Engineering 2(1), 1990.Gray, J., et al., “The Dangers of Replication and a Solution”, Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996.Operating System IssuesChou, H., and DeWitt, D., “An Evaluation of Buffer Management Strategies for Relational Database Systems”, Proceedings of the 11th International Conference on Very Large Data Bases, 1985. Stonebraker, M., “Operating System Support for Database Management”, Communications of the ACM24(7), 1981.Database TheoryAho, A., et al, “Equivalences Among Relational Expressions”, SIAM Journal on Computing 8(2), 1979. Date, C. D., and Fagin, R., “Simple Conditions for Guaranteeing Higher Normal Forms in Relational Databases”, ACM Transactions on Database Systems 17(3), 1992.Aho, A., and Ullman, J., “Universality of Data Retrieval Languages”, Proceedings of the 6th ACM Symposium on Principles of Programming Languages, 1979.Bernstein, P.A., Hadzilacos, V., and Goodman, N., “Concurrency Control and Recovery in Database Systems”, Addison-Wesley, 1987; can be freely downloaded from Bernstein’s webpage. (Chapters 1 and 2) Temporal DatabasesSnodgrass, R., and Ahn, I., “A Taxonomy of Time in Databases”, Proceedings of the ACM-SIGMOD International Conference on Management of Data, 1985.Decision SupportO’Neil, P., and D. Quass, “Improved Query Performance with Variant Indexes.” Proceedings of the ACM SIGMOD International Conference on Management of Data, 1997.Gray, J., et al., “Data Cube” A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals.” In Data Mining and Knowledge Discovery, 1(1): 29—53.Data MiningAgrawal, R., and R. Srikant, “Fast Algorithms for Mining Association Rules.” In Proceedings of the 20th International Conference on Very Large Data Bases, 487—499.Zhang, T., R. Ramakrishnan, and M. Livny, “BIRCH: A Clustering Algorithm for Large Multidimensional Datasets.” Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996.。
it面试题库及答案IntroductionWith the rapid development of information technology, the demand for IT professionals has been increasing. As a result, IT interviews have become a crucial step in the hiring process. To succeed in an IT interview, it is essential to have a comprehensive understanding of commonly asked questions and well-prepared answers. This article aims to provide a detailed IT interview question bank along with sample answers.1. Technical Questions1.1 Programming Languages1.1.1 What is the difference between Java and Python?Java and Python are both popular programming languages but differ in various aspects. Java is a compiled language that runs on the Java Virtual Machine (JVM), while Python is an interpreted language. Java is mainly used for enterprise-level applications, while Python is known for its simplicity and readability. Additionally, Java requires explicit declaration of data types, while Python uses dynamic typing.1.1.2 What is object-oriented programming?Object-oriented programming (OOP) is a programming paradigm that organizes software design around objects that represent real-world entities. It emphasizes the concepts of encapsulation, inheritance, and polymorphism. OOP allows for modular and reusable code, making it easier to maintain and expand software systems.1.2 Database Management1.2.1 What is the difference between SQL and NoSQL databases?SQL (Structured Query Language) databases are relational databases that store and manage structured data in tables with predefined schemas. They are suitable for complex queries and transactions. NoSQL (Not only SQL) databases, on the other hand, are non-relational databases designed for handling large amounts of unstructured or semi-structured data. They provide flexible schemas and horizontal scalability.1.2.2 What is ACID in database management?ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that ensure reliability and consistency in database transactions. Atomicity guarantees that a transaction is treated as a single unit of work, either fully completed or fully rolled back if any part fails. Consistency ensures that the database remains in a valid state before and after a transaction. Isolation prevents interference from concurrent transactions, and Durability guarantees that once a transaction is committed, its changes are permanent.2. Behavioral Questions2.1 Problem-Solving2.1.1 Describe a challenging technical problem you encountered and how you resolved it.2.1.2 Give an example of a time when you had to work under pressure to meet a tight deadline. How did you handle it?2.2 Communication Skills2.2.1 Explain a complex technical concept to someone without a technical background.2.2.2 Describe a situation where you had to communicate and collaborate with a team to solve a problem.3. Sample Answers1.1.1 Java and Python differ in their execution models and use cases. Java is a compiled language that runs on the JVM, allowing for high performance and platform independence. It is commonly used for building enterprise-level applications. In contrast, Python is an interpreted language known for its simplicity and readability. It is widely adopted in web development, scientific computing, and data analysis due to its ease of use and extensive library support.1.1.2 Object-oriented programming (OOP) is a software development paradigm that focuses on modular and reusable code. It organizes software design around objects, which are instances of classes representing real-world entities. OOP emphasizes encapsulation, where data and methods are bundled together within objects. It also enables inheritance, allowing classes to inherit attributes and behaviors from parent classes. Polymorphism, another key concept in OOP, allows objects to take on many forms and exhibit different behaviors based on the context. OOP enhances code reusability, maintainability, and scalability.2.1.1 In my previous role, I encountered a challenging technical problem where a critical database server went down, causing a major disruption inour production environment. I promptly identified the root cause, which was a disk failure. To resolve the issue, I replaced the faulty disk and restored the database from the latest backup. However, to minimize the downtime, I implemented a backup server configuration and utilized a replication mechanism to keep the data synchronized. This solution ensured both data integrity and high availability.2.1.2 Working under pressure to meet tight deadlines is a common occurrence in the IT industry. In one instance, I received a project with an unexpectedly shortened timeline due to unexpected client requirements. To handle the situation, I immediately prioritized the tasks, focusing on critical components and breaking down the project into manageable subtasks. I communicated with the team, delegating responsibilities and ensuring everyone was aware of the new timeline. By working overtime, maintaining constant communication, and efficiently managing resources, we successfully met the deadline without compromising the quality of the deliverables.2.2.1 Complex technical concepts can be challenging to explain to non-technical individuals. To overcome this, I often use analogies and relatable examples to help them grasp the idea. For instance, when explaining encryption, I compare it to sending secret messages using a lock and key. I describe how encryption algorithms scramble data (message) using a key (like a lock), making it unreadable to unauthorized parties. Only someone with the correct key can decrypt (unlock) and access the original message. This simplifies the concept and allows non-technical individuals to understand the fundamentals of encryption.2.2.2 Collaboration and effective communication are vital in solving complex technical problems. In a recent project, our team encountered a challenging software bug that affected the system's stability. To overcome this, we organized regular meetings to discuss and share ideas. Each team member had a specialized area of expertise, so we collaborated closely, actively listening to each other's suggestions and insights. By pooling our knowledge and skills, we successfully identified the root cause and implemented a comprehensive solution. This experience highlighted the importance of teamwork and effective communication in problem-solving.ConclusionThe field of information technology is vast and evolving, and IT interviews are designed to evaluate candidates' technical knowledge, problem-solving abilities, and communication skills. By familiarizing yourself with common IT interview questions and crafting thoughtful answers, you can increase your chances of success. Remember to adapt your answers based on your own experience and expertise. Good luck with your upcoming IT interviews!。
MiningofMassiveDatasetsAnand RajaramanKosmix,Inc.Jeffrey D.UllmanStanford Univ.Copyright c 2010,2011Anand Rajaraman and Jeffrey D.UllmaniiPrefaceThis book evolved from material developed over several years by Anand Raja-raman and JeffUllman for a one-quarter course at Stanford.The course CS345A,titled“Web Mining,”was designed as an advanced graduate course, although it has become accessible and interesting to advanced undergraduates. What the Book Is AboutAt the highest level of description,this book is about data mining.However, it focuses on data mining of very large amounts of data,that is,data so large it does notfit in main memory.Because of the emphasis on size,many of our examples are about the Web or data derived from the Web.Further,the book takes an algorithmic point of view:data mining is about applying algorithms to data,rather than using data to“train”a machine-learning engine of some sort.The principal topics covered are:1.Distributedfile systems and map-reduce as a tool for creating parallelalgorithms that succeed on very large amounts of data.2.Similarity search,including the key techniques of minhashing and locality-sensitive hashing.3.Data-stream processing and specialized algorithms for dealing with datathat arrives so fast it must be processed immediately or lost.4.The technology of search engines,including Google’s PageRank,link-spamdetection,and the hubs-and-authorities approach.5.Frequent-itemset mining,including association rules,market-baskets,theA-Priori Algorithm and its improvements.6.Algorithms for clustering very large,high-dimensional datasets.7.Two key problems for Web applications:managing advertising and rec-ommendation systems.iiiiv PREFACE PrerequisitesCS345A,although its number indicates an advanced graduate course,has been found accessible by advanced undergraduates and beginning masters students. In the future,it is likely that the course will be given a mezzanine-level number. The prerequisites for CS345A are:1.Thefirst course in database systems,covering application programmingin SQL and other database-related languages such as XQuery.2.A sophomore-level course in data structures,algorithms,and discretemath.3.A sophomore-level course in software systems,software engineering,andprogramming languages.ExercisesThe book contains extensive exercises,with some for almost every section.We indicate harder exercises or parts of exercises with an exclamation point.The hardest exercises have a double exclamation point.Support on the WebYou canfind materials from past offerings of CS345A at:/~ullman/mining/mining.html There,you willfind slides,homework assignments,project requirements,and in some cases,exams.AcknowledgementsCover art is by Scott Ullman.We would like to thank Foto Afrati and Arun Marathe for critical readings of the draft of this manuscript.Errors were also re-ported by Apoorv Agarwal,Susan Biancani,Leland Chen,Shrey Gupta,Xie Ke, Haewoon Kwak,Ellis Lau,Ethan Lozano,Justin Meyer,Brad Penoff,Philips Kokoh Prasetyo,Angad Singh,Sandeep Sripada,Dennis Sidharta,Mark Storus, Roshan Sumbaly,and Tim Triche Jr.The remaining errors are ours,of course.A.R.J.D.U.Palo Alto,CAJune,2011Contents1Data Mining11.1What is Data Mining? (1)1.1.1Statistical Modeling (1)1.1.2Machine Learning (2)1.1.3Computational Approaches to Modeling (2)1.1.4Summarization (3)1.1.5Feature Extraction (4)1.2Statistical Limits on Data Mining (4)1.2.1Total Information Awareness (5)1.2.2Bonferroni’s Principle (5)1.2.3An Example of Bonferroni’s Principle (6)1.2.4Exercises for Section1.2 (7)1.3Things Useful to Know (7)1.3.1Importance of Words in Documents (7)1.3.2Hash Functions (9)1.3.3Indexes (10)1.3.4Secondary Storage (11)1.3.5The Base of Natural Logarithms (12)1.3.6Power Laws (13)1.3.7Exercises for Section1.3 (15)1.4Outline of the Book (15)1.5Summary of Chapter1 (17)1.6References for Chapter1 (17)2Large-Scale File Systems and Map-Reduce192.1Distributed File Systems (20)2.1.1Physical Organization of Compute Nodes (20)2.1.2Large-Scale File-System Organization (21)2.2Map-Reduce (22)2.2.1The Map Tasks (23)2.2.2Grouping and Aggregation (24)2.2.3The Reduce Tasks (24)2.2.4Combiners (25)vvi CONTENTS2.2.5Details of Map-Reduce Execution (25)2.2.6Coping With Node Failures (26)2.3Algorithms Using Map-Reduce (27)2.3.1Matrix-Vector Multiplication by Map-Reduce (27)2.3.2If the Vector v Cannot Fit in Main Memory (28)2.3.3Relational-Algebra Operations (29)2.3.4Computing Selections by Map-Reduce (32)2.3.5Computing Projections by Map-Reduce (32)2.3.6Union,Intersection,and Difference by Map-Reduce (33)2.3.7Computing Natural Join by Map-Reduce (34)2.3.8Generalizing the Join Algorithm (34)2.3.9Grouping and Aggregation by Map-Reduce (35)2.3.10Matrix Multiplication (35)2.3.11Matrix Multiplication with One Map-Reduce Step (36)2.3.12Exercises for Section2.3 (37)2.4Extensions to Map-Reduce (38)2.4.1Workflow Systems (38)2.4.2Recursive Extensions to Map-Reduce (40)2.4.3Pregel (42)2.4.4Exercises for Section2.4 (43)2.5Efficiency of Cluster-Computing Algorithms (43)2.5.1The Communication-Cost Model for ClusterComputing (44)2.5.2Elapsed Communication Cost (46)2.5.3Multiway Joins (46)2.5.4Exercises for Section2.5 (49)2.6Summary of Chapter2 (51)2.7References for Chapter2 (52)3Finding Similar Items553.1Applications of Near-Neighbor Search (55)3.1.1Jaccard Similarity of Sets (56)3.1.2Similarity of Documents (56)3.1.3Collaborative Filtering as a Similar-Sets Problem (57)3.1.4Exercises for Section3.1 (59)3.2Shingling of Documents (59)3.2.1k-Shingles (59)3.2.2Choosing the Shingle Size (60)3.2.3Hashing Shingles (60)3.2.4Shingles Built from Words (61)3.2.5Exercises for Section3.2 (62)3.3Similarity-Preserving Summaries of Sets (62)3.3.1Matrix Representation of Sets (62)3.3.2Minhashing (63)3.3.3Minhashing and Jaccard Similarity (64)CONTENTS vii3.3.4Minhash Signatures (65)3.3.5Computing Minhash Signatures (65)3.3.6Exercises for Section3.3 (67)3.4Locality-Sensitive Hashing for Documents (69)3.4.1LSH for Minhash Signatures (69)3.4.2Analysis of the Banding Technique (71)3.4.3Combining the Techniques (72)3.4.4Exercises for Section3.4 (73)3.5Distance Measures (74)3.5.1Definition of a Distance Measure (74)3.5.2Euclidean Distances (74)3.5.3Jaccard Distance (75)3.5.4Cosine Distance (76)3.5.5Edit Distance (77)3.5.6Hamming Distance (78)3.5.7Exercises for Section3.5 (79)3.6The Theory of Locality-Sensitive Functions (80)3.6.1Locality-Sensitive Functions (81)3.6.2Locality-Sensitive Families for Jaccard Distance (82)3.6.3Amplifying a Locality-Sensitive Family (83)3.6.4Exercises for Section3.6 (85)3.7LSH Families for Other Distance Measures (86)3.7.1LSH Families for Hamming Distance (86)3.7.2Random Hyperplanes and the Cosine Distance (86)3.7.3Sketches (88)3.7.4LSH Families for Euclidean Distance (89)3.7.5More LSH Families for Euclidean Spaces (90)3.7.6Exercises for Section3.7 (90)3.8Applications of Locality-Sensitive Hashing (91)3.8.1Entity Resolution (92)3.8.2An Entity-Resolution Example (92)3.8.3Validating Record Matches (93)3.8.4Matching Fingerprints (94)3.8.5A LSH Family for Fingerprint Matching (95)3.8.6Similar News Articles (97)3.8.7Exercises for Section3.8 (98)3.9Methods for High Degrees of Similarity (99)3.9.1Finding Identical Items (99)3.9.2Representing Sets as Strings (100)3.9.3Length-Based Filtering (100)3.9.4Prefix Indexing (101)3.9.5Using Position Information (102)3.9.6Using Position and Length in Indexes (104)3.9.7Exercises for Section3.9 (106)3.10Summary of Chapter3 (107)viii CONTENTS3.11References for Chapter3 (110)4Mining Data Streams1134.1The Stream Data Model (113)4.1.1A Data-Stream-Management System (114)4.1.2Examples of Stream Sources (115)4.1.3Stream Queries (116)4.1.4Issues in Stream Processing (117)4.2Sampling Data in a Stream (118)4.2.1A Motivating Example (118)4.2.2Obtaining a Representative Sample (119)4.2.3The General Sampling Problem (119)4.2.4Varying the Sample Size (120)4.2.5Exercises for Section4.2 (120)4.3Filtering Streams (121)4.3.1A Motivating Example (121)4.3.2The Bloom Filter (122)4.3.3Analysis of Bloom Filtering (122)4.3.4Exercises for Section4.3 (123)4.4Counting Distinct Elements in a Stream (124)4.4.1The Count-Distinct Problem (124)4.4.2The Flajolet-Martin Algorithm (125)4.4.3Combining Estimates (126)4.4.4Space Requirements (126)4.4.5Exercises for Section4.4 (127)4.5Estimating Moments (127)4.5.1Definition of Moments (127)4.5.2The Alon-Matias-Szegedy Algorithm for SecondMoments (128)4.5.3Why the Alon-Matias-Szegedy Algorithm Works (129)4.5.4Higher-Order Moments (130)4.5.5Dealing With Infinite Streams (130)4.5.6Exercises for Section4.5 (131)4.6Counting Ones in a Window (132)4.6.1The Cost of Exact Counts (133)4.6.2The Datar-Gionis-Indyk-Motwani Algorithm (133)4.6.3Storage Requirements for the DGIM Algorithm (135)4.6.4Query Answering in the DGIM Algorithm (135)4.6.5Maintaining the DGIM Conditions (136)4.6.6Reducing the Error (137)4.6.7Extensions to the Counting of Ones (138)4.6.8Exercises for Section4.6 (139)4.7Decaying Windows (139)4.7.1The Problem of Most-Common Elements (139)4.7.2Definition of the Decaying Window (140)4.7.3Finding the Most Popular Elements (141)4.8Summary of Chapter4 (142)4.9References for Chapter4 (143)5Link Analysis1455.1PageRank (145)5.1.1Early Search Engines and Term Spam (146)5.1.2Definition of PageRank (147)5.1.3Structure of the Web (151)5.1.4Avoiding Dead Ends (152)5.1.5Spider Traps and Taxation (155)5.1.6Using PageRank in a Search Engine (157)5.1.7Exercises for Section5.1 (157)5.2Efficient Computation of PageRank (159)5.2.1Representing Transition Matrices (160)5.2.2PageRank Iteration Using Map-Reduce (161)5.2.3Use of Combiners to Consolidate the Result Vector (161)5.2.4Representing Blocks of the Transition Matrix (162)5.2.5Other Efficient Approaches to PageRank Iteration (163)5.2.6Exercises for Section5.2 (165)5.3Topic-Sensitive PageRank (165)5.3.1Motivation for Topic-Sensitive Page Rank (165)5.3.2Biased Random Walks (166)5.3.3Using Topic-Sensitive PageRank (167)5.3.4Inferring Topics from Words (168)5.3.5Exercises for Section5.3 (169)5.4Link Spam (169)5.4.1Architecture of a Spam Farm (169)5.4.2Analysis of a Spam Farm (171)5.4.3Combating Link Spam (172)5.4.4TrustRank (172)5.4.5Spam Mass (173)5.4.6Exercises for Section5.4 (173)5.5Hubs and Authorities (174)5.5.1The Intuition Behind HITS (174)5.5.2Formalizing Hubbiness and Authority (175)5.5.3Exercises for Section5.5 (178)5.6Summary of Chapter5 (179)5.7References for Chapter5 (182)6Frequent Itemsets1836.1The Market-Basket Model (184)6.1.1Definition of Frequent Itemsets (184)6.1.2Applications of Frequent Itemsets (185)6.1.3Association Rules (187)6.1.4Finding Association Rules with High Confidence (189)6.1.5Exercises for Section6.1 (189)6.2Market Baskets and the A-Priori Algorithm (190)6.2.1Representation of Market-Basket Data (191)6.2.2Use of Main Memory for Itemset Counting (192)6.2.3Monotonicity of Itemsets (194)6.2.4Tyranny of Counting Pairs (194)6.2.5The A-Priori Algorithm (195)6.2.6A-Priori for All Frequent Itemsets (197)6.2.7Exercises for Section6.2 (198)6.3Handling Larger Datasets in Main Memory (200)6.3.1The Algorithm of Park,Chen,and Yu (200)6.3.2The Multistage Algorithm (202)6.3.3The Multihash Algorithm (204)6.3.4Exercises for Section6.3 (206)6.4Limited-Pass Algorithms (208)6.4.1The Simple,Randomized Algorithm (208)6.4.2Avoiding Errors in Sampling Algorithms (209)6.4.3The Algorithm of Savasere,Omiecinski,andNavathe (210)6.4.4The SON Algorithm and Map-Reduce (210)6.4.5Toivonen’s Algorithm (211)6.4.6Why Toivonen’s Algorithm Works (213)6.4.7Exercises for Section6.4 (213)6.5Counting Frequent Items in a Stream (214)6.5.1Sampling Methods for Streams (214)6.5.2Frequent Itemsets in Decaying Windows (215)6.5.3Hybrid Methods (216)6.5.4Exercises for Section6.5 (217)6.6Summary of Chapter6 (217)6.7References for Chapter6 (220)7Clustering2217.1Introduction to Clustering Techniques (221)7.1.1Points,Spaces,and Distances (221)7.1.2Clustering Strategies (223)7.1.3The Curse of Dimensionality (224)7.1.4Exercises for Section7.1 (225)7.2Hierarchical Clustering (225)7.2.1Hierarchical Clustering in a Euclidean Space (226)7.2.2Efficiency of Hierarchical Clustering (228)7.2.3Alternative Rules for Controlling HierarchicalClustering (229)7.2.4Hierarchical Clustering in Non-Euclidean Spaces (232)7.2.5Exercises for Section7.2 (233)CONTENTS xi7.3K-means Algorithms (234)7.3.1K-Means Basics (235)7.3.2Initializing Clusters for K-Means (235)7.3.3Picking the Right Value of k (236)7.3.4The Algorithm of Bradley,Fayyad,and Reina (237)7.3.5Processing Data in the BFR Algorithm (239)7.3.6Exercises for Section7.3 (242)7.4The CURE Algorithm (242)7.4.1Initialization in CURE (243)7.4.2Completion of the CURE Algorithm (244)7.4.3Exercises for Section7.4 (245)7.5Clustering in Non-Euclidean Spaces (246)7.5.1Representing Clusters in the GRGPF Algorithm (246)7.5.2Initializing the Cluster Tree (247)7.5.3Adding Points in the GRGPF Algorithm (248)7.5.4Splitting and Merging Clusters (249)7.5.5Exercises for Section7.5 (250)7.6Clustering for Streams and Parallelism (250)7.6.1The Stream-Computing Model (251)7.6.2A Stream-Clustering Algorithm (251)7.6.3Initializing Buckets (252)7.6.4Merging Buckets (252)7.6.5Answering Queries (255)7.6.6Clustering in a Parallel Environment (255)7.6.7Exercises for Section7.6 (256)7.7Summary of Chapter7 (256)7.8References for Chapter7 (260)8Advertising on the Web2618.1Issues in On-Line Advertising (261)8.1.1Advertising Opportunities (261)8.1.2Direct Placement of Ads (262)8.1.3Issues for Display Ads (263)8.2On-Line Algorithms (264)8.2.1On-Line and Off-Line Algorithms (264)8.2.2Greedy Algorithms (265)8.2.3The Competitive Ratio (266)8.2.4Exercises for Section8.2 (266)8.3The Matching Problem (267)8.3.1Matches and Perfect Matches (267)8.3.2The Greedy Algorithm for Maximal Matching (268)8.3.3Competitive Ratio for Greedy Matching (269)8.3.4Exercises for Section8.3 (270)8.4The Adwords Problem (270)8.4.1History of Search Advertising (271)xii CONTENTS8.4.2Definition of the Adwords Problem (271)8.4.3The Greedy Approach to the Adwords Problem (272)8.4.4The Balance Algorithm (273)8.4.5A Lower Bound on Competitive Ratio for Balance (274)8.4.6The Balance Algorithm with Many Bidders (276)8.4.7The Generalized Balance Algorithm (277)8.4.8Final Observations About the Adwords Problem (278)8.4.9Exercises for Section8.4 (279)8.5Adwords Implementation (279)8.5.1Matching Bids and Search Queries (280)8.5.2More Complex Matching Problems (280)8.5.3A Matching Algorithm for Documents and Bids (281)8.6Summary of Chapter8 (283)8.7References for Chapter8 (285)9Recommendation Systems2879.1A Model for Recommendation Systems (287)9.1.1The Utility Matrix (288)9.1.2The Long Tail (289)9.1.3Applications of Recommendation Systems (289)9.1.4Populating the Utility Matrix (291)9.2Content-Based Recommendations (292)9.2.1Item Profiles (292)9.2.2Discovering Features of Documents (293)9.2.3Obtaining Item Features From Tags (294)9.2.4Representing Item Profiles (295)9.2.5User Profiles (296)9.2.6Recommending Items to Users Based on Content (297)9.2.7Classification Algorithms (298)9.2.8Exercises for Section9.2 (300)9.3Collaborative Filtering (301)9.3.1Measuring Similarity (301)9.3.2The Duality of Similarity (304)9.3.3Clustering Users and Items (305)9.3.4Exercises for Section9.3 (307)9.4Dimensionality Reduction (308)9.4.1UV-Decomposition (308)9.4.2Root-Mean-Square Error (309)9.4.3Incremental Computation of a UV-Decomposition (310)9.4.4Optimizing an Arbitrary Element (312)9.4.5Building a Complete UV-Decomposition Algorithm (314)9.4.6Exercises for Section9.4 (316)9.5The NetFlix Challenge (317)9.6Summary of Chapter9 (318)9.7References for Chapter9 (320)。
网络通信原理的综述重庆工商大学计算机科学 10 级郎佰刚指导教师严玥【摘要】数据库技术在信息管理当中的地位不言而喻,它已经成为先进信息技术的重要组成部分,是现代计算机信息系统和计算机应用系统的基础和核心。
当代数据库,[1]尽管由于互联网应用的兴起,导致XML数据的大量出现,但就目前来讲关系型数据库依旧占据主流的地位,可随着数据库更广泛的应用,以及和多学科技术的结合,新的数据库技术是层出不穷,如面向对象与对象-关系数据库系统、移动数据库系统、实时数据库系统、XML和半结构化数据库系统、并行和分布式数据库系统、多媒体数据库等等。
本报告将阐述与移动数据库和XML数据库相关的内容。
【关键词】数据库、XML、数据模型、系统【abstract 】database technology in information management of position is self-evident, it has already become the advanced information technology is an important part of the modern computer information system and computer application system basis and core.Contemporary database, [1] although because the rise of the Internet application, leading to a large presence of XML data, but as to the present speaking relational database is still occupy the mainstream position, as the database can be more widely used, and the combination of technology and much discipline, new database technology is emerge in endlessly, such as object-oriented and object relational database system, mobile database system, real-time database system, XML and half structural database system, parallel and distributed database system, multimedia database, and so on. This report will be presented and mobile database and XML database related to the content.【key words 】database, XML, the data model, system一、移动数据库的由来社会进入信息时代,人们的生活方式也发生了巨大的变化,现代科技已经为人们的交流和沟通提供了方便的工具,时代要求人们随时随地访问信息并得到服务,实现无约束自由通信和共享资源的理想目标,这是一种更加灵活、复杂的分布计算环境,人们称之为移动计算(Mobile Computing) 。
单位代码01学号040101086分类号密级____ ___ _文献翻译数据库管理系统概述院(系)名称信息工程学院专业名称计算机科学与技术学生姓名指导教师2008年4月15日英文译文数据库管理系统概述赫克托加西亚-莫利纳,杰夫乌尔曼,珍妮佛1.2 数据库管理系统概述从图1.1我们可以看到一个完整的数据库管理系统概况。
单框代表系统组件,而双框代表内存数据结构。
实线显示控制流和数据流,而虚线仅表示数据流。
由于这个图很复杂,我们将分几个阶段来考虑细节。
首先,在顶部,我们认为应该有两个不同的命令来源到达数据库:(1)请求或修改数据的传统用户和应用程序。
(2)数据库管理员:负责数据库结构或模型的个人或组织。
1.2.1 数据定义语言命令第二种命令是简单的进程,从图1.1的右上侧开始,我们可以看见它的路径。
例如,为一所大学搞注册的数据库管理员,或简称DBA,应该为每个学生建一张表或关系,从而说明这个学生所参加的课程以及那门课程的分数。
数据库管理员还要规定学生的成绩只能是A 、B 、C 、D和F。
这个结构和约束信息就是数据库的全部。
这表明在图1.1中,数据库管理员必须要有特殊的权力才能执行模式更改指令,因为这些指令对数据库有着深远的影响。
这些模式更改数据库定义语言指令(“DDL”代表“数据定义语言”)是由数据库定义语言处理器解析,并传递给执行引擎,经过搜索/存档/记录管理,再到元数据,即模型信息数据库。
1.2.2 查询处理概述与数据库管理系统的绝大部份交互都是沿着图1.1左侧的路径。
用户或应用程序启动一些行为,并不会影响数据库的模式,但可能会影响到数据库的内容(如果是一个修改命令行为),或将从数据库中提取数据(如果是一个查询行为)。
1.1节讲过,用这些命令描述的语言称为数据操纵语言(即DML),说白了就是查询语言。
我们可以使用很多数据操纵语言,但是在范例1.1 中所提到的那些数据查询语言,是目前最常用的。
DML语句由两个独立的子系统来处理,其过程如下:查询回复查询就是利用查询编译器进行解析和优化。
作文二元关系的逻辑关系英文回答:The logic relationship of binary relations refers to the way in which two elements are related to each other. There are several types of logic relationships that can exist between two elements in a binary relation.One type of logic relationship is the equivalence relationship. In an equivalence relationship, two elements are considered to be equivalent if they share certain characteristics or properties. For example, the relation "is equal to" is an equivalence relationship because if two numbers are equal, they share the same value. Another example is the relation "is congruent to" in geometry, where two shapes are considered equivalent if they have the same size and shape.Another type of logic relationship is the ordering relationship. In an ordering relationship, two elements arecompared based on their relative positions or values. For example, the relation "is greater than" is an ordering relationship because it compares the values of two numbers and determines which one is larger. Similarly, the relation "is less than" is also an ordering relationship.There is also the inclusion relationship, where one element is included or contained within another element. For example, the relation "is a subset of" is an inclusion relationship because it determines whether one set is contained within another set. Similarly, the relation "is a part of" can be an inclusion relationship if it determines whether one object is a component of another object.In addition, there is the dependency relationship, where one element depends on or is influenced by another element. For example, the relation "is caused by" is a dependency relationship because it indicates that one event is the result of another event. Similarly, the relation "is influenced by" can be a dependency relationship if it describes how one person's actions are influenced by another person's behavior.中文回答:二元关系的逻辑关系指的是两个元素之间的关系方式。
XQuery Implementation in a Relational Database SystemShankar Pal, Istvan Cseri, Oliver Seeliger, Michael Rys, Gideon Schaller, Wei Yu, Dragan Tomic, Adrian Baras, Brandon Berg, Denis Churin, Eugene KoganMicrosoft CorporationOne Microsoft Way, Redmond, Washington, USA{shankarp, istvanc, oliverse, mrys, gideons, weiyu, dragant, adrianb, branber, denistc,ekogan}@AbstractMany enterprise applications prefer to store XML data as a rich data type, i.e. a sequence of bytes, in a relational database system to avoid the complexity of decomposing the data into a large number of tables and the cost of reassembling the XML data. The upcoming release of Microsoft’s SQL Server supports XQuery as the query language over such XML data using its relational infrastructure.XQuery is an emerging W3C recommendation for querying XML data. It provides a set of language constructs (FLWOR), the ability to dynamically shape the query result, and a large set of functions and operators. It includes the emerging W3C recommendation XPath 2.0 for path-based navigational access. XQuery’s type system is compatible with that of XML Schema and allows static type checking.This paper describes the experiences and the challenges in implementing XQuery in Microsoft’s SQL Server 2005. XQuery language constructs are compiled into an enhanced set of relational operators while preserving the semantics of XQuery. The query tree is optimized using relational optimization techniques, such as cost-based decisions, and rewrite rules based on XML schemas. Novel techniques are used for efficiently managing document order and XML hierarchy. 1.IntroductionEnterprise applications use XML [3] for modelling semi-structured and markup data in scenarios such as document management and object property management [13]. Powerful applications can be developed to retrieve documents based on document content, to query for partial contents such as sections whose title contains the word "background", to aggregate fragments from different documents, and to find all the phone numbers of a person.Storing XML data as a sequence of bytes representing a rich data type has several advantages. XML schemas for real-life applications are complex so that decomposing XML data conforming to those schemas into the relational data model results in a large number of tables. This makes the decomposition logic complex, the re-assembly cost high, and the queries very complicated. Furthermore, changes to the XML schema require a significant amount of maintenance of the database schema and the application. XML as a rich data type also permits structural characteristics of the XML data, such as document order and recursive structures, to be preserved more faithfully.The upcoming release of Microsoft’s SQL Server 2005 [10] allows storage of XML data in a new, rich data type called XML [1][8][13]. This data type stores both rooted XML trees and XML fragments in a binary representation (“binary XML”). The query language on XML data type is a subset of XQuery [15][16][22], an emerging W3C recommendation (currently in Last Call) that includes the navigational language XPath 2.0 [20]. It is supported using the relational query processing framework with some enhancements. SQL Server 2005 also supports a data modification language on XML data type for incremental updates, which is not discussed further in this paper [1][13].This paper discusses the XQuery processing architecture in SQL Server 2005 and how XQuery expressions are compiled into query trees containing relational operators and a small number of new operatorsPermission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the EndowmentProceedings of the 31st VLDB Conference,Trondheim, Norway, 2005introduced for the purpose of XQuery processing. An XQuery expression is parsed and compiled into an internal structure called the XML algebra tree on which rule-based optimizations are applied. This is followed by a transformation of the XML algebra tree into the relational operator tree. This paper describes some of the interesting aspects of the implementation instead of being a comprehensive manual on the subject.XML as a richly structured data type introduces new challenges for query processing, data modification and indexing. Query processing must retain document order, perform structural navigation, provide sequence operation, and support dynamically constructed XML nodes. These requirements are not supported by a relational query processor and appropriate extensions to it are necessary.At runtime, the XML data (“XML blob”) must be available in a parsed state (the so-called XQuery Data Model [23]) to evaluate an XQuery expression. The data may be parsed multiple times to evaluate several XQuery expressions on the same data, or to evaluate complex XQuery expressions using a streaming parser, such as the XmlReader in the .NET framework [9], to avoid the overhead of keeping the data in memory (e.g. DOM). Runtime parsing is costly and often fails to meet the performance requirements of enterprise applications. For better query performance, SQL Server 2005 provides a mechanism for indexing the XML data [12] based on its Data Model content [2]. An XML index retains structural fidelity of the data, such as document order and hierarchical relationships among the XML nodes, and speeds up different classes of queries on the XML data.XQuery compilation produces a query tree that uses relational operators, such as SELECT and JOIN, on the primary XML index [12], if one exists, on an XML column. For non-indexed XML columns, the query plan contains operators to parse each XML blob, locate nodes matching simple path expressions, and generate rows resembling XML index entries that represent the subtrees rooted at those nodes. From this point onward, the processing for both the XML indexed and the XML blob cases is largely the same – multiple rowsets are manipulated using relational operators to yield the query result. The queries that return XML results aggregate the rows representing the resulting XML sequence into the binary XML form as the final processing step.The XQuery compiler performs static type inference by annotating operator-nodes in the query tree with type information. Type incompatibility between the inferred type and the expected type raises static errors. This fits well with the static type guarantees in the SQL language and the relational query processor’s ability to optimize query plans using statically known constraints. As a result, many runtime checks are avoided.The compiled query plan is optimized using well-known relational optimization techniques such as costing functions and histograms of data distributions. Query compilation produces a single query plan for both relational and XML data accesses, and the overall query tree is optimized as a whole. SQL Server 2005 also introduces optimizations for document order (by eliminating sort operations on ordered sets) and document hierarchy, and query tree rewrites using XML schema information.Relational query optimization, however, impacts XQuery semantics and introduces new challenges. The query optimizer shuffles operators around in the query tree to produce a faster execution plan, which may evaluate different parts of the query plan in any order considered to be correct from the relational viewpoint. Consequently, path expression based navigational accesses are not guaranteed to be executed top-down and may be evaluated bottom-up. This may yield dynamic errors, such as type cast errors, when none would occur with top-down evaluation. For this reason, SQL Server 2005 currently converts dynamic errors to empty sequences. In most contexts this yields correct results, but not always (e.g., in the presence of negation).A significant number of XQuery functions and operators are supported in the system. Wherever possible, these functions and operators are compiled into the analogous SQL functions and operators for efficient execution. In all other cases, additional code in the server executes the XQuery function or operator while preserving XQuery semantics.The rest of the paper is organized as follows. Section 2 provides background material on the native XML support in SQL Server 2005. Section 3 introduces the query processing architecture and provides an overview of the XML algebra operators used in the server. Section 4 discusses the transformation of XML algebra trees for XPath and XQuery expressions into relational operator trees. Section 5 deals with the type inference mechanism employed by the XQuery compiler and Section 6 discusses optimizations on the query trees yielding the execution plan for the queries. Related work is discussed in Section 7 while concluding remarks appear in Section 8.2.XML Support in SQL Server 2005This section provides a look into some of the XML features of SQL Server 2005 necessary for the discussions in this paper. Detailed information and be found in the product’s documentation [10] as well as MSDN whitepapers [13][14].2.1XML Data TypeMicrosoft’s SQL Server 2005 [10] introduces native storage for XML data as a new, rich data type called XML. A table may contain one or more columns of type XML wherein both rooted XML trees and XML fragments can be stored. Variable and parameters of type XML are also allowed. XML parsing occurs eitherimplicitly or explicitly during assignments of either string or binary SQL values to XML columns, variables and parameters.XML values are stored in an internal format as large binary objects (“XML blob”) in order to support the XML data model characteristics more faithfully such as document order and recursive structures.The following statement creates a table DOCS with an integer, primary key column PK and an XML column XDOC:CREATE TABLE DOCS (PK INT PRIMARY KEY, XDOC XML)2.2XML Schema SupportSQL Server 2005 provides XML schema collections as a mechanism for managing W3C XML schema documents [21] as metadata. XML data type can be associated with an XML schema collection to have XML schema constraints enforced on XML instances. Such XML data types are called “typed XML”. Non-XML schema bound XML data type is referred to as “untyped XML”.Both typed and untyped XML are supported within a single framework, the XML data model is preserved, and query processing enforces XQuery semantics. The underlying relational infrastructure is used extensively for this purpose.2.3Querying XML DataXML instances can be retrieved using the SQL SELECT statement. Four built-in methods on the XML data type, namely query(), value(), exist() and nodes(), are available for fine-grained querying. A fifth built-in method modify() allows fine-grained modification of XML instances but is not discussed further in this paper.The query methods on XML data type accept the XQuery language [15][16][22], which is an emerging W3C recommendation (currently in Last Call), and includes the navigational language XPath 2.0 [20]. Together with a large set of functions, XQuery provides rich support for manipulating XML data. The supported features of the XQuery language are shown below: •XQuery clauses “for”, “where”, “return” and “order by”.•XPath axes child, descendant, parent, attribute, self and descendant-or-self.•Functions – numeric, string, Boolean, nodes, context, sequences, aggregate, constructor, dataaccessor, and SQL Server extension functions toaccess SQL variable and column data withinXQuery.•Numeric operators (+, -, *, div, mod).•Value comparison operators (eq, ne, lt, gt, le, ge).•General comparison operators (=, !=, <, >, <=, >=).The following is an example of a query in which section titles are retrieved from books and wrapped in new <topic> elements:SELECT PK, XDOC.query('for $s in /BOOK/SECTIONreturn <topic>{data($s/TITLE)}</topic>')FROM DOCSThe query execution is tuple-oriented – the SELECT list is evaluated on each row of the DOCS table, the query() method is processed on the XDOC column in each row, and the result is a two-column rowset where the column types are integer (for PK) and untyped XML (for the XML result). The query methods are evaluated on single XML instances, so that XQuery evaluation over multiple XML documents is currently not supported by the syntax but is allowed by the architecture. Scalar value-based joins over XML instances are possible.2.4Indexing XML DataQuery execution processes each XML instance at runtime; this becomes expensive whenever the XML blob is large in size, the query is evaluated on a large number of rows in a table, or a single SQL query executes multiple XQuery expressions requiring the XML blob to be parsed multiple times. Consequently, a mechanism for indexing XML columns is supported in SQL Server 2005 to speed up queries.A primary XML index [12] on an XML column creates a B+tree index on the data model content of the XML nodes, and adds a column Path_ID for the reversed, encoded path from each XML node to the root of the XML tree.The structural properties of the XML instance, such as relative order of nodes and document hierarchy, are captured in the OrdPath column for each node [11]. The primary XML index is clustered on the OrdPath value of each XML instance in the XML column. The other noteworthy columns are the name, type and the value of a node.XML indexes provide efficient evaluation of queries on XML data, and reassembly of the XML result from the B+tree. These use the relational infrastructure while preserving document order and document structure. OrdPath encodes the parent-child relationship of XML nodes by extending the parent’s OrdPath with a labelling component for the child. This allows efficient determination of parent-child and ancestor-descendant relationships. Furthermore, the subtree of any XML node N can be retrieved from the primary XML index using arange scan over the OrdPath values of N and the descendant limit of N. The latter value can be determined from N’s OrdPath alone, which makes OrdPath a very simple yet efficient node labelling scheme.Secondary XML indexes can be created on an XML column to speed up different classes of commonly occurring queries: PATH index for path-based queries, PROPERTY index for property bag scenarios, and VALUE index for value-based queries are currently provided.Statistics are created on the key columns of the primary and secondary XML indexes. These are used for cost-based selection of the secondary XML indexes. Choice of the primary XML index is currently a static decision.The next section describes the architecture for query processing on XML data.3.XML Query Processing ArchitectureAs outlined in the previous section, the XML data is persisted in the relational store to leverage the existing relational infrastructure. An XQuery expression is compiled into a query tree that can be optimized and executed by the relational query processor. The hierarchical nature of the XML data is modelled as parent-child relationship using the OrdPath node labelling scheme [11] instead of developing a new, hierarchical store. Query processing for ordered, hierarchical data model requires more work than for the flat relational model. For this reason, the set of relational operators is extended with additional operators for XML processing. This enhancement yields “relational+” operators.XQuery compilation is performed in multiple stages, starting with the parsing of XQuery expressions and resulting in the generation of the query plan containing the enhanced set of relational operators. The overall architecture is shown in Figure 1. The main steps consist of an XQuery Compiler, which includes XQuery parsing, and an XML Operator Mapper.The XML algebra tree is an intermediate representation on which rule-based (as opposed to cost-based) optimizations are applied. One such optimization is path collapsing described in Section 6. Rewrites using XML schema information are also applied to the XML algebra tree. The output of the XQuery Compiler step is an XML algebra tree that is highly optimized for XML processing.Using the appropriate XML and relational type information, the XML Operator Mapper converts the XML operators in the XML algebra tree into a relational operator tree that includes the enhanced set of relational operators. This mapping is discussed in more details in Section 4.XML Operator Mapper recursively traverses the XML algebra tree. For each XML operator in the XML algebra tree, a relational operator sub-tree is generated, which includes enhanced relational operators. The relational operator sub-trees are then inserted into the overall relational operator tree for the XQuery expression.The mapping of each XML operator to a relational operator subtree depends upon the existence of a primary XML index on the XML column being queried. If it exists, then the query plan is generated to access columns in the primary XML index. If it does not exist, then the query plan is produced to evaluate path expressions without branching on the XML blob and to generate a set of rows representing the subtree of the matching nodes in document order. These rows contain most of the columns of the primary XML index except notably the primary key columns from the base table (used in back join from the primary XML index to the base table) and the Path_ID column that contains the reversed, encoded path from an XML node to the root of the XML tree.The rest of the query plan is the same if the primary key and Path_ID columns are not needed. Otherwise, it continues to differ.The relational operator tree for the XQuery expression is grafted into the main query tree for the whole SQL query. Thus, a single query tree is produced, and the query optimizer can optimize the full query plan containing both relational and XML accesses. This also supports interoperability between relational and XML data at the server, making way for richer application development.The next subsection describes some of the XML operators used in the XML algebra tree.3.1XML OperatorsThe XQuery Compiler parses an XQuery expression and produces an XML algebra tree that includes XML operators. This section describes a handful of the XML operators introduced in SQL Server 2005, some of which are used further in this paper. This list is representative but not exhaustive; detailed descriptions are beyond the scope of this paper.Each XML operator may accept input such as an ordered XML node list, an unordered XML node set, a Boolean condition, an ordinal condition, a node list condition, and other scalar input.3.1.1XmlOp_SelectThe XmlOp_Select operator takes a list of items, including ordered XML nodes, as a left child and a condition as right child. It returns the input items in their input order which satisfy the given condition.3.1.2XmlOp_PathThe XmlOp_Path operator is used for simple paths without predicates and produces the eligible XML nodes. This operator also uses a path context to collapse paths (see Section 6 for more information).3.1.3XmlOp_ApplyThe XmlOp_Apply operator takes two item lists as input, and returns one item list. It has an “apply name” property whose value is the variable name bound by the corresponding “for” clause in XQuery. The variable is bound to each of the items in a first item list. The second item list typically contains references to this variable, and is evaluated using the variable binding with the items in the first list.The XmlOp_Apply operator also takes a “where” and an “order-by” child. It is a complex operator that the XML Operator Mapper translates to a relational operator tree for evaluating the “for”, “where” and “order-by” clauses with the appropriate XQuery semantics.3.1.4XmlOp_CompareThis is a comparison operator with a field indicating the type of the comparison.3.1.5XmlOp_ConstantThis operator represents a constant, which can be a literal or the result of constant folding. Constant folding is the static optimization that evaluates constant expressions during query compilation to avoid runtime execution costs and to allow more query optimizations.3.1.6XmlOp_ConstructThe XmlOp_Construct operator creates all the XML node types: elements, attributes, processing instructions, comments, and text nodes. For element construction, the operator takes as input the sub-nodes (attributes and/or children), otherwise the value of the constructed node. 3.1.7Scalar OperatorsThe XmlOp_Function operator represents a built in function that returns a scalar or XML nodes. The inputs are the parameters of the function and the output is the result of the function.The next section describes the mapping of the XML operators for XPath and XQuery expressions to relational operators.4.XML Operator MappingThe XML Operator Mapper transforms an XML algebra tree into a relational operator tree. Conventional relational algebraic operators are inadequate to process the hierarchical XML data model in an efficient way. Consequently, the set of relational operators is enhanced with new operators for the purpose of XQuery processing, yielding the relational+ algebra. The relational operator tree is submitted to the query processor for optimization and execution.We describe the mapping of the XML algebra tree to the relational operator tree in the following subsections. For convenience, we subdivide the discussion into the following categories:•Mapping of XPath expressions•Mapping of XQuery expressions•Mapping of XQuery built-in functions4.1 XPath ExpressionsThe XmlOp_Path operator representing a path is mapped to a relational operator in a different way for XML blob than for a primary XML index on an XML column. Each of these scenarios is further subdivided into two cases – •Simple path expressions without branching inwhich the full paths from the root of the XMLtrees are known after path collapsing (“exactpaths”)•Paths expressions without branching in which the full paths are not known (“inexact paths”). As described later in Section 6, segments of simple paths may be concatenated together to produce a longer simple path using the path collapsing technique.Inexact paths occur in the XML algebra tree when segments of the path cannot be collapsed or a path is split into multiple segments. It occurs most commonly for paths containing wildcard steps, the //-operator, self and parent axes.The resulting four mappings are discussed below using the path expression /BOOK/SECTION as example. Predicate and ordinal evaluations are discussed later in this section.4.1.1Non-indexed XML, Exact PathThe XmlOp_Path operator is mapped to an XML_Reader operator for parsing the XML blob. XML_Reader is a streaming, pull-model XML parser, similar to the XmlReader in the .NET framework [9]. It is chosen for its efficiency in parsing XML data and its relatively low memory requirements, compared to a non-streaming XML parser such as for DOM, for handling large XML instances.The path /BOOK/SECTION is an argument to the XML_Reader operator and is applied during runtime parsing of the XML blob. The result is a set of rows representing the subtrees of the qualifying <SECTION> nodes and retaining the structural properties of those subtrees using their OrdPath values.The XmlOp_Path operator can occur at the top-level of the XML algebra tree when the path expression occurs within the query() method, i.e. XDOC.query (‘/BOOK/SECTION’). In this case, rows representing the subtree of each <SECTION> node are reassembled into an XML data type result using an XML_Serialize operator. This step is referred to as XML Serialization in the rest of the paper. The overall mapping is shown in Figure 2.4.1.2Non-indexed XML, Inexact PathThe path, such as /BOOK/SECTION//TITLE, is used by XML_Reader during XML blob parsing to filter the eligible nodes. Thus, the relational operator tree is similar to the one in Figure 2 with the appropriate path as input to the XML_Reader operator.Figure 4. Relational operator tree for the inexact path query XDOC.query (‘/BOOK/SECTION//TITLE’) for the indexed case.As should be apparent from the discussions above, the indexed and the non-indexed cases differ mainly in the way paths are evaluated on XML blobs or the column Path_ID in the primary XML index. The rest of the processing is done much the same way on columns common to both the primary XML index and the output rows of the XML_Reader. For this reason, in the remainder of this paper, only the indexed case is illustrated for brevity. 4.1.5Predicate EvaluationPredicate evaluation is performed by comparing the search value with that in the value column of the primary XML index. The relational operator tree for the path expression /BOOK[@id = “123”] is shown in Figure 5. The evaluation of the simple paths /BOOK and /BOOK/@id proceed as described above using the Path_ID column of the primary XML index. The specified value “123” is compared with the VALUE column in the same row of the primary XML index as the @id attribute. Since the two paths are evaluated separately, a check for the parent-child relationship is also needed. This is depicted in Figure 5 as the Parent_Check() function. The check uses the OrdPath property that the parent’s OrdPath is a prefix of the child’s OrdPath except for the rightmost component.The value of a simple-valued, typed element is stored in the same row as the element, so that predicates on the element are evaluated in the same way as an attribute. Predicates on untyped XML are more complicated to evaluate since values may need to be aggregated from multiple rows, which makes the relational operator tree more complex.Figure 5. Relational operator tree for the query XDOC.query (‘/BOOK[@id=”123”]’).The relational operator tree may also contain CONVERT operators if the operands need to be converted to the appropriate types to perform an operation. 4.1.6Ordinal PredicateOrdinal predicate evaluation such as /BOOK[n ] adds a ranking column to the rows for <BOOK> elements and then retrieves the n th <BOOK> node. A special optimization exists for the cases n = 1 and n = last(). The ordinal predicate is mapped to TOP 1 ascending and TOP 1 descending, respectively. TOP n is a relational operator that chooses the topmost n values from a rowset. When the input set is sorted, such as the rows in the primary XML index, this rewrite avoids ranking all the nodes before the ordinal predicate is evaluated. 4.2 XQuery ExpressionsSQL Server 2005 supports the FLWOR clauses “for”, “where”, “order-by” and “return”. XML operator mapping is described in some detail below for these constructs. A formal algorithm for the mapping is not presented in this paper for lack of space. However, fragments of the algorithm are illustrated below using examples.The XQuery processing framework described in this paper is powerful enough to support “let” but this is not discussed further in the paper. 4.2.1“for” IteratorThe XML algebra operator for the “for” iterator in XQuery is XmlOp_Apply. It maps to the relational APPLY operator, as shown in the example in Figure 6 for the queryfor $s in /BOOK//SECTION where $s/@num >= 3 return $s/TITLEApplySelect ($s) GET(PXI)Path_ID LIKE #TITLE%#SECTION#BOOK XML_Serialize Assemble subtree of<TITLE>XML_SerializePath_ID=#@id#BOOK & VALUE=“123” & Parent_check($b)Apply SelectGET(PXI)ApplySelect ($b) GET(PXI)Path_ID=#BOOKAssemble subtree of<BOOK>。