intergration of data warehouse
- 格式:pdf
- 大小:420.57 KB
- 文档页数:35
浅谈数据仓库和数据挖掘技术1、数据仓库信息技术的不断推广应用,将企业带入一个信息爆炸的时代。
每时每刻都有潮水般的信息出现在管理者的面前,等待管理者去处理、去使用。
这些管理信息的处理类型主要分事务型(操作型)处理和信息型(分析型)处理两大类。
事务型处理也就是通常所说的业务操作处理。
这种操作处理主要是对管理信息进行日常的操作,对信息进行查询和修改等,目的是满足组织特定的日常管理需要。
在这类处理中,管理者关心的是信息能否得到快速的处理,信息的安全性能否得到保证,信息的完整性是否遭到破坏。
信息型处理则是指对信息做进一步的分析,为管理人员的决策提供支持。
1.1 数据仓库的定义业界公认的数据仓库概念创始人W.H.Inmon在《数据仓库》(Building the Data Warehouse)一书中对数据仓库的定义是:数据仓库就是面向主题的、集成的、不可更新的(稳定性)、随时间不断变化(不同时间)的数据集合,用以支持经营管理中的决策制定。
数据仓库是将原始的操作数据进行各种处理并转换成综合信息,提供功能强大的分析工具对这些信息进行多方位的分析以帮助企业领导做出更符合业务发展规律的决策。
因此,在很多场合,决策支持系统也成了数据仓库的代名词。
建立数据仓库的目的是把企业的内部数据和外部数据进行有效的集成,为企业的各层决策和分析人员使用。
1.2 数据仓库的特点从W.H.Inmon关于数据仓库的定义中可以分析出数据仓库具有这样一些重要的以下特性。
1.2.1 面向主题性面向主题性表示了数据仓库中数据组织的基本原则,数据仓库中的所有数据都是围绕着某一主题组织展开的。
由于数据仓库的用户大多是企业的管理决策者,这些人所面对的往往是一些比较抽象的、层次较高的管理分析对象。
1.2.2 集成性所谓集成性是指在数据进入数据仓库之前,必须经过数据加工和集成,这是建立数据仓库的关键步骤。
1.2.3 时变性所谓时变性是指数据仓库中的信息并不只是关于企业当时或某一时点的信息,而是系统地记录了企业从过去某一时点到目前的数据,主要用于进行时间趋势分析。
数据仓库的粗略发展历程及相关概念1.1 概述数据仓库的概念可能比一般人想像的都要早一些,中间也经历比较曲折的过程。
其最初的目标是为了实现全企业的集成(Enterprise Integration),但是在发展过程中却退而求其次:建立战术性的数据集市(Data Marts)。
到目前为止,还有很多分歧、论争,很多概念模棱两可甚至是彻底的让人迷惑。
本文试图从数据仓库的发展历史中看到一些发展的脉络,了解数据仓库应该是怎么样的,并展望一下未来的数据仓库发展方向。
同时,由于新应用的不断出现,出现了很多新的概念和新的应用,这些新的应用如何统一现成完整的企业BI应用方案还存在很多争论。
本文试图对这些概念做一些简要的阐述,让大家对此有初步的了解。
1.2 粗略发展过程1.2.1 开始阶段(1978-1988)数据仓库最早的概念可以追溯到20世纪70年代MIT的一项研究,该研究致力于开发一种优化的技术架构并提出这些架构的指导性意见。
第一次,MIT的研究员将业务系统和分析系统分开,将业务处理和分析处理分成不同的层次,并采用单独的数据存储和完全不同的设计准则。
同时,MIT的研究成果与80年代提出的信息中心(Information Center)相吻合:即把那些新出现的、不可以预测的、但是大量存在的分析型的负载从业务处理系统中剥离出来。
但是限于当时的信息处理和数据存储能力,该研究只是确立了一个论点:这两种信息处理的方式差别如此之大,以至于它们只能采用完全不同的架构和设计方法。
之后,在80年代中后期,作为当时技术最先进的公司,DEC已经开始采用分布式网络架构来支持其业务应用,并且DEC公司首先将业务系统移植到其自身的RDBMS产品:RdB。
并且,DEC公司从工程部、销售部、财务部以及信息技术部抽调了不同的人员组建了新的小组,不仅研究新的分析系统架构,并要求将其应用到其全球的财务系统中。
该小组结合MIT的研究结论,建立了TA2(T echnical Architecture 2)规范,该规范定义了分析系统的四个组成部分:♦数据获取♦数据访问♦目录♦用户服务其中的数据获取和数据访问目前大家都很清楚,而目录服务是用于帮助用户在网络中找到他们想要的信息,类似于业务元数据管理;用户服务用以支持对数据的直接交互,包含了其他服务的所有人机交互界面,这是系统架构的一个非常大的转变,第一次将交互界面作为单独的组件提出来。
数据仓库与数据挖掘摘要数据挖掘是一新兴的技术,近年对其研究正在蓬勃开展。
本文阐述了数据仓库及数据挖掘的相关概念.做了相应的分析,同时共同探讨了两者共同发展的关系,并对数据仓库与挖掘技术结合应用的发展做了展望。
用Data Miner作为对数据挖掘的工具,给出了应用于医院的数据仓库实例。
指出了数据挖掘技术在医疗费用管理、医疗诊断管理、医院资源管理中具有的广泛应用性,为支持医院管理者的分析决策作出了积极探索。
AbstractThe Data Mine is a burgeoning technology,the research about it is developing flourishing.In this paper,it expatiates and analyses the concepts of Data Warehouse and Data Mine Together,discussing the connections of how to expand the two technologies,and combining the two technologies with prospect.The data warehouse supports the mass data on the further handling and recycling.The paper points out the use of data mining in patient charge control,medical quality control, hospital resources allocation management. It helps the hospital to make decisions positively关键字:数据仓库;数据挖掘;医院信息系统Key words:Data Warehouse;Data Mine;Hospital information system目录1、数据仓库的概述 (1)1.1 数据仓库的特征 (1)1.2 数据仓库系统 (2)1.3 联机分析技术 (2)2、数据挖掘 (3)2.1 数据挖掘定义及实现过程 (3)2.2 数据挖掘的分类 (4)2.3 数据挖掘任务 (5)3、数据挖掘与数据仓库的联系 (6)4、数据挖掘技术在医院管理中的应用 (7)4.1 病人费用构成分析 (7)4.2 同期费用对比分析 (7)4.3 病人结构分析 (8)4.4 病人流动情况分析 (8)4.5 病人就诊时间分析 (8)4.6 成本效益分析 (8)5、总结 (9)随着信息时代的不断进步,社会正处于数据技术飞速发展的良好状态。
IEEM 300G - Special Topics:Information Technology in Supply Chain ManagementCourse Syllabus - Spring 2000t.hk/dfaculty/yen/COURSE/IT-SCM00/Course Objective§ Review information technology, production systems, and supply chain management.§ Study modern Internet-based planning/scheduling/control systems, product design and process simulation tools, and distribution/re-distribution tracking and monitoring systems.§ Carry out project and research on information access, information coordination, and information processing for supply chain management in manufacturing and service industries.Course Description• The course is designed to prepare attendants to apply information technology in supply chain management. Traditionally industries focus on operation evaluation and performance improvement of manufacturing process; however, the deficiency of supply chain coordination results in severe downgrade of business competitiveness. With advent of information technology, computers not only improve manufacturing operation and management and also strategic decision-making as well.Course Topics• Production process and distribution process• Material flow and information flow• Data acquisition, tracking and monitoring• MPR, CRP, MRP II, and ERP systems• Electronic catalogues and directories for Web sourcing• Electronic Warehouse and Data Warehouse• Web-based information access, information coordination, and information processing• Supply chain modeling, analysis, and simulation• Supply chain coordination and synchronization• IT project management and deploymentInstructorDr. Benjamin YenOffice Room #5521 Telephone 2358-7107 Office Hour by appointment. Email benyen@ust.hk Web address t.hk/dfaculty/yen/yen.htmlTA TBAOffice TBA Telephone TBA Office Hour TBAEmail TBALecture L1 Tue/Thr 15:00-16:20 Room 3412 L2 Tue/Thr 16:30-17:50 Room 3412Lab/Tutorial TBATextbooks (Required)1. Robert B. Handfield, Ernest L. Jr. Nichols, Introduction to Supply ChainManagement,Prentice Hall; ISBN: 0136216161, (July 1998)References (Strongly recommended)1. Dave Chaffey, Groupware, Workflow and Intranets: Reengineering the Enterprisewith Collaborative Software , Digital Press; ISBN: 1555581846, (July 1998) 2. Chris Marshall, Enterprise Modeling with UML: Designing Successful Softwarethrough Business Analysis (The Addison-Wesley Object Technology Series) Addison-Wesley Pub Co; ISBN: 020*******, (October 29, 1999)Other References1. David Simchi-Levi, Philip Kaminsky, Edith Simchi-Levi, Designing and Managingthe Supply Chain: Concepts, Strategies and Case Studies, Irwin/McGraw-Hill; ISBN: 0072357568, (August 27, 1999)2. James E. Hill, Larry Fredendall, Fred Hill, Basics of Supply Chain Management , SaintLucie Pr; ISBN: 1574441205, (December 1999)3. Martin Christopher (Preface), Christopher Martin , Logistics and Supply ChainManagement: Strategies for Reducing Cost and Improving Service (Financial Times Management), 2nd edition Financal Times Management (April 1999)4. Grady Booch, Ivar Jacobson, James Rumbaugh, The Unified Modeling LanguageUser Guide (The Addison-Wesley Object Technology Series), Addison-Wesley Pub Co; ISBN: 020157168 (October 30, 1998)5. Carol A. Ptak, Eli Schragenheim (Editor), ERP Tools, Techniques and Applicationsfor Integrating the Supply Chain , CRC Pr; ISBN: 1574442708, (September 28, 1999)6. William C. Copacino, Supply Chain Management: The Basics and Beyond (The St.Lucie Press/Apics Series on Resource Management), Saint Lucie Pr; (May 1997) 7. Donald J. Bowersox, David J. Closs, Logistical Management: The Integrated SupplyChain Process , McGraw Hill College Div; ISBN: 0070068836 (January 30, 1996) 8. Sridhar Tayur (Editor), Ram Ganeshan (Editor), Michael J. Magazine (Editor),Quantitative Models for Supply Chain Management(International Series inOperations Research & Management Science, 17), Kluwer Academic Publishers;ISBN: 0792383443 (December 1998)9. Miguel Fernandez-Ranada (Editor), F. Xavier Gurrola-Gal, Enrique Lopez-Tello(Editor), 3C: A Proven Alternative to MRPII for Optimizing Supply ChainPerformance, Saint Lucie Pr; ISBN: 1574442716 (August 1999)10. Ronald H. Ballou, Business Logistics Management: Planning, Organizing, andControlling the Supply Chain, Fourth edition Prentice Hall (August 1998)11. Avraham Shtub, Enterprise Resource Planning (ERP): The Dynamics of OperationsManagement, Kluwer Academic Publishers; ISBN: 0792384385 (March 1999)12. Charles C. Poirier, Advanced Supply Chain Management: How to Build a SustainedCompetition, Publishers' Group West; ISBN: 1576750523 (February 1999)Class Participation (15%)§ Class discussion on reading/Web applications§ Reading/Research Assignment (group)(denoted as homework #5)o summary/presentation (2-3 pages)o presentation to be scheduled in the lecture after midterm (15 min presentation +5 min discussion)Homework (20% - 4 Assignments)§ (4x5%) Writing Assignment(individual)o question/answer, Web practiceMidterm (25%)§ Closed-bookProject (40%)§ Team project (2-3 people).§ Design/develop information systems or research on modern information technology for supply chain management.§ Project topics can be either assigned by lecturer or proposed by project members (with lecturer’s approval).§ Project grading/scheduleGrading Week Item proposal 10% 3 proposal plan 20% 5 (revised proposal)presentation 30% 7 project plan report 40% 17presentation/report Total 100%Class Participation 15% Homework 30% Midterm 25% Project 40% Total 100%TimetableWeek Subject Lab-T HW Due1 (02/01)(02/03) Introduction to Supply Chain Management § Information Systems and Supply Chain Management§ Inventory Management across the Supply Chain2 (02/08)(02/10) Introduction to Supply Chain Management(cont.)§ Supply Chain Relationships§ Challenges Facing Supply Chain Managers#13 (02/15)(02/17) The Role of Information Systems and Technologyin Supply Chain Management§ The Importance of Information in anIntegrated Supply Chain ManagementEnvironment§ Inter-organizational Information Systems#14 (02/22)(02/24) The Role of Information Systems and Technologyin Supply Chain Management (cont.)§ Information Requirements Determinationfor a Supply Chain IOIS§ Information and Technology Applicationsfor Supply Chain Management#25 (02/29)(03/02) Managing the Flow of Materials across theSupply Chain§ Understanding Supply Chains§ Reengineering Supply Chain Logistics(#5)1#26 (03/07)(03/09) Managing the Flow of Materials across theSupply Chain (cont.)§ The Importance of Time§ Performance Measurement#37 (03/14)(03/16) Developing and Maintaining Supply ChainRelationships§ A Conceptual Model of AllianceDevelopment§ Developing a Trusting Relationship withPartners in the Supply Chain(#5)2#38 (03/21)(03/23) Developing and Maintaining Supply ChainRelationships (cont.)§ Resolving Conflicts in a Supply ChainRelationship#49 (03/28)(03/30) Review sessionMidterm#410 (04/06) Cases in Supply Chain Management§ Case One: Consumable Computer Supplies§ Case Two: Computer Hardware and Software11 (04/11)(04/13) Cases in Supply Chain Management (cont.)§ Case Three: Upscale Men's Shoes§ Case Four: Biochemicals§ Case Five: Solectron(#5)312 (04/18)(04/20) Web technology -Informationaccess/coordination§ Data tracking/monitoring§ Workflow management§ Sourcing/procurement on the Web(#5)313 (04/25)(04/27) Web technology - Information processing§ Design/simulation§ Planning/scheduling§ Mass customization(#5)314 (05/02)(05/04) Future Challenges in Supply Chain Management§ Sharing Risks in Inter-organizationalRelationships§ Managing the Global Supply Chain§ The "Greening" of the Supply Chain(#5)315 (05/09) Future Challenges in Supply Chain Management(cont.)§ Design for Supply Chain Management§ Intelligent Information Systems§ When Things Go Wrong(#5)316 (05/16)(05/18)Study break17 (05/23)(05/25)Project PresentationNotice:1 : reading assignment distribution2 : reading presentation schedule3 : presentation session (2-3 presentations per week)Information Links/pfingar/bookecommerce.htm/~citm/cec////conf&sem/ElectricCommerce/index1.htm.hk/rthk/index.htm。
DATA WAREHOUSEData warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. A large number of organizations have found that data warehouse systems are valuable tools in today's competitive, fast evolving world. In the last several years, many firms have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with competition mounting in every industry, data warehousing is the latestmust-have marketing weapon —— a way to keep customers by learning more about their needs.“So", you may ask, full of intrigue, “what exactly is a data warehouse?"Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database that is maintained separately from an organization's operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated, historical data for analysis.According to W. H. Inmon, a leading architect in the construction of data warehouse systems, “a data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision making process." This short, but comprehensive definition presents the major features of a data warehouse. The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems. Let's take a closer look at each of these key features.(1).Subject-oriented: A data warehouse is organized around major subjects, such as customer, vendor, product, and sales. Rather than concentrating on the day-to-day operations and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses typically provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.(2) Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on.(3).Time-variant: Data are stored to provide information from a historical perspective(e.g., the past 5-10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time.(4)Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.In sum, a data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model and stores the information on which an enterprise needs to make strategic decisions. A data warehouse is also often viewed as an architecture, constructed by integrating data from multiple heterogeneoussources to support structured and/or ad hoc queries, analytical reporting, and decision making.“OK", you now ask, “what, then, is data warehousing?"Based on the above, we view data warehousing as the process of constructing and using data warehouses. The construction of a data warehouse requires data integration, data cleaning, and data consolidation. The utilization of a data warehouse often necessitates a collection of decision support technologies. This allows “knowledge workers" (e.g., managers, analysts, and executives) to use the warehouse to quickly and conveniently obtain an overview of the data, and to make sound decisions based on information in the warehouse. Some authors use the term “data warehousing" to refer only to the process of data warehouse construction, while the term warehouse DBMS is used to refer to the management and utilization of data warehouses. We will not make this distinction here.“How are organizations using the information from data warehouses?" Many organizations are using this information to support business decision making activities, including:(1) increasing customer focus, which includes the analysis of customer buying patterns (such as buying preference, buying time, budget cycles, and appetites for spending),(2) repositioning products and managing product portfolios by comparing the performance of sales by quarter, by year, and by geographic regions, in order to fine-tune production strategies,(3) analyzing operations and looking for sources of profit,(4) managing the customer relationships, making environmental corrections, and managing the cost of corporate assets.Data warehousing is also very useful from the point of view of heterogeneous database integration. Many organizations typically collect diverse kinds of data and maintain large databases from multiple, heterogeneous, autonomous, and distributed information sources. To integrate such data, and provide easy and efficient access to it is highly desirable, yet challenging. Much effort has been spent in the database industry and research community towards achieving this goal.The traditional database approach to heterogeneous database integration is to build wrappers and integrators (or mediators) on top of multiple, heterogeneous databases. A variety of data joiner and data blade products belong to this category. When a query is posed to a client site, a metadata dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved. These queries are then mapped and sent to local query processors. The results returned from the different sites are integrated into a global answer set. This query-driven approach requires complex information filtering and integration processes, and competes for resources with processing at local sources. It is inefficient and potentially expensive for frequent queries, especially for queries requiring aggregations.Data warehousing provides an interesting alternative to the traditional approach of heterogeneous database integration described above. Rather than using a query-driven approach, data warehousing employs an update-driven approach in which information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis. Unlike on-line transaction processing databases, datawarehouses do not contain the most current information. However, a data warehouse brings high performance to the integrated heterogeneous database system since data are copied, preprocessed, integrated, annotated, summarized, and restructured into one semantic data store. Furthermore, query processing in data warehouses does not interfere with the processing at local sources. Moreover, data warehouses can store and integrate historical information and support complex multidimensional queries. As a result, data warehousing has become very popular in industry.1.Differences between operational database systems and data warehousesSince most people are familiar with commercial relational database systems, it is easy to understand what a data warehouse is by comparing these two kinds of systems.The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as, purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on the other hand, serve users or “knowledge workers" in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems.The major distinguishing features between OLTP and OLAP are summarized as follows.(1). Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts.(2). Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier for use in informed decision making.(3). Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application -oriented database design. An OLAP system typically adopts either a star or snowflake model, and a subject-oriented database design.(4). View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media.(5). Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly read-only operations (since most data warehouses store historical rather than up-to-date information), although many could be complex queries.Other features which distinguish between OLTP and OLAP systems include database size, frequency of operations, and performance metrics and so on.2.But, why have a separate data warehouse?“Since operational databases store huge amounts of data", you observe, “why not perform on-line analytical processing directly on such databases instead of spending additional time and resources to construct a separate data warehouse?"A major reason for such a separation is to help promote the high performance of both systems. An operational database is designed and tuned from known tasks and workloads, such as indexing and hashing using primary keys, searching for particular records, and optimizing “canned" queries. On the other hand, d ata warehouse queries are often complex. They involve the computation of large groups of data at summarized levels, and may require the use of special data organization, access, and implementation methods based on multidimensional views. Processing OLAP queries in operational databases would substantially degrade the performance of operational tasks.Moreover, an operational database supports the concurrent processing of several transactions. Concurrency control and recovery mechanisms, such as locking and logging, are required to ensure the consistency and robustness of transactions. An OLAP query often needs read-only access of data records for summarization and aggregation. Concurrency control and recovery mechanisms, if applied for such OLAP operations, may jeopardize the execution of concurrent transactions and thus substantially reduce the throughput of an OLTP system.Finally, the separation of operational databases from data warehouses is based on the different structures, contents, and uses of the data in these two systems. Decision support requires historical data, whereas operational databases do not typically maintain historical data. In this context, the data in operational databases, though abundant, is usually far from complete for decision making. Decision support requires consolidation (such as aggregation and summarization) of data from heterogeneous sources, resulting in high quality, cleansed and integrated data. In contrast, operational databases contain only detailed raw data, such as transactions, which need to be consolidated before analysis. Since the two systems provide quite different functionalities and require different kindsof data, it is necessary to maintain separate databases.数据仓库数据仓库为商务运作提供结构与工具,以便系统地组织、理解和使用数据进行决策。
International Journal of Cooperative Information SystemsVol.10,No.3(2001)237–271c World Scientific Publishing CompanyDATA INTEGRATION IN DATA W AREHOUSINGDIEGO CAL V ANESE∗,GIUSEPPE DE GIACOMO†,MAURIZIO LENZERINI‡,DANIELENARDI§and RICCARDO ROSATI¶Dipartimento di Informatica e Sistemistica,Universit`a di Roma“La Sapienza”Via Salaria113,00198Roma,ItalyInformation integration is one of the most important aspects of a Data Warehouse.When data passes from the sources of the application-oriented operational environmentto the Data Warehouse,possible inconsistencies and redundancies should be resolved,so that the warehouse is able to provide an integrated and reconciled view of data of theorganization.We describe a novel approach to data integration in Data Warehousing.Our approach is based on a conceptual representation of the Data Warehouse applica-tion domain,and follows the so-called local-as-view paradigm:both source and DataWarehouse relations are defined as views over the conceptual model.We propose a tech-nique for declaratively specifying suitable reconciliation correspondences to be used inorder to solve conflicts among data in different sources.The main goal of the methodis to support the design of mediators that materialize the data in the Data Warehouserelations.Starting from the specification of one such relation as a query over the con-ceptual model,a rewriting algorithm reformulates the query in terms of both the sourcerelations and the reconciliation correspondences,thus obtaining a correct specificationof how to load the data in the materialized view.Keywords:Data Warehousing,data integration,data reconciliation,local-as-view ap-proach,query rewriting,automated reasoning.1.IntroductionA Data Warehouse is a set of materialized views over the operational information sources of an organization,designed to provide support for data analysis and man-agement’s rmation integration is at the heart of Data Warehousing.1 When data passes from the application-oriented operational environment to the Data Warehouse,possible inconsistencies and redundancies should be resolved,so that the Warehouse is able to provide an integrated and reconciled view of data of the organization.Generally speaking,information integration is the problem of acquiring data from a set of sources that are available for the application of interest.2This problem ∗E-mail:Calvanese@dis.uniroma1.it†E-mail:Giacomo@dis.uniroma1.it‡E-mail:Lenzerini@dis.uniroma1.it§E-mail:Nardi@dis.uniroma1.it¶E-mail:Rosati@dis.uniroma1.it237238 D.Calvanese et al.has recently become a central issue in several contexts,including Data Warehousing, Interoperable and Cooperative Systems,Multi-database systems,and Web Infor-mation Systems.The typical architecture of an integration system is described in terms of two types of modules:wrappers and mediators.3,4The goal of a wrapper is to access a source,extract the relevant data,and present such data in a specified format.The role of a mediator is to collect,clean,and combine data produced by different wrappers(or mediators),so as to meet a specific information need of the integration system.The specification and the realization of mediators is the core problem in the design of an integration system.The constraints that are typical of Data Warehouse applications restrict the large spectrum of approaches that have been proposed for integration.1,5,6First, while the sources are often external to the organization managing the integration system,in a Data Warehouse they are mostly internal to the organization.Second, a Data Warehouse should reflect the informational needs of the organization,and should therefore be defined in terms of a global,corporate view of data.Without such a global view,there is the risk of concentrating too much on what is in the sources at the operational level,rather than on what is really needed in order to perform the required analysis on data.7Third,such a corporate view should be provided in terms of representation mechanisms that are able to abstract from the physical and logical structure of data in the sources.It follows that the need and requirements for maintaining an integrated,conceptual view of the corporate data in the organization are stronger than in other contexts.A direct consequence of this fact is that the data in the sources and in the Data Warehouse should be defined in terms of the corporate view of data,and not the other way around.In other words,data integration in Data Warehousing should follow the local-as-view approach,where each table in a source and in the Data Warehouse is defined as a view of a global model of the corporate data.On the contrary,the global-as-view approach requires,for each information need,to specify the corresponding query in terms of the data at the sources,and is therefore suited when no global view of the data of the organization is available.The above considerations motivate the local-as-view approach to information in-tegration proposed in the context of the DWQ(Data Warehouse Quality)project.8,9 The distinguishing features of the approach are:a rich modeling language,which extends an Entity-Relationship data model,in order to represent a Conceptual Data Warehouse Model;reasoning tools associated to the modeling language which support the Data Warehouse construction,maintenance and evolution.Most of the work on integration has been concerned with the intensional/schema level.10Schema integration is nowadays a well-established discipline,which has also been addressed within DWQ.11It will not be discussed further in this paper. On the other hand,less attention has generally been devoted to the problem of data integration at the extensional/instance level.Data integration at the instance level is,nonetheless,crucial in Data Warehousing,where the process of acquiring data from the sources and making them available within the Data Warehouse is of paramount importance.Data Integration in Data Warehousing239 As a matter of fact,in the life time of a Data Warehouse the explicit repre-sentation of relationships between the sources and the materialized data in the Data Warehouse is useful in several tasks:from the initial loading,where the iden-tification of the revelavant data within the sources is critical,to the refreshment process,which may require a dynamic adaptation depending on the availability of the sources,as well as on their reliability and quality that may change over time. Moreover,the extraction of data from a primary Data Warehouse for Data Mart applications,where the primary Warehouse is now regarded as a data source,can be treated in a similar way.In addition,even though the sources within an En-terprise are not as dynamic as in other information integration frameworks,they are nonetheless subject to changes;in particular,creation of new sources and dele-tion of existing ones must be taken into account.Consequently,the maintenance of the Data Warehouse requires several upgrades of the dataflows towards the Data Warehouse.In other words,a Data Warehouse,especially in large organizations, should be regarded as an incremental system,which critically depends upon the relationships between the sources and the Data Warehouse.Given a request for new data to be materialized in the Data Warehouse,which is formulated in terms of the conceptual,global view of the corporate data,(i.e.not the language of the sources,but of the enterprise),there are several steps that are required for the acquisition of data from the sources:(1)Identification of the sources where the relevant information resides.Note thatthis task is typical of the local-as-view approach,and requires algorithms that are generally both sophisticated and costly.4,12,13(2)Decomposition of the user request into queries to individual sources that aresupposed to return the data of interest.(3)Interpretation and merging of the data provided by a source.Interpreting datacan be regarded as the task of casting them into a common representation.Moreover,the data returned by various sources need to be combined to provide the Data Warehouse with the requested information.The complexity of this reconciliation step is due to several problems,such as possible mismatches be-tween data referring to the same real world object,possible errors in the data stored in the sources,possible inconsistencies between values representing the properties of the real world objects in different sources.14In commercial environments for Data Warehouse design and management,the above tasks are taken care of through ad hoc components.6In general,such compo-nents provide the user with the capability of specifying the mapping between the sources and the Data Warehouse by browsing through a meta-level description of the relations of the sources.In addition,they generally provide both for automatic code generators and for the possibility of attaching procedures to accomplish ad hoc transformations andfiltering of the data.Even though there are powerful and ef-fective environments with the above features,their nature is inherently procedural, and close to the notion of global-as-view,where the task of relating the sources with the Data Warehouse is done on a query-by-query basis.240 D.Calvanese et al.Several recent research contributions address the same problem from a more formal perspective.5,15–21Generally speaking,these works differ from our pro-posal since they follow the global-as-view approach.We refer to Sec.6for a com-parison with our work.Research projects concerning the local-as-view approach have concentrated on the problem of reformulating a query in terms of the source relations.12,13,22–34However,none of them addresses both the problem of data rec-onciliation at the instance level,and the problem of query rewriting with respect to a conceptual model.In this paper we present a novel approach to data integration in Data Ware-housing,that builds upon and extends recent work within DWQ.8,35,36Compared with the existing proposals mentioned above,the novelty of our approach stems from the following features:•It relies on the Conceptual Model of the corporate data,which is expressed in an Entity-Relationship formalism.•It follows the local-as-view paradigm,in the sense that both the sources and the Data Warehouse relations are defined as views over the Conceptual Model.•It allows the designer to declaratively specify several types of reconciliation cor-respondences between data in different schemas(either source schemas or Data Warehouse schema).Three types of Reconciliation Correspondences are taken into account,namely Conversion,Matching,and Merging Correspondences.•It uses such correspondences in order to support the task of specifying the correct mediators for the loading of the materialized views of the Data Warehouse.For this purpose,our methodology relies on a query rewriting algorithm,whose role is to reformulate the query that defines the view to materialize in terms of both the source relations and the Reconciliation Correspondences.The characteristic feature of the algorithm is that it takes into account the constraints imposed by the Conceptual Model,and uses the Reconciliation Correspondences for cleaning, integrating,and reconciling data coming from different sources.The paper is organized as follows.In Sec.2,we summarize the general architecture we use in our approach to data integration in Data Warehous-ing.Section3illustrates the method we propose to describe the content of the sources and the Data Warehouse at the logical level.Section4is devoted to a discussion of the meaning and the role of Reconciliation Correspondences.Section5 describes the query rewriting algorithm at the basis of our approach to the design of mediators.Section6compares our proposal with related work.Finally,Sec.7 concludes the paper.2.General FrameworkWe briefly summarize the general framework at the base of our approach to schema and data integration.Such a framework has been developed within the ESPRIT project DWQ,“Foundations of Data Warehouse Quality”.6,9A Data Warehouse canData Integration in Data Warehousing241 be seen as a database which maintains an integrated,reconciled and materialized view of information residing in several data sources.In our approach we explicitly model the data in the sources and in the Data Warehouse at different levels of abstraction37,8,38:•The conceptual level,which contains a conceptual representation of the corporate data.•The logical level,which contains a representation in terms of a logical data model of the sources and of the data materialized in the Data Warehouse.•The physical level,which contains a specification of the stored data,the wrappers for the sources,and the mediators for loading the data store.The relationship between the conceptual and the logical,and between the logical and the physical level is represented explicitly by specifying mappings between corresponding objects of the different levels.In the rest of this section,we focus on the conceptual and logical levels,referring to the abstract architecture depicted in Fig.1,the physical level is treated elsewhere.6In Sec.5,we will explain in detail the construction of the specification of the mediators.2.1.Conceptual levelIn the overall Data Warehouse architecture,we explicitly conceive a conceptual level, which provides a conceptual representation of the data managed by the enterprise, including a conceptual representation of the data residing in sources,and of the global concepts and relationships that are of interest to the Data Warehouse appli-cation.Such a description,for which we use the term Conceptual Model,is indepen-dent from any system consideration,and is oriented towards the goal of expressing the semantics of the application.The Conceptual Model corresponds roughly toFig.1.Architecture for data integration.242 D.Calvanese et al.the notion of integrated conceptual schema in the traditional approaches to schema integration,thus providing a consolidated view of the concepts and the relation-ships that are important to the enterprise,and have been currently analyzed.Such a view includes a conceptual representation of the portion of data,residing in the sources,currently taken into account.Hence,our approach is not committed to the existence of a fully specified Conceptual Model,but rather supports an incremental definition of such a model.Indeed,the Conceptual Model is subject to changes and additions as the analysis of the information sources proceeds.An important aspect of the conceptual representation is the explicit specification of the set of interdependencies between objects in the sources and objects in the Data Warehouse.In this respect,data integration can be regarded as the process of understanding and representing the relationships between data residing in the information sources and the information contained in the Data Warehouse.Data reconciliation is also performed at this stage,instead of simply producing a unified data schema;moreover,such an integration activity is driven by automated reason-ing tools,which are able to derive and verify several kinds of properties concerning the conceptual specification of information.The formalization of information in the Conceptual Model is based on the dis-tinction between conceptual objects and values.Reflecting such distinction,the Con-ceptual Model consists of two components:(1)an enriched Entity-Relationship model,which formalizes the properties of con-ceptual objects;and(2)a set of domain assertions,which model the properties of values.We discuss the two components in the following.2.1.1.Enriched entity-relationship modelThe enriched Entity-Relationship model is formalized in terms of a logic based formalism,called DLR.37Such a formalism allows us to capture the Entity-Relationship(ER)model augmented with several forms of constraints that cannot be expressed in the standard ER model.Moreover,it provides sophisticated auto-mated reasoning capabilities,which can be exploited in verifying different properties of the Conceptual Model.DLR belongs to the family of Description Logics,introduced and studied in thefield of Knowledge Representation.39,40Generally speaking,Description Logics are class-based representation formalism that allow one to express several kinds of relationships and constraints(e.g.subclass constraints)holding among classes. Specifically,DLR includes:•concepts,which are used to represent entity types(or simply entities),i.e.sets of conceptual objects having common properties;•n-ary relationships,which are used to represent relationship types(or simply relationships),i.e.sets of tuples,each of which represents an association betweenData Integration in Data Warehousing243 conceptual objects belonging to different entities.The participation of conceptual objects in relationships models properties corresponding to relations to other conceptual objects;and•attributes,which are used to associate to conceptual objects(or tuples of concep-tual objects)properties expressed by values belonging to one of several domains.The Conceptual Model for a given application is specified by means of a set of logical assertions that express interrelationships between concepts,relationships, and attributes.In particular,the richness of DLR allows for expressing:•disjointness and covering constraints,and more generally Boolean combinations between entities and relationships;•universal and existential qualification of concepts and relationship components;•participation and functionality constraints,and more complex forms of cardinal-ity constraints;•ISA relationships between entities and relationships;and•definitions(expressing necessary and sufficient conditions)of entities and rela-tionships in terms of other entities and relationships.These features make DLR powerful enough to express not only the ER model,but also other conceptual,semantic,and object-oriented data models.Moreover,DLR assertions provide a simple and effective declarative mechanism to express the de-pendencies that hold between entities and relationships in different sources.2,41The use of inter-model assertions allows for an incremental approach to the integration of the conceptual models of the sources and of the enterprise.37,42 One distinguishing feature of DLR is that sound and complete reasoning al-gorithms are available.43Exploiting such algorithms,we gain the possibility of reasoning about the Conceptual Model.In particular,we can automatically verify the following properties:•Consistency of the Conceptual Model,i.e.whether there exists a database satis-fying all constraints expressed in the Conceptual Model.•Concept(relationship)satisfiability,i.e.whether there exists a database that satis-fies the Conceptual Model in which a certain entity(relationship)has a nonempty extension.•Entity(relationship)subsumption,i.e.whether the extension of an entity (relationship)is a subset of the extension of another entity(relationship)in every database satisfying the Conceptual Model.•Constraint inference,i.e.whether a certain constraint holds for all databases satisfying the Conceptual Model.Such reasoning services support the designer in the construction process of the Conceptual Model:they can be used,for instance,for inferring inclusion between entities and relationships,and detecting inconsistencies and redundancies.244 D.Calvanese et al.We show how to formalize in DLR a simple ER schema,which we will use as our running example.A full-fledged example of our methodology can be found in a case study from the telecommunication domain.44,45Example 1.The schema shown in Fig.2represents persons divided in males and females and parent-child relationship.The following set of assertions exactly captures the ER schema in thefigure.Person (=1name) ∀String (=1ssn) ∀ssn.SSNString (=1dob) ∀dob.Date (=1income) ∀income.Money Person≡Female MaleFemale ¬MaleCHILD ($1:Person) ($2:Person)Thefirst four assertions specify the existence and the domain of the attributes of Person.The next two assertions specify that persons are partitioned in females and males.The last assertion specifies the typing of the CHILD relationship.We could also add constraints not expressible in the ER model,such as intro-ducing a further entity MotherWith3Sons denoting mothers having at least three sons:MotherWith3Sons≡Female (≤3[$1](CHILD ($2:Male)))2.1.2.Abstract domainsRather than considering concrete domains,such as strings,integers,and reals,our approach is based on the use of abstract domains.Abstract domains may have an un-derlying concrete domain,but their use allows the designer to distinguish between the different meanings that values of the concrete domain may have.The prop-erties and mutual inter-relationships between domains can be specified by means of domain assertions,each of which is expressed as an inclusion between Boolean combinations of domains.In particular,domain assertions allow to express an ISAFig.2.Entity-relationship schema for parent-child relationship.Data Integration in Data Warehousing245 hierarchy between domains.We say that a domain assertion is satisfied if the inclu-sion between the corresponding Boolean combinations of domain extensions holds. Example 2.Consider two attributes A1in a source and A2in the Data Warehouse,both representing amounts of money.Rather than specifying that both attributes have values of type Real,the designer may specify that the domain of attribute A1is MoneyInLire while the domain of attribute A2is MoneyInEuro,which are both subsets of the domain Money(which possibly has Real as the underlying concrete domain).The relationship between the three domains can be expressed by means of the domain assertions:MoneyInLire MoneyMoneyInEuro Money ¬MoneyInLireIn this way,it becomes possible to specify declaratively the difference between values of the two attributes,and take such knowledge into account for loading data from the source to the Data Warehouse.Given a set of domain assertions,the designer may be interested in verifying several properties,such as:•Satisfiability of the whole set of domain assertions,i.e.whether it is actually possible to assign to each domain an extension in such a way that all assertions are satisfied.•Domain satisfiability,i.e.whether it is possible to assign to a domain a nonempty extension,under the condition that all assertions are satisfied.•Domain subsumption,i.e.whether the extension of one domain is a subset of the extension of another domain,whenever all domain assertions are satisfied.The presence of unsatisfiable domains reflects some error in the modeling process, and requires to remove the unsatisfiable domain or revise the set of domain as-sertions.Similarly,the presence of equivalent domains may be an indication for redundancy.Typically,the designer is interested in automatically checking the above properties,and more generally,in automatically constructing a domain hierarchy reflecting the ISA relationships between domains that logically follow from the set of assertions.Using DLR we can express and reason over the enriched Entity-Relationship model together with the domain hierarchy.However,if we are interested in reason-ing about domains only,we can use a more direct approach.Indeed,by conceiving each domain as a unary predicate,we can correctly represent a domain assertion D D by thefirst-order logic formula∀x·D(x)⊃D (x).Since such formula does not contain any quantifier besides the outermost∀,it can be captured correctly by the propositional formula A⊃B,where we consider each domain as a propo-sitional symbol.Therefore we can exploit techniques developed for propositional246 D.Calvanese et al.reasoning46–48to perform inference on domains,and thus automatically check the desired properties resulting from the set of assertions.2.2.Logical levelThe logical level provides a description of the logical content of each source,called the Source Schema,and the logical content of the materialized views constituting the Data Warehouse,called the Data Warehouse Schema(see Fig.1).Such schemas are intended to provide a structural description of the content of both the sources and the materialized views in the Data Warehouse.A Source Schema is provided in terms of a set of relations using the relational model.The link between the logical representation and the conceptual represen-tation of the source is formally defined by associating each relation with a query that describes its content in terms of a query over the Conceptual Model.In other words,the logical content of a source relation is described in terms of a view over the virtual database represented by the Conceptual Model,adopting the local-as-view approach.To map physical structures to logical structures we make use of suitable wrappers,which encapsulate the sources.The wrapper hides how the source actu-ally stores its data,the data model it adopts,etc.and presents the source as a set of relations.In particular,we assume that all attributes in the relations are of interest to the Data Warehouse application(attributes that are not of interest are hidden by the wrapper).Relation attributes are thus modeled as either entity attributes or relationship attributes in the Conceptual Model.The Data Warehouse Schema,which expresses the logical content of the ma-terialized views constituting the Data Warehouse,is provided in terms of a set of relations.Similarly to the case of the sources,each relation of the Data Warehouse Schema is described in terms of a query over the Conceptual Model.From a technical point of view such queries are unions of conjunctive queries. More precisely,a query q over the Conceptual Model has the form:T( x)←q( x, y)where the head T( x)defines the schema of the relation in terms of a name T,and its arity,i.e.the number of columns(number of components of x),and the body q( x, y)describes the content of the relation in terms of the Conceptual Model.The body has the formconj1( x, y1)OR···OR conj m( x, y m)where each conj i( x, y i)is a conjunction of atoms,and x, y i are all the variables ap-pearing in the conjunct(we use x to denote a tuple of variables x1,...,x n,for some n).Each atom is of the form E(t),R( t),or A(t,t ),where t,t,and t are variables in x, y i or constants,and E,R,and A are respectively entities,relationships,and attributes appearing in the Conceptual Model.In the following,we will also consider queries whose body may contain special predicates that do not appear in the conceptual model.The semantics of queries is as follows.Given a database that satisfies the Con-ceptual Model,a queryT( x)←conj1( x, y1)OR···OR conj m( x, y m)of arity n is interpreted as the set of n-tuples(d1,...,d n),with each d i an object of the database,such that,when substituting each d i for x i,the formula∃ y1·conj1( x, y1)OR···OR∃ y m·conj m( x, y m)evaluates to true.Suitable inference techniques allow for carrying out the following reasoning ser-vices on queries by taking into account the Conceptual Model43:•Query containment.Given two relational queries q1and q2(of the same arity n) over the Conceptual Model,we say that q1is contained in q2,if the set of tuples denoted by q1is contained in the set of tuples denoted by q2in every database satisfying the Conceptual Model.•Query consistency.A relational query q over the Conceptual Model is consistent, if there exists a database satisfying the Conceptual Model in which the set of tuples denoted by q is not empty.•Query disjointness.Two relational queries q1and q2(of the same arity)over the Conceptual Model are disjoint,if the intersection of the set of tuples denoted by q1and the set of tuples denoted by q2is empty,in every database satisfying the Conceptual Model.3.Source and Data Warehouse Logical Schema DescriptionsThe notion of query over the Conceptual Model is a powerful tool for modeling the logical level of the Sources and the Data Warehouse.As mentioned above,we express the relational tables constituting both the Data Warehouse Schema and Source Schemas in terms of queries over the Conceptual Model,with the following characteristics:•Relational tables are composed of tuples of values,which are the only kind of objects at the logical level.Therefore,each variable in the head of the query represents a value(not a conceptual object).•Each variable appearing in the body of the query either denotes a conceptual object or a value,depending on the atoms in which it appears.Since,in each database that satisfies the Conceptual Model,conceptual objects and values are disjoint sets,no query can contain a variable which can be instantiated by both a conceptual object and a value.•Each conceptual object is represented by a tuple of values at the logical level. Thus,a mechanism is needed to express this kind of correspondence between a tuple of values and the conceptual object it represents.This is taken into account by the notion of adornment introduced below.。