What is Data Mining?

Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, “data mining” should have been more appropriately named “knowledge mining from data”, which is unfortunately somewhat long. “Knowledge mining”, a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process

that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer which carries both “data” and “mining” became a popular choice. There are many other terms carrying a similar or slightly different meaning to data mining, such as knowledge mining from databases, knowledge extraction, data / pattern analysis, data archaeology, and data dredging.

Many people treat data mining as a synonym for another popularly used term, “Knowledge Discovery in Databases”, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery in databases. Knowledge discovery consists of an iterative sequence of the following steps:

· data cleaning: to remove noise or irrelevant data,

· data integration: where multiple data sources may be combined,

· data selection : where data relevant to the analysis task are retrieved from the database,

· data transformati on : where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance,

· data mining: an essential process where intelligent methods are applied in order to extract data patterns,

· pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interestingness measures, and

· knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user .

The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user, and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one since it uncovers hidden patterns for evaluation.

We agree that data mining is a knowledge discovery process. However, in industry, in media, and in the database research milieu, the term “data mining” is becoming more popular than the longer term of “knowledge discovery in databases”. Therefore, in this book, we choose to use the term “data mining”. We adopt a broad view of data mining functionality: data mining is the process of discovering interesting knowledge

from large amounts of data stored either in databases, data warehouses, or other information repositories.

Based on this view, the architecture of a typical data mining system may have the following major components:

1. Database, data warehouse, or other information repository. This is one or a set of databases, data warehouses, spread sheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.

2. Database or data warehouse server. The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.

3. Knowledge base. This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).

4. Data mining engine. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association analysis, classification, evolution and deviation analysis.

5. Pattern evaluation module. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. It may access interestingness thresholds stored in the knowledge base. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns.

6. Graphical user interface. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data

structures, evaluate mined patterns, and visualize the patterns in different forms.

From a data warehouse perspective, data mining can be viewed as an advanced stage of on-1ine analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style analytical processing of data warehouse systems by incorporating more advanced techniques for data understanding.

While there may be many “data mining systems” on the market, not all of them can perform true data mining. A data analysis system that does not handle large amounts of data can at most be categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorized as either a database system, an information retrieval system, or a deductive database system.

Data mining involves an integration of techniques from mult1ple disciplines such as database technology, statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization, information

retrieval, image and signal processing, and spatial data analysis. We adopt a database perspective in our presentation of data mining in this book. That is, emphasis is placed on efficient and scalable data mining techniques for large databases. By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles. The discovered knowledge can be applied to decision making, process control, information management, query processing, and so on. Therefore, data mining is considered as one of the most important frontiers in database systems and one of the most promising, new database applications in the information industry.

A classification of data mining systems

Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and or rough set theory, knowledge representation, inductive logic programming, or high performance computing. Depending on the kinds of data to be

mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, Information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, or psychology.

Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems. Such a classification may help potential users distinguish data mining systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows.

1) Classification according to the kinds of databases mined.

A data mining system can be classified according to the kinds of databases mined. Database systems themselves can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly.

For instance, if classifying according to data models, we may have a relational, transactional, object-oriented,

object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time -series, text, or multimedia data mining system , or a World-Wide Web mining system . Other system types include heterogeneous data mining systems, and legacy data mining systems.

2) Classification according to the kinds of knowledge mined. Data mining systems can be categorized according to the kinds of knowledge they mine, i.e., based on data mining functionalities, such as characterization, discrimination, association, classification, clustering, trend and evolution analysis, deviation analysis , similarity analysis, etc. A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities.

Moreover, data mining systems can also be distinguished based on the granularity or levels of abstraction of the knowledge mined, including generalized knowledge(at a high level of abstraction), primitive-level knowledge(at a raw data level), or knowledge at multiple levels (considering several levels of abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction.

3) Classification according to the kinds of techniques utilized.

Data mining systems can also be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis employed(e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on ) .A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique which combines the merits of a few individual approaches.





大数据外文翻译参考文献综述 (文档含中英文对照即英文原文和中文翻译) 原文: Data Mining and Data Publishing Data mining is the extraction of vast interesting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data that is not supposed to be revealed even to the party

running the algorithm. In contrast, privacy-preserving data publishing (PPDP) may not necessarily be tied to a specific data mining task, and the data mining task may be unknown at the time of data publishing. PPDP studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Privacy-preserving for both data mining (PPDM) and data publishing (PPDP) has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes. One well studied approach is the k-anonymity model [1] which in turn led to other models such as confidence bounding, l-diversity, t-closeness, (α,k)-anonymity, etc. In particular, all known mechanisms try to minimize information loss and such an attempt provides a loophole for attacks. The aim of this paper is to present a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explain their effects on Data Privacy. Although data mining is potentially useful, many data holders are reluctant to provide their data for data mining for the fear of violating individual privacy. In recent years, study has been made to ensure that the sensitive information of individuals cannot be identified easily. Anonymity Models, k-anonymization techniques have been the focus of intense research in the last few years. In order to ensure anonymization of data while at the same time minimizing the information


数据挖掘简介 数据挖掘的任务 数据挖掘的任务就是从实例集合中找出容易理解的规则和关系。这些规则可以用于预测未来趋势、评价顾客、评估风险或简单地描述和解释给定的数据。通常数据挖掘的任务包括以下几个部分: 数据总结目的是对数据进行浓缩,给出它的紧凑描述。传统的也是最简单的数据总结方法是计算出数据库的各个字段上的求和值、平均值、方差值等统计值,或者用直方图、饼图等图形方式表示。数据挖掘主要关心从数据泛化的角度来讨论数据总结。数据泛化是一种把数据库中的有关数据从低层次抽象到高层次上的过程。数据泛化目前主要有两种技术:多维数据分析方法和面向属性的归纳方法。 多维数据分析方法是一种数据仓库技术,也称作联机分析处理(OLAP,onLineAnalysisProeess)。数据仓库是面向决策支持的、集成的、稳定的、不同时间的历史数据集合。决策的前提是数据分析。在数据分析中经常要用到诸如求和、总计、平均、最大、最小等汇集操作,这类操作的计算量特别大。因此一种很自然的想法是,把汇集操作结果预先计算并存储起来,以便于决策支持系统使用。存储汇集操作结果的地方称作多维数据库。多维数据分析技术已经在决策支持系统中获得了成功的应用,如著名的SAS数据分析软件包、Businessobject公司的决策支持系统Businessobjeet,以及IBM公司的决策分析工具都使用了多维数据分析技术。 采用多维数据分析方法进行数据总结,它针对的是数据仓库,数据仓库存储的是脱机的历史数据。为了处理联机数据,研究人员提出了一种面向属性的归纳方法。它的思路是,直接对用户感兴趣的数据视图(用一般的SQL查询语言即可获得)进行泛化,而不是像多维数据分析方法那样预先就存储好了泛化数据。方法的提出者对这种数据泛化技术称之为面向属性的归纳方法。原始关系经过泛化操作后得到的是一个泛化关系,它从较高的层次上总结了在低层次上的原始关系。有了泛化关系后,就可以对它进行各种深入的操作而生成满足用户需要的知识,如在泛化关系基础上生成特性规则、判别规则、分类规则,以及关联规则等。数据挖掘的分类 数据挖掘所能发现的知识有如下几种: .广义型知识,反映同类事物共同性质的知识; .特征型知识,反映事物各方面的特征知识; .差异型知识,反映不同事物之间属性差别的知识; .关联型知识,反映事物之间依赖或关联的知识; .预测型知识,根据历史的和当前的数据推测未来数据; .偏离型知识。揭示事物偏离常规的异常现象。 所有这些知识都可以在不同的概念层次上被发现,随着概念树的提升,从微观到中观再到宏观,以满足不同用户、不同层次决策的需要。例如,从一家超市的数据仓库中,可以发现的一条典型关联规则可能是“买面包和黄油的顾客十有八九也买牛奶”,也可能是“买食品的顾客几乎都用信用卡”,这种规则对于商家开发和实施客户化的销售计划和策略是非常有用的。 数据挖掘的方法 数据挖掘并非一个完全自动化的过程。整个过程需要考虑数据的所有因素和其预定的效用,然后应用最佳的数据挖掘方法。数据挖掘的方法很重要。在数据挖掘的领域里.有一点已经被广泛地接受,即不管你选择哪种方法,总存在着某种协定。因此对实际情况,应该具体分析,根据累积的经验和优秀的范例选择最佳的方法。数据挖掘中没有免费的午餐,也没


第三方物流问题战略外文翻译文献 (文档含英文原文和中文翻译) 我国第三方物流中存在的问题、原因及战略选择 【摘要】我国物流业发展刚刚起步,第三方物流的理论和实践等方面都比较薄弱。本文指出我国第三方物流存在的问题在于国内外第三方物流企业差距、物流效率不高、缺乏系统性管理、物流平台构筑滞后、物流管理观念落后等。分析了产生上述问题的原因,并提出了精益物流、中小型第三方物流企业价值链联盟、大型第三方物流企业虚拟化战等三种可供选择的第三方物流企业发展战略。【关键词】第三方物流;精益物流战略;价值链联盟;虚拟化战略 1引言 长期以来,我国国内企业对采购、运输、仓储、代理、包装、加工、配送等

环节控制能力不强,在“采购黑洞”、“物流陷井”中造成的损失浪费难以计算。因此,对第三方物流的研究,对于促进我国经济整体效益的提高有着非常重要的理论和实践意义。本文试图对我国策三方物流存在的问题及原因进行分析探讨,并提出第三方物流几种可行的战略选择。 2 我国第三方物流业存在的主要问题 (一)我国策三方物流企业与国外第三方物流企业的差距较大,具体表现在以下几个方面: 1、规模经济及资本差距明显。由于国外的大型第三方物流企业从全球经营的战略出发,其规模和资本优势是毫无疑问的,尤其初创时期的我国策三方物流业,本身的规模就很小,国外巨头雄厚的资本令国内企业相形见绌。 2、我国策三方物流业企业提供的物流服务水准及质量控制远不如国外同行。当国内一些企业还在把物流理解成“卡车加仓库“的时候,国外的物流企业早已完成了一系列标准化的改造。同时,国外的物流组织能力非常强大,例如德国一家第三方物流公司,公司各方面的物流专家遍布欧洲各地。如果有客户的货物需要经达不同的国家,那么欧洲各地的这些专家就在网上设计出一个最佳的物流解决方案。这种提供解决方案的能力就是这第三方物流公司的核心能力,而不像国内公司号称拥有多少条船,多少辆车。 3、我国加入 WTO 后物流产业的门槛降低。在物流服务业方面:我国承诺所有的服务行业,在经过合理过渡期后,取消大部分外国股权限制,不限制外国服务供应商进入目前的市场,不限制所有服务行业的现有市场准入和活动。同时在辅助分销的服务方面也作出了类似的承诺。这些方面的限制将在以后 3—4 年内逐步取消,在此期间,国外的服务供应商可以建立百分之百的全资拥有的分支机构或经营机构,国内物流服务业将直面国际竞争。 (二)资源浪费严重,第三方物流效率不高。 从微观上看,由于受计划经济体制的影响,长期以来许多企业,尤其是国有企业走的是“大而全”、“小而全”的路子,它们拥有自己的仓库、车队、甚至远洋船队,造成物流过程的大量浪费,具体表现为仓库的闲置,物流业经营分散,组织化程度低,横向联合薄弱。而能够提供一体化、现代化、专业化、准时化、高效服务的第三方物流企业则很少。从宏观上看第三方物流未能跟上经


