Knowledge Discovery in Databases：An Overview

格式：pdf
大小：199.31 KB
文档页数：14

下载文档原格式

《高级人工智能》第九章知识发现和数据挖掘(2)PPT课件

➢ 聚类：
根据数据的不同特征，将其划分为不同的类。无导师学习
2020/11/5
高级人工智能史忠植
7
知识发现的任务(2)
➢ 相关性分析：
发现特征之间或数据之间的相互依赖关系关联规则
➢ 偏差分析：
基本思想是寻找观察结果与参照量之间的有意义的差别。通过发现异常, 可以引起人们对特殊情况的加倍注意。
2020/11/级人工智能史忠植
6
知识发现的任务(1)
➢ 数据总结：
对数据进行总结与概括。传统的最简单的数据总结方法是计算出数据库的各个字段上的求和值、平均值、方差值等统计值,或者用直方图、饼状图等图形方式表示。
➢ 分类：
根据分类模型对数据集合分类。分类属于有导师学习，一般需要有一个训练样本数据集作为输入。
➢ 神经计算：
神经网络是指一类新的计算模型，它是模仿人脑神经网络的结构和某些工作机制而建立的一种计算模型。常用的模型： ✓ Hopfield网 ✓ 多层感知机 ✓ 自组织特征映射 ✓ 反传网络
➢ 可视化：
2020/11/5
高级人工智能史忠植
11
KDD的技术难点
➢ 动态变化的数据 ➢ 噪声 ➢ 数据不完整 ➢ 冗余信息 ➢ 数据稀疏 ➢ 超大数据量
2020/11/5
高级人工智能史忠植
15
关联规则的相关概念(2)
➢支持度
物品集A的支持度：称物品集A具有大小为s的支持度, 如果D中有s%的事务支持物品集X P(A)
1000个顾客购物，其中200个顾客购买了面包，
支持度就是20％（200/1000）。
关联规则A→B的支持度：关联规则A→B在事务数据库W中具有大小为s的支持度,如果物品集A∪B的支持度为s

数据挖掘与知识发现

6
7 8 9 10
小型
小型微型小型小型
6
6 4 4 4
N
N N N N
2型
1型 2型 2型 2型
中
中小小小
中
中高高高
中
高低低中
自动
手动手动手动自动
重
重轻中中
低
低高中中
11
12 13 14 15 16
微型
微型小型微型微型小型
4
4 4 4 4 4
N
N N Y N Y
元组(记录)压缩实质上是对数据库的元组(记录)进行合并、归并和聚类等。 (1)相同元组(记录)的合并 (2)利用概念树进行归并 (3)对元组的聚类
5.2 数据挖掘方法和技术
5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 归纳学习的信息论方法归纳学习的集合论方法仿生物技术的神经网络方法仿生物技术的遗传算法数值数据的公式发现可视化技术
注

意

六个步骤并非完全按照此顺序来执行。在应用当中，应该针对不同的应用环境和实际情况做出必要的调整。一个数据挖掘项目通常并不是一次性地执行了上述六个步骤就结束了，它往往是一个反复迭代、不断完善的过程。
1.数据准备

目标数据(Target Data)，是根据用户的需要从原始数据库中选取的一组数据。

一个类的内涵描述分为：特征描述和辨别性描述。

特征描述是对类中对象的共同特征的描述。辨别性描述是对两个或多个类之间的区别的描述。
3.偏差检测
从数据分析中发现这些异常情况。

偏差检测的基本方法是寻找观察结果与参照之间的差别。

第七章数据挖掘

数据挖掘语言
Automated vs. query-driven? queryFinding all the patterns autonomously in a database?—unrealistic database?— because the patterns could be too many but uninteresting Data mining should be an interactive process User directs what to be mined Users must be provided with a set of primitives to be used to communicate with the data mining system Incorporating these primitives in a data mining query language More flexible user interaction Foundation for design of graphical user interface Standardization of data mining industry and practice
Choosing the mining algorithm(s) Data mining: search for patterns of interest mining: Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
relevant prior knowledge and goals of application

医学科研数据挖掘概述解析

医学科研数据挖掘
流行病学与卫生统计学系陈大方
联系方式：公卫楼520房间电话：82802644(O) 邮箱：dafangchen@
Epidemiology
数据挖掘应沃尔玛的故事
Epidemiology
关联规则挖掘

关联规则挖掘过程主要包含两个阶段，第一阶段从原始资料集合中，找出所有高频项目组。高频的意思是指某一项目组出现的频率相对于所有记录而言，必须达到某一水平。第二阶段是从高频项目组产生关联规则，即找出最小支持度和最小臵信度的关联规则。
Epidemiology

数据的预处理主要包括数据描述、数据清理、数据集成和变换、数据归约四个步骤。数据描述：对于获得数据的总体印象至关重要。描述性数据汇总技术可以用来识别数据的典型性质，突显哪些数据值应当视为噪声或离群点。度量数据的中心趋势和度量数据的离散程度。数据清理：是指填充缺失值，光滑噪声并识别离群点，纠正数据中的不一致。数据集成：是指合并来自多个数据存储的数据和可能需要转换成适于挖掘的形式。数据归约：数据归约技术可以用来得到数据集的归约表示，它接近于保持数据的完整性，但数据量比原数据小得多。与非归约数据相比，在归约的数据上进行挖掘，所需的时间和内存资源更少，挖掘更有效，并产生相同或几乎相同的分析结果。
2.
特点
是一个利用各种分析工具在海量数据中发现模型和数据间关系的过程。数据挖掘的过程是一个“发现”的过程,而不是 “发明”的过程。。
Epidemiology
四、数据挖掘基本过程
Epidemiology

数据的选择：搜索所有与业务对象有关的内部和外部数据信息，并从中选择出适用于数据挖掘应用的数据。数据的预处理：研究数据的质量，为进一步分析作准备。并确定将要进行的挖掘操作的类型。数据的转换：将数据转换成一个分析模型。这个分析模型是针对挖掘算法建立的。建立一个真正适合挖掘算法的分析模型是数据挖掘成功的关键。数据挖掘：对所得到的经过转换的数据进行挖掘。结果分析：解释并评估结果。其使用的分析方法一般应作数据挖掘操作而定。知识的同化：将分析所得到的知识集成到业务信息系统的组织结构中去。

第五章_关联规则挖掘

Any subset of a frequent itemset must be frequent
Mining Frequent Itemsets: the Key Step
Find the frequent itemsets: the sets of
items that have minimum support A subset of a frequent itemset must also be a frequent itemset
Customer buys diaper
Find all the rules X & Y Z
with minimum confidence and support
support, s, probability that a transaction contains {X , Y , Z}
Customer buys beer
Use the frequent itemsets to generate association rules.
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
[0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations (see ex. Above) Single level vs. multiple-level analysis What brands of beers are associated with what brands of diapers? Various extensions Correlation, causality analysis

From Data Mining to Knowledge Discovery in Databases

s Data mining and knowledge discovery in databases have been attracting a signiﬁcant amount of research, industry, and media atten-tion of late. What is all the excitement about?This article provides an overview of this emerging ﬁeld, clarifying how data mining and knowledge discovery in databases are related both to each other and to related ﬁelds, such as machine learning, statistics, and databases. The article mentions particular real-world applications, speciﬁc data-mining techniques, challenges in-volved in real-world applications of knowledge discovery, and current and future research direc-tions in the ﬁeld.A cross a wide variety of ﬁelds, data arebeing collected and accumulated at adramatic pace. There is an urgent need for a new generation of computational theo-ries and tools to assist humans in extracting useful information (knowledge) from the rapidly growing volumes of digital data. These theories and tools are the subject of the emerging ﬁeld of knowledge discovery in databases (KDD).At an abstract level, the KDD ﬁeld is con-cerned with the development of methods and techniques for making sense of data. The basic problem addressed by the KDD process is one of mapping low-level data (which are typically too voluminous to understand and digest easi-ly) into other forms that might be more com-pact (for example, a short report), more ab-stract (for example, a descriptive approximation or model of the process that generated the data), or more useful (for exam-ple, a predictive model for estimating the val-ue of future cases). At the core of the process is the application of speciﬁc data-mining meth-ods for pattern discovery and extraction.1This article begins by discussing the histori-cal context of KDD and data mining and theirintersection with other related ﬁelds. A briefsummary of recent KDD real-world applica-tions is provided. Deﬁnitions of KDD and da-ta mining are provided, and the general mul-tistep KDD process is outlined. This multistepprocess has the application of data-mining al-gorithms as one particular step in the process.The data-mining step is discussed in more de-tail in the context of speciﬁc data-mining al-gorithms and their application. Real-worldpractical application issues are also outlined.Finally, the article enumerates challenges forfuture research and development and in par-ticular discusses potential opportunities for AItechnology in KDD systems.Why Do We Need KDD?The traditional method of turning data intoknowledge relies on manual analysis and in-terpretation. For example, in the health-careindustry, it is common for specialists to peri-odically analyze current trends and changesin health-care data, say, on a quarterly basis.The specialists then provide a report detailingthe analysis to the sponsoring health-care or-ganization; this report becomes the basis forfuture decision making and planning forhealth-care management. In a totally differ-ent type of application, planetary geologistssift through remotely sensed images of plan-ets and asteroids, carefully locating and cata-loging such geologic objects of interest as im-pact craters. Be it science, marketing, ﬁnance,health care, retail, or any other ﬁeld, the clas-sical approach to data analysis relies funda-mentally on one or more analysts becomingArticlesFALL 1996 37From Data Mining to Knowledge Discovery inDatabasesUsama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth Copyright © 1996, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 / $2.00areas is astronomy. Here, a notable success was achieved by SKICAT ,a system used by as-tronomers to perform image analysis,classiﬁcation, and cataloging of sky objects from sky-survey images (Fayyad, Djorgovski,and Weir 1996). In its ﬁrst application, the system was used to process the 3 terabytes (1012bytes) of image data resulting from the Second Palomar Observatory Sky Survey,where it is estimated that on the order of 109sky objects are detectable. SKICAT can outper-form humans and traditional computational techniques in classifying faint sky objects. See Fayyad, Haussler, and Stolorz (1996) for a sur-vey of scientiﬁc applications.In business, main KDD application areas includes marketing, ﬁnance (especially in-vestment), fraud detection, manufacturing,telecommunications, and Internet agents.Marketing:In marketing, the primary ap-plication is database marketing systems,which analyze customer databases to identify different customer groups and forecast their behavior. Business Week (Berry 1994) estimat-ed that over half of all retailers are using or planning to use database marketing, and those who do use it have good results; for ex-ample, American Express reports a 10- to 15-percent increase in credit-card use. Another notable marketing application is market-bas-ket analysis (Agrawal et al. 1996) systems,which ﬁnd patterns such as, “If customer bought X, he/she is also likely to buy Y and Z.” Such patterns are valuable to retailers.Investment: Numerous companies use da-ta mining for investment, but most do not describe their systems. One exception is LBS Capital Management. Its system uses expert systems, neural nets, and genetic algorithms to manage portfolios totaling $600 million;since its start in 1993, the system has outper-formed the broad stock market (Hall, Mani,and Barr 1996).Fraud detection: HNC Falcon and Nestor PRISM systems are used for monitoring credit-card fraud, watching over millions of ac-counts. The FAIS system (Senator et al. 1995),from the U.S. Treasury Financial Crimes En-forcement Network, is used to identify ﬁnan-cial transactions that might indicate money-laundering activity.Manufacturing: The CASSIOPEE trou-bleshooting system, developed as part of a joint venture between General Electric and SNECMA, was applied by three major Euro-pean airlines to diagnose and predict prob-lems for the Boeing 737. To derive families of faults, clustering methods are used. CASSIOPEE received the European ﬁrst prize for innova-intimately familiar with the data and serving as an interface between the data and the users and products.For these (and many other) applications,this form of manual probing of a data set is slow, expensive, and highly subjective. In fact, as data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains.Databases are increasing in size in two ways:(1) the number N of records or objects in the database and (2) the number d of ﬁelds or at-tributes to an object. Databases containing on the order of N = 109objects are becoming in-creasingly common, for example, in the as-tronomical sciences. Similarly, the number of ﬁelds d can easily be on the order of 102or even 103, for example, in medical diagnostic applications. Who could be expected to di-gest millions of records, each having tens or hundreds of ﬁelds? We believe that this job is certainly not one for humans; hence, analysis work needs to be automated, at least partially.The need to scale up human analysis capa-bilities to handling the large number of bytes that we can collect is both economic and sci-entiﬁc. Businesses use data to gain competi-tive advantage, increase efﬁciency, and pro-vide more valuable services to customers.Data we capture about our environment are the basic evidence we use to build theories and models of the universe we live in. Be-cause computers have enabled humans to gather more data than we can digest, it is on-ly natural to turn to computational tech-niques to help us unearth meaningful pat-terns and structures from the massive volumes of data. Hence, KDD is an attempt to address a problem that the digital informa-tion era made a fact of life for all of us: data overload.Data Mining and Knowledge Discovery in the Real WorldA large degree of the current interest in KDD is the result of the media interest surrounding successful KDD applications, for example, the focus articles within the last two years in Business Week , Newsweek , Byte , PC Week , and other large-circulation periodicals. Unfortu-nately, it is not always easy to separate fact from media hype. Nonetheless, several well-documented examples of successful systems can rightly be referred to as KDD applications and have been deployed in operational use on large-scale real-world problems in science and in business.In science, one of the primary applicationThere is an urgent need for a new generation of computation-al theories and tools toassist humans in extractinguseful information (knowledge)from the rapidly growing volumes ofdigital data.Articles38AI MAGAZINEtive applications (Manago and Auriol 1996).Telecommunications: The telecommuni-cations alarm-sequence analyzer (TASA) wasbuilt in cooperation with a manufacturer oftelecommunications equipment and threetelephone networks (Mannila, Toivonen, andVerkamo 1995). The system uses a novelframework for locating frequently occurringalarm episodes from the alarm stream andpresenting them as rules. Large sets of discov-ered rules can be explored with ﬂexible infor-mation-retrieval tools supporting interactivityand iteration. In this way, TASA offers pruning,grouping, and ordering tools to reﬁne the re-sults of a basic brute-force search for rules.Data cleaning: The MERGE-PURGE systemwas applied to the identiﬁcation of duplicatewelfare claims (Hernandez and Stolfo 1995).It was used successfully on data from the Wel-fare Department of the State of Washington.In other areas, a well-publicized system isIBM’s ADVANCED SCOUT,a specialized data-min-ing system that helps National Basketball As-sociation (NBA) coaches organize and inter-pret data from NBA games (U.S. News 1995). ADVANCED SCOUT was used by several of the NBA teams in 1996, including the Seattle Su-personics, which reached the NBA ﬁnals.Finally, a novel and increasingly importanttype of discovery is one based on the use of in-telligent agents to navigate through an infor-mation-rich environment. Although the ideaof active triggers has long been analyzed in thedatabase ﬁeld, really successful applications ofthis idea appeared only with the advent of theInternet. These systems ask the user to specifya proﬁle of interest and search for related in-formation among a wide variety of public-do-main and proprietary sources. For example, FIREFLY is a personal music-recommendation agent: It asks a user his/her opinion of several music pieces and then suggests other music that the user might like (<http:// www.fﬂ/>). CRAYON(/>) allows users to create their own free newspaper (supported by ads); NEWSHOUND(<http://www. /hound/>) from the San Jose Mercury News and FARCAST(</> automatically search information from a wide variety of sources, including newspapers and wire services, and e-mail rele-vant documents directly to the user.These are just a few of the numerous suchsystems that use KDD techniques to automat-ically produce useful information from largemasses of raw data. See Piatetsky-Shapiro etal. (1996) for an overview of issues in devel-oping industrial KDD applications.Data Mining and KDDHistorically, the notion of ﬁnding useful pat-terns in data has been given a variety ofnames, including data mining, knowledge ex-traction, information discovery, informationharvesting, data archaeology, and data patternprocessing. The term data mining has mostlybeen used by statisticians, data analysts, andthe management information systems (MIS)communities. It has also gained popularity inthe database ﬁeld. The phrase knowledge dis-covery in databases was coined at the ﬁrst KDDworkshop in 1989 (Piatetsky-Shapiro 1991) toemphasize that knowledge is the end productof a data-driven discovery. It has been popular-ized in the AI and machine-learning ﬁelds.In our view, KDD refers to the overall pro-cess of discovering useful knowledge from da-ta, and data mining refers to a particular stepin this process. Data mining is the applicationof speciﬁc algorithms for extracting patternsfrom data. The distinction between the KDDprocess and the data-mining step (within theprocess) is a central point of this article. Theadditional steps in the KDD process, such asdata preparation, data selection, data cleaning,incorporation of appropriate prior knowledge,and proper interpretation of the results ofmining, are essential to ensure that usefulknowledge is derived from the data. Blind ap-plication of data-mining methods (rightly crit-icized as data dredging in the statistical litera-ture) can be a dangerous activity, easilyleading to the discovery of meaningless andinvalid patterns.The Interdisciplinary Nature of KDDKDD has evolved, and continues to evolve,from the intersection of research ﬁelds such asmachine learning, pattern recognition,databases, statistics, AI, knowledge acquisitionfor expert systems, data visualization, andhigh-performance computing. The unifyinggoal is extracting high-level knowledge fromlow-level data in the context of large data sets.The data-mining component of KDD cur-rently relies heavily on known techniquesfrom machine learning, pattern recognition,and statistics to ﬁnd patterns from data in thedata-mining step of the KDD process. A natu-ral question is, How is KDD different from pat-tern recognition or machine learning (and re-lated ﬁelds)? The answer is that these ﬁeldsprovide some of the data-mining methodsthat are used in the data-mining step of theKDD process. KDD focuses on the overall pro-cess of knowledge discovery from data, includ-ing how the data are stored and accessed, howalgorithms can be scaled to massive data setsThe basicproblemaddressed bythe KDDprocess isone ofmappinglow-leveldata intoother formsthat might bemorecompact,moreabstract,or moreuseful.ArticlesFALL 1996 39A driving force behind KDD is the database ﬁeld (the second D in KDD). Indeed, the problem of effective data manipulation when data cannot ﬁt in the main memory is of fun-damental importance to KDD. Database tech-niques for gaining efﬁcient data access,grouping and ordering operations when ac-cessing data, and optimizing queries consti-tute the basics for scaling algorithms to larger data sets. Most data-mining algorithms from statistics, pattern recognition, and machine learning assume data are in the main memo-ry and pay no attention to how the algorithm breaks down if only limited views of the data are possible.A related ﬁeld evolving from databases is data warehousing,which refers to the popular business trend of collecting and cleaning transactional data to make them available for online analysis and decision support. Data warehousing helps set the stage for KDD in two important ways: (1) data cleaning and (2)data access.Data cleaning: As organizations are forced to think about a uniﬁed logical view of the wide variety of data and databases they pos-sess, they have to address the issues of map-ping data to a single naming convention,uniformly representing and handling missing data, and handling noise and errors when possible.Data access: Uniform and well-deﬁned methods must be created for accessing the da-ta and providing access paths to data that were historically difﬁcult to get to (for exam-ple, stored ofﬂine).Once organizations and individuals have solved the problem of how to store and ac-cess their data, the natural next step is the question, What else do we do with all the da-ta? This is where opportunities for KDD natu-rally arise.A popular approach for analysis of data warehouses is called online analytical processing (OLAP), named for a set of principles pro-posed by Codd (1993). OLAP tools focus on providing multidimensional data analysis,which is superior to SQL in computing sum-maries and breakdowns along many dimen-sions. OLAP tools are targeted toward simpli-fying and supporting interactive data analysis,but the goal of KDD tools is to automate as much of the process as possible. Thus, KDD is a step beyond what is currently supported by most standard database systems.Basic DeﬁnitionsKDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimate-and still run efﬁciently, how results can be in-terpreted and visualized, and how the overall man-machine interaction can usefully be modeled and supported. The KDD process can be viewed as a multidisciplinary activity that encompasses techniques beyond the scope of any one particular discipline such as machine learning. In this context, there are clear opportunities for other ﬁelds of AI (be-sides machine learning) to contribute to KDD. KDD places a special emphasis on ﬁnd-ing understandable patterns that can be inter-preted as useful or interesting knowledge.Thus, for example, neural networks, although a powerful modeling tool, are relatively difﬁcult to understand compared to decision trees. KDD also emphasizes scaling and ro-bustness properties of modeling algorithms for large noisy data sets.Related AI research ﬁelds include machine discovery, which targets the discovery of em-pirical laws from observation and experimen-tation (Shrager and Langley 1990) (see Kloes-gen and Zytkow [1996] for a glossary of terms common to KDD and machine discovery),and causal modeling for the inference of causal models from data (Spirtes, Glymour,and Scheines 1993). Statistics in particular has much in common with KDD (see Elder and Pregibon [1996] and Glymour et al.[1996] for a more detailed discussion of this synergy). Knowledge discovery from data is fundamentally a statistical endeavor. Statistics provides a language and framework for quan-tifying the uncertainty that results when one tries to infer general patterns from a particu-lar sample of an overall population. As men-tioned earlier, the term data mining has had negative connotations in statistics since the 1960s when computer-based data analysis techniques were ﬁrst introduced. The concern arose because if one searches long enough in any data set (even randomly generated data),one can ﬁnd patterns that appear to be statis-tically signiﬁcant but, in fact, are not. Clearly,this issue is of fundamental importance to KDD. Substantial progress has been made in recent years in understanding such issues in statistics. Much of this work is of direct rele-vance to KDD. Thus, data mining is a legiti-mate activity as long as one understands how to do it correctly; data mining carried out poorly (without regard to the statistical as-pects of the problem) is to be avoided. KDD can also be viewed as encompassing a broader view of modeling than statistics. KDD aims to provide tools to automate (to the degree pos-sible) the entire process of data analysis and the statistician’s “art” of hypothesis selection.Data mining is a step in the KDD process that consists of ap-plying data analysis and discovery al-gorithms that produce a par-ticular enu-meration ofpatterns (or models)over the data.Articles40AI MAGAZINEly understandable patterns in data (Fayyad, Piatetsky-Shapiro, and Smyth 1996).Here, data are a set of facts (for example, cases in a database), and pattern is an expres-sion in some language describing a subset of the data or a model applicable to the subset. Hence, in our usage here, extracting a pattern also designates ﬁtting a model to data; ﬁnd-ing structure from data; or, in general, mak-ing any high-level description of a set of data. The term process implies that KDD comprises many steps, which involve data preparation, search for patterns, knowledge evaluation, and reﬁnement, all repeated in multiple itera-tions. By nontrivial, we mean that some search or inference is involved; that is, it is not a straightforward computation of predeﬁned quantities like computing the av-erage value of a set of numbers.The discovered patterns should be valid on new data with some degree of certainty. We also want patterns to be novel (at least to the system and preferably to the user) and poten-tially useful, that is, lead to some beneﬁt to the user or task. Finally, the patterns should be understandable, if not immediately then after some postprocessing.The previous discussion implies that we can deﬁne quantitative measures for evaluating extracted patterns. In many cases, it is possi-ble to deﬁne measures of certainty (for exam-ple, estimated prediction accuracy on new data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness(for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be deﬁned explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to deﬁne knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this deﬁnition ispurely user oriented and domain speciﬁc andis determined by whatever functions andthresholds the user chooses.Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efﬁciency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space ofArticlesFALL 1996 41Figure 1. An Overview of the Steps That Compose the KDD Process.methods, the effective number of variables under consideration can be reduced, or in-variant representations for the data can be found.Fifth is matching the goals of the KDD pro-cess (step 1) to a particular data-mining method. For example, summarization, clas-siﬁcation, regression, clustering, and so on,are described later as well as in Fayyad, Piatet-sky-Shapiro, and Smyth (1996).Sixth is exploratory analysis and model and hypothesis selection: choosing the data-mining algorithm(s) and selecting method(s)to be used for searching for data patterns.This process includes deciding which models and parameters might be appropriate (for ex-ample, models of categorical data are differ-ent than models of vectors over the reals) and matching a particular data-mining method with the overall criteria of the KDD process (for example, the end user might be more in-terested in understanding the model than its predictive capabilities).Seventh is data mining: searching for pat-terns of interest in a particular representa-tional form or a set of such representations,including classiﬁcation rules or trees, regres-sion, and clustering. The user can signiﬁcant-ly aid the data-mining method by correctly performing the preceding steps.Eighth is interpreting mined patterns, pos-sibly returning to any of steps 1 through 7 for further iteration. This step can also involve visualization of the extracted patterns and models or visualization of the data given the extracted models.Ninth is acting on the discovered knowl-edge: using the knowledge directly, incorpo-rating the knowledge into another system for further action, or simply documenting it and reporting it to interested parties. This process also includes checking for and resolving po-tential conﬂicts with previously believed (or extracted) knowledge.The KDD process can involve signiﬁcant iteration and can contain loops between any two steps. The basic ﬂow of steps (al-though not the potential multitude of itera-tions and loops) is illustrated in ﬁgure 1.Most previous work on KDD has focused on step 7, the data mining. However, the other steps are as important (and probably more so) for the successful application of KDD in practice. Having deﬁned the basic notions and introduced the KDD process, we now focus on the data-mining component,which has, by far, received the most atten-tion in the literature.patterns is often inﬁnite, and the enumera-tion of patterns involves some form of search in this space. Practical computational constraints place severe limits on the sub-space that can be explored by a data-mining algorithm.The KDD process involves using the database along with any required selection,preprocessing, subsampling, and transforma-tions of it; applying data-mining methods (algorithms) to enumerate patterns from it;and evaluating the products of data mining to identify the subset of the enumerated pat-terns deemed knowledge. The data-mining component of the KDD process is concerned with the algorithmic means by which pat-terns are extracted and enumerated from da-ta. The overall KDD process (ﬁgure 1) in-cludes the evaluation and possible interpretation of the mined patterns to de-termine which patterns can be considered new knowledge. The KDD process also in-cludes all the additional steps described in the next section.The notion of an overall user-driven pro-cess is not unique to KDD: analogous propos-als have been put forward both in statistics (Hand 1994) and in machine learning (Brod-ley and Smyth 1996).The KDD ProcessThe KDD process is interactive and iterative,involving numerous steps with many deci-sions made by the user. Brachman and Anand (1996) give a practical view of the KDD pro-cess, emphasizing the interactive nature of the process. Here, we broadly outline some of its basic steps:First is developing an understanding of the application domain and the relevant prior knowledge and identifying the goal of the KDD process from the customer’s viewpoint.Second is creating a target data set: select-ing a data set, or focusing on a subset of vari-ables or data samples, on which discovery is to be performed.Third is data cleaning and preprocessing.Basic operations include removing noise if appropriate, collecting the necessary informa-tion to model or account for noise, deciding on strategies for handling missing data ﬁelds,and accounting for time-sequence informa-tion and known changes.Fourth is data reduction and projection:ﬁnding useful features to represent the data depending on the goal of the task. With di-mensionality reduction or transformationArticles42AI MAGAZINEThe Data-Mining Stepof the KDD ProcessThe data-mining component of the KDD pro-cess often involves repeated iterative applica-tion of particular data-mining methods. This section presents an overview of the primary goals of data mining, a description of the methods used to address these goals, and a brief description of the data-mining algo-rithms that incorporate these methods.The knowledge discovery goals are deﬁned by the intended use of the system. We can distinguish two types of goals: (1) veriﬁcation and (2) discovery. With veriﬁcation,the sys-tem is limited to verifying the user’s hypothe-sis. With discovery,the system autonomously ﬁnds new patterns. We further subdivide the discovery goal into prediction,where the sys-tem ﬁnds patterns for predicting the future behavior of some entities, and description, where the system ﬁnds patterns for presenta-tion to a user in a human-understandableform. In this article, we are primarily con-cerned with discovery-oriented data mining.Data mining involves ﬁtting models to, or determining patterns from, observed data. The ﬁtted models play the role of inferred knowledge: Whether the models reﬂect useful or interesting knowledge is part of the over-all, interactive KDD process where subjective human judgment is typically required. Two primary mathematical formalisms are used in model ﬁtting: (1) statistical and (2) logical. The statistical approach allows for nondeter-ministic effects in the model, whereas a logi-cal model is purely deterministic. We focus primarily on the statistical approach to data mining, which tends to be the most widely used basis for practical data-mining applica-tions given the typical presence of uncertain-ty in real-world data-generating processes.Most data-mining methods are based on tried and tested techniques from machine learning, pattern recognition, and statistics: classiﬁcation, clustering, regression, and so on. The array of different algorithms under each of these headings can often be bewilder-ing to both the novice and the experienced data analyst. It should be emphasized that of the many data-mining methods advertised in the literature, there are really only a few fun-damental techniques. The actual underlying model representation being used by a particu-lar method typically comes from a composi-tion of a small number of well-known op-tions: polynomials, splines, kernel and basis functions, threshold-Boolean functions, and so on. Thus, algorithms tend to differ primar-ily in the goodness-of-ﬁt criterion used toevaluate model ﬁt or in the search methodused to ﬁnd a good ﬁt.In our brief overview of data-mining meth-ods, we try in particular to convey the notionthat most (if not all) methods can be viewedas extensions or hybrids of a few basic tech-niques and principles. We ﬁrst discuss the pri-mary methods of data mining and then showthat the data- mining methods can be viewedas consisting of three primary algorithmiccomponents: (1) model representation, (2)model evaluation, and (3) search. In the dis-cussion of KDD and data-mining methods,we use a simple example to make some of thenotions more concrete. Figure 2 shows a sim-ple two-dimensional artiﬁcial data set consist-ing of 23 cases. Each point on the graph rep-resents a person who has been given a loanby a particular bank at some time in the past.The horizontal axis represents the income ofthe person; the vertical axis represents the to-tal personal debt of the person (mortgage, carpayments, and so on). The data have beenclassiﬁed into two classes: (1) the x’s repre-sent persons who have defaulted on theirloans and (2) the o’s represent persons whoseloans are in good status with the bank. Thus,this simple artiﬁcial data set could represent ahistorical data set that can contain usefulknowledge from the point of view of thebank making the loans. Note that in actualKDD applications, there are typically manymore dimensions (as many as several hun-dreds) and many more data points (manythousands or even millions).ArticlesFALL 1996 43Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes.。

数据挖掘Introduction to Data Mining

Course Overview
• • • •
Introduction to Knowledge Discovery in Databases and Data Mining
•
Why Data Mining? What is Data Mining? On What Kind of Data?
Application Domains and Examples
•
Webster: A pattern is (a) a natural or chance configuration, (b) a reliable sample of traits, acts, tendencies, or other observable characteristics of a person, group, or institution
(Latitude, Longitude)2
QUERY RESULT
(Latitude, Longitude)1
alg | Automated Learning Group
What is It?
Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
Data Mining Tools
•
•
Data Mining Methods
• • •
•
Summary
alg | Automated Learning Group

DATA MINING INTRODUCTION(数据挖掘简介)

Databases
11
Example: A Web Mining Framework
Web mining usually involves
Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into
1
Course Description
Data Mining and Knowledge Discovery
Topics:
Introduction
Getting to Know Your Data
Data Preprocessing
Data Warehouse and OLAP Technology: An Introduction
9
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
One of Java, C++, Perl, Matlab, etc. Will need to read Java Library

KDD Knowledge Discovery in Databases

KDD Knowledge Discovery in Databases百科名片知识发现知识发现（KDD：Knowledge Discovery in Databases）是从数据集中别出有效的、新颖的、潜在有用的，以及最终可理解的模式的非平凡过程。

知识发现将信息变为知识，从数据矿山中找到蕴藏的知识金块，将为知识创新和知识经济的发展作出贡献。

该术语于1989年出现，Fayyad定义为"KDD"是从数据集中识别出有效的、新颖的、潜在有用的，以及最终可理解的模式的非平凡过程”。

目录详细解释1.KDD基本过程(the process of the KDD)2.常用KDD过程模型 (KDD process model)编辑本段详细解释数据库知识发现（knowledge discovery in databases，KDD）的研究非常活跃。

在上面的定义中，涉及几个需要进一步解释的概念：“数据集”、“模式”、“过程”、“有效性”、“新颖性”、“潜在有用性”和“最终可理解性”。

数据集是一组事实 F（如关系数据库中的记录）。

模式是一个用语言L来表示的一个表达式E，它可用来描述数据集F的某个子集凡上作为一个模式要求它比对数据子集FE的枚举要简单（所用的描述信息量要少）。

过程在KDD中通常指多阶段的处理，涉及数据准备、模式搜索、知识评价以及反复的修改求精；该过程要求是非平凡的，意思是要有一定程度的智能性、自动性（仅仅给出所有数据的总和不能算作是一个发现过程）。

有效性是指发现的模式对于新的数据仍保持有一定的可信度。

新颖性要求发现的模式应该是新的。

潜在有用性是指发现的知识将来有实际效用，如用于决策支持系统里可提高经济效益。

最终可理解性要求发现的模式能被用户理解，目前它主要是体现在简洁性上。

有效性、新颖性、潜在有用性和最终可理解性综合在一起称为兴趣性。

由于知识发现是一门受到来自各种不同领域的研究者关注的交叉性学科，因此导致了很多不同的术语名称。

数据挖掘第一章

3
CS512 Coverage (Chapters 11, 12, 13 + More Advanced Topics)

Cluster Analysis: Advanced Methods (Chapter 11) Outlier Analysis (Chapter 12) Mining data streams, time-series, and sequence data Mining graph data Mining social and information networks Mining object, spatial, multimedia, text and Web data Mining complex data objects Spatial and spatiotemporal data mining Multimedia data mining Text and Web mining Additional (often current) themes if time permits

Database Systems:

Text information systems

Bioinformatics

Yahoo!-DAIS seminar (CS591DAIS—Fall and Spring. 1 credit unit)
2
CS412 Coverage (Chapters 1-10, 3rd Ed.)

Summary
7
Why Data Mining?

Tfrom terabytes to petabytes

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

ArticlesFALL 1992 57Computers have promised us a fountain of wisdom but delivered a flood of data.—A frustrated MIS executiveIt has been estimated that the amount of information in the world doubles every 20months. The size and number of databases probably increases even faster. In 1989, the total number of databases in the world was estimated at five million, although most of them are small DBASE III databases. The automation of business activities produces an ever-increasing stream of data because even simple transactions, such as a telephone call,the use of a credit card, or a medical test, are typically recorded in a computer.Scientific and government databases are also rapidly growing. The National Aeronau-tics and Space Administration already has much more data than it can analyze. Earth observation satellites, planned for the 1990s,are expected to generate one terabyte (1015bytes) of data every day— more than all pre-vious missions combined. At a rate of one picture each second, it would take a person several years (working nights and weekends)just to look at the pictures generated in one day. In biology, the federally funded Human Genome project will store thousands of bytes for each of the several billion genetic bases.Closer to everyday lives, the 1990 U.S. census data of a million million bytes encode pat-terns that in hidden ways describe the lifestyles and subcultures of today’s United States.What are we supposed to do with this flood of raw data? Clearly, little of it will ever be seen by human eyes. If it will be under-stood at all, it will have to be analyzed by computers. Although simple statistical tech-niques for data analysis were developed long ago, advanced techniques for intelligent data analysis are not yet mature. As a result, there is a growing gap between data generation and data understanding. At the same time,there is a growing realization and expectation that data, intelligently analyzed and present-ed, will be a valuable resource to be used for a competitive advantage.The computer science community is responding to both the scientific and practi-cal challenges presented by the need to find the knowledge adrift in the flood of data. In assessing the potential of AI technologies,Michie (1990), a leading European expert on machine learning, predicted that “the next area that is going to explode is the use of machine learning tools as a component of large-scale data analysis.” A recent National Science Foundation workshop on the future of database research ranked data mining among the most promising research topicsKnowledge Discoveryin Databases: An OverviewWilliam J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus After a decade of fundamental interdisciplinary research in machine learning,the spadework in this field has been done; the 1990s should see the widespread exploitation of knowledge discovery as an aid to assembling knowledge bases. The contributors to the AAAI Press book Knowledge Discovery in Databases were excited at the potential benefits of this research. The editors hope that some of this excitement will com-municate itself to AI Magazine readers of this article.0738-4602/92/$4.00 ©1992 AAAIAI Magazine Volume 13 Number 3 (1992) (© AAAI)Articles58AI MAGAZINE for the 1990s (Silber-schatz, Stonebraker,and Ullman 1990).Some research meth-ods are already wellenough developed tohave been made partof commerciallyavailable software.Several expert systemshells use variationsof ID3 for inducingrules from examples.Other systems useinductive, neuralnet, or genetic learn-ing approaches todiscover patterns inpersonal computerdatabases. Many for-ward-looking com-panies are usingthese and other toolsto analyze theirdatabases for inter-esting and useful patterns. American Airlinessearches its frequent flyer database to find itsbetter customers, targeting them for specificmarketing promotions. Farm Journal analyzesits subscriber database and uses advancedprinting technology to custom-build hun-dreds of editions tailored to particulargroups. Several banks, using patterns discov-ered in loan and credit histories, have derivedbetter loan approval and bankruptcy predic-tion methods. General Motors is using adatabase of automobile trouble reports toderive diagnostic expert systems for variousmodels. Packaged-goods manufacturers aresearching the supermarket scanner data tomeasure the effects of their promotions andto look for shopping patterns.A combination of business and researchinterests has produced increasing demandsfor, as well as increased activity to provide,tools and techniques for discovery indatabases. This book is the first to bringtogether leading-edge research from aroundthe world on this topic. It spans many differ-ent approaches to discovery, including induc-tive learning, Bayesian statistics, semanticquery optimization, knowledge acquisitionfor expert systems, information theory, andfuzzy sets. The book is aimed at those inter-ested or involved in computer science andthe management of data, to both inform andinspire further research and applications. Itwill be of particular interest to professionalsworking with databases and managementinformation systemsand to those apply-ing machine learn-ing to real-worldproblems.What IsKnowledgeDiscovery?There is an immensediversity of currentresearch on knowl-edge discovery indatabases. To pro-vide a point of refer-ence for thisresearch, we beginhere by definingand explaining rele-vant terms.Definition ofKnowledgeDiscoveryKnowledge discovery is the nontrivial extrac-tion of implicit, previously unknown, andpotentially useful information from data.Given a set of facts (data) F, a language L,and some measure of certainty C, we define apattern as a statement S in L that describesrelationships among a subset F S of F with acertainty c, such that S is simpler (in somesense) than the enumeration of all facts inF S. A pattern that is interesting (according toa user-imposed interest measure) and certainenough (again according to the user’s criteria)is called knowledge. The output of a programthat monitors the set of facts in a databaseand produces patterns in this sense is discov-ered knowledge.These definitions about the language, thecertainty, and the simplicity and interesting-ness measures are intentionally vague tocover a wide variety of approaches. Collec-tively, these terms capture our view of thefundamental characteristics of discovery indatabases. In the following paragraphs, wesummarize the connotations of these termsand suggest their relevance to the problem ofknowledge discovery in databases.Patterns and Languages:Although manydifferent types of information can be discov-ered in data, this book focuses on patternsthat are expressed in a high-level language,such asIf Age < 25 and Driver-Education-Course =NoThen At-fault-accident = YesWith Likelihood = 0.2 to 0.3.Such patterns can be understood and used directly by people, or they can be the input to another program, for example, an expert system or a semantic query optimizer. We do not consider low-level patterns, such as those generated by neural networks.Certainty:Seldom is a piece of discovered knowledge true across all the data. Represent-ing and conveying the degree of certainty is essential to determining how much faith the system or user should put into a discovery. As we examine later, certainty involves several factors, including the integrity of the data; the size of the sample on which the discovery was performed; and, possibly, the degree of support from available domain knowledge. Without sufficient certainty, patterns become unjustified and, thus, fail to be knowledge.Interesting: Although numerous patterns can be extracted from any database, only those deemed to be interesting in some way are considered knowledge. Patterns are inter-esting when they are novel, useful, and non-trivial to compute. Whether a pattern is novel depends on the assumed frame of reference, which can be either the scope of the system’s knowledge or the scope of the user’s knowl-edge. For example, a system might discover the following: If At-fault-accident = yes Then Age > 16. To the system, this piece of knowl-edge might be previously unknown and potentially useful; to a user trying to analyze insurance claims records, this pattern would be tautological and uninteresting and would not represent discovered knowledge. This example also suggests the notion of utility. Knowledge is useful when it can help achieve a goal of the system or the user. Patterns com-pletely unrelated to current goals are of little use and do not constitute knowledge within the given situation. For example, a pattern relating at-fault-accident to a driver’s age, dis-covered while the user’s intent was to analyze sales figures, would not be useful to the user.Novelty and utility alone, however, are not enough to qualify a pattern as discovered knowledge. Most databases contain numerous novel and useful patterns, such as the totalsales for 1990, the average cost of an insur-ance claim, and the maximum intensity of aspectral line. These types of patterns wouldnot typically be considered knowledgebecause they are trivial to compute. To benontrivial, a system must do more thanblindly compute statistics; the results of thedirected calculation of straightforward statis-tics are, to our way of thinking, readily avail-able to the database user. A discovery systemmust be capable of deciding which calcula-tions to perform and whether the results areinteresting enough to constitute knowledgein the current context. Another way of view-ing this notion of nontriviality is that a dis-covery system must possess some degree ofautonomy in processing the data and evaluat-ing its results.Efficiency:Finally, we are interested in dis-covery processes that can be efficiently imple-mented on a computer. An algorithm isconsidered efficient1if the run time and spaceused are a polynomial function of low degreeof the input length. The problem of discoveryof interesting sentences (concepts) that satisfygiven facts is inherently hard. Recentadvances in computational learning theory(Valiant 1984; Haussler 1988) have shownthat it is not possible to efficiently learn anarbitrary Boolean concept; the problem is NP-hard. However, these results are generallyrelated to the worst-case performance of algo-rithms and do not eliminate the possibilitythat, on the average, we can find complexconcepts fast. Efficient algorithms do exist forrestricted concept classes, such as purely con-junctive concepts (for example, AŸ BŸ C), orthe conjunction of clauses made up of dis-junctions of no more than k literals (forexample, (A⁄ B) Ÿ (C⁄ D) Ÿ (E ⁄ F), for k = 2).Another possibility for efficiently finding con-cepts is to abandon the demand that the algo-rithm learn a desired concept with someguarantee and instead accept heuristic orapproximate algorithms.To summarize, knowledge discovery indatabases exhibits four main characteristics:•High-Level Language—Discovered knowl-ArticlesFALL 1992 59Articles60AI MAGAZINE edge is represented in a high-level language.It need not be directly used by humans, butits expression should be understandable byhuman users.•Accuracy—Discoveries accurately portraythe contents of the database. The extent towhich this portrayal is imperfect is expressedby measures of certainty.•Interesting Results—Discovered knowledgeis interesting according to user-defined biases.In particular, being interesting implies thatpatterns are novel and potentially useful, andthe discovery process is nontrivial.•Efficiency—The discovery process is effi-cient. Running times for large-sized databasesare predictable and acceptable.The systems and techniques described inKnowledge Discovery in Databases generallystrive to satisfy these characteristics. Theapproaches taken, however, are quite diverse.Most are based on machine learning methodsthat have been enhanced to better deal withissues particular to discovery in databases.Next, we briefly define database concepts andterminology and relate them to termscommon to machine learning.Databases and Machine LearningIn database management, a database is a logi-cally integrated collection of data maintainedin one or more files and organized to facili-tate the efficient storage, modification, andretrieval of related information. In a relation-al database, for example, data are organizedinto files or tables of fixed-length records.Each record is an ordered list of values, onevalue for each field. Information about eachfield’s name and potential values is main-tained in a separate file called a data dictio-nary. A d atabase management system is acollection of procedures for retrieving, stor-ing, and manipulating data within databases.In machine learning, the term databasetypically refers to a collection of instances orexamples maintained in a single file.2Instances are usually fixed-length feature vec-tors. Information is sometimes also providedabout the feature names and value ranges, asin a data dictionary. A learning algorithmtakes the data set and its accompanyinginformation as input and returns a statement(for example, a concept) representing theresults of the learning as output.Table 1 informally compares the terminologyin database management with that ofmachine learning. With the appropriatetranslation of terms, it seems that machinelearning could readily be applied to databas-es: Rather than learning on a set of instances,learning is done on a file of records from adatabase. Knowledge discovery in databases,however, raises additional concerns thatextend beyond those typically encounteredin machine learning. In the real world,databases are often dynamic, incomplete,noisy, and much larger than typical machinelearning data sets (table 2). These factorsrender most learning algorithms ineffectivein the general case. Not surprisingly, much ofthe work on discovery in databases focuseson overcoming these complications.Database Managment Machine Learningdatabase: a logically integrated a fixed set of examples collection of dynamic filesfile database, data set, set of instances tuple, record instance, example, feature vector field, attribute feature, attributefield domain possible field valuesdata dictionary field type and domain information relational data a set of instancesobject-oriented, structured data relational datalogical condition concept descriptionTable 1. Translations between Database Managementand Machine Learning Terms.Database Management Machine LearningDatabase is an active, evolving Database is just a staticentity collection of dataRecords may contain missing or Instances are usually complete erroneous information and noise-freeA typical field is numeric A typical feature is binaryA database typically contains Data sets typically contain millions of records several hundred instancesAI should get down to reality“Databases” is a solved problemand is therefore uninterestingTable 2. Conflicting Viewpoints betweenDatabase Management and Machine Learning.Related ApproachesAlthough machine learning is the foundation for much of the work in this area, knowledge discovery in databases deals with issues rele-vant to several other fields, including database management, expert systems, statis-tical analysis, and scientific discovery.Database Management: A database man-agement system provides procedures for stor-ing, accessing, and modifying the data. Typical operations include retrieval, update, or deletion of all tuples satisfying a specific condition, and maintaining user-specified integrity constraints. The ability to extract tuples satisfying a common condition is like discovery in its ability to produce interesting and useful statements (for example, “Bob and Dave sold fewer widgets this year than last”). These techniques, however, cannot by them-selves determine what computations are worth trying, nor do they evaluate the quality of the derived patterns. Interesting discoveries uncovered by these data-manipulation toolsresult from the guidance of the user. However, the new generation of deductive and object-oriented database systems (Kim, Nicolas, and Nishio 1990) will provide improved capabili-ties for intelligent data analysis and discovery.Expert Systems: Expert systems attempt to capture knowledge pertinent to a specific problem. Techniques exist for helping to extract knowledge from experts. One such method is the induction of rules from expert-generated examples of problem solutions. This method differs from discovery in databases in that the expert examples are usu-ally of much higher quality than the data in databases, and they usually cover only the important cases (see also Gaines3,for a com-parison between knowledge acquisition from an expert and induction from data). Further-more, experts are available to confirm the validity and usefulness of the discovered pat-terns. As with database management tools, the autonomy of discovery is lacking in these methods.Statistics: Although statistics provide a solid theoretical foundation for the problem of data analysis, a purely statistical approach is not enough. First, standard statistical meth-ods are ill suited for the nominal and struc-tured data types found in many databases (figure 2). Second, statistics are totally data driven, precluding the use of available domain knowledge, an important issue that we discuss later. Third, the results of statistical analysis can be overwhelming and difficult to interpret. Finally, statistical methods require the guidance of the user to specify where and how to analyze the data. However, some recent statistics-based techniques such as pro-jection pursuit (Huber 1985) and discovery ofcausal structure from data (Glymour et al.1987; Geiger, Paz, and Pearl 1990) addresssome of these problems and are much closerto intelligent data analysis. We expect thatmethods using domain knowledge will bedeveloped by the statistical community. Wealso believe that statistics should have a vitalrole in all discovery systems dealing withlarge amounts of data.Scientific Discovery: Discovery in databas-es is significantly different from scientific dis-covery (see also Zytkow and Baker3,) in thatthe former is less purposeful and less con-trolled. Scientific data come from experi-ments designed to eliminate the effects of allbut a few parameters and to emphasize thevariation of one or a few target parameters tobe explained. However, typical businessdatabases record a plethora of informationabout their subjects to meet a number oforganizational goals. This richness (or confu-sion) both captures and hides from viewunderlying relationships in the data. More-over, scientists can reformulate and reruntheir experiments should they find that theinitial design was inadequate. Database man-agers rarely have the luxury of redesigningtheir data fields and recollecting the data.A Framework for Knowledge DiscoveryWe have referred to discovery systems severaltimes without specifying what a discoverysystem is. Although discovery systems varyconsiderably in their design, it is possible toArticlesFALL 1992 61 Figure 1. A Framework for Knowledge Discovery in Databases.Articles62AI MAGAZINE describe a prototypical discovery system.Figure 1 depicts the basic components of ourprototypical system for knowledge discoveryin databases. At the core of the system is thediscovery method, which computes and eval-uates patterns on their way to becomingknowledge. The input to the discoverymethod include raw data from the database,information from the data dictionary, addi-tional domain knowledge, and a set of user-defined biases that provide high-level focus.The output is discovered knowledge that canbe directed to the user or back into thesystem as new domain knowledge. Becausethis model represents a prototypical system, adiscovery system need not include all theseaspects. The feedback of discovered knowl-edge into the store of domain knowledge, forexample, is present in few existing systems(see Future Directions).In the remainder of this article, we exploreeach aspect of this framework in detail.Specifically, we consider (1) the peculiaritiesof databases, (2) the use of domain knowl-edge, (3) the role of the user in the discoveryprocess, (4) methods of knowledge discovery,and (5) the form and uses of discoveredknowledge.Database IssuesThe fundamental input to a discovery systemis the raw data present in a database. We pre-viously mentioned that databases poseunique problems to discovery not typicallyconfronted in machine learning. In particu-lar, these problems arise from the fact thatreal-world databases are dynamic, incom-plete, noisy, and large. Other concernsinclude whether the database contains ade-quate information for interesting discoveryand how to deal with the overabundance ofirrelevant information.Dynamic Data: A fundamental characteris-tic of most databases is that their contents areever changing. Data can be time sensitive,and discovery is affected by the timeliness ofdata observations. Some data values, such asa patient’s social security number, are con-stant over time; some vary more or less grad-ually over time (for example, weight andheight); and some are so situation dependentthat only a recently observed value will suf-fice (for example, pulse rate).Irrelevant Fields: A nother key characteris-tic is the relevance of data, that is, whetheran item of data is relevant to the currentfocus of discovery. When a patient database isbeing explored for interesting patterns ofsymptoms and diagnoses, nonmedical data,such as patient name or zip code, are irrele-vant, and errors there are unimportant. Onthe other hand, pulse rate, a simple and typi-cally recorded medical observation, is rele-vant, and errors here can affect what isdiscovered. If, however, we are looking for ageographic concentration of a particular dis-ease, then a correct zip code becomes crucial.If the zip code is thought to be faulty, it cansometimes be inferred from related informa-tion, such as the patient’s address.An aspect somewhat related to relevance isthe applicability of an attribute to a subset ofthe database; for example, a patient’s preg-nant field is not applicable to men (the classof patients with sex equal to male), but it isessential to know for female patients of child-bearing age.Missing Values: The presence or absence ofvalues for relevant data attributes can affectdiscovery. Whether a patient was comatose atthe time of diagnosis can be so importantthat it does not allow the substitution of adefault value; less important missing data canbe defaulted. In an interactive system, theabsence of an important datum can spawn arequest for its value or a test to determine it.Alternatively, the absence of data can be dealtwith as if it were a condition in itself, and themissing attribute can be assigned a neutralvalue, such as unknown. For example, theabsence of some patient measurements mightbe found to imply that the patient was for-merly in excellent health.Noise and Uncertainty: For relevantattributes, the severity of error can depend onthe data type of the allowed values. Values ofdifferent attributes can be real numbers(charge), integer numbers (age), or strings(name) or can belong to a set of nominalvalues (patient type). Nominal values can beordered partially or completely and can alsofields, or more complex conditions such as a department must always have exactly one manager. In a sense, the data dictionary defines the syntax of database use.The database contains the raw data to be processed by the discovery system. In prac-tice, the discovery system must also use the additional information about the form of data and constraints on it. Some of this infor-mation can be stored in the data dictionary,but other information might exist in manuals or experts’ heads. For example, a discovery system for a hospital database needs to know which diagnosis codes (DX) are grouped into which diagnostic-related groups (DRG),which, in turn, are grouped into major diag-nostic categories (MDC) (figure 2). Then, if a patient’s DX does not match his or her DRG,an unusual side effect or an error in record keeping is indicated.Because discovery is computationally expensive, additional knowledge regarding the form and content of data, the domain(s)described by the database, the context of a particular discovery episode, and the purposes being served by discovery are often used to guide and constrain the search for interesting knowledge. We refer to this form of informa-tion as domain knowledge or background knowl-edge . Domain knowledge assists discovery by focusing search. However, its use is controver-sial because by telling a system what to look for and where to look for it, domain knowl-edge restricts search and can deliberately rule out valuable discovery. An example discussed at the IJCAI-89 Knowledge Discovery Work-shop illustrates the trade-off between quickly finding conventional solutions and discard-ing unusual ones. In logistics planning, the search space is so large that it is impossible to find solutions without using constraints such as “trucks don’t drive over water.” This con-straint, however, eliminates potentially inter-esting solutions, such as those in whichArticlesFALL 1992 63have a semantic structure (figure 2).Another aspect of uncertainty is the inher-ent or expected exactitude of data, that is, the degree of noise in the data. Repeated mea-surements cluster around an average; based on a priori analysis or computations on the measurements themselves, a statistical model describing randomness is formulated and used to define expected and tolerable devia-tions in the data. Subjective input, such as whether a patient’s condition is severe or moderate, can vary considerably about one or more cluster points. Often, statistical models are applied in an ad hoc manner to subjec-tively determined attributes to obtain gross statistics and judge the acceptability of (com-binations of) attribute values.Especially with regard to numeric data types, data precision can be a factor in discov-ery. For example, it is regarded as adequate to record body temperature to a precision of 0.1degree. A sensitive trend analysis of body temperature would require even greater preci-sion in the data. To a discovery system able to relate such trends to diagnosis, this impreci-sion would appear to be noise in the input.Missing Fields: An inadequate view of the database can make valid data appear to be in error. The database view is the totality of usable attributes or accessors that the discov-ery system can apply to a problem. It is assumed that the attributes differentiate cases of interest. When they don’t, there appears to be some error. Suppose, for example, that a system is tasked to learn to diagnose malaria from a patient database that does not include the red blood cell count. Patients whose records are correct and who are medically identical with respect to this given view might have different diagnoses, which, in turn, might incorrectly be blamed on data error.Databases and KnowledgeIgnorance is the curse of God, knowledge the wing wherewith we fly to heaven.—William ShakespeareA database is a logically integrated collection of files. Associated with a database is a data dictionary, which defines field names, the allowable data types for field values, various constraints on field values, and other related information. For example, the field age is a positive integer and the field date-of-birth has the form MMDDYY . Types of constraints include ranges of possible values for numeric fields, lists of possible values for nominal。

Knowledge Discovery in Databases：An Overview

合集下载

《高级人工智能》第九章知识发现和数据挖掘(2)PPT课件

数据挖掘与知识发现

第七章数据挖掘

医学科研数据挖掘概述解析

第五章_关联规则挖掘

From Data Mining to Knowledge Discovery in Databases

数据挖掘Introduction to Data Mining

DATA MINING INTRODUCTION(数据挖掘简介)

KDD Knowledge Discovery in Databases

数据挖掘第一章

文档推荐

最新文档

Knowledge Discovery in Databases：An Overview

合集下载

《高级人工智能》第九章知识发现和数据挖掘(2)PPT课件

数据挖掘与知识发现

第七章 数据挖掘

医学科研数据挖掘概述解析

第五章_关联规则挖掘

From Data Mining to Knowledge Discovery in Databases

数据挖掘Introduction to Data Mining

DATA MINING INTRODUCTION(数据挖掘简介)

KDD Knowledge Discovery in Databases

数据挖掘第一章

文档推荐

最新文档

第七章数据挖掘