Knowledge Discovery in Databases:An Overview
- 格式:pdf
- 大小:199.31 KB
- 文档页数:14
s Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media atten-tion of late. What is all the excitement about?This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges in-volved in real-world applications of knowledge discovery, and current and future research direc-tions in the field.A cross a wide variety of fields, data arebeing collected and accumulated at adramatic pace. There is an urgent need for a new generation of computational theo-ries and tools to assist humans in extracting useful information (knowledge) from the rapidly growing volumes of digital data. These theories and tools are the subject of the emerging field of knowledge discovery in databases (KDD).At an abstract level, the KDD field is con-cerned with the development of methods and techniques for making sense of data. The basic problem addressed by the KDD process is one of mapping low-level data (which are typically too voluminous to understand and digest easi-ly) into other forms that might be more com-pact (for example, a short report), more ab-stract (for example, a descriptive approximation or model of the process that generated the data), or more useful (for exam-ple, a predictive model for estimating the val-ue of future cases). At the core of the process is the application of specific data-mining meth-ods for pattern discovery and extraction.1This article begins by discussing the histori-cal context of KDD and data mining and theirintersection with other related fields. A briefsummary of recent KDD real-world applica-tions is provided. Definitions of KDD and da-ta mining are provided, and the general mul-tistep KDD process is outlined. This multistepprocess has the application of data-mining al-gorithms as one particular step in the process.The data-mining step is discussed in more de-tail in the context of specific data-mining al-gorithms and their application. Real-worldpractical application issues are also outlined.Finally, the article enumerates challenges forfuture research and development and in par-ticular discusses potential opportunities for AItechnology in KDD systems.Why Do We Need KDD?The traditional method of turning data intoknowledge relies on manual analysis and in-terpretation. For example, in the health-careindustry, it is common for specialists to peri-odically analyze current trends and changesin health-care data, say, on a quarterly basis.The specialists then provide a report detailingthe analysis to the sponsoring health-care or-ganization; this report becomes the basis forfuture decision making and planning forhealth-care management. In a totally differ-ent type of application, planetary geologistssift through remotely sensed images of plan-ets and asteroids, carefully locating and cata-loging such geologic objects of interest as im-pact craters. Be it science, marketing, finance,health care, retail, or any other field, the clas-sical approach to data analysis relies funda-mentally on one or more analysts becomingArticlesFALL 1996 37From Data Mining to Knowledge Discovery inDatabasesUsama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth Copyright © 1996, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 / $2.00areas is astronomy. Here, a notable success was achieved by SKICAT ,a system used by as-tronomers to perform image analysis,classification, and cataloging of sky objects from sky-survey images (Fayyad, Djorgovski,and Weir 1996). In its first application, the system was used to process the 3 terabytes (1012bytes) of image data resulting from the Second Palomar Observatory Sky Survey,where it is estimated that on the order of 109sky objects are detectable. SKICAT can outper-form humans and traditional computational techniques in classifying faint sky objects. See Fayyad, Haussler, and Stolorz (1996) for a sur-vey of scientific applications.In business, main KDD application areas includes marketing, finance (especially in-vestment), fraud detection, manufacturing,telecommunications, and Internet agents.Marketing:In marketing, the primary ap-plication is database marketing systems,which analyze customer databases to identify different customer groups and forecast their behavior. Business Week (Berry 1994) estimat-ed that over half of all retailers are using or planning to use database marketing, and those who do use it have good results; for ex-ample, American Express reports a 10- to 15-percent increase in credit-card use. Another notable marketing application is market-bas-ket analysis (Agrawal et al. 1996) systems,which find patterns such as, “If customer bought X, he/she is also likely to buy Y and Z.” Such patterns are valuable to retailers.Investment: Numerous companies use da-ta mining for investment, but most do not describe their systems. One exception is LBS Capital Management. Its system uses expert systems, neural nets, and genetic algorithms to manage portfolios totaling $600 million;since its start in 1993, the system has outper-formed the broad stock market (Hall, Mani,and Barr 1996).Fraud detection: HNC Falcon and Nestor PRISM systems are used for monitoring credit-card fraud, watching over millions of ac-counts. The FAIS system (Senator et al. 1995),from the U.S. Treasury Financial Crimes En-forcement Network, is used to identify finan-cial transactions that might indicate money-laundering activity.Manufacturing: The CASSIOPEE trou-bleshooting system, developed as part of a joint venture between General Electric and SNECMA, was applied by three major Euro-pean airlines to diagnose and predict prob-lems for the Boeing 737. To derive families of faults, clustering methods are used. CASSIOPEE received the European first prize for innova-intimately familiar with the data and serving as an interface between the data and the users and products.For these (and many other) applications,this form of manual probing of a data set is slow, expensive, and highly subjective. In fact, as data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains.Databases are increasing in size in two ways:(1) the number N of records or objects in the database and (2) the number d of fields or at-tributes to an object. Databases containing on the order of N = 109objects are becoming in-creasingly common, for example, in the as-tronomical sciences. Similarly, the number of fields d can easily be on the order of 102or even 103, for example, in medical diagnostic applications. Who could be expected to di-gest millions of records, each having tens or hundreds of fields? We believe that this job is certainly not one for humans; hence, analysis work needs to be automated, at least partially.The need to scale up human analysis capa-bilities to handling the large number of bytes that we can collect is both economic and sci-entific. Businesses use data to gain competi-tive advantage, increase efficiency, and pro-vide more valuable services to customers.Data we capture about our environment are the basic evidence we use to build theories and models of the universe we live in. Be-cause computers have enabled humans to gather more data than we can digest, it is on-ly natural to turn to computational tech-niques to help us unearth meaningful pat-terns and structures from the massive volumes of data. Hence, KDD is an attempt to address a problem that the digital informa-tion era made a fact of life for all of us: data overload.Data Mining and Knowledge Discovery in the Real WorldA large degree of the current interest in KDD is the result of the media interest surrounding successful KDD applications, for example, the focus articles within the last two years in Business Week , Newsweek , Byte , PC Week , and other large-circulation periodicals. Unfortu-nately, it is not always easy to separate fact from media hype. Nonetheless, several well-documented examples of successful systems can rightly be referred to as KDD applications and have been deployed in operational use on large-scale real-world problems in science and in business.In science, one of the primary applicationThere is an urgent need for a new generation of computation-al theories and tools toassist humans in extractinguseful information (knowledge)from the rapidly growing volumes ofdigital data.Articles38AI MAGAZINEtive applications (Manago and Auriol 1996).Telecommunications: The telecommuni-cations alarm-sequence analyzer (TASA) wasbuilt in cooperation with a manufacturer oftelecommunications equipment and threetelephone networks (Mannila, Toivonen, andVerkamo 1995). The system uses a novelframework for locating frequently occurringalarm episodes from the alarm stream andpresenting them as rules. Large sets of discov-ered rules can be explored with flexible infor-mation-retrieval tools supporting interactivityand iteration. In this way, TASA offers pruning,grouping, and ordering tools to refine the re-sults of a basic brute-force search for rules.Data cleaning: The MERGE-PURGE systemwas applied to the identification of duplicatewelfare claims (Hernandez and Stolfo 1995).It was used successfully on data from the Wel-fare Department of the State of Washington.In other areas, a well-publicized system isIBM’s ADVANCED SCOUT,a specialized data-min-ing system that helps National Basketball As-sociation (NBA) coaches organize and inter-pret data from NBA games (U.S. News 1995). ADVANCED SCOUT was used by several of the NBA teams in 1996, including the Seattle Su-personics, which reached the NBA finals.Finally, a novel and increasingly importanttype of discovery is one based on the use of in-telligent agents to navigate through an infor-mation-rich environment. Although the ideaof active triggers has long been analyzed in thedatabase field, really successful applications ofthis idea appeared only with the advent of theInternet. These systems ask the user to specifya profile of interest and search for related in-formation among a wide variety of public-do-main and proprietary sources. For example, FIREFLY is a personal music-recommendation agent: It asks a user his/her opinion of several music pieces and then suggests other music that the user might like (<http:// www.ffl/>). CRAYON(/>) allows users to create their own free newspaper (supported by ads); NEWSHOUND(<http://www. /hound/>) from the San Jose Mercury News and FARCAST(</> automatically search information from a wide variety of sources, including newspapers and wire services, and e-mail rele-vant documents directly to the user.These are just a few of the numerous suchsystems that use KDD techniques to automat-ically produce useful information from largemasses of raw data. See Piatetsky-Shapiro etal. (1996) for an overview of issues in devel-oping industrial KDD applications.Data Mining and KDDHistorically, the notion of finding useful pat-terns in data has been given a variety ofnames, including data mining, knowledge ex-traction, information discovery, informationharvesting, data archaeology, and data patternprocessing. The term data mining has mostlybeen used by statisticians, data analysts, andthe management information systems (MIS)communities. It has also gained popularity inthe database field. The phrase knowledge dis-covery in databases was coined at the first KDDworkshop in 1989 (Piatetsky-Shapiro 1991) toemphasize that knowledge is the end productof a data-driven discovery. It has been popular-ized in the AI and machine-learning fields.In our view, KDD refers to the overall pro-cess of discovering useful knowledge from da-ta, and data mining refers to a particular stepin this process. Data mining is the applicationof specific algorithms for extracting patternsfrom data. The distinction between the KDDprocess and the data-mining step (within theprocess) is a central point of this article. Theadditional steps in the KDD process, such asdata preparation, data selection, data cleaning,incorporation of appropriate prior knowledge,and proper interpretation of the results ofmining, are essential to ensure that usefulknowledge is derived from the data. Blind ap-plication of data-mining methods (rightly crit-icized as data dredging in the statistical litera-ture) can be a dangerous activity, easilyleading to the discovery of meaningless andinvalid patterns.The Interdisciplinary Nature of KDDKDD has evolved, and continues to evolve,from the intersection of research fields such asmachine learning, pattern recognition,databases, statistics, AI, knowledge acquisitionfor expert systems, data visualization, andhigh-performance computing. The unifyinggoal is extracting high-level knowledge fromlow-level data in the context of large data sets.The data-mining component of KDD cur-rently relies heavily on known techniquesfrom machine learning, pattern recognition,and statistics to find patterns from data in thedata-mining step of the KDD process. A natu-ral question is, How is KDD different from pat-tern recognition or machine learning (and re-lated fields)? The answer is that these fieldsprovide some of the data-mining methodsthat are used in the data-mining step of theKDD process. KDD focuses on the overall pro-cess of knowledge discovery from data, includ-ing how the data are stored and accessed, howalgorithms can be scaled to massive data setsThe basicproblemaddressed bythe KDDprocess isone ofmappinglow-leveldata intoother formsthat might bemorecompact,moreabstract,or moreuseful.ArticlesFALL 1996 39A driving force behind KDD is the database field (the second D in KDD). Indeed, the problem of effective data manipulation when data cannot fit in the main memory is of fun-damental importance to KDD. Database tech-niques for gaining efficient data access,grouping and ordering operations when ac-cessing data, and optimizing queries consti-tute the basics for scaling algorithms to larger data sets. Most data-mining algorithms from statistics, pattern recognition, and machine learning assume data are in the main memo-ry and pay no attention to how the algorithm breaks down if only limited views of the data are possible.A related field evolving from databases is data warehousing,which refers to the popular business trend of collecting and cleaning transactional data to make them available for online analysis and decision support. Data warehousing helps set the stage for KDD in two important ways: (1) data cleaning and (2)data access.Data cleaning: As organizations are forced to think about a unified logical view of the wide variety of data and databases they pos-sess, they have to address the issues of map-ping data to a single naming convention,uniformly representing and handling missing data, and handling noise and errors when possible.Data access: Uniform and well-defined methods must be created for accessing the da-ta and providing access paths to data that were historically difficult to get to (for exam-ple, stored offline).Once organizations and individuals have solved the problem of how to store and ac-cess their data, the natural next step is the question, What else do we do with all the da-ta? This is where opportunities for KDD natu-rally arise.A popular approach for analysis of data warehouses is called online analytical processing (OLAP), named for a set of principles pro-posed by Codd (1993). OLAP tools focus on providing multidimensional data analysis,which is superior to SQL in computing sum-maries and breakdowns along many dimen-sions. OLAP tools are targeted toward simpli-fying and supporting interactive data analysis,but the goal of KDD tools is to automate as much of the process as possible. Thus, KDD is a step beyond what is currently supported by most standard database systems.Basic DefinitionsKDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimate-and still run efficiently, how results can be in-terpreted and visualized, and how the overall man-machine interaction can usefully be modeled and supported. The KDD process can be viewed as a multidisciplinary activity that encompasses techniques beyond the scope of any one particular discipline such as machine learning. In this context, there are clear opportunities for other fields of AI (be-sides machine learning) to contribute to KDD. KDD places a special emphasis on find-ing understandable patterns that can be inter-preted as useful or interesting knowledge.Thus, for example, neural networks, although a powerful modeling tool, are relatively difficult to understand compared to decision trees. KDD also emphasizes scaling and ro-bustness properties of modeling algorithms for large noisy data sets.Related AI research fields include machine discovery, which targets the discovery of em-pirical laws from observation and experimen-tation (Shrager and Langley 1990) (see Kloes-gen and Zytkow [1996] for a glossary of terms common to KDD and machine discovery),and causal modeling for the inference of causal models from data (Spirtes, Glymour,and Scheines 1993). Statistics in particular has much in common with KDD (see Elder and Pregibon [1996] and Glymour et al.[1996] for a more detailed discussion of this synergy). Knowledge discovery from data is fundamentally a statistical endeavor. Statistics provides a language and framework for quan-tifying the uncertainty that results when one tries to infer general patterns from a particu-lar sample of an overall population. As men-tioned earlier, the term data mining has had negative connotations in statistics since the 1960s when computer-based data analysis techniques were first introduced. The concern arose because if one searches long enough in any data set (even randomly generated data),one can find patterns that appear to be statis-tically significant but, in fact, are not. Clearly,this issue is of fundamental importance to KDD. Substantial progress has been made in recent years in understanding such issues in statistics. Much of this work is of direct rele-vance to KDD. Thus, data mining is a legiti-mate activity as long as one understands how to do it correctly; data mining carried out poorly (without regard to the statistical as-pects of the problem) is to be avoided. KDD can also be viewed as encompassing a broader view of modeling than statistics. KDD aims to provide tools to automate (to the degree pos-sible) the entire process of data analysis and the statistician’s “art” of hypothesis selection.Data mining is a step in the KDD process that consists of ap-plying data analysis and discovery al-gorithms that produce a par-ticular enu-meration ofpatterns (or models)over the data.Articles40AI MAGAZINEly understandable patterns in data (Fayyad, Piatetsky-Shapiro, and Smyth 1996).Here, data are a set of facts (for example, cases in a database), and pattern is an expres-sion in some language describing a subset of the data or a model applicable to the subset. Hence, in our usage here, extracting a pattern also designates fitting a model to data; find-ing structure from data; or, in general, mak-ing any high-level description of a set of data. The term process implies that KDD comprises many steps, which involve data preparation, search for patterns, knowledge evaluation, and refinement, all repeated in multiple itera-tions. By nontrivial, we mean that some search or inference is involved; that is, it is not a straightforward computation of predefined quantities like computing the av-erage value of a set of numbers.The discovered patterns should be valid on new data with some degree of certainty. We also want patterns to be novel (at least to the system and preferably to the user) and poten-tially useful, that is, lead to some benefit to the user or task. Finally, the patterns should be understandable, if not immediately then after some postprocessing.The previous discussion implies that we can define quantitative measures for evaluating extracted patterns. In many cases, it is possi-ble to define measures of certainty (for exam-ple, estimated prediction accuracy on new data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness(for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be defined explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to define knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this definition ispurely user oriented and domain specific andis determined by whatever functions andthresholds the user chooses.Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efficiency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space ofArticlesFALL 1996 41Figure 1. An Overview of the Steps That Compose the KDD Process.methods, the effective number of variables under consideration can be reduced, or in-variant representations for the data can be found.Fifth is matching the goals of the KDD pro-cess (step 1) to a particular data-mining method. For example, summarization, clas-sification, regression, clustering, and so on,are described later as well as in Fayyad, Piatet-sky-Shapiro, and Smyth (1996).Sixth is exploratory analysis and model and hypothesis selection: choosing the data-mining algorithm(s) and selecting method(s)to be used for searching for data patterns.This process includes deciding which models and parameters might be appropriate (for ex-ample, models of categorical data are differ-ent than models of vectors over the reals) and matching a particular data-mining method with the overall criteria of the KDD process (for example, the end user might be more in-terested in understanding the model than its predictive capabilities).Seventh is data mining: searching for pat-terns of interest in a particular representa-tional form or a set of such representations,including classification rules or trees, regres-sion, and clustering. The user can significant-ly aid the data-mining method by correctly performing the preceding steps.Eighth is interpreting mined patterns, pos-sibly returning to any of steps 1 through 7 for further iteration. This step can also involve visualization of the extracted patterns and models or visualization of the data given the extracted models.Ninth is acting on the discovered knowl-edge: using the knowledge directly, incorpo-rating the knowledge into another system for further action, or simply documenting it and reporting it to interested parties. This process also includes checking for and resolving po-tential conflicts with previously believed (or extracted) knowledge.The KDD process can involve significant iteration and can contain loops between any two steps. The basic flow of steps (al-though not the potential multitude of itera-tions and loops) is illustrated in figure 1.Most previous work on KDD has focused on step 7, the data mining. However, the other steps are as important (and probably more so) for the successful application of KDD in practice. Having defined the basic notions and introduced the KDD process, we now focus on the data-mining component,which has, by far, received the most atten-tion in the literature.patterns is often infinite, and the enumera-tion of patterns involves some form of search in this space. Practical computational constraints place severe limits on the sub-space that can be explored by a data-mining algorithm.The KDD process involves using the database along with any required selection,preprocessing, subsampling, and transforma-tions of it; applying data-mining methods (algorithms) to enumerate patterns from it;and evaluating the products of data mining to identify the subset of the enumerated pat-terns deemed knowledge. The data-mining component of the KDD process is concerned with the algorithmic means by which pat-terns are extracted and enumerated from da-ta. The overall KDD process (figure 1) in-cludes the evaluation and possible interpretation of the mined patterns to de-termine which patterns can be considered new knowledge. The KDD process also in-cludes all the additional steps described in the next section.The notion of an overall user-driven pro-cess is not unique to KDD: analogous propos-als have been put forward both in statistics (Hand 1994) and in machine learning (Brod-ley and Smyth 1996).The KDD ProcessThe KDD process is interactive and iterative,involving numerous steps with many deci-sions made by the user. Brachman and Anand (1996) give a practical view of the KDD pro-cess, emphasizing the interactive nature of the process. Here, we broadly outline some of its basic steps:First is developing an understanding of the application domain and the relevant prior knowledge and identifying the goal of the KDD process from the customer’s viewpoint.Second is creating a target data set: select-ing a data set, or focusing on a subset of vari-ables or data samples, on which discovery is to be performed.Third is data cleaning and preprocessing.Basic operations include removing noise if appropriate, collecting the necessary informa-tion to model or account for noise, deciding on strategies for handling missing data fields,and accounting for time-sequence informa-tion and known changes.Fourth is data reduction and projection:finding useful features to represent the data depending on the goal of the task. With di-mensionality reduction or transformationArticles42AI MAGAZINEThe Data-Mining Stepof the KDD ProcessThe data-mining component of the KDD pro-cess often involves repeated iterative applica-tion of particular data-mining methods. This section presents an overview of the primary goals of data mining, a description of the methods used to address these goals, and a brief description of the data-mining algo-rithms that incorporate these methods.The knowledge discovery goals are defined by the intended use of the system. We can distinguish two types of goals: (1) verification and (2) discovery. With verification,the sys-tem is limited to verifying the user’s hypothe-sis. With discovery,the system autonomously finds new patterns. We further subdivide the discovery goal into prediction,where the sys-tem finds patterns for predicting the future behavior of some entities, and description, where the system finds patterns for presenta-tion to a user in a human-understandableform. In this article, we are primarily con-cerned with discovery-oriented data mining.Data mining involves fitting models to, or determining patterns from, observed data. The fitted models play the role of inferred knowledge: Whether the models reflect useful or interesting knowledge is part of the over-all, interactive KDD process where subjective human judgment is typically required. Two primary mathematical formalisms are used in model fitting: (1) statistical and (2) logical. The statistical approach allows for nondeter-ministic effects in the model, whereas a logi-cal model is purely deterministic. We focus primarily on the statistical approach to data mining, which tends to be the most widely used basis for practical data-mining applica-tions given the typical presence of uncertain-ty in real-world data-generating processes.Most data-mining methods are based on tried and tested techniques from machine learning, pattern recognition, and statistics: classification, clustering, regression, and so on. The array of different algorithms under each of these headings can often be bewilder-ing to both the novice and the experienced data analyst. It should be emphasized that of the many data-mining methods advertised in the literature, there are really only a few fun-damental techniques. The actual underlying model representation being used by a particu-lar method typically comes from a composi-tion of a small number of well-known op-tions: polynomials, splines, kernel and basis functions, threshold-Boolean functions, and so on. Thus, algorithms tend to differ primar-ily in the goodness-of-fit criterion used toevaluate model fit or in the search methodused to find a good fit.In our brief overview of data-mining meth-ods, we try in particular to convey the notionthat most (if not all) methods can be viewedas extensions or hybrids of a few basic tech-niques and principles. We first discuss the pri-mary methods of data mining and then showthat the data- mining methods can be viewedas consisting of three primary algorithmiccomponents: (1) model representation, (2)model evaluation, and (3) search. In the dis-cussion of KDD and data-mining methods,we use a simple example to make some of thenotions more concrete. Figure 2 shows a sim-ple two-dimensional artificial data set consist-ing of 23 cases. Each point on the graph rep-resents a person who has been given a loanby a particular bank at some time in the past.The horizontal axis represents the income ofthe person; the vertical axis represents the to-tal personal debt of the person (mortgage, carpayments, and so on). The data have beenclassified into two classes: (1) the x’s repre-sent persons who have defaulted on theirloans and (2) the o’s represent persons whoseloans are in good status with the bank. Thus,this simple artificial data set could represent ahistorical data set that can contain usefulknowledge from the point of view of thebank making the loans. Note that in actualKDD applications, there are typically manymore dimensions (as many as several hun-dreds) and many more data points (manythousands or even millions).ArticlesFALL 1996 43Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes.。
KDD Knowledge Discovery in Databases百科名片知识发现知识发现(KDD:Knowledge Discovery in Databases)是从数据集中别出有效的、新颖的、潜在有用的,以及最终可理解的模式的非平凡过程。
知识发现将信息变为知识,从数据矿山中找到蕴藏的知识金块,将为知识创新和知识经济的发展作出贡献。
该术语于1989年出现,Fayyad定义为"KDD"是从数据集中识别出有效的、新颖的、潜在有用的,以及最终可理解的模式的非平凡过程”。
目录详细解释1.KDD基本过程(the process of the KDD)2.常用KDD过程模型 (KDD process model)编辑本段详细解释数据库知识发现(knowledge discovery in databases,KDD)的研究非常活跃。
在上面的定义中,涉及几个需要进一步解释的概念:“数据集”、“模式”、“过程”、“有效性”、“新颖性”、“潜在有用性”和“最终可理解性”。
数据集是一组事实 F(如关系数据库中的记录)。
模式是一个用语言L来表示的一个表达式E,它可用来描述数据集F的某个子集凡上作为一个模式要求它比对数据子集FE的枚举要简单(所用的描述信息量要少)。
过程在KDD中通常指多阶段的处理,涉及数据准备、模式搜索、知识评价以及反复的修改求精;该过程要求是非平凡的,意思是要有一定程度的智能性、自动性(仅仅给出所有数据的总和不能算作是一个发现过程)。
有效性是指发现的模式对于新的数据仍保持有一定的可信度。
新颖性要求发现的模式应该是新的。
潜在有用性是指发现的知识将来有实际效用,如用于决策支持系统里可提高经济效益。
最终可理解性要求发现的模式能被用户理解,目前它主要是体现在简洁性上。
有效性、新颖性、潜在有用性和最终可理解性综合在一起称为兴趣性。
由于知识发现是一门受到来自各种不同领域的研究者关注的交叉性学科,因此导致了很多不同的术语名称。
ArticlesFALL 1992 57Computers have promised us a fountain of wisdom but delivered a flood of data.—A frustrated MIS executiveIt has been estimated that the amount of information in the world doubles every 20months. The size and number of databases probably increases even faster. In 1989, the total number of databases in the world was estimated at five million, although most of them are small DBASE III databases. The automation of business activities produces an ever-increasing stream of data because even simple transactions, such as a telephone call,the use of a credit card, or a medical test, are typically recorded in a computer.Scientific and government databases are also rapidly growing. The National Aeronau-tics and Space Administration already has much more data than it can analyze. Earth observation satellites, planned for the 1990s,are expected to generate one terabyte (1015bytes) of data every day— more than all pre-vious missions combined. At a rate of one picture each second, it would take a person several years (working nights and weekends)just to look at the pictures generated in one day. In biology, the federally funded Human Genome project will store thousands of bytes for each of the several billion genetic bases.Closer to everyday lives, the 1990 U.S. census data of a million million bytes encode pat-terns that in hidden ways describe the lifestyles and subcultures of today’s United States.What are we supposed to do with this flood of raw data? Clearly, little of it will ever be seen by human eyes. If it will be under-stood at all, it will have to be analyzed by computers. Although simple statistical tech-niques for data analysis were developed long ago, advanced techniques for intelligent data analysis are not yet mature. As a result, there is a growing gap between data generation and data understanding. At the same time,there is a growing realization and expectation that data, intelligently analyzed and present-ed, will be a valuable resource to be used for a competitive advantage.The computer science community is responding to both the scientific and practi-cal challenges presented by the need to find the knowledge adrift in the flood of data. In assessing the potential of AI technologies,Michie (1990), a leading European expert on machine learning, predicted that “the next area that is going to explode is the use of machine learning tools as a component of large-scale data analysis.” A recent National Science Foundation workshop on the future of database research ranked data mining among the most promising research topicsKnowledge Discoveryin Databases: An OverviewWilliam J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus After a decade of fundamental interdisciplinary research in machine learning,the spadework in this field has been done; the 1990s should see the widespread exploitation of knowledge discovery as an aid to assembling knowledge bases. The contributors to the AAAI Press book Knowledge Discovery in Databases were excited at the potential benefits of this research. The editors hope that some of this excitement will com-municate itself to AI Magazine readers of this article.0738-4602/92/$4.00 ©1992 AAAIAI Magazine Volume 13 Number 3 (1992) (© AAAI)Articles58AI MAGAZINE for the 1990s (Silber-schatz, Stonebraker,and Ullman 1990).Some research meth-ods are already wellenough developed tohave been made partof commerciallyavailable software.Several expert systemshells use variationsof ID3 for inducingrules from examples.Other systems useinductive, neuralnet, or genetic learn-ing approaches todiscover patterns inpersonal computerdatabases. Many for-ward-looking com-panies are usingthese and other toolsto analyze theirdatabases for inter-esting and useful patterns. American Airlinessearches its frequent flyer database to find itsbetter customers, targeting them for specificmarketing promotions. Farm Journal analyzesits subscriber database and uses advancedprinting technology to custom-build hun-dreds of editions tailored to particulargroups. Several banks, using patterns discov-ered in loan and credit histories, have derivedbetter loan approval and bankruptcy predic-tion methods. General Motors is using adatabase of automobile trouble reports toderive diagnostic expert systems for variousmodels. Packaged-goods manufacturers aresearching the supermarket scanner data tomeasure the effects of their promotions andto look for shopping patterns.A combination of business and researchinterests has produced increasing demandsfor, as well as increased activity to provide,tools and techniques for discovery indatabases. This book is the first to bringtogether leading-edge research from aroundthe world on this topic. It spans many differ-ent approaches to discovery, including induc-tive learning, Bayesian statistics, semanticquery optimization, knowledge acquisitionfor expert systems, information theory, andfuzzy sets. The book is aimed at those inter-ested or involved in computer science andthe management of data, to both inform andinspire further research and applications. Itwill be of particular interest to professionalsworking with databases and managementinformation systemsand to those apply-ing machine learn-ing to real-worldproblems.What IsKnowledgeDiscovery?There is an immensediversity of currentresearch on knowl-edge discovery indatabases. To pro-vide a point of refer-ence for thisresearch, we beginhere by definingand explaining rele-vant terms.Definition ofKnowledgeDiscoveryKnowledge discovery is the nontrivial extrac-tion of implicit, previously unknown, andpotentially useful information from data.Given a set of facts (data) F, a language L,and some measure of certainty C, we define apattern as a statement S in L that describesrelationships among a subset F S of F with acertainty c, such that S is simpler (in somesense) than the enumeration of all facts inF S. A pattern that is interesting (according toa user-imposed interest measure) and certainenough (again according to the user’s criteria)is called knowledge. The output of a programthat monitors the set of facts in a databaseand produces patterns in this sense is discov-ered knowledge.These definitions about the language, thecertainty, and the simplicity and interesting-ness measures are intentionally vague tocover a wide variety of approaches. Collec-tively, these terms capture our view of thefundamental characteristics of discovery indatabases. In the following paragraphs, wesummarize the connotations of these termsand suggest their relevance to the problem ofknowledge discovery in databases.Patterns and Languages:Although manydifferent types of information can be discov-ered in data, this book focuses on patternsthat are expressed in a high-level language,such asIf Age < 25 and Driver-Education-Course =NoThen At-fault-accident = YesWith Likelihood = 0.2 to 0.3.Such patterns can be understood and used directly by people, or they can be the input to another program, for example, an expert system or a semantic query optimizer. We do not consider low-level patterns, such as those generated by neural networks.Certainty:Seldom is a piece of discovered knowledge true across all the data. Represent-ing and conveying the degree of certainty is essential to determining how much faith the system or user should put into a discovery. As we examine later, certainty involves several factors, including the integrity of the data; the size of the sample on which the discovery was performed; and, possibly, the degree of support from available domain knowledge. Without sufficient certainty, patterns become unjustified and, thus, fail to be knowledge.Interesting: Although numerous patterns can be extracted from any database, only those deemed to be interesting in some way are considered knowledge. Patterns are inter-esting when they are novel, useful, and non-trivial to compute. Whether a pattern is novel depends on the assumed frame of reference, which can be either the scope of the system’s knowledge or the scope of the user’s knowl-edge. For example, a system might discover the following: If At-fault-accident = yes Then Age > 16. To the system, this piece of knowl-edge might be previously unknown and potentially useful; to a user trying to analyze insurance claims records, this pattern would be tautological and uninteresting and would not represent discovered knowledge. This example also suggests the notion of utility. Knowledge is useful when it can help achieve a goal of the system or the user. Patterns com-pletely unrelated to current goals are of little use and do not constitute knowledge within the given situation. For example, a pattern relating at-fault-accident to a driver’s age, dis-covered while the user’s intent was to analyze sales figures, would not be useful to the user.Novelty and utility alone, however, are not enough to qualify a pattern as discovered knowledge. Most databases contain numerous novel and useful patterns, such as the totalsales for 1990, the average cost of an insur-ance claim, and the maximum intensity of aspectral line. These types of patterns wouldnot typically be considered knowledgebecause they are trivial to compute. To benontrivial, a system must do more thanblindly compute statistics; the results of thedirected calculation of straightforward statis-tics are, to our way of thinking, readily avail-able to the database user. A discovery systemmust be capable of deciding which calcula-tions to perform and whether the results areinteresting enough to constitute knowledgein the current context. Another way of view-ing this notion of nontriviality is that a dis-covery system must possess some degree ofautonomy in processing the data and evaluat-ing its results.Efficiency:Finally, we are interested in dis-covery processes that can be efficiently imple-mented on a computer. An algorithm isconsidered efficient1if the run time and spaceused are a polynomial function of low degreeof the input length. The problem of discoveryof interesting sentences (concepts) that satisfygiven facts is inherently hard. Recentadvances in computational learning theory(Valiant 1984; Haussler 1988) have shownthat it is not possible to efficiently learn anarbitrary Boolean concept; the problem is NP-hard. However, these results are generallyrelated to the worst-case performance of algo-rithms and do not eliminate the possibilitythat, on the average, we can find complexconcepts fast. Efficient algorithms do exist forrestricted concept classes, such as purely con-junctive concepts (for example, AŸ BŸ C), orthe conjunction of clauses made up of dis-junctions of no more than k literals (forexample, (A⁄ B) Ÿ (C⁄ D) Ÿ (E ⁄ F), for k = 2).Another possibility for efficiently finding con-cepts is to abandon the demand that the algo-rithm learn a desired concept with someguarantee and instead accept heuristic orapproximate algorithms.To summarize, knowledge discovery indatabases exhibits four main characteristics:•High-Level Language—Discovered knowl-ArticlesFALL 1992 59Articles60AI MAGAZINE edge is represented in a high-level language.It need not be directly used by humans, butits expression should be understandable byhuman users.•Accuracy—Discoveries accurately portraythe contents of the database. The extent towhich this portrayal is imperfect is expressedby measures of certainty.•Interesting Results—Discovered knowledgeis interesting according to user-defined biases.In particular, being interesting implies thatpatterns are novel and potentially useful, andthe discovery process is nontrivial.•Efficiency—The discovery process is effi-cient. Running times for large-sized databasesare predictable and acceptable.The systems and techniques described inKnowledge Discovery in Databases generallystrive to satisfy these characteristics. Theapproaches taken, however, are quite diverse.Most are based on machine learning methodsthat have been enhanced to better deal withissues particular to discovery in databases.Next, we briefly define database concepts andterminology and relate them to termscommon to machine learning.Databases and Machine LearningIn database management, a database is a logi-cally integrated collection of data maintainedin one or more files and organized to facili-tate the efficient storage, modification, andretrieval of related information. In a relation-al database, for example, data are organizedinto files or tables of fixed-length records.Each record is an ordered list of values, onevalue for each field. Information about eachfield’s name and potential values is main-tained in a separate file called a data dictio-nary. A d atabase management system is acollection of procedures for retrieving, stor-ing, and manipulating data within databases.In machine learning, the term databasetypically refers to a collection of instances orexamples maintained in a single file.2Instances are usually fixed-length feature vec-tors. Information is sometimes also providedabout the feature names and value ranges, asin a data dictionary. A learning algorithmtakes the data set and its accompanyinginformation as input and returns a statement(for example, a concept) representing theresults of the learning as output.Table 1 informally compares the terminologyin database management with that ofmachine learning. With the appropriatetranslation of terms, it seems that machinelearning could readily be applied to databas-es: Rather than learning on a set of instances,learning is done on a file of records from adatabase. Knowledge discovery in databases,however, raises additional concerns thatextend beyond those typically encounteredin machine learning. In the real world,databases are often dynamic, incomplete,noisy, and much larger than typical machinelearning data sets (table 2). These factorsrender most learning algorithms ineffectivein the general case. Not surprisingly, much ofthe work on discovery in databases focuseson overcoming these complications.Database Managment Machine Learningdatabase: a logically integrated a fixed set of examples collection of dynamic filesfile database, data set, set of instances tuple, record instance, example, feature vector field, attribute feature, attributefield domain possible field valuesdata dictionary field type and domain information relational data a set of instancesobject-oriented, structured data relational datalogical condition concept descriptionTable 1. Translations between Database Managementand Machine Learning Terms.Database Management Machine LearningDatabase is an active, evolving Database is just a staticentity collection of dataRecords may contain missing or Instances are usually complete erroneous information and noise-freeA typical field is numeric A typical feature is binaryA database typically contains Data sets typically contain millions of records several hundred instancesAI should get down to reality“Databases” is a solved problemand is therefore uninterestingTable 2. Conflicting Viewpoints betweenDatabase Management and Machine Learning.Related ApproachesAlthough machine learning is the foundation for much of the work in this area, knowledge discovery in databases deals with issues rele-vant to several other fields, including database management, expert systems, statis-tical analysis, and scientific discovery.Database Management: A database man-agement system provides procedures for stor-ing, accessing, and modifying the data. Typical operations include retrieval, update, or deletion of all tuples satisfying a specific condition, and maintaining user-specified integrity constraints. The ability to extract tuples satisfying a common condition is like discovery in its ability to produce interesting and useful statements (for example, “Bob and Dave sold fewer widgets this year than last”). These techniques, however, cannot by them-selves determine what computations are worth trying, nor do they evaluate the quality of the derived patterns. Interesting discoveries uncovered by these data-manipulation toolsresult from the guidance of the user. However, the new generation of deductive and object-oriented database systems (Kim, Nicolas, and Nishio 1990) will provide improved capabili-ties for intelligent data analysis and discovery.Expert Systems: Expert systems attempt to capture knowledge pertinent to a specific problem. Techniques exist for helping to extract knowledge from experts. One such method is the induction of rules from expert-generated examples of problem solutions. This method differs from discovery in databases in that the expert examples are usu-ally of much higher quality than the data in databases, and they usually cover only the important cases (see also Gaines3,for a com-parison between knowledge acquisition from an expert and induction from data). Further-more, experts are available to confirm the validity and usefulness of the discovered pat-terns. As with database management tools, the autonomy of discovery is lacking in these methods.Statistics: Although statistics provide a solid theoretical foundation for the problem of data analysis, a purely statistical approach is not enough. First, standard statistical meth-ods are ill suited for the nominal and struc-tured data types found in many databases (figure 2). Second, statistics are totally data driven, precluding the use of available domain knowledge, an important issue that we discuss later. Third, the results of statistical analysis can be overwhelming and difficult to interpret. Finally, statistical methods require the guidance of the user to specify where and how to analyze the data. However, some recent statistics-based techniques such as pro-jection pursuit (Huber 1985) and discovery ofcausal structure from data (Glymour et al.1987; Geiger, Paz, and Pearl 1990) addresssome of these problems and are much closerto intelligent data analysis. We expect thatmethods using domain knowledge will bedeveloped by the statistical community. Wealso believe that statistics should have a vitalrole in all discovery systems dealing withlarge amounts of data.Scientific Discovery: Discovery in databas-es is significantly different from scientific dis-covery (see also Zytkow and Baker3,) in thatthe former is less purposeful and less con-trolled. Scientific data come from experi-ments designed to eliminate the effects of allbut a few parameters and to emphasize thevariation of one or a few target parameters tobe explained. However, typical businessdatabases record a plethora of informationabout their subjects to meet a number oforganizational goals. This richness (or confu-sion) both captures and hides from viewunderlying relationships in the data. More-over, scientists can reformulate and reruntheir experiments should they find that theinitial design was inadequate. Database man-agers rarely have the luxury of redesigningtheir data fields and recollecting the data.A Framework for Knowledge DiscoveryWe have referred to discovery systems severaltimes without specifying what a discoverysystem is. Although discovery systems varyconsiderably in their design, it is possible toArticlesFALL 1992 61 Figure 1. A Framework for Knowledge Discovery in Databases.Articles62AI MAGAZINE describe a prototypical discovery system.Figure 1 depicts the basic components of ourprototypical system for knowledge discoveryin databases. At the core of the system is thediscovery method, which computes and eval-uates patterns on their way to becomingknowledge. The input to the discoverymethod include raw data from the database,information from the data dictionary, addi-tional domain knowledge, and a set of user-defined biases that provide high-level focus.The output is discovered knowledge that canbe directed to the user or back into thesystem as new domain knowledge. Becausethis model represents a prototypical system, adiscovery system need not include all theseaspects. The feedback of discovered knowl-edge into the store of domain knowledge, forexample, is present in few existing systems(see Future Directions).In the remainder of this article, we exploreeach aspect of this framework in detail.Specifically, we consider (1) the peculiaritiesof databases, (2) the use of domain knowl-edge, (3) the role of the user in the discoveryprocess, (4) methods of knowledge discovery,and (5) the form and uses of discoveredknowledge.Database IssuesThe fundamental input to a discovery systemis the raw data present in a database. We pre-viously mentioned that databases poseunique problems to discovery not typicallyconfronted in machine learning. In particu-lar, these problems arise from the fact thatreal-world databases are dynamic, incom-plete, noisy, and large. Other concernsinclude whether the database contains ade-quate information for interesting discoveryand how to deal with the overabundance ofirrelevant information.Dynamic Data: A fundamental characteris-tic of most databases is that their contents areever changing. Data can be time sensitive,and discovery is affected by the timeliness ofdata observations. Some data values, such asa patient’s social security number, are con-stant over time; some vary more or less grad-ually over time (for example, weight andheight); and some are so situation dependentthat only a recently observed value will suf-fice (for example, pulse rate).Irrelevant Fields: A nother key characteris-tic is the relevance of data, that is, whetheran item of data is relevant to the currentfocus of discovery. When a patient database isbeing explored for interesting patterns ofsymptoms and diagnoses, nonmedical data,such as patient name or zip code, are irrele-vant, and errors there are unimportant. Onthe other hand, pulse rate, a simple and typi-cally recorded medical observation, is rele-vant, and errors here can affect what isdiscovered. If, however, we are looking for ageographic concentration of a particular dis-ease, then a correct zip code becomes crucial.If the zip code is thought to be faulty, it cansometimes be inferred from related informa-tion, such as the patient’s address.An aspect somewhat related to relevance isthe applicability of an attribute to a subset ofthe database; for example, a patient’s preg-nant field is not applicable to men (the classof patients with sex equal to male), but it isessential to know for female patients of child-bearing age.Missing Values: The presence or absence ofvalues for relevant data attributes can affectdiscovery. Whether a patient was comatose atthe time of diagnosis can be so importantthat it does not allow the substitution of adefault value; less important missing data canbe defaulted. In an interactive system, theabsence of an important datum can spawn arequest for its value or a test to determine it.Alternatively, the absence of data can be dealtwith as if it were a condition in itself, and themissing attribute can be assigned a neutralvalue, such as unknown. For example, theabsence of some patient measurements mightbe found to imply that the patient was for-merly in excellent health.Noise and Uncertainty: For relevantattributes, the severity of error can depend onthe data type of the allowed values. Values ofdifferent attributes can be real numbers(charge), integer numbers (age), or strings(name) or can belong to a set of nominalvalues (patient type). Nominal values can beordered partially or completely and can alsofields, or more complex conditions such as a department must always have exactly one manager. In a sense, the data dictionary defines the syntax of database use.The database contains the raw data to be processed by the discovery system. In prac-tice, the discovery system must also use the additional information about the form of data and constraints on it. Some of this infor-mation can be stored in the data dictionary,but other information might exist in manuals or experts’ heads. For example, a discovery system for a hospital database needs to know which diagnosis codes (DX) are grouped into which diagnostic-related groups (DRG),which, in turn, are grouped into major diag-nostic categories (MDC) (figure 2). Then, if a patient’s DX does not match his or her DRG,an unusual side effect or an error in record keeping is indicated.Because discovery is computationally expensive, additional knowledge regarding the form and content of data, the domain(s)described by the database, the context of a particular discovery episode, and the purposes being served by discovery are often used to guide and constrain the search for interesting knowledge. We refer to this form of informa-tion as domain knowledge or background knowl-edge . Domain knowledge assists discovery by focusing search. However, its use is controver-sial because by telling a system what to look for and where to look for it, domain knowl-edge restricts search and can deliberately rule out valuable discovery. An example discussed at the IJCAI-89 Knowledge Discovery Work-shop illustrates the trade-off between quickly finding conventional solutions and discard-ing unusual ones. In logistics planning, the search space is so large that it is impossible to find solutions without using constraints such as “trucks don’t drive over water.” This con-straint, however, eliminates potentially inter-esting solutions, such as those in whichArticlesFALL 1992 63have a semantic structure (figure 2).Another aspect of uncertainty is the inher-ent or expected exactitude of data, that is, the degree of noise in the data. Repeated mea-surements cluster around an average; based on a priori analysis or computations on the measurements themselves, a statistical model describing randomness is formulated and used to define expected and tolerable devia-tions in the data. Subjective input, such as whether a patient’s condition is severe or moderate, can vary considerably about one or more cluster points. Often, statistical models are applied in an ad hoc manner to subjec-tively determined attributes to obtain gross statistics and judge the acceptability of (com-binations of) attribute values.Especially with regard to numeric data types, data precision can be a factor in discov-ery. For example, it is regarded as adequate to record body temperature to a precision of 0.1degree. A sensitive trend analysis of body temperature would require even greater preci-sion in the data. To a discovery system able to relate such trends to diagnosis, this impreci-sion would appear to be noise in the input.Missing Fields: An inadequate view of the database can make valid data appear to be in error. The database view is the totality of usable attributes or accessors that the discov-ery system can apply to a problem. It is assumed that the attributes differentiate cases of interest. When they don’t, there appears to be some error. Suppose, for example, that a system is tasked to learn to diagnose malaria from a patient database that does not include the red blood cell count. Patients whose records are correct and who are medically identical with respect to this given view might have different diagnoses, which, in turn, might incorrectly be blamed on data error.Databases and KnowledgeIgnorance is the curse of God, knowledge the wing wherewith we fly to heaven.—William ShakespeareA database is a logically integrated collection of files. Associated with a database is a data dictionary, which defines field names, the allowable data types for field values, various constraints on field values, and other related information. For example, the field age is a positive integer and the field date-of-birth has the form MMDDYY . Types of constraints include ranges of possible values for numeric fields, lists of possible values for nominal。