Mining Top-k Approximate Frequent Patterns
- 格式:pdf
- 大小:233.36 KB
- 文档页数:13
考频分布单词
本题是一道写作任务题,需要用中文写出考频分布单词。
首先,我们需要理解题目的含义。
所谓“考频分布单词”,就是
指经常出现在各种考试中的词汇,这些词汇在不同的题型中会有不同
的应用。
因此,对于考生来说,熟练掌握这些词汇,是提高考试成绩
的关键之一。
下面是一些常见的考频分布单词:
1. 透露:指揭露、披露、透漏某些事情的真相或者内情。
在写作中,
我们经常会用到这个词汇,比如:透露真相、不透露姓名等等。
2. 发掘:指挖掘、搜索、发现某些事物的潜在价值或者存在的
事实。
在历史、考古、地质等领域中,发掘是一个非常重要的词汇。
3. 强调:指强调、突出、重视某些事情的重要性或者需要特别
注意的特点。
在写作中,我们经常会用到这个词汇,比如:要强调重点、强调注意事项等等。
4. 探讨:指探讨、探究、深入研究某些事情的内在原因或者实
质内容。
在教育、学术、研究等领域中,探讨是一个非常常见的词汇。
5. 优秀:指杰出、卓越、做得非常好的人或者事物。
在奖项评选、职业评级等场合中,常常会用到这个词汇,比如:评选优秀学生、荣获优秀教师等等。
以上是一些常见的考频分布单词,考生们可以通过不断练习,掌
握更多实用的词汇,提高自己的写作和口语水平。
采矿工程英语面试及笔试资料Introduction:Mining engineering is a field that involves the extraction of valuable minerals or other geological materials from the earth. It requires a combination of technical skills, knowledge of geology, and an understanding of mining operations. This document provides a comprehensive guide for conducting interviews and written tests for candidates applying for mining engineering positions. The content is purely fictional and should be used for illustrative purposes only.1. Interview Questions:1.1 Technical Skills:1.1.1 Can you explain the process of mineral exploration and how it differs from mining?Answer: Mineral exploration involves searching for deposits of minerals in the ground. It includes geological mapping, geophysical surveys, and drilling to determine the presence and extent of mineral resources. Mining, on the other hand, involves extracting the minerals from the ground and processing them for various uses.1.1.2 What are the different methods of mining? Can you explain each briefly?Answer: There are several methods of mining, including underground mining, open-pit mining, placer mining, and mountaintop removal mining.- Underground mining involves extracting minerals from beneath the earth's surface, using tunnels and shafts.- Open-pit mining is the process of extracting minerals from a large, open excavation.- Placer mining involves extracting minerals from alluvial deposits, such as rivers or beach sands.- Mountaintop removal mining is a method used to extract coal by removing the summit of a mountain.1.1.3 How do you ensure the safety of mining operations?Answer: Safety is a top priority in mining operations. Measures such as regular inspections, proper training of workers, implementation of safety protocols, and use of protective equipment are crucial. Additionally, conducting risk assessments, maintaining proper ventilation, and monitoring gas levels are essential to ensure a safe working environment.1.2 Knowledge of Mining Regulations and Environmental Impact:1.2.1 Can you explain the role of mining regulations in the industry?Answer: Mining regulations are put in place to ensure responsible and sustainable mining practices. These regulations cover areas such as safety, environmental protection, labor rights, and community engagement. Compliance with these regulations is essential for maintaining the industry's social license to operate.1.2.2 How can mining operations minimize their environmental impact?Answer: Mining operations can minimize their environmental impact through various measures, including:- Implementing proper waste management and reclamation plans to restore the land after mining activities.- Using advanced technology and best practices to reduce air and water pollution.- Engaging in biodiversity conservation efforts and minimizing disturbance to ecosystems.- Promoting sustainable water management practices and reducing water usage.1.3 Problem-Solving and Analytical Skills:1.3.1 Can you provide an example of a challenging problem you encountered in a mining project and how you resolved it?Answer: During a mining project, we encountered a sudden increase in water inflow into the underground mine, which posed a significant risk to the safety of workers and the stability of the mine. To resolve this issue, we implemented emergency dewatering measures, including the installation of additional pumps and the reinforcement of underground support structures. We also conducted a thorough investigation to identify the root cause and implemented preventive measures to avoid such incidents in the future.1.3.2 How do you approach risk assessment in mining projects?Answer: Risk assessment is crucial in mining projects to identify potential hazards and develop strategies to mitigate them. I approach risk assessment by conducting comprehensive site inspections, analyzing historical data, and engaging with experts in geotechnical engineering and safety. By identifying potential risks early on, we can implement appropriate control measures to minimize their impact on the project.2. Written Test:2.1 Multiple Choice Questions:2.1.1 Which mining method involves extracting minerals from beneath the earth's surface?a) Open-pit miningb) Placer miningc) Underground miningd) Mountaintop removal mining2.1.2 What is the purpose of mining regulations?a) Ensuring maximum profitability for mining companiesb) Protecting the rights of mining workersc) Minimizing the environmental impact of mining operationsd) Promoting community engagement in mining projects2.1.3 What is the primary objective of risk assessment in mining projects?a) Maximizing the productivity of the mineb) Identifying potential hazards and developing strategies to mitigate themc) Reducing the cost of mining operationsd) Ensuring compliance with mining regulations2.2 Short Answer Questions:2.2.1 Define mineral exploration and explain its significance in the mining industry.Answer: Mineral exploration is the process of searching for deposits of minerals in the ground. It is significant in the mining industry as it helps identify the presence and extent of mineral resources, allowing mining companies to make informed decisions regarding the feasibility and profitability of mining projects.2.2.2 List three measures that mining operations can take to minimize their environmental impact.Answer: Mining operations can minimize their environmental impact by:- Implementing proper waste management and reclamation plans.- Using advanced technology and best practices to reduce pollution.- Engaging in biodiversity conservation efforts and minimizing disturbance to ecosystems.Conclusion:This comprehensive guide provides a framework for conducting interviews and written tests for candidates applying for mining engineering positions. It covers variousaspects, including technical skills, knowledge of mining regulations and environmental impact, and problem-solving abilities. Employers can use this guide as a reference to ensure a thorough evaluation of candidates' suitability for mining engineering roles.。
Mining multiple-level spatial association rules for objects witha broad boundaryEliseo Clementini a,*,Paolino Di Felice a ,Krzysztof Koperski baDepartment of Electrical Engineering,University of L'Aquila,67040Poggio di Roio,L'Aquila,ItalybMathsoft,Inc.,1700Westlake Ave.N.,Suite 500,Seattle,WA 98109-3044,USAReceived 30July 1999;accepted 21December 1999AbstractSpatial data mining,i.e.,mining knowledge from large amounts of spatial data,is a demanding ®eld since huge amounts of spatial data have been collected in various applications,ranging from remote sensing to geographical in-formation systems (GIS),computer cartography,environmental assessment and planning.The collected data far ex-ceeds people's ability to analyze it.Thus,new and e cient methods are needed to discover knowledge from large spatial databases.Most of the spatial data mining methods do not take into account the uncertainty of spatial information.In our work we use objects with broad boundaries,the concept that absorbs all the uncertainty by which spatial data is commonly a ected and allows computations in the presence of uncertainty without rough simpli®cations of the reality.The topological relations between objects with a broad boundary can be organized into a three-level concept hierarchy.We developed and implemented a method for an e cient determination of such topological relations.Based on the hierarchy of topological relations we present a method for mining spatial association rules for objects with uncertainty.The progressive re®nement approach is used for the optimization of the mining process.Ó2000Elsevier Science B.V.All rights reserved.Keywords:Association rule;Data mining;Spatial database;Topological relation;Uncertainty1.IntroductionLarge amounts of spatial data collected through remote sensing,e-commerce,and other data collection tools,make it crucial to develop tools for the discovery of interesting knowledge from large spatial databases.This situation creates the necessity of an automated knowledge/infor-mation discovery from data,which leads to a promising emerging ®eld,called data mining or knowledge discovery in databases (KDD).Knowledge discovery in databases can be de®ned as the nontrivial extraction of implicit,previously unknown,and potentially useful information from data [17].Data mining represents the integration of several ®elds,including machine learning,database systems,data visualization,statistics,and informationtheory.Data &Knowledge Engineering 34(2000)251±270/locate/datak*Corresponding author.Tel.:+39-862-434431;fax:+39-862-434403.E-mail addresses:eliseo@ing.univaq.it (E.Clementini),difelice@ing.univaq.it (P.Di Felice),krisk@ (K.Koperski).0169-023X/00/$-see front matter Ó2000Elsevier Science B.V.All rights reserved.PII:S 0169-023X (00)00017-3252 E.Clementini et al./Data&Knowledge Engineering34(2000)251±270The majority of the data mining algorithms was developed for the analysis of relational and transactional databases,but recently non-spatial data mining techniques have been expanded toward mining from spatial data.Generalization-based spatial data mining methods[26,27]dis-cover spatial and non-spatial relations at a general concept level,where spatial objects are ex-pressed as merged spatial regions[26]or clustered spatial points[27,16,18].However,these methods do not discover rules re¯ecting the structure of spatial objects and spatial/spatial or spatial/non-spatial relations that contain spatial predicates,such as adjacent,inside,close,and intersects.Spatial association rules[24],i.e.,the rules of the form``P A R'',where P and R are sets of predicates,use spatial and non-spatial predicates in order to describe spatial objects using relations with other objects.For example,a rule``is a X Y university A inside X Y city ''is a spatial association rule.The re®ned spatial predicates used in[24]were not analyzed at di erent ab-straction levels,as we did in this paper.In a large database many association relations may exist,but some may occur rarely or may not hold in most cases.To focus our study on patterns that are relatively``strong'',i.e.,patterns that occur frequently and hold in most cases,the concepts of minimum support and minimum con-®dence are used[1,2].Informally,the support of a pattern P in a set of spatial objects S is the probability that a member of S satis®es pattern P;and the con®dence of the rule P A R is the probability that the pattern R occurs,if the pattern P occurs.A user or an expert may specify thresholds to con®ne the discovered rules to be the strong ones,that is,using the patterns that occur frequently and the rules that demonstrate relatively strong implication relations.Most spatial data mining methods,including previous study on spatial association rules[24], use spatial objects with exactly known location.However,in real situations the extensions of spatial objects can be known only with a®nite accuracy.There are di erent sources that cause spatial information to be uncertain:incompleteness,inconsistency,vagueness,imprecision,and error[31].Incompleteness is related to totally or partly missing data:the prototypical situation of this kind is when a dataset is obtained from digitizing paper maps and pieces of lines are missing. Inconsistency arises when several versions of the same object exist,due either to di erent time snapshots,or datasets of di erent sources,or di erent abstraction levels.Vagueness is an intrinsic property of many natural geographic features that do not have crisp or well-de®ned boundaries. Imprecision is due to a®nite representation of spatial entities:the basic example of this kind is the regular tessellation used in raster data,where the element of the tessellation is the smallest unit that represents space.Error is everything that is introduced by limited means of taking measurements.In this paper,we extend the technique for mining spatial association rules[24]toward mining in the situation when spatial information is inaccurate.Spatial predicates used in previous work are based on the assumption that the boundary of spatial objects is exactly determined,i.e.,the objects are crisp.Many research papers propose an approach to deal with uncertainty in spatial data in which objects are represented by the lower and upper approximations of their extent,i.e., objects have broad boundaries[9,13,15,28].The advantage of this approach is that it can be im-plemented on existing database systems at a reasonable cost:the new model can be seen as an extension of existing geometric models.Topological relations,which are the spatial predicates taken into account in this paper,have been studied for simple regions with a broad boundary in[10].A set of topological relations for complex objects with a broad boundary has been proposed in[12],where the topological predi-E.Clementini et al./Data&Knowledge Engineering34(2000)251±270253 cates have been hierarchically organized into three levels.The bottom level of the hierarchy o ers detailed topological relations using an extension of the9-intersection model[14].The intermediate and top levels o er more abstract operators that allow users to query uncertain spatial data in-dependently of the underlying geometric data model.In the present work we study e cient methods for mining spatial association rules that use a progressive search technique.This technique®rst searches for frequent patterns at a high concept level of the topological relations.Then,only for such frequent patterns,it deepens the search to lower concept levels(i.e.,their lower level descendants).Such a deepening search process con-tinues until no frequent patterns can be found.A decision tree is constructed to determine the type of a topological relation between two objects.The distribution of topological relations is taken into account during the construction of the decision tree in order to minimize the number of computations needed to determine the type of a topological relation.The remainder of this paper is organized as follows.In Section2we recall the basic de®nitions from the extended model for complex objects with a broad boundary,which o ers a uniform way of treating uncertainty in spatial data[12].Section3introduces the notion of concept hierarchies including the three-level hierarchy of topological relations.Section4presents a method for the determination of the topological relations between objects with broad boundaries.Section5in-troduces the algorithm for mining the strong spatial association rules that use topological rela-tions between objects with broad boundaries.Section6draws short conclusions.2.The model for complex objects with a broad boundaryIn this section,we recall the basic de®nitions of the concepts that are used to describe composite objects with a broad boundary.De®nition1.A omposite region with ro d ound ry e is made up of two composite regions A1 and A2with A1 A2,where o A1is the inner ound ry of e and o A2is the outer ound ry of e.De®nition2.The ro d ound ry D A of a composite region with a broad boundary e is the closed subset comprised between the inner boundary and the outer boundary of e,i.e.,D A A2ÀA1,or equivalently D A A2ÀA1°.De®nition3.snterior, losure,and exterior of a composite region with a broad boundary e are de®ned as A° A2ÀD A Y A A° D A Y AÀ R2À"A,respectively.Fig.1illustrates some con®gurations of composite regions:case(a)is a region with two components;case(b)is a region where A1has two components and A2has one component;case(c) is a region where A1has one component and A2has two components.3.The concept hierarchiesWe believe that the advancement of OLAP tools is partially related to the ability of the OLAP systems to provide multi-level,multidimensional presentation of data stored in large data ware-houses[5].The existing OLAP systems provide mainly tools for summarization and visualizationof generalized data.Therefore,a new approach to data mining,which integrates OLAP and data mining was proposed.This new approach,called On-Line Analytical Mining (OLAM)presents a promising direction for mining large databases and data warehouses [22].Analysts can interac-tively adjust the level of generalization.For example,an analyst may start with descriptions of all schools using spatial predicates such as tou h (s hool ,p rk ).If he/she wants a more detailed in-formation about di erent types of schools he/she can drill-down and use predicates such as tou h (high _s hool ,p rk ).Such progressive ``zoom in''and ``zoom out''operations can individu-alize the mining process for di erent purposes.For example,a higher level user may concentrate on general information,while other users can look for details.Concept hierarchies [20]are used in previous work on spatial association [24]to facilitate presentation of knowledge at di erent levels.As we ascend the concept hierarchy,information becomes more and more general,but it still remains consistent with the lower concept levels.For example,a concept hierarchy for ro k can be speci®ed as follows:According to it,both lowest level concepts limestone and dolomite can be generalized 1to the concept hemi l sediment ry ,which in turn can be generalized to the concept sediment ry that also includes org ni sediment ry .Concept hierarchies can be explicitly given by the experts,or in some cases they can be gen-erated automatically by data analysis [21].In some cases concept hierarchies can be encoded in a database schema.For example,there may exist separate attributes for the n me of the rock (e.g.,granite),group (e.g.,intrusive igneous),and type (e.g.,igneous).For the purpose of this paper we assume that non-spatial predicates are generalized based on an underlying schema.(rock(igneous (intrusive igneous (granite,diorite,F F F )),(extrusive igneous (basalt,rhyolite,F F F ))),(sedimentary (clastic sedimentary (sandstone,shale,F F F )),(chemical sedimentary (limestone,dolomite,F F F )),(organic sedimentary (chalk,coal,F F F ))),(metamorphic(foliated metamorphic (slate,gneiss,F F F )),(non-foliated metamorphic(marble,quartzite,F F F)))).posite regions with a broad boundary.1One can notice that such generalization process di ers from the cartographic map generalization process,which involves the generalization of the symbolic representations of the objects [4].254 E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270The novelty of this paper is to explore the bene®ts of forming a hierarchical structure for spatial relations.Di erent kinds of spatial relations can be used for this purpose.For example,a hier-archical structure of metric relations could be based on the methods proposed in[11],but this is outside the scope of the present paper.Hereafter,we concentrate on a concept hierarchy con-cerning topological relations.Binary topological relations between two objects,e and f,in R2can be classi®ed according to the intersection of eÕs interior,boundary,and exterior with fÕs interior, boundary,and exterior.The nine intersections between the six object parts describe a topological relation and can be concisely represented by the following3Â3matrix w,called the9-interse tion [14]:MA° B°A° D B A° BÀD A B°D A D B D A BÀAÀ B°AÀ D B AÀ BÀHdIe XBy considering the values empty(0)and nonempty(1),we can distinguish between29 512 binary topological relations.For two simple regions with a1-dimensional boundary,only eight of them can be realized.For two composite regions with a1-dimensional boundary,there are eight additional matrices that can be realized totaling to16relations.When objects are ap-proximated using a broad boundary this number grows to56.Generally speaking,the9-in-tersection method has the advantage that users can test for a large number of spatial relations and®ne tune of the particular relation being tested.It has the disadvantage that it is too detailed for the most of the practical applications and it does not have a corresponding natural language equivalent.The topological relations between pairs of composite regions with a broad boundary at three hierarchical levels were studied in[12].Thereafter,we summarize the concepts of that paper that are necessary for the self-contained reading of the present paper.The ottom level,which deals directly with geometry,consists of the56relations de®ned in terms of the9-intersection matrices(Fig.2).Such relations can be organized in a graph where each node is labeled with a relation(for the numbering the reader may refer to[12])and the arcs express geometric proximity between relations(Fig.3).At the top level,a de®nitely smaller number of relations(namely the four relations,disjoint,tou h,overl p,and in,of the so-called g l ulusEf sed wethod(CBM)2[6,8])is su cient to describe the topological relations between pairs of composite regions with a broad boundary.Each CBM relation corresponds to a cluster of 9-intersection matrices,as depicted in Fig.3.The mapping between the two levels can be also concisely expressed by patterns of the9-intersection matrices(see Table1,where d stands for any value(0or1)).Opposite to the bottom level,the top level is much more abstract as it does not provide a user with the geometric details related to the presence of broad boundaries and multiple components.2The CBM de®nitions of the top level relations are as follows:h A Y touch Y B i X A° B° Y A B Y Yh A Y in Y B i X A B A A° B° Y Yh A Y overlap Y B i X dim A° dim B° dim A° B° A B A A B B Yh A Y disjoint Y B i X A B Y XE.Clementini et al./Data&Knowledge Engineering34(2000)251±270255Fig.2.The 56topological relations between composite regions with a broad boundary.256 E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270More details are o ered by the intermedi te level ,structured in terms of 14relations,which results on the clustering illustrated in Fig.4.For brevity,Table 2shows only the 9-intersection patterns for the intermediate level relations descending from the top-level relations tou h and in .Table 3summarizes the three-level concept hierarchy of the topological relations.4.Determination of the topological relationsWe use decision trees to determine which predicate is satis®ed at the level l of the topological concept hierarchy.The nodes of the decision tree store tests for the intersections from the 9-in-tersection matrix.Based on the values for these intersections the search space is partitioned so,®nally a leaf node of the tree contains only a single relation at the level l .In general the average cost to determine a relation at the level l of the topological concept hierarchy can be calculated as the weighted sum of the operations needed to determine a generalization of a bottom-level re-lation i at the level l (i.e.,the length of the path)times the probability that the iexistsFig.3.The clusters of the topological relations at the top level.E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270257Cost l 56i 1op num gen l Y R i ÃPr R i XThe goal is to build a decision tree that minimizes the cost of computations.Therefore,the re-lations should be found with the smallest possible number of tests.Unfortunately the task of building decision trees with the smallest number of tests is known to be NP-complete [23].The problem of constructing decision trees has been widely analyzed in the areas of machine learning and data mining,where it is used in the classi®cation of data.The decision tree algorithms try to minimize the size of the resulting trees,because usually smaller trees enable better accuracy of classi®cation [29].In the process of the determination of predicates smaller trees enable faster computations.In [7]the authors use decision trees to determine which topological relation is satis®ed between two objects.The algorithm uses patterns consisting of a single matrix that uniquely de®nes the relation between objects with crisp boundaries.Therefore,this approach would be di cult to apply in the case of topological relations between objects with broad boundaries,because for some relations the template consists of a disjunction of 9-intersection matrices (see Table 1).For the purpose of building a decision tree we use the ID3algorithm [30]that tries to build a tree with the minimal number of nodes in order to distinguish between di erent groups of objects.ID3is a greedy algorithm that uses values of the attributes and data distribution to decide which attribute should be used to partition the dataset.This way we can optimize the cost of determining the topological relations based on the distribution of bottom-level relations.Such distribution can be estimated using results of the previous queries or established using sampling techniques.Based on the information gain measure the algorithm chooses an intersection from the 9-intersection matrix,which allows for the maximum separation of classes.If there are p i pairs of objects related by the topological relation i and there are p pairs of objects related by any topological relationTable 1Mapping of top-level topological relations to 9-intersectionmatrices disjoint0d d d 0d d ddHd Ie touch0d d d 1d d d d Hd Ie overlap1d dd d 1d 1d Hd Ie 1d d1d 1d d d Hd Ie 11ddd d d 1dHd Ie 11d 1d d d ddHd Ie ind 0dd d 0dd dHd Ie d d d 0d d d 0dH d Ie 258 E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270then an arbitrary relation between pair of objects should be i with the probability p i /p .When a decision tree is used to determine the relation between two objects it can be regarded as a source of messages for i Õs with the expected information needed to generate this message given byI p 1Y p 2Y F F F Y p m Àm i 1p i p log 2p ip X If a value of the intersection g is used to partition the dataset of relations to two subsets g 0andg 1,then the expected information gain for g is obtained as the following weighted average:E C p 10 ÁÁÁ p m 0p I p 10YF F F Y p m 0 p 11 ÁÁÁ p m 1p I p 11Y F F F Y p m 1 Ywhere p i 0and p i 1are the numbers of object pairs that satisfy the relation i and that have thevalues of the g intersection 0and 1,respectively.The information gained by branching on g isgain C I p 1Y p 2Y F F F Y p m ÀE CXFig.4.The clusters of the topological relations at the intermediate level.E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270259ID3examines all intersections and splits the decision tree based on the intersection that maximizes the information gain.Fig.5presents a decision tree constructed in this way in the case when all 56bottom-level relations are equally possible and the top-level relations are determined.When all bottom-level relations are equally probable the cost is 3.0(the leaf nodes of the tree contain di erent numbers of the bottom-level relations).We have implemented the decision tree algorithm and tested it using di erent distributions of the topological relations.The tree presented in Fig.5can be used for any distribution of the topological relations,but in the case when the distribution is not even its cost can be larger than for other trees.For example,we can have a distribution of topological relations such that the most of the relations are in (in particular 79.9%are the relations 40of Fig.2),10%are disjoint,10%are overlap (relation 18),and the rest of relations is equally distributed between other 53relations.In this case the tree from Fig.5has the cost 3.7(i.e.,0X 799Â4 0X 1Â2 0X 1Â3 ÁÁÁ),while the decision tree built by the presented algorithm has the cost 2.2.Such distribution can happen whenTable 2The examples of the mapping of intermediate-level topological relations to9-intersection matrices nearlyInsided 0d 1d 0d 1dH d I e d 0d d d011dHd Ie nearlyEquald 1d0d 0d 0d Hd Ie d 000d d d 0d H d I e d 0dd d 0d 0dHd Ie d 0d 0d 00ddHd Ie nearlyContainsd 1d0d 1d 0d Hd Ie d d 10d 1d 0dH d Ie nearlyMeet0d 1d 1d 1d d Hd Ie coveredByBoundary0d 0d d d 1d d Hd Ie coversWithBoundary0d 1d d d 0d d Hd Ie boundaryOverlap0d 0d d d 0d dHd Ie 260 E.Clementini et al./Data &Knowledge Engineering 34(2000)251±270one type of objects is very large in size,and a coarse resolution ®lter (see the next section)is used to eliminate many disjoint relations.A decision tree can be built for any level of the topological concept hierarchy.In the case when topological relations are determined in a sequence,starting from the top level and proceeding to lower levels,like in the case of the mining algorithm presented in the next section,an optimization can take place.In such case a number of separate decision trees can be constructed for the re-lations contained in di erent leaf nodes of the decision tree for a higher level.For example,there are three leaf nodes in the tree from Fig.5that contain children of the relation in .The node a contains intermediate-level relations ne rlysnside and ne rlyiqu l ;the node b contains relations ne rlygont ins and ne rlyiqu l ;and the node c contains a single relation ne rlyiqu l .Therefore,when the information obtained from the decision tree for top-level relations is known,and a user wants to determine intermediate-level relations the decision trees have to be built only for the nodes a and b .Also each of these decision trees would have to distinguish just between twoTable 3The three-level hierarchy of topological relationsLevelTotal number of operators Top level disjoint touch overlap in (and reverse)4Intermediate level disjoint nearlyMeets coveredByBoundary coversWithBoundary boundaryOverlapnearlyOverlaps interiorCoveredBy-Interior interiorCov-ersInterior partlyIn-side partlyContains crossCointainment nearlyInside nearlyContains nearlyEqual 14Bottom level (9-intersection)1relation matrix 16relation matrices 26relation matrices 13relation matrices56Fig.5.Example of a decision tree for top-level relations.intermediate-level relations instead of three relations.The optimized average cost of the deter-mination of the children of the relation in is1.54,while if one tries to determine the intermediate-level relations based only on the information that the top-level relation is in the average cost is 2.69(assuming even distribution of the bottom-level relations in both cases).5.Spatial association rulesSpatial association rules represent spatial and non-spatial relations between objects.For ex-ample,the following is a spatial association rule:is a X Y resort overlap X Y national park A is expensive X 30%Y80%This rule states that80%of resorts that overlap national parks are expensive and that30%of resorts satisfy all predicates in the rule,i.e.,they are expensive and they overlap national parks. Various kinds of spatial predicates can be involved in spatial association rules.They may represent topological relations described in the previous section.They may also represent spatial orientation or ordering,such as left,right,north,east,or contain some distance information,such as close_to,far_from,etc.These single predicates can be used in conjunction providing more detailed descriptions of spatial objects.For the systematic study of the mining of spatial associ-ation rules,we®rst present some concepts.De®nition4.A conjunction of k single predicates is called a kEpredi te.The support of a k-predicate,R R1 ÁÁÁ R k,in a set of objects ,denoted as r R a S ,is the number of objects in that satisfy ,versus the cardinality(i.e.,the total number of objects)of .A sp ti l sso i tion rule is a rule in the form ofP A Q s%Y c%where s%is the support of the rule, %is the onfiden e of the rule D is a k-predicate, is an m-predicate,and at least one of the predicates forming or is a spatial predicate.The support of the rule is the support of the conjunction of the predicates that form the rule(i.e.,P Q).The onfiden e3of rule P A Q in ,is the ratio of r P Q a S ,over r P a S .Most people are interested in the patterns that occur relatively frequently(i.e.,with wide support)and the rules that have strong implications(i.e.,with high con®dence).The rules with large values of support and high con®dence are strong rules.To mine strong rules two kinds of thresholds±minimum support and minimum con®dence±are used.3The concept of the con®dence of the rule provides the estimator of conditional probability Pr Q j P ,i.e.,the probability that k-predicate is satis®ed by a randomly chosen object that satis®es k-predicate .This concept should not be confused with the concept of con®dence interval that is used in statistics.The con®dence interval is the interval,that with a certain probability,a value of a statistical variable falls into.De®nition5.A k-predicate is frequent in set ,at level l,if the support of is no less than its minimum support threshold for level l,and all ancestors of from the concept hierarchy,are frequent at their corresponding levels.The con®dence of a rule P A Q is high at level l if its con®dence is no less than its corresponding minimum con®dence threshold.A rule P A Q is strong if the predicate P Q is frequent in set and the con®dence of P A Q is high. Although it is possible to®nd strong spatial association rules that use only the topological relations from the lowest level of the concept hierarchy proposed in[12],such an approach would be quite ine cient because most intersection values from the9-intersection model would have to be determined.Moreover,because many of these relations may characterize infrequently existing characteristics they would not pass the support threshold,and therefore,the results of these computations would not be presented to a user despite used computation time.In our approach the mining starts with the®ltration process,which is based on a predicate describing relation between MBRs of the ter the algorithm uses the topological rela-tions from the top level of the hierarchy presented in Table3.The predicates such as disjointD tou hD overl pD or in are found.Then,the algorithm proceeds to the lower levels of the topological hierarchy.In this case only the children of the topological relations that are frequent would be examined in detail.For example,if the tou h relation characterizes only a small number of objects then its children,such as,ne rlyweet or ound ryyverl p would not be computed,because they cannot satisfy the conditions for large predicates according to De®nition5.The mining process can be summarized in the following algorithm.Algorithm5.1.Mining multi-level spatial association rules for objects with a broad boundary. Input:1.Spatial database SDB containing set of spatial objects with broad boundaries.2.Set of spatial and non-spatial concept hierarchies.3.Two thresholds for each level l of description:minimum support(min_support[l])and minimum con®dence(min_con®dence[l]).4.Mining query,which consists of:4.1.a reference set of objects that are described ,4.2.a set of task-relevant sets for spatial objects,and4.3.a set of relevant predicates.Output:A set of strong multi-level spatial association rules using topological relations between objects with broad boundaries.Method:1. elev nt hf: extract_task_relevant_objects( hf);2.wf redi te hf: ®nd_MBR_predicates( elev nt hf);3. redi te hf: ®lter_with_minimum_support(min_support[1],wf redi te hf);4.for(vevel 1;vevel! max& redi te hf! Y;vevel++)5.{ redi te hf: ®nd_predicates(vevel, redi te hf, elev nt hf);6. redi te hf: ®lter_with_minimum_support(min_support[vevel], redi te hf);7.mine_association_rules( redi te hf);}。
Kibble hoist 吊桶提升, kibble winder 吊桶提升机,kicker port 放气口,knapper 碎石工,knapping machine 碎石机,jack drill 凿岩机,jack hammer 手持式凿岩机,opencast working 露天开采,inbreak 崩落,incline inset 斜井井底车场,inclined drift 倾斜巷道,induced block caving 阶段人工崩落开采法,infiltration ditch 排水沟,intake 进风向道,level 主平巷,interchamber pillar 矿房间的矿柱,bucket excavator 多斗挖掘机,bullfrog 平衡锤,butt heading 回采平巷,artificial caving 人工崩落,adjoining rock 围岩,benching 阶梯式开采,benching bank 阶段,cage winding 罐笼提升cage winding machine 罐笼提升机,carryall scraper 轮式铲运机,crushing machine 破碎机working face 工作面,cut and fill mining 充填开采,cut off grade 品位下限winding shaft, hoisting shaft 提升井,open pit bench 露天矿阶段,open up by adits 平硐开拓ventilation shaft 通风井,pithead, mine entrance 矿井口,井口,gallery 平峒,平巷timbering, shoring 撑材,支护,opening up by veryical shafts 竖井开拓,prop, shore 支柱,open stope method 空场采矿法,lining, planking 加衬,ore dilution 矿石贫化,ore drawing 方矿,ore deposit 矿床air vent 气孔,排气口truck 斗车,ore removal 出矿,dump skip 倾卸箕斗dump truck 自卸式载重汽车slag 矿渣slag heap 矿渣场tip 矿渣堆collier, coal miner 矿工,煤矿工人jumper 钻孔机explosive 炸药charge 凿岩机blast hole 钻孔undercutter 擦伤miner's lamp, safety lamp 矿灯,安全灯fire damp explosion 瓦斯爆炸cave-in 陷落landslide 坍落flooding 漫灌asphyxia, suffocation, gassing 窒息eld 矿床,outcrop 露头,fault 断层,vein, sean, lode 矿脉,reservoir 储藏water table 潜水面,地下水面,mine 矿, stratum, layer 矿层,quarry 露天采石clay pit 粘土矿坑,peat bog 泥炭沼,gold nugget 块金,gangue 脉石,矿石,尾矿prospector 探矿者prospecting 探矿boring, drilling 钻探auger, drill 钻excavation 发掘quarrying, extraction 采石,miner 矿工, borer, drill, drilling machine 钻机stonemason 石工锤stonecutter 切石机mining engineer 采矿工程师ditch 沟道ditch and trench excavator 挖沟机ditch blasting 开沟爆破ditch excavator 挖沟机ditcher 挖沟机divergence 发散divergency 发散divide 分水岭divider 罐粱;罐梁;分隔器division 分割;区域division surface 分界面divisional plane 节理do jiargillaceous rock 泥质岩do jiarsenolite 砷华do jiblastproof 防爆的do jibolter 筛do jibrittle 脆的dobie blasting 糊炮dock 栈桥dog 把手dolerite 粗玄岩dolly way 栈桥dolomite 白云石dolomitization 白云石化dolomization 白云石化dome 穹domeyki t e 砷铜矿donarite 道纳瑞特炸药door 门door regulator 第风门door stoop 井筒安全柱door tender 看门工door trapper 看门工dope 吸收剂dopplerite 灰色沥青dormant fire 潜伏火灾dorr classifier 道尔型分级机dorr thickener 道尔型浓缩机dosimeter 剂量计dosing 配量dosing tank 计量箱double 二重的double bank cage 双层式罐笼double barrel 复式岩心管double chain conveyor 双链刮板输送机double deck cage 双层式罐笼double deck screen 双层筛double drum air driven hoist 风动双滚筒绞车double drum hoist 双滚筒绞车double drum scraper hoist 双滚筒扒矿绞车double drum separator 双浓筒磁选机double drum winch 双滚筒绞车double entry 双平巷double intakes 双进风道double parting 错车道double reduction gearbox 两级减速装置double roll crusher 双辊破碎机double stage compressor 双级压气机double track haulage roadway 双轨运输巷道double track heading 双线平巷double tracked incline 双轨斜井double tracked plane 双轨上山double tube core barrel 复式岩心管double union 双键double uni t双工炸区double up post 补充柱dovetail jo int 鸠尾接合dowel 合缝销down grade 下坡down to earth salt production 地下岩盐开采downcast air 进风downcast shaft 进风井downcut 下部掏槽downdraft 下向通风downhole 向下炮眼downpour 注下downward 下向的downward current 下降流下向流downward mining 下行开采downward ventilation 下向通风downward working 下行开采dozer 推土机draft 通风draft tube 吸入管drag bar conveyor 刮板运输机drag classifier 刮板分级机drag conveyor 刮板运输机dragline 朔挖掘机dragline excavator 朔挖掘机dragline tower excavator 塔式朔挖掘机dragscraper 刮土铲运机dragshovel 刮土铲运机drain 排水管drain adit 排水平峒drain cock 放水旋塞drain line 排水管道drain opening 排水囗drain outlet 排水囗drain pipe 排水管drain pump 排水泵drain sump 集水仓drain tap 放水旋塞drain tube 排水管drain valve 放泄阀drainage 排水drainage adit 排水平峒drainage area 排水面积drainage channel 排水沟drainage conveyor 脱水输送机drainage elevator 脱水提升机drainage facilities 排水设备drainage gallery 排水平硐drainage hole 放泄孔drainage level 排水平巷drainage network 排水网drainage property 透水性drainage pump 排水泵drainage screen 脱水筛drainage shaft 排水井drainage sieve 脱水筛drainage works 排水工作draining 排水draught 通风draught tube 吸管drawbar 牵引杆drawing 回收drawing back of pillars 后退式回采矿柱drawing height 提升高度drawing hoist 回柱绞车drawing machine 提升绞车drawing program 放矿计划drawing rate 放矿速度drawing shaft 提升井drawpoint brow 放矿点口dredge 挖掘船dredge pump 吸泥泵dredger 挖掘船dredging 挖出dredging engine 挖泥机dresser 选矿工dressing 刃磨dressing expenses 选矿费dressing machine 锻钎机dressing method 连矿法dressing plant 选矿厂dressing works 选矿厂drier 干燥机drift 平硐drift angle 偏差角drift bed 冲积层drift conveyer 水平坑道运输机drift drill 架式凿岩机drift miner 巷道掘进工drift mining 平硐开采drift pillar 平巷矿柱drift way 水平巷道driftage 巷道掘进drifter 架式凿岩机drifting 巷道掘进drifting machine 架式凿岩机drifting method 掘进法drill 钎杆drill adapter 钻杆卡头drill autofeeder 自动推进装置drill bar 钻杆drill bit 钎头drill bit gage loss 钻头直径磨损量drill blower 钻机吹粉器drill bortz 钻用金刚石drill carriage 凿岩机drill chuck 钻头夹盘drill column 钻杆柱drill core 钻机岩心drill cuttings 钻粉drill hammer 凿岩机drill hole 排放钻孔drill hole depth 钻孔深度drill hole wall 钻孔壁drill jumbo 凿岩机drill maker 锻钎机drill man 凿岩工drill mounting 钻机架drill pipe 钻杆drill pipe cutter 套管内切刀drill piston 风钻活塞drill point angle 钻尖角drill pump 钻机泵drill rig 钻车drill rod 钻杆drill rope 钻井钢丝绳drill round 炮眼组drill steel 钻钢drill stem 钻杆drill team 钻探队drill tower 钻塔drill truck 钻车drill unit 钻孔设备drill water hose 凿岩机供水软管drill water pipe 钻机冲洗水管drillability 可钻性drillability index 可钻性指数driller 凿岩工drillhole burden 炮眼的负裁drilling 穿孔drilling and blasting operation 打眼放炮工作drilling cable 钻井钢丝绳drilling cost 打钻费drilling device 钻眼装置drilling dust 钻粉drilling equipment 钻孔设备drilling exploration 钻孔勘探drilling fluid 钻孔液体drilling head 钻头drilling hole 钻孔drilling jumbo 钻车drilling line 钻井钢丝绳drilling machine 钻孔机drilling meal 钻粉drilling method 钻进方法drilling mud 钻泥drilling outfit 钻孔设备drilling pattern 炮孔排列法drilling pipe 钻杆drilling platform 钻井平台drilling rate 钻孔速度drilling rope 钻井钢丝绳drilling shift 钻眼班drilling speed 钻孔速度drilling staging 凿岩台drilling steel 钻钢drilling time 钻孔时间drilling tool 钻具drillings 钻粉drillmobile 钻车drip 滴drip proof protection 防滴保护drivage 巷道掘进drivage efficiency 掘进效率drivage method 掘进法drive chain 传动链drive head 传动机头drive rod 传动钻杆drive shaft 传动轴drive sprocket 传动链轮drive sprocket wheel 传动链轮driven pulley 从动driven shaft 从动轴driver 掘进工driving 冲击driving belt 传动带driving chain 传动链driving force 传动力driving mechanism 传动机构driving openings 巷道掘进driving place 掘进现场driving pulley 织皮带轮driving shaft 传动轴driving speed 掘进速度driving terminal 传动站driving up the pitch 倾斜掘进drop 滴drop bottom cage 落底式罐笼drop bottom car 底卸式车drop cage 翻转罐笼drop crusher 冲唤破碎机drop crushing 落锤破碎drop end car 端卸车drop hammar 打桩落锤drop hammer test 落锤试验drop pit 溜道drop shaft 沉井drop side car 侧卸车drop test 落锤试验dropper 支脉drossy coal 劣质煤drowned mine 淹没的矿drowned pump 浸没泵drum 筒drum feeder 转筒给料机drum filter 鼓式过滤器drum screen 滚筒筛drum separator 圆筒式分选机drum switch 鼓形开关drum to rope ratio 筒径绳径比drum type feeder 转筒给料机drum winder 滚筒式提升机drumlin 鼓丘druse 晶洞drusy 晶洞dry 干燥dry assay 干法试金dry cleaning 干选dry coal preparation 干法选煤dry cobbing 干法磁选dry compressor 气冷压气机dry concentration 干选dry concentrator 干式选矿机dry digging 干料挖掘dry drilling 干式钻眼dry feeder 干给矿机dry grinding 干磨dry magnetic dressing 干法磁选dry magnetic separation 干法磁选dry method 干式法dry mill 干磨机dry milling 干磨dry packing 干式充填dry separation 干选dry sieving 干法筛分dry stowing 干式充填dry treatment 干处理dryer 干燥机dryer drum 干燥机滚筒drying 干燥drying chamber 干燥室drying cylinder 干燥筒drying room 干燥室ds blasting 秒延迟爆破duct 导管ductility 延性ductwork 通风管道duff dust 末煤dufrenite 绿磷铁矿dull coal 暗煤dumm drift 盲道dummy drift 独头巷道dummy roadway 石垛平巷dummy shaft 暗井dumortierite 蓝线石dump 堆dump body truck 自卸式载重汽车dump car 翻斗车dump house 翻车房dump leaching 堆积沥滤dump pocket 倾卸仓dump skip 倾卸箕斗dump truck 自卸式载重汽车dumper 翻车机dumping 翻卸dumping place 倾卸场dumping station 倾卸站dumping track 倾卸线dumping wagon 翻斗车dunite 纯橄榄岩dunn bass 泥质页岩duplex 二倍的duplex compressor 双动压气机duplex jig 双室跳汰机duplex table 双摇床durability 耐久性durable 耐久的durain 暗煤型duration 经久duration of cycle 循环时间durite 暗煤型durometer 硬度计dust allayer 集尘器dust barrier 岩粉棚dust catcher 集尘器dust catching efficiency 集尘效率dust catching plant 集尘装置dust chamber 集尘室dust coal 粉煤dust collection 集尘dust concentration 尘末浓度dust consolidation 尘末结合dust content 含尘率dust control 防尘dust counter 尘度计dust distributor 撒岩粉器dust explosion 尘末爆炸dust extraction 除尘dust extractor 吸尘器dust filter 滤尘器dust flotation 矿尘浮选dust laden air 含尘空气dust lung 尘肺病dust mask 防尘面具dust monitor 吸尘器dust ore 粉状矿石dust phthisis 尘肺病dust precipitation 煤尘沉降dust prevention 防尘dust proof 防尘的dust protective mask 防尘面具dust recovery 收尘dust removal 除尘dust separator 离尘器dust settling 尘末沉淀dust tight 防尘的dust yield 生尘量dustfree drilling 无尘钻眼dustiness 含尘量dustiness index 含尘指数dusting 生尘dustless drilling 无尘钻眼dusty air 含尘空气dusty mine 多尘矿井duty 能率dyke 岩脉dynamic 动力的dynamic balance 动态平衡dynamic characteristic 动态特性dynamic effect 动态效应dynamic equillibrium 动态平衡dynamic load 动负载dynamic pressure 动压dynamic stress 动力应力dynamics 动力学dynamite 疵麦特dynamite magazine 炸药房dynamometamorphism 动力变质dynamometer 测力计dynamometry 测力法face 工祖face advance 工祖推进face alignment 工祖当face conveyor 工祖运输机face crew 工祖工组face eqiupment 工祖设备face fall 工祖塌落face labour 工祖工人face length 工祖长度face lighting 工祖照明face man 工祖工人face mechanization 工祖机械化face of shaft 井底face of well 油井底face preparation 工祖准备face run 采煤机在工祖移动的时间face support 工祖支架face timbering 工祖支架face timbering plan 工祖支护计划face toe 坡面底部face track 工祖轨道face work 工祖工作facilities 设备factor 系数factor analysis 因子分析factor of safety 安全系数factory 工厂fahlerz 铜矿fahlore 铜矿failed hole 拒爆残留孔眼failing 缺点failure 缺点failure prediction 故障预测failure rate 故障率fall 崩落fall rate 下沉速度fallers 罐笼座falling weight test 落锤试验false bottom 假底层false cap 临时顶梁false dynamite 低硝甘炸药false roof 伪顶false roof rock 伪顶岩石false set 临时支架false stull 临时支柱false timbering 临时支架falsework 拱架fan 扇风机fan blade 扇风机叶片fan cut 扇形掏槽fan delivery 扇风机送风量fan pattern 扇形排列方式fan pattern holes 扇形炮眼组fan pipe 风管fan shaft 通风井fan shaped round 扇形炮眼组fancy coal 精煤fandrift 扇风机引风道fang 通风井fanner 扇风机fast 快速的;坚固的fast extraction 快速回采fast pulley 固定轮fast roof 坚固顶板fast setting concrete 速凝混凝土fast top 坚固顶板fat 脂肪fat coal 肥煤fatigue breakdoun 疲劳破坏fatigue failure 疲劳破坏fatigue limit 疲劳极限fatigue resistance 疲劳强度fatigue strength 疲劳强度fatigue test 疲劳试验fault 断层fault basin 断层盆地fault coal 劣质煤fault diagnosis 故障诊断fault line 断层线fault outcrop 断层露头fault plane 断层面fault surface 断层面fault zone 断层带fayalite 铁橄榄石feature 特征feed 给矿feed apron 板式给矿机feed chute 给矿槽feed control 给料控制feed launder 给矿槽feed leg 风动钻架feed mechanism 推进机构feed plate 给矿板feed port 给矿口feed pump 给水泵feed rope 回绳feed screw 给料螺旋运输机feed tray 给矿槽feed water 给水feed water pump 给水泵feed worm 给料螺旋运输机feedback 反馈feedback ratio 反馈系数feeder 给矿机feeder trough 给矿槽feeding 给矿feeding canal 给矿沟feeding conveyor 给矿运输机feeding hopper 装料漏斗feeler gage 测隙规feighs 选矿尾矿feldspar 长石feldspathization 长石化fell 筛下产品felsite 矽长岩felsophyre 霏细斑岩fen peat 沼泽泥炭fender 矿柱fenite 长霓岩ferberite 钨铁石ferganite 铀钇钽矿ferghani t e 铀钇钽矿;铀锂钒矿fergusonite 褐钇钽矿ferrite 铁素质ferroconcrete 钢筋混凝土ferroconcrete prop 钢筋混凝土支柱ferromagnetic 铁磁的ferromagnetic mineral 铁磁性矿物ferromagnetic substance 铁磁体fertilizer mineral 肥料矿物fibre structure 纤维结构fibroid phthisis 硅肺fibrous coal 丝煤fibrous rock 纤维状岩石field 煤田field reconnaissance 野外普查field test 现场试验fiery coal 瓦斯煤fiery colliery 瓦斯煤矿fiery gas 爆炸性气fiery mine 瓦斯煤矿fiery seam 瓦斯煤层figure 图fill 装载fill pass 充填天井fill raise 充填天井fill slope 填坡fill toe 堆积底fill up ground 充填地filled pigsty 填实木垛filled pigstye 填实木垛filled stope 充填回采工祖filler 充填工filling 充填filling machine 充填机filling material 充填料filling method 充填法filling operations 充填工作filling stope 充填回采工祖filling stoping 充填法filling system 充填开采法fillings 充填料fillmaterial 充填料film 膜film flotation 表层浮迭filter 过滤机filter bed 过滤层filter cake 滤饼filter cloth 滤布filter paper 滤纸filter press 压滤机filter restrictor 过滤限涟置filter tank 过滤池filter thickener 过滤浓缩机filterability 过滤性filtering 过滤filtering surface 过滤面filthy 矿内气体filtrate 滤液filtration 过滤final 最终的final concentrate 最后精矿final product 最后产品final recleaner flotation 最后精选final support 永久支架final tailings 最终尾矿final velocity 末速fine coal 粉煤fine concentrate 细粒精矿fine crusher 细碎破碎机fine grinding 细磨fine ore 粉矿fine regulation 细蝶fine sand 细砂fine sieve 细筛fine sizes 粉末fineness 细度fineness of grinding 磨碎细度fines 细粒;粉煤finish 完结finished ore 精矿finished product 最后产品finishing table 最后精选摇床finite 有限的fire 火;火灾fire alarm 火警信号器fire barrier 防火煤柱fire brake 防火墙fire breeding 自燃fire brick 耐火砖fire clay 耐火粘土fire cock 消防栓fire damp 爆炸性气体fire district 火区fire enclosure 火区密闭fire engine 消防车fire extinguisher 灭火器消火器fire face 煤层燃烧面fire fighting powder 灭火粉fire hydrant 消防栓fire line 消防水管fire main 消防水管fire pillar 防火煤柱fire proof timbering 防火支架fire protection 防火fire pump 消防泵fire resistan 耐火的fire resistance 耐火性fire resistance test 耐火试验fire seal 封火墙fire zone 火区fireboss 瓦斯检定员firedamp content 沼气含量firedamp detector 沼气检定器firedamp emission 沼气泄出firedamp explosion 矿内瓦斯爆炸firedamp outburst 沼气突出firedamp probe 沼气检定器firedamp testing 沼气测定firefighting 消防firefighting crew 消防队firefighting equipment 消防设备fireman 爆破工fireproof 耐火的fireproof lining 耐火支架fireproofness 耐火性firing 点火firing circuit 爆破电路firing cost 爆破费firing current 发火电流点火电流firing interval 起爆间隔firing machine 放炮器firing order 放炮次序firing transformer 爆破变压器firm walls 坚实圃岩first advance 超前工祖first aid repair 紧急修理first bit 开眼钎子first caving 直接顶初次垮落first mining 采区准备first roof caving 直接顶初次垮落fish eye stone 鱼眼石fishing jobs 钻具打捞工作fishing tap 打捞母锥钻fishing tool 打捞工具fishtail bit 鱼尾式钻头fissile 易裂的fissility 易裂性fissionable 可分裂的fissure 裂缝fitter 装配工fitting 装配fix 固定fixation 固定fixed carbon 固定碳fixed grizzly 固定格筛fixed jaw 固定颚板fixed peg 固定标桩fixed point 固定点fixed pulley 固定轮fixed roll 固定辊fixed screen 固定筛fixed sieve jig 固定筛跳汰机fixed spindle gyratory crusher 固定轴圆锥破碎机flame coal 长焰煤flame indicator 爆炸火焰指示器flame of shot 爆炸焰flame resistant 耐火的;防爆的flame safety lamp 火焰安全灯flame throwing drill 热力钻机flameproof 耐火的flameproof drill panel 防爆电钻配电箱flameproof motor 防爆式电动机flammability 可燃性flammable 可燃的flammable gas 可燃气体flange 凸缘flank hole 侧向钻孔flap door 风门flap valve 瓣阀flash compound 起火剂flash over capability 殉爆敏感度flashing composition 起火剂flashing point 闪燃点flashing system 充填采矿法flat back cut and fill method 水平分层充填开采法flat back method 上向梯段开采法flat bed 平层flat dip 平倾斜flat dipping bed 缓倾斜层flat grade 平缓坡度flat grade mine 缓倾斜层矿山flat gradient 平缓坡度flat hole 水平炮眼flat muck pile 平面废石堆flat pitch 缓倾斜flat rammer 平夯flat rope 扁平钢丝绳flat wall 下盘flat wire rope 扁平钢丝绳flaw detector 探伤器flexadux 软风管flexibility 可缩性flexible 挠性的flexible cable 软电缆flexible hose 软管flexible idler 挠性托辊flexible lining 可缩性支架flexible pipe 软管flexible shaft 挠性轴flexible support 可缩性支架flexible transport 无轨运输flexible wire rope 柔性钢丝绳flexion test 弯曲试验flexure 挠曲flight 阶梯flight belt 皮带抛掷充填机flight elevator 刮板式升运机flint 燧石flirting post 临时顶柱float 浮游矿物float and sink analysis 浮沉试验float and sink process 重悬浮液分离过程float and sink sampling 浮沉试验取样float and sink separation 重介选float fill 水力充填float fraction 浮物级别float valve 浮阀floatability 浮游性floatation 浮选floating dust 浮游尘末floating platform 浮台floc 絮凝物flocculant 架凝剂flocculation 架凝flocks 絮凝物flood 淹没flooded mine 淹没的矿flooded shaft 淹没矿井flooding 淹没floor 底板floor bar 横梁floor bolting 底板锚杆支护floor brushing 卧底floor heave 底板隆起floor hole 底部炮眼floor pillar 阶段间矿柱floor ripping blastign 卧底爆破floor rock 底板岩石flotability rank 可浮性等级flotation 浮选flotation agent 浮选剂flotation cell 浮选室flotation chemicals 浮选剂flotation concentrate 浮选精矿flotation equipment 浮选设备flotation froth 浮选泡沫flotation machine 浮选机flotation mill 浮选厂flotation oil 浮选用油flotation oil feeder 浮选给油器flotation process 浮选法flotation pulp 浮选矿浆flotation reagent 浮选剂flotation tailings 浮选尾矿flotator 浮选机flotol 弗洛脱尔flow 流flow coefficient 量系数flow line 吝flow production 连续生产flow rate 临flow regulator 量第器flow string 采油管flowability 怜性flowmeter 量计flowsheet 撂图flowsheet of mill 选矿撂图fluctuation 波动flue 风管flue coal dust 悬浮煤尘fluid 铃fluid pressure 铃压力fluidity 怜性fluor 萤石fluor spat 萤石fluorite 萤石flushing 冲洗;湿式充填flushing method 水力充填开采法flushing pipe 充填管flushing port 进水孔flushing system 水砂充填法fly ash 烟道尘foam 泡沫foam extinguisher 泡沫灭火器foam flotation 泡沫浮选foam ganerating unit 泡沫发生设备foam plug 泡沫基foamed plastics 多空塑料foaming 起泡foaming agent 起泡剂focal point of subsidence 沉陷中心点fog 雳folding boards 罐座foliated coal 层状煤foot 基础foot plate 支柱垫板;脚踏板foot wall 下盘footwall 下盘footwalling 卧底footway 人行道force fan 压风机force feed 压力推进force pipe 压送管force pulling 牵引力force pump 压力泵forced block caving 强制分段崩落开采forced draught 强制通风forced draught fan 压风机forced feed 压力推进forced lubrication pump 压力润滑泵forced ventilation 人工通风fore breast 巷道掘进工祖fore end 超前工祖forehead 超前工祖foreign matter 杂质forepoling 超前支架forepoling bar 前探梁forepoling board 超前板桩foreshaft 井颈forgeability 可锻性forging 锻造form 形form factor 波形因数formation 形成formula 式forsterite 镁橄榄石forward stroke 前进冲程foshagite 变针硅钙石fossil 化石fossil wax 地蜡fossilization 石化酌foul air 污浊空气foul mine 瓦斯矿foundation 基础foundation bed 基底foundation work 基础工作four compartment mill 四室式磨机four groove drill 十字形钎子four point bit 十字钻头four wing rotary bi t十字钻头fraction 破片fractional precipitation 分级沉淀fractionating column 分镏塔fracture 断面fracture test 断裂试验fractured zone 破裂带fracturing 龟裂fragile 脆的fragility 脆性fragment 破片fragmental rock 碎屑岩fragmentation 破碎fragmented rock 破碎的岩石frame 架frame set 框式支架framed timber 棚子framing sheet pile 木板桩frangibility 脆性franklinite 锌铁尖晶石free 自由的free acid 游离酸free carbon 游离碳free energy 自由能free face 自由面free fall boring 自由降落冲唤钻进free falling 自由下落free falling classifier 自由下沉式分级机free flowing dynamite 粉状狄那米特硝甘炸药free gas 游离气free oscillation 自由振动gabbro 辉长岩gabbronorite 辉长苏长岩gad 钢楔gadder 钻机架;创煤镐gadding 穿孔gadolinite 硅铍钇矿gadolinium 钆gage 轨距gahnite 锌尖晶石gain 掏槽galena 方铅矿galenite 方铅矿gallery 平硐gallery driving 平巷掘进gallery entrance 纸巷进口gallery level 运输平巷水平gallery sheeting 平巷背板gallery test 巷道试验gallet 碎石gallium 钾gallon 加仑gallows frame 井架galmei 异极矿galmey 异极矿galvanometer 检疗gamma prospecting 射线勘探gamma rays 射线gang 矿车列车gange 脉石ganger 运矿工;班长gangue 脉石gangue froth 脉石泡沫gangway 炙输平巷gangway conveyor 纸巷运输机gantry 起重机架gap 间隙gap sensitivity 殉爆敏感度gap test 殉爆试验garland 集水圈garnierite 硅镁镍矿gas 瓦斯gas accumulation 瓦斯聚集gas analysis 气体分析gas analyzer 气体分析器gas anchor 气锚gas and dust explosion 瓦斯煤尘爆炸gas bearing 含瓦斯的gas bearing capaci t y 瓦斯含量gas blower 瓦斯喷出口gas brust 瓦斯突出gas bubble 气泡gas burner 煤气燃烧器gas burst 瓦斯突出gas coal 气煤gas concrete 加气混凝土gas constant 气体常数gas control 瓦斯泄出控制gas cutting 气割gas detector 瓦斯检查器gas drainage 排瓦斯设备gas drainage roadway 排瓦斯巷道gas emission 瓦斯泄出gas emission rate 瓦斯泄出速度gas explosion 瓦斯爆炸gas explosion prevention 防止瓦斯爆炸gas factor 油气比gas field 气田gas generation 煤气化gas helmet 防毒面具gas issue 瓦斯突出gas making 煤气制造gas offtake borehole 煤气泄出钻孔gas oil contact 油气界面gas oil interface 油气界面gas oil level 油气界面gas oil ratio 油气比gas oil surface 油气界面gas permeability of rocks 岩石透气性gas pipe 瓦斯管gas pocket 瓦斯包gas pool 气田gas pressure 气体压力gas production 瓦斯开采gas purifier 气体净化器gas release 瓦斯泄出gas reservoir 气田gas rock 含瓦斯的岩石gas rush 瓦斯突出gas separator 气体分离器gas testing lamp 沼气检验灯gas welding 气焊gaseous 瓦斯的gaseous and dusty mine 多尘瓦斯矿gaseous diffusion 气体扩散gaseous fuel 气体燃料gaseous mine 瓦斯矿gaseous phase 气相gaseous seam 瓦斯煤层gasification 煤气化gasifying 煤气化gasless detonator 无烟雷管gasoline 汽油gasproof apparatus 防爆装置gasproof shelter 瓦斯躲避峒gassing 放气gassy 含瓦斯的gassy mine 瓦斯矿gassy seam 瓦斯煤层gate 采区顺槽gate air cooling uni t平巷空气冷却装置gate conveyor 联络巷道的运输gate end 巷道的内端gate end conveyor 平巷转载运输机gate end panel 工祖配电设备gate end plate 联络平巷内端的转车盘gate road 采区顺槽gate stull 保护台gate top 上平巷gate way 平巷gategateway 平巷gatehead 装载点gateroad bunker 采区运输顺槽煤仓gathering arm loader 集爪式装载机;集瓜式装载机gathering conveyor 集矿运输机gathering parting 档岔道gathering raise 集合放矿天井gauge 轨距;规gauge door 第风门gauging station 气菱量站gauze 金属丝网gaylussite 单斜纳钙石gear 齿轮gear box 变速器gear ratio 齿轮比gear reduction ratio 齿轮减速比gearing 齿轮装置gehlenite 钙黄长石gelamite 胶质硝甘炸药gelatin 煤gelatin dynamite 胶质硝甘炸药gelatination 凝胶化gelatine dynamite 胶质硝甘炸药gelatine explosive 煤炸药gelatine powder 胶质硝甘炸药gelatinous explosive 煤炸药gelignite 葛里炸药gem 宝石gemstone 宝石general arrangement plan 总平面布置图general trend 总走向generator 发电机genesis 成因gentle dip 平倾斜gentle incline 缓坡度gentle slope 平缓坡度geochemical exploration 地球化学勘探geochemical prospecting 地球化学勘探geocronite 斜方硫锑铅矿geode 晶洞geodesy 大地测量学geography 地理geologic column 地质柱状图geological compass 地质罗盘geological conditions 地质条件geological cross 地质剖面geological exploration 地质勘探geological map 地质图geological prospecting 地质勘探geological section 地质剖面geological theory 地质理论geology 地质geomagnetic field 地磁场geomechanics 岩石力学geometric mean diameter 几何平均径geometrical factor 几何因子geometry 几何geophone 地震检波器geophysical exploration 地球物理勘探geophysical prospecting 地球物理勘探geophysics 地球物理学geosyncline 地槽geothermal gradient 地热增温率geothermal prospecting 地热勘探geothermic gradient 地热增温率germanite 锗石germanium 锗get rock 采石getter 采煤工getter loader 截装机getting 采煤geylussite 单斜纳钙石geyser 喷泉giant 水枪giant excavator 巨型挖掘机giant nozzle 水枪gib 采掘面临时支柱gig 绞车gioberite 菱镁矿girder 横梁girdle 薄砾石层glacial period 冰河时期glaciation 结冰酌glacier 冰河glaciology 冰川学glance coal 辉煤glass 玻璃glauberite 钙芒硝glaucodote 铁硫砷钻矿glauconite 海绿石glaucophane 蓝闪石glesum 琥珀glide plate 滑行板glist 云母glory hole method 放矿漏斗式开采glossy coal 辉煤gmelinite 钠菱沸石gneiss 片麻岩goaf 采空区goaf degasification 采空区脱气goaf road 采空区中的巷道goaf shield 掩护支架goaf stowing 采空区充填gob 采空区gob area 采空区gob caving 采空区落顶gob flushing 采空区水砂充填gob pack 废石充填带gob pile 废石堆gob roadway 采空区中的巷道gob stower 投掷式充填机gob stowing 采空区充填gob stowing machine 投掷式充填机goethite 针铁矿goffan 沟going headway 运输平巷gold 金gold dredging 挖掘船采金gold dust 金末gold field 金矿区gold mine 金矿gold mining 采金gold saving device 捕金装置gold vein 金矿脉gold washer 洗金机gondola 侧卸漏斗车goniometer 测角计good air 新鲜空气good ground 稳定地层gopherhole charge 洞室装药gophering 滥采gothite 针铁矿goths 煤的突出govern 第governor 蒂器grab 抓岩机grab bucket 抓岩机的抓斗grab bucket excavator 抓斗式挖掘机grab crane 抓岩机吊车grab dredge 抓斗挖掘船grab loader 抓岩机grab picking 粗选grab type dredge 抓斗挖掘船grabbing crane 抓岩机吊车grabbing excavator 抓斗式挖掘机gradation 分级gradation composition 粒度组成gradation test 粒度分级试验grade 坡度grade of coal 煤品级grade of ore 矿石品位grade peg 坡度标桩grade up 上坡graded coal 过筛煤graded crushing 分段破碎graded product 分级产品gradienter 水准仪grading 水准测量;分级grading curve 粒度曲线gradiometr 测斜仪gradual sagging roof 缓慢下沉顶板graduate 刻度graduation 分度grahamite 脆沥青grail 砂grain 颗粒grain composition 粒度组成grain diameter 颗粒直径grain powder 粒状炸药grain size 粒度grain size accumulation curve 粒径累积曲线grain size category 粒度等级grain size characteristic curve 粒度特性曲线grain size curve 粒度曲线grain size grade 粒度等级granby car 侧卸式矿车granite 花岗岩granitization 花岗岩化granular 粒状的granular material 粒状材料granular rock 粒状岩granulating 粒化granulating machine 制粒机granulating plant 粒化装置granulation 成粒granulator 制粒机granule 颗粒granulometric composition 粒度组成granulometric distribution 颗粒分布granulometry 颗粒测定术graph 图表graphite 石黑graphitization 石墨化graphitizing 石墨化graphometer 测角器grapple 抓斗grappling 锚固grass 矿井地面grass crop 露头grass root's 地表水准grate 格子grate ball mill 格子排料式球磨机grate bar 筛条grate mill 格子排料式球磨机grating 格子gravel 砾石gravel face 砂矿工祖gravel filter 砾石过滤器gravel pit 采砾场gravimeter 重差计gravimetric analysis 重量分析gravimetric density 重量密度gravimetry 测定重量gravitation 重力gravitational exploration 重力勘探gravitational field 重力场gravitational method 重力法gravitational prospecting 重力勘探gravitational separation 重力选gravity 重力gravity acceleration 重力加速度gravity concentration 重力选gravity concentrator 重力选矿机gravity dumper 重力翻车器gravity feed 自两供给gravity field 重力场gravity flow 重力怜gravity flushing 自重水力运输gravity hammer 打桩落锤gravity haulage 自运输gravity hydraulic transport 自重水力运输gravity incline 轮子坡gravity method 重力法gravity mill 重力选矿厂gravity ore pass 重力放矿溜道gravity plane 轮子坡gravity preparation 重力选gravity runway 轮子坡gravity separation 重力选gravity separator 重力分选机gravity solution 重液gravity stowing 重力充填gravity take up 重力拉紧装置gravity water 重力水grease 润滑脂green iron ore 绿磷铁矿green lead ore 磷氯铅矿green ore 原矿green prop 新伐的坑木green timber 新木材green vitriol 水绿矾greenalite 土状硅铁矿greenockite 硫镉矿greenrock 徨绿岩greisen 云英岩grey cobalt ore 砷钴矿grid 格子grid plate 跳汰机筛板griddle 筛子grill 格栅grind 粉碎;研磨grindability 可磨碎性grindability index 可磨碎性指数grindability limit 可磨性限度grindability margin 可磨性限度grindability rating 可磨碎性指数grindability test 可磨性试验grinder 磨石grinding 磨碎grinding balls 磨球grinding flowsheet 磨矿撂图grinding mill 磨碎机grinding rate 研磨速度grinding stone 磨石grinding test 研磨试验grip 夹子gripper 钳子grit 粗砂grit blast 喷砂装置grit mill tube 磨碎机grit stone 尖角粗砂岩gritting material 粗砂;砂砾grizzle 高硫低级煤grizzly 格筛grizzly bar 筛条grizzly blasting 溜井爆破grizzly feeder 棒条给料机。
Mining Approximate Frequent Itemsets from Noisy Data 1Jinze Liu,1Susan Paulsen,1Wei Wang,1,2Andrew Nobel,1Jan Prins1Department of Computer Science2Department of Statistics and Operations ResearchUniversity of North Carolina,Chapel Hill,NC27599{liuj,paulsen,weiwang,nobel,prins}@AbstractFrequent itemset mining is a popular and important first step in analyzing data sets across a broad range of applications.The traditional,“exact”approach for finding frequent itemsets requires that every item in the itemset occurs in each supporting transaction.However, real data is typically subject to noise,and in the presence of such noise,traditional itemset mining may fail to de-tect relevant itemsets,particularly those large itemsets that are more vulnerable to noise.In this paper we propose approximate frequent item-sets(AFI),as a noise-tolerant itemset model.In addi-tion to the usual requirement for sufficiently many sup-porting transactions,the AFI model places constraints on the fraction of errors permitted in each item col-umn and the fraction of errors permitted in a supporting transaction.Taken together,these constraints winnow out the approximate itemsets that exhibit systematic er-rors.In the context of a simple noise model,we demon-strate that AFI is better at recovering underlying data patterns,while identifying fewer spurious patterns than either the exact frequent itemset approach or the exist-ing error tolerant itemset approach of Yang et al.[10]. 1IntroductionRelational databases are ubiquitous,cataloging ev-erything from market-basket data[1]to gene-expression data[4].Frequent itemset mining[1]is a key technique in the analysis of such data,providing the basis for deriving association rules,for clustering data,and for building classifiers.The frequent itemset problem is generally charac-terized in the following form:The available data take the form of an n×m binary matrix D.Each row of D corresponds to a transaction t and each column of D corresponds to an item i.The t,i-th element of D,denoted d t,i,is one if transaction t contains item i,and zero otherwise.Let T0={t1,t2,...,t n}and I0={i1,i2,...,i m}be the set of transactions and items associated with D,respectively.Under exact frequent itemset mining a transaction supports an itemset if it contains a’1’under each item in the itemset.An item-set is deemed frequent if the number of its supporting transactions exceeds the“support threshold,”a user de-termined percentage of the total number of transactions.While the classic exact frequent itemset definition and the algorithms designed to generate such itemsets have been well studied,the problems created by imper-fect data have not.Error can be introduced when an item fails to be recorded,or not purchased at all because it was out of stock.In the presence of such“noise”(i.e. actual errors as well as incorrect imputation of measure-ments),classical frequent itemset algorithms willfind a large number of small fragments of the true itemset,and may miss a pattern altogether if the frequency criterion is not satisfied.This failure to detect the full pattern compromises the usefulness of classic frequent itemset mining for detecting associations,clustering items,or building classifiers when such errors are present.As a solution,we present here a noise-tolerant approach to frequent itemset mining.One natural approach for handling errors is to relax the requirement that a supporting transaction contain only1’s under the items in the itemset.Instead,a small fraction of0’s is tolerated,e.g.the“presence”signal of [10].However,stipulating a small fraction of0’s row-wise alone may not be sufficient:we would also like to ensure that distribution of0’s is globally reasonable,e.g. that they are also not concentrated in a small number of columns.For example,the fraction of1’s is80%in each of the submatrices presented in panels(A)-(D)of Figure1. However,not all of the transactions in each panel sen-sibly support the itemset I={a,b,c,d,e}.In(A),the row-wise constraint employed by Yang et.al.[10]cor-rectly excludes transaction5from the support;however,(B)(A)(C)(D)Figure 1.Itemsets with global density of 80%but dif-ferent distributions of noise in individual transactions and items.in (B)enforcing the row-wise constraint alone allows each transaction to support the addition of {e }to the itemset.Panel (C)illustrates the problem with a purely column-wise constraint,while (D)exhibits an error dis-tribution where each transaction sensibly lends support to the full itemset.In this latter case,each row and column permits no more than 20%error.Thus to attain noise-tolerant itemsets free from sys-tematic errors,we propose the joint use of two criteria.We define an approximate itemset to be one where the fraction of 0’s in each row and each column is restricted to r and c ,respectively.If the approximate itemset has sufficiently many rows,it is an approximate frequent itemset (AFI ).Definition 1.1Let D be as above,and let r , c ∈[0,1].An itemset I ⊆I 0is an AFI,if there exists a set of transactions T ⊆T 0with |T |≥minsup |T 0|such that the following two conditions hold:(i)for each t ∈T the fraction of items in I that appear in t is at least (1− r )and (2)for each i ∈I ,the fraction of transactions in T that appear in each item i is at least (1− c ).Example 1.1Consider the transaction database D in Figure 1(A)with AFI parameters minsup =0.5, r =1/3and c =1/3.Then the maximal AFI contained in D is I ={a,b,c },which is supported at least four transactions (T ={t 1,t 2,t 3,t 4,t 5}).For each item i ∈I ,at least 80%>100(1− c )%of the transactions in T contain it;each transaction t ∈T is missing at mosta b c d 1111021100310104011051111600017010181000Table 1.An example datasetone of the items in I ,so the fraction of zeros in each row is at most 1/3= r .The rest of the paper is organized as follows.Sec-tion 2presents a formal definition of our problem and outlines related work in the area of noise-tolerant item-set minings.Section 3presents a brute-force algorithm.Evaluation of the AFI algorithm using both synthetic and real datasets is presented in Section 4.Section 5concludes the paper.2Background and Related WorkNoise-tolerant itemsets were first discussed by Yang et.al [10],who proposed two error tolerant models,termed weak error-tolerant itemsets (ETIs)and strong ETIs.An itemset is a weak ETI if the fraction of noise in the entire set of supporting transactions is below a certain threshold,with no constraint on where the noise may occur.An itemset is a strong ETI if it satisfies the row,but not necessarily the column,constraint of the AFI definition above.As noted in the discussion of Figure 1,neither of the ETI models precludes columns of zeros.Yang et.al [10]describe algorithms for finding weak and strong ETIs based on a variety of heuristics and sampling techniques.In [7]Seppanen et.al seeks weak ETIs by adding the constraint that all of their subsets must also be weak ETIs.The resulting itemsets belong to the category of weak ETIs but their overall characteristics are hard to derive.In some cases,this additional constraint elim-inates irrelevant transactions as in Figure 1(B),but in others it permits Figure 1(C).Another alternative,the support envelope[8]identi-fies regions of the data matrix where each transaction contains at least m items and each item appears in at least n transactions,where n and m are fixed integers.Support envelope mining can only recover one big sub-matrix at a time,prohibiting the discovery of multiple embedded dense regions.Furthermore,if the matrix is large,the one approximate itemset found by the support envelope approach tends to be very sparse.Fault-tolerant frequent itemsets[?]allow afixed num-ber of errorsδwithin an itemset.This criterion is not consistent with our expectation that the number of er-rors should be permitted to scale with the size of the result.3A Brute Force Algorithm To Discover AFIsAs implied by its definition,an AFI( r, c)is also an ETI( r),where only the row-wise constraint is enforced. Thus,a natural brute-force method tofind the set of AFI( r, c)can be obtained two steps:1.Generate the set of all ETIs( r)2.For each ETI( r),check its validity as anAFI( r, c).Thefirst step of the algorithm was studied by Cheng et.al[10].The exhaustive algorithm proposed in their paper starts with single items and develops them into longer itemsets by adding one of the remaining items at each time.The lattice of itemsets is traversed in a breadth-first manner.As may be obvious,the Apriori property of classical frequent itemset mining will not hold for either ETI or AFI.Thus an itemset cannot be pruned if one of its(k−1)is not a valid itemset. Instead,an length-k itemset cannot be eliminated as a valid ETI until it is established that none of its(k−1) subsets is a weak ETI.The second step in the algorithm is a postprocessing step.For each of the submatrices of itemsets and transactions discovered in thefirst step, determine which transactions meet the AFI column con-straint and if the number of qualifying transactions is still large enough to meet the support constraint.4ExperimentsWe performed two experiments to evaluate the per-formance of AFI.A synthetic data matrix corrupted with noise was used to compare the results of AFI min-ing to both exact frequent itemset mining and the ETI approach.In addition,we applied AFI to a data set drawn from a real biogeographic problem,where the AFI algorithm identified interesting patterns more suc-cinctly than the competing algorithms.4.1Quality Testing with Synthetic DataIn order to test the quality of the AFI model,we created data with both embedded patterns and overlaid random errors.By knowing the true patterns,we were able to assess the quality of AFI’s results.To each synthetic dataset created,an exact method,ETI and AFI were each applied.To evaluate the performance of an algorithm on a given dataset,we employed two measures that jointly describe quality:“recoverability”and“spuriousness”(in the spirit,but not exact detail of[6]).Recoverabil-ity is the fraction of the embedded patterns recovered by an algorithm,while spuriousness is the fraction of the mined results that fail to correspond to any planted cluster.A truly useful data mining algorithm should achieve high recoverability with little spuriousness to dilute the results.Multiple datasets were created and analyzed to ex-plore the relationship between increasing noise levels and the quality of the result.Noise was introduced by bit-flipping each entry of the full matrix with a prob-ability equal to p.The probability p was varied from 0.01to0.2.The number of pattern blocks embedded also varied,but the results were consistent across this parameter.Here we present results when1or3blocks were embedded in the data matrix(Figure2(A)and(B), respectively).In both cases,the exact method performed poorly as noise increased.Beyond p=0.05the original pattern cannot be recovered,and all of the discovered patterns are spurious.In contrast,the error-tolerant algorithms, ETI and AFI,were much better at recovering the em-bedded matrices at the higher error rates.However, the ETI algorithm reported many more spurious results than AFI.Though it may discover the embedded pat-terns,ETI generates many more patterns that are not of interest,which may overshadow the real patterns of interest.The AFI algorithm consistently demonstrates higher recoverability of embedded pattern while main-taining a low level ofspuriousness.(A)Single Cluster(B)Multiple ClustersFigure2.Algorithm quality versus noise level4.2An Application in BiogeographyOne novel,but natural,application of frequent item-set mining is in the field of biogeography,the study of the geographical distributions of organisms.The pat-terns discovered in species distributions are used to in-fer either connections or barriers between regions,which in turn lead to hypotheses concerning the biogeographic tracks of organisms in historical time.Here we apply AFI to data from a study of freshwater fish across Aus-tralia (from Unmack [9]).The presence or absence of 167species was recorded for each of 31regions covering the continent.This type of data is subject to error in its collection,and “soft”(i.e.approximate)patterns are of interest.Application of exact frequent itemset mining using minsup =5produced a total of 31itemsets and a reasonable result:the broadest cluster in terms of re-gions covered corresponds to one of data author’s re-sults.Its 11regions form a contiguous coastal band across Northern and Eastern Australia (shown as the dark-colored provinces in Figure3).However,applica-tion of AFI not only recovered the exact result (with fewer spurious blocks),but at c = r =0.2,it adds two more regions to the item-wise largest block.These regions have been acknowledged by Unmack as sensible additions:they are contiguous with the previously iden-tified cluster in the northern portion of Australia,and appear to be the next most closely related regions in Unmack’s analysis.These regions appear in Figure 3as the light grayregions.Figure 3.Map of Australia with shading representingprovinces in the cluster5ConclusionIn this paper we have defined criteria for mining ap-proximate frequent itemsets from noisy data.The AFI model places constraints on the fraction of noise in each row and column,and so ensures a relatively reasonable distribution of error in any patterns found.According to investigation,AFI generates more reasonable and use-ful itemsets than classical frequent itemset mining and existing noise-tolerant frequent itemset mining.Several computational challenges remain unsolved,however,and are currently under investigation.Noise tolerance creates substantial algorithmic challenges notpresent in exact frequent itemset mining.First,the AFI criteria do not have the anti-monotone (Apriori)prop-erty enjoyed by exact frequent itemsets.Second,one cannot derive the support set of an AFI from the com-mon support sets of its sub-patterns,as is done in exact frequent itemset mining.Both of these considerations make the traditional breadth-first,and the projection-based depth-first algorithm hard for the generation of approximate frequent itemsets.Development of an ef-ficient algorithm and pruning method will be the main focus of our future work.This research was partially supported through NIH Integrative Research Resource grant 1-P20-RR020751-01,NSF grant DMS-0406361and NSF grant IIS-0448392.References[1]R.Agrawal,T.Imielinski,and A.Swami.Mining asso-ciation rules between sets of items in large databases In SIGMOD 1993.[2]R.Agrawal,H.Mannila,R.Srikant,H.Toivonen,and A Verkamo.Fast discovery of association rules.In U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth,and R.Uthu-rusamy,editors,Advances in Knowledge Discove and Data Mining,chapter 12,pages 307328.AAAI Pre 1996.[3] C.Becquet ,S.Blachon,B.Jeudy,J.F.Boulicaut,Gan-drillon O.Strong-association-rule mining for large-scale gene-expression data analysis:a case study on humaMin-ing gene expression databases for association rules.n SAGE data.Genome Biol.2002[4] C.Creighton,S.Hanash .Mining gene expression databases for association rules.Bioinformatics.2003Jan;19(1):79-86.[5]J.Liu,S.Paulsen,W.Wang,A.Nobel,J.Prins.”Mining Approximate Frequent Itemset from Noisy Data”.Techni-cal Report(TR05-015)of Department of Compuater Sci-ence,UNC-Chapel Hill,2005Jun.[6]H.C.Kum,S.Paulsen,W.Wang,Comparative Study of Sequential Pattern Mining Models,Studies in Computa-tional Intelligence,Volume 6,Aug 2005,Pages 43-70.[7]J.K.Seppanen,H.Mannila.Dense Itemsets.In SIGKDD 2004.[8]M.Steinbach,P.N.Tan,V.Kumar.Support envelopes:a technique for exploring the structure of association pat-terns.In SIGKDD 2004.[9]P.J.Unmack.Biogeography of Australian freshwater fishes.Journal of Biogeography.Vol.28:pages 1053–1089.Blackwell Science Ltd.2001.[10] C.Yang,U.Fayyad,P.S.Bradley.Efficient discovery of error-tolerant frequent itemsets in high dimensions.In SIGKDD 2001.。
Mining Frequent Patterns from Very High Dimensional Data:A Top-Down Row Enumeration Approach*Hongyan Liu1 Jiawei Han2 Dong Xin2 Zheng Shao2 1Department of Management Science and Engineering, Tsinghua Universityhyliu@2Department of Computer Science, University of Illinois at Urbana-Champaign {hanj, dongxin, zshao1}@AbstractData sets of very high dimensionality, such as microarray data, pose great challenges on efficient processing to most existing data mining algorithms. Recently, there comes a row-enumeration method that performs a bottom-up search of row combination space to find corresponding frequent patterns. Due to a limited number of rows in microarray data, this method is more efficient than column enumeration-based algorithms. However, the bottom-up search strategy cannot take an advantage of user-specified minimum support threshold to effectively prune search space, and therefore leads to long runtime and much memory overhead.In this paper we propose a new search strategy, top-down mining, integrated with a novel row-enumeration tree, which makes full use of the pruning power of the minimum support threshold to cut down search space dramatically. Using this kind of searching strategy, we design an algorithm, TD-Close, to find a complete set of frequent closed patterns from very high dimensional data. Furthermore, an effective closeness-checking method is also developed that avoids scanning the dataset multiple times. Our performance study shows that the TD-Close algorithm outperforms substantially both Carpenter, a bottom-up searching algorithm, and FPclose, a column enumeration-based frequent closed pattern mining algorithm.1 IntroductionWith the development of bioinformatics, microarray technology produces many gene expression data sets, i.e., microarray data. Different from transactional data set, microarray data usually does not have so many rows (samples) but have a large number of columns (genes). This kind of very high dimensional data needs data mining techniques to discover interesting knowledge from it. For example, frequent pattern mining algorithm can be used to find co-regulated genes or gene groups [2, 14]. Association rules based on frequent patterns can be used to build gene networks [9]. Classification and clustering algorithms are also applied on microarray data [3, 4, 6]. Although there are many algorithms dealing with transactional data sets that usually have a small number of dimensions and a large number of tuples, there are few algorithms oriented to very high dimensional data sets with a small number of tuples. Taking frequent-pattern mining as an example, most of the existing algorithms [1, 10, 11, 12, 13] are column enumeration-based, which take column (item) combination space as search space. Due to the exponential number of column combinations, this method is usually not suitable for very high dimensional data.Recently, a row enumeration-based method [5] is proposed to handle this kind of very high dimensional data. Based on this work, several algorithms have been developed to find frequent closed patterns or classification rules [5, 6, 7, 8]. As they search through the row enumeration space instead of column enumeration space, these algorithms are much faster than their counterparts in very high dimensional data. However, as this method exploits a bottom-up search strategy to check row combinations from the smallest to the largest, it cannot make full use of the minimum support threshold to prune search space. As a result, experiments show that it often cannot run to completion in a reasonable time for large microarray data, and it sometimes runs out of memory before completion. To solve these problems, we propose a new top-down search strategy for row enumeration-*This work was supported in part by the National Natural Science Foundation of China under Grant No. 70471006 and 70321001, and by the U.S. National Science Foundation NSF IIS-02-09199 and IIS-03-08215.280based mining algorithm. To show its effectiveness, we design an algorithm, called TD-Close, to mine a complete set of frequent closed patterns and compare it with two bottom-up search algorithms, Carpenter [5], and FPclose [15]. Here are the main contributions of this paper:(1) A top-down search method and a novel row-enumeration tree are proposed to take advantageof the pruning power of minimum support threshold. Our experiments and analysis showthat this cuts down the search space dramatically.This is critical for mining high dimensional data,because the dataset is usually big, and withoutpruning the huge search space, one has to generate a very large set of candidate itemsets forchecking.(2) A new method, called closeness-checking, isdeveloped to check efficiently and effectively whether a pattern is closed. Unlike other existingcloseness-checking methods, it does not need toscan the mining data set, nor the result set, and iseasy to integrate with the top-down search process. The correctness of this method is provedby both theoretic proof and experimental results. (3)An algorithm using the above two methods isdesigned and implemented to discover a completeset of frequent closed patterns. Experimental results show that this algorithm is more efficientand uses less memory than bottom-up search styled algorithms, Carpenter and FPclose.The remaining of the paper is organized as follows. In section 2, we present some preliminaries and the mining task. In section 3, we describe our top-down search strategy and compare it with the bottom-up search strategy. We present the new algorithm in section 4 and conduct experimental study in section 5. Finally, we give the related work in section 6 and conclude the study in section 7.2 PreliminariesLet T be a discretized data table (or data set), composed of a set of rows, S = {r1, r2, …, r n}, where r i (i = 1, …, n) is called a row ID, or rid in short. Each row corresponds to a sample consisting of k discrete values or intervals, and I is the complete set of these values or intervals, I = {i1, i2, …, i m}. For simplicity, we call each i j an item. We call a set of rids S ⊆S a rowset, and a rowset having k rids a k-rowset. Likewise, we call a set of items I ⊆ I an itemset. Hence, a table T is a triple (S, I, R), where R ⊆S × I is a relation. For a r i ∈S, and a i j∈I, (r i, i j) ∈ R denotes that r i contains i j, or i j is contained by r i.Let TT be the transposed table of T, in which each row corresponds to an item i j and consists of a set of rids which contain i j in T. For clarity, we call each row of TT a tuple. Table TT is a triple (I, S, R), where R ⊆S × I is a relation. For a rid r i∈S, and an item i j∈I, (r i, i j) ∈ R denotes that i j contains r i, or r i is contained by i j.Example 2.1 (Table and transposed table) Table 2.1 shows an example table T with 4 attributes (columns): A, B, C and D. The corresponding transposed table TT is shown in Table 2.2. For simplicity, we use number i (i = 1, 2, …, n) instead of r i to represent each rid. In order to describe our search strategy and mining algorithm clearly, we need to define an order of these rows. In this paper, we define the numerical order of rids as the order, i.e., a row j is greater than k if j > k.Let minimum support (denoted minsup) be set to 2. All the tuples with the number of rids less than minsup is deleted from TT. Table TT shown in Table 2.2 is already pruned by minsup. This kind of pruning will be further explained in the following sections.In this paper we aim to discover the set of the frequent closed patterns. Some concepts related to it are defined as follows.Table 2.1 An example table Tr i A B C D1a1b1c1d12a1b1c2d23a1b1c1d24a2b1c2d25a2b2c2d3Table 2.2 Transposed table TT of Titemset rowseta1 1,2,3a2 4,5b11, 2, 3, 4c1 1,3c2 2,4,5d2 2,3,4Definition 2.1 (Closure) Given an itemset I ⊆ I and a rowset S ⊆S,we definer(I) = { r i ∈S |∀i j∈ I , (r i, i j) ∈ R }281i(S) = { i j∈I |∀r i ∈ S , (r i, i j) ∈ R }Based on these definitions, we define C(I) as the closure of an itemset I, and C(S) as the closure of a rowset S as follows:C(I) = i(r(I))C(S) = r(i(S))Note that definition 2.1 is applicable to both table T and TT.Definition 2.2 (Closed itemset and closed rowset) An itemset I is called a closed itemset iff I = C(I). Likewise, a rowset S is called a closed rowset iff S = C(S).Definition 2.3 (Frequent itemset and large rowset) Given an absolute value of user-specified threshold minsup, an itemset I is called frequent if |r(I)| ≥minsup, and a rowset S is called large if |S| ≥minsup, where |r(I)| is called the support of itemset I and minsup is called the minimum support threshold. |S| is called the size of rowset S, and minsup is called minimum size threshold accordingly. Further, an itemset I is called frequent closed itemset if it is both closed and frequent. Likewise, a rowset S is called large closed rowset if it is both closed and large.Example 2.2(Closed itemset and closed rowset) In table 2.1, for an itemset {b1, c2}, r({b1, c2}) = {2, 4}, and i({2, 4}) = {b1, c2, d2}, so C({b1, c2}) = {b1, c2, d2}. Therefore, {b1, c2} is not a closed itemset. If minsup = 2, it is a frequent itemset. In table 2.2, for a rowset {1, 2}, i({1, 2}) = {a1, b1} and r({a1, b1}) = {1, 2, 3}, then C(S) = {1, 2, 3}. So rowset {1, 2} is not a closed rowset, but apparently {1, 2, 3} is.Mining task: Originally, we want to find all of the frequent closed itemsets which satisfy the minimum support threshold minsup from table T. After transposing T to transposed table TT, the mining task becomes finding all of the large closed rowsets which satisfy minimum size threshold minsup from table TT.3 Top-down Search StrategyBefore giving our top-down search strategy, we will first look at what is bottom-up search strategy used by the previous mining algorithms [5, 6, 7, 8]. For simplicity, we will use Carpenter as a representative for this group of algorithms since they use the same kind of search strategy.3.1 Bottom-up Search StrategyFigure 3.1 shows a row enumeration tree that uses the bottom-up search strategy. By bottom-up we mean that along every search path, we search the row enumeration space from small rowsets to large ones. For example, first single rows, then 2-rowsets, …, and finally n-rowsets. Both depth-first and breadth-first search of this tree belong to this search strategy.In Figure 3.1, each node represents a rowset. Our mining task is to discover all of the large closed rowsets. So the main constraint for mining is the size of rowset. Since it is monotonic in terms of bottom up search order, it is hard to prune the row enumeration search space early. For example, suppose minsup is set to 3, although obviously all of the nodes in the first two levels from the root cannot satisfy this constraint, these nodes still need to be checked [5, 6, 7, 8]. As a result, as the minsup increases, the time needed to complete the mining process cannot decrease rapidly. This limits the application of this kind of algorithms to real situations.In addition, the memory cost for this kind of bottom-up search is also big. Take Carpenter as an example. Similar to several other algorithms [6, 7, 8], Carpenter uses a pointer list to point to each tuple belonging to an x-conditional transposed table. For a table with n rows, the maximum number of different levels of pointer lists needed to remain in memory is n, although among which the first (minsup– 1) levels of pointer lists will not contribute to the final result.These observations motivate the proposal of our method.3.2 Top-down Search StrategyAlmost all of the frequent pattern mining algorithms dealing with the data set without transposition use the anti-monotonicity property of minsup to speed up the mining process. For transposed data set, the minsup {}12345121314152324251231241251341351452342352453453435451234123513452345123451245Figure 3.1 Bottom-up row enumeration tree282constraint maps to the size of rowset. In order to stop further search when minsup is not satisfied, we propose to exploit a top-down search strategy instead of the bottom-up one. To do so, we design a row enumeration tree following this strategy, which is shown in Figure 3.2. Contrary to the bottom-up search strategy, the top-down searching strategy means that along each search path the rowsets are checked from large to small ones.In Figure 3.2, each node represents a rowset. We define the level of root node as 0, and then the highest level for a data set with n rows is (n – 1).It is easy to see from the figure 3.2 that for a table with n rows, if the user specified minimum support threshold is minsup , we do not need to search all of rowsets which are in levels greater than (n – minsup ) in the row enumeration tree. For example, suppose minsup = 3, for data set shown in Table 2.1, we can stop further search at level 2, because rowsets represented by nodes at level 3 and 4 will not contribute to the set of frequent patterns at all.With this row enumeration tree, we can also mine the transposed table TT by divide-and-conquer method. Each node of the tree in Figure 3.2 corresponds to a sub-table. For example, the root represents the whole table TT, and then it can be divided into 5 sub-tables: table without rid 5, table with 5 but without 4, table with 45 but without 3, table with 345 but without 2, and table with 2345 but without 1. Here 2345 represents the set of rows {2, 3, 4, 5}, and same holds for 45 and 345. Each of these tables can be further divided by the same rule. We call each of these sub-tables x-excluded transposed table , where x is a rowset which is excluded in the table. Tables corresponding to a parent node and a child node are called parent tableand child table respectively. Following is the definition of x-excluded transposed table .Definition 3.1 (x-excluded transposed table ) Given a rowset x = {r i1, r i2, …, r ik } with an order such that r i1 > r i2 > … > r ik , a minimum support threshold minsup and its parent table TT|p , an x-excluded transposed table TT|x is a table in which each tuple contains rids less than any of rids in x , and at the same time contains all of the rids greater than any of rids in x . Rowset x is called an excluded rowset .Example 3.1 (x-excluded transposed table ) For transposed table TT shown in Table 2.2, two of its x-excluded tranposed tables are shown in Tables 3.1 and 3.2 respectively, assuming minsup = 2.Table 3.1 shows an x-excluded tranposed table TT|54, where x = {5, 4}. In this table, each tuple only contains rids which are less then 4, and contains at least two such rids as minsup is 2. Since the largest rid in the original data set is 5, it is not necessary for each tuple to contain some other rids . Procedures to get this table are shown in Example 3.2.Table 3.2 is an x-excluded transposed table TT|4, where x = {4}. Its parent table is the table shown in Table 2.2. Each tuple in TT|4 must contain rid 5 as it is greater than 4, and in the meantime must contain at least one rid less than 4 as minsup is set to 2. As a result, in Table 2.2 only those tuples containing rid 5 can be a candidate tuple of TT|4. Therefore, only tuples a 2 and c 2 satisfy this condition. But tuple a 2 does not satisfy minsup after excluding rid 4, so only one tuple left in TT|4. Note, although the current size of tuple c 2 in TT|4 is 1, its actual size is 2 since it contains rid 5 which is not listed explicitly in the table.Table 3.1 TT|54 itemset rowset a 1 1, 2, 3 b 11, 2, 3c 1 1, 3d 2 2, 3 Table 3.2 TT|4 itemsetrowsetc 2 2The x-excluded transposed table can be obtained bythe following steps.(1) Extract from TT or its direct parent table TT|peach tuple containing all rids greater than r i1.1234512 13 14 1523 24 25123 124125 134 135 145234235 245 345 343545 1234 1235 1345234512345 1245Figure 3.2 Top-down row enumeration tree 283(2)For each tuple obtained in the first step, keeponly rids less than r ik.(3)Get rid of tuples containing less than (minsup– j) number of rids, where j is the number ofrids greater than r i1in S.The reason of the operation in step 3 will be given in section 4. Note that the original transposed table corresponds to TT|φ, where φ is an empty rowset.Figure 3.3 shows the corresponding excluded row enumeration tree for the row enumeration tree shown in Figure 3.2. This tree shows the parent-child relationship between the excluded rowsets.Example 3.2 (Procedure to get x-excluded transposed table) Take TT|54 as an example, here is the step to get it. Table TT|5 shown in Table 3.3 is its parent table.Table 3.3 TT|5itemset rowseta11, 2, 3b11, 2, 3, 4c1 1,3c2 2,4d22, 3, 4(1)Each tuple in table TT|5 is a candidate tupleof TT|54 as there is no rid greater than 5 forthe original data set.(2)After excluding rids not less than 4, the tableis shown in Table 3.4.(3)Since tuple c2 only contains one rid, it doesnot satisfy minsup and is thus pruned fromTT|54. Then we get the final TT|54 shown inTable 3.1.Table 3.4 TT|54 without pruningitemset rowseta11, 2, 3b11, 2, 3c1 1,3c2 2d2 2,3From definition 3.1 and the above procedure to get x-excluded transposed table we can see that the size of the excluded table will become smaller and smaller due to the minsup threshold, so the search space will shrink rapidly.As for the memory cost, in order to compare with Carpenter, we also use pointer list to simulate the x-excluded transposed table. What is different is that this pointer list keeps track of rowsets from the end of each tuple of TT, and we also split it according to the current rid. We will not discuss the detail of implementation due to space limitation. However, what is clear is that when we stop search at level (n –minsup), we do not need to spend more memory for all of the excluded transposed tables corresponding to nodes at levels greater than (n – minsup), and we can release the memory used for nodes along the current search path. Therefore, comparing to Carpenter, it is more memory saving. This is also demonstrated in our experimental study, as Carpenter often runs out of memory before completion.4 AlgorithmTo mining frequent closed itemsets from high dimensional data using the top-down search strategy, we design an algorithm, TD-Close, and compare it with the corresponding bottom-up based algorithm Carpenter. In this section, we first present our new closeness-checking method and then describe the new algorithm.4.1 Closeness-Checking MethodTo avoid generating all the frequent itemsets during the mining process, it is important to perform the closeness checking as early as possible during mining. Thus an efficient closeness-checking method has been developed, based on the following lemmas.{} 54321545352514342415435425415325315214324314213213231215432543153214321543215421Figure 3.3 Excluded row enumeration tree284Lemma 4.1 Given an itemset I ⊆ I and a rowset S ⊆ S , the following two equations hold:r(I) = r(i(r(I))) (1) i(S) = i(r(i(S))) (2)Proof . Since r(I) is a set of rows that share a given set of items, and i(S) is a set of items that is common to a set of rows, according to this semantics and definition 2.1, these two equations are obviously correct. Lemma 4.2 In transposed table TT, a rowset S ⊆ S is closed iff it can be represented by an intersection of a set of tuples, that is:∃I ⊆ I, s.t. S = r(I) = ∩j r({i j })where i j ∈ I, and I = {i 1, i 2, …, i l }Proof . First, we prove that if S is closed then s = r(I) holds. If S is closed, according to definition 2.2, S = C(S) = r(i(S)), we can always set I = i(S), so S = r(I) holds. Now we need to prove that if s = r(I) holds then S is closed. If s = r(I) holds for some I ⊆ I , according to definition 2.1, C(S) = r(i(S)) = r(i(r(I))) holds. Then based on Lemma 4.1 we have r(i(r(I))) = r(I) = S, so C(S) = S holds. Therefore S is closed.Lemma 4.3 Given a rowset S ⊆ S , in transposed table TT, for every tuple i j containing S, which means i j ∈ i(S), if S ≠ ∩j r({i j }), where i j ∈ i(S), then S is not closed.Proof . For tuple i j ∈ i(S), ∩j r({i j }) ⊇ S apparently holds. If S ≠ ∩j r({i j }) holds, then ∩j r({i j }) ⊂ S holds, which means that there exists at least another one item, say y, such that S ∪y = ∩j r({i j }). So S ≠ r(I), that is S ≠ r(i(S)). Therefore S is not closed.Lemmas 4.2 and 4.3 are the basis of our closeness-checking method. In order to speed up this kind of checking, we add some additional information for each x-excluded transposed table . The third column of the table shown in Tables 4.1 or 4.3 is just it. The so-called skip-rowset is a set of rids which keeps track of of the rids that are excluded from the same tuple of all of its parent tables. When two tuples in an x-excluded transposed table have the same rowset, they will be merged to one tuple, and the intersection of corresponding two skip-rowsets will become the current skip-rowset .Example 4.1 (skip-rowset and merge of x-excluded transposed table ) In example 3.2, when we got TT|54 from its parent TT|5, we excluded rid 4 from tuple b 1 and d 2 respectively. The skip-rowset of these two tuples in TT|5 should be empty as they do not contain rid 5 in TT|φ. Therefore, the skip-rowset of these twotuples in TT|54 is 4. Table 4.1 shows TT|54 with this kind of skip-rowsets .In Table 4.1, the first 2 tuples have the same rowset {1, 2, 3}. After merging these two tuples, it becomes Table 4.2. The skip-rowset of this rowset becomes empty because the intersection of an empty set and any other set is still empty. If the intersection result is empty, it means that currently this rowset is the result of intersection of two tuples. When it is time to output a rowset, this skip-rowset will be checked. If it is empty, then it must be a closed rowset.Table 4.1 TT|54 with skip-rowset itemset rowset skip-rowseta 1 1, 2, 3b 11, 2, 34 c 1 1, 3 d 2 2, 34Table 4.2 TT|54 after merge itemset rowset skip-rowseta 1b 11, 2, 3c 1 1, 3d 2 2, 34Table 4.3 TT|4 with skip-rowset itemsetrowsetskip-rowsetc 2 244.2 The TD-Close AlgorithmBased on the top-down search strategy and thecloseness-checking method, we design an algorithm, called TD-Close, to mine all of the frequent closed patterns from table T.Figure 4.1 shows the main steps of the algorithm. It begins with the transposition operation that transforms table T to the transposed table TT. Then, after the initialization of the set of frequent closed patterns FCP to empty set and excludedSize to 0, the subroutine TopDownMine is called to deal with each x-excluded transposed table and find all of the frequent closed itemsets. The General processing order of rowsets is equivalent to the depth-first search of the row enumeration tree shown in Figure 3.2.285Subroutine TopDownMine takes each x-excluded transposed table and another two variables, cMinsup excludedSize, as parameters and checks each candidate rowset of the x-excluded transposed table to see if it is closed. Candidate rowsets are those large rowsets that occur at least once in table TT. Parameter cMinsup is a dynamically changing minimum support threshold as indicated in step 5, and excludedSize is the size of rowset x. There are five main steps in this subroutine, which will be explained one by one as follows. Algorithm TD-CloseInput: Table T, and minimum support threshold, minsup Output: A complete set of frequent closed patterns, FCP Method:1.Transform T into transposed table TT2.Initialize FCP =Φ and excludedSize = 03.Call TopDownMine(TT|Φ, minsup, excludedSize) Subroutine TopDownMine(TT|x, cMinsup, excludedSize) Method:1.Pruning 1: if excludedSize >= (n–minsup) return;2.Pruning 2: If the size of TT|x is 1, output thecorresponding itemset if the rowset is closed, andthen return.3.Pruning 3: Derive TT|x∪y and TT|x’, wherey is the largest rid among rids in tuples of TT|x,TT|x’ = {tuple t i| t i∈ TT|x and t i contains y},TT|x∪y = {tuple t i | t i ∈ TT|x and if t i contains y,size of t i must be greater than cMinsup}Note, we delete y from both TT|x∪y and TT|x’ .4.Output: Add to FCI itemset corresponding toeach rowset in TT|x∪y with the largest size k andending with rid k.5.Recursive call:TopDownMine(TT|x∪y, cMinsup, excluedSize+1)TopDownMine(TT|x’, cMinsup–1, excluedSize)Figure 4.1 Algorithm TD-CloseIn step 1, we apply pruning strategy 1 to stop processing current excluded transposed table.Pruning strategy 1: If excludedSize is equal to or greater than (n–minsup), then there is no need to do any further recursive call of TopDownMine. ExcludedSize is the number of rids excluded from current table. If it is not less than (n–minsup), the size of each rowset in current transposed table must be less than minsup, so these rowsets are impossible to become large.In step 2, we apply pruning strategy 2 to stop further recursive calls. Pruning strategy 2: If an x-excluded transposed table contains only one tuple, it is not necessary to do further recursive call to deal with its child transposed tables.The reason for this pruning strategy is apparent. Suppose the rowset corresponds to this tuple is S. From the itemset point of view, any child transposed table of this current table will not produce any different itemsets from the one corresponding to rowset S. From the rowset point of view, each rowset S i corresponding to each child transposed table of S is a subset of S, and S i cannot be closed because r(i(S i)) ⊇ S holds, and therefore S i ≠ r(i(S i)) holds.Of course, before return according to pruning strategy 2, the current rowset S might be a closed rowset, so if the skip-rowset is empty, we need to output it first.Example 4.2 (pruning strategy 2) For table TT|4 shown in Table 4.3, there is only one tuple in this table. After we check this tuple to see if it is closed (it is not closed apparently as its skip-rowset is not empty), we do not need to produce any child excluded transposed table from it. That is, according to excluded row enumeration tree shown in Figure 3.3, all of the child nodes of node {4} are pruned. This is because all of the subsets of the current rowset cannot be closed anymore since it is already contained by a larger rowset.Step 3 is performed to derive from TT|x two child excluded transposed tables: TT|x∪y and TT|x’, where y is the largest rid among all rids in tuples of TT|x. These two tables correspond to a partition of current table TT|x. TT|x∪y is the sub-table without y, and TT|x’ is the sub-table with every tuple containing y. Since every rowset that will be derived from TT|x’ must contain y, we delete y from TT|x’ and at the same time decrease cMinsup by 1. Pruning strategy 3 is applied to shrink table TT|x∪y.Pruning strategy 3: Each tuple t containing rid y in TT|x will be deleted from TT|x∪y if size of t (that is the number of rids t contains) equals cMinsup.Example 4.3 (pruning strategy 3) Suppose currently we have finished dealing with table TT|54 which is shown in Table 4.2, and we need to create TT|543 with cMinsup being 2. Then, according to pruning strategy 2, tuples c1 and d2 will be pruned from TT|543, because after excluding rid 3 from these two tuples, their size will become less than cMinsup, although currently they satisfy the minsup threshold. As a result, there is only one tuple {a1b1} left in TT|543, as shown in Table 4.4.286。
区块链常⽤词汇中英俄语翻译对照挖矿篇挖矿MiningМайнинг矿⼯MinerМайнер矿池miningpoolsМайнингпул矿机miningrigsМайнингриг算⼒(哈希率)HashrateМощности (Хешрейт)千哈希/秒/ Kilo-hashes per Second / KH/sКило-хэшвсекунду/КХ/с基础知识篇区块链BlockchainБлокчейн区块BlockБлок⽐特币BitcoinБиткоин加密货币CryptocurrencyКриптовалюта钱包WalletКошел?к账户accountsСчета地址AddressАдрес确认ConfirmationПодтверждение去中⼼化应⽤DappDapp去中⼼化⾃治组织DAOДAO (Децентрализованная Автономная Организация)共识ConsensusЕдинодушие分布式账本DistributedLedgerРаспредел?ннаякнига中央帐簿Central LedgerЦентральная Книга分布式⽹络DistributedNetworkРаспредел?ннаясеть智能合约SmartContractsСмарт-Контракты交易区块TransactionBlockБлок Транзакций⼿续费TransactionFeeОперационный сбор/комиссия积分奖励BlockRewardНаграда За Блок51%攻击51% AttackАтака专业词汇公有链Public blockchainПубличный блокчейн私有链PrivateblockchainПриватный блокчейн联盟链ConsortiumblockchainБлокчейн-консорциум⼯作证明PoW (ProofofWor)Доказательство работы股权证明PoS(ProofofStake)Доказательство доли владения混合PoS / PoW HybridPoS/PoWГибридный PoS / PoW数字加密DigitalSignatureЦифровая подпись/Электронная цифровая подпись(ЭЦП)双重⽀付DoubleSpendingДвойное расходование分叉ForkФилиал多重签名Multi-SignatureМульти-Подпись节点NodeУзел预⾔机OraclesПредсказания点对点PeertoPeerОдноранговая сеть公⽤地址PublicAddressПубличный адрес创世区块GenesisBlockГенезисыБлок⼀个联盟区块链Consortium blockchainБлокчейн-консорциум区块⾼度BlockHeightВысота Блока区块资源管理器BlockExplorerБлок-эксплорер/ блок-проводник专⽤集成电路ASICСпециализированная интегральная схема公钥PublicKeyОткрытый ключ私钥PrivateKeyПриватный ключ加密哈希函数CryptographicHashFunctionКриптографическая Хэш-Функция容易程度DifficultyТрудность。
软件学报ISSN 1000-9825, CODEN RUXUEW E-mail: jos@Journal of Software,2012,23(6):1542−1560 [doi: 10.3724/SP.J.1001.2012.04200] +86-10-62562563 ©中国科学院软件研究所版权所有. Tel/Fax:∗不确定性Top-K查询处理李文凤1, 彭智勇2+, 李德毅31(武汉大学软件工程国家重点实验室,湖北武汉 430072)2(武汉大学计算机学院,湖北武汉 430072)3(中国电子系统工程研究所,北京 100840)Top-K Query Processing Techniques on Uncertain DataLI Wen-Feng1, PENG Zhi-Yong2+, LI De-Yi31(State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, China)2(Computer School, Wuhan University, Wuhan 430072, China)3(Institute of Electronic System Engineering of China, Beijing 100840, China)+ Corresponding author: E-mail: peng@Li WF, Peng ZY, Li DY. Top-K query processing techniques on uncertain data. 2012,23(6):1542−1560./1000-9825/4200.htmAbstract: Efficient processing of Top-K queries has always been a significant technique in the interactiveenvironment involving massive amounts of data. With the emerging of imprecise data,the management of them hasgradually raised people’s attention. In contrast with traditional Top-K query, Top-K query on uncertain datapresents different features both in semantics and computation. On the basis of prevailing uncertain data model andpossible world semantic model, researchers have already studied multiple sound semantics and efficient approaches.This survey describes and classifies Top-K processing techniques on uncertain data including semantics, rankcriteria, algorithms and implementation levels, and so on. Finally, the challenges and future research trends inprocessing of Top-k queries on uncertain data are predicated.Key words: semantic of Top-K queries; processing of Top-K queries; rank criterion; uncertain data; possibleworld摘要: 高效Top-K查询处理在涉及大量数据交互的应用中是一项重要技术,随着应用中不确定性数据的大量涌现,不确定性数据的管理逐渐引起人们的重视.不确定性数据上Top-K查询从语义和处理上都呈现出与传统Top-K查询不同的特点.在主流不确定性数据模型和可能世界语义模型下,学者们已经提出了多种不确定性Top-K查询的语义和处理方法.介绍了当前不确定性Top-K查询的研究工作,并对其进行分类,讨论包括语义、排序标准、算法以及应用等方面的技术.最后提出不确定性Top-K查询面临的挑战和下一步的发展方向.关键词: Top-K查询语义;Top-K查询处理;排序标准;不确定性数据;可能世界中图法分类号: TP311文献标识码: A∗基金项目: 国家自然科学基金(61070011); 湖北省自然科学基金国际合作重点项目; 武汉市学科带头人计划(201150530139)收稿时间:2011-08-08; 修改时间: 2011-11-02; 定稿时间: 2012-02-15; jos在线出版时间: 2012-03-26CNKI网络优先出版: 2012-03-26 13:47, /kcms/detail/11.2560.TP.20120326.1347.001.html李文凤等:不确定性Top-K查询处理15431 引言随着数据采集技术的进步和网络的快速发展,人们可获取的数据量越来越大.如何从大量数据中选择最符合查询条件的信息,一直是数据管理和信息检索的重要课题.而高效Top-K查询处理在涉及大量数据交互的应用中逐渐成为一项重要技术,在数据库、网络、分布式系统等领域被广泛研究[1−3].同时,随着自动生成数据、推断数据以及大众数据的大量产生,数据往往存在大量的噪声、丢失值、错误以及不一致,不确定性数据的管理正随着现实应用逐渐被人们重视[4−7].这些应用主要包括:大规模传感器网络系统、信息抽取和数据整合系统、科学数据管理系统以及社会网络.下面是一个不确定性数据库实例.例1:多点温度监测可用于如火灾前期温度报警、空调环境监测、工业温度探测、粮仓、土壤、温室、养殖场、农场、冰窟热窟、矿业等多种应用,为提高监测精确程度,常常设置多个温度监测器,由于物理误差和多源监测,监测到的数据常常含有不确定性.表1是某时刻一个粮仓的多点温度监测数据库简表,每个温度监测器在特定的时间返回一个温度值,同一个位置有可能放置多个温度监测器,但同一时间只可能使用其中一个监测器的数据.每个监测器有一个可信度属性,反映该监测器的可靠程度.Table 1 A uncertain database表1不确定性数据库实例时间检测器位置标号温度可信度t1 11:45 M1 W-101 50°C0.4t2 11:45 M2 W-218 46°C 1t3 11:45 M3 E-012 45°C0.4t4 11:45 M4 E-012 44°C0.5t5 11:45 M5 S-411 15°C0.7t6 11:45 M6 S-411 10°C0.3Rules: (t3⊕t4), (t5⊕t6)当考虑数据的不确定性时,Top-K查询从查询语义到处理技术都面临着巨大挑战.考虑例1数据库上一个Top-1查询:• 在11:45温度最高的位置根据表1,记录t1温度最高,但可靠程度只有0.4;记录t2虽然温度略低,但可靠程度为1.这种情况究竟应该返回哪条记录?记录t3和t4位置区域同为E-012但可靠程度不同,位置E-012温度值该取何值?在不确定世界里,仅仅依靠温度这一个属性决定返回哪条记录,显得不再合理.由于不确定性的存在,Top-K查询变得不再清晰和不易操作,不能再像传统Top-K查询那样,仅仅基于某分值函数返回具有最大分值的对象.幸运的是,随着不确定性数据库研究的进展和对不确定性Top-K查询关注程度加强,针对不确定性Top-K查询处理的研究工作在近年来取得了较大进展[8−16].本文将基于典型的不确定性数据模型以及可能世界语义[17−22]模型,讨论近期出现的主流不确定性Top-K查询处理技术.主要涉及的研究方向(如图1所示)包括:(1) 语义合理性和数学性质研究.由于不确定性的引入,“不确定性Top-K查询究竟要返回哪些记录?”这个问题的答案不再清晰,对这个问题的不同回答形成了不确定性Top-K查询的不同语义形式.本文首先介绍学术界广泛研究的几种主要语义形式及其满足的数学性质,通过对比,对它们的合理性和满足的数学性质进行分析;(2) 排序标准的研究.确定性Top-K对记录进行排序的标准是记录在某分值函数作用下的得分.因此,分值函数成为排序的唯一标准.在不确定性Top-K查询中,所依赖的排序标准如何变化?本文针对分值与概率的平衡问题、分值的连续与离散问题,对目前不确定性Top-K排序标准进行分析.另外,鉴于排序标准的变化导致排序结果的差异巨大,本文还将介绍统一化排序方法的新进展;(3) 查询处理算法的研究.基于可能世界语义进行不确定性数据处理,最大的问题就是可能世界实例爆炸问题,在不确定性Top-K查询处理中,该问题仍然是我们面临的主要问题.具体而言,确定性Top-K1544 Journal of Software软件学报 V ol.23, No.6, June 2012的高效查询处理就是研究如何在最少的记录读取量和最小的实例空间内完成查询.本文将从确定性方法和近似方法两个方面展开讨论;(4) 应用层面研究.随着不确定性Top-K查询研究的展开,学者们展开了诸如高维数据库、分布式系统、数据流等应用中不确定性Top-K的研究.Fig.1 Classification of Top-k query processing techniques on uncertain data图1 不确定性Top-K查询处理技术分类1.1 标号本文描述的技术中将使用统一的符号标记,表2中将列出本文频繁使用的标号及含义.Table 2 Frequently used notations表2正文标号表符号含义D 不确定性数据库w∈W w:一个可能世界实例,W:所有可能世界形成空间|w| 可能世界实例的规模(可能世界实例中记录数)|W| 可能世界空间W的规模(可能世界实例数)p(w) 可能世界w的概率t i不确定性数据记录p(t i) 不确定性记录的存在概率τj生成规则,存在约束,依赖规则p(τj) 生成规则的概率(所有生成规则内记录概率和)|τj| 生成规则的规模(生成规则中记录数)f(t i)=f i分值函数f作用下,t i的得分为f iR k不确定性Top-K查询结果集1.2 大纲本文第2节介绍主流的不确定性数据模型和可能世界语义.第3节将介绍目前广泛使用的不确定性Top-K 查询语义,并分析其合理性和满足的数学性质.第4节将从不确定性数据排序标准的角度阐述目前不确定性Top-K的研究进展.第5节从确定性方法和近似方法讨论不确定性Top-K查询处理的算法.第6节介绍各应用层面的不确定性Top-K研究技术.第7节总结各技术并给出未来的研究方向.2 基础知识从20世纪80年代初就已经开始了针对不确定性数据的研究[17,19,20],为了将不确定性引入数据模型,出现了很多用来描述不确定性数据的模型[19,20,22−24].文献[25]中,从模型特点和表达能力比较了各数据模型,提出了下层完备、上层不完备的两层数据模型.而文献[26]则针对关系型、半结构化、数据流、高维数据中主要的不确定性数据模型进行了比较分析.在不确定性Top-K查询处理的研究中,大部分的研究工作都不是基于完备的不李文凤 等:不确定性Top-K 查询处理 1545确定数据模型展开的,主要原因是完备数据模型上的查询难于推理和展开处理.作为基础背景,本文将首先介绍几种主流不确定性数据模型[17,20,25,27]以及被广泛应用的可能世界语义模型[17,18,21].2.1 不确定性数据模型不确定性数据模型中,第1个提出的完备模型是c -table [17],c -table 由c -tuples 组成,c -tuples 具备以下特点:(1) 某些属性值自由变量代替;(2) 每条记录都有一个属性condition,定义了该记录中自由变量满足的关系范式;(3) 整个c -table 可能还会有一个全局约束条件.对全部的变量进行赋值,每次能满足所有约束条件的赋值会形成一个可能表实例.表3和表4分别是一个c -table 及其一个可能表实例的例子.Table 3 An instance of c -table model Table 4 A possible state of the c -table instance 表3 c -table 实例 表4 c -table 的可能表实例c -table 之所以完备,是允许变量的约束任意.但在实际的应用中,定义变量的任意约束,通常意味着读取和推理的代价很高.事实上,很多不确定性数据处理,包括不确定性Top-k 查询处理,关注的不确定性主要包括以下两个方面:(1) 属性级不确定性.在一个不确定性数据库D 中,如果有一个或多个属性是不确定性的:a. 属性值是一个离散值的集合,每个离散值关联一个出现概率;b. 属性值是一个可能值的连续分布,关联一个概率密度函数,则此数据库含属性级不确定性.实例化该数据库时,每条记录在其不确定性属性值或分布中抽取一个可能值,形成一个实例表.很多实际应用如传感器、电子标签、GPS 值形成的数据记录含属性级不确定性,文献[27]中描述了针对关系模型进行属性级不确定性扩展的probability ?-table 模型;(2) 记录级不确定性.在一个不确定性数据库D 中,假如记录不含不确定性属性,但整个数据库中的每条记录都以一定概率出现,则此数据库含记录级不确定性.更复杂的记录级不确定性还含有一组生成规则,每个生成规则含有一组记录,规定该组记录满足的约束条件.通常,生成规则有两种:a. 互斥规则,规定该组记录只能有一个出现,不能同时出现;b. 共存规则,规定该组记录必须同时存在.probability or-set table [20]是针对记录级不确定性对关系模型的一种重要扩展.同时含有属性级和记录级不确定性的probability or-?-set table [22,25]又被称为x -relation [20].但在目前已有的不确定性Top-K 查询技术中,同时涉及两种不确定性的并不常见,经常是仅仅关注其中一方面或只处理其简单情况.例如,属性级不确定性只处理离散值不确定性,又或是记录级不确定性只处理不含生成规则的情况.2.2 可能世界模型目前研究的主流不确定性数据库为概率数据库,它建立在可能世界模型的基础上,可能世界空间由一系列 可能世界实例组成,即W ={w 1,w 2,…,w n },P :W →[0,1]是其上一个概率分布,且1,...,()1,()0j j j n p w p w ==>∑.每一个 可能世界实例对应一个确定性数据库,其中,那些非确定性属性是满足约束条件的确定值.可能世界语义是不确定性查询处理技术的出发点和基础.针对第2.1节中涉及的不确定性,可以用图2示意其结构.一个不确定性数据库可以分别或同时含有属性级和记录级不确定性;而对于不确定性属性,其值可以离散或连续;对于以一定概率存在的记录之间,可以没有生成规则也可以有生成规则,含有生成规则时,生成规则可以是互斥、共存或其他规则[28].ID Attr.1 Attr.2 con 001 5 z 002 x 4 x ≠z ∧x ≠1003 7 y x =y ∨z =y ID Attr.1 Attr.2 001 5 3 002 6 4 003 7 61546Journal of Software 软件学报 V ol.23, No.6, June 2012Fig.2 The uncertainty levels in a uncertain database图2 不确定性数据库的组成通过表5~表7,我们会看到几个典型的不确定性数据库及其可能世界空间的实例,分别是:(a) 仅含属性级不确定性,且不确定性属性值离散;(b) 仅含属性级不确定性,且不确定性属性值连续;(c) 仅含记录级不确定性,且有互斥生成规则.表5是属性级不确定性数据库的一个实例,不确定性属性值为离散值.表6是不确定属性值为连续值的一个实例,其中,N (15,5)是期望为15、方差为5的正态分布;B (1000,0.5)是二项分布.值得注意的是,属性值连续时,可能世界有可能不可数.表7是对引言中例1的抽象,它是一个仅含记录级不确定性的数据库,每条记录有一个附属属性,用来表示该记录存在的概率.在规则表中,每条规则τi 中的记录相互排斥,即不出现在同一个可能世界中.Table 5 An instance of attribute-level uncertainty Table 6 An instance of attribute-level uncertainty(discrete value distribution) (continuous value distribution)表5 属性值离散的不确定性数据库 表6 属性值连续的不确定性数据库Table 7 An instance of tuple-level uncertainty with exclusion rules表7 含互斥规则的记录级不确定性数据库记录 值 概率 互斥规则 可能世界t 1 50 0.4 τ1 {t 3,t 4} w 1={t 1,t 2,t 5} w 7={t 2,t 5}t 2 46 1 τ2 {t 5,t 6} w 2={t 1,t 2,t 3,t 6}w 8={t 2,t 3,t 6}t 3 45 0.4 w 3={t 1,t 2,t 4,t 5}w 9={t 2,t 4,t 5}t 4 44 0.5 w 4={t 1,t 2,t 6} w 10={t 2,t 6}t 5 15 0.7 w 5={t 1,t 2,t 3,t 5}w 11={t 2,t 3,t 5}t 6 10 0.3 w 6={t 1,t 2,t 4,t 6}w 12={t 2,t 4,t 6}实际应用中的不确定性数据库,是以上3种典型数据库中的一种或者组合情况.目前,Top-K 查询处理中涉及最复杂的情况就是既含有属性级不确定性,又含有记录级不确定性和生成规则,且不确性属性值连续.2.3 可能世界语义下的计算问题可能世界语义下的基本计算问题包括:(1) 可能世界空间规模和可能世界实例规模的计算;(2) 可能世界概率(分布)的计算.可能世界空间规模即一个可能世界空间中实例的个数.对于第2.2节中涉及的属性为离散值的属性级不确 定性数据库,假设每条不确定性属性t i 有s i 个可选属性,则其可能世界空间规模1||n i i W s ==∏.例如第2.2节中的 例(a),3条记录的不确定属性可选值分别为2,2,1,则可能世界空间规模为2×2×1=4.属性为连续值时,可能世界规记录 属性值 可能世界11123t 2 {(40,0.6),(20,0.4)} w 2={t 1=50,t 2=20,t 3=35}t 3 {(35,1)} w 3={t 1=15,t 2=40,t 3=35}w 4={t 1=15,t 2=20,t 3=35}记录 属性值区间 分布函数 t 1 [4,10] f =1/6 t 2 [0,200] f =N (15,5) t 3 [0,1000] f =B (1000,0.5)李文凤 等:不确定性Top-K 查询处理1547模为无穷.但无论属性值连续还是离散,每个可能世界实例的规模|w |=n . 对于含生成规则的记录级不确定性数据库,可能世界规模与生成规则有关.由于组成一个可能世界实例需要在概率为1的生成规则中取一个记录,而在概率小于1的生成规则中取0个或1个记录,根据组合数学的计 算方法可知,可能世界空间规模()1()1||||(||1)j j j j p p W ττττ=<=+∏∏;如果概率为1的生成规则有m 条,则可能世界 实例的规模|w |∈[m ,n ].如第 2.2节中的例(c),生成规则τ1概率小于1,τ2概率等于1,则可能世界空间规模为2×1×(2+1)×2=12.每个可能世界实例都有一定的存在概率,而可能世界空间中所有可能世界实例的概率形成了可能世界概 率分布.对于属性级不确定性数据库,每个可能世界实例的概率为所选属性值概率积,即()()i k k i t w p w p t ∈=∏.例 如第2.2节中的例(a),p (w 1)=0.3×0.6×1=0.18.对于记录级不确定性数据库,每个可能世界实例的概率也与生成规 则有关,计算公式为()()(1())i k j k k i j t w w p w p t p τΦτ∈∩==−∏∏,它是所有出现记录的概率积与未出现任何记录的生 成规则的不出现概率的乘积.例如第2.2节中的例(c), p (w 1)=0.4×1×(1−0.9)×0.7=0.028.不论是属性级不确定性数 据库还是记录级不确定性数据库,及有无生成规则,可能世界空间的概率和为1,即()1w W p w ∈=∑.3 不确定性Top -K 查询的语义研究目前,不确定性数据库的研究虽然涉及关系型数据、半结构化数据、本体数据[20,26,29]等,由于关系数据模型的应用最为广泛,不确定性数据库的研究焦点仍然在关系型数据上.因此,大部分的不确定性Top-k 研究也都建立在不确定性关系型数据库上.确定性关系数据库的Top-K 查询的语义非常清晰,就是基于某个分值函数计算每个记录的分值,返回数据库中具有最大分值的前k 个记录.分值函数的计算是基于记录的属性值的,不确定数据模型中,属性值可能有多重选择,此时分值该如何计算?如何比较?不确定数据模型中,记录可能有一定的存在概率,此时分值该如何计算?如何比较?这些问题使究竟该返回哪些记录这个问题变得不再清晰.针对不确定性Top-K 查询,学者们从不同侧面和不同应用的需要给出了不同的查询语义.最优先考虑的是查询语义的合理性,即该语义和返回结果能合理解释并满足实际的查询需求.因此,学者们根据实际的查询需求定义了各式各样的不确定性Top-K 查询语义.这些不确定性Top-K 查询语义看似都满足了一定的应用需求,但实际上无论从形式化定义满足的数学性质还是从返回结果上,都存在巨大差异,因此出现了对不确定性Top-K 查询语义满足的数学性质的研究.第3.1节首先介绍目前出现的比较有影响的不确定性Top-K 查询语义[8−13,15,30,31],并对其进行合理性分析.第3.2节将对不确定性Top-K 查询语义满足的数学性质进行介绍和分析[10,15,30].3.1 不确定性Top -K 查询语义的合理性目前,不确定性Top-K 查询语义的研究有多种定义,比较有影响的包括U-Top K [9,32],U-k Ranks [9,16],PT-k [11−13], Global-Top K [15],Expected Rank [10,30],E-Score Rank [10,30],c -typical-Top K [31]等.它们分别适应不同的应用场景,下面我们将给出各语义的形式化定义及简要解释.定义1(U -Top K ). 设D 是不确定性数据库,其可能世界空间为W ={w 1,w 2,…,w n },T ={T 1,T 2,…,T m },T i 是长度为k 的记录向量,如果T i 中k 个记录是根据某分值函数f 计算的分值排序序列且对应某些可能世界的前k 个记录, 则基于f 的U-Top K 查询返回T *∈T 满足()*()arg max ()i i T Tw w T T p w ∈∈=∑.即将T i 对应的那些可能世界概率求和, 具有最大概率和的T i 是U-Top K 查询的返回结果.定义2(U -k Ranks ). 设D 是不确定性数据库,其可能世界空间为W ={w 1,w 2,…,w n }.对于排序位置i =1,…,k , 分别对应一组记录12,,...,m i i i t t t ,它们是在某可能世界中,根据分值函数f 排序后出现在位置i 的记录.基于f 的U-k Ranks,返回k 个记录{*|1,...,i t i k =},满足()*()arg max ()j j i i i w w t t t p w ∈=∑.也就是将出现在位置i 的可能世界概率求和,取概率最大者作为返回结果的第i 个记录.1548Journal of Software 软件学报 V ol.23, No.6, June 2012定义3(PT -K ). 设D 是不确定性数据库,其可能世界空间为W ={w 1,w 2,…,w n },Q ={q 1,q 2,…,q m },q i 是记录集合,分别对应每个可能世界按某分值函数f 排序的前k 个记录.在Q 的基础上计算每个记录位于Top-k 的概率 ()()ii t q p t p w ∈=∑,设定概率阈值p (0<p ≤1),PT-K 查询返回T ={t i |p (t i )≥p }.即对每个记录出现在Top-K 位置的可 能世界概率求和,取大于等于p 的记录作为PT-K 的结果返回(不限于k 个).定义4(global -Top K ). 设D 是不确定性数据库,其可能世界空间为W ={w 1,w 2,…,w n },Q ={q 1,q 2,…,q m },q i 是记录集合,分别对应每个可能世界按某分值函数f 排序的前k 个记录.在Q 的基础上计算每个记录t 位于Top-k 的 概率()()ii t q p t p w ∈=∑,称为该记录的global-Top K 概率,global-Top K 查询返回具有最大global-Top K 概率的k 个记录.定义5(expected rank ). 设D 是不确定性数据库,其可能世界空间为W ={w 1,w 2,…,w n }.假设根据分值函数 f ,可能世界w i 中在记录t 前的记录数记为()i w rank t ,则t 的Expected Rank 分值为,()()()i i iw w W t w r t p w rank t ∈∈=⋅∑. Expected Rank 按Expected Rank 分值排序记录(t 在未出现它的可能世界中()||i w i rank t w =).定义6(E -score rank ). 设D 是不确定性数据库,其可能世界空间为W ={w 1,w 2,…,w n }.假设根据分值函数f , 在可能世界w i 中计算t 的分值为()i w score t ,则t 的E-score 分值定义为,()()()i i iw w W t w e t p w score t ∈∈=⋅∑,E-Score Rank 按E-Score 分值排序记录.定义7(c -typical -Top K ). 设D 是不确定性数据库,其可能世界空间为W ={w 1,w 2,…,w n },T ={T 1,T 2,…,T m },T i 是长度为k 的记录向量.如果T i 中k 个记录是根据某分值函数f 计算的分值排序序列且对应某些可能世界的前 k 个记录,对这些可能世界概率求和()()()i i w w T p T p w ∈=∑,并对T i 中所有记录求总f 分值()()ii t T s T f t ∈=∑形成分布S ,在S 中寻找最典型的c 个分值111{,...,}{,..,}{,...,}arg min [min ||]c c c s s s s i s s E S s =−,c -typical-Top K 返回典型分值中具有最大概率的记录向量()arg max (),1i i i s T s i T p T i c ==≤≤(分值典型性).从以上定义中可以看出,不确定性Top-K 查询中涉及的一个很重要的计算问题是可能世界概率,假设例1中属性值一列为根据某分值函数f 计算的得分(此处分值直接为温度值),我们在表8中给出每个可能世界的概率,并根据以上定义求各查询Top-2结果,见表9.Table 8 The possible worlds for the uncertain database in example 1表8 例1可能世界概率表 可能世界概率 可能世界 概率 w 1={t 1,t 2,t 5}p 1=0.028 w 7={t 2,t 5} p 7=0.042 w 2={t 1,t 2,t 3,t 6}p 2=0.048 w 8={t 2,t 3,t 6} p 8=0.072 w 3={t 1,t 2,t 4,t 5}p 3=0.14 w 9={t 2,t 4,t 5} p 9=0.21 w 4={t 1,t 2,t 6}p 4=0.012 w 10={t 2,t 6} p 10=0.018 w 5={t 1,t 2,t 3,t 5}p 5=0.112w 11={t 2,t 3,t 5}p 11=0.168w 6={t 1,t 2,t 4,t 6} p 6=0.06 w 12={t 2,t 4,t 6} p 12=0.09Table 9 The result sets of Top-2 by different semantics on the uncertain database in example 1表9 例1各查询Top-2结果表查询类型Top-2 Prob. (score) 可能世界空间 U-Top2t 1t 2 0.4 {w 1~6} U-2Rankst 2,t 2 0.6,0.4 {w 7~12}, {w 1~6} PT-2 (p =0.3)t 1,t 2,t 4 0.4,1,0.3 {w 1~6},{w 1~12},{w 9,w 12} Global-Top2t 2,t 1 1,0.4 {w 1~12},{w 1~6} Expected rank-2t 2,t 1 0.4,0.63 - E-score-2t 3,t 2 46,22 -1-typical-Top2 t 2t 4 0.3 -从表9的查询结果看,同样是Top-2查询,语义不同则查询结果迥然,这是因为每个查询语义的定义都是情景相关的.李文凤等:不确定性Top-K查询处理1549下面从各查询语义在可能世界中的兼容性、对f分值的依赖性及有序性这3个方面进行对比,见表10.我们发现:(1) 大部分不确定性Top-K查询都考虑了查询结果在可能世界中的兼容性问题.这个结论很显然也很合理,用户总是希望得到的结果集或结果序列是能够在同一可能世界共存的.而U-k Ranks更多地关注排在某位置的高概率记录,更多地用于单一记录查询而不是一组记录查询;(2) 大部分不确定性Top-K查询都不依赖排序函数f的分值.这说明不确定性Top-K查询从根本上关注的仍然是记录间的相对位置,除非分值的大小程度与记录重要性直接相关;(3) 大部分不确定性Top-K查询结果都是有序的.PT-K虽然无序且结果集也未必是k 个元素,但是Global-Top K在它的基础上以概率排序,仍然可以得到有序结果.U-k Ranks的无序性更进一步说明它更关注单一记录查询,实际上在第5节中我们可以看到,经过排序整合U-k Ranks也可以有序.Table 10Rank criterion comparison of various Top-k semantics on uncertain database表10不确定性Top-k查询语义对比表查询方式兼容性分值依赖性有序性U-Top k Y N YU-k Ranks N N YPT-k (p=0.3) Y N NGlobal-Top k Y N YExpected Rank Y N YE-score Y Y Y 1-typical-Top k Y Y Y 总之我们可以看到,不确定性Top-K查询总体上仍然关注记录的相对位置,并以有序的方式呈现查询结果.不同语义的差别关键点在概率和排序分值的平衡方式上.3.2 不确定性Top-K查询语义的数学性质从第3.1节可以看到,尽管大部分不确定性Top-K查询语义遵循了传统确定性Top-K的合理解释,但它们在形式化定义满足的数学性质上还存在很大差异,因此,Zhang和Cormode等人提出了一系列不确定性Top-K查询应该满足的数学性质[10,15,30,33],并分析证明了部分不确定性Top-K查询语义分别能满足的性质[33].性质1(exact K). 设R k是某不确定性Top-K查询的返回结果集,|D|≥k时,则|R k|=k.性质2(faithfulness). t1,t2∈D,如果t1的分值和概率都大于t2且t2∈R k,则t1∈R k.性质3(containment). 对于任何正整数k,R k⊂R k+1.性质4(unique ranking). 设r(i)是在结果集中位于i位置的记录ID,对于任何排序位置i,j,i≠j,则r(i)≠r(j).性质5(value invariance). 排序分值v1≤v2≤…≤v k,对应不确定性Top-K结果序列,在不改变分值相对位置的前提下,排序分值的大小不影响查询结果.性质6(stability). t i∈R k时,增大t i的排序分值或概率不会使t i∉R k;t i∉R k时,减小t i的排序分值或概率不会使t i∈R k.Table 11 Property comparison of various Top-K semantics on uncertain database表11不确定性Top-K查询语义满足的性质对比表k Faithfulness Containment Uni.Ranking Val.invariance Stability Semantics ExactU-Top k N Weak N Y Y YU-k Ranks Y N Y N Y NPT-k (p=0.3) N Weak Weak Y Y YGlobal-Top k Y Y N Y Y YExp. Rank Y Weak Y Y Y YE-score Y Y Y Y N Y 1-typical-Top k Y N N Y N N不确定性Top-K查询究竟应该满足哪些基本性质,目前来说是一个相对开放的问题,而不确定性Top-K查询语义的应用相关性更增加了该问题的难度,例如某些应用更关心排序分值的典型性[31],而某些应用更关心记录的位置概率等[9].总体上看,研究不确定性Top-K查询语义满足的数学性质作用主要有:1. 进一步规范不确定性Top-K语义的定义.在此项研究开展前,各学者基本上都是从各自面临的应用场1550 Journal of Software软件学报 V ol.23, No.6, June 2012景出发定义需要的语义,很少考虑该语义满足那些性质、需要满足那些性质.语义性质的研究,可以指导更为科学、规范、合理的不确定性Top-K语义定义;2. 由于不确定性Top-K查询语义满足的性质反映了结果集的数学性质,对指导查询处理意义重大.例如在文献[34]中,由于Expected Rank具有包含性和稳定性,在处理滑动窗口的Top-K解集时,可以不保存整个窗口的记录,而仅仅通过维护含Top-K解集的最小子集——紧致集,实现连续的Top-K回答.因此,结合语义满足的数学性质,更有可能设计出高效的查询算法.4 不确定性Top-K查询的排序标准研究从第3节中可以看到,研究者在不确定性数据库上定义了许多新的Top-K查询语义,例如U-Top K,返回具有最大概率的Top-K记录向量;U-k Ranks,返回每个位置最大概率出现的那个记录;PT-K返回以概率p以上出现在Top-K的记录集,它们基本上都有特定的应用场景,语义的提出具有合理性.但它们返回的结果集又有什么特点呢?Li采用规范Kendall距离[14]对比同一数据集上5种不确定性Top-K返回的结果,发现各语义返回结果序列差异显著,有些结果甚至完全相反.最根本的原因是采用的排序标准不同,而在不确定性Top-K查询中,影响排序的因素主要有两个:排序函数分值和记录的概率.不确定性Top-K查询处理中,并没有研究者专门对排序标准进行研究,多是在讨论某查询处理方法时进行一些扩展和思考.本节从排序分值与概率的平衡技术、连续型分值排序技术以及统一化排序方法3个方面来探讨不确定性Top-K的排序标准.4.1 分值与概率的平衡技术按照对分值排序和概率的处理先后顺序不同,可以将分值与概率平衡方式划分为3类:第1类是先排序再求取概率,第2类是先求取概率再排序,第3类是同时综合考虑排序和概率.U-Top K在平衡分值与概率时采取的就是第1类方式.它首先将每个可能世界空间记录按排序分值排序,截取每个可能世界空间的前k个记录形成一个k长度排序序列,这些k长度排序序列与其所在的可能世界一样,是有一定存在概率的,找到拥有最大存在概率的k长度排序序列就找到了U-Top K的解.事实上,这种平衡的方式有很多弊端.例如文献[31]提出,U-Top K求得的Top-K序列往往概率很小.这点显而易见,所有可能世界概率和为1,k越大,Top-K序列可能情况就越多,每个可能的Top-K序列概率就越小.只按这种微小的概率差别来区分优劣,很多时候并不客观.因此,文献[31]中提出,在求得所有Top-K序列之后,概率并不应该成为唯一的衡量标准,分值应该重新纳入考虑范围,比如考虑Top-K向量总分值的典型性程度.第2类方式是先考虑记录在某位置的位置概率,再按概率在各位置最优分配的方式形成全排序(Top-K排序).文献[32]在文献[9]中位置概率U-i Ranks的基础上,提出依据Top-K位置上记录的位置概率总和最大的优化目标,用二分图匹配的方式找出最优全序.这种平衡方式也并不是没有缺点,比如:得到的序列有可能在任何可能世界都不存在;概率总和最大化的目标也许并不适用所有场景等.第3类方式是同时考虑排序和概率,比较典型的是Expected Rank和global Top K的处理方式.Expected Rank 是按记录在所有可能世界的期望排位来排序;而global-Top K则将分值序列和记录在Top-K的概率排序序列看成是两个待合并的排序列表,用传统合并排序的方式形成全序.不管是在形式化语义中还是在查询处理中,分值与概率的平衡都是不确定性Top-K查询的焦点问题.因此,合理适用的分值概率平衡方式在不确定性Top-K查询中至关重要.4.2 连续分值排序技术属性级不确定性数据库中可以存在连续型属性,当此属性是分值函数的参考依据时,就出现了排序分值是连续分布的情况.连续分值的存在直接挑战传统排序标准,因为作为传统排序标准的分值函数总能根据记录的唯一得分形成记录的全序,而连续分值形成的却是记录偏序[32].目前,解决偏序问题基本上都是采用Soliman在文献[32]中提到的概率偏序模型.如图3所示.。
1R 算法名称Activation function 激励函数Adaptive classifier combination (ACC) 自适应分类器组合Adaptive 自适应Additive 可累加的Affinity analysis 亲和力分析Affinity 亲和力Agglomerative clustering 凝聚聚类Agglomerative 凝聚的Aggregate proximity relationship 整体接近关系Aggregate proximity 整体接近Aggregation hierarchy 聚合层次AGNES 算法名称AIRMA 集成的自回归移动平均Algorithm 算法Alleles 等位基因Alternative hypothesis 备择假设Approximation 近似Apriori 算法名称AprioriAll 算法名称Apriori-Gen 算法名称ARGen 算法名称ARMA 自回归移动平均Artificial intelligence (AI) 人工智能Artificial neural networks (ANN) 人工神经网络Association rule problem 关联规则问题Association rule/ Association rules 关联规则Association 关联Attribute-oriented induction 面向属性的归纳Authoritative 权威的Authority权威Autocorrelation coefficient 自相关系数Autocorrelation 自相关Autoregression 自回归Auto-regressive integrated moving average 集成的自回归移动平均Average link 平均连接Average 平均Averagelink 平均连接Backlink 后向链接back-percolation 回滤Backpropagation 反向传播backward crawling 后向爬行backward traversal 后向访问BANG 算法名称Batch gradient descent 批量梯度下降Batch 批量的Bayes Rule 贝叶斯规则Bayes Theorem 贝叶斯定理Bayes 贝叶斯Bayesian classification 贝叶斯分类BEA 算法名称Bias 偏差Binary search tree 二叉搜索树Bipolar activation function 双极激励函数Bipolar 双极BIRCH 算法名称Bitmap index 位图索引Bivariate regression 二元回归Bond Energy algorithm 能量约束算法Boosting 提升Border point 边界点Box plot 箱线图Boyer-Moore (BM) 算法名称Broker 代理b-tree b树B-tree B-树C4.5 算法名称C5 算法名称C5.0 算法名称CACTUS 算法名称Calendric association rule 日历关联规则Candidate 候选CARMA 算法名称CART 算法名称Categorical 类别的CCPD 算法名称CDA 算法名称Cell 单元格Center 中心Centroid 质心CF tree 聚类特征树CHAID 算法名称CHAMELEON 算法名称Characterization 特征化Chi squared automatic interaction detection 卡方自动交互检测Chi squared statistic 卡方统计量Chi squared 卡方Children 孩子Chromosome 染色体CLARA 算法名称CLARANS 算法名称Class 类别Classification and regression trees 分类和回归树Classification rule 分类规则Classification tree 分类树Classification 分类Click 点击Clickstream 点击流Clique 团Cluster mean 聚类均值Clustering feature 聚类特征Clustering problem 聚类问题,备选定义Clustering 聚类Clusters 簇Collaborative filtering 协同过滤Combination of multiple classifiers (CMC) 多分类器组合Competitive layer 竞争层Complete link 全连接Compression 压缩Concept hierarchy 概念层次Concept 概念Confidence interval 质心区间Confidence 置信度Confusion matrix 混淆矩阵Connected component 连通分量Contains 包含Context Focused Crawler (CFC) 上下文专用爬虫Context graph 上下文图Context layer 上下文层Contiguous subsequeace 邻接子序列Contingency table 列联表Continuous data 连续型数据Convex hull 凸包Conviction 信任度Core points 核心点Correlation coefficient r 相关系数rCorrelation pattern 相关模式Correlation rule 相关规则Correlation 相关Correlogram 相关图Cosine 余弦Count distribution 计数分配Covariance 协方差Covering 覆盖Crawler 爬虫CRH 算法名称Cross 交叉Crossover 杂交CURE 算法名称Customer-sequence 客户序列Cyclic association rules 循环关联规则Data bubbles 数据气泡Data distribution 数据分配Data mart 数据集市Data Mining Query Language (DMQL) 数据挖掘查询语言Data mining 数据挖掘Data model 数据模型Data parallelism 数据并行Data scrubbing 数据清洗Data staging 数据升级Data warehouse 数据仓库Database (DB) 数据库Database Management System 数据库管理系统Database segmentation 数据库分割DBCLASD 算法名称DBMS 算法名称DBSCAN 算法名称DDA算法名称,数据分配算法Decision support systems (DSS) 决策支持系统Decision tree build 决策树构造Decision tree induction 决策树归纳Decision tree model (DT model) 决策树模型Decision tree processing 决策树处理Decision tree 决策树Decision trees 决策树Delta rule delta规则DENCLUE 算法名称Dendrogram 谱系图Density-reachable 密度可达的Descriptive model 描述型模型Diameter 直径DIANA 算法名称dice 切块Dice 一种相似度量标准Dimension modeling 维数据建模Dimension table 维表Dimension 维Dimensionality curse 维数灾难Dimensionality reduction 维数约简Dimensions 维Directed acyclic graph (DAG) 有向无环图Direction relationship 方向关系Directly density-reachable 直接密度可达的Discordancy test 不一致性测试Dissimilarity measure 差别度量Dissimilarity 差别Distance based simple 基于距离简单的Distance measure 距离度量Distance scan 距离扫描Distance 距离Distiller 提取器Distributed 分布式的Division 分割Divisive clustering 分裂聚类Divisive 分裂的DMA 算法名称Domain 值域Downward closed 向下封闭的Drill down 下钻Dynamic classifier selection (DCS) 动态分类器选择EM 期望最大值Encompassing circle 包含圆Entity 实体Entity-relationship data model 实体-关系数据模型Entropy 熵Episode rule 情节规则Eps-neighborhood ϵ-邻域Equivalence classes 等价类Equivalent 等价的ER data model ER数据模型ER diagram ER图Euclidean distance 欧几里得距离Euclidean 欧几里得Evaluation 评价Event sequence 事件序列Evolutionary computing 进化计算Executive information systems (EIS) 主管信息系统Executive support systems (ESS) 主管支持系统Exhaustive CHAID 穷尽CHAIDExpanded dimension table 扩张的维表Expectation-maximization 期望最大化Exploratory data analysis 探索性数据分析Extensible Markup Language 可扩展置标语言Extrinsic 外部的Fact table 事实表Fact 事实Fallout 错检率False negative (FN) 假反例False positive (FP) 假正例Farthest neighbor 最远邻居FEATUREMINE 算法名称Feedback 反馈Feedforward 前馈Finite state machine (FSM) 有限状态机Finite state recognizer 有限状态机识别器Firefly 算法名称Fires 点火Firing rule 点火规则Fitness function 适应度函数Fitness 适应度Flattened dimension table 扁平的维表Flattened 扁平的Focused crawler 专用爬虫Forecasting 预报Forward references 前向访问Frequency distribution 频率分布Frequent itemset 频率项目集Frequent 频率的Fuzzy association rule 模糊关联规则Fuzzy logic 模糊逻辑Fuzzy set 模糊集GA clustering 遗传算法聚类Gain 增益GainRatio 增益比率Gatherer 收集器Gaussian activation function 高斯激励函数Gaussian 高斯GDBSCAN 算法名称Gene 基因Generalization 泛化,一般化Generalized association rules 泛化关联规则Generalized suffix tree (GST) 一般化的后缀树Generate rules 生成规则Generating rules from DT 从决策树生成规则Generating rules from NN 从神经网络生成规则Generating rules 生成规则Generic algorithms 遗传算法Genetic algorithm 遗传算法Genetic algorithms (GA) 遗传算法Geographic Information Systems (GIS) 地理信息系统Gini 吉尼Gradient descent 梯度下降g-sequence g-序列GSP 一般化的序列模式Hard focus 硬聚焦Harvest rate 收获率Harvest 一个Web内容挖掘系统Hash tree 哈希树Heapify 建堆Hebb rule hebb规则Hebbian learning hebb学习Hidden layer 隐层Hidden Markov Model 隐马尔可夫模型Hidden node 隐节点Hierarchical classifier 层次分类器Hierarchical clustering 层次聚类Hierarchical 层次的High dimensionality 高维度Histogram 直方图HITS 算法名称Hmm 隐马尔可夫模型HNC Risk Suite 算法名称HPA 算法名称Hub 中心Hybrid Distribution (HD) 混合分布Hybrid OLAP (HOLAP) 混合型联机分析处理Hyper Text Markup Language 超文本置标语言Hyperbolic tangent activation function 双曲正切激励函数Hyperbolic tangent 双曲正切Hypothesis testing 假设检验ID3 算法名称IDD 算法名称Inverse document frequency(IDF) 文档频率倒数Image databases 图像数据库Incremental crawler 增量爬虫Incremental gradient descent 增量梯度下降Incremental rules 增量规则Incremental updating 增量更新Incremental 增量的Individual 个体Induction 归纳Information gain 信息增益Information retrieval (IR) 信息检索Information 信息Informational data 情报数据Input layer 输入层Input node 输入节点Integration 集成Interconnections 相互连接Interest 兴趣度Interpretation 解释Inter-transaction association rules 事务间关联规则Inter-transaction 事务之间Intra-transaction association rules 事务内关联规则Intra-transaction 事务之内Intrinsic 内部的Introduction 引言IR 算法名称Isothetic rectangle isothetic矩形Issues 问题Itemset 项目集Iterative 迭代的JaccardJaccard’s coefficient Jaccard系数Jackknife estimate 折叠刀估计Java Data Mining (JDM) Java数据挖掘Join index 连接索引K nearest neighbors (KNN) K最近邻K-D tree K-D树KDD object KDD对象KDD process KDD过程Key 键K-means K-均值K-Medoids K-中心点K-Modes K-模KMP 算法名称Knowledge and data discovery management system (KDDMS) 知识与数据发现管理系统Knowledge discovery in databases (KDD) 数据库知识发现Knowledge discovery in spatial databases 空间数据库知识发现Knuth-Morris-Pratt algorithm 算法名称Kohonen self organizing map Kohonen自组织映射k-sequence K-序列Lag 时滞Large itemset property 大项集性质Large itemset 大项集Large reference sequence 强访问序列Large sequence property 大序列性质Large 大Learning parameter 学习参数Learning rule 学习准则Learning 学习Learning-rate 学习率Least squares estimates 最小二乘估计Levelized dimension table 层次化维表Lift 作用度Likelihood 似然Linear activation function 线性激励函数Linear discriminant analysis (LDA) 线性判别分析Linear filter 线性滤波器Linear regression 线性回归Linear 线性Linear 线性的Link analysis 连接分析Location 位置LogisticLogistic regression logistic回归Longest common subseries 最长公共子序列Machine learning 机器学习Major table 主表Manhattan distance 曼哈顿距离Manhattan 曼哈顿Map overlay 地图覆盖Market basket analysis 购物篮分析Market basket 购物篮Markov Model (MM) 马尔可夫模型Markov Property 马尔可夫性质Maximal flequent forward sequences 最长前向访问序列Maximal forward reference 最长前向访问Maximal reference sequences 最长访问序列Maximum likelihood estimate (MLE) 极大似然估计MBR 最小边界矩形Mean squared error (MSE) 均方误差Mean squared 均方Mean 均值Median 中值Medoid 中心点Merged context graph 合并上下文图Method of least squares 最小二乘法Metric 度量Minimum bounded rectangle 最小边界矩阵Minimum item supports 最小项目支持度Minimum Spanning Tree algorithm 最小生成树算法Minimum Spanning Tree (MST) 最小生成树Minor table 副表MinPts 输入参数名称MINT 一种网络查询语言MISapriori 算法名称Mismatch 失配Missing data 缺失数据Mode 模Momentum 动量Monothetic 单一的Moving average 移动平均Multidimensional Database (MDD) 多维数据库Multidimensional OLAP (MOLAP) 多维OLAP Multilayer perceptron (MLP) 多层感知器Multimedia data 多媒体数据Multiple Layered DataBase (MLDB) 多层数据库Multiple linear regression 多元线性回归Multiple-level association rules 多层关联规则Mutation 变异Naïve Bayes 朴素贝叶斯Nearest hit 同类最近Nearest miss 异类最近Nearest Neighbor algorithm 最近邻算法Nearest neighbor query 最近邻查询Nearest neighbor 最近邻Nearest Neighbors 最近邻Negative border 负边界Neighborhood graph 近邻图Neighborhood 邻居Neural network (NN) 神经网络Neural network model (NN model) 神经网络模型Neural networks 神经网络Noise 噪声Noisy data 噪声数据Noncompetitive learning 非竞争性学习Nonhierarchical 非层次的Nonlinear regression 非线性回归Nonlinear 非线性的Nonparametric model 非参数模型Nonspatial data dominant generalization 以非空间数据为主的一般化Nonspatial hierarchy 非空间层次Nonstationary 非平稳的Normalized dimension table 归一化维表NSD CLARANS 算法名称Null hypothesis 空假设OAT 算法名称Observation probability 观测概率OC curve OC曲线Ockham’s razor 奥卡姆剃刀Offline gradient descent 离线梯度下降Offline 离线Offspring 子孙OLAP 联机分析处理Online Analytic Processing 在线梯度下降Online gradient descent 在线梯度下降Online transaction processing (OLTP) 联机事务处理Online 在线Operational characteristic curve 操作特征曲线Operational data 操作型数据OPTICS 算法名称OPUS 算法名称Outlier detection 异常点检测Outlier 异常点Output layer 输出层Output node 输出结点Overfitting 过拟合Overlap 重叠Page 页面PageRank 算法名称PAM 算法名称Parallel algorithms 并行算法Parallel 并行的Parallelization 并行化Parametric model 参数模型Parents 双亲Partial-completeness 部分完备性Partition 分区Partitional clustering 基于划分的聚类Partitional MST 划分MST算法Partitional 划分的Partitioning Around Medoids 围绕中心点的划分Partitioning 划分Path completion 路径补全Pattern detection 模式检测Pattern discovery 模式发现Pattern matching 模式匹配Pattern Query Language (PQL) 模式查询语言Pattern recognition 模式识别Pattern 模式PDM 算法名称Pearson’s r 皮尔逊系数rPerceptron 感知器Performance measures 性能度量Performance 性能Periodic crawler 周期性爬虫Personalization 个性化Point estimation 点估计PolyAnalyst 附录APolythetic 多的Population 种群Posterior probability 后验概率Potentially large 潜在大的Precision 查准率Predicate set 谓词集合Prediction 预测Predictive model 预测型模型Predictive Modeling Mark-Up Language (PMML) 预测模型置标语言Predictor 预测变量Prefix 前缀Preprocessing 预处理Prior probability 先验概率PRISM 算法名称Privacy 隐私Processing element function 处理单元函数Processing elements 处理单元Profile association rule (PAR) 简档关联规则Profiling 描绘Progressive refinement 渐进求精Propagation 传播Pruning 剪枝Quad tree 4叉树Quantitative association rule 数量关联规则Quartiles 4分位树Query language 查询语言Querying 查询QUEST 算法名称R correlation coefficient r相关系数Radial basis function (RBF) 径向基函数Radial function 径向函数Radius 半径RainForest 算法名称Range query 范围查询Range 全距Rank sink 排序沉没Rare item problem 稀疏项目问题Raster 光栅Ratio rule 比率规则RBF network 径向基函数网络Recall 召回率Receiver operating characteristic curve 接受者操作特征曲线Recurrent neural network 递归神经网络Referral analysis 推荐分析Region query 区域查询Regression coefficients 回归稀疏Regression 回归Regressor 回归变量Related concepts 相关概念Relation 关系Relational algebra 关系代数Relational calculus 关系计算Relational model 关系模型Relational OLAP 关系OLAPRelationship 关系Relative operating characteristic curve 相对操作特征曲线Relevance 相关性Relevant 相关的Reproduction 复制Response 响应Return on investment 投资回报率RMSE 均方根误差RNN 递归神经网络Robot 机器人ROC curve ROC曲线ROCK algorithm ROCK算法ROI 投资回报率ROLAP 关系型联机分析处理Roll up 上卷Root mean square error 均方根误差Root mean square (Rms) 均方根Roulette wheel selection 轮盘赌选择R-tree R树Rule extraction 规则抽取Rules 规则Sampling 抽样SAND 算法名称Satisfy 满足Scalability 可伸缩性Scalable parallelizable induction of decision trees 决策树的可伸缩并行归纳Scatter diagram 散点图Schema 模式SD CLARANS 算法名称,空间主导的Search engine 搜索引擎Search 搜索Seed URL 种子URLSegmentation 分割Segments 片段Selection 选择Self organizing feature map (SOFM) 自组织特征映射Self organizing map (SOM) 自组织映射Self organizing neural networks 自组织神经网络Self organizing 自组织Semantic index 语义索引Sequence association rule problem 序列关联规则问题Sequence association rule 序列关联规则Sequence association rules 序列关联规则Sequence classifier 序列分类器Sequence discovery 序列发现Sequence 时间序列Sequential analysis 序列分析Sequential pattern 序列模式Sequential patterns 序列模式Serial 单行的Session 会话Set 集合SGML 一种置标语言Shock 冲击Sigmoid activation function S型激励函数Sigmoid S型的Silhouette coefficient 轮廓系数Similarity measure 相似性度量Similarity measures 相似性度量Similarity 相似性Simple distance based 简单基于距离的Simultaneous 同时的Single link 单连接Slice 切片Sliding window 滑动窗口SLIQ 算法名称Smoothing 平滑Snapshot 快照Snowflake schema 雪花模式Soft focus 软聚焦SPADE 算法名称Spatial Association Rule 空间关联规则Spatial association rules 空间关联规则Spatial characteristic rules 空间特征曲线Spatial clustering 空间聚类Spatial data dominant generalization 以空间数据为主的一般化Spatial data mining 空间数据挖掘Spatial data 空间数据Spatial database 空间数据库Spatial Decision Tree 空间决策树Spatial discriminant rule 空间数据判别规则Spatial hierarchy 空间数据层次Spatial join 空间连接Spatial mining 空间数据挖掘Spatial operator 空间运算符Spatial selection 空间选择Spatial-data-dominant 空间数据主导Spider 蜘蛛Splitting attributes 分裂属性Splitting predicates 分裂谓语Splitting 分裂SPRINT 算法名称SQL 结构化查询语言Squared Error algorithm 平方误差算法Squared error 平方误差Squashing function 压缩函数Standard deviation 标准差Star schema 星型模式Stationary 平稳的Statistical inference 统计推断Statistical significance 统计显著性Statistics 统计学Step activation function 阶跃激励函数Step 阶跃Sting build 算法名称STING 算法名称Strength 强度String to String Conversion 串到串转换Subepisode 子情节Subsequence 子序列Subseries 子序列Subtree raising 子树上升Subtree replacement 子树替代Suffix tree 后缀树Summarization 汇总Supervised learning 有指导的学习Support 支持度SurfAid Analytics 附录ASurprise 惊奇度Targeting 瞄准Task parallelism 任务并行Temporal association rules 时序关联规则Temporal database 时序数据库Temporal mining 时序数据挖掘Temporal 时序Term frequency (TF) 词频Thematic map 主题地图Threshold activation function 阈值激励函数Threshold 阈值Time constraint 时间约束Time line 大事记Time series analysis 时间序列分析Time series 时间序列Topological relationship 拓扑关系Training data 训练数据Transaction time 事务时间Transaction 事务Transformation 变换Transition probability 转移概率Traversal patterns 浏览模式Trend dependency 趋势依赖Trend detection 趋势检测Trie 一种数据结构True negative (TN) 真反例True positive (TP) 真正例Unbiased 无偏的Unipolar activation function 单极激励函数Unipolar 单级的Unsupervised learning 无指导学习Valid time 有效时间Variance 方差Vector 向量Vertical fragment 纵向片段Virtual warehouse 虚拟数据仓库Virtual Web View (VWV) 虚拟Web视图Visualization 可视化V oronoi diagram V oronoi图V oronoi polyhedron V oronoi多面体WAP-tree WAP树WaveCluster 算法名称Wavelet transform 小波变换Web access patterns Web访问模式Web content mining Web内容挖掘Web log Web日志Web mining Web挖掘Web usage mining Web使用挖掘Web Watcher 一种方法WebML 一种Web挖掘查询语言White noise 白噪声WordNet Semantic NetworkWordNet 一个英语词汇数据库。
基于节点集Top-k频繁模式挖掘算法孙俊;张曦煌【摘要】The number of mined patterns is usually too large and a small number of frequent patterns are used in real appli-cation. Therefore, the mining of top-rank-k frequent patterns which limits the number of mined frequent patterns by ranking them in frequency, has improved the efficiency of the algorithm. This paper proposes the TPN algorithm for mining top-k frequent patterns. The TPN employs a new data structure, Nodesets, to represent patterns, compressing the data to Poc-tree and computing min support patterns to limit candidate items by the top-k-rank table. The experiments are conducted to evaluate TPN and ATFP, Top-k-FP-growth in terms of mining time for two datasets. The experimental results show that TPN is more efficient and faster.%频繁模式挖掘的模式数量通常过于巨大,在实际应用中只有少量的频繁模式被使用.Top-k频繁模式挖掘通过排列模式频数限制频繁模式的数量,有效提高了算法效率.提出了TPN(Top-k-Patterns based on Nodesets)算法,该算法使用了节点集的概念,将数据压缩于Poc-tree,通过Top-k-rank表重新计算最小支持度限制生成候选模式的数量.实验通过与ATFP,Top-k-FP-growth算法比较,证明该算法有较好的效率.【期刊名称】《计算机工程与应用》【年(卷),期】2017(053)006【总页数】5页(P101-105)【关键词】数据挖掘;top-k;频繁模式;节点集【作者】孙俊;张曦煌【作者单位】江南大学物联网工程学院,江苏无锡 214122;江南大学物联网工程学院,江苏无锡 214122【正文语种】中文【中图分类】TP301.6关联规则挖掘(Association Rule Mining)是数据挖掘研究的一个重要分支,其主要是发现事物数据库中不同商品之间的关联关系,发现的这些规则可以反映顾客的行为模式,从而可以作为商业决策的依据。
写作常用高级替换词重要的important [im'pɔːt(ə)nt] vital ['vait(ə)l] vital importance. She had found out some information of 她已经发现了一些至关重要的信息crucial['kruːʃ(ə)l] It is crucial that the problem is tackled immediately. 立即着手解决这个问题是至关重要的。
prominent ['prɒminminəənt] This considerable increase in investment played a prominent role in fueling economic growth. 投资的较大幅度增加,对拉动经济增长发挥了明显作用。
cardinal ['kɑːd(i)n(i)n(əə)l] Respect for life is a cardinal principle of English law. 尊重生命是英国法律最重要的原则。
优秀的good [gʊd] excellent ['eks(ə)l()l(əə)nt] excellent short stories. She has always had a high reputation for her 她一直因其优秀的短篇小说享有很高的声望。
outstanding [aʊt'stæt'stændindiŋŋ] ndioutstanding. The girl who won the scholarship was quite 得奖学金的女孩是相当优秀的。
extraordinary [ikˈstrdnri] strɔːɔːɔːdnri] Her strength of will was extraordinary. 她的意志力是非凡的. remarkable [ri'mɑːkəb(b(əə)l] The economic diplomacy of China is characterized by distinctive features and remarkable achievements. 中国的经济外交具有鲜明的特点和卓越的成就. 有趣的 interesting ['int(ə)risti ŋ] amusing [ə'mjuːz i ŋ] Do not hesitate to laugh at anything you find amusing . 只要觉得好笑就尽管笑。
ToP-k概率频繁co-location模式挖掘算法优化及数据库实现的开题报告一、选题背景现今,随着大数据时代的到来,人们对于数据的挖掘和分析需求也日益增加。
其中,频繁模式挖掘是数据挖掘中重要的一个方向。
频繁模式挖掘是指在事务型数据库中找出经常共同出现的物品集合。
然而,传统的频繁模式挖掘算法的效率问题已经成为瓶颈。
因此,如何提高频繁模式挖掘算法的效率,成为了一个重要的研究方向。
本文选题基于ToP-k概率频繁co-location模式挖掘算法。
ToP-k算法是一种能够快速挖掘前k个概率频繁模式的算法,是目前比较优秀的频繁模式挖掘算法之一。
该算法的核心思想是利用交互项和非交互项的特点进行有效的剪枝,从而减少候选集的规模。
本文将重点关注对这一算法的优化以及数据库实现。
二、研究目的1. 对ToP-k概率频繁co-location模式挖掘算法进行优化,提高算法的效率和准确性。
2. 通过数据库实现,实现ToP-k算法在实际生产环境中的应用。
3. 探究ToP-k算法在数据挖掘中的应用和实际效果。
三、研究内容及方法研究内容:1. 对ToP-k算法进行优化,包括改进剪枝策略以及优化候选集生成方式等。
2. 设计并实现数据库,包括数据库结构的设计以及ToP-k算法在数据库中的实现。
3. 对ToP-k算法在实际数据集上进行测试,并对算法效果进行评估。
研究方法:1. 对当前ToP-k算法的问题进行深入研究,找出瓶颈。
2. 根据问题分析对算法进行优化。
3. 设计和实现数据库,包括数据库结构和ToP-k算法的实现。
4. 在不同数据集上进行测试,对算法进行评估与改进。
四、预期成果1. 提出优化后的ToP-k概率频繁co-location模式挖掘算法。
2. 实现数据库,包括数据库结构和算法的实现。
3. 对算法进行测试并评估其在不同数据集上的效果。
4. 有可能进一步探究ToP-k算法在实际生产环境中的应用。
五、可能的创新点和难点创新点:1. 提出针对ToP-k算法的优化方案。