Cluster Analysis:Basic Concepts and Methods
- 格式:ppt
- 大小:4.16 MB
- 文档页数:96
1聚类分析内涵1.1聚类分析定义聚类分析(Cluste.Analysis)是一组将研究对象分为相对同质的群组(clusters)的统计分析技术.也叫分类分析(classificatio.analysis)或数值分类(numerica.taxonomy), 它是研究(样品或指标)分类问题的一种多元统计方法, 所谓类, 通俗地说, 就是指相似元素的集合。
聚类分析有关变量类型:定类变量,定量(离散和连续)变量聚类分析的原则是同一类中的个体有较大的相似性, 不同类中的个体差异很大。
1.2聚类分析分类聚类分析的功能是建立一种分类方法, 它将一批样品或变量, 按照它们在性质上的亲疏、相似程度进行分类.聚类分析的内容十分丰富, 按其聚类的方法可分为以下几种:(1)系统聚类法: 开始每个对象自成一类, 然后每次将最相似的两类合并, 合并后重新计算新类与其他类的距离或相近性测度. 这一过程一直继续直到所有对象归为一类为止. 并类的过程可用一张谱系聚类图描述.(2)调优法(动态聚类法): 首先对n个对象初步分类, 然后根据分类的损失函数尽可能小的原则对其进行调整, 直到分类合理为止.(3)最优分割法(有序样品聚类法): 开始将所有样品看成一类, 然后根据某种最优准则将它们分割为二类、三类, 一直分割到所需的K类为止. 这种方法适用于有序样品的分类问题, 也称为有序样品的聚类法.(4)模糊聚类法: 利用模糊集理论来处理分类问题, 它对经济领域中具有模糊特征的两态数据或多态数据具有明显的分类效果.(5)图论聚类法: 利用图论中最小支撑树的概念来处理分类问题, 创造了独具风格的方法.(6)聚类预报法:利用聚类方法处理预报问题, 在多元统计分析中, 可用来作预报的方法很多, 如回归分析和判别分析. 但对一些异常数据, 如气象中的灾害性天气的预报, 使用回归分析或判别分析处理的效果都不好, 而聚类预报弥补了这一不足, 这是一个值得重视的方法。
聚类分析(cluster analysis)medical aircraftClustering analysis refers to the grouping of physical or abstract objects into a class consisting of similar objects. It is an important human behavior. The goal of cluster analysis is to classify data on a similar basis. Clustering comes from many fields, including mathematics, computer science, statistics, biology and economics. In different applications, many clustering techniques have been developed. These techniques are used to describe data, measure the similarity between different data sources, and classify data sources into different clusters.CatalogconceptMainly used in businessOn BiologyGeographicallyIn the insurance businessOn Internet applicationsIn E-commerceMain stepsCluster analysis algorithm conceptMainly used in businessOn BiologyGeographicallyIn the insurance businessOn Internet applicationsIn E-commerceMain stepsClustering analysis algorithmExpand the concept of editing this paragraphThe difference between clustering and classification is that the classes required by clustering are unknown. Clustering is a process of classifying data into different classes or clusters, so objects in the same cluster have great similarity, while objects between different clusters have great dissimilarity. From a statistical point of view, clustering analysis is a way to simplify data through data modeling. Traditional statistical clustering analysis methods include system clustering method, decomposition method, adding method, dynamic clustering method, ordered sample clustering,overlapping clustering and fuzzy clustering, etc.. Cluster analysis tools, such as k- mean and k- center point, have been added to many famous statistical analysis packages, such as SPSS, SAS and so on. From the point of view of machine learning, clusters are equivalent to hidden patterns. Clustering is an unsupervised learning process for searching clusters. Unlike classification, unsupervised learning does not rely on predefined classes or class labeled training instances. Automatic marking is required by clustering learning algorithms, while instances of classification learning or data objects have class tags. Clustering is observational learning, not sample learning. From the point of view of practical application, clustering analysis is one of the main tasks of data mining. Moreover, clustering can be used as an independent tool to obtain the distribution of data, to observe the characteristics of each cluster of data, and to concentrate on the analysis of specific cluster sets. Clustering analysis can also be used as a preprocessing step for other algorithms (such as classification and qualitative inductive algorithms).Edit the main application of this paragraphCommerciallyCluster analysis is used to identify different customer groups and to characterize different customer groups through the purchase model. Cluster analysis is an effective tool for market segmentation. It can also be used to study consumer behavior, to find new potential markets, to select experimental markets, and to be used as a preprocessing of multivariate analysis.On BiologyCluster analysis is used to classify plants and plants and classify genes so as to get an understanding of the inherent structure of the populationGeographicallyClustering can help the similarity of the databases that are observed in the earthIn the insurance businessCluster analysis uses a high average consumption to identify groups of car insurance holders, and identifies a city's property groups based on type of residence, value, locationOn Internet applicationsCluster analysis is used to categorize documents online to fix informationIn E-commerceA clustering analysis is a very important aspect in the construction of Web Data Mining in electronic commerce, through clustering with similar browsing behavior of customers, and analyze the common characteristics of customers, help the users of e-commerce can better understand their customers, provide more suitable services to customers.Edit the main steps of this paragraph1. data preprocessing,2. defines a distance function for measuring similarity between data points,3. clustering or grouping, and4. evaluating output. Data preprocessing includes the selection of number, types and characteristics of the scale, it relies on the feature selection and feature extraction, feature selection important feature, feature extraction feature transformation input for a new character, they are often used to obtain an appropriate feature set to avoid the "cluster dimension disaster" data preprocessing, including outlier removal data, outlier is not dependent on the general data or model data, so the outlier clustering results often leads to a deviation, so in order to get the correct clustering, we must eliminate them. Now that is similar to the definition of a class based, so different data in the same measure of similarity feature space for clustering step is very important, because the diversity of types and characteristics of the scale, the distance measure must be cautious, it often depends on the application, for example,Usually by definition in the feature space distance metric to evaluate the differences of the different objects, many distance are applied in different fields, a simple distance measure, Euclidean distance, are often used to reflect the differences between different data, some of the similarity measure, such as PMC and SMC, to the concept of is used to characterize different data similarity in image clustering, sub image error correction can be used to measure the similarity of two patterns. The data objects are divided into differentclasses is a very important step, data based on different methods are divided into different classes, classification method and hierarchical method are two main methods of clustering analysis, classification methods start from the initial partition and optimization of a clustering criterion. Crisp Clustering, each data it belonged to a separate class; Fuzzy Clustering, each data it could be in any one class, Crisp Clustering and Fuzzy Clusterin are the two main technical classification method, classification method of clustering is divided to produce a series of nested a standard based on the similarity measure, it can or a class separability for merging and splitting is similar between the other clustering methods include density based clustering model, clustering based on Grid Based clustering. To evaluate the quality of clustering results is another important stage, clustering is a management program, there is no objective criteria to evaluate the clustering results, it is a kind of effective evaluation, the index of general geometric properties, including internal separation between class and class coupling, the quality is generally to evaluate the clustering results, effective index in the determination of the number of the class is often played an important role, the best value of effective index is expected to get from the real number, a common class number is decided to select the optimum values for a particular class of effective index, is the the validity of the standard index the real number of this index can, many existing standards for separate data set can be obtained very good results, but for the complex number According to a collection, it usually does not work, for example, for overlapping classes of collections.Edit this section clustering analysis algorithmClustering analysis is an active research field in data mining, and many clustering algorithms are proposed. Traditional clustering algorithms can be divided into five categories: partitioning method, hierarchical method, density based method, grid based method and model-based method. The 1 division method (PAM:PArtitioning method) first create the K partition, K is the number of partition to create; and then use a circular positioning technology through the object from a division to another division to help improve the quality of classification. Including the classification of typical: K-means, k-medoids, CLARA (Clustering LARge Application), CLARANS (Clustering Large Application based upon RANdomized Search). FCM 2 level (hierarchical method) method to create a hierarchical decomposition of the given data set. The method can be divided into two operations: top-down (decomposition) and bottom-up (merging). In order to make up for the shortcomings of decomposition and merging, hierarchical merging is often combined with other clustering methods, such as cyclic localization. This includes the typical methods of BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) method, it firstly set the tree structure to divide the object; then use other methods to optimize the clustering. CURE (Clustering, Using, REprisentatives) method, which uses fixed numbers to represent objects to represent the corresponding clustering, and then shrinks the clusters according to the specified amount (to the clustering center). ROCK method, it uses the connection between clusters to cluster and merge. CHEMALOEN method, it constructs dynamic model in hierarchical clustering. 3 density based method, according to the density to complete the object clustering. It grows continuouslyaccording to the density around the object (such as DBSCAN). The typical density based methods include: DBSCAN(Densit-based Spatial Clustering of Application with Noise): the algorithm by growing enough high density region to clustering; clustering can find arbitrary shape from spatial databases with noise in. This method defines a cluster as a set of point sets of density connectivity. OPTICS (Ordering, Points, To, Identify, the, Clustering, Structure): it does not explicitly generate a cluster, but calculates an enhanced clustering order for automatic interactive clustering analysis.. 4 grid based approach,Firstly, the object space is divided into finite elements to form a grid structure, and then the mesh structure is used to complete the clustering. STING (STatistical, INformation, Grid) is a grid based clustering method that uses the statistical information stored in the grid cell. CLIQUE (Clustering, In, QUEst) and Wave-Cluster are a combination of grid based and density based methods. 5, a model-based approach, which assumes the model of each cluster, and finds data appropriate for the corresponding model. Typical model-based methods include: statistical methods, COBWEB: is a commonly used and simple incremental concept clustering method. Its input object is represented by a symbolic quantity (property - value) pair. A hierarchical cluster is created in the form of a classification tree. CLASSIT is another version of COBWEB. It can incrementally attribute continuous attributes. For each node of each property holds the corresponding continuous normal distribution (mean and variance); and the use of an improved classification ability description method is not like COBWEB (value) and the calculation of discrete attributes but theintegral of the continuous attributes. However, CLASSIT methods also have problems similar to those of COBWEB. Therefore, they are not suitable for clustering large databases. Traditional clustering algorithms have successfully solved the clustering problem of low dimensional data. However, due to the complexity of data in practical applications, the existing algorithms often fail when dealing with many problems, especially for high-dimensional data and large data. Because traditional clustering methods cluster in high-dimensional data sets, there are two main problems. The high dimension data set the existence of a large number of irrelevant attributes makes the possibility of the existence of clusters in all the dimensions of almost zero; to sparse data distribution data of low dimensional space in high dimensional space, which is almost the same distance between the data is a common phenomenon, but the traditional clustering method is based on the distance from the cluster, so high dimensional space based on the distance not to build clusters. High dimensional clustering analysis has become an important research direction of cluster analysis. At the same time, clustering of high-dimensional data is also the difficulty of clustering. With the development of technology makes the data collection becomes more and more easily, cause the database to larger scale and more complex, such as trade transaction data, various types of Web documents, gene expression data, their dimensions (attributes) usually can reach hundreds of thousands or even higher dimensional. However, due to the "dimension effect", many clustering methods that perform well in low dimensional data space can not obtain good clustering results in high-dimensional space. Clustering analysis of high-dimensional data is a very active field in clustering analysis, and it is also a challenging task. Atpresent, cluster analysis of high-dimensional data is widely used in market analysis, information security, finance, entertainment, anti-terrorism and so on.。