Cluster ensembles, quantization and the dilogarithm

格式：pdf
大小：695.38 KB
文档页数：58

下载文档原格式

聚类分析外文文献及翻译

聚类分析外文文献及翻译本科毕业论文外文文献及译文文献、资料题目：Cluster Analysis—Basic Concepts and Algorithms 文献、资料来源：文献、资料发表（出版）日期：院（部）：土木工程学院专业：土木工程班级：姓名：学号：指导教师：翻译日期：外文文献：Cluster Analysis—Basic Concepts and AlgorithmsCluster analysis divides data into groups (clusters) that are meaningful, useful,or both. If meaningful groups are the goal, then the clusters should capture the natural structure of the data. In some cases, however, cluster analysis is only a useful starting point for other purposes, such as data summarization. Whether for understanding or utility, cluster analysis has long played an important role in a wide variety of ﬁelds: psychology and other social sciences, biology,sta tistics, pattern recognition, information retrieval, machine learning, and data mining.There have been many applications of cluster analysis to practical problems. We provide some speciﬁc examples, organized by whether the purpose of the clustering is understanding or utility.Clustering for Understanding Classes, or conceptually meaningful groups of objects that share common characteristics, play an important role in how people analyze and describe the world. Indeed, human beings are skilled at dividing objects into groups (clustering) and assigning particular objects to these groups (classiﬁcation). For example, even relatively young children can quickly label the objects in a photograph as buildings, vehicles, people, animals, plants, etc. In the context of understanding data, clusters are potential classes and cluster analysis is the study of techniques for automatically ﬁnding classes. The following are some examples:Biology. Biologists have spent many years creating a taxonomy (hierarchical classiﬁca tion) of all living things: kingdom, phylum, class,order, family, genus, and species. Thus, it is perhaps not surprising that much of the early work in cluster analys is sought to create a discipline of mathematical taxonomy that could automatically ﬁnd such classiﬁcation structures. More recently, biologists have applied clustering to analyze the large amounts of genetic information that are now available. For example, clustering has been used to ﬁnd groups of genes that havesimilar functions.• Informati on Retrieval. The World Wide Web consists of billions of Web pages, and the results of a query to a search engine can return thousands of pages. Clustering can be used to group these search results into a small number of clusters, each of which captures a particular aspect of the query. For instance, a query of “movie” might return Web pages grouped into categories such as reviews, trailers, stars, and theaters. Each category (cluster) can be broken into subcategories (sub-clusters), producing a hierarchica l structure that further assists a user’s exploration of the query results.• Climate.Understanding the Earth’s climate requires ﬁnding patternsin the atmosphere and ocean. To that end, cluster analysis has been applied to ﬁnd patterns in the atmospheric pressure of polar regions and areas of the ocean that have a signiﬁcant impact on land climate.• Psychology and Medicine. An illness or condition frequently has a number of variations, and cluster analysis can be used to identify these different subcategories. For example, clustering has been used to identify different types of depression. Cluster analysis can also be used to detect patterns in the spatial or temporal distribution of a disease.• Business.Businesses collect large amounts of information on current and potential customers. Clustering can be used to segment customers into a small number of groups for additional analysis and marketing activities.Clustering for Utility：Cluster analysis provides an abstraction from individual data objects to the clusters in which those data objects reside. Additionally, some clustering techniques characterize each cluster in terms of a cluster prototype; i.e., a data object that is representative of the other objects in the cluster. These cluster prototypes can be used as the basis for a number of data analysis or data processing techniques. Therefore, in the context of utility, cluster analysis is the study of techniques for ﬁnding the most representative cluster prototypes.• Summarization. Many data analysis techniques, such as regression or PCA, have a time or space complexity of O(m2) or higher (where m is the number ofobjects), and thus, are not practical for large data sets. However, instead of applying the algorithm to the entire data set, it can be applied to a reduced data set consisting only of cluster prototypes. Depending on the type of analysis, the number of prototypes, and the accuracy with which the prototypes represent the data, the results can be comparable to those that would have been obtained if all the data could have been used.• Compression. Cluster prototypes can also be used for data compres-sion. In particular, a table is created that consists of the prototypes for each cluster; i.e., each prototype is assigned an integer value that is its position (index) in the table. Each object is represented by the index of the prototype associated with its cluster. This type of compression is known as vector quantization and is often applied to image, sound, and video data, where (1) many of the data objects are highly similar to one another, (2) some loss of information is acceptable, and (3) a substantial reduction in the data size is desired• E ffciently Finding Nearest Neighbors. Finding nearest neighbors can require computing the pairwise distance between all points. Often clusters and their cluster prototypes can be found much more effciently. If objects are relatively close to the prototype of their cluster, then we can use the prototypes to reduce the number of distance computations that are n ecessary to ﬁnd the nearest neighbors of an object. Intuitively, if two cluster prototypes are far apart, then the objects in the corresponding clusters cannot be nearest neighbors of each other. Consequently, to ﬁnd an object’s nearest neighbors it is onl y necessary to compute the distance to objects in nearby clusters, where the nearness of two clusters is measured by the distance between their prototypes.This chapter provides an introduction to cluster analysis. We begin with a high-level overview of clustering, including a discussion of the various ap- proaches to dividing objects into sets of clusters and the different types of clusters. We then describe three speciﬁc clustering techniques that represent broad categories of algorithms and illustrate a variety of concepts: K-means, agglomerative hierarchical clustering, and DBSCAN. The ﬁnal section of this chapter is devoted tocluster validity—methods for evaluating the goodness of the clusters produced by a clustering algorithm. More advanced clusteringconcepts and algorithms will be discussed in Chapter 9. Whenever possible,we discuss the strengths and weaknesses of different schemes. In addition,the bibliographic notes provide references to relevant books and papers that explore cluster analysis in greater depth.1.1OverviewBefore discussing speciﬁc clustering techniques, we provide some necessary background. First, we further deﬁne cluster ana lysis, illustrating why it isdiffcult and explaining its relationship to other techniques that group data.Then we explore two important topics: (1) different ways to group a set ofobjects into a set of clusters, and (2) types of clusters.1.1.1What Is Cluster Analysis?Cluster analysis groups data objects based only on information found in thedata that describes the objects and their relationships. The goal is that theobjects within a group be similar (or related) to one another and diﬀerent from(or unrelated to) the objects in other groups. The greater the similarity (orhomogeneity) within a group and the greater the diﬀerence between groups,the better or more distinct the clustering.Cluster analysis is related to other techniques that are used to divide data objects into groups. For instance, clustering can be regarded as a form of classiﬁcation in that it cr eates a labeling of objects with class (cluster) labels.However, it derives these labels only from the data. In contrast, classiﬁcationn the sense of Chapter 4 is supervised classiﬁcation; i.e., new, unlabeled objects are assigned a class label using a model developed from objects with known class labels.For this reason, cluster analysis is sometimes referred to as unsupervised classiﬁcation. When the term classiﬁcation is used without any qualiﬁcation within data mining, it typically refers to supervised classiﬁcation.Also, while the terms segmentation and partitioning are sometimesused as synonyms for clustering, these terms are frequently used for approaches outside thetraditional bounds of cluster analysis. For example, the termpartitioning is often used in connection with techniques that divide graphs into subgraphs and that are not strongly connected to clustering. Segmentation often refers to the division of data into groups using simple techniques; e.g.,an image can be split into segments based only on pixel intensity and color, orpeople can be divided into groups based on their income. Nonetheless, somework in graph partitioning and in image and market segmentation is relatedto cluster analysis.1.1.2 Different Types of ClusteringsAn entire collection of clusters is commonly referred to as a clustering, and in this section, we distinguish various types of clusterings: hierarchical (nested) versus partitional (unnested), exclusive versus overlapping versus fuzzy, and complete versus partial.Hierarchical versus Partitional The most commonly discussed distinc- tion among different types of clusterings is whether the set of clusters is nested or unnested, or in more traditional terminology, hierarchical or partitional. Apartitional clustering is simply a division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly onesubset.If we permit clusters to have subclusters, then we obtain a hierarchical clustering, which is a set of nested clusters that are organized as a tree. Each node (cluster) in the tree (except for the leaf nodes) is the union of its children (subclusters), and the root of the tree is the cluster containing all the objects.Often, but not always, the leaves of the tree are singleton clusters of individual data objects. If we allow clusters to be nested, then one interpretation of Figure 8.1(a) is that it has two subclusters (Figure 8.1(b)), each of which, inturn, has three subclusters (Figure 8.1(d)). The clusters shown in Figures 8.1(a–d), when taken in that order, also form a hierarchical (nested) clusteringwith, respectively, 1, 2, 4, and 6 clusters on each level. Finally, note that a hierarchical clustering can be viewed as a sequence of partitional clusterings and a partitional clustering can be obtained by taking any member of that sequence; i.e., by cutting the hierarchical tree at aparticular level.Exclusive versus Overlapping versus Fuzzy The clusterings shown in Figure 8.1 are all exclusive, as they assign each object to a single cluster.There are many situations in which a point could reasonably be placed in more than one cluster, and these situations are better addressed by non-exclusiveclustering. In the most general sense, an overlapping or non-exclusiveclustering is use d to reﬂect the fact that an object can simultaneously belong to more than one group (class). For instance, a person at a university can be both an enrolled student and an employee of the university. A non-exclusiveclustering is also often used when, for example, an object is “between” two or more clusters and could reasonably be assigned to any of these clusters.Imagine a point halfway between two of the clusters of Figure 8.1. Rather than make a somewhat arbitrary assignment of the object to a single cluster,it is placed in all of the “equally good” clusters.In a fuzzy clustering, every object belongs to every cluster with a membership weight that is between 0 (absolutely doesn’t belong) and 1 (absolutelybelongs). In other words, clusters are treated as fuzzy sets. (Mathematically,a fuzzy set is one in which an object belongs to any set with a weight thatis between 0 and 1. In fuzzy clustering, we often impose the additional constraint that the sum of the weights for each object must equal 1.) Similarly,probabilistic clustering techniques compute the probability with which each point belongs to each cluster, and these probabilities must also sum to 1. Because the membership weights or probabilities for any object sum to 1, a fuzzyor probabilistic clustering does not address true multiclass situations, such as the case of a student employee, where an object belongs to multiple classes .Instead, these approaches are most appropriate for avoiding the arbitrariness of assigning an object to only one cluster when it may be close to several. Inpractice, a fuzzy or probabilistic clustering is often converted to an exclusiveclustering by assigning each object to the cluster in which its membership weight or probability is highest.Complete versus Partial A complete clustering assigns every object to a cluster, whereas a partial clustering does not. The motivation for a partial clustering is thatsome objects in a data set may not belong to well-deﬁned groups. Many times objects in the data set may represent noise, out liers, or“uninteresting background.” For example, some newspaper stories may share a common theme, such as global warming, while other stories are more genericor one-of-a-kind. Thus, to ﬁnd the important topics in last month’s stories, we may want to searc h only for clusters of documents that are tightly related by a common theme. In other cases, a complete clustering of the objects is desired.For example, an application that uses clustering to organize documents forbrowsing needs to guarantee that all documents can be browsed.1.1.3Different Types of ClustersClustering aims to ﬁnd useful groups of objects (clusters), where usefulness is deﬁned by the goals of the data analysis. Not sur prisingly, there are several different notions of a cluster that prove useful in practice. In order to visually illustrate the differences among these types of clusters, we use two-dimensional points, as shown in Figure 8.2, as our data objects. We stress, however, thatthe types of clusters described here are equally valid for other kinds of data.Well-Separated A cluster is a set of objects in which each object is closer (or more similar) to every other object in the cluster than to any object notin the cluster. Sometimes a threshold is used to specify that all the objects in a cluster must be suﬃciently close (or similar) to one another. This ideal istic deﬁnition of a cluster is satisﬁed only when the data contains natural clusters that are quite far from each other. Figure 8.2(a) gives an example of well-separated clusters that consists of two groups of points in a two-dimensional space. The distance between any two points in different groups is larger than he distance between any two points within a group. Well-separated clusters do not need to be globular, but can have any shape.Prototype-Based A cluster is a set of objects in which each object is closer(more similar) to the prototype that deﬁnes the cluster than to the prototype of any other cluster. For data with continuous attributes, the prototype of a cluster is often a centroid, i.e., the average (mean) of all the points in the cluster. When a centroid isnot meaningful, such as when the data has categorical attributes, the prototype is often a medoid, i.e., the most representative pointof a cluster. For many types of data, the prototype can be regarded as the most central point, and in such instances, we commonly refer to prototype-based clusters as center-based clusters. Not surprisingly, such clusters tend to be globular. Figure 8.2(b) shows an example of center-based clusters.Graph-Based If the data is represented as a graph, where the nodes are objects and the links represent connections among objects (see Section ),then a cluster can be deﬁned as a connected component; i.e., a group of objects that are connected to one another, but that have no connection to objects outside the group. An important example of graph-based clusters are contiguity-based clusters, where two objects are connected only if they are within a speciﬁed distance of each other. This implies that each object in a contiguity-based cluster is closer to some other object in the cluster than to any point in a different cluster. Figure 8.2(c) shows an example of such clusters for two-dimensional points. This deﬁnition of a cluster is useful when clusters are irregular or intertwined, but can have trouble when noise is present since, as illustrated by the two spherical clusters of Figure 8.2(c), a small bridge of points can merge two distinct clusters.Other types of graph-based clusters are also possible. One such approach (Section ) deﬁnes a cluster as a clique; i.e., a set of nodes in a graph that are completely connected to each other. Speciﬁcally, if we add connections between objects in the order of their distance from one another, a cluster is formed when a set of objects forms a clique. Like prototype-based clusters, such clusters tend to be globular.Density-Based A cluster is a dense region of objects that is surrounded bya region of low density. Figure 8.2(d) shows some density-based clusters for data created by adding noise to the data of Figure 8.2(c). The two circular clusters are not merged, as in Figure 8.2(c), because the bridge between them fades into the noise. Likewise, the curve that is present in Figure 8.2(c) also fades into the noise and does not form a cluster in Figure 8.2(d). A density-based deﬁnition of a cluster isoften employed when the clusters are irregular or intertwined, and when noise and outliers are present. By contrast, a contiguity- based deﬁnition of a cluster wou ld not work well for the data of Figure 8.2(d)since the noise would tend to form bridges between clusters.Shared-Property (Conceptual Clusters) More generally, we can deﬁne a cluster as a set of objects that share some property. This deﬁnition encom passes all the previous deﬁnitions of a cluster; e.g., objects in a center-based cluster share the property that they are all closest to the same centroid or medoid. However, the shared-property approach also includes new types of clusters. Consider the clusters shown in Figure 8.2(e). A triangular area (cluster) is adjacent to a rectangular one, and there are two intertwined circles (clusters). In both cases, a clustering algorithm would need a very speciﬁc concept of a cluster to successfully detect these clust ers. The process of ﬁnd- ing such clusters is called conceptual clustering. However, too sophisticated a notion of a cluster would take us into the area of pattern recognition, and thus, we only consider simpler types of clusters in this book.Road MapIn this chapter, we use the following three simple, but important techniques to introduce many of the concepts involved in cluster analysis.• K-means. This is a prototype-based, partitional clustering technique that attempts to ﬁnd a user-speciﬁed number o f clusters (K ), which are represented by their centroids.• Agglomerative Hierarchical Clustering. This clustering approach refers to a collection of closely related clustering techniques that produce a hierarchical clustering by starting with each point as a singleton cluster and then repeatedly merging the two closest clusters until a single, all- encompassing cluster remains. Some of these techniques have a natural interpretation in terms of graph-based clustering, while others have an interpretation in terms of a prototype-based approach.• DBSCAN. This is a density-based clustering algorithm that producesa partitional clustering, in which the number of clusters is automatically determinedby the algorithm. Points in low-density regions are classiﬁed as noise and omitted;thus, DBSCAN does not produce a complete lustering.中文译文：聚类分析—基本概念及算法聚类分析将数据分为有意义的，有用的，或两者兼而有之的组（集群）。

Document Clustering Using Locality Preserving Indexing

Xiaofei He Department of Computer Science The University of Chicago 1100 East 58th Street, Chicago, IL 60637, USA Phone: (733) 288-2851 xiaofei@
Jiawei Han Department of Computer Science University of Illinois at Urbana Champaign 2132 Siebel Center, 201 N. Goodwin Ave, Urbana, IL 61801, USA Phone: (217) 333-6903 Fax: (217) 265-6494 hanj@
document clustering [28][27]. They model each cluster as a linear combination of the data points, and each data point as a linear combination of the clusters. And they compute the linear coeﬃcients by minimizing the global reconstruction error of the data points using Non-negative Matrix Factorization. Thus, NMF method still focuses on the global geometrical structure of document space. Moreover, the iterative update method for solving NMF problem is computational expensive. In this paper, we propose a novel document clustering algorithm by using Locality Preserving Indexing (LPI). Diﬀerent from LSI which aims to discover the global Euclidean structure, LPI aims to discover the local geometrical structure. LPI can have more discriminating power. Thus, the documents related to the same semantics are close to each other in the low dimensional representation space. Also, LPI is derived by ﬁnding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the document manifold. Laplace Beltrami operator takes the second order derivatives of the functions on the manifolds. It evaluates the smoothness of the functions. Therefore, it can discover the non-linear manifold structure to some extent. Some theoretical justiﬁcations can be traced back to [15][14]. The original LPI is not optimal in the sense of computation in that the obtained basis functions might contain a trivial solution. The trivial solution contains no information and thus useless for document indexing. A modiﬁed LPI is proposed to obtain better document representations. In this low dimensional space, we then apply traditional clustering algorithms such as k -means to cluster the documents into semantically diﬀerent classes. The rest of this paper is organized as follows: In Section 2, we give a brief review of LSI and LPI. Section 3 introduces our proposed document clustering algorithm. Some theoretical analysis is provided in Section 4. The experimental results are shown in Section 5. Finally, we give concluding remarks and future work in Section 6.

多聚类研究

基本思想
（1）通过对原始数据集进行不同的数据映射操作，生成多个不同的新数据集；（2）采用不同的单一聚类算法对新数据集进行聚类，获得多个不同的聚类结果；（3）运用一致性函数对多个聚类结果进行优化获得一个综合的聚类结果，以取得比单一聚类算法更好的性能。
聚类集成
Ensemble generator 集成器生成阶段 Consensus function 一致性函数阶段
多聚类算法研究
余志文
计算机科学与工程学院华南理工大学二〇一四年八月
提纲
多聚类的动机及类型聚类集成
半监督聚类集成
聚类结构集成
总结
多聚类的动机
单聚类：从一个视角探索未知的数据集，只有一个聚类结果
数据集
模型4 模型1 模型2 模型3
多聚类：从不同的视角探索未知的数据集，存在多个聚类过程，允许有一个或多个聚类结果
Fuzzy matrix B2
Consensus function
Final results
聚类集成
实验数据集衡量指标
6个癌症基因表达数据集， 4个UCI数据集
The normalized mutual information, The purity
聚类集成
聚类集成
实验结论
硬聚类和软聚类之间不同的结合方式对算法的性能有一定的影响混合模糊聚类集成框架，在癌症基因表达数据集和UCI数据集上都能取得不错的性能
数据集名
来源
类数 k
癌症基因数据集
样本数n
属性数m
DLBCL-A DLBCL-B Breast St.Jude leukemia
文献文献文献文献

threshold and

Effective wavelet-based compression method with adaptive quantizationthreshold and zerotree codingArtur Przelaskowski, Marian Kazubek, Tomasz JamrógiewiczInstitute of Radioelectronics, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warszawa,PolandABSTRACTEfficient image compression technique especially for medical applications is presented. Dyadic wavelet decomposition by use of Antonini and Villasenor bank filters is followed by adaptive space-frequency quantization and zerotree-based entropy coding of wavelet coefficients. Threshold selection and uniform quantization is made on a base of spatial variance estimate built on the lowest frequency subband data set. Threshold value for each coefficient is evaluated as linear function of 9-order binary context. After quantization zerotree construction, pruning and arithmetic coding is applied for efficient lossless data coding. Presented compression method is less complex than the most effective EZW-based techniques but allows to achieve comparable compression efficiency. Specifically our method has similar to SPIHT efficiency in MR image compression, slightly better for CT image and significantly better in US image compression. Thus the compression efficiency of presented method is competitive with the best published algorithms in the literature across diverse classes of medical images. Keywords: wavelet transform, image compression, medical image archiving, adaptive quantization1. INTRODUCTIONLossy image compression techniques allow significantly diminish the length of original image representation at the cost of certain original data changes. At range of lower bit rates these changes are mostly observed as distortion but sometimes improved image quality is visible. Compression of the concrete image with its all important features preserving and the noise and all redundancy of original representation removing is do required. The choice of proper compression method depends on many factors, especially on statistical image characteristics (global and local) and application. Medical applications seem to be challenged because of restricted demands on image quality (in the meaning of diagnostic accuracy) preserving. Perfect reconstruction of very small structures which are often very important for diagnosis even at low bit rates is possible by increasing adaptability of the algorithm. Fitting data processing method to changeable data behaviour within an image and taking into account a priori data knowledge allow to achieve sufficient compression efficiency. Recent achievements clearly show that nowadays wavelet-based techniques can realise these ideas in the best way.Wavelet transform features are useful for better representation of the actual nonstationary signals and allow to use a priori and a posteriori data knowledge for diagnostically important image elements preserving. Wavelets are very efficient for image compression as entire transformation basis function set. This transformation gives similar level of data decorrelation in comparison to very popular discrete cosine transform and has additional very important features. It often provides a more natural basis set than the sinusoids of the Fourier analysis, enables widen set of solution to construct effective adaptive scalar or vector quantization in time-frequency domain and correlated entropy coding techniques, does not create blocking artefacts and is well suited for hardware implementation. Wavelet-based compression is naturally multiresolution and scalable in different applications so that a single decomposition provides reconstruction at a variety of sizes and resolutions (limited by compressed representation) and progressive coding and transmission in multiuser environments.Wavelet decomposition can be implemented in terms of filters and realised as subband coding approach. The fundamental issue in construction of efficient subband coding techniques is to select, design or modify the analysis and synthesis filters.1Wavelets are good tool to create wide class of new filters which occur very effective in compression schemes. The choice of suitable wavelet family, with such criteria as regularity, linearity, symmetry, orthogonality or impulse and step response of corresponding filter bank, can significantly improve compression efficiency. For compactly supported wavelets corresponding filter length is proportional to the degree of smoothness and regularity of the wavelet. Butwhen the wavelets are orthogonal (the greatest data decorrelation) they also have non-linear phase in the associated FIR filters. The symmetry, compact support and linear phase of filters may be achieved by biorthogonal wavelet bases application. Then quadrature mirror and perfect reconstruction subband filters are used to compute the wavelet transform. Biorthogonal wavelet-based filters occurred very efficient in compression algorithms. A construction of wavelet transformation by fitting local defined basis transformation function (or finite length filters) into image data characteristics is possible but very difficult. Because of nonstationary of image data, miscellaneous image futures which could be important for good reconstruction, significant various image quality (signal to noise level, spatial resolution etc.) from different imaging systems it is very difficult to elaborate the construction method of the optimal-for-compression filters. Many issues relating to the choice of the most efficient filter bank for image compression remain still unresolved.2The demands of preserving the diagnostic accuracy in reconstructed medical images are exacting. Important high frequency coefficients which appear at the place of small structure edges in CT and MR images should be saved. Accurate global organ shapes reconstruction in US images and strong noise reduction in MN images is also required. It is rather difficult to imagine that one filter bank can do it in the best way. Rather choosing the best wavelet families for each modality is expected.Our aim is to increase the image compression efficiency, especially for medical applications, by applying suitable wavelet transformation, adaptive quantization scheme and corresponding processed decomposition tree entropy coding. We want to achieve higher acceptable compression ratios for medical images by better preserving the diagnostic accuracy of images. Many bit allocation techniques applied in quantization scheme are based on data distribution assumptions, quantiser distortion function etc. All statistical assumptions built on global data characteristics do not cover exactly local data behaviour and important detail of original image, e.g., different texture small area may be lost. Thus we decided to build quantization scheme on the base of local data characteristics such a direct data context in two dimensions mentioned earlier. We do data variance estimation on the base of real data set as spatial estimate for corresponding coefficient positions in successive subbands. The details of quantization process and correlated coding technique as a part of effective simple wavelet-based compression method which allows to achieve high reconstructed image quality at low bit rates are presented.2. THE COMPRESSION TECHNIQUEScheme of our algorithm is very simple: dyadic, 3 levels decomposition of original image (256×256 images were used) done by selected filters. For symmetrical filters symmetry boundary extension at the image borders was used and for asymmetrical filters - a periodic (or circular) boundary extension.Figure 1. Dyadic wavelet image decomposition scheme. - horizontal relations, - parent - children relations. LL - the lowest frequency subband.Our approach to filters is utilitarian one, making use of the literature to select the proper filters rather than to design them. We conducted an experiment using different kinds of wavelet transformation in presented algorithm. Long list of wavelet families and corresponding filters were tested: Daubechies, Adelson, Brislawn, Odegard, Villasenor, Spline, Antonini, Coiflet, Symmlet, Beylkin, Vaid etc.3 Generally Antonini 4 filters occurred to be the most efficient. Villasenor, Odegard and Brislawn filters allow to achieve similar compression efficiency. Finally: Antonini 7/9 tap filters are used for MR and US image compression and Villasenor 18/10 tap filters for CT image compression.2.1 Adaptive space-frequency quantizationPresented space-frequency quantization technique is realised as entire data pre-selection, threshold selection and scalar uniform quantization with step size conditioned by chosen compression ratio. For adaptive estimation of threshold and quantization step values two extra data structure are build. Entire data pre-selection allows to evaluate zero-quantized data set and predict the spatial context of each coefficient. Next simple quantization of the lowest frequency subband (LL) allows to estimate quantized coefficient variance prediction as a space function across sequential subbands. Next the value of quantization step is slightly modified by a model build on variance estimate. Additionally, a set of coefficients is reduced by threshold selection. The threshold value is increased in the areas with the dominant zero-valued coefficients and the level of growth depends on coefficient spatial position according variance estimation function.Firstly zero-quantized data prediction is performed. The step size w is assumed to be constant for all coefficients at each decomposition level. For such quantization model the threshold value is equal to w /2. Each coefficient whose value is less than threshold is predicted to be zero-valued after quantization (insignificant). In opposite case coefficient is predicted to be not equal to zero (significant). It allows to create predictive zero-quantized coefficients P map for threshold evaluation in the next step. The process of P map creation is as follows:if c w then p else p i i i <==/201, (1)where i m n m n =⋅−12,,...,;, horizontal and vertical image size , c i - wavelet coefficient value. The coefficient variance estimation is made on the base of LL data for coefficients from next subbands in corresponding spatial positions. The quantization with mentioned step size w is performed in LL and the most often occurring coefficient value is estimated. This value is named MHC (mode of histogram coefficient). The areas of MHC appearance are strongly correlated with zero-valued data areas in the successive subbands. The absolute difference of the LL quantized data and MHC is used as variance estimate for next subband coefficients in corresponding spatial positions. We tested many different schemes but this model allows to achieve the best results in the final meaning of compression efficiency. The variance estimation is rather coarse but this simple adaptive model built on real data does not need additional information for reconstruction process and increases the compression efficiency. Let lc i , i =1,2,...,lm , be a set ofLL quantized coefficient values, lm - size of this set . Furthermore let mode of histogram coefficient MHC value be estimated as follows:f MHC f lc MHC Al lc Al i i ()max ()=∈∈ and , (2)where Al - alphabet of data source which describes the values of the coefficient set and f lc n lmi lc i ()=, n lc i - number of lc i -valued coefficients. The normalised values of variance estimate ve si for next subband coefficients in corresponding to i spatial positions (parent - children relations from the top to the bottom of zerotree - see fig. 1) are simply expressed by the following equation: ve lc MHC ve si i =−max . (3)These set of ve si data is treated as top parent estimation and is applied to all corresponding child nodes in wavelet hierarchical decomposition tree.9-th order context model is applied for coarser data reduction in ‘unimportant' areas (usually with low diagnostic importance). The unimportance means that in these areas the majority of the data are equal to zero and significant values are separated. If single significant values appear in these areas it most often suggests that these high frequency coefficients are caused by noise. Thus the coarser data reduction by higher threshold allows to increase signal to noise ratio by removing the noise. At the edges of diagnostically important structures significant values are grouped together and the threshold value is lower at this fields. P map is used for each coefficient context estimation. Noncausal prediction of the coefficient importance is made as linear function of the binary surrounding data excluding considered coefficient significance. The other polynomial, exponential or hyperbolic function were tested but linear function occurred the most efficient. The data context shown on fig. 2 is formed for each coefficient. This context is modified in the previous data points of processing stream by the results of the selection with the actual threshold values at these points instead of w /2 (causal modification). Values of the coefficient importance - cim are evaluated for each c i coefficient from the following equation:cim coeff p i i j j =⋅−=∑1199(),, where i m n =⋅12,,...,. (4)Next the threshold value is evaluated for each c i coefficient: th w cim w ve i i si =⋅+⋅⋅−/(())211, (5)where i m n =⋅12,,...,, si - corresponding to LL parent spatial location in lower decomposition levels.The modified quantization step model uses the LL-based variance estimate to slightly increase the step size for less variance coefficients. Threshold data selection and uniform quantization is made as follows: each coefficient value is firstly compared to its threshold value and then quantized using w step for LL and modified step value mw si for next subbands . Threshold selection and quantization for each c i coefficient can be clearly described by the following equations:LLif c then c c welse if c th then c else c c mw i i i i i i i i si∈=<==//0, (6)where mw w coeff ve si si =⋅+⋅−(())112. (7)The coeff 1 and coeff 2 values are fitted to actual data characteristic by using a priori image knowledge and performingentire tests on groups of similar characteristic images.a) b)Figure 2. a) 9-order coefficient context for evaluating the coefficient importance value in procedure of adaptive threshold P map context of single edge coefficient.2.2 Zerotrees construction and codingSophisticated entropy coding methods which can significantly improve compression efficiency should retain progressive way of data reconstruction. Progressive reconstruction is simple and natural after wavelet-based decomposition. Thus the wavelet coefficient values are coded subband-sequentially and spectral selection is made typically for wavelet methods. The same scale subbands are coded as follows: firstly the lowest frequency subband, then right side coefficient block, down-left and down-right block at the end. After that next larger scale data blocks are coded in the same order. To reduce a redundancy of such data representation zerotree structure is built. Zerotree describes well the correlation between data values in horizontal and vertical directions, especially between large areas with zero-valued data. These correlated fragments of zerotree are removed and final data streams for entropy coding are significantly diminish. Also zerotree structure allows to create different characteristics data streams to increase the coding efficiency. We used simple arithmetic coders for these data streams coding instead of applied in many techniques bit map (from MSB to LSB) coding with necessity of applying the efficient context model construction. Because of refusing the successive approximation we lost full progression. But the simplicity of the algorithm and sometimes even higher coding efficiency was achieved. Two slightly different arithmetic coders for producing ending data stream were used.2.2.1 Construction and pruning of zerotreeThe dyadic hierarchical image data decomposition is presented on fig. 1. Decomposition tree structure reflects this hierarchical data processing and strictly corresponds to created in transformation process data streams. The four lowest frequency subbands which belong to the coarsest scale level are located at the top of the tree. These data have not got parent values but they are the parents for the coefficients in lower tree level of greater scale in corresponding spatial positions. These correspondence is shown on the fig. 1 as parent-children relations. Each parent coefficient has got four direct children and each child is under one direct parent. Additionally, horizontal relations at top tree level are introduced to describe the data correlation in better way.The decomposition tree becomes zerotree when node values of quantized coefficients are signed by symbols of binary alphabet. Each tree node is checked to be significant (not equal to zero) or insignificant (equal to zero) - binary tree is built. For LL nodes way of significance estimation is slightly different. The MHC value is used again because of the LL areas of MHC appearance strong correlation with zero-valued data areas in the next subbands. Node is signed to be significant if its value is not equal to MHC value or insignificant if its value is equal to MHC. The value of MHC must be sent to a decoder for correct tree reconstruction.Next step of algorithm is a pruning of this tree. Only the branches to insignificant nodes can be pruned and the procedure is slightly other at different levels of the zerotree. Procedure of zerotree pruning starts at the bottom of wavelet zerotree. Sequential values of four children data and their parent from higher level are tested. If the parent and the children are insignificant - the tree branch with child nodes is removed and the parent is signed as pruned branch node (PBN). Because of this the tree alphabet is widened to three symbols. At the middle levels the pruning of the tree is performed if the parent value is insignificant and all children are recognised as PBN. From conducted research we found out that adding extra symbols to the tree alphabet is not efficient for decreasing the code bit rate. The zerotree pruning at top level is different. The checking node values is made in horizontal tree directions by exploiting the spatial correlation of the quantized coefficients in the subbands of the coarsest scale - see fig. 1. Sequentially the four coefficients from the same spatial positions and different subbands are compared with one another. The tree is pruned if the LL node is insignificant and three corresponding coefficients are PBN. Thus three branches with nodes are removed and LL node is signed as PBN. It means that all its children across zerotree are insignificant. The spatial horizontal correlation between the data at other tree levels is not strong enough to increase the coding efficiency by its utilisation.2.2.2 Making three data streams and codingPruned zerotree structure is handy to create data streams for ending efficient entropy coding. Instead of PBN zero or MHC values (nodes of LL) additional code value is inserted into data set of coded values. Also bit maps of PBN spatial distribution at different tree levels can be applied. We used optionally only PBN bit map of LL data to slightly increase the coding efficiency. The zerotree coding is performed sequentially from the top to the bottom to support progressive reconstruction. Because of various quantized data characteristics and wider alphabet of data source model after zerotree pruning three separated different data streams and optionally fourth bit map stream are produced for efficient data coding. It is well known from information theory that if we deal with a data set with significant variability of data statistics anddifferent statistics (alphabet and estimate of conditional probabilities) data may be grouped together it is better to separate these data and encode each group independently to increase the coding efficiency. Especially is true when context-based arithmetic coder is used. The data separation is made on the base of zerotree and than the following data are coded independently:- the LL data set which has usually smaller number of insignificant (MHC-valued) coefficients, less PBN and less spatial data correlation than next subband data (word- or charwise arithmetic coder is less efficient then bitwise coder);optionally this data stream is divided on PBN distribution bit map and word or char data set without PBNs,- the rest of top level (three next subbands) and middle level subband data set with a considerable number of zero-valued (insignificant) coefficients and PBN code values; level of data correlation is greater, thus word- or charwise arithmetic coder is efficient enough,- the lowest level data set with usually great number of insignificant coefficients and without PBN code value; data correlation is very high.Urban Koistinen arithmetic coder (DDJ Compression Contest public domain code accessible by internet) with simple bitwise algorithm is used for first data stream coding. For the second and third data stream coding 1-st order arithmetic coder built on the base of code presented in Nelson book 5 is applied. Urban coder occurred up to 10% more efficient than Nelson coder for first data stream coding. Combining a rest of top level data and the similar statistics middle level data allows to increase the coding efficiency approximately up to 3%.The procedure of the zerotree construction, pruning and coding is presented on fig. 3.Construction ofbinary zerotreeBitwise arithmetic codingFinal compressed data representationFigure 3. Quantized wavelet coefficients coding scheme with using zerotree structure. PBN - pruned branch node.3. TESTS, RESULTS AND DISCUSSIONIn our tests many different medical modality images were used. For chosen results presentation we applied three 256×256×8-bit images from various medical imaging systems: CT (computed tomography), MR (magnetic resonance) and US(ultrasound) images. These images are shown on fig. 4. Mean square error - MSE and peak signal to noise ratio - PSNR were assumed to be reconstructed image quality evaluation criteria. Subjective quality appreciation was conducted in very simple way - only by psychovisual impression of the non-professional observer.Application of adaptive quantization scheme based on modified threshold value and quantization step size is more efficient than simple uniform scalar quantization up to 10% in a sense of better compression of all algorithm. Generally applying zerotree structure and its processing improved coding efficiency up to 10% in comparison to direct arithmetic coding of quantized data set.The comparison of the compression efficiency of three methods: DCT-based algorithm,6,7 SPIHT 8 and presented compression technique, called MBWT (modified basic wavelet-based technique) were performed for efficiency evaluation of MBWT. The results of MSE and PSNR-based evaluation are presented in table 1. Two wavelet-based compression techniques are clearly more efficient than DCT-based compression in terms of MSE/PSNR and also in our subjective evaluation for all cases. MBWT overcomes SPIHT method for US images and slightly for CT test image at lower bit rate range.The concept of adaptive threshold and modified quantization step size is effective for strong reduction of noise but it occurs sometimes too coarse at lower bit rate range and very small details of the image structures are put out of shape. US images contain significant noise level and diagnostically important small structures do not appear (image resolution is poor). Thus these images can be efficiently compressed by MBWT with image quality preserved. It is clearly shown on fig.5. An improvement of compression efficiency in relatio to SPIHT is almost constant at wide range of bit rates (0.3 - 0.6 dB of PSNR).a) b)c)Figure 4. Examples of images used in the tests of compression efficiency evaluation. The results presented in table 1 and on fig. 5 were achieved for those images. The images are as follows: a ) echocardiography image, b) CT head image, c) MR head image.Table 1. Comparison of the three techniques compression efficiency: DCT-based, SPIHT and MBWT. The bit rates are chosen in diagnostically interesting range (near the borders of acceptance).Modality - bit rateDCT-based SPIHT MBWTMSE PSNR[dB] MSE PSNR[dB] MSE PSNR[db] MRI - 0.70 bpp8.93 38.62 4.65 41.45 4.75 41.36 MRI - 0.50 bpp13.8 36.72 8.00 39.10 7.96 39.12 CT - 0.50 bpp6.41 40.06 3.17 43.12 3.1843.11 CT - 0.30 bpp18.5 35.46 8.30 38.94 8.0639.07 US - 0.40 bpp54.5 30.08 31.3 33.18 28.3 33.61 US - 0.25 bpp 91.5 28.61 51.5 31.01 46.8 31.43The level of noise in CT and MR images is lower and small structures are often important in image analysis. That is the reason why the benefits of MBWT in this case are smaller. Generally compression efficiency of MBWT is comparable to SPIHT for these images. Presented method lost its effectiveness for higher bit rates (see PSNR of 0.7 bpp MR representation) but for lower bit rates both MR and CT images are compressed significantly better. Maybe the reason is that the coefficients are reduced relatively stronger because of its importance reduction in MBWT threshold selection at lower bits rate range.0,20,30,40,50,60,70,8Rate in bits/pixel PSNR in dBFigure 5. Comparison of SPIHT and presented in this paper technique (MBWT) compression efficiency at range of low bit rates. US test image was compressed.4. CONCLUSIONSAdaptive space-frequency quantization scheme and zerotree-based entropy coding are not time-consuming and allow to achieve significant compression efficiency. Generally our algorithm is simpler than EZW-based algorithms 9 and other algorithms with extended subband classification or space -frequency quantization models 10 but compression efficiency of presented method is competitive with the best published algorithms in the literature across diverse classes of medical images. The MBWT-based compression gives slightly better results than SPIHT for high quality images: CT and MR and significantly better efficiency for US images. Presented compression technique occurred very useful and promising for medical applications. Appropriate reconstructed image quality evaluation is desirable to delimit the acceptable lossy compression ratios for each medical modality. We intend to improve the efficiency of this method by: the design a construction method of adaptive filter banks and correlated more sufficient quantization scheme. It seems to be possible byapplying proper a priori model of image features which determine diagnostic accuracy. Also more efficient context-based arithmetic coders should be applied and more sophisticated zerotree structures should be tested.REFERENCES1.Hui, C. W. Kok, T. Q. Nguyen, …Image Compression Using Shift-Invariant Dydiadic Wavelet Transform”, subbmited toIEEE Trans. Image Proc., April 3nd, 1996.2.J. D. Villasenor, B. Belzer and J. Liao, …Wavelet Filter Evaluation for Image Compression”, IEEE Trans. Image Proc.,August 1995.3. A. Przelaskowski, M.Kazubek, T. Jamrógiewicz, …Optimalization of the Wavelet-Based Algorithm for Increasing theMedical Image Compression Efficiency”, submitted and accepted to TFTS'97 2nd IEEE UK Symposium on Applications of Time-Frequency and Time-Scale Methods, Coventry, UK 27-29 August 1997.4.M. Antonini, M. Barlaud, P. Mathieu and I. Daubechies, …Image coding using wavelet transform”, IEEE Trans. ImageProc., vol. IP-1, pp.205-220, April 1992.5.M. Nelson, The Data Compression Book, chapter 6, M&T Books, 1991.6.M. Kazubek, A. Przelaskowski and T. Jamrógiewicz, …Using A Priori Information for Improving the Compression ofMedical Images”, Analysis of Biomedical Signals and Images, vol. 13,pp. 32-34, 1996.7. A. Przelaskowski, M. Kazubek and T. Jamrógiewicz, …Application of Medical Image Data Characteristics forConstructing DCT-based Compression Algorithm”, Medical & Biological Engineering & Computing,vol. 34, Supplement I, part I, pp.243-244, 1996.8. A. Said and W. A. Pearlman, …A New Fast and Efficient Image Codec Based on Set Partitioning in Hierarchical Trees”,submitted to IEEE Trans. Circ. & Syst. Video Tech., 1996.9.J. M. Shapiro, …Embedded Image Coding Using Zerotrees of Wavelet Coefficients”, IEEE Trans. Signal Proces., vol.41, no.12, pp. 3445-3462, December 1993.10.Z. Xiong, K. Ramchandran and M. T. Orchard, …Space-Frequency Quantization for Wavelet Image Coding”, IEEETrans. Image Proc., to appear in 1997.。

61_icmlpaper

On Information-Maximization Clustering:Tuning Parameter Selection and Analytic SolutionMasashi Sugiyama sugi@cs.titech.ac.jp Makoto Yamada yamada@sg.cs.titech.ac.jp Manabu Kimura kimura@sg.cs.titech.ac.jp Hirotaka Hachiya hachiya@sg.cs.titech.ac.jp Department of Computer Science,Tokyo Institute of Technology,Tokyo,Japan.AbstractInformation-maximization clustering learnsa probabilistic classiﬁer in an unsupervisedmanner so that mutual information betweenfeature vectors and cluster assignments ismaximized.A notable advantage of thisapproach is that it only involves continu-ous optimization of model parameters,whichis substantially easier to solve than dis-crete optimization of cluster assignments.However,existing methods still involve non-convex optimization problems,and there-foreﬁnding a good local optimal solution isnot straightforward in practice.In this pa-per,we propose an alternative information-maximization clustering method based on asquared-loss variant of mutual information.This novel approach gives a clustering so-lution analytically in a computationally ef-ﬁcient way via kernel eigenvalue decompo-sition.Furthermore,we provide a practicalmodel selection procedure that allows us toobjectively optimize tuning parameters in-cluded in the kernel function.Through ex-periments,we demonstrate the usefulness ofthe proposed approach.1.IntroductionThe goal of clustering is to classify data samples into disjoint groups in an unsupervised manner.K-means is a classic but still popular clustering algorithm.How-ever,since k-means only produces linearly separated clusters,its usefulness is rather limited in practice. Appearing in Proceedings of the28th International Con-ference on Machine Learning,Bellevue,WA,USA,2011. Copyright2011by the author(s)/owner(s).To cope with this problem,various non-linear clus-tering methods have been developed.Kernel k-means (Girolami,2002)performs k-means in a feature space induced by a reproducing kernel function.Spectral clustering(Shi&Malik,2000)ﬁrst unfolds non-linear data manifolds by a spectral embedding method,and then performs k-means in the embedded space.Blur-ring mean-shift(Fukunaga&Hostetler,1975)uses a non-parametric kernel density estimator for model-ing the data-generating probability density andﬁnds clusters based on the modes of the estimated den-sity.Discriminative clustering(Xu et al.,2005;Bach &Harchaoui,2008)learns a discriminative classiﬁer for separating clusters,where class labels are also treated as parameters to be optimized.Dependence-maximization clustering(Song et al.,2007;Faivi-shevsky&Goldberger,2010)determines cluster as-signments so that their dependence on input data is maximized.These non-linear clustering techniques would be capa-ble of handling highly complex real-world data.How-ever,they suﬀer from lack of objective model selection strategies1.More speciﬁcally,the above non-linear clustering methods contain tuning parameters such as the width of Gaussian functions and the number of nearest neighbors in kernel functions or similarity mea-sures,and these tuning parameter values need to be heuristically determined in an unsupervised manner. The problem of learning similarities/kernels was ad-dressed in earlier works,but they considered super-vised setups,i.e.,labeled samples are assumed to be given.Zelnik-Manor&Perona(2005)provided a use-ful unsupervised heuristic to determine the similarity in a data-dependent way.However,it still requires the number of nearest neighbors to be determined man-1‘Model selection’in this paper refers to the choice of tuning parameters in kernel functions or similarity mea-sures,not the choice of the number of clusters.ually(although the magic number‘7’was shown to work well in their experiments).Another line of clustering framework called information-maximization clustering(Agakov& Barber,2006;Gomes et al.,2010)exhibited the state-of-the-art performance.In this information-maximization approach,probabilistic classiﬁers such as a kernelized Gaussian classiﬁer(Agakov&Barber, 2006)and a kernel logistic regression classiﬁer(Gomes et al.,2010)are learned so that mutual information (MI)between feature vectors and cluster assignments is maximized in an unsupervised manner.A notable advantage of this approach is that classiﬁer training is formulated as continuous optimization problems, which are substantially simpler than discrete opti-mization of cluster assignments.Indeed,classiﬁer training can be carried out in computationally ef-ﬁcient manners by a gradient method(Agakov& Barber,2006)or a quasi-Newton method(Gomes et al.,2010).Furthermore,Agakov&Barber(2006) provided a model selection strategy based on the common information-maximization principle.Thus, kernel parameters can be systematically optimized in an unsupervised way.However,in the above MI-based clustering approach, the optimization problems are non-convex,andﬁnd-ing a good local optimal solution is not straightfor-ward in practice.The goal of this paper is to over-come this problem by providing a novel information-maximization clustering method.More speciﬁcally, we propose to employ a variant of MI called squared-loss MI(SMI),and develop a new clustering algo-rithm whose solution can be computed analytically in a computationally eﬃcient way via eigenvalue decom-position.Furthermore,for kernel parameter optimiza-tion,we propose to use a non-parametric SMI esti-mator called least-squares MI(LSMI)(Suzuki et al., 2009),which was proved to achieve the optimal con-vergence rate with analytic-form solutions.Through experiments on various real-world datasets such as im-ages,natural languages,accelerometric sensors,and speech,we demonstrate the usefulness of the proposed clustering method.rmation-MaximizationClustering with Squared-LossMutual InformationIn this section,we describe our novel clustering algo-rithm.2.1.Formulation of Information-MaximizationClusteringSuppose we are given d-dimensional i.i.d.feature vec-tors of size n,{x i|x i∈R d}n i=1,which are assumed to be drawn independently from a distribution with density p∗(x).The goal of clustering is to give cluster assignments,{y i|y i∈{1,...,c}}n i=1,to the feature vectors{x i}n i=1,where c denotes the number of classes.Throughout this paper,we assume that c is known.In order to solve the clustering problem,we take the information-maximization approach(Agakov&Bar-ber,2006;Gomes et al.,2010).That is,we regard clus-tering as an unsupervised classiﬁcation problem,and learn the class-posterior probability p∗(y|x)so that‘in-formation’between feature vector x and class label y is maximized.The dependence-maximization approach(Song et al., 2007;Faivishevsky&Goldberger,2010)is re-lated to,but substantially diﬀerent from the above information-maximization approach.In the dependence-maximization approach,cluster assign-ments{y i}n i=1are directly determined so that their dependence on feature vectors{x i}n i=1is maximized. Thus,the dependence-maximization approach intrin-sically involves combinatorial optimization with re-spect to{y i}n i=1.On the other hand,the information-maximization approach involves continuous optimiza-tion with respect to the parameterαincluded in a class-posterior model p(y|x;α).This continuous op-timization ofαis substantially easier to solve than discrete optimization of{y i}n i=1.Another advantage of the information-maximization approach is that it naturally allows out-of-sample clus-tering based on the discriminative model p(y|x;α), i.e.,a cluster assignment for a new feature vector can be obtained based on the learned discriminative model.2.2.Squared-Loss Mutual InformationAs an information measure,we adopt squared-loss mu-tual information(SMI).SMI between feature vector x and class label y is deﬁned bySMI:=12cy=1p∗(x)p∗(y)p∗(x,y)p∗(x)p∗(y)−12d x,(1)where p∗(x,y)denotes the joint density of x and y,and p∗(y)is the marginal probability of y.SMI is thePearson divergence(Pearson,1900)from p∗(x,y)top∗(x)p∗(y),while the ordinary MI(Cover&Thomas, 2006)is the Kullback-Leibler divergence(Kullback&Leibler,1951)from p∗(x,y)to p∗(x)p∗(y):MI:=cy=1p∗(x,y)logp∗(x,y)p∗(x)p∗(y)d x.(2)The Pearson divergence and the Kullback-Leibler di-vergence both belong to the class of Ali-Silvey-Csisz´a r divergences(which is also known as f-divergences,see (Ali&Silvey,1966;Csisz´a r,1967)),and thus they share similar properties.For example,SMI is non-negative and takes zero if and only if x and y are statistically independent,as the ordinary MI.In the existing information-maximization clustering methods(Agakov&Barber,2006;Gomes et al.,2010), MI is used as the information measure.On the other hand,in this paper,we adopt SMI because it allows us to develop a clustering algorithm whose solution can be computed analytically in a computationally ef-ﬁcient way via eigenvalue decomposition,as described below.2.3.Clustering by SMI MaximizationHere,we give a computationally-eﬃcient clustering al-gorithm based on SMI(1).We can express SMI asSMI=12cy=1p∗(x,y)p∗(x,y)p∗(x)p∗(y)d x−12(3)=12cy=1p∗(y|x)p∗(x)p∗(y|x)p∗(y)d x−12.(4)Suppose that the class-prior probability p∗(y)is set to be uniform:p∗(y)=1/c.Then Eq.(4)is expressed asc 2 cy=1p∗(y|x)p∗(x)p∗(y|x)d x−12.(5)Let us approximate the class-posterior probability p∗(y|x)by the following kernel model:p(y|x;α):=ni=1αy,i K(x,x i),(6)where K(x,x )denotes a kernel function with a ker-nel parameter t.In the experiments,we will use a sparse variant of the local-scaling kernel(Zelnik-Manor &Perona,2005):K(x i,x j)=⎧⎪⎪⎪⎨⎪⎪⎪⎩exp−x i−x j 2i jif x i∈N t(x j)or x j∈N t(x i),0otherwise,(7)where N t(x)denotes the set of t nearest neighbors for x(t is the kernel parameter),σi is a local scaling factor deﬁned asσi= x i−x(t)i ,and x(t)i is the t-th nearest neighbor of x i.Further approximating the expectation with respect to p∗(x)included in Eq.(5)by the empirical average of samples{x i}n i=1,we arrive at the following SMI approximator:SMI:=c2ncy=1α y K2αy−12,(8)where denotes the transpose,αy:= (αy,1,...,αy,n) ,and K i,j:=K(x i,x j).For each cluster y,we maximizeα y K2αy under2 αy =1.Since this is the Rayleigh quotient,the maximizer is given by the normalized principal eigen-vector of K(Horn&Johnson,1985).To avoid all the solutions{αy}c y=1to be reduced to the same princi-pal eigenvector,we impose their mutual orthogonality:α yαy =0for y=y .Then the solutions are given by the normalized eigenvectorsφ1,...,φc associated with the eigenvaluesλ1≥···≥λn≥0of K.Since the sign ofφy is arbitrary,we set the sign asφy=φy×sign(φ y1n),where sign(·)denotes the sign of a scalar and1n de-notes the n-dimensional vector with all ones.On the other hand,sincep∗(y)=p∗(y|x)p∗(x)d x≈1nni=1p(y|x i;α)=α y K1n,and the class-prior probability p∗(y)was set to be uni-form,we have the following normalization condition:α y K1n=1/c.Furthermore,probability estimates should be non-negative,which can be achieved by rounding up nega-tive outputs to zero.Taking these issues into account, 2Note that this unit-norm constraint is not essential since the obtained solution is renormalized later.cluster assignments{y i}n i=1for{x i}n i=1are determined asy i=argmaxy[max(0n, φy)]i max(0n, φy) 1n,where the max operation for vectors is applied in the element-wise manner and[·]i denotes the i-th element of a vector.Note that we used K φy=λy φy in the above derivation.We call the above method SMI-based clustering (SMIC).2.4.Kernel Parameter Choice by SMIMaximizationSince the above clustering approach was developed in the framework of SMI maximization,it would be natural to determine the kernel parameters so that SMI is maximized.A direct approach is to use the above SMI estimatorSMI also for kernel parameter choice.However,this direct approach is not favor-able becauseSMI is an unsupervised SMI estimator (i.e.,SMI is estimated only from unlabeled samples {x i}n i=1).In the model selection stage,however,we have already obtained labeled samples{(x i,y i)}n i=1, and thus supervised estimation of SMI is possible.For supervised SMI estimation,a non-parametric SMI esti-mator called least-squares mutual information(LSMI) (Suzuki et al.,2009)was shown to achieve the optimal convergence rate.For this reason,we propose to use LSMI for model selection,instead ofSMI(8).LSMI is an estimator of SMI based on paired samples {(x i,y i)}n i=1.The key idea of LSMI is to learn the following density-ratio function,r∗(x,y):=p∗(x,y)p∗(x)p∗(y),(9)without going through density estimation of p∗(x,y), p∗(x),and p∗(y).More speciﬁcally,let us employ the following density-ratio model:r(x,y;θ):=:y =yθ L(x,x ),(10)where L(x,x )is a kernel function with kernel param-eterγ.In the experiments,we will use the Gaussiankernel:L(x,x )=exp−x−x 22γ2.(11)The parameterθin the above density-ratio model is learned so that the following squared error is mini-mized:12cy=1r(x,y;θ)−r∗(x,y)2p∗(x)p∗(y)d x.(12)Among n cluster assignments{y i}n i=1,let n y be the number of samples in cluster y.Letθy be the parameter vector corresponding to the kernel bases {L(x,x )} :y=y,i.e.,θy is the n y-dimensional sub-vector ofθ=(θ1,...,θn) consisting of indices { |y =y}.Then an empirical and regularized ver-sion of the optimization problem(12)is given for each y as follows:minθy12θ y H(y)θy−θ y h(y)+δθ yθy,(13) whereδ(≥0)is the regularization parameter. H(y) is the n y×n y matrix and h(y)is the n y-dimensional vector deﬁned asH(y),:=n yn2ni=1L(x i,x(y) )L(x i,x(y) ),h(y):=1ni:y i=yL(x i,x(y) ),where x(y)is the -th sample in class y(which corre-sponds to θ(y) ).A notable advantage of LSMI is that the solution θ(y) can be computed analytically asθ(y)=( H(y)+δI)−1 h(y).Then a density-ratio estimator is obtained analytically as follows:r(x,y)=n y=1θ(y)L(x,x(y) ).The accuracy of the above least-squares density-ratio estimator depends on the choice of the ker-nel parameterγand the regularization parameterδ. They can be systematically optimized based on cross-validation as follows(Suzuki et al.,2009).The sam-ples Z={(x i,y i)}n i=1are divided into M disjoint sub-sets{Z m}M m=1of approximately the same size.Then a density-ratio estimator r m(x,y)is obtained using Z\Z m(i.e.,all samples without Z m),and its out-of-sample error(which corresponds to Eq.(12)without irrelevant constant)for the hold-out samples Z m is computed asCV m:=12|Z m|2x,y∈Z mr m(x,y)2−1|Z m|(x,y)∈Z mr m(x,y).This procedure is repeated for m =1,...,M ,and the average of the above hold-out error over all m is com-puted.Finally,the kernel parameter γand the regu-larization parameter δthat minimize the average hold-out error are chosen as the most suitable ones.Based on the expression of SMI given by Eq.(3),an SMI estimator called LSMI is given as follows:LSMI :=12n ni =1 r (x i ,y i )−12,(14)where r (x ,y )is a density-ratio estimator obtained above.Since r (x ,y )can be computed analytically,LSMI can also be computed analytically.We use LSMI for model selection of SMIC.More specif-ically,we compute LSMI as a function of the kernel pa-rameter t of K (x ,x )included in the cluster-posterior model (6),and choose the one that maximizes LSMI.MATLAB implementation of the proposed clus-tering method is available from ‘http://sugiyama-www.cs.titech.ac.jp/˜sugi/software/SMIC’.3.Existing MethodsIn this section,we qualitatively compare the proposed approach with existing methods.3.1.Spectral ClusteringThe basic idea of spectral clustering (Shi &Malik,2000)is to ﬁrst unfold non-linear data manifolds by a spectral embedding method,and then perform k-means in the embedded space.More speciﬁcally,given sample-sample similarity W i,j ≥0,the minimizer of the following criterion with respect to {ξi }n i =1is ob-tained under some normalization constraint:n i,j W i,j 1 D i,i ξi −1 D j,j ξj 2,where D is the diagonal matrix with i -th diagonal el-ement given by D i,i := nj =1W i,j .Consequently,the embedded samples are given by the principal eigenvec-tors of D −12W D −12,followed by normalization.Note that spectral clustering was shown to be equivalent to a weighted variant of kernel k-means with some spe-ciﬁc kernel (Dhillon et al.,2004).The performance of spectral clustering depends heav-ily on the choice of sample-sample similarity W i,j .Zelnik-Manor &Perona (2005)proposed a useful un-supervised heuristic to determine the similarity in a data-dependent manner,called local scaling :W i,j =exp− x i −x j 22σi σj,where σi is a local scaling factor de-ﬁned as σi = x i −x (t )i ,and x (t )i is the t -th nearest neighbor of x i .t is the tuning parameter in the local scaling similarity,and t =7was shown to be use-ful (Zelnik-Manor &Perona,2005;Sugiyama,2007).However,this magic number ‘7’does not seem to work always well in general.If D −12W D −12is regarded as a kernel matrix,spec-tral clustering will be similar to the proposed SMIC method described in Section 2.3.However,SMIC does not require the post k-means processing since the prin-cipal components have clear interpretation as parame-ter estimates of the class-posterior model (6).Further-more,our proposed approach provides a systematic model selection strategy,which is a notable advantage over spectral clustering.3.2.Blurring Mean-Shift ClusteringBlurring mean-shift (Fukunaga &Hostetler,1975)is a non-parametric clustering method based on the modes of the data-generating probability density.In the blurring mean-shift algorithm,a kernel density estimator (Silverman,1986)is used for modeling the data-generating probability density:p (x )=1n ni =1K x −x i 2/σ2,where K (ξ)is a kernel function such as a Gaussian kernel K (ξ)=e −ξ/2.Taking the derivative of p (x )with respect to x and equating the derivative at x =x i to zero,we obtain the following updating formula for sample x i (i =1,...,n ):x i ←− nj =1W i,j x jnj =1W i,j,where W i,j :=Kx i −x j 2/σ2and K (ξ)is the derivative of K (ξ).Each mode of the density is re-garded as a representative of a cluster,and each data point is assigned to the cluster which it converges to.Carreira-Perpi˜n ´a n (2007)showed that the blur-ring mean-shift algorithm can be interpreted as an EM algorithm (Dempster et al.,1977),where W i,j /( nj =1W i,j )is regarded as the posterior prob-ability of the i -th sample belonging to the j -th clus-ter.Furthermore,the above update rule can be ex-pressed in a matrix form as X ←−XP ,where X =(x 1,...,x n )is a sample matrix and P :=W D −1is a stochastic matrix of the random walk in a graph with adjacency W (Chung,1997).D is deﬁned asD i,i:= nj=1W i,j and D i,j=0for i=j.If P isindependent of X,the above iterative algorithm cor-responds to the power method(Golub&Loan,1996)forﬁnding the leading left eigenvector of P.Then,this algorithm is highly related to the spectral clus-tering which computes the principal eigenvectors of D−12W D−12(see Section3.1).Although P depends on X in reality,Carreira-Perpi˜n´a n(2006)insisted thatthis analysis is still valid since P and X quickly reacha quasi-stable state.An attractive property of blurring mean-shift is thatthe number of clusters is automatically determined asthe number of modes in the probability density es-timate.However,this choice depends on the kernelparameterσand there is no systematic way to deter-mineσ,which is restrictive compared with the pro-posed method.Another critical drawback of the blur-ring mean-shift algorithm is that it eventually con-verges to a single point(Cheng,1995),and therefore asensible stopping criterion is necessary in practice.Al-though Carreira-Perpi˜n´a n(2006)gave a useful heuris-tic for stopping the iteration,it is not clear whetherthis heuristic always works well in practice.4.ExperimentsIn this section,we experimentally evaluate the perfor-mance of the proposed and existing clustering meth-ods.4.1.IllustrationFirst,we illustrate the behavior of the proposedmethod using artiﬁcial datasets described in the toprow of Figure1.The dimensionality is d=2and thesample size is n=200.As a kernel function,we usedthe sparse local-scaling kernel(7)for SMIC,where thekernel parameter t was chosen from{1,...,10}basedon LSMI with the Gaussian kernel(11).The top graphs in Figure1depict the cluster assign-ments obtained by SMIC,and the bottom graphs inFigure1depict the model selection curves obtainedby LSMI.The results show that SMIC combined withLSMI works well for these toy datasets.4.2.Performance ComparisonNext,we systematically compare the performance ofthe proposed and existing clustering methods usingvarious real-world datasets such as images,natural lan-guages,accelerometric sensors,and speech.We compared the performance of the following meth-ods,which all do not contain open tuningparame-Kernel parameter tSMIestimateKernel parameter tKernel parameter tKernel parameter t Figure1.Illustrative examples.Cluster assignments ob-tained by SMIC(top)and model selection curves obtained by LSMI(bottom).ters and therefore experimental results are fair and objective:K-means(KM),spectral clustering with the self-tuning local-scaling similarity(SC)(Zelnik-Manor &Perona,2005),mean nearest-neighbor clustering (MNN)(Faivishevsky&Goldberger,2010),MI-based clustering for kernel logistic models(MIC)(Gomes et al.,2010)with model selection by maximum-likelihood MI(Suzuki et al.,2008),and the proposed SMIC.The clustering performance was evaluated by the ad-justed Rand index(ARI)(Hubert&Arabie,1985) between inferred cluster assignments and the ground truth rger ARI values mean better per-formance,and ARI takes its maximum value1when two sets of cluster assignments are identical.In addi-tion,we also evaluated the computational eﬃciency of each method by the CPU computation time.We used various real-world datasets including im-ages,natural languages,accelerometric sensors,and speech:The USPS hand-written digit dataset(‘digit’), the Olivetti Face dataset(‘face’),the20-Newsgroups dataset(‘document’),the SENSEVAL-2dataset (‘word’),the ALKAN dataset(‘accelerometry’),and the in-house speech dataset(‘speech’).Detailed expla-nation of the datasets is omitted due to lack of space. For each dataset,the experiment was repeated100 times with random choice of samples from a pool. Samples were centralized and their variance was nor-malized in the dimension-wise manner,before feeding them to clustering algorithms.The experimental results are described in Table1.For the digit dataset,MIC and SMIC outperform KM,SC, and MNN in terms of ARI.The entire computation time of SMIC including model selection is faster than KM,SC,and MIC,and is comparable to MNN which does not include a model selection procedure.For theTable1.Experimental results on real-world datasets(with equal cluster size).The average clustering accuracy(and its standard deviation in the bracket)in terms of ARI and the average CPU computation time in second over100runs are described.The best method in terms of the average ARI and methods judged to be comparable to the best one by the t-test at the signiﬁcance level1%are described in putation time of MIC and SMIC cor-responds to the time for computing a clustering solution after model selection has been carried out.For references, computation time for the entire procedure including model selection is described in the square bracket.Digit(d=256,n=5000,and c=10)KM SC MNN MIC SMICARI0.42(0.01)0.24(0.02)0.44(0.03)0.63(0.08)0.63(0.05) Time835.9973.3318.584.4[3631.7]14.4[359.5]Face(d=4096,n=100,and c=10)KM SC MNN MIC SMICARI0.60(0.11)0.62(0.11)0.47(0.10)0.64(0.12)0.65(0.11) Time93.3 2.1 1.0 1.4[30.8]0.0[19.3]Document(d=50,n=700,and c=7)KM SC MNN MIC SMICARI0.00(0.00)0.09(0.02)0.09(0.02)0.01(0.02)0.19(0.03) Time77.89.7 6.4 3.4[530.5]0.3[115.3]Word(d=50,n=300,and c=3)KM SC MNN MIC SMICARI0.04(0.05)0.02(0.01)0.02(0.02)0.04(0.04)0.08(0.05) Time 6.5 5.9 2.2 1.0[369.6]0.2[203.9] Accelerometry(d=5,n=300,and c=3)KM SC MNN MIC SMICARI0.49(0.04)0.58(0.14)0.71(0.05)0.57(0.23)0.68(0.12) Time0.4 3.3 1.90.8[410.6]0.2[92.6]Speech(d=50,n=400,and c=2)KM SC MNN MIC SMICARI0.00(0.00)0.00(0.00)0.04(0.15)0.18(0.16)0.21(0.25) Time0.9 4.2 1.80.7[413.4]0.3[179.7] face dataset,SC,MIC,and SMIC are comparable to each other and are better than KM and MNN in terms of ARI.For the document and word datasets,SMIC tends to outperform the other methods.For the ac-celerometry dataset,MNN and SMIC work better than the other methods.Finally,for the speech dataset, MIC and SMIC work comparably well,and are signif-icantly better than KM,SC,and MNN.Overall,MIC was shown to work reasonably well,im-plying that model selectoin by maximum-likelihood MI is practically useful.SMIC was shown to work even better than MIC,with much less computation time. The accuracy improvement of SMIC over MIC was gained by computing the SMIC solution in a closed-form without any heuristic initialization.The compu-tational eﬃciency of SMIC was brought by the analytic computation of the optimal solution and the class-wise optimization of LSMI(see Section2.4).The performance of MNN and SC was rather unsta-ble because of the heuristic averaging of the number of nearest neighbors and the heuristic choice of local scaling.In terms of computation time,they are rela-Table2.Experimental results on real-world datasets under imbalanced setup.ARI values are described in the table. Class-imbalance was realized by setting the sample size of theﬁrst class m times larger than other classes.The results for m=1are the same as the ones reported in Table1.Digit(d=256,n=5000,and c=10)KM SC MNN MIC SMICm=10.42(0.01)0.24(0.02)0.44(0.03)0.63(0.08)0.63(0.05) m=20.52(0.01)0.21(0.02)0.43(0.04)0.60(0.05)0.63(0.05)Document(d=50,n=700,and c=7)KM SC MNN MIC SMICm=10.00(0.00)0.09(0.02)0.09(0.02)0.01(0.02)0.19(0.03) m=20.01(0.01)0.10(0.03)0.10(0.02)0.01(0.02)0.19(0.04) m=30.01(0.01)0.10(0.03)0.09(0.02)-0.01(0.03)0.16(0.05) m=40.02(0.01)0.09(0.03)0.08(0.02)-0.00(0.04)0.14(0.05)Word(d=50,n=300,and c=3)KM SC MNN MIC SMICm=10.04(0.05)0.02(0.01)0.02(0.02)0.04(0.04)0.08(0.05) m=20.00(0.07)-0.01(0.01)0.01(0.02)-0.02(0.05)0.03(0.05)Accelerometry(d=5,n=300,and c=3)KM SC MNN MIC SMICm=10.49(0.04)0.58(0.14)0.71(0.05)0.57(0.23)0.68(0.12) m=20.48(0.05)0.54(0.14)0.58(0.11)0.49(0.19)0.69(0.16) m=30.49(0.05)0.47(0.10)0.42(0.12)0.42(0.14)0.66(0.20) m=40.49(0.06)0.38(0.11)0.31(0.09)0.40(0.18)0.56(0.22)tively eﬃcient for small-to medium-sized datasets,but they are expensive for the largest dataset,digit.KM was not reliable for the document and speech datasets because of the restriction that the cluster boundaries are linear.For the digit,face,and document datasets, KM was computationally very expensive since a large number of iterations were needed until convergence to a local optimum solution.Finally,we performed similar experiments under im-balanced setup,where the the sample size of theﬁrst class was set to be m times larger than other classes. The results are summarized in Table2,showing that the performance of all methods tends to be degraded as the degree of imbalance increases.Thus,clustering becomes more challenging if the cluster size is imbal-anced.Among the compared methods,the proposed SMIC still worked better than other methods. Overall,the proposed SMIC combined with LSMI was shown to be a useful alternative to existing clustering approaches.5.ConclusionsIn this paper,we proposed a novel information-maximization clustering method,which learns class-posterior probabilities in an unsupervised manner so that the squared-loss mutual information(SMI)be-tween feature vectors and cluster assignments is maxi-mized.The proposed algorithm called SMI-based clus-tering(SMIC)allows us to obtain clustering solutions analytically by solving a kernel eigenvalue problem. Thus,unlike the previous information-maximization。

外文翻译

Mapreduce Design of K-means Clustering AlorithmAbstract— Cluster is a collection of data members having similar characteristics. The process of establishing a relation or deriving information from raw data by perfor ming some operations on the data set like clustering is known as data mining. Data col lected in practical scenarios is more often than not completely random and unstructure d. Hence, there is always a need for analysis of unstructured data sets to derive meani ngful information. This is where unsupervised algorithms come in to picture to proces s unstructured or even semi structured data sets by resultant. K-Means Clustering is o ne such technique used to provide a structure to unstructured data so that valuable info rmation can be extracted. This paper discusses the implementation of the K-Means Cl ustering Algorithm over a distributed environment using ApacheTM Hadoop. The key to the implementation of the K-Means Algorithm is the design of the Mapper and Re ducer routines which has been discussed in the later part of the paper. The steps involv ed in the execution of the K-Means Algorithm has also been described in this paper ba sed on a small scale implementation of the K-Means Clustering Algorithm on an expe rimental setup to serve as a guide for practical implementations.I. INTRODUCTIONAny inference that delineates an argument is an outcome of careful analysis of a huge amount of data related to the subject. So to facilitate a comprehensive and definitive c orrelation of data we apply methods of data mining to group data and derive meaningf ul conclusions. Data mining thus can be defined as subject that discovers data relation s by applying principles of artificial intelligence, statistics , database systems and like wise. In addition to just analysis this facilitates data management aspects,data modeling,visualization,complexity considerations.Distributed Computing is a technique aimed at solving computational problems mainl y by sharing the computation over a network of interconnected systems. Each individ ual system connected on the network is called a node and the collection of many node s that form a network is called a cluster.ApacheTM Hadoop[1]isonesuchopensourceframeworkthat supports distributed comp uting. It came into existence from Google’s MapReduce and Google File Systems proj ects. It is a platform that can be used for intense data applications which are processed in a distributed environment. It follows a Map and Reduce programming paradigm w here the fragmentation of data is the elementary step and this fragmented data is fed in to the distributed network for processing. The processed data is then integrated as a w hole. Hadoop[1][2][3] also provides a defined file system for the organization of proc essed data like the Hadoop Distributed File System. The Hadoop framework takes int o account the node failures and is automatically handled by it. This makes hadoop real ly flexible and a versatile platform for data intensive applications. The answer to grow ing volumes of data that demand fast and effective retrieval of information lies in eng endering the principles of data mining over a distributed environment such as Hadoop . This not only reduces the time required for completion of the operation but also redu ces the individual system requirements for computation of large volumes of data. Starting from the Google File Systems[4] and MapReduce concept, Hadoop has taken the world of distributed computing to a new level with various versions of Hadoop th at are now in existence and also under Research and Development. Few of which incl ude Hive[5] , Zookeeper [6] ,Pig[7]. The data-intensity today in any field is growing a t a brisk space giving rise to implementation of complex principles of Data Mining to derive meaningful information from the data.The MapReduce structure gives great flexibility and speed to execute a process over a distributed Framework. Unstructured data analysis is one of the most challenging aspects of data mining that involve implementation of complex algorithms.The Hadoop Framework is designed to compute thousands of petabytes of data. This i s primarily done by downscaling and consequent integration of data and reducing the configuration demands of systems par ticipating in processing such huge volumes of data. The workload is shared by all the computers connected on the network and hence increase the efficiency and overall per formance of the network and at the same time facilitating the brisk processing of volu minous data.This paper discusses the K-Means Algorithm design over a span of sections starting w ith the Introduction , section II deals with Clustering Analysis , section III talks about K- Means Clustering, section IV discusses the MapReduce paradigm, K-Means cluste ring using MapReduce paradigm and the algorithm to design the MapReduce routines is discussed in section V, section VI talks about the experimental setup, section VII tal ks about system deployment, section VII about Implementation of the K-Means Clust ering on a distributed environment and section IX Concludes the paper.II. CLUSTER ANALYSISClustering basically deals with grouping of objects such that each group consists of si milar or related objects. The main idea behind clustering is to maximize the intra-clust er similarities and minimize the inter cluster similarities.The data set may have objects with more than attributes. The classification is done by selecting the appropriate attribute and relate to a carefully selected reference and this i s solely dependent on the field that concerns the user. Classification therefore plays a more definitive role in establishing a relation among the various items in semi or unstr uctured data set.Cluster analysis is a broad subject and hence there are abundant clustering algorithms available to group data sets. Very common methods of clustering involve computing d istance, density and interval or a particular statistical distribution. Depending on the re quirements and data sets we apply the appropriate clustering algorithm to extract data from them.Clustering has a broad spectrum and the methods ofclustering on the basis of their implement can bo grouped into...Connectivity TechniqueExample: Hierarchical Clustering... Centroid TechniqueExample: K-Means Clustering... Distribution TechniqueExample: Expectation Maximization... Density TechniqueExample: DBSCAN...Subspace TechniqueExample:Co-ClusteringAdvantages of Data Clustering• Provides a quick and meaningful overview of data.• Improves efficiency of data mining by combining data with similar characteristics s o that a generalization can be derived for each cluster and hence processing isdone batch wise rather than individually.• Gives a good understanding of the unusual similaritiesthat may occur once the clustering is complete.•Provides a really good base for nearest neighboring and ordination of deeper relations.III. K-MEANS CLUSTERINGK-Means Clustering is a method used to classify semi structured or unstructured data sets. This is one of the most commonly and effective methods to classify data because of its simplicity and ability to handle voluminous data sets.It accepts the number of clusters and the initial set of centroids as parameters. The dist ance of each item in the data set is calculated with each of the centroids of the respecti ve cluster. The item is then assigned to the cluster with which the distance of the item is the least. The centroid of the cluster to which the item was assigned is recalculated. One of the most important and commonly used methods for grouping the items of a d ata set using K-Means Clustering is calculating the distance of the point from the chos en mean. This distance is usually the Euclidean Distance though there are other such d istance calculating techniques in existence. This is the most common metric for comp arison of points.Suppose there the two points are defined as P = (x1(P), x2(P), x3(P) .....) and Q = (x1( Q), x2(Q), x3(Q) .....). The distance is calculated by the formula given bd(P,Q)=The next important parameter is the cluster centroid. The point whose coordinates cor responds to the mean of the coordinates of all the points in the cluster.The data set may or better said will have certain items that may not be related to any c luster and hence cannot be classified under them, such points are referred to as outlier s and more often than not correspond to the extremes of the data set depending on wh ether their values or extremely high or low.-The main objective of the algorithm is to obtain a minimal squared difference betwee n the centroid of the cluster and the item in the dataset.||Where x,is the value of the item and cj is the value of the centroid of the cluster.• The required number of cluster must be chosen. We will refer to the number of clust ers to be ‘K’..The next step is to choose distant and distinct centroids for each of the chosen set of K clusters..The third step is to consider each element of the given set and compare its distance toall the centroids of the K clusters. Based on the calculated distance the element is add ed to the cluster whose centroid is nearest to the element..The cluster centroids are re calculated after each assignment or a set of assignments. .This is an iterative method and continuously updated.IV. MAPREDUCE PARADIGMMapReduce is a programming paradigm used for computation of large datasets. A sta ndard MapReduce process computes terabytes or even petabytes of data on interconne cted systems forming a cluster of nodes. MapReduce implementation splits the huge d ata into chunks that are independently fed to the nodes so the number and size of each chunk of data is dependent on the number of nodes connected to the network. The pr ogrammer designs a Map function that uses a (key,value) pair for computation. The M ap function results in the creation of another set of data in form of (key,value) pair wh ich is known as the intermediate data set. The programmer also designs a Reduce func tion that combines value elements of the (key,value) paired intermediate data set havi ng the same intermediate key. [10]Map and Reduce steps are separate and distinct and complete freedom is given to the programmer to design them. Each of the Map and Reduce steps are performed in paral lel on pairs of (key,value) data members. Thereby the program is segmented into two distinct and well defined stages namely Map and Reduce. The Map stage involves exe cution of a function on a given data set in the form of (key,value) and generates the int ermediate data set. The generated intermediate data set is then organized for the imple mentation of the Reduce operation. Data transfer takes place between the Map and Re duce functions. The Reduce function compiles all the data sets bearing the particular k ey and this process is repeated for all the various key values. The final out put produc ed by the Reduce call is also a dataset of (key,value) pairs. An important thing to note is that the execution of the Reduce function is possible only after the Mapping process is complete.Each MapReduce Framework has a solo Job Tracker and multiple task trackers. Each node connected to the network hasthe right to behave as a slave Task Tracker. The issues like division of data to various nodes , task scheduling, node failures, task failure management, communication of no des, monitoring the task progress is all taken care by the master node. The data used a s input and output data is stored in the file-system.V. K-MEANS CLUSTERING USING MAPREDUCEThe first step in designing the MapReduce routines for K-means is to define and handl e the input and output of the implementation. The input is given as a <key,value> pair , where ‘key’ is the cluster center and ‘value’ is the serializable implementation of vector in the data set.The prerequisite to implement the Map and Reduce routines is to have two file one th at houses the clusters with their centroids and the other that houses the vectors to be cl ustered.Once the set of initial set of clusters and chosen centroids is defined and the data vect ors that are to be clustered properly organized in two files then the clustering of data u sing K-Means clustering technique can be accomplished by following the algorithm to design the Map and Reduce routines for K-Means Clustering.The initial set of centers is stored in the input directory of HDFS prior to Map routine call and they form the ‘key’ field in the <key,value> pair. The instructions required to compute the distance between the given data set and cluster center fed as a <key,va lue> pair is coded in the Mapper routine. The Mapper is structured in such a way that it computes the distance between the vector value and each of the cluster centers ment ioned in the cluster set and simultaneously keeping track of the cluster to which the gi ven vector is closest. Once the computation of distances is complete the vector should be assigned to the nearest cluster.Once Mapper is invoked the given vector is assigned to the cluster that it is closest rel ated to. After the assignment is done the centroid of that particular cluster is recalculat ed. The recalculation is done by the Reduce routine and also it restructures the cluster to prevent creations of clusters with extreme sizes i.e. cluster having too less data vect ors or a cluster having too many data vectors. Finally, once the centroid of the given cl uster is updated, the new set of vectors and clusters is re-written to the disk and is read y for the next iteration.After understanding of what the input, output and functionality of the Map and Reduc e routines we design the Map and Reduce classes by following the algorithm discusse d below.123VI. EXPERIMENTAL SETUPThe experimental setup consists of a three nodes sharing a private LAN via a manage d switch. One of the Nodes is used as a Master which supervises the data and flow of control over allthe other nodes in the Hadoop Cluster. The nodes used in the implementation are of si milar configuration. All the nodes run on a Intel – Core 2 Duo processor.The cluster used for the implantation has 7 nodes connected over a private LAN netw ork. All the nodes use Ubuntu operating system with Java JDK 7, SSH installed on the m. The ApacheTM Hadoop bundle available on the official website was used to install Hadoop. The Single-Node and consequent Multi-Node Setup was accomplished by fo llowing the installation guide found at [8][9][10].Figure 1: Experimental Setup of 7 Nodes.VII. SYSTEM DEPLOYMENTThe visual deployment of the system is depicted in Figure 2. The Data set is fed to the master node. The data set is along with the Job is distributed among the nodes in the network. The Map function is called for the input in form of <key,value> pair. Once th e assignment is complete, the Reduce function recalculates the centroids and makes th e data set ready for the subsequent iterationsFigure 2: System DeploymentVIII. IMPLEMENTATION OF THE K-MEANS ALGORITHM ON DISTRIBUTED NETWORK.The K-Means Clustering Algorithm is implemented on a distributed network involves the following steps:• The input data vectors and the initial set of chosen cluster centers is stored in the inp ut directory of the HDFS and creating an output directory to house the result of cluster ed data.• SSH into the master node, with the input directory containing the vectors and cluster s run the program that is structured on the Algorithm used to define the Map and Redu ce routines discussed in section V of this paper.• Based on the iterations and the value of ‘K’ the resultant clusters of data can be foun d in the output directory which can be echoed to the output directory in HDFS.IX. CONCLUSION AND ON GOING RESEARCH WORKThe volume of information exchange in today’s world engages huge quantity of data p rocessing. We through this paper have discussed the implementation of K-Means Clus tering Algorithm over a distributed network. Not only does this algorithm provide a ro bust and efficient system for grouping of data with similar characteristics but also red uces the implementation costs of processing such huge volumes of data.Data Mining being one of the most important tools in information retrieval. The rate o f information exchange today is growing spectacularly fast and so there arises a need of processing huge volume of data. To respond to this need we try and implement the vital algorithms used for data mining on a distributed or a parallel environment to red uce the operational resources and increase the speed of operation.Not matter how good anything can be there is always scope for improvement. Primary area of improvement in any algorithm is its accuracy. Research work has gone in sinc e the first implementation of K-Means Algorithm in 1970’s. Another area of improve ment particular to the K-Means Algorithm is the selection of initial set of centroids. T hese two areas are of primary concern beside a few secondary issues. One of which is reducing the number of iterations required to complete a task this is mainly achieved b y choosing the right set of initial centroids[12]. Another issue is to improve the algorit hm scalability to support processing of high volume datasets, parallel implementation of K-Means Algorithm is an area of research that is aimed at addressing this issue. On e more important issue is handling the outliers,though not related these points that are difficult to cluster and usually that make up the extreme cases cannot be ignored and h ence different ways of clustering these outliers is another research hotspot. The numb er of iterations required to cluster data can be reduced by choosing the right set of initi al set of centroids.ACKNOWLEDGEMENTWe would like to thank all the members of the Hadoop team present and past at RV C ollege of Engineering for their hard work and valuable contribution towards the experi mental setup. We also would like to thank the technical staff in the research lab at the Department of Computer Science and Engineering, RV College of Engineering for the ir ever available helping hand.REFERENCES[1] Apache Hadoop. /[2] J. Venner, Pro Hadoop. Apress, June 22, 2009.[3] T. White, Hadoop: The Definitive Guide. O'Reilly Media, Yahoo!Press, June 5, 2009.[4] S. Ghemawat, H. Gobioff, S. Leung. â€œThe Google file system,â€ InProc.of ACM Symposium on Operating Systems Principles, LakeGeorge, NY, Oct 2003, pp 29â€“43.[5] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H.Liu,P. Wyckoff, R. Murthy, â€œHiveâ€“ A Warehousing Solution Over a Map-Redu ce Framework,â€In Proc. of Very Large Data Bases, vol. 2 no. 2,August 2009, pp. 1626-1629.[6] F. P. Junqueira, B. C. Reed. â€œThe life and times of a zookeeper,â€In Proc. of the 28th ACM Symposium on Principles of Distributed Computing, Calga ry, AB, Canada, August 10â€“12, 2009.[7] A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C.Olston, B. Reed, S. Srinivasan, U. Srivastava. â€œBuilding a High-Level Dataflow System on top of M apReduce: The Pig Experience,â€In Proc. of Very Large Data Bases, vol 2 no. 2, 2009, pp. 1414â€“1425[8] Description if Single Node Cluster Setup at: /tutorial s/running-hadoop-on-ubuntu-l inux-single-node-cluster/ visited on 21st January, 2012 [9] Description of Multi Node Cluster Setup at: /tutorials /running-hadoop-on-ubuntu-l inux-multi-node-cluster/ visited on 21st January, 2012 [10] Anjan K Koundinya, Srinath N K, A K Sharma, Kiran Kumar, Madhu M N and Kiran U Shanbagh,Map/Reduce Design anf Implementation of Apriori Algorithm for handling Voluminous Data-Sets, Advanced Computing: An International Journal ( A CIJ ), Vol.3, No.6, November 2012[11] J. Dean and S. Ghemawat. Mapreduce: Simplied data processing on large cluster munications of the ACM, 51(1):107-113, 2008.。

通俗讲解dirichlet 聚类

通俗讲解dirichlet 聚类英文回答：Dirichlet clustering, also known as Dirichlet process mixture modeling, is a probabilistic model used for clustering data. It is named after the mathematician Peter Gustav Lejeune Dirichlet, who made significantcontributions to the field of mathematics.In Dirichlet clustering, the data is assumed to be generated from a mixture of distributions, where each cluster is associated with a probability distribution. The number of clusters is not fixed in advance, but is determined by the data itself. This is a key advantage of Dirichlet clustering, as it allows for automatic determination of the number of clusters.The Dirichlet distribution is used as a prior distribution over the mixture components. This distribution is characterized by a vector of concentration parameters,which control the shape of the distribution. The concentration parameters determine the strength of theprior belief in the number of clusters. Higher concentration parameters result in a smaller number of clusters, while lower concentration parameters result in a larger number of clusters.To perform Dirichlet clustering, an algorithm called the Dirichlet process is used. This algorithm iteratively assigns data points to clusters based on the posterior probabilities of the data points belonging to each cluster. The algorithm also updates the parameters of the mixture components based on the assigned data points.Dirichlet clustering has various applications in machine learning and data analysis. It can be used for document clustering, where each document is represented as a mixture of topics. It can also be used for image segmentation, where each pixel is assigned to a cluster based on its color or texture. Additionally, Dirichlet clustering can be used for customer segmentation in marketing, where each customer is assigned to a clusterbased on their purchasing behavior.中文回答：Dirichlet聚类，也被称为Dirichlet过程混合建模，是一种用于聚类数据的概率模型。

k-means算法的简单示例

d BI 3.61 d BJ 4.24
d AJ 5
d AB means distance A→B
Randomly choose A,B as the centre and K=2.
Step 1 and 2.
So,we classify A,C as a cluster and B,E,D,F,G,H,I and J as another cluster.
B
C
D
E
β(3.75,2.875)
F I
F G
dE 1.8 dF 3.91 dG 4.72 dH 5.59 dI 4.61 dJ 5.32
G J
H
H I J
< < < < < > > > > >
d A 2.97 d B 2.08
dC 3.48 dD 2.75
Example
x y center ( ,
i j
i
j
)
E C D P(1.6,4.8) A B
cluster 1
M A,B,C ,D,E (1.6,4.8)
NF ,G,H ,I , J (4.8,1.6)
F H Q(4.8,1.6) J G
new center
cluster 2
I
d E 3.58 d F 0.91 dG 1.53 dH 2.41 d I 1.89 d J 2.25
α , β as the centre and K=2.
Step 2 again.
So,we classify A,B,C,D,E as a cluster and F,G,H,I,J as another cluster.

人工智能聚类分析作业

数据变量类型
按照数据结构分：
结构化数据：即行数据，存储在数据库里，可以用二维表结构来逻辑表达实现的数据
例子：学生档案数据
非结构数据：不方便用数据库二维逻辑表来表现的数据
例子：图象、声音、超媒体、基于网络的变量等信息
混杂变量类型的数据如何聚类？
当对象是同时被各种类型的变量描述时，怎样描述对象之间的相异度呢？
A K Q J
花色相同的牌为一副
聚类的主观性
分成四组符号相同的牌为一组
A K Q J
符号相同的的牌
聚类的主观性
分成两组颜色相同的牌为一组
A K Q J
颜色相同的配对
聚类的主观性
这个例子告诉我们，分组的意义在于我们怎么定义并度量“相似性”
A
Similarity 因此衍生出一系列度量 K
相似性的算法
Q
J
如何部分修正聚类的主观性: 数据点 A1, A2 必须在同一个类.
CL(B3, A3): 数据C点LB3, A3 必须在不同的两个类.
数据变量类型
变量按测量尺度（Measurement Level）分类
名义尺度变量（Nominal）
帮助市场分析人员从客户数据库中发现不同的客户群谁喜欢打国际长途，在什么时间，打到那里？对住宅区进行聚类，确定自动提款机ATM的安放位置企业信用等级分类 ……
生物医学领域
推导植物和动物的分类；对基因分类，获得对种群的认识癌症病人基因表达数据分析
有贡献的研究领域
数据挖掘
聚类可伸缩性、各种各种复杂形状类的识别，高维聚类等
3. 推荐参考书目
1. 聚类方法原理介绍
1.1 什么是聚类 1.2 为什么聚类 1.3 聚类问题特征 1.4 主要聚类算法的分类 1.5 聚类方法的不稳定性

em聚类算法流程

em聚类算法流程Clustering algorithms are widely used in data analysis to group similar data points together based on certain criteria. These algorithms play a crucial role in various fields such as machine learning, data mining, and pattern recognition. One of the most commonly used clustering algorithms is the K-means algorithm, which is known for its simplicity and efficiency. The K-means algorithm works by partitioning a dataset into K clusters, where each data point belongs to the cluster with the nearest mean.聚类算法在数据分析中被广泛应用，根据某些标准将相似的数据点分组在一起。

这些算法在机器学习、数据挖掘和模式识别等领域起着至关重要的作用。

其中最常用的聚类算法之一是K-means算法，以其简单性和效率而闻名。

K-means算法通过将数据集分成K个簇，使得每个数据点属于最近均值的簇。

The workflow of the K-means algorithm begins with randomly selecting K initial cluster centroids. Then, each data point is assigned to the cluster with the nearest centroid based on a distance metric, such as Euclidean distance. After the assignment step, the centroidsof the clusters are updated by taking the average of all data points assigned to that cluster. This process iterates until the centroids converge, and the clustering is considered to be stable.K-means算法的工作流程始于随机选择K个初始簇中心。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

54 56
1
Introduction and main deﬁnitions with simplest examples
Cluster algebras are a remarkable discovery of S. Fomin and A. Zelevinsky [FZI]. They are certain commutative algebras deﬁned by a very simple and general data. We show that a cluster algebra is part of a richer structure, which we call a cluster ensemble. A cluster ensemble is a pair (X , A) of positive spaces (which are varieties equipped with positive atlases), coming with an action of a certain discrete symmetry group Γ. These two spaces are related by a morphism p : A −→ X , which in general, as well as in many interesting examples, is neither injective nor surjective. The space A has a degenerate symplectic structure, and the space X has a Poisson structure. The map p relates the Poisson and degenerate symplectic structures in a natural way. Amazingly, the dilogarithm together with its motivic and quantum avatars plays a central role in the cluster ensemble structure. The space A is closely related to the spectrum of a cluster algebra. On the other hand, in many situations the most interesting part of the structure is the space X . We deﬁne a canonical non-commutative q -deformation of the X -space. We show that when q is a root of unity the algebra of functions on the q -deformed X -space has a large center, identiﬁed with the algebra of functions on the original X -space. Cluster ensembles admit canonical quantization. The main example, as well as the main application of this theory so far, is provided by the (X , A)pair of moduli spaces assigned in [FG1] to a topological surface S with a ﬁnite set of points at the boundary and a semisimple algebraic group G. In particular, the X -space in the simplest case when G = P GL2 and S is a disc with n points at the boundary is the moduli space M0,n . This pair of moduli spaces is an algebraic-geometric avatar of higher Teichm¨ uller theory on S related to G. In the case G = SL2 we get the classical Teichm¨ uller theory, as well as its generalization to surfaces with a ﬁnite set of points on the boundary. A survey of the Teichm¨ uller theory emphasizing the cluster point of view can be found in [FG]. We suggest that there exists a duality between the A and X spaces. One of its manifestations is our package of duality conjectures in Section 4. These conjectures assert that the tropical points of the A/X -space parametrise a basis in a certain class of functions on the Langlands dual X /A-space. It can be viewed as a canonical function (the universal kernel) on the product of the set of tropical points of one space and the Langlands dual space. To support these conjectures, we deﬁne in Section 5.1 the tropical limit of such a universal kernel in the ﬁnite type case. Another piece of evidence is provided by Chapter 12 in [FG1]. In the rest of the Introduction we deﬁne cluster X - and A-varieties and describe their key features. Section 1.1 provides background on positive spaces, borrowed from Chapter 4 of [FG1]. Cluster varieties are deﬁned in Section 1.2. In Section 1.3 we discuss one of the simplest examples: cluster X -variety structures of the moduli space M0,n+3 . In Section 1.4 we summarize their main structures. In Section 1.4 we discuss how they appear in our main example - higher Teichm¨ uller theory.
Cluster ensembles, quantization and the dilogarithm
arXiv:math/0311245v7 [math.AG] 4 Aug 2009
V.V. Fock, A. B. Goncharov
Contents
1 Introduction and main deﬁnitions with simplest examples 1.1 Positive schemes and positive spaces . . . . . . . . . . . . . . . . . . 1.2 Cluster ensembles: deﬁnitions . . . . . . . . . . . . . . . . . . . . . . 1.3 An example: cluster X -variety structure of the moduli space M0,n+3 1.4 Cluster ensemble structures . . . . . . . . . . . . . . . . . . . . . . . 1.5 Our basic example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 The structure of the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 5 11 12 14 15 16 16 19 20 21 23 25 27 29 32 32 33 34 37 39 39 42 43 45 46 46 48 49 50 51 52
6.4 6.5
The motivic dilogarithm class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ind (Q) . . . . . . . . . . . . . . . . . . . . . Invariant points of the modular group and K3