当前位置：文档之家› 基于概率和代表点的数据流动态聚类算法

基于概率和代表点的数据流动态聚类算法

计算机研究与发展DOI ：１０．７５４４/issn １０００-１２３９．２０１６．２０１４８４２８Journal of Computer Research and Development ５３（５）：１０２９１０４２，２０１６　收稿日期：２０１４－１２－２３；修回日期：２０１５－０６－０２

　基金项目：国家自然科学基金项目（６１２７２２０）；山东省高等学校科技计划项目（J １４L N ０５）；江苏省普通高校研究生科研创新计划基金项目

（K YLX ˉ１１２４）

T his w ork w as supported by the National Natural Science Foundation of China （６１２７２２０），the Project of Shandong Province Higher Educational Science and T echnology Program （J １４L N ０５），and Jiangsu Graduate Student Innovation Projects （KYLX ˉ１１２４）．基于概率和代表点的数据流动态聚类算法

毕安琪

１　董爱美１，２　王士同１１

（江南大学数字媒体学院　江苏无锡　２１４１２２）２（齐鲁工业大学信息学院　济南　２５０３５３）

（angela ．sue ．bi ＠g mail ．com ）

A Dynamic Data Stream Clustering Algorithm Based on Probability and Exemplar Bi Anqi １，Dong Aimei １，２，and Wang Shitong １１

（School o f Di g ital Media ，Jian g nan Universit y ，W ux i ，Jian g su ２１４１２２）２（School o f In f ormation ，Qilu Universit y o f Technolo gy ，Jinan ２５０３５３）

Abstract We propose an efficient probability drifting dynamic α-expansion clustering algorithm ，w hich is designed for data stream clustering problem ．In this paper ，we first develop a unified target function of both affinity propagation （AP ）and enhanced α-expansion move （EEM ）clustering algorithms ，namely the probability exemplar -based clustering algorithm ．T hen a probability drifting dynamic α-expansion （PDDE ）clustering algorithm has been proposed considering the probability framework ．T he framework is capable of dealing with data stream clustering problem w hen current data points are similar with pervious data points ．In the process of clustering ，the proposed algorithm ensures that the clustering result of current data points is at least comparable well with that of p revious data points ．What ’s more ，the proposed algorithm is capable of dealing with two kinds of similarities between current and previous data points ，that is w hether current data points share some p oints with previous data points or not ．Besides ，experiments based on both synthetic （D ３１，Birch ３）and real -world dataset （Forest Covertype ，KDD CU P ９９）have indicated the capability of PDDE in clustering data streams ．T he advantage of the proposed clustering algorithm in contrast to both AP and EEM algorithms has been show n as well ．Key words data stream ；energy function ；p robability ；optimization algorithm ；dynamic clustering 摘　要　为了解决数据流动态聚类问题，提出了一种概率化的基于代表点聚类算法．首先，基于概率框架给出了AP （affinity propagation ）聚类算法和EEM （enhanced α-expansion move ）聚类算法的联合目标函数，提出了概率化的基于代表点聚类算法；其次，根据样本与其代表点之间的概率，提出了基于概率的漂移动态α-expansion 数据流聚类算法．该算法使得新数据的代表点尽可能贴近原始数据的代表点，从而提高聚类性能；另一方面，考虑到原始数据与新数据的相似性，该算法能够处理２种漂移过程中的动态聚类问题：１）新数据与原始数据分享部分数据，其余数据与原始数据相似；２）没有相同的数据，新数据与原始数据有相似关系．在人工合成数据集D ３１，Birch ３以及真实数据集Forest Covertpye ，KDD CU P ９９的实验结果均显示出了所提之算法能够处理数据流聚类问题，并保证聚类性能稳定．