Sparse Subspace Clustering

格式：pdf
大小：901.30 KB
文档页数：8

下载文档原格式

求解矩阵补全问题的三分解方法

求解矩阵补全问题的三分解方法常彩霞;王永丽【摘要】在机器学习、图像处理等研究领域,矩阵补全主要用于恢复一个完整的低秩矩阵.考虑到计算迭代过程中,每一步均需要进行奇异值分解,若矩阵维数过大.则计算复杂度非常高.为降低计算复杂度,本文将矩阵三分解方法应用到鲁棒矩阵补全问题中,并应用交替方向乘子法对其进行求解.最后利用人脸识别的实际数据,通过数值实验验证了方法的有效性.【期刊名称】《山东科技大学学报（自然科学版）》【年(卷),期】2018(037)004【总页数】6页(P77-82)【关键词】矩阵补全;三分解方法;交替方向乘子法;人脸识别【作者】常彩霞;王永丽【作者单位】山东科技大学数学与系统科学学院,山东青岛266590;山东科技大学数学与系统科学学院,山东青岛266590【正文语种】中文【中图分类】TN957.52近年来，矩阵补全问题广泛应用于图像处理、计算机视觉、数据挖掘、模式识别和机器学习等领域。

作为信号与图像处理技术的一个强大的新兴分支，矩阵补全已成为继压缩感知之后的又一种重要的信号获取工具[1]。

一般来说，要根据信号的部分采样元素来精确地恢复出所有元素是非常困难甚至是不可能的。

但是当信号在一组基下是稀疏的且满足一定条件时，压缩感知理论证实了可以通过求解l1最小化问题来精确地恢复所有元素[2]。

类似的，当信号用矩阵形式表示时，要根据其部分元素来恢复所有丢失元素也是很困难的。

针对这一问题，Candès等[3]证明了当矩阵的奇异值具有稀疏性且采样数目满足一定条件时，大多数矩阵可以通过求解核范数最小化问题来精确地恢复所有元素。

由矩阵的部分元素恢复所有元素这一问题称为矩阵补全问题，著名的Netflix问题便是一个典型的矩阵补全问题[4]。

给定不完整的低秩缺失矩阵W∈Rm×n，矩阵补全问题可以描述为如下优化问题：(1)其中A∈Rm×n为待补全的矩阵，Ω是A的p个已知元素的指标集。

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱（共包含二级节点15 个，三级节点93 个）间序列分析)监督学习)领域二级分类三级分类。

Subspace CLustering

Subspace Clustering for High Dimensional Data

Contents
Algorithms for clustering
CLIQUE ,PROCLUS and S3C
experiments
Problem Description
Company Logo
S3C
Dimensions are not overlapped depending on inconsistent constraints
1.subspace forming phase
KNN: find the same set of near neighbors per-dimension Constraints combining Dimension selecting
Company Logo
PROCLUS
Differents between CLIQUE and PROCLUS ----accuracy
1.prematurely pruning a dimension based on irrelevant, threshold 1. as a greater percentage of dimensions area coveragethe relevant ones 2.not appropriate density threshold are more difficult to uncover. The use of sampling can exacerbate this 3.when as some clusters may clusters. problem,N=4million, fracturing ofbe only weakly represented.

lasso-var用法 -回复

lasso-var用法-回复LASSOVAR stands for Location Aware Sparse Subspace Clustering with Outlier Rejection. It is a state-of-the-art algorithm used for subspace clustering, a technique that aims to identify and group similar data points within high-dimensional space. In this article, we will explore the key concepts and steps involved in utilizing LASSOVAR for location-aware sparse subspace clustering with outlier rejection.1. Introduction to Subspace Clustering:Subspace clustering is an unsupervised machine learning technique that aims to partition data points into groups or clusters based on their underlying subspaces. In high-dimensional data, different subspaces can represent different patterns or behaviors. Subspace clustering helps in identifying and separating such subspaces, even when the data points are corrupted by noise or outliers.2. Sparse Subspace Clustering:Sparse subspace clustering is a popular approach in subspace clustering that utilizes the sparsity assumption of high-dimensional data. It assumes that the data points within a subspace can berepresented by a sparse linear combination of a few basis vectors. By enforcing sparsity, the algorithm can identify the relevant basis vectors and assign data points to their corresponding subspaces.3. Location-Aware Subspace Clustering:In some applications, the location or position of data points is crucial for clustering. For example, in object tracking, subspace clustering can help separate different objects in a video sequence. However, objects closer to each other in the physical space may have a higher chance of belonging to the same subspace. Therefore, incorporating location information into the clustering process can enhance its accuracy and efficacy.4. LASSOVAR Algorithm:LASSOVAR combines the principles of sparse subspace clustering with location-awareness to achieve robust and accurate clustering results. It consists of the following steps:Step 1: Input Data Representation:The data points are first represented in a matrix, where each column represents a data point and the rows represent the different features or dimensions. The matrix is denoted as X.Step 2: Location-Aware Representation Matrix:LASSOVAR constructs a location-aware representation matrix, denoted as Y. It combines the spatial coordinates of the data points with their feature representations. The location information can be obtained from a GPS device or calculated using other methods, depending on the application domain.Step 3: Outlier Detection:LASSOVAR includes an outlier rejection step to identify and remove outliers from the data. It employs a robust principal component analysis algorithm or other methods to detect and eliminate outliers, as they can negatively impact the clustering results.Step 4: Sparse Subspace Clustering:Using the location-aware representation matrix Y, LASSOVAR applies sparse subspace clustering techniques to identify the underlying subspaces within the data. It enforces sparsity by solving an optimization problem that promotes sparse linear combinations of basis vectors.Step 5: Cluster Assignment:After obtaining the subspaces, LASSOVAR assigns each data point to its most representative subspace. This step involves calculating the reconstruction errors for each data point and assigning it to the subspace with the minimum error.Step 6: Post-Processing:In the final step, LASSOVAR conducts post-processing to refine the clustering results. This step can involve further outlier removal, merging or splitting of clusters based on specific criteria, and other techniques to enhance the quality of the clustering output.5. Conclusion:LASSOVAR is an advanced algorithm that combines sparse subspace clustering with location-awareness for accurate and robust clustering in high-dimensional data. By incorporating location information, it leverages the spatial relationships between data points to improve the clustering results. The algorithm follows a step-by-step process, including data representation,location-awareness, outlier rejection, sparse subspace clustering, cluster assignment, and post-processing. Through these steps,LASSOVAR provides a comprehensive approach for location-aware sparse subspace clustering with outlier rejection, making it highly valuable in various applications such as object tracking, image analysis, and recommender systems.。

python稀疏矩阵奇异值分解

一、概述在计算机科学领域中，稀疏矩阵是一种特殊的矩阵结构，它在现实世界的许多问题中都有着重要的应用。

而奇异值分解（Singular Value Dposition, SVD）则是一种常用的矩阵分解方法，可以用于降维、推荐系统、数据压缩等领域。

Python作为一种流行的编程语言，在处理稀疏矩阵和进行奇异值分解时发挥着重要的作用。

本文将重点介绍Python在稀疏矩阵奇异值分解方面的应用。

二、稀疏矩阵的概念与特点1. 稀疏矩阵是指绝大部分元素为零的矩阵。

在实际问题中，很多数据都具有稀疏性，比如用户-物品评分矩阵、文本数据的词频矩阵等。

2. 稀疏矩阵的特点包括存储空间大大节省、计算效率较高等。

然而，由于稀疏矩阵中零元素过多，传统的矩阵运算方法并不适用，因此需要特殊的处理方式。

三、Python中的稀疏矩阵表示与操作1. Python中有多种表示稀疏矩阵的方式，比如传统的嵌套列表、稀疏矩阵库（如scipy.sparse）等。

这些方法各有优缺点，可以根据具体情况选择合适的表示方式。

2. 在进行稀疏矩阵运算时，Python提供了丰富的库和工具，比如NumPy、SciPy等。

这些库中提供了很多高效的稀疏矩阵运算方法，可以大大提高计算效率。

四、稀疏矩阵奇异值分解的原理与方法1. 奇异值分解是一种重要的矩阵分解方法，可以将一个矩阵分解为三个矩阵的乘积。

在稀疏矩阵中，奇异值分解可以用于降维、去噪、推荐系统等领域。

2. 在Python中，可以使用SciPy库中的sparse.linalg.svds方法来对稀疏矩阵进行奇异值分解。

这个方法可以高效地处理大规模稀疏矩阵，并且支持指定分解后保留的奇异值数量。

五、实例分析：基于Python的稀疏矩阵奇异值分解应用1. 在实际应用中，我们可以利用Python进行稀疏矩阵奇异值分解，并结合其他方法进行推荐系统的构建。

比如可以使用MovieLens数据集构建用户-电影评分矩阵进行分解，从而实现电影推荐功能。

信号与数据处理中的低秩模型——理论、算法与应用

min rank( A), s.t.
A
( D) ( A)
2 F
,
(2)
以处理测量数据有噪声的情况。如果考虑数据有强噪声时如何恢复低秩结构的问题，看似这个问题可以用传统的 PCA 解决，但实际上传统 PCA 只在噪声是高斯噪声时可以准确恢复潜在的低秩结构。对于非高斯噪声，如果噪声很强，即使是极少数的噪声，也会使传统的主元分析失败。由于主元分析在应用上的极端重要性，大量学者付出了很多努力在提高主元分析的鲁棒性上，提出了许多号称“鲁棒”的主元分析方法，但是没有一个方法被理论上严格证明是能够在一定条件下一定能够精确恢复出低秩结构的。 2009 年， Chandrasekaran 等人[CSPW2009]和 Wright 等人[WGRM2009]同时提出了鲁棒主元分析（Robust PCA, RPCA）。他们考虑的是数据中有稀疏大噪声时如何恢复数据的低秩结构：
b) 多子空间模型
RPCA 只能从数据中提取一个子空间，它对数据在此子空间中的精细结构无法刻画。精细结构的最简单情形是多子空间模型，即数据分布在若干子空间附近，我们需要找到这些子空间。这个问题马毅等人称为 Generalized PCA (GPCA)问题[VMS2015]，之前已有很多算法，如代数法、RANSAC 等，但都没有理论保障。稀疏表示的出现为这个问题提供了新的思路。E. Elhamifar 和 R. Vidal 2009 年利用样本间相互表达，在表达系数矩阵稀疏的目标下提出了 Sparse Subspace Clustering (SSC)模型 [EV2009]（(6)中 rank( Z ) 换成 Z
* 本文得到国家自然科学基金(61272341, 61231002)资助。

CLUSTERING DISJOINT SUBSPACES VIA SPARSE REPRESENTATION

D Let {Si }n i=1 be an arrangement of n linear subspaces of R n of dimensions {di }i=1 . We will distinguish between the following two types of arrangements.
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
1926
ICASSP 2010
Y = Y 1 , . . . , Y n Γ, where Γ ∈ RN ×N is an unknown permutation matrix. We assume that we do not know a priori the bases for each one of the subspaces nor do we know which data points belong to which subspace. The subspace clustering problem refers to the problem of ﬁnding the number of subspaces, their dimensions, a basis for each subspace, and the segmentation of the data from Y . The sparse subspace clustering (SSC) algorithm (see [7]) addresses the subspace clustering problem using techniques from sparse representation theory. This algorithm is based on the observation that each data point y ∈ Si can always be written as a linear combination of all the other data points in {Si }n i=1 . However, generically, the sparsest representation is obtained when the point y is written as a linear combination of points in its own subspace. In this case, the number of nonzero coefﬁcients corresponds to the dimension of the subspace. It is shown in [7] that when the subspaces are independent and low-dimensional, i.e., di D, this sparse representation can be obtained by using 1 minimization. The segmentation of the data is found by applying spectral clustering to a similarity graph formed using the sparse coefﬁcients. More speciﬁcally, the SSC algorithm proceeds as follows. Algorithm 1 : Sparse Subspace Clustering (SSC) n Input: A set of points {y i }N i=1 lying in n subspaces {Si }i=1 . 1: For every data point y i , solve the following optimization problem: min ci

基于分式函数约束的稀疏子空间聚类方法

摘要：针对现有稀疏子空间聚类算法获取的系数矩阵不能准确反应高维空间中数据分布的稀疏性的不足，提出一种分式函数约束的稀疏子空间聚类模型，并利用交替方向迭代方法给出该模型的解。在无噪声情形下，证明了该方法获取的系数矩阵具有块对角结构，这为其准确获取数据结构提供了理论保证；在含噪声情形下，对异常点噪声同样采用分式函数约束作为正则项，提高了模型的鲁棒性。在人工数据集、Extended Yale B 库和 Hopkins155 数据集上的实验结果表明，基于分式函数约束的稀疏子空间聚类方法不仅提高了聚类结果的准确率，而且对异常点噪声具有更好的鲁棒性。关键词：分式函数；稀疏表示；块对角结构；子空间聚类；谱聚类文献标志码：A 中图分类号：TP391 doi：10.3778/j.issn.1002-8331.1909-0147
Abstract：This paper proposes a novel sparse subspace clustering model which is based on the constraints of fractional function in order to overcome the shortcoming of sparse subspace clustering algorithm that the coefficient matrix obtained by this algorithm cannot reflect the sparsity of data distribution in high-dimensional space accurately and solves this model by applying the alternating direction iteration method. It is proved that the coefficient matrix obtained by this method has block diagonal structure without any noise, which provides a theoretical guarantee to acquire its data structure accurately. Under the condition of noise, the fractional function constraint is also used as the regular term for outlier noise to improve the robustness of the model. Experimental results on artificial data sets, Extended Yale B database and Hopkins155 data set show that the sparse subspace clustering method based on fractional function constraint not only improves the accuracy of clustering results and also improves the robustness to outlier noise. Key words：fractional function; sparse representation; block diagonal structure; subspace clustering; spectral clustering

矩阵低秩分解理论

• 基于低秩张量应用研究
低秩矩阵表示新近的发展研究
• 基于低秩张量应用研究
稀疏表示和矩阵低秩分解类比
低秩矩阵表示的应用
• 显著性检测
Lang et al. Saliency Detection by Multitask Sparsity Pursuit. IEEE TIP 2012.
低秩矩阵表示新近的发展研究
• Latent LRR
Liu and Yan. Latent Low-Rank Representation for Subspace Segmentation and Feature Extraction, ICCV 2011.
为便于优化，凸松弛后转化为：
若选取数据集Ｄ本身作为字典，则有
那么其解为
，这里
是D的SVD分解。
当D是从多个独立子空间的采样组合，那么为对角块矩阵，每个块对应着一个子空间。此即为子空间聚类（Sparse Subspace Clustering）。
低秩矩阵表示（LRR）
为了对噪声和野点更加鲁棒，一个更合理的模型为：
当Ａ＝Ａk+1，Ｙ＝Ｙk时，
当Ａ＝Ａk+1 ，Ｅ＝Ｅk+1时，其中：步长δk满足０＜ δk ＜1 • ＩＴ算法的迭代式形式简单且收敛，但它的收敛速度比较慢，且难以选取合适的步长
加速近端梯度算法（accelerated proximal gradient，
APG）
• 将优化问题式的等式约束松弛到目标函数中，得到如下的拉格朗日函数：
min Ａ,ＥＬ(Ａ,Ｅ,Ｙk,μ k )。
• 使用精确拉格朗日乘子法交替迭代矩阵Ａ和Ｅ，直到满足终止条件为
止。若
则
交替方向方法（alternating direction

基于密度峰值的聚类集成

第42卷第9期自动化学报Vol.42,No.9 2016年9月ACTA AUTOMATICA SINICA September,2016基于密度峰值的聚类集成褚睿鸿1王红军1杨燕1李天瑞1摘要聚类集成的目的是为了提高聚类结果的准确性、稳定性和鲁棒性.通过集成多个基聚类结果可以产生一个较优的结果.本文提出了一个基于密度峰值的聚类集成模型,主要完成三个方面的工作:1)在研究已有的各聚类集成算法和模型后发现各基聚类结果可以用密度表示;2)使用改进的最大信息系数(Rapid computation of the maximal information coeﬃcient, RapidMic)表示各基聚类结果之间的相关性,使用这种相关性来衡量原始数据在经过基聚类器聚类后相互之间的密度关系;3)改进密度峰值(Density peaks,DP)算法进行聚类集成.最后,使用一些标准数据集对所设计的模型进行评估.实验结果表明,相比经典的聚类集成模型,本文提出的模型聚类集成效果更佳.关键词聚类集成,近邻传播,密度峰值,相似性矩阵引用格式褚睿鸿,王红军,杨燕,李天瑞.基于密度峰值的聚类集成.自动化学报,2016,42(9):1401−1412DOI10.16383/j.aas.2016.c150864Clustering Ensemble Based on Density PeaksCHU Rui-Hong1WANG Hong-Jun1YANG Yan1LI Tian-Rui1Abstract Clustering ensemble aims to improve the accuracy,stability and robustness of clustering results.A good ensemble result is achieved by integrating multiple base clustering results.This paper proposes a clustering ensemble model based on density peaks.First,this paper discovers that the base clustering results can be expressed with density after studying and analyzing the existing clustering algorithms and models.Second,rapid computation of the maximal information coeﬃcient(RapidMic)is introduced to represent the correlation of the base clustering results,which is then used to measure the density of these original datasets after base clustering.Third,the density peak(DP)algorithm is improved for clustering ensemble.Furthermore,some standard datasets are used to evaluate the proposed model. Experimental results show that our model is eﬀective and greatly outperforms some classical clustering ensemble models. Key words Clustering ensemble,aﬃnity propagation,density peaks,similarity matrixCitation Chu Rui-Hong,Wang Hong-Jun,Yang Yan,Li Tian-Rui.Clustering ensemble based on density peaks.Acta Automatica Sinica,2016,42(9):1401−14121绪论1.1研究背景和研究意义类簇可以描述为一个包含密度相对较高的点集的多维空间中的连通区域,一个类簇内的实体是相收稿日期2015-12-25录用日期2016-04-18Manuscript received December25,2015;accepted April18, 2016国家科技支撑计划课题(2015BAH19F02),国家自然科学基金(61262058,61572407),教育部在线教育研究中心在线教育研究基金(全通教育)(2016YB158),西南交通大学中央高校基本科研业务费专项基金(A0920502051515-12)资助Supported by National Science and Technology Support Pro-gram(2015BAH19F02),National Natural Science Foundation of China(61262058,61572407),Online Education Research Cen-ter of the Ministry of Education Online Education Research Fund(Full Education)(2016YB158)and Fundamental Research Funds for the Central Universities of Southwest Jiaotong Uni-versity(A0920502051515-12)本文责任编委王立威Recommended by Associate Editor WANG Li-Wei1.西南交通大学信息科学与技术学院成都6117561.School of Information Science and Technology,Southwest Jiaotong University,Chengdu611756似的,不同类簇的实体是不相似的,同一类簇任意两个点间的距离小于不同类簇任意两个点间的距离[1].近几年,有不少新的聚类算法被提出.周晨曦等[2]设计出一种基于动态更新约束的半监督凝聚层次聚类方法,其更新过程可以保证最终结果的有效性.为了处理混合属性的数据,陈晋音等[3]提出了基于自动确定密度聚类中心的聚类算法.王卫卫等[4]提出了SSC(Sparse subspace clustering),能够揭示高维数据真实子空间结构的表示模型.Ta¸s demir等[5]通过对高空间分辨率遥感影像进行无监督聚类来识别土地覆盖.Parvin等[6]提出了一种模糊加权聚类算法(Fuzzy weighted locally adaptive clustering, FWLAC),能够处理不平衡的聚类.单一的聚类方法普遍存在局限性,例如聚类结果很大程度上取决于参数及其初始化,无法准确判断数据集的真实类簇个数等.另外,真实的数据集往往具有不同的结构和大小,因此,任何一种聚类方法都无法在全部的数据集上获得好的聚类效果.1402自动化学报42卷集成学习是指通过集成多个不同学习器来解决同一问题.Strehl等[7]最早明确提出聚类集成:通过计算和比较多个基聚类结果的相关关系以及信息熵,集成多个基聚类结果可以获得一个综合性的结果.聚类集成有很多优点,不仅能够提高聚类结果的准确性、稳定性和鲁棒性,还能并行处理数据集[7−9].近年来,聚类集成在原有的基础上获得了极大的发展,技术也日趋成熟,越来越多的方法被应用到数据挖掘、生物信息、医学等领域.与此同时,一些问题也显露出来,例如从不同结构的数据源中获得的基聚类结果往往具有不同的结构,如何确定最具代表性的聚类结构或是构建一个新的聚类集成结构显得尤为重要[10−11].本文在对各类已有的聚类集成算法和模型研究的基础上,分析基聚类算法对原始数据进行聚类后其数据结构上的改变.探索了基聚类结果是否可以用原始数据的密度关系进行表示.尝试用改进的密度峰值(Density peaks,DP)算法[12]进行聚类集成.1.2主要研究内容本文主要有三个方面的研究内容:1)本文把基聚类结果看成是原始数据的新增属性,研究结果发现只使用这些新增属性就可以发现数据的密度关系;2)采用改进的最大信息系数(Rapid com-putation of the maximal information coeﬃcient, RapidMic)[13−14]来表示这些数据的密度关系,通过计算数据之间的相关关系矩阵,可以把基聚类结果转换成一个最大相关系数矩阵;3)改进DP算法,确定聚类数量这一参数,自动选取峰值最大的K个点作为聚类中心,然后使用这个改进的DP算法对前面得到的最大相关系数矩阵进行聚类集成.最后,使用标准的数据集对所设计的聚类集成算法和经典的聚类集成算法、以及K均值(K-means)算法进行比较,采用准确率和纯度值这两个评价指标对聚类集成结果进行评价,并对评价结果进行统计检验等工作.在上述研究内容中,本文有两个方面的创新:1)对聚类结果进行分析,使用RapidMic进行运算,发现基聚类结果可以表示成为原始数据的密度关系,通过这个密度关系可以得到一个最大相关系数矩阵;2)改进了DP算法,增加了DP算法的一个参数K,然后使用这个改进的DP算法进行聚类集成.1.3本文内容安排本文的其余部分安排如下.第2节介绍聚类集成的相关工作.第3节首先例证基聚类结果可以表示成为原始数据的密度关系,随后介绍了基于改进的DP算法的聚类集成方法.第4节展示实验步骤及实验结果.第5节得出结论,并介绍了未来可能的工作.2相关工作近年来,集成学习获得越来越多的关注.大多数集成学习方法可分为三类:监督学习、半监督学习以及无监督学习.无监督学习方法的集成,也就是聚类集成,其基本思想是用多个独立的基聚类器分别对原始数据集进行聚类,然后使用某种集成方法进行处理,获得一个最终的集成结果[15].这类算法在第一阶段应尽可能地使用多种方式来获取基聚类结果,第二阶段应选择一个最合适的集成解决方案来处理这些结果.依据算法中解决问题的重点不同,现有的聚类集成方法可大致分为三类.第一类侧重于设计新的聚类集成方法.为了改善最终单聚类算法的结果,Strehl等[7]提出基于三种集成技术的聚类集成框架.Fred等[16]提出了EAC(Evidence accumulation clustering)算法. Topchy等[17]提出了EM(Expectation maximiza-tion)算法并通过对比其他算法分析了其在聚类集成中的性能表现.Ayad等[18]提出了基于累积投票方法的聚类集成框架.Zheng等[19]设计出一种在分层聚类过程中考虑超度量距离分层的聚类集成框架.Wang等[20]在聚类集成框架中引入贝叶斯理论,并设计出基于贝叶斯网络的聚类集成方法.周林等[21]提出了基于谱聚类的聚类集成算法,既能利用谱聚类算法的优越性能,又能避免精确选择尺度参数的问题.Banerjee等[22]提出了一种能够为EM算法产生更好聚类参数近似值的聚类集成算法. Lingras等[23]提出了一种基于粗糙集的聚类集成方法用以维护固有的聚类顺序.Wahid等[24]提出了一种基于SPEA(Strength pareto evolutionary algorithm)的新的多目标聚类集成方法:SPEA-II.结合不同的聚类集成方法,Goswami等[25]提出了一种基于遗传算法的聚类集成算法.Wei等[26]设计出一个能够同时考虑成对约束和度量学习的半监督聚类集成框架.Liu等[27]提出一种选择性的聚类集成算法以及自适应加权策略.Hao等[28]设计的算法能够用以改善基于链接相似性度量的数据聚类关联矩阵.Huang等[29]通过在聚类集成中引入超对象的概念,提出一种新的方法:ECFG(Ensemble clustering using factor graph).第二类主要探索聚类集成方法的属性.9期褚睿鸿等:基于密度峰值的聚类集成1403Kuncheva等[30]研究了聚类的多样性与准确性之间的关系.Kuncheva等[31]探索出如何利用合适的多样性提高聚类准确性.Topchy等[32]重点研究了聚类集成方法的收敛性.Amasyali等[33]就不同因素对聚类集成性能的影响进行了研究.Zhang等[34]提出了一个广义调整的兰德指数来衡量数据集中两个基分区之间的一致性.Wang[35]设计出基于CA 树的分层数据聚类结构,可加速聚类形成,提升聚类集成效率.为了提高聚类集成算法的鲁棒性,Zhou 等[36]提出在捕获到稀疏和对称的错误后,将其整合到强大和一致的框架下用以学习低秩矩阵.Zhong 等[37]认为证据积累是一种有效的框架能够将基分区转换为关联矩阵,从而充分利用每个基分区的集群结构信息.Wahid等[38]研究出的聚类集成方法能够解决两个不同但相互关联的问题:从数据集中产生多个聚类集成结果,同时产生一个最终的聚类集成结果.第三类聚类集成方法主要探索其应用领域.通过检测基因表达数据集的基础聚类结构,Yu等[39]提出的聚类集成框架可用于发现癌症基因.Zhang 等[40]提出的聚类集成方法可应用于SAR图像分割.Hu等[41]研究了如何使用聚类集成从基因表达数据集中确定基因簇的问题.徐森等[42]在聚类集成中引入谱聚类思想,以解决文本聚类问题.Ye等[43]融合了聚类集成框架与领域知识,用以实现恶意软件的自动分类.Zhang等[44]探索出基于聚类集成对流数据进行数据挖掘的方法.Yu等[45]借助新的聚类集成方法BAE(Bagging-Adaboost ensemble)实现了对真核细胞蛋白质磷酸化位点的预测.在从基因表达数据集发现癌症的过程中,为了降低噪声基因的影响,Yu等[46]提出两种新的共识聚类框架:三谱聚类为基础的共识聚类(SC3)和双谱聚类为基础的共识聚类(SC2).Ammour等[47]提出的聚类集成方法可应用于图像分割领域,方法中包含了模糊C均值聚类(Fuzzy C-means,FCM)算法和具有不同邻居效应值的本地信息FCM算法FCM S1, FCM S2.为了解决大规模社会媒体网络中的隐身术检测问题,Li等[48]提出了高阶共同特征和聚类集成的方法.受Chameleon理念的启发,Xiao等[49]设计出一种半监督的聚类集成模型用于高速列车行进过程中传动装置的故障诊断.Teng等[50]提出用基于数据处理分组方法的聚类集成框架(Cluster ensemble framework based on the group method of data handling,CE-GMDH)提升数据处理技术.本文提出一种基于改进的DP算法的聚类集成模型,获得基聚类结果后,使用RapidMic衡量各基聚类结果之间的相关性,通过计算得到最大相关系数矩阵后,使用改进的DP算法进行聚类集成,获得最终的聚类集成结果.3基于改进的DP算法的聚类集成3.1聚类集成问题聚类集成可以分为两个步骤进行.第一步是使用基聚类器对原始数据集进行多次聚类,得到多个基聚类结果.这一步可选择两种方式达成:1)使用某一种算法重复运算多次获得基聚类结果;2)选用多种不同的算法进行运算获得基聚类结果.第二步是基聚类结果集成,选取一种适当的聚类集成方法或者框架,使之能够最大限度地分析这些结果,得到一个对原始数据集最好的集成结果.3.2基聚类结果的产生近邻传播(Aﬃnity propagation,AP)[51]算法是2007年在Science上被提出的.本文选用AP 算法作为基聚类算法,与其他算法不同,AP算法不需要在一开始指定聚类个数,所有的数据点均作为潜在的聚类中心.通过计算原始数据集的相似度矩阵,使用AP算法进行聚类,产生基聚类结果.假设原始数据集有n个数据点,选用欧式距离作为相似度的测度指标,则任意两点之间的相似度为两点距离平方的负数,例如对于点x i和点x k,有G(i,k)=− x i−x k 2.通过计算所有数据点的相似度,得到n×n维的相似度矩阵G.AP算法初始设定所有G(k,k)为相同值p.通过参考度p的值来判断某个点是否能成为聚类中心,参考度p直接影响了最终的聚类数量.AP算法传递两种类型的消息:吸引度值(Re-sponsibility)和归属度值(Availability).吸引度值r(i,k)表示从点i发送到候选聚类中心k的数值消息,反映了k点是否适合作为i点的聚类中心.而归属度值a(i,k)表示从候选聚类中心k发送到i的数值消息,反映了i点是否选择k作为其聚类中心. r(i,k)与a(i,k)越强,k点作为聚类中心的可能性就越大,并且i点隶属于以k点为聚类中心的聚类的可能性也越大.算法运行过程中,通过迭代过程不断更新每一个点的吸引度值和归属度值,直到产生K个高质量的聚类中心,随后将其余的数据点分配到相应的聚类中.通过选取不同的p值,重复使用AP算法计算次,最终可获得m个不同的基聚类结果P=[P1,P2,P3,···,P m].P为一个n×m维的矩阵,矩阵的每一行代表每一个数据点在m种不同1404自动化学报42卷聚类算法中被分配到的类别标签,而矩阵的每一列则代表一次基聚类运算的结果(对应某个参考度p).3.3基聚类结果的相似性矩阵互信息(Mutual information)是信息论里一种有用的信息度量,可以看成是一个随机变量中包含的关于另一个随机变量的信息量.两个变量之间的互信息越大,说明两者的相关性越大.反之,则越小.衡量简单的离散变量之间的关联度大小可以通过计算互信息实现,然而,互信息无法衡量混合类型数据之间的相关性.2011年,最大信息系数(Maximalinformation coeﬃcient,MIC)被提出[13],MIC能够对不同类型的庞大数据集进行关联关系的评估.其算法原理是:通过计算一个数据集中两两变量之间的互信息,找出每两个变量之间互信息最大的值,通过归一化后构成一个特征矩阵.与互信息相比,MIC有下面两大优势:1)除了能够对本身是离散型的数据进行处理以外,还能够通过对连续型数据进行离散处理,实现对混合类型数据的处理.2)通过构建互信息特征矩阵来寻找变量之间的最大信息系数,可以更精确地表示出数据属性间关联性的大小.本文使用改进的MIC,也就是RapidMic[14]算法计算基聚类结果的相关性.基聚类过程中计算得到的n×m维的矩阵P包含所有的基聚类结果.P中有n个数据点,把基聚类结果看成是原始数据的新增属性,每个数据点有m种属性.计算新的数据对象之间的互信息.可以用I(αi,αj)表示αi和αj之间的互信息,其中,H(αi)表示变量αi的熵,I(αi,αj)的值没有上界限.为了更好地比较变量之间的相关性,可以采用标准化的I(αi,αj)值,范围在0到1之间,0表示两个变量互相独立,而1表示两个变量有无噪的关系.已有的几种标准化方法以I(αi,αj)≤min(H(αi),H(αj))为依据,需要计算H(αi)和H(αj)的算术和几何平均数.标准化后,当i=j时,H(αi)=I(αi,αj),且是在Hilbert空间中对数据进行标准化的,所以一般更倾向于采用几何平均值的方法来对数据进行标准化.归一化互信息(Normalized mutual information,NMI)公式如式(1)所示:NMI(αi,αj)=I(αi,αj)H(αi)H(αj)(1)根据式(1)可以计算出新的数据对象之间的互信息大小.在得到所有的互信息值后,构建n×n维特征矩阵S.该相似性矩阵为一个对称阵,主对角线上的值是属性自身的互信息,值为1,其余的为两两属性间的互信息值,即NMI(αi,αj).3.4基聚类结果的密度关系Isomap算法[52]是一种非线性降维方法,首先创建能够正确表达数据邻域结构的邻域图,接着用最短路径法计算各数据点间的最短路径,逼近相应的测地距离,最后使用经典的多维标度分析(Multi-dimensional scaling,MDS)算法在低维可视空间重建数据.本文使用Iris数据集进行例证,Iris有150个数据点,即n取值150,每个数据点4种维度.实验中以AP算法作为基聚类器,重复运算30次,即m 取值30,得到包含所有基聚类结果的n×m维的P 矩阵.接着使用RapidMic算法获得基聚类结果的相似性矩阵S.最后通过Isomap算法获得原始数据的基聚类结果的二维关系映射.具体过程如图1所示,最终结果如图2所示.将这些基聚类结果从150维降到2维之后,得到一个二维关系映射图,观察图片可以发现,基聚类结果的二维映射以某种规律聚成了几类.由此推论,如果把基聚类结果看成是原始数据的新增属性,那么只使用这些新增属性就可以发现数据的密度关系.3.5改进基于密度峰值的DP算法用于聚类集成DP算法是2014年在Science上被提出的聚类图1基于基聚类结果获取原始数据二维关系映射图的过程Fig.1The process of obtaining the two-dimensional relational mapping of original dataset based onclustering results9期褚睿鸿等:基于密度峰值的聚类集成1405图2原始数据的基聚类结果的二维关系映射Fig.2The two-dimensional relational mapping of the base clustering results of original dataset算法,算法有很好的鲁棒性并且对于各种数据集都能达到很好的聚类效果[12].DP算法基于的假设是:类簇中心被具有较低局部密度的邻居点包围,且与具有更高密度的任何点有相对较大的距离.对于每一个数据点i,要计算两个量:数据点的局部密度ρi和该点到具有更高局部密度的点的距离δi,而这两个值都取决于数据点间的距离d ij,d c是一个截断距离.数据点i的局部密度ρi如式(2)所示:ρi=jχ(d ij−d c)(2)其中,令x=d ij−d c,如果x<0,那么χ(x)=1;否则χ(x)=0.基本上,ρi等于与点i的距离小于d c的点的个数.算法只对不同点的ρi的相对大小敏感,这意味着对于大数据集,分析结果对于d c的选择有很好的鲁棒性.δi是数据点i到任何比其密度大的点的距离的最小值,计算公式如式(3)所示:δi=minj:ρj>ρi(d ij)(3)对于密度最大的点,我们可以得到δi=min j(d ij).只有具有高δ和相对较高ρ的点才是类簇中心.而有高δ值和低ρ值的,往往就是异常点.类簇中心找到后,剩余的每个点被归属到有更高密度的最近邻所属类簇.类簇分配只需一步即可完成,不像其他算法要对目标函数进行迭代优化.本文改进了经典的DP算法,在原有的算法中增加了一个参数K,具体的修改方法见算法1.经过改进之后,算法能够自动选取拥有最大密度峰值的前K个点作为聚类中心.接着使用这个改进的DP 算法进行聚类集成,将相似性矩阵S n×n作为目标数据输入算法,计算出局部密度ρi和该点到具有更高局部密度的点的距离δi,自动选取具有高δ和相对较高ρ的点作为类簇中心,将其余的数据点归到各个类簇,要求分配到的类簇与比其密度高并且与其距离近的节点所属的类一样.最终的聚类集成结果可以看成是针对原始数据集所获得的最好聚类结果.整个聚类集成过程如图3所示.图3基于改进的DP算法的聚类集成过程Fig.3The process of cluster ensembling based onimproved DP algorithm3.6算法描述根据图3展示的聚类集成过程,算法1描述如下:算法1.DP ensemble输入.实验数据集αn×q,基聚类器运算次数m,实验数据集的总类簇数K输出.α的聚类集成标签L步骤1.重复运算m次AP算法获得数据集α的基聚类结果P n×m.步骤2.根据式(1)使用RapidMic计算P n×m的相似性矩阵S n×n.步骤3.使用改进的DP算法对S n×n进行聚类集成:1)根据式(2)和(3)计算S中每个数据点的局部密度ρi和该点到具有更高局部密度的点的距离δi(本文使用的是RapidMic算法,取值范围是0到1.相似度越大,表示距离越近,具体转换关系是:距离=1–相似度);2)自动选取具有高δi和相对较高ρi的前K个点作为类簇中心;3)对除类簇中心外的所有数据点进行划分,获得α的聚类集成标签L.1406自动化学报42卷4实验4.1数据集和评价标准本文使用UCI(University of CaliforniaIrvine)机器学习库中的20个数据集作为实验数据集.表1列出了数据集的样本、属性和类别数量.有很多标准可以用来衡量聚类集成算法,本文以这些数据的真实类别标签为标准,选用Micro-precision(MP)[53−54]标准和Purity[55]标准来衡量聚类结果的好坏.表1实验数据集的样本、属性和类别数量Table1The number of instances,features and classes ofdatasetsID Datasets Number of Number of Number ofinstances features classes1Aerosol90589232Amber88089233Ambulances93089234Aquarium92289235Balloon83089236Banner86089237Baobab90089238Basket89289239Bateau900892310Bathroom924892311Bed888892312Beret876892313Beverage873892314Bicycle844892315Birthdaycake932892316Blog943892317Blood866892318Boat857892319Bonbon874892320Bonsai8678923MP标准的计算如式(4)所示:MP=1nKh=1a h(4)其中,a h表示对数据某一类分类正确的数量,n表示数据集中数据对象的数量,K表示此数据集中类别的数量.MP值越大,代表聚类的准确率越高.为了获得更好的准确率需要进行重复的实验,采用平均MP值来衡量结果将更精确.具体计算公式如式(5)所示:AMP=1m×nmt=1Kh=1a h(5)其中,m为重复实验的次数,本文在基聚类中m为10,在聚类集成中m为3.Purity标准的计算如式(6)所示:Purity=1nrk=1max1<l<qn lk(6)其中,n l k是原始类簇的样本数.一个较大的纯度值代表较好的聚类性能.4.2实验步骤和实验结果本文选取AP作为基聚类器算法,通过选取10个不同的参考度p值,使用AP算法重复运算10次,获得基聚类结果P=[P1,P2,P3,···,P10].随后使用CSPA(Cluster-based similarity parti-tioning algorithm)[7]、HGPA(Hypergraph parti-tioning algorithm)[7]、MCLA(Meta-clustering al-gorithm)[7]、DP、EM和QMI(Quadratic mutual information)[9]六种算法对基聚类结果的相似性矩阵进行聚类集成,获得最终的聚类集成标签.最后将这些标签与数据集的真实标签进行对比,采用准确率和纯度值标准进行评价.此外,同时使用经典的K-means算法对数据集进行聚类,获取标签,对比真实标签后,采用准确率和纯度值标准进行评价.使用K-means与各聚类集成算法进行比照实验,是为了证明本文所提出的聚类集成算法是有效的,实际上并没有太大的意义,因为聚类集成更关注在已有的基聚类基础上的效果的提升,而非进行单一聚类算法的比较.表2展示了实验结果的准确率及其标准差.表中第一列数据给出了实验中用到的数据集ID.第二列和第三列给出了使用基聚类器AP算法对20个数据集进行10次聚类之后得到的平均准确率AP-average、最大准确率AP-max,之后的六列则依次给出使用CSPA、HGPA、MCLA、DP、EM和QMI六种聚类算法对基聚类结果的相似性矩阵进行集成运算,以及使用K-means进行聚类获得的准确率.从表2可以观测得到以下结论:1)从AP-average和AP-max两列数据可以看出,使用AP算法对20个数据集进行聚类运算,准确率极低,平均准确率仅为0.023.正确率极低的原因在于,AP算法产生的聚类标签与真实的标签很不一样,某种程度上来说,这两列结果是无意义的.9期褚睿鸿等:基于密度峰值的聚类集成1407表2平均准确率和标准差(每个数据集的最大准确率加粗显示.)Table2Average MPs and standard deviations(The highest MP among diﬀerent algorithms oneach dataset is bolded.)ID AP-average AP-max CSPA HGPA MCLA DP EM QMI K-means 10.022±0.0150.113±0.0720.379±0.0200.384±0.0030.395±0.0200.484±0.0190.354±0.0040.466±0.0530.369±0.008 20.023±0.0160.111±0.0540.472±0.0450.502±0.0200.493±0.0040.571±0.0010.387±0.0250.526±0.0500.595±0.008 30.026±0.0160.119±0.0560.392±0.0260.394±0.0210.384±0.0080.596±0.0410.370±0.0220.497±0.0190.442±0.029 40.021±0.0130.095±0.0400.401±0.0130.394±0.0270.381±0.0260.653±0.0420.353±0.0110.580±0.0570.361±0.006 50.026±0.0200.161±0.1180.410±0.0370.468±0.0170.393±0.0090.554±0.0100.358±0.0320.541±0.0350.445±0.004 60.027±0.0170.124±0.0580.346±0.0020.365±0.0100.346±0.0040.805±0.1580.358±0.0040.791±0.1120.462±0.001 70.018±0.0150.108±0.0980.432±0.0260.474±0.0080.427±0.0170.534±0.0680.389±0.0500.482±0.0240.503±0.004 80.022±0.0170.133±0.0990.362±0.0180.394±0.0200.394±0.0080.538±0.0250.357±0.0230.483±0.0490.409±0.000 90.028±0.0180.135±0.0660.401±0.0200.445±0.0390.423±0.0230.511±0.0500.367±0.0170.510±0.0200.441±0.003 100.019±0.0120.088±0.0450.351±0.0140.369±0.0010.361±0.0140.756±0.0590.355±0.0110.662±0.0310.394±0.001 110.035±0.0210.173±0.0450.371±0.0020.401±0.0250.382±0.0240.617±0.0460.347±0.0060.542±0.0290.459±0.004 120.020±0.0120.101±0.0480.368±0.0050.372±0.0180.373±0.0170.629±0.0150.354±0.0070.575±0.0940.417±0.009 130.031±0.0230.173±0.1010.419±0.0140.425±0.0190.400±0.0130.531±0.0270.353±0.0040.511±0.0150.396±0.000 140.020±0.0150.113±0.0890.410±0.0200.412±0.0250.407±0.0060.522±0.0170.357±0.0080.478±0.0280.452±0.008 150.017±0.0130.092±0.0630.392±0.0440.446±0.0320.450±0.0140.576±0.0080.372±0.0050.563±0.0420.491±0.001 160.023±0.0150.119±0.0560.347±0.0050.383±0.0100.369±0.0060.685±0.0690.357±0.0030.627±0.0770.411±0.001 170.018±0.0110.088±0.0250.349±0.0110.352±0.0130.375±0.0030.776±0.0590.354±0.0170.687±0.0950.473±0.004 180.022±0.0140.106±0.0520.388±0.0070.368±0.0200.377±0.0120.587±0.0380.354±0.0080.537±0.0250.404±0.001 190.016±0.0110.090±0.0490.397±0.0220.411±0.0190.402±0.0120.508±0.0120.357±0.0130.481±0.0200.465±0.002 200.021±0.0120.095±0.0320.383±0.0050.385±0.0240.355±0.0110.642±0.0450.380±0.0390.587±0.0510.443±0.003 AVG0.023±0.0150.117±0.0630.389±0.0180.407±0.0190.394±0.0120.604±0.0400.362±0.0160.556±0.0460.442±0.0052)从CSPA、HGPA、MCLA、DP、EM和QMI 六列数据可以观测到,这六种算法的聚类集成准确率都超过0.340.说明聚类集成算法可以获得比单纯聚类更好的结果.3)最高的准确率已在表中加粗显示,在所有的聚类集成算法中,DP获得的准确率最高,平均准确率超过0.600.准确率越高,说明聚类效果越好,实验中,DP获得了最好的聚类效果.4)表2中各准确率的标准差相对都比较小,绝大部分不超过0.100.标准差越小,说明结果越稳定,实验中,各算法均具有良好的稳定性.5)对比K-means和DP两列数据可以发现,比起单纯使用K-means进行聚类运算,使用DP进行聚类集成能够获得更高的准确率.表3展示了实验结果的纯度值及其标准差.表中第一列数据给出了实验中用到的数据集ID.第二列和第三列给出了使用基聚类器AP算法对20个数据集进行10次聚类之后得到的平均纯度值AP-average、最大纯度值AP-max,之后的六列则依次给出使用CSPA、HGPA、MCLA、DP、EM和QMI 六种聚类算法对基聚类结果的相似性矩阵进行集成运算,以及使用K-means进行聚类获得的纯度值.从表3可以观测得到以下结论:1)从AP-average和AP-max两列数据可以看到,使用AP算法对20个数据集进行聚类获得的平均纯度值较低,仅为0.277,而最大纯度值却很高,达到0.672.纯度值不高,且波动很大的原因在于AP 算法产生的聚类标签与真实的标签很不一样,某种程度上来说,这两列结果是无意义的.2)从CSPA、HGPA、MCLA、DP、EM和QMI 六列数据可以观测到,这六种算法的聚类集成纯度值远远高于基聚类结果的平均纯度值,说明聚类集成算法可以获得比单纯聚类算法更好的结果.3)最高的纯度值已在表中加粗显示,在所有的聚类集成算法中,DP和EM获得的纯度值最高,两者在20个数据集上的平均纯度值均超过0.740.纯度值越高,说明聚类效果越好.实验中,DP和EM 能够获得较好的聚类效果.4)表中除了AP-average和AP-max的标准差比较大,六种聚类集成算法的纯度值的标准差都比较小,特别是DP和EM两种算法,平均标准差不到0.010,标准差越小,说明结果越稳定.实验中,聚。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

(RANSAC) [11], ﬁt a subspace of dimension d to randomly chosen subsets of d points until the number of inliers is large enough. The inliers are then removed, and the process is repeated to ﬁnd a second subspace, and so on. RANSAC can deal with noise and outliers, and does need to know the number of subspaces. However, the dimensions of the subspaces must be known and equal, and the number of trials needed to ﬁnd d points in the same subspace grows exponentially with the number and dimension of the subspaces. Factorization-based methods [6, 12, 16] ﬁnd an initial segmentation by thresholding the entries of a similarity matrix built from the factorization of the matrix of data points. Such methods are provably correct when the subspaces are independent, but fail when this assumption is violated. Also, these methods are sensitive to noise. Spectralclustering methods [30, 10, 28] deal with these issues by using local information around each point to build a similarity between pairs of points. The segmentation of the data is then obtained by applying spectral clustering to this similarity matrix. These methods have difﬁculties dealing with points near the intersection of two subspaces, because the neighborhood of a point can contain points from different subspaces. This issue can be resolved by looking at multiway similarities that capture the curvature of a collection of points within an afﬁne subspace [5]. However, the complexity of building a multi-way similarity grows exponentially with the number of subspaces and their dimensions. Algebraic methods, such as Generalized Principal Component Analysis (GPCA) [25, 18], ﬁt the data with a polynomial whose gradient at a point gives a vector normal to the subspace containing that point. Subspace clustering is then equivalent to ﬁtting and differentiating polynomials. GPCA can deal with subspaces of different dimensions, and does not impose any restriction on the relative orientation of the subspaces. However, GPCA is sensitive to noise and outliers, and its complexity increases exponentially with the number of subspaces and their dimensions. Informationtheoretic approaches, such as Agglomerative Lossy Compression (ALC) [17], model each subspace with a degenerate Gaussian, and look for the segmentation of the data that minimizes the coding length needed to ﬁt these points with a mixture of Gaussians. As this minimization problem 1

1. Introduction
Subspace clustering is an important problem with numerous applications in image processing, e.g. image representation and compression [15, 29], and computer vision, e.g. image/motion/video segmentation [6, 16, 30, 28, 26]. Given a set of points drawn from a union of subspaces, the task is to ﬁnd the number of subspaces, their dimensions, a basis for each subspace, and the segmentation of the data. Prior work on subspace clustering. Existing works on subspace clustering can be divided into six main categories: iterative, statistical, factorization-based, spectral clustering, algebraic and information-theoretic approaches. Iterative approaches, such as K-subspaces [14], alternate between assigning points to subspaces, and ﬁtting a subspace to each cluster. Statistical approaches, such as Mixtures of Probabilistic PCA (MPPCA) [24], Multi-Stage Learning (MSL) [22], or [13], assume that the distribution of the data inside each subspace is Gaussian and alternate between data clustering and subspace estimation by applying Expectation Maximization (EM) to a mixture of probabilistic PCAs. The main drawbacks of both approaches are that they generally require the number and dimensions of the subspaces to be known, and that they are sensitive to correct initialization. Robust methods, such as Random Sample Consensus
is NP hard, a suboptimal solution is found by ﬁrst assuming that each point forms its own group, and then iterative merging pairs of groups to reduce the coding length. ALC can handle noise and outliers in the data, and can estimate the number of subspaces and their dimensions. However, there is no theoretical proof for the optimality of the algorithm. Paper contributions. In this paper, we propose a completely different approach to subspace clustering based on sparse representation. Sparse representation of signals has attracted a lot of attention during the last decade, especially in the signal and image processing communities (see §2 for a brief review). However, its application to computer vision problems is fairly recent. [21] uses 1 optimization to deal with missing or corrupted data in motion segmentation. [20] uses sparse representation for restoration of color images. [27] uses 1 minimization for recognizing human faces from frontal views with varying expression and illumination as well as occlusion. [19] uses a sparse representation to learn a dictionary for object recognition. Our work is the ﬁrst one to directly use the sparse representation of vectors lying on a union of subspaces to cluster the data into separate subspaces. We exploit the fact that each data point in a union of subspaces can always be written as a linear or afﬁne combination of all other points. By searching for the sparsest combination, we automatically obtain other points lying in the same subspace. This allows us to build a similarity matrix, from which the segmentation of the data can be easily obtained using spectral clustering. Our work has numerous advantages over the state of the art. – Our sparse representation approach resolves the exponential complexity issue of methods such as RANSAC, spectral clustering, and GPCA. While in principle ﬁnding the sparsest representation is also an NP hard problem, we show that under mild assumptions on the distribution of data on the subspaces, the sparsest representation can be found efﬁciently by solving a (convex) 1 optimization problem. – Our work extends sparse representation work from one to multiple subspaces. As we will see in §2, most of the sparse representation literature assumes that the data lies in a single linear subspace [1, 4, 7]. The work of [9] is the ﬁrst one to address the case of multiple subspaces, under the assumption that a sparsifying basis for each subspace is known. Our case is more challenging, because we do not have any basis for any of the subspaces nor do we know which data belong to which subspace. We only have the sparsifying basis for the union of subspaces given by the data matrix. – Our work requires no initialization, can deal with both linear and afﬁne subspaces, can handle data points near the intersections, noise, outliers, and missing data. – Last, but not least, our method signiﬁcantly outperforms existing motion segmentation algorithms on 167 sequences.

2019年托福高频词汇表：sparse什么意思(附翻译及例句).doc

页数:3
Sparse Feature Learning for Deep Belief Networks

页数:8
Sparse Representations in Unions of Bases

页数:6
稀疏编码 Optimization with sparse inducingnorms

页数:22
稀疏表示在信号处理中的应用

页数:15
第十二讲：Matlab稀疏矩阵介绍

页数:14
sparse recovery 稀疏求解

页数:26
SPARSE(稀松矩阵求解器)

页数:4
稠密匹配和稀疏匹配

页数:3
《Kernel Sparse Representation》中文翻译

页数:8