特征选择
- 格式:pdf
- 大小:253.63 KB
- 文档页数:7
特征的提取和选择
特征提取和选择是机器学习中非常重要的环节,它既可以减少计算量,又可以提高模型的性能。
选择较好的特征可以让模型更加简单,更加容易
和快速的训练出最佳参数,从而使得模型更加精确、效果更好。
一般来说,特征提取和选择有以下几步:
1.特征提取。
特征提取关注的是利用现有特征生成新的特征。
它可以
是特征融合(如结合多个特征生成更强大的特征),也可以是特征变换
(如离散特征变换成连续特征)。
2.无关特征删除。
把没有帮助的特征删除,有助于减少模型的运行时间,提高模型的效果。
3.有关特征选择。
把与目标值有很强关联的特征选择出来,这些特征
被称为有关特征,它们可以帮助模型训练出更好的结果。
4.特征降维。
为了减少特征之间的相关性,减少计算量,与有关特征
相关性比较低的特征可以被删除。
5.特征加权。
调整特征的权重,使得有关特征拥有更大的影响力,从
而帮助模型更好的进行预测。
通过这种特征提取和选择的过程,可以把训练集中拥有相关性比较高
的有用特征保留下来,把没用的特征抛弃,有效的提高模型的性能。
数据挖掘中的特征选择分析特征选择是数据挖掘中十分重要的一步,其目的是从原始数据中选择出最能够反映问题本质的特征,减少特征维度,提高模型的准确性和效率。
本文将介绍特征选择的意义、常用的特征选择方法以及常见的特征选择算法。
一、特征选择的意义特征选择在数据挖掘中具有重要的意义,主要有以下几个方面:1.提高模型的准确性:通过选择最能够反映问题本质的特征,可以减少噪声和冗余信息的影响,提高模型的准确性。
2.提高模型的效率:特征选择可以减少特征维度,降低模型的复杂度,提高模型的训练和预测效率。
3.简化模型的解释和理解:选择最重要的特征可以简化模型的解释和理解过程,便于对模型的结果进行分析和解释。
二、特征选择方法特征选择方法可以分为三类:过滤式方法、包裹式方法和嵌入式方法。
1.过滤式方法:过滤式方法独立于具体的学习算法,通过特征间的关联性或相关性进行筛选。
常用的过滤式方法有相关系数、卡方检验和信息增益等。
2.包裹式方法:包裹式方法将特征选择作为一个子问题,直接在学习算法的过程中进行优化。
常用的包裹式方法有模型评估和交叉验证等。
3.嵌入式方法:嵌入式方法将特征选择融入到学习算法中,在学习过程中自动选择特征。
常用的嵌入式方法有L1正则化和决策树剪枝等。
三、特征选择算法1.相关系数:相关系数衡量两个变量之间的关联性,可用于过滤式方法。
相关系数的绝对值越大,表示两个变量之间的关联性越强。
2.卡方检验:卡方检验用于衡量特征与目标变量之间的相关性,可用于过滤式方法。
卡方值越大,表示特征与目标变量之间的相关性越强。
3.信息增益:信息增益用于衡量特征对于目标变量的贡献,可用于过滤式方法。
信息增益越大,表示特征对于目标变量的贡献越大。
4.L1正则化:L1正则化是一种嵌入式方法,在模型训练过程中自动选择特征。
L1正则化通过增加L1范数作为正则化项,使得部分特征的权重变为0,实现特征选择的效果。
5.决策树剪枝:决策树剪枝是一种嵌入式方法,通过裁剪决策树的叶子节点来选择特征。
特征提取与特征选择的区别与联系在机器学习和数据挖掘领域,特征提取和特征选择是两个重要的概念。
它们在数据预处理和模型构建中起着至关重要的作用。
本文将探讨特征提取与特征选择的区别与联系,并从理论和实践角度进行深入分析。
1. 特征提取的定义与意义首先,我们来看看特征提取的定义与意义。
特征提取是指从原始数据中提取出具有代表性的特征,以便进行后续的数据分析和建模。
在实际应用中,原始数据往往包含大量的冗余信息和噪声,特征提取的目的就是通过某种算法或方法,对原始数据进行转换或映射,得到更加有用和有效的特征表示。
这样可以提高模型的准确性和泛化能力,同时减少计算复杂度和存储空间的消耗。
特征提取的方法有很多种,比如主成分分析(PCA)、独立成分分析(ICA)、线性判别分析(LDA)等。
这些方法都是通过对原始数据进行变换,得到新的特征表示,从而达到降维、去噪或增强特征的目的。
2. 特征选择的定义与意义接下来,我们再来看看特征选择的定义与意义。
特征选择是指从原始特征中选择出最具有代表性和重要性的特征子集,以用于后续的建模和预测。
在实际应用中,原始特征往往包含很多冗余和无关的信息,特征选择的目的就是找出对目标变量影响最大的特征,从而简化模型、提高预测性能和可解释性。
特征选择的方法有很多种,比如过滤式、包裹式和嵌入式等。
过滤式方法是直接对特征进行评估和排序,选择最高分的特征子集;包裹式方法是把特征选择看作一个搜索问题,通过试验不同的特征子集来找到最佳组合;嵌入式方法则是在模型训练过程中,通过正则化或增加惩罚项的方式来选择特征。
3. 特征提取与特征选择的区别特征提取与特征选择虽然都是对原始数据或特征进行处理,但它们在目的和方法上有着明显的区别。
首先,特征提取是通过某种变换或映射,得到新的特征表示,目的是降维、去噪或增强特征;而特征选择是从原始特征中选择出最具有代表性和重要性的特征子集,目的是简化模型、提高预测性能和可解释性。
特征提取和特征选择是机器学习和数据挖掘领域中常用的两个概念。
虽然它们都是为了从原始数据中提取出有用的特征以便进行进一步的分析和建模,但是它们之间有着明显的区别和联系。
首先我们来看看特征提取,特征提取是指从原始数据中提取出一些能够代表数据特征的特征。
这些特征可以是原始数据中的某些属性,也可以是对原始数据进行某种变换得到的新的特征。
特征提取的目的是将原始数据转化为更容易被机器学习算法处理的形式,同时保持数据的最重要的特征。
特征提取的方法有很多种,比如说主成分分析(PCA)、线性判别分析(LDA)、小波变换等。
这些方法可以将高维度的数据降维到低维度,从而减小了数据的复杂度,提高了机器学习的效率。
特征提取的过程可以看成是对数据的一种抽象和概括,它的目的是提取出对于目标任务最有用的信息。
而特征选择则是在特征提取的基础上进行的一个步骤。
特征选择是指从已有的特征中选择出对目标任务最有用的特征。
在特征提取的过程中,可能会产生大量的特征,有些特征可能对于目标任务没有太大的作用,甚至会影响到机器学习算法的性能。
因此需要进行特征选择,选择出对目标任务最有用的特征,去除那些冗余或者无关的特征。
特征选择的方法也有很多种,比如说过滤式特征选择、包裹式特征选择、嵌入式特征选择等。
过滤式特征选择是指通过对特征进行评估,选择出对目标任务最有用的特征,比如说使用相关系数或者信息增益进行特征评估。
包裹式特征选择是指在特征子集上训练出一个机器学习模型,通过模型的性能来评估特征的重要性。
嵌入式特征选择则是指在模型训练的过程中自动选择出对目标任务最有用的特征,比如说使用正则化方法。
特征提取和特征选择在实际应用中经常会同时进行,它们之间有着很大的联系。
特征提取会产生大量的特征,在特征选择的过程中,有时候也需要对特征进行一些变换和组合。
比如说,在包裹式特征选择的过程中,需要对特征子集进行训练,可能需要将特征进行某种组合,而这个过程有点类似于特征提取。
特征选择方法特征选择是机器学习和数据挖掘中的重要环节,其目的是从原始特征中选择出对目标变量有重要影响的特征,以提高模型的预测性能和降低计算成本。
在实际应用中,特征选择方法的选择对最终模型的性能有着重要的影响。
本文将介绍几种常见的特征选择方法,以帮助读者更好地理解和应用特征选择技术。
1. 过滤式特征选择。
过滤式特征选择是在训练模型之前对特征进行选择,其主要思想是根据特征与目标变量之间的相关性来进行选择。
常用的过滤式特征选择方法包括相关系数、信息增益、卡方检验等。
这些方法通过对特征进行评估,筛选出与目标变量相关性较高的特征,从而达到降低特征维度、提高模型性能的目的。
2. 包裹式特征选择。
包裹式特征选择是在模型训练过程中进行特征选择,其主要思想是将特征选择过程嵌入到模型训练中。
常用的包裹式特征选择方法包括递归特征消除、基于模型的特征选择等。
这些方法通过反复训练模型并调整特征集合,最终选择出对模型性能影响最大的特征组合。
3. 嵌入式特征选择。
嵌入式特征选择是在模型训练过程中自动地进行特征选择,其主要思想是将特征选择过程融入到模型参数的学习过程中。
常用的嵌入式特征选择方法包括L1正则化、决策树剪枝等。
这些方法通过在模型训练过程中对特征进行惩罚或剪枝,从而实现特征选择的目的。
4. 混合式特征选择。
混合式特征选择是将多种特征选择方法进行组合,以充分利用各种方法的优势。
常用的混合式特征选择方法包括特征重要性评估、特征组合搜索等。
这些方法通过综合考虑不同特征选择方法的结果,选择出对模型性能影响最大的特征集合。
在实际应用中,特征选择方法的选择应根据具体问题的特点和数据的特征来进行。
需要注意的是,特征选择过程应该是一个迭代的过程,需要不断地尝试不同的方法和参数,以找到最优的特征集合。
另外,特征选择方法的选择也需要考虑到模型的类型和性能指标,以确保选择出的特征集合能够在实际应用中发挥最大的作用。
总之,特征选择是机器学习和数据挖掘中至关重要的一环,其选择方法的合理性和有效性直接影响着最终模型的性能。
特征选择算法1 综述(1)什么是特征选择特征选择 ( FeatureSelection )也称特征⼦集选择(Feature Subset Selection , FSS ) ,或属性选择( AttributeSelection ) ,是指从全部特征中选取⼀个特征⼦集,使构造出来的模型更好。
(2)为什么要做特征选择在机器学习的实际应⽤中,特征数量往往较多,其中可能存在不相关的特征,特征之间也可能存在相互依赖,容易导致如下的后果:Ø 特征个数越多,分析特征、训练模型所需的时间就越长。
Ø 特征个数越多,容易引起“维度灾难”,模型也会越复杂,其推⼴能⼒会下降。
特征选择能剔除不相关(irrelevant)或亢余(redundant)的特征,从⽽达到减少特征个数,提⾼模型精确度,减少运⾏时间的⽬的。
另⼀⽅⾯,选取出真正相关的特征简化了模型,使研究⼈员易于理解数据产⽣的过程。
2 特征选择过程2.1 特征选择的⼀般过程特征选择的⼀般过程可⽤图1表⽰。
⾸先从特征全集中产⽣出⼀个特征⼦集,然后⽤评价函数对该特征⼦集进⾏评价,评价的结果与停⽌准则进⾏⽐较,若评价结果⽐停⽌准则好就停⽌,否则就继续产⽣下⼀组特征⼦集,继续进⾏特征选择。
选出来的特征⼦集⼀般还要验证其有效性。
综上所述,特征选择过程⼀般包括产⽣过程,评价函数,停⽌准则,验证过程,这4个部分。
(1) 产⽣过程(Generation Procedure ) 产⽣过程是搜索特征⼦集的过程,负责为评价函数提供特征⼦集。
搜索特征⼦集的过程有多种,将在2.2⼩节展开介绍。
(2) 评价函数(Evaluation Function ) 评价函数是评价⼀个特征⼦集好坏程度的⼀个准则。
评价函数将在2.3⼩节展开介绍。
(3) 停⽌准则(Stopping Criterion ) 停⽌准则是与评价函数相关的,⼀般是⼀个阈值,当评价函数值达到这个阈值后就可停⽌搜索。
逻辑回归模型的特征选择主要有以下几种方法:
1.单变量特征选择:这种方法用于评估每个预测变量与结果变量之间的相关性。
这种方法适用于存在多个预测变量和目标变量的情况。
2.L1正则化:这种方法利用L1范数对逻辑回归的系数进行惩罚,并且可以将不重要的系数设置为零。
这种方法可以有效地降低维度,使得模型更
加简单。
3.嵌入式特征选择:这种方法将特征选择嵌入到模型中,并且在训练过程中对其进行优化。
这种方法可以在模型的训练过程中同时优化预测和特征
选择。
4.过滤法:利用缺失率、单值率、方差、pearson相关系数、VIF、IV值、PSI、P值等指标对特征进行筛选。
其中,VIF是共线性指标,其原理
是分别尝试以各个特征作为标签,用其他特征去学习拟合,得到线性回归模型拟合效果的R^2值,算出各个特征的VIF。
以上方法各有特点,实际应用中可以根据具体情况选择适合的方法。
特征选择方法特征选择是机器学习和数据挖掘中非常重要的一步,它可以帮助我们从大量的特征中选择出对于问题解决有用的特征,从而提高模型的性能和效率。
在实际应用中,特征选择方法有很多种,包括过滤式、包裹式和嵌入式等。
本文将介绍几种常用的特征选择方法,帮助大家更好地理解和应用特征选择。
1. 过滤式特征选择。
过滤式特征选择是在特征选择和学习器训练之前进行的,它通过对特征进行评估和排序,然后选择出排名靠前的特征作为最终的特征集合。
常用的评估指标包括信息增益、方差分析、相关系数等。
过滤式特征选择的优点是计算简单,速度快,但缺点是没有考虑到学习器的性能,可能会选择出对学习任务无用的特征。
2. 包裹式特征选择。
包裹式特征选择是将特征选择过程嵌入到学习器的训练过程中,它直接使用学习器的性能作为特征选择的评价标准,从而能够更准确地选择出对学习任务有用的特征。
常用的方法包括递归特征消除、基于模型的特征选择等。
包裹式特征选择的优点是能够充分考虑学习器的性能,但缺点是计算复杂,速度较慢。
3. 嵌入式特征选择。
嵌入式特征选择是将特征选择过程嵌入到学习器的训练过程中,它通过正则化方法或者模型参数的学习来选择出对学习任务有用的特征。
常用的方法包括L1正则化、决策树剪枝等。
嵌入式特征选择的优点是能够充分考虑学习器的性能,计算相对较快,但缺点是可能会受到学习器类型的限制。
在实际应用中,选择合适的特征选择方法非常重要,需要根据具体的问题和数据集来进行选择。
有时候也可以结合多种特征选择方法来进行特征选择,以达到更好的效果。
另外,特征选择并不是一劳永逸的过程,随着数据的变化和问题的演化,特征选择也需要不断地进行调整和优化。
总结而言,特征选择是机器学习和数据挖掘中非常重要的一步,它可以帮助我们提高模型的性能和效率。
常用的特征选择方法包括过滤式、包裹式和嵌入式特征选择,每种方法都有其优点和局限性,需要根据具体情况进行选择和调整。
希望本文介绍的内容能够帮助大家更好地理解和应用特征选择方法,提高数据分析和建模的能力。
特征选择方法的比较分析特征选择是机器学习中重要的一环,它帮助我们确定对预测任务最有用的特征,减小了模型的复杂度和训练时间,并提高了模型的准确性。
然而,不同的特征选择方法具有不同的效果和使用场景。
在这篇文章中,我们将比较不同的特征选择方法及其优缺点。
1、过滤式特征选择过滤式特征选择是指在训练模型之前,对特征进行筛选,去掉与标记变量关系不大的特征。
其主要方法是基于特征之间的相关性、方差或信息增益等指标进行排序。
过滤式特征选择算法简单、容易实现,通常用于数据处理阶段。
然而,过滤式特征选择算法存在一定的局限性,如不能处理特征之间的关联性,只能从特征的维度入手,没有考虑特征的组合效应。
2、包裹式特征选择包裹式特征选择是指将特征选择作为模型的一部分,使用模型来评估特征的质量并进行筛选。
常用的包裹式特征选择算法包括递归特征消除和基于遗传算法的特征选择。
包裹式特征选择算法通常可以更准确地筛选出对模型最有用的特征,但是计算成本更高,训练时间更长。
3、嵌入式特征选择嵌入式特征选择是指将特征选择嵌入到机器学习的建模过程中,例如Lasso回归、Elastic Net等。
嵌入式特征选择算法可以同时进行特征选择和模型训练,具有较高的效率,而且可以在特征之间建立有效的关系,更好地利用特征信息。
然而,嵌入式特征选择算法需要评估每个特征的权重和影响,计算量比过滤和包裹式特征选择算法更大。
4、基于深度学习的特征选择随着深度学习的发展,它在特征提取和特征选择方面的应用越来越广泛。
基于深度学习的特征选择算法可以利用神经网络分层结构对特征进行自动提取和筛选,其主要方法包括Autoencoder、Deep Belief Networks和Convolutional Neural Networks。
这些算法在大数据集合和高维数据中表现良好,可以挖掘出更丰富的特征,但是需要更大的计算资源和更长的训练时间。
总的来说,不同的特征选择算法有各自的优劣和使用限制,需要根据实际的数据和任务需求进行选择。
特征选择方法
特征选择在机器学习和数据挖掘任务中起着关键的作用。
它可以帮助我们从原始数据中选择出最具有预测能力的特征,以提高模型的性能和效果。
针对特征选择问题,常用的方法有:
1. 过滤法(Filter Method):该方法通过对特征进行统计学分析,如相关系数、卡方检验等,从中选择与目标变量最相关的特征。
常用的过滤法有相关系数法、信息增益法、方差选择法等。
2. 包裹法(Wrapper Method):该方法将特征选择看作是一个
搜索问题,通过不断地构建模型并评估性能,来确定最佳的特征子集。
常用的包裹法有递归特征消除法(RFE)和遗传算法等。
3. 嵌入法(Embedded Method):该方法是在学习算法的过程中,通过正则化(如L1正则化)或构建专门的特征选择模型,来对特征的重要性进行评估和选择。
常用的嵌入法有Lasso回归、岭回归等。
4. 基于树模型的方法:该方法通过决策树等树模型,根据特征的重要性进行特征选择。
常用的方法有信息增益、基尼系数等。
除了以上方法,还有一些其他的特征选择方法,如基于稳定性的方法、深度学习中的特征选择方法等。
这些方法可以根据具体的任务和数据集的特点来选择合适的方法进行特征选择。
特征选择的目的是为了去除无关特征、降低数据维度以及提高模型性能等。
正确选择合适的特征选择方法,可以帮助我们更好地理解数据并提高模型的预测能力。
A Novel Unsupervised Feature Selection Method for Bioinformatics Data Setsthrough Feature ClusteringGuangrong Li1, 2, Xiaohua Hu3, 4, Xiajiong Shen4, Xin Chen3, Zhoujun Li51School of Computer, Wuhan University, Wuhan, China2College of Accounting, Hunan University, Changsha, China3College of Information Science and TechnologyDrexel University, Philadelphia, PA 191044College of Computer and Information Engineering,Henan University, Henan, China5Dept. of Computer Science, Beihang University, Beijing, ChinaAbstractMany feature selection methods have been proposed and most of them are in the supervised learning paradigm. Recently unsupervised feature selection has attracted a lot of attention especially in bioinformatics and text mining. So far, supervised feature selection and unsupervised feature selection method are studied and developed separately. A subset selected by a supervised feature selection method may not be a good one for unsupervised learning and vice verse. In bioinformatics research, however it is very common to perform clustering and classification iteratively for the same data sets, especially in gene expression analysis, thus it is very desirable to have a feature selection method which works well for both unsupervised learning and supervised learning. In this paper we propose a novel feature selection algorithm through feature clustering. Our algorithm does not need the class label information in the data set and is suitable for both supervised learning and unsupervised learning. Our algorithm groups the features into different clusters based on feature similarity, so that the features in the same clusters are similar to each other. A representative feature is selected from each cluster, thus reduces the feature redundancy. Our feature selection algorithm uses feature similarity for feature redundancy reduction but requires no feature search, works very well for high dimensional data set. We test our algorithm on some biological data sets for both clustering and classification analysis and the results indicates that our FSFC algorithm can significantly reduce the original data sets without scarifying the quality of clustering and classification . 1. IntroductionFeature selection (also known as variable selection, subspace selection, or dimensionality reduction) is a procedure to select a subset from the original feature set by eliminating redundant and less informative features so that the subset has only the best discriminative features [15]. There are three major benefits of feature selection (FS): (1) improves the prediction performance of the predictors; (2) helps predictors do faster and more cost-effective prediction; and (3) provides a better understanding of the underlying process that generated data [9]. Therefore FS has been an essential tool for many applications, especially bioinformatics, text mining, and combinatorial chemistry [9] where high dimensional data is very common.Feature selection research has been studied extensively in the supervised learning paradigm in various disciplines such as machine learning, data mining, pattern recognition, etc for very long time. Recently unsupervised feature selection (UFS) received some attentions in data mining and machine learning as clustering high dimensional data sets becomes an essential and routine task in bioinformatics and text mining. UFS is becoming an essential preprocessing step because UFS can not only reduce computational time greatly due to reduced feature subset but also improve clustering quality because no redundant features that could act as noises are involvedin unsupervised learning.Feature selection as heuristic search consists of four modules; the selection of starting point in feature search, the organization of the search (search strategy), the evaluation of feature subsets, and search halting criterion [2]. With this guideline it seems that UFS could be designed in the same way with some modifications as supervised feature selection. However, the traditional feature selection approach cannot be directly applied to UFS. This is because, dueto the absence of class label in data set, two of the fourrequired modules are recondite to fulfill in UFS; the evaluation of feature subsets, generated through feature search, and search halting criterion, because it mostly depends on feature subset evaluation, are abstruse. Also it is impossible to measure the correlations between the class and each feature by using distance, information or dependence measures, which is an essential processing in FS. In addition to that, due to the unknown number of clusters, it is very hard to evaluate feature subsets in UFS. That is why supervised feature selection and unsupervised feature selection and studied and developed separately.The subsets selected based on supervised feature selection may not a good one for clustering task and vice verse [3]. In some real applications, various data mining analysis may need to apply to the same data, which is very popular in bioinformatics study, especially in gene expression analysis. It is very common to perform clustering to group genes and then perform classification to identify important genes to distinguish different genes type in different groups. Thus, it is necessary to design a feature selection method that works for both supervised and unsupervised learning. In this paper we present a novel feature selection approach based on feature clustering. Since our method does not rely on the classification label information in the data set, it works for both supervised and unsupervised learning. In our method, we use maximal information compression index (MICI) [14] to evaluate the similarity of the feature and then use a novel distance measure to cluster the feature into different clusters. In each cluster, a representative feature is selected, thus the number of feature is reduced significantly.The rest of this paper is organized as follows. In Section 2 we first discuss filter and wrapper method in feature selection since most UFS works are based on these methods, then discuss several feature selection methods based on feature clustering in text mining and a UFS approach based on feature clustering. We explain our method (FSFC) in details in Section 3. In Section 4, extensive experiment results of various biological data sets for both clustering and classification based on our FSFC approach are presented. In Section 5 we conclude the paper.2. Related WorkTraditionally, filter and wrapper feature selection approaches have been widely used in supervised feature selection [12]. Wrapper approach uses learning (or induction) algorithm for both the FS and the evaluation of each feature set while filter approach uses intrinsic properties of data (e.g. the correlation with the class) for FS. Both wrapper and filter approaches need to search the feature space to get a feature subset. For feature search greedy algorithms, instead of exhaustive search, like forward or backward sequential selection is widely used. For exhaustive search, the whole search space is O(2D), where D is the number of features and for forward or backward sequential search, the search space is O(D2).In wrapper approach, every feature set must be evaluated by learning algorithm instead of intrinsic properties of data. However, it requires huge amount of computational time. Even if scalable and efficient learning algorithms and greedy algorithm for feature search are used in wrapper approach, it is very expensive in terms of computational cost. A huge number (at least D2) of iterative running of the algorithms make it infeasible to be applied in high-dimensional data set.On the other hand, in bioinformatics, the number of features in the data sets tends to be very large, feature clustering method has been used frequently for the initial exploration analysis. It was found that this method outperforms traditional feature selection methods [1] [16] and it also reduces more redundant features with high classification accuracy than feature selection methods [1]. Distributional clustering method and information bottleneck method are used for feature clustering [1] [16]. However, these two methods require high computational time, compared with [6] that uses information theoretic framework. The framework is similar to information bottleneck but it uses a new divisive algorithm (similar to k-means algorithm) that uses Kullback Leibler divergences as distance, which makes it faster. Recently an unsupervised feature selection method based on clustering method was introduced in [14]. However, the method requires user input parameter k, where k is used to control the neighborhood size of the feature clusters. It is very difficult to pick up the proper k value and it normally needs to try different k value many times before a desirable k value is determined. For different k value, the whole procedure to calculate the distance and clustering the data need to start over. 3. Our Algorithm (FSFC)One of the reasons why UFS is very challenging is because it is very hard to distinguish informative features from less informative features due to the absence of class label. Unlike traditional UFS methods [13] [4] [18] [5] [17] [7], we propose a clustering based feature selection algorithm which uses feature similarity for redundancy reduction but requiring no feature search.Our feature selection method consists of two major steps: first the entire feature set is partitioned into different homogeneous subsets (clusters) based on the feature similarity, and then a representative feature is selected from each cluster. Partitioning the features is done using hierarchical clustering based on MICI. A representative feature of each cluster is selected based on our cost distance function. Such representatives (features) constitute the optimal feature subset. The algorithm is summarized as below:Algorithm: Feature Selection through Feature Clustering (FSFC)Input: Data set D = {F n, n = 1, …, D} (D is the number of features), the number of feature kOutput: the reduced feature subset FS kMethodStep1: FS k = ∅Calculate MICI for each feature pairStep2: RepeatSelect S i, S j if C(S i, S j) is the minimumMerge S i & S jUntil all the objects are merged into one clusterStep3: Select the top k clusters from the hierarchical cluster treeFor each cluster S kThe center CF k(the feature with the smallest sum of MICI all other features inthe cluster) is selected as a representativefeature of the cluster.FS k = FS k∪ CF kEndForThe complexity of the FSFC algorithm is O(D2), where is the dimension. Even though the sequential forward selection and backward elimination also have complexity O(D2), but the each evaluation of the searched based approach is more item consuming than ours.var()var()x y+, where “var” means a sample variance and “cov” means a sample covariance. MICI as a similarity measure between features in our method has many benefits over correlation coefficient and least square regression error; first, MICI is symmetric regardless of the feature order on distance calculation (i.e. Distance(F1,F2) = Distance(F2,F1), where F n is the n th feature); secondly it is strong for data translation since the formula includes mean, variance, and covariance; finally it is insensitive to rotation of the variable.The distance function used in our clustering adapted from [20], it was successfully used for in cluster ensemble approach [10]. It is defined as(,)min((),())i j i j j ic S S D S S D S S=→→, where1,1()(,)m inij ji imi j i jx Si x SiD S S M IC I x xm∈=∈→=∑S i and S j are the i th and j th clusters in the clustering hierarchy. In hierarchical clustering S i or S j contains only one object (a feature in our case) in the beginning. S i and S j are merged based on the distance. As the clustering step progresses, the number of unclustered objects decreases one by one. The process stops until all the clusters are merged into one cluster. For each cluster a representative feature is selected from the features in the cluster. Our definition of the representative feature of a cluster is the center of the features which has the shortest sum of distance to all other features in the cluster. For example in the “1D” and “2D” example as shown below, The star should be the representative of both clusters because it is in the middle; the distance of the star for every other object in the cluster is the shortest.▲▼★◀▶A cluster in 1D space A cluster in 2D spaceA hierarchical clustering tree is formed after running the algorithm once in the data set. Different feature subsets can be easily formed from this tree. To generate a feature subset with size k, k representative features are selected from the top K clusters in the cluster tree. Since our feature selection approach is based on hierarchical clustering, it is very easy to generate various feature subsets without much computational overhead. We only need to do clustering once. This is a significant advantages compared with other approaches. For example, in [14] for different k, the whole clustering procedure needs to be done over again.Our clustering based FSFC method does not require any feature search, it is expected that our algorithm is considerably fast, compared with traditional UFS. The biggest differences of our approach from [14] is that in our method we only need to do feature clustering once and then can choose different numbers of features for the exploration analysis with very little computation overhead for clustering, which is considered crucial inreal applications. Another difference is that our k is exactly the number of cluster; the exact number of features reduced is expected in the result.4. Experiment ResultsWe conduct extensive experiments to demonstrate that our approach works well for both clustering and classification tasks. We want to demonstrate that our FSFC can significantly reduce the original data sets without scarifying the quality of clustering and classification. We first apply our FSFC method on the original data set to reduce the number of feature and then apply either clustering or classification algorithm. We want to demonstrate that our method can significantly reduce the number of redundant featuresin high dimensional data set and retain highly informative features, which is essential for clustering and/or classification. In our experiments we use various biological data sets (protein sequence, microarray gene expression data sets) for clustering and classification.Before applying our FSFC method to data sets for clustering analysis, we remove the class label from the original data sets and use the class label only on generating the true clusters for MS measurement purpose (MS definition is explained in the experimental subsection). For clustering analysis, K-means, SOM, and Fuzzy C-means are used and MS is used as an evaluation measurement for the clustering result. For classification analysis, SVM light is used [11]. This software is available at /. For each data set, the features are decreased by 20 percents each time so each data set generate 4 additional data sets with 20% features removed each time. For example, if a data set has 100 features, data sets with 20, 40, 60, and 80 features are generated using FSFC algorithm. Graphs based on MS or the accuracy of SVM are provided; y axis indicates MS or SVM accuracy while x axis indicates the number of features used. Please note the small is better in MS while the bigger is better in the accuracy of SVM.4.1 Clustering analysis with FSFC a pre-processing step for data reductionIn these tests, we demonstrate that FSFC can significantly reduce the original data sets without scarifying the clustering quality.To evaluate the clustering results, we adopt the Minkowski score. A clustering solution for a set of n elements can be represented by an nxn matrix C whereC ij=1 iff x i and x j are in the same cluster according to the solution and C ij=0 otherwise. A measure of Minkowski Score (MS) between the clustering results C(h) from a particular clustering algorithm CA h with a reference clustering T (or alternatively, the true clusters if the cluster information in the data set is known in advance) is defined as MS (T, C(h)) = ||T-C(h)||/||T||, where ||T|| = sqrt(∑i∑j T ij)The Minkowski score is the normalized distance between the two matrices. Hence a perfect solution will obtain a score zero, and the smaller the score, the better solution.We use 5 public gene datasets for our clustering analysis.(1) Yeast gene data set contains 7129 tuples and 80 attributes. But 2465 genes are classified. Such genes are classified into 108 function family. All Yeast data sets used here have 80 attributes. In our experiment, each family is treated as a cluster. Yeast1, Yeast2, and Yeast3 data sets are extracted from those genes. Yeast1 data set with 3 clusters contains 101 tuples and Yeast2 data set with 2 clusters contains 80 tuples. Yeast3 data set with 4 classes contains 669 tuples. The original data set is available here (/EisenData.htm).Yeast1(MS index)K-means SOMFuzzy C-mean16 features 1.086 1.066 1.10232 features 1.027 0.945 1.06448 features 1.014 0.912 1.043 64 features 1.028 0.929 1.05480 features 1.040.956 1.054Yeast2(MS index)K-means SOMFuzzy C-mean16features 0.926 0.879 0.879 32features 0.759 0.817 0.799 48features 0.780 0.713 0.737 64 features 0.737 0.688 0.713 80 features 0.759 0.737 0.713Yeast3 (MS index) K-means SOMFuzzy C-mean16 features 1.198 1.185 1.37332 features 1.136 1.167 1.54048 features 1.123 1.114 1.265 64 features 1.110 1.119 1.303 80 features 1.124 1.1181.305(2) Leukemia data set has 7129 genes but 52 genes are classified into 2 clusters. Our experiment data set has 52 genes and 38 features. The data set is available from /colondata.Leukemia (MS index) K-means SOMFuzzy C-mean8 features 0.917 0.768 0.93315 features 0.933 0.768 0.93323 features 0.933 0.768 0.89830 features 0.933 0.768 0.877 38 features0.877 0.768 0.877(3) B-cell Lymphoma dataset has 4026 genes and 96 features. 43 genes are classified into 4 clusters. The/lymphoma.B-cellLymphoma (MS index) K-means SOMFuzzy C-mean19 features 0.831 0.965 1.08738 features 0.512 0.841 1.106 58 features 0.569 0.816 0.56977 features 0.577 0.768 0.82196 features 0.779 0.882 0.569(4) RTK (Receptor Tyrosine Kinase) data set has 6312 genes and 16 attributes but 137 genes are classified into 7 classes. We used this classified gene so our experiment data set with 7 classes has 137 genes and 16 features. The data set is available from /RTK/.RTK(MS index)K-means SOMFuzzy C-mean 3features 1.354 1.350 1.354 6features 1.347 1.285 1.375 10 features 1.354 1.256 1.35413 features 1.435 1.275 1.35116 features 1.291 1.277 1.359(5) BRCA1 (related with breast cancer) data set has 373 genes and 12 attributes. 337 genes are classified into 51 classes (each class has 1~43genes). Our experimental data set with 6 clusters has 164 tuples and 12 features. For more information about this gene data set, refer to [21].BRCA1(MS index)K-means SOMFuzzy C-mean2 features 1.308 1.277 1.3635 features 1.354 1.317 1.3487 features 1.353 1.292 1.35810 features 1.351 1.304 1.36012 features 1.355 1.302 1.3474.2 Classification analysis with FSFC as pre-processing step for data reduction.4.2.1UCI data sets are used for classification testingData Set Name# ofClass# ofTuples# ofFeaturesSpambase 2 4601 57 Ionosphere 2 351 34 MultipleFeatures10 2000 649Spambase SVM11 features 71.67%23 features 81.30%34 features 86.01%46 features 91.30%57 features73.12%Ionosphere SVM7 features 61.90%14 features 62.86%20 features 61.90%27 features 63.81%34 features61.90%Multiple Features SVM130 features 85.33%260 features 85.17%389 features 85.83%519 features 87.00%649 features81.83% 4.2.2 Protein Data:The first data set has 315 features and 38009 tuples [3]. We use 10 fold cross validation for the accuracy result.Protein interaction SVM63 features 65.86%126 features 66.92%189 features 67.52%252 features 67.88%315 features70.53%The second data set with 2 classes has 300 features and 94466 tuples [3]. We use 10 fold cross validation for the accuracy result.Solvent SVM60 features 68.72%120 features 75.64%180 features 77.47%240 features 77.61%300 features78.05%5. Discussions and ConclusionThe difference of our algorithm from traditional unsupervised feature selection is two fold; it is clustering base so it does not require feature search. Inour algorithm k is the number of the features in the selection output. We only need to do the feature clustering once and get different feature subsets with various dimension numbers without extra computational cost. This characteristic is useful in data mining where multiscale representation of the data is often necessary. In our algorithm, k acts as a scale parameter which controls the degree of details in a more direct manner. These make it suitable for a wide variety of data mining tasks involving large dimensional data set.The novelty of our method is the absence of search process which contributes to the high-computational time requirement of those features selection algorithms. Our algorithm is based on pairwise feature similarity measures, which are fast to compute. Unlike other approaches, which are based on optimizing either classification accuracy or clustering performance explicitly. In our method, we determine a set of maximally independent features by discarding the redundant ones; this improves the applicability of the resulting features to other data mining task such as data reduction, summarization, and association mining in addition to clustering/classification.The experimental results on various biological data sets indicate our algorithm work well both for supervised learning and unsupervised learning. As our future plan, we would like to test our feature selection approach on biomedical literature mining and hope to report our findings in the near future.Acknowledgement: This work is supported partially by NSF CCF 0514679, and the NSF Career grant (IIS-0448023) and PA Dept of Health Grants (#239667).6. References[1] L. D. Baker and A. McCallum, “Distributional clusteringof words for text classification”, in SIGIR’98: Proceedings of the 21st Annual International ACM SIGIR, Melbourne, Australia, 1998, pp. 96–103.[2] A. Blum, and P. Langley, “Selection of relevant features and examples in machine learning”, Artificial Intelligence, vol. 97, pp. 245-271, 1997.[3] H. Chen, H. Zhou H., X. Hu, and I. Yoo, “Classification Comparison of Prediction of Solvent Accessibility from Protein Sequences”, in the 2nd Asia-Pacific Bioinformatics Conference, New Zealand, Jan 18-22, 2004.[4] M. Dash, H. Liu, and J. Yao, “Dimensionality Reductionof Unsupervised Data”, in 9th IEEE Int'l Conf. Tools with Artificial Intelligence, 1997, pp. 532-539. [5] M. Devaney and A. Ram, “Efficient Feature Selection in Conceptual Clustering”, in 14th Int'l Conf. Machine Learning, 1997.[6] I. S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification”, JMLR, vol. 3, pp.1265-1287, 2003.[7] J. G. Dy and C. E. Brodley, “Feature Subset Selection and Order Identification for Unsupervised Learning”, ICML, 2000, pp. 247-254.[8] D. H. Fisher, “Knowledge acquisition via incremental conceptual clustering, Machine Learning”, Machine Learning, vol. 2, pp.139-172, 1987.[9] I. Guyon and A. Elisseeff, “An Introduction to Variable and Feature Selection JMLR Special Issue on Variable and Feature Selection”, Kernel Machines Section, Mar, pp. 1157-1182, 2003.[10] X. Hu and Y. Yoo, “Cluster Ensemble and Its Applications in Gene Expression Analysis”, in BIBE, 2006. [11] T. Joachims, “Making large-Scale SVM Learning Practical”, Advances in Kernel Methods - Support Vector Learning, MIT-Press, 1999.[12] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subset selection problem”, in 11th International conference on Machine Learning, New Brunswick, NJ, 1994, pp. 121-129.[13] Y.S. Kim, N. Street, and F. Menczer, “Evolutionary model selection in unsupervised learning”, Intelligent Data Analysis, vol. 6, pp. 531-556, 2002.[14] P. Mitra, C. A. Murthy, and Sankar K. Pal, “Unsupervised Feature Selection Using Feature Similarity”, IEEE Transactions on Pattern Analysis and Machine Intelligenc, vol. 24, pp. 301-312, 2002.[15] M.E Morita, R. Sabourin, F. Bortolozzi, and C.Y. Suen, “Unsupervised Feature Selection Using Multi-Objective Genetic Algorithm for Handwritten Word Recognition”, in the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 2003, pp.666-670.[16] N. Slonim and N. Tishby. “The power of word clusters for text classification.” In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.[17] L. Talavera, “Feature Selection as a Preprocessing Step for Hierarchical Clustering”, in 16th Int'l Conf. on Machine Learning, 1999, pp. 389-397.[18] L. Talavera, “Dependency-Based Feature Selection for Clustering Symbolic Data”, Intelligent Data Analysis, vol. 4, pp. 19-28, 2000.[19] P.L. Welcsh, M.K. Lee, R.M. Gonzalez-Hernandez, D.J. Black, M. Mahadevappa, E.M. Swisher, J.A. Warrington, and M.C. King, “BRCA1 transcriptionally regulates genes involved in breast tumorigenesis”, in Natl Acad Sci U S A 99, 2002, pp. 7560-7565.[20] Y. Zeng, J. Tang, J. Garcia-Frias, and G.R. Gao, “An Adaptive Meta-Clustering Approach: Combining The Information From Different Clustering Results”, in IEEE Computer Society Bioinformatics Conference, Stanford University, pp. 276-287.[21] “Machine Learning Repository”/~mlearn/MLRepository.html。