Feature selection for the classification of crosstalk in multi-channel audio

格式：pdf
大小：166.31 KB
文档页数：4

下载文档原格式

feature selection

Feature Selection Based on Mutual Information:Criteria of Max-Dependency,Max-Relevance,and Min-RedundancyHanchuan Peng,Member ,IEEE ,Fuhui Long,and Chris DingAbstract —Feature selection is an important problem for pattern classification systems.We study how to select good features according to the maximal statistical dependency criterion based on mutual information.Because of the difficulty in directly implementing the maximal dependency condition,we first derive an equivalent form,called minimal-redundancy-maximal-relevance criterion (mRMR),for first-order incremental feature selection.Then,we present a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g.,wrappers).This allows us to select a compact set of superior features at very low cost.We perform extensive experimental comparison of our algorithm and other methods using three different classifiers (naive Bayes,support vector machine,and linear discriminate analysis)and four different data sets (handwritten digits,arrhythmia,NCI cancer cell lines,andlymphoma tissues).The results confirm that mRMR leads to promising improvement on feature selection and classification accuracy.Index Terms —Feature selection,mutual information,minimal redundancy,maximal relevance,maximal dependency,classification.æ1I NTRODUCTIONINmany pattern recognition applications,identifying the most characterizing features (or attributes)of the ob-served data,i.e.,feature selection (or variable selection,among many other names)[30],[14],[17],[18],[15],[12],[11],[19],[31],[32],[5],is critical to minimize the classifica-tion error.Given the input data D tabled as N samples and M features X ¼f x i ;i ¼1;...;M g ,and the target classifica-tion variable c ,the feature selection problem is to find from the M -dimensional observation space,R M ,a subspace of m features,R m ,that “optimally”characterizes c .Given a condition defining the “optimal characteriza-tion,”a search algorithm is needed to find the best subspace.Because the total number of subspaces is 2M ,and the number of subspaces with dimensions no larger than m is Æm i ¼1Mi ÀÁ,it is hard to search the feature subspace exhaus-tively.Alternatively,many sequential-search-based approx-imation schemes have been proposed,including best individual features,sequential forward search,sequential forward floating search,etc.,(see [30],[14],[13]for a detailed comparison.).The optimal characterization condition often means the minimal classification error .In an unsupervised situation where theclassifiers arenot specified,minimalerrorusually requires the maximal statistical dependency of the target class c on the data distribution in the subspace R m (and vice versa).This scheme is maximal dependency (Max-Dependency).One of the most popular approaches to realize Max-Dependency is maximal relevance (Max-Relevance)feature selection:selecting the features with the highest relevance to the target class c .Relevance is usually characterized in terms of correlation or mutual information,of which the latter is one of the widely used measures to define dependency of variables.In this paper,we focus on the discussion of mutual-information-based feature selection.Given two random variables x and y ,their mutual information is defined in terms of their density functions p ðx Þ,p ðy Þ,and p ðx;y Þ:I ðx ;y Þ¼Z Zp ðx;y Þlog p ðx;y Þp ðx Þp ðy Þdxdy:ð1ÞIn Max-Relevance,the selected features x i are required,individually,to have the largest mutual information I ðx i ;c Þwith the target class c ,reflecting the largest dependency on the target class.In terms of sequential search,the m best individual features,i.e.,the top m features in the descent ordering of I ðx i ;c Þ,are often selected as the m features.In feature selection,it has been recognized that the combinations of individually good features do not necessa-rily lead to good classification performance.In other words,“the m best features are not the best m features”[4],[3],[14],[30].Some researchers have studied indirect or direct means to reduce the redundancy among features 1(e.g.,[4],[14],[19],[15],[22],[12],[5])and select features with the minimal redundancy (Min-Redundancy).For example,in the sequen-tial forward floating search [25],the joint dependency of features on the target class is maximized;as a by-product,the redundancy among features might be reduced.In [12],Jaeger et al.presented a prefiltering method to group variables,thus,redundant variables within each group can be removed.In.H.Peng and F.Long are with the Lawrence Berkeley National Laboratory,University of California at Berkeley,1Cyclotron Road,MS.84-171,Berkeley,CA 94720.E-mail:{hpeng,flong}@.. C.Ding is with the Computational Research Division,Lawrence Berkeley National Laboratory,1Cyclotron Road,Berkeley,CA 94720.E-mail:CHQDing@.Manuscript received 6Aug.2003;revised 1May 2004;accepted 3Dec.2004;published online 13June 2005.Recommended for acceptance by A.Del Bimbo.For information on obtaining reprints of this article,please send e-mail to:tpami@,and reference IEEECS Log Number TPAMI-0215-0803.1.Minimal redundancy has also been studied in feature extraction,which aims to find good features in a transformed domain.For instance,it has been well addressed in various techniques such as principal component analysis and independent component analysis [10],neural network feature extractors (e.g.,[22]),etc.0162-8828/05/$20.00ß2005IEEEPublished by the IEEE Computer Societyrelevance (mRMR)framework to minimize redundancy,and used a series of intuitive measures of relevance and redundancy to select promising features for both continuous and discrete data sets.Our work in this paper focuses on three issues that have not been touched in earlier work.First,although both Max-Relevance and Min-Redundancy have been intuitively used for feature selection,no theoretical analysis is given on why they can benefit selecting optimal features for classification.Thus,the first goal of this paper is to present a theoretical analysis showing that mRMR is equivalent to Max-Depen-dency for first-order feature selection,but is more efficient.Second,we investigate how to combine mRMR with other feature selection methods (such as [18],[15])into a two-stage selection algorithm.By doing this,we show that the space of candidate features selected by mRMR is more characterizing.This property of mRMR facilitates the integration of other feature selection schemes to find a compact subset of superior features at very low cost.Third,through comprehensive experiments we compare mRMR,Max-Relevance,Max-Dependency,and the two-stage feature selection algorithm,using three different classifiers and four data sets.The results show that mRMR and our two-stage algorithm are very effective in a wide range of feature selection applications.This paper is organized as follows:Section 2presents the theoretical analysis of the relationships of Max-Dependency,Max-Relevance,and Min-Redundancy.Section 3presents the two-stage feature selection algorithm,including schemes to integrate wrappers to select a squeezed subset of features.Section 4discusses implementation issues of density estima-tion for mutual information and several different classifiers.Section 5gives experimental results on four data sets,including handwritten characters,arrhythmia,NCI cancer cell lines,and lymphoma tissues.Sections 6and 7are discussions and conclusions,respectively.2R ELATIONSHIPS OF M AX -D EPENDENCY ,M AX -R ELEVANCE ,AND M IN -R EDUNDANCY2.1Max-DependencyIn terms of mutual information,the purpose of feature selection is to find a feature set S with m features f x i g ,which jointly have the largest dependency on the target class c .This scheme,called Max-Dependency,has the following form:max D ðS;c Þ;D ¼I ðf x i ;i ¼1;...;m g ;c Þ:ð2ÞObviously,when m equals 1,the solution is the feature that maximizes I ðx j ;c Þð1 j M Þ.When m >1,a simple incremental search scheme is to add one feature at one time:given the set with m À1features,S m À1,the m th feature can be determined as the one that contributes to the largest increase of I ðS ;c Þ,which takes the form of (3):I ðS m ;c Þ¼Z Zp ðS m ;c Þlog p ðS m ;c Þp ðS m Þp ðc ÞdS m dc¼Z Zp ðS m À1;x m ;c Þlogp ðS m À1;x m ;c Þp ðS m À1;x m Þp ðc ÞdS m À1dx m dc ¼Z ÁÁÁZp ðx 1;ÁÁÁ;x m ;c Þlogp ðx 1;ÁÁÁ;x m ;c Þp ðx 1;ÁÁÁ;x m Þp ðc Þdx 1ÁÁÁdx m dc:ð3Þoften hard to get an accurate estimation for multivariate density p ðx 1;...;x m Þand p ðx 1;...;x m ;c Þ,because of two difficulties in the high-dimensional space:1)the number of samples is often insufficient and 2)the multivariate density estimation often involves computing the inverse of the high-dimensional covariance matrix,which is usually an ill-posed problem.Another drawback of Max-Dependency is the slow computational speed.These problems are most pronounced for continuous feature variables.Even for discrete (categorical)features,the practical problems in implementing Max-Dependency cannot be completely avoided.For example,suppose each feature has three categorical states and N samples.K features could have a maximum min ð3K ;N Þjoint states.When the number of joint states increases very quickly and gets comparable to the number of samples,N ,the joint probability of these features,as well as the mutual information,cannot be estimated correctly.Hence,although Max-Dependency feature selection might be useful to select a very small number of features when N is large,it is not appropriate for applications where the aim is to achieve high classification accuracy with a reasonably compact set of features.2.2Max-Relevance and Min-RedundancyAs Max-Dependency criterion is hard to implement,an alternative is to select features based on maximal relevance criterion (Max-Relevance).Max-Relevance is to search features satisfying (4),which approximates D ðS;c Þin (2)with the mean value of all mutual information values between individual feature x i and class c :max D ðS;c Þ;D ¼1j S j Xx i2SI ðx i ;c Þ:ð4ÞIt is likely that features selected according to Max-Relevance could have rich redundancy,i.e.,the dependency among these features could be large.When two features highly depend on each other,the respective class-discrimi-native power would not change much if one of them were removed.Therefore,the following minimal redundancy (Min-Redundancy)condition can be added to select mutually exclusive features [5]:min R ðS Þ;R ¼1j S j 2X x i ;x j 2SI ðx i ;x j Þ:ð5ÞThe criterion combining the above two constraints is called “minimal-redundancy-maximal-relevance”(mRMR)[5].We define the operator ÈðD;R Þto combine D and R and consider the following simplest form to optimize D and R simultaneously:max ÈðD;R Þ;È¼D ÀR:ð6ÞIn practice,incremental search methods can be used to find thenear-optimalfeatures definedby Èð:Þ.Suppose wealready have S m À1,the feature set with m À1features.The task is to select the m th feature from the set f X ÀS m À1g .This is done by selecting the feature that maximizes Èð:Þ.The respective incremental algorithm optimizes the following condition:max x j 2X ÀS m À1I ðx j ;c ÞÀ1Xx i 2S m À1I ðx j ;x i Þ"#:ð7ÞThe computational complexity of this incremental search method is O ðj S j ÁM Þ.2.3Optimal First-Order Incremental SelectionWe prove in the following that the combination of Max-Relevance and Min-Redundancy criteria,i.e.,the mRMR criterion,is equivalent to the Max-Dependency criterion if one feature is selected (added)at one time.We call this type of selection the “first-order”incremental search.We have the following theorem:Theorem.For the first-order incremental search,mRMR is equivalent to Max-Dependency (2).Proof.By definition of the first-order search,we assume that S m À1,i.e.,the set of m À1features,has already been obtained.The task is to select the optimal m th feature x m from set f X ÀS m À1g .The dependency D in (2)and (3)is represented by mutual information,i.e.,D ¼I ðS m ;c Þ,where S m ¼f S m À1;x m g can be treated as a multivariate variable.Thus,by the definition of mutual information,we have:I ðS m ;c Þ¼H ðc ÞþH ðS m ÞÀH ðS m ;c Þ¼H ðc ÞþH ðS m À1;x m ÞÀH ðS m À1;x m ;c Þ;ð8Þwhere H ð:Þis the entropy of the respective multivariate(or univariate)variables.Now,we define the following quantity J ðS m Þ¼J ðx 1;...;x m Þfor scalar variables x 1;...;x m ,J ðx 1;x 2;...;x m Þ¼Z ÁÁÁZp ðx 1;...;x m Þlog p ðx 1;x 2;...;x m Þp ðx 1ÞÁÁÁp ðx m Þdx 1ÁÁÁdx m :ð9ÞSimilarly,we define J ðS m ;c Þ¼J ðx 1;...;x m ;c Þas J ðx 1;x 2;...;x m ;c Þ¼Z ÁÁÁZp ðx 1;...;x m ;c Þlog p ðx 1;x 2;...;x m ;c Þ1m dx 1ÁÁÁdx m dc:ð10ÞWe can easily derive (11)and (12)from (9)and (10),H ðS m À1;x m Þ¼H ðS m Þ¼X m i ¼1H ðx i ÞÀJ ðS m Þ;ð11ÞH ðS m À1;x m ;c Þ¼H ðS m ;c Þ¼H ðc ÞþX m i ¼1H ðx i ÞÀJ ðS m ;c Þ:ð12ÞBy substituting them to the corresponding terms in(8),we haveI ðS m ;c Þ¼J ðS m ;c ÞÀJ ðS m Þ¼J ðS m À1;x m ;c ÞÀJ ðS m À1;x m Þ:ð13ÞObviously,Max-Dependency is equivalent to simul-taneously maximizing the first term and minimizing the second term.We can use the Jensen’s Inequality [16]to show the second term J ðS m À1;x m Þis lower-bounded by 0.Arelated and slightly simpler proof is to consider the inequality log ðz Þ z À1with the equality if and only if z ¼1.We see thatÀJ ðx 1;x 2;...;x m Þ¼Z ÁÁÁZp ðx 1;...;x m Þlog p ðx 1ÞÁÁÁp ðx m Þp ðx 1;...;x m Þdx 1ÁÁÁdx mZ ÁÁÁZp ðx 1;...;x m Þp ðx 1ÞÁÁÁp ðx m Þp ðx 1;...;x m ÞÀ1 !dx 1ÁÁÁdx m ¼Z ÁÁÁZp ðx 1ÞÁÁÁp ðx m Þdx 1ÁÁÁdx m ÀZ ÁÁÁZp ðx 1;ÁÁÁ;x m Þdx 1ÁÁÁdx m ¼1À1¼0:ð14ÞIt is easy to verify that the minimum is attained when p ðx 1;...;x m Þ¼p ðx 1ÞÁÁÁp ðx m Þ,i.e.,all the variables are independent of each other.As all the m À1features have been selected,this pair-wise independence condition means that the mutual information between x m and any selected feature x i ði ¼1;...;m À1Þis minimized.This is the Min-Redundancy criterion.We can also derive the upper bound of the first term in (13),J ðS m À1;c;x m Þ.For simplicity,let us first show the upper bound of the general form J ðy 1;...;y n Þ,assuming there are n variables y 1;...;y n .This can be seen as follows:J ðy 1;y 2;...;y n Þ¼Z ÁÁÁZp ðy 1;...;y n Þlogp ðy 1;...;y n Þp ðy 1ÞÁÁÁp ðy n Þdy 1ÁÁÁdy n ¼Z ÁÁÁZp ðy 1;...;y n Þlog p ðy 1j y 2;...;y n Þp ðy 2j y 3;...;y n ÞÁÁÁp ðy n À1j y n Þp ðy n Þp ðy 1ÞÁÁÁp ðy n À1Þp ðy n Þdy 1ÁÁÁdy n¼X n À1i ¼1H ðy i ÞÀH ðy 1j y 2;...;y n ÞÀH ðy 2j y 3;...;y n ÞÀH ðy n À1j y n ÞX n À1i ¼1H ðy i Þ:ð15ÞEquation (15)can be easily extended as J ðy 1;y 2;...;y n Þ minX n i ¼2H ðy i Þ;X n i ¼1;i ¼2H ðy i Þ;ÁÁÁ;X n i ¼1;i ¼n À1H ðy i Þ;X n À1i ¼1H ðy i Þ():ð16ÞIt is easy to verify the maximum of J ðy 1;...;y n Þor,similarly,the first term in (13),J ðS m À1;c;x m Þ,is attained when all variables are maximally dependent.When S m À1has been fixed,this indicates that x m and c should have the maximal dependency.This is the Max-Relevance criterion.Therefore,according to (13),as a combination of Max-Relevance and Min-Redundancy,mRMR is equivalent to Max-Dependency for first-order selection.t u Note that the quantity J ð:Þin (9)and (10)has also been called “mutual information”for multiple scalar variables [10].We have the following observations:1.Minimizing JðS mÞonly is equivalent to searchingmutually exclusive(independent)features.2This isinsufficient for selecting highly discriminativefeatures.2.Maximizing JðS m;cÞonly leads to Max-Relevance.Clearly,the difference between mRMR and Max-Relevance is rooted in the different definitions ofdependency(in terms of mutual information).Equa-tion(10)does not consider the joint effect of features onthe target class.On the contrary,Max-Dependency((2)and(3))considers the dependency between the datadistribution in subspace R m and the target class c.Thisdifference is critical in many circumstances.3.The equivalence between Max-Dependency andmRMR indicates mRMR is an optimal first-orderimplementation scheme of Max-Dependency.pared to Max-Dependency,mRMR avoids theestimation of multivariate densities pðx1;...;x mÞand pðx1;...;x m;cÞ.Instead,calculating the bivariatedensity pðx i;x jÞand pðx i;cÞcould be much easierand more accurate.This also leads to a more efficientfeature selection algorithm.3F EATURE S ELECTION A LGORITHMSOur goal is to design efficient algorithms to select a compact set of features.In Section2,we propose a fast mRMR feature selection scheme(7).A remaining issue is how to determine the optimal number of features m.Since a mechanism to remove potentially redundant features from the already selected features has not been considered in the incremental selection,according to the idea of mRMR,we need to refine the results of incremental selection.We present a two-stage feature selection algorithm.In the first stage,we find a candidate feature set using the mRMR incremental selection method.In the second stage, we use other more sophisticated schemes to search a compact feature subset from the candidate feature set.3.1Selecting the Candidate Feature SetTo select the candidate feature set,we compute the cross-validation classification error for a large number of features and find a relatively stable range of small error.This range is called .The optimal number of features(denoted as nÃ) of the candidate set is determined within .The whole process includes three steps:e mRMR incremental selection(7)to select n(apreset large number)sequential features from theinput X.This leads to n sequential feature setsS1&S2&...&S nÀ1&S n.pare all the n sequential feature sets S1;...;S k;...;S n;ð1k nÞto find the range of k,called ,within which the respective(cross-validation classifi-cation)error e k is consistently small(i.e.,has bothsmall mean and small variance).3.Within ,find the smallest classification erroreÃ¼min e k.The optimal size of the candidate featureset,nÃ,ischosenasthesmallest k thatcorrespondsto eÃ.3.2Selecting Compact Feature SubsetsMany sophisticated schemes can be used to search the compact feature subsets from the candidate set S nÃ.To illustrate that mRMR can produce better candidate features, which favors better combination with other methods,we use wrappers to search the compact feature subsets.A wrapper[15],[18]is a feature selector that convolves with a classifier(e.g.,naive Bayes classifier),with the direct goal to minimize the classification error of the particular ually,wrappers can yield high classification accuracy for a particular classifier at the cost of high computational complexity and less generalization of the selected features on other classifiers.This is different from the mRMR method introduced above,which does not optimize the classification error directly.The latter type of approach(e.g.,mRMR and Max-Relevance),sometimes called“filter”[18],[15],often selects features by testing whether some preset conditions about the features and the target class are satisfied.In practice,the filter approach has much lower complexity than wrappers;the features thus selected often yield comparable classification errors for different classifiers, because such features often form intrinsic clusters in the respective subspace.By using mRMR feature selection in the first-stage,we intend to find a small set of candidate features,in which the wrappers can be applied at a much lower cost in the second-stage.We will continue our discussion on this point in Section3.3.In this paper,we consider two selection schemes of wrapper,i.e.,the backward and forward selections:1.The backward selection tries to exclude one redun-dant feature at a time from the current feature set S k(initially,k is set to nÃobtained in Section3.1),with theconstraint that the resultant feature set S kÀ1leads to aclassification error e kÀ1no worse than e k.Becauseevery feature in S k can be considered in removal,thereare k different configurations of S kÀ1.For eachpossible configuration,the respective classificationerror e kÀ1is calculated.If,for every configuration,thecorresponding e kÀ1is larger than e k,there is no gain ineither classification accuracy or feature dimensionreduction(i.e.,every existing feature in S k appears tobe useful),thus,the backward selection terminates(accordingly,the size of the compact feature subset,m,is set to k).Otherwise,among the k configurations ofS kÀ1,the one that leads to the largest error reduction ischosen as the new feature set.If there are multipleconfigurations leading to the same error reduction,one of them is chosen randomly.This decrementalselection procedure is repeated until the terminationcondition is satisfied.2.The forward selection tries to select a subset ofm features from S nÃin an incremental manner.Initially,the classification error is set to the numberof samples,i.e.,N.The wrapper first searches for thefeature subset with one feature,denoted as Z1,byselecting the feature xÃ1that leads to the largest errorreduction.Then,from the set f S nÀZ1g,the wrapperselects the feature xÃ2so that the feature set Z2¼f Z1;xÃ2g leads to the largest error reduction.Thisincremental selection repeats until the classificationerror begins to increase,i.e.,e kþ1>e k.Note that we2.In the field of feature extraction,minimizing JðS mÞhas led to an algorithm of independent component analysis[10].allow the incremental search to continue when e kþ1equals e k,because we want to search a space as large aspossible.Once the termination condition is satisfied,the selected number of features,m,is chosen as thedimension for which the lowest error is first reached.For example,suppose the sequence of classificationerrors of the first six features is[10,8,4,4,4,7].Theforward selection will terminate at five features,butonly return the first three features as the result;in thisway we obtain a more compact set of features thatminimizes the error.3.3Characteristic Feature SpaceGiven two feature sets S1n and S2n both containing n features, and a classifierÀ,we say the feature space of S1n is more characteristic if the classification error(using classifierÀ)on S1n issmallerthanon S2n.Thisdefinitionofcharacteristic spacecan be extended recursively to the subsets(subspaces)of S1n andS2 n .Suppose we have a feature selection method F to generatea series of feature subsets in S1n:S11&S12&...&S1k&...&S1nÀ1&S1n,and,similarly,a series of subsets inS2 n :S21&...&S2k&...&S2n.We say S1n is recursively morecharacteristic(RM-characteristic)than S2n on the range ¼½k lower;k upper ð1k lower<k upper nÞ,if for every k2 , the classification error on S1k is consistently smaller than on S2k.To determine which one of the feature sets S1n and S2n is superior,it is often insufficient to compare the classification errors for a specific size of the feature sets.A better way is to observe which set is RM-characteristic for a reasonably large range .In the extreme case,we use ¼½1;n .Given two feature selection methods F1and F2,if,the feature sets generated by F1are RM-characteristic than those generated by F2,we believe the method F1is better than F2.Let us consider the following example to compare mRMR and Max-Relevance based on the concept of RM-characteristic feature space.As a comprehensive study,we consider both the sequential and nonsequential feature sets as follows(more details will be given in experiments):1.A direct comparison is to examine whether themRMR sequential feature sets are RM-characteristicthan Max-Relevance sequential feature sets.We useboth methods to select n sequential feature sets S1&...&S k&...&S n and compute the respectiveclassification errors.If,for most k2½1;n ,we obtainsmaller errors on mRMR feature sets,we canconclude that mRMR is better than Max-Relevancefor the sequential(or incremental)feature selection.2.We also use other feature selection methods(e.g.,wrappers)in the second stage of our feature-selectionalgorithm to probe whether mRMR is better than Max-Relevance for nonsequential feature sets.For example,for the mRMR and Max-Relevance candidate featuresets with nÃfeatures,we use the backward-selection-wrapper to produce two series of feature sets withk¼nÃÀ1;nÃÀ2;...;m features by removing somenonsequential features that are potentially redundant.Then,the respective classification errors of thesefeature sets are computed.If,for most k,we find themRMR nonsequential feature subset leads to lowererror,we conclude the mRMR candidate feature set is(approximately)RM-characteristic than the Max-Relevance candidate feature set.3.Both the forward and backward selections ofwrapper are used.Different classifiers(as discussedlater in Section4.2)are also considered in wrappers.We use both mRMR and Max-Relevance methods toselect the same number of candidate features andcompare the classification errors of the featuresubsets thereafter selected by wrappers.If all theobservations agree that the mRMR candidate featureset is RM-characteristic,we have high confidencethat mRMR is a superior feature selection method.4.Given two feature sets,if S1n is RM-characteristicthan S2n,then it is faithful to compare the lowesterrors obtained for the subsets of S1n and S2n.Clearly,for feature spaces containing the same number of features,wrappers can be applied more effectively on the space that is RM-characteristic.This also indicates that wrappers can be applied at a lower cost,by improving the characterizing strength of features and reducing the number of pre-selected features.In real situations,it might not be possible to obtain e1k<e2k for every k in .Hence,we can define a confidence score0 1to indicate the percentage of different k values for which the e1k<e2k condition is satisfied.For example,when ¼0:90 (90percent k-values correspond to the e1k<e2k condition),it is safe to claim that S1n is approximately RM-characteristic than S2n on .As can be seen in the experiments,usually this approximation is sufficient to compare two series of feature subsets.4I MPLEMENTATION I SSUESBefore presenting the experimental results in Section5,we discuss two implementation issues regarding the experi-ments:1)calculation of mutual information for both discrete and continuous data and2)multiple types of classifiers used in our experiments.4.1Mutual Information EstimationWe consider mutual-information-based feature selection for both discrete and continuous data.For discrete(categorical) feature variables,the integral operation in(1)reduces to summation.In this case,computing mutual information is straightforward,because both joint and marginal probabil-ity tables can be estimated by tallying the samples of categorical variables in the data.However,when at least one of variables x and y is continuous,their mutual information Iðx;yÞis hard to compute,because it is often difficult to compute the integral in the continuous space based on a limited number of samples. One solution is to incorporate data discretization as a preprocessing step.For some applications where it is unclear how to properly discretize the continuous data,an alternative solution is to use density estimation method(e.g.,Parzen windows)toapproximate Iðx;yÞ,assuggestedbyearlierwork in medical image registration[7]and feature selection[17].Given N samples of a variable x,the approximate density function^pðxÞhas the following form:^pðxÞ¼1X Ni¼1ðxÀxðiÞ;hÞ;ð17Þwhere ð:Þis the Parzen window function as explained below, xðiÞis the i th sample,and h is the window width.Parzen has。

sklearn feature selection

sklearn feature selectionIntroductionFeature selection is an important step in machine learning and data analysis. It involves selecting a subset of relevant features from a larger pool of variables to build a model. This process not only improves model performance but also reduces the computational complexity and avoids the curse of dimensionality. In this article, we will explore the feature selection techniques available in the scikit-learn library, also known as sklearn.Why feature selection?Before we dive into the details of sklearn’s f eature selection methods, let’s understand why it is important.1.Improved Model Performance: By selecting the most relevantfeatures, we can focus our model on the most informative variables.This can lead to improved model accuracy and generalization.2.Reduced Overfitting: Including too many irrelevant features canlead to overfitting, where the model performs well on the training data but fails to generalize on unseen data. Feature selectionhelps combat overfitting by removing irrelevant or redundantfeatures.3.Faster Training and Inference: With fewer features, thecomputational complexity of the model reduces significantly. Thisresults in faster training and inference times, making it suitable for large-scale datasets.Feature selection techniques in sklearnSklearn provides various feature selection methods that can be broadly categorized into three types based on their approach: filter methods, wrapper methods, and embedded methods.Filter methodsFilter methods rank the features using some statistical measure and select the top-ranked ones. These methods are independent of the choice of machine learning algorithm. Sklearn provides several filter methods, including:1.VarianceThreshold: This method removes features with low variance.It is particularly useful for binary features or categoricalfeatures where one category dominates.2.SelectKBest: This method selects the top k features based onunivariate statistical tests. It supports different scoringfunctions such as chi-squared, f-regression, mutual information,etc.3.SelectPercentile: Similar to SelectKBest, this method selects aspecific percentage of features based on univariate statisticaltests.Wrapper methodsWrapper methods evaluate the performance of the model with different subsets of features. They rely on a specific machine learning algorithm to rank the features. Some popular wrapper methods available in sklearn are:1.Recursive Feature Elimination (RFE): It is an iterative methodthat starts with all features and eliminates the least importantone in each iteration. The importance is evaluated based on theweights or coefficients assigned by the machine learning algorithm.2.Sequential Feature Selection (SFS): This method selects featuresin a forward or backward manner based on the performance of themodel. It starts with an empty set and adds or removes featuresone at a time based on their impact on the model’s performance.Embedded methodsEmbedded methods incorporate the feature selection within the model building process. They select the best features during the model training, making it more efficient by optimizing both feature selectionand model building simultaneously. Some embedded methods provided by sklearn include:1.L1-based feature selection: Linear models with L1 regularization(Lasso) can lead to sparse solutions, effectively performingfeature selection. Features with zero coefficients are eliminated from the model.2.Tree-based feature importance: Tree-based algorithms, such asdecision trees and random forests, can assign feature importancescores. Sklearn provides methods to extract these scores andselect the top features based on them.How to use sklearn’s feature selection methodsNow that we have an overview of the different types of feature selection methods offered by sklearn, let’s understand how to use them in practice. The general workflow involves the following steps:1.Import the required libraries: Import the necessary libraries,including sklearn and other relevant dependencies.2.Load the dataset: Load the dataset that you want to performfeature selection on.3.Data preprocessing: Preprocess the dataset if required, includinghandling missing values, encoding categorical variables, andscaling numerical features.4.Split the dataset: Split the dataset into training and testingsets. It is crucial to perform feature selection only on thetraining set to avoid leakage.5.Apply feature selection method: Select the appropriate featureselection method based on the type of problem and apply it to the training set.6.Evaluate performance: Train a model using the selected featuresand evaluate its performance on the testing set. Compare it with a model trained using all the features.7.Fine-tune and iterate: Based on the results, fine-tune thefeature selection approach or try different methods to achieve the desired model performance.ConclusionFeature selection is a critical step in machine learning tasks as it helps in improving model performance, reducing overfitting, and enhancing computational efficiency. Sklearn provides a wide range of feature selection methods, including filter, wrapper, and embedded methods. By utilizing these methods, one can effectively select the most relevant features and build better models. Remember to experiment with different methods and fine-tune the approach to achieve the best results for your specific problem.。

a survey on feature selection methods

A Survey on Feature Selection MethodsIntroductionIn the field of machine learning and data analysis, feature selection plays a vital role in improving model performance and reducing computational complexity. Feature selection methods aim to identify the most relevant and informative subset of features from a given dataset. This survey article provides a comprehensive overview of various feature selection methods, discussing their strengths, weaknesses, and applications.Types of Feature Selection MethodsThere are various types of feature selection methods, each with its own approach and underlying principles. Some of the commonly used methods include:1. Filter MethodsFilter methods evaluate the relevance of features based on their statistical properties, without considering the target variable. These methods are computationally efficient and can be applied as a preprocessing step. Some popular filter methods are:•Chi-square Test: This statistical test measures the independence between features and the target variable using a chi-squaredistribution.•Information Gain: Information gain calculates the amount of information provided by each feature in predicting the targetvariable.•Correlation Coefficient: This method measures the linear relationship between features and the target variable.2. Wrapper MethodsWrapper methods select features by evaluating the performance of a specific machine learning algorithm. These methods consider the interaction between features and are computationally expensive. Some commonly used wrapper methods include:•Recursive Feature Elimination (RFE): RFE recursively eliminates features based on their importance, determined by the coefficients of a machine learning model.•Forward Selection: Forward selection starts with an empty set of features and iteratively adds the most significant feature at each step based on the model’s performance.•Backward Elimination: Backward elimination starts with all features and iteratively removes the least significant feature at each step.3. Embedded MethodsEmbedded methods incorporate feature selection within the model training process. These methods select features based on their contribution to the model’s performance. Some popular embedded methods are:•Lasso Regularization: Lasso regularization adds a penalty term to the cost function, encouraging the model to select only relevantfeatures.•Random Forest Importance: Random Forest calculates the feature importance based on the decrease in impurity when a feature isused for splitting.Evaluation Metrics for Feature SelectionTo assess the effectiveness of feature selection methods, several evaluation metrics are commonly used. These metrics help in comparing different methods and selecting the most appropriate one for aparticular task. Some widely used evaluation metrics include:1.Accuracy: Accuracy measures the overall correctness of themodel’s predictions.2.Precision: Precision calculates the proportion of correctlypredicted positive instances out of all predicted positiveinstances.3.Recall: Recall calculates the proportion of correctly predictedpositive instances out of all actual positive instances.4.F1-Score: F1-Score is the harmonic mean of precision and recall,providing a balanced measure between the two.5.Area Under the Receiver Operating Characteristic Curve (AUC-ROC):AUC-ROC measures the model’s ability to distinguish betweenpositive and negative instances.Applications of Feature SelectionFeature selection methods find applications in various domains, including:1. Text ClassificationIn natural language processing, feature selection helps in identifying the most informative words or n-grams for text classification tasks. By selecting relevant features, the model can focus on the essential aspects of the text, improving classification accuracy.2. BioinformaticsFeature selection plays a crucial role in genomics and proteomics, where the goal is to identify relevant genes or proteins associated with a particular disease. By selecting informative features, researchers can gain insights into the underlying biological processes.3. Image ProcessingIn image processing, feature selection helps in reducing the dimensionality of image data while retaining the most discriminative features. This enables efficient image recognition, object detection, and image segmentation.4. FinanceFeature selection methods find applications in financial analysis and stock market prediction. By selecting relevant financial indicators, such as price-to-earnings ratio, market capitalization, and volatility, models can make more accurate predictions and investment decisions.ConclusionFeature selection methods are essential in machine learning and data analysis to improve model performance, reduce computational complexity, and gain insights from data. This survey article provided an overview of various feature selection methods, their strengths, weaknesses, evaluation metrics, and applications in different domains. By understanding the different approaches, researchers and practitioners can choose the most suitable method for their specific tasks.。

基于粗糙集和蚁群优化算法的特征选择方法

基金项目 : 国家自然科学基金 ( 50863003, 61070139) , 江西省教育厅科技资助项目 ( 赣教技字 [ GJ J08042] 号 ) 收稿日期 : 2010- 05- 11 通讯联系人 , E mail: t aorongqiu@ 163. com
* **
& 488 &
南京大学学报 ( 自然科学 )
权重参数蚂蚁数量m满足n实验结果与分析为了表述方便分别称蚁群优化算法贪婪法和基于粗糙集蚁群优化方法的特征选择算法为算法1算法2和算法其中算法1和算法3中取蚁群迭代10次所得到结果的平均值作为最终测试结果三种方法在不同数据集上的测试结果tableexperimentresultsdifferentdatasetsthreemethods数据集编号算法1算法85714857141818216867390241699615385173751012436161674535428可以看出本文提出的将粗糙集和蚁群优化方法相结合的特征选择算法与基于粗糙集的特征选择算法或者基于蚁群优化方法的特征选择算法相比不管是在特征子集的长基于粗糙集和蚁群优化算法的特征选择方法度上还是在特征子集的精度上都具有一定的优势
[ 7]
与其他方法相结合的优点. 本文将粗糙集方法和蚁群优化算法各自的优势相结合应用于特征选择中, 提出一种基于粗糙集蚁群优化方法的特征选择算法. 所提出的方法将粗糙集的属性依赖度和重要度作为启发性信息应用于转移规则中, 并用粗糙集方法的分类质量和特征子集的长度来构建信息素更新策略.
1
粗糙集理论的相关概念[ 12, 13]
X ! U/ Q
算法能保证得到相对最优特征子集 , 但是
, C为
随着处理的数据集规模增大, 算法的效率会有所下降 . 随机搜索策略在计算过程中将特征选择问题与模拟退火算法、禁忌搜索算法和遗传算法

Spectral Feature Selection for Supervised and Unsupervised Learning

or unlabeled, leading to the development of supervised and unsupervised feature selection algorithms. To date, researchers have studied the two types of feature selection algorithms largely separately. Supervised feature selection determines feature relevance by evaluating feature’s correlation with the class, and without labels, unsupervised feature selection exploits data variance and separability to evaluate feature relevance (He et al., 2005; Dy & Brodley, 2004). In this paper, we endeavor to investigate some intrinsic properties of supervised and unsupervised feature selection algorithms, explore their possible connections, and develop a uniﬁed framework that will enable us to (1) jointly study supervised and unsupervised feature selection algorithms, (2) gain a deeper understanding of some existing successful algorithms, and (3) derive novel algorithms with better performance. To the best of our knowledge, this work presents the ﬁrst attempt to unify supervised and unsupervised feature selection by developing a general framework. The chasm between supervised and unsupervised feature selection seems diﬃcult to close as one works with class labels and the other does not. However, if we change the perspective and put less focus on class information, both supervised and unsupervised feature selection can be viewed as an eﬀort to select features that are consistent with the target concept. In supervised learning the target concept is related to class aﬃliation, while in unsupervised learning the target concept is usually related to the innate structures of the data. Essentially, in both cases, the target concept is related to dividing instances into well separable subsets according to diﬀerent deﬁnitions of the separability. The challenge now is how to develop a uniﬁed representation based on which diﬀerent types of separability can be measured. Pairwise instance similarity is widely used in both supervised and unsupervised learning to describe the relationships among instances. Given a set of pairwise instance similarities S, the separability of the instances can be studied by

特征选择 Feature Selection

Feature Selection for Classification
Agenda:
• Overview and general introduction.
• Four main steps in any feature selection methods. • Categorization of the various methods.
• Empirical comparison using some artificial data set. • Guidelines in selecting the “right” method.
(stan)
(stan) (pk)
Feature Selection for Classification
Feature Selection for Classification
by M. Dash and H. Liu
Group 10
Stanlay Irawan HD97-1976M Loo Poh Kok HD98-1858E Wong Sze Cheong HD99-9031U
Slides: .sg/~wongszec/group10.ppt
Feature Selection for Classification
(8) Evaluation
• determine the relevancy of the generated feature subset candidate towards the classification task. Rvalue = J(candidate subset) if (Rvalue > best_value) best_value = Rvalue • 5 main type of evaluation functions. (8.1) distance (euclidean distance measure). (8.2) information (entropy, information gain, etc.) (8.3) dependency (correlation coefficient). (8.4) consistency (min-features bias). (8.5) classifier error rate (the classifier themselves).

Chapter-1--Prototypes-and-Categories认知语言学(1)

“There is nothing more basic than categorization to our thought, perception, action and speech.”(Lakoff, 1987:5)
Categorization
the world is made up of an infinite variety of objects with different substances, shapes and colors.
Berlin & Kay’s research on focal colors and inspirations Eleanor Rosch’s research (an extended one) and
inspirations
Influencing factors on the prototype category The application of the prototype theory
Theories on categorization
The Classical Theory Wittgenstein’s theory The Prototype Theory
The Classical Theory
The ‘classical theory’ of categorization was the prevalent model since the time of Aristotle and holds that conceptual and linguistic categories have definitional structure (Evans & Green, 2006:251).
Chapter 1 Prototypes and Categories

模拟ai英文面试题目及答案

模拟ai英文面试题目及答案模拟AI英文面试题目及答案1. 题目: What is the difference between a neural network anda deep learning model?答案: A neural network is a set of algorithms modeled loosely after the human brain that are designed to recognize patterns. A deep learning model is a neural network with multiple layers, allowing it to learn more complex patterns and features from data.2. 题目: Explain the concept of 'overfitting' in machine learning.答案: Overfitting occurs when a machine learning model learns the training data too well, including its noise and outliers, resulting in poor generalization to new, unseen data.3. 题目: What is the role of a 'bias' in an AI model?答案: Bias in an AI model refers to the systematic errors introduced by the model during the learning process. It can be due to the choice of model, the training data, or the algorithm's assumptions, and it can lead to unfair or inaccurate predictions.4. 题目: Describe the importance of data preprocessing in AI.答案: Data preprocessing is crucial in AI as it involves cleaning, transforming, and reducing the data to a suitableformat for the model to learn effectively. Proper preprocessing can significantly improve the performance of AI models by ensuring that the input data is relevant, accurate, and free from noise.5. 题目: How does reinforcement learning differ from supervised learning?答案: Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a reward signal. It differs from supervised learning, where the model learns from labeled data to predict outcomes based on input features.6. 题目: What is the purpose of a 'convolutional neural network' (CNN)?答案: A convolutional neural network (CNN) is a type of deep learning model that is particularly effective for processing data with a grid-like topology, such as images. CNNs use convolutional layers to automatically and adaptively learn spatial hierarchies of features from input images.7. 题目: Explain the concept of 'feature extraction' in AI.答案: Feature extraction in AI is the process of identifying and extracting relevant pieces of information from the raw data. It is a crucial step in many machine learning algorithms, as it helps to reduce the dimensionality of the data and to focus on the most informative aspects that can be used to make predictions or classifications.8. 题目: What is the significance of 'gradient descent' in training AI models?答案: Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In the context of AI, it is used to minimize the loss function of a model, thus refining the model's parameters to improve its accuracy.9. 题目: How does 'transfer learning' work in AI?答案: Transfer learning is a technique where a pre-trained model is used as the starting point for learning a new task. It leverages the knowledge gained from one problem to improve performance on a different but related problem, reducing the need for large amounts of labeled data and computational resources.10. 题目: What is the role of 'regularization' in preventing overfitting?答案: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function, which discourages overly complex models. It helps to control the model's capacity, forcing it to generalize better to new data by not fitting too closely to the training data.。

最新版财政学单词

财政学单词（大家加油！）、课本目录重点:1、市场失灵：Market failure2、公共财政：Public finance3、财政职能：Fun cti ons of public finance4、资源配置：Resource allocati on5、收入职能：Fun cti ons of in come distributi on6、经济稳定：Econo mic stabilizati on7、经济稳定与发展：E cono mic stabilizati on and developme nt8、公共选择理论：Public choice theory9、社会基础设施：Social in frastructure10、公共支出：Public expe nditure11、公共支出原则：Public expe nditure prin ciples12、公共支出分类：Public expe nditure's classificati ons13、公共支出规模：The scales of Public expe nditure14、公共支出效益：The ben efits of Public expe nditure15、购买性支出：Purchase expe nditure16、科教文卫支出：The scie nce-educati on-culture-health expe nditure17、行政管理支出：Admi nistrative expe nditure18、国防支出：Defence expe nditure19、投资性支出：In vestme nt expe nditure20、政府采购制度：System for gover nment purchase21、转移性支出：Tran sfer expe nditure22、社会保障支出：Social security expe nditure23、财政补贴：Fiscal subsidy24、税式支出：Tax expe nditures25、公共收入：Public reve nue26、公共收入的分类：Public reve nue's classificati ons27、公共收入的构成：The form of Public revenue28、公共收入的规模：The scales of Public reve nue29、税收原理：Principles of taxation30、税收的职能：Fun ctio ns of taxation31、税收原则：Taxation prin ciples32、税收负担：Tax burden33、税负转嫁：Tax shifti ng34、税收制度：Tax system35、商品课税：Commodity tax36、所得课税：In come tax37、财产课税：Property tax38、国际税收：In ternatio nal taxation39、税收管辖权：Tax revenue jurisdictio n40、国际税收协定：International Tax Agreement41、非税收收入：No n-reve nue receipt42、国有资产收益：National asset revenue43、公债：Public debt44、公债制度：Funding system45、国外公债：External public debt46、公共预算：Public budget ing47、财政体制：Finan cial system48、分级财政体制：The hierarchical fiscal system49、财政政策：Finan cial policy50、货币政策：Monetary policy二、p pt上内容：1、P ublic Finance：财政学2、U nemployment:失业3、P rofits :收益、红利4、M arginal cost :边际成本5、E xpenditures:开支、支出& Welfare：福利7、P olitical Economy :政治经济学8、P rivatization:私有化9、U tility :公共设施10、Budget constrain：预算约束11、Tax avoidanee 避税12、Supply:供给13、Public Sector Economics 公共部门经济学14、Public goods：公共产品15、Social in sura nee 社会保险16、In come：收入17、Re nt:租金18、Nominal interest rate：名义利率19、Natural monopoly:自然垄断20、Federal system 联邦制21、Equilibrium :均衡22、Corporation：合作23、Markets：市场24、Dema nd:需求25、In flation :通货膨胀26、Gross domestic product (GDP):国内生产总值27、Equity :公平28、Efficiency :效率29、Deficit :赤字30、Local government:地方政府三、师姐汇总1、财政(公共部门经济学) public sector economics \ government economics \public expenditure \ government spending theoretical foundations positive analysis normative analysis infrastructures social securitypublic administration fiscal revenue 税收原理 principles of taxation 国债 national debt 国家预算 national budget 财政政策 fiscal policy 政府组织体系 the structure of government 市场失灵 market failure 竞争失灵 competition failure垄断 mon opoly 寡头 oligopoly 公共产品 public goods外部效应 externality夕卜部效益 external benefit \ positive externality夕卜部成本 external cost \ negative externality信息失灵 information failure 交易成本 transaction cost 优值品 merit goods 劣值品 demerit goods 公共需要 public needs 财政职能 functions of public finance 资源配置 resource allocation 收入分配 income distribution （经济）稳定 stabilization （经济）增长 growth and development 非竞争性 non-rival 非排斥性 non-excluding 私人产品 private goods混合产品 mixed goods 搭便车问题 free rider problem 财政支出规模 scope of public expenditure 基尼系数 Gini coefficient瓦格纳法则 Wagner's Law 贫困指数 poverty index 财政支出结构 the structure of public expenditure 公共选择 public choice选民 voter10、 11、 12、 13、 14、 15、 16、 17、 1& 19、 20、 21、 22、 23、 24、 25、 26、 27、 2& 29、 30、31、32、33、34、35、36、37、3&39、40、41、42、43、44、public finance 财政支出理论基础实证分析规范分析基础设施支出社会保障支出公共管理财政收入 2、 3、 4、 5、 6政治家 politician 管理者 bureaucrat特殊利益集团基础设施支出社会基础设施 special interest group infranstructure social infrastructure 筹资模式 raising fund BOT build-operate-transfer TOT transfer-operate-transfer 收入与分配 income and distribution 逆向选择 adverse selection 经济周期 business cycles 养老保险 endowment insurance 现收现付 pay-as-you-go system 完全基金式 full funding system 咅 E 分基金式 partial funding system 医疗福利 medical welfare医疗保险 medical insurance 社会救济 social welfare财政收入分项目构成财政收入所有制构成财政收入部门构成税收 taxatio nthe item structure of fiscal revenue the ownership structure of fiscal revenuethe sector structure of fiscal revenue 自然人 natural person 法人 artificial person 所得税 income tax商品税 commodity tax 财产税 property tax 遗产税 inheritance tax 比例税 proportional tax 增值税 value-added tax 税率 tax rate 名义税率 nominal tax rate 实际税率 real tax rate 平均税率 average tax rate 边际税率 marginal tax rate 比例税率 proportional tax rate 累进税率 progressive tax rate 累退税率 regressive tax rate 从价税 ad valorem tax 从量税 specific tax 单一税制 9 single imposition 复合税制 9 double imposition 效率原贝 U efficiency in taxation 弹性 flexibility 45、 46、 47、 48、 49、 50、 51、 52、 53、 54、 55、 56、 57、 5& 59、 60、 61、 62、 63、 64、 65、 66、 67、 6& 69、 70、 71、 72、 73、 74、 75、76、 77、 7& 79、 80、 81、 82、 83、 84、 85、 86、 87、8&中间性 neutrality 矫正性 rectification 客观说 objective principle 主观说 subjective principle 收入效应 income effect 中性税收 neutral taxation 替代效应 substitution effect 扭曲性税收 distortionary taxation 矫正税 corrective taxation 税收负担 tax burden 税负归宿 tax incidence forward shifti ng backward shifting 89、90、 91、 92、 93、 94、 95、 96、 97、9&99、 100、前转嫁。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

AbstractAn extension to the conventional speech / nonspeech classification framework is presented for a scenario in which a number of microphones record the activity of speakers present at a meeting (one microphone per speaker). Since each microphone can receive speech from both the participant wearing the microphone (local speech) and other participants (crosstalk), the recorded audio can be broadly classified in four ways: local speech, crosstalk plus local speech, crosstalk alone and silence. We describe a classifier in which a Gaussian mixture model (GMM) is used to model each class. A large set of potential acoustic features are considered, some of which have been employed in previous speech / nonspeech classifiers.A combination of two feature selection algorithms is used to identify the optimal feature set for each class. Results from the GMM classifier using the selected features are superior to those of a previously published approach.1. IntroductionThe objective of the M4 (multimodal meeting manager) project [1] is to produce a demonstration system to enable structuring, browsing and querying of an archive of automatically analysed meetings. The meetings take place in a room equipped with multimodal sensors. Audio information is acquired from lapel mounted microphones, microphone arrays and a K EMAR binaural manikin. Video information is captured from multiple cameras. Unfortunately, crosstalk (non-local speech being received by the local microphone) causes problems for tasks such as turn detection and automatic speech recognition (ASR). For example, when performing ASR, it is important to know which portions of the signal are uttered by the local speaker and hence not insert words from an intruding speaker. The ability to detect overlapping speech also allows corrupted regions of speech to be identified - regions that may not be recognised by an ASR system without pre-processing. Furthermore, patterns of speaker activity and overlap can provide valuable information regarding the structure of the meeting.In this paper, we concentrate on the task of producing a classifier which can label each lapel microphone signal using four high-level activity categories:•local channel speaker alone (speaker alone)•local channel speaker concurrent with one or more other speakers (speaker+crosstalk)•one or more non-local speakers (crosstalk alone)•no speakers(silence)Similar work on meetings-based signal classification [2] has concentrated on a speech / nonspeech identification task. The additional classes above increase the flexibility of the system and more closely guide future analysis (such as enhancement of crosstalk contaminated speech).A primary objective of our approach is to obtain reliable classification regardless of the room in which the meeting takes place, the identities of the individual speakers and the overall number of participants. We note that previous approaches (e.g., [7]) have tended to be channel, speaker or environment dependent.A secondary objective was to investigate a range of possible features which can be extracted from the audio signal and determine which combination provides the optimum classification performance for each category. This also allows us to compare our approach to other systems, which compute a subset of our features, using the same data set.2. GMM-based classifierEach class is modelled by a Gaussian mixture model (GMM) in order to capture the variability in the distributions caused by multiple speakers and differing channel characteristics.2.1. Candidate featuresPrevious speech / nonspeech classifiers (e.g. [2]) have used features such as critical band loudness values, energy and zero crossing rate. In addition to these, we identified a number of features which are particularly suited to analysing the differences between isolated speech and overlapping speech.Each feature was calculated over a 16 ms Hamming window with a frame shift of 10 ms, unless otherwise stated.2.1.1. MFCC, energy and zero crossing rateIn order to compare our system with other speech / nonspeech classifiers, we included the conventional feature set of mel-frequency cepstral coefficients (MFCCs), log energy and zero crossing rate (for example, see [2]).2.1.2. KurtosisKurtosis is defined as the fourth central moment divided by the fourth power of the standard deviation. Thus, kurtosis is based on the size of a distribution's tails - i.e. a measure of Gaussianity. K urtosis is zero for a Gaussian random variable and positive for super-Gaussian signals such as speech. It has been shown that the kurtosis of overlapping speech is generally less than the kurtosis of the individual speech utterances [3], since - in accordance with the central limit theorem - mixtures of speech signals will tend towards a Gaussian distribution. Here, a 160 ms window, centred on the same points as the 16 ms window, was used to allow a more accurate estimate of the short-time signal kurtosis. The frequency-domain kurtosis (i.e. the kurtosis of the magnitude spectrum) was also computed.Feature Selection for the Classification of Crosstalk in Multi-Channel Audio Stuart N. Wrigley, Guy J. Brown, Vincent Wan and Steve RenalsSpeech and Hearing Research Group, Department of Computer Science,University of Sheffield, UK{s.wrigley,g.brown,v.wan,s.renals}@2.1.3. FundamentalnessKawahara et al. [4] describe an approach to estimating the ‘fundamentalness’ of a harmonic. Their technique is based on amplitude modulation (AM) and frequency modulation (FM) extracted from the output of wavelet analysis on the signal log frequency spectrum. At different positions in frequency, the analysing wavelet will encompass a different number of harmonic components; fundamentalness is defined as having maximum value when the FM and AM modulation magnitudes are minimum, which corresponds to the situation when the minimum number of components are present in the wavelet window (usually just the fundamental component). Although this technique was developed to analyse isolated speech (see [4] page 196, eqns 13-19), the concept that a single fundamental produces high fundamentalness is useful here: if more than one fundamental is present, interference of the two components will cause AM and FM modulation, thus decreasing the fundamentalness measure. Such an effect will arise when two speakers are active simultaneously, giving rise to overlapping harmonic series.2.1.4. SAPVRSAPVR (spectral autocorrelation peak valley ratio; [5]) is computed from the autocorrelation of the signal spectrum. The measure is the ratio of peaks to valleys within the spectral autocorrelation. Specifically, the SAPVR-5 measure [6] is the sum of the autocorrelation peaks, including lag zero, divided by the sum of the first two autocorrelation valleys. For a peak to be used, it must be an integer multiple of the fundamental (i.e. true harmonic). For single speaker speech, a strongly periodic autocorrelation function is produced due to the harmonic structure of the spectrum. However, when more than one speaker is active simultaneously, the autocorrelation function becomes flatter due to the overlapping harmonic series.2.1.5. PPFThe PPF (pitch prediction feature) was developed for the specific task of discriminating between single speaker speech and two speaker speech [7]. The first stage computes 12-th order linear prediction filter coefficients (LPCs) which are then used to calculate the LP residual - the error signal. The residual is smoothed using a Gaussian shaped filter after which a form of autocorrelation analysis then identifies periodicities (between 50 Hz and 500 Hz). Potential pitch peaks are extracted by applying a threshold to this function. The final PPF measure is defined as the standard deviation of the differences between such extracted peaks. If a frame contains a single speaker, strong peaks will occur at multiples of the pitch period. Therefore, the standard deviation of the differences of the peaks will be small. Conversely, if the frame contains two speakers of different fundamental frequency, there will be a considerably larger number of strong peaks due to the overlapping harmonic series. Therefore, the standard deviation of the differences of the peaks will be much higher. In order to allow direct comparison between our approach and that of [7], a 30 ms window was used.2.1.6. Features derived from genetic programmingA genetic programming (GP) approach (see [8] for a review) was also used to identify frame-based features that could be useful for signal classification. The GP engine was implemented in MATLAB and included a type checking system, thus allowing vector and scalar types to be mixed in the same expression tree. Intermediate results were cached to reduce computation time. The function set included standard MATLAB functions such as fft,min,max,kurtosis, and user defined functions such as autocorr (time-domain autocorrelation) and normalize (which scaled a vector to have zero mean and unity standard deviation). The terminal set contained integer and floating point constants, the signal vector x, and array indices. A population of 1000 individuals was used, with a mutation rate of 0.5% and crossover rate of 90%.Individuals were evaluated by training and testing a Gaussian classifier on the features derived from each expression tree, using a subset of the data described in section 3. Successive generations were obtained using fitness-proportional selection. The GP engine identified several successful features, such as max(autocorr(normalize(x))), which were included in the feature selection process. Interestingly, GP also discovered several features based on spectral autocorrelation, but these were never ranked highly.2.1.7. Cross-channel correlationA number of other features were extracted using cross-channel correlation. For each channel i, the cross-channel correlation was computed between channel i and all other channels. From these, the unnormalised and normalised minimum, maximum and mean values were extracted and used as individual features. Normalisation consisted of dividing the feature set for channeli by the frame energy of channel i.2.2. Feature selectionThe parcel algorithm (see [9]) was used to assess the classification performance of the different feature combinations. For each feature combination, the classifier’s GMMs are trained and then evaluated to create a receiver operating characteristic (ROC) curve for each crosstalk category. Each point on such a ROC curve represents the performance of a classifier with a different decision threshold between two crosstalk categories (i.e. the category of interest versus all others). Given a number of ROCs (one per feature combination), a maximum realisable ROC (MRROC) can be calculated by fitting a convex hull over the existing ROCs. Thus, each point on a MRROC represents the best feature combination for that class.However, due to the size of the feature set, the calculation of each possible feature combination (2N-1 combinations) is impractical. As an alternative to exhaustive search, the sequential forward selection (SFS) algorithm was employed to produce a sub-optimal feature set for each crosstalk category. SFS proceeds by calculating a measure of the GMM classification performance (in this case, the area under the ROC curve) for each individual feature. The winning feature is the one with the highest classification performance. Each remaining feature is combined with the winning feature and the performance is again computed. If the combined performance of the new feature and the previous winner is above a certain threshold (in this case, a 1% increase in the area under the ROC curve) it is added to the feature set. This process is repeated untilno more features can be added. Despite specifying a maximumfeature set of six, the SFS algorithm terminated before reaching this size for all crosstalk categories.3. Experiments and resultsSince collection and annotation of M4 data has only recently begun, the following experiments were conducted using data from the International Computer Science Institute (ICSI)meeting corpus [10].3.1. Feature selection using full feature setThe training data consisted of one million frames per category (16 ms window with 10 ms shift) of conversational speech extracted at random from four ICSI meeting recordings (bro012,bmr006,bed008,bed010). For each channel, a label file specifying the four different crosstalk categories (local speaker alone, local speaker plus crosstalk, crosstalk alone,background) was automatically created from the existing transcriptions provided with the ICSI corpus.The test data consisted of 15000 frames per category extracted at random from one ICSI meeting recording (bmr001). As for the training data, a label file was automatically created from the existing transcriptions to assess the performance of the classifiers.The feature sets derived by the SFS algorithm were:•local channel speaker alone: kurtosis and maximum normalised cross-channel correlation.•local channel speaker concurrent with one or more speakers: energy, kurtosis, maximum normalised cross-channel correlation and mean normalised cross-channel correlation.•one or more non-local speakers: energy, kurtosis, mean cross-channel correlation, mean normalised cross-channel correlation, maximum normalised cross-channel correlation.•no speakers: energy and mean cross-channel correlation.It is interesting to note that MFCC features do not appear in any of the optimal feature sets.The GMM performance of each feature set is shown in figure 1. For equal false positive and false negative misclassification rates, the performance of each classifier is approximately 80%.3.2. Feature selection excluding energy featureThe results presented in the previous section were created using data from the ICSI corpus. However, it should be noted that there are differences between the data acquisition equipment used by ICSI and M4. Most notable is the type of microphone used for each participant. M4 uses lapel microphones as opposed to the head-mounted microphones used by ICSI.Hence, there may be significant differences between the resultant data sets. For example, channel energy may be an unreliable cue for M4 data since the lapel microphones allow a variable coupling (separation) between the mouth and the microphone. A drop in channel energy may simply be due to the local speaker turning their head rather than a change in speaker (in which case, the drop in energy is interpreted as a change from local speaker speech to crosstalk).In order to give an indication of the classification performance when energy is removed from the set of potential features, the feature selection process described in section 2.2was repeated with log energy excluded.The feature sets derived by the SFS algorithm were:•local channel speaker alone: kurtosis and maximum normalised cross-channel correlation.•local channel speaker concurrent with one or more speakers: kurtosis, fundamentalness, maximum normalised cross-channel correlation and mean normalised cross-channel correlation.•one or more non-local speakers: mean cross-channel correlation and mean normalised cross-channel correlation.•no speakers: kurtosis, mean cross-channel correlation and mean normalised cross-channel correlation.category’s optimum feature set. Diagonal lines indicate equalerror rates.category’s optimum feature set. Log energy has been excluded from the set of potential features. Diagonal lines indicate equal error rates.Figure 2 shows the GMM performance of each feature set. It can be seen that the removal of log energy has little effect on the feature set and overall classification performance of the system. Indeed, performance remains at approximately 80%. This is due to the high performance of the cross correlation features which dominate the ROC curves.3.3. Comparison with a previous approachAs described in section 2.1.5, the pitch prediction feature (PPF) [7] was developed with the specific task of identifying portions of speech which contained either a single speaker or two speakers. Since this is clearly related to the crosstalk analysis task, a brief comparison between the performance of the systemdescribed here and the PPF is made. The same training and test sets are used as described above; the ROC curves for each crosstalk category are shown in figure 3. The performance reported in [7] ranges between 55% and 64% for unknown speakers, depending on the classifier used. Similar performance was obtained when using the PPF with our recogniser: approximately 64% for single speaker speech when considering equal error rates; approximately 59% for crosstalk speech. It ought to be noted that [7] makes no distinction between local speaker plus crosstalk and crosstalk alone - arguably making their task easier. However, the PPF performance here may be reduced due to the higher degree of background noise present in our data when compared to the data used in [7]. Despite this, the performance reported in this paper, and shown in figures 1 and 2, remains superior.4. Conclusions and future workA GMM-based approach to crosstalk analysis has been presented in which a wide range of potential features has been investigated. The parcel [9] and SFS algorithms were used to identify the feature set, for each crosstalk class, which had the highest performance. GMM classification performance for each of the four crosstalk classes is approximately 80% -significantly higher than other similar approaches (e.g. [7]).Currently, the approach described here operates on individual frames of audio data. Temporal constraints could be effectively applied in the system using a hidden Markov model (HMM). Work is being conducted on the development of an ergodic HMM consisting of four states, one per class, in which each state is modelled by a GMM as described above. In addition to this, we are investigating the potential gain of using individual HMMs for each crosstalk category. Preliminary evaluation of the ergodic HMM classifier (with transition probabilities calculated from the training data and tested using ICSI meeting bmr001) suggest that it will yield a further improvement in performance.Since M4 data transcription has recently begun, it will be necessary to assess the classifiers’ performance on this data set and potentially adjust the feature sets. For example, as described above, it may become clear that an energy-based feature is not a reliable cue for crosstalk analysis.Furthermore, work will also be conducted into the use of cross-channel features and cross-channel classification. For example, one channel’s classification can be used to inform the classification process of another channel within the same meeting.5. AcknowledgementsThis work was conducted as part of the M4: multimodal meeting manager project [1] which is funded by the EU IST Programme (project IST-2001-34485).6. References[1]/spandh/projects/m4/[2]Pfau, T., Ellis, D.P.W. and Stolcke, A. “Multispeakerspeech activity detection for the ICSI meeting recorder”, Proc. Automatic S peech Recognition and Understanding Workshop, Italy, December 2001.[3]LeBlanc, J. and de Leon, P. “Speech Separation byKurtosis Maximization”, IEEE ICASSP 1998, 1029-1032.[4]Kawahara, H., Masuda-Katsuse, I. and de Cheveigné, A.,“Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency based F0 extraction: Possible role of a repetitive structure in sounds”, Speech Communication27:187-207, 1999.[5]K rishnamachari, K., Yantorno, R., Benincasa, D. andWenndt, S., “Spectral Autocorrelation Ratio as a Usability Measure of Speech Segments Under Co-channel Conditions”, IEEE International S ymposium Intelligent.Sig. Process. and Comm. Sys., 2000.[6]Yantorno, R.E., “A study of the spectral autocorrelationpeak valley ratio (SAPVR) as a method for identification of usable speech and detection of co-channel speech”, Speech Processing Lab Technical Report, Temple University, Philadelphia, 2000.[7]Lewis, M.A. and Ramachandran, R.P., “Cochannelspeaker count labelling based on the use of cepstral and pitch prediction derived features”, Pattern Recognition34: 499-507, 2001.[8]Banzhaf, W., Nordin, P., K eller, R., and Francone, F.,Genetic programming: an introduction. Morgan Kaufmann, 1998.[9]Scott, M., Niranjan, M. and Prager, R., “Parcel: featuresubset selection in variable cost domains”, Technical Report CUED/F-INFENG/TR. 323, Cambridge University Engineering Department, UK, 1998.[10]Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D.,Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A.and Wooters, C., “The ICSI Meeting Corpus”, IEEE ICASSP 2003.using PPF. Diagonal line indicates equal error rates.。

Feature selection for the classification of crosstalk in multi-channel audio

合集下载