A New Feature Selection Method for Neuro-Fuzzy Classifier
- 格式:pdf
- 大小:208.45 KB
- 文档页数:7
Feature Selection Based on Mutual Information:Criteria of Max-Dependency,Max-Relevance,and Min-RedundancyHanchuan Peng,Member ,IEEE ,Fuhui Long,and Chris DingAbstract —Feature selection is an important problem for pattern classification systems.We study how to select good features according to the maximal statistical dependency criterion based on mutual information.Because of the difficulty in directly implementing the maximal dependency condition,we first derive an equivalent form,called minimal-redundancy-maximal-relevance criterion (mRMR),for first-order incremental feature selection.Then,we present a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g.,wrappers).This allows us to select a compact set of superior features at very low cost.We perform extensive experimental comparison of our algorithm and other methods using three different classifiers (naive Bayes,support vector machine,and linear discriminate analysis)and four different data sets (handwritten digits,arrhythmia,NCI cancer cell lines,andlymphoma tissues).The results confirm that mRMR leads to promising improvement on feature selection and classification accuracy.Index Terms —Feature selection,mutual information,minimal redundancy,maximal relevance,maximal dependency,classification.æ1I NTRODUCTIONINmany pattern recognition applications,identifying the most characterizing features (or attributes)of the ob-served data,i.e.,feature selection (or variable selection,among many other names)[30],[14],[17],[18],[15],[12],[11],[19],[31],[32],[5],is critical to minimize the classifica-tion error.Given the input data D tabled as N samples and M features X ¼f x i ;i ¼1;...;M g ,and the target classifica-tion variable c ,the feature selection problem is to find from the M -dimensional observation space,R M ,a subspace of m features,R m ,that “optimally”characterizes c .Given a condition defining the “optimal characteriza-tion,”a search algorithm is needed to find the best subspace.Because the total number of subspaces is 2M ,and the number of subspaces with dimensions no larger than m is Æm i ¼1Mi ÀÁ,it is hard to search the feature subspace exhaus-tively.Alternatively,many sequential-search-based approx-imation schemes have been proposed,including best individual features,sequential forward search,sequential forward floating search,etc.,(see [30],[14],[13]for a detailed comparison.).The optimal characterization condition often means the minimal classification error .In an unsupervised situation where theclassifiers arenot specified,minimalerrorusually requires the maximal statistical dependency of the target class c on the data distribution in the subspace R m (and vice versa).This scheme is maximal dependency (Max-Dependency).One of the most popular approaches to realize Max-Dependency is maximal relevance (Max-Relevance)feature selection:selecting the features with the highest relevance to the target class c .Relevance is usually characterized in terms of correlation or mutual information,of which the latter is one of the widely used measures to define dependency of variables.In this paper,we focus on the discussion of mutual-information-based feature selection.Given two random variables x and y ,their mutual information is defined in terms of their density functions p ðx Þ,p ðy Þ,and p ðx;y Þ:I ðx ;y Þ¼Z Zp ðx;y Þlog p ðx;y Þp ðx Þp ðy Þdxdy:ð1ÞIn Max-Relevance,the selected features x i are required,individually,to have the largest mutual information I ðx i ;c Þwith the target class c ,reflecting the largest dependency on the target class.In terms of sequential search,the m best individual features,i.e.,the top m features in the descent ordering of I ðx i ;c Þ,are often selected as the m features.In feature selection,it has been recognized that the combinations of individually good features do not necessa-rily lead to good classification performance.In other words,“the m best features are not the best m features”[4],[3],[14],[30].Some researchers have studied indirect or direct means to reduce the redundancy among features 1(e.g.,[4],[14],[19],[15],[22],[12],[5])and select features with the minimal redundancy (Min-Redundancy).For example,in the sequen-tial forward floating search [25],the joint dependency of features on the target class is maximized;as a by-product,the redundancy among features might be reduced.In [12],Jaeger et al.presented a prefiltering method to group variables,thus,redundant variables within each group can be removed.In.H.Peng and F.Long are with the Lawrence Berkeley National Laboratory,University of California at Berkeley,1Cyclotron Road,MS.84-171,Berkeley,CA 94720.E-mail:{hpeng,flong}@.. C.Ding is with the Computational Research Division,Lawrence Berkeley National Laboratory,1Cyclotron Road,Berkeley,CA 94720.E-mail:CHQDing@.Manuscript received 6Aug.2003;revised 1May 2004;accepted 3Dec.2004;published online 13June 2005.Recommended for acceptance by A.Del Bimbo.For information on obtaining reprints of this article,please send e-mail to:tpami@,and reference IEEECS Log Number TPAMI-0215-0803.1.Minimal redundancy has also been studied in feature extraction,which aims to find good features in a transformed domain.For instance,it has been well addressed in various techniques such as principal component analysis and independent component analysis [10],neural network feature extractors (e.g.,[22]),etc.0162-8828/05/$20.00ß2005IEEEPublished by the IEEE Computer Societyrelevance (mRMR)framework to minimize redundancy,and used a series of intuitive measures of relevance and redundancy to select promising features for both continuous and discrete data sets.Our work in this paper focuses on three issues that have not been touched in earlier work.First,although both Max-Relevance and Min-Redundancy have been intuitively used for feature selection,no theoretical analysis is given on why they can benefit selecting optimal features for classification.Thus,the first goal of this paper is to present a theoretical analysis showing that mRMR is equivalent to Max-Depen-dency for first-order feature selection,but is more efficient.Second,we investigate how to combine mRMR with other feature selection methods (such as [18],[15])into a two-stage selection algorithm.By doing this,we show that the space of candidate features selected by mRMR is more characterizing.This property of mRMR facilitates the integration of other feature selection schemes to find a compact subset of superior features at very low cost.Third,through comprehensive experiments we compare mRMR,Max-Relevance,Max-Dependency,and the two-stage feature selection algorithm,using three different classifiers and four data sets.The results show that mRMR and our two-stage algorithm are very effective in a wide range of feature selection applications.This paper is organized as follows:Section 2presents the theoretical analysis of the relationships of Max-Dependency,Max-Relevance,and Min-Redundancy.Section 3presents the two-stage feature selection algorithm,including schemes to integrate wrappers to select a squeezed subset of features.Section 4discusses implementation issues of density estima-tion for mutual information and several different classifiers.Section 5gives experimental results on four data sets,including handwritten characters,arrhythmia,NCI cancer cell lines,and lymphoma tissues.Sections 6and 7are discussions and conclusions,respectively.2R ELATIONSHIPS OF M AX -D EPENDENCY ,M AX -R ELEVANCE ,AND M IN -R EDUNDANCY2.1Max-DependencyIn terms of mutual information,the purpose of feature selection is to find a feature set S with m features f x i g ,which jointly have the largest dependency on the target class c .This scheme,called Max-Dependency,has the following form:max D ðS;c Þ;D ¼I ðf x i ;i ¼1;...;m g ;c Þ:ð2ÞObviously,when m equals 1,the solution is the feature that maximizes I ðx j ;c Þð1 j M Þ.When m >1,a simple incremental search scheme is to add one feature at one time:given the set with m À1features,S m À1,the m th feature can be determined as the one that contributes to the largest increase of I ðS ;c Þ,which takes the form of (3):I ðS m ;c Þ¼Z Zp ðS m ;c Þlog p ðS m ;c Þp ðS m Þp ðc ÞdS m dc¼Z Zp ðS m À1;x m ;c Þlogp ðS m À1;x m ;c Þp ðS m À1;x m Þp ðc ÞdS m À1dx m dc ¼Z ÁÁÁZp ðx 1;ÁÁÁ;x m ;c Þlogp ðx 1;ÁÁÁ;x m ;c Þp ðx 1;ÁÁÁ;x m Þp ðc Þdx 1ÁÁÁdx m dc:ð3Þoften hard to get an accurate estimation for multivariate density p ðx 1;...;x m Þand p ðx 1;...;x m ;c Þ,because of two difficulties in the high-dimensional space:1)the number of samples is often insufficient and 2)the multivariate density estimation often involves computing the inverse of the high-dimensional covariance matrix,which is usually an ill-posed problem.Another drawback of Max-Dependency is the slow computational speed.These problems are most pronounced for continuous feature variables.Even for discrete (categorical)features,the practical problems in implementing Max-Dependency cannot be completely avoided.For example,suppose each feature has three categorical states and N samples.K features could have a maximum min ð3K ;N Þjoint states.When the number of joint states increases very quickly and gets comparable to the number of samples,N ,the joint probability of these features,as well as the mutual information,cannot be estimated correctly.Hence,although Max-Dependency feature selection might be useful to select a very small number of features when N is large,it is not appropriate for applications where the aim is to achieve high classification accuracy with a reasonably compact set of features.2.2Max-Relevance and Min-RedundancyAs Max-Dependency criterion is hard to implement,an alternative is to select features based on maximal relevance criterion (Max-Relevance).Max-Relevance is to search features satisfying (4),which approximates D ðS;c Þin (2)with the mean value of all mutual information values between individual feature x i and class c :max D ðS;c Þ;D ¼1j S j Xx i2SI ðx i ;c Þ:ð4ÞIt is likely that features selected according to Max-Relevance could have rich redundancy,i.e.,the dependency among these features could be large.When two features highly depend on each other,the respective class-discrimi-native power would not change much if one of them were removed.Therefore,the following minimal redundancy (Min-Redundancy)condition can be added to select mutually exclusive features [5]:min R ðS Þ;R ¼1j S j 2X x i ;x j 2SI ðx i ;x j Þ:ð5ÞThe criterion combining the above two constraints is called “minimal-redundancy-maximal-relevance”(mRMR)[5].We define the operator ÈðD;R Þto combine D and R and consider the following simplest form to optimize D and R simultaneously:max ÈðD;R Þ;ȼD ÀR:ð6ÞIn practice,incremental search methods can be used to find thenear-optimalfeatures definedby Èð:Þ.Suppose wealready have S m À1,the feature set with m À1features.The task is to select the m th feature from the set f X ÀS m À1g .This is done by selecting the feature that maximizes Èð:Þ.The respective incremental algorithm optimizes the following condition:max x j 2X ÀS m À1I ðx j ;c ÞÀ1Xx i 2S m À1I ðx j ;x i Þ"#:ð7ÞThe computational complexity of this incremental search method is O ðj S j ÁM Þ.2.3Optimal First-Order Incremental SelectionWe prove in the following that the combination of Max-Relevance and Min-Redundancy criteria,i.e.,the mRMR criterion,is equivalent to the Max-Dependency criterion if one feature is selected (added)at one time.We call this type of selection the “first-order”incremental search.We have the following theorem:Theorem.For the first-order incremental search,mRMR is equivalent to Max-Dependency (2).Proof.By definition of the first-order search,we assume that S m À1,i.e.,the set of m À1features,has already been obtained.The task is to select the optimal m th feature x m from set f X ÀS m À1g .The dependency D in (2)and (3)is represented by mutual information,i.e.,D ¼I ðS m ;c Þ,where S m ¼f S m À1;x m g can be treated as a multivariate variable.Thus,by the definition of mutual information,we have:I ðS m ;c Þ¼H ðc ÞþH ðS m ÞÀH ðS m ;c Þ¼H ðc ÞþH ðS m À1;x m ÞÀH ðS m À1;x m ;c Þ;ð8Þwhere H ð:Þis the entropy of the respective multivariate(or univariate)variables.Now,we define the following quantity J ðS m Þ¼J ðx 1;...;x m Þfor scalar variables x 1;...;x m ,J ðx 1;x 2;...;x m Þ¼Z ÁÁÁZp ðx 1;...;x m Þlog p ðx 1;x 2;...;x m Þp ðx 1ÞÁÁÁp ðx m Þdx 1ÁÁÁdx m :ð9ÞSimilarly,we define J ðS m ;c Þ¼J ðx 1;...;x m ;c Þas J ðx 1;x 2;...;x m ;c Þ¼Z ÁÁÁZp ðx 1;...;x m ;c Þlog p ðx 1;x 2;...;x m ;c Þ1m dx 1ÁÁÁdx m dc:ð10ÞWe can easily derive (11)and (12)from (9)and (10),H ðS m À1;x m Þ¼H ðS m Þ¼X m i ¼1H ðx i ÞÀJ ðS m Þ;ð11ÞH ðS m À1;x m ;c Þ¼H ðS m ;c Þ¼H ðc ÞþX m i ¼1H ðx i ÞÀJ ðS m ;c Þ:ð12ÞBy substituting them to the corresponding terms in(8),we haveI ðS m ;c Þ¼J ðS m ;c ÞÀJ ðS m Þ¼J ðS m À1;x m ;c ÞÀJ ðS m À1;x m Þ:ð13ÞObviously,Max-Dependency is equivalent to simul-taneously maximizing the first term and minimizing the second term.We can use the Jensen’s Inequality [16]to show the second term J ðS m À1;x m Þis lower-bounded by 0.Arelated and slightly simpler proof is to consider the inequality log ðz Þ z À1with the equality if and only if z ¼1.We see thatÀJ ðx 1;x 2;...;x m Þ¼Z ÁÁÁZp ðx 1;...;x m Þlog p ðx 1ÞÁÁÁp ðx m Þp ðx 1;...;x m Þdx 1ÁÁÁdx mZ ÁÁÁZp ðx 1;...;x m Þp ðx 1ÞÁÁÁp ðx m Þp ðx 1;...;x m ÞÀ1 !dx 1ÁÁÁdx m ¼Z ÁÁÁZp ðx 1ÞÁÁÁp ðx m Þdx 1ÁÁÁdx m ÀZ ÁÁÁZp ðx 1;ÁÁÁ;x m Þdx 1ÁÁÁdx m ¼1À1¼0:ð14ÞIt is easy to verify that the minimum is attained when p ðx 1;...;x m Þ¼p ðx 1ÞÁÁÁp ðx m Þ,i.e.,all the variables are independent of each other.As all the m À1features have been selected,this pair-wise independence condition means that the mutual information between x m and any selected feature x i ði ¼1;...;m À1Þis minimized.This is the Min-Redundancy criterion.We can also derive the upper bound of the first term in (13),J ðS m À1;c;x m Þ.For simplicity,let us first show the upper bound of the general form J ðy 1;...;y n Þ,assuming there are n variables y 1;...;y n .This can be seen as follows:J ðy 1;y 2;...;y n Þ¼Z ÁÁÁZp ðy 1;...;y n Þlogp ðy 1;...;y n Þp ðy 1ÞÁÁÁp ðy n Þdy 1ÁÁÁdy n ¼Z ÁÁÁZp ðy 1;...;y n Þlog p ðy 1j y 2;...;y n Þp ðy 2j y 3;...;y n ÞÁÁÁp ðy n À1j y n Þp ðy n Þp ðy 1ÞÁÁÁp ðy n À1Þp ðy n Þdy 1ÁÁÁdy n¼X n À1i ¼1H ðy i ÞÀH ðy 1j y 2;...;y n ÞÀH ðy 2j y 3;...;y n ÞÀH ðy n À1j y n ÞX n À1i ¼1H ðy i Þ:ð15ÞEquation (15)can be easily extended as J ðy 1;y 2;...;y n Þ minX n i ¼2H ðy i Þ;X n i ¼1;i ¼2H ðy i Þ;ÁÁÁ;X n i ¼1;i ¼n À1H ðy i Þ;X n À1i ¼1H ðy i Þ():ð16ÞIt is easy to verify the maximum of J ðy 1;...;y n Þor,similarly,the first term in (13),J ðS m À1;c;x m Þ,is attained when all variables are maximally dependent.When S m À1has been fixed,this indicates that x m and c should have the maximal dependency.This is the Max-Relevance criterion.Therefore,according to (13),as a combination of Max-Relevance and Min-Redundancy,mRMR is equivalent to Max-Dependency for first-order selection.t u Note that the quantity J ð:Þin (9)and (10)has also been called “mutual information”for multiple scalar variables [10].We have the following observations:1.Minimizing JðS mÞonly is equivalent to searchingmutually exclusive(independent)features.2This isinsufficient for selecting highly discriminativefeatures.2.Maximizing JðS m;cÞonly leads to Max-Relevance.Clearly,the difference between mRMR and Max-Relevance is rooted in the different definitions ofdependency(in terms of mutual information).Equa-tion(10)does not consider the joint effect of features onthe target class.On the contrary,Max-Dependency((2)and(3))considers the dependency between the datadistribution in subspace R m and the target class c.Thisdifference is critical in many circumstances.3.The equivalence between Max-Dependency andmRMR indicates mRMR is an optimal first-orderimplementation scheme of Max-Dependency.pared to Max-Dependency,mRMR avoids theestimation of multivariate densities pðx1;...;x mÞand pðx1;...;x m;cÞ.Instead,calculating the bivariatedensity pðx i;x jÞand pðx i;cÞcould be much easierand more accurate.This also leads to a more efficientfeature selection algorithm.3F EATURE S ELECTION A LGORITHMSOur goal is to design efficient algorithms to select a compact set of features.In Section2,we propose a fast mRMR feature selection scheme(7).A remaining issue is how to determine the optimal number of features m.Since a mechanism to remove potentially redundant features from the already selected features has not been considered in the incremental selection,according to the idea of mRMR,we need to refine the results of incremental selection.We present a two-stage feature selection algorithm.In the first stage,we find a candidate feature set using the mRMR incremental selection method.In the second stage, we use other more sophisticated schemes to search a compact feature subset from the candidate feature set.3.1Selecting the Candidate Feature SetTo select the candidate feature set,we compute the cross-validation classification error for a large number of features and find a relatively stable range of small error.This range is called .The optimal number of features(denoted as nÃ) of the candidate set is determined within .The whole process includes three steps:e mRMR incremental selection(7)to select n(apreset large number)sequential features from theinput X.This leads to n sequential feature setsS1&S2&...&S nÀ1&S n.pare all the n sequential feature sets S1;...;S k;...;S n;ð1k nÞto find the range of k,called ,within which the respective(cross-validation classifi-cation)error e k is consistently small(i.e.,has bothsmall mean and small variance).3.Within ,find the smallest classification erroreümin e k.The optimal size of the candidate featureset,nÃ,ischosenasthesmallest k thatcorrespondsto eÃ.3.2Selecting Compact Feature SubsetsMany sophisticated schemes can be used to search the compact feature subsets from the candidate set S nÃ.To illustrate that mRMR can produce better candidate features, which favors better combination with other methods,we use wrappers to search the compact feature subsets.A wrapper[15],[18]is a feature selector that convolves with a classifier(e.g.,naive Bayes classifier),with the direct goal to minimize the classification error of the particular ually,wrappers can yield high classification accuracy for a particular classifier at the cost of high computational complexity and less generalization of the selected features on other classifiers.This is different from the mRMR method introduced above,which does not optimize the classification error directly.The latter type of approach(e.g.,mRMR and Max-Relevance),sometimes called“filter”[18],[15],often selects features by testing whether some preset conditions about the features and the target class are satisfied.In practice,the filter approach has much lower complexity than wrappers;the features thus selected often yield comparable classification errors for different classifiers, because such features often form intrinsic clusters in the respective subspace.By using mRMR feature selection in the first-stage,we intend to find a small set of candidate features,in which the wrappers can be applied at a much lower cost in the second-stage.We will continue our discussion on this point in Section3.3.In this paper,we consider two selection schemes of wrapper,i.e.,the backward and forward selections:1.The backward selection tries to exclude one redun-dant feature at a time from the current feature set S k(initially,k is set to nÃobtained in Section3.1),with theconstraint that the resultant feature set S kÀ1leads to aclassification error e kÀ1no worse than e k.Becauseevery feature in S k can be considered in removal,thereare k different configurations of S kÀ1.For eachpossible configuration,the respective classificationerror e kÀ1is calculated.If,for every configuration,thecorresponding e kÀ1is larger than e k,there is no gain ineither classification accuracy or feature dimensionreduction(i.e.,every existing feature in S k appears tobe useful),thus,the backward selection terminates(accordingly,the size of the compact feature subset,m,is set to k).Otherwise,among the k configurations ofS kÀ1,the one that leads to the largest error reduction ischosen as the new feature set.If there are multipleconfigurations leading to the same error reduction,one of them is chosen randomly.This decrementalselection procedure is repeated until the terminationcondition is satisfied.2.The forward selection tries to select a subset ofm features from S nÃin an incremental manner.Initially,the classification error is set to the numberof samples,i.e.,N.The wrapper first searches for thefeature subset with one feature,denoted as Z1,byselecting the feature xÃ1that leads to the largest errorreduction.Then,from the set f S nÀZ1g,the wrapperselects the feature xÃ2so that the feature set Z2¼f Z1;xÃ2g leads to the largest error reduction.Thisincremental selection repeats until the classificationerror begins to increase,i.e.,e kþ1>e k.Note that we2.In the field of feature extraction,minimizing JðS mÞhas led to an algorithm of independent component analysis[10].allow the incremental search to continue when e kþ1equals e k,because we want to search a space as large aspossible.Once the termination condition is satisfied,the selected number of features,m,is chosen as thedimension for which the lowest error is first reached.For example,suppose the sequence of classificationerrors of the first six features is[10,8,4,4,4,7].Theforward selection will terminate at five features,butonly return the first three features as the result;in thisway we obtain a more compact set of features thatminimizes the error.3.3Characteristic Feature SpaceGiven two feature sets S1n and S2n both containing n features, and a classifierÀ,we say the feature space of S1n is more characteristic if the classification error(using classifierÀ)on S1n issmallerthanon S2n.Thisdefinitionofcharacteristic spacecan be extended recursively to the subsets(subspaces)of S1n andS2 n .Suppose we have a feature selection method F to generatea series of feature subsets in S1n:S11&S12&...&S1k&...&S1nÀ1&S1n,and,similarly,a series of subsets inS2 n :S21&...&S2k&...&S2n.We say S1n is recursively morecharacteristic(RM-characteristic)than S2n on the range ¼½k lower;k upper ð1k lower<k upper nÞ,if for every k2 , the classification error on S1k is consistently smaller than on S2k.To determine which one of the feature sets S1n and S2n is superior,it is often insufficient to compare the classification errors for a specific size of the feature sets.A better way is to observe which set is RM-characteristic for a reasonably large range .In the extreme case,we use ¼½1;n .Given two feature selection methods F1and F2,if,the feature sets generated by F1are RM-characteristic than those generated by F2,we believe the method F1is better than F2.Let us consider the following example to compare mRMR and Max-Relevance based on the concept of RM-characteristic feature space.As a comprehensive study,we consider both the sequential and nonsequential feature sets as follows(more details will be given in experiments):1.A direct comparison is to examine whether themRMR sequential feature sets are RM-characteristicthan Max-Relevance sequential feature sets.We useboth methods to select n sequential feature sets S1&...&S k&...&S n and compute the respectiveclassification errors.If,for most k2½1;n ,we obtainsmaller errors on mRMR feature sets,we canconclude that mRMR is better than Max-Relevancefor the sequential(or incremental)feature selection.2.We also use other feature selection methods(e.g.,wrappers)in the second stage of our feature-selectionalgorithm to probe whether mRMR is better than Max-Relevance for nonsequential feature sets.For example,for the mRMR and Max-Relevance candidate featuresets with nÃfeatures,we use the backward-selection-wrapper to produce two series of feature sets withk¼nÃÀ1;nÃÀ2;...;m features by removing somenonsequential features that are potentially redundant.Then,the respective classification errors of thesefeature sets are computed.If,for most k,we find themRMR nonsequential feature subset leads to lowererror,we conclude the mRMR candidate feature set is(approximately)RM-characteristic than the Max-Relevance candidate feature set.3.Both the forward and backward selections ofwrapper are used.Different classifiers(as discussedlater in Section4.2)are also considered in wrappers.We use both mRMR and Max-Relevance methods toselect the same number of candidate features andcompare the classification errors of the featuresubsets thereafter selected by wrappers.If all theobservations agree that the mRMR candidate featureset is RM-characteristic,we have high confidencethat mRMR is a superior feature selection method.4.Given two feature sets,if S1n is RM-characteristicthan S2n,then it is faithful to compare the lowesterrors obtained for the subsets of S1n and S2n.Clearly,for feature spaces containing the same number of features,wrappers can be applied more effectively on the space that is RM-characteristic.This also indicates that wrappers can be applied at a lower cost,by improving the characterizing strength of features and reducing the number of pre-selected features.In real situations,it might not be possible to obtain e1k<e2k for every k in .Hence,we can define a confidence score0 1to indicate the percentage of different k values for which the e1k<e2k condition is satisfied.For example,when ¼0:90 (90percent k-values correspond to the e1k<e2k condition),it is safe to claim that S1n is approximately RM-characteristic than S2n on .As can be seen in the experiments,usually this approximation is sufficient to compare two series of feature subsets.4I MPLEMENTATION I SSUESBefore presenting the experimental results in Section5,we discuss two implementation issues regarding the experi-ments:1)calculation of mutual information for both discrete and continuous data and2)multiple types of classifiers used in our experiments.4.1Mutual Information EstimationWe consider mutual-information-based feature selection for both discrete and continuous data.For discrete(categorical) feature variables,the integral operation in(1)reduces to summation.In this case,computing mutual information is straightforward,because both joint and marginal probabil-ity tables can be estimated by tallying the samples of categorical variables in the data.However,when at least one of variables x and y is continuous,their mutual information Iðx;yÞis hard to compute,because it is often difficult to compute the integral in the continuous space based on a limited number of samples. One solution is to incorporate data discretization as a preprocessing step.For some applications where it is unclear how to properly discretize the continuous data,an alternative solution is to use density estimation method(e.g.,Parzen windows)toapproximate Iðx;yÞ,assuggestedbyearlierwork in medical image registration[7]and feature selection[17].Given N samples of a variable x,the approximate density function^pðxÞhas the following form:^pðxÞ¼1X Ni¼1ðxÀxðiÞ;hÞ;ð17Þwhere ð:Þis the Parzen window function as explained below, xðiÞis the i th sample,and h is the window width.Parzen has。
Neural Networks and Evolutionary Computation. Part II:Hybrid Approaches in the NeurosciencesGerhard WeißAbstract—This paper series focusses on the intersection of neural networks and evolutionary computation.It is ad-dressed to researchers from artificial intelligence as well as the neurosciences.Part II provides an overview of hybrid work done in the neurosciences,and surveys neuroscientific theories that are bridging the gap between neural and evolutionary computa-tion.According to these theories evolutionary mechanisms like mutation and selection act in real brains in somatic time and are fundamental to learning and developmental processes in biological neural networks.Keywords—Theory of evolutionary learning circuits,theory of selective stabilization of synapses,theory of selective sta-bilization of pre–representations,theory of neuronal group selection.I.IntroductionIn the neurosciences biological neural networks are in-vestigated at different organizational levels,including the molecular level,the level of individual synapses and neu-rons,and the level of whole groups of neurons(e.g.,[11, 36,44]).Several neuroscientific theories have been pro-posed which combine thefields of neural and evolution-ary computation at these different levels.These are the theory of evolutionary learning circuits,the theories of se-lective stabilization of synapses and pre–representations, and the theory of neuronal group selection.According to these theories,neural processes of learning and develop-ment strongly base on evolutionary mechanisms like mu-tation and selection.In other words,according to these theories evolutionary mechanisms play in real brains and nervous systems the same role in somatic time as they do in ecosystems in phylogenetic time.(Other neuroscientific work which is closely related to these theories is described in[28,46,47].)This paper overviews the hybrid work done in the neu-rosciences.Sections II to V survey the four evolutionary theories mentioned above.This includes a description of the major characteristics of these theories as well as a guide to relevant and related literature.Section VI concludes the paper with some general remarks on these theories and their relation to the hybrid approaches proposed in artifi-cial intelligence.The author is with the Institut f¨u r Informatik(H2),Technische Uni-versit¨a t M¨u nchen,D-80290M¨u nchen,Germany.II.The Theory of EvolutionaryLearning CircuitsAccording to the theory of evolutionary learning circuits(TELC for short)neural learning is viewed as the grad-ual modification of the information–processing capabilities of enzymatic neurons through a process of variation andselection in somatic time[12,13].In order to put this more precisely,first a closer look is taken at enzymatic neurons,and then the fundamental claims of the TELCare described.The TELC starts from the point of view that the brainis organized into various types of local networks which contain enzymatic neurons,that is,neurons whosefiringbehavior is controlled by enzymes called excitases.(For details of this control and its underlying biochemical pro-cesses see e.g.[14,15].)These neurons incorporate the principle of double dynamics[15]by operating at two levels of dynamics:at the level of readin or tactilization dynam-ics,the neural input patterns are transduced into chemical–concentration patterns inside the neuron;and at the level of readout dynamics,these chemical patterns are recognized by the excitases.Consequently,the enzymatic neurons themselves are endowed with powerful pattern–recognition capabilities where the excitases are the recognition primi-tives.Both levels of dynamics are gradually deformable as a consequence of the structure–function gradualism(“slight changes in the structure cause slight changes in the func-tion”)in the excitases.As Conrad pointed out,this struc-ture–function gradualism is the key to evolution and evo-lutionary learning in general,and is a important condition for evolutionary adaptability in particular.(Evolutionary adaptability is defined as the extent to which mechanisms of variation and selection can be utilized in order to survive in uncertain and unknown environments[16].)There are three fundamental claims made by the TESC: redundancy of brain tissue,specifity of neurons,and ex-istence of brain–internal selection circuits.According to the claim for redundany,there are many replicas of each type of local network.This means that the brain consists of local networks which are interchangeable in the sense that they are highly similar with respect to their connec-tivity and the properties of their neurons.The claim for specifity says that the excitases are capable of recognizing specific chemical patterns and,with that,cause the enzy-matic neurons tofire in response to specific input patterns. According to the third claim,the brain contains selectioncircuits which direct thefitness–oriented,gradual modifi-cation of the local networks’excitase configurations.These selection circuits include three systems:a testing system which allows to check the consequences(e.g.,pleasure or pain)of the outputs of one or several local networks for the organism;an evaluation system which assignsfitness values to the local networks on the basis of these consequences; and a growth–control system which stimulates or inhibits the production of the nucleic acids which code for the lo-cal networks’excitases on the basis of theirfitness values. The nucleic acids,whose variability is ensured by random somatic recombination and mutation processes,diffuse to neighbouring networks of the same type(where they per-form the same function because of the interchangeability property mentioned above).These claims imply that neu-ral learning proceeds by means of the gradual modification of the excitase configurations in the brain’s local networks through the repeated execution of the following evolution-ary learning cycle:1.Test and evaluation of the enzymatic neuron–basedlocal networks.As a result,afitness value is assigned to each network.2.Selection of local networks.This involves thefitness–oriented regulation of the production of the excitase–coding nucleic acids,as well as their spreading to ad-jacent interchangeable networks.3.Application of somatic recombination and selection tothese nucleic acids.This maintains the range of the excitase configurations.The execution stops when a local network is found which has a sufficiently highfitness.Conrad emphasized that this evolutionary learning cycle is much more efficient than nat-ural evolution because the selection circuits enable an in-tensive selection even if there is hardly a difference between thefitness values of the interchangeable networks. Finally,some references to related work.The TESC is part of extensive work focussing on the differences be-tween the information processing capabilities of biological (molecular)systems and conventional computers;see e.g. [15,16,17].A computational specification of the ESCM which concentrates on the pattern–processing capabilities of the enzymatic neurons,together with its sucessful appli-cation to a robot–control task,is contained in[29,30,31]. Another computational specification which concentrates on the intraneuronal dynamics of enzymatic neurons is de-scribed in[32].A combination of these two specifications is described in[33].Further related work being of particu-lar interest from a computational point of view is presented in[1,18].III.The Theory of Selective Stabilizationof SynapsesThe theory of selective stabilization of synapses(TSSS for short)is presented in[7,8].This theory accounts for neural processes of learning and development by postulat-ing that a somatic,evolutionary selection mechanism acts at the level of synapses and contributes to the wiring pat-tern in the adult brain.Subsequently the neurobiological basis and the major claims of the TSSS are depicted. The neurobiological basis of the TSSS comprises aspects of both neurogenesis and neurogenetics.In vertebrates one can distinguish several processes of brain development. These are the cellular processes of cell division,movement, adhesion,differentiation,and death,and the synaptic pro-cesses of connection formation and elimination.(For de-tails see e.g.[19,20,38].)The TSSS focusses on the“synap-tic aspect”of neurogenesis;it deals with the outgrowth and the stabilization of synapses,and takes the developmental stage where maximal synaptic wiring exists as its initial state.The neurogenetic attidue of the TSSS constitutes a compromise between the preformist(“specified–by–genes”) and the empirist(“specified–by–activity”)view of brain development.It is assumed that the genes involved in brain development,the so–called genetic envelope,only specify the invariant characters of the brain.This includes,in particular,the connections between the main categories of neurons(i.e.,between groups of neurons which are of the same morphological and biochemical type)and the rules of synaptic growth and stabilization.These rules allow for an activity–dependent,epigenetic synapse formation within the neuronal categories.(As Changeux formulated:“The genetic envelope offers a hazily outlined network,the ac-tivity defines its angles.”[3,p.193])The TSSS makes three major claims.First,at the crit-ical stage of maximal connectivity there is a significant but limited redundany within the neuronal categories as regards the specifity of the synapses.Second,at this time of so–called“structural redundany”any synapse may exist under(at least)three states of plasticity:labile,stable,and degenerate.Only the labile and stable synapses transmit nerve impulses,and the acceptable state transitions are those from labile to either stable or degenerate and from stable to labile.Especially,the state transition of a synapse is epigenetically regulated by all signals received by the postsynaptic soma during a given time interval.(The max-imal synaptic connectivity,the mechanisms of its develop-ment,and the regulative and integrative properties of the soma are determinate expressions of the genetic envelope.) Third,the total activity of the developing network leads to the selective stabilization of some synapses,and to the re-gression of their functional equivalents.As a consequence, structural redundancy decreases and neuronal singularity (i.e.,individual connectivity)increases.This provides a plausible explanation of the naturally occuring connection elimination occuring during neural development.For further readings in the TSSS see e.g.[4,5,6,10].IV.The Theory of Selective Stabilizationof Pre–RepresentationsThe theory of selective stabilization of pre–representa-tions(TSSP for short)can be viewed as an extension of the TSSS.This theory provides a selectionist view of neurallearning and development in the adult brain by postulating that somatic selection takes place at the level of neural networks[5,10,27].Similar to the theory of neuronal group selection(section V),the TSSP may be viewed as an attempt to show how neurobiology and psychology are related to each other.There are two major claims made by the TSSP.Thefirst claim is that there exist mental objects or“neural repre-sentations”in the brain.A mental object is defined as a physical state achieved by the correlated and transitory (both electrical and chemical)activity of a cell assembly consisting of a large number of neurons having different singularities.According to the TSSP,three classes of men-tal objects are distinguished.First,primary percepts;these are labile mental objects whose activation depends on the direct interaction with the outside world and is caused by sensory stimulations.Second,stored representations;these are memory objects whose evocation does not demand en-vironmental interaction and whose all–or–none activity be-havior results from a stable,cooperative coupling between the neurons.And third,pre–representations;these are mental objects which are generated before and concomitant with any environmental interaction.Pre–representations are labile and of great variety and variability;they result from the spontaneous but correlatedfiring of neurons or groups of neurons.The second claim made by the TSSP is that learning in the adult brain corresponds to the se-lective stabilization of pre–representations,that means,the transition from selected pre–representations to stored rep-resentations.This requires,in the simplest case,the inter-action with the environment,the criterion of selection is the resonance(i.e.,spatial overlapping orfiring in phase) between a primary percept and a pre–representation.Further literature on the TSSP.In[9]the two selective–stabilization theories,TSSS and TSSP,are embedded in more general considerations on the neural basis of cogni-tion.A formal model of neural learning and development on the basis of the TSSP is described in[22,43].V.The Theory of Neuronal Group SelectionThe theory of neuronal group selection(TNGS for short) or“neural Darwinism”[23,25]is the most rigorous and elaborate hybrid approach in the neurosciences.This the-ory,which has attracted much attention especially in the last few years,bridges the gap between biology and psy-chology by postulating that somatic selection is the key mechanism which establishes the connection between the structure and the function of the brain.As done in the preceding sections,below the major ideas of the TNGS are described.There are three basic claims.First,during prenatal and early postnatal development,primary repertoires of degen-erate neuronal groups were formed epigenetically by selec-tion.According to the TNGS a neuronal group is consid-ered as a local anatomical entity which consists of hundreds to thousands of strongly connected neurons,and degener-ate neuronal groups are groups that have different struc-tures but carry out the same function more or less well (they are nonisomorphic but isofunctional).The concept of degeneracy is fundamental to the TNGS;it implies both structural diversity and functional redundancy and,hence, ensures both a wide range of recognition and the reliabil-ity against the loss of neural tissue.Degeneracy naturally origins from the processes of brain development which are assumed to occur in an epigenetic manner and to elabo-rate several selective events at the cellular level.According to the regulator hypothesis,these complex developmental processes,as well as the selective events accompaning these processes,are guided by a relatively small number of cell adhesion molecules.Second,in the(postnatal)phase of be-havioral experience,a secondary repertoire of functioning neuronal groups is formed by selection among the preexist-ing groups of each primary repertoire.This group selection is accomplished by epigenetic modifications of the synap-tic strenghts without change of the connectivity pattern. According to the dual rules model,these modifications are realized by two synaptic rules that operate upon popula-tions of synapses in a parallel and independent fashion: a presynaptic rule which applies to long–term changes in the whole target neuron and which affects a large num-ber of synapses;and a postsynaptic rule which applies to short–term changes at individual synapses.The function-ing groups are more likely to respond to identical or simi-lar stimuli than the non–selected groups and,hence,con-tribute to the future behavior of the organism.A funda-mental operation of the functional groups is to compete for neurons that belong to other groups;this competition affects the groups’functional properties and is assumed to play a central role in the formation and organization of cerebral cortical maps.Third,reentry–phasic sig-naling over re–entrant(reciprocal and cyclic)connections between different repertoires,in particular between topo-graphic maps–allows for the spatiotemporal correlation of the responses of the repertoires at all levels in the brain. This kind of phasic signaling is viewed as an important mechanism supporting group selection and as being essen-tial both to categorization and the development of con-sciousness.Reentry implies two fundamental neural struc-tures:first,classification couples,that is,re–entrant reper-toires that can perform classifications more complex than a single involved repertoire could do;and second,global map-pings,that is,re-entrant repertoires that correlate sensory input and motor activity.Some brief notes on how the TNGS accounts for psy-chological functions.Following Edelman’s argumentation, categories do not exist apriori in the world(the world is “unlabeled”),and categorization is the fundamental prob-lem facing the nervous system.This problem is solved by means of group selection and reentry.Consequently,cat-egorization largely depends on the organism’s interaction with its environment and turns out to be the central neural operation required for all other operations.Based on this view of categorization,Edelman suggests that memory is “the enhanced ability to categorize or generalize associa-tively,not the storage of features or attributes of objects as a list”[25,p.241]and that learning,in the minimal case,is the“categorization of complexes of adaptive value under conditions of expectancy”[25,p.293].There is a large body of literature on the TNGS.The most detailed depiction of the theory is provided in Edel-man’s book[25].In order to be able to test the TNGS, several computer models have been constructed which em-body the theory’s major ideas.These models are Darwin I[24],Darwin II[26,39,25],and Darwin III[40,41].Re-views of the TNGS can be found in e.g.[21,34,35,42,37].VI.Concluding RemarksThis paper overviewed neuroscientific theories which view real brains as evolutionary systems or“Darwin machines”[2].This point of view is radically opposed to traditional in-structive theories which postulate that brain development is directed epigenetically during an organism’s interaction with its environment by rules for a more or less precise brain wiring.Nowadays most researchers agree that the instructive theories are very likely to be wrong und unre-alistic,and that the evolutionary theories offer interesting and plausible alternatives.In particular,there is an in-creasing number of neurobiological facts and observations described in the literature which indicate that evolutionary mechanisms(and in particular the mechanism of selection) as postulated by the evolutionary theories are indeed fun-damental to the neural processes in our brains.Somefinal notes on the relation between the hybrid work done in the neurosciences and the hybrid work done in artificial intelligence(see part I of this paper series[45]). Whereas the neuroscientific approaches aim at a better un-derstanding of the developmental and learning processes in real brains,the artificial intelligence approaches typically aim at the design of artificial neural networks that are ap-propriate for solving specific real-world tasks.Despite this fundamental difference and its implications,however,there are several aspects and questions which are elementary and significant to both the neuroscientific and the artificial in-telligence approaches:•Symbolic–subsymbolic intersection(e.g.,“What are the neural foundations of high-level,cognitive abili-ties like concept formation?”and“How are symbolic entities encoded in the neural tissue?”),•Brain wiring(e.g.,“What are the principles of neural development?”and“How are the structure and the function of neural networks related to each other?”),•Genetic encoding(e.g.,“How and to what extend are neural networks genetically encoded?”),and •Evolutionary modification(e.g.,“At what network level and at what time scale do evolutionary mechanisms operate?”and“In how far do the evolutionary mech-anisms influence the network structure?”).Because of this correspondence of interests and research topics it would be useful and stimulating for the neurosci-entific and the artificial intelligence community to be aware of each others hybrid work.This requires an increased in-terdisciplinary transparency.To offer such a transparency is a major intention of this paper series.References[1]Akingbehin,K.&Conrad,M.(1989).A hybrid architecture forprogrammable computing and evolutionary learning.Parallel Distributed Computing,6,245–263.[2]Calvin,W.H.(1987).The brain as a Darwin machine.Nature,330,pp.33–43.[3]Changeux,J.–P.(1980).Genetic determinism and epigenesisof the neuronal network:Is there a biological compromise be-tween Chomsky and Piaget?In M.Piatelli–Palmarini(Ed.), Language and learning–The debate between Jean Piaget and Noam Chomsky(pp.184–202).Routledge&Kegan Paul. [4]Changeux,J.–P.(1983).Concluding remarks:On the“singu-larity”of nerve cells and its ontogenesis.In J.–P.Changeux, J.Glowinski,M.Imbert,&F.E.Bloom(Eds.),Molecular and cellular interactions underlying higher brain function(pp.465–478).Elsevier Science Publ.[5]Changeux,J.–P.(1983).L’Homme neuronal.Fayard.[6]Changeux,J.–P.(1985).Remarks on the complexity of thenervous system and its ontogenesis.In J.Mehler&R.Fox (Eds.),Neonate cognition.Beyond the blooming buzzing con-fusion(pp.263–284).Lawrence Erlbaum.[7]Changeux,J.–P.,Courrege,P.,&Danchin,A.(1973).A theoryof the epigenesis of neuronal networks by selective stabilization of synapses.In Proceedings of the National Academy of Sciences USA,70(10),2974–2978.[8]Changeux,J.–P.,&Danchin,A.(1976).Selective stabilizationof developing synapses as a mechanism for the specification of neuronal networks.Nature,264,705–712.[9]Changeux,J.–P.,&Dehaene,S.(1989).Neuronal models ofcognitive functions.Cognition,33,63–109.[10]Changeux,J.–P.,Heidmann,T.,&Patte,P.(1984).Learningby selection.In P.Marler&H.S.Terrace(Eds.),The biology of learning(pp.115–133).Springer.[11]Changeux,J.–P.,&Konoshi,M.(Eds.).(1986).The neural andmolecular basis of learning.Springer.[12]Conrad,M.(1974).Evolutionary learning cicuits.Journal ofTheoretical Biology,46,167–188.[13]Conrad M.(1976).Complementary models of learning andmemory.BioSystems,8,119–138.[14]Conrad,M.(1984).Microscopic–macroscopic interface in bio-logical information processing.BioSystems,16,345–363. [15]Conrad,M.(1985).On design principles for a molecular com-munications of the ACM,28(5),464–480.[16]Conrad,M.(1988).The price of programmability.In R.Herken(Ed.),The universal Turing machine–A half–century survey (pp.285–307).Kammerer&Unverzagt.[17]Conrad,M.(1989).The brain–machine disanalogy.BioSystems,22,197–213.[18]Conrad,M.,Kampfner,R.R.,Kirby,K.G.,Rizki, E.N.,Schleis,G.,Smalz,R.,&Trenary,R.(1989).Towards an ar-tificial brain.BioSystems,23,175–218.[19]Cowan,W.M.(1978).Aspects of neural development.Interna-tional Reviews of Physiology,17,150–91.[20]Cowan,W.M.,Fawcett,J.W.,O’Leary,D.D.M.,&Stanfield,B.B.(1984).Regressive events in neurogenesis.Science,225,1258–1265.[21]Crick,F.(1989).Neural Edelmanism.Trends in Neurosciences,12,240–248.(Reply from G.N.Reeke,R.Michod and F.Crick: Trends in Neurosciences,13,11–14.)[22]Dehaene,S.,Changeux,J.–P.,&Nadal,J.–P.(1987).Neuralnetworks that learn temporal sequences by selection.In Proceed-ings of the National Academy of Sciences USA,84,pp.2727–2731.[23]Edelman,G.M.(1978).Group selection and phasic reentrantsignaling:A theory of higher brain function.In G.M.Edelman &V.B.Mountcastle(Eds.),The mindful brain.Cortical organi-zation and the group–selective theory of higher brain functions (pp.51–100).The MIT Press.[24]Edelman,G.M.(1981).Group selection as the basis for higherbrain function.In F.O.Schmitt,F.G.Worden,G.Adelman&S.G.Dennis(Eds.),The organization of the cerebral cortex (pp.535–563).The MIT Press.[25]Edelman,G.M.(1987).Neural Darwinism.The theory of neu-ronal group selection.Basic Books.[26]Edelman,G.M.,&Reeke,G.N.(1982).Selective networkscapable of representative transformations,limited generaliza-tions,and associative memory.In Proceedings of the National Academy of Sciences USA,79,2091–2095.[27]Heidmann, A.,Heidmann,T.M.,&Changeux,J.–P.(1984).Stabilisation selective de representations neuronales par reso-nance entre“presepresentations”spontanes du reseau cerebral et“percepts”.In C.R.Acad.Sci.Ser.III,299,839–844. [28]Jerne,N.K.(1967).Antibodies and learning:selection vs.in-struction.In G.C.Quarton,T.Melnechuk&F.O.Schmitt (Eds.),The neurosciences:a study program(pp.200–205).Rockefeller University Press.[29]Kampfner,R.R.(1988).Generalization in evolutionary learn-ing with enzymatic neuron–based systems.In M.Kochen&H.M.Hastings(Eds.),Advances in cognitive science.Steps to-ward convergence(pp.190–209).Westview Press,Inc.[30]Kampfner,R.R.,&Conrad,M.(1983).Computational model-ing of evolutionary learning processes in the brain.Bulletin of Mathematical Biology,45(6),931–968.[31]Kampfner,R.R.,&Conrad,M.(1983).Sequential behaviorand stability properties of enzymatic neuron networks.Bulletin of Mathematical Biology,45(6),969–980.[32]Kirby,K.G.,&Conrad,M.(1984).The enzymatic neuron asa reaction–diffusion network of cyclic nucleotides.Bulletin ofMathematical Biology,46,765–782.[33]Kirby,K.G.,&Conrad,M.(1986).Intraneuronal dynamics asa substrate for evolutionary learning.In Physica22D,205–215.[34]Michod,R.E.(1989).Darwinian selection in the brain.Evolu-tion,(3),694–696.[35]Nelson,R.J.(1989).Philosophical issues in Edelman’s neuralDarwinism.Journal of Experimental and Theoretical Artificial Intelligence,1,195–208.[36]Neuroscience Research(1986).Special issue3:Synaptic plastic-ity,memory and learning.[37]Patton,P.,&Parisi,T.(1989).Brains,computation,and selec-tion:An essay review of Gerald Edelman’s Neural Darwinism.Psychobiology,17(3),326–333.[38]Purves,D.,&Lichtman,J.W.(1985).Principles of neural de-velopment.Sinauer Associates Inc.[39]Reeke,G.N.jr.,&Edelman,G.M.(1984).Selective networksand recognition automata.In Annals of the New York Academy of Sciences,426(Special issue on computer culture),181–201.[40]Reeke,G.N.jr.,&Edelman,G.M.(1988).Real brains andartificial intelligence.Deadalus,117(1),143–173.[41]Reeke,G.N.jr.,Sporns,O.,&Edelman,G.M.(1988).Syn-thetic neural modeling:a Darwinian approach to brain theory.In R.Pfeifer,Z.Schreter,F.Fogelman–Soulie&L.Steels(Eds.), Connectionism in perspective.Elsevier.[42]Smoliar,S.(1989).Book review of[25].Artificial Intelligence,39,121–139.[43]Toulouse,G.,Dehaene,S.,&Changeux,J.–P.(1986).Spin glassmodels of learning by selection.In Proceedings of the National Academy of Sciences USA,83,1695–1698.[44]Trends in Neurosciences(1988).Vol.11(4),Special issue:Learn-ing,memory.[45]Weiß,G.(1993,submitted to IEEE World Congress on Compu-tational Intelligence).Neural networks and evolutionary com-putation.Part I:Hybrid approaches in artificial intelligence. [46]Young,J.Z.(1973).Memory as a selective process.In Aus-tralian Academy of Science Report:Symposium on Biological Memory;Australian Academy of Science(pp.25–45).[47]Young,J.Z.(1975).Sources of discovery in neuroscience.InF.G.Worden,J.P.Swazey&G.Edelman(Eds.),The neu-rosciences:Paths of discovery(pp.15–46).Oxford University Press.。
从灰度直方图的一个Tlreshold选择方法自动摘要,非参数和非监督方法阈值选取的图像分割相结合。
一个最佳阈值是由选定的判别准则,即以便最大化所得到的类中灰色的可分水平。
该过程是很简单的,仅利用零级和的灰度级直方图的一阶累计瞬间。
这是直白地扩展到多阈值问题的方法。
几种实验结果也列于支持该方法的有效性。
(一)引言重要的是在图像处理以选择一个适当的阈值的灰度级从他们的背景中提取的对象。
一各种技术已经被提出在这方面。
在一个理想的情况下,直方图两者之间的深刻和尖锐的山谷表示对象和背景峰,分别使该阈值可以在这个山谷的底部选择[1]。
然而,对于大多数实际图片,它往往是难以检测的谷底精确,尤其是在这种情况下,如当谷在平坦宽阔,充满了噪声时,或者当两个峰在高度极为不平等的,通常不产生可追溯山谷。
已经出现了,为了克服提出了一些技术这些困难。
它们是,例如,谷锐化技术[2],制约了直方图的像素衍生工具(或拉普拉斯梯度)的大型绝对值和的差值直方图方法[3],它选择在阈的灰度级与差的最大金额。
这些利用在原来的关于相邻像素(或边缘)的信息图像修改的直方图,以使有用的阈值。
另一个类的方法直接处理的通过参数化技术的灰度直方图。
例如,本直方图是由之和近似在最小二乘意义高斯分布,统计决策程序应用[4]。
然而,这种方法需要相当乏味有时不稳定的计算。
此外,在许多情况下,高斯分布变成是微薄的逼近的真实模式。
在任何情况下,门槛不“善”在已评估大多数的方法,到目前为止提出的。
这意味着,它可以得出一个最优阈值法的正确方法建立一个适当的标准,用于评估“善”从更一般的角度来看阈值。
在这种对应关系,我们的讨论将局限在阈值选择的基本情况,其中只有灰度级直方图足够了而没有其他的先验知识。
它不仅重要的,因为一个标准技术在图像处理中,但也在模式识别无人监督的决策问题是必不可少的。
一种新方法是从判别的角度,提出了分析;它直接评估方法的可行性,门槛的“善”,并自动选择最佳门槛。
A review of feature selection techniques in bioinformaticsAbstractFeature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques.In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.1 INTRODUCTIONDuring the last decade, the motivation for applying feature selection (FS) techniques in bioinformatics has shifted from being an illustrative example to becoming a real prerequisite for model building. In particular, the high dimensional nature of many modelling tasks in bioinformatics, going from sequence analysis over microarray analysis to spectral analyses and literature mining has given rise to a wealth of feature selection techniques being presented in the field.In this review, we focus on the application of feature selection techniques. In contrast to other dimensionality reduction techniques like those based on projection (e.g. principal component analysis) or compression (e.g. using information theory), feature selection techniques do not alter the original representation of the variables, but merely select a subset of them. Thus, they preserve the original semantics of the variables, hence, offering the advantage of interpretability by a domain expert.While feature selection can be applied to both supervised and unsupervised learning, we focus here on the problem of supervised learning (classification), where the class labels are known beforehand. The interesting topic of feature selection for unsupervised learning (clustering) is a more complex issue, and research into this field is recently getting more attention in several communities (Liu and Yu, 2005; Varshavsky et al., 2006).The main aim of this review is to make practitioners aware of the benefits, and in some cases even the necessity of applying feature selection techniques. Therefore, we provide an overview of the different feature selection techniques for classification: we illustrate them by reviewing the most important application fields in the bioinformatics domain, highlighting the efforts done by the bioinformatics community in developing novel and adapted procedures. Finally, we also point the interested reader to some useful data mining and bioinformatics software packages that can be used for feature selection.Previous SectionNext Section2 FEATURE SELECTION TECHNIQUESAs many pattern recognition techniques were originally not designed to cope with large amounts of irrelevant features, combining them with FS techniques has become a necessity in many applications (Guyon and Elisseeff, 2003; Liu and Motoda, 1998; Liu and Yu, 2005). The objectives of feature selection are manifold, the most important ones being: (a) to avoid overfitting andimprove model performance, i.e. prediction performance in the case of supervised classification and better cluster detection in the case of clustering, (b) to provide faster and more cost-effective models and (c) to gain a deeper insight into the underlying processes that generated the data. However, the advantages of feature selection techniques come at a certain price, as the search for a subset of relevant features introduces an additional layer of complexity in the modelling task. Instead of just optimizing the parameters of the model for the full feature subset, we now need to find the optimal model parameters for the optimal feature subset, as there is no guarantee that the optimal parameters for the full feature set are equally optimal for the optimal feature subset (Daelemans et al., 2003). As a result, the search in the model hypothesis space is augmented by another dimension: the one of finding the optimal subset of relevant features. Feature selection techniques differ from each other in the way they incorporate this search in the added space of feature subsets in the model selection.In the context of classification, feature selection techniques can be organized into three categories, depending on how they combine the feature selection search with the construction of the classification model: filter methods, wrapper methods and embedded methods. Table 1 provides a common taxonomy of feature selection methods, showing for each technique the most prominent advantages and disadvantages, as well as some examples of the most influential techniques.Table 1.A taxonomy of feature selection techniques. For each feature selection type, we highlight a set of characteristics which can guide the choice for a technique suited to the goals and resources of practitioners in the fieldFilter techniques assess the relevance of features by looking only at the intrinsic properties of the data. In most cases a feature relevance score is calculated, and low-scoring features are removed. Afterwards, this subset of features is presented as input to the classification algorithm. Advantages of filter techniques are that they easily scale to very high-dimensional datasets, they are computationally simple and fast, and they are independent of the classification algorithm. As a result, feature selection needs to be performed only once, and then different classifiers can be evaluated.A common disadvantage of filter methods is that they ignore the interaction with the classifier (the search in the feature subset space is separated from the search in the hypothesis space), and that most proposed techniques are univariate. This means that each feature is considered separately, thereby ignoring feature dependencies, which may lead to worse classification performance when compared to other types of feature selection techniques. In order to overcome the problem of ignoring feature dependencies, a number of multivariate filter techniques were introduced, aiming at the incorporation of feature dependencies to some degree.Whereas filter techniques treat the problem of finding a good feature subset independently of the model selection step, wrapper methods embed the model hypothesis search within the feature subset search. In this setup, a search procedure in the space of possible feature subsets is defined, and various subsets of features are generated and evaluated. The evaluation of a specific subset of features is obtained by training and testing a specific classification model, rendering this approach tailored to a specific classification algorithm. To search the space of all feature subsets, a search algorithm is then ‘wrapped’ around the classification model. However, as the space of feature subsets grows exponentially with the number of features, heuristic search methods are used to guide the search for an optimal subset. These search methods can be divided in two classes: deterministic and randomized search algorithms. Advantages of wrapper approaches include the interaction between feature subset search and model selection, and the ability to take into account feature dependencies. A common drawback of these techniques is that they have a higher risk of overfitting than filter techniques and are very computationally intensive, especially if building the classifier has a high computational cost.In a third class of feature selection techniques, termed embedded techniques, the search for an optimal subset of features is built into the classifier construction, and can be seen as a search in the combined space of feature subsets and hypotheses. Just like wrapper approaches, embedded approaches are thus specific to a given learning algorithm. Embedded methods have the advantage that they include the interaction with the classification model, while at the same time being far less computationally intensive than wrapper methods.Previous SectionNext Section3 APPLICATIONS IN BIOINFORMATICS3.1 Feature selection for sequence analysisSequence analysis has a long-standing tradition in bioinformatics. In the context of feature selection, two types of problems can be distinguished: content and signal analysis. Content analysis focuses on the broad characteristics of a sequence, such as tendency to code for proteins or fulfillment of a certain biological function. Signal analysis on the other hand focuses on the identification of important motifs in the sequence, such as gene structural elements or regulatory elements.Apart from the basic features that just represent the nucleotide or amino acid at each position in a sequence, many other features, such as higher order combinations of these building blocks (e.g.k-mer patterns) can be derived, their number growing exponentially with the pattern length k. As many of them will be irrelevant or redundant, feature selection techniques are then applied to focus on the subset of relevant variables.3.1.1 Content analysisThe prediction of subsequences that code for proteins (coding potential prediction) has been a focus of interest since the early days of bioinformatics. Because many features can be extracted from a sequence, and most dependencies occur between adjacent positions, many variations of Markov models were developed. To deal with the high amount of possible features, and the often limited amount of samples, (Salzberg et al., 1998) introduced the interpolated Markov model (IMM), which used interpolation between different orders of the Markov model to deal with small sample sizes, and a filter method (χ2) to select only relevant features. In further work, (Delcher et al., 1999) extended the IMM framework to also deal with non-adjacent feature dependencies, resulting in the interpolated context model (ICM), which crosses a Bayesian decision tree with a filter method (χ2) to assess feature relevance. Recently, the avenue of FS techniques for coding potential prediction was further pursued by (Saeys et al., 2007), who combined different measures of coding potential prediction, and then used the Markov blanket multivariate filter approach (MBF) to retain only the relevant ones.A second class of techniques focuses on the prediction of protein function from sequence. The early work of Chuzhanova et al. (1998), who combined a genetic algorithm in combination with the Gamma test to score feature subsets for classification of large subunits of rRNA, inspired researchers to use FS techniques to focus on important subsets of amino acids that relate to the protein's; functional class (Al-Shahib et al., 2005). An interesting technique is described in Zavaljevsky et al. (2002), using selective kernel scaling for support vector machines (SVM) as a way to asses feature weights, and subsequently remove features with low weights.The use of FS techniques in the domain of sequence analysis is also emerging in a number of more recent applications, such as the recognition of promoter regions (Conilione and Wang, 2005), and the prediction of microRNA targets (Kim et al., 2006).3.1.2 Signal analysisMany sequence analysis methodologies involve the recognition of short, more or less conserved signals in the sequence, representing mainly binding sites for various proteins or protein complexes. A common approach to find regulatory motifs, is to relate motifs to gene expressionlevels using a regression approach. Feature selection can then be used to search for the motifs that maximize the fit to the regression model (Keles et al., 2002; Tadesse et al.,2004). In Sinha (2003), a classification approach is chosen to find discriminative motifs. The method is inspired by Ben-Dor et al. (2000) who use the threshold number of misclassification (TNoM, see further in the section on microarray analysis) to score genes for relevance to tissue classification. From the TNoM score, a P-value is calculated that represents the significance of each motif. Motifs are then sorted according to their P-value.Another line of research is performed in the context of the gene prediction setting, where structural elements such as the translation initiation site (TIS) and splice sites are modelled as specific classification problems. The problem of feature selection for structural element recognition was pioneered in Degroeve et al. (2002) for the problem of splice site prediction, combining a sequential backward method together with an embedded SVM evaluation criterion to assess feature relevance. In Saeys et al. (2004), an estimation of distribution algorithm (EDA, a generalization of genetic algorithms) was used to gain more insight in the relevant features for splice site prediction. Similarly, the prediction of TIS is a suitable problem to apply feature selection techniques. In Liu et al. (2004), the authors demonstrate the advantages of using feature selection for this problem, using the feature-class entropy as a filter measure to remove irrelevant features.In future research, FS techniques can be expected to be useful for a number of challenging prediction tasks, such as identifying relevant features related to alternative splice sites and alternative TIS.3.2 Feature selection for microarray analysisDuring the last decade, the advent of microarray datasets stimulated a new line of research in bioinformatics. Microarray data pose a great challenge for computational techniques, because of their large dimensionality (up to several tens of thousands of genes) and their small sample sizes (Somorjai et al., 2003). Furthermore, additional experimental complications like noise and variability render the analysis of microarray data an exciting domain.In order to deal with these particular characteristics of microarray data, the obvious need for dimension reduction techniques was realized (Alon et al., 1999; Ben-Dor et al., 2000; Golub et al., 1999; Ross et al., 2000), and soon their application became a de facto standard in the field. Whereas in 2001, the field of microarray analysis was still claimed to be in its infancy (Efron et al., 2001), a considerable and valuable effort has since been done to contribute new and adapt known FS methodologies (Jafari and Azuaje, 2006). A general overview of the most influential techniques, organized according to the general FS taxonomy of Section 2, is shown in Table 2.Table 2.Key references for each type of feature selection technique in the microarray domain3.2.1 The univariate filter paradigm: simple yet efficientBecause of the high dimensionality of most microarray analyses, fast and efficient FS techniques such as univariate filter methods have attracted most attention. The prevalence of these univariate techniques has dominated the field, and up to now comparative evaluations of different classification and FS techniques over DNA microarray datasets only focused on the univariate case (Dudoit et al., 2002; Lee et al., 2005; Li et al., 2004; Statnikov et al., 2005). This domination of the univariate approach can be explained by a number of reasons:the output provided by univariate feature rankings is intuitive and easy to understand;the gene ranking output could fulfill the objectives and expectations that bio-domain experts have when wanting to subsequently validate the result by laboratory techniques or in order to explore literature searches. The experts could not feel the need for selection techniques that take into account gene interactions;the possible unawareness of subgroups of gene expression domain experts about the existence of data analysis techniques to select genes in a multivariate way;the extra computation time needed by multivariate gene selection techniques.Some of the simplest heuristics for the identification of differentially expressed genes include setting a threshold on the observed fold-change differences in gene expression between the states under study, and the detection of the threshold point in each gene that minimizes the number of training sample misclassification (threshold number of misclassification, TNoM (Ben-Dor etal.,2000)). However, a wide range of new or adapted univariate feature ranking techniques has since then been developed. These techniques can be divided into two classes: parametric and model-free methods (see Table 2).Parametric methods assume a given distribution from which the samples (observations) have been generated. The two sample t-test and ANOVA are among the most widely used techniques in microarray studies, although the usage of their basic form, possibly without justification of their main assumptions, is not advisable (Jafari and Azuaje, 2006). Modifications of the standard t-test to better deal with the small sample size and inherent noise of gene expression datasets include a number of t- or t-test like statistics (differing primarily in the way the variance is estimated) and a number of Bayesian frameworks (Baldi and Long, 2001; Fox and Dimmic, 2006). Although Gaussian assumptions have dominated the field, other types of parametrical approaches can also be found in the literature, such as regression modelling approaches (Thomas et al., 2001) and Gamma distribution models (Newton et al.,2001).Due to the uncertainty about the true underlying distribution of many gene expression scenarios, and the difficulties to validate distributional assumptions because of small sample sizes,non-parametric or model-free methods have been widely proposed as an attractive alternative to make less stringent distributional assumptions (Troyanskaya et al., 2002). Many model-free metrics, frequently borrowed from the statistics field, have demonstrated their usefulness in many gene expression studies, including the Wilcoxon rank-sum test (Thomas et al., 2001), the between-within classes sum of squares (BSS/WSS) (Dudoit et al., 2002) and the rank products method (Breitling et al., 2004).A specific class of model-free methods estimates the reference distribution of the statistic using random permutations of the data, allowing the computation of a model-free version of the associated parametric tests. These techniques have emerged as a solid alternative to deal with the specificities of DNA microarray data, and do not depend on strong parametric assumptions (Efron et al., 2001; Pan, 2003; Park et al., 2001; Tusher et al., 2001). Their permutation principle partly alleviates the problem of small sample sizes in microarray studies, enhancing the robustness against outliers.We also mention promising types of non-parametric metrics which, instead of trying to identify differentially expressed genes at the whole population level (e.g. comparison of sample means), are able to capture genes which are significantly disregulated in only a subset of samples (Lyons-Weiler et al., 2004; Pavlidis and Poirazi, 2006). These types of methods offer a more patient specific approach for the identification of markers, and can select genes exhibiting complex patterns that are missed by metrics that work under the classical comparison of two prelabelled phenotypic groups. In addition, we also point out the importance of procedures for controlling the different types of errors that arise in this complex multiple testing scenario of thousands of genes (Dudoit et al., 2003; Ploner et al., 2006; Pounds and Cheng, 2004; Storey, 2002), with a special focus on contributions for controlling the false discovery rate (FDR).3.2.2 Towards more advanced models: the multivariate paradigm for filter, wrapperand embedded techniquesUnivariate selection methods have certain restrictions and may lead to less accurate classifiers by, e.g. not taking into account gene–gene interactions. Thus, researchers have proposed techniques that try to capture these correlations between genes.The application of multivariate filter methods ranges from simple bivariate interactions (Bø and Jonassen, 2002) towards more advanced solutions exploring higher order interactions, such as correlation-based feature selection (CFS) (Wang et al., 2005; Yeoh et al., 2002) and several variants of the Markov blanket filter method (Gevaert et al., 2006; Mamitsuka, 2006; Xing et al., 2001). The Minimum Redundancy-Maximum Relevance (MRMR) (Ding and Peng, 2003) and Uncorrelated Shrunken Centroid (USC) (Yeung and Bumgarner, 2003) algorithms are two other solid multivariate filter procedures, highlighting the advantage of using multivariate methods over univariate procedures in the gene expression domain.Feature selection using wrapper or embedded methods offers an alternative way to perform a multivariate gene subset selection, incorporating the classifier's; bias into the search and thus offering an opportunity to construct more accurate classifiers. In the context of microarray analysis, most wrapper methods use population-based, randomized search heuristics (Blanco et al., 2004; Jirapech-Umpai and Aitken, 2005; Li et al., 2001; Ooi and Tan, 2003), although also a few examples use sequential search techniques (Inza et al., 2004; Xiong et al., 2001). An interesting hybrid filter-wrapper approach is introduced in (Ruiz et al., 2006), crossing a univariatelypre-ordered gene ranking with an incrementally augmenting wrapper method.Another characteristic of any wrapper procedure concerns the scoring function used to evaluate each gene subset found. As the 0–1 accuracy measure allows for comparison with previous works, the vast majority of papers uses this measure. However, recent proposals advocate the use of methods for the approximation of the area under the ROC curve (Ma and Huang, 2005), or the optimization of the LASSO (Least Absolute Shrinkage and Selection Operator) model (Ghosh and Chinnaiyan, 2005). ROC curves certainly provide an interesting evaluation measure, especially suited to the demand for screening different types of errors in many biomedical scenarios.The embedded capacity of several classifiers to discard input features and thus propose a subset of discriminative genes, has been exploited by several authors. Examples include the use of random forests (a classifier that combines many single decision trees) in an embedded way to calculate the importance of each gene (Díaz-Uriarte and Alvarez de Andrés, 2006; Jiang et al., 2004). Another line of embedded FS techniques uses the weights of each feature in linear classifiers, such as SVMs (Guyon et al., 2002) and logistic regression (Ma and Huang, 2005). These weights are used to reflect the relevance of each gene in a multivariate way, and thus allow for the removal of genes with very small weights.Partially due to the higher computational complexity of wrapper and to a lesser degree embedded approaches, these techniques have not received as much interest as filter proposals. However, an advisable practice is to pre-reduce the search space using a univariate filter method, and only then apply wrapper or embedded methods, hence fitting the computation time to the available resources.3.3 Mass spectra analysisMass spectrometry technology (MS) is emerging as a new and attractive framework for disease diagnosis and protein-based biomarker profiling (Petricoin and Liotta, 2003). A mass spectrum sample is characterized by thousands of different mass/charge (m / z) ratios on the x-axis, each with their corresponding signal intensity value on the y-axis. A typical MALDI-TOF low-resolution proteomic profile can contain up to 15 500 data points in the spectrum between 500 and 20 000 m / z, and the number of points even grows using higher resolution instruments.For data mining and bioinformatics purposes, it can initially be assumed that each m / z ratio represents a distinct variable whose value is the intensity. As Somorjai et al. (2003) explain, the data analysis step is severely constrained by both high-dimensional input spaces and their inherent sparseness, just as it is the case with gene expression datasets. Although the amount of publications on mass spectrometry based data mining is not comparable to the level of maturity reached in the microarray analysis domain, an interesting collection of methods has been presented in the last 4–5 years (see Hilario et al., 2006; Shin and Markey, 2006 for recent reviews) since the pioneering work of Petricoin et al.(2002).Starting from the raw data, and after an initial step to reduce noise and normalize the spectra from different samples (Coombes et al., 2007), the following crucial step is to extract the variables that will constitute the initial pool of candidate discriminative features. Some studies employ the simplest approach of considering every measured value as a predictive feature, thus applying FS techniques over initial huge pools of about 15 000 variables (Li et al., 2004; Petricoin et al., 2002), up to around 100 000 variables (Ball et al.,2002). On the other hand, a great deal of the current studies performs aggressive feature extraction procedures using elaborated peak detection and alignment techniques (see Coombes et al., 2007; Hilario et al., 2006; Shin and Markey, 2006 for a detailed description of these techniques). These procedures tend to seed the dimensionality from which supervised FS techniques will start their work in less than 500 variables (Bhanot et al., 2006; Ressom et al., 2007; Tibshirani et al., 2004). A feature extraction step is thus advisable to set the computational costs of many FS techniques to a feasible size in these MS scenarios. Table 3 presents an overview of FS techniques used in the domain of mass spectrometry. Similar to the domain of microarray analysis, univariate filter techniques seem to be the most common techniques used, although the use of embedded techniques is certainly emerging as an alternative. Although the t-test maintains a high level of popularity (Liu et al., 2002; Wu et al., 2003), other parametric measures such as F-test (Bhanot et al., 2006), and a notable variety of non-parametric scores (Tibshirani et al., 2004; Yu et al., 2005) have also been used in several MS studies. Multivariate filter techniques on the other hand, are still somewhat underrepresented (Liu et al., 2002; Prados et al., 2004).Table 3.Key references for each type of feature selection technique in the domain of mass pectrometryWrapper approaches have demonstrated their usefulness in MS studies by a group of influential works. Different types of population-based randomized heuristics are used as search engines in the major part of these papers: genetic algorithms (Li et al., 2004; Petricoin et al., 2002), particle swarm optimization (Ressom et al., 2005) and ant colony procedures (Ressom et al., 2007). It is worth noting that while the first two references start the search procedure in ∼ 15 000 dimensions by considering each m / z ratio as an initial predictive feature, aggressive peak detection and alignment processes reduce the initial dimension to about 300 variables in the last two references (Ressom et al., 2005; Ressom et al., 2007).An increasing number of papers uses the embedded capacity of several classifiers to discard input features. Variations of the popular method originally proposed for gene expression domains by Guyon et al. (2002), using the weights of the variables in the SVM-formulation to discard features with small weights, have been broadly and successfully applied in the MS domain (Jong et al., 2004; Prados et al., 2004; Zhang et al., 2006). Based on a similar framework, the weights of the input masses in a neural network classifier have been used to rank the features'importance in Ball et al. (2002). The embedded capacity of random forests (Wu et al., 2003) and other types of decision tree-based algorithms (Geurts et al., 2005) constitutes an alternative embedded FS strategy.Previous SectionNext Section4 DEALING WITH SMALL SAMPLE DOMAINSSmall sample sizes, and their inherent risk of imprecision and overfitting, pose a great challenge for many modelling problems in bioinformatics (Braga-Neto and Dougherty, 2004; Molinaro et al., 2005; Sima and Dougherty, 2006). In the context of feature selection, two initiatives have emerged in response to this novel experimental situation: the use of adequate evaluation criteria, and the use of stable and robust feature selection models.4.1 Adequate evaluation criteria。
NeuroQLab – A Software Assistant for Neurosurgical Planning and Quantitative Image AnalysisFlorian Weiler1, Jan Rexilius, Jan Klein1, Horst K. Hahn11 Fraunhofer MEVIS, Universitätsallee 29, 28359 Bremen, Germanyflorian.weiler@mevis.fraunhofer.dejan.rexilius@jan.klein@mevis.fraunhofer.dehorst.hahn@mevis.fraunhofer.deAbstract. Neuroimaging techniques produce large amounts of data capable ofdisplaying a wide variety of structural and functional properties of the brain. Alarge number of specialized image analysis and visualization tools exist thataim at helping the physician in analyzing and dealing with the data. We presenta flexible and extendible software assistant covering a number of typicallyrequired tools for evaluating neuroimaging studies. It comprises tools forpreprocessing tasks such as registration, skull-stripping or non-uniformitynormalization as well as some dedicated packages for quantitative analysis ofanatomical images, a toolkit for DTI analysis as well as a tool for analyzingfMRI studies. The software assistant is built upon an established platform forrapid-prototyping, which facilitates fast integration of new features by userrequest as well as the adaption of given features to concrete clinical questions.In this paper, a brief overview of the basic underlying software architecture isgiven accompanied by a presentation of selected tools offered by the software.1. OverviewNeuroimaging is a continuously evolving medical discipline that covers the large variety of imaging techniques required for all medical fields associated to the brain and its pathologies, which includes at least neurology, neuroradiology, neurosurgery and neuropsychology. Besides the different nature of these medical fields, the required tools for image analysis and data visualization techniques for the underlying data are often similar. Common processing tasks include registration of different modalities or imaging sequences to a reference image as well as segmentation of prominent structures such as the brain, the ventricular system or the vascular system from different datasets. Additionally, since there usually exists a number of different datasets for a given patient which show different properties of the same brain, a commonly desired feature is the possibility to combine multiple available datasets in such a manner that the relations between individual properties of the brain become clear. For example, the fusion of activation areas determined by functional magneticresonance imaging (fMRI) with white matter fiber tracts derived from diffusion tensor imaging (DTI) may enhance the neurosurgical planning process.These circumstances strongly motivate the desire for software assistants capable of dealing with the available data in a flexible manner by providing both basic tools for general processing tasks as well as dedicated tools custom tailored to specific clinical problems. In this paper, we give an overview of NeuroQLab, a software assistant aiming to offer all the basic tools typically required for processing and working with MR images of the brain, as well as a number of tools specifically targeted at certain clinical problems. Its primary purpose is to facilitate the application of state-of-the-art technologies and research results from the field of medical image processing in clinical trials and studies.2. Underlying architectureThe software assistant is build upon a dedicated application framework designed specifically for rapid prototyping of clinically applicable software assistants with a focus on research and study evaluation. The framework, which is built on top of MeVisLab [MeV09], consists of multiple hierarchically structured layers providing a number of fundamental functionalities for dealing with medical image datasets, handling patient information and storing processing results. Additionally, a modular plug-in concept allows for integration of packages dedicated to concrete clinical questions. An additional communication layer allows for exchanging events and data between different applications allowing for reuse and combination of results stemming from different processing steps within the pipeline. Key features of the application framework include:a dynamic and extendible multi-layered architecture allowing for shortdevelopment cycles for both the addition of new features to existing tools aswell as the inclusion of completely new toolsa mature case handling concept managing all relevant data of a case withinone data structure based on the Extensible Markup Language (XML)integration into clinical environments by means of PACS communicationand handling of standard medical image data formats (DICOM)A detailed description of the application framework can be found in [Rex08]. The multi-layered approach is an adaption of the widely used Model-View-Controller (MVC) pattern known from software engineering. It provides the key functionality for integrating rapid-prototyping techniques into the development process of medical software assistants, which allows for seamless integration of the user, i.e. the clinical expert, into the process of software development.3. ComponentsThe following section will give a brief overview of some of the most important tools included within the software assistant. They can be categorized into three groups, Preprocessing, Visualization and Advanced Processing.3.1 PreprocessingThe Preprocessing section includes tools typically required before performing any advanced analysis or data processing. One essential element of this section is the registration tool, which allows for manual or automatic linear registration of all images within a case. Here, any image can be registered to any reference image from the same case, which means that the user is free to choose which image to use as a reference image for further processing steps.A commonly required prerequisite for analyzing MR images of the head is the segmentation of the brain from the head, a problem generally referred to as skull-stripping. NeuroQLab comes with an interactive skull-stripping tool using a marker-based watershed transformation, which allows for efficient segmentation of the brain even in the presence of serious pathologies such as malignant brain tumors [Hahn00]. The algorithm consists of an initialization phase followed by interactive placement of markers to include or exclude parts of the image in the segmented mask. The initialization requires about 10 seconds calculation time on a current standard PC, the placement of markers is fully interactive, which makes it a highly efficient as well as flexible tool for segmentation of the brain. Typically, a good segmentation can be achieved by placing 5 to 10 markers in the dataset.Fig. 1. Two screenshots showing the main screen with application selection and the sub-menufor selection of preprocessing tools.For dealing with signal inhomogeneities inherently present in MR images, NeuroQLab provides an implementation of the non-parametric non-uniformity normalization (N3) algorithm [SZE98]. Further tools in the preprocessing sectionprovide functionality for preprocessing DTI data, or performing motion correction and filtering on fMRI datasets.3.2 VisualizationData visualization is an important aspect when dealing with medical image data. A conventional slice wise gray level presentation of raw datasets is often inferior to dedicated visualization tools build to enhance individual aspects of the input data. This includes, e.g., using color overlays of multiple images possibly stemming from different modalities or MR sequences in both two and three dimensions as well as 3d-volume renderings of datasets in combination with clip-planes and opaque or semi-transparent objects rendered through iso-surfaces.NeuroQLab features several advanced viewers including synchronized 2d- and 3d-viewers capable of displaying overlays such as fiber tracts or fMRI activation data [KWK07]. Also, dedicated volume rendering modes are included within selected analysis tools, custom tailored to the visualization requirements of the chosen tool.Fig. 2. An example of synchronized 2d- and 3d-viewers, in this case showing DTI color-codeddata and some reconstructed fiber tracts.3.3 Advanced Processing PackagesAs an example of an advanced processing package included within NeuroQLab, we will now present DTILab, our software package for DTI-based white matter fiber tracking and quantification. For performing fiber tracking, two modes are available. The so-called seed-ROI based approach requires the user to first choose a ROI which is used by the algorithm as a starting point for tracking. This way, all tracts can be reconstructed that emerge from the user chosen ROI. Alternatively, a whole-braintracking mode is included. Here, tracking for the whole DTI dataset is performed as a preprocessing step. Once this has finished, the user can interactively choose a ROI and immediately visualize all fibers passing through this ROI. Usually, the whole-brain tracking yield qualitatively better results, since it can reconstruct all tracts running through an input ROI as opposed to the seed-ROI based approach.Fig. 3. Color enhanced rendering of the results of quantitative brain volumetry [HJK04].The tracking itself can be performed using two different algorithms. First a deterministic deflection-based tracking algorithm and second a non-deterministic algorithm based on probabilistic methods for white matter fiber tract reconstruction [FFW06]. Probabilistic fiber tracking is a relatively new technique that aims towards handling the intrinsically existing uncertainty inherent to DTI based fiber tracking, since it not only reconstructs the tract itself, but at the same time generates a probability map indicating how likely a tract belongs to the reconstructed anatomical structure.A unique feature of DTILab is its capability of performing quantitative analysis of DTI parameters along reconstructed fiber tracts [KSR08]. This allows for reproducible quantification of fiber bundles based on average values for fractional anisotropy, diffusion strength, axial diffusivity or radial diffusivity. The technique has been used e.g. in the context of multiple sclerosis (MS) for the correlation of white matter fiber integrity with outcomes of psychiatric test based on the expanded disability status scale (EDSS) [FKL09].Fig. 4. DTI-based fiber tract quantification.4. ConclusionIn this work, we have presented a software assistant for use in neuroimaging studies. It comprises a number of tools for both basic processing steps and advanced analysis techniques required in a variety of neuro-related medical disciplines such as neurology, neuroradiology, neurosurgery and neuropsychology. It is currently used by about 20 clinical research partners of our institute in the context of a variety of clinical questions. Further, it has been used as the technical basis for numerous studies resulting in a variety of clinically driven scientific publications.The number of tools and features of the software assistant is being constantly enhanced with techniques being included continuously. Our focus for current and future developments is on the inclusion of new methods for fiber tracking based on High Angular Resolution Diffusion Imaging (HARDI) techniques, non-linear registration methods, new quantitative measuring tools based on cortical thickness mapping for monitoring and diagnosing dementia and related neurodegenerative diseases. Further, a software link to operation microscopes and neuro-navigation systems is planned for better integration of our software in the operation room for intra-operative software support.References[FKL09] F. Fink, J. Klein, M. Lanz, T. Mitrovics, M. Lentschig, H.K. Hahn, H.Hildebrandt, Comparison of Diffusion Tensor-Based Tractography andQuantified Brain Atrophy for Analysing Demyelination and Axonal Loss inMS, To appear: Journal of Neuroimaging, 2009.[FFW06] O. Friman, G. Farneback and C. Westin, A bayesian approach for stochastic white matter tractography, IEEE Trans. Med. Imag.25 (8), pp. 965–978, 2006. [Hahn00] H. K. Hahn and H.-O. Peitgen. The Skull Stripping Problem in MRI Solved bya Single 3D Watershed Transform. In MICCAI – Medical Image Computingand Computer-Assisted Intervention (pp. 134–143), 2000.[HJK04] H. K. Hahn, B. Jolly, M. Lee, D. Krastel, and et al. How Accurate is Brain Volumetry? A Methodological Evaluation. In C. Barillot, D. R. Haynor, & P.Hellier (Eds.), MICCAI –Medical Image Computing and Computer-AssistedIntervention (pp. 335–342). LNCS, 3216, 2004.[KSR08] J. Klein, H. Stuke, J. Rexilius, and et al. Towards User-Independent DTI Quantification. In SPIE Medical Imaging (Vol. 6914, pp 69142E-1–69142E-8),2008[KWK07] A. Köhn, F. Weiler, J. Klein, and et al. State-of-the-Art Computer Graphics in Neurosurgical Planning and Risk Assessment. In P. Cignoni, & J. Sochor(Eds.), Eurographics Short Papers and Medical Prize Awards (pp. 117–120),2007.[MeV09] MeVisLab homepage, 2009. http://www.mevislab.de.[Rex08] J. Rexilius and H.-O. Peitgen. Rapid Prototyping of Clinical Software Assistents. In Proc. SPIE Medical Imaging(Vol. 6919, pp. 69190S–1-pp.69190S–11), 2008.[SZE98] J.G. Sled, A.P. Zijdenbos and A.C. Evans, A nonparametric method for automatic correction of intensity nonuniformity in MRI data, In IEEE Trans.Med. Imag.17 (1998), pp. 87–97, 1998.。
NLC:A Measure Based on ProjectionsRoberto Ruiz,Jos´e C.Riquelme,and Jes´u s S.Aguilar-RuizDepartamento de Lenguajes y Sistemas,Universidad de SevillaAvda.Reina Mercedes S/N.41012Sevilla,Espa˜n a{rruiz,riquelme,aguilar}@.esAbstract.In this paper,we propose a new feature selection criterion.It is based on the projections of data set elements onto each attribute.The main advantages are its speed and simplicity in the evaluation ofthe attributes.The measure allows features to be sorted in ascendingorder of importance in the definition of the class.In order to test therelevance of the new feature selection measure,we compare the resultsinduced by several classifiers before and after applying the feature selec-tion algorithms.1IntroductionThe selection of relevant features is a central problem in machine learning.If a relevant feature is removed,the measure of the remaining features will de-teriorate.In order to identify relevant attributes,we need to address what a good feature is for classification.Without defining the goodness of a feature or features,it does not make sense to talk about best or optimal features.The algorithms evaluate the attributes based on general characteristics of the data.Feature Selection can be viewed as a search problem,where each state in the search space specifies a subset of the possible features.The need for evaluation is common to all search strategies.In this paper,we propose a new feature selection criterion not based on calcu-lated measures between attributes,or complex and costly distance calculations. This criterion is based on a unique value called NLC.It relates each attribute with the label used for classification.This value is calculated by projecting data set elements onto the respective axis of the attribute(ordering the examples by this attribute),then crossing the axis from the beginning to the greatest attribute value,and counting the Number of Label Changes(NLC)produced. This work has been supported by the Spanish Research Agency CICYT under grant TIC2001-1143-C03-02,and Junta Andalucia under coordinated action ACC-1021-TIC-2002.V.Maˇr´ık et al.(Eds.):DEXA2003,LNCS2736,pp.907–916,2003.c Springer-Verlag Berlin Heidelberg2003908R.Ruiz,J.C.Riquelme,and J.S.Aguilar-Ruiz2Related WorkFeature selection algorithms use different evaluation functions.Functions are based in criterions to measure the relevance of the attributes.There are several taxonomies of these evaluation measures in previous work,depending on differ-ent criterions:Langley[9]group evaluation functions into two categories:filter and wrapper.Blum y Langley[3]provide a classification of evaluation functions into four groups,depending on the relation between the selection and the induc-tion process:embedded,filter,wrapper,weight.Another different classification, Doak[5]and Dash[4]provide a classification of evaluation measure based on their general characteristics more then in the relation with the induction process. The classification realized by Dash,separatefive different types of measures:dis-tance,information,dependence,consistency y accuracy.Feature Selection can be viewed as a search problem,where each state in the search space specifies a subset of the possible features.The need for evaluation is common to all search strategies.In general,attribute selection algorithms perform a search through the space of feature subsets,and must address four basic issues affecting the nature of the search:1)Starting point:forward and backward,according to whether it began with no features or with all features.2)Search organization:exhaustive or heuristic search.3)Evaluation strategy:wrapper orfilter.4)Stopping crite-rion:a feature selector must decide when to stop searching through the space of feature subsets.A predefined number of features are selected,a predefined number of iterations reached.Whether or not the addition or deletion of any feature produces a better subset,we also stop the search,if an optimal subset according to some evaluation function is obtained.3Feature Evaluation3.1ObservationsTo discover main idea of the algorithm we base on the data sets IRIS and WINE, because of the easy interpretation of their two-dimensional projections.In Figure1(a)it is possible to observe that if the projection of the examples is made on the ordinate axis we can not obtain intervals where any class is a majority.Nevertheless,for the Petalwidth attribute it is possible to appreciate some intervals where the class is unique:[0,0.6]for Setosa,[1.0,1.3]for Versicolor and[1.8,2.5]for Virginica.This is because when projecting the examples on this attribute the number of label changes is minimum.For example,it is possible to verify that for Petalwidth thefirst label change takes place for value1(setosa to Versicolor),the second in1.3(Versicolor to Virginica).There are other changes later and the last one is in1.8.In Figure1(b)the same conclusion is reached with data set WINE.We analyze the projection of data set elements onto C8and C7attributes.We identify intervals where one class is a majority when crossing the abscissas axis from the beginning to the greatest attribute value:[0,1]for class3,[1.5,2.3]forNLC:A Measure Based on Projections9090,511,522,53petalwidths e p a l w i d t hvirginica(a)123456C7C 8class 3(b)Fig.1.(a)IRIS.Representation of Attributes Sepalwidth-Petalwidth (b)WINE.Rep-resentation of Attributes C8-C7class 2and [3,4]for class 1.Nevertheless,for C8attribute on the ordinate axis it is not possible to observe any intervals where the class is unique.We conclude that it will be easier classify by attributes with the smallest number of label changes.If the attributes are in ascending order according to the NLC,we obtain a ranking list with the better attributes from the point of view of the classification,in Iris this would be:Petalwidth 16,Petallength 19,Sepallenth 87and Sepalwidth 120.This result agrees with what is common knowledge in data mining,which states that the width and length of petals are more important than those related to sepals.Classifying IRIS with C4.5by Sepalwidth only,we obtain 59%accuracy and by Petalwitdth 95%.The attributes used in Figure 1(b)are the first and the last on the ranked list,with a NLC value of 43and 139respectively.Applying the classifier C4.5,we obtain 80.34%accuracy by C7and 47.75%by C8.3.2DefinitionsDefinition 1:An example e ∈E is a tuple formed by the Cartesian productof the value sets of each attribute and the set C of labels.We define the operations att and lab to access the attribute and its label (or class):att:E x N →A and lab:E →C,where N is the set of natural numbers.Definition 2:Let the universe U be a sequence of example from E.We willsay that a database with n examples,each of them with m attributes and one class,forms a particular universe.Then U=<u[1],...,u[n]>and as the database is a sequence,the access to an example is achieved by means of its position.Likewise,the access to j-th attribute of the i-th example is made by att(u[i],j),and for identifying its label lab(u[i]).Definition 3:An ordered projected sequence is a sequence formed by the pro-jection of the universe onto the i-th attribute.This sequence is sorted out in ascending order.Definition 4:A partition in constant subsequences is the set of subsequencesformed from the ordered projected sequence of an attribute in such a way as to maintain the projection order.All the examples belonging to a sub-sequence have the same class and every two consecutive subsequences are disjointed with respect to the class.910R.Ruiz,J.C.Riquelme,and J.S.Aguilar-Ruiz(a)(b)75212106O 39E 48O111E 0-E b OaOE E 8751069342O1O EOO E O E Fig.2.Data set with (a)ten and (b)twelve elements and two classesDefinition 5:A subsequence of the same value is the sequence composed ofthe examples with identical value from the i-th attribute within the ordered projected sequence.This situation can be originated in continuous variables,and it will be the way to deal with the discrete variables.Definition 6:two examples are inconsistent if they match except for the classlabel.3.3DescriptionThe algorithm is based on this basic principle:to count the label changes of examples projected onto each feature.If the attributes are in ascending order according to the NLC we will have a list that defines the priority of selection,from greater to smaller importance.Before formally exposing the algorithm,we will explain in more detail the main idea.Let us consider the situation depicted in Figure 2(a),with ten el-ements numbered and two labels (O-odd numbers and E-even numbers):the projection of the examples on the abscissas axis produces three constant subse-quences {O,E,O }corresponding to the examples {[1,3,5][8,4,10,2,6][7,9]}.Identi-cally,with the projection on the ordinates axis we can obtain six constant subse-quences {O,E,O,E,O,E }formed by the examples {[1][2,4][3,9][6,10][5,7][8]}.We check that the first attribute has two label changes and the second one has five.Applying our hypothesis,the first attribute is more relevant than the second one,because it has a smaller NLC.4AlgorithmThe algorithm is very simple and fast (Table 1).It has the capacity to operate with continuous and discrete variables as well as with databases which have two classes or multiple classes.For each attribute,the training-set is ordered (QuickSort [7],this algorithm is O(n log n),on average and we count the NLC throughout the ordered projected sequence.NLC:A Measure Based on Projections911Table1.Main AlgorithmOutput:E reduced(N examples,K attributes)for each attribute A i∈1..MQuickSort(E,i)NLC i←NumberChanges(E,i)NLC Attribute RankingSelect the k firstTable2.NumberChanges functionInput:E training(N examples,M attributes),iOutput:number of label changesfor each example e j∈E with j in1..Nif att(u[j],i)∈subsequence of the same valuechanges=changes+ChangesSameValue()elseif lab(u[j])<>lastLabel)changes=changes+1return(changes)Applying the algorithm to the example of the Figure2(b)we obtain the ordered projected sequences:{1,3,4,10,2,11,8,9,6,12,5,7}{1,11,4,8,3,9,10,6,2,12,5,7}and the partitions:{[1,3][4,10,2][11,8,9,6,12,5,7]}{[1,11][4,8][3,9][10,6,2,12][5,7]}The elements’projections onto thefirst attribute produce two constant sub-sequences and one subsequence of the same value with different labels.The elements’projections onto the second attribute producesfive constant subse-quences.NumberChanges considers whether we deal with different values from an at-tribute,or with a subsequence of the same value(this situation can be originated in continuous and discrete variables).In thefirst case,it compares the present label with the last one.Whereas in the second case,where the subsequence is of the same value,it counts the maximum possible changes by means of the function ChangesSameValue.In the attribute represented on the ordinates axis(b)in Figure2(b),we see several subsequences of the same value with the same label,then,we deal with constant subsequence,and the result is four label changes(NLC=4).In the attribute on the abscissas axis(a),thefirst two partitions are constant912R.Ruiz,J.C.Riquelme,and J.S.Aguilar-Ruizsubsequences,and the third is a subsequence of the same value with two labels. Therefore,we consider the maximum possible NLC.In the previous case,we have the subsequence[11,8,9,6,12,5,7]where four elements are class O(odd)y three class E(even).There are two reasons for counting the maximum NLC:first,we want to penalize the attribute in these inconsistency situations;and second,we want to avoid ambiguities that could be produced depending on the elements order after algorithm QuicSort is ap-plied.For example,in different independent executions,we could obtain these situations:[E,E,E,O,O,O,O],[E,E,O,O,O,O,E],[O,O,E,E,O,O,E],...with NLC equal to1,2and3respectively.ChangesSameValue returns5,the maximum. The situation is:[O,E,O,E,O,E,O].This can be obtained with low cost.It can be deduced counting the class’elements in the subsequence without resorting the elements.We conclude that the attribute b with four NLC is more relevant that the attribute a with seven NLC.5ExperimentsIn this section we compare the quality of selected attributes by the NCL mea-sure with the selected attributes by the other two methods:Information Gain (IG)[10]and the ReliefF method[8].IG has been chosen because it is the more popular concept and it is used more when you want to evaluate the relevance of an attribute.And the ReliefF method has been chosen because it is widely referenced in other papers.The ReliefF method is a version of the Relief method by Kononenko,wich permits attributes with missing values and multiclass prob-lems.The quality of each selected attribute was tested by means of three clas-sifiers:the Naive Bayes[6],C4.5[10]and1-NN[1].The implementation of the induction algorithms and the others selectors was done using the Weka library1 and the comparison was performed with eighteen databases of the University from California Irvine[2].The data sets were chosen with few missing values.The process followed to test the quality of the attributes selected with the NCL measure was the following.For all original data sets,we obtained the accuracy using the three classifiers and the size of the decision trees induced by C4.5.We obtained the same measures after applying each selector algorithm, recording the number of attributes selected.To asses the obtained results,two paired t statistical tests with a confidence level of95%were realizedIn order to establish the number of attributes in each case,we obtain a ranked list of features with the three method and we use the learning curve to observe the effect of added features.Starting with one feature(the most relevant one first)and gradually adding next most relevant feature one by one,we calculate its accuracy rate.We select the set of attributes with the best accuracy.Applying a different classifier,we obtain a different set.1/mlNLC:A Measure Based on Projections913N={0,1,...9}(a)METHOD={NLC,RLF,IG}N={0,1,...9}(b)(c)Fig.3.Cross-validation processFig.4.Reduction method:feature ranking and learning curveFor each database (DB),the measures were estimated taking the mean of a ten fold cross validation.A ten-fold cross-validation is performed by dividing the data into ten blocks of cases that have an approximately similar size,and for each block in turn,testing the model constructed from the remaining nine blocks on the unseen cases in the hold-out block (Figure 3(a)).The same folds were used for each algorithm training-sets.Each reducing method was given a training set (DB N.data)consisting of 90%of the available data,from which it returned a subset DB METHOD N (Figure 3(b)),where METHOD is one of NLC,RLF,IG,N is a value in 0,1,...,9and includes a classifier to obtain the learning curve (Figure 4).We use the same classifier that we are going to classify the test set.For example,from DB 1.data we would obtain DB IG 1.data by applying the IG method.The remaining 10%of the unseen data (DB N.test)was also reduced (DB METHOD N.test)(Fig-ure 3(b))and tested on the instances of DB METHOD N.data using a clas-sifier (Figure 3(c)).For example,we obtain iris ig 1.data by applying the IG method to the iris 1.data file generated by the cross validation.Afterwards,we use iris ig 1.data to classify iris ig 1.test by means of the nearest neighbor tech-nique.When we deal with the learning curve,we also apply 1NN (Figure 4).As a further comparison,another widely-used learner,C4.5and NB,was run on these data sets (Figure 3(c)).For example,after reducing iris 3.data with NLC,iris nlc 3.data was generated,it was given as input to C4.5and the decision tree generated was used to classify the iris nlc 3.test file (test files are reduced too).If we consider all the possible results that we get using the original data,the three selection methods (NLC,RLF and IG)and the three classifiers (C4.5,1NN and NB)with eighteen data sets taking the mean of a 10-fold cross validation,914R.Ruiz,J.C.Riquelme,and J.S.Aguilar-RuizTable3.Accuracy obtained with C4.5,1NN and naive Bayes,selecting the subset with the best accuracyData C4.51NN NB C4.5NLC123RLF IG1NN NLC123RLF IG NB NLC123RLF IG anne98.698.498.498.099.399.098.998.986.389.2◦•90.092.4 bala78.478.478.478.486.986.986.986.988.888.888.888.8 germ71.173.8◦◦70.574.572.469.770.871.074.875.573.474.7 diab76.775.175.975.570.968.266.768.576.275.675.876.5 glas69.270.171.067.370.571.9•75.174.245.856.6◦◦47.751.0 gla277.879.679.279.079.278.3••89.089.061.969.9◦◦64.469.9 h-st78.583.7◦80.485.274.479.3◦•79.683.384.181.9•85.284.4 iono88.689.7•92.691.286.687.2•91.489.283.284.689.787.2 iris94.092.794.092.795.393.393.393.395.392.7•94.092.7 kr-v99.599.399.599.596.597.4◦◦98.396.988.090.4◦•94.090.4 lymp78.374.977.776.379.777.083.077.883.883.881.883.8 segm97.096.996.896.997.097.197.197.180.087.2◦87.287.2 sona72.673.575.072.285.584.681.387.567.873.574.973.5 spli-294.494.094.094.273.990.0◦90.090.095.396.196.395.8 vehi73.572.672.773.770.170.771.269.344.344.3•48.544.3 vowe82.981.782.781.099.499.299.099.266.168.7◦69.269.1 wave76.677.678.177.473.879.1◦78.279.180.080.7•81.480.7 zoo93.194.193.192.196.097.095.095.095.095.091.092.1 we get two hundred and eighteen results((1+3)×3×18=216).Now we are going to analyze these results to obtain some conclusions about the performance of the different methods.Table3shows a summary of the results of the classification using C4.5, 1NN and NB.Table shows how often each method performs significantly better (denoted by◦)or worse(denoted by•)than data without reduction(column named1),and better or worse than ReliefF(RLF)and Information Gain(IG) (column2and3respectively).Then,we obtainfifty four results comparing NLC with data without reduction(18data sets×3classifiers=54)and one hundred and eight comparing NLC with RLF and IG(18data sets×3classifiers×2 =108).NLC measure is better than data without reduction in twelve of the fifty four cases,and in forty one are equal and only in one is worse than data without reduction.Furthermore,in four of the one hundred and eight cases,the set of attributes selected by the NLC measure yields better accuracy than the two other methods.In ninety three they are equal,and in eleven they are worse than the other.We select the set of attributes with the best accuracy.Applying each classi-fier,we obtain a different set.Therefore,we get three set of attributes for each reduction method for each data set.We obtain the percentage of the original features retained and we calculate the average over the eighteen data sets(nine results).All the methods are between50%and60%of the original features.The experiments show that by applying NLC,the knowledge attained in the original trainingfile is conserved into the reduced trainingfile,and theNLC:A Measure Based on Projections915 Table4.Accuracy obtained with C4.5,1NN and naive Bayes,selecting the threefirst attributes of the ranked listData C4.51NN NBC4.5NLC12RLF IG1NN NLC12RLF IG NB NLC12RLF IG anneal98.689.9•92.590.699.391.191.691.686.384.0••90.088.9 balance78.469.469.469.486.968.069.667.788.874.273.673.3 g credit71.170.971.572.272.461.2••70.670.674.871.671.373.9 diabetes76.774.674.775.170.969.866.769.376.277.176.476.8 glass69.271.967.765.470.566.366.364.545.853.8◦45.849.1 glass277.881.477.981.479.277.983.477.961.968.763.868.7 heart-s78.572.6•73.385.274.467.8••73.384.884.174.8•74.879.6 ionosphe88.680.4••88.390.086.684.385.788.683.278.3•83.586.6 iris94.094.094.094.095.395.395.395.395.395.395.395.3 kr-vs99.590.490.490.496.590.490.490.488.090.490.490.4 lymph78.377.779.777.779.775.783.775.083.875.080.372.3 segment97.090.5◦◦85.685.597.091.8◦◦85.388.580.077.0◦◦72.564.3 sonar72.668.870.270.785.573.166.870.267.873.170.670.6 splice-294.480.581.480.873.980.281.280.695.379.681.180.7 vehicle73.553.5••62.461.470.153.756.057.744.340.442.840.8 vowel82.969.170.672.199.479.680.182.866.157.956.058.7 waveform76.666.264.565.373.857.056.356.480.065.166.164.9 zoo93.184.2◦72.385.296.083.2◦71.387.295.084.2◦71.384.2 dimensionality of data is reduced significantly.We obtain similar results with the other method,but needing much more time.It is very interesting to compare the speed of attribute selection techniques. We measured the time taken in milliseconds to select the ranking of attributes. NLC is an algorithm with a very short computation time.NLC takes792mil-liseconds in reducing18data sets whereas ReliefF takes566seconds and IG2189 milliseconds.We obtain the percentage of reduction time of each data sets,and we calculate the average.NCL reduce the computational cost to the99%of the time needed by ReliefF and50%of the time needed by IG.In order to compare thefirst attributes in the ranking list of each method, we obtain Table4where the data sets are reduced to thefirst three attributes of each ranking list.we observe the accuracy for each reduction method applying the three classifiers.We obtain similar results with the three methods.Table shows how often each method performs significantly better or worse than ReliefF (RLF)and Information Gain(IG)(column1and2respectively).In ten of the one hundred and eight cases,the set of attributes selected by the NLC measure yields better accuracy than the two other methods.In eighty they are equal,and in fourteen they are worse than the other.6ConclusionsIn this paper we present a deterministic attribute selection criterion.The main advantages are its speed and simplicity in the evaluation of the attributes.The916R.Ruiz,J.C.Riquelme,and J.S.Aguilar-Ruizmeasure allows features to be sorted in ascending order of relevance.A con-siderable reduction of the number of attributes is produced.It is not based on calculated measures between attributes,or complex and costly distance calcula-tions.The computational cost is lower than other methods O(m×n×log n).We conclude that by applying NLC,the knowledge attained in the original trainingfile is conserved into the reduced trainingfile,and the dimensionality of data is reduced significantly.We obtain similar results with the other method, but needing much more time.References1.Aha,D.,Kibler,D.,&Albert,M.Instance-based learning algorithms.MachineLearning,6,37–66(1991).2.Blake,C.,&Merz,E.K.Uci repository of machine learning databases(1998).3.Blum,A.,&Langley,P.Selection of relevant features and examples in machinelearning.Artificial Intelligence,pp.245–271(1997).4.Dash,M.,&Liu,H.Feature selection for classification.Intelligent Data Analisys,1(1997).5.Doak,J.An evaluation of search algorithms for feature selection.Technical Report,Los Alamos National Laboratory(1994).6.Duda,R.,&P.Hart.Pattern classification and scene analysis.John Willey andSons(1973).7.Hoare,puter Journal,5,10–15(1962).8.Kononenko,I.Estimating attributes:Analysis and estensions of relief.EuropeanConference on Machine Learning,pp.171–182(1994).ngley,P.Selection of relevant features in machine learning.Procs.Of the AAAIFall Symposium on Relevance,pp.140–144(1994).10.Quinlan,J.Induction of decision trees.Machine Learning,1,81–106(1986).。
Artif Intell Rev(2006)26:159–190DOI10.1007/s10462-007-9052-3Machine learning:a review of classificationand combining techniquesS.B.Kotsiantis·I.D.Zaharakis·P.E.PintelasPublished online:10November2007©Springer Science+Business Media B.V.2007Abstract Supervised classification is one of the tasks most frequently carried out by so-called Intelligent Systems.Thus,a large number of techniques have been developed based on Artificial Intelligence(Logic-based techniques,Perceptron-based techniques)and Statis-tics(Bayesian Networks,Instance-based techniques).The goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictor features.The resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known,but the value of the class label is unknown.This paper describes various classification algorithms and the recent attempt for improving classification accuracy—ensembles of classifiers.Keywords Classifiers·Data mining techniques·Intelligent data analysis·Learning algorithms1IntroductionThere are several applications for Machine Learning(ML),the most significant of which is predictive data mining.Every instance in any dataset used by machine learning algorithms is represented using the same set of features.The features may be continuous,categorical S.B.Kotsiantis(B)·P.E.PintelasDepartment of Computer Science and Technology,University of Peloponnese,Peloponnese Greecee-mail:sotos@math.upatras.grS.B.Kotsiantis·P.E.PintelasEducational Software Development Laboratory,Department of Mathematics,University of Patras,P.O.Box1399,Patras GreeceP.E.Pintelase-mail:pintelas@math.upatras.grI.D.ZaharakisComputer Technology Institute,Patras Greecee-mail:jzaharak@cti.gr160S.B.Kotsiantis et al. or binary.If instances are given with known labels(the corresponding correct outputs)then the learning is called supervised,in contrast to unsupervised learning,where instances are unlabeled(Jain et al.1999).Numerous ML applications involve tasks that can be set up as supervised.In the present paper,we have concentrated on the techniques necessary to do this.In particular,this work is concerned with classification problems in which the output of instances admits only discrete, unordered values.We have limited our references to recent refereed journals,published books and conferences.A brief review of what ML includes can be found in(Dutton and Conroy 1996).De Mantaras and Armengol(1998)also presented a historical survey of logic and instance based learning classifiers.After a better understanding of the strengths and limitations of each method,the possibility of integrating two or more algorithms together to solve a problem should be investigated.The objective is to utilize the strengths of one method to complement the weaknesses of another. If we are only interested in the best possible classification accuracy,it might be difficult or impossible tofind a single classifier that performs as well as a good ensemble of classifiers.Our next section covers wide-ranging issues of supervised machine learning such as data pre-processing and feature selection.Logic-based learning techniques are described in Sect. 3,whereas perceptron-based techniques are analyzed in Sect.4.Statistical techniques for ML are covered in Sect.5.Section6deals with the newest supervised ML technique—Support Vector Machines(SVMs).In Sect.7,a representative algorithm for each learning technique is compared to a number of datasets in order for researchers to have baseline accuracy for new algorithms in these well-known datasets.Section8presents the recent attempt for improving classification accuracy—ensembles of classifiers.Finally,the last section concludes this work.2General issues of supervised learning algorithmsThefirst step is collecting the dataset.If a requisite expert is available,then s/he could suggest whichfields(attributes,features)are the most informative.If not,then the simplest method is that of“brute-force,”which means measuring everything available in the hope that the right(informative,relevant)features can be isolated.However,a dataset collected by the “brute-force”method is not directly suitable for induction.It contains in most cases noise and missing feature values,and therefore requires significant pre-processing(Zhang et al. 2002).2.1Data preparation and data pre-processingWhat can be wrong with data?There is a hierarchy of problems that are often encountered in data preparation and pre-processing:•Impossible values have been inputted.•Unlikely values have been inputted.•No values have been inputted(missing values).•Irrelevant input features are present in the data at hand.Impossible values should be checked for by the data handling software,ideally at the point of input so that they can be re-entered.These errors are generally straightforward,such as coming across negative prices when positive ones are expected.If correct values cannot be entered,the problem is converted into missing value category,by simply removing the data.Machine learning161Table1Examples for the use of variable-by-variable data cleansingProblems Metadata Examples/HeuristicsIllegal values Cardinality e.g.,cardinality(gender)>2indicates problemMax,min Max,min should not be outside of permissible rangeVariance,deviation Variance,deviation of statistical values should not behigher than thresholdMisspellings Feature values Sorting on values often brings misspelled values next tocorrect valuesVariable-by-variable data cleansing is afilter approach for unlikely values(those values that are suspicious due to their relationship to a specific probability distribution,that is to say,a normal distribution with a mean of5,a standard deviation of3,but a suspicious value of10).Table1shows examples of how such metadata can help in detecting a number of possible data quality problems.Hodge and Austin(2004)have recently introduced a survey of contemporary techniques for outlier(noise)detection.These researchers have identified the techniques’advantages and disadvantages.While the focus above is on analytical methods,the use of visualization can also often be a powerful tool.Visualization is particularly good at picking out bad values that occur in a regular pattern.However,care is needed in distinguishing between natural variability and the presence of bad values,since data is often more dispersed that we think.Instance selection is not only used to handle noise but to cope with the infeasibility of learning from very large datasets.Instance selection in these datasets is an optimization prob-lem that attempts to maintain the mining quality while minimizing the sample size(Liu and Metoda2001).It reduces data and enables a data-mining algorithm to function and work effectively with very large datasets.There are a variety of procedures for sampling instances from a large dataset.The most well-known are(Reinartz2002):•Random sampling,which selects a subset of instances randomly.•Stratified sampling,which is applicable when the class values are not uniformly dis-tributed in the training sets.Instances of the minority class(es)are selected with greater frequency in order to even out the distribution.Other techniques for handling imbalanced datasets can be found in(Japkowicz and Stephen2002).Incomplete data is an unavoidable problem in dealing with most real world data sources. Generally,there are some important factors to be taken into account when processing unknown feature values.One of the most important ones is the source of“unknown-ness”: (i)a value is missing because it was forgotten or lost;(ii)a certain feature is not applicable for a given instance(e.g.,it does not exist for a given instance);(iii)for a given observation, the designer of a training set does not care about the value of a certain feature(so-called “don’t-care values).”Depending on the circumstances,researchers have a number of methods to choose from to handle missing data(Batista and Monard2003):•Method of ignoring instances with unknown feature values:This method is simplest: ignore any instances,which have at least one unknown feature value.•Most common feature value:The value of the feature that occurs most often is selected to be the value for all the unknown values of the feature.162S.B.Kotsiantis et al.•Most common feature value in class:This time the value of the feature,which occurs most commonly within the same class is selected to be the value for all the unknown values of the feature.•Mean substitution:Researchers substitute a feature’s mean value(computed from avail-able cases)tofill in missing data values on the remaining cases.A more sophisticated solution than using the“general”feature mean is to use the feature mean for all samples belonging to the same class tofill in the missing value.•Regression or classification methods:A regression or classification model based on the complete case data for a given feature is developed.This model treats the feature as the outcome and uses the other features as predictors.•Hot deck inputting:The most similar case to the case with a missing value is identified, and then a similar case’s Y value for the missing case’s Y value is substituted.•Method of treating missing feature values as special values:“Unknown”itself is treated as a new value for the features that contain missing values.Feature subset selection is the process of identifying and removing as many irrelevant and redundant features as possible(Yu and Liu2004).This reduces the dimensionality of the data and enables data mining algorithms to operate faster and more effectively.Generally, features are characterized as:•Relevant:These are features have an influence on the output.Their role cannot be assumed by the rest.•Irrelevant:Irrelevant features are defined as those features not having any influence on the output.Their values could be generated at random and not influence the output.•Redundant:A redundancy exists whenever a feature can take the role of another(perhaps the simplest way to incur model redundancy).Feature selection algorithms in general have two components:a selection algorithm that gen-erates proposed subsets of features and attempts tofind an optimal subset and an evaluation algorithm that determines how“good”a proposed feature subset is.However,without a suit-able stopping criterion,the feature selection process may run repeatedly through the space of subsets,taking up valuable computational time.Stopping criteria might be:(i)whether addition(or deletion)of any feature does not produce a better subset;and(ii)whether an optimal subset according to some evaluation function is obtained.The fact that many features depend on one another often unduly influences the accuracy of supervised ML classification models.This problem can be addressed by constructing new features from the basic feature set(Markovitch and Rosenstein2002).This technique is called feature construction/transformation.These newly generated features may lead to the creation of more concise and accurate classifiers.In addition,the discovery of meaningful features contributes to better comprehensibility of the produced classifier,and a better understanding of the learned concept.2.2Algorithm selectionThe choice of which specific learning algorithm we should use is a critical step.The classifier’s evaluation is most often based on prediction accuracy(the percentage of correct prediction divided by the total number of predictions).There are at least three techniques which are used to calculate a classifier’s accuracy.One technique is to split the training set by using two-thirds for training and the other third for estimating performance.In another technique, known as cross-validation,the training set is divided into mutually exclusive and equal-sized subsets and for each subset the classifier is trained on the union of all the other subsets.TheMachine learning163 average of the error rate of each subset is therefore an estimate of the error rate of the classi-fier.Leave-one-out validation is a special case of cross validation.All test subsets consist of a single instance.This type of validation is,of course,more expensive computationally,but useful when the most accurate estimate of a classifier’s error rate is required.If the error rate evaluation is unsatisfactory,a variety of factors must be examined:perhaps relevant features for the problem are not being used,a larger training set is needed,the dimensionality of the problem is too high,the selected algorithm is inappropriate or param-eter tuning is needed.A common method for comparing supervised ML algorithms is to perform statistical comparisons of the accuracies of trained classifiers on specific datasets (Bouckaert2003).Several heuristic versions of the t-test have been developed to handle this issue(Dietterich1998;Nadeau and Bengio2003).In the next sections,we will focus on the most important supervised machine learning techniques,starting with logic-based techniques.3Logic based algorithmsIn this section we will concentrate on two groups of logic(symbolic)learning methods: decision trees and rule-based classifiers.3.1Decision treesMurthy(1998)provided an overview of work in decision trees and a sample of their useful-ness to newcomers as well as practitioners in thefield of machine learning.Decision trees are trees that classify instances by sorting them based on feature values.Each node in a decision tree represents a feature in an instance to be classified,and each branch represents a value that the node can assume.Instances are classified starting at the root node and sorted based on their feature values.The problem of constructing optimal binary decision trees is an NP-complete problem and thus theoreticians have searched for efficient heuristics for constructing near-optimal decision trees.The feature that best divides the training data would be the root node of the tree.There are numerous methods forfinding the feature that best divides the training data but a majority of studies have concluded that there is no single best method(Murthy1998). Comparison of individual methods may still be important when deciding which metric should be used in a particular dataset.The same procedure is then repeated on each partition of the divided data,creating sub-trees until the training data is divided into subsets of the same class.Figure1presents a general pseudo-code for building decision trees.A decision tree,or any learned hypothesis h,is said to overfit training data if another hypothesis h exists that has a larger error than h when tested on the training data,but a smaller error than h when tested on the entire dataset.If the two trees employ the same kind of tests and have the same prediction accuracy,the one with fewer leaves is usually preferred.Breslow and Aha(1997)survey methods of tree simplification to improve their comprehensibility.The most straightforward way of tackling overfitting is to pre-prune the decision tree by not allowing it to grow to its full size.A comparative study of well-known pruning methods presented in(Elomaa1999).However,Elomaa(1999)concluded that there is no single best pruning method.Even though the divide-and-conquer algorithm is quick,efficiency can become important in tasks with hundreds of thousands of instances.The most time-consuming aspect is sort-ing the instances on a numeric feature tofind the best threshold t.This can be expedited ifexample is Zheng’s(1998),who improved the classification accuracy of the decision trees by constructing new binary features with logical operators such as conjunction,negation,and disjunction.In addition,Zheng(2000)created at-least M-of-N features.For a given instance, the value of an at-least M-of-N representation is true if at least M of its conditions is true of the instance,otherwise it is false.Gama and Brazdil(1999)combined a decision tree with a linear discriminant for constructing multivariate decision trees.In this model,new features are computed as linear combinations of the previous ones.The most well-know algorithm in the literature for building decision trees is the C4.5 (Quinlan1993).One of the latest studies that compare decision trees and other learning algorithms has been done by(Tjen-Sien et al.2000).The study shows that C4.5has a very good combination of error rate and speed.C4.5assumes that the training datafits in memory, thus,Gehrke et al.(2000)proposed Rainforest,a framework for developing fast and scalable algorithms to construct decision trees that gracefully adapt to the amount of main memory available.Baik and Bala(2004)presented preliminary work on an agent-based approach for the distributed learning of decision trees.To sum up,one of the most useful characteristics of decision trees is their comprehensi-bility.People can easily understand why a decision tree classifies an instance as belongingof the node where the unknown feature value was detected,and each branch outputs a class distribution.The output is a combination of the different class distributions that sum to1. The assumption made in the decision trees is that instances belonging to different classes have different values in at least one of their features.3.2Learning set of rulesDecision trees can be translated into a set of rules by creating a separate rule for each path from the root to a leaf in the tree(Quinlan1993).However,rules can also be directly induced from training data using a variety of rule-based algorithms.Furnkranz(1999)provided an excellent overview of existing work in rule-based methods.The goal is to construct the small-est rule-set that is consistent with the training data.A large number of learned rules is usually a sign that the learning algorithm is attempting to“remember”the training set,instead of discovering the assumptions that govern it.A separate-and-conquer algorithm search for a rule that explains a part of its training instances,separates these instances and recursively conquers the remaining instances by learning more rules,until no instances remain.In Fig.2,a general pseudo-code for rule learners is presented.The difference between heuristics for rule learning and heuristics for decision trees is that the latter evaluate the average quality of a number of disjointed sets(one for each value of the feature that is tested),while rule learners only evaluate the quality of the set of instances that is covered by the candidate rule.More advanced rule learners differ from this simple pseudo-code mostly by adding additional mechanisms to prevent over-fitting of the training data,for instance by stopping the specialization process with the use of a quality mea-sure or by generalizing overly specialized rules in a separate pruning phase(Furnkranz1997).It is therefore important for a rule induction system to generate decision rules that have high predictability or reliability.These properties are commonly measured by a function called rule quality.An and Cercone(2000)surveyed a number of statistical and empirical rule quality measures.When using unordered rule sets,conflicts can arise between the rules, i.e.,two or more rules cover the same example but predict different classes.Lindgren(2004) has recently given a survey of methods used to solve this type of conflict.RIPPER is a well-known rule-based algorithm(Cohen1995).It forms rules through a process of repeated growing and pruning.During the growing phase the rules are made more restrictive in order tofit the training data as closely as possible.During the pruning phase,the166S.B.Kotsiantis et al. rules are made less restrictive in order to avoid overfitting,which can cause poor performance on unseen instances.RIPPER handles multiple classes by ordering them from least to most prevalent and then treating each in order as a distinct two-class problem.Bonarini(2000) gave an overview of fuzzy rule-based classifiers.Fuzzy logic tries to improve classification and decision support systems by allowing the use of overlapping class definitions.Furnkranz(2001)investigated the use of round robin binarization(or pairwise classifica-tion)as a technique for handling multi-class problems with separate and conquer rule learning algorithms.His experimental results show that,in comparison to conventional,ordered or unordered binarization,the round robin approach may yield significant gains in accuracy without risking a poor performance.There are numerous other rule-based learning algorithms.Furnkranz(1999)referred to most of them.The PART algorithm infers rules by repeatedly generating partial decision trees,thus combining the two major paradigms for rule generation—creating rules from decision trees and the separate-and-conquer rule-learning technique.Once a partial tree has been build,a single rule is extracted from it and for this reason the PART algorithm avoids postprocessing(Frank and Witten1998).Genetic algorithms(GAs)have also been used for learning sets of rules.Fidelis et al. (2000)used a genetic algorithm to learn binary concepts represented by a disjunctive set of propositional rules and it was found to be comparable in generalization accuracy to other learning algorithms.Assuming two binary features X1,X2and the binary target value c,rule representation to chromosomes is:IF X1=True∧X2=False THEN c=True,IF X1=False∧X2=True THEN c=False 1001101100 Note that there is afixed length bit-string representation for each rule.The task of the genetic algorithm is tofind good chromosomes.The goodness of a chromosome is repre-sented in the GA by a function which is called thefitness function(Reeves and Rowe2003). For the classification task,thefitness function typically scores the classification accuracy of the rule over a set of provided training instances.At the heart of the algorithm are opera-tions which take the population of the present generation and produce the population of the next generation in such a way that the overallfitness of the population is increased.These operations repeat until some stopping criterion is met,such as a given number of chromo-somes being processed or a chromosome of given quality produced.Three operations take the population of generation t and produce the new population of generation t+1:selection, crossover,and mutation(Reeves and Rowe2003).For the task of learning binary problems,rules are more comprehensible than decision trees because typical rule-based approaches learn a set of rules for only the positive class. On the other hand,if definitions for multiple classes are to be learned,the rule-based learner must be run separately for each class separately.For each individual class a separate rule set is obtained and these sets may be inconsistent(a particular instance might be assigned multiple classes)or incomplete(no class might be assigned to a particular instance).These problems can be solved with decision lists(the rules in a rule set are supposed to be ordered,a rule is only applicable when none of the preceding rules are applicable)but with the decision tree approach,they simply do not occur.Moreover,the divide and conquer approach(used by decision trees)is usually more efficient than the separate and conquer approach(used by rule-based algorithms).Separate-and-conquer algorithms look at one class at a time,and try to produce rules that uniquely identify the class.They do this independent of all the other classes in the training set.For this reason,for small datasets,it may be better to use a divide-and-conquer algorithm that considers the entire set at once.Machine learning167 To sum up,the most useful characteristic of rule-based classifiers is their comprehensi-bility.In addition,even though some rule-based classifiers can deal with numerical features, some experts propose these features should be discretized before induction,so as to reduce training time and increase classification accuracy(An and Cercone1999).Classification accuracy of rule learning algorithms can be improved by combining features(such as in decision trees)using the background knowledge of the user(Flach and Lavrac2000)or automatic feature construction algorithms(Markovitch and Rosenstein2002).3.2.1Inductive logic programmingThe main limitation of propositional learners is their limited capacity to take into account available background knowledge.Learners that induce hypotheses in the form of logic pro-grams(Horn clauses)are called inductive logic programming systems.De Mantaras and Armengol(1998)presented a survey of inductive logic programming.The previously discussed logic-based methods can learn hypotheses which can be expressed in terms of propositional logic.Inductive Logic Programming(ILP)classifiers use framework offirst order predicate logic.As with any learning system,this one can be quite complex and intractably difficult unless it is constrained by biases of some sort.One might restrict the program to Horn clauses,not allow recursion,and so on(De Raedt1996). First,most algorithms favor those clauses which cover as many instances as possible,then, they turn to those clauses having a smaller number of features andfinally they accept those clauses which have the least total number of values in the internal disjunctions.ILP systems typically adopt the covering approach of rule induction systems.In a main loop,they construct a clause explaining some of the positive examples,add this clause to the hypothesis,remove the positive examples explained and repeat this until all positive examples are explained(the hypothesis is complete).In an inner loop,individual clauses are constructed by(heuristically)searching the space of possible clauses which was structured by a specialization or generalization operator.Typically,a search starts with a very general rule(clause with no conditions in the body),then proceeds to add literals(conditions)to this clause until it only covers(explains)positive examples(the clause is consistent).This search can be bound from below by using so-called“bottom clauses,”constructed by least general generalization or inverse resolution/entailment.There are three major patterns in which a logic program might be generalized:(i)replace some terms in a program clause by variables,(ii)remove literals from the body of a clause, (iii)add a clause to the program.Correspondingly,there are three patterns in which a logic program might be specialized:(i)replace some variables in a program clause by terms(a substitution),(ii)add literals to the body of a clause and(iii)remove a clause from the pro-gram.FOIL(Quinlan1995)is an algorithm designed to learn a set offirst order rules to predict a target predicate to be true.It differs from classifiers such as C4.5in that it learns relations among features that are described with variables.PROGOL(Muggleton1995)is another well-knownfirst order rules learner.A significant part of ILP research now goes under the heading of Relational Data Mining and is concerned withfinding decision trees from multi-table relational databases(Dzeroski and Lavrac2001).These ILP algorithms can be understood as a kind decision tree induction where each node of the decision tree is itself a sub-decision tree,and each sub-decision tree consists of nodes that make binary splits on several features using the background rela-tions available.However,since ILP methods search a much larger space of possible rules in a more expressive language,they are computationally more demanding.Blockeel and De Raedt(1998)studied scaling upfirst-order logical decision trees with the TILDEFig.3First Order Rule versus Zero Order Rulealgorithm.A dataset is presented to TILDE in the form of a set of interpretations.Each interpretation consists of a number of Prolog facts,surrounded by a begin and end line.The background knowledge is simply a Prolog program.ILP differs from most other forms of machine learning both by its use of an expressive representation language and its ability to make use of logically encoded background knowl-edge.This has allowed successful applications of ILP in areas such as molecular biology and natural language processing,which both have rich sources of background knowledge and both benefit from the use of conceptually expressive languages (Muggleton 1999).When comparing ILP (first order predicate logic rules)and feature-value learning tech-niques (propositional logic rules),there is a trade-off between expressive power and efficiency (Dantsin et al.2001).ILP techniques are typically more expressive (see Fig.3)but also less efficient.If the available data has a standard tabular form,with rows being individual records (training instances)and columns being properties (features)used to describe the data,a data analyst has no reason to become interested in ILP if there is no expert knowledge (background knowledge).ILP can be used for data mining in multi-relational data mining tasks with data stored in relational databases and tasks with abundant expert knowledge of a relational or structural nature.Typically,each predicate will correspond to one relation in the relational database.Each fact in an interpretation is a tuple (instance)in the database,and an interpre-tation corresponds to a part of the database (a set of tuples).Background knowledge can be expressed by means of views as well as extensional tables.4Perceptron-based techniquesOther well-known algorithms are based on the notion of perceptron.Perceptron can be briefly described as:If x 1through x n are input feature values and w 1through w n are connection weights/pre-diction vector (typically real numbers in the interval [−1,1]),then perceptron computes the sum of weighted inputs: i x i w i and output goes through an adjustable threshold:if the sum is above threshold,output is 1;else it is 0.The most common way the perceptron algorithm is used for learning from a batch of training instances is to run the algorithm repeatedly through the training set until it finds a prediction vector which is correct on all of the train-ing set.This prediction rule is then used for predicting the labels on the test set.WINNOW (Littlestone and Warmuth 1994)is based on the perceptron idea.It was experimentally proved (Blum 1997)that WINNOW can adapt rapidly to changes in the target function (concept drift).A target function (such as user preferences)is not static in time.In order to enable,for example,a decision tree algorithm to respond to changes,it is necessary to decide which old training instances could be deleted.A number of algorithms similar to WINNOW have been developed,such as those by Auer and Warmuth (1998).。
類神 - 路 類A New Feature Selection Method for Neuro-Fuzzy Classifier易立 立 蓮 立 理t07026@.tw cmchen@.tw 69370034@.tw類 類 料 行 行 參 類神 - 路 類 易 了 … 易 練 練 精煉 精 論類神 - 類1.來 異 力 類 行 料 理 料 intelligent data analysis, IDA 便 料 行 料 識 knowledge discovery in database 料 識 立 類 領 識 … [12] [6][1] [11]類神 - 路 類 理論 fuzzy theory 理論 Zadeh 1965年 [6] 利 數(linguistic variable) 來 論 類 論 類 便 理論 類神 路 異 力兩 [14] 料 料 行 量 料 識 行 力 類 不 black-box [4] 了 類2. K-meansK-means K-means 量 度 料 離 易 不 利 更 量 數 列 說 K-meansStep 1 行 數量 k 數 n jStep 2kjjc...1,∈Step 3 )(jixjc 離 1211)(jkjnijicxJ−=∑∑==1 Step 4 離數量m Step 5 2∑==m i j i new jx m c1)()(1 2 Step 6 )(new jc j c 若類 若利 參數 更 切 數 3[][)[]⎪⎪⎪⎭⎪⎪⎪⎬⎫⎪⎪⎪⎩⎪⎪⎪⎨⎧∈−−∈−−=→otherwise 0,if , ,if , )(,1,0R :c b x b c x c b a x a b a x x µµ 33. 類神 - 路 類Nauck Kruse 1999年 Neuro-Fuzzy Classification (NEFCLASS)[4] 料 行 易3.1 類神 - 神 路 Neuro-FuzzyNetwork, NFNNEFCLASS 類 類神 - 路 NFN 1 Layer 1 Layer 2 Layer 3 不 數 神 說 (1) NFN路連:()XR w X R ×→ℜF 1X L ∈ 2R L ∈ ()ℜF 數{}:0,1RC w R C ×→ 3C L ∈ NFN 連 連1 NFN ( )(2) 神 數(input function)X net x = 神 路 4 5(){}X XR L X R O w net 1min ∈= 4 (){}R RC L X C O w net 2max ∈= 5()X X X o O a ≡ 神 ()R R R o O a ≡ 神類神 - 路 NFN XR w (linguistic term) NFN 識 NFNCLASS 良 讀3.2 NFN NEFCLASSNEFCLASS NFN 類Step 1 立度 數 度 4 6{}1min ()R XR X L a w x ∈= 6Step 2 立類 濾 度 類 若 m 類 i 類 度 5 l 類 l C 7{}(){}i C C l m i l ,,1argmax K ∈= 7Step 3 度度 8()(){}∑≠∈−=ci m i ll l i C c C Perf ,,,1K 8Step 4 精煉l Perf 數量 Step 5 行 精NEFCLASS 類 易了4.類神 - 路 類 量 料 數 量 料 亂 若 量 數 易 類類 易理 精 量[13] 行 行 [8][7] 行更 路 [3] 類神 -路 識 行 降 數 精4.1類神 - 路 不 論 不 論 劣 論 精 度 論 度 聯度 練4.2 練度 度 2 理論 略 度 若 度 度 度 度2 度()()()()[]()()()()()[]()()()()()[]() 0 , , 0 , , 0 , , ⎪⎪⎩⎪⎪⎨⎧<+−==<+−==>+−==i u i u i u i u i u L if i u i u i u i u i u M if i u i u i u i u i u S if jj j jM j jS j jL j j jj j jL j jS j jM j j j j jjL j jM j jS j j λλλ 9 若 t p 類神- 路 r n l j λ 令 ij ()i j σ 度 ()i u l j92 9 ()i u lj度 若 度 9 若 類 度 理 10()()()∑≠=−=nk k l jk l j lji u i ui u λλ,1 10練 不 不 行 練 練 數g 11()∑=−=pj j p j p g 1!!!11數Ε 12g r ⋅=Ε 12行 q 練 t l p 度 lq T 13{}()()()()()()p t lpl l ll p ll l g q q t u t u t u u u u u T ×∈⎥⎥⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎢⎢⎣⎡=L L M O M M O M M O L L K 21121,,1, 2)1(11 13 14 lq T p l 度{}()pi u v p j ti ljl g q q ∑∑==∈=11,,1,K 14l lq v 精5.利 PHP 4.3.4 MySQL 料 理 行 料來 UCI Wisconsin -Breast-Cancer WBC 料 699 料 兩 類 良 458 惡 241漏 數 683 料 9 理 1x 度 2x3x 狀 4x 5x 6x 7x 8x 9x 裂數列342 數列341K-means 不 更 理 類 數 3 行 88 不 類 2 benign 數 20 類 4 malign 68 度 類 118 行 類 論 類 , 類 率 324,18 94.74% 327,14 95.89% 651,32 95.3% 不數1 NEFCLASSIf1x 2x 3x 4x 5x 6x 7x 8x 9x Then1R S S S S S S S S S benign 2R L L L L L L L L L malign 3R S L L L L L L L S malign 4RL L L L L L L L S malign2 練 練 1~9 1x ~9x 10 1x 2x11 1x 3x 121x 4xMM5091x 3x 4x 5x6x 7x 8x 9x 5102x 3x 4x 5x 6x 7x 8x 9x 5111x 2x 3x 4x 5x 6x 7x 8x 9x練 11 練 2度 兩 行511 練 14 6 3 2R 數 10 易 1 6 8 漏 8 留 2 1 留 1 練 4 行 327,15 95.61%3 練1R2R6 6x181.30 1 1x14.54 41 6x 8x 178.00 15 1x 7x-4.34 8 8x174.70 7 7x-23.23 115 4x 6x 8x 174.16 65 1x 5x 7x -30.43 32 4x 6x 173.89 13 1x 5x-34.04 344x 8x170.5972 1x 7x 9x-34.914 練If 1x 6x 8xThen1R S S benign 2RL malign5 不 率NEFCLASS +K-meansNEFCLASS +K-means +料 數 683 683 數 18 3 數 4 2 類 651 653 類 25 24 7 6 率95.31% 95.61%6 行數數度 Nauck et al.[4] 10 2 95.06% Wang et al.[9] 18 2 96.3% Ishibuchi et al.[5] >4 4 97.4% Chang et al.[2] 43 96.5%3 2 95.61%326,15 95.60% 653,30 95.61% 行 練 56 練 數 率 練 聯度6. 論類神 - 路 便… K-means 參數 利 度 練 精神 不 更 精煉 度 度行 NSC-93-2218-E-003-001 NSC-93-2218-E-003-002參[1] C. Fredembach, M. Schroder, and S. Susstrunk,“Eigenregions for image classification ,” IEEETrans. Pattern Analysis and MachineIntelligence, vol. 26, no. 12, pp.1645-1649,2004[2] C. Xiaoguang and H. John “EvolutionaryDesign of a Fuzzy Classifier From Data,” ”IEEE Trans. Sys. Man. Cyber., vol. 34, no. 4,pp.1894-1906, 2004.[3] D. Chakraborty and N.R. Pal, “A neuro-FuzzyScheme for Simultaneous Feature Selection andFuzzy Rule-Based Classification,” IEEE Trans.Neural Network., vol. 15 , pp. 110–123,2004 [4] D. Nauck and R. Kruse, “Obtaininginterpretable fuzzy classification rules frommedical data,” Artif. Intell. Med., vol. 16,pp.149-169, 1999.[5]H. Ishibuchi and T. Yamamoto, “Effects ofthree-objective genetic rule selection on thegeneralization ability of fuzzy rule-basedsystems,” Proc. 2nd Int. Conf. EvolutionaryMulti-Criterion Optimization, Faro, Portugal,Apr. 8–11, 2003, pp. 608–622[6]H. Li and D. Enke, “Forecasting series-basedstock price data using direct reinforcementlearning,” Proc. 2004 Int. Conf. neuralnetworks, vol. 2, pp.1103-21108, 2004[7]H.M. Lee, C.M. Chen, T.M. Chen and Y.L. Jou,“An Efficient Fuzzy Classifier with FeatureSelection Based on Fuzzy Entropy,” IEEETrans. Sys. Man. Cyber., vol. 31, no. 3,pp.426-432, 2001[8]J. A. Roubos, M. Setnes, and A. Janos,“Learning fuzzy classification rules fromlabeled data,” Inform. Sci., vol. 150, pp.77-93,2003.[9]J. S. Wang and G. C. S. Lee, “Self-adaptiveneuro-fuzzy inference system for classificationapplication,” IEEE Trans. Fuzzy Syst., vol.10,pp. 790–802, Dec. 2002.[10]L. A. Zadeh, “Fuzzy sets,” Information andControl, vol. 8, pp.338-353, 1965[11]P. R. Innocent and R. I. John, “Computer aidedfuzzy medical diagnosis,” Artif. Intell. Med.,vol. 162, pp.81-104, 2004.[12]R. Lee and J. Liu, “iJADE WeatherMAN: aweather forecasting system using intelligentmultiagent-based fuzzy neuro network,” IEEETrans. Sys. Man. Cyber., vol. 34, no. 3,pp.369-377, 2004[13]R. Mikut, J. Jakel and L. Groll, “Interpretabilityissues in data-based learning of fuzzy systems,”Fuzzy Sets and Systems, pp.179–197, 2005 [14]R. P. Paiva and A. Dourado, “Interpretabilityand learning in neuro-fuzzy systems,” FuzzySets Systems., vol. 147 , pp. 17–38,2004。