Summary The Weka machine learning workbench provides
- 格式:pdf
- 大小:125.09 KB
- 文档页数:3
Weka总结引言Weka是一个免费、开源的数据挖掘和机器学习软件,于1997年首次发布。
它由新西兰怀卡托大学的机器学习小组开发,提供了一系列数据预处理、分类、回归、聚类和关联规则挖掘等功能。
本文将对Weka进行总结,并讨论其主要功能和优点。
主要功能1. 数据预处理Weka提供了各种数据预处理技术,用于数据的清洗、转换和集成。
最常用的预处理技术包括缺失值处理、离散化、属性选择和特征缩放等。
通过这些预处理技术,用户可以减少数据中的噪声和冗余信息,提高机器学习模型的性能。
2. 分类Weka支持多种分类算法,包括决策树、贝叶斯分类器、神经网络和支持向量机等。
用户可以根据自己的需求选择适当的算法进行分类任务。
Weka还提供了交叉验证和自动参数调整等功能,帮助用户评估和优化分类器的性能。
3. 回归除了分类,Weka还支持回归问题的解决。
用户可以使用线性回归、多项式回归和局部回归等算法,对给定的数据集进行回归分析。
Weka提供了模型评估和可视化工具,帮助用户理解回归模型和评估其预测性能。
4. 聚类Weka的聚类算法可用于将数据集中相似的样本归类到一起。
Weka支持K-means、DBSCAN、谱聚类和层次聚类等常用的聚类算法。
用户可以根据数据的特点选择适当的算法并解释聚类结果。
5. 关联规则挖掘关联规则挖掘是一种常见的数据挖掘任务,用于发现数据集中的频繁项集和关联规则。
通过Weka,用户可以使用Apriori和FP-growth等算法来挖掘数据中的关联规则。
Weka还提供了支持多种评估指标的工具,用于评估关联规则的质量和可信度。
优点1. 易于使用Weka的用户界面友好且易于使用。
它提供了直观的图形界面,使用户可以快速上手并进行各种数据挖掘任务。
此外,Weka还支持命令行操作,方便用户在脚本中使用和集成Weka的功能。
2. 强大的功能Weka提供了丰富的数据挖掘和机器学习功能,涵盖了数据预处理、分类、回归、聚类和关联规则挖掘等领域。
机器学习方法有哪些数学基础有无数激情满满大步向前,誓要在机器学习领域有一番作为的同学,在看到公式的一刻突然就觉得自己狗带了。
是啊,机器学习之所以相对于其他开发工作,更有门槛的根本原因就是数学。
每一个算法,要在训练集上最大程度拟合同时又保证泛化能力,需要不断分析结果和数据,调优参数,这需要我们对数据分布和模型底层的数学原理有一定的理解。
所幸的是如果只是想合理应用机器学习,而不是做相关方向高精尖的research,需要的数学知识啃一啃还是基本能理解下来的。
至于更高深的部分,恩,博主非常愿意承认自己是『数学渣』。
基本所有常见机器学习算法需要的数学基础,都集中在微积分、线性代数和概率与统计当中。
下面我们先过一过知识重点,文章的后部分会介绍一些帮助学习和巩固这些知识的资料。
微积分微分的计算及其几何、物理含义,是机器学习中大多数算法的求解过程的核心。
比如算法中运用到梯度下降法、牛顿法等。
如果对其几何意义有充分的理解,就能理解“梯度下降是用平面来逼近局部,牛顿法是用曲面逼近局部”,能够更好地理解运用这样的方法。
凸优化和条件最优化的相关知识在算法中的应用随处可见,如果能有系统的学习将使得你对算法的认识达到一个新高度。
线性代数大多数机器学习的算法要应用起来,依赖于高效的计算,这种场景下,程序员GG们习惯的多层for循环通常就行不通了,而大多数的循环操作可转化成矩阵之间的乘法运算,这就和线性代数有莫大的关系了向量的内积运算更是随处可见。
矩阵乘法与分解在机器学习的主成分分析(PCA)和奇异值分解(SVD)等部分呈现刷屏状地出现。
概率与统计从广义来说,机器学习在做的很多事情,和统计层面数据分析和发掘隐藏的模式,是非常类似的。
极大似然思想、贝叶斯模型是理论基础,朴素贝叶斯(Na?veBayes)、语言模型(N-gram)、隐马尔科夫(HMM)、隐变量混合概率模型是他们的高级形态。
常见分布如高斯分布是混合高斯模型(GMM)等的基础。
文献信息:文献标题:A Study of Data Mining with Big Data(大数据挖掘研究)国外作者:VH Shastri,V Sreeprada文献出处:《International Journal of Emerging Trends and Technology in Computer Science》,2016,38(2):99-103字数统计:英文2291单词,12196字符;中文3868汉字外文文献:A Study of Data Mining with Big DataAbstract Data has become an important part of every economy, industry, organization, business, function and individual. Big Data is a term used to identify large data sets typically whose size is larger than the typical data base. Big data introduces unique computational and statistical challenges. Big Data are at present expanding in most of the domains of engineering and science. Data mining helps to extract useful data from the huge data sets due to its volume, variability and velocity. This article presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective.Keywords: Big Data, Data Mining, HACE theorem, structured and unstructured.I.IntroductionBig Data refers to enormous amount of structured data and unstructured data thatoverflow the organization. If this data is properly used, it can lead to meaningful information. Big data includes a large number of data which requires a lot of processing in real time. It provides a room to discover new values, to understand in-depth knowledge from hidden values and provide a space to manage the data effectively. A database is an organized collection of logically related data which can be easily managed, updated and accessed. Data mining is a process discovering interesting knowledge such as associations, patterns, changes, anomalies and significant structures from large amount of data stored in the databases or other repositories.Big Data includes 3 V’s as its characteristics. They are volume, velocity and variety. V olume means the amount of data generated every second. The data is in state of rest. It is also known for its scale characteristics. Velocity is the speed with which the data is generated. It should have high speed data. The data generated from social media is an example. Variety means different types of data can be taken such as audio, video or documents. It can be numerals, images, time series, arrays etc.Data Mining analyses the data from different perspectives and summarizing it into useful information that can be used for business solutions and predicting the future trends. Data mining (DM), also called Knowledge Discovery in Databases (KDD) or Knowledge Discovery and Data Mining, is the process of searching large volumes of data automatically for patterns such as association rules. It applies many computational techniques from statistics, information retrieval, machine learning and pattern recognition. Data mining extract only required patterns from the database in a short time span. Based on the type of patterns to be mined, data mining tasks can be classified into summarization, classification, clustering, association and trends analysis.Big Data is expanding in all domains including science and engineering fields including physical, biological and biomedical sciences.II.BIG DATA with DATA MININGGenerally big data refers to a collection of large volumes of data and these data are generated from various sources like internet, social-media, business organization, sensors etc. We can extract some useful information with the help of Data Mining. It is a technique for discovering patterns as well as descriptive, understandable, models from a large scale of data.V olume is the size of the data which is larger than petabytes and terabytes. The scale and rise of size makes it difficult to store and analyse using traditional tools. Big Data should be used to mine large amounts of data within the predefined period of time. Traditional database systems were designed to address small amounts of data which were structured and consistent, whereas Big Data includes wide variety of data such as geospatial data, audio, video, unstructured text and so on.Big Data mining refers to the activity of going through big data sets to look for relevant information. To process large volumes of data from different sources quickly, Hadoop is used. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Its distributed supports fast data transfer rates among nodes and allows the system to continue operating uninterrupted at times of node failure. It runs Map Reduce for distributed data processing and is works with structured and unstructured data.III.BIG DATA characteristics- HACE THEOREM.We have large volume of heterogeneous data. There exists a complex relationship among the data. We need to discover useful information from this voluminous data.Let us imagine a scenario in which the blind people are asked to draw elephant. The information collected by each blind people may think the trunk as wall, leg as tree, body as wall and tail as rope. The blind men can exchange information with each other.Figure1: Blind men and the giant elephantSome of the characteristics that include are:i.Vast data with heterogeneous and diverse sources: One of the fundamental characteristics of big data is the large volume of data represented by heterogeneous and diverse dimensions. For example in the biomedical world, a single human being is represented as name, age, gender, family history etc., For X-ray and CT scan images and videos are used. Heterogeneity refers to the different types of representations of same individual and diverse refers to the variety of features to represent single information.ii.Autonomous with distributed and de-centralized control: the sources are autonomous, i.e., automatically generated; it generates information without any centralized control. We can compare it with World Wide Web (WWW) where each server provides a certain amount of information without depending on other servers.plex and evolving relationships: As the size of the data becomes infinitely large, the relationship that exists is also large. In early stages, when data is small, there is no complexity in relationships among the data. Data generated from social media and other sources have complex relationships.IV.TOOLS:OPEN SOURCE REVOLUTIONLarge companies such as Facebook, Yahoo, Twitter, LinkedIn benefit and contribute work on open source projects. In Big Data Mining, there are many open source initiatives. The most popular of them are:Apache Mahout:Scalable machine learning and data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent patternmining.R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.MOA: Stream data mining open source software to perform data mining in real time. It has implementations of classification, regression; clustering and frequent item set mining and frequent graph mining. It started as a project of the Machine Learning group of University of Waikato, New Zealand, famous for the WEKA software. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA, Android and Storm.SAMOA: It is a new upcoming software project for distributed stream mining that will combine S4 and Storm with MOA.Vow pal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design a fast, scalable, useful learning algorithm. VW is able to learn from terafeature datasets. It can exceed the throughput of any single machine networkinterface when doing linear learning, via parallel learning.V.DATA MINING for BIG DATAData mining is the process by which data is analysed coming from different sources discovers useful information. Data Mining contains several algorithms which fall into 4 categories. They are:1.Association Rule2.Clustering3.Classification4.RegressionAssociation is used to search relationship between variables. It is applied in searching for frequently visited items. In short it establishes relationship among objects. Clustering discovers groups and structures in the data.Classification deals with associating an unknown structure to a known structure. Regression finds a function to model the data.The different data mining algorithms are:Table 1. Classification of AlgorithmsData Mining algorithms can be converted into big map reduce algorithm based on parallel computing basis.Table 2. Differences between Data Mining and Big DataVI.Challenges in BIG DATAMeeting the challenges with BIG Data is difficult. The volume is increasing every day. The velocity is increasing by the internet connected devices. The variety is also expanding and the organizations’ capability to capture and process the data is limited.The following are the challenges in area of Big Data when it is handled:1.Data capture and storage2.Data transmission3.Data curation4.Data analysis5.Data visualizationAccording to, challenges of big data mining are divided into 3 tiers.The first tier is the setup of data mining algorithms. The second tier includesrmation sharing and Data Privacy.2.Domain and Application Knowledge.The third one includes local learning and model fusion for multiple information sources.3.Mining from sparse, uncertain and incomplete data.4.Mining complex and dynamic data.Figure 2: Phases of Big Data ChallengesGenerally mining of data from different data sources is tedious as size of data is larger. Big data is stored at different places and collecting those data will be a tedious task and applying basic data mining algorithms will be an obstacle for it. Next we need to consider the privacy of data. The third case is mining algorithms. When we are applying data mining algorithms to these subsets of data the result may not be that much accurate.VII.Forecast of the futureThere are some challenges that researchers and practitioners will have to deal during the next years:Analytics Architecture:It is not clear yet how an optimal architecture of analytics systems should be to deal with historic data and with real-time data at the same time. An interesting proposal is the Lambda architecture of Nathan Marz. The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, theserving layer, and the speed layer. It combines in the same system Hadoop for the batch layer, and Storm for the speed layer. The properties of the system are: robust and fault tolerant, scalable, general, and extensible, allows ad hoc queries, minimal maintenance, and debuggable.Statistical significance: It is important to achieve significant statistical results, and not be fooled by randomness. As Efron explains in his book about Large Scale Inference, it is easy to go wrong with huge data sets and thousands of questions to answer at once.Distributed mining: Many data mining techniques are not trivial to paralyze. To have distributed versions of some methods, a lot of research is needed with practical and theoretical analysis to provide new methods.Time evolving data: Data may be evolving over time, so it is important that the Big Data mining techniques should be able to adapt and in some cases to detect change first. For example, the data stream mining field has very powerful techniques for this task.Compression: Dealing with Big Data, the quantity of space needed to store it is very relevant. There are two main approaches: compression where we don’t loose anything, or sampling where we choose what is thedata that is more representative. Using compression, we may take more time and less space, so we can consider it as a transformation from time to space. Using sampling, we are loosing information, but the gains inspace may be in orders of magnitude. For example Feldman et al use core sets to reduce the complexity of Big Data problems. Core sets are small sets that provably approximate the original data for a given problem. Using merge- reduce the small sets can then be used for solving hard machine learning problems in parallel.Visualization: A main task of Big Data analysis is how to visualize the results. As the data is so big, it is very difficult to find user-friendly visualizations. New techniques, and frameworks to tell and show stories will be needed, as for examplethe photographs, infographics and essays in the beautiful book ”The Human Face of Big Data”.Hidden Big Data: Large quantities of useful data are getting lost since new data is largely untagged and unstructured data. The 2012 IDC studyon Big Data explains that in 2012, 23% (643 exabytes) of the digital universe would be useful for Big Data if tagged and analyzed. However, currently only 3% of the potentially useful data is tagged, and even less is analyzed.VIII.CONCLUSIONThe amounts of data is growing exponentially due to social networking sites, search and retrieval engines, media sharing sites, stock trading sites, news sources and so on. Big Data is becoming the new area for scientific data research and for business applications.Data mining techniques can be applied on big data to acquire some useful information from large datasets. They can be used together to acquire some useful picture from the data.Big Data analysis tools like Map Reduce over Hadoop and HDFS helps organization.中文译文:大数据挖掘研究摘要数据已经成为各个经济、行业、组织、企业、职能和个人的重要组成部分。
Journal of Machine Learning Research17(2016)1-5Submitted5/16;Revised11/16;Published11/16 Auto-WEKA2.0:Automatic model selectionand hyperparameter optimization in WEKALars Kotthoff*************.ca Chris Thornton***************.ca Holger H.Hoos***********.ca Frank Hutter******************.de Kevin Leyton-Brown**************.ca Department of Computer ScienceUniversity of British Columbia2366Main Mall,Vancouver,B.C.V6T1Z4CanadaEditor:GeoffHolmesAbstractWEKA is a widely used,open-source machine learning platform.Due to its intuitive in-terface,it is particularly popular with novice users.However,such users oftenfind it hard to identify the best approach for their particular dataset among the many available.We describe the new version of Auto-WEKA,a system designed to help such users by automati-cally searching through the joint space of WEKA’s learning algorithms and their respective hyperparameter settings to maximize performance,using a state-of-the-art Bayesian opti-mization method.Our new package is tightly integrated with WEKA,making it just as accessible to end users as any other learning algorithm.Keywords:Hyperparameter Optimization,Model Selection,Feature Selection1.The Principles Behind Auto-WEKAThe WEKA machine learning software(Hall et al.,2009)puts state-of-the-art machine learning techniques at the disposal of even novice users.However,such users do not typically know how to choose among the dozens of machine learning procedures implemented in WEKA and each procedure’s hyperparameter settings to achieve good performance.Auto-WEKA1addresses this problem by treating all of WEKA as a single,highly para-metric machine learning framework,and using Bayesian optimization tofind a strong instan-tiation for a given dataset.Specifically,it considers the combined space of WEKA’s learning algorithms A={A(1),...,A(k)}and their associated hyperparameter spacesΛ(1),...,Λ(k) and aims to identify the combination of algorithm A(j)∈A and hyperparametersλ∈Λ(j)that minimizes cross-validation loss,A∗λ∗∈argminA(j)∈A,λ∈Λ(j)1kk∑i=1L(A(j)λ,D(i)train,D(i)test),1.Thornton et al.(2013)first introduced Auto-WEKA and empirically demonstrated state-of-the-art per-formance.Here we describe an improved and more broadly accessible implementation of Auto-WEKA, focussing on usability and software design.Kotthoff,Thornton,Hutter,Hoos,Leyton-Brownwhere L (A λ,D (i )train ,D (i )test )denotes the loss achieved by algorithm A with hyperparameters λwhen trained on D (i )train and evaluated on D (i )test .We call this the combined algorithm selectionand hyperparameter optimization (CASH)problem.CASH can be seen as a blackbox func-tion optimization problem:determining argmin θ∈Θf (θ),where each configuration θ∈Θcomprises the choice of algorithm A (j )∈A and its hyperparameter settings λ∈Λ(j ).In this formulation,the hyperparameters of algorithm A (j )are conditional on A (j )being selected.For a given θrepresenting algorithm A (j )∈A and hyperparameter settings λ∈Λ(j ),f (θ)is then defined as the cross-validation loss 1k ∑k i =1L (A (j )λ,D (i )train ,D (i )test ).2Bayesian optimization (see,e.g.,Brochu et al.,2010),also known as sequential model-based optimization,is an iterative method for solving such blackbox optimization problems.In its n -th iteration,it fits a probabilistic model based on the first n −1function evaluations ⟨θi ,f (θi )⟩n −1i =1,uses this model to select the next θn to evaluate (trading offexploration of new parts of the space vs exploitation of regions known to be good)and evaluates f (θn ).While Bayesian optimization based on Gaussian process models is known to perform well for low-dimensional problems with numerical hyperparameters (see,e.g.,Snoek et al.,2012),tree-based models have been shown to be more effective for high-dimensional,structured,and partly discrete problems (Eggensperger et al.,2013),such as the highly conditional space of WEKA’s learning algorithms and their corresponding hyperparameters we face here.3Thornton et al.(2013)showed that tree-based Bayesian optimization methods yielded the best performance in Auto-WEKA,with the random-forest-based SMAC (Hutter et al.,2011)performing better than the tree-structured Parzen estimator,TPE (Bergstra et al.,2011).Auto-WEKA uses SMAC to determine the classifier with the best performance on the given data.2.Auto-WEKA 2.0Since the initial release of a usable research prototype in 2013,we have made substantial improvements to the Auto-WEKA package described by Thornton et al.(2013).At a prosaic level,we have fixed bugs,improved tests and documentation,and updated the software to work with the latest versions of WEKA and Java.We have also added four major features.First,we now support regression algorithms,expanding Auto-WEKA beyond its pre-vious focus on classification (starred entries in Fig.1).Second,we now support the op-timization of all performance metrics WEKA supports.Third,we now natively support parallel runs (on a single machine)to find good configurations faster and save the N best configurations of each run instead of just the single best.Fourth,Auto-WEKA 2.0is now fully integrated with WEKA.This is important,because the crux of Auto-WEKA lies in its simplicity:providing a push-button interface that requires no knowledge about the avail-able learning algorithms or their hyperparameters,asking the user to provide,in addition to the dataset to be processed,only a memory bound (1GB by default)and the overall time2.In fact,on top of machine learning algorithms and their respective hyperparameters,we also include attribute selection methods and their respective hyperparameters in the configurations θ,thereby jointly optimizing over their choice and the choice of algorithms.3.Conditional dependencies can also be accommodated in the Gaussian process framework (Hutter and Osborne,2013;Swersky et al.,2013),but currently,tree-based methods achieve better performance.Auto-WEKA2.0:Automatic model and hyperparameter selection in WEKA LearnersBayesNet2 DecisionStump*0 DecisionTable*4 GaussianProcesses*10 IBk*5 J489 JRip4 KStar*3 LinearRegression*3 LMT9Logistic1M5P4M5Rules4MultilayerPerceptron*8NaiveBayes2NaiveBayesMultinomial0OneR1PART4RandomForest7RandomTree*11REPTree*6SGD*5SimpleLinearRegression*0SimpleLogistic5SMO11SMOreg*13VotedPerceptron3ZeroR*0Ensemble MethodsStacking2Vote2 Meta-MethodsLWL5 AdaBoostM16 AdditiveRegression4AttributeSelectedClassifier2Bagging4RandomCommittee2RandomSubSpace3Attribute Selection MethodsBestFirst2GreedyStepwise4Figure1:Learners and methods supported by Auto-WEKA2.0,along with number of hyperparameters|Λ|.Every learner supports classification;starred learners also support regression.budget available for the entire learning process.4The overall budget is set to15minutes by default to accommodate impatient users;longer runs allow the Bayesian optimizer to search the space more thoroughly;we recommend at least several hours for production runs.The usability of the earlier research prototype was hampered by the fact that users had to download Auto-WEKA manually and run it separately from WEKA.In contrast, Auto-WEKA2.0is now available through WEKA’s package ers do not need to install software separately;everything is included in the package and installed automatically upon request.After installation,Auto-WEKA2.0can be used in two different ways:1.As a meta-classifier:Auto-WEKA can be run like any other machine learning algo-rithm in WEKA:via the GUI,the command-line interface,or the public API.Figure2 shows how to run it from the command line.2.Through the Auto-WEKA tab:This provides a customized interface that hides someof the complexity.Figure3shows the output of an example run.Source code for Auto-WEKA is hosted on GitHub(https:///automl/autoweka) and is available under the GPL license(version3).Releases are published to the WEKA package repository and available both through the WEKA package manager and from the Auto-WEKA project website(/autoweka).A manual describes how to use the WEKA package and gives a high-level overview for developers;we also provide lower-level Javadoc documentation.An issue tracker on GitHub,JUnit tests and the con-tinuous integration system Travis facilitate bug tracking and correctness of the code.Since its release on March1,2016,Auto-WEKA2.0has been downloaded more than15000times, with an average of about400downloads per week.4.Internally,to avoid using all its budget for executing a single slow learner,Auto-WEKA limits individualruns of any learner to1/12of the overall budget;it further limits feature search to1/60of the budget.Kotthoff,Thornton,Hutter,Hoos,Leyton-Brownjava-cp autoweka.jar weka.classifiers.meta.AutoWEKAClassifier -timeLimit5-t iris.arff-no-cvFigure2:Command-line call for running Auto-WEKA with a time limit of5minutes on training dataset iris.arff.Auto-WEKA performs cross-validation internally,so we disable WEKA’s cross-validation(-no-cv).Running with-h lists the available options.Figure3:Example Auto-WEKA run on the iris dataset.The resulting best classifier along with its parameter settings is printedfirst,followed by its performance.While Auto-WEKA runs,it logs to the status bar how many configurations it has evaluated so far.3.Related ImplementationsAuto-WEKA was thefirst method to use Bayesian optimization to automatically instantiate a highly parametric machine learning framework at the push of a button.This automated machine learning(AutoML)approach has recently also been applied to Python and scikit-learn(Pedregosa et al.,2011)in Auto-WEKA’s sister package,Auto-sklearn(Feurer et al., 2015).Auto-sklearn uses the same Bayesian optimizer as Auto-WEKA,but comprises a smaller space of models and hyperparameters,since scikit-learn does not implement as many different machine learning techniques as WEKA;however,Auto-sklearn includes additional meta-learning techniques.It is also possible to optimize hyperparameters using WEKA’s own grid search and MultiSearch packages.However,these packages only permit tuning one learner and one filtering method at a time.Grid search handles only one hyperparameter.Furthermore, hyperparameter names and possible values have to be specified by the user.Auto-WEKA2.0:Automatic model and hyperparameter selection in WEKAReferencesJ.Bergstra,R.Bardenet,Y.Bengio,and B.K´e gl.Algorithms for hyper-parameter opti-mization.In Advances in Neural Information Processing Systems24(NIPS’11),pages 2546–2554,2011.E.Brochu,V.Cora,and N.de Freitas.A tutorial on Bayesian optimization of expensive cost functions,with application to active user modeling and hierarchical reinforcement puting Research Repository(arXiv),abs/1012.2599,2010.K.Eggensperger,M.Feurer,F.Hutter,J.Bergstra,J.Snoek,H.Hoos,and K.Leyton-Brown.Towards an empirical foundation for assessing Bayesian optimization of hyper-parameters.In NIPS Workshop on Bayesian Optimization(BayesOpt’13),2013.M.Feurer,A.Klein,K.Eggensperger,J.Springenberg,M.Blum,and F.Hutter.Efficient and Robust Automated Machine Learning.In Advances in Neural Information Processing Systems28(NIPS’15),pages2944–2952,2015.M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,and I.H.Witten.The WEKA Data Mining Software:An Update.SIGKDD Explor.Newsl.,11(1):10–18,Nov.2009. ISSN1931-0145.F.Hutter and M.Osborne.A Kernel for Hierarchical Parameter puting Re-search Repository(arXiv),abs/1310.5738,Oct.2013.F.Hutter,H.H.Hoos,and K.Leyton-Brown.Sequential Model-Based Optimization for General Algorithm Configuration.In Learning and Intelligent OptimizatioN Conference (LION5),pages507–523,2011.F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blon-del,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau, M.Brucher,M.Perrot,and E.Duchesnay.Scikit-learn:Machine learning in Python. Journal of Machine Learning Research,12:2825–2830,2011.J.Snoek,rochelle,and R.P.Adams.Practical Bayesian optimization of machine learn-ing algorithms.In Advances in Neural Information Processing Systems25(NIPS’12), pages2951–2959,2012.K.Swersky, D.Duvenaud,J.Snoek, F.Hutter,and M.Osborne.Raiders of the lost architecture:Kernels for Bayesian optimization in conditional parameter spaces.In NIPS Workshop on Bayesian Optimization(BayesOpt’13),2013.C.Thornton, F.Hutter,H.H.Hoos,and K.Leyton-Brown.Auto-WEKA:Combined selection and hyperparameter optimization of classification algorithms.In19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD’13),2013.。
BIOINFORMATICS APPLICATIONS NOTEVol.20no.152004,pages 2479–2481doi:10.1093/bioinformatics/bth261Data mining in bioinformatics using WekaEibe Frank 1,∗,Mark Hall 1,Len Trigg 2,Geoffrey Holmes 1and Ian H.Witten 11Departmentof Computer Science,University of Waikato,Private Bag 3105,Hamilton,New Zealand and 2Reel Two,PO Box 1538,Hamilton,New ZealandReceived on December 3,2003;revised on February 3,2004;accepted on February 26,2004Advance Access publication April 8,2004ABSTRACTSummary:The Weka machine learning workbench provides a general-purpose environment for automatic classification,regression,clustering and feature selection—common data mining problems in bioinformatics research.It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental compar-ison of different machine learning techniques on the same problem.Weka can process data given in the form of a single relational table.Its main objectives are to (a)assist users in extracting useful information from data and (b)enable them to easily identify a suitable algorithm for generating an accurate predictive model from it.Availability:/ml/weka Contact:eibe@ INTRODUCTIONBioinformatics research entails many problems that can be cast as machine learning tasks.In classification or regression,the task is to predict the outcome associated with a particular individual given a feature vector describing that individual;in clustering,individuals are grouped together because they share certain properties;and in feature selection,the task is to select those features that are important in predicting the outcome for an individual.The Weka data mining suite provides algorithms for all three problem types.In the bioinformatics arena,it has been used for automated protein annotation (Kretschmann et al .,2001;Bazzan et al .,2002),probe selection for gene-expression arrays (Tobler et al .,2002),experiments with automatic can-cer diagnosis (Li et al .,2003a),developing a computational model for frame-shifting sites (Bekaert et al .,2003),plant genotype discrimination (Taylor et al .,2002),classifying gene expression profiles (Li and Wong,2002)and extracting rules from them (Li et al .,2003b).Many of the algorithms in Weka are described in Witten and Frank (2000).Towhom correspondence should be addressed.Real datasets vary:no single algorithm is superior on all data mining problems.The algorithm needs to match the structure of the problem to obtain useful information or an accurate model.The aim in developing Weka was to permit a maximum of flexibility when trying machine learning meth-ods on new datasets.This includes algorithms for learning different types of models (e.g.decision trees,rule sets,lin-ear discriminants),feature selection schemes (fast filtering as well as wrapper approaches)and pre-processing methods (e.g.discretization,arbitrary mathematical transformations and combinations of attributes).By providing a diverse set of methods that are available through a common interface,Weka makes it easy to compare different solution strategies based on the same evaluation method and identify the one that is most appropriate for the problem at hand.It is implemented in Java and runs on almost any computing platform.THE WEKA EXPLORERThe main interface in Weka is the Explorer,shown in Figure 1.It has a set of panels,each of which can be used to perform a certain task.The Preprocess panel,selected in Figure 1,retrieves data from a file,SQL database or URL.(A limitation is that all the data are kept in main memory,so subsampling may be needed for very large datasets.)Then the data can be pre-processed using one of Weka’s filtering tools.For example,one can delete all instances (i.e.rows)in the data for which a certain attribute (i.e.column)has a particular value.An undo facility is provided to revert to an earlier state of the data if needed.The Preprocess panel also shows a histogram of the attribute that is currently selected and some statistics about it—histograms for all attributes can be shown simultaneously in a separate window.Once a dataset has been loaded (and perhaps processed by one or more filters),one of the other panels in the Explorer can be used to perform further analysis.If the data entail a classi-fication or regression problem,it can be processed in the Clas-sify panel.This provides an interface to learning algorithms for classification and regression models (both are called ‘clas-sifiers’in Weka),and evaluation tools for analyzing the outcome of the learning process.Weka has implementationsBioinformatics 20(15)©Oxford University Press 2004;all rights reserved.2479E.Frank etal.Fig.1.The Weka explorer.of all major learning techniques for classification and regres-sion:decision trees,rule sets,Bayesian classifiers,support vector machines,logistic and linear regression,multi-layer perceptrons and nearest-neighbor methods.It also contains ‘meta-learners’such as bagging,boosting,stacking and schemes that perform automatic parameter tuning using cross-validation,cost-sensitive classification,and so on.Learning algorithms can be evaluated using cross-validation or a hold-out set,and Weka provides standard numeric performance measures(e.g.accuracy,root mean squared error),as well as graphical means for visualizing classifier performance (e.g.receiver operating characteristic curves and precision-recall curves).It is possible to visualize the predictions of a classification or regression model,enabling the identifica-tion of outliers,and to load and save models that have been generated.The third panel in the Explorer,Cluster,gives access to Weka’s clustering algorithms.These include k-means, mixtures of normal distributions with diagonal co-variance matrices estimated using EM,and a heuristic incremental hierarchical clustering scheme.Cluster assignments can be visualized and compared with actual clusters defined by one of the attributes in the data.Weka also contains algorithms for generating association rules that can be used to identify relationships between groups of attributes in the data.These are available from the Explorer’s Associate panel.However,more interesting in the context of bioinformatics is thefifth panel,which offers methods for identifying those subsets of attributes that are pre-dictive of another(target)attribute in the data.Weka contains several methods for searching through the space of attribute subsets,as well as evaluation measures for attributes and attribute subsets.Search methods include best-first search, forward selection,genetic algorithms and a simple rank-ing of attributes.Evaluation measures include correlation-and entropy-based criteria as well as the performance of a selected learning scheme(e.g.a decision tree learner)for a particular subset of attributes.Different search and evalu-ation methods can be combined,making the system very flexible.The last panel in the Explorer,Visualization,shows a matrix of scatter plots for all pairs of attributes in the data.Any matrix element can be selected and enlarged in a separate window, where one can zoom in on subsets of the data and retrieve information about individual data points.A‘jitter’option for exposing obscured data points is also provided.OTHER INTERFACES TO WEKAAll the learning techniques in Weka can be accessed from the command line,as part of shell scripts,or from within other Java programs using the Weka API.Weka also contains an alternative graphical user interface,called‘Knowledge Flow’,which can be used instead of the Explorer.It caters2480Data mining in bioinformatics using Wekafor a more process-oriented view of data mining,where individual learning components(represented by Java beans) can be connected graphically to create a‘flow’of inform-ation.Finally,there is a third graphical user interface—the ‘Experimenter’—which is designed for experiments that com-pare the performance of(multiple)learning schemes on (multiple)datasets.Experiments can be distributed across multiple computers running remote experiment servers. ACKNOWLEDGEMENTSMany people have contributed to the Weka project,in particu-lar Richard Kirkby,Ashraf Kibriya and Bernhard Pfahringer, and we thank them all for their invaluable efforts.We would also like to thank Yu Wang for suggesting that we write this note,and the New Zealand Foundation for Research, Science&Technology for funding the project. REFERENCESBazzan,A.L.,Engel,P.M.,Schroeder,L.F.and Da Silva,S.C.(2002) Automated annotation of keywords for proteins related to myco-plasmataceae using machine learning techniques.Bioinformatics, 18(Suppl.2),35S–43S.Bekaert,M.,Bidou,L.,Denise,A.,Duchateau-Nguyen,G., Forest,J.P.,Froidevaux,C.,Hatin,I.,Rousset,J.P.and Termier,M.(2003)Towards a computational model for−1eukaryotic frameshifting sites.Bioinformatics,19,327–335. Kretschmann,E.,Fleischmann,W.and Apweiler,R.(2001)Auto-matic rule generation for protein annotation with the C4.5data mining algorithm applied on SWISS-PROT.Bioinformatics,17, 920–926.Li,J.and Wong,L.(2002)Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns.Bioinformatics,18,725–734.Li,J.,Liu,H.,Ng,S.K.and Wong,L.(2003a).Discovery of signific-ant rules for classifying cancer diagnosis data.Bioinformatics, 19(Suppl.2),II93–II102.Li,J.,Liu,H.,Downing,J.R.,Yeoh,A.E.and Wong,L.(2003b) Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia(ALL)patients.Bioinformatics,19,71–78.Taylor,J.,King,R.D.,Altmann,T.and Fiehn,O.(2002)Application of metabolomics to plant genotype discrimination using statistics and machine learning.Bioinformatics,18(Suppl.2),241S–248S. Tobler,J.B.,Molla,M.N.,Nuwaysir,E.F.,Green,R.D.and Shavlik,J.W.(2002)Evaluating machine learning approaches for aiding probe selection for gene-expression arrays.Bioinformatics, 18(Suppl.1),164S–171S.Witten,I.H.and Frank,E.(2000)Data Mining:Practical Machine Learning Tools and Techniques with Java Implementations.Morgan Kaufmann,San Francisco,CA.2481。