Weka[31] crossValidation源代码分析

格式：doc
大小：192.50 KB
文档页数：5

下载文档原格式

Weka[24] Apriori源代码分析(1)

Weka[24] Apriori源代码分析作者：Koala++/屈伟曾经卖过一个Apriori的程序，那个程序大约有50%的正确率(当然结果是正确的，是只实现上很不一样)，数据挖掘课上写了一个Apriori，一部分懒地按书上的算法，大约对了80% (当然结果仍然是正确的)，记得邱强有一次要用Apriori算法时说：weka的太慢了，还好上次数据挖掘课实现了一下，还挺快的，注意的一点是关联规则不属于机器学习，这里我不想再分出来一个数据挖掘的组了。

从buildAssociations函数开始：double[] confidences, supports;int[] indices;FastVector[] sortedRuleSet;int necSupport = 0;instances = new Instances(instances);if (m_removeMissingCols) {instances = removeMissingColumns(instances);}看一下removeMissingColumns，虽然它是如此的不重要：protected Instances removeMissingColumns(Instances instances)throws Exception {int numInstances = instances.numInstances();StringBuffer deleteString = new StringBuffer();int removeCount = 0;boolean first = true;int maxCount = 0;for (int i = 0; i < instances.numAttributes(); i++) {AttributeStats as = instances.attributeStats(i);if (m_upperBoundMinSupport == 1.0 && maxCount != numInstances) { // see if we can decrease this by looking for the most frequent// valueint[] counts = as.nominalCounts;if (counts[Utils.maxIndex(counts)] > maxCount) {maxCount = counts[Utils.maxIndex(counts)];}}if (as.missingCount == numInstances) {if (first) {deleteString.append((i + 1));first = false;} else {deleteString.append("," + (i + 1));}removeCount++;}}if (m_verbose) {System.err.println("Removed : " + removeCount+ " columns with all missing " + "values.");}if (m_upperBoundMinSupport == 1.0 && maxCount != numInstances) { m_upperBoundMinSupport = (double) maxCount/ (double) numInstances;if (m_verbose) {System.err.println("Setting upper bound min support to : "+ m_upperBoundMinSupport);}}if (deleteString.toString().length() > 0) {Remove af = new Remove();af.setAttributeIndices(deleteString.toString());af.setInvertSelection(false);af.setInputFormat(instances);Instances newInst = eFilter(instances, af);return newInst;}return instances;}For循环中的第一个if不重要，不要理睬。

Weka[31] crossValidation源代码分析汇总

Weka[31] crossValidation源代码分析作者:Koala++/屈伟Weka 学习论坛里的人也帮我下载过一些论文,而群主也希望我建设群论坛,而我一开始看到的就是 cross validation,感觉介绍的有点简单了,这里我只是再深一点而已。

我还是 Ng 的介绍为主。

在看到学习算法的时候,我们都是以最小化经验误差为目标,比如有方法:梯度下降, 牛顿法, lagrange 乘子法,坐标上升法,都是我 blog 里提到过的方法了。

如果我们用下面的步骤来得到模型:1. 在数据集上训练多个模型。

2. 选择训练误差最小的模型。

下面是说明这个问题的一幅图(它的原意倒和这没什么关系 ,直接去看 Pattern recognition and machine learning第一章。

这幅图大概意思就是模型复杂到一定程序(可以表示学习到概念之后 ,再使用更复杂模型,那么它的训练误差会下降,而测试误差会提高。

这幅图其实还有一个比较深的意义, 那就是你可以通过它来选择合适的模型,不要一看测试误差高,就加样本。

然后有一个代替它的方法是 hold-out cross validation或是被称为 simple cross validation。

1. 把数据集分为两部分, 一部分为训练集 (比如 70%的数据 , 和测试数据集 (比如 30% 。

测试数据集也被称为 hold-out cross validation set。

2. 在训练集上训练多个模型。

3. 每一个模型在测试数据集上得到一个分类误差值,选择分类误差最小的模型。

通常测试集的大小在数据集的 1/4到 1/3之间。

一般 30%是一个 t 典型的选择。

这样做的原因是:如果只是以最小化经验误差为目标,那么最终选择的就是过拟合的模型。

但用这种方法也有一个缺点就是它浪费了 30%的数据, 就算我们最后选择出我们认为合理的模型,再用全部的数据进行训练,只不能保证这个模型是最好的。

最优模型选择中的交叉验证（Crossvalidation）方法

最优模型选择中的交叉验证（Crossvalidation）方法很多时候，大家会利用各种方法建立不同的统计模型，诸如普通的cox回归，利用Lasso方法建立的cox回归，或者稳健的cox回归；或者说利用不同的变量建立不同的模型，诸如模型一只考虑了三个因素、模型二考虑了四个因素，最后对上述模型选择（评价）的时候，或者是参数择优的时候，通常传统统计学方法中会用AIC，BIC、拟合优度-2logL，或者预测误差最小等准则来选择最优模型；而最新的文献中都会提到一种叫交叉验证（Cross validation）的方法，或者会用到一种将原始数据按照样本量分为两部分三分之二用来建模，三分之一用来验证的思路（临床上有医生称为内部验证），再或者利用多中心数据，一个中心数据用来建模，另外一个中心数据用来验证（临床上称为外部验证），这些都是什么？总结一下自己最近看的文献和书籍，在这里简单介绍下，仅供参考。

一、交叉验证的概念交叉验证（Cross validation)，有时亦称循环估计，是一种统计学上将数据样本切割成较小子集的实用方法。

于是可以先在一个子集上做建模分析，而其它子集则用来做后续对此分析的效果评价及验证。

一开始的子集被称为训练集（Train set）。

而其它的子集则被称为验证集(Validationset)或测试集(Test set)。

交叉验证是一种评估统计分析、机器学习算法对独立于训练数据的数据集的泛化(普遍适用性)能力（Generalize）.例如下图文献中，原始数据集中449例观测，文献中将数据集分为了训练集（Primary Cohort）367例，验证集(Validation Cohort)82例。

二、交叉验证的原理及分类假设利用原始数据可以建立n个统计模型，这n 个模型的集合是M={M1，M2，…,Mn}，比如我们想做回归，那么简单线性回归、logistic回归、随机森林、神经网络等模型都包含在M中。

weka开发[21]ibk(knn)源代码分析

Weka开发［21］——IBk(KNN)源代码分析如果你没有看上一篇IB1，请先看一下，因为重复的内容我在这里不会介绍了。

直接看buildClassifier，这里只列出在IB1中也没有出现的代码：try {m_NumClasses = instances.numClasses();m_ClassType = instances.classAttribute().type(); } catch (Exception ex) {throw new Error("This should never be reached"); }// Throw away initial instances until within the specified window sizeif ((m_WindowSize > 0) &&(instances.numInstances() > m_WindowSize)) {m_Train = new Instances(m_Train,m_Train.numInstances()- m_WindowSize, m_WindowSize);}// Compute the number of attributes that contribute// to each predictionm_NumAttributesUsed = 0.0;for (int i = 0; i < m_Train.numAttributes(); i ) {if ((i != m_Train.classIndex())&& (m_Train.attribute(i).isNominal() || m_Train.attribute(i).isNumeric())) {m_NumAttributesUsed = 1.0;}}// Invalidate any currently cross-validation selected km_kNNValid = false;IB1中不关心m_NumClasses是因为它就找一个邻居，当然就一个值了。

Weka开发［47］——Stacking源代码分析

Weka开发［47］——Stacking源代码分析从网上拷了一段解释，这不是什么权威论文，拷贝它只是因为它条理清楚简单。

Stacked generalization (or stacking) (Wolpert 1992) is a different way of combining multiple models, that introduces the concept of a meta learner. Although an attractive idea, it is less widely used than bagging and boosting. Unlike bagging and boosting, stacking may be (and normally is) used to combine models of different types. The procedure is as follows:1. Split the training set into two disjoint sets.2. Train several base learners on the first part.3. Test the base learners on the second part.4. Using the predictions from 3) as the inputs, and the correct responses as the outputs, train a higher level learner.Note that steps 1) to 3) are the same as cross-validation, but instead of using a winner-takes-all approach, the base learners are combined, possibly non-linearly.buildClassifier中的前一部分就不看了，以前看过多次了，重要的就下面几行：// Create meta levelgenerateMetaLevel(newData, random);// Rebuilt all the base classifiers on the full training datafor (int i = 0; i < m_Classifiers.length; i ) {getClassifier(i).buildClassifier(newData);}而下面的for循环是在全部数据集上训练所有基分类器，那么最要的也就上面的generateMetaLevel函数了。

基于weka的数据分类和聚类分析实验报告.docx

基于w e k a的数据分类分析实验报告1实验基本内容本实验的基本内容是通过使用weka 中的三种常见分类和聚类方法（决策树J48、KNN 和 k-means）分别在训练数据上训练出分类模型，并使用校验数据对各个模型进行测试和评价，找出各个模型最优的参数值，并对三个模型进行全面评价比较，得到一个最好的分类模型以及该模型所有设置的最优参数。

最后使用这些参数以及训练集和校验集数据一起构造出一个最优分类器，并利用该分类器对测试数据进行预测。

2数据的准备及预处理2.1 格式转换方法(1)打开“data02.xls ”另存为 CSV 类型，得到“ data02.csv”。

(2)在 WEKA 中提供了一个“ Arff Viewer ”模块，打开一个“ data02.csv”进行浏览，然后另存为ARFF 文件，得到“data02.arff”。

3.实验过程及结果截图3.1 决策树分类(1)决策树分类用“ Explorer ”打开数据“ data02.arff”，然后切换到“Classify”。

点击“ Choose”，选择算法“ trees-J48 ”，再在“ Test options ”选择“ Cross-validation （ Flods=10 ）”，点击“ Start ”，开始运行。

系统默认 trees-J48决策树算法中minNumObj=2，得到如下结果=== Summary ===Correctly Classified Instances2388.4615 %Incorrectly Classified Instances311.5385 %Kappa statistic0.7636Mean absolute error0.141Root mean squared error0.3255Relative absolute error30.7368 %Root relative squared error68.0307 %Total Number of Instances26=== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.824010.8240.9030.892N10.1760.7510.8570.892Y Weighted Avg.0.8850.0610.9130.8850.8870.892=== Confusion Matrix ===a b<-- classified as14 3 | a = N09 | b = Y使用不同的参数准确率比较：minNumObj2345Correctly23222323 Classified（ 88.4615 %）（ 84.6154 %）（ 88.4615 %）（ 88.4615 %）Instances由上表，可知minNumObj为 2 时，准确率最高。

【最新】R语言WEKA模型预测课件教案讲义(附代码数据)

Data Mining
模型的选择、泛化与评估 — Model Estimation
刘鹏管理学博士/教授上海财经大学信息管理与工程学院 liupeng@
Using Data
Prof. Dr. LIU Peng
Page 2
The Data Mining Process
Prof. Dr. LIU Peng
V & V are performed through training (learning) and testing data sets. For extremely large amount of samples any V & V method is applicable.
Prof. Dr. LIU Peng Page 8
should not be just to produce a model for a problem at hand, but to provide one that is sufficiently credible and accepted and implemented by the decision-makers.
Prof. Dr. LIU Peng Page 9
模型测试过程
The DM results are validated and verified by the testing process. Demonstrating that inaccuracies exist or revealing the existence of errors in the model. “Test failed” implies the failure of the model, not the test.

交叉验证（crossvalidation）

交叉验证（crossvalidation）1.交叉验证的⽬的在机器学习的相关研究中，如果是有监督的算法，需要将原始数据集分为训练集和测试集两个集合。

训练集中的数据带有标签，⽤这些数据来训练出⼀个模型，告诉机器什么样的数据可以分成哪⼀类，然后⽤这个模型来预测测试集中数据的标签。

然后⽤预测得到的标签跟真实的标签作⽐对，就可以得到这个模型的预测准确率，其实是考察这个模型的generalization ability（泛化能⼒），即，从训练集中总结出来的规律能不能⽤到别的数据上去。

那么，怎样分训练集和测试集呢？需要考虑两个问题：1. 训练集中的数据要⾜够多，⼀般要⼤于原始数据集的⼀般，否则总结出来的规律太⼩众2. 两组集合必须是原始集合的均匀取样，否则⽐如说，训练集选择都是1类数据，测试集都是2类数据，训练之后模型知道1类数据的特点，⽤它来分别2类数据，这好难。

于是，cross validation2.交叉验证的分类简单交叉验证、K-折交叉验证、留⼀交叉验证2.1 简单交叉验证将原始数据随机分为两组，⼀组做为训练集，⼀组做为验证集，利⽤训练集训练分类器，然后利⽤验证集验证模型，记录最后的分类准确率为此分类器的性能指标。

优点：处理简单，只需随机把原始数据分为两组即可缺点：但没有达到交叉的思想，由于是随机的将原始数据分组，所以最后验证集分类准确率的⾼低与原始数据的分组有很⼤的关系，得到的结果并不具有说服性。

2.2 K-折交叉验证将原始数据分成K组（⼀般是均分），将每个⼦集数据分别做⼀次验证集，其余的K-1组⼦集数据作为训练集，这样会得到K个模型，⽤这K 个模型最终的验证集的分类准确率的平均数作为此K-CV下分类器的性能指标。

K⼀般⼤于等于2，实际操作时⼀般从3开始取，只有在原始数据集合数据量⼩的时候才会尝试取2。

应⽤最多，K-CV可以有效的避免过拟合与⽋拟合的发⽣，最后得到的结果也⽐较具有说服性。

Eg:⼗折交叉验证1. 将数据集分成⼗份，轮流将其中9份作为训练数据，1份作为测试数据，进⾏试验。

weka简介

4.Help
1. Weka homepage 打开一个浏览器窗口，显示 WEKA 的主页。 2.HOWTOs, code snippets, etc. 通用的 WekaWiki，包括大量的例子，以及开发和使用 WEKA 的基本知识（HOWTO）。 3.Weka on Sourceforge WEKA 项目在的主页。 4.SystemInfo 列出一些关于 Java/WEKA 环境的信息，例如 CLASSPATH。
3.6.2 测试选项应用选定的分类器后得到的结果会根据 Test Option 一栏中的选择来进行测试。共有四种测试模式： 1. Using training set. 根据分类器在用来训练的实例上的预测效果来评价它。 2. Supplied test set. 从文件载入的一组实例，根据分类器在这组实例上的预测效果来评价它。点击 Set… 按钮将打开一个对话框来选择用来测试的文件。 3. Cross-validation. 使用交叉验证来评价分类器，所用的折数填在 Folds 文本框中。 4. Percentage split. 从数据集中按一定百分比取出部分数据放在一边作测试用，根据分类器这些实例上预测效果来评价它。取出的数据量模型总是从所有训练数据中构建的。点击 More options 按钮可以设置更多的测试选项。
三. WEKA Explorer
3.1
标签页
在窗口的顶部，标题栏下是一排标签。当 Explorer 首次启动时，只有第一个标签页是活动的；其他均是灰色的。这是因为在探索数据之前，必须先打开一个数据集(可能还要对它进行预处理)。所有的标签页如下所示： 1. Preprocess. 选择和修改要处理的数据。 2. Classify. 训练和测试关于分类或回归的学习方案。 3. Cluster. 从数据中学习聚类。 4. Associate. 从数据中学习关联规则。 5. Select attributes. 选择数据中最相关的属性。 6. Visualize. 查看数据的交互式二维图像。这些标签被激活后，点击它们可以在不同的标签页面上进行切换，而每一个页面上可以执行对应的操作。不管位于哪个页面，窗口的底部区域(包括状态栏、log 按钮和 Weka 鸟) 仍然可见。

WEKA数据分析实验

WEKA 数据分析实验1.实验简介借助工具Weka 3.6 ，对数据样本进行测试，分类测试方法包括：朴素贝叶斯、决策树、随机数三类，聚类测试方法包括：DBScan，K均值两种；2.数据样本以熟悉数据分类的各类常用算法，以及了解Weka的使用方法为目的，本次试验中，采用的数据样本是Weka软件自带的“Vote”样本，如图：3.关联规则分析1)操作步骤：a)点击“Explorer”按钮，弹出“Weka Explorer”控制界面b)选择“Associate”选项卡；c)点击“Choose”按钮，选择“Apriori”规则d)点击参数文本框框，在参数选项卡设置参数如：e)点击左侧“Start”按钮2)执行结果：=== Run information ===Scheme: weka.associations.Apriori -I -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.5 -S -1.0 -c -1 Relation: voteInstances: 435Attributes: 17handicapped-infantswater-project-cost-sharingadoption-of-the-budget-resolutionphysician-fee-freezeel-salvador-aidreligious-groups-in-schoolsanti-satellite-test-banaid-to-nicaraguan-contrasmx-missileimmigrationsynfuels-corporation-cutbackeducation-spendingsuperfund-right-to-suecrimeduty-free-exportsexport-administration-act-south-africaClass=== Associator model (full training set) ===Apriori=======Minimum support: 0.5 (218 instances)Minimum metric <confidence>: 0.9Number of cycles performed: 10Generated sets of large itemsets:Size of set of large itemsets L(1): 12Large Itemsets L(1):handicapped-infants=n 236adoption-of-the-budget-resolution=y 253physician-fee-freeze=n 247religious-groups-in-schools=y 272anti-satellite-test-ban=y 239aid-to-nicaraguan-contras=y 242synfuels-corporation-cutback=n 264education-spending=n 233crime=y 248duty-free-exports=n 233export-administration-act-south-africa=y 269Class=democrat 267Size of set of large itemsets L(2): 4Large Itemsets L(2):adoption-of-the-budget-resolution=y physician-fee-freeze=n 219adoption-of-the-budget-resolution=y Class=democrat 231physician-fee-freeze=n Class=democrat 245aid-to-nicaraguan-contras=y Class=democrat 218Size of set of large itemsets L(3): 1Large Itemsets L(3):adoption-of-the-budget-resolution=y physician-fee-freeze=n Class=democrat 219Best rules found:1. adoption-of-the-budget-resolution=y physician-fee-freeze=n 219 ==> Class=democrat 219 conf:(1)2. physician-fee-freeze=n 247 ==> Class=democrat 245 conf:(0.99)3. adoption-of-the-budget-resolution=y Class=democrat 231 ==> physician-fee-freeze=n 219 conf:(0.95)4. Class=democrat 267 ==> physician-fee-freeze=n 245 conf:(0.92)5. adoption-of-the-budget-resolution=y 253 ==> Class=democrat 231 conf:(0.91)6. aid-to-nicaraguan-contras=y 242 ==> Class=democrat 218 conf:(0.9)3)结果分析：a)该样本数据，数据记录数435个，17个属性，进行了10轮测试b)最小支持度为0.5，即至少需要218个实例；c)最小置信度为0.9；d)进行了10轮搜索，频繁1项集12个，频繁2项集4个，频繁3项集1个；4.分类算法-随机树分析1)操作步骤：a)点击“Explorer”按钮，弹出“Weka Explorer”控制界面b)选择“Classify ”选项卡；c)点击“Choose”按钮，选择“trees” “RandomTree”规则d)设置Cross-validation 为10次e)点击左侧“Start”按钮2)执行结果：=== Run information ===Scheme:weka.classifiers.trees.RandomTree -K 0 -M 1.0 -S 1Relation: voteInstances:435Attributes:17handicapped-infantswater-project-cost-sharingadoption-of-the-budget-resolutionphysician-fee-freezeel-salvador-aidreligious-groups-in-schoolsanti-satellite-test-banaid-to-nicaraguan-contrasmx-missileimmigrationsynfuels-corporation-cutbackeducation-spendingsuperfund-right-to-suecrimeduty-free-exportsexport-administration-act-south-africaClassTest mode:10-fold cross-validation=== Classifier model (full training set) ===RandomTree==========el-salvador-aid = n| physician-fee-freeze = n| | duty-free-exports = n| | | anti-satellite-test-ban = n| | | | synfuels-corporation-cutback = n| | | | | crime = n : republican (0.96/0)| | | | | crime = y| | | | | | handicapped-infants = n : democrat (2.02/0.01) | | | | | | handicapped-infants = y : democrat (0.05/0)| | | | synfuels-corporation-cutback = y| | | | | handicapped-infants = n : democrat (0.79/0.01)| | | | | handicapped-infants = y : democrat (2.12/0)| | | anti-satellite-test-ban = y| | | | adoption-of-the-budget-resolution = n| | | | | handicapped-infants = n : democrat (1.26/0.01)| | | | | handicapped-infants = y : republican (1.25/0.25)| | | | adoption-of-the-budget-resolution = y| | | | | handicapped-infants = n| | | | | | crime = n : democrat (5.94/0.01)| | | | | | crime = y : democrat (5.15/0.12)| | | | | handicapped-infants = y : democrat (36.99/0.09)| | duty-free-exports = y| | | crime = n : democrat (124.23/0.29)| | | crime = y| | | | handicapped-infants = n : democrat (16.9/0.38)| | | | handicapped-infants = y : democrat (8.99/0.02)| physician-fee-freeze = y| | immigration = n| | | education-spending = n| | | | crime = n : democrat (1.09/0)| | | | crime = y : democrat (1.01/0.01)| | | education-spending = y : republican (1.06/0.02)| | immigration = y| | | synfuels-corporation-cutback = n| | | | religious-groups-in-schools = n : republican (3.02/0.01)| | | | religious-groups-in-schools = y : republican (1.54/0.04)| | | synfuels-corporation-cutback = y : republican (1.06/0.05)el-salvador-aid = y| synfuels-corporation-cutback = n| | physician-fee-freeze = n| | | handicapped-infants = n| | | | superfund-right-to-sue = n| | | | | crime = n : democrat (1.36/0)| | | | | crime = y| | | | | | mx-missile = n : republican (1.01/0)| | | | | | mx-missile = y : democrat (1.01/0.01)| | | | superfund-right-to-sue = y : democrat (4.83/0.03)| | | handicapped-infants = y : democrat (8.42/0.02)| | physician-fee-freeze = y| | | adoption-of-the-budget-resolution = n| | | | export-administration-act-south-africa = n| | | | | mx-missile = n : republican (49.03/0)| | | | | mx-missile = y : democrat (0.11/0)| | | | export-administration-act-south-africa = y| | | | | duty-free-exports = n| | | | | | mx-missile = n : republican (60.67/0)| | | | | | mx-missile = y : republican (6.21/0.15)| | | | | duty-free-exports = y| | | | | | aid-to-nicaraguan-contras = n| | | | | | | water-project-cost-sharing = n| | | | | | | | mx-missile = n : republican (3.12/0)| | | | | | | | mx-missile = y : democrat (0.01/0)| | | | | | | water-project-cost-sharing = y : democrat (1.15/0.14) | | | | | | aid-to-nicaraguan-contras = y : republican (0.16/0)| | | adoption-of-the-budget-resolution = y| | | | anti-satellite-test-ban = n| | | | | immigration = n : democrat (2.01/0.01)| | | | | immigration = y| | | | | | water-project-cost-sharing = n| | | | | | | mx-missile = n : republican (1.63/0)| | | | | | | mx-missile = y : republican (1.01/0.01)| | | | | | water-project-cost-sharing = y| | | | | | | superfund-right-to-sue = n : republican (0.45/0)| | | | | | | superfund-right-to-sue = y : republican (1.71/0.64) | | | | anti-satellite-test-ban = y| | | | | mx-missile = n : republican (7.74/0)| | | | | mx-missile = y : republican (4.05/0.03)| synfuels-corporation-cutback = y| | adoption-of-the-budget-resolution = n| | | superfund-right-to-sue = n| | | | anti-satellite-test-ban = n| | | | | physician-fee-freeze = n : democrat (1.39/0.01)| | | | | physician-fee-freeze = y| | | | | | water-project-cost-sharing = n : republican (1.01/0)| | | | | | water-project-cost-sharing = y : democrat (1.05/0.05)| | | | anti-satellite-test-ban = y : democrat (1.13/0.01)| | | superfund-right-to-sue = y| | | | education-spending = n| | | | | physician-fee-freeze = n| | | | | | crime = n : democrat (0.09/0)| | | | | | crime = y| | | | | | | handicapped-infants = n : democrat (1.01/0.01)| | | | | | | handicapped-infants = y : democrat (1/0)| | | | | physician-fee-freeze = y| | | | | | immigration = n| | | | | | | export-administration-act-south-africa = n : democrat(0.34/0.11)| | | | | | | export-administration-act-south-africa = y| | | | | | | | crime = n : democrat (0.16/0)| | | | | | | | crime = y| | | | | | | | | mx-missile = n| | | | | | | | | | handicapped-infants = n : republican (0.29/0) | | | | | | | | | | handicapped-infants = y : republican (1.88/0.87) | | | | | | | | | mx-missile = y : democrat (0.01/0)| | | | | | immigration = y : republican (1.01/0)| | | | education-spending = y| | | | | physician-fee-freeze = n| | | | | | handicapped-infants = n : democrat (1.51/0.01)| | | | | | handicapped-infants = y : democrat (2.01/0)| | | | | physician-fee-freeze = y| | | | | | crime = n : republican (1.02/0)| | | | | | crime = y| | | | | | | export-administration-act-south-africa = n| | | | | | | | handicapped-infants = n| | | | | | | | | immigration = n| | | | | | | | | | mx-missile = n| | | | | | | | | | | water-project-cost-sharing = n : democrat (1.01/0.01)| | | | | | | | | | | water-project-cost-sharing = y : republican (1.81/0)| | | | | | | | | | mx-missile = y : democrat (0.01/0)| | | | | | | | | immigration = y| | | | | | | | | | mx-missile = n : republican (2.78/0)| | | | | | | | | | mx-missile = y : democrat (0.01/0)| | | | | | | | handicapped-infants = y| | | | | | | | | mx-missile = n : republican (2/0)| | | | | | | | | mx-missile = y : democrat (0.4/0)| | | | | | | export-administration-act-south-africa = y| | | | | | | | mx-missile = n : republican (8.77/0)| | | | | | | | mx-missile = y : democrat (0.02/0)| | adoption-of-the-budget-resolution = y| | | anti-satellite-test-ban = n| | | | handicapped-infants = n| | | | | crime = n : democrat (2.52/0.01)| | | | | crime = y : democrat (7.65/0.07)| | | | handicapped-infants = y : democrat (10.83/0.02)| | | anti-satellite-test-ban = y| | | | physician-fee-freeze = n| | | | | handicapped-infants = n| | | | | | crime = n : democrat (2.42/0.01)| | | | | | crime = y : democrat (2.28/0.03)| | | | | handicapped-infants = y : democrat (4.17/0.01)| | | | physician-fee-freeze = y| | | | | mx-missile = n : republican (2.3/0)| | | | | mx-missile = y : democrat (0.01/0)Size of the tree : 143Time taken to build model: 0.01seconds=== Stratified cross-validation ====== Summary ===Correctly Classified Instances 407 93.5632 %Incorrectly Classified Instances 28 6.4368 %Kappa statistic 0.8636Mean absolute error 0.0699Root mean squared error 0.2379Relative absolute error 14.7341 %Root relative squared error 48.8605 %Total Number of Instances 435=== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.955 0.095 0.941 0.955 0.948 0.966 democrat0.905 0.045 0.927 0.905 0.916 0.967 republicanWeighted Avg. 0.936 0.076 0.936 0.936 0.935 0.966 === Confusion Matrix ===a b <-- classified as255 12 | a = democrat16 152 | b = republican3)结果分析：a)该样本数据，数据记录数435个，17个属性，进行了10轮交叉验证b)随机树长143c)正确分类共407个，正确率达93.5632 %d)错误分类28个，错误率6.4368 %e)测试数据的正确率较好5.分类算法-随机树分析1)操作步骤：a)点击“Explorer”按钮，弹出“Weka Explorer”控制界面b)选择“Classify ”选项卡；c)点击“Choose”按钮，选择“trees” “J48”规则d)设置Cross-validation 为10次e)点击左侧“Start”按钮2)执行结果：=== Run information ===Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2Relation: voteInstances:435Attributes:17handicapped-infantswater-project-cost-sharingadoption-of-the-budget-resolutionphysician-fee-freezeel-salvador-aidreligious-groups-in-schoolsanti-satellite-test-banaid-to-nicaraguan-contrasmx-missileimmigrationsynfuels-corporation-cutbackeducation-spendingsuperfund-right-to-suecrimeduty-free-exportsexport-administration-act-south-africaClassTest mode:10-fold cross-validation=== Classifier model (full training set) ===J48 pruned tree------------------physician-fee-freeze = n: democrat (253.41/3.75)physician-fee-freeze = y| synfuels-corporation-cutback = n: republican (145.71/4.0)| synfuels-corporation-cutback = y| | mx-missile = n| | | adoption-of-the-budget-resolution = n: republican (22.61/3.32) | | | adoption-of-the-budget-resolution = y| | | | anti-satellite-test-ban = n: democrat (5.04/0.02)| | | | anti-satellite-test-ban = y: republican (2.21)| | mx-missile = y: democrat (6.03/1.03)Number of Leaves : 6Size of the tree : 11Time taken to build model: 0.06seconds=== Stratified cross-validation ====== Summary ===Correctly Classified Instances 419 96.3218 % Incorrectly Classified Instances 16 3.6782 % Kappa statistic 0.9224Mean absolute error 0.0611Root mean squared error 0.1748Relative absolute error 12.887 %Root relative squared error 35.9085 %Total Number of Instances 435=== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.97 0.048 0.97 0.97 0.97 0.971 democrat0.952 0.03 0.952 0.952 0.952 0.971 republicanWeighted Avg. 0.963 0.041 0.963 0.963 0.963 0.971=== Confusion Matrix ===a b <-- classified as259 8 | a = democrat8 160 | b = republican3)结果分析：a)该样本数据，数据记录数435个，17个属性，进行了10轮交叉验证b)决策树分6级，长度11c)正确分类共419个，正确率达96.3218 %d)错误分类16个，错误率3.6782 %e)测试结果接近随机数，正确率较高6.分类算法-朴素贝叶斯分析1)操作步骤：a)点击“Explorer”按钮，弹出“Weka Explorer”控制界面b)选择“Classify ”选项卡；c)点击“Choose”按钮，选择“bayes” “Naive Bayes”规则d)设置Cross-validation 为10次e)点击左侧“Start”按钮2)执行结果：=== Stratified cross-validation ====== Summary ===Correctly Classified Instances 392 90.1149 %Incorrectly Classified Instances 43 9.8851 %Kappa statistic 0.7949Mean absolute error 0.0995Root mean squared error 0.2977Relative absolute error 20.9815 %Root relative squared error 61.1406 %Total Number of Instances 435=== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.891 0.083 0.944 0.891 0.917 0.973democrat0.917 0.109 0.842 0.917 0.877 0.973republicanWeighted Avg. 0.901 0.093 0.905 0.901 0.902 0.973 === Confusion Matrix ===a b <-- classified as238 29 | a = democrat14 154 | b = republican3)结果分析a)该样本数据，数据记录数435个，17个属性，进行了10轮交叉验证b)正确分类共392个，正确率达90.1149 %c)错误分类43个，错误率9.8851 %d)测试正确率较高7.分类算法-RandomTree、决策树、朴素贝叶斯结果比较：RandomTree 决策树朴素贝叶斯正确率93.5632% 96.3218 % 90.1149 %混淆矩阵 a b <-- classified as255 12 | a = democrat16 152 | b = republican a b <-- classified as259 8 | a = democrat8 160 | b = republicana b <-- classified as238 29 | a = democrat14 154 | b =republican标准误差48.8605 % 35.9085 % 61.1406 % 根据以上对照结果，三类分类算法对样板数据Vote测试准确率类似；8.。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Weka[31] crossValidation源代码分析作者：Koala++/屈伟Weka学习论坛里的人也帮我下载过一些论文，而群主也希望我建设群论坛，而我一开始看到的就是cross validation，感觉介绍的有点简单了，这里我只是再深一点而已。

我还是Ng的介绍为主。

在看到学习算法的时候，我们都是以最小化经验误差为目标，比如有方法：梯度下降，牛顿法，lagrange乘子法，坐标上升法，都是我blog里提到过的方法了。

如果我们用下面的步骤来得到模型：1.在数据集上训练多个模型。

2.选择训练误差最小的模型。

下面是说明这个问题的一幅图（它的原意倒和这没什么关系），直接去看Patternrecognition and machine learning第一章。

这幅图大概意思就是模型复杂到一定程序（可以表示学习到概念之后），再使用更复杂模型，那么它的训练误差会下降，而测试误差会提高。

这幅图其实还有一个比较深的意义，那就是你可以通过它来选择合适的模型，不要一看测试误差高，就加样本。

然后有一个代替它的方法是hold-out cross validation或是被称为simple cross validation。

1.把数据集分为两部分，一部分为训练集（比如70%的数据），和测试数据集（比如30%）。

测试数据集也被称为hold-out cross validation set。

2.在训练集上训练多个模型。

3.每一个模型在测试数据集上得到一个分类误差值，选择分类误差最小的模型。

通常测试集的大小在数据集的1/4到1/3之间。

一般30％是一个t典型的选择。

这样做的原因是：如果只是以最小化经验误差为目标，那么最终选择的就是过拟合的模型。

但用这种方法也有一个缺点就是它浪费了30％的数据，就算我们最后选择出我们认为合理的模型，再用全部的数据进行训练，只不能保证这个模型是最好的。

如果训练样本充分，那倒没什么，如果训练样本不足，而模型与模型训练所需的样本也是不一样的(不太清楚如何表述，可以看一下learning theory，就是求的最小需要样本)。

再说明白一点，在样本不充足的情况下，或不知道是不是充足的情况下，模型A在70%的数据上比模型B好，不能说在100％的数据集上，模型A一定比模型B好。

接下来的方法是k-fold cross validation，这个算法就不写伪代码了，马上看真代码，一个典型的选择是k=10，它与上面的方法相比，它只留下了1/k的数据，但是我们也需要训练k次，比以前多了k次(其实也不完全是这样，就算是划分训练测试集，也不可能只测一次)。

还有就是Pattern recognition and machine learning提到的： A further problem with techniques such as cross-validation that use separate data to assess performance is that we might have multiple complexity parameters for a single model (for instance, there might be several regularization parameters). Exploring combinations of setting for such parameters could, in the worst case, require a number of training runs that is exponential in the number of parameters.在数据实在太少的情况下，就要用一种特殊的cross validation方法了，是leave-one-out cross validation，就是每次训练只留下一个样本不训练，用它来测试。

在听Ng的课程时，他的学生提了一个很有意思的问题，在Learning Theory（这个还是比较重要的，被Ng视为是否入门的标准之一）中，不是可以从样本数求出来一个模型学习所需要的样本数（在一定条件下），为什么还要用model selection这一类方法呢？下面翻译一下(大意)Ng的回答：It turns out that when you’re proving learning theory bounds, very often the bounds will be extremely loose because you’re sort of proving the worse case upper bound that holds true evenfor very bad – what is it – so the bounds that I proved just now; right? That holds true for absolutely any probability distribution over training examples; right? So just assume the training examples we’ve drawn, iid from some distribution script d, and the bounds I proved hold true for absolutely any probability distribution over script d. And chances are whatever real life distribution you get over, you know, houses and their prices or whatever, is probably not as bad as the very worse one you could’ve gotten; okay? And so it turns out that if you actually plug in the constants of learning theory bounds, you often get extremely large numbers.你在证明学习理论的边界时，通常边界都是异常loose，因为你在证的都是比较糟糕的上界，也就是在很坏的时候都成立的边界，这些就是训练样本无论服从何从概率分布都成立。

现实中的数据，也许不会和最坏的情况一样。

当你将常量加入学习理论的边界时（像算法时间空间分析时，忽略所有常量），你通常都会得到一个非常大的值。

Take logistic regression – logistic regression you have ten parameters and 0.01 error, andwith 95 percent probability. How many training examples do I need? If you actually plug in actual constants into the text for learning theory bounds, you often get extremely pessimistic estimateswith the number of examples you need. You end up with some ridiculously large numbers. You would need 10,000 training examples to fit ten parameters. So a good way to think of these learning theory bounds is – and this is why, also, when I write papers on learning theory bounds, I quite often use big-O notation to just absolutely just ignore the constant factors because thebounds seem to be very loose.以logistic regression为例－你有10个参数和在95％的概率下错误率小于0.01。

我需要多少样本，如果你将常量代入边界，你会得到一个非常悲观的估计，你会得到一个大的可笑的值，比如10,000个训练样本来拟合10个参数。

所以一个好的方式来理解这些学习边界是，忽略常量，因为边界非常loose。

There are some attempts to use these bounds to give guidelines as to what model to choose,and so on. But I personally tend to use the bounds – again, intuition about – for example, what arethe number of training examples you need grows linearly in the number of parameters or what are your grows x dimension in number of parameters; whether it goes quadratic – parameters? So it’s quite often the shape of the bounds. The fact that the number of training examples – the fact that some complexity is linear in the VC dimension, that’s sort of a useful intuition you can get from these theories. But the actual magnitude of the bound will tend to be much looser than will holdtrue for a particular problem you are working on.有一些关于用这些边界来指导选择哪一种模型的方法，但我个人倾向于用这些边界——再强调一下，直观的——比如，你所需要的样本数是与参数的个数是线性关系，还是二次关系。