Weka[36] InfoGainAttributeEval源代码分析
- 格式:doc
- 大小:87.00 KB
- 文档页数:4
Weka开发[21]——IBk(KNN)源代码分析如果你没有看上一篇IB1,请先看一下,因为重复的内容我在这里不会介绍了。
直接看buildClassifier,这里只列出在IB1中也没有出现的代码:try {m_NumClasses = instances.numClasses();m_ClassType = instances.classAttribute().type(); } catch (Exception ex) {throw new Error("This should never be reached"); }// Throw away initial instances until within the specified window sizeif ((m_WindowSize > 0) &&(instances.numInstances() > m_WindowSize)) {m_Train = new Instances(m_Train,m_Train.numInstances()- m_WindowSize, m_WindowSize);}// Compute the number of attributes that contribute// to each predictionm_NumAttributesUsed = 0.0;for (int i = 0; i < m_Train.numAttributes(); i ) {if ((i != m_Train.classIndex())&& (m_Train.attribute(i).isNominal() || m_Train.attribute(i).isNumeric())) {m_NumAttributesUsed = 1.0;}}// Invalidate any currently cross-validation selected km_kNNValid = false;IB1中不关心m_NumClasses是因为它就找一个邻居,当然就一个值了。
eval()实现原理(一)eval()实现什么是eval()函数eval()是一个内置函数,用于将字符串作为Python代码执行。
它可以接受一个参数,并将其解释为Python表达式或语句,然后执行。
eval()函数的基本用法•将字符串解析为表达式并计算其结果•执行字符串中的一组语句eval()的强大功能•可以动态地执行代码•可以将用户提供的字符串转换为有效的Python代码eval()的原理1.将传递给eval()的字符串解析为Python代码对象2.执行代码对象并返回结果eval()的安全性问题•eval()允许执行任意代码,可能引发安全漏洞•传递给eval()的字符串应该受到信任,以防止代码注入攻击eval()的高级用法•可以通过为eval()提供一个globals和locals参数,来控制执行环境•globals参数指定全局命名空间,locals参数指定局部命名空间eval()的替代方案•ast模块:将字符串解析为抽象语法树(AST),然后通过ast模块的其他函数来执行代码•exec()函数:与eval()类似,但不返回结果,只执行代码总结•eval()是一个强大的函数,可以将字符串解释为有效的Python 代码,并执行它•在使用eval()时要考虑安全性问题,避免代码注入攻击以上是关于eval()函数实现的一些基本原理和用法的介绍,希望对大家有所帮助。
eval()函数的注意事项•使用eval()时应谨慎,避免将未经过滤的用户输入作为参数传递给eval()函数。
这可能导致代码注入漏洞。
•eval()可以执行任意代码,包括系统调用、文件操作等。
因此在使用时要确保代码的来源可信。
•eval()函数的性能较差,因为它需要将字符串解析为代码对象并执行。
在大规模使用时可能会导致性能问题。
•使用eval()时要注意代码的正确性。
如果字符串包含语法错误,eval()会引发SyntaxError异常。
•eval()函数的返回值取决于传递的字符串。
python中的eval用法-回复Python中的eval()函数是一个内置函数,可以将字符串作为代码进行求值。
它将字符串解析为一个表达式,并返回其计算结果。
eval()函数在python 中非常强大,但同时也需要谨慎使用,以避免潜在的安全风险。
在本文中,我们将深入探讨eval()函数的用法和其潜在的风险。
我们将从简单的用例开始,逐步分析它的用法和限制。
# 1. eval()函数的基本用法首先,让我们看一个简单的例子来演示eval()函数的基本用法。
pythonx = 10res = eval('x2 + x + 1')print(res) # 输出: 111在这个例子中,字符串`'x2 + x + 1'` 被传递给eval()函数进行求值。
eval()函数解析字符串中的表达式,并使用当前的变量环境来计算结果。
# 2. eval()函数的表达式求值eval()函数可以处理各种复杂的表达式,并支持所有基本的数学运算、函数调用和逻辑操作。
例如:pythonexpr = '23 + abs(-4) - round(4.8)'res = eval(expr)print(res) # 输出: 14在这个例子中,表达式`'23 + abs(-4) - round(4.8)'` 包含了幂运算、绝对值和四舍五入等多种操作。
eval()函数将对这个表达式进行求值并返回结果。
# 3. eval()函数的环境设置eval()函数可以使用不同的环境进行求值。
通过提供一个字典作为eval()函数的第二个参数,我们可以自定义变量和函数,这样eval()函数将在这个自定义的环境中进行求值。
例如:pythonx = 5env = {'x': x, 'y': 10, 'add': lambda a, b: a + b}expr = 'x * y + add(x, y)'res = eval(expr, env)print(res) # 输出: 75在这个例子中,我们使用了一个自定义的环境字典env,其中包含了x和y两个变量以及一个add函数。
Python 是一种广泛使用的高级编程语言,因其简洁、易读和强大的特性而备受程序员的喜爱。
而在 Python 的内置函数中,eval() 函数无疑是一个备受关注的存在。
eval() 函数可以将字符串作为表达式进行求值,并返回结果。
但是,eval() 函数的使用也存在一定的潜在风险,本文将针对 Python3 中的 eval() 函数进行深入探讨。
一、eval() 函数的基本用法eval() 函数是 Python 内置的一个函数,其基本语法为:eval(expression, globals=None, locals=None)其中,expression 为要进行求值的字符串表达式,globals 和 locals 分别为全局和局部命名空间。
当不传入 globals 和 locals 参数时,eval() 函数在当前的命名空间中进行求值。
二、eval() 函数的应用场景在实际的编程中,eval() 函数通常用于动态执行用户输入的代码或者进行表达式的求值。
当我们需要在程序运行时动态地获取用户输入的表达式并进行求值时,就可以使用eval() 函数来实现这一功能。
另外,eval() 函数还可以用来解析 JSON 数据,动态创建函数等。
三、eval() 函数的潜在风险尽管 eval() 函数在某些场景下确实能够带来极大的便利,但是其潜在的安全风险也不容忽视。
由于 eval() 函数会将字符串作为代码进行执行,因此如果接受到恶意构造的输入,就可能导致代码注入或者其他安全漏洞的产生。
特别是在涉及到用户输入的情况下,一定要格外谨慎地使用 eval() 函数,以避免潜在的安全问题。
四、如何安全地使用 eval() 函数为了避免 eval() 函数带来的安全风险,在实际的编程中可以遵循以下几点建议:4.1. 永远不要将未经过滤的用户输入作为 eval() 函数的参数,以免造成安全漏洞。
4.2. 在必须要使用 eval() 函数的情况下,可以通过预先对用户输入进行验证和过滤的方式来确保安全性。
python中eval函数Python中的eval()函数在Python中,eval()函数是一个非常强大的内置函数。
它可以将字符串作为代码进行解释和执行,并返回执行结果。
eval()函数可以灵活地处理字符串,使我们能够动态地执行代码,从而增加了程序的灵活性和可扩展性。
本文将介绍eval()函数的基本用法、注意事项和一些示例。
1. eval()函数的基本用法eval()函数的基本语法如下:result = eval(expression, globals=None, locals=None)- expression:表示要执行的字符串表达式。
- globals和locals:表示全局和局部变量的字典。
如果不指定这两个参数,eval()函数将使用当前环境的全局和局部变量。
eval()函数将expression参数作为字符串解析为Python代码,并且执行这段代码。
返回值是expression中的表达式的结果。
2. eval()函数的注意事项在使用eval()函数时,需要考虑以下几个方面的注意事项:2.1 安全性问题由于eval()函数可以执行任意的Python代码,因此在使用时要格外小心。
如果通过用户输入的方式传递expression参数,那么可能存在安全风险。
不建议在生产环境中使用eval()函数来执行未经验证的用户输入。
2.2 对象命名冲突如果在expression中使用了全局和局部变量,可能会导致命名冲突。
为了避免这种情况,建议显式地传递globals和locals参数,以确保代码执行在正确的作用域中。
2.3 错误处理eval()函数执行代码时,如果出现语法错误,将会抛出SyntaxError。
为了避免程序崩溃,应该在调用eval()函数前使用try-except语句进行异常处理。
3. eval()函数的示例现在我们来看一些eval()函数的示例,以便更好地理解它的用法。
3.1 计算数学表达式eval()函数可以用来计算数学表达式。
Weka[36] InfoGainAttributeEval源代码分析作者:Koala++/屈伟最近要用到特征选择,但需要的特征选择又有点不同,还是看看源码,本文后面介绍了weka中熵的计算,它的计算与公式中不太一样,以前给weka中文站人的介绍过一次,这次我专门周末把公式敲出来,方便大家看。
从buildEvaluator开始看:int classIndex = data.classIndex();int numInstances = data.numInstances();if (!m_Binarize) {Discretize disTransform = new Discretize();disTransform.setUseBetterEncoding(true);disTransform.setInputFormat(data);data = eFilter(data, disTransform);} else {NumericToBinary binTransform = new NumericToBinary();binTransform.setInputFormat(data);data = eFilter(data, binTransform);}看是要离散成二个值,还是多个值。
int numClasses = data.attribute(classIndex).numValues();// Reserve space and initialize countersdouble[][][] counts = new double[data.numAttributes()][][];for (int k = 0; k < data.numAttributes(); k++) {if (k != classIndex) {int numValues = data.attribute(k).numValues();counts[k] = new double[numValues + 1][numClasses + 1];}}// Initialize countersdouble[] temp = new double[numClasses + 1];for (int k = 0; k < numInstances; k++) {Instance inst = data.instance(k);if (inst.classIsMissing()) {temp[numClasses] += inst.weight();} else {temp[(int) inst.classValue()] += inst.weight();}}for (int k = 0; k < counts.length; k++) {if (k != classIndex) {for (int i = 0; i < temp.length; i++) {counts[k][0][i] = temp[i];}}}Counts第一维是属性个数,第二列是属性值个数+1,第三列是类别个数+1。
WEKA 数据分析实验1.实验简介借助工具Weka 3.6 ,对数据样本进行测试,分类测试方法包括:朴素贝叶斯、决策树、随机数三类,聚类测试方法包括:DBScan,K均值两种;2.数据样本以熟悉数据分类的各类常用算法,以及了解Weka的使用方法为目的,本次试验中,采用的数据样本是Weka软件自带的“Vote”样本,如图:3.关联规则分析1)操作步骤:a)点击“Explorer”按钮,弹出“Weka Explorer”控制界面b)选择“Associate”选项卡;c)点击“Choose”按钮,选择“Apriori”规则d)点击参数文本框框,在参数选项卡设置参数如:e)点击左侧“Start”按钮2)执行结果:=== Run information ===Scheme: weka.associations.Apriori -I -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.5 -S -1.0 -c -1 Relation: voteInstances: 435Attributes: 17handicapped-infantswater-project-cost-sharingadoption-of-the-budget-resolutionphysician-fee-freezeel-salvador-aidreligious-groups-in-schoolsanti-satellite-test-banaid-to-nicaraguan-contrasmx-missileimmigrationsynfuels-corporation-cutbackeducation-spendingsuperfund-right-to-suecrimeduty-free-exportsexport-administration-act-south-africaClass=== Associator model (full training set) ===Apriori=======Minimum support: 0.5 (218 instances)Minimum metric <confidence>: 0.9Number of cycles performed: 10Generated sets of large itemsets:Size of set of large itemsets L(1): 12Large Itemsets L(1):handicapped-infants=n 236adoption-of-the-budget-resolution=y 253physician-fee-freeze=n 247religious-groups-in-schools=y 272anti-satellite-test-ban=y 239aid-to-nicaraguan-contras=y 242synfuels-corporation-cutback=n 264education-spending=n 233crime=y 248duty-free-exports=n 233export-administration-act-south-africa=y 269Class=democrat 267Size of set of large itemsets L(2): 4Large Itemsets L(2):adoption-of-the-budget-resolution=y physician-fee-freeze=n 219adoption-of-the-budget-resolution=y Class=democrat 231physician-fee-freeze=n Class=democrat 245aid-to-nicaraguan-contras=y Class=democrat 218Size of set of large itemsets L(3): 1Large Itemsets L(3):adoption-of-the-budget-resolution=y physician-fee-freeze=n Class=democrat 219Best rules found:1. adoption-of-the-budget-resolution=y physician-fee-freeze=n 219 ==> Class=democrat 219 conf:(1)2. physician-fee-freeze=n 247 ==> Class=democrat 245 conf:(0.99)3. adoption-of-the-budget-resolution=y Class=democrat 231 ==> physician-fee-freeze=n 219 conf:(0.95)4. Class=democrat 267 ==> physician-fee-freeze=n 245 conf:(0.92)5. adoption-of-the-budget-resolution=y 253 ==> Class=democrat 231 conf:(0.91)6. aid-to-nicaraguan-contras=y 242 ==> Class=democrat 218 conf:(0.9)3)结果分析:a)该样本数据,数据记录数435个,17个属性,进行了10轮测试b)最小支持度为0.5,即至少需要218个实例;c)最小置信度为0.9;d)进行了10轮搜索,频繁1项集12个,频繁2项集4个,频繁3项集1个;4.分类算法-随机树分析1)操作步骤:a)点击“Explorer”按钮,弹出“Weka Explorer”控制界面b)选择“Classify ”选项卡;c)点击“Choose”按钮,选择“trees” “RandomTree”规则d)设置Cross-validation 为10次e)点击左侧“Start”按钮2)执行结果:=== Run information ===Scheme:weka.classifiers.trees.RandomTree -K 0 -M 1.0 -S 1Relation: voteInstances:435Attributes:17handicapped-infantswater-project-cost-sharingadoption-of-the-budget-resolutionphysician-fee-freezeel-salvador-aidreligious-groups-in-schoolsanti-satellite-test-banaid-to-nicaraguan-contrasmx-missileimmigrationsynfuels-corporation-cutbackeducation-spendingsuperfund-right-to-suecrimeduty-free-exportsexport-administration-act-south-africaClassTest mode:10-fold cross-validation=== Classifier model (full training set) ===RandomTree==========el-salvador-aid = n| physician-fee-freeze = n| | duty-free-exports = n| | | anti-satellite-test-ban = n| | | | synfuels-corporation-cutback = n| | | | | crime = n : republican (0.96/0)| | | | | crime = y| | | | | | handicapped-infants = n : democrat (2.02/0.01) | | | | | | handicapped-infants = y : democrat (0.05/0)| | | | synfuels-corporation-cutback = y| | | | | handicapped-infants = n : democrat (0.79/0.01)| | | | | handicapped-infants = y : democrat (2.12/0)| | | anti-satellite-test-ban = y| | | | adoption-of-the-budget-resolution = n| | | | | handicapped-infants = n : democrat (1.26/0.01)| | | | | handicapped-infants = y : republican (1.25/0.25)| | | | adoption-of-the-budget-resolution = y| | | | | handicapped-infants = n| | | | | | crime = n : democrat (5.94/0.01)| | | | | | crime = y : democrat (5.15/0.12)| | | | | handicapped-infants = y : democrat (36.99/0.09)| | duty-free-exports = y| | | crime = n : democrat (124.23/0.29)| | | crime = y| | | | handicapped-infants = n : democrat (16.9/0.38)| | | | handicapped-infants = y : democrat (8.99/0.02)| physician-fee-freeze = y| | immigration = n| | | education-spending = n| | | | crime = n : democrat (1.09/0)| | | | crime = y : democrat (1.01/0.01)| | | education-spending = y : republican (1.06/0.02)| | immigration = y| | | synfuels-corporation-cutback = n| | | | religious-groups-in-schools = n : republican (3.02/0.01)| | | | religious-groups-in-schools = y : republican (1.54/0.04)| | | synfuels-corporation-cutback = y : republican (1.06/0.05)el-salvador-aid = y| synfuels-corporation-cutback = n| | physician-fee-freeze = n| | | handicapped-infants = n| | | | superfund-right-to-sue = n| | | | | crime = n : democrat (1.36/0)| | | | | crime = y| | | | | | mx-missile = n : republican (1.01/0)| | | | | | mx-missile = y : democrat (1.01/0.01)| | | | superfund-right-to-sue = y : democrat (4.83/0.03)| | | handicapped-infants = y : democrat (8.42/0.02)| | physician-fee-freeze = y| | | adoption-of-the-budget-resolution = n| | | | export-administration-act-south-africa = n| | | | | mx-missile = n : republican (49.03/0)| | | | | mx-missile = y : democrat (0.11/0)| | | | export-administration-act-south-africa = y| | | | | duty-free-exports = n| | | | | | mx-missile = n : republican (60.67/0)| | | | | | mx-missile = y : republican (6.21/0.15)| | | | | duty-free-exports = y| | | | | | aid-to-nicaraguan-contras = n| | | | | | | water-project-cost-sharing = n| | | | | | | | mx-missile = n : republican (3.12/0)| | | | | | | | mx-missile = y : democrat (0.01/0)| | | | | | | water-project-cost-sharing = y : democrat (1.15/0.14) | | | | | | aid-to-nicaraguan-contras = y : republican (0.16/0)| | | adoption-of-the-budget-resolution = y| | | | anti-satellite-test-ban = n| | | | | immigration = n : democrat (2.01/0.01)| | | | | immigration = y| | | | | | water-project-cost-sharing = n| | | | | | | mx-missile = n : republican (1.63/0)| | | | | | | mx-missile = y : republican (1.01/0.01)| | | | | | water-project-cost-sharing = y| | | | | | | superfund-right-to-sue = n : republican (0.45/0)| | | | | | | superfund-right-to-sue = y : republican (1.71/0.64) | | | | anti-satellite-test-ban = y| | | | | mx-missile = n : republican (7.74/0)| | | | | mx-missile = y : republican (4.05/0.03)| synfuels-corporation-cutback = y| | adoption-of-the-budget-resolution = n| | | superfund-right-to-sue = n| | | | anti-satellite-test-ban = n| | | | | physician-fee-freeze = n : democrat (1.39/0.01)| | | | | physician-fee-freeze = y| | | | | | water-project-cost-sharing = n : republican (1.01/0)| | | | | | water-project-cost-sharing = y : democrat (1.05/0.05)| | | | anti-satellite-test-ban = y : democrat (1.13/0.01)| | | superfund-right-to-sue = y| | | | education-spending = n| | | | | physician-fee-freeze = n| | | | | | crime = n : democrat (0.09/0)| | | | | | crime = y| | | | | | | handicapped-infants = n : democrat (1.01/0.01)| | | | | | | handicapped-infants = y : democrat (1/0)| | | | | physician-fee-freeze = y| | | | | | immigration = n| | | | | | | export-administration-act-south-africa = n : democrat(0.34/0.11)| | | | | | | export-administration-act-south-africa = y| | | | | | | | crime = n : democrat (0.16/0)| | | | | | | | crime = y| | | | | | | | | mx-missile = n| | | | | | | | | | handicapped-infants = n : republican (0.29/0) | | | | | | | | | | handicapped-infants = y : republican (1.88/0.87) | | | | | | | | | mx-missile = y : democrat (0.01/0)| | | | | | immigration = y : republican (1.01/0)| | | | education-spending = y| | | | | physician-fee-freeze = n| | | | | | handicapped-infants = n : democrat (1.51/0.01)| | | | | | handicapped-infants = y : democrat (2.01/0)| | | | | physician-fee-freeze = y| | | | | | crime = n : republican (1.02/0)| | | | | | crime = y| | | | | | | export-administration-act-south-africa = n| | | | | | | | handicapped-infants = n| | | | | | | | | immigration = n| | | | | | | | | | mx-missile = n| | | | | | | | | | | water-project-cost-sharing = n : democrat (1.01/0.01)| | | | | | | | | | | water-project-cost-sharing = y : republican (1.81/0)| | | | | | | | | | mx-missile = y : democrat (0.01/0)| | | | | | | | | immigration = y| | | | | | | | | | mx-missile = n : republican (2.78/0)| | | | | | | | | | mx-missile = y : democrat (0.01/0)| | | | | | | | handicapped-infants = y| | | | | | | | | mx-missile = n : republican (2/0)| | | | | | | | | mx-missile = y : democrat (0.4/0)| | | | | | | export-administration-act-south-africa = y| | | | | | | | mx-missile = n : republican (8.77/0)| | | | | | | | mx-missile = y : democrat (0.02/0)| | adoption-of-the-budget-resolution = y| | | anti-satellite-test-ban = n| | | | handicapped-infants = n| | | | | crime = n : democrat (2.52/0.01)| | | | | crime = y : democrat (7.65/0.07)| | | | handicapped-infants = y : democrat (10.83/0.02)| | | anti-satellite-test-ban = y| | | | physician-fee-freeze = n| | | | | handicapped-infants = n| | | | | | crime = n : democrat (2.42/0.01)| | | | | | crime = y : democrat (2.28/0.03)| | | | | handicapped-infants = y : democrat (4.17/0.01)| | | | physician-fee-freeze = y| | | | | mx-missile = n : republican (2.3/0)| | | | | mx-missile = y : democrat (0.01/0)Size of the tree : 143Time taken to build model: 0.01seconds=== Stratified cross-validation ====== Summary ===Correctly Classified Instances 407 93.5632 %Incorrectly Classified Instances 28 6.4368 %Kappa statistic 0.8636Mean absolute error 0.0699Root mean squared error 0.2379Relative absolute error 14.7341 %Root relative squared error 48.8605 %Total Number of Instances 435=== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.955 0.095 0.941 0.955 0.948 0.966 democrat0.905 0.045 0.927 0.905 0.916 0.967 republicanWeighted Avg. 0.936 0.076 0.936 0.936 0.935 0.966 === Confusion Matrix ===a b <-- classified as255 12 | a = democrat16 152 | b = republican3)结果分析:a)该样本数据,数据记录数435个,17个属性,进行了10轮交叉验证b)随机树长143c)正确分类共407个,正确率达93.5632 %d)错误分类28个,错误率6.4368 %e)测试数据的正确率较好5.分类算法-随机树分析1)操作步骤:a)点击“Explorer”按钮,弹出“Weka Explorer”控制界面b)选择“Classify ”选项卡;c)点击“Choose”按钮,选择“trees” “J48”规则d)设置Cross-validation 为10次e)点击左侧“Start”按钮2)执行结果:=== Run information ===Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2Relation: voteInstances:435Attributes:17handicapped-infantswater-project-cost-sharingadoption-of-the-budget-resolutionphysician-fee-freezeel-salvador-aidreligious-groups-in-schoolsanti-satellite-test-banaid-to-nicaraguan-contrasmx-missileimmigrationsynfuels-corporation-cutbackeducation-spendingsuperfund-right-to-suecrimeduty-free-exportsexport-administration-act-south-africaClassTest mode:10-fold cross-validation=== Classifier model (full training set) ===J48 pruned tree------------------physician-fee-freeze = n: democrat (253.41/3.75)physician-fee-freeze = y| synfuels-corporation-cutback = n: republican (145.71/4.0)| synfuels-corporation-cutback = y| | mx-missile = n| | | adoption-of-the-budget-resolution = n: republican (22.61/3.32) | | | adoption-of-the-budget-resolution = y| | | | anti-satellite-test-ban = n: democrat (5.04/0.02)| | | | anti-satellite-test-ban = y: republican (2.21)| | mx-missile = y: democrat (6.03/1.03)Number of Leaves : 6Size of the tree : 11Time taken to build model: 0.06seconds=== Stratified cross-validation ====== Summary ===Correctly Classified Instances 419 96.3218 % Incorrectly Classified Instances 16 3.6782 % Kappa statistic 0.9224Mean absolute error 0.0611Root mean squared error 0.1748Relative absolute error 12.887 %Root relative squared error 35.9085 %Total Number of Instances 435=== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.97 0.048 0.97 0.97 0.97 0.971 democrat0.952 0.03 0.952 0.952 0.952 0.971 republicanWeighted Avg. 0.963 0.041 0.963 0.963 0.963 0.971=== Confusion Matrix ===a b <-- classified as259 8 | a = democrat8 160 | b = republican3)结果分析:a)该样本数据,数据记录数435个,17个属性,进行了10轮交叉验证b)决策树分6级,长度11c)正确分类共419个,正确率达96.3218 %d)错误分类16个,错误率3.6782 %e)测试结果接近随机数,正确率较高6.分类算法-朴素贝叶斯分析1)操作步骤:a)点击“Explorer”按钮,弹出“Weka Explorer”控制界面b)选择“Classify ”选项卡;c)点击“Choose”按钮,选择“bayes” “Naive Bayes”规则d)设置Cross-validation 为10次e)点击左侧“Start”按钮2)执行结果:=== Stratified cross-validation ====== Summary ===Correctly Classified Instances 392 90.1149 %Incorrectly Classified Instances 43 9.8851 %Kappa statistic 0.7949Mean absolute error 0.0995Root mean squared error 0.2977Relative absolute error 20.9815 %Root relative squared error 61.1406 %Total Number of Instances 435=== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.891 0.083 0.944 0.891 0.917 0.973democrat0.917 0.109 0.842 0.917 0.877 0.973republicanWeighted Avg. 0.901 0.093 0.905 0.901 0.902 0.973 === Confusion Matrix ===a b <-- classified as238 29 | a = democrat14 154 | b = republican3)结果分析a)该样本数据,数据记录数435个,17个属性,进行了10轮交叉验证b)正确分类共392个,正确率达90.1149 %c)错误分类43个,错误率9.8851 %d)测试正确率较高7.分类算法-RandomTree、决策树、朴素贝叶斯结果比较:RandomTree 决策树朴素贝叶斯正确率93.5632% 96.3218 % 90.1149 %混淆矩阵 a b <-- classified as255 12 | a = democrat16 152 | b = republican a b <-- classified as259 8 | a = democrat8 160 | b = republicana b <-- classified as238 29 | a = democrat14 154 | b =republican标准误差48.8605 % 35.9085 % 61.1406 % 根据以上对照结果,三类分类算法对样板数据Vote测试准确率类似;8.。
Weka[25] Bagging源代码分析作者:Koala++/屈伟先翻译一段Bagging的介绍,Breiman的bagging算法,是bootstrap aggregating的缩写,是最早的Ensemble算法之一,它也是最直接容易实现,又有着另人惊讶的好效果的算法之一。
Bagging中的多样性是由有放回抽取训练样本来实现的,用这种方式随机产生多个训练数据的子集,在每一个训练集的子集上训练一个同种分类器,最终分类结果是由多个分类器的分类结果多数投票而产生的。
Breiman’s bagging, short for bootstrap aggregating, is one of the earliest ensemble based algorithms. It is also one of the most intuitive and simplest to implement, with a surprisingly good performance . Diversity in bagging is obtained by using bootstrapped replicas of the training data: different training data subsets are randomly drawn—with replacement—from the entire training data. Each training data subset is used to train a different classifier of the same type. Individual classifiers are then combined by taking a majority vote of their decisions. For any given instance, the class chosen by most classifiers is the ensemble decision.Bagging类在weka.classifiers.meta包下面。
Weka开发[37]——ChiSquareAttributeEval源代码分析卡方(chi-square)核心代码在buildEvaluator中,而buildEvalutor中的代码绝大部分是与InfoGainAttributeEval,因为只是加一个每个类别值,每个属性的每一个属性值的次数,保存在counts中,下面的代码是不同的几句:// Compute chi-squared valuesm_ChiSquareds = new double[data.numAttributes()];for (int i = 0; i < data.numAttributes(); i ) {if (i != classIndex) {m_ChiSquareds[i] = ContingencyTables.chiVal(ContingencyTables.reduceMatrix(counts[i]), false);}}所调用的reduceMatrix代码如下:/*** Reduces a matrix by deleting all zero rows and columns.*/public static double[][] reduceMatrix(double[][] matrix) {int row, col, currCol, currRow, nrows, ncols, nonZeroRows = 0,nonZeroColumns = 0;double[] rtotal, ctotal;double[][] newMatrix;nrows = matrix.length;ncols = matrix[0].length;rtotal = new double[nrows];ctotal = new double[ncols];for (row = 0; row < nrows; row ) {for (col = 0; col < ncols; col ) {rtotal[row] = matrix[row][col];ctotal[col] = matrix[row][col];}}for (row = 0; row < nrows; row ) {if (Utils.gr(rtotal[row], 0)) {nonZeroRows ;}}for (col = 0; col < ncols; col ) {if (Utils.gr(ctotal[col], 0)) {nonZeroColumns ;}}newMatrix = new double[nonZeroRows][nonZeroColumns]; currRow = 0;for (row = 0; row < nrows; row ) {if (Utils.gr(rtotal[row], 0)) {currCol = 0;for (col = 0; col < ncols; col ) {if (Utils.gr(ctotal[col], 0)) {newMatrix[currRow][currCol] = matrix[row][col];currCol ;}}currRow ;}}return newMatrix;}rtotal,ctotal分别是每个行与列的的全部元素之和,nonZeroRows和nonZeroColumns分别是非0行与列的值,将这些元素值全为0的行或列删去,得到一个新的矩阵newMatrix。
Weka[36] InfoGainAttributeEval源代码分析作者:Koala++/屈伟最近要用到特征选择,但需要的特征选择又有点不同,还是看看源码,本文后面介绍了weka中熵的计算,它的计算与公式中不太一样,以前给weka中文站人的介绍过一次,这次我专门周末把公式敲出来,方便大家看。
从buildEvaluator开始看:int classIndex = data.classIndex();int numInstances = data.numInstances();if (!m_Binarize) {Discretize disTransform = new Discretize();disTransform.setUseBetterEncoding(true);disTransform.setInputFormat(data);data = eFilter(data, disTransform);} else {NumericToBinary binTransform = new NumericToBinary();binTransform.setInputFormat(data);data = eFilter(data, binTransform);}看是要离散成二个值,还是多个值。
int numClasses = data.attribute(classIndex).numValues();// Reserve space and initialize countersdouble[][][] counts = new double[data.numAttributes()][][];for (int k = 0; k < data.numAttributes(); k++) {if (k != classIndex) {int numValues = data.attribute(k).numValues();counts[k] = new double[numValues + 1][numClasses + 1];}}// Initialize countersdouble[] temp = new double[numClasses + 1];for (int k = 0; k < numInstances; k++) {Instance inst = data.instance(k);if (inst.classIsMissing()) {temp[numClasses] += inst.weight();} else {temp[(int) inst.classValue()] += inst.weight();}}for (int k = 0; k < counts.length; k++) {if (k != classIndex) {for (int i = 0; i < temp.length; i++) {counts[k][0][i] = temp[i];}}}Counts第一维是属性个数,第二列是属性值个数+1,第三列是类别个数+1。
将第属性值的第0个元素,设为样本权重,且属性值第0个元素的类别第0个元素为,所有类别缺失样本权重之和。
// Get countsfor (int k = 0; k < numInstances; k++) {Instance inst = data.instance(k);for (int i = 0; i < inst.numValues(); i++) {if (inst.index(i) != classIndex) {if (inst.isMissingSparse(i) || inst.classIsMissing()) {if (!inst.isMissingSparse(i)) {counts[inst.index(i)][(int) inst.valueSparse(i)][numClasses] += inst.weight();counts[inst.index(i)][0][numClasses] -= inst.weight();} else if (!inst.classIsMissing()) {counts[inst.index(i)][data.attribute(inst.index(i)).numValues()][(int) inst.classValue()] += inst.weight();counts[inst.index(i)][0][(int) inst.classValue()]-= inst.weight();} else {counts[inst.index(i)][data.attribute(inst.index(i)).numValues()][numClasses] += inst.weight();counts[inst.index(i)][0][numClasses] -= inst.weight();}} else {counts[inst.index(i)][(int) inst.valueSparse(i)][(int) inst.classValue()] += inst.weight();counts[inst.index(i)][0][(int) inst.classValue()]-= inst.weight();}}}}核心的就是最下面的else的第一句,将这个属性的这个属性值的类别值的元素加上它的权重。
if(m_missing_merge)就是把那些缺失值平均分到相应的元素中去,懒的细看了,下一个:// Compute info gainsm_InfoGains = new double[data.numAttributes()];for (int i = 0; i < data.numAttributes(); i++) {if (i != classIndex) {m_InfoGains[i] = (ContingencyTables.entropyOverColumns(counts[i]) - ContingencyTables.entropyConditionedOnRows(counts[i]));}}重要的有两个函数entropyOverColumns和entropyConditionedOnRows:public static double entropyOverColumns(double[][] matrix) { double returnValue = 0, sumForColumn, total = 0;for (int j = 0; j < matrix[0].length; j++) {sumForColumn = 0;for (int i = 0; i < matrix.length; i++) {sumForColumn += matrix[i][j];}returnValue = returnValue - lnFunc(sumForColumn);total += sumForColumn;}if (Utils.eq(total, 0)) {return 0;}return (returnValue + lnFunc(total)) / (total * log2);}这里要注意一下,其实从名字也反应出来了Over Columns是求这个属性的熵,也就是InfoGain前面的那一项。
public static double entropyConditionedOnRows(double[][] matrix) { double returnValue = 0, sumForRow, total = 0;for (int i = 0; i < matrix.length; i++) {sumForRow = 0;for (int j = 0; j < matrix[0].length; j++) {returnValue = returnValue + lnFunc(matrix[i][j]);sumForRow += matrix[i][j];}returnValue = returnValue - lnFunc(sumForRow);total += sumForRow;}if (Utils.eq(total, 0)) {return 0;}return -returnValue / (total * log2);}这一步也就是InfoGain公式的后面一项。
这里先以ID3为例讲一下如何计算信息熵,weka中所用的计算有点点不同:private double computeEntropy(Instances data) throws Exception {double[] classCounts = new double[data.numClasses()];Enumeration instEnum = data.enumerateInstances();while (instEnum.hasMoreElements()) {Instance inst = (Instance) instEnum.nextElement();classCounts[(int) inst.classValue()]++;}double entropy = 0;for (int j = 0; j < data.numClasses(); j++) {if (classCounts[j] > 0) {entropy -= classCounts[j] * Utils.log2(classCounts[j]);}}entropy /= (double) data.numInstances();return entropy + Utils.log2(data.numInstances());}classCounts数组不必说,自然是每个类别的样本数。
这里设数样本数为N,类别数为M,类别i的样本数为C i (C i=classCounts[i])。
中间的for循环用公式表示出来就是:Mentropy = −C i∗ log C ii=0N=C 1+…+C M 。
则视为(N/N) log 2 =(C 1/N) log 2N +…+(C M /N) log 2N =P(C 1) log 2N +…+P(C M ) log 2N 。