adaboost完整版

格式：ppt
大小：561.50 KB
文档页数：35

下载文档原格式

Adaboost算法的MATLAB实现

Adaboost算法的MATLAB实现：clear allclctr_n=200;%the population of the train sette_n=200;%the population of the test setweak_learner_n=20;%the population of the weak_learner tr_set=[1,5;2,3;3,2;4,6;4,7;5,9;6,5;6,7;8,5;8,8];te_se=[1,5;2,3;3,2;4,6;4,7;5,9;6,5;6,7;8,5;8,8];tr_labels=[2,2,1,1,2,2,1,2,1,1];te_labels=[2,2,1,1,2,2,1,2,1,1];figure;subplot(2,2,1);hold on;axis square;indices=tr_labels==1;plot(tr_set(indices,1),tr_set(indices,2),'b*');indices=~indices;plot(tr_set(indices,1),tr_set(indices,2),'r*');title('Training set');subplot(2,2,2);hold on;axis square;indices=te_labels==1;plot(te_set(indices,1),te_set(indices,2),'b*')3;indices=~indices;plot(te_set(indices,1),te_set(indices,2),'r*');title('Training set');%Training and testing error ratestr_error=zeros(1,weak_learner_n);te_error=zeros(1,weak_learner_n);for i=1:weak_learner_nadaboost_model=adaboost_tr(@threshold_tr,@threshold_te,tr_set,tr_labels,i); [L_tr,hits_tr]=adaboost_te(adaboost_model,@threshold_te,te_set,te_labels); tr_error(i)=(tr_n-hits_tr)/tr_n;[L_te,hits_te]=adaboost_te(adaboost_model,@threshold_te,te_set,te_labels); te_error(i)=(te_n-hits_te)/te_n;endsubplot(2,2,3);plot(1:weak_learner_n,tr_error);axis([1,weak_learner_n,0,1]);title('Training Error');xlabel('weak classifier number');ylabel('error rate');grid on;subplot(2,2,4);axis square;plot(1:weak_learner_n,te_error);axis([1,weak_learner_n,0,1]);title('Testing Error');xlabel('weak classifier number');ylabel('error rate');grid on;这里需要另外分别撰写两个函数，其中一个为生成adaboost模型的训练函数，另外为测试测试样本的测试函数。

adaboost完整版

i
t ,i

n
j 1
wt , j
2.对每个特征f，训练一个弱分类器 h( x, f ) ；计算所有特征的弱分类器的加权错误率
f i qi | h( x, f ) yi |
3.（在所有特征中）选取具有最小错误率的最佳弱分类器 ht ( x ) 。 f min f i qi | h( x, f ) yi | i qi | h( x, ft ) yi |
3、强分类器的公式,权重的选取
一共m个样本，Y={-1，+1}，-1代表负样本错误率： f
q | h ( x, f ) y |
i i i
假设的权重：
1 1 t t ln 2 t
Adaboost 算法分析
对于boosting算法，存在两个问题： 1. 如何调整训练集，使得在训练集上训练的弱分类器得以进行； 2. 如何将训练得到的各个弱分类器联合起来形成强分类器。针对以上两个问题，AdaBoost算法进行了调整： 1. 使用加权后选取的训练数据代替随机选取的训练样本，这样将训练的焦点集中在比较难分的训练数据样本上； 2. 将弱分类器联合起来，使用加权的投票机制代替平均投票机制。让分类效果好的弱分类器具有较大的权重，而分类效果差的分类器具有较小的权重。
1 其中 t log 。 t 注：（另一种方法：不用循环，而是用识别率与误识别率是否达到来进行循环）在每一次循环完之后，运用5判断已经得到的弱分类器的加权后判断的识别率与误识别率是否在预定范围内，若在：停止循环，直接；不在：继续。
强分类器的构成 • 经过 T 次迭代后，获得了T 个最佳弱分类 ) 器 h1 ( x),..., h2 ( x ，可以按照下面的方式组合成一个强分类器： T 1 T 1 t 1 t ht ( x) t 1 t C ( x) 2 0其他 • 其中 t log 。 t • 那么，这个强分类器对待一幅待检测图像时，相当于让所有弱分均投票结果比较得出最终的结果。

AdaBoost装袋提升算法

AdaBoost装袋提升算法参开资料：更多挖掘算法：介绍在介绍AdaBoost算法之前，需要了解⼀个类似的算法，装袋算法(bagging)，bagging是⼀种提⾼分类准确率的算法，通过给定组合投票的⽅式，获得最优解。

⽐如你⽣病了，去n个医院看了n个医⽣，每个医⽣给你开了药⽅，最后的结果中，哪个药⽅的出现的次数多，那就说明这个药⽅就越有可能性是最由解，这个很好理解。

⽽bagging算法就是这个思想。

算法原理⽽AdaBoost算法的核⼼思想还是基于bagging算法，但是他⼜⼀点点的改进，上⾯的每个医⽣的投票结果都是⼀样的，说明地位平等，如果在这⾥加上⼀个权重，⼤城市的医⽣权重⾼点，⼩县城的医⽣权重低，这样通过最终计算权重和的⽅式，会更加的合理，这就是AdaBoost算法。

AdaBoost算法是⼀种迭代算法，只有最终分类误差率⼩于阈值算法才能停⽌，针对同⼀训练集数据训练不同的分类器，我们称弱分类器，最后按照权重和的形式组合起来，构成⼀个组合分类器，就是⼀个强分类器了。

算法的只要过程：1、对D训练集数据训练处⼀个分类器Ci2、通过分类器Ci对数据进⾏分类，计算此时误差率3、把上步骤中的分错的数据的权重提⾼，分对的权重降低，以此凸显了分错的数据。

为什么这么做呢，后⾯会做出解释。

完整的adaboost算法如下最后的sign函数是符号函数，如果最后的值为正，则分为+1类，否则即使-1类。

我们举个例⼦代⼊上⾯的过程，这样能够更好的理解。

adaboost的实现过程：图中，“+”和“-”分别表⽰两种类别，在这个过程中，我们使⽤⽔平或者垂直的直线作为分类器，来进⾏分类。

第⼀步：根据分类的正确率，得到⼀个新的样本分布D2，⼀个⼦分类器h1 其中划圈的样本表⽰被分错的。

在右边的途中，⽐较⼤的“+”表⽰对该样本做了加权。

算法最开始给了⼀个均匀分布 D 。

所以h1 ⾥的每个点的值是0.1。

ok，当划分后，有三个点划分错了，根据算法误差表达式得到误差为分错了的三个点的值之和，所以ɛ1=(0.1+0.1+0.1)=0.3，⽽ɑ1 根据表达式的可以算出来为0.42. 然后就根据算法把分错的点权值变⼤。

AdaBoost

Zt is a normalization factor
Output: final hypothesis
f Ens ( x) = ∑ t =1α t ht ( x)
T
AdaBoost
t =1
(1) d n = 1/10
20
N = 10
ε1 = ∑ n=1 d n(1) I ( yn ≠ h1 ( xn )) = 0.3
( ε t = ∑ n =1 d nt ) I( yn ≠ ht ( xn )) N
3. compute hypothesis weight α t = ln
1 2
1- ε t
εt
( ( 4. update example distribution d nt +1) = d nt ) exp ( −α t yn ht ( x n ) ) Z t
N
xn ∈
d
,
yn ∈ {±1}
y = f ( x ) (or y ∼ P ( y | x )) x ∼ p( x)
Given new x, predict y
P(x, y) is unknown!
2
AdaBoost: Introduction
The Model: • Hypothesis class: • Loss:
AdaBoost: Frame work
Algorithm
Idea: • Simple Hypotheses are not perfect! • Hypotheses combination increased accuracy
6
Problems: • How to generate different hypotheses? • How to combine them? Method: • Compute distribution d1, . . . , dN on examples • Find hypothesis on the weighted training sample

数据挖掘领域十大经典算法之—AdaBoost算法（超详细附代码）

数据挖掘领域⼗⼤经典算法之—AdaBoost算法（超详细附代码）相关⽂章：数据挖掘领域⼗⼤经典算法之—C4.5算法（超详细附代码）数据挖掘领域⼗⼤经典算法之—K-Means算法（超详细附代码）数据挖掘领域⼗⼤经典算法之—SVM算法（超详细附代码）数据挖掘领域⼗⼤经典算法之—Apriori算法数据挖掘领域⼗⼤经典算法之—EM算法数据挖掘领域⼗⼤经典算法之—PageRank算法数据挖掘领域⼗⼤经典算法之—K-邻近算法/kNN（超详细附代码）数据挖掘领域⼗⼤经典算法之—朴素贝叶斯算法（超详细附代码）数据挖掘领域⼗⼤经典算法之—CART算法（超详细附代码）简介Adaboost算法是⼀种提升⽅法，将多个弱分类器，组合成强分类器。

AdaBoost，是英⽂”Adaptive Boosting“（⾃适应增强）的缩写，由Yoav Freund和Robert Schapire在1995年提出。

它的⾃适应在于：前⼀个弱分类器分错的样本的权值（样本对应的权值）会得到加强，权值更新后的样本再次被⽤来训练下⼀个新的弱分类器。

在每轮训练中，⽤总体（样本总体）训练新的弱分类器，产⽣新的样本权值、该弱分类器的话语权，⼀直迭代直到达到预定的错误率或达到指定的最⼤迭代次数。

总体——样本——个体三者间的关系需要搞清除总体N。

样本：｛ni｝i从1到M。

个体：如n1=（1,2），样本n1中有两个个体。

算法原理（1）初始化训练数据（每个样本）的权值分布：如果有N个样本，则每⼀个训练的样本点最开始时都被赋予相同的权重：1/N。

（2）训练弱分类器。

具体训练过程中，如果某个样本已经被准确地分类，那么在构造下⼀个训练集中，它的权重就被降低；相反，如果某个样本点没有被准确地分类，那么它的权重就得到提⾼。

同时，得到弱分类器对应的话语权。

然后，更新权值后的样本集被⽤于训练下⼀个分类器，整个训练过程如此迭代地进⾏下去。

（3）将各个训练得到的弱分类器组合成强分类器。

机器学习算法系列(6)：AdaBoost

em 1 − em
倍。因此，误分类样本在下一一轮学习中起更更大大的作用用。不不改变所给的训练数据，
而而不不断改变训练数据权值的分布，使得训练数据在基本分类器器的学习中起不不同的作业，这是 AdaBoost的一一个特点。这里里里我们还引入入了了一一个规范化因子子，它的作用用在于使D m + 1成为一一个概率分布。具体公式为
率低的弱分类器器在最终分类器器中权重较大大，否则较小小。得到最终分类器器。
2.2 AdaBoost算法流程
现在叙述AdaBoost算法。假定给定一一个二二类分类的训练数据集 T = {(x 1, y 1), (x 2, y 2), · · · , (x n, y n)} 其中y i属于二二分类的标记组合，即y i ∈ { + 1, − 1}，AdaBoost算法利利用用一一下算法，从训练数据中学习一一系列列弱分类器器或基本分类器器，并将这些弱分类器器线性组合成一一个强分类器器。步骤一一：首首先，初始化训练数据的权值分布。假设每一一个训练样本最开始时都被赋予相同的权值：1 / N，即每个训练样本在基本分类器器的学习中作用用相同，这一一假设保证步骤一一能够在原始数据上学习基本分类器器G 1(x)，数学化的语言言表示为： 1 D 1 = w 11, w 12, · · · , w 1i, · · · , w 1N , w 1i = , i = 1, 2, · · · , N N
集成学习通过将多个学习器器进行行行结合，可获得比比单一一学习器器显著优越的泛化性能，它基于这样一一种思想：对于一一个复杂任务来说，将多个专家的判断进行行行适当的综合所得出的判断，要比比其中任何一一个专家单独的判断好，直观一一点理理解，就是我们平时所说的“三个臭臭皮皮匠，顶个诸葛亮”，通过使用用多个决策者共同决策一一个实例例的分类从而而提高高分类器器的泛化能力力力。

adaboost——【人工智能精品讲义】

。 Rtr
H final
exp
2
T t 1
2 t
记 t, t 0 ，则 Rtr H final 。 e2 2T
证明：
1、对 DT1 进行迭代展开
DT1 i DT i exp
T yihT xi
ZT
D1
i
exp
yi
T
t 1 T
tht
xi
Zt
t 1
D1
i
exp
yi
x
t1
例如在下图中
需要一些线段把红色的球和深蓝色的球分开，显然然如果仅用一条直线的话，是分不开的。
使用提升算法来进行划分的话，先画出一条错误率最小的线段如图a，但左下角的深蓝色的球被错划分到红色区域，因此加重被错误划分的球的权重，再下一次划分的时候，将更加考虑那些权重大的球，如图c，最终得到一个准确的划分，如下图所示。也就是说，在提升算法中，不要一个单个分类器进行分类，而是用多个分类器来通过一定的组合后进行划分。这就是AdaBoost算法。
计算弱分类器 ht : X 1,1，该弱分类器在分布 Dt 上的误差为：
t Dt ht xi yi
计算该弱分类器的权重： t
1 2
ln
1
t
t
更新训练样本的分布：Dt1
i
Dt
i expt
Zt
yiht
xi
，其中
Zt
为归
一化常数。
最后的强分类器为：
H
final
x
sign
T
tht
算法介绍（AdaBoost）
AdaBoost算法中不同的训练集是通过调整每个样本对应的权重来实现的。开始时，每个样本对应的权重是相同的，即其中n为样本个数，在此样本分布下训练出一弱分类器。对于分类错误的样本，加大其对应的权重；而对于分类正确的样本，降低其权重，这样分错的样本就被突显出来，从而得到一个新的样本分布。在新的样本分布下，再次对样本进行训练，得到弱分类器。依次类推，经过T次循环，得到T个弱分类器，把这T个弱分类器按一定的权重叠加（boost）起来，得到最终想要的强分类器。

机器学习十大算法：AdaBoost

Chapter7AdaBoostZhi-Hua Zhou and Yang YuContents7.1Introduction (127)7.2The Algorithm (128)7.2.1Notations (128)7.2.2A General Boosting Procedure (129)7.2.3The AdaBoost Algorithm (130)7.3Illustrative Examples (133)7.3.1Solving XOR Problem (133)7.3.2Performance on Real Data (134)7.4Real Application (136)7.5Advanced Topics (138)7.5.1Theoretical Issues (138)7.5.2Multiclass AdaBoost (142)7.5.3Other Advanced Topics (145)7.6Software Implementations (145)7.7Exercises (146)References (147)7.1IntroductionGeneralization ability,which characterizes how well the result learned from a given training data set can be applied to unseen new data,is the most central concept in machine learning.Researchers have devoted tremendous efforts to the pursuit of tech-niques that could lead to a learning system with strong generalization ability.One of the most successful paradigms is ensemble learning[32].In contrast to ordinary machine learning approaches which try to generate one learner from training data, ensemble methods try to construct a set of base learners and combine them.Base learners are usually generated from training data by a base learning algorithm which can be a decision tree,a neural network,or other kinds of machine learning algorithms. Just like“many hands make light work,”the generalization ability of an ensemble is usually signiﬁcantly better than that of a single learner.Actually,ensemble meth-ods are appealing mainly because they are able to boost weak learners,which are127128AdaBoostslightly better than random guess,to strong learners,which can make very accurate predictions.So,“base learners”are also referred as“weak learners.”AdaBoost[9,10]is one of the most inﬂuential ensemble methods.It took birth from the answer to an interesting question posed by Kearns and Valiant in1988.That is,whether two complexity classes,weakly learnable and strongly learnable prob-lems,are equal.If the answer to the question is positive,a weak learner that performs just slightly better than random guess can be“boosted”into an arbitrarily accurate strong learner.Obviously,such a question is of great importance to machine learning. Schapire[21]found that the answer to the question is“yes,”and gave a proof by construction,which is theﬁrst boosting algorithm.An important practical deﬁciency of this algorithm is the requirement that the error bound of the base learners be known ahead of time,which is usually unknown in practice.Freund and Schapire[9]then pro-posed an adaptive boosting algorithm,named AdaBoost,which does not require those unavailable information.It is evident that AdaBoost was born with theoretical signif-icance,which has given rise to abundant research on theoretical aspects of ensemble methods in communities of machine learning and statistics.It is worth mentioning that for their AdaBoost paper[9],Schapire and Freund won the Godel Prize,which is one of the most prestigious awards in theoretical computer science,in the year2003. AdaBoost and its variants have been applied to diverse domains with great success, owing to their solid theoretical foundation,accurate prediction,and great simplicity (Schapire said it needs only“just10lines of code”).For example,Viola and Jones[27] combined AdaBoost with a cascade process for face detection.They regarded rectan-gular features as weak learners,and by using AdaBoost to weight the weak learners, they got very intuitive features for face detection.In order to get high accuracy as well as high efﬁciency,they used a cascade process(which is beyond the scope of this chap-ter).As a result,they reported a very strong face detector:On a466MHz machine,face detection on a384×288image costs only0.067second,which is15times faster than state-of-the-art face detectors at that time but with comparable accuracy.This face detector has been recognized as one of the most exciting breakthroughs in computer vision(in particular,face detection)during the past decade.It is not strange that“boost-ing”has become a buzzword in computer vision and many other application areas. In the rest of this chapter,we will introduce the algorithm and implementations,and give some illustrations on how the algorithm works.For readers who are eager to know more,we will introduce some theoretical results and extensions as advanced topics.7.2The Algorithm7.2.1NotationsWeﬁrst introduce some notations that will be used in the rest of the chapter.Let X denote the instance space,or in other words,feature space.Let Y denote the set of labels that express the underlying concepts which are to be learned.For example,we7.2The Algorithm129 let Y={−1,+1}for binary classiﬁcation.A training set D consists of m instances whose associated labels are observed,i.e.,D={(x i,y i)}(i∈{1,...,m}),while the label of a test instance is unknown and thus to be predicted.We assume both training and test instances are drawn independently and identically from an underlying distribution D.After training on a training data set D,a learning algorithm L will output a hypoth-esis h,which is a mapping from X to Y,or called as a classiﬁer.The learning process can be regarded as picking the best hypothesis from a hypothesis space,where the word“best”refers to a loss function.For classiﬁcation,the loss function can naturally be0/1-loss,loss0/1(h|x)=I[h(x)=y]where I[·]is the indication function which outputs1if the inner expression is true and0otherwise,which means that one error is counted if an instance is wrongly classiﬁed.In this chapter0/1-loss is used by default,but it is noteworthy that other kinds of loss functions can also be used in boosting.7.2.2A General Boosting ProcedureBoosting is actually a family of algorithms,among which the AdaBoost algorithm is the most inﬂuential one.So,it may be easier by starting from a general boosting procedure.Suppose we are dealing with a binary classiﬁcation problem,that is,we are trying to classify instances as positive and ually we assume that there exists an unknown target concept,which correctly assigns“positive”labels to instances belonging to the concept and“negative”labels to others.This unknown target concept is actually what we want to learn.We call this target concept ground-truth.For a binary classiﬁcation problem,a classiﬁer working by random guess will have50%0/1-loss. Suppose we are unlucky and only have a weak classiﬁer at hand,which is only slightly better than random guess on the underlying instance distribution D,say,it has49%0/1-loss.Let’s denote this weak classiﬁer as h1.It is obvious that h1is not what we want,and we will try to improve it.A natural idea is to correct the mistakes made by h1.We can try to derive a new distribution D from D,which makes the mistakes of h1more evident,for example,it focuses more on the instances wrongly classiﬁed by h1(we will explain how to generate D in the next section).We can train a classiﬁer h2from D .Again,suppose we are unlucky and h2is also a weak classiﬁer.Since D was derived from D,if D satisﬁes some condition,h2will be able to achieve a better performance than h1on some places in D where h1does not work well, without scarifying the places where h1performs well.Thus,by combining h1and h2in an appropriate way(we will explain how to combine them in the next section), the combined classiﬁer will be able to achieve less loss than that achieved by h1.By repeating the above process,we can expect to get a combined classiﬁer which has very small(ideally,zero)0/1-loss on D.130AdaBoostInput: Instance distribution D ; Base learning algorithm L ;Number of learning rounds T .Process:1. D 1 = D . % Initialize distribution2. for t = 1, ··· ,T :3. h t = L (D t ); % Train a weak learner from distribution D t4. єt = Pr x ~D t ,y I [h t (x )≠ y ]; % Measure the error of h t5. D t +1 = AdjustDistribution (D t , єt )6. endOutput: H (x ) = CombineOutputs ({h t (x )})Figure 7.1A general boosting procedure.Brieﬂy,boosting works by training a set of classiﬁers sequentially and combining them for prediction,where the later classiﬁers focus more on the mistakes of the earlier classiﬁers.Figure 7.1summarizes the general boosting procedure.7.2.3The AdaBoost AlgorithmFigure 7.1is not a real algorithm since there are some undecided parts such as Ad just Distribution and CombineOutputs .The AdaBoost algorithm can be viewed as an instantiation of the general boosting procedure,which is summarized in Figure 7.2.Input: Data set D = {(x 1, y 1), (x 2, y 2), . . . , (x m , y m )};Base learning algorithm L ;Number of learning rounds T .Process:1. D 1 (i ) = 1/m . % Initialize the weight distribution2. for t = 1, ··· ,T :3. h t = L (D , D t ); % Train a learner h t from D using distribution D t4. єt = Pr x ~D t ,y I [h t (x )≠ y ]; % Measure the error of h t5. if єt > 0.5 then break6. αt = ½ ln ( ); % Determine the weight of h t7. D t +1 (i ) =8. endOutput: H (x ) = sign (Σt =1αt h t (x ))× {exp(–αt ) if h t (x i ) = y i exp(αt ) if h t (x i ) ≠ y i % Update the distribution, where % Z t is a normalization factor which% enables D t +1 to be distribution T 1– єt єt D t (i )Z t D t (i )exp(–αt y i h t (x i ))Z t Figure 7.2The AdaBoost algorithm.7.2The Algorithm 131Now we explain the details.1AdaBoost generates a sequence of hypotheses and combines them with weights,which can be regarded as an additive weighted combi-nation in the form of H (x )=T t =1αt h t (x )From this view,AdaBoost actually solves two problems,that is,how to generate the hypotheses h t ’s and how to determine the proper weights αt ’s.In order to have a highly efﬁcient error reduction process,we try to minimize an exponential lossloss exp (h )=E x ∼D ,y [e −yh (x )]where yh (x )is called as the classiﬁcation margin of the hypothesis.Let’s consider one round in the boosting process.Suppose a set of hypotheses as well as their weights have already been obtained,and let H denote the combined hypothesis.Now,one more hypothesis h will be generated and is to be combined with H to form H +αh .The loss after the combination will beloss exp (H +αh )=E x ∼D ,y [e −y (H (x )+αh (x ))]The loss can be decomposed to each instance,which is called pointwise loss,asloss exp (H +αh |x )=E y [e −y (H (x )+αh (x ))|x ]Since y and h (x )must be +1or −1,we can expand the expectation as loss exp (H +αh |x )=e −y H (x ) e −αP (y =h (x )|x )+e αP (y =h (x )|x ) Suppose we have already generated h ,and thus the weight αthat minimizes the loss can be found when the derivative of the loss equals zero,that is,∂loss exp (H +αh |x )∂α=e −y H (x ) −e −αP (y =h (x )|x )+e αP (y =h (x )|x ) =0and the solution isα=12ln P (y =h (x )|x )P (y =h (x )|x )=12ln 1−P (y =h (x )|x )P (y =h (x )|x )By taking an expectation over x ,that is,solving∂loss exp (H +αh )∂α=0,and denoting=E x ∼D [y =h (x )],we getα=12ln 1− which is the way of determining αt in AdaBoost.1Here we explain the AdaBoost algorithm from the view of [11]since it is easier to understand than the original explanation in [9].132AdaBoostNow let’s consider how to generate h.Given a base learning algorithm,AdaBoost invokes it to produce a hypothesis from a particular instance distribution.So,we only need to consider what hypothesis is desired for the next round,and then generate an instance distribution to achieve this hypothesis.We can expand the pointwise loss to second order about h(x)=0,whenﬁxing α=1,loss exp(H+h|x)≈E y[e−y H(x)(1−yh(x)+y2h(x)2/2)|x]=E y[e−y H(x)(1−yh(x)+1/2)|x]since y2=1and h(x)2=1.Then a perfect hypothesis ish∗(x)=arg minh loss exp(H+h|x)=arg maxhE y[e−y H(x)yh(x)|x]=arg maxhe−H(x)P(y=1|x)·1·h(x)+e H(x)P(y=−1|x)·(−1)·h(x) Note that e−y H(x)is a constant in terms of h(x).By normalizing the expectation ash∗(x)=arg maxh e−H(x)P(y=1|x)·1·h(x)+e H(x)P(y=−1|x)·(−1)·h(x)e P(y=1|x)+e P(y=−1|x)we can rewrite the expectation using a new term w(x,y),which is drawn from e−y H(x)P(y|x),ash∗(x)=arg maxhE w(x,y)∼e−y H(x)P(y|x)[yh(x)|x]Since h∗(x)must be+1or−1,the solution to the optimization is that h∗(x)holds the same sign with y|x,that is,h∗(x)=E w(x,y)∼e−y H(x)P(y|x)[y|x]=P w(x,y)∼e−y H(x)P(y|x)(y=1|x)−P w(x,y)∼e−y H(x)P(y|x)(y=−1|x)As can be seen,h∗simply performs the optimal classiﬁcation of x under the distri-bution e−y H(x)P(y|x).Therefore,e−y H(x)P(y|x)is the desired distribution for a hypothesis minimizing0/1-loss.So,when the hypothesis h(x)has been learned andα=12ln1−has been deter-mined in the current round,the distribution for the next round should beD t+1(x)=e−y(H(x)+αh(x))P(y|x)=e−y H(x)P(y|x)·e−αyh(x)=D t(x)·e−αyh(x)which is the way of updating instance distribution in AdaBoost.But,why optimizing the exponential loss works for minimizing the0/1-loss? Actually,we can see thath∗(x)=arg minh E x∼D,y[e−yh(x)|x]=12lnP(y=1|x)P(y=−1|x)7.3Illustrative Examples133 and therefore we havesign(h∗(x))=arg maxyP(y|x)which implies that the optimal solution to the exponential loss achieves the minimum Bayesian error for the classiﬁcation problem.Moreover,we can see that the function h∗which minimizes the exponential loss is the logistic regression model up to a factor 2.So,by ignoring the factor1/2,AdaBoost can also be viewed asﬁtting an additive logistic regression model.It is noteworthy that the data distribution is not known in practice,and the AdaBoost algorithm works on a given training set withﬁnite training examples.Therefore,all the expectations in the above derivations are taken on the training examples,and the weights are also imposed on training examples.For base learning algorithms that cannot handle weighted training examples,a resampling mechanism,which samples training examples according to desired weights,can be used instead.7.3Illustrative ExamplesIn this section,we demonstrate how the AdaBoost algorithm works,from an illustra-tion on a toy problem to real data sets.7.3.1Solving XOR ProblemWe consider an artiﬁcial data set in a two-dimensional space,plotted in Figure7.3(a). There are only four instances,that is,⎧⎪⎪⎪⎨⎪⎪⎪⎩(x1=(0,+1),y1=+1) (x2=(0,−1),y2=+1) (x3=(+1,0),y3=−1) (x4=(−1,0),y4=−1)⎫⎪⎪⎪⎬⎪⎪⎪⎭This is the XOR problem.The two classes cannot be separated by a linear classiﬁer which corresponds to a line on theﬁgure.Suppose we have a base learning algorithm which tries to select the best of the fol-lowing eight functions.Note that none of them is perfect.For equally good functions, the base learning algorithm will pick one function from them randomly.h1(x)=+1,if(x1>−0.5)−1,otherwise h2(x)=−1,if(x1>−0.5)+1,otherwiseh3(x)=+1,if(x1>+0.5)−1,otherwise h4(x)=−1,if(x1>+0.5)+1,otherwise134AdaBoost(a) The XOR data(b) 1st round(c) 2nd round(d) 3rd round Figure7.3AdaBoost on the XOR problem.h5(x)=+1,if(x2>−0.5)−1,otherwise h6(x)=−1,if(x2>−0.5)+1,otherwiseh7(x)=+1,if(x2>+0.5)−1,otherwise h8(x)=−1,if(x2>+0.5)+1,otherwisewhere x1and x2are the values of x at theﬁrst and second dimension,respectively. Now we track how AdaBoost works:1.Theﬁrst step is to invoke the base learning algorithm on the original data.h2,h3,h5,and h8all have0.25classiﬁcation errors.Suppose h2is picked as theﬁrst base learner.One instance,x1,is wrongly classiﬁed,so the error is1/4=0.25.The weight of h2is0.5ln3≈0.55.Figure7.3(b)visualizes the classiﬁcation, where the shadowed area is classiﬁed as negative(−1)and the weights of the classiﬁcation,0.55and−0.55,are displayed.2.The weight of x1is increased and the base learning algorithm is invoked again.This time h3,h5,and h8have equal errors.Suppose h3is picked,of which the weight is0.80.Figure7.3(c)shows the combined classiﬁcation of h2and h3with their weights,where different gray levels are used for distinguishing negative areas according to classiﬁcation weights.3.The weight of x3is increased,and this time only h5and h8equally have thelowest errors.Suppose h5is picked,of which the weight is1.10.Figure7.3(d) shows the combined classiﬁcation of h2,h3,and h8.If we look at the sign of classiﬁcation weights in each area in Figure7.3(d),all the instances are correctly classiﬁed.Thus,by combining the imperfect linear classiﬁers,AdaBoost has produced a nonlinear classiﬁer which has zero error.7.3.2Performance on Real DataWe evaluate the AdaBoost algorithm on56data sets from the UCI Machine Learning Repository,2which covers a broad range of real-world tasks.We use the Weka(will be introduced in Section7.6)implementation of AdaBoost.M1using reweighting with /∼mlearn/MLRepository.html7.3Illustrative Examples 135AdaBoost with decision tree (unpruned)AdaBoost with decision tree (pruned)AdaBoost with decision stumpD e c i s i o n s t u m p D e c i s i o n t r e e (p r u n e d )1.000.800.600.400.200.00 1.000.800.600.400.200.000.000.200.400.600.80 1.000.000.200.400.600.80 1.00D e c i s i o n t r e e (u n p r u n e d )1.000.800.600.400.200.000.000.200.400.600.80 1.00Figure 7.4Comparison of predictive errors of AdaBoost against decision stump,pruned,and unpruned single decision trees on 56UCI data sets.50base learners.Almost all kinds of learning algorithms can be taken as base learning algorithms,such as decision trees,neural networks,and so on.Here,we have tried three base learning algorithms,including decision stump,pruned,and unpruned J4.8decision trees (Weka implementation of C4.5).We plot the comparison results in Figure 7.4,where each circle represents a data set and locates according to the predictive errors of the two compared algorithms.In each plot of Figure 7.4,the diagonal line indicates where the two compared algorithms have identical errors.It can be observed that AdaBoost often outperforms its base learning algorithm,with a few exceptions on which it degenerates the performance.The famous bias-variance decomposition [12]has been employed to empirically study why AdaBoost achieves excellent performance [2,3,34].This powerful tool breaks the expected error of a learning approach into the sum of three nonnegative quantities,that is,the intrinsic noise,the bias,and the variance.The bias measures how closely the average estimate of the learning approach is able to approximate the target,and the variance measures how much the estimate of the learning approach ﬂuctuates for the different training sets of the same size.It has been observed [2,3,34]that AdaBoost primarily reduces the bias but it is also able to reduce the variance.136AdaBoostFigure7.5Four feature masks to be applied to each rectangle.7.4Real ApplicationViola and Jones[27]combined AdaBoost with a cascade process for face detection. As the result,they reported that on a466MHz machine,face detection on a384×288 image costs only0.067seconds,which is almost15times faster than state-of-the-art face detectors at that time but with comparable accuracy.This face detector has been recognized as one of the most exciting breakthroughs in computer vision(in particular,face detection)during the past decade.In this section,we brieﬂy introduce how AdaBoost works in the Viola-Jones face detector.Here the task is to locate all possible human faces in a given image.An image is ﬁrst divided into subimages,say24×24squares.Each subimage is then represented by a feature vector.To make the computational process efﬁcient,very simple features are used.All possible rectangles in a subimage are examined.On every rectangle, four features are extracted using the masks shown in Figure7.5.With each mask, the sum of pixels’gray level in white areas is subtracted by the sum of those in dark areas,which is regarded as a feature.Thus,by a24×24splitting,there are more than 1million features,but each of the features can be calculated very fast.Each feature is regarded as a weak learner,that is,h i,p,θ(x)=I[px i≤pθ](p∈{+1,−1})where x i is the value of x at the i-th feature.The base learning algorithm tries toﬁnd the best weak classiﬁer h i∗,p∗,θ∗that minimizes the classiﬁcation error,that is,E(x,y)I[h i,p,θ(x)=y](i∗,p∗,θ∗)=arg mini,p,θFace rectangles are regarded as positive examples,as shown in Figure7.6,while rectangles that do not contain any face are regarded as negative examples.Then,the AdaBoost process is applied and it will return a few weak learners,each corresponds to one of the over1million features.Actually,the AdaBoost process can be regarded as a feature selection tool here.Figure7.7shows theﬁrst two selected features and their position relative to a human face.It is evident that these two features are intuitive,where theﬁrst feature measures how the intensity of the eye areas differ from that of the lower areas,while7.4Real Application137Figure7.6Positive training examples[27].the second feature measures how the intensity of the two eye areas differ from the area between two eyes.Using the selected features in order,an extremely imbalanced decision tree is built, which is called cascade of classiﬁers,as illustrated in Figure7.8.The parameterθis adjusted in the cascade such that,at each tree node,branching into“not a face”means that the image is really not a face.In other words,the false negative rate is minimized.This design owes to the fact that a nonface image is easier to be recognized,and it is possible to use a few features toﬁlter out a lot of candidate image rectangles,which endows the high efﬁciency.It was reported[27]that10 features per subimage are examined in average.Some test results of the Viola-Jones face detector are shown in Figure7.9.138AdaBoost7.5Advanced Topics7.5.1Theoretical IssuesComputational learning theory studies some fundamental theoretical issues of machine learning.First introduced by Valiant in1984[25],the Probably Approx-imately Correct(PAC)framework models learning algorithms in a distribution free manner.Roughly speaking,for binary classiﬁcation,a problem is learnable or stronglytimelearnable if there exists an algorithm that outputs a hypothesis h in polynomial7.5Advanced Topics139Figure7.9Outputs of the Viola-Jones face detector on a number of test images[27]. such that for all0<δ, ≤0.5,PE x∼D,y[I[h(x)=y]]<≥1−δand a problem is weakly learnable if the above holds for all0<δ≤0.5but only when is slightly smaller than0.5(or in other words,h is only slightly better than random guess).In1988,Kearns and Valiant[15]posed an interesting question,that is,whether the strongly learnable problem class equals the weakly learnable problem class.This question is of fundamental importance,since if the answer is“yes,”any weak learner is potentially able to be boosted to a strong learner.In1989,Schapire[21]proved that the answer is really“yes,”and the proof he gave is a construction,which is theﬁrst140AdaBoostboosting algorithm.One year later,Freund[7]developed a more efﬁcient algorithm. Both algorithms,however,suffered from the practical deﬁciency that the error bound of the base learners need to be known ahead of time,which is usually unknown in ter,in1995,Freund and Schapire[9]developed the AdaBoost algorithm, which is effective and efﬁcient in practice.Freund and Schapire[9]proved that,if the base learners of AdaBoost have errors 1, 2,···, T,the error of theﬁnal combined learner, ,is upper bounded as=E x∼D,y I[H(x)=y]≤2TTt=1t t≤e−2Tt=1γ2twhereγt=0.5− t.It can be seen that AdaBoost reduces the error exponentially fast.Also,it can be derived that,to achieve an error less than ,the round T is upperbounded asT≤12γln1where it is assumed thatγ=γ1=γ2=···=γT.In practice,however,all the operations of AdaBoost can only be carried out on training data D,that is,D=E x∼D,y I[H(x)=y]and thus the errors are training errors,while the generalization error,that is,the error over instance distribution DD=E x∼D,y I[H(x)=y]is of more interest.The initial analysis[9]showed that the generalization error of AdaBoost is upperbounded asD≤ D+˜OdTmwith probability at least1−δ,where d is the VC-dimension of base learners,m is the number of training instances,and˜O(·)is used instead of O(·)to hide logarithmic terms and constant factors.The above bound suggests that in order to achieve a good generalization ability, it is necessary to constrain the complexity of base learners as well as the number of learning rounds;otherwise AdaBoost will overﬁt.However,empirical studies show that AdaBoost often does not overﬁt,that is,its test error often tends to decrease even after the training error reaches zero,even after a large number of rounds,such as 1000.For example,Schapire et al.[22]plotted the performance of AdaBoost on the letter data set from UCI Machine Learning Repository,as shown in Figure7.10(left), where the higher curve is test error while the lower one is training error.It can be observed that AdaBoost achieves zero training error in less than10rounds but the generalization error keeps on reducing.This phenomenon seems to counter Occam’s7.5Advanced Topics 141e r r o r r a t e r a t i o of t e s t s e t t θ20151050 1.00.5101001000-1-0.50.51Figure 7.10Training and test error (left)and margin distribution (right)of AdaBoost on the letter data set [22].Razor,that is,nothing more than necessary should be done,which is one of the basic principles in machine learning.Many researchers have studied this phenomena,and several theoretical explana-tions have been given,for example,[11].Schapire et al.[22]introduced the margin -based explanation.They argued that AdaBoost is able to increase the margin even after the training error reaches zero,and thus it does not overﬁt even after a large number of rounds.The classiﬁcation margin of h on x is deﬁned as yh (x ),and that of H (x )= T t =1αt h t (x )is deﬁned asy H (x )= Tt =1αt yh t (x ) Tt =1αtFigure 7.10(right)plots the distribution of y H (x )≤θfor different values of θ.It was proved in [22]that the generalization error is upper bounded asD ≤P x ∼D ,y (y H (x )≤θ)+˜O d m θ2+ln 1δ ≤2TT t =1 1−θt (1− )1+θ+˜O m θ2+ln δ with probability at least 1−δ.This bound qualitatively explains that when other variables in the bound are ﬁxed,the larger the margin,the smaller the generalization error.However,this margin-based explanation was challenged by Brieman [4].Using minimum margin ,=min x ∈Dy H (x )Breiman proved a generalization error bound is tighter than the above one using minimum margin.Motivated by the tighter bound,the arc-gv algorithm,which is a variant of AdaBoost,was proposed to maximize the minimum margin directly,by142AdaBoost updatingαt according toαt=12ln1+γt1−γt−12ln1+ t1− tInterestingly,the minimum margin of arc-gv is uniformly better than that of AdaBoost, but the test error of arc-gv increases drastically on all tested data sets[4].Thus,the margin theory for AdaBoost was almost sentenced to death.In2006,Reyzin and Schapire[20]reported an interestingﬁnding.It is well-known that the bound of the generalization error is associated with margin,the number of rounds,and the complexity of base learners.When comparing arc-gv with AdaBoost, Breiman[4]tried to control the complexity of base learners by using decision trees with the same number of leaves,but Reyzin and Schapire found that these are trees with very different shapes.The trees generated by arc-gv tend to have larger depth, while those generated by AdaBoost tend to have larger width.Figure7.11(top) shows the difference of depth of the trees generated by the two algorithms on the breast cancer data set from UCI Machine Learning Repository.Although the trees have the same number of leaves,it seems that a deeper tree makes more attribute tests than a wider tree,and therefore they are unlikely to have equal complexity. So,Reyzin and Schapire repeated Breiman’s experiments by using decision stump, which has only one leaf and therefore is with aﬁxed complexity,and found that the margin distribution of AdaBoost is actually better than that of arc-gv,as illustrated in Figure7.11(bottom).Recently,Wang et al.[28]introduced equilibrium margin and proved a new bound tighter than that obtained by using minimum margin,which suggests that the mini-mum margin may not be crucial for the generalization error of AdaBoost.It will be interesting to develop an algorithm that maximizes equilibrium margin directly,and to see whether the test error of such an algorithm is smaller than that of AdaBoost, which remains an open problem.7.5.2Multiclass AdaBoostIn the previous sections we focused on AdaBoost for binary classiﬁcation,that is, Y={+1,−1}.In many classiﬁcation tasks,however,an instance belongs to one of many instead of two classes.For example,a handwritten number belongs to1of10 classes,that is,Y={0,...,9}.There is more than one way to deal with a multiclass classiﬁcation problem.AdaBoost.M1[9]is a very direct extension,which is as same as the algorithm shown in Figure7.2,except that now the base learners are multiclass learners instead of binary classiﬁers.This algorithm could not use binary base classiﬁers,and requires every base learner have less than1/2multiclass0/1-loss,which is an overstrong constraint. SAMME[35]is an improvement over AdaBoost.M1,which replaces Line5of AdaBoost.M1in Figure7.2byαt=12ln1− tt+ln(|Y|−1)。

Adaboost算法流程及示例

Adaboost算法及示例一、Boosting提升方法提升方法是一种常用的统计学习方法，应用十分广泛且有效。

在分类问题中，它通过改变训练样本的权重，学习多个分类器，并将这些分类器进行线性组合，提高分类的性能。

提升算法基于这样一种思路：对于一个复杂任务来说，将多个专家的判断进行适当的综合所得出的判断，要比其中任何一个专家独断的判断好。

实际上，就是“三个臭皮匠顶个诸葛亮”的道理。

历史上，Kearns和Valiant首先提出了“强可学习（Strongly learnable）”和“弱可学习（Weekly learnable）”的概念。

指出：在概率近似正确（probably approximately correct，PAC）学习框架中，一个概念（一个分类），如果存在一个多项式的学习算法能够学习它，并且正确率很好，那么就称这个概念是强可学习的；一个概念（一个分类），如果存在一个多项式的学习算法能够学习它，但学习的正确率仅比随机猜测略好，那么就称这个概念是弱可学习的。

非常有趣的是Schapire后来证明强可学习与弱可学习是等价的，也就是说，在PAC学习框架下，一个概念是强可学习的充要条件是这个概念是弱可学习的。

这样一来，问题便成为，在学习中，如果已经发现了“弱学习算法”，那么能否将它提升（boost）为“强学习算法”。

大家知道，发现弱学习算法通常要比发现强学习算法容易得多。

那么如何具体实施提升，便成为开发提升方法时所要解决的问题。

关于提升方法的研究很多，有很多算法被提出，最具代表性的是AdaBoost算法（Adaboost algorithm）。

对与分类问题而言，给定一个训练样本集，求比较粗糙的分类规则（弱分类器）要比求精确的分类规则（强分类器）容易得多。

提升方法就是从弱学习算法出发，反复学习，得到一系列分类器，然后组合这些分类器，构成一个强分类器。

这样，对于提升算法来说，有两个问题需要回答：一是在每一轮如何改变训练数据的权值分布；二是如何将弱分类器组合成一个强分类器。

Adaboost算法入门详解_20130309

不过很长一段时间都没有一个切实可行的办法来实现这个理想。细节决定成败，再好的理论也需要有效的算法来执行。终于功夫不负有心人， Schapire 在 1996 年提出一个有效的算法真正实现了这个夙愿，它的名字叫 AdaBoost。AdaBoost 把多个不同的决策树用一种非随机的方式组合起来，表现出惊人的性能！第一，把决策树的准确率大大提高，可以与 SVM 媲美。第二，速度快，且基本不用调参数。第三，几乎不 Overfitting。我估计当时 Breiman 和 Friedman 肯定高兴坏了，因为眼看着他们提出的 CART 正在被 SVM 比下去的时候，AdaBoost 让决策树起死回生！Breiman 情不自禁地在他的论文里赞扬 AdaBoost 是最好的现货方法（off-the-shelf，即“拿下了就可以用”的意思）。
For t=1,……,T:
Find
arg min ∈ ε
||
||
% Where ht is a weak classifier; ht(xi): X->{-1,+1}，即 ht(xi)表示从 xi 元素至某个分类的映射，+1 表示 xi 属于某个分类， -1 表示 xi 不属于某个分类； % y=max f(t) 代表 y 是 f(t)函数所有值中最大的输出；y=arg max f(t)代表 y 是 f(t)函数产生最大输出时相对应的那个 t；
exp ∗ 1
,若 1 ,若
对于归类正确的 7 个点，其权值保持不变，为 0.1；对于归类错误的三个点，其权值为
0.1 . 0.2333
.
即，分类错误的三个点误差增加为 0.233.如此迭代。第二步：
5
根据分类的正确率，得到一个新的样本分布 D3，一个子分类器 h2 如上图所示，弱分类器 h2 中有三个“-”符号分类错误，分类错误的权值为 we2=0.1*3=0.3；上图中十个点的总权值为：wt2=0.1*7+0.233*3=1.3990；错误率为： we2/wt2=0.3/1.399= 0.2144;

AdaBoost

ε3=we3/wt3=0.3/2.1982= 0.1365;
1 1 3 1 1 0.1365 3 ln ln 0.9223 2 3 2 0.1365
对于分类错误的三个点，其权值为:
D（ 3 i)
1 - 3
3
9
第二步Βιβλιοθήκη 10根据分类的正确率，得到一个新的样本分布D3，一个子分类器h2 如上图所示，弱分类器h2 中有三个“-”符号分类错误，分类错误的权值为：we2=0.1*3=0.3；上图中十个点的总权值为：wt2=0.1*7+0.233*3=1.3990；错误率为：ε2=we2/wt2=0.3/1.399= 0.2144;
图中，“+”和“-”分别表示两种类别，在这个过程中，我们使用水平或者垂直的直线作为分类器，来进行分类。 7
第一步

根据分类的正确率，得到一个新的样本分布D2，一个子分类器h1
8
根据算法误差表达式： m t i 1 Dt || y i ht(x i ) ||
得到误差为分错的三个点的值之和，所以， ε1=(0.1+0.1+0.1)=0.3,
AdaBoost
1
Adaboost 算法介绍
•Idea AdaBoost
(Adaptive Boosting, R.Scharpire, Y.Freund, ICML, 1996)
Adaboost是一种迭代算法，其核心思想是针对同一个训练集训练不同的分类器（弱分类器），然后把这些弱分类器集合起来，构成一个更强的最终分类器（强分类器）。
1 1 1 1 1 0.3 1 ln ln 0.42 2 1 2 0.3

Adaboost算法自己整理

Adaboost 算法整理Adaboost 算法流程图：弱分类器的训练过程：一个弱分类器h(x, f , p, )由一个特征f ，阈值 θ 和指示不等号方向的p 组成：一个haar 特征对应一个弱分类器。

训练一个弱分类器（特征f ）就是在当前权重分布的情况下，确定f 的最优阈值以及不等号的方向，使得这个弱分类器（特征f ）对所有训练样本的分类误差最低。

具体方法如下：对于每个特征 f ，计算所有训练样本的特征值，并将其排序。

通过扫描一遍排好序的特征值，可以为这个特征确定一个最优的阈值，从而训练成一个弱分类器。

具体来说，对排好序的表中的每个元素，计算下面四个值： 1)全部人脸样本的权重的和T+； 2) 全部非人脸样本的权重的和T-；3) 在此元素之前的人脸样本的权重的和S+； 4) 在此元素之前的非人脸样本的权重的和S=；这样，当选取当前任意元素的特征值作为阈值时，所得到的弱分类器就在当前元素处把样本分开——也就是说这个阈值对应的弱分类器将当前元素前的所有元素分类为人脸（或非人脸），而把当前元素后（含）的所有元素分类为非人脸（或1()(,,,)0pf x p h x f p θθ <⎧=⎨⎩其他人脸）。

可以认为这个阈值所带来的分类误差为:于是，通过把这个排序的表扫描从头到尾扫描一遍就可以为弱分类器选择使分类误差最小的阈值（最优阈值），也就是选取了一个最佳弱分类器。

同时，选择最小权重错误率的过程中也决定了弱分类器的不等式方向。

具体的弱分类器学习演示表如下：其中：通过演示表我们可以得到这个矩形特征的学习结果，这个弱分类器阈值为4，不等号方向为p=-1，这个弱分类器的权重错误率为0.1。

X Y F w T(f) T(nf) S(f) S(nf) A B eX(1) 0 1 0.2 0.6 0.4 0 0.2 0.2 0.8 0.2X(2) 1 3 0.1 0.6 0.4 0.1 0.2 0.3 0.7 0.3X(3) 0 4 0.2 0.6 0.4 0.1 0.4 0.1 0.9 0.1X(4) 1 6 0.3 0.6 0.4 0.4 0.4 0.4 0.6 0.4X(5) 1 9 0.1 0.6 0.4 0.5 0.4 0.5 0.5 0.5X(6) 1 10 0.1 0.6 0.4 0.6 0.4 0.6 0.4 0.4min((),())S T S S T S ε+---++=+-+-()(()())A S f T nf S nf =+-()(()())B S nf T f S f =+-1A B=-min(,)e A B =Adaboost 算法的具体描述如下：一组训练集：，其中为样本描述，为样本标识，；其中0，1分别表示正例子和反例。

AdaBoost简介及训练误差分析

AdaBoost主要内容： AdaBoost 简介训练误差分析一、AdaBoost 简介：给定训练集：()()11,,...,,N N x y x y ，其中{}1,1i y ∈-，表示i x 的正确的类别标签，1,...,i N =训练集上样本的初始分布：()11D i N=对1,...,t T =，计算弱分类器{}:1,1t h X →-，该弱分类器在分布t D 上的误差为：()()tt D t i i h x y ε=≠P计算该弱分类器的权重：11ln 2t t t εαε⎛⎫-=⎪⎝⎭更新训练样本的分布：()()()()1exp t t i t i t tD i y h x D i Z α+-=，其中t Z 为归一化常数。

最后的强分类器为：()()1T final t t t H x sign h x α=⎛⎫= ⎪⎝⎭∑二、训练误差分析记12t t εγ=-，由于弱分类器的错误率总是比随机猜测（随机猜测的分类器的错误率为0.5），所以0t γ>，则训练误差为：()21exp 2T tr final t t R H γ=⎛⎫≤- ⎪⎝⎭∑。

记,0t t γγ∀≥>，则()22T tr final R H e γ-≤。

证明：1、对1T D +进行迭代展开()()()()1exp T i T i T T Ty h x D i D i Z α+-=()()111exp Ti t t i t Ttt y h x D i Z α==⎛⎫- ⎪⎝⎭=∑∏令()()1Tt t t f x h x α==∑()()()11exp i i Ttt y f x D i Z=-=∏。

由于1T D +是一个分布，所以：()111nT i D i +==∑所以()()111exp TNt iit i Z y f x N ===-∏∑。

2、训练误差为()(){}11:Ntr final ifinal i i R H i yH x N==≠∑()11 10Ni final i i if y H x Nelse =≠⎧=⎨⎩∑ ()11 010Ni i i if y f x Nelse =≤⎧=⎨⎩∑ *()()11e x pNiii y f x N=≤-∑1Ttt Z ==∏。

adaboost算法原理PPT课件

Adaboos t
D4 = (0.125, 0.125, 0.125, 0.102, 0.102, 0.102, 0.065, 0.065, 0.065, 0.125)。被分错的样本“0 1 2 9”的权值变大，其它被分对的样本的权值变小。 f3(x)=0.4236G1(x) + 0.6496G2(x)+0.7514G3(x) 此时，得到的第三个基本分类器sign(f3(x))在训练数据集上有0个误分类点。至此，整个训练过程结束。（经此过程，所有的误分类点的权值都得到了平衡化） G(x) = sign[f3(x)] = sign[ a1 * G1(x) + a2 * G2(x) + a3 * G3(x) ]，将上面计算得到的a1 、a2、a3各值代入G(x)中，得到最终的分类器为：G(x) = sign[f3(x)] = sign[ 0.4236G1(x) + 0.6496G2(x)+0.7514G3(x) ]。
4
1 Adaboost算法基础
1.3 分类器训练
Adaboost
样本1
样本2
...
样本i
样本i+1 ...
样本n
基本分类器1 G1(X)
权重 a1
弱分类器2
...
弱本分类器i
G2(X)
Gi(x)
弱分类器i+1
...
Gi+1(x)
权重 a2
权重 ai
权重 ai+ 1
分类器训练过程
弱分类器n Gn(x)
Adaboos t
D3 = (0.0455, 0.0455, 0.0455, 0.1667, 0.1667, 0.01667, 0.1060, 0.1060, 0.1060, 0.0455)。被分错的样本“3 4 5”的权值变大，其它被分对的样本的权值变小。

Adaboost算法原理及简单实现

分类误差随基分类器个数变化图
test_error = Columns 1 through 15 0.1037 0.1070 0.1271 Columns 16 through 20 0.0903 0.0936 0.1003
0.0836 0.0870
0.1003 0.0903
0.0970
0.0936
0.1037
Adaboost算法原理及简单实现
目录
1
背景简介
2
Adaboost原理简介
3
实验
背景简介
Bagging 和Boosting都是分类器的集成技术,Bagging 出现在Boosting之前. 这种算法都比较简单,只是将分类器进行简单的组合,实际上,并没有发挥出分类器组合的威力来.之后在1990年,Schapire提出了Boosting算法,次年Freund 改进Boosting算法,这两种算法存在共同的实践上的缺陷, 那就是都要求事先知道弱学习算法学习正确率的下限.
实验
数据集：LISSAJOUS FIGURE DATA p42 训练数据个数：700 测试数据个数：300 基分类器：决策树p95-99 基分类器个数为1及20时分类结果
实验
数据集：LISSAJOUS FIGURE DATA p42
训练数据个类器：决策树p95-99
AdaBoost.M1 Discrete AdaBoost：
训练过程
1.初始化：初始赋予每个样本相等的权重
；
基分类器个数L； 2. For k = 1, 2, …, L Do
• 根据分布从训练数据集中抽取样本 ; • 将用作训练数据集建立一个分类器； • 计算该分类器的加权训练误差

第8章 adaboost

即
为AdaBoost算法的基本分类器，为使第
m轮加权训练数据分类误差率最小的基本分类器
前向分步算法
求由
将已求得的得：
代入，对α求导并使导数为0，
与AdaBoost的αm完全一致
前向分步算法
每一轮样本权值的更新，由：
可得：与AdaBoost的权值更新一致，相差规范化因子，因而等
怎样组合弱分类器
多专家组合一种并行结构，所有的弱分类器都给出各自的预测结果，通过 “组合器”把这些预测结果转换为最终结果。 eg.投票（voting）及其变种、混合专家模型
多级组合一种串行结构，其中下一个分类器只在前一个分类器预测不够准（不够自信）的实例上进行训练或检测。 eg. 级联算法（cascading）
• 在概率近似正确（probably approximately correct, PAC)学习的框架中，一个概念（类），如果存在一个多项式的学习算法能够学习它，并且正确率很高，称这个概念是强可学习的；
• 一个概念（类），如果存在一个多项式的学习算法能够学习它，学习的正确率仅比随机猜测略好，则称这个概念是弱可学习的。
证明：后面，由ex和
在x=0的泰劳展开得：
进而得证。
AdaBoost的训练误差分析
定理：如果存在γ > 0，对所有的m有γ m ≥ γ ，则
即：训练误差为指数下降。
注意：AdaBoost算法不需要知道下界γ，这正是Freund与Schapire设计 AdaBoost时所考虑的，与一些早期的提升方法不同，AdaBoost具有适应性，即它能适应弱分类器各自的训练误差率，这也是它的名称(适应的提升)的由来，Ada是Adaptive的简写

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

F j 和它前面的一个特征值F j 1之 • 这样，当选取当前元素的特征值间的数作为阈值时，所得到的弱分类器就在当前元素处把样本分开——也就是说这个阈值对应的弱分类器将当前元素前的所有元素分类为人脸（或非人脸），而把当前元素后（含）的所有元素分类为非人脸（或人脸）。 • 可以认为这个阈值所带来的分类误差为:
• 后来，Lienhart等人提出扩展的Haar-like特征下图所示，每个特征由2~3个矩形组成，分别检测边界、细线、中心特征等。
• 在确定了特征形式后，Harr-like特征的数量就取决于训练样本图像矩阵的大小。当人脸样本图像和非人脸样本图像的矩阵大小为24×24时(一般人脸检测的训练样本为24×24大小)，原始Harr-like特征库的四种特征的总数量就为162336个，见下表。
1如何选取特征？---类haar特征
1.1矩形特征
• Viola和Jone将Adaboost应用于人脸检测，在保证检测率的同时，首次使得人脸检测达到了实时的速度。 • 为了保证Adaboost分类器的分类能力，其选择的弱分类器一般都应该尽可能的简单，通常都是一条简单的规则，对物体的某个特征进行简单判断。
ht ( x) h( x, ft )
4.按照这个最佳弱分类器，调整权重： t 1 ei t wt 1,i wt ,i t 1 t ，其中ei 0 表示 xi 被正确分类，ei 1 表示 xi 被错误分类。
5.最后的强分类器为：
T 1 T 1 t ht ( x) t C ( x) 2 t 1 t 1 0其他

( m,n ) ( s ,t )
m x 1 nt 1 n y 1 [ ]* [ ] s t x 1 y 1
m s 1
m,n表示样本图像的宽和高比例，s,t表示haar矩形特征宽和高。公式表示样本图像逐渐缩小，得到的某一 haar特征的总数 • Adaboost算法通过从大量的haar特征中挑选出最优的特征，并将其转换成对应的弱分类器进行分类使用，从而达到对目标进行分类的目的。
min( S (T S ), S (T S ))

于是，通过把这个排序的表扫描从头到尾扫描一遍就可以为弱分类器选择使分类误差最小的阈值（最优阈值），也就是为这个特征选取了一个最佳弱分类器。对于所有特征，应用以上寻找阈值的方法，就得到了所有特征对应的弱分类器，组成一个弱分类器集，作为训练的输入。
• 级联分类器将若干个强分类器分级串联在一起，强分类器一级比一级复杂，一级比一级严格。检测中非目标图像会在前端被排除掉，只有目标图像才能通过各级强分类器的检测。此外，由于非目标图像会被级联分类器的前几级迅速排除掉，从而加快了Adaboost算法的检测速度。
5、补充负样本
• 一级训练结束后，应用级联分类器于负样本，将分类错误的负样本留下，再补充新的负样本，作为下一级训练的负样本。
1 其中 t log 。 t 注：（另一种方法：不用循环，而是用识别率与误识别率是否达到来进行循环）在每一次循环完之后，运用5判断已经得到的弱分类器的加权后判断的识别率与误识别率是否在预定范围内，若在：停止循环，直接；不在：继续。
强分类器的构成 • 经过 T 次迭代后，获得了T 个最佳弱分类 ) 器 h1 ( x),..., h2 ( x ，可以按照下面的方式组合成一个强分类器： T 1 T 1 t 1 t ht ( x) t 1 t C ( x) 2 0其他 • 其中 t log 。 t • 那么，这个强分类器对待一幅待检测图像时，相当于让所有弱分类器投票，再对投票结果按照弱分类器的错误率加权求和，将投票加权求和的结果与平均投票结果比较得出最终的结果。
3、Adaboost算法基本原理
• 强分类器：如果一个学习算法通过一组样本的学习后，能够达到理想的识别率，则称它为强分类器。 • 弱分类器：如果一个学习算法的识别率仅好于随机的猜测，则称其为弱分类器。通常，针对一个具体的识别问题，我们很难找到一个理想的强分类器，但是弱分类器一般都会很多，基于这种现象，Freund和Schapire提出了Adaboosting算法：通过一定的算法可以将一组弱分类器提升为一个强分类器。
4、级联分类器
• 利用训练过程得到的弱分类器，使用上式将部分弱分类器组合得到若干强分类器，各强分类器对目标都有较强的检测能力。如果将多个强分类器级联在一起，那么能够通过各级强分类器检测的对象是人脸的可能性也最大。根据这一原理，Adaboost算法引入了一种瀑布型的分类器---级联分类器。级联分类器的检测示意图如下图：（分类器误识别率不断降低。确定不是证样本的排除，不确定的到下一个分类器中）
3.1Adaboost算法的具体描述如下：
输入：一组训练集：( x1 , y1 ),...,( xn , yn ) ，其中 x 为样本 yi 为样本标识，yi (0,1) ；其中0，1分别表描述，示正例子和反例。在人脸检测中，可以定义0为非人脸，1为人脸。 1 w1, j 初始化：初始样本权值设为 n (可能会导致证样本比例很小，所以常用正m个，负n个，则正的权重为1/2m,负的权重1/2n,使得正负比例分别为1/2)。对 t 1,2,..., T，(T为循环数，即找到T个弱分类器) 循环执行下面的步骤： 1.归一化权重： q wt ,i
基于类haar特征的 Adaboost算法
主要内容:
训练系统分为“训练部分”和“补充部分”，14为训练部分，5为补充部分。 1、以样本集为输入，在给定的矩形特征原型下，计算并获得矩形特征集; 2、以特征集为输入，根据给定的弱学习算法，确定阈值，将特征与弱分类器一一对应，获得弱分类器集; 3、以弱分类器集为输入，在训练检出率和误判率限制下，使用A d a B o o s t 算法挑选最优的弱分类器构成强分类器; 4、以强分类器集为输入，将其组合为级联分类器; 5、以非人脸图片集为输入，组合强分类器为临时的级联分类器，筛选并补充非人脸样本。
三个问题详解： 1、强分类器为什么是级联的？
3、强分类器的公式,权重的选取？
1、强分类器为什么是级联的？每一级训练都会产生一个强分类器，同时产生一组特征，而且，随着训练级数的增加，特征数也会随着增加。例如第一级训练结束后，会产生15个特征，第六级就会产生50个特征。如果只用第六级的分类器进行特征识别，由于特征数较多，运行起来会很慢。这时，将各级的分类器级联起来，容易判断的特征会再第一级分类器中被判断出来，将其踢出样本集，这样留下的样本数会越来越少，速度就会愈来愈快。
3、强分类器的公式,权重的选取
一共m个样本，Y={-1，+1}，-1代表负样本错误率： f
q | h ( x, f ) y |
i i i
假设的权Байду номын сангаас：
1 1 t t ln 2 t
ii( x, y)
x' x , y ' y

i( x , y )
其中ii(x,y)为积分图，i(x,y)为原始图像，如下图所示。x,y表示图像的像素坐标。上式表示对（x,y）左上角像素求和。
• 有了积分图，矩形特征值就可以通过很少的计算量得到。任意一个矩形内的像素和可以由积分图上对应的四点得到。如图6（b）:矩形框A、B、C、 D内的像素和分别为A、B、C、D，即1处的积分值为A，2点处的积分值为A+B，3点处的积分值为A+C，4点处的积分值为A+B+C+D，如果求D 区域内的像素值的和，则D＝4+1-(2+3)，其实就是利用了D的四个顶点处的积分值就可以求得D。由此可见，矩形特征的特征值的计算，只于此特征的端点的积分图有关，而与图像的坐标值无关。所以积分图的引入，大大提高了检测速度。这也就是Viola方法速度非常快的根本原因。
i
t ,i

n
j 1
wt , j
2.对每个特征f，训练一个弱分类器 h( x, f ) ；计算所有特征的弱分类器的加权错误率
f i qi | h( x, f ) yi |
3.（在所有特征中）选取具有最小错误率的最佳弱分类器 ht ( x ) 。 f min f i qi | h( x, f ) yi | i qi | h( x, ft ) yi |
•
选取一个最佳弱分类器就是选择那个对所有训练样本的分类误差在所有弱分类器中最低的那个弱分类器（特征）。 • 对于每个特征 f，计算所有训练样本的特征值，并将其排序。通过扫描一遍排好序的特征值，可以为这个特征确定一个最优的阈值，从而训练成一个弱分类器。具体来说，对排好序的表中的每个元素，计算下面四个值： 1)全部人脸样本的权重的和 T ； 2) 全部非人脸样本的权重的和 T ； 3) 在此元素之前的人脸样本的权重的和 S ； 4) 在此元素之前的非人脸样本的权重的和 S；
2、弱分类器及其选取
• 一个弱分类器h(x, f , p,q)由一个特征f，阈值q和指示不等号方向的p 组成：
1 pf ( x) p h ( x, f , p , ) 0其他
• 训练一个弱分类器（特征f）就是在当前权重分布的情况下，确定f 的最优阈值，使得这个弱分类器（特征f）对所有训练样本的分类误差最低。
Adaboost 算法分析
对于boosting算法，存在两个问题： 1. 如何调整训练集，使得在训练集上训练的弱分类器得以进行； 2. 如何将训练得到的各个弱分类器联合起来形成强分类器。针对以上两个问题，AdaBoost算法进行了调整： 1. 使用加权后选取的训练数据代替随机选取的训练样本，这样将训练的焦点集中在比较难分的训练数据样本上； 2. 将弱分类器联合起来，使用加权的投票机制代替平均投票机制。让分类效果好的弱分类器具有较大的权重，而分类效果差的分类器具有较小的权重。

AdaBoost人脸检测原理

页数:3
一种基于AdaBoost的人脸检测算法

页数:7
基于Adaboost算法的人脸检测研究共3篇

页数:9
opencv adaboost人脸检测训练程序阅读笔记(LBP特征)

页数:4
基于Adaboost算法的多角度人脸检测

页数:2
opencv adaboost人脸检测训练程序阅读笔记(LBP特征)

页数:4
OPENCV ADABOOST人脸检测训练程序阅读笔记(LBP特征)

页数:4
人脸识别技术的算法模型

页数:3
基于AdaBoost的人脸识别

页数:6
基于肤色和Adaboost算法的人脸检测

页数:7

adaboost完整版

合集下载

Adaboost算法的MATLAB实现

adaboost完整版

AdaBoost装袋提升算法

AdaBoost

数据挖掘领域十大经典算法之—AdaBoost算法（超详细附代码）

机器学习算法系列(6)：AdaBoost

adaboost——【人工智能精品讲义】

机器学习十大算法：AdaBoost

Adaboost算法流程及示例

Adaboost算法入门详解_20130309

AdaBoost

Adaboost算法自己整理

AdaBoost简介及训练误差分析

adaboost算法原理PPT课件

Adaboost算法原理及简单实现

第8章 adaboost

文档推荐

最新文档

adaboost完整版

合集下载

Adaboost算法的MATLAB实现

adaboost完整版

AdaBoost装袋提升算法

AdaBoost

数据挖掘领域十大经典算法之—AdaBoost算法（超详细附代码）

机器学习算法系列(6)：AdaBoost

adaboost——【人工智能 精品讲义】

机器学习十大算法：AdaBoost

Adaboost算法流程及示例

Adaboost算法入门详解_20130309

AdaBoost

Adaboost算法自己整理

AdaBoost简介及训练误差分析

adaboost算法原理PPT课件

Adaboost算法原理及简单实现

第8章 adaboost

文档推荐

最新文档

adaboost——【人工智能精品讲义】