pattern recogniton and machine learning
- 格式:pdf
- 大小:55.09 KB
- 文档页数:9
概率图模型过去的⼀段时间⾥,忙于考试、忙于完成实验室、更忙于过年,很长时间没有以⼀种良好的⼼态来回忆、总结⾃⼰所学的东西了。
这⼏天总在想,我应该怎么做。
后来我才明⽩,应该想想我现在该做什么,所以我开始写这篇博客了。
这将是对概率图模型的⼀个很基础的总结,主要参考了《PATTERN RECOGNITION and MACHINE LEARNING》。
看这部分内容主要是因为中涉及到了相关的知识。
概率图模型本⾝是值得深究的,但我了解得不多,本⽂就纯当是介绍了,如有错误或不当之处还请多多指教。
0. 这是什么?很多事情是具有不确定性的。
⼈们往往希望从不确定的东西⾥尽可能多的得到确定的知识、信息。
为了达到这⼀⽬的,⼈们创建了概率理论来描述事物的不确定性。
在这⼀基础上,⼈们希望能够通过已经知道的知识来推测出未知的事情,⽆论是现在、过去、还是将来。
在这⼀过程中,模型往往是必须的,什么样的模型才是相对正确的?这⼜是我们需要解决的问题。
这些问题出现在很多领域,包括模式识别、差错控制编码等。
概率图模型是解决这些问题的⼯具之⼀。
从名字上可以看出,这是⼀种或是⼀类模型,同时运⽤了概率和图这两种数学⼯具来建⽴的模型。
那么,很⾃然的有下⼀个问题1. 为什么要引⼊概率图模型?对于⼀般的统计推断问题,概率模型能够很好的解决,那么引⼊概率图模型⼜能带来什么好处呢?LDPC码的译码算法中的置信传播算法的提出早于因⼦图,这在⼀定程度上说明概率图模型不是⼀个从不能解决问题到解决问题的突破,⽽是采⽤概率图模型能够更好的解决问题。
《模式识别和机器学习》这本书在图模型的开篇就阐明了在概率模型中运⽤图这⼀⼯具带来的⼀些好的性质,包括1. They provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models.2. Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph.3. Complex computations, required to perform inference and learning in sophisticated models, can be expressed in terms of graphical manipulations, in which underlying mathematical expressions are carried along implicitly.简⽽⾔之,就是图使得概率模型可视化了,这样就使得⼀些变量之间的关系能够很容易的从图中观测出来;同时有⼀些概率上的复杂的计算可以理解为图上的信息传递,这是我们就⽆需关注太多的复杂表达式了。
中考英语作文:人工智能类(常用短语30个+句型20句+满分范文8篇)一、常用短语1. artificial intelligence (AI) 人工智能2. machine learning 机器学习3. deep learning 深度学习4. intelligent robots 智能机器人5. facial recognition 面部识别6. natural language processing 自然语言处理7. voice recognition 语音识别8. data analysis 数据分析9. algorithm design 算法设计10. smart devices 智能设备11. human-computer interaction 人机交互12. autonomous vehicles 自动驾驶车辆13. virtual reality 虚拟现实14. augmented reality 增强现实15. intelligent systems 智能系统16. pattern recognition 模式识别17. big data 大数据18. neural network 神经网络19. technological innovation 技术创新20. future development 未来发展21. smart home 智能家居22. industrial automation 工业自动化23. intelligent manufacturing 智能制造24. healthcare application 医疗应用25. education innovation 教育创新26. digital transformation 数字化转型27. ethical issue 伦理问题28. privacy protection 隐私保护29. security threat 安全威胁30. job displacement 工作岗位替代二、常用句型1. With the rapid development of artificial intelligence, our lives have been greatly changed.随着人工智能的快速发展,我们的生活已经发生了巨大的变化。
人工智能专业人才培养方案(Artificial Intelligence)(2020级)目录一、培养目标 (3)二、毕业要求 (3)三、主干学科 (4)四、核心课程 (4)五、主要实践性环节 (4)六、主要专业实验 (5)七、学习年限 (5)八、授予学位 (5)九、课程设置(理工类专业) (5)一、培养目标本专业培养以立德树人为根本,适应区域经济与社会发展需求,具有良好道德修养、扎实的理论基础知识、较强的专业应用和工程实践能力,富于终身学习和创新意识,具有较强的交流与团队合作和组织管理能力,可从事人工智能领域中智能系统技术开发、系统运维管理、产品设计及运营等相关工作,以跨界、交叉、融合、创新为特征,符合产业最新用人标准的德智体美劳全面发展的应用型、复合型专门人才。
本专业学生在毕业后五年左右预期能达到的目标如下:目标1-道德修养:具有良好的科学素质、人文素养、社会责任感和职业道德,能够综合考虑社会、健康、安全、法律、文化及环境等因素服务国家与社会。
目标2-知识应用能力:跟踪人工智能领域前沿技术发展,运用计算机及人工智能基础理论和专业知识,能够对人工智能系统领域的复杂工程问题提出系统的解决方案。
目标3-工程实践能力:具有独立从事人工智能系统领域技术开发、系统运维管理、产品设计等工作的能力。
目标4-交流与合作能力:具有跨文化背景的技术交流与团队合作能力。
目标5-学习创新能力:具有人工智能领域的知识更新、终身学习意识,具有人工智能系统领域的工程创新能力。
二、毕业要求1.工程知识:能够将数学、自然科学、工程基础和人工智能专业知识用于解决人工智能领域复杂工程问题。
2.问题分析:能够应用数学、自然科学和工程科学的基本原理,识别、表达并通过文献研究分析人工智能领域复杂工程问题,以获得有效结论。
3.设计/开发解决方案:能够设计针对人工智能领域复杂工程问题的解决方案,设计满足特定需求的人工智能系统、模块或算法流程,并能够在设计环节中体现创新意识,考虑社会、健康、安全、法律、文化以及环境等因素。
Machine Learning and Pattern RecognitionDensity Estimation:GaussiansCourse Lecturer:Amos J Storkey∗Institute for Adaptive and Neural ComputationSchool of InformaticsUniversity of Edinburgh10Crichton Street,Edinburgh UKa.storkey@Course page:/teaching/courses/mlpr/∗Notes are edited versions of notes originally created by David Barber0x (Haggis length in cm)sample measurements 70605040302010Figure 1:The measured lengths of haggis.This suggests a distributionp (x ),where x is haggis length,as shown.1Why Density Estimation?Consider our example of handwritten digits,where each digit is a 28×28=784dimensional vector x .It is intuitively clear that,once we haveseen only a few examples of someones handwriting style,we have a prettygood idea of how they write the number seven.It maybe that they donot always write exactly in the same way,but we would probably havelittle difficulty in recognising another seven,even though every singleseven that they write is different.What this means is that we have insome sense,captured the way that a person writes a seven –sometimesthere may be minor modifications,but we have a pretty good idea ofwhat effect these would have.Since every seven that someone writesis different from the others,it makes more sense to use probabilities todescribe how probable certain types of seven are than others.That is,we can describe how Mavis writes sevens by using a distribution p (x ).Thus,perhaps sevens that are very curly have a high probability densityvalue,whilst very straight sevens have a low probability.The usefulnessof probability is that it enables us to define a likelihood value for everyone of the infinite number of different sevens that Mavis could write.Haggis are a commonly occurring wild animal in the highlands of Scot-Haggis example land.Being relatively easy to catch (due to their unfortunate evolu-tionary quirk of having one leg longer than another to make contouringaround hills easier)many measurements have been made over the yearsof the size of adult haggis.These are plotted in Figure 1(Actually,mostof the larger measurements are from female haggis,whilst the smallermeasurements are typically from the males).A good summary of thisdata is given by using the probability density function (p.d.f)p (x )asdrawn.This gives a probability density value p (x )for any length ofhaggis.The distribution must satisfy the normal rules of probability, p (x )dx =1,p (x )≥0.1.1How do we find p (x )?Density estimation is the “learning”of p (x )given examples of x .Usually,in order to compress the information,we paramterise the probability den-sity function is some way.A very important class of probability densityfunctions is the Gaussians which in one dimension are:p (x |µ,σ2)=1√2πσ2e −12σ2(x −µ)2(1.1)This distribution is the classic “bell shaped”distribution as plotted inFigure 2.There are two parameters to this distributionµ= xp (x )dx,the mean,and σ2= (x −µ)2p (x )dx ,the variance.(1.2)If we had a set of datapoints X = x 1,...x P ,how could we find thebest setting of µand σ2to give the best fit of p (x |µ,σ2)to the data?xµFigure 2:The Gaussian (or normal distribution)p (x |µ,σ2)1.2Maximum Likelihood fittingIf we make the standard assumptions that each observed datapoint hasbeen drawn independently from the same (identical)distribution,p (x |θ),where θare some parameters describing the distribution (such as themean and variance),the likelihood of generating the data X given ourp.d.f isp (X |θ)=P i =1p (x i |θ)(1.3)We can thus calculate the log-likelihood of the data asL =Pi =1log p (x i |θ)(1.4)In the case of using a Gaussian this isL (µ,σ2)=−12σ2P i =1(x i −µ)2−P 2log(2πσ2)(1.5)We wish to find the parameters θ=(µ,σ2)that maximise the likelihood(since the logarithm is a monotonically increasing function,this is thesame as maximising the log-likelihood).Differentiating L with respect toµgives,and equating to zero givesµ=1P P i =1x i (1.6)Similarly,differentiating with respect to σ2and equating to zero givesσ2=1P P i =1(x i −µ)2(1.7)These are known as the maximum likelihood (ML)sample estimates ofthe parameters.There are other ways fit distributions to data,and thesecan give different parameter settings to those obtained by ML.The MLestimator of σ2is biased in the sense that if we were to repeat the exper-iment many times by drawing samples from a distribution with knownmean and variance,the ML estimated variance,averaged over many tri-als,would differ from the correct variance slightly.The unbiased esti-mator for the variance is σ2=1P −1 P i =1(x i −µ)2.This difference istypically negligible in practice.Figure3:Fitting a Gaussian to2dimensional data.We need tofind anestimate for the meanµand covariance matrixΣ.1.3Fitting Gaussians–the multi-dimensional caseGiven a dataset X=x1,...,x P,we wish tofit a Gaussian to this data.For example,how can wefit the data in Figure3?The multi-dimensional Gaussian is given byp(x|µ,Σ)=1det(2πΣ)exp−12(x−µ)TΣ−1(x−µ)(1.8)It is straightforward to show that the maximum likelihood estimates are given byµ=1PPi=1x i(1.9)andΣ=1PPi=1(x i−µ)(x i−µ)T(1.10)Note that,for an n-dimensional vector x,the covariance matrix will haven(n−1)/2elements.If we are to accurately estimate these parameters,we will need a lot of data to do so.It is therefore often a good idea toreduce the dimensionality of the datafirst(using principal componentanalysis say),so that the number of parameters in the covariance matrixis reduced significantly.We will learn about dimensionality reductiontechniques later.1.4Fitting a Gaussian to the Handwritten DigitsIn some experiments later,we shall be looking at classifying the hand-written digits one and seven.In order to reduce the dimensionality ofthe data,Ifirst use PCA on600training examples of our usual28×28dimensional data.This training data contains300ones and300sevensfor ing PCA I reduced the dimensionality down to20.How-ever,fitting a single Gaussian to the PCA representation of both onesand sevens does not make much sense since we believe that there willbe two high probability regions in the20dimensional space,separatedby a region of low probability(that is,there are two clusters,one cor-responding to the ones,and one corresponding to the sevens).If we dothis,fitting a single Gaussian,and then sample100pca vectors from thisdistribution,the reconstructions are given in Figure4.Instead,we canfit a class conditional distribution to the PCA data.That is,wefit oneGaussian to the PCA representation of the sevens,and another to theFigure4:Reconstructions using100samples from a20dimensional Gaus-sianfitted to600training examples of ones and sevens.Note that thereconstructions are not particularly good examples of either ones or sev-ens,with a lot of mixed types.Figure5:Reconstructions using100samples from a20dimensional Gaus-sianfitted to300training examples of ones(left)and sevens(right).PCA representation of the ones.We again sample100PCA representa-tions from each of the two Gaussians,and these are plotted in Figure5.Note how the sample reconstructions are much more like ones and sevensthan the single global model.1.5Making Classifications using Class Conditional DensitiesIf we have data with two(or more classes),we canfit a separate Gaus-sian to each class,p(x|class1),...,p(x|classK)where each Gaussian ischaracterised by its meanµ(k)and covariance matrixΣ(k),k=1,...,K.(Note that here the upper index k does not refer to taking powers–itjust indexes the class).What we are really interested in for classification is the following:givena new datapoint x∗,what is p(class=k|x∗)?To evaluate this probability,we use Bayes rulep(A|B)=p(B|A)p(A)p(B)(1.11)In this case this givesp(class=k|x∗)=p(x∗|class=k)p(class=k)p(x∗)(1.12)Since the denominator on the right hand sides of the above equation is Decision Boundaryindependent of the class,we will classify x∗as class k providedp(x∗|class=k)p(class=k)>p(x∗|class=l)p(class=l)for all l=k(1.13)It is straightforward to show that the maximum likelihood estimate ofp(class=k)is simply given by the relative frequency of occurrence ofeach class in the training set.For example,if there are100class1train-ing points and50class2training points,then p(class=1)=2/3andp(class=2)=1/3.Normally we take logarithms of the above(this isallowed since it is a monotonic function and will not affect the decision process).The reason for this is that this is a numerically more stable procedure.That is,we will classify x∗as class k provided log p(x∗|class=k)+log p(class=k)>log p(x∗|class=l)+log p(class=l)for all l=k(1.14) In our case,using the fact that each density is Gaussian,we classify x∗as class1if−12x∗−µ(1)T(Σ(1))−1x∗−µ(1)−12log detΣ(1)+log p(class=1)>−12x∗−µ(2)T(Σ(2))−1x∗−µ(2)−12log detΣ(2)+log p(class=2)(1.15)An example is given in Figure6where two dimensional data of two dif-ferent classes is plotted.The classification was produced using the codebelow.%Demo for fitting Gaussians and using Bayes to classify:2classes%generate some fake training data for class1:X1=randn(2,10);%generate some fake training data for class2:X2=randn(2,15)+repmat(3*ones(2,1),1,15);%fit a Gaussian to data X1:m1=mean(X1’)’;S1=cov(X1’);invS1=inv(S1);logdetS1=trace(logm(S1));p1=size(X1,2)/(size(X1,2)+size(X2,2));%prior%fit a Gaussian to data X2:m2=mean(X2’)’;S2=cov(X2’);invS2=inv(S2);logdetS2=trace(logm(S2));p2=1-p1;%priorXnew=2*randn(2,50)+2*repmat(ones(2,1),1,50);;%some test points%calculate the decisions:d1=(Xnew-repmat(m1,1,size(Xnew,2)));d2=(Xnew-repmat(m2,1,size(Xnew,2)));for i=1:size(Xnew,2)if d2(:,i)’*invS2*d2(:,i)+logdetS2-2*log(p2)>d1(:,i)’*invS1*d1(:,i)+logdetS1-2*log(p1) class(1,i)=1;elseclass(1,i)=2;endend%plot a few things:plot(X1(1,:),X1(2,:),’bx’,’markersize’,10,’linewidth’,2);hold on;%class1;plot(X2(1,:),X2(2,:),’ro’,’markersize’,10,’linewidth’,2);%class2;plot(Xnew(1,find(class==1)),Xnew(2,find(class==1)),’bx’);%class1;plot(Xnew(1,find(class==2)),Xnew(2,find(class==2)),’ro’);hold off%class2;The decision boundary is therefore quadratic–in a later chapter we shallencounter simpler,linear decision boundaries.The decision boundarybecomes linear(a straight line or hyperplane)in the case that the twoclass covariances are equal.−3−2−101234567−2−11234567Figure 6:Classification using two classes.The large symbols are thetraining data.The small symbols are the positions of the test data andthey are given the symbol according to the Bayes decision rule.An important point to stress though is that we no longer need to storethe dataset to make our decisions –we just can use the rule above,result-ing in a much faster decision rule than that used,for example,in nearestneighbour classification.It is usually a good idea to calculate and storethe inverse matrices offline,as in the code above,so that,during clas-sification we do not need to keep inverting the matrices.(However,ifthe data is very high dimensional,there may are better ways to calculatethe decision boundary without computing the inverse of the covariancematrices.)If the data is very high dimensional (say above 1000)then calculatingHigh dimensional Data log det Σis numerically very difficult (since det is the product of the eigen-values so that under/overflow problems can occur).A numerically morestable way to do this is to use the identity trace log M ≡log det M wherelog M is the matrix logarithm (not the element-wise logarithm).Thus,inMATLAB,one can replace log (det (Sigma ))with trace (logm (Sigma )).1.6Making Classifications using Class Conditional DensitiesIn the above sections,we fitted two probability distributions to data,p (y |digit =1)and p (y |digit =7)where y are the PCA representationsof the data.What we are really interested in for classification is the following:given anew y ∗(the PCA representation of a new test point x ∗),what is p (digit =1|y ∗)?To evaluate this probability,we use Bayes rulep (digit =1|y ∗)=p (y ∗|digit =1)p (digit =1)p (y ∗)(1.16)Similarly,p (digit =7|y ∗)=p (y ∗|digit =7)p (digit =7)p (y ∗)(1.17)Since the denominators on the right hand sides of the above two equationsare equal,we will classify y ∗as a one providedp (y ∗|digit =1)p (digit =1)>p (y ∗|digit =7)p (digit =7)(1.18)In our case,we have Gaussians modelling the distributions p (y ∗|digit =1)and p (y ∗|digit =7).The values p (digit =1)and p (digit =7)areboth0.5since there are an equal number of training examples of ones and sevens.Thus,in our case,the decision is that y∗is a one provided log p(y∗|digit=1)>log p(y∗|digit=7)(1.19) That is,−12y∗−µ(1)T(Σ(1))−1y∗−µ(1)−12log detΣ(1)>−12y∗−µ(7)T(Σ(7))−1y∗−µ(7)−12log detΣ(7)(1.20)Using the above decision method on the600test points gives an erroron only10examples.Note that this is therefore about the same as thenearest neighbour method(there we used50components and had anerror of18),if not a little better.As opposed to the nearest neighbour method,we no longer need to storethe dataset to make our decisions–we just can use the rule above wherewe only need to store the various means and covariances and prior pa-rameters,resulting in a much more efficient and faster decision rule.。
Draft:Deep Learning in Neural Networks:An OverviewTechnical Report IDSIA-03-14/arXiv:1404.7828(v1.5)[cs.NE]J¨u rgen SchmidhuberThe Swiss AI Lab IDSIAIstituto Dalle Molle di Studi sull’Intelligenza ArtificialeUniversity of Lugano&SUPSIGalleria2,6928Manno-LuganoSwitzerland15May2014AbstractIn recent years,deep artificial neural networks(including recurrent ones)have won numerous con-tests in pattern recognition and machine learning.This historical survey compactly summarises relevantwork,much of it from the previous millennium.Shallow and deep learners are distinguished by thedepth of their credit assignment paths,which are chains of possibly learnable,causal links between ac-tions and effects.I review deep supervised learning(also recapitulating the history of backpropagation),unsupervised learning,reinforcement learning&evolutionary computation,and indirect search for shortprograms encoding deep and large networks.PDF of earlier draft(v1):http://www.idsia.ch/∼juergen/DeepLearning30April2014.pdfLATEX source:http://www.idsia.ch/∼juergen/DeepLearning30April2014.texComplete BIBTEXfile:http://www.idsia.ch/∼juergen/bib.bibPrefaceThis is the draft of an invited Deep Learning(DL)overview.One of its goals is to assign credit to those who contributed to the present state of the art.I acknowledge the limitations of attempting to achieve this goal.The DL research community itself may be viewed as a continually evolving,deep network of scientists who have influenced each other in complex ways.Starting from recent DL results,I tried to trace back the origins of relevant ideas through the past half century and beyond,sometimes using“local search”to follow citations of citations backwards in time.Since not all DL publications properly acknowledge earlier relevant work,additional global search strategies were employed,aided by consulting numerous neural network experts.As a result,the present draft mostly consists of references(about800entries so far).Nevertheless,through an expert selection bias I may have missed important work.A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century.For these reasons,the present draft should be viewed as merely a snapshot of an ongoing credit assignment process.To help improve it,please do not hesitate to send corrections and suggestions to juergen@idsia.ch.Contents1Introduction to Deep Learning(DL)in Neural Networks(NNs)3 2Event-Oriented Notation for Activation Spreading in FNNs/RNNs3 3Depth of Credit Assignment Paths(CAPs)and of Problems4 4Recurring Themes of Deep Learning54.1Dynamic Programming(DP)for DL (5)4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL (6)4.3Occam’s Razor:Compression and Minimum Description Length(MDL) (6)4.4Learning Hierarchical Representations Through Deep SL,UL,RL (6)4.5Fast Graphics Processing Units(GPUs)for DL in NNs (6)5Supervised NNs,Some Helped by Unsupervised NNs75.11940s and Earlier (7)5.2Around1960:More Neurobiological Inspiration for DL (7)5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) (8)5.41979:Convolution+Weight Replication+Winner-Take-All(WTA) (8)5.51960-1981and Beyond:Development of Backpropagation(BP)for NNs (8)5.5.1BP for Weight-Sharing Feedforward NNs(FNNs)and Recurrent NNs(RNNs)..95.6Late1980s-2000:Numerous Improvements of NNs (9)5.6.1Ideas for Dealing with Long Time Lags and Deep CAPs (10)5.6.2Better BP Through Advanced Gradient Descent (10)5.6.3Discovering Low-Complexity,Problem-Solving NNs (11)5.6.4Potential Benefits of UL for SL (11)5.71987:UL Through Autoencoder(AE)Hierarchies (12)5.81989:BP for Convolutional NNs(CNNs) (13)5.91991:Fundamental Deep Learning Problem of Gradient Descent (13)5.101991:UL-Based History Compression Through a Deep Hierarchy of RNNs (14)5.111992:Max-Pooling(MP):Towards MPCNNs (14)5.121994:Contest-Winning Not So Deep NNs (15)5.131995:Supervised Recurrent Very Deep Learner(LSTM RNN) (15)5.142003:More Contest-Winning/Record-Setting,Often Not So Deep NNs (16)5.152006/7:Deep Belief Networks(DBNs)&AE Stacks Fine-Tuned by BP (17)5.162006/7:Improved CNNs/GPU-CNNs/BP-Trained MPCNNs (17)5.172009:First Official Competitions Won by RNNs,and with MPCNNs (18)5.182010:Plain Backprop(+Distortions)on GPU Yields Excellent Results (18)5.192011:MPCNNs on GPU Achieve Superhuman Vision Performance (18)5.202011:Hessian-Free Optimization for RNNs (19)5.212012:First Contests Won on ImageNet&Object Detection&Segmentation (19)5.222013-:More Contests and Benchmark Records (20)5.22.1Currently Successful Supervised Techniques:LSTM RNNs/GPU-MPCNNs (21)5.23Recent Tricks for Improving SL Deep NNs(Compare Sec.5.6.2,5.6.3) (21)5.24Consequences for Neuroscience (22)5.25DL with Spiking Neurons? (22)6DL in FNNs and RNNs for Reinforcement Learning(RL)236.1RL Through NN World Models Yields RNNs With Deep CAPs (23)6.2Deep FNNs for Traditional RL and Markov Decision Processes(MDPs) (24)6.3Deep RL RNNs for Partially Observable MDPs(POMDPs) (24)6.4RL Facilitated by Deep UL in FNNs and RNNs (25)6.5Deep Hierarchical RL(HRL)and Subgoal Learning with FNNs and RNNs (25)6.6Deep RL by Direct NN Search/Policy Gradients/Evolution (25)6.7Deep RL by Indirect Policy Search/Compressed NN Search (26)6.8Universal RL (27)7Conclusion271Introduction to Deep Learning(DL)in Neural Networks(NNs) Which modifiable components of a learning system are responsible for its success or failure?What changes to them improve performance?This has been called the fundamental credit assignment problem(Minsky, 1963).There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses(Sec.6.8).The present survey,however,will focus on the narrower,but now commercially important,subfield of Deep Learning(DL)in Artificial Neural Networks(NNs).We are interested in accurate credit assignment across possibly many,often nonlinear,computational stages of NNs.Shallow NN-like models have been around for many decades if not centuries(Sec.5.1).Models with several successive nonlinear layers of neurons date back at least to the1960s(Sec.5.3)and1970s(Sec.5.5). An efficient gradient descent method for teacher-based Supervised Learning(SL)in discrete,differentiable networks of arbitrary depth called backpropagation(BP)was developed in the1960s and1970s,and ap-plied to NNs in1981(Sec.5.5).BP-based training of deep NNs with many layers,however,had been found to be difficult in practice by the late1980s(Sec.5.6),and had become an explicit research subject by the early1990s(Sec.5.9).DL became practically feasible to some extent through the help of Unsupervised Learning(UL)(e.g.,Sec.5.10,5.15).The1990s and2000s also saw many improvements of purely super-vised DL(Sec.5).In the new millennium,deep NNs havefinally attracted wide-spread attention,mainly by outperforming alternative machine learning methods such as kernel machines(Vapnik,1995;Sch¨o lkopf et al.,1998)in numerous important applications.In fact,supervised deep NNs have won numerous of-ficial international pattern recognition competitions(e.g.,Sec.5.17,5.19,5.21,5.22),achieving thefirst superhuman visual pattern recognition results in limited domains(Sec.5.19).Deep NNs also have become relevant for the more generalfield of Reinforcement Learning(RL)where there is no supervising teacher (Sec.6).Both feedforward(acyclic)NNs(FNNs)and recurrent(cyclic)NNs(RNNs)have won contests(Sec.5.12,5.14,5.17,5.19,5.21,5.22).In a sense,RNNs are the deepest of all NNs(Sec.3)—they are general computers more powerful than FNNs,and can in principle create and process memories of ar-bitrary sequences of input patterns(e.g.,Siegelmann and Sontag,1991;Schmidhuber,1990a).Unlike traditional methods for automatic sequential program synthesis(e.g.,Waldinger and Lee,1969;Balzer, 1985;Soloway,1986;Deville and Lau,1994),RNNs can learn programs that mix sequential and parallel information processing in a natural and efficient way,exploiting the massive parallelism viewed as crucial for sustaining the rapid decline of computation cost observed over the past75years.The rest of this paper is structured as follows.Sec.2introduces a compact,event-oriented notation that is simple yet general enough to accommodate both FNNs and RNNs.Sec.3introduces the concept of Credit Assignment Paths(CAPs)to measure whether learning in a given NN application is of the deep or shallow type.Sec.4lists recurring themes of DL in SL,UL,and RL.Sec.5focuses on SL and UL,and on how UL can facilitate SL,although pure SL has become dominant in recent competitions(Sec.5.17-5.22). Sec.5is arranged in a historical timeline format with subsections on important inspirations and technical contributions.Sec.6on deep RL discusses traditional Dynamic Programming(DP)-based RL combined with gradient-based search techniques for SL or UL in deep NNs,as well as general methods for direct and indirect search in the weight space of deep FNNs and RNNs,including successful policy gradient and evolutionary methods.2Event-Oriented Notation for Activation Spreading in FNNs/RNNs Throughout this paper,let i,j,k,t,p,q,r denote positive integer variables assuming ranges implicit in the given contexts.Let n,m,T denote positive integer constants.An NN’s topology may change over time(e.g.,Fahlman,1991;Ring,1991;Weng et al.,1992;Fritzke, 1994).At any given moment,it can be described as afinite subset of units(or nodes or neurons)N= {u1,u2,...,}and afinite set H⊆N×N of directed edges or connections between nodes.FNNs are acyclic graphs,RNNs cyclic.Thefirst(input)layer is the set of input units,a subset of N.In FNNs,the k-th layer(k>1)is the set of all nodes u∈N such that there is an edge path of length k−1(but no longer path)between some input unit and u.There may be shortcut connections between distant layers.The NN’s behavior or program is determined by a set of real-valued,possibly modifiable,parameters or weights w i(i=1,...,n).We now focus on a singlefinite episode or epoch of information processing and activation spreading,without learning through weight changes.The following slightly unconventional notation is designed to compactly describe what is happening during the runtime of the system.During an episode,there is a partially causal sequence x t(t=1,...,T)of real values that I call events.Each x t is either an input set by the environment,or the activation of a unit that may directly depend on other x k(k<t)through a current NN topology-dependent set in t of indices k representing incoming causal connections or links.Let the function v encode topology information and map such event index pairs(k,t)to weight indices.For example,in the non-input case we may have x t=f t(net t)with real-valued net t= k∈in t x k w v(k,t)(additive case)or net t= k∈in t x k w v(k,t)(multiplicative case), where f t is a typically nonlinear real-valued activation function such as tanh.In many recent competition-winning NNs(Sec.5.19,5.21,5.22)there also are events of the type x t=max k∈int (x k);some networktypes may also use complex polynomial activation functions(Sec.5.3).x t may directly affect certain x k(k>t)through outgoing connections or links represented through a current set out t of indices k with t∈in k.Some non-input events are called output events.Note that many of the x t may refer to different,time-varying activations of the same unit in sequence-processing RNNs(e.g.,Williams,1989,“unfolding in time”),or also in FNNs sequentially exposed to time-varying input patterns of a large training set encoded as input events.During an episode,the same weight may get reused over and over again in topology-dependent ways,e.g.,in RNNs,or in convolutional NNs(Sec.5.4,5.8).I call this weight sharing across space and/or time.Weight sharing may greatly reduce the NN’s descriptive complexity,which is the number of bits of information required to describe the NN (Sec.4.3).In Supervised Learning(SL),certain NN output events x t may be associated with teacher-given,real-valued labels or targets d t yielding errors e t,e.g.,e t=1/2(x t−d t)2.A typical goal of supervised NN training is tofind weights that yield episodes with small total error E,the sum of all such e t.The hope is that the NN will generalize well in later episodes,causing only small errors on previously unseen sequences of input events.Many alternative error functions for SL and UL are possible.SL assumes that input events are independent of earlier output events(which may affect the environ-ment through actions causing subsequent perceptions).This assumption does not hold in the broaderfields of Sequential Decision Making and Reinforcement Learning(RL)(Kaelbling et al.,1996;Sutton and Barto, 1998;Hutter,2005)(Sec.6).In RL,some of the input events may encode real-valued reward signals given by the environment,and a typical goal is tofind weights that yield episodes with a high sum of reward signals,through sequences of appropriate output actions.Sec.5.5will use the notation above to compactly describe a central algorithm of DL,namely,back-propagation(BP)for supervised weight-sharing FNNs and RNNs.(FNNs may be viewed as RNNs with certainfixed zero weights.)Sec.6will address the more general RL case.3Depth of Credit Assignment Paths(CAPs)and of ProblemsTo measure whether credit assignment in a given NN application is of the deep or shallow type,I introduce the concept of Credit Assignment Paths or CAPs,which are chains of possibly causal links between events.Let usfirst focus on SL.Consider two events x p and x q(1≤p<q≤T).Depending on the appli-cation,they may have a Potential Direct Causal Connection(PDCC)expressed by the Boolean predicate pdcc(p,q),which is true if and only if p∈in q.Then the2-element list(p,q)is defined to be a CAP from p to q(a minimal one).A learning algorithm may be allowed to change w v(p,q)to improve performance in future episodes.More general,possibly indirect,Potential Causal Connections(PCC)are expressed by the recursively defined Boolean predicate pcc(p,q),which in the SL case is true only if pdcc(p,q),or if pcc(p,k)for some k and pdcc(k,q).In the latter case,appending q to any CAP from p to k yields a CAP from p to q(this is a recursive definition,too).The set of such CAPs may be large but isfinite.Note that the same weight may affect many different PDCCs between successive events listed by a given CAP,e.g.,in the case of RNNs, or weight-sharing FNNs.Suppose a CAP has the form(...,k,t,...,q),where k and t(possibly t=q)are thefirst successive elements with modifiable w v(k,t).Then the length of the suffix list(t,...,q)is called the CAP’s depth (which is0if there are no modifiable links at all).This depth limits how far backwards credit assignment can move down the causal chain tofind a modifiable weight.1Suppose an episode and its event sequence x1,...,x T satisfy a computable criterion used to decide whether a given problem has been solved(e.g.,total error E below some threshold).Then the set of used weights is called a solution to the problem,and the depth of the deepest CAP within the sequence is called the solution’s depth.There may be other solutions(yielding different event sequences)with different depths.Given somefixed NN topology,the smallest depth of any solution is called the problem’s depth.Sometimes we also speak of the depth of an architecture:SL FNNs withfixed topology imply a problem-independent maximal problem depth bounded by the number of non-input layers.Certain SL RNNs withfixed weights for all connections except those to output units(Jaeger,2001;Maass et al.,2002; Jaeger,2004;Schrauwen et al.,2007)have a maximal problem depth of1,because only thefinal links in the corresponding CAPs are modifiable.In general,however,RNNs may learn to solve problems of potentially unlimited depth.Note that the definitions above are solely based on the depths of causal chains,and agnostic of the temporal distance between events.For example,shallow FNNs perceiving large“time windows”of in-put events may correctly classify long input sequences through appropriate output events,and thus solve shallow problems involving long time lags between relevant events.At which problem depth does Shallow Learning end,and Deep Learning begin?Discussions with DL experts have not yet yielded a conclusive response to this question.Instead of committing myself to a precise answer,let me just define for the purposes of this overview:problems of depth>10require Very Deep Learning.The difficulty of a problem may have little to do with its depth.Some NNs can quickly learn to solve certain deep problems,e.g.,through random weight guessing(Sec.5.9)or other types of direct search (Sec.6.6)or indirect search(Sec.6.7)in weight space,or through training an NNfirst on shallow problems whose solutions may then generalize to deep problems,or through collapsing sequences of(non)linear operations into a single(non)linear operation—but see an analysis of non-trivial aspects of deep linear networks(Baldi and Hornik,1994,Section B).In general,however,finding an NN that precisely models a given training set is an NP-complete problem(Judd,1990;Blum and Rivest,1992),also in the case of deep NNs(S´ıma,1994;de Souto et al.,1999;Windisch,2005);compare a survey of negative results(S´ıma, 2002,Section1).Above we have focused on SL.In the more general case of RL in unknown environments,pcc(p,q) is also true if x p is an output event and x q any later input event—any action may affect the environment and thus any later perception.(In the real world,the environment may even influence non-input events computed on a physical hardware entangled with the entire universe,but this is ignored here.)It is possible to model and replace such unmodifiable environmental PCCs through a part of the NN that has already learned to predict(through some of its units)input events(including reward signals)from former input events and actions(Sec.6.1).Its weights are frozen,but can help to assign credit to other,still modifiable weights used to compute actions(Sec.6.1).This approach may lead to very deep CAPs though.Some DL research is about automatically rephrasing problems such that their depth is reduced(Sec.4). In particular,sometimes UL is used to make SL problems less deep,e.g.,Sec.5.10.Often Dynamic Programming(Sec.4.1)is used to facilitate certain traditional RL problems,e.g.,Sec.6.2.Sec.5focuses on CAPs for SL,Sec.6on the more complex case of RL.4Recurring Themes of Deep Learning4.1Dynamic Programming(DP)for DLOne recurring theme of DL is Dynamic Programming(DP)(Bellman,1957),which can help to facili-tate credit assignment under certain assumptions.For example,in SL NNs,backpropagation itself can 1An alternative would be to count only modifiable links when measuring depth.In many typical NN applications this would not make a difference,but in some it would,e.g.,Sec.6.1.be viewed as a DP-derived method(Sec.5.5).In traditional RL based on strong Markovian assumptions, DP-derived methods can help to greatly reduce problem depth(Sec.6.2).DP algorithms are also essen-tial for systems that combine concepts of NNs and graphical models,such as Hidden Markov Models (HMMs)(Stratonovich,1960;Baum and Petrie,1966)and Expectation Maximization(EM)(Dempster et al.,1977),e.g.,(Bottou,1991;Bengio,1991;Bourlard and Morgan,1994;Baldi and Chauvin,1996; Jordan and Sejnowski,2001;Bishop,2006;Poon and Domingos,2011;Dahl et al.,2012;Hinton et al., 2012a).4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL Another recurring theme is how UL can facilitate both SL(Sec.5)and RL(Sec.6).UL(Sec.5.6.4) is normally used to encode raw incoming data such as video or speech streams in a form that is more convenient for subsequent goal-directed learning.In particular,codes that describe the original data in a less redundant or more compact way can be fed into SL(Sec.5.10,5.15)or RL machines(Sec.6.4),whose search spaces may thus become smaller(and whose CAPs shallower)than those necessary for dealing with the raw data.UL is closely connected to the topics of regularization and compression(Sec.4.3,5.6.3). 4.3Occam’s Razor:Compression and Minimum Description Length(MDL) Occam’s razor favors simple solutions over complex ones.Given some programming language,the prin-ciple of Minimum Description Length(MDL)can be used to measure the complexity of a solution candi-date by the length of the shortest program that computes it(e.g.,Solomonoff,1964;Kolmogorov,1965b; Chaitin,1966;Wallace and Boulton,1968;Levin,1973a;Rissanen,1986;Blumer et al.,1987;Li and Vit´a nyi,1997;Gr¨u nwald et al.,2005).Some methods explicitly take into account program runtime(Al-lender,1992;Watanabe,1992;Schmidhuber,2002,1995);many consider only programs with constant runtime,written in non-universal programming languages(e.g.,Rissanen,1986;Hinton and van Camp, 1993).In the NN case,the MDL principle suggests that low NN weight complexity corresponds to high NN probability in the Bayesian view(e.g.,MacKay,1992;Buntine and Weigend,1991;De Freitas,2003), and to high generalization performance(e.g.,Baum and Haussler,1989),without overfitting the training data.Many methods have been proposed for regularizing NNs,that is,searching for solution-computing, low-complexity SL NNs(Sec.5.6.3)and RL NNs(Sec.6.7).This is closely related to certain UL methods (Sec.4.2,5.6.4).4.4Learning Hierarchical Representations Through Deep SL,UL,RLMany methods of Good Old-Fashioned Artificial Intelligence(GOFAI)(Nilsson,1980)as well as more recent approaches to AI(Russell et al.,1995)and Machine Learning(Mitchell,1997)learn hierarchies of more and more abstract data representations.For example,certain methods of syntactic pattern recog-nition(Fu,1977)such as grammar induction discover hierarchies of formal rules to model observations. The partially(un)supervised Automated Mathematician/EURISKO(Lenat,1983;Lenat and Brown,1984) continually learns concepts by combining previously learnt concepts.Such hierarchical representation learning(Ring,1994;Bengio et al.,2013;Deng and Yu,2014)is also a recurring theme of DL NNs for SL (Sec.5),UL-aided SL(Sec.5.7,5.10,5.15),and hierarchical RL(Sec.6.5).Often,abstract hierarchical representations are natural by-products of data compression(Sec.4.3),e.g.,Sec.5.10.4.5Fast Graphics Processing Units(GPUs)for DL in NNsWhile the previous millennium saw several attempts at creating fast NN-specific hardware(e.g.,Jackel et al.,1990;Faggin,1992;Ramacher et al.,1993;Widrow et al.,1994;Heemskerk,1995;Korkin et al., 1997;Urlbe,1999),and at exploiting standard hardware(e.g.,Anguita et al.,1994;Muller et al.,1995; Anguita and Gomes,1996),the new millennium brought a DL breakthrough in form of cheap,multi-processor graphics cards or GPUs.GPUs are widely used for video games,a huge and competitive market that has driven down hardware prices.GPUs excel at fast matrix and vector multiplications required not only for convincing virtual realities but also for NN training,where they can speed up learning by a factorof50and more.Some of the GPU-based FNN implementations(Sec.5.16-5.19)have greatly contributed to recent successes in contests for pattern recognition(Sec.5.19-5.22),image segmentation(Sec.5.21), and object detection(Sec.5.21-5.22).5Supervised NNs,Some Helped by Unsupervised NNsThe main focus of current practical applications is on Supervised Learning(SL),which has dominated re-cent pattern recognition contests(Sec.5.17-5.22).Several methods,however,use additional Unsupervised Learning(UL)to facilitate SL(Sec.5.7,5.10,5.15).It does make sense to treat SL and UL in the same section:often gradient-based methods,such as BP(Sec.5.5.1),are used to optimize objective functions of both UL and SL,and the boundary between SL and UL may blur,for example,when it comes to time series prediction and sequence classification,e.g.,Sec.5.10,5.12.A historical timeline format will help to arrange subsections on important inspirations and techni-cal contributions(although such a subsection may span a time interval of many years).Sec.5.1briefly mentions early,shallow NN models since the1940s,Sec.5.2additional early neurobiological inspiration relevant for modern Deep Learning(DL).Sec.5.3is about GMDH networks(since1965),perhaps thefirst (feedforward)DL systems.Sec.5.4is about the relatively deep Neocognitron NN(1979)which is similar to certain modern deep FNN architectures,as it combines convolutional NNs(CNNs),weight pattern repli-cation,and winner-take-all(WTA)mechanisms.Sec.5.5uses the notation of Sec.2to compactly describe a central algorithm of DL,namely,backpropagation(BP)for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP1960-1981and beyond.Sec.5.6describes problems encountered in the late1980s with BP for deep NNs,and mentions several ideas from the previous millennium to overcome them.Sec.5.7discusses afirst hierarchical stack of coupled UL-based Autoencoders(AEs)—this concept resurfaced in the new millennium(Sec.5.15).Sec.5.8is about applying BP to CNNs,which is important for today’s DL applications.Sec.5.9explains BP’s Fundamental DL Problem(of vanishing/exploding gradients)discovered in1991.Sec.5.10explains how a deep RNN stack of1991(the History Compressor) pre-trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment Paths(CAPs,Sec.3)of depth1000and more.Sec.5.11discusses a particular WTA method called Max-Pooling(MP)important in today’s DL FNNs.Sec.5.12mentions afirst important contest won by SL NNs in1994.Sec.5.13describes a purely supervised DL RNN(Long Short-Term Memory,LSTM)for problems of depth1000and more.Sec.5.14mentions an early contest of2003won by an ensemble of shallow NNs, as well as good pattern recognition results with CNNs and LSTM RNNs(2003).Sec.5.15is mostly about Deep Belief Networks(DBNs,2006)and related stacks of Autoencoders(AEs,Sec.5.7)pre-trained by UL to facilitate BP-based SL.Sec.5.16mentions thefirst BP-trained MPCNNs(2007)and GPU-CNNs(2006). Sec.5.17-5.22focus on official competitions with secret test sets won by(mostly purely supervised)DL NNs since2009,in sequence recognition,image classification,image segmentation,and object detection. Many RNN results depended on LSTM(Sec.5.13);many FNN results depended on GPU-based FNN code developed since2004(Sec.5.16,5.17,5.18,5.19),in particular,GPU-MPCNNs(Sec.5.19).5.11940s and EarlierNN research started in the1940s(e.g.,McCulloch and Pitts,1943;Hebb,1949);compare also later work on learning NNs(Rosenblatt,1958,1962;Widrow and Hoff,1962;Grossberg,1969;Kohonen,1972; von der Malsburg,1973;Narendra and Thathatchar,1974;Willshaw and von der Malsburg,1976;Palm, 1980;Hopfield,1982).In a sense NNs have been around even longer,since early supervised NNs were essentially variants of linear regression methods going back at least to the early1800s(e.g.,Legendre, 1805;Gauss,1809,1821).Early NNs had a maximal CAP depth of1(Sec.3).5.2Around1960:More Neurobiological Inspiration for DLSimple cells and complex cells were found in the cat’s visual cortex(e.g.,Hubel and Wiesel,1962;Wiesel and Hubel,1959).These cellsfire in response to certain properties of visual sensory inputs,such as theorientation of plex cells exhibit more spatial invariance than simple cells.This inspired later deep NN architectures(Sec.5.4)used in certain modern award-winning Deep Learners(Sec.5.19-5.22).5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) Networks trained by the Group Method of Data Handling(GMDH)(Ivakhnenko and Lapa,1965; Ivakhnenko et al.,1967;Ivakhnenko,1968,1971)were perhaps thefirst DL systems of the Feedforward Multilayer Perceptron type.The units of GMDH nets may have polynomial activation functions imple-menting Kolmogorov-Gabor polynomials(more general than traditional NN activation functions).Given a training set,layers are incrementally grown and trained by regression analysis,then pruned with the help of a separate validation set(using today’s terminology),where Decision Regularisation is used to weed out superfluous units.The numbers of layers and units per layer can be learned in problem-dependent fashion. This is a good example of hierarchical representation learning(Sec.4.4).There have been numerous ap-plications of GMDH-style networks,e.g.(Ikeda et al.,1976;Farlow,1984;Madala and Ivakhnenko,1994; Ivakhnenko,1995;Kondo,1998;Kord´ık et al.,2003;Witczak et al.,2006;Kondo and Ueno,2008).5.41979:Convolution+Weight Replication+Winner-Take-All(WTA)Apart from deep GMDH networks(Sec.5.3),the Neocognitron(Fukushima,1979,1980,2013a)was per-haps thefirst artificial NN that deserved the attribute deep,and thefirst to incorporate the neurophysiolog-ical insights of Sec.5.2.It introduced convolutional NNs(today often called CNNs or convnets),where the(typically rectangular)receptivefield of a convolutional unit with given weight vector is shifted step by step across a2-dimensional array of input values,such as the pixels of an image.The resulting2D array of subsequent activation events of this unit can then provide inputs to higher-level units,and so on.Due to massive weight replication(Sec.2),relatively few parameters may be necessary to describe the behavior of such a convolutional layer.Competition layers have WTA subsets whose maximally active units are the only ones to adopt non-zero activation values.They essentially“down-sample”the competition layer’s input.This helps to create units whose responses are insensitive to small image shifts(compare Sec.5.2).The Neocognitron is very similar to the architecture of modern,contest-winning,purely super-vised,feedforward,gradient-based Deep Learners with alternating convolutional and competition lay-ers(e.g.,Sec.5.19-5.22).Fukushima,however,did not set the weights by supervised backpropagation (Sec.5.5,5.8),but by local un supervised learning rules(e.g.,Fukushima,2013b),or by pre-wiring.In that sense he did not care for the DL problem(Sec.5.9),although his architecture was comparatively deep indeed.He also used Spatial Averaging(Fukushima,1980,2011)instead of Max-Pooling(MP,Sec.5.11), currently a particularly convenient and popular WTA mechanism.Today’s CNN-based DL machines profita lot from later CNN work(e.g.,LeCun et al.,1989;Ranzato et al.,2007)(Sec.5.8,5.16,5.19).5.51960-1981and Beyond:Development of Backpropagation(BP)for NNsThe minimisation of errors through gradient descent(Hadamard,1908)in the parameter space of com-plex,nonlinear,differentiable,multi-stage,NN-related systems has been discussed at least since the early 1960s(e.g.,Kelley,1960;Bryson,1961;Bryson and Denham,1961;Pontryagin et al.,1961;Dreyfus,1962; Wilkinson,1965;Amari,1967;Bryson and Ho,1969;Director and Rohrer,1969;Griewank,2012),ini-tially within the framework of Euler-LaGrange equations in the Calculus of Variations(e.g.,Euler,1744). Steepest descent in such systems can be performed(Bryson,1961;Kelley,1960;Bryson and Ho,1969)by iterating the ancient chain rule(Leibniz,1676;L’Hˆo pital,1696)in Dynamic Programming(DP)style(Bell-man,1957).A simplified derivation of the method uses the chain rule only(Dreyfus,1962).The methods of the1960s were already efficient in the DP sense.However,they backpropagated derivative information through standard Jacobian matrix calculations from one“layer”to the previous one, explicitly addressing neither direct links across several layers nor potential additional efficiency gains due to network sparsity(but perhaps such enhancements seemed obvious to the authors).。
文献信息:文献标题:A Study of Data Mining with Big Data(大数据挖掘研究)国外作者:VH Shastri,V Sreeprada文献出处:《International Journal of Emerging Trends and Technology in Computer Science》,2016,38(2):99-103字数统计:英文2291单词,12196字符;中文3868汉字外文文献:A Study of Data Mining with Big DataAbstract Data has become an important part of every economy, industry, organization, business, function and individual. Big Data is a term used to identify large data sets typically whose size is larger than the typical data base. Big data introduces unique computational and statistical challenges. Big Data are at present expanding in most of the domains of engineering and science. Data mining helps to extract useful data from the huge data sets due to its volume, variability and velocity. This article presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective.Keywords: Big Data, Data Mining, HACE theorem, structured and unstructured.I.IntroductionBig Data refers to enormous amount of structured data and unstructured data thatoverflow the organization. If this data is properly used, it can lead to meaningful information. Big data includes a large number of data which requires a lot of processing in real time. It provides a room to discover new values, to understand in-depth knowledge from hidden values and provide a space to manage the data effectively. A database is an organized collection of logically related data which can be easily managed, updated and accessed. Data mining is a process discovering interesting knowledge such as associations, patterns, changes, anomalies and significant structures from large amount of data stored in the databases or other repositories.Big Data includes 3 V’s as its characteristics. They are volume, velocity and variety. V olume means the amount of data generated every second. The data is in state of rest. It is also known for its scale characteristics. Velocity is the speed with which the data is generated. It should have high speed data. The data generated from social media is an example. Variety means different types of data can be taken such as audio, video or documents. It can be numerals, images, time series, arrays etc.Data Mining analyses the data from different perspectives and summarizing it into useful information that can be used for business solutions and predicting the future trends. Data mining (DM), also called Knowledge Discovery in Databases (KDD) or Knowledge Discovery and Data Mining, is the process of searching large volumes of data automatically for patterns such as association rules. It applies many computational techniques from statistics, information retrieval, machine learning and pattern recognition. Data mining extract only required patterns from the database in a short time span. Based on the type of patterns to be mined, data mining tasks can be classified into summarization, classification, clustering, association and trends analysis.Big Data is expanding in all domains including science and engineering fields including physical, biological and biomedical sciences.II.BIG DATA with DATA MININGGenerally big data refers to a collection of large volumes of data and these data are generated from various sources like internet, social-media, business organization, sensors etc. We can extract some useful information with the help of Data Mining. It is a technique for discovering patterns as well as descriptive, understandable, models from a large scale of data.V olume is the size of the data which is larger than petabytes and terabytes. The scale and rise of size makes it difficult to store and analyse using traditional tools. Big Data should be used to mine large amounts of data within the predefined period of time. Traditional database systems were designed to address small amounts of data which were structured and consistent, whereas Big Data includes wide variety of data such as geospatial data, audio, video, unstructured text and so on.Big Data mining refers to the activity of going through big data sets to look for relevant information. To process large volumes of data from different sources quickly, Hadoop is used. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Its distributed supports fast data transfer rates among nodes and allows the system to continue operating uninterrupted at times of node failure. It runs Map Reduce for distributed data processing and is works with structured and unstructured data.III.BIG DATA characteristics- HACE THEOREM.We have large volume of heterogeneous data. There exists a complex relationship among the data. We need to discover useful information from this voluminous data.Let us imagine a scenario in which the blind people are asked to draw elephant. The information collected by each blind people may think the trunk as wall, leg as tree, body as wall and tail as rope. The blind men can exchange information with each other.Figure1: Blind men and the giant elephantSome of the characteristics that include are:i.Vast data with heterogeneous and diverse sources: One of the fundamental characteristics of big data is the large volume of data represented by heterogeneous and diverse dimensions. For example in the biomedical world, a single human being is represented as name, age, gender, family history etc., For X-ray and CT scan images and videos are used. Heterogeneity refers to the different types of representations of same individual and diverse refers to the variety of features to represent single information.ii.Autonomous with distributed and de-centralized control: the sources are autonomous, i.e., automatically generated; it generates information without any centralized control. We can compare it with World Wide Web (WWW) where each server provides a certain amount of information without depending on other servers.plex and evolving relationships: As the size of the data becomes infinitely large, the relationship that exists is also large. In early stages, when data is small, there is no complexity in relationships among the data. Data generated from social media and other sources have complex relationships.IV.TOOLS:OPEN SOURCE REVOLUTIONLarge companies such as Facebook, Yahoo, Twitter, LinkedIn benefit and contribute work on open source projects. In Big Data Mining, there are many open source initiatives. The most popular of them are:Apache Mahout:Scalable machine learning and data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent patternmining.R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.MOA: Stream data mining open source software to perform data mining in real time. It has implementations of classification, regression; clustering and frequent item set mining and frequent graph mining. It started as a project of the Machine Learning group of University of Waikato, New Zealand, famous for the WEKA software. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA, Android and Storm.SAMOA: It is a new upcoming software project for distributed stream mining that will combine S4 and Storm with MOA.Vow pal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design a fast, scalable, useful learning algorithm. VW is able to learn from terafeature datasets. It can exceed the throughput of any single machine networkinterface when doing linear learning, via parallel learning.V.DATA MINING for BIG DATAData mining is the process by which data is analysed coming from different sources discovers useful information. Data Mining contains several algorithms which fall into 4 categories. They are:1.Association Rule2.Clustering3.Classification4.RegressionAssociation is used to search relationship between variables. It is applied in searching for frequently visited items. In short it establishes relationship among objects. Clustering discovers groups and structures in the data.Classification deals with associating an unknown structure to a known structure. Regression finds a function to model the data.The different data mining algorithms are:Table 1. Classification of AlgorithmsData Mining algorithms can be converted into big map reduce algorithm based on parallel computing basis.Table 2. Differences between Data Mining and Big DataVI.Challenges in BIG DATAMeeting the challenges with BIG Data is difficult. The volume is increasing every day. The velocity is increasing by the internet connected devices. The variety is also expanding and the organizations’ capability to capture and process the data is limited.The following are the challenges in area of Big Data when it is handled:1.Data capture and storage2.Data transmission3.Data curation4.Data analysis5.Data visualizationAccording to, challenges of big data mining are divided into 3 tiers.The first tier is the setup of data mining algorithms. The second tier includesrmation sharing and Data Privacy.2.Domain and Application Knowledge.The third one includes local learning and model fusion for multiple information sources.3.Mining from sparse, uncertain and incomplete data.4.Mining complex and dynamic data.Figure 2: Phases of Big Data ChallengesGenerally mining of data from different data sources is tedious as size of data is larger. Big data is stored at different places and collecting those data will be a tedious task and applying basic data mining algorithms will be an obstacle for it. Next we need to consider the privacy of data. The third case is mining algorithms. When we are applying data mining algorithms to these subsets of data the result may not be that much accurate.VII.Forecast of the futureThere are some challenges that researchers and practitioners will have to deal during the next years:Analytics Architecture:It is not clear yet how an optimal architecture of analytics systems should be to deal with historic data and with real-time data at the same time. An interesting proposal is the Lambda architecture of Nathan Marz. The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, theserving layer, and the speed layer. It combines in the same system Hadoop for the batch layer, and Storm for the speed layer. The properties of the system are: robust and fault tolerant, scalable, general, and extensible, allows ad hoc queries, minimal maintenance, and debuggable.Statistical significance: It is important to achieve significant statistical results, and not be fooled by randomness. As Efron explains in his book about Large Scale Inference, it is easy to go wrong with huge data sets and thousands of questions to answer at once.Distributed mining: Many data mining techniques are not trivial to paralyze. To have distributed versions of some methods, a lot of research is needed with practical and theoretical analysis to provide new methods.Time evolving data: Data may be evolving over time, so it is important that the Big Data mining techniques should be able to adapt and in some cases to detect change first. For example, the data stream mining field has very powerful techniques for this task.Compression: Dealing with Big Data, the quantity of space needed to store it is very relevant. There are two main approaches: compression where we don’t loose anything, or sampling where we choose what is thedata that is more representative. Using compression, we may take more time and less space, so we can consider it as a transformation from time to space. Using sampling, we are loosing information, but the gains inspace may be in orders of magnitude. For example Feldman et al use core sets to reduce the complexity of Big Data problems. Core sets are small sets that provably approximate the original data for a given problem. Using merge- reduce the small sets can then be used for solving hard machine learning problems in parallel.Visualization: A main task of Big Data analysis is how to visualize the results. As the data is so big, it is very difficult to find user-friendly visualizations. New techniques, and frameworks to tell and show stories will be needed, as for examplethe photographs, infographics and essays in the beautiful book ”The Human Face of Big Data”.Hidden Big Data: Large quantities of useful data are getting lost since new data is largely untagged and unstructured data. The 2012 IDC studyon Big Data explains that in 2012, 23% (643 exabytes) of the digital universe would be useful for Big Data if tagged and analyzed. However, currently only 3% of the potentially useful data is tagged, and even less is analyzed.VIII.CONCLUSIONThe amounts of data is growing exponentially due to social networking sites, search and retrieval engines, media sharing sites, stock trading sites, news sources and so on. Big Data is becoming the new area for scientific data research and for business applications.Data mining techniques can be applied on big data to acquire some useful information from large datasets. They can be used together to acquire some useful picture from the data.Big Data analysis tools like Map Reduce over Hadoop and HDFS helps organization.中文译文:大数据挖掘研究摘要数据已经成为各个经济、行业、组织、企业、职能和个人的重要组成部分。
Christopher M. BishopPattern Recognition and Machine LearningfyA SpringerContentsPreface Mathematical notation Contents 1 Introduction 1.1 Example: Polynomial Curve Fitting 1.2 Probability Theory 1.2.1 Probability densities 1.2.2 Expectations and covariances 1.2.3 Bayesian probabilities 1.2.4 The Gaussian distribution 1.2.5 Curve fitting re-visited 1.2.6 Bayesian curve fitting 1.3 Model Selection 1.4 The Curse of Dimensionality 1.5 Decision Theory 1.5.1 Minimizing the misclassification rate 1.5.2 Minimizing the expected loss 1.5.3 The reject option 1.5.4 Inference and decision 1.5.5 Loss functions for regression 1.6 Information Theory 1.6.1 Relative entropy and mutual information Exercises vii xi xiii 1 4 12 17 19 21 24 28 30 32 33 38 39 41 42 42 46 48 55 58XIIIxivCONTENTS Probability Distributions 2.1 Binary Variables 2.1.1 The beta distribution 2.2 Multinomial Variables 2.2.1 The Dirichlet distribution 2.3 The Gaussian Distribution 2.3.1 Conditional Gaussian distributions 2.3.2 Marginal Gaussian distributions 2.3.3 Bayes' theorem for Gaussian variables 2.3.4 Maximum likelihood for the Gaussian 2.3.5 Sequential estimation 2.3.6 Bayesian inference for the Gaussian 2.3.7 Student's t-distribution 2.3.8 Periodic variables 2.3.9 Mixtures of Gaussians 2.4 The Exponential Family 2.4.1 Maximum likelihood and sufficient statistics 2.4.2 Conjugate priors 2.4.3 Noninformative priors 2.5 Nonparametric Methods 2.5.1 Kernel density estimators 2.5.2 Nearest-neighbour methods Exercises Linear Models for Regression 3.1 Linear Basis Function Models 3.1.1 Maximum likelihood and least squares 3.1.2 Geometry of least squares 3.1.3 Sequential learning 3.1.4 Regularized least squares 3.1.5 Multiple outputs 3.2 The Bias-Variance Decomposition 3.3 Bayesian Linear Regression 3.3.1 Parameter distribution 3.3.2 Predictive distribution 3.3.3 Equivalent kernel 3.4 Bayesian Model Comparison 3.5 The Evidence Approximation 3.5.1 Evaluation of the evidence function 3.5.2 Maximizing the evidence function 3.5.3 Effective number of parameters 3.6 Limitations of Fixed Basis Functions Exercises 67 68 71 74 76 78 85 88 90 93 94 97 102 105 110 113 116 117 117 120 122 124 127 137 138 140 143 143 144 146 147 152 152 156 159 161 165 166 168 170 172 173CONTENTS 4 Linear Models for Classification 4.1 Discriminant Functions 4.1.1 Two classes 4.1.2 Multiple classes 4.1.3 Least squares for classification 4.1.4 Fisher's linear discriminant 4.1.5 Relation to least squares 4.1.6 Fisher's discriminant for multiple classes 4.1.7 The perceptron algorithm 4.2 Probabilistic Generative Models 4.2.1 Continuous inputs 4.2.2 Maximum likelihood solution 4.2.3 Discrete features 4.2.4 Exponential family 4.3 Probabilistic Discriminative Models 4.3.1 Fixed basis functions 4.3.2 Logistic regression 4.3.3 Iterative reweighted least squares 4.3.4 Multiclass logistic regression 4.3.5 Probit regression 4.3.6 Canonical link functions 4.4 The Laplace Approximation 4.4.1 Model comparison and BIC 4.5 Bayesian Logistic Regression 4.5.1 Laplace approximation 4.5.2 Predictive distribution Exercises Neural Networks 5.1 Feed-forward Network Functions 5.1.1 Weight-space symmetries 5.2 Network Training 5.2.1 Parameter optimization 5.2.2 Local quadratic approximation 5.2.3 Use of gradient information 5.2.4 Gradient descent optimization 5.3 Error Backpropagation 5.3.1 Evaluation of error-function derivatives 5.3.2 A simple example 5.3.3 Efficiency of backpropagation 5.3.4 The Jacobian matrix 5.4 The Hessian Matrix 5.4.1 Diagonal approximation 5.4.2 Outer product approximation 5.4.3 Inverse Hessianxv 179 181 181 182 184 186 189 191 192 196 198 200 202 202 203 204 205 207 209 210 212 213 216 217 217 218 220 225 227 231 232 236 237 239 240 241 242 245 246 247 249 250 251 2525cviCONTENTS 5.4.4 Finite differences 5.4.5 Exact evaluation of the Hessian 5.4.6 Fast multiplication by the Hessian 5.5 Regularization in Neural Networks . . . 5.5.1 Consistent Gaussian priors 5.5.2 Early stopping 5.5.3 Invariances 5.5.4 Tangent propagation 5.5.5 Training with transformed data 5.5.6 Convolutional networks 5.5.7 Soft weight sharing 5.6 Mixture Density Networks 5.7 Bayesian Neural Networks 5.7.1 Posterior parameter distribution 5.7.2 Hyperparameter optimization 5.7.3 Bayesian neural networks for classification Exercises 6 Kernel Methods 6.1 Dual Representations 6.2 Constructing Kernels 6.3 Radial Basis Function Networks 6.3.1 Nadaraya-Watson model 6.4 Gaussian Processes 6.4.1 Linear regression revisited 6.4.2 Gaussian processes for regression 6.4.3 Learning the hyperparameters 6.4.4 Automatic relevance determination 6.4.5 Gaussian processes for classification 6.4.6 Laplace approximation 6.4.7 Connection to neural networks Exercises Sparse Kernel Machines 7.1 Maximum Margin Classifiers 7.1.1 Overlapping class distributions 7.1.2 Relation to logistic regression 7.1.3 Multiclass SVMs 7.1.4 SVMs for regression 7.1.5 Computational learning theory 7.2 Relevance Vector Machines 7.2.1 RVM for regression 7.2.2 Analysis of sparsity 7.2.3 RVM for classification Exercises 252 253 254 256 257 259 261 263 265 267 269 272 277 278 280 281 284 291 293 294 299 301 303 304 306 311 312 313 315 319 320 325 326 331 336 338 339 344 345 345 349 353 3577CONTENTS 8 Graphical Models 8.1 Bayesian Networks 8.1.1 Example: Polynomial regression 8.1.2 Generative models 8.1.3 Discrete variables 8.1.4 Linear-Gaussian models 8.2 Conditional Independence 8.2.1 Three example graphs 8.2.2 D-separation 8.3 Markov Random Fields 8.3.1 Conditional independence properties 8.3.2 Factorization properties 8.3.3 Illustration: Image de-noising 8.3.4 Relation to directed graphs 8.4 Inference in Graphical Models 8.4.1 Inference on a chain 8.4.2 Trees 8.4.3 Factor graphs 8.4.4 The sum-product algorithm 8.4.5 The max-sum algorithm 8.4.6 Exact inference in general graphs 8.4.7 Loopy belief propagation 8.4.8 Learning the graph structure Exercises Mixture Models and EM 9.1 K-means Clustering 9.1.1 Image segmentation and compression 9.2 Mixtures of Gaussians 9.2.1 Maximum likelihood 9.2.2 EM for Gaussian mixtures 9.3 An Alternative View of EM 9.3.1 Gaussian mixtures revisited 9.3.2 Relation to K-means 9.3.3 Mixtures of Bernoulli distributions 9.3.4 EM for Bayesian linear regression 9.4 The EM Algorithm in General Exercisesxvii 359 360 362 365 366 370 372 373 378 383 383 384 387 390 393 394 398 399 402 411 416 417 418 418 423 424 428 430 432 435 439 441 443 444 448 450 455 461 462 464 466 470 473 474910 Approximate Inference 10.1 Variational Inference 10.1.1 Factorized distributions 10.1.2 Properties of factorized approximations 10.1.3 Example: The univariate Gaussian 10.1.4 Model comparison 10.2 Illustration: Variational Mixture of GaussiansxviiiCONTENTS 10.2.1 Variational distribution 10.2.2 Variational lower bound 10.2.3 Predictive density 10.2.4 Determining the number of components 10.2.5 Induced factorizations 10.3 Variational Linear Regression 10.3.1 Variational distribution 10.3.2 Predictive distribution 10.3.3 Lower bound 10.4 Exponential Family Distributions 10.4.1 Variational message passing 10.5 Local Variational Methods 10.6 Variational Logistic Regression 10.6.1 Variational posterior distribution 10.6.2 Optimizing the variational parameters 10.6.3 Inference of hyperparameters 10.7 Expectation Propagation 10.7.1 Example: The clutter problem 10.7.2 Expectation propagation on graphs Exercises 11 Sampling Methods 11.1 Basic Sampling Algorithms 11.1.1 Standard distributions 11.1.2 Rejection sampling 11.1.3 Adaptive rejection sampling 11.1.4 Importance sampling 11.1.5 Sampling-importance-resampling 11.1.6 Sampling and the EM algorithm 11.2 Markov Chain Monte Carlo 11.2.1 Markov chains 11.2.2 The Metropolis-Hastings algorithm 11.3 Gibbs Sampling 11.4 Slice Sampling 11.5 The Hybrid Monte Carlo Algorithm 11.5.1 Dynamical systems 11.5.2 Hybrid Monte Carlo 11.6 Estimating the Partition Function Exercises 12 Continuous Latent Variables 12.1 Principal Component Analysis 12.1.1 Maximum variance formulation 12.1.2 Minimum-error formulation 12.1.3 Applications of PCA 12.1.4 PCA for high-dimensional data 475 481 482 483 485 486 486 488 489 490 491 493 498 498 500 502 505 511 513 517 523 526 526 528 530 532 534 536 537 539 541 542 546 548 548 552 554 556 559 561 561 563 565 569CONTENTS 12.2 Probabilistic PCA 12.2.1 Maximum likelihood PCA 12.2.2 EM algorithm for PCA 12.2.3 Bayesian PCA 12.2.4 Factor analysis 12.3 Kernel PCA 12.4 Nonlinear Latent Variable Models 12.4.1 Independent component analysis 12.4.2 Autoassociative neural networks 12.4.3 Modelling nonlinear manifolds Exercises 13 Sequential Data 13.1 Markov Models 13.2 Hidden Markov Models 13.2.1 Maximum likelihood for the HMM 13.2.2 The forward-backward algorithm 13.2.3 The sum-product algorithm for the HMM 13.2.4 Scaling factors 13.2.5 The Viterbi algorithm 13.2.6 Extensions of the hidden Markov model 13.3 Linear Dynamical Systems 13.3.1 Inference in LDS 13.3.2 Learning in LDS 13.3.3 Extensions of LDS 13.3.4 Particle filters Exercises 14 Combining Models 14.1 Bayesian Model Averaging 14.2 Committees 14.3 Boosting 14.3.1 Minimizing exponential error 14.3.2 Error functions for boosting 14.4 Tree-based Models 14.5 Conditional Mixture Models 14.5.1 Mixtures of linear regression models 14.5.2 Mixtures of logistic models 14.5.3 Mixtures of experts Exercises Appendix A Data Sets Appendix B Probability Distributionsxix 570 574 577 580 583 586 591 591 592 595 599 605 607 610 615 618 625 627 629 631 635 638 642 644 645 646 653 654 655 657 659 661 663 666 667 670 672 674 677 685 695Appendix C Properties of MatricesxxCONTENTS Appendix D Appendix E References Index Calculus of Variations Lagrange Multipliers 703 707 711 729。