LMS AND SMI ALGORITHMS FOR SPATIAL ADAPTIVE INTERFERENCE REJECTION 3 Contents

格式：pdf
大小：381.35 KB
文档页数：28

下载文档原格式

基于主动学习和二次有理核的模型无关局部解释方法

基于主动学习和二次有理核的模型无关局部解释方法
周晟昊;袁伟伟;关东海
【期刊名称】《计算机科学》
【年(卷),期】2024(51)2
【摘要】深度学习模型的广泛使用,在更大程度上使人们意识到模型的决策是亟需解决的问题,复杂难以解释的黑盒模型阻碍了算法在实际场景中部署。

LIME作为最流行的局部解释方法,生成的扰动数据却具有不稳定性,导致最终的解释产生偏差。

针对上述问题,提出了一种基于主动学习和二次有理核的模型无关局部解释方法ActiveLIME,使得局部解释模型更加忠于原始分类器。

ActiveLIME生成扰动数据后,通过主动学习的查询策略对扰动数据进行采样,筛选不确定性高的扰动集训练,使用迭代过程中准确度最高的局部模型对感兴趣实例生成解释。

并且,针对容易陷入局部过拟合的高维稀疏样本,在模型损失函数中引入了二次有理核来减少过拟合。

实验结果表明,所提出的ActiveLIME方法引比传统局部解释方法具有更高的局部保真度和解释质量。

【总页数】7页(P245-251)
【作者】周晟昊;袁伟伟;关东海
【作者单位】南京航空航天大学计算机科学与技术学院
【正文语种】中文
【中图分类】TP391
【相关文献】
1.一种基于因果强度的局部因果结构主动学习方法
2.基于主动形状模型算法的局部灰度模型的加权改进方法
3.基于局部主动轮廓模型的飞机壁板铆接孔定位方法研究
4.基于模糊核聚类和主动学习的异常检测方法
5.基于核极限学习机的快速主动学习方法及其软测量应用
因版权原因，仅展示原文概要，查看原文内容请购买。

GSPBOX_-Atoolboxforsignalprocessingongraphs_

GSPBOX_-Atoolboxforsignalprocessingongraphs_GSPBOX:A toolbox for signal processing on graphsNathanael Perraudin,Johan Paratte,David Shuman,Lionel Martin Vassilis Kalofolias,Pierre Vandergheynst and David K.HammondMarch 16,2016AbstractThis document introduces the Graph Signal Processing Toolbox (GSPBox)a framework that can be used to tackle graph related problems with a signal processing approach.It explains the structure and the organization of this software.It also contains a general description of the important modules.1Toolbox organizationIn this document,we brie?y describe the different modules available in the toolbox.For each of them,the main functions are brie?y described.This chapter should help making the connection between the theoretical concepts introduced in [7,9,6]and the technical documentation provided with the toolbox.We highly recommend to read this document and the tutorial before using the toolbox.The documentation,the tutorials and other resources are available on-line 1.The toolbox has ?rst been implemented in MATLAB but a port to Python,called the PyGSP,has been made recently.As of the time of writing of this document,not all the functionalities have been ported to Python,but the main modules are already available.In the following,functions pre?xed by [M]:refer to the MATLAB implementation and the ones pre?xed with [P]:refer to the Python implementation. 1.1General structure of the toolbox (MATLAB)The general design of the GSPBox focuses around the graph object [7],a MATLAB structure containing the necessary infor-mations to use most of the algorithms.By default,only a few attributes are available (see section 2),allowing only the use of a subset of functions.In order to enable the use of more algorithms,additional ?elds can be added to the graph structure.For example,the following line will compute the graph Fourier basis enabling exact ?ltering operations.1G =gsp_compute_fourier_basis(G);Ideally,this operation should be done on the ?y when exact ?ltering is required.Unfortunately,the lack of well de?ned class paradigm in MATLAB makes it too complicated to be implemented.Luckily,the above formulation prevents any unnecessary data copy of the data contained in the structure G .In order to avoid name con?icts,all functions in the GSPBox start with [M]:gsp_.A second important convention is that all functions applying a graph algorithm on a graph signal takes the graph as ?rst argument.For example,the graph Fourier transform of the vector f is computed by1fhat =gsp_gft(G,f);1Seehttps://lts2.epfl.ch/gsp/doc/for MATLAB and https://lts2.epfl.ch/pygsp for Python.The full documentation is also avail-able in a single document:https://lts2.epfl.ch/gsp/gspbox.pdf1a r X i v :1408.5781v 2 [c s .I T ] 15 M a r 2016The graph operators are described in section4.Filtering a signal on a graph is also a linear operation.However,since the design of special?lters(kernels)is important,they are regrouped in a dedicated module(see section5).The toolbox contains two additional important modules.The optimization module contains proximal operators,projections and solvers compatible with the UNLocBoX[5](see section6).These functions facilitate the de?nition of convex optimization problems using graphs.Finally,section??is composed of well known graph machine learning algorithms.1.2General structure of the toolbox(Python)The structure of the Python toolbox follows closely the MATLAB one.The major difference comes from the fact that the Python implementation is object-oriented and thus allows for a natural use of instances of the graph object.For example the equivalent of the MATLAB call:1G=gsp_estimate_lmax(G);can be achieved using a simple method call on the graph object:1G.estimate_lmax()Moreover,the use of class for the"graph object"allows to compute additional graph attributes on the?y,making the code clearer as its MATLAB equivalent.Note though that functionalities are grouped into different modules(one per section below) and that several functions that work on graphs have to be called directly from the modules.For example,one should write:1layers=pygsp.operators.kron_pyramid(G,levels)This is the case as soon as the graph is the structure on which the action has to be performed and not our principal focus.In a similar way to the MATLAB implementation using the UNLocBoX for the convex optimization routines,the Python implementation uses the PyUNLocBoX,which is the Python port of the UNLocBoX. 2GraphsThe GSPBox is constructed around one main object:the graph.It is implemented as a structure in Matlab and as a class in Python.It stores the nodes,the edges and other attributes related to the graph.In the implementation,a graph is fully de?ned by the weight matrix W,which is the main and only required attribute.Since most graph structures are far from fully connected, W is implemented as a sparse matrix.From the weight matrix a Laplacian matrix is computed and stored as an attribute of the graph object.Different other attributes are available such as plotting attributes,vertex coordinates,the degree matrix,the number of vertices and edges.The list of all attributes is given in table1.2Attribute Format Data type DescriptionMandatory?eldsW N x N sparse matrix double Weight matrix WL N x N sparse matrix double Laplacian matrixd N x1vector double The diagonal of the degree matrixN scalar integer Number of verticesNe scalar integer Number of edgesplotting[M]:structure[P]:dict none Plotting parameterstype text string Name,type or short descriptiondirected scalar[M]:logical[P]:boolean State if the graph is directed or notlap_type text string Laplacian typeOptional?eldsA N x N sparse matrix[M]:logical[P]:boolean Adjacency matrixcoords N x2or N x3matrix double Vectors of coordinates in2D or3D.lmax scalar double Exact or estimated maximum eigenvalue U N x N matrix double Matrix of eigenvectorse N x1vector double Vector of eigenvaluesmu scalar double Graph coherenceTable1:Attributes of the graph objectThe easiest way to create a graph is the[M]:gsp_graph[P]:pygsp.graphs.Graph function which takes the weight matrix as input.This function initializes a graph structure by creating the graph Laplacian and other useful attributes.Note that by default the toolbox uses the combinatorial de?nition of the Laplacian operator.Other Laplacians can be computed using the[M]:gsp_create_laplacian[P]:pygsp.gutils.create_laplacian function.Please note that almost all functions are dependent of the Laplacian de?nition.As a result,it is important to select the correct de?nition at? rst.Many particular graphs are also available using helper functions such as:ring,path,comet,swiss roll,airfoil or two moons. In addition,functions are provided for usual non-deterministic graphs suchas:Erdos-Renyi,community,Stochastic Block Model or sensor networks graphs.Nearest Neighbors(NN)graphs form a class which is used in many applications and can be constructed from a set of points (or point cloud)using the[M]:gsp_nn_graph[P]:pygsp.graphs.NNGraph function.The function is highly tunable and can handle very large sets of points using FLANN[3].Two particular cases of NN graphs have their dedicated helper functions:3D point clouds and image patch-graphs.An example of the former can be seen in thefunction[M]:gsp_bunny[P]:pygsp.graphs.Bunny.As for the second,a graph can be created from an image by connecting similar patches of pixels together.The function[M]:gsp_patch_graph creates this graph.Parameters allow the resulting graph to vary between local and non-local and to use different distance functions [12,4].A few examples of the graphs are displayed in Figure1.3PlottingAs in many other domains,visualization is very important in graph signal processing.The most basic operation is to visualize graphs.This can be achieved using a call to thefunction[M]:gsp_plot_graph[P]:pygsp.plotting.plot_graph. In order to be displayable,a graph needs to have2D(or3D)coordinates(which is a?eld of the graph object).Some graphs do not possess default coordinates(e.g.Erdos-Renyi).The toolbox also contains routines to plot signals living on graphs.The function dedicated to this task is[M]:gsp_plot_ signal[P]:pygsp.plotting.plot_signal.For now,only1D signals are supported.By default,the value of the signal is displayed using a color coding,but bars can be displayed by passing parameters.3Figure 1:Examples of classical graphs :two moons (top left),community (top right),airfoil (bottom left)and sensor network (bottom right).The third visualization helper is a function to plot ?lters (in the spectral domain)which is called [M]:gsp_plot_filter [P]:pygsp.plotting.plot_filter .It also supports ?lter-banks and allows to automatically inspect the related frames.The results obtained using these three plotting functions are visible in Fig.2.4OperatorsThe module operators contains basics spectral graph functions such as Fourier transform,localization,gradient,divergence or pyramid decomposition.Since all operator are based on the Laplacian de? nition,the necessary underlying objects (attributes)are all stored into a single object:the graph.As a ?rst example,the graph Fourier transform [M]:gsp_gft [P]:pygsp.operators.gft requires the Fourier basis.This attribute can be computed with the function [M]:gsp_compute_fourier_basis[P]:/doc/c09ff3e90342a8956bec0975f46527d3240ca692.html pute_fourier_basis [9]that adds the ?elds U ,e and lmax to the graph structure.As a second example,since the gradient and divergence operate on the edges of the graph,a search on the edge matrix is needed to enable the use of these operators.It can be done with the routines [M]:gsp_adj2vec[P]:pygsp.operators.adj2vec .These operations take time and should4Figure 2:Visualization of graph and signals using plotting functions.NameEdge derivativefe (i,j )Laplacian matrix (operator)Available Undirected graph Combinatorial LaplacianW (i,j )(f (j )?f (i ))D ?WV Normalized Laplacian W (i,j ) f (j )√d (j )f (i )√d (i )D ?12(D ?W )D ?12V Directed graph Combinatorial LaplacianW (i,j )(f (j )?f (i ))12(D ++D ??W ?W ?)V Degree normalized Laplacian W (i,j ) f (j )√d ?(j )?f (i )√d +(i )I ?12D ?12+[W +W ?]D ?12V Distribution normalized Laplacianπ(i ) p (i,j )π(j )f (j )? p (i,j )π(i )f (i )12 Π12PΠ12+Π?12P ?Π12 VTable 2:Different de?nitions for graph Laplacian operator and their associated edge derivative.(For directed graph,d +,D +and d ?,D ?de?ne the out degree and in-degree of a node.π,Πis the stationary distribution of the graph and P is a normalized weight matrix W .For sake of clarity,exact de?nition of those quantities are not given here,but can be found in [14].)be performed only once.In MATLAB,these functions are called explicitly by the user beforehand.However,in Python they are automatically called when needed and the result stored as an attribute. The module operator also includes a Multi-scale Pyramid Transform for graph signals [6].Again,it works in two steps.Firstthe pyramid is precomputed with [M]:gsp_graph_multiresolution [P]:pygsp.operators.graph_multiresolution .Second the decomposition of a signal is performed with [M]:gsp_pyramid_analysis [P]:pygsp.operators.pyramid_analysis .The reconstruction uses [M]:gsp_pyramid_synthesis [P]:pygsp.operators.pyramid_synthesis .The Laplacian is a special operator stored as a sparse matrix in the ?eld L of the graph.Table 2summarizes the available de?nitions.We are planning to implement additional ones.5FiltersFilters are a special kind of linear operators that are so prominent in the toolbox that they deserve their own module [9,7,2,8,2].A ?lter is simply an anonymous function (in MATLAB)or a lambda function (in Python)acting element-by-element on the input.In MATLAB,a ?lter-bank is created simply by gathering these functions together into a cell array.For example,you would write:51%g(x)=x^2+sin(x)2g=@(x)x.^2+sin(x);3%h(x)=exp(-x)4h=@(x)exp(-x);5%Filterbank composed of g and h6fb={g,h};The toolbox contains many prede?ned design of?lter.They all start with[M]:gsp_design_in MATLAB and are in the module[P]:pygsp.filters in Python.Once a?lter(or a?lter-bank)is created,it can be applied to a signal with[M]: gsp_filter_analysis in MATLAB and a call to the method[P]:analysis of the?lter object in Python.Note that the toolbox uses accelerated algorithms to scale almost linearly with the number of sample[11].The available type of?lter design of the GSPBox can be classi?ed as:Wavelets(Filters are scaled version of a mother window)Gabor(Filters are shifted version of a mother window)Low passlter(Filters to de-noise a signal)High pass/Low pass separationlterbank(tight frame of2lters to separate the high frequencies from the low ones.No energy is lost in the process)Additionally,to adapt the?lter to the graph eigen-distribution,the warping function[M]:gsp_design_warped_translates [P]:pygsp.filters.WarpedTranslates can be used[10].6UNLocBoX BindingThis module contains special wrappers for the UNLocBoX[5].It allows to solve convex problems containing graph terms very easily[13,15,14,1].For example,the proximal operator of the graph TV norm is given by[M]:gsp_prox_tv.The optimization module contains also some prede?ned problems such as graph basis pursuit in[M]:gsp_solve_l1or wavelet de-noising in[M]:gsp_wavelet_dn.There is still active work on this module so it is expected to grow rapidly in the future releases of the toolbox.7Toolbox conventions7.1General conventionsAs much as possible,all small letters are used for vectors(or vector stacked into a matrix)and capital are reserved for matrices.A notable exception is the creation of nearest neighbors graphs.A variable should never have the same name as an already existing function in MATLAB or Python respectively.This makes the code easier to read and less prone to errors.This is a best coding practice in general,but since both languages allow the override of built-in functions,a special care is needed.All function names should be lowercase.This avoids a lot of confusion because some computer architectures respect upper/lower casing and others do not.As much as possible,functions are named after the action they perform,rather than the algorithm they use,or the person who invented it.No global variables.Global variables makes it harder to debug and the code is harder to parallelize.67.2MATLABAll function start by gsp_.The graph structure is always therst argument in the function call.Filters are always second.Finally,optional parameter are last.In the toolbox,we do use any argument helper functions.As a result,optional argument are generally stacked into a graph structure named param.If a transform works on a matrix,it will per default work along the columns.This is a standard in Matlab(fft does this, among many other functions).Function names are traditionally written in uppercase in MATLAB documentation.7.3PythonAll functions should be part of a module,there should be no call directly from pygsp([P]:pygsp.my_function).Inside a given module,functionalities can be further split in differentles regrouping those that are used in the same context.MATLAB’s matrix operations are sometimes ported in a different way that preserves the efciency of the code.When matrix operations are necessary,they are all performed through the numpy and scipy libraries.Since Python does not come with a plotting library,we support both matplotlib and pyqtgraph.One should install the required libraries on his own.If both are correctly installed,then pyqtgraph is favoured unless speci?cally speci?ed. AcknowledgementsWe would like to thanks all coding authors of the GSPBOX.The toolbox was ported in Python by Basile Chatillon,Alexandre Lafaye and Nicolas Rod.The toolbox was also improved by Nauman Shahid and Yann Sch?nenberger.References[1]M.Belkin,P.Niyogi,and V.Sindhwani.Manifold regularization:A geometric framework for learning from labeled and unlabeledexamples.The Journal of Machine Learning Research,7:2399–2434,2006.[2] D.K.Hammond,P.Vandergheynst,and R.Gribonval.Wavelets on graphs via spectral graph theory.Applied and ComputationalHarmonic Analysis,30(2):129–150,2011.[3]M.Muja and D.G.Lowe.Scalable nearest neighbor algorithms for high dimensional data.Pattern Analysis and Machine Intelligence,IEEE Transactions on,36,2014.[4]S.K.Narang,Y.H.Chao,and A.Ortega.Graph-wavelet?lterbanks for edge-aware image processing.In Statistical Signal ProcessingWorkshop(SSP),2012IEEE,pages141–144.IEEE,2012.[5]N.Perraudin,D.Shuman,G.Puy,and P.Vandergheynst.UNLocBoX A matlab convex optimization toolbox using proximal splittingmethods.ArXiv e-prints,Feb.2014.[6] D.I.Shuman,M.J.Faraji,and P.Vandergheynst.A multiscale pyramid transform for graph signals.arXiv preprint arXiv:1308.4942,2013.[7] D.I.Shuman,S.K.Narang,P.Frossard,A.Ortega,and P.Vandergheynst.The emerging?eld of signal processing on graphs:Extendinghigh-dimensional data analysis to networks and other irregular domains.Signal Processing Magazine,IEEE,30(3):83–98,2013.7[8] D.I.Shuman,B.Ricaud,and P.Vandergheynst.A windowed graph Fourier transform.Statistical Signal Processing Workshop(SSP),2012IEEE,pages133–136,2012.[9] D.I.Shuman,B.Ricaud,and P.Vandergheynst.Vertex-frequency analysis on graphs.arXiv preprint arXiv:1307.5708,2013.[10] D.I.Shuman,C.Wiesmeyr,N.Holighaus,and P.Vandergheynst.Spectrum-adapted tight graph wavelet and vertex-frequency frames.arXiv preprint arXiv:1311.0897,2013.[11] A.Susnjara,N.Perraudin,D.Kressner,and P.Vandergheynst.Accelerated?ltering on graphs using lanczos method.arXiv preprintarXiv:1509.04537,2015.[12] F.Zhang and E.R.Hancock.Graph spectral image smoothing using the heat kernel.Pattern Recognition,41(11):3328–3342,2008.[13] D.Zhou,O.Bousquet,/doc/c09ff3e90342a8956bec0975f46527d3240ca692.html l,J.Weston,and B.Sch?lkopf.Learning with local and global consistency.Advances in neural informationprocessing systems,16(16):321–328,2004.[14] D.Zhou,J.Huang,and B.Sch?lkopf.Learning from labeled and unlabeled data on a directed graph.In the22nd international conference,pages1036–1043,New York,New York,USA,2005.ACM Press.[15] D.Zhou and B.Sch?lkopf.A regularization framework for learning from graph data.2004.8。

基于多尺度小波变换的变步长LMS滤波算法

基于多尺度小波变换的变步长LMS滤波算法缑新科;王娜娜【摘要】对LMS自适应算法、基于抽样函数的变步长LMS算法和基于多尺度小波变换的自适应滤波算法进行了研究,在此基础上把变步长LMS算法与多尺度小波变换相结合,产生了新算法.该算法一方面可以克服固定步长LMS算法在收敛速度与收敛精度方面与步长因子的矛盾；另一方面,小波变换的引入减少了输入向量自相关矩阵的条件数,提高了收敛速度、跟踪性能和稳定性.最后对算法的性能进行了计算机仿真比较,仿真结果表明:基于多尺度小波变换的变步长LMS滤波算法具有较快的收敛速度和更强的抑噪能力.%The paper represents a new algorithm from the mergence of the variable step size LMS algorithm and the multi-scale wavelet transform on the basis of studies of LMS algorithm, variable step size LMS algorithm based on sample function ,and LMS adaptive filtering algorithm based on multi-scale wavelet transform. On one hand,this new algorithm can overcome the contradiction between the fixed step size LMS algorithm' s speed and accuracy of convergence and its step factors. On the other hand, the introduction of wavelet reduce the conditions of auto-correlation matrix of the input vector, and improve the speed, tracking performance and stability of contradication. Finally, the algorithm is compared by computer simulation, the simulation results show that a adaptive LMS algorithm with variable step based on multi-scale wavelet transform has faster convergence speed, better performance, and stronger robustness against noise and disturbance.【期刊名称】《工业仪表与自动化装置》【年(卷),期】2012(000)003【总页数】6页(P5-9,13)【关键词】自适应滤波;LMS算法;变步长;小波变换;Matlab仿真【作者】缑新科;王娜娜【作者单位】兰州理工大学电气工程与信息工程学院,兰州730050;兰州理工大学电气工程与信息工程学院,兰州730050【正文语种】中文【中图分类】O2310 引言1967年Windrow等人提出了自适应LMS滤波算法，自适应滤波系统的参数能自动的调整而达到最优状况，而且在设计时，只需要很少的或根本不需要任何关于信号与噪声的先验统计知识。

一种变步长LMS算法提取胎儿心电

一种变步长LMS算法提取胎儿心电
袁丽;吴水才;袁延超;高小峰
【期刊名称】《中国医疗设备》
【年(卷),期】2018(033)003
【摘要】目的为了解决LMS算法在收敛速度与稳态误差之间的矛盾,本文提出一种变步长LMS算法提取胎儿心电.方法首先将母体腹部信号作为原始输入信号,母体胸部信号作为参考输入信号,对两路信号进行预处理,去除基线漂移,再使用本文设计的算法提取胎儿心电.结果实验结果表明本文算法在收敛速度与跟踪能力方面均优于传统LMS算法,并且可提取出清晰的胎儿心电.结论改进后的LMS算法可以提取出比较清晰的胎儿心电.
【总页数】4页(P15-17,21)
【作者】袁丽;吴水才;袁延超;高小峰
【作者单位】北京工业大学生命科学与生物工程学院,北京 100124;北京工业大学生命科学与生物工程学院,北京 100124;北京工业大学生命科学与生物工程学院,北京 100124;北京麦迪克斯科技有限公司,北京 100095
【正文语种】中文
【中图分类】R540.4+1
【相关文献】
1.小波域变步长ICA算法提取胎儿心电信号 [J], 占海龙;赵治栋
2.一种基于变抽头长度的变步长LMS算法 [J], 雷翼龙;余涛
3.基于迭代次数变步长的LMS算法在ECG信号提取中的应用 [J], 孙俊香;赵继印
4.一种面向车内噪声控制的改进变步长LMS算法 [J], 钱梦男;卢剑伟;晏桂喜;郭嘉豪
5.基于分段式变步长截断误差LMS算法的心率信号提取 [J], 赵继印;田宝凤;李建坡;隋俊凌
因版权原因，仅展示原文概要，查看原文内容请购买。

mi配准算法阈值

mi配准算法阈值1. 什么是mi配准算法mi配准算法是一种基于互信息的图像配准方法，用于将两幅图像进行对齐，使得它们在空间上完全或近似重合。

图像配准在医学影像处理、计算机视觉和遥感图像处理等领域具有广泛的应用。

2. mi配准算法的原理mi配准算法基于互信息度量图像之间的相似性。

互信息是一种统计量，用于衡量两个随机变量之间的相关性。

在图像配准中，互信息可以用来度量两幅图像的相似程度。

mi配准算法的主要步骤如下：步骤1：图像预处理首先，对待配准的两幅图像进行预处理。

这包括图像灰度化、滤波、去噪等操作，以提高配准的准确性和稳定性。

步骤2：选择参考图像和待配准图像从待配准的图像集中选择一幅作为参考图像，其他图像作为待配准图像。

参考图像是用来作为参照物进行配准的基准。

步骤3：计算互信息对于每一对参考图像和待配准图像，计算它们之间的互信息。

互信息可以用公式表示如下：I(X;Y)=∑∑py∈Yx∈X (x,y)log(p(x,y)p(x)p(y))其中，X和Y分别表示参考图像和待配准图像的像素值集合，p(x,y)表示(x,y)对出现的概率，p(x)和p(y)分别表示x和y出现的概率。

步骤4：寻找最大互信息对于每一对参考图像和待配准图像，计算它们之间的互信息。

通过遍历所有可能的图像配准变换参数，找到使互信息最大的配准变换参数。

步骤5：应用配准变换将找到的最佳配准变换参数应用到待配准图像上，对其进行配准。

配准变换可以是平移、旋转、缩放等操作。

步骤6：评估配准结果评估配准结果的准确性和稳定性。

可以使用各种评估指标，如均方差、互信息等。

3. mi配准算法阈值的作用mi配准算法阈值是用来控制配准过程中的相似性度量的灵敏度。

当阈值较高时，只有相似性较高的图像区域才会被考虑在内，从而提高配准的准确性和稳定性。

当阈值较低时，相似性较低的图像区域也会被考虑在内，从而增加了配准的灵活性和鲁棒性。

mi配准算法阈值的选择需要根据具体应用场景和需求来确定。

Kernels and regularization on graphs

Kernels and Regularization on GraphsAlexander J.Smola1and Risi Kondor21Machine Learning Group,RSISEAustralian National UniversityCanberra,ACT0200,AustraliaAlex.Smola@.au2Department of Computer ScienceColumbia University1214Amsterdam Avenue,M.C.0401New York,NY10027,USArisi@Abstract.We introduce a family of kernels on graphs based on thenotion of regularization operators.This generalizes in a natural way thenotion of regularization and Greens functions,as commonly used forreal valued functions,to graphs.It turns out that diﬀusion kernels canbe found as a special case of our reasoning.We show that the class ofpositive,monotonically decreasing functions on the unit interval leads tokernels and corresponding regularization operators.1IntroductionThere has recently been a surge of interest in learning algorithms that operate on input spaces X other than R n,speciﬁcally,discrete input spaces,such as strings, graphs,trees,automata etc..Since kernel-based algorithms,such as Support Vector Machines,Gaussian Processes,Kernel PCA,etc.capture the structure of X via the kernel K:X×X→R,as long as we can deﬁne an appropriate kernel on our discrete input space,these algorithms can be imported wholesale, together with their error analysis,theoretical guarantees and empirical success.One of the most general representations of discrete metric spaces are graphs. Even if all we know about our input space are local pairwise similarities between points x i,x j∈X,distances(e.g shortest path length)on the graph induced by these similarities can give a useful,more global,sense of similarity between objects.In their work on Diﬀusion Kernels,Kondor and Laﬀerty[2002]gave a speciﬁc construction for a kernel capturing this structure.Belkin and Niyogi [2002]proposed an essentially equivalent construction in the context of approx-imating data lying on surfaces in a high dimensional embedding space,and in the context of leveraging information from unlabeled data.In this paper we put these earlier results into the more principled framework of Regularization Theory.We propose a family of regularization operators(equiv-alently,kernels)on graphs that include Diﬀusion Kernels as a special case,and show that this family encompasses all possible regularization operators invariant under permutations of the vertices in a particular sense.2Alexander Smola and Risi KondorOutline of the Paper:Section2introduces the concept of the graph Laplacian and relates it to the Laplace operator on real valued functions.Next we deﬁne an extended class of regularization operators and show why they have to be es-sentially a function of the Laplacian.An analogy to real valued Greens functions is established in Section3.3,and eﬃcient methods for computing such functions are presented in Section4.We conclude with a discussion.2Laplace OperatorsAn undirected unweighted graph G consists of a set of vertices V numbered1to n,and a set of edges E(i.e.,pairs(i,j)where i,j∈V and(i,j)∈E⇔(j,i)∈E). We will sometimes write i∼j to denote that i and j are neighbors,i.e.(i,j)∈E. The adjacency matrix of G is an n×n real matrix W,with W ij=1if i∼j,and 0otherwise(by construction,W is symmetric and its diagonal entries are zero). These deﬁnitions and most of the following theory can trivially be extended toweighted graphs by allowing W ij∈[0,∞).Let D be an n×n diagonal matrix with D ii=jW ij.The Laplacian of Gis deﬁned as L:=D−W and the Normalized Laplacian is˜L:=D−12LD−12= I−D−12W D−12.The following two theorems are well known results from spectral graph theory[Chung-Graham,1997]:Theorem1(Spectrum of˜L).˜L is a symmetric,positive semideﬁnite matrix, and its eigenvaluesλ1,λ2,...,λn satisfy0≤λi≤2.Furthermore,the number of eigenvalues equal to zero equals to the number of disjoint components in G.The bound on the spectrum follows directly from Gerschgorin’s Theorem.Theorem2(L and˜L for Regular Graphs).Now let G be a regular graph of degree d,that is,a graph in which every vertex has exactly d neighbors.ThenL=d I−W and˜L=I−1d W=1dL.Finally,W,L,˜L share the same eigenvectors{v i},where v i=λ−1iW v i=(d−λi)−1L v i=(1−d−1λi)−1˜L v i for all i.L and˜L can be regarded as linear operators on functions f:V→R,or,equiv-alently,on vectors f=(f1,f2,...,f n) .We could equally well have deﬁned Lbyf,L f =f L f=−12i∼j(f i−f j)2for all f∈R n,(1)which readily generalizes to graphs with a countably inﬁnite number of vertices.The Laplacian derives its name from its analogy with the familiar Laplacianoperator∆=∂2∂x21+∂2∂x22+...+∂2∂x2mon continuous spaces.Regarding(1)asinducing a semi-norm f L= f,L f on R n,the analogous expression for∆deﬁned on a compact spaceΩisf ∆= f,∆f =Ωf(∆f)dω=Ω(∇f)·(∇f)dω.(2)Both(1)and(2)quantify how much f and f vary locally,or how“smooth”they are over their respective domains.Kernels and Regularization on Graphs3 More explicitly,whenΩ=R m,up to a constant,−L is exactly theﬁnite diﬀerence discretization of∆on a regular lattice:∆f(x)=mi=1∂2∂x2if≈mi=1∂∂x if(x+12e i)−∂∂x if(x−12e i)δ≈mi=1f(x+e i)+f(x−e i)−2f(x)δ2=1δ2mi=1(f x1,...,x i+1,...,x m+f x1,...,x i−1,...,x m−2f x1,...,x m)=−1δ2[L f]x1,...,x m,where e1,e2,...,e m is an orthogonal basis for R m normalized to e i =δ, the vertices of the lattice are at x=x1e1+...+x m e m with integer valuedcoordinates x i∈N,and f x1,x2,...,x m=f(x).Moreover,both the continuous and the dis-crete Laplacians are canonical operators on their respective domains,in the sense that they are invariant under certain natural transformations of the underlying space,and in this they are essentially unique.Regular grid in two dimensionsThe Laplace operator∆is the unique self-adjoint linear second order diﬀer-ential operator invariant under transformations of the coordinate system under the action of the special orthogonal group SO m,i.e.invariant under rotations. This well known result can be seen by using Schur’s lemma and the fact that SO m is irreducible on R m.We now show a similar result for L.Here the permutation group plays a similar role to SO m.We need some additional deﬁnitions:denote by S n the group of permutations on{1,2,...,n}withπ∈S n being a speciﬁc permutation taking i∈{1,2,...n}toπ(i).The so-called deﬁning representation of S n consists of n×n matricesΠπ,such that[Ππ]i,π(i)=1and all other entries ofΠπare zero. Theorem3(Permutation Invariant Linear Functions on Graphs).Let L be an n×n symmetric real matrix,linearly related to the n×n adjacency matrix W,i.e.L=T[W]for some linear operator L in a way invariant to permutations of vertices in the sense thatΠ πT[W]Ππ=TΠ πWΠπ(3)for anyπ∈S n.Then L is related to W by a linear combination of the follow-ing three operations:identity;row/column sums;overall sum;row/column sum restricted to the diagonal of L;overall sum restricted to the diagonal of W. Proof LetL i1i2=T[W]i1i2:=ni3=1ni4=1T i1i2i3i4W i3i4(4)with T∈R n4.Eq.(3)then implies Tπ(i1)π(i2)π(i3)π(i4)=T i1i2i3i4for anyπ∈S n.4Alexander Smola and Risi KondorThe indices of T can be partitioned by the equality relation on their values,e.g.(2,5,2,7)is of the partition type [13|2|4],since i 1=i 3,but i 2=i 1,i 4=i 1and i 2=i 4.The key observation is that under the action of the permutation group,elements of T with a given index partition structure are taken to elements with the same index partition structure,e.g.if i 1=i 3then π(i 1)=π(i 3)and if i 1=i 3,then π(i 1)=π(i 3).Furthermore,an element with a given index index partition structure can be mapped to any other element of T with the same index partition structure by a suitable choice of π.Hence,a necessary and suﬃcient condition for (4)is that all elements of T of a given index partition structure be equal.Therefore,T must be a linear combination of the following tensors (i.e.multilinear forms):A i 1i 2i 3i 4=1B [1,2]i 1i 2i 3i 4=δi 1i 2B [1,3]i 1i 2i 3i 4=δi 1i 3B [1,4]i 1i 2i 3i 4=δi 1i 4B [2,3]i 1i 2i 3i 4=δi 2i 3B [2,4]i 1i 2i 3i 4=δi 2i 4B [3,4]i 1i 2i 3i 4=δi 3i 4C [1,2,3]i 1i 2i 3i 4=δi 1i 2δi 2i 3C [2,3,4]i 1i 2i 3i 4=δi 2i 3δi 3i 4C [3,4,1]i 1i 2i 3i 4=δi 3i 4δi 4i 1C [4,1,2]i 1i 2i 3i 4=δi 4i 1δi 1i 2D [1,2][3,4]i 1i 2i 3i 4=δi 1i 2δi 3i 4D [1,3][2,4]i 1i 2i 3i 4=δi 1i 3δi 2i 4D [1,4][2,3]i 1i 2i 3i 4=δi 1i 4δi 2i 3E [1,2,3,4]i 1i 2i 3i 4=δi 1i 2δi 1i 3δi 1i 4.The tensor A puts the overall sum in each element of L ,while B [1,2]returns the the same restricted to the diagonal of L .Since W has vanishing diagonal,B [3,4],C [2,3,4],C [3,4,1],D [1,2][3,4]and E [1,2,3,4]produce zero.Without loss of generality we can therefore ignore them.By symmetry of W ,the pairs (B [1,3],B [1,4]),(B [2,3],B [2,4]),(C [1,2,3],C [4,1,2])have the same eﬀect on W ,hence we can set the coeﬃcient of the second member of each to zero.Furthermore,to enforce symmetry on L ,the coeﬃcient of B [1,3]and B [2,3]must be the same (without loss of generality 1)and this will give the row/column sum matrix ( k W ik )+( k W kl ).Similarly,C [1,2,3]and C [4,1,2]must have the same coeﬃcient and this will give the row/column sum restricted to the diagonal:δij [( k W ik )+( k W kl )].Finally,by symmetry of W ,D [1,3][2,4]and D [1,4][2,3]are both equivalent to the identity map.The various row/column sum and overall sum operations are uninteresting from a graph theory point of view,since they do not heed to the topology of the graph.Imposing the conditions that each row and column in L must sum to zero,we recover the graph Laplacian.Hence,up to a constant factor and trivial additive components,the graph Laplacian (or the normalized graph Laplacian if we wish to rescale by the number of edges per vertex)is the only “invariant”diﬀerential operator for given W (or its normalized counterpart ˜W ).Unless stated otherwise,all results below hold for both L and ˜L (albeit with a diﬀerent spectrum)and we will,in the following,focus on ˜Ldue to the fact that its spectrum is contained in [0,2].Kernels and Regularization on Graphs5 3RegularizationThe fact that L induces a semi-norm on f which penalizes the changes between adjacent vertices,as described in(1),indicates that it may serve as a tool to design regularization operators.3.1Regularization via the Laplace OperatorWe begin with a brief overview of translation invariant regularization operators on continuous spaces and show how they can be interpreted as powers of∆.This will allow us to repeat the development almost verbatim with˜L(or L)instead.Some of the most successful regularization functionals on R n,leading to kernels such as the Gaussian RBF,can be written as[Smola et al.,1998]f,P f :=|˜f(ω)|2r( ω 2)dω= f,r(∆)f .(5)Here f∈L2(R n),˜f(ω)denotes the Fourier transform of f,r( ω 2)is a function penalizing frequency components|˜f(ω)|of f,typically increasing in ω 2,and ﬁnally,r(∆)is the extension of r to operators simply by applying r to the spectrum of∆[Dunford and Schwartz,1958]f,r(∆)f =if,ψi r(λi) ψi,fwhere{(ψi,λi)}is the eigensystem of∆.The last equality in(5)holds because applications of∆become multiplications by ω 2in Fourier space.Kernels are obtained by solving the self-consistency condition[Smola et al.,1998]k(x,·),P k(x ,·) =k(x,x ).(6) One can show that k(x,x )=κ(x−x ),whereκis equal to the inverse Fourier transform of r−1( ω 2).Several r functions have been known to yield good results.The two most popular are given below:r( ω 2)k(x,x )r(∆)Gaussian RBF expσ22ω 2exp−12σ2x−x 2∞i=0σ2ii!∆iLaplacian RBF1+σ2 ω 2exp−1σx−x1+σ2∆In summary,regularization according to(5)is carried out by penalizing˜f(ω) by a function of the Laplace operator.For many results in regularization theory one requires r( ω 2)→∞for ω 2→∞.3.2Regularization via the Graph LaplacianIn complete analogy to(5),we deﬁne a class of regularization functionals on graphs asf,P f := f,r(˜L)f .(7)6Alexander Smola and Risi KondorFig.1.Regularization function r (λ).From left to right:regularized Laplacian (σ2=1),diﬀusion process (σ2=1),one-step random walk (a =2),4-step random walk (a =2),inverse cosine.Here r (˜L )is understood as applying the scalar valued function r (λ)to the eigen-values of ˜L ,that is,r (˜L ):=m i =1r (λi )v i v i ,(8)where {(λi ,v i )}constitute the eigensystem of ˜L .The normalized graph Lapla-cian ˜Lis preferable to L ,since ˜L ’s spectrum is contained in [0,2].The obvious goal is to gain insight into what functions are appropriate choices for r .–From (1)we infer that v i with large λi correspond to rather uneven functions on the graph G .Consequently,they should be penalized more strongly than v i with small λi .Hence r (λ)should be monotonically increasing in λ.–Requiring that r (˜L) 0imposes the constraint r (λ)≥0for all λ∈[0,2].–Finally,we can limit ourselves to r (λ)expressible as power series,since the latter are dense in the space of C 0functions on bounded domains.In Section 3.5we will present additional motivation for the choice of r (λ)in the context of spectral graph theory and segmentation.As we shall see,the following functions are of particular interest:r (λ)=1+σ2λ(Regularized Laplacian)(9)r (λ)=exp σ2/2λ(Diﬀusion Process)(10)r (λ)=(aI −λ)−1with a ≥2(One-Step Random Walk)(11)r (λ)=(aI −λ)−p with a ≥2(p -Step Random Walk)(12)r (λ)=(cos λπ/4)−1(Inverse Cosine)(13)Figure 1shows the regularization behavior for the functions (9)-(13).3.3KernelsThe introduction of a regularization matrix P =r (˜L)allows us to deﬁne a Hilbert space H on R m via f,f H := f ,P f .We now show that H is a reproducing kernel Hilbert space.Kernels and Regularization on Graphs 7Theorem 4.Denote by P ∈R m ×m a (positive semideﬁnite)regularization ma-trix and denote by H the image of R m under P .Then H with dot product f,f H := f ,P f is a Reproducing Kernel Hilbert Space and its kernel is k (i,j )= P −1ij ,where P −1denotes the pseudo-inverse if P is not invertible.Proof Since P is a positive semideﬁnite matrix,we clearly have a Hilbert space on P R m .To show the reproducing property we need to prove thatf (i )= f,k (i,·) H .(14)Note that k (i,j )can take on at most m 2diﬀerent values (since i,j ∈[1:m ]).In matrix notation (14)means that for all f ∈Hf (i )=f P K i,:for all i ⇐⇒f =f P K.(15)The latter holds if K =P −1and f ∈P R m ,which proves the claim.In other words,K is the Greens function of P ,just as in the continuous case.The notion of Greens functions on graphs was only recently introduced by Chung-Graham and Yau [2000]for L .The above theorem extended this idea to arbitrary regularization operators ˆr (˜L).Corollary 1.Denote by P =r (˜L )a regularization matrix,then the correspond-ing kernel is given by K =r −1(˜L ),where we take the pseudo-inverse wherever necessary.More speciﬁcally,if {(v i ,λi )}constitute the eigensystem of ˜L,we have K =mi =1r −1(λi )v i v i where we deﬁne 0−1≡0.(16)3.4Examples of KernelsBy virtue of Corollary 1we only need to take (9)-(13)and plug the deﬁnition of r (λ)into (16)to obtain formulae for computing K .This yields the following kernel matrices:K =(I +σ2˜L)−1(Regularized Laplacian)(17)K =exp(−σ2/2˜L)(Diﬀusion Process)(18)K =(aI −˜L)p with a ≥2(p -Step Random Walk)(19)K =cos ˜Lπ/4(Inverse Cosine)(20)Equation (18)corresponds to the diﬀusion kernel proposed by Kondor and Laf-ferty [2002],for which K (x,x )can be visualized as the quantity of some sub-stance that would accumulate at vertex x after a given amount of time if we injected the substance at vertex x and let it diﬀuse through the graph along the edges.Note that this involves matrix exponentiation deﬁned via the limit K =exp(B )=lim n →∞(I +B/n )n as opposed to component-wise exponentiation K i,j =exp(B i,j ).8Alexander Smola and Risi KondorFig.2.Theﬁrst8eigenvectors of the normalized graph Laplacian corresponding to the graph drawn above.Each line attached to a vertex is proportional to the value of the corresponding eigenvector at the vertex.Positive values(red)point up and negative values(blue)point down.Note that the assignment of values becomes less and less uniform with increasing eigenvalue(i.e.from left to right).For(17)it is typically more eﬃcient to deal with the inverse of K,as it avoids the costly inversion of the sparse matrix˜L.Such situations arise,e.g.,in Gaussian Process estimation,where K is the covariance matrix of a stochastic process[Williams,1999].Regarding(19),recall that(aI−˜L)p=((a−1)I+˜W)p is up to scaling terms equiv-alent to a p-step random walk on the graphwith random restarts(see Section A for de-tails).In this sense it is similar to the dif-fusion kernel.However,the fact that K in-volves only aﬁnite number of products ofmatrices makes it much more attractive forpractical purposes.In particular,entries inK ij can be computed cheaply using the factthat˜L is a sparse matrix.A nearest neighbor graph.Finally,the inverse cosine kernel treats lower complexity functions almost equally,with a signiﬁcant reduction in the upper end of the spectrum.Figure2 shows the leading eigenvectors of the graph drawn above and Figure3provide examples of some of the kernels discussed above.3.5Clustering and Spectral Graph TheoryWe could also have derived r(˜L)directly from spectral graph theory:the eigen-vectors of the graph Laplacian correspond to functions partitioning the graph into clusters,see e.g.,[Chung-Graham,1997,Shi and Malik,1997]and the ref-erences therein.In general,small eigenvalues have associated eigenvectors which vary little between adjacent vertices.Finding the smallest eigenvectors of˜L can be seen as a real-valued relaxation of the min-cut problem.3For instance,the smallest eigenvalue of˜L is0,its corresponding eigenvector is D121n with1n:=(1,...,1)∈R n.The second smallest eigenvalue/eigenvector pair,also often referred to as the Fiedler-vector,can be used to split the graph 3Only recently,algorithms based on the celebrated semideﬁnite relaxation of the min-cut problem by Goemans and Williamson[1995]have seen wider use[Torr,2003]in segmentation and clustering by use of spectral bundle methods.Kernels and Regularization on Graphs9Fig.3.Top:regularized graph Laplacian;Middle:diﬀusion kernel with σ=5,Bottom:4-step random walk kernel.Each ﬁgure displays K ij for ﬁxed i .The value K ij at vertex i is denoted by a bold line.Note that only adjacent vertices to i bear signiﬁcant value.into two distinct parts [Weiss,1999,Shi and Malik,1997],and further eigenvec-tors with larger eigenvalues have been used for more ﬁnely-grained partitions of the graph.See Figure 2for an example.Such a decomposition into functions of increasing complexity has very de-sirable properties:if we want to perform estimation on the graph,we will wish to bias the estimate towards functions which vary little over large homogeneous portions 4.Consequently,we have the following interpretation of f,f H .As-sume that f = i βi v i ,where {(v i ,λi )}is the eigensystem of ˜L.Then we can rewrite f,f H to yield f ,r (˜L )f = i βi v i , j r (λj )v j v j l βl v l = iβ2i r (λi ).(21)This means that the components of f which vary a lot over coherent clusters in the graph are penalized more strongly,whereas the portions of f ,which are essentially constant over clusters,are preferred.This is exactly what we want.3.6Approximate ComputationOften it is not necessary to know all values of the kernel (e.g.,if we only observe instances from a subset of all positions on the graph).There it would be wasteful to compute the full matrix r (L )−1explicitly,since such operations typically scale with O (n 3).Furthermore,for large n it is not desirable to compute K via (16),that is,by computing the eigensystem of ˜Land assembling K directly.4If we cannot assume a connection between the structure of the graph and the values of the function to be estimated on it,the entire concept of designing kernels on graphs obviously becomes meaningless.10Alexander Smola and Risi KondorInstead,we would like to take advantage of the fact that ˜L is sparse,and con-sequently any operation ˜Lαhas cost at most linear in the number of nonzero ele-ments of ˜L ,hence the cost is bounded by O (|E |+n ).Moreover,if d is the largest degree of the graph,then computing L p e i costs at most |E | p −1i =1(min(d +1,n ))ioperations:at each step the number of non-zeros in the rhs decreases by at most a factor of d +1.This means that as long as we can approximate K =r −1(˜L )by a low order polynomial,say ρ(˜L ):= N i =0βi ˜L i ,signiﬁcant savings are possible.Note that we need not necessarily require a uniformly good approximation and put the main emphasis on the approximation for small λ.However,we need to ensure that ρ(˜L)is positive semideﬁnite.Diﬀusion Kernel:The fact that the series r −1(x )=exp(−βx )= ∞m =0(−β)m x m m !has alternating signs shows that the approximation error at r −1(x )is boundedby (2β)N +1(N +1)!,if we use N terms in the expansion (from Theorem 1we know that ˜L≤2).For instance,for β=1,10terms are suﬃcient to obtain an error of the order of 10−4.Variational Approximation:In general,if we want to approximate r −1(λ)on[0,2],we need to solve the L ∞([0,2])approximation problemminimize β, subject to N i =0βi λi −r −1(λ) ≤ ∀λ∈[0,2](22)Clearly,(22)is equivalent to minimizing sup ˜L ρ(˜L )−r−1(˜L ) ,since the matrix norm is determined by the largest eigenvalues,and we can ﬁnd ˜Lsuch that the discrepancy between ρ(λ)and r −1(λ)is attained.Variational problems of this form have been studied in the literature,and their solution may provide much better approximations to r −1(λ)than a truncated power series expansion.4Products of GraphsAs we have already pointed out,it is very expensive to compute K for arbitrary ˆr and ˜L.For special types of graphs and regularization,however,signiﬁcant computational savings can be made.4.1Factor GraphsThe work of this section is a direct extension of results by Ellis [2002]and Chung-Graham and Yau [2000],who study factor graphs to compute inverses of the graph Laplacian.Deﬁnition 1(Factor Graphs).Denote by (V,E )and (V ,E )the vertices V and edges E of two graphs,then the factor graph (V f ,E f ):=(V,E )⊗(V ,E )is deﬁned as the graph where (i,i )∈V f if i ∈V and i ∈V ;and ((i,i ),(j,j ))∈E f if and only if either (i,j )∈E and i =j or (i ,j )∈E and i =j .Kernels and Regularization on Graphs 11For instance,the factor graph of two rings is a torus.The nice property of factor graphs is that we can compute the eigenvalues of the Laplacian on products very easily (see e.g.,Chung-Graham and Yau [2000]):Theorem 5(Eigenvalues of Factor Graphs).The eigenvalues and eigen-vectors of the normalized Laplacian for the factor graph between a regular graph of degree d with eigenvalues {λj }and a regular graph of degree d with eigenvalues {λ l }are of the form:λfact j,l =d d +d λj +d d +d λ l(23)and the eigenvectors satisfy e j,l(i,i )=e j i e l i ,where e j is an eigenvector of ˜L and e l is an eigenvector of ˜L.This allows us to apply Corollary 1to obtain an expansion of K asK =(r (L ))−1=j,l r −1(λjl )e j,l e j,l .(24)While providing an explicit recipe for the computation of K ij without the need to compute the full matrix K ,this still requires O (n 2)operations per entry,which may be more costly than what we want (here n is the number of vertices of the factor graph).Two methods for computing (24)become evident at this point:if r has a special structure,we may exploit this to decompose K into the products and sums of terms depending on one of the two graphs alone and pre-compute these expressions beforehand.Secondly,if one of the two terms in the expansion can be computed for a rather general class of values of r (x ),we can pre-compute this expansion and only carry out the remainder corresponding to (24)explicitly.4.2Product Decomposition of r (x )Central to our reasoning is the observation that for certain r (x ),the term 1r (a +b )can be expressed in terms of a product and sum of terms depending on a and b only.We assume that 1r (a +b )=M m =1ρn (a )˜ρn (b ).(25)In the following we will show that in such situations the kernels on factor graphs can be computed as an analogous combination of products and sums of kernel functions on the terms constituting the ingredients of the factor graph.Before we do so,we brieﬂy check that many r (x )indeed satisfy this property.exp(−β(a +b ))=exp(−βa )exp(−βb )(26)(A −(a +b ))= A 2−a + A 2−b (27)(A −(a +b ))p =p n =0p n A 2−a n A 2−b p −n (28)cos (a +b )π4=cos aπ4cos bπ4−sin aπ4sin bπ4(29)12Alexander Smola and Risi KondorIn a nutshell,we will exploit the fact that for products of graphs the eigenvalues of the joint graph Laplacian can be written as the sum of the eigenvalues of the Laplacians of the constituent graphs.This way we can perform computations on ρn and˜ρn separately without the need to take the other part of the the product of graphs into account.Deﬁnek m(i,j):=l ρldλld+de l i e l j and˜k m(i ,j ):=l˜ρldλld+d˜e l i ˜e l j .(30)Then we have the following composition theorem:Theorem6.Denote by(V,E)and(V ,E )connected regular graphs of degrees d with m vertices(and d ,m respectively)and normalized graph Laplacians ˜L,˜L .Furthermore denote by r(x)a rational function with matrix-valued exten-sionˆr(X).In this case the kernel K corresponding to the regularization operator ˆr(L)on the product graph of(V,E)and(V ,E )is given byk((i,i ),(j,j ))=Mm=1k m(i,j)˜k m(i ,j )(31)Proof Plug the expansion of1r(a+b)as given by(25)into(24)and collect terms.From(26)we immediately obtain the corollary(see Kondor and Laﬀerty[2002]) that for diﬀusion processes on factor graphs the kernel on the factor graph is given by the product of kernels on the constituents,that is k((i,i ),(j,j ))= k(i,j)k (i ,j ).The kernels k m and˜k m can be computed either by using an analytic solution of the underlying factors of the graph or alternatively they can be computed numerically.If the total number of kernels k n is small in comparison to the number of possible coordinates this is still computationally beneﬁcial.4.3Composition TheoremsIf no expansion as in(31)can be found,we may still be able to compute ker-nels by extending a reasoning from[Ellis,2002].More speciﬁcally,the following composition theorem allows us to accelerate the computation in many cases, whenever we can parameterize(ˆr(L+αI))−1in an eﬃcient way.For this pur-pose we introduce two auxiliary functionsKα(i,j):=ˆrdd+dL+αdd+dI−1=lrdλl+αdd+d−1e l(i)e l(j)G α(i,j):=(L +αI)−1=l1λl+αe l(i)e l(j).(32)In some cases Kα(i,j)may be computed in closed form,thus obviating the need to perform expensive matrix inversion,e.g.,in the case where the underlying graph is a chain[Ellis,2002]and Kα=Gα.Kernels and Regularization on Graphs 13Theorem 7.Under the assumptions of Theorem 6we haveK ((j,j ),(l,l ))=12πi C K α(j,l )G −α(j ,l )dα= v K λv (j,l )e v j e v l (33)where C ⊂C is a contour of the C containing the poles of (V ,E )including 0.For practical purposes,the third term of (33)is more amenable to computation.Proof From (24)we haveK ((j,j ),(l,l ))= u,v r dλu +d λv d +d −1e u j e u l e v j e v l (34)=12πi C u r dλu +d αd +d −1e u j e u l v 1λv −αe v j e v l dαHere the second equalityfollows from the fact that the contour integral over a pole p yields C f (α)p −αdα=2πif (p ),and the claim is veriﬁed by checking thedeﬁnitions of K αand G α.The last equality can be seen from (34)by splitting up the summation over u and v .5ConclusionsWe have shown that the canonical family of kernels on graphs are of the form of power series in the graph Laplacian.Equivalently,such kernels can be char-acterized by a real valued function of the eigenvalues of the Laplacian.Special cases include diﬀusion kernels,the regularized Laplacian kernel and p -step ran-dom walk kernels.We have developed the regularization theory of learning on graphs using such kernels and explored methods for eﬃciently computing and approximating the kernel matrix.Acknowledgments This work was supported by a grant of the ARC.The authors thank Eleazar Eskin,Patrick Haﬀner,Andrew Ng,Bob Williamson and S.V.N.Vishwanathan for helpful comments and suggestions.A Link AnalysisRather surprisingly,our approach to regularizing functions on graphs bears re-semblance to algorithms for scoring web pages such as PageRank [Page et al.,1998],HITS [Kleinberg,1999],and randomized HITS [Zheng et al.,2001].More speciﬁcally,the random walks on graphs used in all three algorithms and the stationary distributions arising from them are closely connected with the eigen-system of L and ˜Lrespectively.We begin with an analysis of PageRank.Given a set of web pages and links between them we construct a directed graph in such a way that pages correspond。

用于生命科学的人工智能定量相位成像方法

用于生命科学的人工智能定量相位成像方法English:Artificial intelligence (AI) has revolutionized the field of quantitative phase imaging (QPI) in life sciences by enabling high-throughput and accurate analysis of complex biological samples. AI-based QPI methods integrate advanced machine learning algorithms with computational imaging techniques to extract quantitative information from phase images with unprecedented precision and speed. These methods leverage deep learning models to perform tasks such as cell segmentation, classification, and tracking, enabling researchers to study dynamic cellular processes and disease progression with minimal human intervention. One example of AI-enabled QPI is label-free cell classification, where AI algorithms can accurately differentiate between different cell types based on their phase images, providing valuable insights into cellular morphology and function. Furthermore, AI-based QPI approaches can also improve the sensitivity and specificity of quantitative phase measurements, allowing for more accurate quantification of cellular and subcellular features. Overall, the integration of AI with QPI holds great potential for advancing our understanding of complexbiological systems and accelerating the development of novel diagnostic and therapeutic strategies in the life sciences.中文翻译:人工智能（AI）已经彻底改革了生命科学中的定量相位成像（QPI）领域，通过实现对复杂生物样品的高通量和准确分析。

德国工业4.0原版

z
Intense research activities in universities and other research institutions Drastically increasing number of publications in recent years Large amount of funding by the German government
Model predictive control (MPC)
Modern, optimization-based control technique Successful applications in many industrial ﬁelds Can handle hard constraints on states and inputs Optimization of some performance criterion Applicable to nonlinear, MIMO systems
A system is strictly dissipative on a set W ⊆ Z with respect to the supply rate s if there exists a storage function λ such that for all (x , u ) ∈ W it holds that λ(f (x , u )) − λ(x ) ≤ s (x , u ) − ρ(x ) with ρ > 0.
k =0 x (k |t + 1) x (t + 1) state x input u t+1 u (k |t + 1) k =N
Basic MPC scheme

《神经网络与深度学习综述DeepLearning15May2014

Draft:Deep Learning in Neural Networks:An OverviewTechnical Report IDSIA-03-14/arXiv:1404.7828(v1.5)[cs.NE]J¨u rgen SchmidhuberThe Swiss AI Lab IDSIAIstituto Dalle Molle di Studi sull’Intelligenza ArtiﬁcialeUniversity of Lugano&SUPSIGalleria2,6928Manno-LuganoSwitzerland15May2014AbstractIn recent years,deep artiﬁcial neural networks(including recurrent ones)have won numerous con-tests in pattern recognition and machine learning.This historical survey compactly summarises relevantwork,much of it from the previous millennium.Shallow and deep learners are distinguished by thedepth of their credit assignment paths,which are chains of possibly learnable,causal links between ac-tions and effects.I review deep supervised learning(also recapitulating the history of backpropagation),unsupervised learning,reinforcement learning&evolutionary computation,and indirect search for shortprograms encoding deep and large networks.PDF of earlier draft(v1):http://www.idsia.ch/∼juergen/DeepLearning30April2014.pdfLATEX source:http://www.idsia.ch/∼juergen/DeepLearning30April2014.texComplete BIBTEXﬁle:http://www.idsia.ch/∼juergen/bib.bibPrefaceThis is the draft of an invited Deep Learning(DL)overview.One of its goals is to assign credit to those who contributed to the present state of the art.I acknowledge the limitations of attempting to achieve this goal.The DL research community itself may be viewed as a continually evolving,deep network of scientists who have inﬂuenced each other in complex ways.Starting from recent DL results,I tried to trace back the origins of relevant ideas through the past half century and beyond,sometimes using“local search”to follow citations of citations backwards in time.Since not all DL publications properly acknowledge earlier relevant work,additional global search strategies were employed,aided by consulting numerous neural network experts.As a result,the present draft mostly consists of references(about800entries so far).Nevertheless,through an expert selection bias I may have missed important work.A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century.For these reasons,the present draft should be viewed as merely a snapshot of an ongoing credit assignment process.To help improve it,please do not hesitate to send corrections and suggestions to juergen@idsia.ch.Contents1Introduction to Deep Learning(DL)in Neural Networks(NNs)3 2Event-Oriented Notation for Activation Spreading in FNNs/RNNs3 3Depth of Credit Assignment Paths(CAPs)and of Problems4 4Recurring Themes of Deep Learning54.1Dynamic Programming(DP)for DL (5)4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL (6)4.3Occam’s Razor:Compression and Minimum Description Length(MDL) (6)4.4Learning Hierarchical Representations Through Deep SL,UL,RL (6)4.5Fast Graphics Processing Units(GPUs)for DL in NNs (6)5Supervised NNs,Some Helped by Unsupervised NNs75.11940s and Earlier (7)5.2Around1960:More Neurobiological Inspiration for DL (7)5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) (8)5.41979:Convolution+Weight Replication+Winner-Take-All(WTA) (8)5.51960-1981and Beyond:Development of Backpropagation(BP)for NNs (8)5.5.1BP for Weight-Sharing Feedforward NNs(FNNs)and Recurrent NNs(RNNs)..95.6Late1980s-2000:Numerous Improvements of NNs (9)5.6.1Ideas for Dealing with Long Time Lags and Deep CAPs (10)5.6.2Better BP Through Advanced Gradient Descent (10)5.6.3Discovering Low-Complexity,Problem-Solving NNs (11)5.6.4Potential Beneﬁts of UL for SL (11)5.71987:UL Through Autoencoder(AE)Hierarchies (12)5.81989:BP for Convolutional NNs(CNNs) (13)5.91991:Fundamental Deep Learning Problem of Gradient Descent (13)5.101991:UL-Based History Compression Through a Deep Hierarchy of RNNs (14)5.111992:Max-Pooling(MP):Towards MPCNNs (14)5.121994:Contest-Winning Not So Deep NNs (15)5.131995:Supervised Recurrent Very Deep Learner(LSTM RNN) (15)5.142003:More Contest-Winning/Record-Setting,Often Not So Deep NNs (16)5.152006/7:Deep Belief Networks(DBNs)&AE Stacks Fine-Tuned by BP (17)5.162006/7:Improved CNNs/GPU-CNNs/BP-Trained MPCNNs (17)5.172009:First Ofﬁcial Competitions Won by RNNs,and with MPCNNs (18)5.182010:Plain Backprop(+Distortions)on GPU Yields Excellent Results (18)5.192011:MPCNNs on GPU Achieve Superhuman Vision Performance (18)5.202011:Hessian-Free Optimization for RNNs (19)5.212012:First Contests Won on ImageNet&Object Detection&Segmentation (19)5.222013-:More Contests and Benchmark Records (20)5.22.1Currently Successful Supervised Techniques:LSTM RNNs/GPU-MPCNNs (21)5.23Recent Tricks for Improving SL Deep NNs(Compare Sec.5.6.2,5.6.3) (21)5.24Consequences for Neuroscience (22)5.25DL with Spiking Neurons? (22)6DL in FNNs and RNNs for Reinforcement Learning(RL)236.1RL Through NN World Models Yields RNNs With Deep CAPs (23)6.2Deep FNNs for Traditional RL and Markov Decision Processes(MDPs) (24)6.3Deep RL RNNs for Partially Observable MDPs(POMDPs) (24)6.4RL Facilitated by Deep UL in FNNs and RNNs (25)6.5Deep Hierarchical RL(HRL)and Subgoal Learning with FNNs and RNNs (25)6.6Deep RL by Direct NN Search/Policy Gradients/Evolution (25)6.7Deep RL by Indirect Policy Search/Compressed NN Search (26)6.8Universal RL (27)7Conclusion271Introduction to Deep Learning(DL)in Neural Networks(NNs) Which modiﬁable components of a learning system are responsible for its success or failure?What changes to them improve performance?This has been called the fundamental credit assignment problem(Minsky, 1963).There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses(Sec.6.8).The present survey,however,will focus on the narrower,but now commercially important,subﬁeld of Deep Learning(DL)in Artiﬁcial Neural Networks(NNs).We are interested in accurate credit assignment across possibly many,often nonlinear,computational stages of NNs.Shallow NN-like models have been around for many decades if not centuries(Sec.5.1).Models with several successive nonlinear layers of neurons date back at least to the1960s(Sec.5.3)and1970s(Sec.5.5). An efﬁcient gradient descent method for teacher-based Supervised Learning(SL)in discrete,differentiable networks of arbitrary depth called backpropagation(BP)was developed in the1960s and1970s,and ap-plied to NNs in1981(Sec.5.5).BP-based training of deep NNs with many layers,however,had been found to be difﬁcult in practice by the late1980s(Sec.5.6),and had become an explicit research subject by the early1990s(Sec.5.9).DL became practically feasible to some extent through the help of Unsupervised Learning(UL)(e.g.,Sec.5.10,5.15).The1990s and2000s also saw many improvements of purely super-vised DL(Sec.5).In the new millennium,deep NNs haveﬁnally attracted wide-spread attention,mainly by outperforming alternative machine learning methods such as kernel machines(Vapnik,1995;Sch¨o lkopf et al.,1998)in numerous important applications.In fact,supervised deep NNs have won numerous of-ﬁcial international pattern recognition competitions(e.g.,Sec.5.17,5.19,5.21,5.22),achieving theﬁrst superhuman visual pattern recognition results in limited domains(Sec.5.19).Deep NNs also have become relevant for the more generalﬁeld of Reinforcement Learning(RL)where there is no supervising teacher (Sec.6).Both feedforward(acyclic)NNs(FNNs)and recurrent(cyclic)NNs(RNNs)have won contests(Sec.5.12,5.14,5.17,5.19,5.21,5.22).In a sense,RNNs are the deepest of all NNs(Sec.3)—they are general computers more powerful than FNNs,and can in principle create and process memories of ar-bitrary sequences of input patterns(e.g.,Siegelmann and Sontag,1991;Schmidhuber,1990a).Unlike traditional methods for automatic sequential program synthesis(e.g.,Waldinger and Lee,1969;Balzer, 1985;Soloway,1986;Deville and Lau,1994),RNNs can learn programs that mix sequential and parallel information processing in a natural and efﬁcient way,exploiting the massive parallelism viewed as crucial for sustaining the rapid decline of computation cost observed over the past75years.The rest of this paper is structured as follows.Sec.2introduces a compact,event-oriented notation that is simple yet general enough to accommodate both FNNs and RNNs.Sec.3introduces the concept of Credit Assignment Paths(CAPs)to measure whether learning in a given NN application is of the deep or shallow type.Sec.4lists recurring themes of DL in SL,UL,and RL.Sec.5focuses on SL and UL,and on how UL can facilitate SL,although pure SL has become dominant in recent competitions(Sec.5.17-5.22). Sec.5is arranged in a historical timeline format with subsections on important inspirations and technical contributions.Sec.6on deep RL discusses traditional Dynamic Programming(DP)-based RL combined with gradient-based search techniques for SL or UL in deep NNs,as well as general methods for direct and indirect search in the weight space of deep FNNs and RNNs,including successful policy gradient and evolutionary methods.2Event-Oriented Notation for Activation Spreading in FNNs/RNNs Throughout this paper,let i,j,k,t,p,q,r denote positive integer variables assuming ranges implicit in the given contexts.Let n,m,T denote positive integer constants.An NN’s topology may change over time(e.g.,Fahlman,1991;Ring,1991;Weng et al.,1992;Fritzke, 1994).At any given moment,it can be described as aﬁnite subset of units(or nodes or neurons)N= {u1,u2,...,}and aﬁnite set H⊆N×N of directed edges or connections between nodes.FNNs are acyclic graphs,RNNs cyclic.Theﬁrst(input)layer is the set of input units,a subset of N.In FNNs,the k-th layer(k>1)is the set of all nodes u∈N such that there is an edge path of length k−1(but no longer path)between some input unit and u.There may be shortcut connections between distant layers.The NN’s behavior or program is determined by a set of real-valued,possibly modiﬁable,parameters or weights w i(i=1,...,n).We now focus on a singleﬁnite episode or epoch of information processing and activation spreading,without learning through weight changes.The following slightly unconventional notation is designed to compactly describe what is happening during the runtime of the system.During an episode,there is a partially causal sequence x t(t=1,...,T)of real values that I call events.Each x t is either an input set by the environment,or the activation of a unit that may directly depend on other x k(k<t)through a current NN topology-dependent set in t of indices k representing incoming causal connections or links.Let the function v encode topology information and map such event index pairs(k,t)to weight indices.For example,in the non-input case we may have x t=f t(net t)with real-valued net t= k∈in t x k w v(k,t)(additive case)or net t= k∈in t x k w v(k,t)(multiplicative case), where f t is a typically nonlinear real-valued activation function such as tanh.In many recent competition-winning NNs(Sec.5.19,5.21,5.22)there also are events of the type x t=max k∈int (x k);some networktypes may also use complex polynomial activation functions(Sec.5.3).x t may directly affect certain x k(k>t)through outgoing connections or links represented through a current set out t of indices k with t∈in k.Some non-input events are called output events.Note that many of the x t may refer to different,time-varying activations of the same unit in sequence-processing RNNs(e.g.,Williams,1989,“unfolding in time”),or also in FNNs sequentially exposed to time-varying input patterns of a large training set encoded as input events.During an episode,the same weight may get reused over and over again in topology-dependent ways,e.g.,in RNNs,or in convolutional NNs(Sec.5.4,5.8).I call this weight sharing across space and/or time.Weight sharing may greatly reduce the NN’s descriptive complexity,which is the number of bits of information required to describe the NN (Sec.4.3).In Supervised Learning(SL),certain NN output events x t may be associated with teacher-given,real-valued labels or targets d t yielding errors e t,e.g.,e t=1/2(x t−d t)2.A typical goal of supervised NN training is toﬁnd weights that yield episodes with small total error E,the sum of all such e t.The hope is that the NN will generalize well in later episodes,causing only small errors on previously unseen sequences of input events.Many alternative error functions for SL and UL are possible.SL assumes that input events are independent of earlier output events(which may affect the environ-ment through actions causing subsequent perceptions).This assumption does not hold in the broaderﬁelds of Sequential Decision Making and Reinforcement Learning(RL)(Kaelbling et al.,1996;Sutton and Barto, 1998;Hutter,2005)(Sec.6).In RL,some of the input events may encode real-valued reward signals given by the environment,and a typical goal is toﬁnd weights that yield episodes with a high sum of reward signals,through sequences of appropriate output actions.Sec.5.5will use the notation above to compactly describe a central algorithm of DL,namely,back-propagation(BP)for supervised weight-sharing FNNs and RNNs.(FNNs may be viewed as RNNs with certainﬁxed zero weights.)Sec.6will address the more general RL case.3Depth of Credit Assignment Paths(CAPs)and of ProblemsTo measure whether credit assignment in a given NN application is of the deep or shallow type,I introduce the concept of Credit Assignment Paths or CAPs,which are chains of possibly causal links between events.Let usﬁrst focus on SL.Consider two events x p and x q(1≤p<q≤T).Depending on the appli-cation,they may have a Potential Direct Causal Connection(PDCC)expressed by the Boolean predicate pdcc(p,q),which is true if and only if p∈in q.Then the2-element list(p,q)is deﬁned to be a CAP from p to q(a minimal one).A learning algorithm may be allowed to change w v(p,q)to improve performance in future episodes.More general,possibly indirect,Potential Causal Connections(PCC)are expressed by the recursively deﬁned Boolean predicate pcc(p,q),which in the SL case is true only if pdcc(p,q),or if pcc(p,k)for some k and pdcc(k,q).In the latter case,appending q to any CAP from p to k yields a CAP from p to q(this is a recursive deﬁnition,too).The set of such CAPs may be large but isﬁnite.Note that the same weight may affect many different PDCCs between successive events listed by a given CAP,e.g.,in the case of RNNs, or weight-sharing FNNs.Suppose a CAP has the form(...,k,t,...,q),where k and t(possibly t=q)are theﬁrst successive elements with modiﬁable w v(k,t).Then the length of the sufﬁx list(t,...,q)is called the CAP’s depth (which is0if there are no modiﬁable links at all).This depth limits how far backwards credit assignment can move down the causal chain toﬁnd a modiﬁable weight.1Suppose an episode and its event sequence x1,...,x T satisfy a computable criterion used to decide whether a given problem has been solved(e.g.,total error E below some threshold).Then the set of used weights is called a solution to the problem,and the depth of the deepest CAP within the sequence is called the solution’s depth.There may be other solutions(yielding different event sequences)with different depths.Given someﬁxed NN topology,the smallest depth of any solution is called the problem’s depth.Sometimes we also speak of the depth of an architecture:SL FNNs withﬁxed topology imply a problem-independent maximal problem depth bounded by the number of non-input layers.Certain SL RNNs withﬁxed weights for all connections except those to output units(Jaeger,2001;Maass et al.,2002; Jaeger,2004;Schrauwen et al.,2007)have a maximal problem depth of1,because only theﬁnal links in the corresponding CAPs are modiﬁable.In general,however,RNNs may learn to solve problems of potentially unlimited depth.Note that the deﬁnitions above are solely based on the depths of causal chains,and agnostic of the temporal distance between events.For example,shallow FNNs perceiving large“time windows”of in-put events may correctly classify long input sequences through appropriate output events,and thus solve shallow problems involving long time lags between relevant events.At which problem depth does Shallow Learning end,and Deep Learning begin?Discussions with DL experts have not yet yielded a conclusive response to this question.Instead of committing myself to a precise answer,let me just deﬁne for the purposes of this overview:problems of depth>10require Very Deep Learning.The difﬁculty of a problem may have little to do with its depth.Some NNs can quickly learn to solve certain deep problems,e.g.,through random weight guessing(Sec.5.9)or other types of direct search (Sec.6.6)or indirect search(Sec.6.7)in weight space,or through training an NNﬁrst on shallow problems whose solutions may then generalize to deep problems,or through collapsing sequences of(non)linear operations into a single(non)linear operation—but see an analysis of non-trivial aspects of deep linear networks(Baldi and Hornik,1994,Section B).In general,however,ﬁnding an NN that precisely models a given training set is an NP-complete problem(Judd,1990;Blum and Rivest,1992),also in the case of deep NNs(S´ıma,1994;de Souto et al.,1999;Windisch,2005);compare a survey of negative results(S´ıma, 2002,Section1).Above we have focused on SL.In the more general case of RL in unknown environments,pcc(p,q) is also true if x p is an output event and x q any later input event—any action may affect the environment and thus any later perception.(In the real world,the environment may even inﬂuence non-input events computed on a physical hardware entangled with the entire universe,but this is ignored here.)It is possible to model and replace such unmodiﬁable environmental PCCs through a part of the NN that has already learned to predict(through some of its units)input events(including reward signals)from former input events and actions(Sec.6.1).Its weights are frozen,but can help to assign credit to other,still modiﬁable weights used to compute actions(Sec.6.1).This approach may lead to very deep CAPs though.Some DL research is about automatically rephrasing problems such that their depth is reduced(Sec.4). In particular,sometimes UL is used to make SL problems less deep,e.g.,Sec.5.10.Often Dynamic Programming(Sec.4.1)is used to facilitate certain traditional RL problems,e.g.,Sec.6.2.Sec.5focuses on CAPs for SL,Sec.6on the more complex case of RL.4Recurring Themes of Deep Learning4.1Dynamic Programming(DP)for DLOne recurring theme of DL is Dynamic Programming(DP)(Bellman,1957),which can help to facili-tate credit assignment under certain assumptions.For example,in SL NNs,backpropagation itself can 1An alternative would be to count only modiﬁable links when measuring depth.In many typical NN applications this would not make a difference,but in some it would,e.g.,Sec.6.1.be viewed as a DP-derived method(Sec.5.5).In traditional RL based on strong Markovian assumptions, DP-derived methods can help to greatly reduce problem depth(Sec.6.2).DP algorithms are also essen-tial for systems that combine concepts of NNs and graphical models,such as Hidden Markov Models (HMMs)(Stratonovich,1960;Baum and Petrie,1966)and Expectation Maximization(EM)(Dempster et al.,1977),e.g.,(Bottou,1991;Bengio,1991;Bourlard and Morgan,1994;Baldi and Chauvin,1996; Jordan and Sejnowski,2001;Bishop,2006;Poon and Domingos,2011;Dahl et al.,2012;Hinton et al., 2012a).4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL Another recurring theme is how UL can facilitate both SL(Sec.5)and RL(Sec.6).UL(Sec.5.6.4) is normally used to encode raw incoming data such as video or speech streams in a form that is more convenient for subsequent goal-directed learning.In particular,codes that describe the original data in a less redundant or more compact way can be fed into SL(Sec.5.10,5.15)or RL machines(Sec.6.4),whose search spaces may thus become smaller(and whose CAPs shallower)than those necessary for dealing with the raw data.UL is closely connected to the topics of regularization and compression(Sec.4.3,5.6.3). 4.3Occam’s Razor:Compression and Minimum Description Length(MDL) Occam’s razor favors simple solutions over complex ones.Given some programming language,the prin-ciple of Minimum Description Length(MDL)can be used to measure the complexity of a solution candi-date by the length of the shortest program that computes it(e.g.,Solomonoff,1964;Kolmogorov,1965b; Chaitin,1966;Wallace and Boulton,1968;Levin,1973a;Rissanen,1986;Blumer et al.,1987;Li and Vit´a nyi,1997;Gr¨u nwald et al.,2005).Some methods explicitly take into account program runtime(Al-lender,1992;Watanabe,1992;Schmidhuber,2002,1995);many consider only programs with constant runtime,written in non-universal programming languages(e.g.,Rissanen,1986;Hinton and van Camp, 1993).In the NN case,the MDL principle suggests that low NN weight complexity corresponds to high NN probability in the Bayesian view(e.g.,MacKay,1992;Buntine and Weigend,1991;De Freitas,2003), and to high generalization performance(e.g.,Baum and Haussler,1989),without overﬁtting the training data.Many methods have been proposed for regularizing NNs,that is,searching for solution-computing, low-complexity SL NNs(Sec.5.6.3)and RL NNs(Sec.6.7).This is closely related to certain UL methods (Sec.4.2,5.6.4).4.4Learning Hierarchical Representations Through Deep SL,UL,RLMany methods of Good Old-Fashioned Artiﬁcial Intelligence(GOFAI)(Nilsson,1980)as well as more recent approaches to AI(Russell et al.,1995)and Machine Learning(Mitchell,1997)learn hierarchies of more and more abstract data representations.For example,certain methods of syntactic pattern recog-nition(Fu,1977)such as grammar induction discover hierarchies of formal rules to model observations. The partially(un)supervised Automated Mathematician/EURISKO(Lenat,1983;Lenat and Brown,1984) continually learns concepts by combining previously learnt concepts.Such hierarchical representation learning(Ring,1994;Bengio et al.,2013;Deng and Yu,2014)is also a recurring theme of DL NNs for SL (Sec.5),UL-aided SL(Sec.5.7,5.10,5.15),and hierarchical RL(Sec.6.5).Often,abstract hierarchical representations are natural by-products of data compression(Sec.4.3),e.g.,Sec.5.10.4.5Fast Graphics Processing Units(GPUs)for DL in NNsWhile the previous millennium saw several attempts at creating fast NN-speciﬁc hardware(e.g.,Jackel et al.,1990;Faggin,1992;Ramacher et al.,1993;Widrow et al.,1994;Heemskerk,1995;Korkin et al., 1997;Urlbe,1999),and at exploiting standard hardware(e.g.,Anguita et al.,1994;Muller et al.,1995; Anguita and Gomes,1996),the new millennium brought a DL breakthrough in form of cheap,multi-processor graphics cards or GPUs.GPUs are widely used for video games,a huge and competitive market that has driven down hardware prices.GPUs excel at fast matrix and vector multiplications required not only for convincing virtual realities but also for NN training,where they can speed up learning by a factorof50and more.Some of the GPU-based FNN implementations(Sec.5.16-5.19)have greatly contributed to recent successes in contests for pattern recognition(Sec.5.19-5.22),image segmentation(Sec.5.21), and object detection(Sec.5.21-5.22).5Supervised NNs,Some Helped by Unsupervised NNsThe main focus of current practical applications is on Supervised Learning(SL),which has dominated re-cent pattern recognition contests(Sec.5.17-5.22).Several methods,however,use additional Unsupervised Learning(UL)to facilitate SL(Sec.5.7,5.10,5.15).It does make sense to treat SL and UL in the same section:often gradient-based methods,such as BP(Sec.5.5.1),are used to optimize objective functions of both UL and SL,and the boundary between SL and UL may blur,for example,when it comes to time series prediction and sequence classiﬁcation,e.g.,Sec.5.10,5.12.A historical timeline format will help to arrange subsections on important inspirations and techni-cal contributions(although such a subsection may span a time interval of many years).Sec.5.1brieﬂy mentions early,shallow NN models since the1940s,Sec.5.2additional early neurobiological inspiration relevant for modern Deep Learning(DL).Sec.5.3is about GMDH networks(since1965),perhaps theﬁrst (feedforward)DL systems.Sec.5.4is about the relatively deep Neocognitron NN(1979)which is similar to certain modern deep FNN architectures,as it combines convolutional NNs(CNNs),weight pattern repli-cation,and winner-take-all(WTA)mechanisms.Sec.5.5uses the notation of Sec.2to compactly describe a central algorithm of DL,namely,backpropagation(BP)for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP1960-1981and beyond.Sec.5.6describes problems encountered in the late1980s with BP for deep NNs,and mentions several ideas from the previous millennium to overcome them.Sec.5.7discusses aﬁrst hierarchical stack of coupled UL-based Autoencoders(AEs)—this concept resurfaced in the new millennium(Sec.5.15).Sec.5.8is about applying BP to CNNs,which is important for today’s DL applications.Sec.5.9explains BP’s Fundamental DL Problem(of vanishing/exploding gradients)discovered in1991.Sec.5.10explains how a deep RNN stack of1991(the History Compressor) pre-trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment Paths(CAPs,Sec.3)of depth1000and more.Sec.5.11discusses a particular WTA method called Max-Pooling(MP)important in today’s DL FNNs.Sec.5.12mentions aﬁrst important contest won by SL NNs in1994.Sec.5.13describes a purely supervised DL RNN(Long Short-Term Memory,LSTM)for problems of depth1000and more.Sec.5.14mentions an early contest of2003won by an ensemble of shallow NNs, as well as good pattern recognition results with CNNs and LSTM RNNs(2003).Sec.5.15is mostly about Deep Belief Networks(DBNs,2006)and related stacks of Autoencoders(AEs,Sec.5.7)pre-trained by UL to facilitate BP-based SL.Sec.5.16mentions theﬁrst BP-trained MPCNNs(2007)and GPU-CNNs(2006). Sec.5.17-5.22focus on ofﬁcial competitions with secret test sets won by(mostly purely supervised)DL NNs since2009,in sequence recognition,image classiﬁcation,image segmentation,and object detection. Many RNN results depended on LSTM(Sec.5.13);many FNN results depended on GPU-based FNN code developed since2004(Sec.5.16,5.17,5.18,5.19),in particular,GPU-MPCNNs(Sec.5.19).5.11940s and EarlierNN research started in the1940s(e.g.,McCulloch and Pitts,1943;Hebb,1949);compare also later work on learning NNs(Rosenblatt,1958,1962;Widrow and Hoff,1962;Grossberg,1969;Kohonen,1972; von der Malsburg,1973;Narendra and Thathatchar,1974;Willshaw and von der Malsburg,1976;Palm, 1980;Hopﬁeld,1982).In a sense NNs have been around even longer,since early supervised NNs were essentially variants of linear regression methods going back at least to the early1800s(e.g.,Legendre, 1805;Gauss,1809,1821).Early NNs had a maximal CAP depth of1(Sec.3).5.2Around1960:More Neurobiological Inspiration for DLSimple cells and complex cells were found in the cat’s visual cortex(e.g.,Hubel and Wiesel,1962;Wiesel and Hubel,1959).These cellsﬁre in response to certain properties of visual sensory inputs,such as theorientation of plex cells exhibit more spatial invariance than simple cells.This inspired later deep NN architectures(Sec.5.4)used in certain modern award-winning Deep Learners(Sec.5.19-5.22).5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) Networks trained by the Group Method of Data Handling(GMDH)(Ivakhnenko and Lapa,1965; Ivakhnenko et al.,1967;Ivakhnenko,1968,1971)were perhaps theﬁrst DL systems of the Feedforward Multilayer Perceptron type.The units of GMDH nets may have polynomial activation functions imple-menting Kolmogorov-Gabor polynomials(more general than traditional NN activation functions).Given a training set,layers are incrementally grown and trained by regression analysis,then pruned with the help of a separate validation set(using today’s terminology),where Decision Regularisation is used to weed out superﬂuous units.The numbers of layers and units per layer can be learned in problem-dependent fashion. This is a good example of hierarchical representation learning(Sec.4.4).There have been numerous ap-plications of GMDH-style networks,e.g.(Ikeda et al.,1976;Farlow,1984;Madala and Ivakhnenko,1994; Ivakhnenko,1995;Kondo,1998;Kord´ık et al.,2003;Witczak et al.,2006;Kondo and Ueno,2008).5.41979:Convolution+Weight Replication+Winner-Take-All(WTA)Apart from deep GMDH networks(Sec.5.3),the Neocognitron(Fukushima,1979,1980,2013a)was per-haps theﬁrst artiﬁcial NN that deserved the attribute deep,and theﬁrst to incorporate the neurophysiolog-ical insights of Sec.5.2.It introduced convolutional NNs(today often called CNNs or convnets),where the(typically rectangular)receptiveﬁeld of a convolutional unit with given weight vector is shifted step by step across a2-dimensional array of input values,such as the pixels of an image.The resulting2D array of subsequent activation events of this unit can then provide inputs to higher-level units,and so on.Due to massive weight replication(Sec.2),relatively few parameters may be necessary to describe the behavior of such a convolutional layer.Competition layers have WTA subsets whose maximally active units are the only ones to adopt non-zero activation values.They essentially“down-sample”the competition layer’s input.This helps to create units whose responses are insensitive to small image shifts(compare Sec.5.2).The Neocognitron is very similar to the architecture of modern,contest-winning,purely super-vised,feedforward,gradient-based Deep Learners with alternating convolutional and competition lay-ers(e.g.,Sec.5.19-5.22).Fukushima,however,did not set the weights by supervised backpropagation (Sec.5.5,5.8),but by local un supervised learning rules(e.g.,Fukushima,2013b),or by pre-wiring.In that sense he did not care for the DL problem(Sec.5.9),although his architecture was comparatively deep indeed.He also used Spatial Averaging(Fukushima,1980,2011)instead of Max-Pooling(MP,Sec.5.11), currently a particularly convenient and popular WTA mechanism.Today’s CNN-based DL machines proﬁta lot from later CNN work(e.g.,LeCun et al.,1989;Ranzato et al.,2007)(Sec.5.8,5.16,5.19).5.51960-1981and Beyond:Development of Backpropagation(BP)for NNsThe minimisation of errors through gradient descent(Hadamard,1908)in the parameter space of com-plex,nonlinear,differentiable,multi-stage,NN-related systems has been discussed at least since the early 1960s(e.g.,Kelley,1960;Bryson,1961;Bryson and Denham,1961;Pontryagin et al.,1961;Dreyfus,1962; Wilkinson,1965;Amari,1967;Bryson and Ho,1969;Director and Rohrer,1969;Griewank,2012),ini-tially within the framework of Euler-LaGrange equations in the Calculus of Variations(e.g.,Euler,1744). Steepest descent in such systems can be performed(Bryson,1961;Kelley,1960;Bryson and Ho,1969)by iterating the ancient chain rule(Leibniz,1676;L’Hˆo pital,1696)in Dynamic Programming(DP)style(Bell-man,1957).A simpliﬁed derivation of the method uses the chain rule only(Dreyfus,1962).The methods of the1960s were already efﬁcient in the DP sense.However,they backpropagated derivative information through standard Jacobian matrix calculations from one“layer”to the previous one, explicitly addressing neither direct links across several layers nor potential additional efﬁciency gains due to network sparsity(but perhaps such enhancements seemed obvious to the authors).。

人工智能用图像和生物组学解码癌症

人工智能用图像和生物组学解码癌症作者：暂无来源：《世界科学》 2019年第6期编译陆默机器学习可以对癌症照片、肿瘤病理切片和基因组进行分析。

如今，科学家正准备将这些信息整合到癌症超级模型中。

每个癌症患者都在思考的一个问题是：我还能活多久？基因组学家迈克尔·斯奈德（Michael Snyder）希望他能找到答案。

目前，所有医生能做的就是将患有类似癌症的患者分组，然后对他们和其他组患者的相同药物反应或预后进行评估，但目前的分组方法粗略而不完善，而且往往都只是基于人工收集的数据。

斯坦福大学基因组学和个体化医学中心主任斯奈德指出：“病理学家根据解读图像的结果来诊断病情的准确率通常只有60%。

”2013年，他和当时的研究生余坤兴（Kun-Hsing Yu，音译）开始琢磨，人工智能是否能够为医生提供更准确的预测。

余将组织学图像连同病理学家确定的诊断一起输入机器学习算法，训练它区分肺癌和正常组织，以及两种不同类型肺癌之间的区别。

然后输入相关患者的生存数据，让系统了解这些信息与图像之间的关系。

最后，他在模型中补充了一些新的病理切片资料，并向AI提出了一个至关重要的问题：患者的存活时间。

计算机可以预测患者的生存期高于或低于某些特定癌症的平均存活时间，这是病理学家很难做到的。

计算机预测“效果出奇的好。

”如今任哈佛医学院讲师的余说道。

但是斯奈德和余认为他们还可以做更多的事。

斯奈德的实验室也在研究生物组学，所以他们决定向计算机提供的学习资料不仅只有组织病理切片资料，还提供了肿瘤转录组资料。

结合这些数据，该计算机模型对患者生存做出的预测甚至比单独使用图像或转录组资料更好，准确率超过了80%。

如今的病理学家通常根据组织显微照片的视觉评估来进行生存情况预测，通过显微照片对肿瘤进行评估分级，包括肿瘤的大小和严重程度，以及肿瘤进一步生长和扩散的可能性。

但这种肿瘤分级方法并不总能准确预测生存情况。

斯奈德和余并不是唯一认识到人工智能在分析癌症相关数据集（包括图像、生物组学以及两者结合的数据集）方面威力的研究人员。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Nu
(1) where,
x(t) = s(t)a(θo ) +
i=1
ui (t)a(θi )
• x(t) - the received signal (output of antenna array). • a(θo ) - the steering vector for the desired signal. • a(θi ) - the steering vector for the ith interfering signal. The generalized spatial ﬁltering problem is to estimate the signal s(t) from the received signal x(t) such the some measure of error between the estimate s ˆ(t) and the original signal s(t) is minimized, based on certain (optional) constraints such as constant modulus, exact position of nulls, etc. The mean squared error (MSE) criterion is normally used for optimization. In this case, the ﬁltering problem can be setup as a classical Wiener Filtering problem for which the solution can be iteratively reached using the LMS and the SMI algorithms. The goal is to develop a general framework for the simulation of the LMS and SMI algorithms and to simulate their performance for various conﬁgurations of the antenna array, the number of interferers (Nu ), the statistical properties of the signal and the interferers and the statistical relations between the signal and the interferers (i.e, correlated or uncorrelated, etc).
LMS AND SMI ALGORITHMS FOR SPATIAL ADAPTIVE INTERFERENCE REJECTION
3
Contents List of Figures 1. The Spatial Filtering Problem 1.1. Statement 1.2. Problem Setup 1.3. The Adaptive Processor 1.4. Assumptions 2. The Wiener MMSE Solution 3. Case - I : The LMS Algorithm 3.1. The Method of Steepest Descent 3.2. The LMS Simpliﬁcation 3.3. The Algorithm 3.4. Stability 3.5. Convergence 4. Case II : Sampled Matrix Invariance 4.1. Motivation 4.2. Formulation 4.3. Adaptation in the SMI algorithm 4.4. Stability 4.5. Convergence 5. Results - LMS Algorithm 5.1. Demonstration 5.2. Eﬀect of µ 5.3. Dynamic Channel Variations 5.4. Convergence of the Weight Vector 5.5. Eﬀect of Changing the SIR 6. Results - SMI Algorithm 6.1. Eﬀect of varying the Block Size 6.2. Dynamic Variations in the Channel 7. Appendix - I : MATLAB Code 7.1. LMS Algorithm 7.2. Sample Matrix Inversion 7.3. Generate Signal 7.4. SteeringVector 7.5. CalcCovarianceMatrix 7.6. CrossCorrelationVector References 4 5 5 5 6 6 7 8 8 9 9 9 9 10 10 10 10 11 11 12 12 14 16 17 18 19 21 22 23 23 24 25 26 26 27 28
March 16, 2001 LMS AND SMI ALGORITHMS FOR SPATIAL ADAPTIVE INTERFERENCE REJECTION
VIJAYA CHANDRAN RAMASAMI
1
2
VIJAYA CHANDRAN RAMASAMI
Abstract. Adaptive Antenna Arrays promise an enormous bandwidth increase for Wireless Communications using Spatial domain processing which is considered to be the last frontier than can provide substantial capacity increase. The aim of this project is to set up and simulate an adaptive spatial ﬁltering problem, using antenna arrays, where an estimate of the desired signal is constructed from a received signal consisting of additional interfering signals and background noise. The ﬁltering is done in an iterative fashion using the LMS and the SMI algorithms.
4
VIJAYA CHAs 1 An Uniform Linear Array (ULA) 5 2 The Adaptive Processor 6 3 Example MSE Surface for N = 2 8 4 Array Factor Plot (Rectangular) for the Example Conﬁguration 12 5 Array Factor Plot (Polar) for the Example Conﬁguration 13 6 LMS Error for the Example Conﬁguration 13 7 LMS Error for µ = 0.0001 14 8 LMS Error for µ = 0.001 14 9 LMS Error for µ = 0.003 15 10 LMS Error for µ = 0.03 15 11 LMS Error for Dynamic Changes in the Spatial Channel 16 12 Example Weight Vector Convergence 17 13 LMS Convergence for SIR = 0 dB (all interferers) 18 14 LMS Convergence for SIR = -20 dB (all interferers) 18 15 Array Factor Plot (Rectangular) for the Example Conﬁguration. The array factors obtained for each block is superimposed to show similarity of processing 19 16 Absolute Error 20 17 SMI Error for the block size of 5 21 18 SMI Error for the block size of 100 21 19 Response to dynamic variations in the channel 22
LMS AND SMI ALGORITHMS FOR SPATIAL ADAPTIVE INTERFERENCE REJECTION
5
1. The Spatial Filtering Problem 1.1. Statement. Consider an antenna array that receive various signals from several directions in space. The spatial ﬁltering problem is to optimize the antenna array pattern such that maximum possible gain is placed on the direction on the desired signal and nulls are placed on the directions of the interferers. 1.2. Problem Setup. Let s(t) denote the desired communications signal arriving at the array at an angle θs and ui (t) denote the Nu interferering signals arriving at the array at angle of incidences θi respectively. It is assumed here that the adaptive processor has no a-priori knowledge of the direction of arrivals (DOA’s) of both the signal and the interferers. The output of the antenna array is given by,