LMS AND SMI ALGORITHMS FOR SPATIAL ADAPTIVE INTERFERENCE REJECTION 3 Contents
- 格式:pdf
- 大小:381.35 KB
- 文档页数:28
基于主动学习和二次有理核的模型无关局部解释方法
周晟昊;袁伟伟;关东海
【期刊名称】《计算机科学》
【年(卷),期】2024(51)2
【摘要】深度学习模型的广泛使用,在更大程度上使人们意识到模型的决策是亟需解决的问题,复杂难以解释的黑盒模型阻碍了算法在实际场景中部署。
LIME作为最流行的局部解释方法,生成的扰动数据却具有不稳定性,导致最终的解释产生偏差。
针对上述问题,提出了一种基于主动学习和二次有理核的模型无关局部解释方法ActiveLIME,使得局部解释模型更加忠于原始分类器。
ActiveLIME生成扰动数据后,通过主动学习的查询策略对扰动数据进行采样,筛选不确定性高的扰动集训练,使用迭代过程中准确度最高的局部模型对感兴趣实例生成解释。
并且,针对容易陷入局部过拟合的高维稀疏样本,在模型损失函数中引入了二次有理核来减少过拟合。
实验结果表明,所提出的ActiveLIME方法引比传统局部解释方法具有更高的局部保真度和解释质量。
【总页数】7页(P245-251)
【作者】周晟昊;袁伟伟;关东海
【作者单位】南京航空航天大学计算机科学与技术学院
【正文语种】中文
【中图分类】TP391
【相关文献】
1.一种基于因果强度的局部因果结构主动学习方法
2.基于主动形状模型算法的局部灰度模型的加权改进方法
3.基于局部主动轮廓模型的飞机壁板铆接孔定位方法研究
4.基于模糊核聚类和主动学习的异常检测方法
5.基于核极限学习机的快速主动学习方法及其软测量应用
因版权原因,仅展示原文概要,查看原文内容请购买。
GSPBOX_-Atoolboxforsignalprocessingongraphs_GSPBOX:A toolbox for signal processing on graphsNathanael Perraudin,Johan Paratte,David Shuman,Lionel Martin Vassilis Kalofolias,Pierre Vandergheynst and David K.HammondMarch 16,2016AbstractThis document introduces the Graph Signal Processing Toolbox (GSPBox)a framework that can be used to tackle graph related problems with a signal processing approach.It explains the structure and the organization of this software.It also contains a general description of the important modules.1Toolbox organizationIn this document,we brie?y describe the different modules available in the toolbox.For each of them,the main functions are brie?y described.This chapter should help making the connection between the theoretical concepts introduced in [7,9,6]and the technical documentation provided with the toolbox.We highly recommend to read this document and the tutorial before using the toolbox.The documentation,the tutorials and other resources are available on-line 1.The toolbox has ?rst been implemented in MATLAB but a port to Python,called the PyGSP,has been made recently.As of the time of writing of this document,not all the functionalities have been ported to Python,but the main modules are already available.In the following,functions pre?xed by [M]:refer to the MATLAB implementation and the ones pre?xed with [P]:refer to the Python implementation. 1.1General structure of the toolbox (MATLAB)The general design of the GSPBox focuses around the graph object [7],a MATLAB structure containing the necessary infor-mations to use most of the algorithms.By default,only a few attributes are available (see section 2),allowing only the use of a subset of functions.In order to enable the use of more algorithms,additional ?elds can be added to the graph structure.For example,the following line will compute the graph Fourier basis enabling exact ?ltering operations.1G =gsp_compute_fourier_basis(G);Ideally,this operation should be done on the ?y when exact ?ltering is required.Unfortunately,the lack of well de?ned class paradigm in MATLAB makes it too complicated to be implemented.Luckily,the above formulation prevents any unnecessary data copy of the data contained in the structure G .In order to avoid name con?icts,all functions in the GSPBox start with [M]:gsp_.A second important convention is that all functions applying a graph algorithm on a graph signal takes the graph as ?rst argument.For example,the graph Fourier transform of the vector f is computed by1fhat =gsp_gft(G,f);1Seehttps://lts2.epfl.ch/gsp/doc/for MATLAB and https://lts2.epfl.ch/pygsp for Python.The full documentation is also avail-able in a single document:https://lts2.epfl.ch/gsp/gspbox.pdf1a r X i v :1408.5781v 2 [c s .I T ] 15 M a r 2016The graph operators are described in section4.Filtering a signal on a graph is also a linear operation.However,since the design of special?lters(kernels)is important,they are regrouped in a dedicated module(see section5).The toolbox contains two additional important modules.The optimization module contains proximal operators,projections and solvers compatible with the UNLocBoX[5](see section6).These functions facilitate the de?nition of convex optimization problems using graphs.Finally,section??is composed of well known graph machine learning algorithms.1.2General structure of the toolbox(Python)The structure of the Python toolbox follows closely the MATLAB one.The major difference comes from the fact that the Python implementation is object-oriented and thus allows for a natural use of instances of the graph object.For example the equivalent of the MATLAB call:1G=gsp_estimate_lmax(G);can be achieved using a simple method call on the graph object:1G.estimate_lmax()Moreover,the use of class for the"graph object"allows to compute additional graph attributes on the?y,making the code clearer as its MATLAB equivalent.Note though that functionalities are grouped into different modules(one per section below) and that several functions that work on graphs have to be called directly from the modules.For example,one should write:1layers=pygsp.operators.kron_pyramid(G,levels)This is the case as soon as the graph is the structure on which the action has to be performed and not our principal focus.In a similar way to the MATLAB implementation using the UNLocBoX for the convex optimization routines,the Python implementation uses the PyUNLocBoX,which is the Python port of the UNLocBoX. 2GraphsThe GSPBox is constructed around one main object:the graph.It is implemented as a structure in Matlab and as a class in Python.It stores the nodes,the edges and other attributes related to the graph.In the implementation,a graph is fully de?ned by the weight matrix W,which is the main and only required attribute.Since most graph structures are far from fully connected, W is implemented as a sparse matrix.From the weight matrix a Laplacian matrix is computed and stored as an attribute of the graph object.Different other attributes are available such as plotting attributes,vertex coordinates,the degree matrix,the number of vertices and edges.The list of all attributes is given in table1.2Attribute Format Data type DescriptionMandatory?eldsW N x N sparse matrix double Weight matrix WL N x N sparse matrix double Laplacian matrixd N x1vector double The diagonal of the degree matrixN scalar integer Number of verticesNe scalar integer Number of edgesplotting[M]:structure[P]:dict none Plotting parameterstype text string Name,type or short descriptiondirected scalar[M]:logical[P]:boolean State if the graph is directed or notlap_type text string Laplacian typeOptional?eldsA N x N sparse matrix[M]:logical[P]:boolean Adjacency matrixcoords N x2or N x3matrix double Vectors of coordinates in2D or3D.lmax scalar double Exact or estimated maximum eigenvalue U N x N matrix double Matrix of eigenvectorse N x1vector double Vector of eigenvaluesmu scalar double Graph coherenceTable1:Attributes of the graph objectThe easiest way to create a graph is the[M]:gsp_graph[P]:pygsp.graphs.Graph function which takes the weight matrix as input.This function initializes a graph structure by creating the graph Laplacian and other useful attributes.Note that by default the toolbox uses the combinatorial de?nition of the Laplacian operator.Other Laplacians can be computed using the[M]:gsp_create_laplacian[P]:pygsp.gutils.create_laplacian function.Please note that almost all functions are dependent of the Laplacian de?nition.As a result,it is important to select the correct de?nition at? rst.Many particular graphs are also available using helper functions such as:ring,path,comet,swiss roll,airfoil or two moons. In addition,functions are provided for usual non-deterministic graphs suchas:Erdos-Renyi,community,Stochastic Block Model or sensor networks graphs.Nearest Neighbors(NN)graphs form a class which is used in many applications and can be constructed from a set of points (or point cloud)using the[M]:gsp_nn_graph[P]:pygsp.graphs.NNGraph function.The function is highly tunable and can handle very large sets of points using FLANN[3].Two particular cases of NN graphs have their dedicated helper functions:3D point clouds and image patch-graphs.An example of the former can be seen in thefunction[M]:gsp_bunny[P]:pygsp.graphs.Bunny.As for the second,a graph can be created from an image by connecting similar patches of pixels together.The function[M]:gsp_patch_graph creates this graph.Parameters allow the resulting graph to vary between local and non-local and to use different distance functions [12,4].A few examples of the graphs are displayed in Figure1.3PlottingAs in many other domains,visualization is very important in graph signal processing.The most basic operation is to visualize graphs.This can be achieved using a call to thefunction[M]:gsp_plot_graph[P]:pygsp.plotting.plot_graph. In order to be displayable,a graph needs to have2D(or3D)coordinates(which is a?eld of the graph object).Some graphs do not possess default coordinates(e.g.Erdos-Renyi).The toolbox also contains routines to plot signals living on graphs.The function dedicated to this task is[M]:gsp_plot_ signal[P]:pygsp.plotting.plot_signal.For now,only1D signals are supported.By default,the value of the signal is displayed using a color coding,but bars can be displayed by passing parameters.3Figure 1:Examples of classical graphs :two moons (top left),community (top right),airfoil (bottom left)and sensor network (bottom right).The third visualization helper is a function to plot ?lters (in the spectral domain)which is called [M]:gsp_plot_filter [P]:pygsp.plotting.plot_filter .It also supports ?lter-banks and allows to automatically inspect the related frames.The results obtained using these three plotting functions are visible in Fig.2.4OperatorsThe module operators contains basics spectral graph functions such as Fourier transform,localization,gradient,divergence or pyramid decomposition.Since all operator are based on the Laplacian de? nition,the necessary underlying objects (attributes)are all stored into a single object:the graph.As a ?rst example,the graph Fourier transform [M]:gsp_gft [P]:pygsp.operators.gft requires the Fourier basis.This attribute can be computed with the function [M]:gsp_compute_fourier_basis[P]:/doc/c09ff3e90342a8956bec0975f46527d3240ca692.html pute_fourier_basis [9]that adds the ?elds U ,e and lmax to the graph structure.As a second example,since the gradient and divergence operate on the edges of the graph,a search on the edge matrix is needed to enable the use of these operators.It can be done with the routines [M]:gsp_adj2vec[P]:pygsp.operators.adj2vec .These operations take time and should4Figure 2:Visualization of graph and signals using plotting functions.NameEdge derivativefe (i,j )Laplacian matrix (operator)Available Undirected graph Combinatorial LaplacianW (i,j )(f (j )?f (i ))D ?WV Normalized Laplacian W (i,j ) f (j )√d (j )f (i )√d (i )D ?12(D ?W )D ?12V Directed graph Combinatorial LaplacianW (i,j )(f (j )?f (i ))12(D ++D ??W ?W ?)V Degree normalized Laplacian W (i,j ) f (j )√d ?(j )?f (i )√d +(i )I ?12D ?12+[W +W ?]D ?12V Distribution normalized Laplacianπ(i ) p (i,j )π(j )f (j )? p (i,j )π(i )f (i )12 Π12PΠ12+Π?12P ?Π12 VTable 2:Different de?nitions for graph Laplacian operator and their associated edge derivative.(For directed graph,d +,D +and d ?,D ?de?ne the out degree and in-degree of a node.π,Πis the stationary distribution of the graph and P is a normalized weight matrix W .For sake of clarity,exact de?nition of those quantities are not given here,but can be found in [14].)be performed only once.In MATLAB,these functions are called explicitly by the user beforehand.However,in Python they are automatically called when needed and the result stored as an attribute. The module operator also includes a Multi-scale Pyramid Transform for graph signals [6].Again,it works in two steps.Firstthe pyramid is precomputed with [M]:gsp_graph_multiresolution [P]:pygsp.operators.graph_multiresolution .Second the decomposition of a signal is performed with [M]:gsp_pyramid_analysis [P]:pygsp.operators.pyramid_analysis .The reconstruction uses [M]:gsp_pyramid_synthesis [P]:pygsp.operators.pyramid_synthesis .The Laplacian is a special operator stored as a sparse matrix in the ?eld L of the graph.Table 2summarizes the available de?nitions.We are planning to implement additional ones.5FiltersFilters are a special kind of linear operators that are so prominent in the toolbox that they deserve their own module [9,7,2,8,2].A ?lter is simply an anonymous function (in MATLAB)or a lambda function (in Python)acting element-by-element on the input.In MATLAB,a ?lter-bank is created simply by gathering these functions together into a cell array.For example,you would write:51%g(x)=x^2+sin(x)2g=@(x)x.^2+sin(x);3%h(x)=exp(-x)4h=@(x)exp(-x);5%Filterbank composed of g and h6fb={g,h};The toolbox contains many prede?ned design of?lter.They all start with[M]:gsp_design_in MATLAB and are in the module[P]:pygsp.filters in Python.Once a?lter(or a?lter-bank)is created,it can be applied to a signal with[M]: gsp_filter_analysis in MATLAB and a call to the method[P]:analysis of the?lter object in Python.Note that the toolbox uses accelerated algorithms to scale almost linearly with the number of sample[11].The available type of?lter design of the GSPBox can be classi?ed as:Wavelets(Filters are scaled version of a mother window)Gabor(Filters are shifted version of a mother window)Low passlter(Filters to de-noise a signal)High pass/Low pass separationlterbank(tight frame of2lters to separate the high frequencies from the low ones.No energy is lost in the process)Additionally,to adapt the?lter to the graph eigen-distribution,the warping function[M]:gsp_design_warped_translates [P]:pygsp.filters.WarpedTranslates can be used[10].6UNLocBoX BindingThis module contains special wrappers for the UNLocBoX[5].It allows to solve convex problems containing graph terms very easily[13,15,14,1].For example,the proximal operator of the graph TV norm is given by[M]:gsp_prox_tv.The optimization module contains also some prede?ned problems such as graph basis pursuit in[M]:gsp_solve_l1or wavelet de-noising in[M]:gsp_wavelet_dn.There is still active work on this module so it is expected to grow rapidly in the future releases of the toolbox.7Toolbox conventions7.1General conventionsAs much as possible,all small letters are used for vectors(or vector stacked into a matrix)and capital are reserved for matrices.A notable exception is the creation of nearest neighbors graphs.A variable should never have the same name as an already existing function in MATLAB or Python respectively.This makes the code easier to read and less prone to errors.This is a best coding practice in general,but since both languages allow the override of built-in functions,a special care is needed.All function names should be lowercase.This avoids a lot of confusion because some computer architectures respect upper/lower casing and others do not.As much as possible,functions are named after the action they perform,rather than the algorithm they use,or the person who invented it.No global variables.Global variables makes it harder to debug and the code is harder to parallelize.67.2MATLABAll function start by gsp_.The graph structure is always therst argument in the function call.Filters are always second.Finally,optional parameter are last.In the toolbox,we do use any argument helper functions.As a result,optional argument are generally stacked into a graph structure named param.If a transform works on a matrix,it will per default work along the columns.This is a standard in Matlab(fft does this, among many other functions).Function names are traditionally written in uppercase in MATLAB documentation.7.3PythonAll functions should be part of a module,there should be no call directly from pygsp([P]:pygsp.my_function).Inside a given module,functionalities can be further split in differentles regrouping those that are used in the same context.MATLAB’s matrix operations are sometimes ported in a different way that preserves the efciency of the code.When matrix operations are necessary,they are all performed through the numpy and scipy libraries.Since Python does not come with a plotting library,we support both matplotlib and pyqtgraph.One should install the required libraries on his own.If both are correctly installed,then pyqtgraph is favoured unless speci?cally speci?ed. AcknowledgementsWe would like to thanks all coding authors of the GSPBOX.The toolbox was ported in Python by Basile Chatillon,Alexandre Lafaye and Nicolas Rod.The toolbox was also improved by Nauman Shahid and Yann Sch?nenberger.References[1]M.Belkin,P.Niyogi,and V.Sindhwani.Manifold regularization:A geometric framework for learning from labeled and unlabeledexamples.The Journal of Machine Learning Research,7:2399–2434,2006.[2] D.K.Hammond,P.Vandergheynst,and R.Gribonval.Wavelets on graphs via spectral graph theory.Applied and ComputationalHarmonic Analysis,30(2):129–150,2011.[3]M.Muja and D.G.Lowe.Scalable nearest neighbor algorithms for high dimensional data.Pattern Analysis and Machine Intelligence,IEEE Transactions on,36,2014.[4]S.K.Narang,Y.H.Chao,and A.Ortega.Graph-wavelet?lterbanks for edge-aware image processing.In Statistical Signal ProcessingWorkshop(SSP),2012IEEE,pages141–144.IEEE,2012.[5]N.Perraudin,D.Shuman,G.Puy,and P.Vandergheynst.UNLocBoX A matlab convex optimization toolbox using proximal splittingmethods.ArXiv e-prints,Feb.2014.[6] D.I.Shuman,M.J.Faraji,and P.Vandergheynst.A multiscale pyramid transform for graph signals.arXiv preprint arXiv:1308.4942,2013.[7] D.I.Shuman,S.K.Narang,P.Frossard,A.Ortega,and P.Vandergheynst.The emerging?eld of signal processing on graphs:Extendinghigh-dimensional data analysis to networks and other irregular domains.Signal Processing Magazine,IEEE,30(3):83–98,2013.7[8] D.I.Shuman,B.Ricaud,and P.Vandergheynst.A windowed graph Fourier transform.Statistical Signal Processing Workshop(SSP),2012IEEE,pages133–136,2012.[9] D.I.Shuman,B.Ricaud,and P.Vandergheynst.Vertex-frequency analysis on graphs.arXiv preprint arXiv:1307.5708,2013.[10] D.I.Shuman,C.Wiesmeyr,N.Holighaus,and P.Vandergheynst.Spectrum-adapted tight graph wavelet and vertex-frequency frames.arXiv preprint arXiv:1311.0897,2013.[11] A.Susnjara,N.Perraudin,D.Kressner,and P.Vandergheynst.Accelerated?ltering on graphs using lanczos method.arXiv preprintarXiv:1509.04537,2015.[12] F.Zhang and E.R.Hancock.Graph spectral image smoothing using the heat kernel.Pattern Recognition,41(11):3328–3342,2008.[13] D.Zhou,O.Bousquet,/doc/c09ff3e90342a8956bec0975f46527d3240ca692.html l,J.Weston,and B.Sch?lkopf.Learning with local and global consistency.Advances in neural informationprocessing systems,16(16):321–328,2004.[14] D.Zhou,J.Huang,and B.Sch?lkopf.Learning from labeled and unlabeled data on a directed graph.In the22nd international conference,pages1036–1043,New York,New York,USA,2005.ACM Press.[15] D.Zhou and B.Sch?lkopf.A regularization framework for learning from graph data.2004.8。
基于多尺度小波变换的变步长LMS滤波算法缑新科;王娜娜【摘要】对LMS自适应算法、基于抽样函数的变步长LMS算法和基于多尺度小波变换的自适应滤波算法进行了研究,在此基础上把变步长LMS算法与多尺度小波变换相结合,产生了新算法.该算法一方面可以克服固定步长LMS算法在收敛速度与收敛精度方面与步长因子的矛盾;另一方面,小波变换的引入减少了输入向量自相关矩阵的条件数,提高了收敛速度、跟踪性能和稳定性.最后对算法的性能进行了计算机仿真比较,仿真结果表明:基于多尺度小波变换的变步长LMS滤波算法具有较快的收敛速度和更强的抑噪能力.%The paper represents a new algorithm from the mergence of the variable step size LMS algorithm and the multi-scale wavelet transform on the basis of studies of LMS algorithm, variable step size LMS algorithm based on sample function ,and LMS adaptive filtering algorithm based on multi-scale wavelet transform. On one hand,this new algorithm can overcome the contradiction between the fixed step size LMS algorithm' s speed and accuracy of convergence and its step factors. On the other hand, the introduction of wavelet reduce the conditions of auto-correlation matrix of the input vector, and improve the speed, tracking performance and stability of contradication. Finally, the algorithm is compared by computer simulation, the simulation results show that a adaptive LMS algorithm with variable step based on multi-scale wavelet transform has faster convergence speed, better performance, and stronger robustness against noise and disturbance.【期刊名称】《工业仪表与自动化装置》【年(卷),期】2012(000)003【总页数】6页(P5-9,13)【关键词】自适应滤波;LMS算法;变步长;小波变换;Matlab仿真【作者】缑新科;王娜娜【作者单位】兰州理工大学电气工程与信息工程学院,兰州730050;兰州理工大学电气工程与信息工程学院,兰州730050【正文语种】中文【中图分类】O2310 引言1967年Windrow等人提出了自适应LMS滤波算法,自适应滤波系统的参数能自动的调整而达到最优状况,而且在设计时,只需要很少的或根本不需要任何关于信号与噪声的先验统计知识。
一种变步长LMS算法提取胎儿心电
袁丽;吴水才;袁延超;高小峰
【期刊名称】《中国医疗设备》
【年(卷),期】2018(033)003
【摘要】目的为了解决LMS算法在收敛速度与稳态误差之间的矛盾,本文提出一种变步长LMS算法提取胎儿心电.方法首先将母体腹部信号作为原始输入信号,母体胸部信号作为参考输入信号,对两路信号进行预处理,去除基线漂移,再使用本文设计的算法提取胎儿心电.结果实验结果表明本文算法在收敛速度与跟踪能力方面均优于传统LMS算法,并且可提取出清晰的胎儿心电.结论改进后的LMS算法可以提取出比较清晰的胎儿心电.
【总页数】4页(P15-17,21)
【作者】袁丽;吴水才;袁延超;高小峰
【作者单位】北京工业大学生命科学与生物工程学院,北京 100124;北京工业大学生命科学与生物工程学院,北京 100124;北京工业大学生命科学与生物工程学院,北京 100124;北京麦迪克斯科技有限公司,北京 100095
【正文语种】中文
【中图分类】R540.4+1
【相关文献】
1.小波域变步长ICA算法提取胎儿心电信号 [J], 占海龙;赵治栋
2.一种基于变抽头长度的变步长LMS算法 [J], 雷翼龙;余涛
3.基于迭代次数变步长的LMS算法在ECG信号提取中的应用 [J], 孙俊香;赵继印
4.一种面向车内噪声控制的改进变步长LMS算法 [J], 钱梦男;卢剑伟;晏桂喜;郭嘉豪
5.基于分段式变步长截断误差LMS算法的心率信号提取 [J], 赵继印;田宝凤;李建坡;隋俊凌
因版权原因,仅展示原文概要,查看原文内容请购买。
mi配准算法阈值1. 什么是mi配准算法mi配准算法是一种基于互信息的图像配准方法,用于将两幅图像进行对齐,使得它们在空间上完全或近似重合。
图像配准在医学影像处理、计算机视觉和遥感图像处理等领域具有广泛的应用。
2. mi配准算法的原理mi配准算法基于互信息度量图像之间的相似性。
互信息是一种统计量,用于衡量两个随机变量之间的相关性。
在图像配准中,互信息可以用来度量两幅图像的相似程度。
mi配准算法的主要步骤如下:步骤1:图像预处理首先,对待配准的两幅图像进行预处理。
这包括图像灰度化、滤波、去噪等操作,以提高配准的准确性和稳定性。
步骤2:选择参考图像和待配准图像从待配准的图像集中选择一幅作为参考图像,其他图像作为待配准图像。
参考图像是用来作为参照物进行配准的基准。
步骤3:计算互信息对于每一对参考图像和待配准图像,计算它们之间的互信息。
互信息可以用公式表示如下:I(X;Y)=∑∑py∈Yx∈X (x,y)log(p(x,y)p(x)p(y))其中,X和Y分别表示参考图像和待配准图像的像素值集合,p(x,y)表示(x,y)对出现的概率,p(x)和p(y)分别表示x和y出现的概率。
步骤4:寻找最大互信息对于每一对参考图像和待配准图像,计算它们之间的互信息。
通过遍历所有可能的图像配准变换参数,找到使互信息最大的配准变换参数。
步骤5:应用配准变换将找到的最佳配准变换参数应用到待配准图像上,对其进行配准。
配准变换可以是平移、旋转、缩放等操作。
步骤6:评估配准结果评估配准结果的准确性和稳定性。
可以使用各种评估指标,如均方差、互信息等。
3. mi配准算法阈值的作用mi配准算法阈值是用来控制配准过程中的相似性度量的灵敏度。
当阈值较高时,只有相似性较高的图像区域才会被考虑在内,从而提高配准的准确性和稳定性。
当阈值较低时,相似性较低的图像区域也会被考虑在内,从而增加了配准的灵活性和鲁棒性。
mi配准算法阈值的选择需要根据具体应用场景和需求来确定。
Kernels and Regularization on GraphsAlexander J.Smola1and Risi Kondor21Machine Learning Group,RSISEAustralian National UniversityCanberra,ACT0200,AustraliaAlex.Smola@.au2Department of Computer ScienceColumbia University1214Amsterdam Avenue,M.C.0401New York,NY10027,USArisi@Abstract.We introduce a family of kernels on graphs based on thenotion of regularization operators.This generalizes in a natural way thenotion of regularization and Greens functions,as commonly used forreal valued functions,to graphs.It turns out that diffusion kernels canbe found as a special case of our reasoning.We show that the class ofpositive,monotonically decreasing functions on the unit interval leads tokernels and corresponding regularization operators.1IntroductionThere has recently been a surge of interest in learning algorithms that operate on input spaces X other than R n,specifically,discrete input spaces,such as strings, graphs,trees,automata etc..Since kernel-based algorithms,such as Support Vector Machines,Gaussian Processes,Kernel PCA,etc.capture the structure of X via the kernel K:X×X→R,as long as we can define an appropriate kernel on our discrete input space,these algorithms can be imported wholesale, together with their error analysis,theoretical guarantees and empirical success.One of the most general representations of discrete metric spaces are graphs. Even if all we know about our input space are local pairwise similarities between points x i,x j∈X,distances(e.g shortest path length)on the graph induced by these similarities can give a useful,more global,sense of similarity between objects.In their work on Diffusion Kernels,Kondor and Lafferty[2002]gave a specific construction for a kernel capturing this structure.Belkin and Niyogi [2002]proposed an essentially equivalent construction in the context of approx-imating data lying on surfaces in a high dimensional embedding space,and in the context of leveraging information from unlabeled data.In this paper we put these earlier results into the more principled framework of Regularization Theory.We propose a family of regularization operators(equiv-alently,kernels)on graphs that include Diffusion Kernels as a special case,and show that this family encompasses all possible regularization operators invariant under permutations of the vertices in a particular sense.2Alexander Smola and Risi KondorOutline of the Paper:Section2introduces the concept of the graph Laplacian and relates it to the Laplace operator on real valued functions.Next we define an extended class of regularization operators and show why they have to be es-sentially a function of the Laplacian.An analogy to real valued Greens functions is established in Section3.3,and efficient methods for computing such functions are presented in Section4.We conclude with a discussion.2Laplace OperatorsAn undirected unweighted graph G consists of a set of vertices V numbered1to n,and a set of edges E(i.e.,pairs(i,j)where i,j∈V and(i,j)∈E⇔(j,i)∈E). We will sometimes write i∼j to denote that i and j are neighbors,i.e.(i,j)∈E. The adjacency matrix of G is an n×n real matrix W,with W ij=1if i∼j,and 0otherwise(by construction,W is symmetric and its diagonal entries are zero). These definitions and most of the following theory can trivially be extended toweighted graphs by allowing W ij∈[0,∞).Let D be an n×n diagonal matrix with D ii=jW ij.The Laplacian of Gis defined as L:=D−W and the Normalized Laplacian is˜L:=D−12LD−12= I−D−12W D−12.The following two theorems are well known results from spectral graph theory[Chung-Graham,1997]:Theorem1(Spectrum of˜L).˜L is a symmetric,positive semidefinite matrix, and its eigenvaluesλ1,λ2,...,λn satisfy0≤λi≤2.Furthermore,the number of eigenvalues equal to zero equals to the number of disjoint components in G.The bound on the spectrum follows directly from Gerschgorin’s Theorem.Theorem2(L and˜L for Regular Graphs).Now let G be a regular graph of degree d,that is,a graph in which every vertex has exactly d neighbors.ThenL=d I−W and˜L=I−1d W=1dL.Finally,W,L,˜L share the same eigenvectors{v i},where v i=λ−1iW v i=(d−λi)−1L v i=(1−d−1λi)−1˜L v i for all i.L and˜L can be regarded as linear operators on functions f:V→R,or,equiv-alently,on vectors f=(f1,f2,...,f n) .We could equally well have defined Lbyf,L f =f L f=−12i∼j(f i−f j)2for all f∈R n,(1)which readily generalizes to graphs with a countably infinite number of vertices.The Laplacian derives its name from its analogy with the familiar Laplacianoperator∆=∂2∂x21+∂2∂x22+...+∂2∂x2mon continuous spaces.Regarding(1)asinducing a semi-norm f L= f,L f on R n,the analogous expression for∆defined on a compact spaceΩisf ∆= f,∆f =Ωf(∆f)dω=Ω(∇f)·(∇f)dω.(2)Both(1)and(2)quantify how much f and f vary locally,or how“smooth”they are over their respective domains.Kernels and Regularization on Graphs3 More explicitly,whenΩ=R m,up to a constant,−L is exactly thefinite difference discretization of∆on a regular lattice:∆f(x)=mi=1∂2∂x2if≈mi=1∂∂x if(x+12e i)−∂∂x if(x−12e i)δ≈mi=1f(x+e i)+f(x−e i)−2f(x)δ2=1δ2mi=1(f x1,...,x i+1,...,x m+f x1,...,x i−1,...,x m−2f x1,...,x m)=−1δ2[L f]x1,...,x m,where e1,e2,...,e m is an orthogonal basis for R m normalized to e i =δ, the vertices of the lattice are at x=x1e1+...+x m e m with integer valuedcoordinates x i∈N,and f x1,x2,...,x m=f(x).Moreover,both the continuous and the dis-crete Laplacians are canonical operators on their respective domains,in the sense that they are invariant under certain natural transformations of the underlying space,and in this they are essentially unique.Regular grid in two dimensionsThe Laplace operator∆is the unique self-adjoint linear second order differ-ential operator invariant under transformations of the coordinate system under the action of the special orthogonal group SO m,i.e.invariant under rotations. This well known result can be seen by using Schur’s lemma and the fact that SO m is irreducible on R m.We now show a similar result for L.Here the permutation group plays a similar role to SO m.We need some additional definitions:denote by S n the group of permutations on{1,2,...,n}withπ∈S n being a specific permutation taking i∈{1,2,...n}toπ(i).The so-called defining representation of S n consists of n×n matricesΠπ,such that[Ππ]i,π(i)=1and all other entries ofΠπare zero. Theorem3(Permutation Invariant Linear Functions on Graphs).Let L be an n×n symmetric real matrix,linearly related to the n×n adjacency matrix W,i.e.L=T[W]for some linear operator L in a way invariant to permutations of vertices in the sense thatΠ πT[W]Ππ=TΠ πWΠπ(3)for anyπ∈S n.Then L is related to W by a linear combination of the follow-ing three operations:identity;row/column sums;overall sum;row/column sum restricted to the diagonal of L;overall sum restricted to the diagonal of W. Proof LetL i1i2=T[W]i1i2:=ni3=1ni4=1T i1i2i3i4W i3i4(4)with T∈R n4.Eq.(3)then implies Tπ(i1)π(i2)π(i3)π(i4)=T i1i2i3i4for anyπ∈S n.4Alexander Smola and Risi KondorThe indices of T can be partitioned by the equality relation on their values,e.g.(2,5,2,7)is of the partition type [13|2|4],since i 1=i 3,but i 2=i 1,i 4=i 1and i 2=i 4.The key observation is that under the action of the permutation group,elements of T with a given index partition structure are taken to elements with the same index partition structure,e.g.if i 1=i 3then π(i 1)=π(i 3)and if i 1=i 3,then π(i 1)=π(i 3).Furthermore,an element with a given index index partition structure can be mapped to any other element of T with the same index partition structure by a suitable choice of π.Hence,a necessary and sufficient condition for (4)is that all elements of T of a given index partition structure be equal.Therefore,T must be a linear combination of the following tensors (i.e.multilinear forms):A i 1i 2i 3i 4=1B [1,2]i 1i 2i 3i 4=δi 1i 2B [1,3]i 1i 2i 3i 4=δi 1i 3B [1,4]i 1i 2i 3i 4=δi 1i 4B [2,3]i 1i 2i 3i 4=δi 2i 3B [2,4]i 1i 2i 3i 4=δi 2i 4B [3,4]i 1i 2i 3i 4=δi 3i 4C [1,2,3]i 1i 2i 3i 4=δi 1i 2δi 2i 3C [2,3,4]i 1i 2i 3i 4=δi 2i 3δi 3i 4C [3,4,1]i 1i 2i 3i 4=δi 3i 4δi 4i 1C [4,1,2]i 1i 2i 3i 4=δi 4i 1δi 1i 2D [1,2][3,4]i 1i 2i 3i 4=δi 1i 2δi 3i 4D [1,3][2,4]i 1i 2i 3i 4=δi 1i 3δi 2i 4D [1,4][2,3]i 1i 2i 3i 4=δi 1i 4δi 2i 3E [1,2,3,4]i 1i 2i 3i 4=δi 1i 2δi 1i 3δi 1i 4.The tensor A puts the overall sum in each element of L ,while B [1,2]returns the the same restricted to the diagonal of L .Since W has vanishing diagonal,B [3,4],C [2,3,4],C [3,4,1],D [1,2][3,4]and E [1,2,3,4]produce zero.Without loss of generality we can therefore ignore them.By symmetry of W ,the pairs (B [1,3],B [1,4]),(B [2,3],B [2,4]),(C [1,2,3],C [4,1,2])have the same effect on W ,hence we can set the coefficient of the second member of each to zero.Furthermore,to enforce symmetry on L ,the coefficient of B [1,3]and B [2,3]must be the same (without loss of generality 1)and this will give the row/column sum matrix ( k W ik )+( k W kl ).Similarly,C [1,2,3]and C [4,1,2]must have the same coefficient and this will give the row/column sum restricted to the diagonal:δij [( k W ik )+( k W kl )].Finally,by symmetry of W ,D [1,3][2,4]and D [1,4][2,3]are both equivalent to the identity map.The various row/column sum and overall sum operations are uninteresting from a graph theory point of view,since they do not heed to the topology of the graph.Imposing the conditions that each row and column in L must sum to zero,we recover the graph Laplacian.Hence,up to a constant factor and trivial additive components,the graph Laplacian (or the normalized graph Laplacian if we wish to rescale by the number of edges per vertex)is the only “invariant”differential operator for given W (or its normalized counterpart ˜W ).Unless stated otherwise,all results below hold for both L and ˜L (albeit with a different spectrum)and we will,in the following,focus on ˜Ldue to the fact that its spectrum is contained in [0,2].Kernels and Regularization on Graphs5 3RegularizationThe fact that L induces a semi-norm on f which penalizes the changes between adjacent vertices,as described in(1),indicates that it may serve as a tool to design regularization operators.3.1Regularization via the Laplace OperatorWe begin with a brief overview of translation invariant regularization operators on continuous spaces and show how they can be interpreted as powers of∆.This will allow us to repeat the development almost verbatim with˜L(or L)instead.Some of the most successful regularization functionals on R n,leading to kernels such as the Gaussian RBF,can be written as[Smola et al.,1998]f,P f :=|˜f(ω)|2r( ω 2)dω= f,r(∆)f .(5)Here f∈L2(R n),˜f(ω)denotes the Fourier transform of f,r( ω 2)is a function penalizing frequency components|˜f(ω)|of f,typically increasing in ω 2,and finally,r(∆)is the extension of r to operators simply by applying r to the spectrum of∆[Dunford and Schwartz,1958]f,r(∆)f =if,ψi r(λi) ψi,fwhere{(ψi,λi)}is the eigensystem of∆.The last equality in(5)holds because applications of∆become multiplications by ω 2in Fourier space.Kernels are obtained by solving the self-consistency condition[Smola et al.,1998]k(x,·),P k(x ,·) =k(x,x ).(6) One can show that k(x,x )=κ(x−x ),whereκis equal to the inverse Fourier transform of r−1( ω 2).Several r functions have been known to yield good results.The two most popular are given below:r( ω 2)k(x,x )r(∆)Gaussian RBF expσ22ω 2exp−12σ2x−x 2∞i=0σ2ii!∆iLaplacian RBF1+σ2 ω 2exp−1σx−x1+σ2∆In summary,regularization according to(5)is carried out by penalizing˜f(ω) by a function of the Laplace operator.For many results in regularization theory one requires r( ω 2)→∞for ω 2→∞.3.2Regularization via the Graph LaplacianIn complete analogy to(5),we define a class of regularization functionals on graphs asf,P f := f,r(˜L)f .(7)6Alexander Smola and Risi KondorFig.1.Regularization function r (λ).From left to right:regularized Laplacian (σ2=1),diffusion process (σ2=1),one-step random walk (a =2),4-step random walk (a =2),inverse cosine.Here r (˜L )is understood as applying the scalar valued function r (λ)to the eigen-values of ˜L ,that is,r (˜L ):=m i =1r (λi )v i v i ,(8)where {(λi ,v i )}constitute the eigensystem of ˜L .The normalized graph Lapla-cian ˜Lis preferable to L ,since ˜L ’s spectrum is contained in [0,2].The obvious goal is to gain insight into what functions are appropriate choices for r .–From (1)we infer that v i with large λi correspond to rather uneven functions on the graph G .Consequently,they should be penalized more strongly than v i with small λi .Hence r (λ)should be monotonically increasing in λ.–Requiring that r (˜L) 0imposes the constraint r (λ)≥0for all λ∈[0,2].–Finally,we can limit ourselves to r (λ)expressible as power series,since the latter are dense in the space of C 0functions on bounded domains.In Section 3.5we will present additional motivation for the choice of r (λ)in the context of spectral graph theory and segmentation.As we shall see,the following functions are of particular interest:r (λ)=1+σ2λ(Regularized Laplacian)(9)r (λ)=exp σ2/2λ(Diffusion Process)(10)r (λ)=(aI −λ)−1with a ≥2(One-Step Random Walk)(11)r (λ)=(aI −λ)−p with a ≥2(p -Step Random Walk)(12)r (λ)=(cos λπ/4)−1(Inverse Cosine)(13)Figure 1shows the regularization behavior for the functions (9)-(13).3.3KernelsThe introduction of a regularization matrix P =r (˜L)allows us to define a Hilbert space H on R m via f,f H := f ,P f .We now show that H is a reproducing kernel Hilbert space.Kernels and Regularization on Graphs 7Theorem 4.Denote by P ∈R m ×m a (positive semidefinite)regularization ma-trix and denote by H the image of R m under P .Then H with dot product f,f H := f ,P f is a Reproducing Kernel Hilbert Space and its kernel is k (i,j )= P −1ij ,where P −1denotes the pseudo-inverse if P is not invertible.Proof Since P is a positive semidefinite matrix,we clearly have a Hilbert space on P R m .To show the reproducing property we need to prove thatf (i )= f,k (i,·) H .(14)Note that k (i,j )can take on at most m 2different values (since i,j ∈[1:m ]).In matrix notation (14)means that for all f ∈Hf (i )=f P K i,:for all i ⇐⇒f =f P K.(15)The latter holds if K =P −1and f ∈P R m ,which proves the claim.In other words,K is the Greens function of P ,just as in the continuous case.The notion of Greens functions on graphs was only recently introduced by Chung-Graham and Yau [2000]for L .The above theorem extended this idea to arbitrary regularization operators ˆr (˜L).Corollary 1.Denote by P =r (˜L )a regularization matrix,then the correspond-ing kernel is given by K =r −1(˜L ),where we take the pseudo-inverse wherever necessary.More specifically,if {(v i ,λi )}constitute the eigensystem of ˜L,we have K =mi =1r −1(λi )v i v i where we define 0−1≡0.(16)3.4Examples of KernelsBy virtue of Corollary 1we only need to take (9)-(13)and plug the definition of r (λ)into (16)to obtain formulae for computing K .This yields the following kernel matrices:K =(I +σ2˜L)−1(Regularized Laplacian)(17)K =exp(−σ2/2˜L)(Diffusion Process)(18)K =(aI −˜L)p with a ≥2(p -Step Random Walk)(19)K =cos ˜Lπ/4(Inverse Cosine)(20)Equation (18)corresponds to the diffusion kernel proposed by Kondor and Laf-ferty [2002],for which K (x,x )can be visualized as the quantity of some sub-stance that would accumulate at vertex x after a given amount of time if we injected the substance at vertex x and let it diffuse through the graph along the edges.Note that this involves matrix exponentiation defined via the limit K =exp(B )=lim n →∞(I +B/n )n as opposed to component-wise exponentiation K i,j =exp(B i,j ).8Alexander Smola and Risi KondorFig.2.Thefirst8eigenvectors of the normalized graph Laplacian corresponding to the graph drawn above.Each line attached to a vertex is proportional to the value of the corresponding eigenvector at the vertex.Positive values(red)point up and negative values(blue)point down.Note that the assignment of values becomes less and less uniform with increasing eigenvalue(i.e.from left to right).For(17)it is typically more efficient to deal with the inverse of K,as it avoids the costly inversion of the sparse matrix˜L.Such situations arise,e.g.,in Gaussian Process estimation,where K is the covariance matrix of a stochastic process[Williams,1999].Regarding(19),recall that(aI−˜L)p=((a−1)I+˜W)p is up to scaling terms equiv-alent to a p-step random walk on the graphwith random restarts(see Section A for de-tails).In this sense it is similar to the dif-fusion kernel.However,the fact that K in-volves only afinite number of products ofmatrices makes it much more attractive forpractical purposes.In particular,entries inK ij can be computed cheaply using the factthat˜L is a sparse matrix.A nearest neighbor graph.Finally,the inverse cosine kernel treats lower complexity functions almost equally,with a significant reduction in the upper end of the spectrum.Figure2 shows the leading eigenvectors of the graph drawn above and Figure3provide examples of some of the kernels discussed above.3.5Clustering and Spectral Graph TheoryWe could also have derived r(˜L)directly from spectral graph theory:the eigen-vectors of the graph Laplacian correspond to functions partitioning the graph into clusters,see e.g.,[Chung-Graham,1997,Shi and Malik,1997]and the ref-erences therein.In general,small eigenvalues have associated eigenvectors which vary little between adjacent vertices.Finding the smallest eigenvectors of˜L can be seen as a real-valued relaxation of the min-cut problem.3For instance,the smallest eigenvalue of˜L is0,its corresponding eigenvector is D121n with1n:=(1,...,1)∈R n.The second smallest eigenvalue/eigenvector pair,also often referred to as the Fiedler-vector,can be used to split the graph 3Only recently,algorithms based on the celebrated semidefinite relaxation of the min-cut problem by Goemans and Williamson[1995]have seen wider use[Torr,2003]in segmentation and clustering by use of spectral bundle methods.Kernels and Regularization on Graphs9Fig.3.Top:regularized graph Laplacian;Middle:diffusion kernel with σ=5,Bottom:4-step random walk kernel.Each figure displays K ij for fixed i .The value K ij at vertex i is denoted by a bold line.Note that only adjacent vertices to i bear significant value.into two distinct parts [Weiss,1999,Shi and Malik,1997],and further eigenvec-tors with larger eigenvalues have been used for more finely-grained partitions of the graph.See Figure 2for an example.Such a decomposition into functions of increasing complexity has very de-sirable properties:if we want to perform estimation on the graph,we will wish to bias the estimate towards functions which vary little over large homogeneous portions 4.Consequently,we have the following interpretation of f,f H .As-sume that f = i βi v i ,where {(v i ,λi )}is the eigensystem of ˜L.Then we can rewrite f,f H to yield f ,r (˜L )f = i βi v i , j r (λj )v j v j l βl v l = iβ2i r (λi ).(21)This means that the components of f which vary a lot over coherent clusters in the graph are penalized more strongly,whereas the portions of f ,which are essentially constant over clusters,are preferred.This is exactly what we want.3.6Approximate ComputationOften it is not necessary to know all values of the kernel (e.g.,if we only observe instances from a subset of all positions on the graph).There it would be wasteful to compute the full matrix r (L )−1explicitly,since such operations typically scale with O (n 3).Furthermore,for large n it is not desirable to compute K via (16),that is,by computing the eigensystem of ˜Land assembling K directly.4If we cannot assume a connection between the structure of the graph and the values of the function to be estimated on it,the entire concept of designing kernels on graphs obviously becomes meaningless.10Alexander Smola and Risi KondorInstead,we would like to take advantage of the fact that ˜L is sparse,and con-sequently any operation ˜Lαhas cost at most linear in the number of nonzero ele-ments of ˜L ,hence the cost is bounded by O (|E |+n ).Moreover,if d is the largest degree of the graph,then computing L p e i costs at most |E | p −1i =1(min(d +1,n ))ioperations:at each step the number of non-zeros in the rhs decreases by at most a factor of d +1.This means that as long as we can approximate K =r −1(˜L )by a low order polynomial,say ρ(˜L ):= N i =0βi ˜L i ,significant savings are possible.Note that we need not necessarily require a uniformly good approximation and put the main emphasis on the approximation for small λ.However,we need to ensure that ρ(˜L)is positive semidefinite.Diffusion Kernel:The fact that the series r −1(x )=exp(−βx )= ∞m =0(−β)m x m m !has alternating signs shows that the approximation error at r −1(x )is boundedby (2β)N +1(N +1)!,if we use N terms in the expansion (from Theorem 1we know that ˜L≤2).For instance,for β=1,10terms are sufficient to obtain an error of the order of 10−4.Variational Approximation:In general,if we want to approximate r −1(λ)on[0,2],we need to solve the L ∞([0,2])approximation problemminimize β, subject to N i =0βi λi −r −1(λ) ≤ ∀λ∈[0,2](22)Clearly,(22)is equivalent to minimizing sup ˜L ρ(˜L )−r−1(˜L ) ,since the matrix norm is determined by the largest eigenvalues,and we can find ˜Lsuch that the discrepancy between ρ(λ)and r −1(λ)is attained.Variational problems of this form have been studied in the literature,and their solution may provide much better approximations to r −1(λ)than a truncated power series expansion.4Products of GraphsAs we have already pointed out,it is very expensive to compute K for arbitrary ˆr and ˜L.For special types of graphs and regularization,however,significant computational savings can be made.4.1Factor GraphsThe work of this section is a direct extension of results by Ellis [2002]and Chung-Graham and Yau [2000],who study factor graphs to compute inverses of the graph Laplacian.Definition 1(Factor Graphs).Denote by (V,E )and (V ,E )the vertices V and edges E of two graphs,then the factor graph (V f ,E f ):=(V,E )⊗(V ,E )is defined as the graph where (i,i )∈V f if i ∈V and i ∈V ;and ((i,i ),(j,j ))∈E f if and only if either (i,j )∈E and i =j or (i ,j )∈E and i =j .Kernels and Regularization on Graphs 11For instance,the factor graph of two rings is a torus.The nice property of factor graphs is that we can compute the eigenvalues of the Laplacian on products very easily (see e.g.,Chung-Graham and Yau [2000]):Theorem 5(Eigenvalues of Factor Graphs).The eigenvalues and eigen-vectors of the normalized Laplacian for the factor graph between a regular graph of degree d with eigenvalues {λj }and a regular graph of degree d with eigenvalues {λ l }are of the form:λfact j,l =d d +d λj +d d +d λ l(23)and the eigenvectors satisfy e j,l(i,i )=e j i e l i ,where e j is an eigenvector of ˜L and e l is an eigenvector of ˜L.This allows us to apply Corollary 1to obtain an expansion of K asK =(r (L ))−1=j,l r −1(λjl )e j,l e j,l .(24)While providing an explicit recipe for the computation of K ij without the need to compute the full matrix K ,this still requires O (n 2)operations per entry,which may be more costly than what we want (here n is the number of vertices of the factor graph).Two methods for computing (24)become evident at this point:if r has a special structure,we may exploit this to decompose K into the products and sums of terms depending on one of the two graphs alone and pre-compute these expressions beforehand.Secondly,if one of the two terms in the expansion can be computed for a rather general class of values of r (x ),we can pre-compute this expansion and only carry out the remainder corresponding to (24)explicitly.4.2Product Decomposition of r (x )Central to our reasoning is the observation that for certain r (x ),the term 1r (a +b )can be expressed in terms of a product and sum of terms depending on a and b only.We assume that 1r (a +b )=M m =1ρn (a )˜ρn (b ).(25)In the following we will show that in such situations the kernels on factor graphs can be computed as an analogous combination of products and sums of kernel functions on the terms constituting the ingredients of the factor graph.Before we do so,we briefly check that many r (x )indeed satisfy this property.exp(−β(a +b ))=exp(−βa )exp(−βb )(26)(A −(a +b ))= A 2−a + A 2−b (27)(A −(a +b ))p =p n =0p n A 2−a n A 2−b p −n (28)cos (a +b )π4=cos aπ4cos bπ4−sin aπ4sin bπ4(29)12Alexander Smola and Risi KondorIn a nutshell,we will exploit the fact that for products of graphs the eigenvalues of the joint graph Laplacian can be written as the sum of the eigenvalues of the Laplacians of the constituent graphs.This way we can perform computations on ρn and˜ρn separately without the need to take the other part of the the product of graphs into account.Definek m(i,j):=l ρldλld+de l i e l j and˜k m(i ,j ):=l˜ρldλld+d˜e l i ˜e l j .(30)Then we have the following composition theorem:Theorem6.Denote by(V,E)and(V ,E )connected regular graphs of degrees d with m vertices(and d ,m respectively)and normalized graph Laplacians ˜L,˜L .Furthermore denote by r(x)a rational function with matrix-valued exten-sionˆr(X).In this case the kernel K corresponding to the regularization operator ˆr(L)on the product graph of(V,E)and(V ,E )is given byk((i,i ),(j,j ))=Mm=1k m(i,j)˜k m(i ,j )(31)Proof Plug the expansion of1r(a+b)as given by(25)into(24)and collect terms.From(26)we immediately obtain the corollary(see Kondor and Lafferty[2002]) that for diffusion processes on factor graphs the kernel on the factor graph is given by the product of kernels on the constituents,that is k((i,i ),(j,j ))= k(i,j)k (i ,j ).The kernels k m and˜k m can be computed either by using an analytic solution of the underlying factors of the graph or alternatively they can be computed numerically.If the total number of kernels k n is small in comparison to the number of possible coordinates this is still computationally beneficial.4.3Composition TheoremsIf no expansion as in(31)can be found,we may still be able to compute ker-nels by extending a reasoning from[Ellis,2002].More specifically,the following composition theorem allows us to accelerate the computation in many cases, whenever we can parameterize(ˆr(L+αI))−1in an efficient way.For this pur-pose we introduce two auxiliary functionsKα(i,j):=ˆrdd+dL+αdd+dI−1=lrdλl+αdd+d−1e l(i)e l(j)G α(i,j):=(L +αI)−1=l1λl+αe l(i)e l(j).(32)In some cases Kα(i,j)may be computed in closed form,thus obviating the need to perform expensive matrix inversion,e.g.,in the case where the underlying graph is a chain[Ellis,2002]and Kα=Gα.Kernels and Regularization on Graphs 13Theorem 7.Under the assumptions of Theorem 6we haveK ((j,j ),(l,l ))=12πi C K α(j,l )G −α(j ,l )dα= v K λv (j,l )e v j e v l (33)where C ⊂C is a contour of the C containing the poles of (V ,E )including 0.For practical purposes,the third term of (33)is more amenable to computation.Proof From (24)we haveK ((j,j ),(l,l ))= u,v r dλu +d λv d +d −1e u j e u l e v j e v l (34)=12πi C u r dλu +d αd +d −1e u j e u l v 1λv −αe v j e v l dαHere the second equalityfollows from the fact that the contour integral over a pole p yields C f (α)p −αdα=2πif (p ),and the claim is verified by checking thedefinitions of K αand G α.The last equality can be seen from (34)by splitting up the summation over u and v .5ConclusionsWe have shown that the canonical family of kernels on graphs are of the form of power series in the graph Laplacian.Equivalently,such kernels can be char-acterized by a real valued function of the eigenvalues of the Laplacian.Special cases include diffusion kernels,the regularized Laplacian kernel and p -step ran-dom walk kernels.We have developed the regularization theory of learning on graphs using such kernels and explored methods for efficiently computing and approximating the kernel matrix.Acknowledgments This work was supported by a grant of the ARC.The authors thank Eleazar Eskin,Patrick Haffner,Andrew Ng,Bob Williamson and S.V.N.Vishwanathan for helpful comments and suggestions.A Link AnalysisRather surprisingly,our approach to regularizing functions on graphs bears re-semblance to algorithms for scoring web pages such as PageRank [Page et al.,1998],HITS [Kleinberg,1999],and randomized HITS [Zheng et al.,2001].More specifically,the random walks on graphs used in all three algorithms and the stationary distributions arising from them are closely connected with the eigen-system of L and ˜Lrespectively.We begin with an analysis of PageRank.Given a set of web pages and links between them we construct a directed graph in such a way that pages correspond。
用于生命科学的人工智能定量相位成像方法English:Artificial intelligence (AI) has revolutionized the field of quantitative phase imaging (QPI) in life sciences by enabling high-throughput and accurate analysis of complex biological samples. AI-based QPI methods integrate advanced machine learning algorithms with computational imaging techniques to extract quantitative information from phase images with unprecedented precision and speed. These methods leverage deep learning models to perform tasks such as cell segmentation, classification, and tracking, enabling researchers to study dynamic cellular processes and disease progression with minimal human intervention. One example of AI-enabled QPI is label-free cell classification, where AI algorithms can accurately differentiate between different cell types based on their phase images, providing valuable insights into cellular morphology and function. Furthermore, AI-based QPI approaches can also improve the sensitivity and specificity of quantitative phase measurements, allowing for more accurate quantification of cellular and subcellular features. Overall, the integration of AI with QPI holds great potential for advancing our understanding of complexbiological systems and accelerating the development of novel diagnostic and therapeutic strategies in the life sciences.中文翻译:人工智能(AI)已经彻底改革了生命科学中的定量相位成像(QPI)领域,通过实现对复杂生物样品的高通量和准确分析。
Draft:Deep Learning in Neural Networks:An OverviewTechnical Report IDSIA-03-14/arXiv:1404.7828(v1.5)[cs.NE]J¨u rgen SchmidhuberThe Swiss AI Lab IDSIAIstituto Dalle Molle di Studi sull’Intelligenza ArtificialeUniversity of Lugano&SUPSIGalleria2,6928Manno-LuganoSwitzerland15May2014AbstractIn recent years,deep artificial neural networks(including recurrent ones)have won numerous con-tests in pattern recognition and machine learning.This historical survey compactly summarises relevantwork,much of it from the previous millennium.Shallow and deep learners are distinguished by thedepth of their credit assignment paths,which are chains of possibly learnable,causal links between ac-tions and effects.I review deep supervised learning(also recapitulating the history of backpropagation),unsupervised learning,reinforcement learning&evolutionary computation,and indirect search for shortprograms encoding deep and large networks.PDF of earlier draft(v1):http://www.idsia.ch/∼juergen/DeepLearning30April2014.pdfLATEX source:http://www.idsia.ch/∼juergen/DeepLearning30April2014.texComplete BIBTEXfile:http://www.idsia.ch/∼juergen/bib.bibPrefaceThis is the draft of an invited Deep Learning(DL)overview.One of its goals is to assign credit to those who contributed to the present state of the art.I acknowledge the limitations of attempting to achieve this goal.The DL research community itself may be viewed as a continually evolving,deep network of scientists who have influenced each other in complex ways.Starting from recent DL results,I tried to trace back the origins of relevant ideas through the past half century and beyond,sometimes using“local search”to follow citations of citations backwards in time.Since not all DL publications properly acknowledge earlier relevant work,additional global search strategies were employed,aided by consulting numerous neural network experts.As a result,the present draft mostly consists of references(about800entries so far).Nevertheless,through an expert selection bias I may have missed important work.A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century.For these reasons,the present draft should be viewed as merely a snapshot of an ongoing credit assignment process.To help improve it,please do not hesitate to send corrections and suggestions to juergen@idsia.ch.Contents1Introduction to Deep Learning(DL)in Neural Networks(NNs)3 2Event-Oriented Notation for Activation Spreading in FNNs/RNNs3 3Depth of Credit Assignment Paths(CAPs)and of Problems4 4Recurring Themes of Deep Learning54.1Dynamic Programming(DP)for DL (5)4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL (6)4.3Occam’s Razor:Compression and Minimum Description Length(MDL) (6)4.4Learning Hierarchical Representations Through Deep SL,UL,RL (6)4.5Fast Graphics Processing Units(GPUs)for DL in NNs (6)5Supervised NNs,Some Helped by Unsupervised NNs75.11940s and Earlier (7)5.2Around1960:More Neurobiological Inspiration for DL (7)5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) (8)5.41979:Convolution+Weight Replication+Winner-Take-All(WTA) (8)5.51960-1981and Beyond:Development of Backpropagation(BP)for NNs (8)5.5.1BP for Weight-Sharing Feedforward NNs(FNNs)and Recurrent NNs(RNNs)..95.6Late1980s-2000:Numerous Improvements of NNs (9)5.6.1Ideas for Dealing with Long Time Lags and Deep CAPs (10)5.6.2Better BP Through Advanced Gradient Descent (10)5.6.3Discovering Low-Complexity,Problem-Solving NNs (11)5.6.4Potential Benefits of UL for SL (11)5.71987:UL Through Autoencoder(AE)Hierarchies (12)5.81989:BP for Convolutional NNs(CNNs) (13)5.91991:Fundamental Deep Learning Problem of Gradient Descent (13)5.101991:UL-Based History Compression Through a Deep Hierarchy of RNNs (14)5.111992:Max-Pooling(MP):Towards MPCNNs (14)5.121994:Contest-Winning Not So Deep NNs (15)5.131995:Supervised Recurrent Very Deep Learner(LSTM RNN) (15)5.142003:More Contest-Winning/Record-Setting,Often Not So Deep NNs (16)5.152006/7:Deep Belief Networks(DBNs)&AE Stacks Fine-Tuned by BP (17)5.162006/7:Improved CNNs/GPU-CNNs/BP-Trained MPCNNs (17)5.172009:First Official Competitions Won by RNNs,and with MPCNNs (18)5.182010:Plain Backprop(+Distortions)on GPU Yields Excellent Results (18)5.192011:MPCNNs on GPU Achieve Superhuman Vision Performance (18)5.202011:Hessian-Free Optimization for RNNs (19)5.212012:First Contests Won on ImageNet&Object Detection&Segmentation (19)5.222013-:More Contests and Benchmark Records (20)5.22.1Currently Successful Supervised Techniques:LSTM RNNs/GPU-MPCNNs (21)5.23Recent Tricks for Improving SL Deep NNs(Compare Sec.5.6.2,5.6.3) (21)5.24Consequences for Neuroscience (22)5.25DL with Spiking Neurons? (22)6DL in FNNs and RNNs for Reinforcement Learning(RL)236.1RL Through NN World Models Yields RNNs With Deep CAPs (23)6.2Deep FNNs for Traditional RL and Markov Decision Processes(MDPs) (24)6.3Deep RL RNNs for Partially Observable MDPs(POMDPs) (24)6.4RL Facilitated by Deep UL in FNNs and RNNs (25)6.5Deep Hierarchical RL(HRL)and Subgoal Learning with FNNs and RNNs (25)6.6Deep RL by Direct NN Search/Policy Gradients/Evolution (25)6.7Deep RL by Indirect Policy Search/Compressed NN Search (26)6.8Universal RL (27)7Conclusion271Introduction to Deep Learning(DL)in Neural Networks(NNs) Which modifiable components of a learning system are responsible for its success or failure?What changes to them improve performance?This has been called the fundamental credit assignment problem(Minsky, 1963).There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses(Sec.6.8).The present survey,however,will focus on the narrower,but now commercially important,subfield of Deep Learning(DL)in Artificial Neural Networks(NNs).We are interested in accurate credit assignment across possibly many,often nonlinear,computational stages of NNs.Shallow NN-like models have been around for many decades if not centuries(Sec.5.1).Models with several successive nonlinear layers of neurons date back at least to the1960s(Sec.5.3)and1970s(Sec.5.5). An efficient gradient descent method for teacher-based Supervised Learning(SL)in discrete,differentiable networks of arbitrary depth called backpropagation(BP)was developed in the1960s and1970s,and ap-plied to NNs in1981(Sec.5.5).BP-based training of deep NNs with many layers,however,had been found to be difficult in practice by the late1980s(Sec.5.6),and had become an explicit research subject by the early1990s(Sec.5.9).DL became practically feasible to some extent through the help of Unsupervised Learning(UL)(e.g.,Sec.5.10,5.15).The1990s and2000s also saw many improvements of purely super-vised DL(Sec.5).In the new millennium,deep NNs havefinally attracted wide-spread attention,mainly by outperforming alternative machine learning methods such as kernel machines(Vapnik,1995;Sch¨o lkopf et al.,1998)in numerous important applications.In fact,supervised deep NNs have won numerous of-ficial international pattern recognition competitions(e.g.,Sec.5.17,5.19,5.21,5.22),achieving thefirst superhuman visual pattern recognition results in limited domains(Sec.5.19).Deep NNs also have become relevant for the more generalfield of Reinforcement Learning(RL)where there is no supervising teacher (Sec.6).Both feedforward(acyclic)NNs(FNNs)and recurrent(cyclic)NNs(RNNs)have won contests(Sec.5.12,5.14,5.17,5.19,5.21,5.22).In a sense,RNNs are the deepest of all NNs(Sec.3)—they are general computers more powerful than FNNs,and can in principle create and process memories of ar-bitrary sequences of input patterns(e.g.,Siegelmann and Sontag,1991;Schmidhuber,1990a).Unlike traditional methods for automatic sequential program synthesis(e.g.,Waldinger and Lee,1969;Balzer, 1985;Soloway,1986;Deville and Lau,1994),RNNs can learn programs that mix sequential and parallel information processing in a natural and efficient way,exploiting the massive parallelism viewed as crucial for sustaining the rapid decline of computation cost observed over the past75years.The rest of this paper is structured as follows.Sec.2introduces a compact,event-oriented notation that is simple yet general enough to accommodate both FNNs and RNNs.Sec.3introduces the concept of Credit Assignment Paths(CAPs)to measure whether learning in a given NN application is of the deep or shallow type.Sec.4lists recurring themes of DL in SL,UL,and RL.Sec.5focuses on SL and UL,and on how UL can facilitate SL,although pure SL has become dominant in recent competitions(Sec.5.17-5.22). Sec.5is arranged in a historical timeline format with subsections on important inspirations and technical contributions.Sec.6on deep RL discusses traditional Dynamic Programming(DP)-based RL combined with gradient-based search techniques for SL or UL in deep NNs,as well as general methods for direct and indirect search in the weight space of deep FNNs and RNNs,including successful policy gradient and evolutionary methods.2Event-Oriented Notation for Activation Spreading in FNNs/RNNs Throughout this paper,let i,j,k,t,p,q,r denote positive integer variables assuming ranges implicit in the given contexts.Let n,m,T denote positive integer constants.An NN’s topology may change over time(e.g.,Fahlman,1991;Ring,1991;Weng et al.,1992;Fritzke, 1994).At any given moment,it can be described as afinite subset of units(or nodes or neurons)N= {u1,u2,...,}and afinite set H⊆N×N of directed edges or connections between nodes.FNNs are acyclic graphs,RNNs cyclic.Thefirst(input)layer is the set of input units,a subset of N.In FNNs,the k-th layer(k>1)is the set of all nodes u∈N such that there is an edge path of length k−1(but no longer path)between some input unit and u.There may be shortcut connections between distant layers.The NN’s behavior or program is determined by a set of real-valued,possibly modifiable,parameters or weights w i(i=1,...,n).We now focus on a singlefinite episode or epoch of information processing and activation spreading,without learning through weight changes.The following slightly unconventional notation is designed to compactly describe what is happening during the runtime of the system.During an episode,there is a partially causal sequence x t(t=1,...,T)of real values that I call events.Each x t is either an input set by the environment,or the activation of a unit that may directly depend on other x k(k<t)through a current NN topology-dependent set in t of indices k representing incoming causal connections or links.Let the function v encode topology information and map such event index pairs(k,t)to weight indices.For example,in the non-input case we may have x t=f t(net t)with real-valued net t= k∈in t x k w v(k,t)(additive case)or net t= k∈in t x k w v(k,t)(multiplicative case), where f t is a typically nonlinear real-valued activation function such as tanh.In many recent competition-winning NNs(Sec.5.19,5.21,5.22)there also are events of the type x t=max k∈int (x k);some networktypes may also use complex polynomial activation functions(Sec.5.3).x t may directly affect certain x k(k>t)through outgoing connections or links represented through a current set out t of indices k with t∈in k.Some non-input events are called output events.Note that many of the x t may refer to different,time-varying activations of the same unit in sequence-processing RNNs(e.g.,Williams,1989,“unfolding in time”),or also in FNNs sequentially exposed to time-varying input patterns of a large training set encoded as input events.During an episode,the same weight may get reused over and over again in topology-dependent ways,e.g.,in RNNs,or in convolutional NNs(Sec.5.4,5.8).I call this weight sharing across space and/or time.Weight sharing may greatly reduce the NN’s descriptive complexity,which is the number of bits of information required to describe the NN (Sec.4.3).In Supervised Learning(SL),certain NN output events x t may be associated with teacher-given,real-valued labels or targets d t yielding errors e t,e.g.,e t=1/2(x t−d t)2.A typical goal of supervised NN training is tofind weights that yield episodes with small total error E,the sum of all such e t.The hope is that the NN will generalize well in later episodes,causing only small errors on previously unseen sequences of input events.Many alternative error functions for SL and UL are possible.SL assumes that input events are independent of earlier output events(which may affect the environ-ment through actions causing subsequent perceptions).This assumption does not hold in the broaderfields of Sequential Decision Making and Reinforcement Learning(RL)(Kaelbling et al.,1996;Sutton and Barto, 1998;Hutter,2005)(Sec.6).In RL,some of the input events may encode real-valued reward signals given by the environment,and a typical goal is tofind weights that yield episodes with a high sum of reward signals,through sequences of appropriate output actions.Sec.5.5will use the notation above to compactly describe a central algorithm of DL,namely,back-propagation(BP)for supervised weight-sharing FNNs and RNNs.(FNNs may be viewed as RNNs with certainfixed zero weights.)Sec.6will address the more general RL case.3Depth of Credit Assignment Paths(CAPs)and of ProblemsTo measure whether credit assignment in a given NN application is of the deep or shallow type,I introduce the concept of Credit Assignment Paths or CAPs,which are chains of possibly causal links between events.Let usfirst focus on SL.Consider two events x p and x q(1≤p<q≤T).Depending on the appli-cation,they may have a Potential Direct Causal Connection(PDCC)expressed by the Boolean predicate pdcc(p,q),which is true if and only if p∈in q.Then the2-element list(p,q)is defined to be a CAP from p to q(a minimal one).A learning algorithm may be allowed to change w v(p,q)to improve performance in future episodes.More general,possibly indirect,Potential Causal Connections(PCC)are expressed by the recursively defined Boolean predicate pcc(p,q),which in the SL case is true only if pdcc(p,q),or if pcc(p,k)for some k and pdcc(k,q).In the latter case,appending q to any CAP from p to k yields a CAP from p to q(this is a recursive definition,too).The set of such CAPs may be large but isfinite.Note that the same weight may affect many different PDCCs between successive events listed by a given CAP,e.g.,in the case of RNNs, or weight-sharing FNNs.Suppose a CAP has the form(...,k,t,...,q),where k and t(possibly t=q)are thefirst successive elements with modifiable w v(k,t).Then the length of the suffix list(t,...,q)is called the CAP’s depth (which is0if there are no modifiable links at all).This depth limits how far backwards credit assignment can move down the causal chain tofind a modifiable weight.1Suppose an episode and its event sequence x1,...,x T satisfy a computable criterion used to decide whether a given problem has been solved(e.g.,total error E below some threshold).Then the set of used weights is called a solution to the problem,and the depth of the deepest CAP within the sequence is called the solution’s depth.There may be other solutions(yielding different event sequences)with different depths.Given somefixed NN topology,the smallest depth of any solution is called the problem’s depth.Sometimes we also speak of the depth of an architecture:SL FNNs withfixed topology imply a problem-independent maximal problem depth bounded by the number of non-input layers.Certain SL RNNs withfixed weights for all connections except those to output units(Jaeger,2001;Maass et al.,2002; Jaeger,2004;Schrauwen et al.,2007)have a maximal problem depth of1,because only thefinal links in the corresponding CAPs are modifiable.In general,however,RNNs may learn to solve problems of potentially unlimited depth.Note that the definitions above are solely based on the depths of causal chains,and agnostic of the temporal distance between events.For example,shallow FNNs perceiving large“time windows”of in-put events may correctly classify long input sequences through appropriate output events,and thus solve shallow problems involving long time lags between relevant events.At which problem depth does Shallow Learning end,and Deep Learning begin?Discussions with DL experts have not yet yielded a conclusive response to this question.Instead of committing myself to a precise answer,let me just define for the purposes of this overview:problems of depth>10require Very Deep Learning.The difficulty of a problem may have little to do with its depth.Some NNs can quickly learn to solve certain deep problems,e.g.,through random weight guessing(Sec.5.9)or other types of direct search (Sec.6.6)or indirect search(Sec.6.7)in weight space,or through training an NNfirst on shallow problems whose solutions may then generalize to deep problems,or through collapsing sequences of(non)linear operations into a single(non)linear operation—but see an analysis of non-trivial aspects of deep linear networks(Baldi and Hornik,1994,Section B).In general,however,finding an NN that precisely models a given training set is an NP-complete problem(Judd,1990;Blum and Rivest,1992),also in the case of deep NNs(S´ıma,1994;de Souto et al.,1999;Windisch,2005);compare a survey of negative results(S´ıma, 2002,Section1).Above we have focused on SL.In the more general case of RL in unknown environments,pcc(p,q) is also true if x p is an output event and x q any later input event—any action may affect the environment and thus any later perception.(In the real world,the environment may even influence non-input events computed on a physical hardware entangled with the entire universe,but this is ignored here.)It is possible to model and replace such unmodifiable environmental PCCs through a part of the NN that has already learned to predict(through some of its units)input events(including reward signals)from former input events and actions(Sec.6.1).Its weights are frozen,but can help to assign credit to other,still modifiable weights used to compute actions(Sec.6.1).This approach may lead to very deep CAPs though.Some DL research is about automatically rephrasing problems such that their depth is reduced(Sec.4). In particular,sometimes UL is used to make SL problems less deep,e.g.,Sec.5.10.Often Dynamic Programming(Sec.4.1)is used to facilitate certain traditional RL problems,e.g.,Sec.6.2.Sec.5focuses on CAPs for SL,Sec.6on the more complex case of RL.4Recurring Themes of Deep Learning4.1Dynamic Programming(DP)for DLOne recurring theme of DL is Dynamic Programming(DP)(Bellman,1957),which can help to facili-tate credit assignment under certain assumptions.For example,in SL NNs,backpropagation itself can 1An alternative would be to count only modifiable links when measuring depth.In many typical NN applications this would not make a difference,but in some it would,e.g.,Sec.6.1.be viewed as a DP-derived method(Sec.5.5).In traditional RL based on strong Markovian assumptions, DP-derived methods can help to greatly reduce problem depth(Sec.6.2).DP algorithms are also essen-tial for systems that combine concepts of NNs and graphical models,such as Hidden Markov Models (HMMs)(Stratonovich,1960;Baum and Petrie,1966)and Expectation Maximization(EM)(Dempster et al.,1977),e.g.,(Bottou,1991;Bengio,1991;Bourlard and Morgan,1994;Baldi and Chauvin,1996; Jordan and Sejnowski,2001;Bishop,2006;Poon and Domingos,2011;Dahl et al.,2012;Hinton et al., 2012a).4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL Another recurring theme is how UL can facilitate both SL(Sec.5)and RL(Sec.6).UL(Sec.5.6.4) is normally used to encode raw incoming data such as video or speech streams in a form that is more convenient for subsequent goal-directed learning.In particular,codes that describe the original data in a less redundant or more compact way can be fed into SL(Sec.5.10,5.15)or RL machines(Sec.6.4),whose search spaces may thus become smaller(and whose CAPs shallower)than those necessary for dealing with the raw data.UL is closely connected to the topics of regularization and compression(Sec.4.3,5.6.3). 4.3Occam’s Razor:Compression and Minimum Description Length(MDL) Occam’s razor favors simple solutions over complex ones.Given some programming language,the prin-ciple of Minimum Description Length(MDL)can be used to measure the complexity of a solution candi-date by the length of the shortest program that computes it(e.g.,Solomonoff,1964;Kolmogorov,1965b; Chaitin,1966;Wallace and Boulton,1968;Levin,1973a;Rissanen,1986;Blumer et al.,1987;Li and Vit´a nyi,1997;Gr¨u nwald et al.,2005).Some methods explicitly take into account program runtime(Al-lender,1992;Watanabe,1992;Schmidhuber,2002,1995);many consider only programs with constant runtime,written in non-universal programming languages(e.g.,Rissanen,1986;Hinton and van Camp, 1993).In the NN case,the MDL principle suggests that low NN weight complexity corresponds to high NN probability in the Bayesian view(e.g.,MacKay,1992;Buntine and Weigend,1991;De Freitas,2003), and to high generalization performance(e.g.,Baum and Haussler,1989),without overfitting the training data.Many methods have been proposed for regularizing NNs,that is,searching for solution-computing, low-complexity SL NNs(Sec.5.6.3)and RL NNs(Sec.6.7).This is closely related to certain UL methods (Sec.4.2,5.6.4).4.4Learning Hierarchical Representations Through Deep SL,UL,RLMany methods of Good Old-Fashioned Artificial Intelligence(GOFAI)(Nilsson,1980)as well as more recent approaches to AI(Russell et al.,1995)and Machine Learning(Mitchell,1997)learn hierarchies of more and more abstract data representations.For example,certain methods of syntactic pattern recog-nition(Fu,1977)such as grammar induction discover hierarchies of formal rules to model observations. The partially(un)supervised Automated Mathematician/EURISKO(Lenat,1983;Lenat and Brown,1984) continually learns concepts by combining previously learnt concepts.Such hierarchical representation learning(Ring,1994;Bengio et al.,2013;Deng and Yu,2014)is also a recurring theme of DL NNs for SL (Sec.5),UL-aided SL(Sec.5.7,5.10,5.15),and hierarchical RL(Sec.6.5).Often,abstract hierarchical representations are natural by-products of data compression(Sec.4.3),e.g.,Sec.5.10.4.5Fast Graphics Processing Units(GPUs)for DL in NNsWhile the previous millennium saw several attempts at creating fast NN-specific hardware(e.g.,Jackel et al.,1990;Faggin,1992;Ramacher et al.,1993;Widrow et al.,1994;Heemskerk,1995;Korkin et al., 1997;Urlbe,1999),and at exploiting standard hardware(e.g.,Anguita et al.,1994;Muller et al.,1995; Anguita and Gomes,1996),the new millennium brought a DL breakthrough in form of cheap,multi-processor graphics cards or GPUs.GPUs are widely used for video games,a huge and competitive market that has driven down hardware prices.GPUs excel at fast matrix and vector multiplications required not only for convincing virtual realities but also for NN training,where they can speed up learning by a factorof50and more.Some of the GPU-based FNN implementations(Sec.5.16-5.19)have greatly contributed to recent successes in contests for pattern recognition(Sec.5.19-5.22),image segmentation(Sec.5.21), and object detection(Sec.5.21-5.22).5Supervised NNs,Some Helped by Unsupervised NNsThe main focus of current practical applications is on Supervised Learning(SL),which has dominated re-cent pattern recognition contests(Sec.5.17-5.22).Several methods,however,use additional Unsupervised Learning(UL)to facilitate SL(Sec.5.7,5.10,5.15).It does make sense to treat SL and UL in the same section:often gradient-based methods,such as BP(Sec.5.5.1),are used to optimize objective functions of both UL and SL,and the boundary between SL and UL may blur,for example,when it comes to time series prediction and sequence classification,e.g.,Sec.5.10,5.12.A historical timeline format will help to arrange subsections on important inspirations and techni-cal contributions(although such a subsection may span a time interval of many years).Sec.5.1briefly mentions early,shallow NN models since the1940s,Sec.5.2additional early neurobiological inspiration relevant for modern Deep Learning(DL).Sec.5.3is about GMDH networks(since1965),perhaps thefirst (feedforward)DL systems.Sec.5.4is about the relatively deep Neocognitron NN(1979)which is similar to certain modern deep FNN architectures,as it combines convolutional NNs(CNNs),weight pattern repli-cation,and winner-take-all(WTA)mechanisms.Sec.5.5uses the notation of Sec.2to compactly describe a central algorithm of DL,namely,backpropagation(BP)for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP1960-1981and beyond.Sec.5.6describes problems encountered in the late1980s with BP for deep NNs,and mentions several ideas from the previous millennium to overcome them.Sec.5.7discusses afirst hierarchical stack of coupled UL-based Autoencoders(AEs)—this concept resurfaced in the new millennium(Sec.5.15).Sec.5.8is about applying BP to CNNs,which is important for today’s DL applications.Sec.5.9explains BP’s Fundamental DL Problem(of vanishing/exploding gradients)discovered in1991.Sec.5.10explains how a deep RNN stack of1991(the History Compressor) pre-trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment Paths(CAPs,Sec.3)of depth1000and more.Sec.5.11discusses a particular WTA method called Max-Pooling(MP)important in today’s DL FNNs.Sec.5.12mentions afirst important contest won by SL NNs in1994.Sec.5.13describes a purely supervised DL RNN(Long Short-Term Memory,LSTM)for problems of depth1000and more.Sec.5.14mentions an early contest of2003won by an ensemble of shallow NNs, as well as good pattern recognition results with CNNs and LSTM RNNs(2003).Sec.5.15is mostly about Deep Belief Networks(DBNs,2006)and related stacks of Autoencoders(AEs,Sec.5.7)pre-trained by UL to facilitate BP-based SL.Sec.5.16mentions thefirst BP-trained MPCNNs(2007)and GPU-CNNs(2006). Sec.5.17-5.22focus on official competitions with secret test sets won by(mostly purely supervised)DL NNs since2009,in sequence recognition,image classification,image segmentation,and object detection. Many RNN results depended on LSTM(Sec.5.13);many FNN results depended on GPU-based FNN code developed since2004(Sec.5.16,5.17,5.18,5.19),in particular,GPU-MPCNNs(Sec.5.19).5.11940s and EarlierNN research started in the1940s(e.g.,McCulloch and Pitts,1943;Hebb,1949);compare also later work on learning NNs(Rosenblatt,1958,1962;Widrow and Hoff,1962;Grossberg,1969;Kohonen,1972; von der Malsburg,1973;Narendra and Thathatchar,1974;Willshaw and von der Malsburg,1976;Palm, 1980;Hopfield,1982).In a sense NNs have been around even longer,since early supervised NNs were essentially variants of linear regression methods going back at least to the early1800s(e.g.,Legendre, 1805;Gauss,1809,1821).Early NNs had a maximal CAP depth of1(Sec.3).5.2Around1960:More Neurobiological Inspiration for DLSimple cells and complex cells were found in the cat’s visual cortex(e.g.,Hubel and Wiesel,1962;Wiesel and Hubel,1959).These cellsfire in response to certain properties of visual sensory inputs,such as theorientation of plex cells exhibit more spatial invariance than simple cells.This inspired later deep NN architectures(Sec.5.4)used in certain modern award-winning Deep Learners(Sec.5.19-5.22).5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) Networks trained by the Group Method of Data Handling(GMDH)(Ivakhnenko and Lapa,1965; Ivakhnenko et al.,1967;Ivakhnenko,1968,1971)were perhaps thefirst DL systems of the Feedforward Multilayer Perceptron type.The units of GMDH nets may have polynomial activation functions imple-menting Kolmogorov-Gabor polynomials(more general than traditional NN activation functions).Given a training set,layers are incrementally grown and trained by regression analysis,then pruned with the help of a separate validation set(using today’s terminology),where Decision Regularisation is used to weed out superfluous units.The numbers of layers and units per layer can be learned in problem-dependent fashion. This is a good example of hierarchical representation learning(Sec.4.4).There have been numerous ap-plications of GMDH-style networks,e.g.(Ikeda et al.,1976;Farlow,1984;Madala and Ivakhnenko,1994; Ivakhnenko,1995;Kondo,1998;Kord´ık et al.,2003;Witczak et al.,2006;Kondo and Ueno,2008).5.41979:Convolution+Weight Replication+Winner-Take-All(WTA)Apart from deep GMDH networks(Sec.5.3),the Neocognitron(Fukushima,1979,1980,2013a)was per-haps thefirst artificial NN that deserved the attribute deep,and thefirst to incorporate the neurophysiolog-ical insights of Sec.5.2.It introduced convolutional NNs(today often called CNNs or convnets),where the(typically rectangular)receptivefield of a convolutional unit with given weight vector is shifted step by step across a2-dimensional array of input values,such as the pixels of an image.The resulting2D array of subsequent activation events of this unit can then provide inputs to higher-level units,and so on.Due to massive weight replication(Sec.2),relatively few parameters may be necessary to describe the behavior of such a convolutional layer.Competition layers have WTA subsets whose maximally active units are the only ones to adopt non-zero activation values.They essentially“down-sample”the competition layer’s input.This helps to create units whose responses are insensitive to small image shifts(compare Sec.5.2).The Neocognitron is very similar to the architecture of modern,contest-winning,purely super-vised,feedforward,gradient-based Deep Learners with alternating convolutional and competition lay-ers(e.g.,Sec.5.19-5.22).Fukushima,however,did not set the weights by supervised backpropagation (Sec.5.5,5.8),but by local un supervised learning rules(e.g.,Fukushima,2013b),or by pre-wiring.In that sense he did not care for the DL problem(Sec.5.9),although his architecture was comparatively deep indeed.He also used Spatial Averaging(Fukushima,1980,2011)instead of Max-Pooling(MP,Sec.5.11), currently a particularly convenient and popular WTA mechanism.Today’s CNN-based DL machines profita lot from later CNN work(e.g.,LeCun et al.,1989;Ranzato et al.,2007)(Sec.5.8,5.16,5.19).5.51960-1981and Beyond:Development of Backpropagation(BP)for NNsThe minimisation of errors through gradient descent(Hadamard,1908)in the parameter space of com-plex,nonlinear,differentiable,multi-stage,NN-related systems has been discussed at least since the early 1960s(e.g.,Kelley,1960;Bryson,1961;Bryson and Denham,1961;Pontryagin et al.,1961;Dreyfus,1962; Wilkinson,1965;Amari,1967;Bryson and Ho,1969;Director and Rohrer,1969;Griewank,2012),ini-tially within the framework of Euler-LaGrange equations in the Calculus of Variations(e.g.,Euler,1744). Steepest descent in such systems can be performed(Bryson,1961;Kelley,1960;Bryson and Ho,1969)by iterating the ancient chain rule(Leibniz,1676;L’Hˆo pital,1696)in Dynamic Programming(DP)style(Bell-man,1957).A simplified derivation of the method uses the chain rule only(Dreyfus,1962).The methods of the1960s were already efficient in the DP sense.However,they backpropagated derivative information through standard Jacobian matrix calculations from one“layer”to the previous one, explicitly addressing neither direct links across several layers nor potential additional efficiency gains due to network sparsity(but perhaps such enhancements seemed obvious to the authors).。
人工智能用图像和生物组学解码癌症作者:暂无来源:《世界科学》 2019年第6期编译陆默机器学习可以对癌症照片、肿瘤病理切片和基因组进行分析。
如今,科学家正准备将这些信息整合到癌症超级模型中。
每个癌症患者都在思考的一个问题是:我还能活多久?基因组学家迈克尔·斯奈德(Michael Snyder)希望他能找到答案。
目前,所有医生能做的就是将患有类似癌症的患者分组,然后对他们和其他组患者的相同药物反应或预后进行评估,但目前的分组方法粗略而不完善,而且往往都只是基于人工收集的数据。
斯坦福大学基因组学和个体化医学中心主任斯奈德指出:“病理学家根据解读图像的结果来诊断病情的准确率通常只有60%。
”2013年,他和当时的研究生余坤兴(Kun-Hsing Yu,音译)开始琢磨,人工智能是否能够为医生提供更准确的预测。
余将组织学图像连同病理学家确定的诊断一起输入机器学习算法,训练它区分肺癌和正常组织,以及两种不同类型肺癌之间的区别。
然后输入相关患者的生存数据,让系统了解这些信息与图像之间的关系。
最后,他在模型中补充了一些新的病理切片资料,并向AI提出了一个至关重要的问题:患者的存活时间。
计算机可以预测患者的生存期高于或低于某些特定癌症的平均存活时间,这是病理学家很难做到的。
计算机预测“效果出奇的好。
”如今任哈佛医学院讲师的余说道。
但是斯奈德和余认为他们还可以做更多的事。
斯奈德的实验室也在研究生物组学,所以他们决定向计算机提供的学习资料不仅只有组织病理切片资料,还提供了肿瘤转录组资料。
结合这些数据,该计算机模型对患者生存做出的预测甚至比单独使用图像或转录组资料更好,准确率超过了80%。
如今的病理学家通常根据组织显微照片的视觉评估来进行生存情况预测,通过显微照片对肿瘤进行评估分级,包括肿瘤的大小和严重程度,以及肿瘤进一步生长和扩散的可能性。
但这种肿瘤分级方法并不总能准确预测生存情况。
斯奈德和余并不是唯一认识到人工智能在分析癌症相关数据集(包括图像、生物组学以及两者结合的数据集)方面威力的研究人员。
专利名称:在深度神经网络中利用激活稀疏性
专利类型:发明专利
发明人:R·希尔,A·兰博,M·戈德法布,A·安萨里,C·洛特申请号:CN201980062020.2
申请日:20190927
公开号:CN112740236A
公开日:
20210430
专利内容由知识产权出版社提供
摘要:描述了一种在深度神经网络中利用激活稀疏性的方法。
该方法包括检索激活张量和权重张量,其中该激活张量是稀疏激活张量。
该方法还包括生成包含该激活张量的非零激活的经压缩激活张量,其中该经压缩激活张量具有比该激活张量少的列。
该方法进一步包括对该经压缩激活张量和该权重张量进行处理以生成输出张量。
申请人:高通股份有限公司
地址:美国加利福尼亚州
国籍:US
代理机构:上海专利商标事务所有限公司
更多信息请下载全文后查看。
mfista算法MFISTA算法(Multifrontal Iterative Shrinkage-Thresholding Algorithm)是一种用于求解稀疏表示问题的迭代优化算法。
它在L1范数正则化的框架下,能够高效地求解具有稀疏性的信号重构问题。
稀疏表示问题是指在给定一组基函数的情况下,通过寻找最少的基函数线性组合来表示一个信号。
这个问题在信号处理、图像处理、机器学习等领域都有广泛的应用。
MFISTA算法的目标就是通过迭代的方式,找到一个最优的稀疏表示。
MFISTA算法的核心思想是通过将稀疏表示问题转化为一个优化问题,并采用了迭代收缩阈值的方式进行求解。
算法的具体步骤如下:1. 初始化:给定信号和基函数,初始化稀疏表示的值。
2. 迭代更新:通过迭代的方式不断更新稀疏表示的值,直至收敛。
在每一次迭代中,首先计算梯度,然后根据梯度信息进行更新。
3. 收缩阈值:在更新稀疏表示的过程中,需要进行阈值的选择。
MFISTA算法采用了一种自适应的阈值选择方法,称为FISTA步骤。
FISTA步骤通过计算两次迭代的差异来选择阈值,从而实现更好的收敛性能。
4. 结束条件:当稀疏表示的值不再发生显著变化时,算法收敛,可以得到最终的稀疏表示结果。
MFISTA算法相比于其他稀疏表示算法具有以下优点:1. 收敛速度快:MFISTA算法通过引入FISTA步骤,能够更快地收敛,提高算法的运行效率。
2. 稀疏性更好:MFISTA算法能够得到更加稀疏的稀疏表示结果,即使用更少的基函数来表示信号。
3. 适用性广:MFISTA算法在不同领域的稀疏表示问题中都有较好的应用效果,包括图像处理、压缩感知、信号恢复等。
4. 鲁棒性强:MFISTA算法对噪声和数据不完整性具有较好的鲁棒性,能够处理一些复杂的实际问题。
尽管MFISTA算法在稀疏表示问题中取得了较好的效果,但仍然存在一些局限性。
首先,算法的收敛性与初始化值有关,不同的初始化值可能导致不同的收敛结果。
《神经网络导论》实验一Adaline的LMS算法专业班级硕2081学号**********姓名李海玥完成时间2012年12月《神经网络导论》实验一Adaline的LMS算法李海玥2012年12月一、实验目的1.通过实验了解Adaline的工作原理;2.对比LMS三种算法,并通过上机实验掌握具体的实现方法;3.与采用硬限幅函数的单个神经元模型进行对比,比较其异同。
二、实验原理2.1.Adaline原理采用硬限幅函数的单个神经元,通过简单的学习算法,可以成功实现两类线性可分类的分类功能。
但对于大多数的非线性可分类来说,则无法完成分类功能,因此采用具有线性功能函数的神经元Adaline方法。
设输入矢量X=[x1,x2⋯x N],如果加权向量W=[w1,w2⋯w N],则神经元的输出为:{I=WX T=XW Ty=f(I)=WX T=XW T按照最小二乘法,就是要求所有样本的实际输出值d与理想预期值y之间的误差的均方值最小。
定义误差ε=d−y。
考虑所有可能出现的样本的均方误差:E[ε2]=E[(d−y)2]=E[d2]+WRW T−2PW T其中,R≡E[X T X]是输入向量相关矩阵,P=E[dX]是输入向量与期望输出的互相关向量,通过求梯度求得最优权向量:W∗=PR−12.2.LMS学习问题的严格递推学习算法1.任意设置初始加权矢量W(0);2.对于每一个时序变量k,按下式调整权向量W:W(k+1)=W(k)+2αE[ε(k)X(k)]2.3. LMS 学习问题的随机逼近算法将严格递推公式修正如下形式:W(k +1)=W(k)+2αε(k)X(k)1) α(k)是时序k 的非增函数;2) ∑α(k )=∞,∑α2<∞∞k=0∞k=0;2.4. LMS 学习基于统计的算法这是一种具有一定统计特性的学习算法,递推公式如下:W (k +1)=W (k )+2αP∑εP (k )X p (k )P如果P 的取值很大相当于严格递推法,若P=1,则相当于随机逼近法。
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 10, NOVEMBER 20083077Adaptive Beamforming and Recursive DOA Estimation Using Frequency-Invariant Uniform Concentric Spherical ArraysH. H. Chen, Member, IEEE, S. C. Chan, Member, IEEE, Z. G. Zhang, Member, IEEE, and K. L. Ho, Senior Member, IEEEAbstract—This paper proposes recursive adaptive beamforming and broadband 2-D direction-of-arrival (DOA) estimation algorithms for uniform concentric spherical arrays (UCSAs) having nearly frequency-invariant (FI) characteristics. The basic principle of the FI-UCSA is to transform the received signals to the phase mode and remove the frequency dependency of individual phase modes through a digital beamforming network. Hence, the far-field pattern of the array is determined by a set of weights. Thanks to the FI characteristic, traditional narrowband adaptive beamforming algorithms such as minimum variance beamforming and the generalized sidelobe canceller method can be applied to the FI-UCSA. Simulation results show that the proposed adaptive FI-UCSA beamformer achieves a lower steady-state error and converges faster than the conventional tapped-delay line approach while requiring fewer adaptive coefficients. A new broadband 2-D DOA estimation algorithm using ESPRIT techniques for FI-UCSA is proposed to recursively estimate the DOAs of the moving targets. Simulation results show that the proposed DOA estimation algorithm achieves a satisfactory performance for slowly varying sources at low arithmetic complexity. Index Terms—Array processing, broadband 2-D direction-of-arrival (DOA) estimation, broadband adaptive beamforming, frequency-invariant (FI), subspace tracking, uniform concentric spherical array (UCSA).I. INTRODUCTION EAMFORMING using sensor arrays is an effective method for suppressing interferences whose angles of arrival are different from the desired looking direction. They find important applications in sonar, radar, acoustics, and radio communications [1]–[3]. Traditional adaptive broadband beamformers usually employ tapped-delay lines or linear transversal filters with adaptive coefficients to generate appropriate beampatterns for suppressing undesirable interference. This usually requires a considerable number of adaptive coefficients resulting in a rather long convergence time and high implementation complexity. These problems can beManuscript received September 11, 2007; revised February 15, 2008. First published April 30, 2008; current version published November 21, 2008. This paper was recommended by Associate Editor P. Regalia. H. H. Chen is with the Communication Systems Group, Darmstadt University of Technology, Darmstadt 64283, Germany (e-mail: haihua.chen@ nt.tu-darmstadt.de). S. C. Chan and K. L. Ho are with the Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong SAR (e-mail: scchan@eee.hku.hk; klho@eee.hku.hk). Z. G. Zhang is with the Department of Orthopaedics and Traumatology, The University of Hong Kong, Hong Kong SAR (e-mail: zgzhang@eee.hku.hk). Digital Object Identifier 10.1109/TCSI.2008.924126Bremedied by using subband decomposition technique, partial adaptation, or using frequency-invariant beamformers (FIBs) [4]–[8]. In FIBs, a beamforming network is used to generate a beampattern with approximately frequency-invariant (FI) characteristics over the frequency band of interest. They can attenuate broadband directional interference using an adaptive beamformer with very few adaptive filter coefficients [4]. One of the widely studied FIBs is the uniform linear array (ULA) FIB [4]–[6], [9]. Due to the geometry of ULA, it allows many efficient direction-of-arrival (DOA) detection algorithms to be developed. For example, the MUSIC algorithm [10] provides a high-resolution method for detecting the angle of arrival (AoA) of the signal sources based on the subspace approach. The MUSIC algorithm is also applicable to DOA estimation of wideband coherent sources by performing the algorithm in beamspace using ULA-FIB [7]. Besides AoA estimation of wideband sources, adaptive interference suppression using beamspace adaptive beamforming [4] is also very attractive because of the small number of adaptive coefficients required and the possibility of employing partial adaptation, yielding faster convergence and fewer high-speed variable multipliers. Recently, electronic steerable uniform-circular arrays (UCAs) [11] and uniform concentric circular arrays (UCCA) with FI characteristics were studied in [12]–[16]. The beampatterns of UCA and UCCA FIBs are almost FI and are governed by a set of weights or coefficients. An important advantage of UCA and UCCA FIBs is that the beampattern is electronically steerable and has a uniform resolution over 180 in the azimuth angle. In addition, the beampattern can be made adaptive using significantly fewer numbers of adaptive parameters than conventional broadband tapped delay line-based adaptive beamformers. Consequently, this leads to a lower arithmetic complexity, better numerical property and hence a higher output signal-to-inference-plus-noise ratio (SINR) than the conventional tapped-delay line approach [15]. However, due to the geometry of the UCCA, the beampattern is not arbitrarily steerable with respect to the elevation angle. To overcome these disadvantages, the authors proposed a uniform concentric spherical array (UCSA) with FI characteristics in [18] and [19]. The UCSA-FIB has all of the advantages of the UCCA mentioned above while possessing electronically steerable characteristic and uniform beampattern in both the azimuth and elevation angles. These characteristics make UCSA more suitable than the UCCA in two-dimensional (2-D) DOA estimation and spatial-time beamforming.1549-8328/$25.00 © 2008 IEEEAuthorized licensed use limited to: NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS. Downloaded on September 5, 2009 at 00:59 from IEEE Xplore. Restrictions apply.3078IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 10, NOVEMBER 2008The basic idea of a UCSA-FIB is to transform each snapshot sampled by the array to phase mode signals via 2-D inverse discrete Fourier transform (IDFT). Each phase mode signal is then filtered by the compensation filter to compensate for the frequency dependency. Finally, these frequency invariant phase-modes are linear combined using a matrix of beamforming weights or coefficients to obtain the desired frequency invariant beampatterns. Based on the theory of the UCSA-FIB proposed in [19], the design of the beamformer is separated into two steps. The first step is to design the compensation filters phase mode by phase mode, which reduces largely the number of the variables and numerical difficulties, hence making the design process realizable. In addition, the compensation filters are designed using second order cone programming (SOCP) [17], which guarantees the optimal solution if it exists. The second step is to design the spatial weighting coefficient matrix for fixed frequency invariant beampatterns. Again, it can be designed using SOCP according to the direction of beam target and beamwidth. Broadband adaptive beamforming and DOA estimation algorithms using UCSA-FIBs are developed in [19]. The minimum variance beamforming (MVB) algorithm [20] and sample matrix inversion (SMI) method are employed in the adaptive beamforming algorithm. Thanks to the frequency invariant characteristic of the UCSA-FIB, high output SINRs can be obtained with very few adaptive coefficients. However, due to the matrix inversion operation required in the SMI method, the computational complexity is very high. Therefore, recursive adaptive beamforming algorithms using UCSA-FIB and generalized sidelobe canceller (GSC) structure are studied in this paper. Another purpose of this paper is to develop a robust adaptive beamforming algorithm for the proposed UCSA-FIB. Due to DOA estimation errors or array implementation imperfection such as calibration errors, the actual steering vector of the array may deviate from the assumed signal model. As a result, conventional beamforming methods such as MVB may suppress the desired signal as well as the interfering signals, causing severe performance degradation. A conventional approach is to limit the norm of the adaptive weight vector [21], the so-called norm constraint, to prevent the desired signal from being canceled. For recursive least-squares (RLS) implementation of the adaptive beamformer, this norm constraints can also be realized by adding a small identity matrix to the autocorrelation matrix of the sensor output. Because of this operation, this method is also called diagonal loading [21]. In this paper, we shall develop robust adaptive beamforming algorithms and diagonal loading approaches for UCSA-FIBs because of their good performance and low implementation complexity. The performance of the proposed robust UCSA-FIB is evaluated in the presence of DOA estimation errors. Simulation results show that the robust UCSA-FIB has a better performance than its original counterpart when there are DOA estimation errors. The DOA estimation algorithm for the UCSA-FIB proposed in [19] focuses mainly on static sources. However, in many applications such as communications, the sources are moving. To reduce such errors, recursive broadband 2-D DOA estimation algorithms using UCSA-FIB and subspace tracking method areFig. 1. UCCA with P rings and K sensors at each ring.also developed in this paper. The proposed DOA estimation algorithm is based on the unitary ESPRIT algorithm [22]–[24], which was originally proposed in [25] for narrow band circular array. In the ESPRIT algorithm, DOA estimations are obtained from the signal subspace, which can be estimated adaptively using subspace tracking techniques. Traditional subspace-based algorithms usually compute the eigenvalue decomposition (ED) or the singular value decomposition (SVD) of the data matrix in order to estimate the signal or noise subspace [26]. Instead of updating the whole eigen-structure, subspace tracking only works with the signal or noise subspace to lower the computational complexity and storage requirements. A very efficient subspace tracking algorithm is the projection approximation subspace tracking (PAST) algorithm [27]. The PAST algorithm considers the signal subspace as the solution of an unconstrained minimization problem, which can be solved by an exponentially weighted RLS problem by an appropriate project approximation. To address the problem of moving sources, a variable forgetting factor (VFF) RLS algorithm originally proposed in [28] is incorporated into the PAST algorithm for broadband 2-D DOA estimation using the proposed UCSA-FIBs. Simulation results show that VFF-RLS algorithm obtains lower DOA estimation errors than traditional RLS algorithm when the sources move fast. The paper is organized as follows. In Section II, the theory and design of the UCSA-FIB are briefly reviewed. Recursive adaptive beamforming algorithms using UCSA-FIB and their robust counterparts are proposed in Section III. Section IV presents the subspace-based recursive 2-D DOA estimation algorithm. Design example and simulation results are given in corresponding sections. Finally, conclusions are drawn in Section V. II. FI UCSAS The UCSA proposed in this paper is constructed from a series of vertical UCCAs, which are uniformly distributed along the azimuth angle as shown in Fig. 1. Each UCCA is comrings and each ring has omnidirectional senposed of (represented as Cartesors located at sian coordinate with the center as the origin) where is the , , and radius of the th ring,Authorized licensed use limited to: NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS. Downloaded on September 5, 2009 at 00:59 from IEEE Xplore. Restrictions apply.CHEN et al.: ADAPTIVE BEAMFORMING AND RECURSIVE DOA ESTIMATION USING FI UCSAS3079the elevation angle. Based on the UCCA described above, the steering vector of this ring can be written as(2) th ring remains The signal received by the sensors of the the same if the ring and source are rotated together to the deth ring is sired location. Therefore, the steering vector of the given by (2). However, the source is rotated instead of the ring plane, which makes the problem simpler. At the same time, this rotation will not alter the array beampattern. Equation (2) can also be written in a more concise form asFig. 2. Geometry of the UCSA.. In UCCAs, the inter-sensor spacing in each ring is fixed at , where is the smallest wavelength of the array to be operated. According to the propagation theory of electromagnetic waves, the steering vector can be written as follows [2], [15]:(3) Figs. 3 and 4 show the two parts of the structure of the broadband FIB for a -sphere UCSA. The spatial-temporal response of the UCSA as shown in [15] and [19] is given by (4), shown at the bottom of the page, where and , and are the numbers of phase modes in the dimension of elevation angle and azimuth angle, is the weight coefficient of respectively. phase mode, and is the frequency response of the th phase mode in the th compensation filter of the ring. In addition, we assume for simplicity that the weighting , as shown in Fig. 4. matrices be identical After some algebraic manipulations and approximations, (4) can be approximated as (5), shown at the bottom of the page, , where is an integer. It can be seen that, if with the term inside the bracket is independent of the frequency variable , then the beampattern will be approximately frequency invariant. It can also be seen that, for a single sphere with radius , the bandwidth of the array without compensation is deter. mined by the term Rings and hence spheres with small radii usually have better high frequency responses and vice versa. Therefore, to obtain a(1) where is the angular frequency variable, is the normalized , denotes the ratio radius and has the form of of the sampling frequency to the maximum frequency ( ), and and are the azimuth angle and the elevation angle respecttively, as shown in Fig. 2. , Around the diameter that has azimuth angle of each ring of the UCCA is rotated to angles of , to obtain rings that have different elevation angles. The -sphere UCSA is composed of all these rings. For notational simplicity, the sphere obtained this way is indexed by . The geometry of the UCSA obtained in this way is th ring, shown in Fig. 2. To derive the steering vector of the let us consider a ring located at an elevation angle of 90 with in the source clockwise situated at an angle of(4)(5)Authorized licensed use limited to: NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS. Downloaded on September 5, 2009 at 00:59 from IEEE Xplore. Restrictions apply.3080IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 10, NOVEMBER 2008Fig. 3. First part of the UCSA-FIB block diagram for P spheres.Fig. 5. Frequency and spatial response against azimuth angles. Fig. 4. Second part of the UCSA-FIB block diagram for P spheres.FI-UCSA with large bandwidth, small responses of the Bessel function at certain frequencies have to be compensated by . This is undesirable in general because it leads to considerable noise amplification. Fortunately, by employing more spheres in a UCSA, spheres with different radii (each with different values of are able to contribute to the overall responses. Hence, a wider bandwidth can then be obtained. in each From (5), we can see that, if the filter phase mode is designed such that (6), shown at the bottom of the and are respectively the lower page, is satisfied, where and upper frequencies of interest, then the beamformer in (4) and will be approximately FI within(7) Furthermore, its far-field pattern is now governed by the spatial alone. They can either be designed offline weighting to achieve different but fixed beampatterns or be varied online for adaptive beamforming. Since the left-hand side of (6) is a ’s, the delinear function of the filter coefficients in sign problem in (6) can be treated as a filter design problem with all the filter outputs adding up to a desired magnitude response of value 1 with linear phase response. If the minimax can error criterion is used, the filter coefficients of be determined by SOCP [17]. The detailed formulation of thedesign problem using SOCP is omitted here due to page limitation. Interested readers are referred to [15] and [19] for details. To avoid amplifying the noise too much, we also impose constraints on the magnitude responses of the compensation filters, , where is certain constant. This constraint is convex and can be readily enforced to the FIB design problem. It can also be seen from (7) that the far-field spatial response is similar to that of a 2-D digital FIR filter with impulse response . Therefore, can be designed by conventional 2-D filter design algorithms such as window method or SOCP if convex quadratic constraints are to be imposed. We now consider a design example on UCSA-FIB. 1) Example 1: UCSA-FIB With Two Spheres: A two-sphere UCSA is considered in this example. It is obtained by rotating UCCAs with two rings. The inner ring and the outer ring have and omnidirectional sensors, respectively. The inner ring and outer ring are then rotated with , , respectively. The required bandwidth . The numbers of phase of the UCSA-FIB is and , modes are set as We choose the central 9 9 spatial filter coefficients (phase mode) out of the 17 17 to shape the spatial response of the UCSA-FIB. The desired beam is targeted at and the beamwidth is 10 . The magnitudes of the compensation filters in the first sphere and second sphere are constrained to be no larger than 1 and 20, respectively. The spatial weights are designed using SOCP. The perspective views of the frequency response against azimuth angle is shown in Fig. 5.(6)Authorized licensed use limited to: NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS. Downloaded on September 5, 2009 at 00:59 from IEEE Xplore. Restrictions apply.CHEN et al.: ADAPTIVE BEAMFORMING AND RECURSIVE DOA ESTIMATION USING FI UCSAS3081some discussions on the frequency band of the UCSA-FIB. As can be seen from (5), the beamformer will be approximately frequency invariant if (6) is satisfied. Theoretically, we can design a UCSA that is approximately FI over the whole band. with However, the value of the Bessel function closes to zero when goes to 0, which requires the magnitudes of the compensation filters to be very large to satisfy (6). Such compensation filters will amplify the noise to a quite high level and make it difficult for the following signal processing. As we will see later in Sections III and IV, the bandwidth of is realizable and such a band is sufficiently wide for many broadband applications. III. ADAPTIVE BEAMFORMING USING UCSA-FIBFig. 6. Spatial responses of the UCSA-FIB in different frequency samples against azimuth angle.The frequency invariant performance is very satisfactory. To illustrate the frequency invariant performance more clearly, the frequency responses of the UCSA-FIB at 128 different are plotted together in frequency values within Fig. 6. The spatial responses at different frequencies almost overlap, which implies the array response is approximately FI. The FI performance against the elevation angle is also satisfying. It is not shown here, and interested readers are referred to [14] and [17] for more details. Like their FI-UCCA counterparts [13], [14], the above FIB can be modulated to form a bank of FIBs for broadband beamspace DOA estimation. Such a nice property can be seen from (5) that the weight coefficients and the compensation filters are separable. Since the weight coefficients determine the spatial response of the beamformer while the compensation filters guarantee the approximately frequency invariancy, different FI beampatterns can be obtained by changing the weight matrix only. For example, we have designed a beampattern with a weight that targets on (0 , 0 ). To obtain a beampattern matrix , we can modulate the weight matrix by that targets on , where , denotes the Hadamard (element-wise) matrix multiplication, and the superscript denotes the transpose. Moreover, the weights can be made adaptive to form adaptive UCSA-FIB with very few variable multipliers and fast convergence speed as we shall see later in Section III. Before closing this section, we makeThe UCSA-FIBs designed in the previous section are electronically steerable in both azimuth and elevation angles, which makes them more appropriate than UCCA-FIB for 2-D adaptive beamforming and 2-D DOA estimation. broadband signals Assuming that there are impinging a -sphere UCSA at angles of , and . According to (7), the output of the UCSA-FIB can be written as(8) where trum of the impinging signals is the spec,along with the equations shown at the bottom of the is the Fourier transform of the additive page, and at the th white Gaussian sensor noise element of the th sphere. From the theory of UCSA-FIB described in Section II, we know that the vector is designed to be FI and hence , . Therefore, the output of the UCSA-FIB in (8) can be simplified to(9)Authorized licensed use limited to: NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS. Downloaded on September 5, 2009 at 00:59 from IEEE Xplore. Restrictions apply.3082IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 10, NOVEMBER 2008For notation convenience, we shall replace the approximation sign by the equality sign and assume that the errors are absorbed . Taking the IDFT, one into the sensor noise terms gets the time-domain expression of the output as follows:(10)Fig. 7. Structure of the MVDR beamformer.where , with Fourier transform , is the output noise of the phase mode of the beamformer, is the th compensated phase mode and signal as shown in Fig. 4. Before proceeding further, we in, which maps a matrix to a troduce an operator 1 vector by stacking the elements of the matrix column by column. With this operator, (10) can be written in a more compact form as (11) where the superscript denotes the Hermitian transpose, , , , . The components and of matrices , , and are , , , and , respectively. By rearranging the 2-D beamforming problem in the form of (11), traditional adaptive beamforming algorithms such as minimum variance distortionless response (MVDR) [20] can be apis the arriving angle of the desired plied. Assume signal. To recover the desired signal from the array output, the MVDR algorithm minimizes the output energy while requiring the response of the array at the desired looking direction to be 1, hence the name MVDR. The structure of the MVDR algorithm is shown in Fig. 7. The input signals of the beamformer in (11), which are the compensated phase mode vector is delayed and weighted to get the signal at the desired direction. Each delay line is a linear transversal filter and is called a tapped-delay line. Let the length of the tapped-delay line be , the MVDR problem can be written as (12) where correlation matrix of the data vector is the autoFig. 8. Structure for GSC structure.Given a series of snapshots , say , the autocorrelation matrix can be approximated by . Thus, can be oband substituting it into tained by inverting the matrix the right-hand side of (14). This is called the sample matrix inversion (SMI) method, and it is computationally expensive because of the matrix inversion involved. Therefore, we shall solve the weight vector recursively using adaptive filtering algorithms such as the RLS algorithm and the least mean-squares (LMS) algorithm using the Generalized Sidelobe Canceller , all of the 1-D (GSC) structure [30]. With the operator recursive adaptive beamforming algorithms can be employed in the 2-D adaptive beamforming using UCSA-FIB. The structure of the GSC beamformer using UCSA-FIB is shown in Fig. 8. The beamformer input is the phase mode signal vector obtained . from the phase mode signal matrix with the operator In the GSC, the weighting vector is decomposed into two and the adaptive part . The fixed parts: the fixed part weight vector , as shown in the upper part in Fig. 8, forms the main beam that is steered towards some assumed propagahas the following tion direction. Therefore, the weight vector form: (15)andandtake the following form: .. (13).This constrained optimization can be solved analytically and the optimal solution is (14)In the proposed UCSA-FIB, the direction of the beam can be readily steered to the desired looking direction by modulating the beam weight matrix with an appropriate sinusoidal matrix. is continuously updated in order to reThe adaptive part move any undesired signals other than the looking direction from appearing at the array output. With the blocking matrix , the input to the adaptive part mainly consists of the undesirable signals. The adaptively weighted interference signal isAuthorized licensed use limited to: NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS. Downloaded on September 5, 2009 at 00:59 from IEEE Xplore. Restrictions apply.CHEN et al.: ADAPTIVE BEAMFORMING AND RECURSIVE DOA ESTIMATION USING FI UCSAS3083then subtracted from the main beam in order to cancel the interference that is present in the main beam. This is achieved by minimizing the output energy of the beamformer using either the RLS or LMS adaptive filtering algorithms. Leakage of the desired signal through the adaptive part will lead to the annihilation or attenuation of the desired signal, i.e., signal cancellation. Robust beamforming techniques are usually employed to avoid this problem, which will be discussed later in this section. is upIn the LMS algorithm, the adaptive weight vector dated in the negative direction of the gradient of the MSE funcas tion (16) where and are defined in (13). is the blocking mais the stepsize parameter. It is called normaltrix and is determined by ized LMS (NLMS) algorithm when , where is the normalized stepsize parameter and is a small positive constant to ensure that will not be very large when the norm of the input is very small. In the RLS algorithm, the inverse of the autocorrelation matrix is updated recursively and the adaptive weight vector is updated as [31] (17) wherewhere is the Lagrange multiplier. Due to the difficulty of choosing an appropriate value of in practical implementation, some alternative methods were proposed in [32]–[34]. The scaled projection (SP) method in [32] is applied to the proposed UCSA-FIB in this paper. In the SP method, the weight vector is updated directly from the norm constraint. does For the LMS-GSC algorithm, the fixed weight vector not depend on the covariance matrix of the beamformer input, is and it has the same solution as in (15). The adaptive part adapted with the following scaled projection algorithm [32]: for for (20)is where is the norm constraint of the weight vector and in (16). For robust RLS algoequal to the weight vector rithm, the forgetting factor is updated adaptively and the adaptive weight vector is updated as [35] for for where , , , is equal to the weight vector , , means in (21)is the Kalman gain, is the inverse of the autocorrelation matrix, and it can be updated as with , is is nonsingular initially, and a small number to ensure that is the forgetting factor that controls the tracking ability and steady-state error of the RLS algorithm. In practical applications, the actual array response may deviate from the assumed signal model due to DOA estimation and other implementation errors. As a result, conventional beamforming methods such as GSC may suppress the desired signal as well as the interfering signals. Robust beamforming algorithms are usually employed to remedy this problem. In the method of diagonal loading [32], the norm of the weight vector is constrained in order to avoid severe signal cancellation. Adopting this idea in our UCSA-FIB, the robust adaptive beamforming problem is written as(18) where is a constant norm bound for the weight vector. The analytical solution can be obtained by employing the Lagrange multiplier technique(19)real part of and (17). Before presenting the simulation results, the arithmetic complexities of the traditional UCSA beamformer (the beamformer using UCSA without compensation filters) and the UCSA-FIB are compared roughly. A digital beamformer usually consists of the complexities for the fixed filtering and the adaptive filtering parts [14]. The order of the arithmetic complexity per sample for the fixed filtering part is usually proportional to the filter length, while the order of arithmetic complexity per unit time for the adaptive filtering part depends on the algorithm used. Let denote the number of adaptive coefficients, if the blocking matrix has dimension of , the arithmetic complexities per unit time for LMS-GSC are of order . Let denote the length of the adaptive tapped-delay line and denote the length of the compensation filters in a UCSA-FIB, the complexity of the fixed and adaptive parts are, respectively, and , where , , , is the number of the usable phase modes of the th sphere and the size of the blocking matrix is . Let denote the length of the broadband fractional delay filters used in the traditional UCSA, the complexity of the fixed and adaptive filtering parts are, respectively, and , given by , , , where is the number of the sensors in the th sphere and the size of the blocking matrix is . In this example, the numbers of sensors in the two spheres of the UCSA are, respectively, 4 4 and 6 6. The length ( ) of the compensation filters required is 31 and the number of usable phase modes is 3 3 in both of the two spheres. On the other hand, the length ( ) of the broadband fractional delay filters usedAuthorized licensed use limited to: NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS. Downloaded on September 5, 2009 at 00:59 from IEEE Xplore. Restrictions apply.。
LMSAPANLMSFRLS算法分析
LMS算法是最常用的自适应滤波算法之一,它是基于最小均方差(MSE)原则的一种加权最小二乘算法。
它的基本思想是以期望和观察误差之间的均方差作为一个指标,试图最小化误差,从而获得一个最优滤波器设计。
LMS算法可以快速而高效地调整滤波器系数,以最大化信号的抑制噪声的能力,是一种逐步增加信号的方法。
APA算法是另一种常用的自适应滤波器算法。
它基于最大似然准则,试图估计出使得观测值合理和自相关系数最大的滤波器。
APA算法不仅考虑了噪声的强度,而且考虑了噪声的自相关性,从而更有效地抑制噪声。
在大多数情况下,APA算法比LMS算法更有效,更稳定,滤波器系数的更新也更平滑。
NLMS算法是一种非线性自适应滤波算法,其基本思想是受到距离准
则的启发,以希尔伯特误差函数作为最小化准则,从而来寻求最优的滤波器设计。
NLMS算法的主要优势在于它的精确度高,收敛速度快,在噪声
多的情况下也有良好的表现。
它也比其他算法更容易实现,因为它只需要计算一个最小二乘系数来计算中间变量,而不需要逆矩阵的计算。
FRLS算法是一种近似最小二乘的自适应滤波算法,它基于利用逆维
费雪滤波器的思想,可以有效地处理一些求逆复杂的情况。