A Simple Algorithm for Nuclear Norm Regularized Problems
- 格式:pdf
- 大小:504.81 KB
- 文档页数:8
对未来科技想象英语作文Title: Imagining the Future: A Vision of Technology in 2050。
As we stand on the threshold of the next half-century,it's fascinating to project how technology will shape our lives by the year 2050. Drawing on the popular trends and exponential advancements, this essay explores a potential future where technology not only enhances our daily lives but also addresses some of the most pressing global challenges.The Dawn of Ubiquitous AI。
In 2050, artificial intelligence (AI) has evolved froma burgeoning technology into an integral aspect of daily life. AI systems now manage everything from householdchores to complex economic analyses. Personal AI assistants, which started as simple algorithm-based helpers, have transformed into sophisticated entities that manage ourschedules, optimize our health, and even provide companionship. These AI companions can predict our preferences and needs, often addressing them before we are even aware of them ourselves.One of the most significant impacts of AI has been in education. Customized learning platforms use AI to adapt to each student's learning style and pace, making education more accessible and effective for students around the globe. These platforms provide real-time feedback and adjust learning materials to suit individual needs, ensuring optimal learning outcomes.Revolutionizing Healthcare with Technology。
function x_out= MSB(Aop,y_vec,s_x,lambda1,rnk,iter)% function [x_out obj_func]= MSB(Aop,y_vec,[m n],lambda1,rnk)% This code solves the problem of recovering a low rank matrix from its% lower dimensional projections% Minimize ||X||* (nuclear norm of Z)% Subject to A(X) = Y%formulated as an unconstarined nuclear norm minmization problem using Split bregman algorithm% Minmimize (lambda1)||W||* + 1/2 || A(X) - y ||_2^2 + eta/2 || W-X-B1 ||_2^2%W is auxillary variable and B1 is the bregman variable%INPUTS%Aop : Linear operator%y_vec : Vector of observed values%s_x : size of the data matrix to be recovered (in form of [m n]) %lambda1:regularization parameter%rnk : rank estimation for X%iter : maximum number of iterations to be carried out%OUTPUTS%x_out : recovered matrixif nargin < 6error('Insufficient Number of arguments');end%Variable and parameter initializationeta1=.001; % Regularization parameters_new=s_x(1)*s_x(2);W=zeros(s_new,1); % proxy variableB1=ones(s_new,1); % bregman variable% create operator for least square mimimizationI=opDiag(s_new,1);Aop_concat = opStack([1 eta1],Aop,I);%run iterative algorithmfor iteration=1:iter%measurement vector for L2 minimizationb_vec=[y_vec;eta1*(W-B1)];%L2 minimization step (subproblem 1)X=lsqr(@lsqrAop,b_vec);%Nuclear norm minization Step (subproblem 2)W=X+B1;[W, s_thld]=nuc_norm(W,s_x,s_new,lambda1,eta1,rnk);%bregman variable updateB1 = B1+X-W;%Objective function valueobj_func(iteration)=0.5*norm(y_vec-Aop(X,1))+ lambda1*sum(s_thld) ;if iteration>10 &&abs(obj_func(iteration)-obj_func(iteration-10))<1e-7 breakendend%reshaping recovered vector to matrix formx_out = reshape(X,s_x(1),s_x(2));%Plot objective functionplot(obj_func);title('Convergence of Objective Function')xlabel('Number of iterations');ylabel('Objective function value');function y = lsqrAop(x,transpose)switch transposecase'transp'y = Aop_concat(x,2);case'notransp'y = Aop_concat(x,1);endendend。
数学专业英语词汇(N)n ary relation n元关系n ball n维球n cell n维胞腔n chromatic graph n色图n coboundary n上边缘n cocycle n上循环n connected space n连通空间n dimensional column vector n维列向量n dimensional euclidean space n维欧几里得空间n dimensional rectangular parallelepiped n维长方体n dimensional row vector n维行向量n dimensional simplex n单形n dimensional skeleton n维骨架n disk n维圆盘n element set n元集n fold extension n重扩张n gon n角n graph n图n homogeneous variety n齐次簇n person game n人对策n simplex n单形n sphere bundle n球丛n th member 第n项n th partial quotient 第n偏商n th power operation n次幂运算n th root n次根n th term 第n项n times continuously differentiable n次连续可微的n times continuously differentiable function n次连续可微函数n tuple n组n tuply connected domain n重连通域n universal bundle n通用丛nabla 倒三角算子nabla calculation 倒三角算子计算nabla operator 倒三角算子napier's logarithm 讷代对数natural boundary 自然边界natural boundary condition 自然边界条件natural coordinates 自然坐标natural equation 自然方程natural equivalence 自然等价natural exponential function 自然指数函数natural frequency 固有频率natural geometry 自然几何natural injection 自然单射natural isomorphism 自然等价natural language 自然语言natural logarithm 自然对数natural mapping 自然映射natural number 自然数natural oscillation 固有振荡natural sine 正弦真数natural transformation 自然变换naught 零necessary and sufficient conditions 必要充分的条件necessary and sufficient statistic 必要充分统计量necessary condition 必要条件necessity 必然性negation 否定negation sign 否定符号negation symbol 否定符号negative 负数negative angle 负角negative binomial distribution 负二项分布negative complex 负复形negative correlation 负相关negative definite form 负定型negative definite hermitian form 负定埃尔米特形式negative definite quadratic form 负定二次形式negative function 负函数negative number 负数negative operator 负算子negative parity 负电阻negative part 负部分negative particular proposition 否定特称命题negative proposition 否定命题negative rotation 反时针方向旋转negative semidefinite 半负定的negative semidefinite eigenvalue problem 半负定特盏问题negative semidefinite form 半负定型negative semidefinite quadratic form 半负定二次形式negative sign 负号negative skewness 负偏斜性negative variation 负变差negligible quantity 可除量neighborhood 邻域neighborhood base 邻域基neighborhood basis 邻域基neighborhood filter 邻域滤子neighborhood retract 邻域收缩核neighborhood space 邻域空间neighborhood system 邻域系neighborhood topology 邻域拓扑neighboring vertex 邻近项点nephroid 肾脏线nerve 神经nested intervals 区间套net 网net function 网格函数net of curves 曲线网net of lines 直线网network 网络network analysis 网络分析network flow problem 网络潦题network matrix 网络矩阵neumann boundary condition 诺伊曼边界条件neumann function 诺伊曼函数neumann problem 诺伊曼问题neumann series 诺伊曼级数neutral element 零元素neutral line 中线neutral plane 中性平面neutral point 中性点newton diagram 牛顿多边形newton formula 牛顿公式newton identities 牛顿恒等式newton interpolation polynomial 牛顿插值多项式newton method 牛顿法newton potential 牛顿位势newtonian mechanics 牛顿力学nice function 佳函数nil ideal 零理想nil radical 幂零根基nilalgebra 幂零代数nilpotency 幂零nilpotent 幂零nilpotent algebra 幂零代数nilpotent element 幂零元素nilpotent group 幂零群nilpotent ideal 幂零理想nilpotent matrix 幂零矩阵nilpotent radical 幂零根基nine point circle 九点圆nine point finite difference scheme 九点有限差分格式niveau line 等位线niveau surface 等位面nodal curve 结点曲线nodal line 交点线nodal point 节点node 节点node locus 结点轨迹node of a curve 曲线的结点noetherian category 诺特范畴noetherian object 诺特对象nomogram 算图nomographic 列线图的nomographic chart 算图nomography 图算法non additivity 非加性non archimedean geometry 非阿基米德几何non archimedean valuation 非阿基米德赋值non countable set 不可数集non critical point 非奇点non denumerable 不可数的non denumerable set 不可数集non developable ruled surface 非可展直纹曲面non enumerable set 不可数集non euclidean geometry 非欧几里得几何学non euclidean motion 非欧几里得运动non euclidean space 非欧几里得空间non euclidean translation 非欧几里得平移non euclidean trigonometry 非欧几里得三角学non homogeneity 非齐non homogeneous chain 非齐次马尔可夫链non homogeneous markov chain 非齐次马尔可夫链non isotropic plane 非迷向平面non linear 非线性的non negative semidefinite matrix 非负半正定阵non oriented graph 无向图non parametric test 无分布检验non pascalian geometry 非拍斯卡几何non ramified extension 非分歧扩张non rational function 无理分数non relativistic approximation 非相对性近似non reversibility 不可逆性non singular 非奇的non stationary random process 不平稳随机过程non steady state 不稳定状态non symmetric 非对称的non symmetry 非对称non zero sum game 非零和对策nonabsolutely convergent series 非绝对收敛级数nonagon 九边形nonassociate 非结合的nonassociative ring 非结合环nonbasic variable 非基本变量noncentral chi squre distribution 非中心分布noncentral f distribution 非中心f分布noncentral t distribution 非中心t分布noncentrality parameter 非中心参数nonclosed group 非闭群noncommutative group 非交换群noncommutative ring 非交换环noncommutative valuation 非交换赋值noncommuting operators 非交换算子noncomparable elements 非可比元素nondegeneracy 非退化nondegenerate collineation 非退化直射变换nondegenerate conic 非退化二次曲线nondegenerate critical point 非退化临界点nondegenerate distribution 非退化分布nondegenerate set 非退化集nondense set 疏集nondenumerability 不可数性nondeterministic automaton 不确定性自动机nondiagonal element 非对角元nondiscrete space 非离散空间nonexistence 不存在性nonfinite set 非有限集nonholonomic constraint 不完全约束nonhomogeneity 非齐性nonhomogeneous 非齐次的nonhomogeneous linear boundary value problem 非齐次线性边值问题nonhomogeneous linear differential equation 非齐次线性微分方程nonhomogeneous linear system of differential equations 非齐次线性微分方程组nonisotropic line 非迷向线nonlimiting ordinal 非极限序数nonlinear equation 非线性方程nonlinear functional analysis 非线性泛函分析nonlinear lattice dynamics 非线性点阵力学nonlinear operator 非线性算子nonlinear optimization 非线性最优化nonlinear oscillations 非线性振动nonlinear problem 非线性问题nonlinear programming 非线性最优化nonlinear restriction 非线性限制nonlinear system 非线性系统nonlinear trend 非线性瞧nonlinear vibration 非线性振动nonlinearity 非线性nonlogical axiom 非逻辑公理nonlogical constant 非逻辑常数nonmeager set 非贫集nonmeasurable set 不可测集nonnegative divisor 非负除数nonnegative number 非负数nonnumeric algorithm 非数值的算法nonorientable contour 不可定向周线nonorientable surface 不可定向的曲面nonorthogonal factor 非正交因子nonparametric confidence region 非参数置信区域nonparametric estimation 非参数估计nonparametric method 非参数法nonparametric test 非参数检定nonperfect set 非完备集nonperiodic 非周期的nonperiodical function 非周期函数nonplanar graph 非平面图形nonprincipal character 非重贞nonrandom sample 非随机样本nonrandomized test 非随机化检验nonrational function 非有理函数nonremovable discontinuity 非可去不连续点nonrepresentative sampling 非代表抽样nonresidue 非剩余nonsense correlation 产生错觉相关nonsingular bilinear form 非奇双线性型nonsingular curve 非奇曲线nonsingular linear transformation 非退化线性变换nonsingular matrix 非退化阵nonspecial group 非特殊群nonstable 不稳定的nonstable homotopy group 非稳定的同伦群nonstandard analysis 非标准分析nonstandard model 非标准模型nonstandard numbers 非标准数nonsymmetric relation 非对称关系nonsymmetry 非对称nontangential 不相切的nontrivial element 非平凡元素nontrivial solution 非平凡解nonuniform convergence 非一致收敛nonvoid proper subset 非空真子集nonvoid set 非空集nonzero vector 非零向量norm 范数norm axioms 范数公理norm form 范形式norm of a matrix 阵的范数norm of vector 向量的模norm preserving mapping 保范映射norm residue 范数剩余norm residue symbol 范数剩余符号norm topology 范拓朴normability 可模性normal 法线normal algorithm 正规算法normal basis theorem 正规基定理normal bundle 法丛normal chain 正规链normal cone 法锥面normal congruence 法汇normal coordinates 正规坐标normal correlation 正态相关normal curvature 法曲率normal curvature vector 法曲率向量normal curve 正规曲线normal density 正规密度normal derivative 法向导数normal dispersion 正常色散normal distribution 正态分布normal distribution function 正态分布函数normal equations 正规方程normal error model 正规误差模型normal extension 正规开拓normal family 正规族normal force 法向力normal form 标准型normal form problem 标准形问题normal form theorem 正规形式定理normal function 正规函数normal homomorphism 正规同态normal integral 正规积分normal linear operator 正规线性算子normal mapping 正规映射normal matrix 正规矩阵normal number 正规数normal operator 正规算子normal order 良序normal plane 法面normal polygon 正规多角形normal polynomial 正规多项式normal population 正态总体normal probability paper 正态概率纸normal process 高斯过程normal sequence 正规序列normal series 正规列normal set 良序集normal simplicial mapping 正规单形映射normal solvable operator 正规可解算子normal space 正规空间normal surface 法曲面normal tensor 正规张量normal to the surface 曲面的法线normal valuation 正规赋值normal variate 正常变量normal variety 正规簇normal vector 法向量normality 正规性normalization 标准化normalization theorem 正规化定理normalize 正规化normalized basis 正规化基normalized function 规范化函数normalized variate 正规化变量normalized vector 正规化向量normalizer 正规化子normalizing factor 正则化因数normed algebra 赋范代数normed linear space 赋范线性空间normed space 赋范线性空间northwest corner rule 北午角规则notation 记法notation free from bracket 无括号记号notation of backus 巴科斯记号notion 概念nought 零nowhere convergent sequence 无处收敛序列nowhere convergent series 无处收敛级数nowhere dense 无处稠密的nowhere dense set 无处稠密点集nowhere dense subset 无处稠密子集nuclear operator 核算子nuclear space 核空间nucleus of an integral equation 积分方程的核null 零null class 零类null divisor 零因子null ellipse 零椭圆null function 零函数null hypothesis 虚假设null line 零线null matrix 零矩阵null method 衡消法null plane 零面null point 零点null ray 零射线null relation 零关系null representation 零表示null sequence 零序列null set 空集null solution 零解null system 零系null transformation 零变换null vector 零向量nullity 退化阶数nullring 零环nullspace 零空间number 数number defined by cut 切断数number defined by the dedekind cut 切断数number field 数域number interval 数区间number line 数值轴number notation 数记法number of partitions 划分数number of repetitions 重复数number of replications 重复数number of sheets 叶数number sequence 数列number set 数集number system 数系number theory 数论number variable 数变量numeration 计算numerator 分子numeric representation of information 信息的数值表示numerical 数值的numerical algorithm 数值算法numerical axis 数值轴numerical calculation 数值计算numerical coding 数值编码numerical coefficient 数字系数numerical computation 数值计算numerical constant 数值常数numerical data 数值数据numerical determinant 数字行列式numerical differentiation 数值微分numerical equality 数值等式numerical equation 数字方程numerical error 数值误差numerical example 数值例numerical function 数值函数numerical inequality 数值不等式numerical integration 数值积分法numerical invariant 不变数numerical mathematics 数值数学numerical method 数值法numerical model 数值模型numerical operator 数字算子numerical quadrature 数值积分法numerical series 数值级数numerical solution 数值解numerical solution of linear equations 线性方程组的数值解法numerical stability 数值稳定性numerical table 数表numerical value 数值numerical value equation 数值方程nutation 章动。
Algorithms for Non-negative MatrixFactorizationDaniel D.LeeBell Laboratories Lucent Technologies Murray Hill,NJ07974H.Sebastian SeungDept.of Brain and Cog.Sci.Massachusetts Institute of TechnologyCambridge,MA02138 AbstractNon-negative matrix factorization(NMF)has previously been shown tobe a useful decomposition for multivariate data.Two different multi-plicative algorithms for NMF are analyzed.They differ only slightly inthe multiplicative factor used in the update rules.One algorithm can beshown to minimize the conventional least squares error while the otherminimizes the generalized Kullback-Leibler divergence.The monotonicconvergence of both algorithms can be proven using an auxiliary func-tion analogous to that used for proving convergence of the Expectation-Maximization algorithm.The algorithms can also be interpreted as diag-onally rescaled gradient descent,where the rescaling factor is optimallychosen to ensure convergence.IntroductionUnsupervised learning algorithms such as principal components analysis and vector quan-tization can be understood as factorizing a data matrix subject to different constraints.De-pending upon the constraints utilized,the resulting factors can be shown to have very dif-ferent representational properties.Principal components analysis enforces only a weak or-thogonality constraint,resulting in a very distributed representation that uses cancellations to generate variability[1,2].On the other hand,vector quantization uses a hard winner-take-all constraint that results in clustering the data into mutually exclusive prototypes[3]. We have previously shown that nonnegativity is a useful constraint for matrix factorization that can learn a parts representation of the data[4,5].The nonnegative basis vectors that are learned are used in distributed,yet still sparse combinations to generate expressiveness in the reconstructions[6,7].In this submission,we analyze in detail two numerical algorithms for learning the optimal nonnegative factors from data.Non-negative matrix factorizationWe formally consider algorithms for solving the following problem:Non-negative matrix factorization(NMF)Given a non-negative matrix,find non-negative matrix factors and such that:(1)NMF can be applied to the statistical analysis of multivariate data in the following manner. Given a set of of multivariate-dimensional data vectors,the vectors are placed in the columns of an matrix where is the number of examples in the data set.This matrix is then approximately factorized into an matrix and an matrix. Usually is chosen to be smaller than or,so that and are smaller than the original matrix.This results in a compressed version of the original data matrix.What is the significance of the approximation in Eq.(1)?It can be rewritten column by column as,where and are the corresponding columns of and.In other words,each data vector is approximated by a linear combination of the columns of, weighted by the components of.Therefore can be regarded as containing a basis that is optimized for the linear approximation of the data in.Since relatively few basis vectors are used to represent many data vectors,good approximation can only be achieved if the basis vectors discover structure that is latent in the data.The present submission is not about applications of NMF,but focuses instead on the tech-nical aspects offinding non-negative matrix factorizations.Of course,other types of ma-trix factorizations have been extensively studied in numerical linear algebra,but the non-negativity constraint makes much of this previous work inapplicable to the present case [8].Here we discuss two algorithms for NMF based on iterative updates of and.Because these algorithms are easy to implement and their convergence properties are guaranteed, we have found them very useful in practical applications.Other algorithms may possibly be more efficient in overall computation time,but are more difficult to implement and may not generalize to different cost functions.Algorithms similar to ours where only one of the factors is adapted have previously been used for the deconvolution of emission tomography and astronomical images[9,10,11,12].At each iteration of our algorithms,the new value of or is found by multiplying the current value by some factor that depends on the quality of the approximation in Eq.(1).We prove that the quality of the approximation improves monotonically with the application of these multiplicative update rules.In practice,this means that repeated iteration of the update rules is guaranteed to converge to a locally optimal matrix factorization.Cost functionsTofind an approximate factorization,wefirst need to define cost functions that quantify the quality of the approximation.Such a cost function can be constructed using some measure of distance between two non-negative matrices and.One useful measure is simply the square of the Euclidean distance between and[13],(2)This is lower bounded by zero,and clearly vanishes if and only if.Another useful measure is(3) Like the Euclidean distance this is also lower bounded by zero,and vanishes if and only if.But it cannot be called a“distance”,because it is not symmetric in and, so we will refer to it as the“divergence”of from.It reduces to the Kullback-Leibler divergence,or relative entropy,when,so that and can beregarded as normalized probability distributions.We now consider two alternative formulations of NMF as optimization problems:Problem1Minimize with respect to and,subject to the constraints .Problem2Minimize with respect to and,subject to the constraints .Although the functions and are convex in only or only,they are not convex in both variables together.Therefore it is unrealistic to expect an algorithm to solve Problems1and2in the sense offinding global minima.However,there are many techniques from numerical optimization that can be applied tofind local minima. Gradient descent is perhaps the simplest technique to implement,but convergence can be slow.Other methods such as conjugate gradient have faster convergence,at least in the vicinity of local minima,but are more complicated to implement than gradient descent [8].The convergence of gradient based methods also have the disadvantage of being very sensitive to the choice of step size,which can be very inconvenient for large applications.Multiplicative update rulesWe have found that the following“multiplicative update rules”are a good compromise between speed and ease of implementation for solving Problems1and2.Theorem1The Euclidean distance is nonincreasing under the update rules(4)The Euclidean distance is invariant under these updates if and only if and are at a stationary point of the distance.Theorem2The divergence is nonincreasing under the update rules(5)The divergence is invariant under these updates if and only if and are at a stationary point of the divergence.Proofs of these theorems are given in a later section.For now,we note that each update consists of multiplication by a factor.In particular,it is straightforward to see that this multiplicative factor is unity when,so that perfect reconstruction is necessarily afixed point of the update rules.Multiplicative versus additive update rulesIt is useful to contrast these multiplicative updates with those arising from gradient descent [14].In particular,a simple additive update for that reduces the squared distance can be written as(6) If are all set equal to some small positive number,this is equivalent to conventional gradient descent.As long as this number is sufficiently small,the update should reduce .Now if we diagonally rescale the variables and set(7) then we obtain the update rule for that is given in Theorem1.Note that this rescaling results in a multiplicative factor with the positive component of the gradient in the denom-inator and the absolute value of the negative component in the numerator of the factor. For the divergence,diagonally rescaled gradient descent takes the form(8)Again,if the are small and positive,this update should reduce.If we now set(9) then we obtain the update rule for that is given in Theorem2.This rescaling can also be interpretated as a multiplicative rule with the positive component of the gradient in the denominator and negative component as the numerator of the multiplicative factor.Since our choices for are not small,it may seem that there is no guarantee that such a rescaled gradient descent should cause the cost function to decrease.Surprisingly,this is indeed the case as shown in the next section.Proofs of convergenceTo prove Theorems1and2,we will make use of an auxiliary function similar to that used in the Expectation-Maximization algorithm[15,16].Definition1is an auxiliary function for if the conditions(10) are satisfied.The auxiliary function is a useful concept because of the following lemma,which is also graphically illustrated in Fig.1.Lemma1If is an auxiliary function,then is nonincreasing under the update(11)Proof:Note that only if is a local minimum of.If the derivatives of exist and are continuous in a small neighborhood of,this also implies that the derivatives.Thus,by iterating the update in Eq.(11)we obtain a sequence of estimates that converge to a local minimum of the objective function:(12)We will show that by defining the appropriate auxiliary functions for both and,the update rules in Theorems1and2easily follow from Eq.(11).minFigure1:Minimizing the auxiliary function guarantees that for.Lemma2If is the diagonal matrix(13) then(14) is an auxiliary function for(15)Proof:Since is obvious,we need only show that.To do this,we compare(16) with Eq.(14)tofind that is equivalent to(17) To prove positive semidefiniteness,consider the matrix1:(18) which is just a rescaling of the components of.Then is positive semidefinite if and only if is,and(19)(20)(21)(22)(23)1One can also show that is positive semidefinite by considering the matrix.Then is a positive eigenvector of with unity eigenvalue,and application of the Frobenius-Perron theorem shows that Eq.17holds.We can now demonstrate the convergence of Theorem1:Proof of Theorem1Replacing in Eq.(11)by Eq.(14)results in the update rule:(24) Since Eq.(14)is an auxiliary function,is nonincreasing under this update rule,according to Lemma1.Writing the components of this equation explicitly,we obtain(25) By reversing the roles of and in Lemma1and2,can similarly be shown to be nonincreasing under the update rules for.We now consider the following auxiliary function for the divergence cost function: Lemma3Define(26)(27) This is an auxiliary function for(28) Proof:It is straightforward to verify that.To show that, we use convexity of the log function to derive the inequality(29) which holds for all nonnegative that sum to unity.Setting(30) we obtain(31) From this inequality it follows that.Theorem2then follows from the application of Lemma1:Proof of Theorem2:The minimum of with respect to is determined by setting the gradient to zero:(32)Thus,the update rule of Eq.(11)takes the form(33)Since is an auxiliary function,in Eq.(28)is nonincreasing under this update.Rewrit-ten in matrix form,this is equivalent to the update rule in Eq.(5).By reversing the roles of and,the update rule for can similarly be shown to be nonincreasing.DiscussionWe have shown that application of the update rules in Eqs.(4)and(5)are guaranteed to find at least locally optimal solutions of Problems1and2,respectively.The convergence proofs rely upon defining an appropriate auxiliary function.We are currently working to generalize these theorems to more complex constraints.The update rules themselves are extremely easy to implement computationally,and will hopefully be utilized by others for a wide variety of applications.We acknowledge the support of Bell Laboratories.We would also like to thank Carlos Brody,Ken Clarkson,Corinna Cortes,Roland Freund,Linda Kaufman,Yann Le Cun,Sam Roweis,Larry Saul,and Margaret Wright for helpful discussions.References[1]Jolliffe,IT(1986).Principal Component Analysis.New York:Springer-Verlag.[2]Turk,M&Pentland,A(1991).Eigenfaces for recognition.J.Cogn.Neurosci.3,71–86.[3]Gersho,A&Gray,RM(1992).Vector Quantization and Signal Compression.Kluwer Acad.Press.[4]Lee,DD&Seung,HS.Unsupervised learning by convex and conic coding(1997).Proceedingsof the Conference on Neural Information Processing Systems9,515–521.[5]Lee,DD&Seung,HS(1999).Learning the parts of objects by non-negative matrix factoriza-tion.Nature401,788–791.[6]Field,DJ(1994).What is the goal of sensory coding?Neural Comput.6,559–601.[7]Foldiak,P&Young,M(1995).Sparse coding in the primate cortex.The Handbook of BrainTheory and Neural Networks,895–898.(MIT Press,Cambridge,MA).[8]Press,WH,Teukolsky,SA,Vetterling,WT&Flannery,BP(1993).Numerical recipes:the artof scientific computing.(Cambridge University Press,Cambridge,England).[9]Shepp,LA&Vardi,Y(1982).Maximum likelihood reconstruction for emission tomography.IEEE Trans.MI-2,113–122.[10]Richardson,WH(1972).Bayesian-based iterative method of image restoration.J.Opt.Soc.Am.62,55–59.[11]Lucy,LB(1974).An iterative technique for the rectification of observed distributions.Astron.J.74,745–754.[12]Bouman,CA&Sauer,K(1996).A unified approach to statistical tomography using coordinatedescent optimization.IEEE Trans.Image Proc.5,480–492.[13]Paatero,P&Tapper,U(1997).Least squares formulation of robust non-negative factor analy-b.37,23–35.[14]Kivinen,J&Warmuth,M(1997).Additive versus exponentiated gradient updates for linearprediction.Journal of Information and Computation132,1–64.[15]Dempster,AP,Laird,NM&Rubin,DB(1977).Maximum likelihood from incomplete data viathe EM algorithm.J.Royal Stat.Soc.39,1–38.[16]Saul,L&Pereira,F(1997).Aggregate and mixed-order Markov models for statistical languageprocessing.In C.Cardie and R.Weischedel(eds).Proceedings of the Second Conference on Empirical Methods in Natural Language Processing,81–89.ACL Press.。
核注清单的用法以下是 20 个关于核注清单用法的双语例句:1. 我每次填写核注清单都小心翼翼,就像走在钢丝上一样,生怕出错,难道你不是吗?Every time I fill out the nuclear annotation list, I am extremely cautious, just like walking on a tightrope, for fear of making mistakes. Aren't you the same?2. 这核注清单可不简单,简直就是一道复杂的谜题,咱们可得认真对待!This nuclear annotation list is not simple at all. It's just like a complicated puzzle. We really have to take it seriously!3. 他在处理核注清单时,那专注的样子,仿佛在雕琢一件珍贵的艺术品,你见过吗?When he was dealing with the nuclear annotation list, his concentrated look was just like carving a precious work of art. Have you ever seen that?4. 核注清单就像指挥交通的信号灯,指引着货物的流向,一旦出错,那可就乱套了!The nuclear annotation list is like the traffic signal lights, guiding theflow of goods. Once there is a mistake, it will be a mess!5. 你以为随便写写核注清单就行啦?那可大错特错,这是要承担责任的哟!Do you think you can just write the nuclear annotation list casually?That's a big mistake. You have to take responsibility for it!6. 我对着这核注清单抓耳挠腮,怎么就这么难搞呢,难道就没有简单点的办法?I scratch my head in front of this nuclear annotation list. Why is it so difficult? Isn't there an easier way?7. 她把核注清单整理得井井有条,就像整理自己心爱的衣橱一样,让人佩服!She organized the nuclear annotation list in an orderly manner, just like organizing her beloved closet. It's admirable!8. 核注清单可不能马虎,这可不是小孩子过家家,一不小心就会出大问题!The nuclear annotation list cannot be sloppy. This is not a child's play. A little carelessness will cause big problems!9. 每次看到核注清单上密密麻麻的信息,我都感觉像是掉进了数字的海洋,难以自拔!Every time I see the dense information on the nuclear annotation list, I feel like falling into an ocean of numbers and can't extricate myself!10. 你能快速准确地完成核注清单吗?我可总是被它搞得晕头转向!Can you complete the nuclear annotation list quickly and accurately? I'm always dizzy because of it!11. 这核注清单就像一座难以翻越的高山,我们得鼓起勇气去攀登!This nuclear annotation list is like a difficult mountain to climb. Wehave to muster up the courage to climb it!12. 他在核注清单上的每一笔,都如同在战场上的每一次冲锋,谨慎而果断!Every stroke he made on the nuclear annotation list was like every charge on the battlefield, cautious and decisive!13. 核注清单可不是闹着玩的,它就像法律条文一样严肃,谁敢轻视?The nuclear annotation list is not for fun. It is as serious as legal provisions. Who dares to underestimate it?14. 我在核注清单里迷失了方向,谁能给我指条明路啊?I lost my way in the nuclear annotation list. Who can show me a clear way?15. 她对待核注清单的认真劲儿,就像守护着自己的宝贝,不容有一丝差错!Her seriousness in dealing with the nuclear annotation list is just like guarding her own treasure, not allowing any mistakes!16. 核注清单难道是故意为难我们的吗?怎么这么复杂繁琐!Is the nuclear annotation list deliberately making things difficult for us? Why is it so complicated and cumbersome!17. 你瞧,这核注清单就像个调皮的孩子,总是让人捉摸不透!Look, this nuclear annotation list is like a naughty child, always making people unpredictable!18. 我为了这核注清单,日夜操劳,它能不能对我友好点啊?I work hard day and night for this nuclear annotation list. Can it be a little friendly to me?19. 核注清单就像个神秘的盒子,打开它需要耐心和技巧,你准备好了吗?The nuclear annotation list is like a mysterious box. Opening it requires patience and skills. Are you ready?20. 他破解核注清单的难题,那股兴奋劲,仿佛赢得了一场重大的比赛!When he solved the problem of the nuclear annotation list, his excitement was just like winning a major competition!。
AI专⽤词汇LetterAAccumulatederrorbackpropagation累积误差逆传播ActivationFunction激活函数AdaptiveResonanceTheory/ART⾃适应谐振理论Addictivemodel加性学习Adversari alNetworks对抗⽹络AffineLayer仿射层Affinitymatrix亲和矩阵Agent代理/智能体Algorithm算法Alpha-betapruningα-β剪枝Anomalydetection异常检测Approximation近似AreaUnderROCCurve/AUCRoc曲线下⾯积ArtificialGeneralIntelligence/AGI通⽤⼈⼯智能ArtificialIntelligence/AI⼈⼯智能Associationanalysis关联分析Attentionmechanism注意⼒机制Attributeconditionalindependenceassumption属性条件独⽴性假设Attributespace属性空间Attributevalue属性值Autoencoder⾃编码器Automaticspeechrecognition⾃动语⾳识别Automaticsummarization⾃动摘要Aver agegradient平均梯度Average-Pooling平均池化LetterBBackpropagationThroughTime通过时间的反向传播Backpropagation/BP反向传播Baselearner基学习器Baselearnin galgorithm基学习算法BatchNormalization/BN批量归⼀化Bayesdecisionrule贝叶斯判定准则BayesModelAveraging/BMA贝叶斯模型平均Bayesoptimalclassifier贝叶斯最优分类器Bayesiandecisiontheory贝叶斯决策论Bayesiannetwork贝叶斯⽹络Between-cla ssscattermatrix类间散度矩阵Bias偏置/偏差Bias-variancedecomposition偏差-⽅差分解Bias-VarianceDilemma偏差–⽅差困境Bi-directionalLong-ShortTermMemory/Bi-LSTM双向长短期记忆Binaryclassification⼆分类Binomialtest⼆项检验Bi-partition⼆分法Boltzmannmachine玻尔兹曼机Bootstrapsampling⾃助采样法/可重复采样/有放回采样Bootstrapping⾃助法Break-EventPoint/BEP平衡点LetterCCalibration校准Cascade-Correlation级联相关Categoricalattribute离散属性Class-conditionalprobability类条件概率Classificationandregressiontree/CART分类与回归树Classifier分类器Class-imbalance类别不平衡Closed-form闭式Cluster簇/类/集群Clusteranalysis聚类分析Clustering聚类Clusteringensemble聚类集成Co-adapting共适应Codin gmatrix编码矩阵COLT国际学习理论会议Committee-basedlearning基于委员会的学习Competiti velearning竞争型学习Componentlearner组件学习器Comprehensibility可解释性Comput ationCost计算成本ComputationalLinguistics计算语⾔学Computervision计算机视觉C onceptdrift概念漂移ConceptLearningSystem/CLS概念学习系统Conditionalentropy条件熵Conditionalmutualinformation条件互信息ConditionalProbabilityTable/CPT条件概率表Conditionalrandomfield/CRF条件随机场Conditionalrisk条件风险Confidence置信度Confusionmatrix混淆矩阵Connectionweight连接权Connectionism连结主义Consistency⼀致性/相合性Contingencytable列联表Continuousattribute连续属性Convergence收敛Conversationalagent会话智能体Convexquadraticprogramming凸⼆次规划Convexity凸性Convolutionalneuralnetwork/CNN卷积神经⽹络Co-oc currence同现Correlationcoefficient相关系数Cosinesimilarity余弦相似度Costcurve成本曲线CostFunction成本函数Costmatrix成本矩阵Cost-sensitive成本敏感Crosse ntropy交叉熵Crossvalidation交叉验证Crowdsourcing众包Curseofdimensionality维数灾难Cutpoint截断点Cuttingplanealgorithm割平⾯法LetterDDatamining数据挖掘Dataset数据集DecisionBoundary决策边界Decisionstump决策树桩Decisiontree决策树/判定树Deduction演绎DeepBeliefNetwork深度信念⽹络DeepConvolutionalGe nerativeAdversarialNetwork/DCGAN深度卷积⽣成对抗⽹络Deeplearning深度学习Deep neuralnetwork/DNN深度神经⽹络DeepQ-Learning深度Q学习DeepQ-Network深度Q⽹络Densityestimation密度估计Density-basedclustering密度聚类Differentiab leneuralcomputer可微分神经计算机Dimensionalityreductionalgorithm降维算法D irectededge有向边Disagreementmeasure不合度量Discriminativemodel判别模型Di scriminator判别器Distancemeasure距离度量Distancemetriclearning距离度量学习D istribution分布Divergence散度Diversitymeasure多样性度量/差异性度量Domainadaption领域⾃适应Downsampling下采样D-separation(Directedseparation)有向分离Dual problem对偶问题Dummynode哑结点DynamicFusion动态融合Dynamicprogramming动态规划LetterEEigenvaluedecomposition特征值分解Embedding嵌⼊Emotionalanalysis情绪分析Empiricalconditionalentropy经验条件熵Empiricalentropy经验熵Empiricalerror经验误差Empiricalrisk经验风险End-to-End端到端Energy-basedmodel基于能量的模型Ensemblelearning集成学习Ensemblepruning集成修剪ErrorCorrectingOu tputCodes/ECOC纠错输出码Errorrate错误率Error-ambiguitydecomposition误差-分歧分解Euclideandistance欧⽒距离Evolutionarycomputation演化计算Expectation-Maximization期望最⼤化Expectedloss期望损失ExplodingGradientProblem梯度爆炸问题Exponentiallossfunction指数损失函数ExtremeLearningMachine/ELM超限学习机LetterFFactorization因⼦分解Falsenegative假负类Falsepositive假正类False PositiveRate/FPR假正例率Featureengineering特征⼯程Featureselection特征选择Featurevector特征向量FeaturedLearning特征学习FeedforwardNeuralNetworks/FNN前馈神经⽹络Fine-tuning微调Flippingoutput翻转法Fluctuation震荡Forwards tagewisealgorithm前向分步算法Frequentist频率主义学派Full-rankmatrix满秩矩阵Func tionalneuron功能神经元LetterGGainratio增益率Gametheory博弈论Gaussianker nelfunction⾼斯核函数GaussianMixtureModel⾼斯混合模型GeneralProblemSolving通⽤问题求解Generalization泛化Generalizationerror泛化误差Generalizatione rrorbound泛化误差上界GeneralizedLagrangefunction⼴义拉格朗⽇函数Generalized linearmodel⼴义线性模型GeneralizedRayleighquotient⼴义瑞利商GenerativeAd versarialNetworks/GAN⽣成对抗⽹络GenerativeModel⽣成模型Generator⽣成器Genet icAlgorithm/GA遗传算法Gibbssampling吉布斯采样Giniindex基尼指数Globalminimum全局最⼩GlobalOptimization全局优化Gradientboosting梯度提升GradientDescent梯度下降Graphtheory图论Ground-truth真相/真实LetterHHardmargin硬间隔Hardvoting硬投票Harmonicmean调和平均Hessematrix海塞矩阵Hiddendynamicmodel隐动态模型H iddenlayer隐藏层HiddenMarkovModel/HMM隐马尔可夫模型Hierarchicalclustering层次聚类Hilbertspace希尔伯特空间Hingelossfunction合页损失函数Hold-out留出法Homo geneous同质Hybridcomputing混合计算Hyperparameter超参数Hypothesis假设Hypothe sistest假设验证LetterIICML国际机器学习会议Improvediterativescaling/IIS改进的迭代尺度法Incrementallearning增量学习Independentandidenticallydistributed/i.i.d.独⽴同分布IndependentComponentAnalysis/ICA独⽴成分分析Indicatorfunction指⽰函数Individuallearner个体学习器Induction归纳Inductivebias归纳偏好I nductivelearning归纳学习InductiveLogicProgramming/ILP归纳逻辑程序设计Infor mationentropy信息熵Informationgain信息增益Inputlayer输⼊层Insensitiveloss不敏感损失Inter-clustersimilarity簇间相似度InternationalConferencefor MachineLearning/ICML国际机器学习⼤会Intra-clustersimilarity簇内相似度Intrinsicvalue固有值IsometricMapping/Isomap等度量映射Isotonicregression等分回归It erativeDichotomiser迭代⼆分器LetterKKernelmethod核⽅法Kerneltrick核技巧K ernelizedLinearDiscriminantAnalysis/KLDA核线性判别分析K-foldcrossvalidationk折交叉验证/k倍交叉验证K-MeansClusteringK–均值聚类K-NearestNeighb oursAlgorithm/KNNK近邻算法Knowledgebase知识库KnowledgeRepresentation知识表征LetterLLabelspace标记空间Lagrangeduality拉格朗⽇对偶性Lagrangemultiplier拉格朗⽇乘⼦Laplacesmoothing拉普拉斯平滑Laplaciancorrection拉普拉斯修正Latent DirichletAllocation隐狄利克雷分布Latentsemanticanalysis潜在语义分析Latentvariable隐变量Lazylearning懒惰学习Learner学习器Learningbyanalogy类⽐学习Learn ingrate学习率LearningVectorQuantization/LVQ学习向量量化Leastsquaresre gressiontree最⼩⼆乘回归树Leave-One-Out/LOO留⼀法linearchainconditional randomfield线性链条件随机场LinearDiscriminantAnalysis/LDA线性判别分析Linearmodel线性模型LinearRegression线性回归Linkfunction联系函数LocalMarkovproperty局部马尔可夫性Localminimum局部最⼩Loglikelihood对数似然Logodds/logit对数⼏率Lo gisticRegressionLogistic回归Log-likelihood对数似然Log-linearregression对数线性回归Long-ShortTermMemory/LSTM长短期记忆Lossfunction损失函数LetterM Machinetranslation/MT机器翻译Macron-P宏查准率Macron-R宏查全率Majorityvoting绝对多数投票法Manifoldassumption流形假设Manifoldlearning流形学习Margintheory间隔理论Marginaldistribution边际分布Marginalindependence边际独⽴性Marginalization边际化MarkovChainMonteCarlo/MCMC马尔可夫链蒙特卡罗⽅法MarkovRandomField马尔可夫随机场Maximalclique最⼤团MaximumLikelihoodEstimation/MLE极⼤似然估计/极⼤似然法Maximummargin最⼤间隔Maximumweightedspanningtree最⼤带权⽣成树Max-P ooling最⼤池化Meansquarederror均⽅误差Meta-learner元学习器Metriclearning度量学习Micro-P微查准率Micro-R微查全率MinimalDescriptionLength/MDL最⼩描述长度Minim axgame极⼩极⼤博弈Misclassificationcost误分类成本Mixtureofexperts混合专家Momentum动量Moralgraph道德图/端正图Multi-classclassification多分类Multi-docum entsummarization多⽂档摘要Multi-layerfeedforwardneuralnetworks多层前馈神经⽹络MultilayerPerceptron/MLP多层感知器Multimodallearning多模态学习Multipl eDimensionalScaling多维缩放Multiplelinearregression多元线性回归Multi-re sponseLinearRegression/MLR多响应线性回归Mutualinformation互信息LetterN Naivebayes朴素贝叶斯NaiveBayesClassifier朴素贝叶斯分类器Namedentityrecognition命名实体识别Nashequilibrium纳什均衡Naturallanguagegeneration/NLG⾃然语⾔⽣成Naturallanguageprocessing⾃然语⾔处理Negativeclass负类Negativecorrelation负相关法NegativeLogLikelihood负对数似然NeighbourhoodComponentAnalysis/NCA近邻成分分析NeuralMachineTranslation神经机器翻译NeuralTuringMachine神经图灵机Newtonmethod⽜顿法NIPS国际神经信息处理系统会议NoFreeLunchTheorem /NFL没有免费的午餐定理Noise-contrastiveestimation噪⾳对⽐估计Nominalattribute列名属性Non-convexoptimization⾮凸优化Nonlinearmodel⾮线性模型Non-metricdistance⾮度量距离Non-negativematrixfactorization⾮负矩阵分解Non-ordinalattribute⽆序属性Non-SaturatingGame⾮饱和博弈Norm范数Normalization归⼀化Nuclearnorm核范数Numericalattribute数值属性LetterOObjectivefunction⽬标函数Obliquedecisiontree斜决策树Occam’srazor奥卡姆剃⼑Odds⼏率Off-Policy离策略Oneshotlearning⼀次性学习One-DependentEstimator/ODE独依赖估计On-Policy在策略Ordinalattribute有序属性Out-of-bagestimate包外估计Outputlayer输出层Outputsmearing输出调制法Overfitting过拟合/过配Oversampling过采样LetterPPairedt-test成对t检验Pairwise成对型PairwiseMarkovproperty成对马尔可夫性Parameter参数Parameterestimation参数估计Parametertuning调参Parsetree解析树ParticleSwarmOptimization/PSO粒⼦群优化算法Part-of-speechtagging词性标注Perceptron感知机Performanceme asure性能度量PlugandPlayGenerativeNetwork即插即⽤⽣成⽹络Pluralityvoting相对多数投票法Polaritydetection极性检测Polynomialkernelfunction多项式核函数Pooling池化Positiveclass正类Positivedefinitematrix正定矩阵Post-hoctest后续检验Post-pruning后剪枝potentialfunction势函数Precision查准率/准确率Prepruning预剪枝Principalcomponentanalysis/PCA主成分分析Principleofmultipleexplanations多释原则Prior先验ProbabilityGraphicalModel概率图模型ProximalGradientDescent/PGD近端梯度下降Pruning剪枝Pseudo-label伪标记LetterQQuantizedNeu ralNetwork量⼦化神经⽹络Quantumcomputer量⼦计算机QuantumComputing量⼦计算Quasi Newtonmethod拟⽜顿法LetterRRadialBasisFunction/RBF径向基函数RandomFo restAlgorithm随机森林算法Randomwalk随机漫步Recall查全率/召回率ReceiverOperatin gCharacteristic/ROC受试者⼯作特征RectifiedLinearUnit/ReLU线性修正单元Recurr entNeuralNetwork循环神经⽹络Recursiveneuralnetwork递归神经⽹络Referencemodel参考模型Regression回归Regularization正则化Reinforcementlearning/RL强化学习Representationlearning表征学习Representertheorem表⽰定理reproducingke rnelHilbertspace/RKHS再⽣核希尔伯特空间Re-sampling重采样法Rescaling再缩放Residu alMapping残差映射ResidualNetwork残差⽹络RestrictedBoltzmannMachine/RBM受限玻尔兹曼机RestrictedIsometryProperty/RIP限定等距性Re-weighting重赋权法Robu stness稳健性/鲁棒性Rootnode根结点RuleEngine规则引擎Rulelearning规则学习LetterS Saddlepoint鞍点Samplespace样本空间Sampling采样Scorefunction评分函数Self-Driving⾃动驾驶Self-OrganizingMap/SOM⾃组织映射Semi-naiveBayesclassifiers半朴素贝叶斯分类器Semi-SupervisedLearning半监督学习semi-SupervisedSupportVec torMachine半监督⽀持向量机Sentimentanalysis情感分析Separatinghyperplane分离超平⾯SigmoidfunctionSigmoid函数Similaritymeasure相似度度量Simulatedannealing模拟退⽕Simultaneouslocalizationandmapping同步定位与地图构建SingularV alueDecomposition奇异值分解Slackvariables松弛变量Smoothing平滑Softmargin软间隔Softmarginmaximization软间隔最⼤化Softvoting软投票Sparserepresentation稀疏表征Sparsity稀疏性Specialization特化SpectralClustering谱聚类SpeechRecognition语⾳识别Splittingvariable切分变量Squashingfunction挤压函数Stability-plasticitydilemma可塑性-稳定性困境Statisticallearning统计学习Statusfeaturefunction状态特征函Stochasticgradientdescent随机梯度下降Stratifiedsampling分层采样Structuralrisk结构风险Structuralriskminimization/SRM结构风险最⼩化S ubspace⼦空间Supervisedlearning监督学习/有导师学习supportvectorexpansion⽀持向量展式SupportVectorMachine/SVM⽀持向量机Surrogatloss替代损失Surrogatefunction替代函数Symboliclearning符号学习Symbolism符号主义Synset同义词集LetterTT-Di stributionStochasticNeighbourEmbedding/t-SNET–分布随机近邻嵌⼊Tensor张量TensorProcessingUnits/TPU张量处理单元Theleastsquaremethod最⼩⼆乘法Th reshold阈值Thresholdlogicunit阈值逻辑单元Threshold-moving阈值移动TimeStep时间步骤Tokenization标记化Trainingerror训练误差Traininginstance训练⽰例/训练例Tran sductivelearning直推学习Transferlearning迁移学习Treebank树库Tria-by-error试错法Truenegative真负类Truepositive真正类TruePositiveRate/TPR真正例率TuringMachine图灵机Twice-learning⼆次学习LetterUUnderfitting⽋拟合/⽋配Undersampling⽋采样Understandability可理解性Unequalcost⾮均等代价Unit-stepfunction单位阶跃函数Univariatedecisiontree单变量决策树Unsupervisedlearning⽆监督学习/⽆导师学习Unsupervisedlayer-wisetraining⽆监督逐层训练Upsampling上采样LetterVVanishingGradientProblem梯度消失问题Variationalinference变分推断VCTheoryVC维理论Versionspace版本空间Viterbialgorithm维特⽐算法VonNeumannarchitecture冯·诺伊曼架构LetterWWassersteinGAN/WGANWasserstein⽣成对抗⽹络Weaklearner弱学习器Weight权重Weightsharing权共享Weightedvoting加权投票法Within-classscattermatrix类内散度矩阵Wordembedding词嵌⼊Wordsensedisambiguation词义消歧LetterZZero-datalearning零数据学习Zero-shotlearning零次学习。
Table of ContentsPreface .................................................................................................................................................................... x iii Organization ......................................................................................................................................................... x viii Program Committees ............................................................................................................................................ x ix Funding ................................................................................................................................................................. x xiii Invited Talks .......................................................................................................................................................... x xv Tutorial and Workshop Summaries ................................................................................................................. x xxiii The Role of Machine Learning in Business Optimization (1)Chid ApteFAB-MAP: Appearance-Based Place Recognition and Mapping using a Learned VisualVocabulary Model (3)Mark Cummins, Paul NewmanDiscriminative Latent Variable Models for Object Detection (11)Pedro Felzenszwalb, Ross Girshick, David McAllester, Deva RamananWeb-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising inMicrosoft's Bing Search Engine (13)Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, Ralf HerbrichMusic Plus One and Machine Learning (21)Christopher RaphaelClimbing the Tower of Babel: Unsupervised Multilingual Learning (29)Benjamin Snyder, Regina BarzilayDetecting Large-Scale System Problems by Mining Console Logs (37)Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael I. JordanParticle Filtered MCMC-MLE with Connections to Contrastive Divergence (47)Arthur U. Asuncion, Qiang Liu, Alexander T. Ihler, Padhraic SmythSurrogating the surrogate: accelerating Gaussian-process-based global optimization with amixture cross-entropy algorithm (55)Remi Bardenet, Balazs KeglForgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-YorProcess (63)Nicholas Bartlett, David Pfau, Frank WoodRobust Formulations for Handling Uncertainty in Kernel Matrices (71)Sahely Bhadra, Sourangshu Bhattacharya, Chiranjib Bhattacharyya, Aharon Ben-talActive Learning for Networked Data (79)Mustafa Bilgic, Lilyana Mihalkova, Lise GetoorDistance dependent Chinese restaurant processes (87)David M. Blei, Peter FrazierCausal filter selection in microarray data (95)Gianluca Bontempi, Patrick E. MeyerLabel Ranking under Ambiguous Supervision for Learning Semantic Correspondences (103)Antoine Bordes, Nicolas Usunier, Jason WestonA Theoretical Analysis of Feature Pooling in Visual Recognition (111)Y-Lan Boureau, Jean Ponce, Yann LeCunMulti-agent Learning Experiments on Repeated Matrix Games (119)Bruno Bouzy, Marc MetivierLearning Tree Conditional Random Fields (127)Joseph K. Bradley, Carlos GuestrinFinding Planted Partitions in Nearly Linear Time using Arrested Spectral Clustering (135)Nader H. Bshouty, Philip M. LongFast boosting using adversarial bandits (143)Róbert Busa-Fekete, Balázs KéglModeling Transfer Learning in Human Categorization with the Hierarchical Dirichlet Process (151)Kevin R. Canini, Mikhail M. Shashkov, Thomas L. GriffithsTransfer Learning for Collective Link Prediction in Multiple Heterogenous Domains (159)Bin Cao, Nathan Nan Liu, Qiang YangThe Elastic Embedding Algorithm for Dimensionality Reduction (167)Miguel Á. Carreira-PerpiñánRandom Spanning Trees and the Prediction of Weighted Graphs (175)Nicolo Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni ZappellaEfficient Learning with Partially Observed Attributes (183)Nicolo Cesa-Bianchi, Shai Shalev-Shwartz, Ohad ShamirConvergence, Targeted Optimality, and Safety in Multiagent Learning (191)Doran Chakraborty, Peter StoneStructured Output Learning with Indirect Supervision (199)Ming-Wei Chang, Vivek Srikumar, Dan Goldwasser, Dan RothDynamical Products of Experts for Modeling Financial Time Series (207)Yutian Chen, Max WellingLabel Ranking Methods based on the Plackett-Luce Model (215)Weiwei Cheng, Krzysztof Dembczynski, Eyke HüllermeierGraded Multilabel Classification: The Ordinal Case (223)Weiwei Cheng, Krzysztof Dembczynski, Eyke HüllermeierComparing Clusterings in Space (231)Michael H. Coen, M. Hidayath Ansari, Nathanael FillmoreTwo-Stage Learning Kernel Algorithms (239)Corinna Cortes, Mehryar Mohri, Afshin RostamizadehGeneralization Bounds for Learning Kernels (247)Corinna Cortes, Mehryar Mohri, Afshin RostamizadehFast Neighborhood Subgraph Pairwise Distance Kernel (255)Fabrizio Costa, Kurt De GraveMining Clustering Dimensions (263)Sajib Dasgupta, Vincent NgBottom-Up Learning of Markov Network Structure (271)Jesse Davis, Pedro DomingosBayes Optimal Multilabel Classification via Probabilistic Classifier Chains (279)Krzysztof Dembczynski, Weiwei Cheng, Eyke HüllermeierA Conditional Random Field for Multiple-Instance Learning (287)Thomas Deselaers, Vittorio FerrariAsymptotic Analysis of Generative Semi-Supervised Learning (295)Joshua V. Dillon, Krishnakumar Balasubramanian, Guy LebanonHeterogeneous Continuous Dynamic Bayesian Networks with Flexible Structure and Inter-TimeSegment Information Sharing (303)Frank Dondelinger, Sophie Lebre, Dirk HusmeierTemporal Difference Bayesian Model Averaging:A Bayesian Perspective on Adapting Lambda (311)Carlton Downey, Scott SannerHigh-Performance Semi-Supervised Learning using Discriminatively Constrained GenerativeModels (319)Gregory Druck, Andrew McCallumOn the Consistency of Ranking Algorithms (327)John C. Duchi, Lester W. Mackey, Michael I. JordanInverse Optimal Control with Linearly-Solvable MDPs (335)Krishnamurthy Dvijotham, Emanuel TodorovContinuous-Time Belief Propagation (343)Tal El-Hay, Ido Cohn, Nir Friedman, Raz KupfermanNonparametric Information Theoretic Clustering Algorithm (351)Lev Faivishevsky, Jacob GoldbergerFeature Selection as a One-Player Game (359)Romaric Gaudel, Michele SebagMultiscale Wavelets on Trees, Graphs and High Dimensional Data: Theory and Applications toSemi Supervised Learning (367)Matan Gavish, Boaz Nadler, Ronald R. CoifmanA Language-based Approach to Measuring Scholarly Impact (375)Sean M. Gerrish, David M. BleiBoosting Classifiers with Tightened L0-Relaxation Penalties (383)Noam Goldberg, Jonathan EcksteinBudgeted Nonparametric Learning from Data Streams (391)Ryan Gomes, Andreas KrauseLearning Fast Approximations of Sparse Coding (399)Karol Gregor, Yann LeCunBoosted Backpropagation Learning for Training Deep Modular Networks (407)Alexander Grubb, J. Andrew BagnellInteractive Submodular Set Cover (415)Andrew Guillory, Jeff BilmesLarge Scale Max-Margin Multi-Label Classification with Priors (423)Bharath Hariharan, Lihi Zelnik-Manor, S. V. N. Vishwanathan, Manik VarmaActive Learning for Multi-Task Adaptive Filtering (431)Abhay Harpale, Yiming YangBayesian Nonparametric Matrix Factorization for Recorded Music (439)Matthew D. Hoffman, David M. Blei, Perry R. CookMulti-Task Learning of Gaussian Graphical Models (447)Jean Honorio, Dimitris SamarasLearning Hierarchical Riffle Independent Groupings from Rankings (455)Jonathan Huang, Carlos GuestrinOn learning with kernels for unordered pairs (463)Martial Hue, Jean-Philippe VertA Simple Algorithm for Nuclear Norm Regularized Problems (471)Martin Jaggi, Marek SulovskýTelling cause from effect based on high-dimensional observations (479)Dominik Janzing, Patrik O. Hoyer, Bernhard SchölkopfProximal Methods for Sparse Hierarchical Dictionary Learning (487)Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach3D Convolutional Neural Networks for Human Action Recognition (495)Shuiwang Ji, Wei Xu, Ming Yang, Kai YuAccelerated dual decomposition for MAP inference (503)Vladimir Jojic, Stephen Gould, Daphne KollerEfficient Selection of Multiple Bandit Arms: Theory and Practice (511)Shivaram Kalyanakrishnan, Peter StoneA scalable trust-region algorithm with application to mixed-norm regression (519)Dongmin Kim, Suvrit Sra, Inderjit DhillonLocal Minima Embedding (527)Minyoung Kim, Fernando De la TorreGaussian Processes Multiple Instance Learning (535)Minyoung Kim, Fernando De la TorreTree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity (543)Seyoung Kim, Eric P. XingLearning Markov Logic Networks Using Structural Motifs (551)Stanley Kok, Pedro DomingosOn Sparse Nonparametric Conditional Covariance Selection (559)Mladen Kolar, Ankur P. Parikh, Eric P. XingSubmodular Dictionary Selection for Sparse Representation (567)Andreas Krause, Volkan CevherImplicit Online Learning (575)Brian Kulis, Peter L. BartlettProbabilistic Backward and Forward Reasoning in Stochastic Relational Worlds (583)Tobias Lang, Marc ToussaintSupervised Aggregation of Classifiers using Artificial Prediction Markets (591)Nathan Lay, Adrian BarbuBayesian Multi-Task Reinforcement Learning (599)Alessandro Lazaric, Mohammad GhavamzadehAnalysis of a Classification-based Policy Iteration Algorithm (607)Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi MunosFinite-Sample Analysis of LSTD (615)Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi MunosA fast natural Newton method (623)Nicolas Le Roux, Andrew FitzgibbonMaking Large-Scale Nystr•om Approximation Possible (631)Mu Li, James T. Kwok, Bao-Liang LuLearning Programs: A Hierarchical Bayesian Approach (639)Percy Liang, Michael I. Jordan, Dan KleinOn the Interaction between Norm and Dimensionality: Multiple Regimes in Learning (647)Percy Liang, Nati SrebroPower Iteration Clustering (655)Frank Lin, William W. CohenRobust Subspace Segmentation by Low-Rank Representation (663)Guangcan Liu, Zhouchen Lin, Yong YuRobust Graph Mode Seeking by Graph Shift (671)Hairong Liu, Shuicheng YanLarge Graph Construction for Scalable Semi-Supervised Learning (679)Wei Liu, Junfeng He, Shih-Fu ChangLearning Temporal Causal Graphs for Relational Time-Series Analysis (687)Yan Liu, Alexandru Niculescu-Mizil, Aurélie Lozano , Yong LuEfficient Reinforcement Learning with Multiple Reward Functions for Randomized ControlledTrial Analysis (695)Daniel J. Lizotte, Michael Bowling, Susan A. MurphyRestricted Boltzmann Machines are Hard to Approximately Evaluate or Simulate (703)Philip M. Long, Rocco A. ServedioMixed Membership Matrix Factorization (711)Lester Mackey, David Weiss, Michael I. JordanToward Off-Policy Learning Control with Function Approximation (719)Hamid Reza Maei, Csaba Szepesvari, Shalabh Bhatnagar, Richard S. SuttonConstructing States for Reinforcement Learning (727)M. M. Hassan MahmudDeep learning via Hessian-free optimization (735)James MartensLearning the Linear Dynamical System with ASOS (743)James MartensFrom Transformation-Based Dimensionality Reduction to Feature Selection (751)Mahdokht Masaeli, Glenn Fung, Jennifer G. DyRisk minimization, probability elicitation, and cost-sensitive SVMs (759)Hamed Masnadi-Shirazi, Nuno VasconcelosExploiting Data-Independence for Fast Belief-Propagation (767)Julian J. McAuley, Tibério S. CaetanoMetric Learning to Rank (775)Brian McFee, Gert LanckrietLearning Efficiently with Approximate Inference via Dual Losses (783)Ofer Meshi, David Sontag, Tommi Jaakkola, Amir GlobersonDeep Supervised t-Distributed Embedding (791)Renqiang Min, Laurens van der Maaten, Zineng Yuan, Anthony Bonner, Zhaolei ZhangNonparametric Return Distribution Approximation for Reinforcement Learning (799)Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima , Hirotaka Hachiya, Toshiyuki TanakaRectified Linear Units Improve Restricted Boltzmann Machines (807)Vinod Nair, Geoffrey E. HintonImplicit Regularization in Variational Bayesian Matrix Factorization (815)Shinichi Nakajima, Masashi SugiyamaEstimation of (near) low-rank matrices with noise and high-dimensional scaling (823)Sahand Negahban, Martin J. WainwrightMultiple Non-Redundant Spectral Clustering Views (831)Donglin Niu, Jennifer G. Dy, Michael I. JordanMultiagent Inductive Learning: an Argumentation-based Approach (839)Santiago Ontañón, Enric PlazaA Stick-Breaking Construction of the Beta Process (847)John Paisley, Aimee Zaas, Christopher W. Woods, Geoffrey S. Ginsburg, Lawrence CarinThe Margin Perceptron with Unlearning (855)Constantinos Panagiotakopoulos, Petroula TsampoukaBoosting for Regression Transfer (863)David Pardoe, Peter StoneFeature Selection Using Regularization in Approximate Linear Programs for Markov DecisionProcesses (871)Marek Petrik, Gavin Taylor, Ron Parr, Shlomo Zilberstein*Budgeted Distribution Learning of Belief Net Parameters (879)Liuyang Li, Barnabás Póczos, Csaba Szepesvári, Russ GreinerVariable Selection in Model-Based Clustering: To Do or To Facilitate (887)Leonard K. M. Poon, Nevin L. Zhang, Tao Chen, Yi WangApproximate Predictive Representations of Partially Observable Systems (895)Monica Dinculescu, Doina PrecupSpherical Topic Models (903)Joseph Reisinger, AustinWaters, Bryan Silverthorn, Raymond J. MooneySVM Classifier Estimation from Group Probabilities (911)Stefan RuepingClustering processes (919)Daniil RyabkoGaussian Process Change Point Models (927)Yunus Saatçi, Ryan Turner, Carl Edward RasmussenOnline Prediction with Privacy (935)Jun Sakuma, Hiromi AraiLearning Deep Boltzmann Machines using Adaptive MCMC (943)Ruslan SalakhutdinovActive Risk Estimation (951)Christoph Sawade , Niels Landwehr, Steffen Bickel, Tobias SchefferShould one compute the Temporal Difference fix point or minimize the Bellman Residual ? Theunified oblique projection view (959)Bruno ScherrerGaussian Covariance and Scalable Variational Inference (967)Matthias W. SeegerApplication of Machine Learning To Epileptic Seizure Detection (975)Ali Shoeb, John GuttagLearning optimally diverse rankings over large document collections (983)Aleksandrs Slivkins, Filip Radlinski, Sreenivas GollapudiHilbert Space Embeddings of Hidden Markov Models (991)Le Song, Sajid M. Siddiqi, Geoffrey Gordon, Alex SmolaCOFFIN : A Computational Framework for Linear SVMs (999)Soeren Sonnenburg, Vojtech FrancInternal Rewards Mitigate Agent Boundedness (1007)Jonathan Sorg, Satinder Singh, Richard LewisGaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design (1015)Niranjan Srinivas, Andreas Krause, Sham Kakade, Matthias SeegerUnsupervised Risk Strati cation in Clinical Datasets: Identifying Patients at Risk of RareOutcomes (1023)Zeeshan Syed, Ilan RubinfeldModel-based reinforcement learning with nearly tight exploration complexity bounds (1031)István Szita, Csaba SzepesváriTotal Variation and Cheeger Cuts (1039)Arthur Szlam, Xavier BressonLearning Sparse SVM for Feature Selection on Very High Dimensional Datasets (1047)Mingkui Tan, Li Wang, Ivor W. TsangDeep networks for robust visual recognition (1055)Yichuan Tang, Chris EliasmithA DC Programming Approach for Sparse Eigenvalue Problem (1063)Mamadou Thiao, Tao Pham Dinh, Hoai An Le ThiLeast-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems (1071)Christophe Thiery, Bruno ScherrerAn Analysis of the Convergence of Graph Laplacians (1079)Daniel Ting, Ling Huang, Michael I. JordanA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices (1087)Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama, Hisashi KashimaOne-sided Support Vector Regression for Multiclass Cost-sensitive Classification (1095)Han-Hsing Tu, Hsuan-Tien LinNon-Local Contrastive Objectives (1103)David Vickrey, Cliff Chiung-Yu Lin, Daphne KollerThe Translation-invariant Wishart-Dirichlet Process for Clustering Distance Data (1111)Julia E. Vogt, Sandhya Prabhakaran, Thomas J. Fuchs, Volker RothGeneralizing Apprenticeship Learning across Hypothesis Classes (1119)Thomas J. Walsh, Kaushik Subramanian, Michael L. Littman, Carlos DiukSequential Projection Learning for Hashing with Compact Codes (1127)Jun Wang, Sanjiv Kumar, Shih-Fu ChangA New Analysis of Co-Training (1135)Wei Wang, Zhi-Hua ZhouMulti-Class Pegasos on a Budget (1143)Zhuang Wang, Koby Crammer, Slobodan VuceticThe IBP Compound Dirichlet Process and its Application to Focused Topic Modeling (1151)Sinead Williamson, Chong Wang, Katherine A. Heller, David M. BleiOnline Streaming Feature Selection (1159)Xindong Wu, Kui Yu, Hao Wang, Wei DingClasses of Multiagent Q-learning Dynamics with -greedy Exploration (1167)Michael Wunder, Michael Littman, Monica BabesSimple and Efficient Multiple Kernel Learning by Group Lasso (1175)Zenglin Xu, Rong Jin, Haiqin Yang, Irwin King, Michael R. LyuSparse Gaussian Process Regression via l1 Penalization (1183)Feng Yan, Yuan (Alan) QiOnline Learning for Group Lasso (1191)Haiqin Yang, Zenglin Xu, Irwin King, Michael R. LyuLearning from Noisy Side Information by Generalized Maximum Entropy Model (1199)Tianbao Yang, Rong Jin, Anil K. JainConvergence of Least Squares Temporal Difference Methods Under General Conditions (1207)Huizhen YuImproved Local Coordinate Coding using Local Tangents (1215)Kai Yu, Tong ZhangProjection Penalties: Dimension Reduction without Loss (1223)Yi Zhang, Jeff SchneiderOTL: A Framework of Online Transfer Learning (1231)Peilin Zhao, Steven C.H. HoiConditional Topic Random Fields (1239)Jun Zhu, Eric P. XingCognitive Models of Test-Item Effects in Human Category Learning (1247)Xiaojin Zhu, Bryan R. Gibson, Kwang-Sung Jun, Timothy T. Rogers, Joseph Harrison, Chuck KalishModeling Interaction via the Principle of Maximum Causal Entropy (1255)Brian D. Ziebart, J. Andrew Bagnell, Anind K. Dey。
A Simple Algorithm for Nuclear Norm Regularized Problems Martin Jaggi jaggi@inf.ethz.chMarek Sulovsk´y smarek@inf.ethz.ch Institute of Theoretical Computer Science,ETH Zurich,CH-8092Zurich,SwitzerlandAbstractOptimization problems with a nuclear normregularization,such as e.g.low norm matrixfactorizations,have seen many applicationsrecently.We propose a new approximationalgorithm building upon the recent sparseapproximate SDP solver of(Hazan,2008).The experimental efficiency of our methodis demonstrated on large matrix completionproblems such as the Netflix dataset.The al-gorithm comes with strong convergence guar-antees,and can be interpreted as afirst theo-retically justified variant of Simon-Funk-typeSVD heuristics.The method is free of tuningparameters,and very easy to parallelize.1.IntroductionThis paper considers large scale convex optimization problems with a nuclear norm regularization,as for in-stance low norm matrix factorizations.Such formula-tions occur in many machine learning and compressed sensing applications such as dimensionality reduction, matrix classification,multi-task learning and matrix completion(Srebro et al.,2004;Candes&Tao,2009). Matrix completion by using matrix factorizations of either low rank or low norm has gained a lot of atten-tion in the area of recommender systems(Koren et al., 2009)with the recently ended Netflix Prize competi-tion.Our new method builds upon the recentfirst-order op-timization scheme for semi-definite programs(SDP)of (Hazan,2008)and has strong convergence guarantees. We consider the following convex optimization prob-lems over matrices:minX∈R n×mf(X)+µ||X||∗(1)Appearing in Proceedings of the27th International Confer-ence on Machine Learning,Haifa,Israel,2010.Copyright 2010by the author(s)/owner(s).and the corresponding constrained variantminX∈R n×m,||X||∗≤t2f(X)(2)where f(X)is any differentiable convex function(usu-ally called the loss function),||.||∗is the nuclear norm of a matrix,also known as the trace norm(sum of the singular values,or 1-norm of the spectrum).Here µ>0and t>0respectively are given parameters, usually called the regularization parameter.When choosing f(X):=||A(X)−b||22for some lin-ear map A:R n×m→R p,the above formula-tion(1)is the matrix generalization of the problem min x∈R n||Ax−b||22+µ||x||1,which is the important 1-regularized least squares problem,also known as the basis pursuit de-noising problem in compressed sensing literature.The analogue vector variant of (2)is the Lasso problem(Tibshirani,1996)which is min x∈R n||Ax−b||22||x||1≤t.Recently(Toh&Yun,2009;Liu et al.,2009)and(Ji &Ye,2009)independently proposed algorithms that obtain an -accurate solution to(1)in O(1/√)steps, by improving the algorithm of(Cai et al.,2008).More recently also(Mazumder et al.,2009)and(Ma et al., 2009)proposed algorithms in this line of so called sin-gular value thresholding methods,but cannot guaran-tee a convergence speed.Each step of all those algo-rithms requires the computation of the singular value decomposition(SVD)of a matrix of the same size as the solution matrix,which is expensive even with the currently available fast methods such as PROPACK. Both(Toh&Yun,2009)and(Ji&Ye,2009)show that the primal error of their algorithm is smaller than after O(1/√)steps,using an analysis in the spirit of(Nesterov,1983).We present a much simpler algorithm to solve prob-lems of the form(2),which does not need any SVD computations.We achieve this by transforming the problem to a convex optimization problem on posi-tive semi-definite matrices,and then using the approx-imate SDP solver of Hazan(2008).Hazan’s algorithmcan be interpreted as the generalization of the core-set approach to problems on symmetric matrices.The algorithm has a strong approximation guarantee,in the sense of obtaining -small primal-dual error (not only small primal error).With the resulting approx-imate solution X ,our algorithm also gives a matrix factorization X =UV T of rank O (1/ )(with the desired bounded nuclear norm).Compared to (Nes-terov ,1983),a moderately increased number of steps is needed in total,O (1/ ),which represents the price for the very severe simplification in each individual step of our method on the one hand and the improved (low)rank on the other hand.We demonstrate that our new algorithm on standard datasets improves the state of the art nuclear norm methods,and scales to large problems such as matrix factorizations on the Netflix dataset.Furthermore,the algorithm is easy to implement and parallelize,as it only uses the power method (or Lanczos steps)to ap-proximate the largest eigenvalue of a matrix.Our method can also be interpreted as a modified,theoretically justified variant of Simon Funk’s popu-lar SVD heuristic (Webb ,2006),making it suitable for low norm matrix factorization.To our knowledge this is the first guaranteed convergence result for this class of SVD-like gradient descent algorithms.Unlike most other comparable algorithms,our general method is free of tuning parameters (apart from the regulariza-tion parameter).Notation.For arbitrary real matrices,the stan-dard inner product is defined as A,B :=T r (A T B ),and the (squared)Frobenius matrix norm ||A ||2F ro := A,A is the sum of all squared entries of the matrix.By S d ×d we denote the set of symmetric d ×d matri-ces.A ∈R d ×d is called positive semi-definite (PSD),written as A 0,iffv T Av ≥0∀v ∈R d .2.Hazan’s AlgorithmOur main ingredient is the following simple gradient-descent type algorithm of (Hazan ,2008),to obtain sparse solutions to any convex optimization problems of the formmin Z ∈Sf (Z ),(3)where S := Z ∈S d ×d Z 0,T r (Z )=1is the set of PSD matrices of unit trace.The set S is some-times called Spectrahedron and is a generalization of the unit simplex to the space of symmetric matrices.The algorithm guarantees ε-small primal-dual error af-ter at most O 1ε iterations,where each iteration only involves the calculation of a single approximate eigen-vector of a matrix M ∈S d ×d .In practice for example Lanczos or the power method can be used.Algorithm 1Hazan’s AlgorithmInput:Convex f with curvature constant C f ,tar-get accuracy ε.Initialize Z (1):=v 0v T0for arbitrary unit vector v 0.for k =1to 4C fεdo Compute v k :=ApproxEV −∇f (Z (k )),C fk 2.Let αk :=1k .Set Z (k +1):=Z (k )+αk v k v T k −Z (k ) .end forHere ApproxEV (M,ε )is a sub-routine that delivers an approximate largest eigenvector to a matrix M with the desired accuracy ε ,meaning a unit length vector v such that v T Mv ≥λmax (M )−ε .Note that as our convex function f takes a symmetric matrix Z as an argument,its gradient ∇f (Z )is a symmetric matrix.The actual running time for a given convex function f :S d ×d →R depends on its curvature constant C f (also called the modulus of convexity)defined as C f :=supZ,V ∈S ,α∈R ,Z =Z +α(V −Z )1α2(f (Z )−f (Z )+ Z −Z,∇f (Z ) ),which turns out to be small for many applications 1.The algorithm can be seen as a matrix generaliza-tion of the sparse greedy approximation algorithm of (Clarkson ,2008)for vectors in the unit simplex,called the coreset method,which has seen many successful applications in a variety of areas ranging from cluster-ing to support vector machine training,smallest en-closing ball/ellipsoid,boosting and others.Here spar-sity just gets replaced by low rank .The same Algo-rithm 1with a well-crafted function f can also be used to solve arbitrary SDPs in feasibility form.3.Transformation to convex problems3.1.Motivation:Formulating Matrix Factori-zations as Convex Optimization Problems Approximate matrix factorization refers to the settingof approximating a given matrix Y ∈R n ×m (typically given only partially)by a product X =UV T ,under an additional low rank or low norm constraint,such that some error function f (X )is minimized.Most of the currently known gradient-descent-type algorithms for matrix factorization suffer from the following problem:1An overview of values of C f for several classes of func-tions f can be found in (Clarkson ,2008)Even if the loss-function f (X )is convex in X ,the same function expressed as a function f (UV T )of both the factor variables U and V usually becomes a non-convex problem (consider for example U,V ∈R 1×1together with the identity function f (x )=x ).Therefore many of the popular methods such as for example (Rennie &Srebro ,2005;Lin ,2007)can get stuck in local minima and so are neither theoretically nor practically well justified,see also (DeCoste ,2006).These shortcomings can be overcome as follows:One can equivalently transform any low-norm matrix fac-torization problem (which is usually not convex in its two factor variables)into an optimization problem over symmetric matrices:For any function f on R n ×m ,the optimization problemmin U ∈Rn ×r V ∈R m ×rf (UV T )(4)s.t.||U ||2F ro +||V ||2F ro =tis equivalent tomin Z ∈S(n +m )×(n +m )rank (Z )≤rˆf(Z )(5)s.t.Z 0,T r (Z )=t.where “equivalent”means that for any feasible solu-tion of each problem,there is a feasible solution of theother problem with the same objective value.Here ˆfis the same function as f ,just acting on the cor-responding off-diagonal rectangle of the larger,sym-metric matrices Z ∈S (n +m )×(n +m ).Formally,ˆf (Z )=ˆfZ 1Z 2Z T 2Z 3:=f (Z 2).The equivalence holds sim-ply because every PSD matrix Z can be written as some product Z = UV (U T V T )= UU T UV TV UT V VT ,for U ∈R n ×r and V ∈R m ×r ,when r ≥rank (Z ).On the other hand,of course any arbitrary valued matrices U,V give rise to a PSD matrix Z of this form.Furthermore T r (Z )=T r (UU T )+T r (V V T )=||U ||2F ro +||V ||2F ro holds by definition.The main advantage of this reformulation is that if the rank r :=n +m is not restricted,the new problem (5)is now a convex problem over a nice well-studied con-vex domain (the cone of PSD matrices of fixed trace),whereas the original formulation (4)is usually not con-vex in both arguments U and V .3.2.Nuclear Norm Regularized Problems In the same spirit,we obtain that any nuclear norm regularized problem of the form (2)is equivalent to the convex problem given by the following Corollary 2.Lemma 1.For any non-zero matrix X ∈R n ×m and t ∈R :||X ||∗≤t2iff∃symmetric matrices A ∈S n ×n ,B ∈S m ×ms.t. A XX T B 0and T r (A )+T r (B )=t .Proof.This is a slight variation of the argument of (Fazel et al.,2001;Srebro et al.,2004).⇒From the characterization ||X ||∗=min UV T =X 12(||U ||2F ro +||V ||2F ro )we get that∃U,V ,UV T =X s.t.||U ||2F ro +||V ||2F ro =T r (UU T )+T r (V V T )≤t ,or in other words we havefound a matrix UU T XX TV V Tof trace say s ≤t .If s <t ,we add (t −s )to the top-left entry of A ,i.e.we add to A the PSD matrix e 1e T 1(which again gives a PSD matrix).⇐As the matrix is symmetric and PSD,it can be (Cholesky)factorized to (U ;V )(U ;V )T s.t.UV T =Xand t =T r (UU T )+T r (V V T )=||U ||2F ro +||V ||2F ro ,therefore ||X ||∗≤t2.Corollary 2.Any nuclear norm regularized problem of the form (2)is equivalent tominZ ∈S (n +m )×(n +m )Z 0,T r (Z )=tˆf(Z ).(6)Note that both transformations in this section are equivalent formulations and not just relaxations.As already mentioned above,an explicit factorization of any feasible solution to (5)or (6)—if needed —can always be directly obtained since Z 0.Alternatively,algorithms for solving the transformed problem (5)or (6)can directly maintain the approximate solution Z in a factorized representation,as achieved for example by Hazan’s algorithm.3.3.Two Variants of RegularizationThe two original problem formulations (1)and (2)are very closely related,and used interchangeably in many applications:If X ∗is an optimal solution to trade-offvariant (1),then the same solution is also optimal for (2)when using the value ||X ∗||∗as the norm constraint.On the other hand (1)is just the Lagrangian version of (2),with µbeing the Lagrange multiplyer belonging to the single constraint.This is the same change in for-mulation as when going from regularized least squares formulation (the vector analogue of (1)),to the Lasso problem corresponding to (2)and vice versa.4.Solving Nuclear Norm Regularized ProblemsBy the equivalent reformulation of the previous sec-tion,as in Corollary 2,we can now solve both general nuclear norm regularized problems and low norm ma-trix factorizations by using Hazan’s algorithm.Algorithm 2Nuclear Norm Regularized Solver1.Consider the transformed problem for ˆfgiven by Corollary 2.2.Adjust the function ˆfby re-scaling all matrix entries by 1t .3.Run Algorithm 1for ˆf (Z ).The following theorem shows that Algorithm 2runsin time linear in the number N f of non-zero entries of the gradient ∇f .This makes it very attractive in particular for recommender systems applications and matrix completion,where ∇f is a sparse matrix (same sparsity pattern as the observed entries).Theorem 3.Algorithm 2obtains an approximate so-lution of primal-dual error ≤εfor problems of the form (2)after at most 4C fεmany steps (or in other words approximate eigenvector computations).In the k-th call of ApproxEV (),it is sufficient to per-form O (k )iterations of Lanczos method.Then the overall running time is O N f ε2 ,or equivalently O 1ε2 many sparse matrix-vector multiplications.Proof.We use Corollary 2and then rescale all matrix entries by 1t .Then the running time of follows from Theorem 2of (Hazan ,2008).The fact that each iteration of our algorithm is com-putationally very cheap —consisting only of the com-putation of an approximate eigenvector —strongly contrasts the existing “singular value thresholding”methods,which in each step need to compute an en-tire SVD.Such a single incomplete SVD computation (first k singular vectors)amounts to the same com-putational cost as an entire run of our algorithm (for k steps).Furthermore,those existing methods which come with a theoretical guarantee,i.e.(Toh &Yun ,2009;Liu et al.,2009;Ji &Ye ,2009;Ma et al.,2009),in their analysis assume that all SVDs used during the algorithm are exact,which is not feasible in practice.By contrast,our analysis is rigorous even if the used eigenvectors are only ε -approximate.Another nice property of Hazan’s method is that the returned solution is guaranteed to be simultaneouslyof low rank (k after k steps),and that by incremen-tally adding the rank-1matrices v k v Tk ,the algorithm automatically maintains a matrix factorization of the approximate solution.Also,Hazan’s algorithm is designed to automatically stay within the feasible region S ,where most of the existing approximate SDP-like methods do need a pro-jection step to get back to the feasible region (as e.g.(Lin ,2007;Liu et al.,2009)),which makes both their theoretical analysis and implementation much more complicated.4.1.The Structure of the Eigenvalue Problem For the actual computation of the approximate largest eigenvector in ApproxEV −∇ˆf (Z (k )),C ˆf k 2,either Lanczos method or the power method (as in PageR-ank,see e.g.(Berkhin ,2005))can be used.Both meth-ods are known to scale well to very large problems andcan be parallelized easily,as each iteration consists of just one matrix-vector multiplication.However,we have to be careful that we obtain the eigenvector for the largest eigenvalue which is not necessarily the prin-cipal one (largest in absolute value).In that case the spectrum can be shifted by adding an appropriate con-stant to the diagonal of the matrix.(Hazan ,2008)made use of the fact that Lanczos method,which is theoretically better understood,provably obtains the required approximation quality in a bounded number of steps if the matrix is PSD (Arora et al.,2005).For arbitrary loss function f ,the gradient −∇ˆf(Z ),which is the matrix whose largest eigenvector we have to compute in the algorithm,is always a symmet-ric matrix of the block form ∇ˆf (Z )= 0G G T 0for G =∇f (Z 2),when Z = Z 1Z 2Z T2Z 3.In other words ∇ˆf(Z )is the adjacency matrix of a weighted bipar-tite graph .One vertex class corresponds to the n rows of the original matrix Z 2(users in recommender sys-tems),the other class corresponds to the m columns(items or movies ).The spectrum of ∇ˆf is always symmetric :Whenever vwis an eigenvector for someeigenvalue λ,then v−w is an eigenvector for −λ.Hence,we have exactly the same setting as in the es-tablished Hubs and Authorities (HITS)model (Klein-berg ,1999).The first part of any eigenvector is always an eigenvector of the hub matrix G T G ,and the second part is an eigenvector of the authority matrix GG T .Repeated squaring.In the special case that the matrix X is very rectangular (n m or n m ),one of the two matrices G T G or GG T is very small.Then it is known that one can obtain an exponential speed-up in the power method by repeatedly squaring the smaller one of the matrices.In other words we can perform O (log 1ε)many matrix-matrix multiplications instead of O (1ε)matrix-vector multiplications.4.2.Application to Matrix Completion andLow Norm Matrix Factorizations For matrix factorization problems as for example from recommender systems (Koren et al.,2009),our algo-rithm is particularly suitable as it retains the sparsity of the observations,and constructs the solution in a factorized way.In the setting of a partially observed matrix such as in the Netflix case,the loss function f (X )only depends on the observed positions,which are very sparse,so ∇f (X )—which is all we need for our algorithm —is also sparse.We again suppose that we want to approximate a par-tially given matrix Y (let P be the set of known entries of the matrix)by a product X =UV T such that some convex loss function f (X )is minimized.By T we de-note the unknown test entries of the matrix we want to predict.Our algorithm applies to any convex loss function on a low norm matrix factorization problem,and we will only mention two cases in particular:Our algorithm directly applies to Maximum Margin Matrix Factorization (MMMF)(Srebro et al.,2004),whose original (soft margin)formulation is the trade-offformulation (1)with f (X ):= ij ∈P |X ij −y ij |being the hinge or 1-loss.Because this is not differen-tiable,the authors recommend using the differentiable smoothed hinge loss instead.When using the standard squared loss function f (X ):= ij ∈P (X ij −y ij )2,the problem is known as Regularized Matrix Factorization (Wu ,2007),and our algorithm directly applies.This loss function is widely used in practice,has a nice gradient structure,and is just the natural matrix generalization of the 2-loss (notice the analogous Lasso and regularized least squares formulation).The same function is known as the rooted mean squared error,which was the qual-ity measure used in the Netflix competition.We write RMSE train and RMSE test for the rooted error on the training ratings P and test ratings T respectively.Running time and memory.From Theorem 3we obtain that the running time of our Algorithm 2is lin-ear in the size of the input:Each matrix-vector multi-plication in Lanczos or the power method exactly costs|P |(the number of observed positions of the matrix)operations,and we know that in total we need at most O 1ε2many such matrix-vector multiplications.The same holds for the memory requirement:There is no need to store the entire factorization of X (k )(meaning all the vectors v k ),but instead we only update andstore the prediction values X (k )ij for ij ∈P ∪T in each step.This,together with the known ratings y ij deter-mines the sparse gradient matrix ∇f (X (k ))during the algorithm.Therefore,the total memory requirement is only |P ∪T |(the size of the output)plus the size n +m of a single feature vector v k .The constant C f in the running time of Hazan’s algorithm.Lemma 4.For the squared error f (X )=12ij ∈P (X ij −y ij )2,it holds that C ˆf≤1.Proof.It is known that the constant C ˆf is upperbounded by the largest eigenvalue of the Hessian ∇2ˆf( Z )(here we consider ˆf as a function on vectors).On can directly compute that the diagonal entries of ∇2ˆf( Z )are 1at the entries corresponding to P ,and zero everywhere else,hence C ˆf ≤1.4.3.Two Improved Variants of Algorithm 1The optimum on the line segment.Instead offixing the step width to αk :=1k in Algorithm 1,the αk ∈[0,1]of best improvement in the objective func-tion f can be found by line search.(Hazan ,2008)has already proposed binary search to find better values for αk .In many cases,however,we can even compute it analytically in a straightforward manner:Consider f α:=f Z (k +1)(α) =f Z (k )+α v k v T k −Z (k ) and compute0.=∂∂αf α= ∇fZ (k +1)(α),v k v T k −Z (k ) (7)If this equation can be solved for α,then the optimal such αk can directly be used as the step size,and the convergence guarantee of Theorem 3still holds.For the squared error f (X )=12 ij ∈P (X ij −y ij )2,when we write ¯v for the approximate eigenvector v k in step k ,the optimality condition (7)is equivalent toαk =ij ∈P (X ij −y ij )(X ij −¯v i ¯v j ) ij ∈P (X ij −¯v i ¯v j )2(8)Immediate Feedback in the Power Method.Asa second small improvement,we propose a heuris-tic to speed up the eigenvector computation in ApproxEV −∇f (Z (k ),ε:Instead of multiplying the current candidate vector v k with the matrix∇f (Z (k ))in each power iteration,we multiply with 12 ∇f (Z(k ))+∇f Z (k )+1k v k v Tk ,i.e.the average of the old and the new gradient.This means we im-mediately take into account the effect of the new fea-ture vector v k .This heuristic (which unfortunately does not fall into our current theoretical guarantee)is inspired by stochastic gradient descent as in Simon Funk’s method,which we describe in the following:4.4.Relation to Simon Funk’s SVD Method Interestingly,our proposed framework can also be seen as a theoretically justified variant of Simon Funk’s (Webb ,2006)and related approximate SVD meth-ods,which were used as a building block by most of the teams participating in the Netflix competition (including the winner team).Those methods have been further investigated by (Paterek ,2007;Tak´a cs et al.,2009)and also (Kurucz et al.,2007),which al-ready proposed a heuristic using the HITS formula-tion.These approaches are algorithmically extremely similar to our method,although they are aimed at a slightly different optimization problem,and do not di-rectly guarantee bounded nuclear norm.Very recently,(Salakhutdinov &Srebro ,2010)observed that Funk’s algorithm can be seen as stochastic gradient descent to optimize (1)when the regularization term is replaced by a weighted variant of the nuclear norm.Simon Funk’s method considers the standard squared loss function f (X )=12 ij ∈S (X ij −y ij )2,and finds the new rank-1estimate (or feature)v by iteratingv :=v +λ(−∇ˆf (Z )v −Kv ),or equivalently v :=λ −∇ˆf (Z )+ 1λ−K Iv ,(9)a fixed number of times.Here λis a small fixed con-stant called the learning rate.Additionally a decay rate K >0is used for regularization,i.e.to penal-ize the magnitude of the resulting feature v .Clearly this matrix multiplication formulation (9)is equiva-lent to a step of the power method applied within our framework 2,and for small enough learning rates λthe resulting feature vector will converge to the largesteigenvector of −∇ˆf(Z ).However in Funk’s method,the magnitude of each newfeature strongly depends on the starting vector v 0,the number of iterations,the learning rate λas well as2Another difference of our method to Simon Funk’s lies in the stochastic gradient descent type of the later,i.e.“immediate feedback”:During each matrix multiplication,it already takes the modified current feature v into accountwhen calculating the loss ˆf(Z ),whereas our Algorithm 1alters Z only after the eigenvector computation is finished.the decay K ,making the convergence very sensitive to these parameters.This might be one of the rea-sons that so far no results on the convergence speed could be obtained.Our method is free of these pa-rameters,the k -th new feature vector is always a unitvector scaled by 1√k.Also,we keep the Frobenius norm ||U ||2F ro +||V ||2F ro of the obtained factorization exactly fixed during the algorithm,whereas in Funk’s method —which has a different optimization objective —this norm strictly increases with every newly added feature.Our described framework therefore gives a solid the-oretical foundation for a modified variant of the ex-perimentally successful method (Webb ,2006)and its related variants such as (Kurucz et al.,2007;Paterek ,2007;Tak´a cs et al.,2009),with proved approximation quality and running time.5.Experimental ResultsWe run our algorithm for the following standard datasets 3for matrix completion problems,using the squared error function.dataset#ratingsn m MovieLens 100k 1059431682MovieLens 1M 10660403706MovieLens 10M 1076987810677Netflix10848018917770Any eigenvector method can be used as a black-box in our algorithm.To keep the experiments simple,we used the power method 4,and performed 0.2·k power iterations in step k .If not stated otherwise,the only optimization we used is the improvement by averaging the old and new gradient as explained in Section 4.3.All results wereobtained by our (single-thread)imple-mentation in Java 6on a 2.4GHz Intel C2D laptop.Sensitivity.The generalization performance of our method is relatively stable under different choices of the 1:Figure 1.choice of the regularization parameter t in (2),on MovieLens 1M .3See and /ml .4We used the power method starting with the uniform unit vector.12of the approximate eigenvalue corresponding to the previously obtained feature v k −1was added to the matrix diagonal to ensure good convergence.timings t TY of(Toh&Yun,2009).The ratings{1, (5)were used as-is and not normalized to any user and/or movie means.In accordance with(Toh&Yun,2009),50% of the ratings were used for training,the others were used as the test set.Here NMAE is the mean absolute error,times15−1,over the total set of ratings.k is the num-ber of iterations of our algorithm,#mm is the total num-ber of sparse matrix-vector multiplications performed,and tr is the used trace parameter t in(2).They used Mat-lab/PROPACK on an Intel Xeon3.20GHz processor.NMAE t TY t our k#mm tr 100k0.2057.390.156******** 1M0.17624.5 1.3763514736060 10M0.16420236.1065468281942 In all the following experiments we have pre-normalized all training ratings to the simple averageµi+µj2of the user and movie mean values,for the sakeof being consistent with comparable literature.For MovieLens10M,we used partition r b provided with the dataset(10test ratings per user).The regulariza-tion parameter t was set to48333.We obtained a RMSE test of0.8617after k=400steps,in a total running time of52minutes(16291matrix multiplica-tions).Our best RMSE test value was0.8573,compared to0.8543obtained by(Lawrence&Urtasun,2009)us-ing their non-linear improvement of MMMF.Algorithm paring the proposed al-gorithm variants from Section4.3,Figure2demon-strates moderate improvements compared to our orig-inal Algorithm2.Netflix.Table2shows an about13fold speed in-crease of our method over the“Hard Impute”singu-lar value thresholding algorithm of(Mazumder et al., 2009)on the Netflix dataset,where they used Mat-lab/PROPACK on an Intel Xeon3GHz processor.0100200300400kMovieLens 10M r b1/k, testbest on line segm., testgradient interp., test1/k, trainbest on line segm., traingradient interp., trainImprovements for the two algorithm variants scribed in Section4.3,when running on MovieLens10M.Table2.Running times t our(in hours)of our algorithm on the Netflix dataset compared to the reported timings t M of (Mazumder et al.,2009).RMSE test t M t our k#mm tr0.986 3.30.1442050995920.977 5.80.30630109”0.965 6.60.50440185”0.9478n.a.13.62004165”Note that the primary goal of this experimental sec-tion is not to compete with the prediction quality of the best engineered recommender systems(which are usually ensemble methods).We just demonstrate that our method solves nuclear norm regularized problems of the form(2)on large sample datasets,obtaining strong performance improvements.6.ConclusionWe have introduced a new method to solve arbitrary convex problems with a nuclear norm regularization, which is simple to implement and parallelize and scales very well.The method is parameter-free and comes with a convergence guarantee.This is,to our knowl-edge,thefirst guaranteed convergence speed result for the class of Simon-Funk-type algorithms.Further interesting questions include whether a simi-lar algorithm could be used if a strict low-rank con-straint as in(4),(5)is simultaneously applied.This corresponds tofixing the sparsity of a solution in the coreset setting.Also,it remains to investigate if our algorithm can be applied to other matrix factorization problems such as(potentially only partially observed) kernel matrices as e.g.PSVM(Chang et al.,2007), PCA or[p]LSA,because our method could exploit the even simpler form of∇f for symmetric matrices.。