An Improved Non-negative Matrix Factorization Method for Masquerade Detection
- 格式:pdf
- 大小:144.19 KB
- 文档页数:5
人脸识别的非负矩阵分解(NMF)方法文献综述摘要:人类对整体的感知是基于对部分的感知,NMF(非负矩阵分解,Non-negative matrix factorization)的思想正是源于此。
通过对矩阵分解因子加入了非负性约束,使得对高维非负原始数据矩阵的分解结果不存在负值,且具有一定的稀疏性,因而得到了相对低维、纯加性、拥有一定稀疏特性的分解结果。
与PCA(主成分分析,principal components analysis)等传统人脸识别方法相比,NMF的基图像就是人脸的各个局部特征,并且通过对经典算法的一系列优化,改进的NMF算法的识别率和鲁棒性较传统方法有着显著优势。
此外,NMF在机器学习、语义理解等领域也有着重要应用。
关键词:非负矩阵分解(NMF)稀疏性改进的NMF 语义理解一、引言在实际中的许多数据都具有非负性,而现实中对数据的处理又要求数据的低秩性经典的数据处理方法一般不能够确保非负性的要求,如何找到一个非负的低秩矩阵来近似原数据矩阵成为一个关键问题。
在这样的背景下,NMF方法应运而生。
NMF方法思想最早可以追溯到由Paatero和Tapper在1994年提出的正矩阵分解(Positive Matrix Factorization,PMF)[1];此后1999年,Lee和Seung提出了一个以广义KL散度为优化目标函数的基本NMF模型算法,并将其应用于人脸图像表示[2];2001年,Lee和Seung通过对基本NMF算法进行深入研究,又提出了两个经典的NMF算法,即基于欧氏距离测度的乘性迭代算法和基于广义KL散度的乘性迭代算法,并给出了收敛性证明[3],这两种算法称为NMF方法的基准算法,广泛应用于各个领域。
但是在实际应用中,由于经典的基准NMF算法存在收敛速度较慢,未利用统计特征,对光线、遮挡等敏感,以及无法进行增量学习等问题,各种改进的NMF算法被提出。
其中包括Lin提出的基于投影梯度(Projected Gradient,PG)的NMF方法[3],该方法有着很高的分解精度;Berry提出的基于投影非负最小二乘(Projected Non-negative Least Square,PNLS)的NMF方法[5],通过这种方法得到的基矩阵的稀疏性、正交性叫基准NMF方法都更好;此外还有牛顿类方法[6]和基于有效集[7]的NMF方法等。
nmf算法代价函数-回复NM算法是一种常用的非负矩阵分解(Non-negative Matrix Factorization,NMF)方法,它在许多领域都有广泛应用。
而代价函数则是用来衡量NMF算法的优劣的重要指标。
本文将详细介绍NMF算法以及代价函数,并分别探讨它们的原理和应用场景。
首先,我们来了解NMF算法。
NMF算法的基本想法是将一个非负矩阵分解为两个非负矩阵的乘积,即将原始矩阵X分解为非负矩阵W和H的乘积,如下所示:X ≈WH其中,矩阵W和H的每一个元素都是非负的。
W是原始矩阵X的基表示,H是系数矩阵,它表示了基在每个样本中的权重。
NMF算法可以用来进行特征提取、数据降维、文本挖掘等任务。
例如,在文本挖掘中,矩阵X可以表示文档-词频矩阵,每个文档可以通过W的某些列来表示,而每个词在文档中的权重则可以通过H的某些行来表示。
通过NMF算法,我们可以得到这些基和系数的表示,并用于文本分类、主题提取等应用。
接下来,我们来介绍NMF算法中的代价函数。
代价函数是用来衡量NMF 算法得到的近似解与原始矩阵的接近程度的指标。
常用的代价函数有欧几里得距离和KL散度。
欧几里得距离代价函数定义为:cost = X - WH ^2其中, . 表示矩阵的F范数,X表示原始矩阵,W和H表示NMF算法得到的近似解。
KL散度代价函数定义为:cost = D(X WH) - Xlog(WH)其中,D(X WH)表示X与WH之间的KL散度。
KL散度是一种用来衡量两个概率分布之间差异的指标,它在NMF算法中用来衡量X与WH之间的接近程度。
欧几里得距离代价函数和KL散度代价函数在NMF算法中各有不同的应用场景。
欧几里得距离代价函数适用于需要准确重构原始矩阵的任务,例如图像重构。
而KL散度代价函数适用于需要找出概率分布之间差异的任务,例如主题提取。
在实际应用中,NMF算法的优化过程通常采用迭代求解的方法。
具体来说,NMF算法可以通过交替最小二乘法(Alternating Least Squares,ALS)或者梯度下降法(Gradient Descent)来进行求解。
非负矩阵分解及其应用探讨作者:高燕燕来源:《硅谷》2011年第23期摘要:介绍非负矩阵分解(non-negative matrix factorization,NMF)的基本算法思想及其实现过程,并对其在一些重要领域内的应用现状进行概括归纳,最后提出NMF方法在图像处理方面存在的问题及其改进的趋势。
关键词:非负矩阵分解;特征提取;矩阵分解中图分类号:TP391 文献标识码:A 文章编号:1671-7597(2011)1210164-01随着现代计算机处理信息数据的规模越来越大,矩阵作为一种最常见的数据表示形式得到了广泛的应用。
但在实际问题当中,由于矩阵的数据量往往很大,直接处理效率低,意义不大,在实际的操作中,都需要对原始矩阵进行分解。
矩阵分解是将原始矩阵进行适当的分解,使得进一步处理变得简单些。
NMF是D.D.Lee和H.S.Seung在1999年《Nature》中首次提出的算法[1],该算法要求矩阵中所有元素均为非负的条件下对矩阵进行非负分解。
由于非负矩阵分解实现简单、分解形式和分解结果上的可解释性,以及占用存储空间上的优点,使得非负矩阵分解在实际应用中得到了广泛应用。
1 非负矩阵分解算法1)问题的描述传统NMF问题可描述如下:(1-1)即给定m个n维数据向量集合,每个列向量表示一个样本数据,m为集合中数据样本的个数。
W为基矩阵,H为编码矩阵。
选取的r值一般要求满足,从而可使W和H矩阵的秩远远小于矩阵V的秩。
这样就达到了对原始矩阵V的降维处理。
NMF算法通过“乘性”迭代规则来保证每次迭代后矩阵的元素为非负,保证了非负矩阵分解的可行性。
[2]这种算法实现容易因此得到十分广泛的应用。
2)算法的实现过程NMF算法可以理解为一个带约束的非线性规划的问题,可转化成最优化问题,利用迭代的手段可求解出W和H。
为了求出矩阵分解的结果,Lee和Seung引入了两类目标函数:①矩阵A和B之间的欧氏距离:② Kullback-Leibler散度函数:令A=V,B=WH,可得到用于NMF算法的两类目标函数基于以上两个目标函数,就可得到如下的约束优化问题:和通过合适的迭代规则对以上两个约束优化问题进行收敛得到稳定的矩阵W和H。
NeNMF:An Optimal Gradient Method for Nonnegative Matrix Factorization Naiyang Guan,Dacheng Tao,Senior Member,IEEE,Zhigang Luo,and Bo YuanAbstract—Nonnegative matrix factorization(NMF)is a pow-erful matrix decomposition technique that approximates a nonnegative matrix by the product of two low-rank nonnegative matrix factors.It has been widely applied to signal processing, computer vision,and data mining.Traditional NMF solvers include the multiplicative update rule(MUR),the projected gradient method(PG),the projected nonnegative least squares (PNLS),and the active set method(AS).However,they suffer from one or some of the following three problems:slow convergence rate,numerical instability and nonconvergence.In this paper,we present a new efficient NeNMF solver to simultaneously overcome the aforementioned problems.It applies Nesterov’s optimal gra-dient method to alternatively optimize one factor with another fixed.In particular,at each iteration round,the matrix factor is updated by using the PG method performed on a smartly chosen search point,where the step size is determined by the Lipschitz constant.Since NeNMF does not use the time consuming line search and converges optimally at rate in optimizing each matrix factor,it is superior to MUR and PG in terms of efficiency as well as approximation pared to PNLS and AS that suffer from numerical instability problem in the worst case, NeNMF overcomes this deficiency.In addition,NeNMF can be used to solve-norm,-norm and manifold regularized NMF with the optimal convergence rate.Numerical experiments on both synthetic and real-world datasets show the efficiency of NeNMF for NMF and its variants comparing to representative NMF solvers.Extensive experiments on document clustering suggest the effectiveness of NeNMF.Index Terms—-norm,-norm,manifold regularization, nonnegative matrix factorization(NMF),optimal gradient method.I.I NTRODUCTIONN ONNEGATIVE MATRIX FACTORIZATION(NMF) learns an approximate decomposition of a nonnegative data matrix into two low-rank nonnegative matrix factors.Manuscript received June07,2011;revised September25,2011and De-cember25,2011;accepted February22,2012.Date of publication March 08,2012;date of current version May11,2012.The associate editor coor-dinating the review of this manuscript and approving it for publication was Dr.Konstantinos Slavakis.This work was supported in part by the Australian ARC discovery project(ARC DP-120103730),the National Natural Science Foundation of China(by Grant No.91024030/G03),and by the Program for Changjiang Scholars and Innovative Research Team in University(No. IRT1012).N.Guan and Z.Luo are with School of Computer Science,National Univer-sity of Defense Technology,Changsha,410073China(e-mail:ny_guan@nudt. ;zgluo@).D.Tao is with Center for Quantum Computation and Intelligent Systems and the Faculty of Engineering and Information Technology,University of Tech-nology,Sydney,NSW2007,Australia(e-mail:dacheng.tao@.au).B.Yuan is with Department of Computer Science and Engineering,Shanghai Jiao Tong University,Shanghai,200240China(e-mail:boyuan@). Color versions of one or more of thefigures in this paper are available online at .Digital Object Identifier10.1109/TSP.2012.2190406Since the nonnegativity constraints on the two matrix factors usually yield sparse representation of the data,NMF has been widely applied to signal processing[6],computer vision[13] and data mining[31],[26].Recently,several solvers have been proposed for NMF[2],[4],[12],[15],[18],[20],[33]. However,they suffer from some or at least one of the following three problems:slow convergence,numerical instability and theoretical nonconvergence problems.To solve the nonconvex NMF problem,Berry et al.[2]proposed an alternating non-negative least squares(ANLS)framework that treats NMF as two nonnegative least squares(NLS)subproblems and alternatively updates matrix factors based on the solution of the corresponding NLS subproblem until convergence.Based on the matrix updating strategy,representative NMF solvers can be classified into the following two groups:1)Inexact block coordinate descent(IBCD):IBCD updatesmatrix factors with approximate solution of the corre-sponding NLS subproblem.According to the method for approximately solving each NLS subproblem,we categorized the IBCD methods into the following three subgroups:Multiplicative update rule(MUR):MUR[18]searches along a rescaled gradient direction with a carefully se-lected learning rate to guarantee the nonnegativity of the resulted matrix factors.Although MUR is simple and easy to implement,it converges slowly.That is because the used rescaled gradient descent is afirst-order method.In addition,MUR does not guarantee the convergence to any local minimum because its solution is unnecessarily a stationary point[21].To overcome this problem,Lin[21] proposed a modified MUR that converges to a stationary point.However,the modified solver cannot improve the convergence rate.Projected NLS method(PNLS):Since MUR updates ma-trix factors by multiplying each entry with a positive factor in each iteration round,all entries in the obtained matrix factors will not shrink to zero.To overcome this problem, Berry et al.[2]presented a projected NLS method(PNLS) which solves each NLS subproblem by directly projecting the least squares solution to the nonnegative quadratic.Note that the PNLS method wasfirst proposed to solve the positive matrix factorization problem[25].Although PNLS performs well in most applications,it does not guar-antee the convergence.Cyclic block coordinate gradient projection(CBGP):To overcome the theoretical nonconvergence deficiency of PNLS,Bonettini[4]proposed a CBGP method to solve NMF based on the inexact block coordinate descent1053-587X/$31.00©2012IEEEframework.Although CBGP unnecessarily obtains the optimum for both NLS subproblems in each iteration round,it obtains the global convergence because of the Armijo-type line search.We appreciate the global conver-gence property of CBGP,but it is inefficient,because the Armijo-type line search is time-consuming.2)Exact block coordinate descent(EBCD):EBCD updatesmatrix factors with the optimum for both NLS subprob-lems in each iteration round.According to the methods for optimizing each NLS subproblem,we categorized the EBCD methods into the following two subgroups:Projected gradient method(PG):Lin[20]deemed each NLS subproblem as a bound constrained optimization problem and applied projected gradient(PG)method to it.In each iteration round,PG uses the Armijo rule to estimate the optimal step size along the projection arc for solving each NLS subproblem.However,PG is inefficient for solving NMF,because the line search procedure is time-consuming.Zdunek and Cichocki[33]applied the projected quasi-Newton method(QN)to each NLS sub-problem.QN estimates the step size by using the inverse of Hessian to estimate the optimal step size and thus it converges much faster than PG.However it is time-con-suming for calculating the inverse of Hessian and QN will fail when the Hessian is rank deficient.To overcome this problem,Kim et al.[17]applied BFGS to the NLS subproblems.BFGS approximates the Hessian based on rank-one updates specified by gradient evaluations.Although BFGS converges,it requires a time-consuming line search and a complex Hessian updating procedure,so it is inefficient.Recently,Han et al.[12]presented pro-jected Barzilai-Borwein method(PBB).PBB uses Barzilai and Borwein’s strategy to determine the step size and an Armijo-type line search to ensure the global conver-gence.PBB is also inefficient due to the time-consuming Armijo-type line search.Active set method(AS):In contrast to the aforementioned gradient-based solver,Kim and Park[16]solved the NLS subproblem by using the active set(AS)method.In par-ticular,AS divides variables into free set and active set, wherein free set contains variables that are not optimal.In each iteration round,AS exchanges one variable between this two sets and solves an unconstrained equation to up-date the variables in free set.Kim and Park[15]further de-veloped the AS method by exchanging multiple variables between free set and active set,and proposed block prin-cipal pivoting(BPP)method that greatly accelerates the convergence of AS.However,both AS and BPP assume each NLS subproblem is strictly convex to make the un-constrained equation solvable.Such assumption may make solvers face the numerical instability problem especially for large scale problems.In this paper,we propose a new efficient solver NeNMF that simultaneously overcomes the aforementioned problems for NMF.By proving that the objective function of each NLS subproblem is a convex function whose gradient is Lipchitz continuous,we naturally apply Nesterov’s optimal gradient method[23]to optimize NLS.In convex optimization,Nes-terov’s optimal gradient method has been recently applied to several optimization problems appearing in many problems, e.g.,compressed sensing[1],trace norm minimization[14] and clustering[34].In this paper,we utilize this method to optimally minimize one matrix factor with anotherfixed.In particular,we update two sequences recursively for optimizing each factor,one sequence stores the approximate solutions and another sequence stores the search points.In each iteration round,the approximate solution is obtained by using the PG method on the search point and the step size is determined by the Lipchitz constant,and the search point is constructed by linearly combining the latest two approximate solutions. This scheme greatly accelerates the optimization based on the problem structure.Therefore,this method achieves optimal convergence rate1of.By alternatively performing the optimal gradient the two matrix factors,NeNMF converges much faster than representative NMF solvers.It has neither time-consuming line search procedure nor numerical instability problem comparing against existing solvers. Preliminary experiments on both synthetic and real-world datasets suggest the efficiency of NeNMF.Document clus-tering experiments on popular real-world datasets confirm the effectiveness of NeNMF.In addition,we show that NeNMF can be naturally extended to handle several versions of NMF that incorporates-norm,-norm and manifold regulariza-tions.Extensive experiments on synthetic datasets confirm the efficiencies of these extensions.II.N ONNEGATIVE M ATRIX F ACTORIZATION V IA O PTIMALG RADIENT M ETHODGiven samples arranged in a nonnegative matrix,NMF approximates by two low-rank nonnegative matrices and ,i.e.,(1) where is the matrix Frobenius norm and. Since(1)is a nonconvex minimization problem,it is imprac-tical to obtain the optimal solution.Fortunately the block co-ordinate descent method[3]can obtain a local solution of(1). Given an initial,the block coordinate descent method alternatively solves(2) and(3) until convergence,where is the iteration counter.Most existing NMF solvers are special implementations under this scheme. Different solvers use different optimization methods to mini-mize(2)and(3).Since problems(2)and(3)are symmetric,we only need to show how to efficiently solve(2)and(3)can be 1In this paper,convergence rate implies the speed of optimizing one matrix factor with anotherfixed.solved accordingly.According to Lemma 1and Lemma 2,we will show that (2)can be ef ficiently solved by Nesterov’s op-timal gradient method (OGM).Lemma 1:The objective function is convex.Lemma 2:The gradient of the objective functionis Lipschitz continuous and the Lipschitz constant is.We leave the detailed proofs of Lemma 1and Lemma 2in Appendices A and B,respectively.A.Optimal Gradient MethodRecent results [22]–[24]in operation research show that the gradient-based methods for smooth optimization can be accel-erated and achieve the optimal convergence rate .Sinceis convex and the gradient is Lips-chitz continuous,Nesterov’s method can be used to ef ficientlyoptimize (2).In particular,we construct two sequences,i.e.,and ,and alternatively update them in each iteration round.At the iteration ,the two sequences are(4)and(5)whereis the proximal function of on ,is the Lipschitz constant given in Lemma 2,denotes the matrix inner product,contains the approximate solution obtained by minimizing the proximal function over ,and stores the search point that is constructed by lin-early combining the latest two approximate solutions,i.e.,and .According to [22],the combination coef ficient is carefully updated in each iteration round(6)By using the Lagrange multiplier method,the Karush-Kuhn-Tucker (K.K.T.)conditions of (4)are(7)(8)(9)whereis the gradient of with respect to at ,and is theHadamard product.By solving (7)to (9),we have(10)where the operatorprojects all the negative entries of to zero.All variables will not exceed the upper bound by setting it suf ficiently large.By alternatively updating ,and with (10),(6)and (5)until convergence,the optimal solution can be obtained.Similar to PG [20],PBB [12]and AS [16],we use the K.K.T.conditions of (2)as the stopping criterionSince the stopping criterion (14)will keep Algorithm 1un-necessarily running,similar to [20],we instead usewhere is the tolerance for (2).If Algorithm 1solves (2)within a prede fined minimum step number,i.e.,,we adap-tively update the tolerance by .The same strategy is adopted to solve (3)by using Algorithm 1.Algorithm 1:Optimal gradient method (OGM)Input :Output :1:Initialize ,,,repeat2:Update,andwith(15)(16)(17)3:until Stopping criterion (14)is satis fied 4:We summarize the optimal gradient method for optimizing (2)in Algorithm 1that accepts input and and outputs.Both and are obtained from the previous iteration in the block coordinate descent method.By alternatively updating(15),(16),and(17)until(14)is satisfied,we obtain the optimal solution of(2),namely, wherein is the iteration number of Algorithm1.The gra-dient,whereinis the iteration counter.In the following Proposition1,we show that the Algorithm1 achieves the optimal convergence rate.Proposition1:Given sequences and gen-erated by(15)and(17),respectively,we havewhere is an optimal solution of(2).We leave the detailed proof in Appendix C.B.NeNMFBy using Algorithm1(OGM)to solve(2)and(3),we arrive at NeNMF summarized in Algorithm2.We give the stopping criterion of this algorithm after the algorithm table.Algorithm2:NeNMFInput:,Output:,1:Initialize,,repeat2:Update and with3:until Stopping criterion(21)is satisfied4:,Similar to Algorithm1,we use the K.K.T.conditions of(1) as a stopping criterion for Algorithm2(18)(19)(20) We reorganize conditions(18)to(20)as(21) To check whether is close to a stationary point,ac-cording to[20],we transform(21)to(22) where is a tolerance.TABLE IT IME C OMPLEXITY:N E NMF V ERSUS THE E XISTING NMF S OLVERS Since the problem(1)is nonconvex,it is impossible to get its global optimum.Existing NMF solvers consider a stationary point as a local optimal solution[4],[12],[15],[20].In nonlinear optimization,the stationarity of the limit point generated by the block coordinate descent method is guaranteed by the assump-tion that the optimal solution of each subproblem is unique[3]. However the optimal solution obtained by Algorithm1is not unique because the problem(2)is not strictly convex.Fortu-nately,Grippo and Sciandrone[9]have shown that the unique-ness is unnecessary if we have only two blocks,i.e.,any limit point of the sequence generated by Algorithm2is a stationary point of(1).Since the feasible sets of(2)and(3)are bounded given sufficiently large bounds,there exists at least one limit point in the sequence generated by Algorithm2according to [27].Therefore,NeNMF converges to a stationary point.The main time cost of NeNMF is spent on the calculation of the gradient in Algorithm1.Note that the second term in is a constant and it can be calculated before iterations of Algorithm1.Since is a constant low-rank matrix,the term can also be calculated previously in time.The complexity of one iteration in NeNMF is.As Algorithm1converges at rate,it stops within a small number of iterations,typically.We summarize the time complexity of one iteration round of NeNMF and compare it to those of the representative NMF solvers in Table I,where the variable in PG[20]and BFGS[17]is the inner iteration number of the line search procedure.Table I shows that the com-plexity of NeNMF is comparable to the existing NMF solvers. However,it converges much faster at each iteration round and therefore it is faster than others.C.Related WorksNMF(1)can be solved by alternatively optimizing(2)and (3).However,it is unnecessary to minimize(2)or(3)in each iteration round.For example,the MUR[18]updates the matrix factor by one step rescaled gradient descent according to(23) Guan et al.[10],[11]proposed to accelerate MUR by searching each matrix factor along its rescaled gradient direction with the optimal step length.Since elements in the denominator in(23)can be zero,MUR may fail.To overcome this deficiency,Lin[21]modified MUR by adding a tiny number to each element in the denominator. PNLS[2]overcomes such deficiency by regarding(2)as an NLS and updates according to(24) where is the pseudo-inverse of and is the pro-jection operator defined in(10).PNLS works well in many problems.It is however difficult to analyze the convergence of PNLS,because may increase the objective function[17]. PG[20]treats(2)as a bound-constrained optimization problem and uses the projected gradient descent to optimize it. In the iteration round,PG repeats the following update until convergence:(25)where is the gradient of at and is the optimal step size estimated by the Armijo rule.QN[33]regards(2)as a bound-constrained optimization problem.In contrast with PG,QN however uses quasi-Newton method to optimize(2).The update rule of is(26) where is a relaxation parameter,andis the QR factorization of the regularized Hessian matrix of at,i.e.,,and is the tradeoff parameter,is the solution in the least squares sense to the system equation.QN converges fast as it uses the second-order gradient information embedded in Hessian matrix.However QN suffers from the numerical instability problem because the Hessian matrix can be singular.Although the regularization can reduce this problem,it is difficult to choose a proper.Kim et al.[17]utilized BFGS to update the Hessian matrix based on the rank-one updates specified by gradient evaluations. In the iteration round,BFGS iterates the following update until convergence:(27)where is the inner iteration counter,and is ini-tialized as.The is updated by the solution of(27), i.e.,,wherein is the iteration number of(27). In(27),is an approximation of that is up-dated by BFGS,and the step size vector is calculated by line search.Although BFGS solver converges,both Hessian update and line search are complex and time-consuming.Therefore,they proposed an inexact version of BFGS solver by updating according to(28)where is utilized as the approximation of,and is a manually prefixed step size.The inner iteration of(28)is repeated forfixed times,e.g.,. Then is updated as.Though the update(28) converges fast,it is difficult to select a suitable step size for different datasets.Han et al.[12]proposed the projected Barzilai-Borwein method(PBB).PBB approximates the secant equation by using two neighborhood points to determine the step size in(25).The update rule is(29) where is the projected gradient with the step size calculated by using Barzilai and Borwein’s strategy,whereinand.Since the objective of(2)is not strictly convex,they introduced an Armijo-type nonmonotone line search for choosing,and this strategy ensures the global convergence.Recently,Bonettini[4]proposed a cyclic block coordinate gradient projection(CBGP)method for NMF based on the in-exact block coordinate descent.In the iteration round,CBGP iterates the following updatefixed times:(30) where is pro-jected gradient,and is the step size determined by the Armijo rule.CBGP unnecessarily obtains the optimum for both(2)and (3)in each iteration round,but it guarantees the global conver-gence.This is theoretically significant.However,the Armijo-type line search in PG,PBB,and CBGP are time-consuming. In contrast to the aforementioned gradient-based solvers,Kim and Park[16]generalized AS[16]to solve(2)and(3),which exchanges multiple variables between free set and active set, and proposed block principal pivoting method(BPP,[15]).The variable exchanging rule is(31) where and are free set and active set,respectively.In(31), and are calculated by and,respectively,wherein is the La-grange multiplier for that is a single column of the matrix. Both AS and BPP assume(2)and(3)are strictly convex to make the unconstrained equation with respect to variables in free set solvable.Such assumption may end up with numerical insta-bility problem especially for large scale problems.The proposed NeNMF solver converges at the optimal rate in each iteration round without time-consuming lineprocedure and predefined tuning pared to AS and BPP,NeNMF overcomes the numerical instability problem.Therefore,NeNMF is more efficient and effective than the aforementioned NMF solvers.III.N E NMF FOR R EGULARIZED NMFThis section shows that NeNMF can be conveniently adopted for optimizing the-norm,-norm and manifold regularized NMF.A.NeNMF for-Norm Regularized NMFAlthough thefinal matrix factors obtained by NMF are sparse, their sparsities are not explicitly guaranteed.Hoyer[13]pro-posed nonnegative sparse coding(NSC)that incorporates the -norm regularization on the encoding matrix to control its sparsity.The objective function is(32) where is the-norm,and is the tradeoff parameter that balance the sparsity of and the approximation error.Since all entries in are nonnegative,we have.In[13], MUR is used for optimizing(32).Here we show that NeNMF can be naturally adopted to effi-ciently solve(32).The factor can be obtained by Algorithm 1directly and we show the optimization for as following(33) By considering Lemma1and the convexity of,we can show that in(33)is convex.Furthermore,for any two matrices,we havewhere is the gradient of,i.e.,,wherein is the matrix whose entries are all one,and is the matrix spectral norm, i.e.,the largest singular value of.Thus is Lipschitz continuous and the Lipschitz constant. According to[23],(33)can be solved by OGM.We slightly modify NeNMF by replacing the gradient and the Lipschitz constant in Algorithm2withand,respectively,to learn the-norm reg-ularized NMF.We term the modified NeNMF for the-norm regularized NMF as NeNMF-.Note that NeNMF can be used to handle other-norm-based regularizations, e.g., sparse transfer learning(STL)[29].B.NeNMF for-Norm Regularized NMFThe-norm regularization,i.e.,Tikhonov regularization,is usually utilized to control the solution smoothness in NMF[26],i.e.(34) For optimizing(34),a gradient descent with constrained least squares(GD-CLS)solver is developed in[26].GD-CLS updates by using MUR and updates by solving a NLS.It has been shown that BFGS[17]can be extended to solve(34).We denote this solver as BFGS-.We show that(34)can be efficiently solved by slightly mod-ifying NeNMF.Since the optimization of and that of are symmetric,we only show the optimization of.Given,the objective function for optimizing is(35) By considering Lemma1and the convexity of,is convex.Furthermore,for any,we have where is the gradient of and is the identity matrix.Thus is Lipschitz continuous,and the Lipschitz constant is the spectral norm of,i.e.,.Since the problem(35)is convex,it can be solved by OGM.Thus,it is easy to extend NeNMF for optimizing(34)by slightly modi-fying the gradient and the step size,i.e.,.In this paper,we name the modified NeNMF for-norm regularized NMF as NeNMF-.C.NeNMF for Manifold Regularized NMFManifold regularization is utilized to encode the intrinsic geo-metric structure of data.Cai et al.[5]proposed the graph regu-larized NMF(GNMF)(36)where is the graph Laplacian of the data matrix.It is de-fined by,wherein is the similarity matrix in the adjacent graph and an entry in the diagonal matrix .To optimize,MUR is used in GNMF based on the following update rule:(37)In this paper,we use NeNMF to solve GNMF.It is clear that can be optimized by using Algorithm1.Here we focus on the optimization of,and the corresponding objective function is(38) The gradient of is.In order to use NeNMF to solve(38),we need the Lips-chitz constant of to determine the step size.Sinceis a linear combination of and,the Lipschitz of can be calculated as a linear combination of the Lipschitz constants of the and,wherein is the gradient of.According to the following Proposition2and Lemma2,wefind that is Lipschitz continuous and the Lipschitz constant is.By replacing and with the step size and gradient in Algorithm1, respectively,we can solve(38)by using NeNMF.In this paper, we denote the extended NeNMF for manifold regularized NMF as NeNMF-.Fig.1.Average objective values versus iteration numbers and CPU seconds of NeNMF,MUR,and PNLS on the Synthetic1(a)and(b)and Synthetic2(c)and(d)datasets.The iteration numbers and CPU seconds are shown in log scale.All solvers start from the same initial point and stop when the same stopping criterion(39)is met,wherein the precision.Proposition2:The gradient of is Lipschitz continuous and the corresponding Lipstchitz constant is.Proof:For any two matrices,we have where is the SVD decomposition of and the singular values are arranged in a descending order.Since ,wherein is the identity matrix,we havewhere is the spectral norm of.Therefore, is Lipschitz continuous and the Lipstchitz constant is.Since NeNMF-does not divide into two parts as in(37), it can be adopted to wider range of criterions such as geometric mean[8]and patch alignment framework[28],[30],[32].Note that NeNMF-decreases the objective function,but it cannot guarantee the optimality of the obtained solution in each iteration round.That is because the objective function is nonconvex.Therefore,similar to MUR for GNMF,NeNMF-cannot guarantee the convergence to any stationary point.How-ever,experimental results in the following section show that NeNMF-is efficient and effective for optimizing(36).IV.E XPERIMENTSIn this section,we evaluate NeNMF by comparing to fol-lowing nine NMF solvers in term of efficiency and approxima-tion accuracy:1)Multiplicative Update Rule(MUR,[18])2)Projected Nonnegative Least Squares(PNLS,2[2])3)Projected Gradient(PG,3[20])4)Projected Quasi-Newton(QN,4[33])5)Broyden Fletcher Goldfarb Shanno(BFGS,5[17])6)Projected Barzilai Borwein(PBB,6[12])7)Cyclic Block Coordinate Gradient Projection(CBGP,[4])8)Active Set(AS,7[16])9)Block Principal Pivoting(BPP,8[15])The source codes for PNLS,PG,BFGS,AS,BPP,and PBB are available online.2345678The source code of QN is given in NMFLAB toolbox.4It is implemented to optimize the Frobenius norm-based nonnegative matrix factorization(1)by using the Quasi-Newton update(26),whereas the parameter starts from and increases until the regularized Hessian matrix is ill-conditioned.We implemented other algorithms in MATLAB.Since the implementation of QN cannot handle large size matrices,we only presented its results on small size matrices.In thefirst part,we compared NeNMF to MUR and 2The code was found at /users/dmkim/Source/soft-ware/nnma/index.html3The code is available at .tw/cjlin/nmf/index.html 4The NMFLAB toolbox is available at http://www.bsp.brain.riken.go.jp/ ICALAB/nmflab.html5The code is available at /users/dmkim/Source/soft-ware/nnma/index.html6The code is available at /neumann/nmf/7The code is available at /hpark/nmfsoftware.php 8The code is available at /hpark/nmfsoftware.php。
Daniel D.LeeBell Laboratories Lucent Technologies Murray Hill,NJ07974H.Sebastian SeungDept.of Brain and Cog.Sci.Massachusetts Institute of TechnologyCambridge,MA02138 AbstractNon-negative matrix factorization(NMF)has previously been shown tobe a useful decomposition for multivariate data.Two different multi-plicative algorithms for NMF are analyzed.They differ only slightly inthe multiplicative factor used in the update rules.One algorithm can beshown to minimize the conventional least squares error while the otherminimizes the generalized Kullback-Leibler divergence.The monotonicconvergence of both algorithms can be proven using an auxiliary func-tion analogous to that used for proving convergence of the Expectation-Maximization algorithm.The algorithms can also be interpreted as diag-onally rescaled gradient descent,where the rescaling factor is optimallychosen to ensure convergence.1IntroductionUnsupervised learning algorithms such as principal components analysis and vector quan-tization can be understood as factorizing a data matrix subject to different constraints.De-pending upon the constraints utilized,the resulting factors can be shown to have very dif-ferent representational properties.Principal components analysis enforces only a weak or-thogonality constraint,resulting in a very distributed representation that uses cancellations to generate variability[1,2].On the other hand,vector quantization uses a hard winner-take-all constraint that results in clustering the data into mutually exclusive prototypes[3]. We have previously shown that nonnegativity is a useful constraint for matrix factorization that can learn a parts representation of the data[4,5].The nonnegative basis vectors that are learned are used in distributed,yet still sparse combinations to generate expressiveness in the reconstructions[6,7].In this submission,we analyze in detail two numerical algorithms for learning the optimal nonnegative factors from data.2Non-negative matrix factorizationWe formally consider algorithms for solving the following problem:Non-negative matrix factorization(NMF)Given a non-negative matrix,find non-negative matrix factors and such that:(1)NMF can be applied to the statistical analysis of multivariate data in the following manner. Given a set of of multivariate-dimensional data vectors,the vectors are placed in the columns of an matrix where is the number of examples in the data set.This matrix is then approximately factorized into an matrix and an matrix. Usually is chosen to be smaller than or,so that and are smaller than the original matrix.This results in a compressed version of the original data matrix.What is the significance of the approximation in Eq.(1)?It can be rewritten column by column as,where and are the corresponding columns of and.In other words,each data vector is approximated by a linear combination of the columns of, weighted by the components of.Therefore can be regarded as containing a basis that is optimized for the linear approximation of the data in.Since relatively few basis vectors are used to represent many data vectors,good approximation can only be achieved if the basis vectors discover structure that is latent in the data.The present submission is not about applications of NMF,but focuses instead on the tech-nical aspects offinding non-negative matrix factorizations.Of course,other types of ma-trix factorizations have been extensively studied in numerical linear algebra,but the non-negativity constraint makes much of this previous work inapplicable to the present case [8].Here we discuss two algorithms for NMF based on iterative updates of and.Because these algorithms are easy to implement and their convergence properties are guaranteed, we have found them very useful in practical applications.Other algorithms may possibly be more efficient in overall computation time,but are more difficult to implement and may not generalize to different cost functions.Algorithms similar to ours where only one of the factors is adapted have previously been used for the deconvolution of emission tomography and astronomical images[9,10,11,12].At each iteration of our algorithms,the new value of or is found by multiplying the current value by some factor that depends on the quality of the approximation in Eq.(1).We prove that the quality of the approximation improves monotonically with the application of these multiplicative update rules.In practice,this means that repeated iteration of the update rules is guaranteed to converge to a locally optimal matrix factorization.3Cost functionsTofind an approximate factorization,wefirst need to define cost functions that quantify the quality of the approximation.Such a cost function can be constructed using some measure of distance between two non-negative matrices and.One useful measure is simply the square of the Euclidean distance between and[13],(2)This is lower bounded by zero,and clearly vanishes if and only if.Another useful measure isWe now consider two alternative formulations of NMF as optimization problems: Problem1Minimize with respect to and,subject to the constraints .Problem2Minimize with respect to and,subject to the constraints .Although the functions and are convex in only or only,they are not convex in both variables together.Therefore it is unrealistic to expect an algorithm to solve Problems1and2in the sense offinding global minima.However,there are many techniques from numerical optimization that can be applied tofind local minima. Gradient descent is perhaps the simplest technique to implement,but convergence can be slow.Other methods such as conjugate gradient have faster convergence,at least in the vicinity of local minima,but are more complicated to implement than gradient descent [8].The convergence of gradient based methods also have the disadvantage of being very sensitive to the choice of step size,which can be very inconvenient for large applications.4Multiplicative update rulesWe have found that the following“multiplicative update rules”are a good compromise between speed and ease of implementation for solving Problems1and2.Theorem1The Euclidean distance is nonincreasing under the update rules(4)The Euclidean distance is invariant under these updates if and only if and are at a stationary point of the distance.Theorem2The divergence is nonincreasing under the update rules(5)The divergence is invariant under these updates if and only if and are at a stationary point of the divergence.Proofs of these theorems are given in a later section.For now,we note that each update consists of multiplication by a factor.In particular,it is straightforward to see that this multiplicative factor is unity when,so that perfect reconstruction is necessarily afixed point of the update rules.5Multiplicative versus additive update rulesIt is useful to contrast these multiplicative updates with those arising from gradient descent [14].In particular,a simple additive update for that reduces the squared distance can be written as(6) If are all set equal to some small positive number,this is equivalent to conventional gradient descent.As long as this number is sufficiently small,the update should reduce .Now if we diagonally rescale the variables and set(8) Again,if the are small and positive,this update should reduce.If we now setminFigure1:Minimizing the auxiliary function guarantees that for.Lemma2If is the diagonal matrix(13) then(15) Proof:Since is obvious,we need only show that.To do this,we compare(22)(23)is a positive eigenvector of with unity eigenvalue,and application of the Frobenius-Perron theorem shows that Eq.17holds.We can now demonstrate the convergence of Theorem1:Proof of Theorem1Replacing in Eq.(11)by Eq.(14)results in the update rule:(24) Since Eq.(14)is an auxiliary function,is nonincreasing under this update rule,accordingto Lemma1.Writing the components of this equation explicitly,we obtain(28)Proof:It is straightforward to verify that.To show that, we use convexity of the log function to derive the inequality(30) we obtain(31) From this inequality it follows that.Theorem2then follows from the application of Lemma1:Proof of Theorem2:The minimum of with respect to is determined by setting the gradient to zero:7DiscussionWe have shown that application of the update rules in Eqs.(4)and(5)are guaranteed to find at least locally optimal solutions of Problems1and2,respectively.The convergence proofs rely upon defining an appropriate auxiliary function.We are currently working to generalize these theorems to more complex constraints.The update rules themselves are extremely easy to implement computationally,and will hopefully be utilized by others for a wide variety of applications.We acknowledge the support of Bell Laboratories.We would also like to thank Carlos Brody,Ken Clarkson,Corinna Cortes,Roland Freund,Linda Kaufman,Yann Le Cun,Sam Roweis,Larry Saul,and Margaret Wright for helpful discussions.References[1]Jolliffe,IT(1986).Principal Component Analysis.New York:Springer-Verlag.[2]Turk,M&Pentland,A(1991).Eigenfaces for recognition.J.Cogn.Neurosci.3,71–86.[3]Gersho,A&Gray,RM(1992).Vector Quantization and Signal Compression.Kluwer Acad.Press.[4]Lee,DD&Seung,HS.Unsupervised learning by convex and conic coding(1997).Proceedingsof the Conference on Neural Information Processing Systems9,515–521.[5]Lee,DD&Seung,HS(1999).Learning the parts of objects by non-negative matrix factoriza-tion.Nature401,788–791.[6]Field,DJ(1994).What is the goal of sensory coding?Neural Comput.6,559–601.[7]Foldiak,P&Young,M(1995).Sparse coding in the primate cortex.The Handbook of BrainTheory and Neural Networks,895–898.(MIT Press,Cambridge,MA).[8]Press,WH,Teukolsky,SA,Vetterling,WT&Flannery,BP(1993).Numerical recipes:the artof scientific computing.(Cambridge University Press,Cambridge,England).[9]Shepp,LA&Vardi,Y(1982).Maximum likelihood reconstruction for emission tomography.IEEE Trans.MI-2,113–122.[10]Richardson,WH(1972).Bayesian-based iterative method of image restoration.J.Opt.Soc.Am.62,55–59.[11]Lucy,LB(1974).An iterative technique for the rectification of observed distributions.Astron.J.74,745–754.[12]Bouman,CA&Sauer,K(1996).A unified approach to statistical tomography using coordinatedescent optimization.IEEE Trans.Image Proc.5,480–492.[13]Paatero,P&Tapper,U(1997).Least squares formulation of robust non-negative factor analy-b.37,23–35.[14]Kivinen,J&Warmuth,M(1997).Additive versus exponentiated gradient updates for linearprediction.Journal of Information and Computation132,1–64.[15]Dempster,AP,Laird,NM&Rubin,DB(1977).Maximum likelihood from incomplete data viathe EM algorithm.J.Royal Stat.Soc.39,1–38.[16]Saul,L&Pereira,F(1997).Aggregate and mixed-order Markov models for statistical languageprocessing.In C.Cardie and R.Weischedel(eds).Proceedings of the Second Conference on Empirical Methods in Natural Language Processing,81–89.ACL Press.。
Projective Nonnegative Matrix Factorization for Image Compression and Feature ExtractionZhijian Yuan and Erkki OjaNeural Networks Research Centre,Helsinki University of Technology,P.O.Box5400,02015HUT,Finland{zhijian.yuan,erkki.oja}@hut.fiAbstract.In image compression and feature extraction,linear expan-sions are standardly used.It was recently pointed out by Lee and Seungthat the positivity or non-negativity of a linear expansion is a very power-ful constraint,that seems to lead to sparse representations for the images.Their technique,called Non-negative Matrix Factorization(NMF),wasshown to be a useful technique in approximating high dimensional datawhere the data are comprised of non-negative components.We proposehere a new variant of the NMF method for learning spatially localized,sparse,part-based subspace representations of visual patterns.The algo-rithm is based on positively constrained projections and is related bothto NMF and to the conventional SVD or PCA decomposition.Two it-erative positive projection algorithms are suggested,one based on mini-mizing Euclidean distance and the other on minimizing the divergence ofthe original data matrix and its non-negative approximation.Experimen-tal results show that P-NMF derives bases which are somewhat bettersuitable for a localized representation than NMF.1IntroductionFor compressing,denoising and feature extraction of digital image windows, one of the classical approaches is Principal Component Analysis(PCA)and its extensions and approximations such as the Discrete Cosine Transform.In PCA or the related Singular Value Decomposition(SVD),the image is projected on the eigenvectors of the image covariance matrix,each of which provides one linear feature.The representation of an image in this basis is distributed in the sense that typically all the features are used at least to some extent in the reconstruction.Another possibility is a sparse representation,in which any given image win-dow is spanned by just a small subset of the available features[1,2,6,10].This kind of representations have some biological significance,as the sparse features seem to correspond to the receptivefields of simple cells in the area V1of the mammalian visual cortex.This approach is related to the technique of Indepen-dent Component Analysis[3]which can be seen as a nongaussian extension of PCA and Factor Analysis.H.Kalviainen et al.(Eds.):SCIA2005,LNCS3540,pp.333–342,2005.c Springer-Verlag Berlin Heidelberg2005334Z.Yuan and E.OjaRecently,it was shown by Lee and Seung[4]that positivity or non-negativity of a linear expansion is a very powerful constraint that also seems to yield sparse representations.Their technique,called Non-negative Matrix Factoriza-tion(NMF),was shown to be a useful technique in approximating high di-mensional data where the data are comprised of non-negative components.The authors proposed the idea of using NMF techniques tofind a set of basis func-tions to represent image data where the basis functions enable the identification and classification of intrinsic“parts”that make up the object being imaged by multiple observations.NMF has been typically applied to image and text data [4,9],but has also been used to deconstruct music tones[8].NMF imposes the non-negativity constraints in learning the basis images. Both the values of the basis images and the coefficients for reconstruction are all non-negative.The additive property ensures that the components are combined to form a whole in the non-negative way,which has been shown to be the part-based representation of the original data.However,the additive parts learned by NMF are not necessarily localized.In this paper,we start from the ideas of SVD and NMF and propose a novel method which we call Projective Non-negative Matrix Factorization(P-NMF), for learning spatially localized,parts-based representations of visual patterns. First,in Section2,we take a look at a simple way to produce a positive SVD by truncating away negative parts.Section3briefly reviews Lee’s and Seung’s ing this as a baseline,we present our P-NMF method in Section4. Section5gives some experiments and comparisons,and Section6concludes the paper.2Truncated Singular Value DecompositionSuppose that our data1is given in the form of an m×n matrix V.Its n columns are the data items,for example,a set of images that have been vectorized by row-by-row scanning.Then m is the number of pixels in any given image.Typically, n>m.The Singular Value Decomposition(SVD)for matrix V isV=UDˆU T,(1) where U(m×m)andˆU(n×m)are orthogonal matrices consisting of the eigenvectors of VV T and V T V,respectively,and D is a diagonal m×m matrix where the diagonal elements are the ordered singular values of V.Choosing the r largest singular values of matrix V to form a new diagonal r×r matrixˆD,with r<m,we get the compressive SVD matrix X with given rank r,X=UˆDˆU T.(2)1For clarity,we use here the same notation as in the original NMF theory by Lee and SeungProjective Nonnegative Matrix Factorization335 Now both matrices U andˆU have only r columns corresponding to the r largest eigenvalues.The compressive SVD gives the best approximation X of the matrix V with the given compressive rank r.In many real-world cases,for example,for images,spectra etc.,the original data matrix V is non-negative.Then the above compressive SVD matrix X fails to keep the nonnegative property.In order to further approximate it by a non-negative matrix,the following truncated SVD(tSVD)is suggested.We simply truncate away the negative elements byˆX=12(X+abs(X)).(3)However,it turns out that typically the matrixˆX in(3)has higher rank than X. Truncation destroys the linear dependences that are the reason for the low rank. In order to get an equal rank,we have to start from a compressive SVD matrix X with lower rank than the given r.Therefore,tofind the truncated matrix ˆX with the compressive rank r,we search all the compressive SVD matrices X with the rank from1to r and form the corresponding truncated matrices.The one with the largest rank that is less than or equal to the given rank r is the truncated matrixˆX what we choose as thefinal non-negative approximation. This matrix can be used as a baseline in comparisons,and also as a starting point in iterative improvements.We call this method truncated SVD(t-SVD).Note that the tSVD only produces the non-negative low-rank approximation ˆX to the data matrix V,but does not give a separable expansion for basis vectors and weights as the usual SVD expansion.3Non-negative Matrix FactorizationGiven the nonnegative m×n matrix V and the constant r,the Nonnegative Matrix Factorization algorithm(NMF)[4]finds a nonnegative m×r matrix W and another nonnegative r×n matrix H such that they minimize the following optimality problem:minW,H≥0||V−WH||.(4) This can be interpreted as follows:each column of matrix W contains a basis vector while each column of H contains the weights needed to approximate the corresponding column in V using the basis from W.So the product WH can be regarded as a compressed form of the data in V.The rank r is usually chosen so that(n+m)r<nm.In order to estimate the factorization matrices,an objective function defined by the authors as Kullback-Leibler divergence isF=mi=1nµ=1[V iµlog(WH)iµ−(WH)iµ].(5)This objective function can be related to the likelihood of generating the images in V from the basis W and encodings H.An iterative approach to336Z.Yuan and E.Ojareach a local maximum of this objective function is given by the following rules[4,5]:W ia←W iaµV iµ(WH)iµH aµ,W ia←W iajW ja(6)H aµ←H aµi W iaV iµ(WH)iµ.(7)The convergence of the process is ensured2.The initialization is performed using positive random initial conditions for matrices W and H.4The Projective NMF Method4.1Definition of the ProblemThe compressive SVD is a projection method.It projects the data matrix V onto the subspace of the eigenvectors of the data covariance matrix.Although the truncated method t-SVD outlined above works and keeps nonnegativity, it is not accurate enough for most cases.To improve it,for the given m×n nonnegative matrix V,m<n,let us try tofind a subspace B of R m,and an m×m projection matrix P with given rank r such that P projects the nonnegative matrix V onto the subspace B and keeps the nonnegative property, that is,PV is a nonnegative matrix.Finally,it should minimize the difference ||V−PV||.This is the basic idea of the Projective NMF method.We can write any symmetrical projection matrix of rank r in the formP=WW T(8) with W an orthogonal(m×r)matrix3.Thus,we can solve the problem by searching for a nonnegative(m×r)matrix W.Based on this,we now introduce a novel method which we call Projective Non-negative Matrix Factorization(P-NMF)as the solution to the following optimality problemminW≥0||V−WW T V||,(9)where||·||is a matrix norm.The most useful norms are the Euclidean dis-tance and the divergence of matrix A from B,defined as follows:The Euclidean distance between two matrices A and B is2The matlab program for the above update rules is available at under the”Computational Neuroscience”discussion category.3This is just notation for a generic basis matrix;the solution will not be the same as the W matrix in NMF.Projective Nonnegative Matrix Factorization337||A−B||2=i,j(A ij−B ij)2,(10) and the divergence of A from BD(A||B)=i,j (A ij logA ijB ij−A ij+B ij).(11)Both are lower bounded by zero,and vanish if and only if A=B.4.2AlgorithmsWefirst consider the Euclidean distance(10).Define the functionF=12||V−WW T V||2.(12)Then the unconstrained gradient of F for W,∂F∂w ij,is given by∂F∂w ij=−2(VV T W)ij+(WWT VV T W)ij+(VVT WW T W)ij.(13)Using the gradient we can construct the additive update rule for minimization,W ij←W ij−ηij∂F∂w ij(14)whereηij is the positive step size.However,there is nothing to guarantee that the elements W ij would stay non-negative.In order to ensure this,we choose the step size as follows,ηij=W ij(WW T VV T W)ij+(VV T WW T W)ij.(15)Then the additive update rule(14)can be formulated as a multiplicative update rule,W ij←W ij(VV T W)ij(WW T VV T W)ij+(VV T WW T W)ij.(16)Now it is guaranteed that the W ij will stay nonnegative,as everything on the right-hand side is nonnegative.For the divergence measure(11),we follow the same process.First we calcu-late the gradient∂D(V||WW T V)∂w ij=k(W T V)jk+lW lj V ik(17)−kV ik(W T V)jk/(WW T V)ik(18)−k V iklW lj V lk/(WW T V)lk.(19)338Z.Yuan and E.OjaUsing the gradient,the additive update rule becomesW ij ←W ij +ζij ∂D (V ||WW T V )∂w ij(20)where ζij is the step size.Choosing this step size as following,ζij =W ij k V ik [(W T V )jk /(WW T V )ik + l W lj V lk /(WW T V )lk ].(21)we obtain the multiplicative update ruleW ij ←W ij k (W T V )jk + l W lj V ik k V ik ((W T V )jk /(WW T V )ik + l W lj V lk /(WW T V )lk ).(22)It is easy to see that both multiplicative update rules (16)and (22)can ensure that the matrix W is non-negative.4.3The Relationship Between NMF and P-NMFThere is a very obvious relationship between our P-NMF algorithms and the original paring the two optimality problems,P-NMF (9)and the original NMF (4),we see that the weight matrix H in NMF is simply replaced by W T V in our algorithms.Both multiplicative update rules (16)and (22)are obtained similar to Lee and Seung’s algorithms [5].Therefore,the convergence of these two algorithms can also be proved following Lee and Seung [5]by noticing that the coefficient matrix H is replaced by WV .4.4The Relationship Between SVD and P-NMFThere is also a relationship between the P-NMF algorithm and the SVD.For the Euclidean norm,note the similarity of the problem (9)with the conventional PCA for the columns of V .Removing the positivity constraint,this would be-come the usual finite-sample PCA problem,whose solution is known to be an orthogonal matrix consisting of the eigenvectors of VV T .But this is the matrix U in the SVD of eq.(1).However,now with the positivity constraint in place,the solution will be something quite different.5Simulations 5.1Data PreparationAs experimental data,we used face images from the MIT-CBCL database and derived the NMF and P-NMF expansions for them.The training data set con-tains 2429faces.Each face has 19×19=361pixels and has been histogram-equalized and normalized so that all pixel values are between 0and 1.ThusProjective Nonnegative Matrix Factorization339 the data matrix V which now has the faces as columns is361×2429.This matrix was compressed to rank r=49using either t-SVD,NMF,or P-NMF expansions.5.2Learning Basis ComponentsThe basis images of tSVD,NMF,and P-NMF with dimension49are shown in Figure1.For NMF and P-NMF,these are the49columns of the corresponding matrices W.For t-SVD,we show the49basis vectors of the range space of the rank-49nonnegative matrixˆX,obtained by ordinary SVD of this matrix.Thus the basis images for NMF and P-NMF are truly non-negative,while the t-SVD only produces a non-negative overall approximation to the data but does not give a separable expansion for basis vectors and weights.All the images are displayed with the matlab command”imagesc”without any extra scale.Both NMF and P-NMF bases are holistic for the training set. For this problem,the P-NMF algorithm converges about5times faster than NMF.Fig.1.NMF(top,left),t-SVD(bottom,left)and the two versions of the new P-NMF method(right)bases of dimension49.Each basis component consists of19×19pixels340Z.Yuan and E.OjaFig.2.The original face image(left)and its reconstructions by NMF(top row),the two versions of the new P-NMF method under100iterative steps(second and third rows),and t-SVD(bottom row).The dimensions in columns2,3,and4are25,49and 81,respectively5.3Reconstruction AccuracyWe repeated the above computations for ranks r=25,49and81.Figure2 shows the reconstructions for one of the face images in the t-SVD,NMF,and P-NMF subspaces of corresponding dimensions.For comparison,also the original face image is shown.As the dimension increases,more details are recovered. Visually,the P-NMF method is comparable to NMF.The recognition accuracy,defined as the Euclidean distance between the orig-inal data matrix and the recognition matrix,can be used to measure the perfor-mance quantitatively.Figure3shows the recognition accuracy curves of P-NMF and NMF under different iterative steps.NMF converges faster,but when the number of steps increases,P-NMF works very similarly to NMF.One thing to be noticed is that the accuracy of P-NMF depends on the initial values.Al-though the number of iteration steps is larger in P-NMF for comparable error with NMF,this is compensated by the fact that the computational complexity for one iteration step is considerably lower for P-NMF,as only one matrix has to be updated instead of two.Projective Nonnegative Matrix Factorization341Fig.3.Recognition accuracies(unit:108)versus iterative steps using t-SVD,NMF and P-NMF with compressive dimension496ConclusionWe proposed a new variant of the well-known Non-negative Matrix Factorization (NMF)method for learning spatially localized,sparse,part-based subspace rep-resentations of visual patterns.The algorithm is based on positively constrained projections and is related both to NMF and to the conventional SVD decompo-sition.Two iterative positive projection algorithms were suggested,one based on minimizing Euclidean distance and the other on minimizing the divergence of the original data matrix and its pared to the NMF method, the iterations are somewhat simpler as only one matrix is updated instead of two as in NMF.The tradeoffis that the convergence,counted in iteration steps, is slower than in NMF.One purpose of these approaches is to learn localized features which would be suitable not only for image compression,but also for object recognition. Experimental results show that P-NMF derives bases which are better suitable for a localized representation than NMF.It remains to be seen whether they would be better in pattern recognition,too.342Z.Yuan and E.OjaReferences1. A.Bell and T.Sejnowski.The”independent components”of images are edgefilters.Vision Research,37:3327–3338,1997.2. A.Hyv¨a rinen and P.Hoyer.Emergence of phase and shift invariant features bydecomposition of natural images into independent feature subspaces.Neural Com-putation,13:1527–1558,2001.3. A.Hyv¨a rinen,J.Karhunen,and E.Oja.Independent Component Analysis.Wiley,New York,2001.4. D.D.Lee and H.S.Seung.Learning the parts of objects by non-negative matrixfactorization.Nature,401:788–791,1999.5. D.D.Lee and H.S.Seung.Algorithms for non-negative matrix factorization.InNIPS,pages556–562,2000.6. B.A.Olshausen and D.J.Field.Natural image statistics and efficient coding.Network,7:333–339,1996.7.P.Paatero and U.Tapper.Positive Matrix Factorization:A non-negative factormodel with optimal utilization of error estimations of data values.Environmetrics, 5,111-126,1997.8.T.Kawamoto,K.Hotta,T.Mishima,J.Fujiki,M.Tanaka and T.Kurita.Esti-mation of single tones from chord sounds using non-negative matrix factorization.Neural Network World,3,429-436,July2000.9.L.K.Saul and D.D.Lee.Multiplicative updates for classification by mixture mod-ela.In Advances in Neural Information Processing Systems14,2002.10.J.H.van Hateren and A.van der Schaaf.Independent componentfilters of natu-ral images compared with simple cells in primary visual cortex.Proc.Royal Soc.London B,265:2315–2320,1998.。
Nonnegative Matrix Factorization:A Comprehensive ReviewYu-Xiong Wang,Student Member,IEEE,and Yu-Jin Zhang,Senior Member,IEEE Abstract—Nonnegative Matrix Factorization(NMF),a relatively novel paradigm for dimensionality reduction,has been in the ascendant since its inception.It incorporates the nonnegativity constraint and thus obtains the parts-based representation as well as enhancing the interpretability of the issue correspondingly.This survey paper mainly focuses on the theoretical research into NMF over the last5years,where the principles,basic models,properties,and algorithms of NMF along with its various modifications,extensions, and generalizations are summarized systematically.The existing NMF algorithms are divided into four categories:Basic NMF(BNMF), Constrained NMF(CNMF),Structured NMF(SNMF),and Generalized NMF(GNMF),upon which the design principles,characteristics,problems,relationships,and evolution of these algorithms are presented and analyzed comprehensively.Some related work not on NMF that NMF should learn from or has connections with is involved too.Moreover,some open issues remained to be solved are discussed.Several relevant application areas of NMF are also briefly described.This survey aims to construct an integrated,state-of-the-art framework for NMF concept,from which the follow-up research may benefit.Index Terms—Data mining,dimensionality reduction,multivariate data analysis,nonnegative matrix factorization(NMF)Ç1I NTRODUCTIONO NE of the basic concepts deeply rooted in science and engineering is that there must be something simple, compact,and elegant playing the fundamental roles under the apparent chaos and complexity.This is also the case in signal processing,data analysis,data mining,pattern recognition,and machine learning.With the increasing quantities of available raw data due to the development in sensor and computer technology,how to obtain such an effective way of representation by appropriate dimension-ality reduction technique has become important,necessary, and challenging in multivariate data analysis.Generally speaking,two basic properties are supposed to be satisfied: first,the dimension of the original data should be reduced; second,the principal components,hidden concepts,promi-nent features,or latent variables of the data,depending on the application context,should be identified efficaciously.In many cases,the primitive data sets or observations are organized as data matrices(or tensors),and described by linear(or multilinear)combination models;whereupon the formulation of dimensionality reduction can be regarded as, from the algebraic perspective,decomposing the original data matrix into two factor matrices.The canonical methods, such as Principal Component Analysis(PCA),Linear Discriminant Analysis(LDA),Independent Component Analysis(ICA),Vector Quantization(VQ),etc.,are the exemplars of such low-rank approximations.They differ from one another in the statistical properties attributable to the different constraints imposed on the component matrices and their underlying structures;however,they have something in common that there is no constraint in the sign of the elements in the factorized matrices.In other words,the negative component or the subtractive combina-tion is allowed in the representation.By contrast,a new paradigm of factorization—Nonnegative Matrix Factoriza-tion(NMF),which incorporates the nonnegativity constraint and thus obtains the parts-based representation as well as enhancing the interpretability of the issue correspondingly, was initiated by Paatero and Tapper[1],[2]together with Lee and Seung[3],[4].As a matter of fact,the notion of NMF has a long history under the name“self modeling curve resolution”in chemometrics,where the vectors are continuous curves rather than discrete vectors[5].NMF was first introduced by Paatero and Tapper as the concept of Positive Matrix Factorization,which concentrated on a specific application with Byzantine algorithms.These shortcomings limit both the theoretical analysis,such as the convergence of the algorithms or the properties of the solutions,and the generalization of the algorithms in other applications. Fortunately,NMF was popularized by Lee and Seung due to their contributing work of a simple yet effective algorithmic procedure,and more importantly the emphasis on its potential value of parts-based representation.Far beyond a mathematical exploration,the philosophy underlying NMF,which tries to formulate a feasible model for learning object parts,is closely relevant to perception mechanism.While the parts-based representation seems intuitive,it is indeed on the basis of physiological and psychological evidence:perception of the whole is based on perception of its parts[6],one of the core concepts in certain computational theories of recognition problems.In fact there are two complementary connotations in nonnegativity—nonnegative component and purely additive combination. On the one hand,the negative values of both observations.The authors are with the Tsinghua National Laboratory for InformationScience and Technology and the Department of Electronic Engineering,Tsinghua University,Rohm Building,Beijing100084,China.E-mail:albertwyx@,zhang-yj@.Manuscript received21July2011;revised2Nov.2011;accepted4Feb.2012;published online2Mar.2012.Recommended for acceptance by L.Chen.For information on obtaining reprints of this article,please send e-mail to:tkde@,and reference IEEECS Log Number TKDE-2011-07-0429.Digital Object Identifier no.10.1109/TKDE.2012.51.1041-4347/13/$31.00ß2013IEEE Published by the IEEE Computer Societyand latent components are physically meaningless in many kinds of real-world data,such as image,spectra,and gene data,analysis tasks.Meanwhile,the discovered prototypes commonly correspond with certain semantic interpretation. For instance,in face recognition,the learned basis images are localized rather than holistic,resembling parts of faces,such as eyes,nose,mouth,and cheeks[3].On the other hand, objects of interest are most naturally characterized by the inventory of its parts,and the exclusively additive combina-tion means that they can be reassembled by adding required parts together similar to identikits.NMF thereupon has achieved great success in real-word scenarios and tasks.In document clustering,NMF surpasses the classic methods, such as spectral clustering,not only in accuracy improve-ment but also in latent semantic topic identification[7].To boot,the nonnegativity constraint will lead to sort of sparseness naturally[3],which is proved to be a highly effective representation distinguished from both the com-pletely distributed and the solely active component de-scription[8].When NMF is interpreted as a neural network learning algorithm depicting how the visible variables are generated from the hidden ones,the parts-based represen-tation is obtained from the additive model.A positive number indicates the presence and a zero value represents the absence of some event or component.This conforms nicely to the dualistic properties of neural activity and synaptic strengths in neurophysiology:either excitatory or inhibitory without changing sign[3].Because of the enhanced semantic interpretability under the nonnegativity and the ensuing sparsity,NMF has become an imperative tool in multivariate data analysis, and been widely used in the fields of mathematics, optimization,neural computing,pattern recognition and machine learning[9],data mining[10],signal processing [11],image engineering and computer vision[11],spectral data analysis[12],bioinformatics[13],chemometrics[1], geophysics[14],finance and economics[15].More specifi-cally,such applications include text data mining[16],digital watermark,image denoising[17],image restoration,image segmentation[18],image fusion,image classification[19], image retrieval,face hallucination,face recognition[20], facial expression recognition[21],audio pattern separation [22],music genre classification[23],speech recognition, microarray analysis,blind source separation[24],spectro-scopy[25],gene expression classification[26],cell analysis, EEG signal processing[17],pathologic diagnosis,email surveillance[10],online discussion participation prediction, network security,automatic personalized summarization, identification of compounds in atmosphere analysis[14], earthquake prediction,stock market pricing[15],and so on.There have been numerous results devoted to NMF research since its inception.Researchers from various fields, mathematicians,statisticians,computer scientists,biolo-gists,and neuroscientists,have explored the NMF concept from diverse perspectives.So a systematic survey is of necessity and consequence.Although there have been such survey papers as[27],[28],[12],[13],[10],[11],[29]and one book[9],they fail to reflect either the updated or the comprehensive results.This review paper will summarize the principles,basic models,properties,and algorithms of NMF systematically over the last5years,including its various modifications,extensions,and generalizations.A taxonomy is accordingly proposed to logically group them, which have not been presented before.Besides these,some related work not on NMF that NMF should learn from or has connections with will also be involved.Furthermore, this survey mainly focuses on the theoretical research rather than the specific applications,the practical usage will also be concerned though.It aims to construct an integrated, state-of-the-art framework for NMF concept,from which the follow-up research may benefit.In conclusion,the theory of NMF has advanced sig-nificantly by now yet is still a work in progress.To be specific:1)the properties of NMF itself have been explored more deeply;whereas a firm statistical underpinning like those of the traditional factorization methods—PCA or LDA—is not developed fully(partly due to its knottiness).2)Some problems like the ones mentioned in[29]have been solved,especially those with additional constraints;never-theless a lot of other questions are still left open.The existing NMF algorithms are divided into four categories here given in Fig.1,following some unified criteria:1.Basic NMF(BNMF),which only imposes the non-negativity constraint.2.Constrained NMF(CNMF),which imposes someadditional constraints as regularization.3.Structured NMF(SNMF),which modifies the stan-dard factorization formulations.4.Generalized NMF(GNMF),which breaks throughthe conventional data types or factorization modesin a broad sense.The model level from Basic to Generalized NMF becomes broader.Therein Basic NMF formulates the fundamental analytical framework upon which all other NMF models are built.We will present the optimization tools and computational methods to efficiently and robustly solve Basic NMF.Moreover,the pragmatic issue of NMF with respect to large-scale data sets and online processing will also be discussed.Constrained NMF is categorized into four subclasses:1.Sparse NMF(SPNMF),which imposes the sparse-ness constraint.2.Orthogonal NMF(ONMF),which imposes theorthogonality constraint.3.Discriminant NMF(DNMF),which involves theinformation for classification and discrimination.4.NMF on manifold(MNMF),which preserves thelocal topological properties.We will demonstrate why these morphological constraints are essentially necessary and how to incorporate them into the existing solution framework of Basic NMF.Correspondingly,Structured NMF is categorized into three subclasses:1.Weighed NMF(WNMF),which attaches differentweights to different elements regarding their relativeimportance.2.Convolutive NMF (CVNMF),which considers the time-frequency domain factorization.3.Nonnegative Matrix Trifactorization (NMTF),whichdecomposes the data matrix into three factor matrices.Besides,Generalized NMF is categorized into four subclasses:1.Semi-NMF,which relaxes the nonnegativity con-straint only on the specific factor matrix.2.Nonnegative Tensor Factorization (NTF),whichgeneralizes the matrix-form data to higher dimen-sional tensors.3.Nonnegative Matrix-Set Factorization (NMSF),which extends the data sets from matrices to matrix-sets.4.Kernel NMF (KNMF),which is the nonlinear modelof NMF.The remainder of this paper is organized as follows:first,the mathematic formulation of NMF model is presented,and the unearthed properties of NMF are summarized.Then the algorithmic details of foregoing categories of NMF are elaborated.Finally,conclusions are drawn,and some open issues remained to be solved are discussed.2C ONCEPT AND P ROPERTIES OF NMFDefinition.Given an M dimensional random vector x with nonnegative elements,whose N observations are denoted asx j;j ¼1;2;...;N ,let data matrix be X ¼½x 1;x 2;...;x N 2I R M ÂN!0,NMF seeks to decompose X into nonnegative M ÂL basismatrix U ¼½u 1;u 2;...;u L 2I R M ÂL!0and nonnegative L ÂN coefficient matrix V ¼½v 1;v 2;...;v N 2I R L ÂN!0,such thatX %U V ,where I R M ÂN!0stands for the set of M ÂN element-wise nonnegative matrices.This can also be written as the equivalent vector formula x j %P Li ¼1u i V ij .It is obvious that v j is the weight coefficient of the observation x j on the columns of U ,the basis vectors or the latent feature vectors of X .Hence,NMF decomposes each data into the linear combination of the basis vectors.Because of the initial condition L (min ðM;N Þ,the obtained basis vectors are incomplete over the original vector space.In other words,this approach tries to represent the high-dimensional stochastic pattern with far fewer bases,so the perfect approximation can be achieved successfully only if the intrinsic features are identified in U .Here,we discuss the relationship between L and M ,N a little more.In most cases,NMF is viewed as a dimension-ality reduction and feature extraction technique with L (M;L (N ;that is,the basis set learned from NMF model is incomplete,and the energy is compacted.However,in general,L can be smaller,equal or larger than M .But there are fundamental differences in the decomposition for L <M and L >M .It is a sort of sparse coding and compressed sensing with overcomplete basis when L >M .Hence,L need not be limited by the dimensionality of the data,which is useful for some applications,like classification.In this situation,it may benefit from the sparseness due to both nonnegativity and redundant representation.One approach to obtain this NMF model is to perform the decomposition on the residue matrix E ¼X ÀU V repeat-edly and sequentially [30].As a kind of matrix factorization model,three essential questions need answering:1)existence,whether the nontrivial NMF solutions exist;2)uniqueness,under what assumptions NMF is,at least in some sense,unique;3)effectiveness,under what assumptions NMF is able to recover the “right answer.”The existence was showed via the theory of Completely Positive (CP)Factorization for the first time in [31].The last two concerns were first mentioned and discussed from a geometric viewpoint in [32].Complete NMF X ¼U V is considered first for the analysis of existence,convexity,and computational com-plexity.The trivial solution always exists as U ¼X and V ¼I N .By relating NMF to CP Factorization,Vasiloglou et al.showed that every nonnegative matrix has a nontrivial complete NMF [31].As such,CP Factorizationis a special case,where a nonnegative matrix X 2I R M ÂM!0is CP if it can be factored in the form X ¼U U T ;U 2I R M ÂL!0.The minimum L is called the CP-rank of X .WhenFig.1.The categorization of NMF models and algorithms.combining that the set of CP matrices forms a convex cone with that the solution to NMF belongs to a CP cone, solving NMF is a convex optimization problem[31]. Nevertheless,finding a practical description of the CP cone is still open,and it remains hard to formulate NMF as a convex optimization problem,despite a convex relaxation to rank reduction with theoretical merit pro-posed in[31].Using the bilinear model,complete NMF can be rewritten as linear combination of rank-one nonnegative matrices expressed byX¼X Li¼1U i V i ¼X Li¼1U i V iðÞT;ð1Þwhere U i is the i th column vector of U while V i is the i th row vector of V,and denotes the outer product of two vectors.The smallest L making the decomposition possible is called the nonnegative rank of the nonnegative matrix X, denoted as rankþðXÞ.And it satisfies the following trivial bounds[33]rankðXÞrankþðXÞminðM;NÞ:ð2ÞWhile PCA can be solved in polynomial time,the optimization problem of NMF,with respect to determining the nonnegative rank and computing the associated factor-ization,is more difficult than its unconstrained counterpart. It is in fact NP-hard when requiring both the dimension and the factorization rank of X to increase,which was proved via relating it to NP-hard intermediate simplex problem by Vavasis[34].This is also the corollary of CP programming, since the CP cone cannot be described in polynomial time despite its convexity.In the special case when rankðXÞ¼1, complete NMF can be solved in polynomial time.However, the complexity of complete NMF for fixed factorization rank generally is still unknown[35].Another related work is so-called Nonnegative Rank Factorization(NRF)focusing on the situation of rankðXÞ¼rankþðXÞ,i.e.,selecting rankðXÞas the minimum L[33]. This is not always possible,and only nonnegative matrix with a corresponding simplicial cone(A polyhedral cone is simplicial if its vertex rays are linearly independent.) existed has an NRF[36].In most cases,the approximation version of NMF X% U V instead of the complete factorization is widely utilized. An alternative generative model isX¼U VþE;ð3Þwhere E2I R MÂN is the residue or noise matrix represent-ing the approximation error.These two modes of NMF are essentially coupled with each other,though much more attention is devoted to the latter.The theoretical results on complete NMF will be helpful to design more efficient NMF algorithms[31],[34]. The selection of the factorization rank L of NMF may be more creditable if tighter bound for the nonnegative rank is obtained[37].In essence,NMF is an ill-posed problem with nonunique solutions[32],[38].From the geometric perspective,NMF can be viewed as finding a simplicial cone involving all the data points in the positive orthant.Given a simplicial cone satisfying all these conditions,it is not difficult to construct another cone containing the former one to meet the same conditions,so the nesting can work on infinitively thus leading to an ill-defined factorization notion.From the algebraic perspective,if there exists a solution X%U0V0, let U¼U0D,V¼DÀ1V0,then X%U V.If a nonsingular matrix and its inverse are both nonnegative,then the matrix is a generalized permutation with the form of P S, where P and S are permutation and scaling matrices, respectively.So the permutation and scaling ambiguities for NMF are inevitable.For that matter,NMF is called unique factorization up to a permutation and a scaling transformation when D¼P S.Unfortunately,there are many ways to select a rotational matrix D which is not necessarily a generalized permutation or even nonnegative matrix,so that the transformed factor matrices U and V are still nonnegative.In other words,the sole nonnegativity constraint in itself will not suffice to guarantee the uniqueness,let alone the effectiveness.Nevertheless,the uniqueness will be achieved if the original data satisfy certain generative model.Intuitively,if U0and V0are sufficiently sparse,only generalized permutation matrices are possible rotation matrices satisfying the nonnegativity constraint.Strictly speaking,this is called boundary close condition for sufficiency and necessity of the uniqueness of NMF solution[39].The deep discussions about this issue can be found in[32],[38],[39],[40],[41],and[42].In practice,incorporating additional constraints such as sparseness in the factor matrices or normalizing the columns of U(respectively rows of V)to unit length is helpful in alleviating the rotational indeterminacy[9].It was hoped that NMF would produce an intrinsically parts-based and sparse representation in unsupervised mode[3],which is the most inspiring benefit of NMF. Intuitively,this can be explained by that the stationary points of NMF solutions will typically be located at the boundary of the feasible domain due to the first order optimality conditions,leading to zero elements[37].Further experiments by Li et al.have shown,however,that the pure additivity does not necessarily mean sparsity and that NMF will not necessarily learn the localized features[43].Further more,NMF is equivalent to k-means clustering when using Square of Euclidian Distance(SED)[44],[45], while tantamount to Probabilistic Latent Semantic Analy-sis(PLSA)when using Generalized Kullback-Leibler Divergence(GKLD)as the objective function[46],[47].So far we may conclude that the merits of NMF,parts-based representation and sparseness included,come at the price of more complexity.Besides,SVD or PCA has always a more compact spectrum than NMF[31].You just cannot have the best of both worlds.3B ASIC NMF A LGORITHMSThe cynosure in Basic NMF is trying to find more efficient and effective solutions to NMF problem under the sole nonnegativity constraint,which lays the foundation for the practicability of NMF.Due to its NP-hardness and lack of appropriate convex formulations,the nonconvex formula-tions with relatively easy solvability are generally adopted,and only local minima are achievable in a reasonable computational time.Hence,the classic and also more practical approach is to perform alternating minimization of a suitable cost function as the similarity measures between X and the product U V .The different optimization models vary from one another mainly in the object functions and the optimization procedures.These optimization models,even serving to give sight of some possible directions for the solutions to Constrained,Structured,and Generalized NMF,are the kernel discus-sions of this section.We will first summarize the objective functions.Then the details about the classic Basic NMF framework and the paragon algorithms are presented.Moreover,some new vision of NMF,such as the geometric formulation of NMF,and the pragmatic issue of NMF,such as large-scale data sets,online processing,parallel comput-ing,and incremental NMF,will be discussed.In the last part of this section,some other relevant issues are also involved.3.1Similarity Measures or Objective FunctionsIn order to quantify the difference between the original data X and the approximation U V ,a similarity measure D ðX U V k Þneeds to be defined first.This is also the objective function of the optimization model.These similarity mea-sures can be either distances or divergences,and correspond-ing objective functions can be either a sole cost function or optionally a set of cost functions with the same global minima to be minimized sequentially or simultaneously.The most commonly used objective functions are SED (i.e.,Frobenius norm)(4)and GKLD (i.e.,I-divergence)(5)[4]D F X U V k ðÞ¼12X ÀU V k k 2F ¼12X ijX ij ÀU V ½ ij 2;ð4ÞD KL X U V k ðÞ¼XijX ij lnX ijU V ½ ijÀX ij þU V ½ ij !:ð5ÞThere are some drawbacks of GKLD,especially the gradients needed in optimization heavily depend on the scales of factorizing matrices leading to many iterations.Thus,the original KLD is renewed for NMF by normalizing the input data in [48].Other cost functions consist of Minkowski family of metrics known as ‘p -norm,Earth Mover’s distance metric [18], -divergence [17], -divergence [49], -divergence [50],Csisza´r’s ’-divergence [51],Bregman divergence [52],and - -divergence [53].Most of them are element-wise mea-sures.Some similarity measures are more robust with respect to noise and outliers,such as hypersurface cost function [54], -divergence [50],and - -divergence [53].Statistically,different similarity measures can be deter-mined based on a prior knowledge about the probability distribution of the noise,which actually reflects the statistical structure of the signals and the disclosed compo-nents.For example,the SED minimization can be seen as a maximum likelihood estimator where the difference is due to additive Gaussian noise,whereas GKLD can be shown to be equivalent to the Expectation Maximization (EM)algo-rithm and maximum likelihood for Poisson processes [9].Given that while the optimization problem is not jointly convex in both U and V ,it is separately convex in either Uor V ,the alternating minimizations are seemly the feasible direction.A phenomenon worthy of notice is that although the generative model of NMF is linear,the inference computation is nonlinear.3.2Classic Basic NMF Optimization Framework The prototypical multiplicative update rules originated by Lee and Seung—the SED-MU and GKLD-MU [4]have still been widely used as the baseline.The SED-MU and GKLD-MU algorithms use SED and GKLD as objective functions,respectively,and both apply iterative multiplicative updates as the optimization approach similar to EM algorithms.In essence,they can be viewed as adaptive rescaled gradient descent algorithms.Considering the efficiency,they are relatively simple and parameter free with low cost per iteration,but they converge slowly due to a first-order convergence rate [28],[55].Regarding the quality of the solutions,Lee and Seung claimed that the multiplicative update rules converge to a local minimum [4].Gonzales and Zhang indicated that the gradient and properties of continual nonincreasing by no means,however,ensure the convergence to a limit point that is also a stationary point,which can be understood under the Karush-Kuhn-Tucker (KKT)optimality conditions [55],[56].So the accurate conclusion is that the algorithms converge to a stationary point which is not necessarily a local minimum when the limit point is in the interior of the feasible region;its stationarity cannot be even determined when the limit point lies on the boundary of the feasible region [10].However,a minor modification in their step size of the gradient descent formula achieves a first-order stationary point [57].Another drawback is the strong correlation enforced by the multi-plication.Once an element in the factor matrices becomes zero,it must remain zero.This means the gradual shrinkage of the feasible region,which is harmful for getting more superior solution.In practice,to reduce the numerical difficulties,like numerical instabilities or ill-conditioning,the normalization of the ‘1or ‘2norm of the columns in U is often needed as an extra procedure,yet this simple trick has changed the original optimization problem,thereby making searching for the global minimum more complicated.Besides,to preclude the computational difficulty due to division by zero,an extra positive additive value in the denominator is helpful [56].To accelerate the convergence rate,one popular method is to apply gradient descent algorithms with additive update rules.Other techniques such as conjugate gradient,projected gradient,and more sophisticated second-order scheme like Newton and Quasi-Newton methods et al.are also in consideration.They choose appropriate descent direction,such as the gradient direction,and update the element additively in the factor matrices at a certain learning rate.They differ from one another as for either the descent direction or the learning rate strategy.To satisfy the nonnegativity constraint,the updated matrices are brought back to the feasible region,namely the nonnegative orthant,by additional projection,like simply setting all negative elements to ually under certain mild additional conditions,they can guarantee the first-order stationarity.These are the widely developed algorithms in Basic NMF recent years.。
非负矩阵分解
非负矩阵分解(Non-Negative Matrix Factorization, NMF)是一种机器学习技术,用于将数据重新表示成低维空间中的基本因素。
其基本概念是将原始数据表
示为两个非负矩阵的乘积。
非负矩阵分解的主要用途是文本挖掘,特别是分析大量文档,确定文档主题或概念关系。
此外,它也被用于图像和声音分析和表示。
非负矩阵分解确保数据表示形式中所有项均为非负值,这可以将分析从基于复数值的空间中转移到基于实数值的空间中,从而显著的改善了复杂度。
此外,由
于它是一种无监督学习算法,它不需要用户指定的方向,因此可以发现未知的模式,并检查任何特定的特性的关联。
非负矩阵分解是一种迭代过程,它将原始数据分解为两个数据矩阵,第一个矩阵描述数据中各个元素的组成,第二个矩阵表示数据中各个元素的重要性。
这两个矩阵相乘可以重新组合成原始数据,并提供有用的信息。
总之,非负矩阵分解是一种强大的工具,可用于分析和提取数据中的有用信息,并使复杂计算更容易实现。
它可以帮助用户更好地理解大量总体数据,提取其中的模式和特征,并在今后的分析过程中进行发现。
An Improved Non-negative Matrix Factorization Method for MasqueradeDetectionC.Mex-Perera,R.Posadas,J.A.NolazcoCenter for Electronics and Telecommunications,Tecnol´o gico de Monterrey,Campus Monterrey Av.Eugenio garza Sada2501Sur.Col.Tecnol´o gico,Monterrey,N.L.,CP64849Mexico {carlosmex,A00790428,jnolazco,}@itesm.mxR.Monroy,A.Soberanes,L.TrejoComputer Science Department,Tecnol´o gico de Monterrey,Campus Estado de M´e xico Carretera al lago de Guadalupe,Km3.5,Atizap´a n,Edo.Mex.,52926,Mexico{raulm,A00966807,ltrejo}@itesm.mxAbstractA local-knowledge method for masquerade detection that uses a Non-negative Matrix Factorization(NMF)al-gorithm is here proposed.This method does not consider training data from other users to build a specific user profile but his own.It is used a normalization phase that helps im-prove a previous NMF-based method by Wang -parisons with other local-knowledge methods like Wang’s, Hidden Markov Models(HMM)and Eigen Co-occurrence Matrix(ECM)are presented.From results shown by the Re-ceiver Operating Characteristic(ROC)curves we conclude that our proposal is better than Wang’s et.al.and it has a better performance in desirable regions of operation than ECM and HMM.1.IntroductionA masquerade attack is considered a serious and chal-lenging problem in computer security.Basically it may be described as the event where someone uses other’s ac-count to enter into a session and by means of account privi-leges access programs and data,this includes read and write access,installation of malicious software and acquisition of system privileges.Masquerade attacks typically occur when an intruder obtains a legitimate user’s password or when a user leaves their computer unattended without any sort of locking mechanism in place.This type of attacks are hard to detect,because atfirst sight everything seems to be done by the legitimate user.However,by analyzing the user behavior we can distinguish between normal and abnormal profiles.The way to construct profiles is by cre-ating system logs,for example time of login,location,files accessed,CPU usage,programs executed and so forth.Current Host-based Intrusion Detection Systems(IDSs), can be divided in two categories,Misuse Intrusion Detec-tion Systems(MIDS),and Anomaly Intrusion Detection Systems(AIDS).A MIDS is a type of IDS that searches for known patterns of malicious activity,they are signature-based and compare attacks against actual events on an IT system.For such systems to scale up we need to constantly upgrade the signatures of new attacks,which makes them vulnerable to bypass simply by varying the attack slightly. An AIDS uses a profile of normal activity to distinguish it from abnormal activity.This type of IDS is more robust against new attacks,so anomaly-based intrusion detection is normally used to detect masqueraders.According to the scope of the information used,detec-tion mechanisms can be classified as local-knowledge or global-knowledge methods.In a local-knowledge method, normal behavior of a user is solely determined based on his own data.On the other side,a global-knowledge method also considers data from all other users of the system.As examples of global-knowledge masquerade detectors we have[2,4],while some examples of local-knowledge meth-ods are[5,7].All these methods try to solve the masquerade detection problem by using the same Schonlau et al(SEA) dataset[3],which consists of clean and contaminated data of50users.This collection was obtained using the acct au-diting mechanism and it consist of15000UNIX commands per user without arguments.Global-knowledge methods tend to be better than local-knowledge ones,the reason is that they have more informa-tion that can be used to distinguish a legitimate user behav-ior from masquerade activity.However,in certain informa-tion systems where the number of users is high,it is difficult to collect data from all users and deliver it to the detector. Besides,there are some concerns about privacy issues that it would prevent the delivery of sensitive audit data to mas-querade detection mechanisms.Due to the above reasons,this paper proposes a local-knowledge method based on Non-negative Matrix Factor-ization algorithm that outperforms others based on local knowledge information.We compare against ECM[5], NMF by Wang et.al.[7]and another one based on Hidden Markov Models.In order to make a fair comparison,we test all those methods with the same SEA dataset and plot ROC curves to have a thorough study of the method behav-ior.The ROC curves are plotted by varying a threshold from 0%to100%and getting the false alarm and missing alarm rates.In the next section we give an overview of how the meth-ods exposed here work.In section3a detailed explanation of the proposed method is given tofinish with the explana-tion of the results and conclusions.2.Overview of MethodsIn this section we give an overview of some known meth-ods for masquerade detection.We will explain the funda-mentals of NMF and how it was employed by Wang et. ter,we will describe ECM,which is the best local-knowledge method reported in the literature so far.Finally we give a brief introduction of HMM-based detector,which has been extensively employed as a classifier in a wide range of applications such as intrusion detection systems, speech recognition,biometrics,etc.2.1.Non-negative Matrix FactorizationNMF is a method developed by Lee and Seung[6]where in contrast to other methods such as principal component analysis(PCA)and vector quantization(VQ)it has non-negativity constraints.NMF is applied by Wang et.al.to build normal behavior models in anomaly intrusion detec-tion systems[7]where user behaviors are profiled by the frequency property.In this section,the theory related to NMF will be briefly exposed.Given a database represented by a n×m matrix V, where each column is an n-dimensional non-negative local vector belonging to the original database(m local vectors), we can obtain an approximation of the whole database(V) by:V iµ≈(W H)iµ= ra=1W ia H aµ(1)Where the dimensions of the matrix W and H are n x r and r x m,ually,r is chosen so that (n+m)r<nm.This results in a compressed version of the original data matrix.Each column of matrix W contains a basis vector while each column of H contains encoding co-efficients needed to approximate the corresponding column in V.The following iterative learning rules are used tofind the linear decomposition:H aµ←−H aµi(V iµ/(V H)iµ)W ia(2) W ia←−W iaµ(V iµ/(W H)iµ)H aµ(3)W ia←−W ia/jW ja(4) The above unsupervised multiplicative learning rules are used iteratively to update W and H.The initial values of W and H arefixed randomly.With these iterative updates, the quality of the approximation of Equation1improves monotonically with a guaranteed convergence to a locally optimal matrix factorization[6].In order to use the NMF algorithm,in[7]the SEA dataset is partitioned into sessions and the frequency of in-dividual elements(commands)in each session is counted. The NMF method uses the values of frequencies of com-mands seen in a session for a given user as entries for ma-trix V,where V iµis the number of times the i-th command appears in theµ-th session.By NMF,matrix V can be factorized into W and H. Columns of W represent the basis profiles and the columns of H are the encoding coefficients.If the features contained in any session deviate significantly from those of the normal behaviors learned in the training datasets,the factorization using the learned basis W will generate encoding coeffi-cients t that suffice|Ct −1|> (5) Where C is a normalization vector defined in[7]and is the detection threshold.The above expression indicates that the tested session belongs to a masquerader.2.2.Eigen Co-occurrence Matrix Method for Mas-querade DetectionEigen Co-occurrence Matrix(ECM)is a method that ex-tracts causal relationship features embedded in sequences of events(commands).According to[5]the sequences of events used in masquerade detection do not have well-defined syntactic rules,however there is a causal relation-ship between two events in sequences.One of the benefits of ECM is the adaptation to concep-tual drift,which is a change of behavior of a user over time. An effective intrusion detection system needs to adapt to this change while not adapting to intrusion behavior.ECM considers the strength of the causal relationship between events in terms of a distance between two events and their frequency of appearance in the event sequence. ECM represents this causal relationship with an event co-occurrence matrix,where each element represents the oc-currence of an event pair in the event sequence.ECM uses principal component analysis(PCA)to extract essential fea-tures in each event co-occurrence matrix.It obtains a set of orthogonal axes(eigenvectors)called Eigen Co-occurrence Matrices.A profile that represents a user behavior is created by using extracted feature vectors from the legitimate user sessions from the SEA dataset.Test data is also converted to a feature vector and classified as normal or intrusive by using a classification technique.Need to include an account for the classification technique.2.3.Hidden Markov ModelsIn this paper we include an AIDS based on Hidden Markov Models[8],which is an approach for pattern clas-sification based on a network model that is created by learn-ing the probability of each event emerging from each node and the probability of each transition between nodes.For the implementation of this detector we used the Hidden Markov Model Toolkit(HTK)[9].Possible transitions are made successively from a start-ing node to afinishing node,and the transition and event probabilities can be multiplied at each transition to calculate the overall likelihood of all the output events produced in the transition path up to that point.When all transitions are finished,the HMM model generates an event sequence ac-cording to the likelihood of a sequence being formed along each path.In other words,when a sequence is given,there is one or more transition paths that could have formed the se-quence,each path has a specific likelihood that the sequence was formed by it.The sum of all the likelihoods obtained for all such transition paths is regarded as the likelihood that the sequence was generated by the HMM.3.Proposed MethodAs explained by Wang et.al.[7],their method uses the values of frequencies of commands seen in the sessions to form matrix V.The main difference between their proposal and ours is that we have included a normalization phase of matrix V.3.1.ArchitectureThe architecture shown in Figure1has been proposed to improve the performance of the NMFmethod.Figure 1.Architecture of the proposedmethod.Normalization Normalization phase takes into account the fact that commands commonly typed by the legitimate user in the training phase are less relevant to identify mas-queraders.Thus,a masquerader might use different com-mands to reach his objectives than those normally used by the legitimate user.The normalization phase modifies the commands’statistics applying different weights for the en-tries of mands that are very common in the training sessions of a given user will have lower weights than rare commands.Besides,we consider the number of times that commands not seen in the user training data appear in a test-ing session.The weights are calculated as followsαiµ=−log(p i)/(L−Nµ)where p i is the probability of occurrence of the i-th com-mand computed considering all the training sessions of a given user,L is the number of commands in sessionµand Nµis the number of times that commands not seen in the user training sessions appear in theµ-th session.Note that for the training stage Nµis equal to zero.Thus,entries of V will be multiplied byαiµ:A iµ=αiµ·V iµTraining and Testing Matrix A is used as input for the NMF algorithm instead of matrix V as in the original method.Normalization is applied as a preprocessing stage both to training and testing phases in our proposed method. In the training phase,the NMF algorithm constructs the r columns of W,which can be seen as the basis profiles of a given ter,for testing purposes W is used to com-pute an encoding profile h(column vector)to approximate the features vector a(column vector)for a given test ses-sion:a=W hDecision In our proposal,the decision about if the consid-ered test session belongs to the legitimate user or not will be made in a different way from the work in[7].First a score is calculated as follows:d= a− a 2(Nµ+1)(6) It can be easily seen that the above expression is the Eu-clidean distance between vectors a and a scaled by factor Nµ+1.The Euclidean distance is a similarity measure that indicates how well the basis profiles approximates the features of a session.Small values of d are expected when sessions come from a legitimate user rather than a masquer-ader.Euclidean distance is reinforced by the number of times that commands not seen in the training session appear in the testing session.These rare commands in the testing session would indicate an anomaly and the session might be created by a masquerader.Note that if all commands in the testing session have been previously seen in any of the training sessions then Equation6reduces to the Euclidean distance d= a− a 2.Finally,score d is compared to a given threshold ,if d> then we consider that the testing session belongs to a masquerader,otherwise the session is considered as created by the legitimate user.4.Experiments and ResultsThe way for comparing the performance of the methods can be accomplished by the usual Receiver Operating Characteristic(ROC)curves,which are parametric curves generated by varying a threshold from0%to100%,and computing the false alarm(false positive)and missing alarm(false negative)rates at each operating point.A false alarm is when the system falsely regards a legitimate user as a masquerader,while a missing alarm rate occurs when the system falsely regards a masquerader as a legitimate user.In general,there is a trade-off between the false alarmrate and missing alarm rate.The lower and further left a curve is,the better it is.We tried the proposed method using different values for the parameters,after some trials we selected the following values:r=2,number of iterations for NMF algorithm in training,testing and updating phases are50,10and2re-spectively.L isfixed to100,which is the number of com-mands for each session.FALSE ALARM (%)MISSINGALARM(%)Figure2.ROC Curves of discused methods.Figure2shows the performance of all four methods pre-sented here.As we can see HMM method is the worst of all since it is the most further right located.For the same false alarm rate it has the highest missing alarm rate of all methods except in the lower right region where false alarm rate is better than ECM.Next we can compare the two NMF methods.The data for the NMF curve was obtained from[7]and only three points were given.Our NMF method is better since it has approximately10%less missing alarms for the same false alarm rate specifyed by the three dots of the NMF curve. That is a direct consecuence of all the improvements applied to the NMF algorithm.The ECM method is best of all when the false alarm ranges from0.3%to10%,however that is not a desirable range of operation because of the high miss-ing alarm rate.In the lower right region our method outper-forms all the others and it has the lowest missing alarm rate. Since the ROC curve is a trade-off between missing alarm and false alarm rates the point of operation is chosen acord-ing to specific needs,however,so to speak a good point of operation would be the dot with about a missing alarm rate of10%and a false alarm rate of14%.5.ConclusionsWe may conclude that our NMF method is as good as ECM near to6.5%of false alarm rate.When false alarm rate is greater than10%,our method has the best perfor-mance of the methods shown here.From the ROC curves, we can conclude that we have significantly improved the method proposed by Wang et.al.by a missing alarm rate difference of10%for the same values of false alarm rates. This shows that modifications to the algorithm represent an improvement in performance for masquerade detectors.For improving masquerade detection,we have proposed a normalization phase,which is a simple but very effective method to reduce missing alarm and false alarm rates.De-spite the use of the normalization phase our NMF imple-mentation is still simple and the complexity remains lower than HMM or ECM.6.AcknowledgmentsThe authors would like to acknowledge the C´a tedra de Seguridad,ITESM,Campus Monterrey and Regional Fund for Digital Innovation in Latin America and the Caribbean (Grant VI),who supported this work.References[1]M.Schonlau,W.DuMouchel,W.-H.Ju,A.F.Karr,M.Theus,Y.Vardi,“Computer Intrusion:Detecting Mas-querades”,Statistical Science,V ol.16,No.1,2001, pp.58-74.[2]M.Schonlau and M.Theus,“Detecting Masquerades inIntrusion Detection Based on Unpopular Commands”, Information Processing Letters,76,2000,pp.33-38. [3]M.Schonlau,Masquerading user data.(MatthiasSchonlau’s home page). [4]tendresse,“Masquerade Detection via Cus-tomized Grammars”,DIMV A2005,LNCS3548, Springer-Verlag Berlin Heidelberg,2005,pp.141-159.[5]M.Oka,Y.Oyama,and K.Kato,“Anomaly de-tection using layered networks based on eigen co-occurrence matrix”,Proceedings of7th International Symposium on Recent Advances in Intrusion Detec-tion,(RAID),Sophia Antipolis,France,LNCS3224, Springer,September2004,pp.223-237[6]D. D.Lee and H.S.Seung,“Learning the partsof objects with nonnegative matrix factorization”Na-ture,1999,401:788-791.[7]W.Wang,X.Guan and X.Zhang,“Profiling Programand User Behaviors for Anomaly Intrusion Detection Based on Non-negative Matrix Factorization”43rd.IEEE Conference on Decision and Control,Atlantis, Paradise Island,Bahamas,December14-17,2004. [8]L.R.Rabiner,B.H.Juang,“An introduction to HiddenMarkov Models”.IEEE ASSP Magazine,1986.[9]Hidden Markov Model Toolkit(HTK)home page/。