Bayesian model selection and isocurvature perturbations
- 格式:pdf
- 大小:160.65 KB
- 文档页数:6
中国科学英文版模板1.Identification of Wiener systems with nonlinearity being piece wise-linear function HUANG YiQing,CHEN HanFu,FANG HaiTao2.A novel algorithm for explicit optimal multi-degree reduction of triangular surfaces HU QianQian,WANG GuoJin3.New approach to the automatic segmentation of coronary arte ry in X-ray angiograms ZHOU ShouJun,YANG Jun,CHEN WuFan,WANG YongTian4.Novel Ω-protocols for NP DENG Yi,LIN DongDai5.Non-coherent space-time code based on full diversity space-ti me block coding GUO YongLiang,ZHU ShiHua6.Recursive algorithm and accurate computation of dyadic Green 's functions for stratified uniaxial anisotropic media WEI BaoJun,ZH ANG GengJi,LIU QingHuo7.A blind separation method of overlapped multi-components b ased on time varying AR model CAI QuanWei,WEI Ping,XIAO Xian Ci8.Joint multiple parameters estimation for coherent chirp signals using vector sensor array WEN Zhong,LI LiPing,CHEN TianQi,ZH ANG XiXiang9.Vision implants: An electrical device will bring light to the blind NIU JinHai,LIU YiFei,REN QiuShi,ZHOU Yang,ZHOU Ye,NIU S huaibining search space partition and search Space partition and ab straction for LTL model checking PU Fei,ZHANG WenHui2.Dynamic replication of Web contents Amjad Mahmood3.On global controllability of affine nonlinear systems with a tria ngular-like structure SUN YiMin,MEI ShengWei,LU Qiang4.A fuzzy model of predicting RNA secondary structure SONG D anDan,DENG ZhiDong5.Randomization of classical inference patterns and its applicatio n WANG GuoJun,HUI XiaoJing6.Pulse shaping method to compensate for antenna distortion in ultra-wideband communications WU XuanLi,SHA XueJun,ZHANG NaiTong7.Study on modulation techniques free of orthogonality restricti on CAO QiSheng,LIANG DeQun8.Joint-state differential detection algorithm and its application in UWB wireless communication systems ZHANG Peng,BI GuangGuo,CAO XiuYing9.Accurate and robust estimation of phase error and its uncertai nty of 50 GHz bandwidth sampling circuit ZHANG Zhe,LIN MaoLiu,XU QingHua,TAN JiuBin10.Solving SAT problem by heuristic polarity decision-making al gorithm JING MingE,ZHOU Dian,TANG PuShan,ZHOU XiaoFang,ZHANG Hua1.A novel formal approach to program slicing ZHANG YingZhou2.On Hamiltonian realization of time-varying nonlinear systems WANG YuZhen,Ge S. S.,CHENG DaiZhan3.Primary exploration of nonlinear information fusion control the ory WANG ZhiSheng,WANG DaoBo,ZHEN ZiYang4.Center-configur ation selection technique for the reconfigurable modular robot LIU J inGuo,WANG YueChao,LI Bin,MA ShuGen,TAN DaLong5.Stabilization of switched linear systems with bounded disturba nces and unobservable switchings LIU Feng6.Solution to the Generalized Champagne Problem on simultane ous stabilization of linear systems GUAN Qiang,WANG Long,XIA B iCan,YANG Lu,YU WenSheng,ZENG ZhenBing7.Supporting service differentiation with enhancements of the IE EE 802.11 MAC protocol: Models and analysis LI Bo,LI JianDong,R oberto Battiti8.Differential space-time block-diagonal codes LUO ZhenDong,L IU YuanAn,GAO JinChun9.Cross-layer optimization in ultra wideband networks WU Qi,BI JingPing,GUO ZiHua,XIONG YongQiang,ZHANG Qian,LI ZhongC heng10.Searching-and-averaging method of underdetermined blind s peech signal separation in time domain XIAO Ming,XIE ShengLi,F U YuLi11.New theoretical framework for OFDM/CDMA systems with pe ak-limited nonlinearities WANG Jian,ZHANG Lin,SHAN XiuMing,R EN Yong1.Fractional Fourier domain analysis of decimation and interpolat ion MENG XiangYi,TAO Ran,WANG Yue2.A reduced state SISO iterative decoding algorithm for serially concatenated continuous phase modulation SUN JinHua,LI JianDong,JIN LiJun3.On the linear span of the p-ary cascaded GMW sequences TA NG XiaoHu4.De-interlacing technique based on total variation with spatial-t emporal smoothness constraint YIN XueMin,YUAN JianHua,LU Xia oPeng,ZOU MouYan5.Constrained total least squares algorithm for passive location based on bearing-only measurements WANG Ding,ZHANG Li,WU Ying6.Phase noise analysis of oscillators with Sylvester representation for periodic time-varying modulus matrix by regular perturbations FAN JianXing,YANG HuaZhong,WANG Hui,YAN XiaoLang,HOU ChaoHuan7.New optimal algorithm of data association for multi-passive-se nsor location system ZHOU Li,HE You,ZHANG WeiHua8.Application research on the chaos synchronization self-mainten ance characteristic to secret communication WU DanHui,ZHAO Che nFei,ZHANG YuJie9.The changes on synchronizing ability of coupled networks fro m ring networks to chain networks HAN XiuPing,LU JunAn10.A new approach to consensus problems in discrete-time mult iagent systems with time-delays WANG Long,XIAO Feng11.Unified stabilizing controller synthesis approach for discrete-ti me intelligent systems with time delays by dynamic output feedbac k LIU MeiQin1.Survey of information security SHEN ChangXiang,ZHANG Hua ngGuo,FENG DengGuo,CAO ZhenFu,HUANG JiWu2.Analysis of affinely equivalent Boolean functions MENG QingSh u,ZHANG HuanGuo,YANG Min,WANG ZhangYi3.Boolean functions of an odd number of variables with maximu m algebraic immunity LI Na,QI WenFeng4.Pirate decoder for the broadcast encryption schemes from Cry pto 2005 WENG Jian,LIU ShengLi,CHEN KeFei5.Symmetric-key cryptosystem with DNA technology LU MingXin,LAI XueJia,XIAO GuoZhen,QIN Lei6.A chaos-based image encryption algorithm using alternate stru cture ZHANG YiWei,WANG YuMin,SHEN XuBang7.Impossible differential cryptanalysis of advanced encryption sta ndard CHEN Jie,HU YuPu,ZHANG YueYu8.Classification and counting on multi-continued fractions and its application to multi-sequences DAI ZongDuo,FENG XiuTao9.A trinomial type of σ-LFSR oriented toward software implemen tation ZENG Guang,HE KaiCheng,HAN WenBao10.Identity-based signature scheme based on quadratic residues CHAI ZhenChuan,CAO ZhenFu,DONG XiaoLei11.Modular approach to the design and analysis of password-ba sed security protocols FENG DengGuo,CHEN WeiDong12.Design of secure operating systems with high security levels QING SiHan,SHEN ChangXiang13.A formal model for access control with supporting spatial co ntext ZHANG Hong,HE YePing,SHI ZhiGuo14.Universally composable anonymous Hash certification model ZHANG Fan,MA JianFeng,SangJae MOON15.Trusted dynamic level scheduling based on Bayes trust model WANG Wei,ZENG GuoSun16.Log-scaling magnitude modulated watermarking scheme LING HeFei,YUAN WuGang,ZOU FuHao,LU ZhengDing17.A digital authentication watermarking scheme for JPEG image s with superior localization and security YU Miao,HE HongJie,ZHA NG JiaShu18.Blind reconnaissance of the pseudo-random sequence in DS/ SS signal with negative SNR HUANG XianGao,HUANG Wei,WANG Chao,L(U) ZeJun,HU YanHua1.Analysis of security protocols based on challenge-response LU O JunZhou,YANG Ming2.Notes on automata theory based on quantum logic QIU Dao Wen3.Optimality analysis of one-step OOSM filtering algorithms in t arget tracking ZHOU WenHui,LI Lin,CHEN GuoHai,YU AnXi4.A general approach to attribute reduction in rough set theory ZHANG WenXiuiu,QIU GuoFang,WU WeiZhi5.Multiscale stochastic hierarchical image segmentation by spectr al clustering LI XiaoBin,TIAN Zheng6.Energy-based adaptive orthogonal FRIT and its application in i mage denoising LIU YunXia,PENG YuHua,QU HuaiJing,YiN Yong7.Remote sensing image fusion based on Bayesian linear estimat ion GE ZhiRong,WANG Bin,ZHANG LiMing8.Fiber soliton-form 3R regenerator and its performance analysis ZHU Bo,YANG XiangLin9.Study on relationships of electromagnetic band structures and left/right handed structures GAO Chu,CHEN ZhiNing,WANG YunY i,YANG Ning10.Study on joint Bayesian model selection and parameter estim ation method of GTD model SHI ZhiGuang,ZHOU JianXiong,ZHAO HongZhong,FU Qiang。
1254IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009Cubature Kalman FiltersIenkaran Arasaratnam and Simon Haykin, Life Fellow, IEEEAbstract—In this paper, we present a new nonlinear filter for high-dimensional state estimation, which we have named the cubature Kalman filter (CKF). The heart of the CKF is a spherical-radial cubature rule, which makes it possible to numerically compute multivariate moment integrals encountered in the nonlinear Bayesian filter. Specifically, we derive a third-degree spherical-radial cubature rule that provides a set of cubature points scaling linearly with the state-vector dimension. The CKF may therefore provide a systematic solution for high-dimensional nonlinear filtering problems. The paper also includes the derivation of a square-root version of the CKF for improved numerical stability. The CKF is tested experimentally in two nonlinear state estimation problems. In the first problem, the proposed cubature rule is used to compute the second-order statistics of a nonlinearly transformed Gaussian random variable. The second problem addresses the use of the CKF for tracking a maneuvering aircraft. The results of both experiments demonstrate the improved performance of the CKF over conventional nonlinear filters. Index Terms—Bayesian filters, cubature rules, Gaussian quadrature rules, invariant theory, Kalman filter, nonlinear filtering.• Time update, which involves computing the predictive density(3)where denotes the history of input; is the measurement pairs up to time and the state transition old posterior density at time is obtained from (1). density • Measurement update, which involves computing the posterior density of the current stateI. INTRODUCTIONUsing the state-space model (1), (2) and Bayes’ rule we have (4) where the normalizing constant is given byIN this paper, we consider the filtering problem of a nonlinear dynamic system with additive noise, whose statespace model is defined by the pair of difference equations in discrete-time [1] (1) (2)is the state of the dynamic system at discrete where and are time ; is the known control input, some known functions; which may be derived from a compensator as in Fig. 1; is the measurement; and are independent process and measurement Gaussian noise sequences with zero and , respectively. means and covariances In the Bayesian filtering paradigm, the posterior density of the state provides a complete statistical description of the state at that time. On the receipt of a new measurement at time , we in update the old posterior density of the state at time two basic steps:Manuscript received July 02, 2008; revised July 02, 2008, August 29, 2008, and September 16, 2008. First published May 27, 2009; current version published June 10, 2009. This work was supported by the Natural Sciences & Engineering Research Council (NSERC) of Canada. Recommended by Associate Editor S. Celikovsky. The authors are with the Cognitive Systems Laboratory, Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4K1, Canada (e-mail: aienkaran@grads.ece.mcmaster.ca; haykin@mcmaster. ca). Color versions of one or more of the figures in this paper are available online at . Digital Object Identifier 10.1109/TAC.2009.2019800To develop a recursive relationship between the predictive density and the posterior density in (4), the inputs have to satisfy the relationshipwhich is also called the natural condition of control [2]. has sufficient This condition therefore suggests that information to generate the input . To be specific, the can be generated using . Under this condiinput tion, we may equivalently write (5) Hence, substituting (5) into (4) yields (6) as desired, where (7) and the measurement likelihood function obtained from (2). is0018-9286/$25.00 © 2009 IEEEARASARATNAM AND HAYKIN: CUBATURE KALMAN FILTERS1255Fig. 1. Signal-flow diagram of a dynamic state-space model driven by the feedback control input. The observer may employ a Bayesian filter. The label denotes the unit delay.The Bayesian filter solution given by (3), (6), and (7) provides a unified recursive approach for nonlinear filtering problems, at least conceptually. From a practical perspective, however, we find that the multi-dimensional integrals involved in (3) and (7) are typically intractable. Notable exceptions arise in the following restricted cases: 1) A linear-Gaussian dynamic system, the optimal solution for which is given by the celebrated Kalman filter [3]. 2) A discrete-valued state-space with a fixed number of states, the optimal solution for which is given by the grid filter (Hidden-Markov model filter) [4]. 3) A “Benes type” of nonlinearity, the optimal solution for which is also tractable [5]. In general, when we are confronted with a nonlinear filtering problem, we have to abandon the idea of seeking an optimal or analytical solution and be content with a suboptimal solution to the Bayesian filter [6]. In computational terms, suboptimal solutions to the posterior density can be obtained using one of two approximate approaches: 1) Local approach. Here, we derive nonlinear filters by fixing the posterior density to take a priori form. For example, we may assume it to be Gaussian; the nonlinear filters, namely, the extended Kalman filter (EKF) [7], the central-difference Kalman filter (CDKF) [8], [9], the unscented Kalman filter (UKF) [10], and the quadrature Kalman filter (QKF) [11], [12], fall under this first category. The emphasis on locality makes the design of the filter simple and fast to execute. 2) Global approach. Here, we do not make any explicit assumption about the posterior density form. For example, the point-mass filter using adaptive grids [13], the Gaussian mixture filter [14], and particle filters using Monte Carlo integrations with the importance sampling [15], [16] fall under this second category. Typically, the global methods suffer from enormous computational demands. Unfortunately, the presently known nonlinear filters mentioned above suffer from the curse of dimensionality [17] or divergence or both. The effect of curse of dimensionality may often become detrimental in high-dimensional state-space models with state-vectors of size 20 or more. The divergence may occur for several reasons including i) inaccurate or incomplete model of the underlying physical system, ii) informationloss in capturing the true evolving posterior density completely, e.g., a nonlinear filter designed under the Gaussian assumption may fail to capture the key features of a multi-modal posterior density, iii) high degree of nonlinearities in the equations that describe the state-space model, and iv) numerical errors. Indeed, each of the above-mentioned filters has its own domain of applicability and it is doubtful that a single filter exists that would be considered effective for a complete range of applications. For example, the EKF, which has been the method of choice for nonlinear filtering problems in many practical applications for the last four decades, works well only in a ‘mild’ nonlinear environment owing to the first-order Taylor series approximation for nonlinear functions. The motivation for this paper has been to derive a more accurate nonlinear filter that could be applied to solve a wide range (from low to high dimensions) of nonlinear filtering problems. Here, we take the local approach to build a new filter, which we have named the cubature Kalman filter (CKF). It is known that the Bayesian filter is rendered tractable when all conditional densities are assumed to be Gaussian. In this case, the Bayesian filter solution reduces to computing multi-dimensional integrals, whose integrands are all of the form nonlinear function Gaussian. The CKF exploits the properties of highly efficient numerical integration methods known as cubature rules for those multi-dimensional integrals [18]. With the cubature rules at our disposal, we may describe the underlying philosophy behind the derivation of the new filter as nonlinear filtering through linear estimation theory, hence the name “cubature Kalman filter.” The CKF is numerically accurate and easily extendable to high-dimensional problems. The rest of the paper is organized as follows: Section II derives the Bayesian filter theory in the Gaussian domain. Section III describes numerical methods available for moment integrals encountered in the Bayesian filter. The cubature Kalman filter, using a third-degree spherical-radial cubature rule, is derived in Section IV. Our argument for choosing a third-degree rule is articulated in Section V. We go on to derive a square-root version of the CKF for improved numerical stability in Section VI. The existing sigma-point approach is compared with the cubature method in Section VII. We apply the CKF in two nonlinear state estimation problems in Section VIII. Section IX concludes the paper with a possible extension of the CKF algorithm for a more general setting.1256IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009II. BAYESIAN FILTER THEORY IN THE GAUSSIAN DOMAIN The key approximation taken to develop the Bayesian filter theory under the Gaussian domain is that the predictive density and the filter likelihood density are both Gaussian, which eventually leads to a Gaussian posterior den. The Gaussian is the most convenient and widely sity used density function for the following reasons: • It has many distinctive mathematical properties. — The Gaussian family is closed under linear transformation and conditioning. — Uncorrelated jointly Gaussian random variables are independent. • It approximates many physical random phenomena by virtue of the central limit theorem of probability theory (see Sections 5.7 and 6.7 in [19] for more details). Under the Gaussian approximation, the functional recursion of the Bayesian filter reduces to an algebraic recursion operating only on means and covariances of various conditional densities encountered in the time and the measurement updates. A. Time Update In the time update, the Bayesian filter computes the mean and the associated covariance of the Gaussian predictive density as follows: (8) where is the statistical expectation operator. Substituting (1) into (8) yieldsTABLE I KALMAN FILTERING FRAMEWORKB. Measurement Update It is well known that the errors in the predicted measurements are zero-mean white sequences [2], [20]. Under the assumption that these errors can be well approximated by the Gaussian, we write the filter likelihood density (12) where the predicted measurement (13) and the associated covariance(14) Hence, we write the conditional Gaussian density of the joint state and the measurement(15) (9) where the cross-covariance is assumed to be zero-mean and uncorrelated Because with the past measurements, we get (16) On the receipt of a new measurement , the Bayesian filter from (15) yielding computes the posterior density (17) (10) where is the conventional symbol for a Gaussian density. Similarly, we obtain the error covariance where (18) (19) (20) If and are linear functions of the state, the Bayesian filter under the Gaussian assumption reduces to the Kalman filter. Table I shows how quantities derived above are called in the Kalman filtering framework. The signal-flow diagram in Fig. 2 summarizes the steps involved in the recursion cycle of the Bayesian filter. The heart of the Bayesian filter is therefore how to compute Gaussian(11)ARASARATNAM AND HAYKIN: CUBATURE KALMAN FILTERS1257Fig. 2. Signal-flow diagram of the recursive Bayesian filter under the Gaussian assumption, where “G-” stands for “Gaussian-.”weighted integrals whose integrands are all of the form nonGaussian density that are present in (10), linear function (11), (13), (14) and (16). The next section describes numerical integration methods to compute multi-dimensional weighted integrals. III. REVIEW ON NUMERICAL METHODS FOR MOMENT INTEGRALS Consider a multi-dimensional weighted integral of the form (21) is some arbitrary function, is the region of where for all integration, and the known weighting function . In a Gaussian-weighted integral, for example, is a Gaussian density and satisfies the nonnegativity condition in the entire region. If the solution to the above integral (21) is difficult to obtain, we may seek numerical integration methods to compute it. The basic task of numerically computing the integral (21) is to find a set of points and weights that approximates by a weighted sum of function evaluations the integral (22) The methods used to find can be divided into product rules and non-product rules, as described next. A. Product Rules ), we For the simplest one-dimensional case (that is, may apply the quadrature rule to compute the integral (21) numerically [21], [22]. In the context of the Bayesian filter, we mention the Gauss-Hermite quadrature rule; when the is in the form of a Gaussian density weighting functionis well approximated by a polynomial and the integrand in , the Gauss-Hermite quadrature rule is used to compute the Gaussian-weighted integral efficiently [12]. The quadrature rule may be extended to compute multidimensional integrals by successively applying it in a tensorproduct of one-dimensional integrals. Consider an -point per dimension quadrature rule that is exact for polynomials of points for functional degree up to . We set up a grid of evaluations and numerically compute an -dimensional integral while retaining the accuracy for polynomials of degree up to only. Hence, the computational complexity of the product quadrature rule increases exponentially with , and therefore , suffers from the curse of dimensionality. Typically for the product Gauss-Hermite quadrature rule is not a reasonable choice to approximate a recursive optimal Bayesian filter. B. Non-Product Rules To mitigate the curse of dimensionality issue in the product rules, we may seek non-product rules for integrals of arbitrary dimensions by choosing points directly from the domain of integration [18], [23]. Some of the well-known non-product rules include randomized Monte Carlo methods [4], quasi-Monte Carlo methods [24], [25], lattice rules [26] and sparse grids [27]–[29]. The randomized Monte Carlo methods evaluate the integration using a set of equally-weighted sample points drawn randomly, whereas in quasi-Monte Carlo methods and lattice rules the points are generated from a unit hyper-cube region using deterministically defined mechanisms. On the other hand, the sparse grids based on Smolyak formula in principle, combine a quadrature (univariate) routine for high-dimensional integrals more sophisticatedly; they detect important dimensions automatically and place more grid points there. Although the non-product methods mentioned here are powerful numerical integration tools to compute a given integral with a prescribed accuracy, they do suffer from the curse of dimensionality to certain extent [30].1258IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009C. Proposed Method In the recursive Bayesian estimation paradigm, we are interested in non-product rules that i) yield reasonable accuracy, ii) require small number of function evaluations, and iii) are easily extendable to arbitrarily high dimensions. In this paper we derive an efficient non-product cubature rule for Gaussianweighted integrals. Specifically, we obtain a third-degree fullysymmetric cubature rule, whose complexity in terms of function evaluations increases linearly with the dimension . Typically, a set of cubature points and weights are chosen so that the cubature rule is exact for a set of monomials of degree or less, as shown by (23)Gaussian density. Specifically, we consider an integral of the form (24)defined in the Cartesian coordinate system. To compute the above integral numerically we take the following two steps: i) We transform it into a more familiar spherical-radial integration form ii) subsequently, we propose a third-degree spherical-radial rule. A. Transformation In the spherical-radial transformation, the key step is a change of variable from the Cartesian vector to a radius and with , so direction vector as follows: Let for . Then the integral (24) can be that rewritten in a spherical-radial coordinate system as (25) is the surface of the sphere defined by and is the spherical surface measure or the area element on . We may thus write the radial integral (26) is defined by the spherical integral with the unit where weighting function (27) The spherical and the radial integrals are numerically computed by the spherical cubature rule (Section IV-B below) and the Gaussian quadrature rule (Section IV-C below), respectively. Before proceeding further, we introduce a number of notations and definitions when constructing such rules as follows: • A cubature rule is said to be fully symmetric if the following two conditions hold: implies , where is any point obtainable 1) from by permutations and/or sign changes of the coordinates of . on the region . That is, all points in 2) the fully symmetric set yield the same weight value. For example, in the one-dimensional space, a point in the fully symmetric set implies that and . • In a fully symmetric region, we call a point as a generator , where if , . The new should not be confused with the control input . zero coordinates and use • For brevity, we suppress to represent a complete fully the notation symmetric set of points that can be obtained by permutating and changing the sign of the generator in all possible ways. Of course, the complete set entails where; are non-negative integers and . Here, an important quality criterion of a cubature rule is its degree; the higher the degree of the cubature rule is, the more accurate solution it yields. To find the unknowns of the cubature rule of degree , we solve a set of moment equations. However, solving the system of moment equations may be more tedious with increasing polynomial degree and/or dimension of the integration domain. For example, an -point cubature rule entails unknown parameters from its points and weights. In general, we may form a system of equations with respect to unknowns from distinct monomials of degree up to . For the nonlinear system to have at least one solution (in this case, the system is said to be consistent), we use at least as many unknowns as equations [31]. That is, we choose to be . Suppose we obtain a cu. In this case, we solve bature rule of degree three for nonlinear moment equations; the re) sulting rule may consist of more than 85 ( weighted cubature points. To reduce the size of the system of algebraically independent equations or equivalently the number of cubature points markedly, Sobolev proposed the invariant theory in 1962 [32] (see also [31] and the references therein for a recent account of the invariant theory). The invariant theory, in principle, discusses how to restrict the structure of a cubature rule by exploiting symmetries of the region of integration and the weighting function. For example, integration regions such as the unit hypercube, the unit hypersphere, and the unit simplex exhibit symmetry. Hence, it is reasonable to look for cubature rules sharing the same symmetry. For the case considered above and ), using the invariant theory, we may con( cubature points struct a cubature rule consisting of by solving only a pair of moment equations (see Section IV). Note that the points and weights of the cubature rule are in. Hence, they can be computed dependent of the integrand off-line and stored in advance to speed up the filter execution. where IV. CUBATURE KALMAN FILTER As described in Section II, nonlinear filtering in the Gaussian domain reduces to a problem of how to compute integrals, whose integrands are all of the form nonlinear functionARASARATNAM AND HAYKIN: CUBATURE KALMAN FILTERS1259points when are all distinct. For example, represents the following set of points:Here, the generator is • We use . set B. Spherical Cubature Rule. to denote the -th point from theWe first postulate a third-degree spherical cubature rule that takes the simplest structure due to the invariant theory (28) The point set due to is invariant under permutations and sign changes. For the above choice of the rule (28), the monomials with being an odd integer, are integrated exactly. In order that this rule is exact for all monomials of degree up to three, it remains to require that the rule is exact , 2. Equivalently, to for all monomials for which find the unknown parameters and , it suffices to consider , and due to the fully symmonomials metric cubature rule (29) (30) where the surface area of the unit sphere with . Solving (29) and (30) , and . Hence, the cubature points are yields located at the intersection of the unit sphere and its axes. C. Radial Rule We next propose a Gaussian quadrature for the radial integration. The Gaussian quadrature is known to be the most efficient numerical method to compute a one-dimensional integration [21], [22]. An -point Gaussian quadrature is exact and constructed as up to polynomials of degree follows: (31) where is a known weighting function and non-negative on ; the points and the associated weights the interval are unknowns to be determined uniquely. In our case, a comparison of (26) and (31) yields the weighting function and and , respecthe interval to be tively. To transform this integral into an integral for which the solution is familiar, we make another change of variable via yielding. The integral on the right-hand side of where (32) is now in the form of the well-known generalized GaussLaguerre formula. The points and weights for the generalized Gauss-Laguerre quadrature are readily obtained as discussed elsewhere [21]. A first-degree Gauss-Laguerre rule is exact for . Equivalently, the rule is exact for ; it . is not exact for odd degree polynomials such as Fortunately, when the radial-rule is combined with the spherical rule to compute the integral (24), the (combined) spherical-radial rule vanishes for all odd-degree polynomials; the reason is that the spherical rule vanishes by symmetry for any odd-degree polynomial (see (25)). Hence, the spherical-radial rule for (24) is exact for all odd degrees. Following this argument, for a spherical-radial rule to be exact for all third-degree polyno, it suffices to consider the first-degree genermials in alized Gauss-Laguerre rule entailing a single point and weight. We may thus write (33) where the point is chosen to be the square-root of the root of the first-order generalized Laguerre polynomial, which is orthogonal with respect to the modified weighting function ; subsequently, we find by solving the zeroth-order moment equation appropriately. In this case, we , and . A detailed account have of computing the points and weights of a Gaussian quadrature with the classical and nonclassical weighting function is presented in [33]. D. Spherical-Radial Rule In this subsection, we describe two useful results that are used to i) combine the spherical and radial rule obtained separately, and ii) extend the spherical-radial rule for a Gaussian weighted integral. The respective results are presented as two propositions: Proposition 4.1: Let the radial integral be computed numer-point Gaussian quadrature rule ically by theLet the spherical integral be computed numerically by the -point spherical ruleThen, an by-point spherical-radial cubature rule is given(34) Proof: Because cubature rules are devised to be exact for a subspace of monomials of some degree, we consider an integrand of the form(32)1260IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE 2009where are some positive integers. Hence, we write the integral of interestwhereFor the moment, we assume the above integrand to be a mono. Making the mial of degree exactly; that is, change of variable as described in Section IV-A, we getWe use the cubature-point set to numerically compute integrals (10), (11), and (13)–(16) and obtain the CKF algorithm, details of which are presented in Appendix A. Note that the above cubature-point set is now defined in the Cartesian coordinate system. V. IS THERE A NEED FOR HIGHER-DEGREE CUBATURE RULES? In this section, we emphasize the importance of third-degree cubature rules over higher-degree rules (degree more than three), when they are embedded into the cubature Kalman filtering framework for the following reasons: • Sufficient approximation. The CKF recursively propagates the first two-order moments, namely, the mean and covariance of the state variable. A third-degree cubature rule is also constructed using up to the second-order moment. Moreover, a natural assumption for a nonlinearly transformed variable to be closed in the Gaussian domain is that the nonlinear function involved is reasonably smooth. In this case, it may be reasonable to assume that the given nonlinear function can be well-approximated by a quadratic function near the prior mean. Because the third-degree rule is exact up to third-degree polynomials, it computes the posterior mean accurately in this case. However, it computes the error covariance approximately; for the covariance estimate to be more accurate, a cubature rule is required to be exact at least up to a fourth degree polynomial. Nevertheless, a higher-degree rule will translate to higher accuracy only if the integrand is well-behaved in the sense of being approximated by a higher-degree polynomial, and the weighting function is known to be a Gaussian density exactly. In practice, these two requirements are hardly met. However, considering in the cubature Kalman filtering framework, our experience with higher-degree rules has indicated that they yield no improvement or make the performance worse. • Efficient and robust computation. The theoretical lower bound for the number of cubature points of a third-degree centrally symmetric cubature rule is given by twice the dimension of an integration region [34]. Hence, the proposed spherical-radial cubature rule is considered to be the most efficient third-degree cubature rule. Because the number of points or function evaluations in the proposed cubature rules scales linearly with the dimension, it may be considered as a practical step for easing the curse of dimensionality. According to [35] and Section 1.5 in [18], a ‘good’ cubature rule has the following two properties: (i) all the cubature points lie inside the region of integration, and (ii) all the cubature weights are positive. The proposed rule equal, positive weights for an -dimensional entails unbounded region and hence belongs to a good cubature family. Of course, we hardly find higher-degree cubature rules belonging to a good cubature family especially for high-dimensional integrations.Decomposing the above integration into the radial and spherical integrals yieldsApplying the numerical rules appropriately, we haveas desired. As we may extend the above results for monomials of degree less than , the proposition holds for any arbitrary integrand that can be written as a linear combination of monomials of degree up to (see also [18, Section 2.8]). Proposition 4.2: Let the weighting functions and be and . such that , we Then for every square matrix have (35) Proof: Consider the left-hand side of (35). Because a positive definite matrix, we factorize to be , we get Making a change of variable via is .which proves the proposition. For the third-degree spherical-radial rule, and . Hence, it entails a total of cubature points. Using the above propositions, we extend this third-degree spherical-radial rule to compute a standard Gaussian weighted integral as follows:ARASARATNAM AND HAYKIN: CUBATURE KALMAN FILTERS1261In the final analysis, the use of higher-degree cubature rules in the design of the CKF may marginally improve its performance at the expense of a reduced numerical stability and an increased computational cost. VI. SQUARE-ROOT CUBATURE KALMAN FILTER This section addresses i) the rationale for why we need a square-root extension of the standard CKF and ii) how the square-root solution can be developed systematically. The two basic properties of an error covariance matrix are i) symmetry and ii) positive definiteness. It is important that we preserve these two properties in each update cycle. The reason is that the use of a forced symmetry on the solution of the matrix Ricatti equation improves the numerical stability of the Kalman filter [36], whereas the underlying meaning of the covariance is embedded in the positive definiteness. In practice, due to errors introduced by arithmetic operations performed on finite word-length digital computers, these two properties are often lost. Specifically, the loss of the positive definiteness may probably be more hazardous as it stops the CKF to run continuously. In each update cycle of the CKF, we mention the following numerically sensitive operations that may catalyze to destroy the properties of the covariance: • Matrix square-rooting [see (38) and (43)]. • Matrix inversion [see (49)]. • Matrix squared-form amplifying roundoff errors [see (42), (47) and (48)]. • Substraction of the two positive definite matrices present in the covariant update [see (51)]. Moreover, some nonlinear filtering problems may be numerically ill-conditioned. For example, the covariance is likely to turn out to be non-positive definite when i) very accurate measurements are processed, or ii) a linear combination of state vector components is known with greater accuracy while other combinations are essentially unobservable [37]. As a systematic solution to mitigate ill effects that may eventually lead to an unstable or even divergent behavior, the logical procedure is to go for a square-root version of the CKF, hereafter called square-root cubature Kalman filter (SCKF). The SCKF essentially propagates square-root factors of the predictive and posterior error covariances. Hence, we avoid matrix square-rooting operations. In addition, the SCKF offers the following benefits [38]: • Preservation of symmetry and positive (semi)definiteness of the covariance. Improved numerical accuracy owing to the fact that , where the symbol denotes the condition number. • Doubled-order precision. To develop the SCKF, we use (i) the least-squares method for the Kalman gain and (ii) matrix triangular factorizations or triangularizations (e.g., the QR decomposition) for covariance updates. The least-squares method avoids to compute a matrix inversion explicitly, whereas the triangularization essentially computes a triangular square-root factor of the covariance without square-rooting a squared-matrix form of the covariance. Appendix B presents the SCKF algorithm, where all of the steps can be deduced directly from the CKF except for the update of the posterior error covariance; hence we derive it in a squared-equivalent form of the covariance in the appendix.The computational complexity of the SCKF in terms of flops, grows as the cube of the state dimension, hence it is comparable to that of the CKF or the EKF. We may reduce the complexity significantly by (i) manipulating sparsity of the square-root covariance carefully and (ii) coding triangularization algorithms for distributed processor-memory architectures. VII. A COMPARISON OF UKF WITH CKF Similarly to the CKF, the unscented Kalman filter (UKF) is another approximate Bayesian filter built in the Gaussian domain, but uses a completely different set of deterministic weighted points [10], [39]. To elaborate the approach taken in the UKF, consider an -dimensional random variable having with mean and covariance a symmetric prior density , within which the Gaussian is a special case. Then a set of sample points and weights, are chosen to satisfy the following moment-matching conditions:Among many candidate sets, one symmetrically distributed sample point set, hereafter called the sigma-point set, is picked up as follows:where and the -th column of a matrix is denoted ; the parameter is used to scale the spread of sigma points by from the prior mean , hence the name “scaling parameter”. Due to its symmetry, the sigma-point set matches the skewness. Moreover, to capture the kurtosis of the prior density closely, it is sug(Appendix I of [10], gested that we choose to be [39]). This choice preserves moments up to the fifth order exactly in the simple one-dimensional Gaussian case. In summary, the sigma-point set is chosen to capture a number as correctly as of low-order moments of the prior density possible. Then the unscented transformation is introduced as a method that are related to of computing posterior statistics of by a nonlinear transformation . It approximates the mean and the covariance of by a weighted sum of projected space, as shown by sigma points in the(36)(37)。
这个notes纯粹是扫盲用的。
我用了一个最简单的线性DSGE,只有两个方程。
先是我用手算的方法找到saddle-path 的policy function,然后手算出impulse response function。
这些我都用Dynare做了计算,程序和结果都写在note里面。
上面是我note的截图,这个DSGE模型实际上就是一个linear rational expecation model (LREM),但DSGE的线性化后的本质也就是个LRE。
虽然这个note提供的模型非常简单,但是思路在于如何用Dynare来深入学习这个动态系统。
有几个事情需要大家自己来做:1. beta和rho的大小,大家从换很多次calibration,看能对IRF带来什么影响?2. beta和rho都大于1的时候,你应该怎么修改模型,为了维持模型的稳定性?3. 看修改shock的stardard deviation能对模型带了什么影响?4. 如果你再加一个方程进去呢?什么样子的方程?以上内容我都试验过了。
这个东西没法帮大家试验,所以大家必须自己试着做。
这样你可以学到很多关于动态系统的感性认识。
之后,我用最大似然估计对参数估计,然后我故意制造under-identification的问题,让大家看一下结果是什么样子。
最后就是Bayesian estimation,我只估计了1个参数,用了2条平行马尔科夫链,做了超超短程模拟(只有500次,正常情况都是100000),为了省时间(我电脑只用50秒左右),所以我并没有让电脑跑很长的马尔科夫链和多个平行链条。
所以结果非常差,但是这不是的目的。
目的还是在于让从来没见过整个估计过程的同学看到一个全貌。
所以我没有提及理论内容,或者是一带而过。
对于Bayesian estimation,有个地方要注意的就是shock的个数必须大于等observable的个数,这是启动Kalman filter模拟likelihood function的充分条件。
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 20091215Sparse Image Reconstruction for Molecular ImagingMichael Ting, Member, IEEE, Raviv Raich, Member, IEEE, and Alfred O. Hero, III, Fellow, IEEEAbstract—The application that motivates this paper is molecular imaging at the atomic level. When discretized at subatomic distances, the volume is inherently sparse. Noiseless measurements from an imaging technology can be modeled by convolution of the image with the system point spread function (psf). Such is the case with magnetic resonance force microscopy (MRFM), an emerging technology where imaging of an individual tobacco mosaic virus was recently demonstrated with nanometer resolution. We also consider additive white Gaussian noise (AWGN) in the measurements. Many prior works of sparse estimators have focused on the has low coherence; however, the system matrix case when in our application is the convolution matrix for the system psf. A typical convolution matrix has high coherence. This paper, therefore, does not assume a low coherence . A discrete-continuous form of the Laplacian and atom at zero (LAZE) p.d.f. used by Johnstone and Silverman is formulated, and two sparse estimators derived by maximizing the joint p.d.f. of the observation and image conditioned on the hyperparameters. A thresholding rule that generalizes the hard and soft thresholding rule appears in the course of the derivation. This so-called hybrid thresholding rule, when used in the iterative thresholding framework, gives rise to the hybrid estimator, a generalization of the lasso. Estimates of the hyperparameters for the lasso and hybrid estimator are obtained via Stein’s unbiased risk estimate (SURE). A numerical study with a Gaussian psf and two sparse images shows that the hybrid estimator outperforms the lasso.HHHIndex Terms—Biomedical image processing, image restoration, magnetic force microscopy, sparse image prior, Stein’s unbiased risk estimate.I. INTRODUCTION HE structures of biological molecules like proteins and viri are of interest to the medical community [1]. Existing methods for imaging at the nanometer or even sub-nanometer scale include atomic force microscopy (AFM), electron microscopy (EM), and X-ray crystallography [2], [3]. At the sub-atomic scale, a molecule is naturally a sparse image. That is, the volume imaged consists of mostly space with a few locations occupied by atoms. The application in particular that motivates this paper is MRFM [4], a technology that potentially offers advantages not existent in currently usedManuscript received September 08, 2008; revised January 28, 2009. First published April 17, 2009; current version published May 13, 2009. This work was supported in part by the Army Research Office under Contract W911NF-05-1-0403. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Erik H. W. Meijering. M. Ting is with Seagate Technology, Bloomington, MN, 55435 USA (e-mail: m_ting@). R. Raich is with Oregon State University, Corvallis, OR 97331 USA (e-mail: raich@). A. O. Hero, III is with the University of Michigan, Ann Arbor, MI 48109 USA (e-mail: hero@). Color versions of one or more of the figures in this paper are available online at . Digital Object Identifier 10.1109/TIP.2009.2017156Tmethods. In particular, MRFM is nondestructive and capable of 3-D imaging. Recently, imaging of a biological sample with nanometer resolution was demonstrated [5]. Given that MRFM and indeed even AFM [6] measures the convolution of the image with a point spread function (psf), a deconvolution must be performed in order to obtain the molecular image. This paper considers the following problem: suppose one observes a linear transformation of a sparse image corrupted by AWGN. With only knowledge of the linear transformation and noise variance, the goal is to reconstruct the unknown sparse image. is the linear transformation that, in The system matrix the case of MRFM, represents convolution with the MRFM psf. Several prior works are only applicable when the system matrix has small pairwise correlation, i.e., low coherence or low collinearity [7]–[10]. Others assume that the columns of come from a specific random distribution, e.g., the uniform spherical ensemble (USE), or the uniform random projection ensemble (URPE) [11]. These assumptions are inapplicable when represents convolution with the MRFM psf. In general, a convolution matrix for a continuous psf would not have low coherence. Such is the case with MRFM. The coherence of the simulated MRFM psf used in the simulation study section is at least 0.557. The lasso, the estimator formed by maximizing the penalized likelihood criterion with a penalty on the image values [12], is known to promote sparsity in the estimate. The Bayesian interpretation of the lasso is the maximum a posteriori (MAP) estimate with an i.i.d. Laplacian p.d.f. on the image values [13]. i.i.d. samples of a Laplacian Consider the following: given distribution, the expected number of samples equal to 0 is zero. The Laplacian p.d.f. is more convincingly described as a heavytailed distribution rather than a sparse distribution. Indeed, when used in a suitable hierarchical model such as in sparse Bayesian learning [14], the Gaussian r.v., not commonly considered as a sparse distribution, results in a sparse estimator. While using a sparse prior is clearly not a necessary condition for formulating a sparse estimator, one wonders if a better sparse estimator can be formed if a sparse prior is used instead. In [15], the mixture of a Dirac delta and a symmetric, unimodal density with heavy tails is considered; a sparse denoising estimator is then obtained via marginal maximum likelihood (MML). The LAZE distribution is a specific member of the mixture family. Going through the same thought experiment previously mentioned with the LAZE distribution, one obtains an insamples equal 0, where is the tuitive result: weight placed on the Dirac delta. Unlike the Laplacian p.d.f., the LAZE p.d.f. is both heavy-tailed and sparse. Under certain conditions, the estimator achieves the asymptotic minimax risk to within a constant factor ([15], Theorem 1). The lasso estimator can be implemented in an iterative thresholding framework using the soft thresholding rule [16], [17]. Use of a thresh-1057-7149/$25.00 © 2009 IEEE1216IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 2009olding rule based on the LAZE prior in the iterative thresholding framework can potentially result in better performance. This paper develops several methods to enable Bayes-optimal nanoscale molecular imaging. In particular, advances are made in these three areas. 1) First, we introduce a mixed discrete-continuous LAZE prior for use in the MAP/maximum likelihood (ML) framework. Knowing only that the image is sparse, but lacking any precise information on the sparsity level, selection of the hyperparameters or regularization parameters has to be empirical or data-driven. The sparse image and hyperparameters are jointly estimated by maximizing the joint p.d.f. of the observation and unknown sparse image conditioned on the hyperparameters. Two sparse Bernoulli–Laplacian MAP/ML estimators based on the discrete-continuous LAZE p.d.f. are introduced: MAP1 and MAP2. 2) The second contribution of the paper is the introduction of the hybrid estimator, which is formed by using the hybrid thresholding rule in the iterative thresholding framework. The hybrid thresholding rule is a generalization of the soft and hard thresholding rules. The disadvantage of the former is that it introduces bias in the estimate, while the disadvantage of the latter is that it is sensitive to small perturbations in the observation [18]. Other thresholding rules have been previously proposed, e.g., firm shrinkage [18], non-negative garrote [19], etc. It would be informative to compare the hybrid thresholding rules with the others mentioned; however, this comparison is outside the scope of this article. In order to apply the hybrid thresholding rule to the molecular imaging problem, it is necessary to estimate the hyperparameters in a data-driven fashion. 3) Third, SURE is applied to estimate the hyperparameter of lasso and of the hybrid estimator proposed above. The SURE-equipped versions of lasso and hybrid estimator are referred to as lasso-SURE and H-SURE. Our lasso-SURE result is a generalization of the results in [20], [21]. Alternative lasso hyperparameter selection methods exist, e.g., [22]. In [22], however, a prior is placed on the support of the image values that discourages the selection of high correlated columns of . Since the we consider has columns that are highly correlated, this predisposes a certain amount of separation between the support of the estimated image values, i.e., the sparse image estimate will be resolution limited. A number of other general-purpose techniques exist as well, e.g., cross validation (CV), generalized CV (GCV), MML [23]. Some are, however, more tractable than others. For example, a closed form expression of the marginal likelihood cannot be obtained for the Laplacian prior: approximations have to be made [13]. A simulation study is performed. In the first part, LS, oracular LS, SBL, stagewise orthogonal matching pursuit (StOMP), and the four proposed sparse estimators, are compared. Two image types (one binary-valued and another based on the LAZE p.d.f.) are studied under two signal-to-noise ratio (SNR) conditions (low and high). MAP2 has the best performance in the two low SNR cases. In one of the high SNR cases, H-SURE has the best performance, while in the other, SBL is arguably the bestperforming method. When the hyperparameters are estimated via SURE, H-SURE is sparser than lasso-SURE and achieves as well as lower detection error lower error for . In the second part of the numerical study, the performance of the proposed sparse estimators is studied across the range of SNRs between the low and high values considered in the first part. A 3-D reconstruction example is given in the third part, where the LS and lasso-SURE estimator are compared. This serves to demonstrate the applicability of lasso-SURE on a relatively large problem. A subset of results herein, e.g., Theorem 1, have been previously reported in [24] by the same authors. The paper is organized into the following sections. First, the sparse image deconvolution problem is formulated in Section II. The algorithms are discussed in Section III: there are three parts to this section. The two MAP/ML estimators based on the discrete-continuous LAZE prior are derived in Section III-A. This is followed by the introduction of the hybrid estimator in Section III-B. Stein’s unbiased risk estimate is applied in Section III-C to derive lasso-SURE and H-SURE. Section IV contains a numerical study comparing the proposed algorithms with several existing sparse reconstruction methods. A summary of the work and future directions in Section V concludes the paper.II. PROBLEM FORMULATION Consider a 2-D or 3-D image, and denote its vector version by . In this paper, is assumed to be sparse, viz., the percentage of nonzero is small. Suppose that the measurement is given by where (1)where is termed the system matrix, and is , AWGN. The problem considered can be stated as: given , estimate knowing that it is sparse. Without loss of and generality, one can assume that the columns of have unit norm. In the problem formulation, note that knowledge of the , is not known a priori. sparseness of , viz., It should be noted that, while the sparsity considered in (1) is in the natural basis of , a wavelet basis has been considered in other works, e.g., [21]. It may be possible to re-formulate (1) using some other basis so that the corresponding system matrix has low coherence. This question is beyond the scope of the paper. The emphasis here is on (1) and on sparsity in the nathad full column rank, an equivalent problem ural basis. If is invertible, (1) can be formulation is available. Since re-written as where (2)is the pseudoinverse of where ; and is colored Gaussian noise. Deconvolution of from in AWGN is, therefore, equivalent to denoising of in colored Gaussian noise. In the special case that is orthonormal, is also AWGN.TING et al.: SPARSE IMAGE RECONSTRUCTION FOR MOLECULAR IMAGING1217III. ALGORITHMS A. Bernoulli–Laplacian MAP/ML Sparse Estimators This section considers the case when the discrete-continuous , with and simultaneously i.i.d. LAZE prior is used for estimated via MAP/ML. In this subsection, denotes the hyper. The parameter of the LAZE prior given in (4), i.e., are used instead of respectively in contexts variables where their meaning is more intuitive. For the continuous disare obtained as the maximizers of the conditional tribution, , viz., density (3) If were constant, obtained from (3) would be the MAP estimate. If were constant, the resulting would be the ML estimate. Since these two principles are at work, it cannot be said that the estimates obtained via (3) are strictly MAP or ML. , where denotes the Suppose that LAZE p.d.f. The latter is given by (4) where is the Laplacian p.d.f. The Dirac delta function is difficult to work with in the context of maximizing the conditional p.d.f. in (3). Consider then a mixed discrete-continuous version of (4). Define the random variables and such that . The r.v.s have the following density: with probability with probability (5) (6) where is some p.d.f. that may or may not depend on : more are i.i.d. will be specified later on. It is assumed that assumes the role of the Dirac delta: its introduction necessitates in (6). Instead of (3), consider use of the auxiliary density the optimality criterion (7) Let and . The maximization of (7) is equivalent to the maximization of Block Coordinate Maximization of MAP Criterion Require: 1: 2: repeat 3: 4: 5: 6: until The p.d.f. arises as an extra degree of freedom due to the introduction of the indicator variables . Consider two cases: in (8). This will give rise to the algofirst, let rithm MAP1. Second, let be an arbitrary p.d.f. such that: for all ; 2) is attained for some 1) ; and 3) is independent of . By selecting that satisfies these three properties, the algorithm MAP2 is, thus, obtained. denote the function obtained 1) MAP1: Let . Step 4) of Algorithm 1 is determined by setting . This is solved as by the solution to and (9)is negative defIt can be verified that the Hessian inite for all and . Given samples drawn from a Laplacian p.d.f. , the ML estimate of is . The estimate in (9) is, therefore, the ML estimate of where all of the s are used. The maximization in step 5) of Algorithm 1 can be obtained by applying the EM algorithm [16]. Recall that EM can be apas the complete data, where plied using and . Denote by the estimates in the th EM iteration. The E-step is the Landweber iteration (10) Define the hybrid thresholding rule as (11) where and are restricted to . See Fig. 1. This is a generalization of the soft and hard thresholding rules. , and the hard The soft thresholding rule thresholding rule . The M-step of the EM algorithm is given by(8) (12) We propose to maximize (8) in a block coordinate-wise fashion . A superscript “ ” [25] via Algorithm 1. Note that attached to a variable indicates its value in the th iteration. where . Recall that . If , the soft-thresholding rule is applied in the -step of the EM iterations of MAP1. These iterations produce1218IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 2009As with MAP1, the maximization in step 5) of Algorithm 1 can be obtained by applying the EM algorithm with the com. The E-step is given by (10), which is plete data the same as MAP1’s E-step. Defineand The resulting olding rule(17)in the M-step is given by the following thresh-(18)Fig. 1. Hybrid thresholding rule.the lasso estimate with hyperparameter . However, if , a larger thresholding value is used that increases the smaller becomes. w.p. 2) MAP2: From (6) and the assumptions on 1. Consequently, the set (13) w.p. 1. Apply (13) to the criterion This implies . to maximize, viz., (8), and denote the result by One gets, which is similar to the M-step of where MAP1. Indeed, the M-step of MAP1 can be obtained by setting . Just like in MAP1, the EM iterations of MAP2 produce a larger threshold the sparser the hyperparameter is. As well, if is smaller, increases. Since the variance of the Laplais , a smaller implies a larger variance of the cian Laplacian. Use of a larger threshold is, therefore, appropriate. can be regarded as an extra degree The tuning parameter . The of freedom that arises due to being independent of MAP2 M-step is a function of , and a suitable value has to be selected. In contrast, MAP1 has no free tuning parameter(s). B. Hybrid Thresholding Rule in the Iterative Framework Define the hybrid estimator to be the estimator formed by using the hybrid thresholding rule (11) in the iterative framework [16, (24)], viz. (19) and are the standard unit vectors. Due to the hybrid thresholding rule being a generalization of the soft thresholding rule, the hybrid estimator potentially offers better performance than lasso. Indeed, lasso performance can be achieved by fixing . Clearly, the performance of the hybrid estimator is dependent on the selection of the regularization parameter . This topic will be discussed in the next subsection. The cost function of the hybrid estimator is given in Proposition 1. Proposition 1: Consider the iterations (19) when and . The iterations minimize the cost function where(14) Recall that and obtained by solving for . The maximization in step (i) is , which produces and (15)As before, one can verify that the Hessian is negative definite for all and . It is instructive to compare the hyperparameter estimates of MAP1 versus MAP2, i.e., (9) versus (15). The main difference lies in the estimation of . Assuming that the estimates and obey (13), one can . This is the re-write the MAP2 estimate , i.e., the nonzero voxels. ML estimate using only the On the other hand, the MAP1 estimate of can be written as (16)where(20) Proof: This is an application of Theorem 3 in Appendix 1. , which gives rise to the When lasso estimator, as expected. The penalty term satisfies the conditions of ([26], Theorem 1). Therefore, in the sparse denoising problem, the hybridTING et al.: SPARSE IMAGE RECONSTRUCTION FOR MOLECULAR IMAGING1219thresholding rule has thresholding rules.risk comparable to the soft and hardC. Using SURE to Empirically Estimate the Hyperparameters In this section, SURE is applied to estimate the regularization parameter of lasso and the hybrid estimator. As in the previous subsection, denote the regularization parameter as . For lasso, , where is the thresholding parameter used in the soft . For the hybrid estimator, , thresholding rule where are the parameters used in the hybrid thresholding . rule Consider the risk measure (21) for lasso. Since is not known, this risk cannot be computed; however, one can compute an unbiased estimate of the risk [27]. : can then be estimated Denote the unbiased estimate by , where is the set of valid values. as When , an expression for is derived in ([20], (11)). , however, Stein’s unbiased estimate [27] cannot When be applied to evaluate (21). In [21], the alternative riskis said to order The permutation matrix the zero and nonzero components of if . Note that in the above definition is not unique. As is a permutation matrix, it is orthogonal. , let be a Definition 2: For a matrix nonzero sequence of length at most s.t. . Similarly, be nonzero sequence of length at most s.t. . let is such that . The submatrix Define and(25) and 0 otherwise. Recall that by assumption, so . Let denote the Gram matrix of . For a given , set (26) where(27) (22) is proposed instead. Equation (22) was evaluated for a diagonal in [21]. The first theorem in this section generalizes the result of for arbitrary full column rank . The [21] by developing second theorem in this section derives (22) when is the hybrid is also an arbitrary full column estimator. For this result, matrix. If the convolution matrix can be approximated by 2-D or 3-D circular convolution, the full column rank assumption is equivalent to the 2-D or 3-D DFT of the psf having no spectral nulls. The proofs of the two theorems are given in Appendix II. , let us drop the vector 1) SURE for Lasso: Since notation and write as . Theorem 1: Assume that the columns of are linearly independent, and is the lasso estimator. The unbiased risk estimate (22) is (23) is the reconstruction error. where Since the hyperparameter , it can be estimated via (24) where is given in (23). Least angle regression (LARS) can be used to efficiently compute (24), [28]. Note that LARS requires the linear independence of the columns of . The eswith obtained via (24) will be referred to as timator lasso-SURE. 2) SURE for the Hybrid Estimator: Several definitions are in order first. has . Definition 1: Suppose that Denote the nonzero components of by . where is a matrix that orders the zero and nonzero components of . for Since is a function of , denote in Theorem 2 below. Theorem 2: Suppose that the columns of are linearly indoes not have an eigenvalue of . dependent and that With denoting the hybrid estimator, the unbiased risk estimate (22) is(28) where is the matrix trace, and . To evaluate (28) for a particular , one would have to construct the matrix ; then, invert the matrix . If is sparse, is small, and the inversion would not be computationally demanding. The optimum is the that minimizes . The would be the output. This method will be corresponding referred to as Hybrid-SURE, or for short, H-SURE. IV. SIMULATION STUDY In Section IV-B, the following classes of methods are compared: (i) least-squares (LS) and oracular LS; (ii) the proposed sparse reconstruction methods; and (iii) other existent sparse methods, viz., SBL and StOMP. The LS solution is implemented via the Landweber algorithm . [29]. It provides a “worst-case” bound for the error, i.e., Since the LS estimate does not take into account the sparsity of , one would expect it to have worse performance than estimates that do. In the oracular LS method, on the other hand, one knows the support of , and regresses the measurement1220IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 2009Fig. 2. Illustration of the two types of used in the simulations; as well, the Gaussian blur psf is shown. (a) Binary . (b) LAZE . (c) Gaussian blur psf.on the corresponding columns of [30]. The oracular LS estimate consequently provides a “best-case” bound for the error; however, the oracular LS estimate is unimplementable in reality, as it requires prior knowledge of the support of . The second class of methods includes the two MAP/ML variants, MAP1 and MAP2; in addition, lasso-SURE and H-SURE are also tested. Finally, in order to benchmark the proposed methods to other sparse methods, SBL and StOMP are included in the simulation study. The Sparselab toolbox is used to obtain the StOMP estimate. The CFAR and CFDR approaches to threshold selection are applied [11]. For CFAR selection, the per-iteration false alarm rate of 1/50 is used. For CFDR selection, the discovery rate is set to 0.5. Although a multitude of other sparse reconstruction methods exist, they are not included in the simulation study due to a lack of space. Two sparse images are investigated in Section IV-B: a binary-valued image, and an image based on the LAZE prior (4). The binary-valued image has 12 pixels set to one, and the rest are zero. The LAZE image, i.e., the image based on the LAZE prior, can be regarded as a realization of the LAZE prior with and . They are depicted in Fig. 2(a) and (b), respectively. The two images are of size 32 32, as is : so, . The matrix , of size 1024 1024, is the convolution matrix for the Gaussian blur point spread function (psf). In order to satisfy the requirements of Theorems 1 and 2, does not the columns of are linearly independent and have an eigenvalue of 1/2. The Gaussian blur is illustrated in Fig. 2(c). The Gaussian blur convolution matrix has columns that are . Let highly correlated: the coherence . The stability and support results of lasso all require that (29) where or 1/4 in order that some statement of recoverability holds [8]–[10], [30]. For a given , (29) places an upper for which recoverability of is assured in some bound onfor fashion. With the Gaussian blur and 1/2. Since , the simulation study is both outside of the coverage of existing recoverability theorems. In Section IV-B, the performance of the proposed sparse methods over a range of SNRs is investigated. The binary-valued image and Gaussian blur psf are considered in this section. In addition to the proposed sparse methods, the LS estimate is included as a point of reference. Last, a 3-D MRFM example of dimension 128 128 32 is given in Section IV.D comparing the LS estimate and lasso-SURE. This serves to illustrate the computational feasibility of lasso-SURE for a relatively large problem. The proposed algorithms are implemented as previously outof MAP2 is set to in lined. The tuning parameter Section IV-B and IV-C. LARS is used to compute the lassoSURE estimator. H-SURE is suboptimally implemented: the is obtained via two line searches. The minimizing direction in the plane, is first, along the done using lasso-SURE. A subsequent line search in the (1,0) direction is performed, i.e., is kept constant and is increased. , and the SNR in Define the SNR as . decibels as A. Error Criteria . Several error Recall that the reconstruction error criteria are considered in the performance assessment of a sparse estimator. for . • • The detection error criterion defined by (30) Values of such that are considered equivalent to 0. This is used to handle the effect of finite-precision computing. More importantly, it addresses the fact that, to the human observer, small nonzero values are not discernible is selected. from zero values. In the study, This error criterion is effectively a 0–1 penalty on the support of . Accurately determining the support of a sparse is more critical than its actual values [7], [31]. . One would • The number of nonzero values of , i.e., like , which is small if is indeed sparse. B. Performance Under Low and High SNR The performance of the estimators is given in Table I for the binary-valued with the SNR equal to 1.76 dB (low SNR) and 20 dB (high SNR). The number reported in Table I is the mean over the simulation runs. For each performance criterion, the best mean number is underlined. The oracular LS estimate is excluded from this assessment, as it cannot be implemented , the best number is without prior knowledge. In terms of . Recall that for the binary-valued image the value closest to . The best number for the other performance criterion is the value closest to 0. In the low SNR case, MAP2 has the best performance. MAP1 consistently produces the trivial estimate of all zeros, as evibeing equal to 0. The trivial denced by the mean value of for . For all-zero estimate results inTING et al.: SPARSE IMAGE RECONSTRUCTION FOR MOLECULAR IMAGING1221TABLE I PERFORMANCE OF THE RECONSTRUCTION METHODS FOR THE BINARY-VALUED Fig. 3. Reconstructed images for the binary-valued under an SNR of 1.76 dB for SBL, StOMP (CFAR), MAP2 (g = 1= 2), and lasso-SURE. (a) SBL. (b) STOMP (CFAR). (c) MAP2 (g = 1= 2). (d) lasso-SURE.ppTABLE II PERFORMANCE OF THE RECONSTRUCTION METHODS FOR THE LAZE a sparse , a small , therefore, is not necessarily an indicator of good performance. A second comment regarding is that it does not always give an accurate assessment of the perceived sparsity of the reconstruction. In Table I, SBL never produces a strictly sparse estimate, as the mean equals the maximal value of 1024. However, consider Fig. 3(a), where the SBL estimate for one noise realization at an SNR of 1.76 dB is depicted. The looks sparser than would be suggested by . This is because many of the nonzero pixel values have a small magnitude, and are visually indistinguishable from zero. The SBL estimate has many spurious nonzero pixels, in addition to blurring around several nonzero pixel locations. Negative values are present in the reconstruction, although the binary is non-negative. The StOMP (CFAR), MAP2, and lasso-SURE estimate are illustrated in Fig. 3(b)–(d), respectively. The StOMP (CFAR) has large positive and negative values. It does not seem like a sufficient number of stages have been taken. While blurring around several nonzero voxels are evident in the MAP2 estimate, closely resembles , cf. Fig. 2(a). None of the estimators considered here take into account positivity. From Fig. 3(b), however, one sees that the MAP2 estimate has no negative values. Qualitatively, the lasso-SURE estimate looks better than SBL, but worse than MAP2. This is reflected in the quantitative performance criteria in Table I. In the high SNR case, H-SURE has the best performance. The mean values of all the performance criteria decrease as com, pared to lasso-SURE. The greatest decreases are in and . They indicate that the H-SURE estimator is properly zeroing out spurious nonzero values and producing a sparser estimate than lasso-SURE. However, this comes at a price of higher computational complexity.Examine next the performance of the reconstruction methods with the LAZE image. One expects MAP1 and MAP2 to have better performance than the other methods, as the image is generated using the LAZE prior. The numbers for the performance criteria are given in Table II. Again, the reconstruction method with the best number for each criterion is underlined. . For the LAZE In the low SNR case, MAP2 has the advantage. MAP1 produces the trivial estimate of all zeros, just as in the case of the binary-valued . The high SNR case has mixed results. While。
这个notes纯粹是扫盲用的。
我用了一个最简单的线性DSGE,只有两个方程。
先是我用手算的方法找到saddle-path 的policy function,然后手算出impulse response function。
这些我都用Dynare做了计算,程序和结果都写在note里面。
上面是我note的截图,这个DSGE模型实际上就是一个linear rational expecation model (LREM),但DSGE的线性化后的本质也就是个LRE。
虽然这个note提供的模型非常简单,但是思路在于如何用Dynare来深入学习这个动态系统。
有几个事情需要大家自己来做:1. beta和rho的大小,大家从换很多次calibration,看能对IRF带来什么影响?2. beta和rho都大于1的时候,你应该怎么修改模型,为了维持模型的稳定性?3. 看修改shock的stardard deviation能对模型带了什么影响?4. 如果你再加一个方程进去呢?什么样子的方程?以上内容我都试验过了。
这个东西没法帮大家试验,所以大家必须自己试着做。
这样你可以学到很多关于动态系统的感性认识。
之后,我用最大似然估计对参数估计,然后我故意制造under-identification的问题,让大家看一下结果是什么样子。
最后就是Bayesian estimation,我只估计了1个参数,用了2条平行马尔科夫链,做了超超短程模拟(只有500次,正常情况都是100000),为了省时间(我电脑只用50秒左右),所以我并没有让电脑跑很长的马尔科夫链和多个平行链条。
所以结果非常差,但是这不是的目的。
目的还是在于让从来没见过整个估计过程的同学看到一个全貌。
所以我没有提及理论内容,或者是一带而过。
对于Bayesian estimation,有个地方要注意的就是shock的个数必须大于等observable的个数,这是启动Kalman filter模拟likelihood function的充分条件。
A Study of Cross-Validation and Bootstrap for Accuracy E stimation and Model SelectionRon KohaviComputer Science DepartmentStanford UniversityStanford, CA 94305ronnykGCS Stanford E D Uh t t p //r o b o t i c s Stanford edu/"ronnykA b s t r a c tWe review accuracy estimation methods andcompare the two most common methods cross-validation and bootstrap Recent experimen-tal results on artificial data and theoretical recults m restricted settings have shown that forselecting a good classifier from a set of classi-fiers (model selection), ten-fold cross-validationmay be better than the more expensive ka\pone-out cross-validation We report on a large-scale experiment—over half a million runs ofC4 5 and aNaive-Bayes algorithm—loestimalethe effects of different parameters on these algonthms on real-world datascts For cross-validation we vary the number of folds andwhether the folds arc stratified or not, for boot-strap, we vary the number of bootstrap sam-ples Our results indicate that for real-worddatasets similar to ours, The best method lo usefor model selection is ten fold stratified crossvalidation even if computation power allowsusing more folds1 I n t r o d u c t i o nIt can not be emphasized enough that no claimwhatsoever 11 being made in this paper that altalgorithms a re equiva lent in practice in the rea l world In pa rticula r no cla im is being ma de tha t ont should not use cross va lida tion in the real world— Wolpcrt (1994a.) Estimating the accuracy of a classifier induced by su-pervised learning algorithms is important not only to predict its future prediction accuracy, but also for choos-ing a classifier from a given set (model selection), or combining classifiers (Wolpert 1992) For estimating the final accuracy of a classifier, we would like an estimation method with low bias and low variance To choose a classifier or to combine classifiers, the absolute accura-cies are less important and we are willing to trade off biasA longer version of the paper can be retrieved by anony mous ftp to starry Htanford edu pub/ronnyk/accEst-long ps for low variance, assuming the bias affects all classifiers similarly (e g esLimates are ")% pessimistic)In this paper we explain some of the assumptions madeby Ihe different estimation methods and present con-crete examples where each method fails While it is known that no accuracy estimation can be corrert allthe time (Wolpert 1994b Schaffer 1994j we are inter ested in identifying a method that ib well suited for the biases and tn rids in typical real world datasetsRecent results both theoretical and experimental, have shown that it is no! alwa>s the case that increas-ing the computational cost is beneficial especiallhy if the relative accuracies are more important than the exact values For example leave-one-out is almost unbiased,but it has high variance leading to unreliable estimates (Efron 1981) l o r linear models using leave-one-out cross-validation for model selection is asymptotically in consistent in the sense that the probability of selectingthe model with the best predictive power does not con-verge to one as the lolal number of observations ap-proaches infinity (Zhang 1992, Shao 1993)This paper \s organized AS follows Section 2 describesthe common accuracy estimation methods and ways of computing confidence bounds that hold under some as-sumptions Section 3 discusses related work comparing cross-validation variants and bootstrap variants Sec lion 4 discusses methodology underlying our experimentThe results of the experiments are given Section 5 with a discussion of important observations We conelude witha summary in Section 62 Methods for Accuracy E s t i m a t i o nA classifier is a function that maps an unlabelled in-stance to a label using internal data structures An i n-ducer or an induction algorithm builds a classifier froma given dataset CART and C 4 5 (Brennan, Friedman Olshen &. Stone 1984, Quinlan 1993) are decision tree in-ducers that build decision tree classifiers In this paperwe are not interested in the specific method for inducing classifiers, but assume access to a dataset and an inducerof interestLet V be the space of unlabelled instances and y theKOHAVI 1137set of possible labels be the space of labelled instances and ,i n ) be a dataset (possibly a multiset) consisting of n labelled instances, where A classifier C maps an unla-beled instance ' 10 a l a b e l a n d an inducer maps a given dataset D into a classifier CThe notationwill denote the label assigned to an unlabelled in-stance v by the classifier built, by inducer X on dataset D tWe assume that there exists adistribution on the set of labelled instances and that our dataset consists of 1 1 d (independently and identically distributed) instances We consider equal misclassifica-lion costs using a 0/1 loss function, but the accuracy estimation methods can easily be extended to other loss functionsThe accuracy of a classifier C is the probability ofcorrectly clasaifying a randoml\ selected instance, i efor a randomly selected instancewhere the probability distribution over theinstance space 15 the same as the distribution that was used to select instances for the inducers training set Given a finite dataset we would like to custimate the fu-ture performance of a classifier induced by the given in-ducer and dataset A single accuracy estimate is usually meaningless without a confidence interval, thus we will consider how to approximate such an interval when pos-sible In order to identify weaknesses, we also attempt o identify cases where the estimates fail2 1 Holdout The holdout method sometimes called test sample esti-mation partitions the data into two mutually exclusivesubsets called a training set and a test set or holdout setIt is Lommon to designate 2/ 3 of the data as the trainingset and the remaining 1/3 as the test set The trainingset is given to the inducer, and the induced classifier istested on the test set Formally, let , the holdout set,be a subset of D of size h, and let Theholdout estimated accuracy is defined aswhere otherwise Assummg that the inducer s accuracy increases as more instances are seen, the holdout method is a pessimistic estimator because only a portion of the data is given to the inducer for training The more instances we leave for the test set, the higher the bias of our estimate however, fewer test set instances means that the confidence interval for the accuracy will be wider as shown belowEach test instance can be viewed as a Bernoulli trialcorrect or incorrect prediction Let S be the numberof correct classifications on the test set, then s is dis-tributed bmomially (sum of Bernoulli trials) For rea-sonably large holdout sets, the distribution of S/h is ap-proximately normal with mean ace (the true accuracy of the classifier) and a variance of ace * (1 — acc)hi Thus, by De Moivre-Laplace limit theorem, we havewhere z is the quanl lie point of the standard normal distribution To get a IOO7 percent confidence interval, one determines 2 and inverts the inequalities Inversion of the inequalities leads to a quadratic equation in ace, the roots of which are the low and high confidence pointsThe above equation is not conditioned on the dataset D , if more information is available about the probability of the given dataset it must be taken into accountThe holdout estimate is a random number that de-pends on the division into a training set and a test set In r a n d o m sub s a m p l i n g the holdout method is re-peated k times and the eslimated accuracy is derived by averaging the runs Th( slandard deviation can be estimated as the standard dewation of the accuracy es-timations from each holdout runThe mam assumption that is violated in random sub-sampling is the independence of instances m the test set from those in the training set If the training and testset are formed by a split of an original dalaset, thenan over-represented class in one subset will be a under represented in the other To demonstrate the issue we simulated a 2/3, 1 /3 split of Fisher's famous ins dataset and used a majority inducer that builds a classifier pre dieting the prevalent class in the training set The iris dataset describes ins plants using four continuous fea-tures, and the task is to classify each instance (an ins) as Ins Setosa Ins Versicolour or Ins Virginica For each class label there are exactly one third of the instances with that label (50 instances of each class from a to-tal of 150 instances) thus we expect 33 3% prediction accuracy However, because the test set will always con-tain less than 1/3 of the instances of the class that wasprevalent in the training set, the accuracy predicted by the holdout method is 21 68% with a standard deviation of 0 13% (estimated by averaging 500 holdouts) In practice, the dataset size is always finite, and usu-ally smaller than we would like it to be The holdout method makes inefficient use of the data a third of dataset is not used for training the inducer 2 2 Cross-Validation, Leave-one-out, and Stratification In fc-fold cross-validation, sometimes called rotation esti-mation, the dataset V is randomly split into k mutuallyexclusive subsets (the folds) , of approx-imately equal size The inducer is trained and tested1138 LEARNINGThe cross-validation estimate is a random number that depends on the division into folds C o m p l e t ec r o s s -v a l id a t i o n is the average of all possibil ities for choosing m/k instances out of m, but it is usually too expensive Exrept for leave-one-one (rc-fold cross-validation), which is always complete, fc-foM cross-validation is estimating complete K-foId cross-validationusing a single split of the data into the folds Repeat-ing cross-validation multiple limes using different spillsinto folds provides a better M onte C arlo estimate to 1 hecomplele cross-validation at an added cost In s t r a t i -fied c r o s s -v a l i d a t i o n the folds are stratified so thaitlicy contain approximately the same proportions of la-bels as the original dataset An inducer is stable for a given dataset and a set of perturbal ions if it induces classifiers thai make the same predictions when it is given the perturbed datasets P r o p o s i t i o n 1 (V a r i a n c e in A>fold C V )Given a dataset and an inducer If the inductr isstable under the pei tur bations caused by deleting theinstances f o r thr folds in k fold cross-validatwn thecross validation < stnnate will be unbiastd and the t a i lance of the estimated accuracy will be approximatelyaccrv (1—)/n when n is the number of instancesin the datasi t Proof If we assume that the k classifiers produced makethe same predictions, then the estimated accuracy has a binomial distribution with n trials and probabihly of success equal to (he accuracy of the classifier | For large enough n a confidence interval may be com-puted using Equation 3 with h equal to n, the number of instancesIn reality a complex inducer is unlikely to be stable for large perturbations unless it has reached its maximal learning capacity We expect the perturbations induced by leave-one-out to be small and therefore the classifier should be very stable As we increase the size of the perturbations, stability is less likely to hold we expect stability to hold more in 20-fold cross-validation than in 10-fold cross-validation and both should be more stable than holdout of 1/3 The proposition does not apply to the resubstitution estimate because it requires the in-ducer to be stable when no instances are given in the datasetThe above proposition helps, understand one possible assumption that is made when using cross-validation if an inducer is unstable for a particular dataset under a set of perturbations introduced by cross-validation, the ac-curacy estimate is likely to be unreliable If the inducer is almost stable on a given dataset, we should expect a reliable estimate The next corollary takes the idea slightly further and shows a result that we have observed empirically there is almost no change in the variance of the cross validation estimate when the number of folds is variedC o r o l l a r y 2 (Variance m cross-validation)Given a dataset and an inductr If the inducer is sta-ble undfi the }>tituibuhoris (aused by deleting the test instances foi the folds in k-fold cross-validation for var-ious valuts of k then tht vartanct of the estimates will be the sameProof The variance of A-fold cross-validation in Propo-sition 1 does not depend on k |While some inducers are liktly to be inherently more stable the following example shows that one must also take into account the dalaset and the actual perturba (ions E x a m p l e 1 (Failure of leave-one-out)lusher s ins dataset contains 50 instances of each class leading one to expect that a majority indu<er should have acruraov about j \% However the eombmation ofthis dataset with a majority inducer is unstable for thesmall perturbations performed by leave-one-out Whenan instance is deleted from the dalaset, its label is a mi-nority in the training set, thus the majority inducer pre-dicts one of the other two classes and always errs in clas-sifying the test instance The leave-one-out estimatedaccuracy for a majont> inducer on the ins dataset istherefore 0% M oreover all folds have this estimated ac-curacy, thus the standard deviation of the folds is again0 %giving the unjustified assurance that 'he estimate is stable | The example shows an inherent problem with cross-validation th-t applies to more than just a majority in-ducer In a no-infornirition dataset where the label val-ues are completely random, the best an induction algo-rithm can do is predict majority Leave-one-out on such a dataset with 50% of the labels for each class and a majontv ind'-cer (the best, possible inducer) would still predict 0% accuracy 2 3 B o o t s t r a pThe bootstrap family was introduced by Efron and is fully described in Efron &. Tibshirani (1993) Given a dataset of size n a b o o t s t r a p s a m p l e is created by sampling n instances uniformly from the data (with re-placement) Since the dataset is sampled with replace-ment, the probability of any given instance not beingchosen after n samples is theKOHAVI 1139expected number of distinct instances from the original dataset appearing in the teat set is thus 0 632n The eO accuracy estimate is derived by using the bootstrap sam-ple for training and the rest of the instances for testing Given a number b, the number of bootstrap samples, let e0, be the accuracy estimate for bootstrap sample i The632 bootstrap estimate is defined as(5)where ace, is the resubstitution accuracy estimate on the full dataset (i e , the accuracy on the training set) The variance of the estimate can be determined by com puting the variance of the estimates for the samples The assumptions made by bootstrap are basically the same as that of cross-validation, i e , stability of the al-gorithm on the dataset the 'bootstrap world" should closely approximate the real world The b32 bootstrap fails (o give the expected result when the classifier is a perfect memonzer (e g an unpruned decision tree or a one nearest neighbor classifier) and the dataset is com-pletely random, say with two classes The resubstitution accuracy is 100%, and the eO accuracy is about 50% Plugging these into the bootstrap formula, one gets an estimated accuracy of about 68 4%, far from the real ac-curacy of 50% Bootstrap can be shown to fail if we add a memonzer module to any given inducer and adjust its predictions If the memonzer remembers the training set and makes the predictions when the test instance was a training instances, adjusting its predictions can make the resubstitution accuracy change from 0% to 100% and can thus bias the overall estimated accuracy in any direction we want3 Related W o r kSome experimental studies comparing different accuracy estimation methods have been previously done but most of them were on artificial or small datasets We now describe some of these effortsEfron (1983) conducted five sampling experiments and compared leave-one-out cross-validation, several variants of bootstrap, and several other methods The purpose of the experiments was to 'investigate some related es-timators, which seem to offer considerably improved es-timation in small samples ' The results indicate that leave-one-out cross-validation gives nearly unbiased esti-mates of the accuracy, but often with unacceptably high variability, particularly for small samples, and that the 632 bootstrap performed bestBreiman et al (1984) conducted experiments using cross-validation for decision tree pruning They chose ten-fold cross-validation for the CART program and claimed it was satisfactory for choosing the correct tree They claimed that "the difference in the cross-validation estimates of the risks of two rules tends to be much more accurate than the two estimates themselves "Jain, Dubes fa Chen (1987) compared the performance of the t0 bootstrap and leave-one-out cross-validation on nearest neighbor classifiers Using artificial data and claimed that the confidence interval of the bootstrap estimator is smaller than that of leave-one-out Weiss (1991) followed similar lines and compared stratified cross-validation and two bootstrap methods with near-est neighbor classifiers His results were that stratified two-fold cross validation is relatively low variance and superior to leave-one-outBreiman fa Spector (1992) conducted a feature sub-set selection experiments for regression, and compared leave-one-out cross-validation, A:-fold cross-validation for various k, stratified K-fold cross-validation, bias-corrected bootstrap, and partial cross-validation (not discussed here) Tests were done on artificial datasets with 60 and 160 instances The behavior observed was (1) the leave-one-out has low bias and RMS (root mean square) error whereas two-fold and five-fold cross-validation have larger bias and RMS error only at models with many features, (2) the pessimistic bias of ten-fold cross-validation at small samples was significantly re-duced for the samples of size 160 (3) for model selection, ten-fold cross-validation is better than leave-one-out Bailey fa E lkan (1993) compared leave-one-out cross-ahdation to 632 bootstrap using the FOIL inducer and four synthetic datasets involving Boolean concepts They observed high variability and little bias in the leave-one-out estimates, and low variability but large bias in the 632 estimatesWeiss and Indurkyha (Weiss fa Indurkhya 1994) con-ducted experiments on real world data Lo determine the applicability of cross-validation to decision tree pruning Their results were that for samples at least of size 200 using stratified ten-fold cross-validation to choose the amount of pruning yields unbiased trees (with respect to their optimal size) 4 M e t h o d o l o g yIn order to conduct a large-scale experiment we decided to use 04 5 and a Naive Bayesian classifier The C4 5 algorithm (Quinlan 1993) is a descendent of ID3 that builds decision trees top-down The Naive-Bayesian clas-sifier (Langley, Iba fa Thompson 1992) used was the one implemented in (Kohavi, John, Long, Manley fa Pfleger 1994) that uses the observed ratios for nominal features and assumes a Gaussian distribution for contin-uous features The exact details are not crucial for this paper because we are interested in the behavior of the accuracy estimation methods more than the internals of the induction algorithms The underlying hypothe-sis spaces—decision trees for C4 5 and summary statis-tics for Naive-Bayes—are different enough that we hope conclusions based on these two induction algorithms will apply to other induction algorithmsBecause the target concept is unknown for real-world1140 LEARNINGconcepts, we used the holdout method to estimate the quality of the cross-validation and bootstrap estimates To choose & set of datasets, we looked at the learning curves for C4 5 and Najve-Bayes for most of the super-vised classification dataaets at the UC Irvine repository (Murphy & Aha 1994) that contained more than 500 instances (about 25 such datasets) We felt that a min-imum of 500 instances were required for testing While the true accuracies of a real dataset cannot be computed because we do not know the target concept, we can esti mate the true accuracies using the holdout method The "true' accuracy estimates in Table 1 were computed by taking a random sample of the given size computing the accuracy using the rest of the dataset as a test set, and repeating 500 timesWe chose six datasets from a wide variety of domains, such that the learning curve for both algorithms did not flatten out too early that is, before one hundred instances We also added a no inform a tion d l stt, rand, with 20 Boolean features and a Boolean random label On one dataset vehicle, the generalization accu-racy of the Naive-Bayes algorithm deteriorated hy morethan 4% as more instances were g;iven A similar phenomenon was observed on the shuttle dataset Such a phenomenon was predicted by Srhaffer and Wolpert (Schaffer 1994, Wolpert 1994), but we were surprised that it was observed on two real world datasetsTo see how well an Accuracy estimation method per forms we sampled instances from the dataset (uniformly without replacement) and created a training set of the desired size We then ran the induction algorihm on the training set and tested the classifier on the rest of the instances L E I the dataset This was repeated 50 times at points where the lea rning curve wa s sloping up The same folds in cross-validation and the same samples m bootstrap were used for both algorithms compared5 Results and DiscussionWe now show the experimental results and discuss their significance We begin with a discussion of the bias in the estimation methods and follow with a discussion of the variance Due to lack of space, we omit some graphs for the Naive-Bayes algorithm when the behavior is ap-proximately the same as that of C 4 5 5 1 T h e B i a sThe bias of a method to estimate a parameter 0 is de-fined as the expected value minus the estimated value An unbiased estimation method is a method that has zero bias Figure 1 shows the bias and variance of k-fold cross-validation on several datasets (the breast cancer dataset is not shown)The diagrams clearly show that k-fold cross-validation is pessimistically biased, especially for two and five folds For the learning curves that have a large derivative at the measurement point the pessimism in k-fold cross-Figure ] C'4 5 The bias of cross-validation with varying folds A negative K folds stands for leave k-out E rror bars are 95% confidence intervals for (he mean The gray regions indicate 95 % confidence intervals for the true ac curaries Note the different ranges for the accuracy axis validation for small k s is apparent Most of the esti-mates are reasonably good at 10 folds and at 20 folds they art almost unbiasedStratified cross validation (not shown) had similar be-havior, except for lower pessimism The estimated accu-racy for soybe an at 2 fold was 7% higher and at five-fold, 1 1% higher for vehicle at 2-fold, the accuracy was 2 8% higher and at five-fold 1 9% higher Thus stratification seems to be a less biased estimation methodFigure 2 shows the bias and variance for the b32 boot-strap accuracy estimation method Although the 632 bootstrap is almost unbiased for chess hypothyroid, and mushroom for both inducers it is highly biased for soy-bean with C'A 5, vehicle with both inducers and rand with both inducers The bias with C4 5 and vehicle is 9 8%5 2 The VarianceWhile a given method may have low bias, its perfor-mance (accuracy estimation in our case) may be poor due to high variance In the experiments above, we have formed confidence intervals by using the standard de-viation of the mea n a ccura cy We now switch to the standard deviation of the population i e , the expected standard deviation of a single accuracy estimation run In practice, if one dots a single cross-validation run the expected accuracy will be the mean reported above, but the standard deviation will be higher by a factor of V50, the number of runs we averaged in the experimentsKOHAVI 1141Table 1 True accuracy estimates for the datasets using C4 5 and Naive-Bayes classifiers at the chosen sample sizesFigure 2 C4 5 The bias of bootstrap with varying sam-ples Estimates are good for mushroom hypothyroid, and chess, but are extremely biased (optimistically) for vehicle and rand, and somewhat biased for soybeanIn what follows, all figures for standard deviation will be drawn with the same range for the standard devi-ation 0 to 7 5% Figure 3 shows the standard devia-tions for C4 5 and Naive Bayes using varying number of folds for cross-validation The results for stratified cross-validation were similar with slightly lower variance Figure 4 shows the same information for 632 bootstrap Cross-validation has high variance at 2-folds on both C4 5 and Naive-Bayes On C4 5, there is high variance at the high-ends too—at leave-one-out and leave-two-out—for three files out of the seven datasets Stratifica-tion reduces the variance slightly, and thus seems to be uniformly better than cross-validation, both for bias and vananceFigure 3 Cross-validation standard deviation of accu-racy (population) Different, line styles are used to help differentiate between curves6 S u m m a r yWe reviewed common accuracy estimation methods in-cluding holdout, cross-validation, and bootstrap, and showed examples where each one fails to produce a good estimate We have compared the latter two approaches on a variety of real-world datasets with differing charac-teristicsProposition 1 shows that if the induction algorithm is stable for a given dataset, the variance of the cross-validation estimates should be approximately the same, independent of the number of folds Although the induc-tion algorithms are not stable, they are approximately stable it fold cross-validation with moderate k values (10-20) reduces the variance while increasing the bias As k decreases (2-5) and the sample sizes get smaller, there is variance due to the instability of the training1142 LEARNING1 igure 4 632 Bootstrap standard deviation in acc-rat y (population)sets themselves leading to an increase in variance this is most apparent for datasets with many categories, such as soybean In these situations) stratification seems to help, but -epeated runs may be a better approach Our results indicate that stratification is generally a better scheme both in terms of bias and variance whencompared to regular cross-validation Bootstrap has low,variance but extremely large bias on some problems We recommend using stratified Len fold cross-validation for model selection A c k n o w l e d g m e n t s We thank David Wolpert for a thorough reading of this paper and many interesting dis-cussions We thank Tom Bylander Brad E fron Jerry Friedman, Rob Holte George John Pat Langley Hob Tibshiram and Sholom Weiss for their helpful com nients and suggestions Dan Sommcrfield implemented Lhe bootstrap method in WLC++ All experiments were conducted using M L C ++ partly partly funded by ONR grant N00014-94-1-0448 and NSF grants IRI 9116399 and IRI-941306ReferencesBailey, T L & E lkan C (1993) stimating the atcuracy of learned concepts, in Proceedings of In ternational Joint Conference on Artificial Intelli-gence , Morgan Kaufmann Publishers, pp 895 900 Breiman, L & Spector, P (1992) Submodel selectionand evaluation in regression the x random case Inttrnational St atistic al Review 60(3), 291-319 Breiman, L , Friedman, J H , Olshen, R A & StoneC J (1984), Cl a ssific ation a nd Regression Trets Wadsworth International GroupEfron, B (1983), 'E stimating the error rate of a pre-diction rule improvement on cross-validation",Journal of the Americ an St atistic al Associ ation 78(382), 316-330 Efron, B & Tibshiram, R (1993) An introduction tothe bootstra p, Chapman & HallJam, A K Dubes R C & Chen, C (1987), "Boot-strap techniques lor error estimation", IEEE tra ns-actions on p a ttern a n a lysis a nd m a chine intelli-gence P A M I -9(5), 628-633 Kohavi, R , John, G , Long, R , Manley, D &Pfleger K (1994), M L C ++ A machine learn-ing library in C ++ in 'Tools with Artifi-cial Intelligence I E E EComputer Society Press, pp 740-743 Available by anonymous ftp from s t a r r y Stanford E DU pub/ronnyk/mlc/ toolsmlc psLangley, P Tba, W & Thompson, K (1992), An anal-ysis of bayesian classifiers in Proceedings of the tenth national conference on artificial intelligence",A A A I Press and M I T Press, pp 223-228Murph' P M & Aha D W (1994), V( I repository of machine learning databases, For information con-tact ml-repository (Ui(,s uci edu Quinlan I R (1993) C4 5 Progra ms for Ma chine Learning Morgan Kaufmann Los Altos CaliforniaSchaffcr C (19941 A conservation law for generalization performance, in Maehinc Learning Proceedings of Lhe E leventh International conference Morgan Kaufmann, pp 259-265Shao, J (1993), Linear model seletion via cross-validation Journ a l of the America n sta tistica l As-sociation 88(422) 486-494 Weiss S M (1991), Small sample error rate estimationfor k nearest neighbor classifiers' I E EE Tr a ns ac tions on Pa ttern An alysis a nd Ma chine Intelligence 13(3), 285-289 Weiss, S M & lndurkhya N (1994) Decision Lreepruning Biased or optimal, in Proceedings of the twelfth national conference on artificial intel-ligence A A A I Press and M I T Press pp 626-632 Wolpert D H (1992), Stacked generalization , Neura lNetworks 5 241-259 Wolpert D H (1994a) Off training set error and a pri-ori distinctions between learning algorithms, tech-mcal Report SFI TR 94-12-121, The Sante Fe ln-stituteWolpert D II {1994b), The relationship between PAC, the statistical physics framework the Bayesian framework, and the VC framework Technical re-port, The Santa Fe Institute Santa Fe, NMZhang, P (1992), 'On the distributional properties of model selection criteria' Journ al of the America nStatistical Associa tion 87(419), 732-737 KOHAVI 1143。
arXiv:astro-ph/0501477v1 24 Jan 2005astro-ph/0501477BayesianmodelselectionandisocurvatureperturbationsMar´ıaBeltr´an,1,2JuanGarc´ıa-Bellido,1JulienLesgourgues,3AndrewR.Liddle,2andAn˘zeSlosar41DepartamentodeF´ısicaTe´oricaC-XI,UniversidadAut´onomadeMadrid,Cantoblanco,28049Madrid,Spain
2AstronomyCentre,UniversityofSussex,BrightonBN19QH,UnitedKingdom
3LAPTH,F-74941Annecy-le-VieuxCedex,FranceandINFNPadova,ViaMarzolo8,I-35131Padova,Italy
4FacultyofMathematicsandPhysics,UniversityofLjubljana,Slovenia
(Dated:February2,2008)
Presentcosmologicaldataarewellexplainedassumingpurelyadiabaticperturbations,butanadmixtureofisocurvatureperturbationsisalsopermitted.WeuseaBayesianframeworktocomparetheperformanceofcosmologicalmodelsincludingisocurvaturemodeswiththepurelyadiabaticcase;thisframeworkautomaticallyandconsistentlypenalizesmodelswhichusemoreparameterstofitthedata.WecomputetheBayesianevidenceforfitstoadatasetcomprisedofWMAPandothermicrowaveanisotropydata,thegalaxypowerspectrumfrom2dFGRSandSDSS,andTypeIasupernovaeluminositydistances.WefindthatBayesianmodelselectionfavoursthepurelyadiabaticmodels,butsofaronlyatlowsignificance.
PACSnumbers:98.80.CqIFT-UAM/CSIC-05-02,LAPTH-1085/05,DFPD/05/A05,astro-ph/0501477
I.INTRODUCTIONFollowingrecentdevelopmentsinobservationalcos-mology,particularlyobservationsbytheWilkinsonMi-crowaveAnisotropyProbe(WMAP)[1],thereexistcom-pellingreasonstotalkaboutaStandardCosmologicalModelbasedontheΛCDMparadigmseededwithpurelyadiabaticperturbations.Inaddition,therehavebeenmanyattemptstoanalyzemoregeneralmodelsfeaturingadditionalphysics,eithertoconstrainsuchprocessesorinthehopeofdiscoveringsometraceeffectsinthedata.Acaseofparticularinterestisthepossibleadditionofanadmixtureofisocurvatureperturbationstotheadiabaticones[2,3]whichhasbeenstudiedinthepost-WMAPerabymanyauthors[4,5,6,7,8,9,10,11].Thebulkofinvestigationssofarhaveasastartingpointchosenaparticularsetofparameterstodefinethecosmologicalmodelunderdiscussion,andthenat-temptedtoconstrainthoseparametersusingobserva-tions,aprocessknownasparameterfitting.Basedonsuchanalyses,manyparametersaredeterminedtoahighdegreeofaccuracy.Muchlessattentionhasbeendirectedatthehigher-levelinferenceproblemofallowingthedatatodecidethesetofparameterstobeused,knownasmodelcomparisonormodelselection[12,13],althoughsuchtechniqueshavebeenwidelydeployedoutsideofas-trophysics.Recently,oneofusappliedtwomodelselec-tionstatistics,knownastheAkaikeandBayesianInfor-mationCriteria,tosomesimplecosmologicalmodels[14],andshowedthatthesimplestmodelconsideredwastheonefavouredbythedata.ThesecriteriahaverecentlybeenappliedtomodelswithisocurvatureperturbationsbyParkinsonetal.[9],whoconcludedthatthepurelyadiabaticmodelwasfavoured.ThosestatisticsarenothoweverfullimplementationsofBayesianinference,whichappearstobethemostap-propriateframeworkforinterprettingcosmologicaldata.ThecorrectmodelselectiontooltouseinthatcontextistheBayesianevidence[12,13],whichistheprobability
ofthemodelinlightofthedata(i.etheaveragelikeli-hoodoverthepriordistribution).Ithasbeendeployedincosmologicalcontextsbyseveralauthors[15],andtheratioofevidencesbetweentwomodelsisalsoknownastheBayesFactor[16].1TheBayesianevidencecanbecombinedwithpriorprobabilitiesfordifferentmodelsifdesired,butevenifthepriorprobabilitiesareassumedequal,theevidencestillautomaticallyencodesaprefer-enceforsimplermodels,implementingOccam’srazorinaquantitativemanner.
Wheneveroneaimstodecidewhetherornotapartic-ularparameterpshouldbefixed(forexampleatp=0),oneshouldusemodelselectiontechniques.Ifonecar-riesoutonlyaparameter-fittingexerciseandthenex-aminesthelikelihoodlevelatwhichp=0isexcluded,suchacomparisonfailstoaccountforthemodeldimen-sionalitybeingreducedbyoneatthepointp=0,andhencedrawsconclusionsinconsistentwithBayesianin-ference.Thistypicallyoverestimatesthesignificanceatwhichtheparameterpisneeded.Anexampleisspectralindexrunning,whichparameterfittingfavoursatamod-est(albeitunconvincing)confidencelevel[1],butwhichisdisfavouredbymodelselectionstatistics[14].
InthispaperweusetheBayesianevidencetocom-pareisocurvatureandadiabaticmodelsinlightofcur-rentdata.WewillcloselyfollowthenotationofBeltr´anetal.[10],whorecentlycarriedoutaparameter-fittinganalysisofisocurvaturemodels,andweusethesamedatasets.Wefollowthenotationofthatpaperandpro-videonlyabriefsummaryinthisarticle.