ON ALGORITHMS FOR ORDINARY LEAST SQUARES REGRESSION SPLINE FITTING A COMPARATIVE STUDY

格式：pdf
大小：459.26 KB
文档页数：17

下载文档原格式

算法导论课程作业答案

算法导论课程作业答案Introduction to AlgorithmsMassachusetts Institute of Technology 6.046J/18.410J Singapore-MIT Alliance SMA5503 Professors Erik Demaine,Lee Wee Sun,and Charles E.Leiserson Handout10Diagnostic Test SolutionsProblem1Consider the following pseudocode:R OUTINE(n)1if n=12then return13else return n+R OUTINE(n?1)(a)Give a one-sentence description of what R OUTINE(n)does.(Remember,don’t guess.) Solution:The routine gives the sum from1to n.(b)Give a precondition for the routine to work correctly.Solution:The value n must be greater than0;otherwise,the routine loops forever.(c)Give a one-sentence description of a faster implementation of the same routine. Solution:Return the value n(n+1)/2.Problem2Give a short(1–2-sentence)description of each of the following data structures:(a)FIFO queueSolution:A dynamic set where the element removed is always the one that has been in the set for the longest time.(b)Priority queueSolution:A dynamic set where each element has anassociated priority value.The element removed is the element with the highest(or lowest)priority.(c)Hash tableSolution:A dynamic set where the location of an element is computed using a function of the ele ment’s key.Problem3UsingΘ-notation,describe the worst-case running time of the best algorithm that you know for each of the following:(a)Finding an element in a sorted array.Solution:Θ(log n)(b)Finding an element in a sorted linked-list.Solution:Θ(n)(c)Inserting an element in a sorted array,once the position is found.Solution:Θ(n)(d)Inserting an element in a sorted linked-list,once the position is found.Solution:Θ(1)Problem4Describe an algorithm that locates the?rst occurrence of the largest element in a?nite list of integers,where the integers are not necessarily distinct.What is the worst-case running time of your algorithm?Solution:Idea is as follows:go through list,keeping track of the largest element found so far and its index.Update whenever necessary.Running time isΘ(n).Problem5How does the height h of a balanced binary search tree relate to the number of nodes n in the tree? Solution:h=O(lg n) Problem 6Does an undirected graph with 5vertices,each of degree 3,exist?If so,draw such a graph.If not,explain why no such graph exists.Solution:No such graph exists by the Handshaking Lemma.Every edge adds 2to the sum of the degrees.Consequently,the sum of the degrees must be even.Problem 7It is known that if a solution to Problem A exists,then a solution to Problem B exists also.(a)Professor Goldbach has just produced a 1,000-page proof that Problem A is unsolvable.If his proof turns out to be valid,can we conclude that Problem B is also unsolvable?Answer yes or no (or don’t know).Solution:No(b)Professor Wiles has just produced a 10,000-page proof that Problem B is unsolvable.If the proof turns out to be valid,can we conclude that problem A is unsolvable as well?Answer yes or no (or don’t know).Solution:YesProblem 8Consider the following statement:If 5points are placed anywhere on or inside a unit square,then there must exist two that are no more than √2/2units apart.Here are two attempts to prove this statement.Proof (a):Place 4of the points on the vertices of the square;that way they are maximally sepa-rated from one another.The 5th point must then lie within √2/2units of one of the other points,since the furthest from the corners it can be is the center,which is exactly √2/2units fromeach of the four corners.Proof (b):Partition the square into 4squares,each with a side of 1/2unit.If any two points areon or inside one of these smaller squares,the distance between these two points will be at most √2/2units.Since there are 5points and only 4squares,at least two points must fall on or inside one of the smaller squares,giving a set of points that are no more than √2/2apart.Which of the proofs are correct:(a),(b),both,or neither (or don’t know)?Solution:(b)onlyProblem9Give an inductive proof of the following statement:For every natural number n>3,we have n!>2n.Solution:Base case:True for n=4.Inductive step:Assume n!>2n.Then,multiplying both sides by(n+1),we get(n+1)n!> (n+1)2n>2?2n=2n+1.Problem10We want to line up6out of10children.Which of the following expresses the number of possible line-ups?(Circle the right answer.)(a)10!/6!(b)10!/4!(c) 106(d) 104 ·6!(e)None of the above(f)Don’t knowSolution:(b),(d)are both correctProblem11A deck of52cards is shuf?ed thoroughly.What is the probability that the4aces are all next to each other?(Circle theright answer.)(a)4!49!/52!(b)1/52!(c)4!/52!(d)4!48!/52!(e)None of the above(f)Don’t knowSolution:(a)Problem12The weather forecaster says that the probability of rain on Saturday is25%and that the probability of rain on Sunday is25%.Consider the following statement:The probability of rain during the weekend is50%.Which of the following best describes the validity of this statement?(a)If the two events(rain on Sat/rain on Sun)are independent,then we can add up the twoprobabilities,and the statement is true.Without independence,we can’t tell.(b)True,whether the two events are independent or not.(c)If the events are independent,the statement is false,because the the probability of no rainduring the weekend is9/16.If they are not independent,we can’t tell.(d)False,no matter what.(e)None of the above.(f)Don’t know.Solution:(c)Problem13A player throws darts at a target.On each trial,independentlyof the other trials,he hits the bull’s-eye with probability1/4.How many times should he throw so that his probability is75%of hitting the bull’s-eye at least once?(a)3(b)4(c)5(d)75%can’t be achieved.(e)Don’t know.Solution:(c),assuming that we want the probability to be≥0.75,not necessarily exactly0.75.Problem14Let X be an indicator random variable.Which of the following statements are true?(Circle all that apply.)(a)Pr{X=0}=Pr{X=1}=1/2(b)Pr{X=1}=E[X](c)E[X]=E[X2](d)E[X]=(E[X])2Solution:(b)and(c)only。

经济统计计量模型chapter2

Chapter 21. A dependent variable is also known as a(n) _____.a. explanatory variableb. control variablec. predictor variabled. response variableAnswer: dDifficulty: EasyBloom’s: KnowledgeA-Head: Definition of the Simple Regression ModelBUSPROG:Feedback: A dependent variable is known as a response variable.2. If a change in variable x causes a change in variable y, variable x is called the _____.a. dependent variableb. explained variablec. explanatory variabled. response variableAnswer: cDifficulty: EasyBloom’s: ComprehensionA-Head: Definition of the Simple Regression ModelBUSPROG:Feedback: If a change in variable x causes a change in variable y, variable x is called the independent variable or the explanatory variable.3. In the equation y = + x + u, is the _____.a. dependent variableb. independent variablec. slope parameterd. intercept parameterAnswer: dDifficulty: EasyBloom’s: KnowledgeA-Head: Definition of the Simple Regression ModelBUSPROG:Feedback: In the equation y = + x + u, is the intercept parameter.4. In the equation y = + x + u, what is the estimated value ofa.b.c.d.Answer: aDifficulty: EasyBloom’s: KnowledgeA-Head: Deriving the Ordinary Least Squares EstimatesBUSPROG:Feedback: The estimated value of is .5. In the equation c = + i + u, c denotes consumption and i denotes income. What is the residual for the 5th observation if =$500 and =$475a. $975b. $300c. $25d. $50Answer: cDifficulty: EasyBloom’s: KnowledgeA-Head: Deriving the Ordinary Least Squares EstimatesBUSPROG:Feedback: The formula for calculating the residual for the i th observation is . In this case, the residual is =$500 -$475= $25.6. What does the equation denote if the regression equation is y = β0+ β1x1 + ua. The explained sum of squaresb. The total sum of squaresc. The sample regression functiond. The population regression functionAnswer: cDifficulty: EasyBloom’s: KnowledgeA-Head: Deriving the Ordinary Least Squares EstimatesBUSPROG:Feedback: The equation denotes the sample regression function of the given regression model.7. Consider the following regression model: y = β0+ β1x1 + u. Which of the following is a property of Ordinary Least Square (OLS) estimates of this model and their associated statisticsa. The sum, and therefore the sample average of the OLS residuals, is positive.b. The sum of the OLS residuals is negative.c. The sample covariance between the regressors and the OLS residuals is positive.d. The point (, ) always lies on the OLS regression line.Answer: dDifficulty: EasyBloom’s: KnowledgeA-Head: Properties of OLS on Any Sample of DataBUSPROG:Feedback: An important property of the OLS estimates is that the point (, ) always lies on the OLS regression line. In other words, if , the predicted value of .8. The explained sum of squares for the regression function, , is defined as _____.a.b.c.d.Answer: bDifficulty: EasyBloom’s: KnowledgeA-Head: Properties of OLS on Any Sample of DataBUSPROG:Feedback: The explained sum of squares is defined as9. If the total sum of squares (SST) in a regression equation is 81, and the residual sum of squares (SSR) is 25, what is the explained sum of squares (SSE)a. 64b. 56c. 32d. 18Answer: bDifficulty: ModerateBloom’s: ApplicationA-Head: Properties of OLS on Any Sample of DataBUSPROG: AnalyticFeedback: Total sum of squares (SST) is given by the sum of explained sum of squares (SSE) and residual sum of squares (SSR). Therefore, in this case, SSE=81-25=56.10. If the residual sum of squares (SSR) in a regression analysis is 66 and the total sum of squares (SST) is equal to 90, what is the value of the coefficient of determinationa.b.c.d.Answer: cDifficulty: ModerateBloom’s: ApplicationA-Head: Properties of OLS on Any Sample of DataBUSPROG: AnalyticFeedback: The formula for calculating the coefficient of determination is . In this case,11. Which of the following is a nonlinear regression modela. y = β0 + β1x1/2 + ub. log y = β0 + β1log x +uc. y = 1 / (β0 + β1x) + ud. y = β0 + β1x + uAnswer: cDifficulty: ModerateBloom’s: ComprehensionA-Head: Properties of OLS on Any Sample of DataBUSPROG:Feedback: A regression model is nonlinear if the equation is nonlinear in the parameters. In this case,y=1 / (β0 + β1x) + u is nonlinear as it is nonlinear in its parameters.12. Which of the following is assumed for establishing the unbiasedness of Ordinary Least Square (OLS) estimatesa. The error term has an expected value of 1 given any value of the explanatory variable.b. The regression equation is linear in the explained and explanatory variables.c. The sample outcomes on the explanatory variable are all the same value.d. The error term has the same variance given any value of the explanatory variable.Answer: dDifficulty: EasyBloom’s: KnowledgeA-Head: Expected Values and Variances of the OLS EstimatorsBUSPROG:Feedback: The error u has the same variance given any value of the explanatory variable.13. The error term in a regression equation is said to exhibit homoskedasticty if _____.a. it has zero conditional meanb. it has the same variance for all values of the explanatory variable.c. it has the same value for all values of the explanatory variabled. if the error term has a value of one given any value of the explanatory variable.Answer: bDifficulty: EasyBloom’s: KnowledgeA-Head: Expected Values and Variances of the OLS EstimatorsBUSPROG:Feedback: The error term in a regression equation is said to exhibit homoskedasticty if it has the same variance for all values of the explanatory variable.14. In the regression of y on x, the error term exhibits heteroskedasticity if _____.a. it has a constant varianceb. Var(y|x) is a function of xc. x is a function of yd. y is a function of xAnswer: bDifficulty: EasyBloom’s: KnowledgeA-Head: Expected Values and Variances of the OLS EstimatorsBUSPROG:Feedback: Heteroskedasticity is present whenever Var(y|x) is a function of x because Var(u|x) = Var(y|x).15. What is the estimated value of the slope parameter when the regression equation, y = β0+ β1x1 +u passes through the origina.b.)c.d.Answer: cDifficulty: EasyBloom’s: KnowledgeA-Head: Regression through the Origin and Regression on a ConstantBUSPROG:Feedback: The estimated value of the slope parameter when the regression equation passes through the origin is .16. A natural measure of the association between two random variables is the correlation coefficient.Answer: TrueDifficulty: EasyBloom’s: KnowledgeA-Head: Definition of the Simple Regression ModelBUSPROG:Feedback: A natural measure of the association between two random variables is the correlation coefficient.17. The sample covariance between the regressors and the Ordinary Least Square (OLS) residuals is always positive.Answer: FalseDifficulty: EasyBloom’s: KnowledgeA-Head: Properties of OLS on Any Sample of DataBUSPROG:Feedback: The sample covariance between the regressors and the Ordinary Least Square (OLS) residuals is zero.18. is the ratio of the explained variation compared to the total variation.Answer: TrueDifficulty: EasyBloom’s: KnowledgeA-Head: Properties of OLS on Any Sample of DataBUSPROG:Feedback: The sample covariance between the regressors and the Ordinary Least Square (OLS) residuals is zero.19. There are n-1 degrees of freedom in Ordinary Least Square residuals.Answer: FalseDifficulty: EasyBloom’s: KnowledgeA-Head: Expected Values and Variances of the OLS EstimatorsBUSPROG:Feedback: There are n-2 degrees of freedom in Ordinary Least Square residuals.20. The variance of the slope estimator increases as the error variance decreases.Answer: FalseDifficulty: EasyBloom’s: KnowledgeA-Head: Expected Values and Variances of the OLS EstimatorsBUSPROG:Feedback: The variance of the slope estimator increases as the error variance increases.。

Least squares support vector machine classifiers

Given a training set fx ; y g =1 with input patterns x 2 R and output values y 2 f?1; +1g indicating the class, SVM formulations 10] start from the assumption that
k T k k
(3)
and Lagrange multipliers (support values) . The conditions for optimality P @ L=@w = 0, @ L=@b = 0, @ L=@e = 0, @ L=@ = 0 give w = =1 y '(x ), P y = 0, = e , y w '(x ) + b] = 1 ? e (k = 1; N ), respectively. =1 By eliminating e; w one obtains the KKT system 0 b = 0 Y (4) ~ Y + ?1 I 1
ers the original SVM formulation of Vapnik is modi ed by considering equality constraints within a form of ridge regression instead of inequality constraints. As a result the solution follows from solving a set of linear equations instead of a quadratic programming problem. However, a drawback is that sparseness is lost in the LS-SVM case due to the choice of 2-norms. In this paper we propose a method for imposing sparseness to the LS-SVM solution. This is done by pruning the support value spectrum which is revealing the relative importance of the training data points and is immediately available as solution to the linear systems.

The Method of Least Squares

The Method of Least SquaresHervéAbdi11IntroductionThe least square methods(LSM)is probably the most popular tech-nique in statistics.This is due to several factors.First,most com-mon estimators can be casted within this framework.For exam-ple,the mean of a distribution is the value that minimizes the sum of squared deviations of the scores.Second,using squares makes LSM mathematically very tractable because the Pythagorean theo-rem indicates that,when the error is independent of an estimated quantity,one can add the squared error and the squared estimated quantity.Third,the mathematical tools and algorithms involved in LSM(derivatives,eigendecomposition,singular value decomposi-tion)have been well studied for a relatively long time.LSM is one of the oldest techniques of modern statistics,and even though ancestors of LSM can be traced up to Greek mathe-matics,theﬁrst modern precursor is probably Galileo(see Harper, 1974,for a history and pre-history of LSM).The modern approach wasﬁrst exposed in1805by the French mathematician Legendre in a now classic memoir,but this method is somewhat older be-cause it turned out that,after the publication of Legendre’s mem-oir,Gauss(the famous German mathematician)contested Legen-1In:Neil Salkind(Ed.)(2007).Encyclopedia of Measurement and Statistics. Thousand Oaks(CA):Sage.Address correspondence to:HervéAbdiProgram in Cognition and Neurosciences,MS:Gr.4.1,The University of Texas at Dallas,Richardson,TX75083–0688,USAE-mail:herve@ /∼hervedre’s priority.Gauss often did not published ideas when he though that they could be controversial or not yet ripe,but would mention his discoveries when others would publish them(the way he did, for example for the discovery of Non-Euclidean geometry).And in1809,Gauss published another memoir in which he mentioned that he had previously discovered LSM and used it as early as1795 in estimating the orbit of an asteroid.A somewhat bitter anterior-ity dispute followed(a bit reminiscent of the Leibniz-Newton con-troversy about the invention of Calculus),which,however,did not diminish the popularity of this technique.The use of LSM in a modern statistical framework can be traced to Galton(1886)who used it in his work on the heritability of size which laid down the foundations of correlation and(also gave the name to)regression analysis.The two antagonistic giants of statis-tics Pearson and Fisher,who did so much in the early develop-ment of statistics,used and developed it in different contexts(fac-tor analysis for Pearson and experimental design for Fisher).Nowadays,the least square method is widely used toﬁnd or es-timate the numerical values of the parameters toﬁt a function to a set of data and to characterize the statistical properties of esti-mates.It exists with several variations:Its simpler version is called ordinary least squares(OLS),a more sophisticated version is called weighted least squares(WLS),which often performs better than OLS because it can modulate the importance of each observation in theﬁnal solution.Recent variations of the least square method are alternating least squares(ALS)and partial least squares(PLS). 2Functionalﬁt example:regressionThe oldest(and still the most frequent)use of OLS was linear re-gression,which corresponds to the problem ofﬁnding a line(or curve)that bestﬁts a set of data points.In the standard formu-lation,a set of N pairs of observations{Y i,X i}is used toﬁnd a function relating the value of the dependent variable(Y)to the values of an independent variable(X).With one variable and alinear function,the prediction is given by the following equation:ˆY=a+bX.(1) This equation involves two free parameters which specify the in-tercept(a)and the slope(b)of the regression line.The least square method deﬁnes the estimate of these parameters as the values wh-ich minimize the sum of the squares(hence the name least squares) between the measurements and the model(i.e.,the predicted val-ues).This amounts to minimizing the expression:E= i(Y i−ˆY i)2= i[Y i−(a+bX i)]2(2) (where E stands for“error"which is the quantity to be minimized). The estimation of the parameters is obtained using basic results from calculus and,speciﬁcally,uses the property that a quadratic expression reaches its minimum value when its derivatives van-ish.Taking the derivative of E with respect to a and b and setting them to zero gives the following set of equations(called the normal equations):∂E ∂a =2Na+2bXi−2Yi=0(3)and∂E ∂b =2bX2i+2aXi−2Yi X i=0.(4)Solving the normal equations gives the following least square esti-mates of a and b as:a=M Y−bM X(5) (with M Y and M X denoting the means of X and Y)andb= (Yi−M Y)(X i−M X)(Xi−M X)2.(6)OLS can be extended to more than one independent variable(us-ing matrix algebra)and to non-linear functions.2.1The geometry of least squaresOLS can be interpreted in a geometrical framework as an orthog-onal projection of the data vector onto the space deﬁned by the independent variable.The projection is orthogonal because the predicted values and the actual values are uncorrelated.This is il-lustrated in Figure1,which depicts the case of two independent variables(vectors x1and x2)and the data vector(y),and shows that the error vector(y−ˆy)is orthogonal to the least square(ˆy)es-timate which lies in the subspace deﬁned by the two independent variables.yFigure1:The least square estimate of the data is the orthogonal projection of the data vector onto the independent variable sub-space.2.2Optimality of least square estimatesOLS estimates have some strong statistical properties.Speciﬁcally when(1)the data obtained constitute a random sample from a well-deﬁned population,(2)the population model is linear,(3)the error has a zero expected value,(4)the independent variables are linearly independent,and(5)the error is normally distributed and uncorrelated with the independent variables(the so-called homo-scedasticity assumption);then the OLS estimate is the b est l inear u nbiased e stimate often denoted with the acronym“BLUE"(the5 conditions and the proof are called the Gauss-Markov conditions and theorem).In addition,when the Gauss-Markov conditions hold,OLS estimates are also maximum likelihood estimates.2.3Weighted least squaresThe optimality of OLS relies heavily on the homoscedasticity as-sumption.When the data come from different sub-populations for which an independent estimate of the error variance is avail-able,a better estimate than OLS can be obtained using weighted least squares(WLS),also called generalized least squares(GLS). The idea is to assign to each observation a weight that reﬂects the uncertainty of the measurement.In general,the weight w i,as-signed to the i th observation,will be a function of the variance ofthis observation,denotedσ2i .A straightforward weighting schemais to deﬁne w i=σ−1i(but other more sophisticated weighted sch-emes can also be proposed).For the linear regression example, WLS willﬁnd the values of a and b minimizing:E w=i w i(Y i−ˆY i)2=iw i[Y i−(a+bX i)]2.(7)2.4Iterative methods:Gradient descentWhen estimating the parameters of a nonlinear function with OLS or WLS,the standard approach using derivatives is not always pos-sible.In this case,iterative methods are very often used.These methods search in a stepwise fashion for the best values of the es-timate.Often they proceed by using at each step a linear approx-imation of the function and reﬁne this approximation by succes-sive corrections.The techniques involved are known as gradient descent and Gauss-Newton approximations.They correspond to nonlinear least squares approximation in numerical analysis and nonlinear regression in statistics.Neural networks constitutes a popular recent application of these techniques3Problems with least squares,and alternativesDespite its popularity and versatility,LSM has its problems.Prob-ably,the most important drawback of LSM is its high sensitivity to outliers(i.e.,extreme observations).This is a consequence of us-ing squares because squaring exaggerates the magnitude of differ-ences(e.g.,the difference between20and10is equal to10but the difference between202and102is equal to300)and therefore gives a much stronger importance to extreme observations.This prob-lem is addressed by using robust techniques which are less sensi-tive to the effect of outliers.Thisﬁeld is currently under develop-ment and is likely to become more important in the next future. References[1]Abdi,H.,Valentin D.,Edelman,B.E.(1999)Neural networks.Thousand Oaks:Sage.[2]Bates,D.M.&Watts D.G.(1988).Nonlinear regression analysisand its applications.New York:Wiley[3]Greene,W.H.(2002).Econometric analysis.New York:PrenticeHall.[4]Harper H.L.(1974–1976).The method of least squares andsome alternatives.Part I,II,II,IV,V,VI.International Satis-tical Review,42,147–174;42,235–264;43,1–44;43,125–190;43,269–272;44,113–159;[5]Nocedal J.&Wright,S.(1999).Numerical optimization.NewYork:Springer.[6]Plackett,R.L.(1972).The discovery of the method of leastsquares.Biometrika,59,239–251.[7]Seal,H.L.(1967).The historical development of the Gauss lin-ear model.Biometrika,54,1–23.。

Lasso算法介绍和R实例

ordinary Least Squares (OLS) regression method finds the unbiased linear combination of
the xij ’s that minimizes the residual sum of squares. However, if p is large or the regression
1
University of Minnesota Duluth
Chapter 1 Introduction
A “lasso” is usually recognized as a loop of rope that is designed to be thrown around a target and tighten when pulled. It is a well-known tool of the American cowboy. In this context, it is fittingly being used as a metaphor of l1 constraint applied to linear model. Coincidently, LASSO is also the initials for Least Absolute Shrinkage and Selection Operator.
Abstract
Key words: l1 -NORM CONSTRAINT; LASSO; VARIABLE SELECTION; SUBSET SELECTION;
BAYSIAN LOGISTIC REGRESSION
LASSO is an innovative variable selection method for regression. Variable selection in regression is extremely important when we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. LASSO minimizes the residual sum of squaresபைடு நூலகம்subject to the sum of the absolute value of the coefficients being less than a constant. LASSO not only helps to improve the prediction accuracy when dealing with multicolinearity data, but also carries several nice properties such as interpretability and numerical stability. This project also includes a couple of numerical approaches for solving LASSO. Case studies and simulations are studied in both the linear and logistic LASSO models, and are implemented in R; the source code is available in the appendix.

Numerical Linear Algebra

letters (and occasionally lower case letters) will denote scalars. RI will denote the set of real
tions to the algorithm, it can be made to work quite well. We understand these algorithmic
transformations most completely in the case of simple algorithms like Cholesky, on simple
LA
Numerical Linear Algebra
Copyright (C) 1991, 1992, 1993, 1994, 1995 by the Computational Science Education Project
This electronic book is copyrighted, and protected by the copyright laws of the United States. This (and all associated documents in the system) must contain the above copyright notice. If this electronic book is used anywhere other than the project's original system, CSEP must be noti ed in writing (email is acceptable) and the copyright notice must remain intact.

计量经济学导论CH13习题答案

CHAPTER 13TEACHING NOTESWhile this chapter falls under “Advanced Topics,” most of this chapter requires no more sophistication than the previous chapters. (In fact, I would argue that, with the possible exception of Section 13.5, this material is easier than some of the time series chapters.)Pooling two or more independent cross sections is a straightforward extension of cross-sectional methods. Nothing new needs to be done in stating assumptions, except possibly mentioning that random sampling in each time period is sufficient. The practically important issue is allowing for different intercepts, and possibly different slopes, across time.The natural experiment material and extensions of the difference-in-differences estimator is widely applicable and, with the aid of the examples, easy to understand.Two years of panel data are often available, in which case differencing across time is a simple way of removing g unobserved heterogeneity. If you have covered Chapter 9, you might compare this with a regression in levels using the second year of data, but where a lagged dependent variable is included. (The second approach only requires collecting information on the dependent variable in a previous year.) These often give similar answers. Two years of panel data, collected before and after a policy change, can be very powerful for policy analysis. Having more than two periods of panel data causes slight complications in that the errors in the differenced equation may be serially correlated. (However, the traditional assumption that the errors in the original equation are serially uncorrelated is not always a good one. In other words, it is not always more appropriate to used fixed effects, as in Chapter 14, than first differencing.) With large N and relatively small T, a simple way to account for possible serial correlation after differencing is to compute standard errors that are robust to arbitrary serial correlation and heteroskedasticity. Econometrics packages that do cluster analysis (such as Stata) often allow this by specifying each cross-sectional unit as its own cluster.108SOLUTIONS TO PROBLEMS13.1 Without changes in the averages of any explanatory variables, the average fertility rate fellby .545 between 1972 and 1984; this is simply the coefficient on y84. To account for theincrease in average education levels, we obtain an additional effect: –.128(13.3 – 12.2) ≈–.141. So the drop in average fertility if the average education level increased by 1.1 is .545+ .141 = .686, or roughly two-thirds of a child per woman.13.2 The first equation omits the 1981 year dummy variable, y81, and so does not allow anyappreciation in nominal housing prices over the three year period in the absence of an incinerator. The interaction term in this case is simply picking up the fact that even homes that are near the incinerator site have appreciated in value over the three years. This equation suffers from omitted variable bias.The second equation omits the dummy variable for being near the incinerator site, nearinc,which means it does not allow for systematic differences in homes near and far from the sitebefore the site was built. If, as seems to be the case, the incinerator was located closer to lessvaluable homes, then omitting nearinc attributes lower housing prices too much to theincinerator effect. Again, we have an omitted variable problem. This is why equation (13.9) (or,even better, the equation that adds a full set of controls), is preferred.13.3 We do not have repeated observations on the same cross-sectional units in each time period,and so it makes no sense to look for pairs to difference. For example, in Example 13.1, it is veryunlikely that the same woman appears in more than one year, as new random samples areobtained in each year. In Example 13.3, some houses may appear in the sample for both 1978and 1981, but the overlap is usually too small to do a true panel data analysis.β, but only13.4 The sign of β1 does not affect the direction of bias in the OLS estimator of1whether we underestimate or overestimate the effect of interest. If we write ∆crmrte i = δ0 +β1∆unem i + ∆u i, where ∆u i and ∆unem i are negatively correlated, then there is a downward biasin the OLS estimator of β1. Because β1 > 0, we will tend to underestimate the effect of unemployment on crime.13.5 No, we cannot include age as an explanatory variable in the original model. Each person inthe panel data set is exactly two years older on January 31, 1992 than on January 31, 1990. This means that ∆age i = 2 for all i. But the equation we would estimate is of the form∆saving i = δ0 + β1∆age i +…,where δ0 is the coefficient the year dummy for 1992 in the original model. As we know, whenwe have an intercept in the model we cannot include an explanatory variable that is constant across i; this violates Assumption MLR.3. Intuitively, since age changes by the same amount for everyone, we cannot distinguish the effect of age from the aggregate time effect.10913.6 (i) Let FL be a binary variable equal to one if a person lives in Florida, and zero otherwise. Let y90 be a year dummy variable for 1990. Then, from equation (13.10), we have the linear probability modelarrest = β0 + δ0y90 + β1FL + δ1y90⋅FL + u.The effect of the law is measured by δ1, which is the change in the probability of drunk driving arrest due to the new law in Florida. Including y90 allows for aggregate trends in drunk driving arrests that would affect both states; including FL allows for systematic differences between Florida and Georgia in either drunk driving behavior or law enforcement.(ii) It could be that the populations of drivers in the two states change in different ways over time. For example, age, race, or gender distributions may have changed. The levels of education across the two states may have changed. As these factors might affect whether someone is arrested for drunk driving, it could be important to control for them. At a minimum, there is the possibility of obtaining a more precise estimator of δ1 by reducing the error variance. Essentially, any explanatory variable that affects arrest can be used for this purpose. (See Section 6.3 for discussion.)SOLUTIONS TO COMPUTER EXERCISES13.7 (i) The F statistic (with 4 and 1,111 df) is about 1.16 and p-value ≈ .328, which shows that the living environment variables are jointly insignificant.(ii) The F statistic (with 3 and 1,111 df) is about 3.01 and p-value ≈ .029, and so the region dummy variables are jointly significant at the 5% level.(iii) After obtaining the OLS residuals, ˆu, from estimating the model in Table 13.1, we run the regression 2ˆu on y74, y76, …, y84 using all 1,129 observations. The null hypothesis of homoskedasticity is H0: γ1 = 0, γ2= 0, … , γ6 = 0. So we just use the usual F statistic for joint significance of the year dummies. The R-squared is about .0153 and F ≈ 2.90; with 6 and 1,122 df, the p-value is about .0082. So there is evidence of heteroskedasticity that is a function of time at the 1% significance level. This suggests that, at a minimum, we should compute heteroskedasticity-robust standard errors, t statistics, and F statistics. We could also use weighted least squares (although the form of heteroskedasticity used here may not be sufficient; it does not depend on educ, age, and so on).(iv) Adding y74⋅educ, , y84⋅educ allows the relationship between fertility and education to be different in each year; remember, the coefficient on the interaction gets added to the coefficient on educ to get the slope for the appropriate year. When these interaction terms are added to the equation, R2≈ .137. The F statistic for joint significance (with 6 and 1,105 df) is about 1.48 with p-value ≈ .18. Thus, the interactions are not jointly significant at even the 10% level. This is a bit misleading, however. An abbreviated equation (which just shows the coefficients on the terms involving educ) is110111kids= -8.48 - .023 educ + - .056 y74⋅educ - .092 y76⋅educ(3.13) (.054) (.073) (.071) - .152 y78⋅educ - .098 y80⋅educ - .139 y82⋅educ - .176 y84⋅educ .(.075) (.070) (.068) (.070)Three of the interaction terms, y78⋅educ , y82⋅educ , and y84⋅educ are statistically significant at the 5% level against a two-sided alternative, with the p -value on the latter being about .012. The coefficients are large in magnitude as well. The coefficient on educ – which is for the base year, 1972 – is small and insignificant, suggesting little if any relationship between fertility andeducation in the early seventies. The estimates above are consistent with fertility becoming more linked to education as the years pass. The F statistic is insignificant because we are testing some insignificant coefficients along with some significant ones.13.8 (i) The coefficient on y85 is roughly the proportionate change in wage for a male (female = 0) with zero years of education (educ = 0). This is not especially useful since we are not interested in people with no education.(ii) What we want to estimate is θ0 = δ0 + 12δ1; this is the change in the intercept for a male with 12 years of education, where we also hold other factors fixed. If we write δ0 = θ0 - 12δ1, plug this into (13.1), and rearrange, we getlog(wage ) = β0 + θ0y85 + β1educ + δ1y85⋅(educ – 12) + β2exper + β3exper 2 + β4union + β5female + δ5y85⋅female + u .Therefore, we simply replace y85⋅educ with y85⋅(educ – 12), and then the coefficient andstandard error we want is on y85. These turn out to be 0ˆθ = .339 and se(0ˆθ) = .034. Roughly, the nominal increase in wage is 33.9%, and the 95% confidence interval is 33.9 ± 1.96(3.4), or about 27.2% to 40.6%. (Because the proportionate change is large, we could use equation (7.10), which implies the point estimate 40.4%; but obtaining the standard error of this estimate is harder.)(iii) Only the coefficient on y85 differs from equation (13.2). The new coefficient is about –.383 (se ≈ .124). This shows that real wages have fallen over the seven year period, although less so for the more educated. For example, the proportionate change for a male with 12 years of education is –.383 + .0185(12) = -.161, or a fall of about 16.1%. For a male with 20 years of education there has been almost no change [–.383 + .0185(20) = –.013].(iv) The R -squared when log(rwage ) is the dependent variable is .356, as compared with .426 when log(wage ) is the dependent variable. If the SSRs from the regressions are the same, but the R -squareds are not, then the total sum of squares must be different. This is the case, as the dependent variables in the two equations are different.(v) In 1978, about 30.6% of workers in the sample belonged to a union. In 1985, only about 18% belonged to a union. Therefore, over the seven-year period, there was a notable fall in union membership.(vi) When y85⋅union is added to the equation, its coefficient and standard error are about -.00040 (se ≈ .06104). This is practically very small and the t statistic is almost zero. There has been no change in the union wage premium over time.(vii) Parts (v) and (vi) are not at odds. They imply that while the economic return to union membership has not changed (assuming we think we have estimated a causal effect), the fraction of people reaping those benefits has fallen.13.9 (i) Other things equal, homes farther from the incinerator should be worth more, so δ1 > 0. If β1 > 0, then the incinerator was located farther away from more expensive homes.(ii) The estimated equation islog()price= 8.06 -.011 y81+ .317 log(dist) + .048 y81⋅log(dist)(0.51) (.805) (.052) (.082)n = 321, R2 = .396, 2R = .390.ˆδ = .048 is the expected sign, it is not statistically significant (t statistic ≈ .59).While1(iii) When we add the list of housing characteristics to the regression, the coefficient ony81⋅log(dist) becomes .062 (se = .050). So the estimated effect is larger – the elasticity of price with respect to dist is .062 after the incinerator site was chosen – but its t statistic is only 1.24. The p-value for the one-sided alternative H1: δ1 > 0 is about .108, which is close to being significant at the 10% level.13.10 (i) In addition to male and married, we add the variables head, neck, upextr, trunk, lowback, lowextr, and occdis for injury type, and manuf and construc for industry. The coefficient on afchnge⋅highearn becomes .231 (se ≈ .070), and so the estimated effect and t statistic are now larger than when we omitted the control variables. The estimate .231 implies a substantial response of durat to the change in the cap for high-earnings workers.(ii) The R-squared is about .041, which means we are explaining only a 4.1% of the variation in log(durat). This means that there are some very important factors that affect log(durat) that we are not controlling for. While this means that predicting log(durat) would be very difficultˆδ: it could still for a particular individual, it does not mean that there is anything biased about1be an unbiased estimator of the causal effect of changing the earnings cap for workers’ compensation.(iii) The estimated equation using the Michigan data is112durat= 1.413 + .097 afchnge+ .169 highearn+ .192 afchnge⋅highearn log()(0.057) (.085) (.106) (.154)n = 1,524, R2 = .012.The estimate of δ1, .192, is remarkably close to the estimate obtained for Kentucky (.191). However, the standard error for the Michigan estimate is much higher (.154 compared with .069). The estimate for Michigan is not statistically significant at even the 10% level against δ1 > 0. Even though we have over 1,500 observations, we cannot get a very precise estimate. (For Kentucky, we have over 5,600 observations.)13.11 (i) Using pooled OLS we obtainrent= -.569 + .262 d90+ .041 log(pop) + .571 log(avginc) + .0050 pctstu log()(.535) (.035) (.023) (.053) (.0010) n = 128, R2 = .861.The positive and very significant coefficient on d90 simply means that, other things in the equation fixed, nominal rents grew by over 26% over the 10 year period. The coefficient on pctstu means that a one percentage point increase in pctstu increases rent by half a percent (.5%). The t statistic of five shows that, at least based on the usual analysis, pctstu is very statistically significant.(ii) The standard errors from part (i) are not valid, unless we thing a i does not really appear in the equation. If a i is in the error term, the errors across the two time periods for each city are positively correlated, and this invalidates the usual OLS standard errors and t statistics.(iii) The equation estimated in differences islog()∆= .386 + .072 ∆log(pop) + .310 log(avginc) + .0112 ∆pctsturent(.037) (.088) (.066) (.0041)n = 64, R2 = .322.Interestingly, the effect of pctstu is over twice as large as we estimated in the pooled OLS equation. Now, a one percentage point increase in pctstu is estimated to increase rental rates by about 1.1%. Not surprisingly, we obtain a much less precise estimate when we difference (although the OLS standard errors from part (i) are likely to be much too small because of the positive serial correlation in the errors within each city). While we have differenced away a i, there may be other unobservables that change over time and are correlated with ∆pctstu.(iv) The heteroskedasticity-robust standard error on ∆pctstu is about .0028, which is actually much smaller than the usual OLS standard error. This only makes pctstu even more significant (robust t statistic ≈ 4). Note that serial correlation is no longer an issue because we have no time component in the first-differenced equation.11311413.12 (i) You may use an econometrics software package that directly tests restrictions such as H 0: β1 = β2 after estimating the unrestricted model in (13.22). But, as we have seen many times, we can simply rewrite the equation to test this using any regression software. Write the differenced equation as∆log(crime ) = δ0 + β1∆clrprc -1 + β2∆clrprc -2 + ∆u .Following the hint, we define θ1 = β1 - β2, and then write β1 = θ1 + β2. Plugging this into the differenced equation and rearranging gives∆log(crime ) = δ0 + θ1∆clrprc -1 + β2(∆clrprc -1 + ∆clrprc -2) + ∆u .Estimating this equation by OLS gives 1ˆθ= .0091, se(1ˆθ) = .0085. The t statistic for H 0: β1 = β2 is .0091/.0085 ≈ 1.07, which is not statistically significant.(ii) With β1 = β2 the equation becomes (without the i subscript)∆log(crime ) = δ0 + β1(∆clrprc -1 + ∆clrprc -2) + ∆u= δ0 + δ1[(∆clrprc -1 + ∆clrprc -2)/2] + ∆u ,where δ1 = 2β1. But (∆clrprc -1 + ∆clrprc -2)/2 = ∆avgclr .(iii) The estimated equation islog()crime ∆ = .099 - .0167 ∆avgclr(.063) (.0051)n = 53, R 2 = .175, 2R = .159.Since we did not reject the hypothesis in part (i), we would be justified in using the simplermodel with avgclr . Based on adjusted R -squared, we have a slightly worse fit with the restriction imposed. But this is a minor consideration. Ideally, we could get more data to determine whether the fairly different unconstrained estimates of β1 and β2 in equation (13.22) reveal true differences in β1 and β2.13.13 (i) Pooling across semesters and using OLS givestrmgpa = -1.75 -.058 spring+ .00170 sat- .0087 hsperc(0.35) (.048) (.00015) (.0010)+ .350 female- .254 black- .023 white- .035 frstsem(.052) (.123) (.117) (.076)- .00034 tothrs + 1.048 crsgpa- .027 season(.00073) (0.104) (.049)n = 732, R2 = .478, 2R = .470.The coefficient on season implies that, other things fixed, an athlete’s term GPA is about .027 points lower when his/her sport is in season. On a four point scale, this a modest effect (although it accumulates over four years of athletic eligibility). However, the estimate is not statistically significant (t statistic ≈-.55).(ii) The quick answer is that if omitted ability is correlated with season then, as we know form Chapters 3 and 5, OLS is biased and inconsistent. The fact that we are pooling across two semesters does not change that basic point.If we think harder, the direction of the bias is not clear, and this is where pooling across semesters plays a role. First, suppose we used only the fall term, when football is in season. Then the error term and season would be negatively correlated, which produces a downward bias in the OLS estimator of βseason. Because βseason is hypothesized to be negative, an OLS regression using only the fall data produces a downward biased estimator. [When just the fall data are used, ˆβ = -.116 (se = .084), which is in the direction of more bias.] However, if we use just the seasonspring semester, the bias is in the opposite direction because ability and season would be positive correlated (more academically able athletes are in season in the spring). In fact, using just theβ = .00089 (se = .06480), which is practically and statistically equal spring semester gives ˆseasonto zero. When we pool the two semesters we cannot, with a much more detailed analysis, determine which bias will dominate.(iii) The variables sat, hsperc, female, black, and white all drop out because they do not vary by semester. The intercept in the first-differenced equation is the intercept for the spring. We have∆= -.237 + .019 ∆frstsem+ .012 ∆tothrs+ 1.136 ∆crsgpa- .065 seasontrmgpa(.206) (.069) (.014) (0.119) (.043) n = 366, R2 = .208, 2R = .199.Interestingly, the in-season effect is larger now: term GPA is estimated to be about .065 points lower in a semester that the sport is in-season. The t statistic is about –1.51, which gives a one-sided p-value of about .065.115(iv) One possibility is a measure of course load. If some fraction of student-athletes take a lighter load during the season (for those sports that have a true season), then term GPAs may tend to be higher, other things equal. This would bias the results away from finding an effect of season on term GPA.13.14 (i) The estimated equation using differences is∆= -2.56 - 1.29 ∆log(inexp) - .599 ∆log(chexp) + .156 ∆incshrvote(0.63) (1.38) (.711) (.064)n = 157, R2 = .244, 2R = .229.Only ∆incshr is statistically significant at the 5% level (t statistic ≈ 2.44, p-value ≈ .016). The other two independent variables have t statistics less than one in absolute value.(ii) The F statistic (with 2 and 153 df) is about 1.51 with p-value ≈ .224. Therefore,∆log(inexp) and ∆log(chexp) are jointly insignificant at even the 20% level.(iii) The simple regression equation is∆= -2.68 + .218 ∆incshrvote(0.63) (.032)n = 157, R2 = .229, 2R = .224.This equation implies t hat a 10 percentage point increase in the incumbent’s share of total spending increases the percent of the incumbent’s vote by about 2.2 percentage points.(iv) Using the 33 elections with repeat challengers we obtain∆= -2.25 + .092 ∆incshrvote(1.00) (.085)n = 33, R2 = .037, 2R = .006.The estimated effect is notably smaller and, not surprisingly, the standard error is much larger than in part (iii). While the direction of the effect is the same, it is not statistically significant (p-value ≈ .14 against a one-sided alternative).13.15 (i) When we add the changes of the nine log wage variables to equation (13.33) we obtain116117 log()crmrte ∆ = .020 - .111 d83 - .037 d84 - .0006 d85 + .031 d86 + .039 d87(.021) (.027) (.025) (.0241) (.025) (.025)- .323 ∆log(prbarr ) - .240 ∆log(prbconv ) - .169 ∆log(prbpris )(.030) (.018) (.026)- .016 ∆log(avgsen ) + .398 ∆log(polpc ) - .044 ∆log(wcon )(.022) (.027) (.030)+ .025 ∆log(wtuc ) - .029 ∆log(wtrd ) + .0091 ∆log(wfir )(0.14) (.031) (.0212)+ .022 ∆log(wser ) - .140 ∆log(wmfg ) - .017 ∆log(wfed )(.014) (.102) (.172)- .052 ∆log(wsta ) - .031 ∆log(wloc ) (.096) (.102) n = 540, R 2 = .445, 2R = .424.The coefficients on the criminal justice variables change very modestly, and the statistical significance of each variable is also essentially unaffected.(ii) Since some signs are positive and others are negative, they cannot all really have the expected sign. For example, why is the coefficient on the wage for transportation, utilities, and communications (wtuc ) positive and marginally significant (t statistic ≈ 1.79)? Higher manufacturing wages lead to lower crime, as we might expect, but, while the estimated coefficient is by far the largest in magnitude, it is not statistically different from zero (tstatistic ≈ –1.37). The F test for joint significance of the wage variables, with 9 and 529 df , yields F ≈ 1.25 and p -value ≈ .26.13.16 (i) The estimated equation using the 1987 to 1988 and 1988 to 1989 changes, where we include a year dummy for 1989 in addition to an overall intercept, isˆhrsemp ∆ = –.740 + 5.42 d89 + 32.60 ∆grant + 2.00 ∆grant -1 + .744 ∆log(employ ) (1.942) (2.65) (2.97) (5.55) (4.868)n = 251, R 2 = .476, 2R = .467.There are 124 firms with both years of data and three firms with only one year of data used, for a total of 127 firms; 30 firms in the sample have missing information in both years and are not used at all. If we had information for all 157 firms, we would have 314 total observations in estimating the equation.(ii) The coefficient on grant – more precisely, on ∆grant in the differenced equation – means that if a firm received a grant for the current year, it trained each worker an average of 32.6 hoursmore than it would have otherwise. This is a practically large effect, and the t statistic is very large.(iii) Since a grant last year was used to pay for training last year, it is perhaps not surprising that the grant does not carry over into more training this year. It would if inertia played a role in training workers.(iv) The coefficient on the employees variable is very small: a 10% increase in employ increases hours per employee by only .074. [Recall:∆≈ (.744/100)(%∆employ).] Thishrsempis very small, and the t statistic is also rather small.13.17. (i) Take changes as usual, holding the other variables fixed: ∆math4it = β1∆log(rexpp it) = (β1/100)⋅[ 100⋅∆log(rexpp it)] ≈ (β1/100)⋅( %∆rexpp it). So, if %∆rexpp it = 10, then ∆math4it= (β1/100)⋅(10) = β1/10.(ii) The equation, estimated by pooled OLS in first differences (except for the year dummies), is4∆ = 5.95 + .52 y94 + 6.81 y95- 5.23 y96- 8.49 y97 + 8.97 y98math(.52) (.73) (.78) (.73) (.72) (.72)- 3.45 ∆log(rexpp) + .635 ∆log(enroll) + .025 ∆lunch(2.76) (1.029) (.055)n = 3,300, R2 = .208.Taken literally, the spending coefficient implies that a 10% increase in real spending per pupil decreases the math4 pass rate by about 3.45/10 ≈ .35 percentage points.(iii) When we add the lagged spending change, and drop another year, we get4∆ = 6.16 + 5.70 y95- 6.80 y96- 8.99 y97 +8.45 y98math(.55) (.77) (.79) (.74) (.74)- 1.41 ∆log(rexpp) + 11.04 ∆log(rexpp-1) + 2.14 ∆log(enroll)(3.04) (2.79) (1.18)+ .073 ∆lunch(.061)n = 2,750, R2 = .238.The contemporaneous spending variable, while still having a negative coefficient, is not at all statistically significant. The coefficient on the lagged spending variable is very statistically significant, and implies that a 10% increase in spending last year increases the math4 pass rate118119 by about 1.1 percentage points. Given the timing of the tests, a lagged effect is not surprising. In Michigan, the fourth grade math test is given in January, and so if preparation for the test begins a full year in advance, spending when the students are in third grade would at least partly matter.(iv) The heteroskedasticity-robust standard error for log() ˆrexpp β∆is about 4.28, which reducesthe significance of ∆log(rexpp ) even further. The heteroskedasticity-robust standard error of 1log() ˆrexpp β-∆is about 4.38, which substantially lowers the t statistic. Still, ∆log(rexpp -1) is statistically significant at just over the 1% significance level against a two-sided alternative.(v) The fully robust standard error for log() ˆrexpp β∆is about 4.94, which even further reducesthe t statistic for ∆log(rexpp ). The fully robust standard error for 1log() ˆrexpp β-∆is about 5.13,which gives ∆log(rexpp -1) a t statistic of about 2.15. The two-sided p -value is about .032.(vi) We can use four years of data for this test. Doing a pooled OLS regression of ,1垐 on it i t r r -,using years 1995, 1996, 1997, and 1998 gives ˆρ= -.423 (se = .019), which is strong negative serial correlation.(vii) Th e fully robust “F ” test for ∆log(enroll ) and ∆lunch , reported by Stata 7.0, is .93. With 2 and 549 df , this translates into p -value = .40. So we would be justified in dropping these variables, but they are not doing any harm.。

高维生物医学数据变量筛选方法的模拟研究

似,也是基于正则化思想的一种方法。 [5] 当自变量中
存在较为显著的多重相关关系时 ,它可以通过对回归
系数的控制 ,达到减小误差的目的。其基本模型假设
可以表示为:
p
∑ β=argmin[RSS +λ βi2] i=1
KEY WORDS:high-dimensionaldata;variableselection;MonteCarlosimulation;correlation;R-programming
language
近年来,随着生物医学检验检测技术的发展,研究中生物医学数据的积累呈现指数级别的增长 ,尤其是在组学研究领域,测序技术的发展实现了基因、蛋白质等大规模组学数据的测量和积累。常规的统计
1.4 弹性网 ZOU 和 HASTIE[6]将 Ridge算法和
LASSO 方法相结合,提出了弹性网(elasticnet,EN)
方法。其基本思想可表示为:
p
p
∑ ∑ β=argmin[RSS +λ1 βi2 +λ2 2 是模型中的两个非负惩罚参数,
(西安交通大学医学部公共卫生学院流行病与卫生统计学系,陕西西安 710061)
摘要:目的比较5种常用的变量筛选方法在高维生物医学数据变量筛选中的实际效果,探讨样本量和变量间关联性
对变量筛选效果的影响,为制定高维生物医学数据分析中的变量筛选策略提供依据。方法基于 R 语言实现变量筛
ABSTRACT:Objective Tocomparetheperformanceoffivecommonlyusedvariableselectionmethodsinhigh-

Iteratively reweighted least squares

Iteratively reweighted least squaresModelsLinear regressionSimple regressionOrdinary least squaresPolynomial regressionGeneral linear modelGeneralized linear modelDiscrete choiceLogistic regressionMultinomial logitMixed logitProbitMultinomial probitOrdered logitOrdered probitPoissonMultilevel modelFixed effectsRandom effectsMixed modelNonlinear regressionNonparametricSemiparametricRobustQuantileIsotonicPrincipal componentsLeast angleLocalStatistics portalThe method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems. It solves objective functions of the form:by an iterative method in which each step involves solving a weighted least squares problem of the form:IRLS is used to find the maximum likelihood estimates of a generalized linear model, and in robust regression to find an M-estimator, as a way of mitigating the influence of outliers in an otherwise normally-distributed data set. For example, by minimizing the least absolute error rather than the least square error.Although not a linear regression problem, Weiszfeld's algorithm for approximating the geometric median can also be viewed as a special case of iteratively reweighted least squares, in which the objective function is the sum of distances of the estimator from the samples.One of the advantages of IRLS over linear and convex programming is that it can be used with Gauss–Newton and Levenberg–Marquardt numerical algorithms.ExamplesL 1 minimization for sparse recoveryIRLS can be used for1 minimization and smoothed p minimization, p < 1, in the compressed sensing problems.It has been proved that the algorithm has a linear rate of convergence for1 norm and superlinear fort with t < 1,under the restricted isometry property, which is generally a sufficient condition for sparse solutions.[1][2]L p norm linear regressionTo find the parameters β = (β1, …,βk )T which minimize the L p norm for the linear regression problem,the IRLS algorithm at step t +1 involves solving the weighted linear least squares[3] problem:[4]where W (t ) is the diagonal matrix of weights with elements:In the case p = 1, this corresponds to least absolute deviation regression (in this case, the problem would be better approached by use of linear programming methods).[citation needed ]Notes[3]/%7Edispenser/cgi-bin/dab_solver.py?page=Iteratively_reweighted_least_squares&editintro=Template:Disambiguation_needed/editintro&client=Template:DnReferences•University of Colorado Applied Regression lecture slides (/courses/7400/2010Spr/lecture23.pdf)•Stanford Lecture Notes on the IRLS algorithm by Antoine Guitton (/public/docs/sep103/antoine2/paper_html/index.html)•Numerical Methods for Least Squares Problems by Åke Björck (http://www.mai.liu.se/~akbjo/LSPbook.html) (Chapter 4: Generalized Least Squares Problems.)•Practical Least-Squares for Computer Graphics. SIGGRAPH Course 11 (/~jplewis/lscourse/SLIDES.pdf)Article Sources and Contributors4Article Sources and ContributorsIteratively reweighted least squares Source: /w/index.php?oldid=536800201 Contributors: 3mta3, BD2412, BenFrantzDale, Benwing, Colbert Sesanker, DavidEppstein, Giggy, Grumpfel, Kiefer.Wolfowitz, Lambiam, Lesswire, LutzL, Melcombe, Michael Hardy, Oleg Alexandrov, RainerBlome, Salix alba, Serg3d2, Stpasha, Wesleyyin, 13 anonymous editsImage Sources, Licenses and Contributorsfile:linear regression.svg Source: /w/index.php?title=File:Linear_regression.svg License: Public Domain Contributors: SewaquFile:Fisher iris versicolor sepalwidth.svg Source: /w/index.php?title=File:Fisher_iris_versicolor_sepalwidth.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: en:User:Qwfp (original); Pbroks13 (talk) (redraw)LicenseCreative Commons Attribution-Share Alike 3.0 Unported///licenses/by-sa/3.0/。

整体最小二乘法及其在测量数据处理中的应用

整体最小二乘法及其在测量数据处理中的应用丁克良[1]，欧吉坤[2]，陈义[2]1北京建筑工程学院测绘与城市空间信息学院；2中国科学院测量与地球物理研究所3 同济大学测量与国土信息工程系Ding Ke-liang,OU Ji-kun，Chen yi(1Beijing Institute of Civil Engineering And Architecture, School of Geomatics andurban information, 1000442 Institute of Geodesy and Geophysics, Chinese Academy ofSciences, Wuhan, 4300773. Department of Surveying and Geo-Informatics , Tongji University , Shanghai 200092 , China)摘要在参数求解中，针对参数估计模型的观测向量和系数矩阵都可能存在误差情况，20世纪80年代提出了整体最小二乘方法。

本文系统阐述了整体最小二乘基本原理、思想，基于奇异值分解原理详细阐述了整体最小二乘的解算方法。

在对普通最小二乘和整体最小二乘深入比较的基础上提出要注意整体最小二乘应用条件问题。

关键词：普通最小二乘，整体最小二乘，奇异值分解；测量数据处理Abstract: For parameter solution, the total least square (TLS) method was proposed in early 80’s of twenty century as to deal with some estimation models that both the observation and design matrix are subject to errors. The basic concept of TLS, principle of computational algorithms for TLS based on the singular value decomposition theory are described systematically in this paper. The applicability and limitation of TLS in surveying data process are proposed based on comparisons and analyses between the ordinary least squares and the total least squares.Key words: ordinary least squares; total least squares; singular decomposition; surveying data process1绪论最小二乘法经历了百余年的发展考验，已经成为许多领域数据处理广泛应用的方法。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

put.Simul.,2002,V ol.72(8),pp.647–663ON ALGORITHMS FOR ORDINARY LEASTSQUARES REGRESSION SPLINE FITTING:A COMP ARATIVE STUDYTHOMAS C.M.LEE*Department of Statistics,Colorado State University,Fort Collins,CO80523-1877,USA(Received25October2001;Inﬁnal form4March2002)Regression spline smoothing is a popular approach for conducting nonparametric regression.An important issue associated with it is the choice of a‘‘theoretically best’’set of knots.Different statistical model selection methods,such as Akaike’s information criterion and generalized cross-validation,have been applied to derive different‘‘theoretically best’’sets of knots.Typically these best knot sets are deﬁned implicitly as the optimizers of some objective functions.Hence another equally important issue concerning regression spline smoothing is how to optimize such objective functions.In this article different numerical algorithms that are designed for carrying out such optimization problems are compared by means of a simulation study.Both the univariate and bivariate smoothing settings will be considered.Based on the simulation results,recommendations for choosing a suitable optimization algorithm under various settings will be provided.Keywords:Bivariate smoothing;Generalized cross-validation;Genetic algorithms;Regression spline;Stepwise selection1INTRODUCTIONAn increasingly popular approach for performing nonparametric curve estimation is regres-sion spline smoothing.For this approach it is customary to assume that the‘‘true’’curve fðxÞto be recovered admits the expressionfðxÞ¼b0þb1xþÁÁÁþb r x rþX mj¼1b jðxÀk jÞrþ;ð1Þwhere r is the order of the regression spline(usually chosen a priori),k j is the j th knot, b¼ðb0;...;b r;b1;...;b mÞT is a set of coefﬁcients andðaÞþ¼maxð0;aÞ.In order to esti-mate fðxÞusing(1),one needs to choose the number and the placement of the knots,as well as to estimate b.As mentioned in Wand(2000),there are two general strategies for carrying out this task.Theﬁrst strategy is to select a relatively small number of knots and estimate b using ordinary least squares(further details will be given below).With this strategy,the choice of the knots is extremely important.Regression spline smoothing procedures follow-ing this strategy include Friedman and Silverman(1989),Koo(1997),Kooperberg,Bose and *Tel.:(970)4912185;Fax:(970)4917895;E-mail:tlee@ISSN0094-9655print;ISSN1563-5163online#2002Taylor&Francis LtdDOI:10.1080=0094965021000024035648T.C.M.LEEStone(1997),Lee(2000),Pittman(1999)and Stone,Hansen,Kooperberg and Truong (1997).The second strategy is to use a relatively large number of knots,but do not use or-dinary least squares to estimate b.In contrast to theﬁrst strategy,for this second strategy the importance of the choice of knots is relatively minor:the crucial aspect is how b is estimated (e.g.,by penalized least squares).Recent related references include Denison,Mallick and Smith(1998a),DiMatteo,Genovese and Kass(2001),Eilers and Marx(1996),Lindstrom (1999),Ruppert and Carroll(2000),Smith and Kohn(1996).This article focuses on theﬁrst strategy.Many regression spline smoothing procedures adopting this strategy are composed of two major components.Theﬁrst component concerns the use of some sort of statistical model selection principle for deﬁning a‘‘theoretically best’’set of knots.Quite often,such a‘‘theoretically best’’set of knots is deﬁned implicitly as the optimizer of some objective function.The second component is hence a practical algorithm for performing the corresponding optimization.Typically this optimization is a very hard pro-blem,as(i)the search space is usually huge and(ii)different candidate solutions may have different dimensions.The main purpose of this paper is to provide a study for the perfor-mances of some common algorithms that are designed for carrying out such optimization problems.Both the univariate and bivariate settings will be considered(but the question of which is the most appropriate principle for deﬁning a‘‘theoretically best’’set of knots will not be considered here).Our hope is that,by separating the two issues of‘‘deﬁning the best’’and‘‘locating the deﬁned best’’,and that by performing a focused study on the latter one,useful computational hints can be obtained for reference by future researchers.Due to the complicated nature of the regression spline optimization algorithms,theoretical comparison seems to be very difﬁcult.Therefore,the study to be presented is entirely based on empirical experiments.Of course it is impossible to exhaust all possible experimental set-tings,but we shall follow Wand(2000)to use a family approach to alleviate this problem. The idea is to change one experimental factor(e.g.,signal-to-noise ratio)at a time so that patterns can be more easily detected.In(1)it is clear that fðxÞis a linear combination of f x j g r j¼0and fðxÀk jÞrþg m j¼1.This set of functions is known as the truncated power basis of degree r.Other basis functions for regres-sion splineﬁtting also exist,such as those for B-splines,natural splines and radial basis func-tions(e.g.,see Eilers and Marx,1996;Green and Silverman,1994).However,as pointed out by Wand(2000),numerical experimental results should not be very sensitive to the choice of basis functions.Therefore for simplicity this article shall concentrate on the truncated power basis.The rest of this article is organized as follows.In Section2further background details on univariate regression spline smoothing and the use of generalized cross-validation(GCV)for deﬁning a‘‘best’’estimate will be provided.Then Sections3and4describe six different optimization algorithms.These six algorithms will be,in Section5,compared through a simulation study.Finally Section6considers the bivariate setting.Conclusions and recom-mendations for the univariate and the bivariate cases are reported in Sections5.5and6.3 respectively.2REGRESSION SPLINE SMOOTHING USING GCVSuppose that n pairs of measurements f x i;y i g n i¼1satisfyingy i¼fðx iÞþe i;e i$iid Nð0;s2Þ;are observed.The goal is to estimate f which is assumed to satisfy(1).In this article it is assumed r¼3and minðx iÞ<k1<ÁÁÁ<k m<maxðx iÞ.Furthermore,following the work of previous authors(e.g.,see Friedman and Silverman,1989;Koo,1997;Smith and Kohn,1996),f k1;...;k m g is restricted to be a subset of f x1;...;x n g.Such a restriction should not have any serious adverse effect on the quality of theﬁnal curve estimate.Since f can be completely speciﬁed by h¼f k;b g,where k¼ðk1;...;k mÞT and b¼ðb0;...;b r; b1;...;b mÞT,the estimation of f can be achieved via estimating h.Notice that different esti-mates^h s for h may have different dimensions(i.e.,different number of parameters).Various methods have been proposed for choosing the dimension of a‘‘best’’^h.One of the earliest proposals is the use of generalized cross-validation(GCV)(e.g.,see Friedman and Silverman,1989;Friedman,1991;Pittman,1999),in which the‘‘best’’estimate of h,or equivalently,f,is deﬁned as the one that minimizesGCVð^hÞ¼1=nP ni¼1f y iÀ^fðx iÞg2f1ÀdðmÞ=n g2:ð2ÞHere dðmÞis an increasing function of the number of the knots m.In order to penalize the additionalﬂexibility inherited by the free choice of knot locations,Friedman and Silverman (1989)suggested using dðmÞ¼3mþ1instead of the conventional GCV choice dðmÞ¼mþ1.In the sequel this GCV choice of^h,with dðmÞ¼3mþ1,will be taken as the target that the optimization algorithms should aim at.Let x¼ðx1;...;x nÞT,y¼ðy1;...;y nÞT and^k¼ð^k1;...;^k^mÞT be an estimate of k. Denote the‘‘design matrix’’as X¼ð1;x;...;x r;ðxÀ^k11Þrþ;...;ðxÀ^k^m1ÞrþÞ,where1is a nÂ1vector of ones.If r and^k are speciﬁed beforehand,then the unique maximum like-lihood estimate^b of b(conditional on r and^k)can be obtained by applying ordinary least squares regression,and admits the closed form expression^b¼ðX T XÞÀ1X T y.This means that^h¼f^k;^b g is completely determined by^k.Therefore the only effective argument for the minimization of GCVð^hÞ(or any other similar criteria)is k;i.e.,the knots.In the next two sections two classes of knot-based optimization algorithms will be described.Their ef-fectiveness,in terms of minimizing GCVð^hÞ,will be studied in Section5via simulations. Besides GCV,other model selection principles that have also been applied to derive various‘‘best’’knot sets include Akaike’s information criterion(AIC)(e.g.,Koo,1997; Kooperberg et al.,1997)and the minimum description length(MDL)principle(e.g.,Lee, 2000).Some preliminary numerical experiments were also conducted for both AIC and MDL.Empirical conclusions obtained from these experiments are similar to those for GCV.Due to space limitation,results of these experiments are not reported here.3STEPWISE KNOT SELECTIONThe most popular method for searching the minimizer of GCVð^hÞ(or any other similar criter-ia)seems to be stepwise knot selection(e.g.,see Friedman,1991;Friedman and Silverman, 1989;Hansen,Kooperberg and Sardy,1998;Koo,1997;Kooperberg et al.,1997;Lee,2000 and references given therein).The idea is very similar to stepwise regression for the classical subset selection problem.It begins with an initial model as the current model.Then at each time step a new model is obtained by either adding a new knot to or removing an existing knot from the current model.This process continues until a certain stopping criterion is met.Amongst all the candidate models that have been visited by the algorithm,the one that gives the smallest value of GCVð^hÞis chosen as theﬁnal model.In this article theLEAST SQUARES REGRESSION SPLINE FITTING649650T.C.M.LEEfollowing four versions of this stepwise method are considered.A comparison of the relative computational speeds amongst these versions is given in Section5.5.1.Forward Addition(ForAdd):Set the initial model as the model with no knots and com-pute its GCVð^hÞvalue.Add a new knot to the current model at each time step.The knot is chosen in such a way that,when it is added,it produces the largest reduction(or smallest increase)in the current value of GCVð^hÞ.Continue this knot addition process until the num-ber of knots added hits a pre-selected limit.Throughout our simulation study reported below, this pre-selected limit was set to n=3.Finally,when the knot-adding processﬁnishes,a nested sequence of candidate models is produced and the one that minimizes GCVð^hÞis selected as theﬁnal model.2.Backward Elimination(BackElim):This version begins with placing a relatively large number of knots in the initial model.One typical strategy for placing these knots is to place a knot at every s(usually3s5)sorted values of the design points x i’s.In our si-mulation study we used s¼3.Then the algorithm proceeds to remove one knot at a time, until there are no more knots to be removed.At each time step the knot to be removed is chosen in such a way that,when it is removed,it provides the largest reduction(or smallest increase)in the current value of GCVð^hÞ.Similar to ForAdd,at the end of the process a nested sequence of candidate models is obtained.Select the one that gives the smallest GCVð^hÞas theﬁnal model.3.Addition followed by Elimination(AddElim):This version uses theﬁnal model selected by aﬁrst application of ForAdd as the initial model for the execution of BackElim.The resulting model obtained from such an execution of BackElim is taken as theﬁnal model.4.Elimination followed by Addition(ElimAdd):This version uses theﬁnal model selected by aﬁrst application of BackElim as the initial model for the execution of ForAdd.The resulting model obtained from such an execution of ForAdd is taken as theﬁnal model.4GENETIC ALGORITHMSAnother class of optimization algorithms that have been applied to regression spline smooth-ing is genetic algorithms;e.g.,see Lee(2002),Pittman(1999),Pittman and Murthy(2000) and references given therein.Below we begin with a brief description of genetic algorithms. As we shall see,an important issue in applying genetic algorithms is how to represent a can-didate model as a chromosome.Two different chromosome representation methods will be described and compared.For general introductions to genetic algorithms,see for examples Davis(1991),Fogel(2000)and Michalewicz(1996).4.1General DescriptionThe use of genetic algorithms for solving optimization problems can be brieﬂy described as follows.An initial set,or population,of possible solutions to an optimization problem is obtained and represented in vector form.Typically these vectors are of the same length and are often called chromosomes.They are free to‘‘evolve’’in the following way.Firstly parent chromosomes are randomly chosen from the initial population:chromosomes having lower or higher values of the objective criterion to be minimized or maximized,respectively, would have a higher chance of being chosen.Offspring are then reproduced from either applying a crossover or a mutation operation to these chosen parents.Once a sufﬁcient num-ber of such second generation offspring are produced,third generation offspring are further produced from these second generation offspring in a similar manner as before.ThisLEAST SQUARES REGRESSION SPLINE FITTING651 reproduction process continues for a number of generations.If one believes in Darwin’s Nat-ural Selection,the expectation is that the objective criterion values of the offspring should gradually improve over generations;i.e.,approaching the optimal value.For the current pro-blem,the objective function is GCVð^hÞand one chromosome represents one^h.In a crossover operation one child chromosome is reproduced from‘‘mixing’’two parent chromosomes.The aim is to allow the possibility that the child would receive different best parts from its parents.A typical‘‘mixing’’strategy is that every child gene location would have equal chances of either receiving the corresponding father gene or the corresponding mother gene.This crossover operation is the distinct feature that makes genetic algorithms different from other optimization methods.In a mutation operation one child chromosome is reproduced from one parent chromo-some.The child would essentially be the same as its parent except for a small number of genes where randomness is introduced to alter the types of these genes.Such a mutation op-eration prevents the algorithm being trapped in local optima.For the reason of preserving the best chromosome of a current generation,an additional step that one may perform is the elitist step:replace the worst chromosome of the next gen-eration with the best chromosome of the current generation.Inclusion of this elitist step would guarantee the monotonicity of the algorithm.4.2Chromosome RepresentationsThis section describes two methods for representing a^h in a chromosome form:theﬁrst method is described in Lee(2002),while the second method can be found in Pittman (1999).First recall that for the current curveﬁtting problem a possible solution^h can be uni-quely speciﬁed by^k,and that^k is assumed to be a subset of the design points f x1;...;x n g. Thus for the present problem a chromosome only needs to carry information about^k.1.Fixed-Length Representation(GeneFix):In this representation method all chromosomes are binary vectors with the same length n,the number of data points.A simple example will be used to illustrate its idea.Suppose n¼15and^k¼f x3;x8;x14g;i.e.,there are three knots in this candidate model and they are located at x3;x8and x14.If we use‘‘0’’to denote a nor-mal gene and‘‘1’’to denote a knot gene,then the chromosome for this example is 001000010000010.One advantage of this method is that,it can be easily extended to handle cases when outliers and/or discontinuities are allowed:simply introduces two additional types of genes,outlier genes and discontinuity genes.2.Variable-Length Representation(GeneV ar):For this method the chromosomes are not binary nor having the same length.Here the location of a knot is represented by one integer-valued gene,and hence the length of a chromosome is equal to the number of knots in the candidate model.For the previous example,the corresponding chromosome is 3-8-14(the hyphens were merely used to separate the genes).Since the chromosomes are generally not of the same length,the optimization is done in the following way.First pre-specify the minimum and the maximum number of knots that any candidate model can have.Denote them as K MIN and K MAX respectively.Then for each K¼K MIN;...;K MAX,apply the algo-rithm toﬁnd the best candidate model amongst only those candidate models that have K knots.That is,for this particular run of the algorithm,all chromosomes are restricted to have the same length K.Then,after the execution of the algorithm for every value of K,a sequence of K MAXÀK MINþ1models is obtained.The one that minimizes GCVð^hÞwill be chosen as theﬁnal model.Please consult Lee(2002)and Pittman(1999)for further details on the implementations of these algorithms.5SIMULATION STUDYThis section presents results of a series of numerical experiments which were conducted to evaluate the performances of the various optimization algorithms (four stepwise and two genetic)described above.These experiments were designed to study the effects of varying the (i)noise level,(ii)design density,(iii)noise variance function,and (iv)degree of spatial variation.The simulation was conducted in the following way.For each of the experimental setups described below,100arti ﬁcial noisy data sets,each with 200design points x i ,were generated.Then,for each of these 100data sets,the above six algorithms were applied to minimize GCV ð^h Þand the corresponding ^f is obtained.For each obtained ^f ,both the GCV value and the mean-squared error (MSE)value,de ﬁned as 1=n P i f f ðx i ÞÀ^f ðx i Þg 2,are computed and recorded.Paired Wilcoxon tests were then applied to test if the difference between the median GCV (and MSE)values of any two algorithms is signi ﬁcant or not.The signi ﬁcance level used was 5=6%¼0:83%.If the median GCV (or MSE)value of an algorithm is sig-ni ﬁcantly less than the remaining ﬁve,it will be assigned a GCV (or MSE)rank 1.If the median GCV (or MSE)value of an algorithm is signi ﬁcantly larger than one but less than four algorithm,it will be assigned a GCV (or MSE)rank 2,and similarly for ranks 3to6.Algorithms having non-signi ﬁcantly different median values will share the same averaged rank.Ranking the algorithms in this manner provides an indicator about the relative merits of the methods (see Wand,2000).Since the main interest of this article is to compare the algo-rithms in terms of their abilities for conducting numerical minimization,GCV seems to be a more appropriate performance measure here than MSE.5.1Varying Noise LevelFor this ﬁrst experiment four test functions were used.They are given in Table I and are also plotted in Figure 1.These test functions possess different characteristics and were also used by many previous authors.The data ðx i ;y i Þwere generated using y i ¼f ðx i Þþe i ,where f is a test function,x i is drawn from Unif[0,1],and e i is a zero-mean Gaussian error with variance s 2.Three signal-to-noise ratios (SNRs)were used:SNR ¼2,4and 6,where SNR is de ﬁned as k f k =s .Boxplots of the GCV and log(MSE)values for the six algorithms,together with their Wilcoxon test rankings,are given in Figures 2and 3.5.2Varying Design DensityThe same four test functions as in the V arying Noise Level experiment were used.The data ðx i ;y i Þwere also generated from y i ¼f ðx i Þþe i with s ¼k f k =4(i.e.,SNR ¼4),but now the x i ’s were drawn from three different densities.The three densities are Beta(1.2,1.8),TABLE I Formulae for Test Functions.Test Function 1:f (x )¼(4x 72)þ2exp{À16(4x 72)2},x 2[0,1]Test Function 2:f (x )¼sin 3(2p x 3),x 2[0,1]Test Function 3:f ðx Þ¼4x 2ð3À4x Þ;x 2½0;0:5Þ43x ð4x 2À10x þ7ÞÀ32;x 2½0:5;0:75Þ16x ðx À1Þ2;x 2½0:75;1 8<:Test Function 4:f (x )¼sin (8p x ),x 2[0,1]652T.C.M.LEEBeta(1.5,1.5)and Beta(1.8,1.2),and their probability density functions are plotted in Figure4.Boxplots of the GCV and log(MSE)values,and Wilcoxon test rankings of the algorithms are displayed in Figures 5and 6in a similar manner as before.5.3Varying Noise Variance FunctionAgain,the same four test functions as in the V arying Noise Level experiment were used,but the variance of the Gaussian noise was not the same for all values of x i .Rather,the variance was speci ﬁed by a variance function v ðx Þ,and the data ðx i ;y i Þwere generatedfrom FIGURE 1Four testfunctions.FIGURE 2Boxplots of the GCV values for the ‘‘Varying Noise Level ’’experiment.In each panel the boxplots correspond respectively to,from left to right,ForAdd ,BackElim ,AddElim ,ElimAdd ,GeneFix and GeneV ar .The number below each boxplot is the Wilcoxon test ranking.LEAST SQUARES REGRESSION SPLINE FITTING 653y i ¼f ðx i Þþv ðx i Þe i ,where s ¼k f k =4and the x i ’s were drawn from Unif[0,1].The following three variance functions v ðx Þwere used:Variance Function 1v ðx Þ¼0:5þx ;x 2½0;1 ;Variance Function 2v ðx Þ¼1:3À1:2x x 2½0;0:5Þ;0:2þ1:2x x 2½0:5;1 ;(Variance Function 3v ðx Þ¼1:5Àx ;x 2½0;1:FIGURE 3Similar to Figure 2but forlog(MSE).FIGURE 4Plots of the three different design densities.654T.C.M.LEEThese variance functions are plotted in Figure7.Boxplots of the GCV and log(MSE)values, together with the Wilcoxon test rankings are given in Figures8and9in a similar manner as before.5.4Varying Spatial VariationIn this experiment the six test functions were taken from Wand(2000),all have different de-grees of spatial variation.They are indexed by a single integer parameter j,and have the formf jðxÞ¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃxð1ÀxÞpsin2p f1þ2ð9À4jÞ=5gxþ2;j¼1;...;6: FIGURE7Plots of the noise variancefunctions.Note that it is of the same structural form of the Doppler function introduced by Donoho and Johnstone(1994).The dataðx i;y iÞwere generated from y i¼f jðx iÞþe i with s¼k f j k=4and the x i’s drawn from Unif[0,1].The test functions,together with typical simulated data sets,are plotted in Figure10.In a similar fashion as before,boxplots of GCV and log(MSE)values,and Wilcoxon test rankings of the algorithms are displayed in Figures11and12.5.5Empirical Conclusions and RecommendationsThe six algorithms gave fairly consistent performances for the above different experimental settings.The averaged Wilcoxon GCV and MSE test rankings are given in Tables II and III respectively.Judging from the overall averaged rankings,it seems that the two genetic algorithms are superior to the stepwise procedures,especially in terms of minimizing GCVð^hÞ.The computation times taken for the stepwise procedures toﬁnish one run under various settings were reasonably constant.The typical running times on a Sun Ultra-60Work-station for ForAdd,BackElim,AddElim and ElimAdd were respectively15s,38s,15s and53s.These timings are best to be served as upper bounds,as we did not optimizethe codes in our implementation.For the genetic algorithms,the execution times werequite variable:for GeneFix it ranged from 50s to 70s,while for GeneV ar it ranged from 50s to 100s.Therefore,if computation time is not an issue,then one should perhaps use the genetic algorithms.However,if time is an important issue,then AddElim seems to be a goodcompromise.FIGURE 10Plots of the test functions,together with typical simulated data sets,for the V arying Spatial V ariationexperiment.FIGURE 11Similar to Figure 2but for the V arying Spatial V ariation experiment.658T.C.M.LEEOne last interesting observation is that,for some settings the GCV and MSE values are notlinearly related;i.e.,a ^hthat has a small GCV value does not guarantee to have a small MSE value,and vice versa.This may suggest that GCV ð^hÞis not the best criterion if the goal is to obtain the ^hthat minimizesMSE.TABLE III Similar to Table II but for the Wilcoxon MSE Test Rankings.Algorithm NoiseL DesignD V arFn SpaV ar Overall ForAdd 3.33 3.21 3.08 2.50 3.03BackElim 4.96 5.13 5.21 5.50 5.20AddElim 4.08 3.50 3.88 2.92 3.60ElimAdd 4.25 4.03 4.54 4.33 4.29GeneFix 2.21 2.29 2.29 2.25 2.26GeneV ar2.172.832.003.502.63TABLE II Average Wilcoxon GCV Test Rankings for the Four Univariate Experimental Settings:Noise Level (NoiseL),Design Density (DesignD),Variance Function (VarFn)and Spatial Variation (SpaVar).Algorithm NoiseL DesignD V arFn SpaV ar Overall ForAdd 6.00 5.96 6.00 4.67 5.66BackElim 4.50 4.54 4.50 5.17 4.68AddElim 3.84 3.62 3.71 2.83 3.50ElimAdd 3.33 3.21 3.33 3.92 3.45GeneFix 1.71 1.88 1.88 1.42 1.72GeneV ar1.631.791.593.002.00LEAST SQUARES REGRESSION SPLINE FITTING 6596BIV ARIATE SMOOTHINGThis section considers bivariate smoothing.Due to space limitation,the description will be kept minimal.Important references include Friedman(1991)and Kooperberg et al.(1997).6.1BackgroundThe data f y i;x1i;x2i g n i¼1are now assumed to satisfyy i¼fðx1i;x2iÞþe i;e i$iid Nð0;s2Þ;ð3Þwhere f is the bivariate‘‘true’’surface to be recovered.It is assumed that f can be modeled byfðx1;x2Þ¼Xb j B jðx1;x2Þ;where the basis functions B jðx1;x2Þ’s are of the form1;x1;x2,ðx1Àk1uÞþ;ðx2Àk2vÞþ;x1x2, x2ðx1Àk1uÞþ;x1ðx2Àk2vÞþandðx1Àk1uÞþðx2Àk2vÞþ.Also,the knots are restricted to be a subset of the marginal values of the design pointsðx1i;x2iÞ.In our simulation a bivariate ver-sion of GCVðhÞis used to deﬁne a‘‘best’’^f.Driven by the univariate simulation results,two algorithms for minimizing the GCV func-tion are investigated:stepwise knot addition followed by elimination and genetic algorithms withﬁxed-length chromosomes.These two algorithms are simple straightforward extensions of their univariate counterparts.In fact the stepwise procedure to be compared in our simula-tion below was taken from the POL YMARS package mainly developed by Charles Kooperberg (see the unpublished manuscript Kooperberg and O’Connor n.d.).Note that in POL YMARS a‘‘no interactions if no main effects’’constraint is placed on all the possible candidate mod-els.The same constraint was also imposed for the genetic algorithm.6.2SimulationA simulation study was conducted.Although it was done at a smaller scale when comparing to the univariate setting,we believe that useful empirical conclusions can still be drawn.Five bivariate test functions were used.They are listed in Table IV and are plotted in Figure13. These functions have been used by for example Denison,Mallick and Smith(1998b)and Hwang,Lay,Maechler,Martin and Schimert(1994),and were constructed to have a unit standard deviation and a non-negative range.The data were generated according to(3), with the design points drawn from Unif½0;1 2.The number of design points was100,and three SNRs were used:2,4and6.The number of replicates was100.TABLE IV Five Test Functions for the Bivariate Setting.Test Function1:fðx1;x2Þ¼10:391fðx1À0:4Þðx2À0:6Þþ0:36gTest Function2:fðx1;x2Þ¼24:234f r2ð0:75Àr2Þg;r2¼ðx1À0:5Þ2þðx2À0:5Þ2Test Function3:fðx1;x2Þ¼42:659f0:1þ~x1ð0:05þ~x41À10~x21~x22þ5~x42Þg;~x1¼x1À0:5;~x2¼x2À0:5 Test Function4:fðx1;x2Þ¼1:3356½1:5ð1Àx1Þþe2x1À1sin f3pðx1À0:6Þ2gþe3ðx2À0:5Þsin f4pðx2À0:9Þ2g Test Function5:fðx1;x2Þ¼1:9½1:35þe x1sin f13ðx1À0:6Þ2g eÀx2sinð7x2Þ660T.C.M.LEE。