广义线性模型共42页
- 格式:ppt
- 大小:4.48 MB
- 文档页数:42
回归适⽤于⼆值响应变量,连接函数为logit函数,概率分布为⼆项分布:glm(Y ~ X1 + X2 + X3, family=binomial(link="logit"), data=data)Logistics回归模型的表达形式为:模型解读:为暴露于某种状态下的结局概率,是⼀种变量变换⽅式,表⽰对进⾏变换,为偏回归系数,表⽰在其他⾃变量不变的条件下,每变化⼀个单位的估计值。
Logistics回归是通过最⼤似然估计求解常数项和偏回归系数,基本思想时当从总体中随机抽取n个样本后,最合理的参数估计量应该使得这n个样本观测值的概率最⼤。
最⼤似然法的基本思想是先建⽴似然函数与对数似然函数,再通过使对数似然函数最⼤求解相应的参数值,所得到的估计值称为参数的最⼤似然估计值。
1. 数据准备(类别型变量进⾏0/1量化)⾸先,我们选⽤AER包中的Affairs数据集来构建Logistics回归模型,这个数据集记录了⼀组婚外情数据,其中包括参与者性别、年龄、婚龄、是否有⼩孩、宗教信仰程度(5分制,1表⽰反对,5表⽰⾮常信仰)、学历、职业和婚姻的⾃我评分(5分制,1表⽰⾮常不幸福,5表⽰⾮常幸福)。
在使⽤数据集之前,载⼊AER包> library(AER)> data(Affairs,package="AER")对于这个数据集,我们关注是否出轨,即这是⼀个⼆值型结果(出轨过/从未出轨)。
因此,我们接下来将'affaris'特征转化为⼆值型因⼦'ynaffair',该⼆值型因⼦即可以作为Logistic回归的结果变量。
> Affairs$ynaffair[Affairs$affairs > 0] <- 1> Affairs$ynaffair[Affairs$affairs== 0] <- 0> Affairs$ynaffair <-factor(Affairs$ynaffair,levels=c(0,1),labels=c("No","Yes"))2. Logistics模型构建:> myfit <- glm(ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + rating, data=Affairs, family=binomial())> summary(myfit)Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) 1.37726 0.88776 1.551 0.120807gendermale 0.28029 0.23909 1.172 0.241083age -0.04426 0.01825 -2.425 0.015301 *yearsmarried 0.09477 0.03221 2.942 0.003262 **childrenyes 0.39767 0.29151 1.364 0.172508religiousness -0.32472 0.08975 -3.618 0.000297 ***education 0.02105 0.05051 0.417 0.676851occupation 0.03092 0.07178 0.431 0.666630rating -0.46845 0.09091 -5.153 2.56e-07 ***-----------------------------------------------Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1从回归系数的p值(最后⼀栏)可以看到,性别、是否有孩⼦、学历和职业对⽅程的贡献都不显著(p值较⼤)。
⼴义线性模型⼴义线性模型GLM是⼀般线性模型的扩展,它处顺序和分类因变量。
所有的组件都是共有的三个组件:随机分量系统分量链接函数===============================================随机分量随机分量跟随响应Y的概率分布例1. (Y1,Y2,。
....YN)可能是正态的。
在这种情况下,我们会说随机分量是正态分布。
该成分导致了普通回归和⽅差分析。
例2. y是Bernoulli随机变量(其值为0或1),即随机分量为⼆项分布时,我们通常关注的是Logistic回归模型或Proit模型。
例2. y是计数变量1,2,3,4,5,6等,即y具有泊松分布,此时的连接函数时ln(E(y)),这个对泊松分布取对数的操作就是泊松回归模型。
============================================系统分量系统组件将解释变量x1、x2、···、xk作为线性预测器:============================================连接函数GLM的第三分量是随机和系统分量之间的链路。
它表⽰平均值µ=e(y)如何通过指定函数关系g(µ)到线性预测器中的解释性变量称G(µ)为链接函数..==============================================⼴义线性模型Y被允许从指数型分布族中得到⼀个分布。
链路函数G(µI)是任何单调函数,并且定义了µI和Xβ之间的关系。
=================================================逻辑回归因变量是⼆进制的评估多个解释变量(可以是数值型变量和/或类别型变量)对因变量的影响。
=============================================模型含义:鸟类的巢址使⽤响应变量是有巢的站点的概率,其中概率计算为p/(1-p),p是有巢的站点的⽐例。
广义线性模型的分析及应用一、引言广义线性模型(Generalized Linear Model, GLM)提供了一种在保持简单性的前提下,对非正态响应变量建立连续性预测模型的方法,适用于许多实际应用问题中。
本文旨在介绍广义线性模型的基本概念、模型构建方法、推断等内容,并通过实际案例的分析加深对GLM的理解与应用。
二、基本概念GLM是统计学中一种具有广泛适用性的模型框架,它的基本思想是将未知的响应变量与已知的协变量之间的关系描述为一个线性预测器和一个非线性函数的组合,即:g(E(Y)) = β_0 + β_1X_1 + ⋯+ β_pX_p其中,g(·)称为联接函数(Link Function),它定义了响应变量的均值与预测变量之间的关系,E(Y)为响应变量的期望,X_1,X_2,…,X_p为解释变量(predictor)或协变量(covariate),β_0, β_1, …, β_p是模型的系数或参数。
GLM假定响应变量Y服从指数分布族中的某一个分布,如正态分布、二项分布、泊松分布等。
三、模型构建方法1. 选择联接函数和分布族:不同的响应变量应选用不同的分布族。
例如,连续性响应变量可选用正态分布,二元响应变量可选用二项分布,而计数型响应变量可选用泊松分布等。
2. 选择解释变量:可使用变量选择算法,如前向选择法、向后选择法、逐步回归等,在给定样本内拟合出最佳模型。
3. 选择估计方法:由于某些非正态分布族无法使用最小二乘法拟合,可以使用极大似然估计法或广义估计方程法。
对于大样本,一般使用广义线性混合模型等。
4. 模型比较与选择:模型拟合后,需要进行模型检验和模型诊断,主要包括残差分析、Q-Q图检验、$R^2$值、F检验、AIC/BIC值等指标的分析。
四、模型应用GLM的应用非常广泛,特别是在医学、生态、社会科学、金融等领域。
下面以某市2019年全年医疗保险数据为例,运用GLM模型进行分析。
1. 数据描述健康保险数据包含了每个缴费人的性别、年龄、缴费金额、报销金额等信息。
第四届R会议北京2011广义线性模型-李欣海广义线性模型Generalized linear model李欣海中科院动物所Generalized Linear Modelg(µ) = β0+ β1x1+ β2x2+ ···+ βk x k GLM is an extension of general linear model that deals with ordinal and categorical response variables. There are three components that are common to all GLMs(McCullagh& Nelder1989) :–Random component–Systematic Component–Link FunctionMcCullagh, P., and J. A. Nelder1989. Generalized linear models. Chapman and Hall.Random Component:The random component: refers to the probability distribution of theresponse Y.Case 1. (Y 1, Y 2, . . ., Y N ) might be normal. In this case, we would say the random component is the normal distribution. This component leads to ordinary regression and analysis of variance models.Case 2. If the observations are Bernoulli random variables (which havevalues 0 or 1), then we would say the link function is the binomialdistribution. When the random component is the binomial distribution, we are commonly concerned with logistic regression models or probit models.Case 3. Quite often the random variables Y 1, Y 2, . . ., Y N have aPoisson distribution. Then we will be involved with Poisson regressionmodels or loglinear models.Systematic ComponentThe systematic component involves theexplanatory variables x 1, x 2, ···, x k .as linear predictors:β0+ β1x 1+ β2x 2+ ···+ βk x kLink FunctionThe third component of a GLM is the link between the random and systematic components.It says how the meanµ= E(Y) relates to the explanatory variables in the linear predictor through specifying a function g(µ):g(µ) = β0+ β1x1+ β2x2+ ···+ βk x kg(µ) is called the link function.Generalized Linear Models•The y i ’s are allowed to have a distribution fromthe exponential family of distributions.•The link function g(μi ) is any monotonic functionand defines the relationship between μi and x i β.kik i 22i 110i X ...X X )(g ββββμ++++=Logistic regression)(11)1(i x i i i e p x y P −+===Dependent variable is binary)(11)0(i x i i i e p x y P +===Linear function Logistic function P x 00.20.40.60.81-10-50510P x0.20.40.60.81-10-50510dt t p x y P ix i i i )21exp(21)1(2−===∫+∞−βαπProbit regression functionP x 00.20.40.60.81-10-50510)(11)1(ix i i i e p x y P −+===ii x x e e+=1ii x xi e ep +−=−111ix e +=11ix ii ep p Odds =−=1ii i x p p =⎟⎟⎠⎞⎜⎜⎝⎛−1ln Logit transformationModel meanings –nest site use of birdsThe response variable was the odds of a site having a nest, where odds are calculated as p/(1-p) and p is the proportion of sites have a nest. The statistical model was:Odds = exp(β0+ β1X 1+ β2X 2+ …βn X n )where n is the number of explanatory variables. The log of the odds is known as the logit transform of p .i x ii e p p Odds =−=1Advantages of Logit•Properties of a linear regression model•Logit between -∞and + ∞•Probability (P) constrained between 0 and 1•Directly related to odds of eventβx αP -1P ln +=⎟⎠⎞⎜⎝⎛ e P -1P βxα+=Assumptions•Dependent variable is binary or dichotomous, vs.continuous dependent variables in linear regression.•The cases are independent.•The independent variables are not linear combinations of each other•No linearity, the population means of the dependent variables at each level of the independent variable are not on a straight line.•No homogeneity of variance, the variance of the errors are not constant.•No normality, the errors are not normally distributed.Example•Risk of developing coronary heart disease (CD) by age (< 60 and > 60 years old)CD> 60 (1)< 60 (0)Present (1)2823Absent (0)1172Odds of disease among the old = 28/11Odds of disease among the young = 23/72 Odds ratio = 7.97R code# Logistic regression# Risk of developing coronary heart disease by age (<60 and >60 years old)coronary1 <-data.frame (present = rep (1, 28), age = 'old')coronary2 <-data.frame (present = rep (0, 11), age = 'old')coronary3 <-data.frame (present = rep (1, 23), age = 'young')coronary4 <-data.frame (present = rep (0, 72), age = 'young')coronary <-rbind (coronary1, coronary2, coronary3, coronary4)coronary <-rbind (coronary3, coronary4, coronary1, coronary2)fit <-glm (present~age, data = coronary, family = binomial ())summary (fit)Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) 0.9343 0.3558 2.626 0.00865 ** ageyoung -2.0755 0.4289 -4.839 1.31e-06 *** Age 2.0755 1.1412- Age βαP 1-P ln 1×+=×+=⎟⎠⎞⎜⎝⎛Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) -1.1412 0.2395 -4.765 1.89e-06 ***ageold 2.0755 0.4289 4.839 1.31e-06 ***Logistic Regression ModelCoefficientSE Coeff/SEAge 2.0755 0.4289 4.839 Constant -1.1412 0.2395 -4.76518.53.4, e CI 95%0.05) (p 1df with4.839 Test Wald 7.97e ratio Odds )0.4289 x 1.96(2.0755 22.0755==<===±¾β= increase in logarithm of odds ratio for a one unit increase in x •Test of the hypothesis that β = 0(Wald test)df)(1 ( Variance 22β)β=χInterpretation of the coefficients in terms of the oddsratio –An Example•Whether owning a car as afunction of the income. •17 individuals, 14 own a car and 3 do not.Variables in the EquationB S.E.Wald df Exp(B)INCOME 0.69310.80720.73721 2.0Constant-6.23838.97940.482610.00195car1 <-data.frame (income = c (10:12), carowner = rep (0, 3))car2 <-data.frame (income = rep (c (10:12), c (2, 4, 8)), carowner = rep (1, 14))car <-rbind (car1, car2)fit <-glm (carowner ~ income, data = car, family = binomial ())summary (fit)Income Car owner100101101110111111111111120121121121121121121121121Interpretation of the coefficients in terms of the oddsratio –An Example•e β= 2•So: increasing the income by one unit increases the odds of owning a car by a factor of 2 (increase in 100%) so that:(odds after increasing income)/ (odds before increasing income) = 2•If we look at the data we can see that this model predicts perfectly:income 0.69 α income βαP 1-P ln ×+=×+=⎟⎠⎞⎜⎝⎛ 2P1-Pincome income e e e ×=×=×αα69.0income 10P(own)P(not own)Odds of Owning a car10212/3=0.661/3=0.330.66/0.33=211414/5=0.81/5=0.20.8/0.2=412818/9=0.8881/9=0.1110.888/0.111=8car ownerMarginal effect of a change in Xln[p/(1-p)] = α+ βX + eThe slope coefficient (β) is interpreted as the rate of change in the "log odds" as X changes …not very useful.•We are also interested in seeing the effect of an explanatory variable on the probability of the event occurring•p = 1/[1 + exp(-α-βX)]The marginal effect of a change in X on the probability is:əp/əX = βp(1-p))()(1111X X eeβαβαβ++−+×+×=Basically, the size of the ‘marginal effect’will depend on two things:–βcoefficient–The initial value of XMarginal Effects: βxP(1-P)•Passing or failing an exam as a function of the number of hours of study•Previous study indicated the estimates of αandβwere:α= -5, β= 0.3•So what’s the effect of studying one more hour in the probability of the event occurring:Initial hoursof study P1-P P(1-P)Marginal effect50.029 0.971 0.028 0.009100.119 0.881 0.105 0.031150.378 0.622 0.235 0.071200.731 0.269 0.197 0.059250.924 0.076 0.070 0.021300.982 0.018 0.0180.005The importance of the initial value of X in themarginal effectLogistic Curves0.10.20.30.40.50.60.70.80.91-19-16-13-1-7-4-1258111417Logistic Curve bo=0.5, b1=0.5Big EffectSmall EffectSmall EffectStarting the change from the central values of X will have a higher impact on the probability of the event occurring than starting from very low or very high values of X.Some useful R codes# Logistic regressionfit <-glm(carowner~ income, data = car, family = binomial())summary (fit) # display resultsconfint(fit) # 95% CI for the coefficientsexp(coef(fit)) # exponentiated coefficientsexp(confint(fit)) # 95% CI for exponentiated coefficientspred= predict (fit, type= "response") # predicted values (logit) res= residuals (fit, type= "deviance") # residualsHow to estimate model coefficientsMaximum likelihood estimation (MLE)iiy i y i i )p (p )P(y −−=11For one observationLikelihood function=−−=n i y i y i ii)p (p L 111)(θGoodness of fit for the full model-likelihood ratio test (LR)•We compare the value of the likelihood function in a model with the variables with the value of the likelihood function in a model without the variables. The test:where is the log likelihood value of the null model (only intercept included); is the log likelihood value of the full model (taking into account of all variable parameters).–The statistic is distributed as χ2 with as many degrees of freedomas coefficients we are restrictingkS )L L (L L LR 20ˆ2ˆ2χ⇒−−−=0ˆL L SL L ˆ# likelihood ratio testfit.full <-glm (present ~ ., data = coronary, family = binomial ())fit.null <-glm (present ~ NULL, data = coronary, family = binomial ())lrtest (fit.full, fit.null)Goodness of fit -AnalogousR2)ˆ2(ˆ20SL L L L −−−Refer to total sum of squareRefer to regression sum of square Likelihood ratio index (LRI):200)ˆ2)ˆ2(ˆ2(LRI RL L L L L L S=−−−−=0ˆ2L L −/n adj)L (RR R R202max 222ˆ1−==# R codelibrary (Design) # required for lrm()fit2 <-lrm (y ~ x1 + x2, data = data1)fit2[[3]][10] # R squareStepwise Regression base on Akaike’s Information Criterion (AIC)AIC = -2 ln (likelihood) + 2KK = number of parameters in the model, including 1for the constant and 1 for the error term443322110X X X X Y βββββ++++=K = 6For small samples (n /K < 40), use AIC c for small sample size1)1(2AIC AIC c −−++=K n K K # R codestep (fit) # Stepwise Regression25Sample plots 35Control plots 35Habitat factors 11Elevation (m)Area of rice fields nearby (ha)Human disturbanceNumber of trees within 100 m 2Mean tree height within 100 m 2 (m)Nest position on the slopeSlope aspect (°)Slope gradient (°)Nest tree height (m)Nest aspect (°)Coverage above the nest (%)Nest site selection of the crested ibisControl plots Nest sites26 0 20 40 kmSource data 10500100015002000250005101520253035SitesE l e v a t i o n (m )Elevation (Nest sites)Elevation (Control plots)51015202505101520253035SitesM e t e rHeight of nest tree (Nest sites)Height of nest tree (Control plots)024681005101520253035SitesDisturbance (Nest sites)Disturbance (Control plots)5010015020025030035005101520253035SitesArea of rice field nearby (Nest sites)Area of rice field nearby (Control plots)Source data 20.00.30.60.91.25101520253035SitesNest aspect (Nest sites)Nest aspect (Control plots)0.020.040.060.080.0100.005101520253035SitesCoverage above the nest (Nest sites)Coverage above the nest (Control plots)0.00.40.81.21.62.005101520253035SitesNest position on the slope (Nest sites)Nest position on the slope (Control plots)0.00.30.60.91.205101520253035SitesSlope aspect (Nest sites)Slope aspect (Control plots)Source data 30.030.060.090.05101520253035SitesSlope gradient (Nest sites)Slope gradient (Control plots)0.05.010.015.020.005101520253035SitesMean tree height (Nest sites)Mean tree height (Control plots)0.05.010.015.020.005101520253035SitesNumber of trees within the site (Nest sites)Number of trees within the site (Control plots)CorrelationHabitat variablesCorrelation coefficientsMean S.D. 12345678910111. Elevation (m)1-0.72*-0.48*-0.70*0.21-0.020.39*-0.38*0.1620.34*0.21894.00176.532. Area (ha) of ricefields within 1km210.53*0.49*-0.23-0.08-0.230.230.05-0.21-0.1211.62 5.403. Humandisturbance10.220.06-0.1540.150.38*0.10-0.020.08 1.40 1.52 4. Number of treeswithin 100 m21-0.37*-0.00-0.52*0.34*-0.330.012-0.258.11 3.53 5. Mean tree heightwithin 100 m2 (m)1-0.240.23-0.34*0.32*0.11-0.0611.23 3.06 6. Nest position onthe slope10.030.22-0.21-0.00-0.07 2.030.45 7. Slope aspect(South = 1,North = 0)1-0.150.180.55*0.060.450.298. Slope gradient (°)1-0.050.100.0125.697.019. Nest tree height (m)1-0.08-0.2314.80 2.3610. Nest aspect(South = 1,North = 0)10.320.430.3211. Coverage abovethe nest (%)149.00%16.53%The Pearson correlations between the 11 habitat variables measured at 35 nest sites of crested ibis in Yang county, Shaanxi province, China. Mean values and standard deviations (S.D.) are also shown.Step Habitat features Selection coefficientsStandard ErrorP value for model selectionAIC 1Nest tree height (m)0.940.38<.000163.3562Human disturbance -0.990.400.000150.4753Slope aspect-5.82 3.250.001341.7274Area of rice fields nearby (ha)0.350.190.010936.2525Nest position on the slope 3.73 2.300.047834.3366Mean tree height within 100 m 2(m)0.280.270.0320 31.9247Nest aspect 54.928531.53780.011226.0488Slope gradient (°)-0.40800.36020.286623.2269Coverage above the nest 0.52010.55860.084124.32210Number of trees within 100 m 2-0.0068300.006160.116025.76411Elevation (m)0.076700.13280.145027.275Stepwise logistic regression for modeling nest site selection of crested ibis in Yang County, Shaanxi Province, China.Model equationlogit(p) = –20.99 + 0.94×nest tree height–0.99×human disturbance+ 3.63×nest position+ 0.35×rice paddy area + …Probability of nest selection:P = e logit(p)/(1 + e logit(p))•R-Square 0.7380•Max-rescaled R-Square 0.9840李欣海, 马志军, 李典谟, 丁长青, 翟天庆, 路宝忠。
广义线性模型广义线性模型*(Nelder和Wedderburn,1972)除了正态分布,也允许反应分布,以及模型结构中的一定程度的非线性。
GLM具有基本结构g(μi)=X iβ,其中μi≡E(Yi),g是光滑单调'链接函数',Xi是模型矩阵的第i行,X和β是未知参数的向量。
此外,GLM通常会做出Yi是独立的和Yi服从一些指数族分布的假设。
指数族分布包括许多对实际建模有用的分布,如泊松分布,二项分布,伽马分布和正态分布。
GLM的综合参考文献是McCullagh和Nelder(1989),而Dobson(2001)提供了一个全面的介绍。
因为广义线性模型是以“线性预测器”Xβ的形式详细说明的,所以线性模型的许多一般想法和概念通过一些修改而继续存在到广义线性模型中。
除了必须选择的链接函数和分布之外,基本模型公式与线性模型公式基本相同。
当然,如果恒等函数被选择作为链接以及正态分布,那么普通线性模型将作为特例被恢复。
然而,泛化是以某种成本为代价的:现在的模型拟合必须要迭代完成,而且用于推理的分布结果是近似的,并且由大样本限制结果证明是正确的而不是精确的。
但在深入探讨这些问题之前,请考虑几个简单的例子。
μi=cexp(bt i),例1:在疾病流行的早期阶段,新病例的发生率通常会随着时间以指数方式增加。
因此,如果μi是第ti天的新病例的预期数量,则该形式的模型为请注意,“广义”和“一般”线性模型之间存在区别-后一个术语有时用于指除简单直线以外的所有线性模型。
可能是合适的,其中c和b是未知参数。
通过使用对数链路,这样的模型可以变成GLM形式log(μi)=log(c)+bt i=β0+t iβ1(根据β0=logc和β1=b的定义)。
请注意,模型的右侧现在在参数中是线性的。
反应变量是每天新病例的数量,因为这是一个计数,所以泊松分布可能是一个合理的可以尝试的分布。
因此,针对这种情况的GLM使用泊松反应分布,对数链路和线性预测器β0+tiβ1。