广义线性模型共42页

格式：ppt
大小：4.48 MB
文档页数：42

下载文档原格式

广义线性模型课件

（三）条件Logistic回归分析的基本原理
1.概述条件Logistic回归是经典Logistic回归的重要拓展方法之一，它主要用于分层数据（strata data）的影响因素分析，通过分层来控制可能的混杂因素对结局变量的影响。分层变量可以包括一个变量或者几个变量。
2.条件 Logistic模型令yk为第k层的因变量，yk=1或0；xk1，xk2…xki… xkm为第k层的m个自变量。第k层的模型为：
推荐书籍：
Hosmer, David W . (2000). Applied logistic regression . John Wiley, New York.
（一）Logistic回归分析的任务
影响因素分析 logistic回归常用于疾病的危险因素分析，logistic回归分析可以提供一个重要的指标：OR。
（2）令病例的生存时间比对照短（3）在设置生存状态变量（status）时，令病例组为完全数据，对照组为删失数据
以下实例摘自Hosme and Lemeshow（2000）. Applied Logistic Regression: Second Edition.
John Wiley & Sons Inc.
Logistic回归
因变量
协变量(自变量)
注：此处将X1、X3看作为连续变量。
OR的95%置信区间
对模型的检验
模型拟合良好
经统计学检验，模型2=13.951，P=0.003，Logistic回归模型有显著性。
拟合分类表
符合率为 70.0%
回归系数标准误 Wald值
P值
OR
OR置信区间
g(x)是对P的变换，称为logit变换：

广义线性模型

回归适⽤于⼆值响应变量，连接函数为logit函数，概率分布为⼆项分布：glm(Y ~ X1 + X2 + X3, family=binomial(link="logit"), data=data)Logistics回归模型的表达形式为：模型解读：为暴露于某种状态下的结局概率，是⼀种变量变换⽅式，表⽰对进⾏变换，为偏回归系数，表⽰在其他⾃变量不变的条件下，每变化⼀个单位的估计值。

Logistics回归是通过最⼤似然估计求解常数项和偏回归系数，基本思想时当从总体中随机抽取n个样本后，最合理的参数估计量应该使得这n个样本观测值的概率最⼤。

最⼤似然法的基本思想是先建⽴似然函数与对数似然函数，再通过使对数似然函数最⼤求解相应的参数值，所得到的估计值称为参数的最⼤似然估计值。

1. 数据准备（类别型变量进⾏0/1量化）⾸先，我们选⽤AER包中的Affairs数据集来构建Logistics回归模型，这个数据集记录了⼀组婚外情数据，其中包括参与者性别、年龄、婚龄、是否有⼩孩、宗教信仰程度（5分制，1表⽰反对，5表⽰⾮常信仰）、学历、职业和婚姻的⾃我评分（5分制，1表⽰⾮常不幸福，5表⽰⾮常幸福）。

在使⽤数据集之前，载⼊AER包> library(AER)> data(Affairs,package="AER")对于这个数据集，我们关注是否出轨，即这是⼀个⼆值型结果（出轨过/从未出轨）。

因此，我们接下来将'affaris'特征转化为⼆值型因⼦'ynaffair'，该⼆值型因⼦即可以作为Logistic回归的结果变量。

> Affairs$ynaffair[Affairs$affairs > 0] <- 1> Affairs$ynaffair[Affairs$affairs== 0] <- 0> Affairs$ynaffair <-factor(Affairs$ynaffair,levels=c(0,1),labels=c("No","Yes"))2. Logistics模型构建：> myfit <- glm(ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + rating, data=Affairs, family=binomial())> summary(myfit)Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) 1.37726 0.88776 1.551 0.120807gendermale 0.28029 0.23909 1.172 0.241083age -0.04426 0.01825 -2.425 0.015301 *yearsmarried 0.09477 0.03221 2.942 0.003262 **childrenyes 0.39767 0.29151 1.364 0.172508religiousness -0.32472 0.08975 -3.618 0.000297 ***education 0.02105 0.05051 0.417 0.676851occupation 0.03092 0.07178 0.431 0.666630rating -0.46845 0.09091 -5.153 2.56e-07 ***-----------------------------------------------Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1从回归系数的p值（最后⼀栏）可以看到，性别、是否有孩⼦、学历和职业对⽅程的贡献都不显著（p值较⼤）。

《广义线模型》课件

资等。
生物统计学
用于分析生物数据和遗传数据，如基因表达、
疾病风险等。
市场营销
用于预测消费者行为和市场趋势，如消费者购买决策、市场细分等。
社会科学
用于研究社会现象和人类行为，如人口统计、
犯罪率等。
广义线模型的优缺点
灵活性强
能够适应各种类型的数据和问题。
数学基础扎实
具有坚实的统计学和线性代数基础。
VS
详细描述
非线性广义线模型通过引入非线性项，如平方、立方等，来描述因变量和自变量之间的复杂关系。这种模型在许多领域都有应用，例如经济学、生物学和医学等。
广义岭回归模型
总结词
广义岭回归模型是广义线模型的另一种扩展形式，它通过引入岭回归方法来处理共线性问题。
详细描述
在统计学中，共线性是指自变量之间存在高度相关性的现象。广义岭回归模型通过引入岭回归方法，即对系数施加约束，来减少共线性的影响，提高模型的稳定性和预测精度
所应用。
THANKS
感谢观看
模型选择
模型选择是指在多个可能的模型中选择一个最优模型的过程。模型选择通常基于模型的复杂度、预测精度、解释性等因素进行评估。
03
广义线模型的基本形式
线性回归模型
线性回归模型是最基础的广义线模型，用于预测一个因变量与一个或多个自变量之间的关系。
线性回归模型假设因变量和自变量之间存在线性关系，即因变量的变化可以用自变量的线性组合来描述。
医学数据分析
总结词
广义线模型在医学数据分析中具有重要价值，能够帮助研究人员更好地理解和解释医学数据。
详细描述
广义线模型可以用于分析医学影像数据、疾病发病率数据等，从而揭示疾病的发生和发展规律。此外，该模型还可以用于药物疗效分析，为新药研发和临床试验提供支持。

广义线性模型

⼴义线性模型⼴义线性模型GLM是⼀般线性模型的扩展，它处顺序和分类因变量。

所有的组件都是共有的三个组件：随机分量系统分量链接函数===============================================随机分量随机分量跟随响应Y的概率分布例1. (Y1，Y2，。

....YN)可能是正态的。

在这种情况下，我们会说随机分量是正态分布。

该成分导致了普通回归和⽅差分析。

例2. y是Bernoulli随机变量(其值为0或1)，即随机分量为⼆项分布时，我们通常关注的是Logistic回归模型或Proit模型。

例2. y是计数变量1,2,3,4,5,6等，即y具有泊松分布，此时的连接函数时ln(E(y))，这个对泊松分布取对数的操作就是泊松回归模型。

============================================系统分量系统组件将解释变量x1、x2、···、xk作为线性预测器：============================================连接函数GLM的第三分量是随机和系统分量之间的链路。

它表⽰平均值µ=e(y)如何通过指定函数关系g(µ)到线性预测器中的解释性变量称G（µ）为链接函数..==============================================⼴义线性模型Y被允许从指数型分布族中得到⼀个分布。

链路函数G(µI)是任何单调函数，并且定义了µI和Xβ之间的关系。

=================================================逻辑回归因变量是⼆进制的评估多个解释变量（可以是数值型变量和/或类别型变量）对因变量的影响。

=============================================模型含义：鸟类的巢址使⽤响应变量是有巢的站点的概率，其中概率计算为p/(1-p)，p是有巢的站点的⽐例。

第3章-广义线性模型

年收入（万元）
是否有车
年收入（万元）
是否有车
年收入（万元）
是否有车
15
1
25
1
12
0
20
1
12
0
15
1
10
0
10
0
9
0
12
1
15
1
8
0
8
0
7
0
10
0
30
1
22
1
22
1
6
0
7
0
24
1
16
1
16
1
9
0
22
1
18
1
10
0
36
1
211181707
0
30
1
24
1
9
0
6
0
6
0
6
0
13
0
11
0
20
1
23
1
18
.
8
2. 正态线性回归模型
• 只要取联结函数为 m (i) i x iT (i 1 , ,n ),则正
态线性回归模型满足广义线性模型的定义.
• 类似的,容易验证,二项分布和泊松分布都属于指数分布族.
• 下面介绍实际中应用广泛的两种广义线性
模型:Logistic模型和对数线性模型.
2020/8/5
1
16
1
10
0
2020/8/5
.
11
2. 模型的参数估计和检验
• 采用R软件中的广义线性模型过程glm( )可以完成回归系数的估计,以及模型回归系数的显著性检验. 程序如下：

6 孟生旺：广义线性模型—发展与应用

参数估计：
b(m + 1) = b(m ) + (X ⅱ + 2l *A SA)- 1[X ⅱ (y - m + l *A S 2I ] WX WG )
18
GLM的推广与应用的推广
• 分布假设的推广
– 过离散：
• 混合泊松分布：泊松-逆高斯，泊松-对数正态
– 零膨胀：
• 零膨胀模型
– 长尾：
• 对数正态，帕累托
11
• 模型比较模型比较：信息准则
A IC = − 2 l + 2 p B IC = − 2 l + p ln( n )
– AIC或BIC的值越小越好。 – 误差平方和的比较？
12
GLM的优缺点的优缺点 • 优点：
– 统计检验 – 处理相关性和交互作用（见下页） – 现成软件
• 缺点：
– 无法处理加法和乘法的混合模型 – 参数模型，函数形式有限 – 寻找交互项：耗时
yij p q ∑ nij α i f ( ) αi ˆ β j = f −1 i ∑ nij pα i q i
26
应用案例
• 来源： Ismail et al.(2007) 和Cheong et al.(2008) • 马来西亚车险汇总数据
分类变量保障类型水平综合险非综合险国内国外男性个人女性个人商务 0至1年 2至3年 4至5年 6年以上中部北部东部南部东马
28
广义线性模型的拟合结果比较
29
回归树的结果
30
模型的误差平方和比较
模型线性回归回归树泊松-逆高斯回归负二项回归泊松回归神经网络（1个神经元）神经网络（2个神经元）神经网络（3个神经元）误差平方和参数个数（SSE） 11 11 12 12 11 13 25 37 19.08 16.76 15.08 14.73 13.04 12.30 5.85 5.11 类R2 0.7274 0.7606 0.7846 0.7896 0.8138 0.8242 0.9165 0.9270

广义线性模型_二_陈希孺

联系函数为g则相应的i此处再提醒注意i与i的区别前者是在第i次观察得yixi时总体的值或更确切的说因受x取值xi的影响这影响到而通过影响到使总体取i至于i是指的第i分量由此可以理解ij的意义为hg1i
DO I : 10 . 13860 / j. cnki . slt j . 2002 . 06 . 013
′ 1( y +…+y ) y ( 1) ( q) Π j) P( y =( y( ) )=( 1 -π ) π 1), …y ( q) ( 1 ) - … -π ( q) ( j) ( j =1
q
′
( 1. 53)
令θ = ( log θ ) ′ = ( θ ) ′ , 其中 ( 1 ), … , log θ ( q) ( 1 ), …θ ( q) π ( j) θ ( j) = , 1 -( π ) ( 1) + … + π ( q) π ( j) = 可将( 1. 53) 写为
k
58
中文核心期刊数理统计与管理 21 卷 6 期 2002 年 11 月
别) 与 1 维广义线性模型相似 , 多维广义线性模型的一个要素是 : Y 有指数型分布 : c( y) exp( θ ′ y -b( θ ) ) dμ ( y) , θ∈ Θ 因 c( y) ex p( θ ′ y -b( θ ) ) dμ ( y )= 1 ∫
1 ＊ )
g( μ μ 1)≠ g ( 2)
( 1. 45) ( 1. 46) ( 1. 47)
g( μ )= η= z β
z( x i) 及 η i =z iβ , 以及( μ i = Ey i ) ( 1. 48)
1 θ μ h( z iβ) ) i =b ( i)= b (

广义线性模型的分析及应用

广义线性模型的分析及应用一、引言广义线性模型（Generalized Linear Model, GLM）提供了一种在保持简单性的前提下，对非正态响应变量建立连续性预测模型的方法，适用于许多实际应用问题中。

本文旨在介绍广义线性模型的基本概念、模型构建方法、推断等内容，并通过实际案例的分析加深对GLM的理解与应用。

二、基本概念GLM是统计学中一种具有广泛适用性的模型框架，它的基本思想是将未知的响应变量与已知的协变量之间的关系描述为一个线性预测器和一个非线性函数的组合，即：g(E(Y)) = β_0 + β_1X_1 + ⋯+ β_pX_p其中，g(·)称为联接函数（Link Function），它定义了响应变量的均值与预测变量之间的关系，E(Y)为响应变量的期望，X_1,X_2,…,X_p为解释变量（predictor）或协变量（covariate），β_0, β_1, …, β_p是模型的系数或参数。

GLM假定响应变量Y服从指数分布族中的某一个分布，如正态分布、二项分布、泊松分布等。

三、模型构建方法1. 选择联接函数和分布族：不同的响应变量应选用不同的分布族。

例如，连续性响应变量可选用正态分布，二元响应变量可选用二项分布，而计数型响应变量可选用泊松分布等。

2. 选择解释变量：可使用变量选择算法，如前向选择法、向后选择法、逐步回归等，在给定样本内拟合出最佳模型。

3. 选择估计方法：由于某些非正态分布族无法使用最小二乘法拟合，可以使用极大似然估计法或广义估计方程法。

对于大样本，一般使用广义线性混合模型等。

4. 模型比较与选择：模型拟合后，需要进行模型检验和模型诊断，主要包括残差分析、Q-Q图检验、$R^2$值、F检验、AIC/BIC值等指标的分析。

四、模型应用GLM的应用非常广泛，特别是在医学、生态、社会科学、金融等领域。

下面以某市2019年全年医疗保险数据为例，运用GLM模型进行分析。

1. 数据描述健康保险数据包含了每个缴费人的性别、年龄、缴费金额、报销金额等信息。

2.-李欣海-广义线性模型

第四届R会议北京2011广义线性模型-李欣海广义线性模型Generalized linear model李欣海中科院动物所Generalized Linear Modelg(µ) = β0+ β1x1+ β2x2+ ···+ βk x k GLM is an extension of general linear model that deals with ordinal and categorical response variables. There are three components that are common to all GLMs(McCullagh& Nelder1989) :–Random component–Systematic Component–Link FunctionMcCullagh, P., and J. A. Nelder1989. Generalized linear models. Chapman and Hall.Random Component:The random component: refers to the probability distribution of theresponse Y.Case 1. (Y 1, Y 2, . . ., Y N ) might be normal. In this case, we would say the random component is the normal distribution. This component leads to ordinary regression and analysis of variance models.Case 2. If the observations are Bernoulli random variables (which havevalues 0 or 1), then we would say the link function is the binomialdistribution. When the random component is the binomial distribution, we are commonly concerned with logistic regression models or probit models.Case 3. Quite often the random variables Y 1, Y 2, . . ., Y N have aPoisson distribution. Then we will be involved with Poisson regressionmodels or loglinear models.Systematic ComponentThe systematic component involves theexplanatory variables x 1, x 2, ···, x k .as linear predictors:β0+ β1x 1+ β2x 2+ ···+ βk x kLink FunctionThe third component of a GLM is the link between the random and systematic components.It says how the meanµ= E(Y) relates to the explanatory variables in the linear predictor through specifying a function g(µ):g(µ) = β0+ β1x1+ β2x2+ ···+ βk x kg(µ) is called the link function.Generalized Linear Models•The y i ’s are allowed to have a distribution fromthe exponential family of distributions.•The link function g(μi ) is any monotonic functionand defines the relationship between μi and x i β.kik i 22i 110i X ...X X )(g ββββμ++++=Logistic regression)(11)1(i x i i i e p x y P −+===Dependent variable is binary)(11)0(i x i i i e p x y P +===Linear function Logistic function P x 00.20.40.60.81-10-50510P x0.20.40.60.81-10-50510dt t p x y P ix i i i )21exp(21)1(2−===∫+∞−βαπProbit regression functionP x 00.20.40.60.81-10-50510)(11)1(ix i i i e p x y P −+===ii x x e e+=1ii x xi e ep +−=−111ix e +=11ix ii ep p Odds =−=1ii i x p p =⎟⎟⎠⎞⎜⎜⎝⎛−1ln Logit transformationModel meanings –nest site use of birdsThe response variable was the odds of a site having a nest, where odds are calculated as p/(1-p) and p is the proportion of sites have a nest. The statistical model was:Odds = exp(β0+ β1X 1+ β2X 2+ …βn X n )where n is the number of explanatory variables. The log of the odds is known as the logit transform of p .i x ii e p p Odds =−=1Advantages of Logit•Properties of a linear regression model•Logit between -∞and + ∞•Probability (P) constrained between 0 and 1•Directly related to odds of eventβx αP -1P ln +=⎟⎠⎞⎜⎝⎛ e P -1P βxα+=Assumptions•Dependent variable is binary or dichotomous, vs.continuous dependent variables in linear regression.•The cases are independent.•The independent variables are not linear combinations of each other•No linearity, the population means of the dependent variables at each level of the independent variable are not on a straight line.•No homogeneity of variance, the variance of the errors are not constant.•No normality, the errors are not normally distributed.Example•Risk of developing coronary heart disease (CD) by age (< 60 and > 60 years old)CD> 60 (1)< 60 (0)Present (1)2823Absent (0)1172Odds of disease among the old = 28/11Odds of disease among the young = 23/72 Odds ratio = 7.97R code# Logistic regression# Risk of developing coronary heart disease by age (<60 and >60 years old)coronary1 <-data.frame (present = rep (1, 28), age = 'old')coronary2 <-data.frame (present = rep (0, 11), age = 'old')coronary3 <-data.frame (present = rep (1, 23), age = 'young')coronary4 <-data.frame (present = rep (0, 72), age = 'young')coronary <-rbind (coronary1, coronary2, coronary3, coronary4)coronary <-rbind (coronary3, coronary4, coronary1, coronary2)fit <-glm (present~age, data = coronary, family = binomial ())summary (fit)Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) 0.9343 0.3558 2.626 0.00865 ** ageyoung -2.0755 0.4289 -4.839 1.31e-06 *** Age 2.0755 1.1412- Age βαP 1-P ln 1×+=×+=⎟⎠⎞⎜⎝⎛Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) -1.1412 0.2395 -4.765 1.89e-06 ***ageold 2.0755 0.4289 4.839 1.31e-06 ***Logistic Regression ModelCoefficientSE Coeff/SEAge 2.0755 0.4289 4.839 Constant -1.1412 0.2395 -4.76518.53.4, e CI 95%0.05) (p 1df with4.839 Test Wald 7.97e ratio Odds )0.4289 x 1.96(2.0755 22.0755==<===±¾β= increase in logarithm of odds ratio for a one unit increase in x •Test of the hypothesis that β = 0(Wald test)df)(1 ( Variance 22β)β=χInterpretation of the coefficients in terms of the oddsratio –An Example•Whether owning a car as afunction of the income. •17 individuals, 14 own a car and 3 do not.Variables in the EquationB S.E.Wald df Exp(B)INCOME 0.69310.80720.73721 2.0Constant-6.23838.97940.482610.00195car1 <-data.frame (income = c (10:12), carowner = rep (0, 3))car2 <-data.frame (income = rep (c (10:12), c (2, 4, 8)), carowner = rep (1, 14))car <-rbind (car1, car2)fit <-glm (carowner ~ income, data = car, family = binomial ())summary (fit)Income Car owner100101101110111111111111120121121121121121121121121Interpretation of the coefficients in terms of the oddsratio –An Example•e β= 2•So: increasing the income by one unit increases the odds of owning a car by a factor of 2 (increase in 100%) so that:(odds after increasing income)/ (odds before increasing income) = 2•If we look at the data we can see that this model predicts perfectly:income 0.69 α income βαP 1-P ln ×+=×+=⎟⎠⎞⎜⎝⎛ 2P1-Pincome income e e e ×=×=×αα69.0income 10P(own)P(not own)Odds of Owning a car10212/3=0.661/3=0.330.66/0.33=211414/5=0.81/5=0.20.8/0.2=412818/9=0.8881/9=0.1110.888/0.111=8car ownerMarginal effect of a change in Xln[p/(1-p)] = α+ βX + eThe slope coefficient (β) is interpreted as the rate of change in the "log odds" as X changes …not very useful.•We are also interested in seeing the effect of an explanatory variable on the probability of the event occurring•p = 1/[1 + exp(-α-βX)]The marginal effect of a change in X on the probability is:əp/əX = βp(1-p))()(1111X X eeβαβαβ++−+×+×=Basically, the size of the ‘marginal effect’will depend on two things:–βcoefficient–The initial value of XMarginal Effects: βxP(1-P)•Passing or failing an exam as a function of the number of hours of study•Previous study indicated the estimates of αandβwere:α= -5, β= 0.3•So what’s the effect of studying one more hour in the probability of the event occurring:Initial hoursof study P1-P P(1-P)Marginal effect50.029 0.971 0.028 0.009100.119 0.881 0.105 0.031150.378 0.622 0.235 0.071200.731 0.269 0.197 0.059250.924 0.076 0.070 0.021300.982 0.018 0.0180.005The importance of the initial value of X in themarginal effectLogistic Curves0.10.20.30.40.50.60.70.80.91-19-16-13-1-7-4-1258111417Logistic Curve bo=0.5, b1=0.5Big EffectSmall EffectSmall EffectStarting the change from the central values of X will have a higher impact on the probability of the event occurring than starting from very low or very high values of X.Some useful R codes# Logistic regressionfit <-glm(carowner~ income, data = car, family = binomial())summary (fit) # display resultsconfint(fit) # 95% CI for the coefficientsexp(coef(fit)) # exponentiated coefficientsexp(confint(fit)) # 95% CI for exponentiated coefficientspred= predict (fit, type= "response") # predicted values (logit) res= residuals (fit, type= "deviance") # residualsHow to estimate model coefficientsMaximum likelihood estimation (MLE)iiy i y i i )p (p )P(y −−=11For one observationLikelihood function=−−=n i y i y i ii)p (p L 111)(θGoodness of fit for the full model-likelihood ratio test (LR)•We compare the value of the likelihood function in a model with the variables with the value of the likelihood function in a model without the variables. The test:where is the log likelihood value of the null model (only intercept included); is the log likelihood value of the full model (taking into account of all variable parameters).–The statistic is distributed as χ2 with as many degrees of freedomas coefficients we are restrictingkS )L L (L L LR 20ˆ2ˆ2χ⇒−−−=0ˆL L SL L ˆ# likelihood ratio testfit.full <-glm (present ~ ., data = coronary, family = binomial ())fit.null <-glm (present ~ NULL, data = coronary, family = binomial ())lrtest (fit.full, fit.null)Goodness of fit -AnalogousR2)ˆ2(ˆ20SL L L L −−−Refer to total sum of squareRefer to regression sum of square Likelihood ratio index (LRI):200)ˆ2)ˆ2(ˆ2(LRI RL L L L L L S=−−−−=0ˆ2L L −/n adj)L (RR R R202max 222ˆ1−==# R codelibrary (Design) # required for lrm()fit2 <-lrm (y ~ x1 + x2, data = data1)fit2[[3]][10] # R squareStepwise Regression base on Akaike’s Information Criterion (AIC)AIC = -2 ln (likelihood) + 2KK = number of parameters in the model, including 1for the constant and 1 for the error term443322110X X X X Y βββββ++++=K = 6For small samples (n /K < 40), use AIC c for small sample size1)1(2AIC AIC c −−++=K n K K # R codestep (fit) # Stepwise Regression25Sample plots 35Control plots 35Habitat factors 11Elevation (m)Area of rice fields nearby (ha)Human disturbanceNumber of trees within 100 m 2Mean tree height within 100 m 2 (m)Nest position on the slopeSlope aspect (°)Slope gradient (°)Nest tree height (m)Nest aspect (°)Coverage above the nest (%)Nest site selection of the crested ibisControl plots Nest sites26 0 20 40 kmSource data 10500100015002000250005101520253035SitesE l e v a t i o n (m )Elevation (Nest sites)Elevation (Control plots)51015202505101520253035SitesM e t e rHeight of nest tree (Nest sites)Height of nest tree (Control plots)024681005101520253035SitesDisturbance (Nest sites)Disturbance (Control plots)5010015020025030035005101520253035SitesArea of rice field nearby (Nest sites)Area of rice field nearby (Control plots)Source data 20.00.30.60.91.25101520253035SitesNest aspect (Nest sites)Nest aspect (Control plots)0.020.040.060.080.0100.005101520253035SitesCoverage above the nest (Nest sites)Coverage above the nest (Control plots)0.00.40.81.21.62.005101520253035SitesNest position on the slope (Nest sites)Nest position on the slope (Control plots)0.00.30.60.91.205101520253035SitesSlope aspect (Nest sites)Slope aspect (Control plots)Source data 30.030.060.090.05101520253035SitesSlope gradient (Nest sites)Slope gradient (Control plots)0.05.010.015.020.005101520253035SitesMean tree height (Nest sites)Mean tree height (Control plots)0.05.010.015.020.005101520253035SitesNumber of trees within the site (Nest sites)Number of trees within the site (Control plots)CorrelationHabitat variablesCorrelation coefficientsMean S.D. 12345678910111. Elevation (m)1-0.72*-0.48*-0.70*0.21-0.020.39*-0.38*0.1620.34*0.21894.00176.532. Area (ha) of ricefields within 1km210.53*0.49*-0.23-0.08-0.230.230.05-0.21-0.1211.62 5.403. Humandisturbance10.220.06-0.1540.150.38*0.10-0.020.08 1.40 1.52 4. Number of treeswithin 100 m21-0.37*-0.00-0.52*0.34*-0.330.012-0.258.11 3.53 5. Mean tree heightwithin 100 m2 (m)1-0.240.23-0.34*0.32*0.11-0.0611.23 3.06 6. Nest position onthe slope10.030.22-0.21-0.00-0.07 2.030.45 7. Slope aspect(South = 1,North = 0)1-0.150.180.55*0.060.450.298. Slope gradient (°)1-0.050.100.0125.697.019. Nest tree height (m)1-0.08-0.2314.80 2.3610. Nest aspect(South = 1,North = 0)10.320.430.3211. Coverage abovethe nest (%)149.00%16.53%The Pearson correlations between the 11 habitat variables measured at 35 nest sites of crested ibis in Yang county, Shaanxi province, China. Mean values and standard deviations (S.D.) are also shown.Step Habitat features Selection coefficientsStandard ErrorP value for model selectionAIC 1Nest tree height (m)0.940.38<.000163.3562Human disturbance -0.990.400.000150.4753Slope aspect-5.82 3.250.001341.7274Area of rice fields nearby (ha)0.350.190.010936.2525Nest position on the slope 3.73 2.300.047834.3366Mean tree height within 100 m 2(m)0.280.270.0320 31.9247Nest aspect 54.928531.53780.011226.0488Slope gradient (°)-0.40800.36020.286623.2269Coverage above the nest 0.52010.55860.084124.32210Number of trees within 100 m 2-0.0068300.006160.116025.76411Elevation (m)0.076700.13280.145027.275Stepwise logistic regression for modeling nest site selection of crested ibis in Yang County, Shaanxi Province, China.Model equationlogit(p) = –20.99 + 0.94×nest tree height–0.99×human disturbance+ 3.63×nest position+ 0.35×rice paddy area + …Probability of nest selection：P = e logit(p)/(1 + e logit(p))•R-Square 0.7380•Max-rescaled R-Square 0.9840李欣海, 马志军, 李典谟, 丁长青, 翟天庆, 路宝忠。

广义线性模型

广义线性模型广义线性模型*（Nelder和Wedderburn，1972）除了正态分布，也允许反应分布，以及模型结构中的一定程度的非线性。

GLM具有基本结构g(μi)=X iβ,其中μi≡E（Yi），g是光滑单调'链接函数'，Xi是模型矩阵的第i行，X和β是未知参数的向量。

此外，GLM通常会做出Yi是独立的和Yi服从一些指数族分布的假设。

指数族分布包括许多对实际建模有用的分布，如泊松分布，二项分布，伽马分布和正态分布。

GLM的综合参考文献是McCullagh和Nelder（1989），而Dobson（2001）提供了一个全面的介绍。

因为广义线性模型是以“线性预测器”Xβ的形式详细说明的，所以线性模型的许多一般想法和概念通过一些修改而继续存在到广义线性模型中。

除了必须选择的链接函数和分布之外，基本模型公式与线性模型公式基本相同。

当然，如果恒等函数被选择作为链接以及正态分布，那么普通线性模型将作为特例被恢复。

然而，泛化是以某种成本为代价的：现在的模型拟合必须要迭代完成，而且用于推理的分布结果是近似的，并且由大样本限制结果证明是正确的而不是精确的。

但在深入探讨这些问题之前，请考虑几个简单的例子。

μi=cexp(bt i),例1：在疾病流行的早期阶段，新病例的发生率通常会随着时间以指数方式增加。

因此，如果μi是第ti天的新病例的预期数量，则该形式的模型为请注意，“广义”和“一般”线性模型之间存在区别-后一个术语有时用于指除简单直线以外的所有线性模型。

可能是合适的，其中c和b是未知参数。

通过使用对数链路，这样的模型可以变成GLM形式log(μi)=log(c)+bt i=β0+t iβ1（根据β0=logc和β1=b的定义）。

请注意，模型的右侧现在在参数中是线性的。

反应变量是每天新病例的数量，因为这是一个计数，所以泊松分布可能是一个合理的可以尝试的分布。

因此，针对这种情况的GLM使用泊松反应分布，对数链路和线性预测器β0+tiβ1。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

广义线性模型共42页

合集下载

广义线性模型课件

广义线性模型

《广义线模型》课件

广义线性模型

第3章-广义线性模型

6 孟生旺：广义线性模型—发展与应用

广义线性模型_二_陈希孺

广义线性模型的分析及应用

2.-李欣海-广义线性模型

广义线性模型

文档推荐

最新文档