【原创】在R语言中实现Logistic逻辑回归数据分析报告论文(含代码数据)
- 格式:docx
- 大小:158.71 KB
- 文档页数:9
r语言逻辑斯蒂回归代码
逻辑斯蒂回归是一种分类分析的方法,它是通过建立逻辑斯蒂方程来实现对数据分类的。
在R语言中,我们可以通过glm函数来实现逻辑斯蒂回归分析。
首先需要准备一份数据,可以用read.csv等函数将数据读入。
假设数据集命名为data,其中包含待预测的分类变量y和一些自变量x1、x2、x3等。
接下来,我们可以通过以下代码来建立逻辑斯蒂回归模型:
```
#建立逻辑斯蒂回归模型
model <- glm(y ~ x1 + x2 + x3, data = data, family = "binomial")
#输出模型摘要
summary(model)
```
在建立模型时,我们需要将分类变量y和自变量x1、x2、x3输入到glm函数中,并设置family参数为"binomial",表明我们进行的是二项逻辑斯蒂回归。
经过模型拟合后,我们可以使用summary函数输出模型的摘要信息,包含模型系数、标准误、z值、p值等统计信息。
咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablog在R语言中实现Logistic逻辑回归数据分析报告来源:大数据部落|逻辑回归是拟合回归曲线的方法,当y是分类变量时,y = f(x)。
典型的使用这种模式被预测Ÿ给定一组预测的X。
预测因子可以是连续的,分类的或两者的混合。
R中的逻辑回归实现R可以很容易地拟合逻辑回归模型。
要调用的函数是glm(),拟合过程与线性回归中使用的过程没有太大差别。
在这篇文章中,我将拟合一个二元逻辑回归模型并解释每一步。
数据集我们将研究泰坦尼克号数据集。
这个数据集有不同版本可以在线免费获得,但我建议使用Kaggle提供的数据集,因为它几乎可以使用(为了下载它,你需要注册Kaggle)。
数据集(训练)是关于一些乘客的数据集合(准确地说是889),并且竞赛的目标是预测生存(如果乘客幸存,则为1,否则为0)基于某些诸如服务等级,性别,年龄等特征。
正如您所看到的,我们将使用分类变量和连续变量。
数据清理过程咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablog在处理真实数据集时,我们需要考虑到一些数据可能丢失或损坏的事实,因此我们需要为我们的分析准备数据集。
作为第一步,我们使用该函数加载csv数据read.csv()。
确保参数na.strings等于c("")使每个缺失值编码为a NA。
这将帮助我们接下来的步骤。
training.data.raw < - read.csv('train.csv',header = T,na.strings = c(“”))现在我们需要检查缺失的值,并查看每个变量的唯一值,使用sapply()函数将函数作为参数传递给数据框的每一列。
sapply(training.data.raw,function(x)sum(is.na(x)))PassengerId生存的Pclass名称性别0 0 0 0 0 年龄SibSp Parch票价177 0 0 0 0 小屋着手687 2 sapply(training.data.raw,函数(x)长度(unique(x)))PassengerId生存的Pclass名称性别891 2 3 891 2 年龄SibSp Parch票价89 7 7 681 248 小屋着手148 4对缺失值进行可视化处理可能会有所帮助:Amelia包具有特殊的绘图功能missmap(),可以绘制数据集并突出显示缺失值:咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablog咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablog可变机舱有太多的缺失值,我们不会使用它。
用R语言做数据分析——Logistic回归当通过一系列连续型或类别型预测变来预测二值型结果变量时,Logistic回归是一个非常有用的工具。
以AER包中的数据框Affairs为例,我们将通过探究婚外情的数据来阐述Logistic回归的过程。
Affairs数据集从601个参与者身上收集了9个变量,包括一年来婚外私通的频率以及参与者性别、年龄、婚龄、是否有小孩、宗教信仰程度(5分制,1分表示反对,5分表示非常信仰)、学历、职业,还有对婚姻的I我评分(5分制,1表示非常不幸福,5表示非常幸福)。
我们先看一些描述性的统计信息:> data(Affairs, package = 'AER')> summary(Affairs)affairs gender age yearsmarried children religiousness educationMin. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171 Min. :1.000 Min. : 9.001st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430 1st Qu.:2.000 1st Qu.:14.00Median : 0.000 Median :32.00 Median : 7.000 Median :3.000 Median :16.00Mean : 1.456 Mean :32.49 Mean : 8.178 Mean :3.116 Mean :16.173rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000 3rd Qu.:4.000 3rd Qu.:18.00Max. :12.000 Max. :57.00 Max. :15.000 Max. :5.000 Max. :20.00 occupation ratingMin. :1.000 Min. :1.0001st Qu.:3.000 1st Qu.:3.000Median :5.000 Median :4.000Mean :4.195 Mean :3.9323rd Qu.:6.000 3rd Qu.:5.000Max. :7.000 Max. :5.000> table(Affairs$affairs)0 1 2 3 7 12451 34 17 19 42 38从这些统计信息可以看到,52%的调查对象是女性,72%的人有孩子,样本年龄的中位数为32岁。
咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablog在R语言中实现Logistic逻辑回归数据分析报告来源:大数据部落|原文链接/?p=2652逻辑回归是拟合回归曲线的方法,当y是分类变量时,y = f(x)。
典型的使用这种模式被预测Ÿ给定一组预测的X。
预测因子可以是连续的,分类的或两者的混合。
R中的逻辑回归实现R可以很容易地拟合逻辑回归模型。
要调用的函数是glm(),拟合过程与线性回归中使用的过程没有太大差别。
在这篇文章中,我将拟合一个二元逻辑回归模型并解释每一步。
数据集我们将研究泰坦尼克号数据集。
这个数据集有不同版本可以在线免费获得,但我建议使用Kaggle提供的数据集,因为它几乎可以使用(为了下载它,你需要注册Kaggle)。
数据集(训练)是关于一些乘客的数据集合(准确地说是889),并且竞赛的目标是预测生存(如果乘客幸存,则为1,否则为0)基于某些诸如服务等级,性别,年龄等特征。
正如您所看到的,我们将使用分类变量和连续变量。
咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablog数据清理过程在处理真实数据集时,我们需要考虑到一些数据可能丢失或损坏的事实,因此我们需要为我们的分析准备数据集。
作为第一步,我们使用该函数加载csv数据read.csv()。
确保参数na.strings等于c("")使每个缺失值编码为a NA。
这将帮助我们接下来的步骤。
training.data.raw < - read.csv('train.csv',header = T,na.strings = c(“”))现在我们需要检查缺失的值,并查看每个变量的唯一值,使用sapply()函数将函数作为参数传递给数据框的每一列。
sapply(training.data.raw,function(x)sum(is.na(x)))PassengerId生存的Pclass名称性别0 0 0 0 0 年龄SibSp Parch票价177 0 0 0 0 小屋着手687 2 sapply(training.data.raw,函数(x)长度(unique(x)))PassengerId生存的Pclass名称性别891 2 3 891 2 年龄SibSp Parch票价89 7 7 681 248 小屋着手148 4对缺失值进行可视化处理可能会有所帮助:Amelia包具有特殊的绘图功能missmap(),可以绘制数据集并突出显示缺失值:咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablog咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablog可变机舱有太多的缺失值,我们不会使用它。
R语言数据分析回归研究案例:移民政策偏好是否有准确的刻板印象?数据重命名,重新编码,重组Group <chr> Count<dbl>Percent<dbl>6 476 56.00 5 179 21.062 60 7.063 54 6.354 46 5.41 1 27 3.18 0 8 0.94对Kirkegaard&Bjerrekær2016的再分析确定用于本研究的32个国家的子集的总体准确性。
#降低样本的#精确度GG_scatter(dk_fiscal, "mean_estimate", "dk_benefits_use",GG_scatter(dk_fiscal_sub, "mean_estimate", "dk_benefits_us e", case_names="Names")GG_scatter(dk_fiscal, "mean_estimate", "dk_fiscal", case_n ames="Names")#compare Muslim bias measures#can we make a bias measure that works without ratio scaleScore stereotype accuracy#add metric to main datad$stereotype_accuracy=indi_accuracy$pearson_rGG_save("figures/aggr_retest_stereotypes.png")GG_save("figures/aggregate_accuracy.png")GG_save("figures/aggregate_accuracy_no_SYR.png")Muslim bias in aggregate dataGG_save("figures/aggregate_muslim_bias.png")Immigrant preferences and stereotypesGG_save("figures/aggregate_muslim_bias_old_data.png") Immigrant preferences and stereotypesGG_save("figures/aggr_fiscal_net_opposition_no_SYR.png")GG_save("figures/aggr_stereotype_net_opposition.png")GG_save("figures/aggr_stereotype_net_opposition_no_SYR.pn g")lhs <chr>op<chr > rhs <chr> est <dbl> se <dbl> z <dbl> pvalue <dbl> net_opposition ~ mean_estimate_fiscal -4.4e-01 0.02303 -19.17 0.0e+00net_opposition~Muslim_frac 4.3e-02 0.05473 0.79 4.3e-01net_opposition~~net_opposition 6.9e-03 0.00175 3.94 8.3e-05dk_fiscal ~~ dk_fiscal 6.2e+03 0.00000 NA NAMuslim_frac~~Muslim_frac1.7e-01 0.0000NANAIndividual level modelsGG_scatter(example_muslim_bias, "Muslim", "resid", case_na mes="name")+#exclude Syria#distributiondescribe(d$Muslim_bias_r)%>%print()GG_save("figures/muslim_bias_dist.png")## `stat_bin()` using `bins = 30`. Pick better value with `GG_scatter(mediation_example, "Muslim", "resid", case_name s="name", repel_names=T)+scale_x_continuous("Muslim % in home country", labels=scal#stereotypes and preferencesmediation_model=plyr::ldply(seq_along_rows(d), function(rGG_denhist(mediation_model, "Muslim_resid_OLS", vline=medi an)## `stat_bin()` using `bins = 30`. Pick better value with `add to main datad$Muslim_preference=mediation_model$Muslim_resid_OLS Predictors of individual primary outcomes#party modelsrms::ols(stereotype_accuracy~party_vote, data=d)GG_group_means(d, "Muslim_bias_r", "party_vote")+ theme(axis.text.x=element_text(angle=-30, hjust=0))GG_group_means(d, "Muslim_preference", "party_vote")+#party agreement cors wtd.cors(d_parties)。
R语言线性回归案例数据分析可视化报告在本实验中,我们将查看来自所有30个职业棒球大联盟球队的数据,并检查一个赛季的得分与其他球员统计数据之间的线性关系。
我们的目标是通过图表和数字总结这些关系,以便找出哪个变量(如果有的话)可以帮助我们最好地预测一个赛季中球队的得分情况。
数据用变量at_bats绘制这种关系作为预测。
关系看起来是线性的吗?如果你知道一个团队的at_bats,你会习惯使用线性模型来预测运行次数吗?散点图.如果关系看起来是线性的,我们可以用相关系数来量化关系的强度。
.残差平方和回想一下我们描述单个变量分布的方式。
回想一下,我们讨论了中心,传播和形状等特征。
能够描述两个数值变量(例如上面的runand at_bats)的关系也是有用的。
从前面的练习中查看你的情节,描述这两个变量之间的关系。
确保讨论关系的形式,方向和强度以及任何不寻常的观察。
正如我们用均值和标准差来总结单个变量一样,我们可以通过找出最符合其关联的线来总结这两个变量之间的关系。
使用下面的交互功能来选择您认为通过点云的最佳工作的线路。
# Click two points to make a line.After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:e i=y i−y^i ei=yi−y^iThe most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.## Click two points to make a line.Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?Answer: The smallest sum of squares is 123721.9. It explains the dispersion from mean. The linear modelIt is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).The first argument in the function lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats. The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top. After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats. With this table, we can write down the least squares regression line for the linear model:y^=−2789.2429+0.6305∗atbats y^=−2789.2429+0.6305∗atbatsOne last piece of information we will discuss from the summary output is the MultipleR-squared, or more simply, R2R2. The R2R2value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.output, write the equation of the regression line. What does the slope tell us in thecontext of the relationship between success of a team and its home runs?Answer: homeruns has positive relationship with runs, which means 1 homeruns increase 1.835 times runs.Prediction and prediction errors Let’s create a scatterplot with the least squares line laid on top.The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model m1, which contains both parameter estimates. This line can be used to predict y y at any value of x x. When predictions are made for values of x x that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for thisprediction?Model diagnosticsTo assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.6.Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?Answer: the residuals has normal linearity of the relationship between runs ans at-bats, which mean is 0.Nearly normal residuals: To check this condition, we can look at a histogramor a normal probability plot of the residuals.7.Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?Answer: Yes.It’s nearly normal.Constant variability:1. Choose another traditional variable from mlb11 that you think might be a goodpredictor of runs. Produce a scatterplot of the two variables and fit a linear model. Ata glance, does there seem to be a linear relationship?Answer: Yes, the scatterplot shows they have a linear relationship..1.How does this relationship compare to the relationship between runs and at_bats?Use the R22 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?1. Now that you can summarize the linear relationship between two variables, investigatethe relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical andnumerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).Answer: The new_obs is the best predicts runs since it has smallest Std. Error, which the points are on or very close to the line.1.Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical andnumerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?Answer: ‘new_slug’ as 87.85% ,‘new_onbase’ as 77.85% ,and ‘new_obs’ as 68.84% are predicte better on ‘runs’ than old variables.1. Check the model diagnostics for the regression model with the variable you decidedwas the best predictor for runs.This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.。
Logistic regression(with R)Christopher Manning4November20071TheoryWe can transform the output of a linear regression to be suitable for probabilities by using a logit link function on the lhs as follows:logit p=log o=logp1−p=eβ0eβ1x1eβ2x2···eβk x k(3) The inverse of the logit function is the logistic function.If logit(π)=z,thenπ=e z1Note that we can convert freely between a probability p and odds o for an event versus its complement:o=po+11Logistic function-6-4-202460.00.20.40.60.81.0Figure 1:The logistic function2Basic R logistic regression modelsWe will illustrate with the Cedegren dataset on the website.cedegren <-read.table("cedegren.txt",header=T)You need to create a two-column matrix of success/failure counts for your response variable.You cannot just use percentages.(You can give percentages but then weight them by a count of success +failures.)attach(cedegren)ced.del <-cbind(sDel,sNoDel)Make the logistic regression model.The shorter second form is equivalent to the first,but don’t omit specifying the family.ced.logr <-glm(ced.del ~cat +follows +factor(class),family=binomial("logit"))ced.logr <-glm(ced.del ~cat +follows +factor(class),family=binomial)The output in more and less detail:>ced.logr Call:glm(formula =ced.del ~cat +follows +factor(class),family =binomial("logit"))Coefficients:(Intercept)catd catm catn catvfollowsP -1.3183-0.16930.17860.6667-0.76750.9525followsV factor(class)2factor(class)3factor(class)40.53411.2704 1.0480 1.3742Degrees of Freedom:51Total (i.e.Null);42ResidualNull Deviance:958.7Residual Deviance:198.6AIC:446.1>summary(ced.logr)Call:glm(formula =ced.del ~cat +follows +factor(class),family =binomial("logit"))Deviance Residuals:Min 1QMedian 3QMax2-3.24384-1.343250.04954 1.01488 6.40094Coefficients:Estimate Std.Error z value Pr(>|z|)(Intercept)-1.318270.12221-10.787<2e-16catd-0.169310.10032-1.6880.091459catm0.178580.08952 1.9950.046053catn0.666720.09651 6.9084.91e-12catv-0.767540.21844-3.5140.000442followsP0.952550.0740012.872<2e-16followsV0.534080.056609.436<2e-16factor(class)2 1.270450.1032012.310<2e-16factor(class)3 1.048050.1035510.122<2e-16factor(class)4 1.374250.1015513.532<2e-16(Dispersion parameter for binomial family taken to be1)Null deviance:958.66on51degrees of freedomResidual deviance:198.63on42degrees of freedomAIC:446.10Number of Fisher Scoring iterations:4Residual deviance is the difference in G2=−2log L between a maximal model that has a separate parameter for each cell in the model and the built model.Changes in the deviance(the difference in the quantity−2log L)for two models which can be nested in a reduction will be approximatelyχ2-distributed with dof equal to the change in the number of estimated parameters.Thus the difference in deviances can be tested against theχ2distribution for significance.The same concerns about this approximation being valid only for reasonably sized expected counts(as with contingency tables and multinomials in Suppes(1970)) still apply here,but we(and most people)ignore this caution and use the statistic as a rough indicator when exploring tofind good models.We’re usually mainly interested in the relative goodness of models,but nevertheless,the high residual de-viance shows that the model cannot be accepted to have been likely to generate the data(pchisq(198.63,42)≈1).However,it certainlyfits the data better than the null model(which means that afixed mean probabilityof deletion is used for all cells):pchisq(958.66-198.63,9)≈1.What can we see from the parameters of this model?catd and catm have different effects,but both are not very clearly significantly different from the effect of cata(the default value).All following environments seem distinctive.For class,all of class2–4seem to have somewhat similar effects,and we might model class as a two way distinction.It seems like we cannot profitably drop a whole factor,but we can test that with the anova function to give an analysis of deviance table,or the drop1function to try dropping each factor: >anova(ced.logr,test="Chisq")Analysis of Deviance TableModel:binomial,link:logitResponse:ced.delTerms added sequentially(first to last)Df Deviance Resid.Df Resid.Dev P(>|Chi|)NULL51958.66cat4314.8847643.796.690e-673follows2228.8645414.932.011e-50factor(class)3216.3042198.631.266e-46>drop1(ced.logr,test="Chisq")Single term deletionsModel:ced.del~cat+follows+factor(class)Df Deviance AIC LRT Pr(Chi)<none>198.63446.10cat4368.76608.23170.13<2.2e-16follows2424.53668.00225.91<2.2e-16factor(class)3414.93656.39216.30<2.2e-16The ANOVA test tries adding the factors only in the order given in the model formula(left to right).Ifthings are close,you should try rearranging the model formula ordering,or using drop1,but given the hugedrops in deviance,here,it seems clearly unnecessary.Let’s now try a couple of models by collapsing particular levels,based on our observations above.>glm(ced.del~cat+follows+I(class==1),family=binomial("logit"))Call:glm(formula=ced.del~cat+follows+I(class==1),family=binomial("logit")) Coefficients:(Intercept)catd catm catn catv -0.0757-0.16140.18760.6710-0.7508followsP followsV I(class==1)TRUE0.95090.5195-1.2452Degrees of Freedom:51Total(i.e.Null);44ResidualNull Deviance:958.7Residual Deviance:232.7AIC:476.2>pchisq(232.72-198.63,2)[1]1The above model isn’t as good.We could just collapse class2and4.Formally that model can’t be rejectedasfitting the data as well as the higher parameter model at a95%confidence threshold:>glm(ced.del~cat+follows+I(class==1)+I(class==3),family=binomial("logit"))Call:glm(formula=ced.del~cat+follows+I(class==1)+I(class==3),family=binomial("logit"))Coefficients:(Intercept)catd catm catn catv0.009838-0.1707170.1818710.666142-0.789446followsP followsV I(class==1)TRUE I(class==3)TRUE0.9574870.529646-1.328227-0.280901Degrees of Freedom:51Total(i.e.Null);43ResidualNull Deviance:958.7Residual Deviance:202.1AIC:447.5>pchisq(202.08-198.63,1)[1]0.93674824However,in terms of our model,where class is a scale,collapsing together classes2and4seems rather dubious,and should be put aside.But it does seem reasonable to collapse together some of the word classes.a and d are certainly a naturalclass of noun modifiers,and it’s perhaps not unreasonable to group those with m.Let’s try those models.This one is worse:>glm(ced.del~I(cat=="n")+I(cat=="v")+follows+factor(class),family=binomial("logit")) Call:glm(formula=ced.del~I(cat=="n")+I(cat=="v")+follows+factor(class),family=binomial("logit"))Coefficients:(Intercept)I(cat=="n")TRUE I(cat=="v")TRUE followsP followsV -1.26990.5750-0.8559 1.01330.5771 factor(class)2factor(class)3factor(class)41.2865 1.0733 1.4029Degrees of Freedom:51Total(i.e.Null);44ResidualNull Deviance:958.7Residual Deviance:229.1AIC:472.6>pchisq(229.11-198.63,2)[1]0.9999998But this one cannot be rejected at a95%confidence level:>summary(glm(ced.del~I(cat=="n")+I(cat=="v")+I(cat=="m")+follows+factor(class),family=binomial))Call:glm(formula=ced.del~I(cat=="n")+I(cat=="v")+I(cat=="m")+follows+factor(class),family=binomial)Deviance Residuals:Min1Q Median3Q Max-3.26753-1.222400.09571 1.05274 6.41257Coefficients:Estimate Std.Error z value Pr(>|z|)(Intercept)-1.433110.10177-14.081<2e-16I(cat=="n")TRUE0.783330.0674111.621<2e-16I(cat=="v")TRUE-0.648790.20680-3.1370.00171I(cat=="m")TRUE0.296030.05634 5.2541.49e-07followsP0.961880.0738213.030<2e-16followsV0.534500.056599.445<2e-16factor(class)2 1.265640.1031412.271<2e-16factor(class)3 1.045070.1035210.096<2e-16factor(class)4 1.370140.1015013.498<2e-16(Dispersion parameter for binomial family taken to be1)Null deviance:958.66on51degrees of freedomResidual deviance:201.47on43degrees of freedomAIC:446.945Number of Fisher Scoring iterations:4>pchisq(201.5-198.63,1)[1]0.9097551So,that seems like what we can do to sensibly reduce the model.But what about the fact that it doesn’t actuallyfit the data excellently?We’ll address that below.But let’sfirst again look at the model coefficients,and how they express odds.>subset(cedegren,class==2&cat=="m")sDel sNoDel cat follows class14377349m C219233120m V2249827m P2>pDel.followsC=377/(377+349)>pDel.followsV=233/(233+120)>pDel.followsP=98/(98+27)>oDel.followsC=pDel.followsC/(1-pDel.followsC)>oDel.followsV=pDel.followsV/(1-pDel.followsV)>oDel.followsP=pDel.followsP/(1-pDel.followsP)>oDel.followsP/oDel.followsC[1]3.360055>exp(0.96188)[1]2.616611>oDel.followsV/oDel.followsC[1]1.797458>exp(0.53450)[1]1.706595For these cells(chosen as cells with lots of data points...),the model coefficients fairly accurately capturethe change in odds of s-Deletion by moving from the baseline of a following consonant to a following pauseor vowel.You can more generally see this effect at work by comparing the model predictions for these cellsin terms of logits:>subset(cbind(cedegren,pred.logit=predict(ced.logr)),class==2&cat=="m") sDel sNoDel cat follows class pred.logit14377349m C20.130754219233120m V20.6648370249827m P2 1.0833024>0.6648370-0.1307542[1]0.5340828>1.0833024-0.1307542[1]0.9525482You can also get model predictions in terms of probabilities by saying predict(ced.logr,type="response"). This can be used to calculate accuracy of predictions.For example:>correct<-sum(cedegren$sDel*(predict(ced.logr,type="response")>=0.5))+ sum(cedegren$sNoDel*(predict(ced.logr,type="response")<0.5))>tot<-sum(cedegren$sDel)+sum(cedegren$sNoDel)>correct/tot[1]0.6166629The low accuracy mainly reflects the high variation in the data:cat,class,and follows are only weak predictors.A saturated model only has accuracy of0.6275.The null model baseline is1−(3755/(3755+65091))=0.5755.Note the huge difference between predictive accuracy and model fit!Is this a problem?It depends on what we want to do with the model.(It is certainly the case that logistic regression cannot account for any hidden correlations that are not coded in the model.)Let us look for problems by comparing observed and fitted values,placed onto a scale of counts (not proportions).You can get the full table by the first command below,or more readably by putting it into a data frame with what you had before,but I omit the results (a whole page):data.frame(fit=fitted(ced.logr)*(sDel+sNoDel),sDel,tot=(sDel+sNoDel))ced.fit <-cbind(cedegren,data.frame(fit=fitted(ced.logr)*(sDel+sNoDel),sDel,tot=(sDel+sNoDel),diff=(fitted(ced.logr)*(sDel+sNoDel)-sDel)))The results can be made into a simple plot (figure 2:>plot(ced.fit$sDel,ced.fit$fit)>abline(a=0,b=1)01002003004005006000100200300400500600ced.fit$sDelc ed .f i t $f itFigure 2:Model fitThis looks good,but it might instead be more revealing to look at a log scale.I use +1since there are empty cells,and log(0)is undefined,and base 2just for human interpretation ease.The result is in figure 3plot(log(ced.fit$sDel +1,2),log(ced.fit$fit +1,2))abline(a=0,b=1)Of course,we would expect some of the low count cells to be badly estimated.But some are quite badly estimated.E.g.,for cat=n,f=V,class=1,10deletions are predicted,but actually 20/21tokens have deletion.In general,a number of cells for class 1are poorly predicted,and we might worry that the model does okay on average only because very little class 1data was collected.Looking at the differences,3of the worst fit cells concern nouns of class 1speakers:>subset(ced.fit,cat=="n"&class==1)sDel sNoDel cat follows class fit sDel.1tot diff 53440n C 125.3553763474-8.64462410201n V 19.8840032021-10.11599713415n P 110.919041419 6.919041The assumptions of the logistic regression model are that each level of each factor (or each continuous explanatory variable)has an independent effect on the response variable.Explanatory variables do not have7024682468log(ced.fit$sDel + 1, 2)l o g (c e d .f i t $f i t + 1, 2)Figure 3:Model fit (log(x +1,2)scaled axes)any kind of special joint effect (“local conjunctions”)unless we explicitly put interaction terms into the model.We can put in an interaction term between just this category and class explicitly,and this model is significantly better:>glm(ced.del ~cat +follows +factor(class)+I(cat=="n"&class==1),family=binomial)Call:glm(formula =ced.del ~cat +follows +factor(class)+I(cat =="n"&class ==1),family =binomial)Coefficients:(Intercept)catdcatm-1.4573-0.17320.1715catn catv followsP 0.6243-0.77490.9561followsV factor(class)2factor(class)30.5388 1.42221.1999factor(class)4I(cat =="n"&class ==1)TRUE1.52560.6147Degrees of Freedom:51Total (i.e.Null);41Residual Null Deviance:958.7Residual Deviance:191.3AIC:440.8>pchisq(198.63-191.34,1)>>pchisq(deviance(ced.logr)-deviance(ced.logr2),df.residual(ced.logr)-df.residual(ced.logr2))[1]0.993066The second form of the call to pchisq shows a cleverer way to get the deviance out of the models,rather than having to type it in again by hand.But it actually involves typing many more characters....You can more generally put an interaction term between all levels of two factors by using the :operator:>summary(glm(ced.del ~cat +follows +factor(class)+cat:factor(class),family=binomial))Call:8glm(formula=ced.del~cat+follows+factor(class)+cat:factor(class),family=binomial)Deviance Residuals:Min1Q Median3Q Max-4.17648-0.66890-0.013940.88999 5.65947Coefficients:Estimate Std.Error z value Pr(>|z|)(Intercept)-1.852e+00 3.200e-01-5.7867.20e-09catd 2.471e-03 4.018e-010.0060.995093catm7.623e-01 3.487e-01 2.1860.028815catn 1.633e+00 3.719e-01 4.3911.13e-05catv-1.603e+01 1.214e+03-0.0130.989464followsP9.573e-017.423e-0212.896<2e-16followsV 5.397e-01 5.693e-029.480<2e-16factor(class)2 1.952e+00 3.568e-01 5.4714.48e-08factor(class)3 1.373e+00 3.565e-01 3.8520.000117factor(class)4 2.081e+00 3.506e-01 5.9372.91e-09catd:factor(class)2-4.803e-01 4.427e-01-1.0850.278049catm:factor(class)2-7.520e-01 3.877e-01-1.9400.052417catn:factor(class)2-9.498e-01 4.143e-01-2.2920.021890catv:factor(class)2 1.473e+01 1.214e+030.0120.990323catd:factor(class)39.870e-02 4.451e-010.2220.824513catm:factor(class)3-2.376e-01 3.874e-01-0.6130.539550catn:factor(class)3-1.085e+00 4.129e-01-2.6270.008624catv:factor(class)3 1.679e+01 1.214e+030.0140.988968catd:factor(class)4-1.898e-01 4.366e-01-0.4350.663692catm:factor(class)4-8.533e-01 3.805e-01-2.2420.024942catn:factor(class)4-1.071e+00 4.074e-01-2.6270.008604catv:factor(class)4 1.521e+01 1.214e+030.0130.990006(Dispersion parameter for binomial family taken to be1)Null deviance:958.66on51degrees of freedomResidual deviance:135.68on30degrees of freedomAIC:407.15Number of Fisher Scoring iterations:15>pchisq(198.63-135.68,42-30)[1]1This model doesfit significantly better than the model without the interaction term.However,looking atthe coefficients,not only have a lot of parameters been added,but many of them look to be otiose.(Notesimilar estimates,little confidence that values aren’t zero.)It looks like we could build a better model,andwe will in a moment after another note on interaction terms.Rather than writing cat+factor(class)+cat:factor(class),you can more simply write cat*factor(class). This works iffyou want the written interaction term,and interaction terms and main effects for every subsetof these variables.For instance,we can make the saturated model with:summary(glm(ced.del~cat*follows*factor(class),family=binomial))9By construction,this model has0degrees of freedom,and a residual deviance of0(approximately–minor errors in numeric optimization occur).However,it isn’t very interesting as a model.It’d be nice to understand the linguistic situation better,but looking just at the model we built with the cat:class interaction term,it now looks like this is what is going on:cat n,and more marginally,cat m have special behavior(the other categories don’t).The following environment and class are significant (even though classes2and4seem to behave similarly).The crucial interaction is then between class and cat n(where deletion of/s/occurs highly significantly less often than would otherwise be expected for class 3and cat n,and,although the counts are small,rather more often than expected with class1and cat n). So,here’s a couple of last attempts at a model:>summary(glm(ced.del~I(cat=="n")+I(cat=="m")+follows+factor(class)+I(cat=="n"&class==3)+I(cat=="n"&class==1),family=binomial))Call:glm(formula=ced.del~I(cat=="n")+I(cat=="m")+follows+factor(class)+I(cat=="n"&class==3)+I(cat=="n"&class==1),family=binomial)Deviance Residuals:Min1Q Median3Q Max-4.18107-1.347990.07195 1.09391 5.98469Coefficients:Estimate Std.Error z value Pr(>|z|)(Intercept)-1.598910.11696-13.671<2e-16***I(cat=="n")TRUE0.965560.0786112.283<2e-16***I(cat=="m")TRUE0.326320.05533 5.8983.68e-09***followsP0.955150.0739912.910<2e-16***followsV0.518220.056409.188<2e-16***factor(class)2 1.362690.1201911.338<2e-16***factor(class)3 1.291960.1218810.600<2e-16***factor(class)4 1.480440.1187212.470<2e-16***I(cat=="n"&class==3)TRUE-0.583510.11844-4.9278.37e-07***I(cat=="n"&class==1)TRUE0.419040.23116 1.8130.0699.---Signif.codes:0***0.001**0.01*0.05.0.11(Dispersion parameter for binomial family taken to be1)Null deviance:958.66on51degrees of freedomResidual deviance:180.56on42degrees of freedomAIC:428.02Number of Fisher Scoring iterations:4This model has the same number of parameters as the initial model we built,but improves the loglikelihood by about18.That is,it makes the observed data more than65million times more likely.But if you look back at the raw stats for nouns of class1speakers,there’s still just this weird fact that they almost always delete/s/before a vowel,and almost never delete it before a pause(why ever that may be).And our model still cannot capture that!But if we make an interaction between class1,cat n and the levels of follows,then we could.We add two more parameters,but,behold,the residual deviance goes down a lot!The deviance is close to the cat:class interaction model,in a rather more interesting way.10>summary(ced.logr<-glm(ced.del~I(cat=="n")+I(cat=="m")+follows+factor(class) +I(cat=="n"&class==3)+I(cat=="n"&class==1)+I(cat=="n"&class==1):follows,family=binomial))Call:glm(formula=ced.del~I(cat=="n")+I(cat=="m")+follows+factor(class)+I(cat=="n"&class==3)+I(cat=="n"&class==1)+I(cat=="n"&class==1):follows,family=binomial)Deviance Residuals:Min1Q Median3Q Max-3.634e+00-1.280e+00-1.054e-089.274e-01 6.079e+00Coefficients:Estimate Std.Error z value Pr(>|z|)(Intercept)-1.596120.11698-13.644<2e-16***I(cat=="n")TRUE0.959290.0786412.198<2e-16***I(cat=="m")TRUE0.324680.05534 5.8674.45e-09***followsP0.998320.0753213.255<2e-16***followsV0.500260.056658.831<2e-16***factor(class)2 1.362560.1202111.334<2e-16***factor(class)3 1.290160.1219110.583<2e-16***factor(class)4 1.477800.1187412.445<2e-16***I(cat=="n"&class==3)TRUE-0.583110.11854-4.9198.70e-07***I(cat=="n"&class==1)TRUE0.474310.26866 1.7650.07749.followsP:I(cat=="n"&class==1)TRUE-2.157560.61380-3.5150.00044*** followsV:I(cat=="n"&class==1)TRUE 2.65799 1.05244 2.5260.01155*---Signif.codes:0***0.001**0.01*0.05.0.11(Dispersion parameter for binomial family taken to be1)Null deviance:958.66on51degrees of freedomResidual deviance:146.72on40degrees of freedomAIC:398.19Number of Fisher Scoring iterations:43Long format R logistic regression models(the Design package) You need to have loaded the Design package for this part to work.Look at Baayen chapter1if you don’t know how to do this!Until now,we have used binary outcome data in a summary format(counts of sDel and sNoDel for each combination of levels of factors).An alternative is long format,where each observation is a line.Collected data often starts out in this form.You can then examine the data by constructing cross-tabulations.You can do them with any number of variables,but they get harder to read with more than two.>ced.long<-read.table("cedegren-long.txt",header=T)>ced.long[1:5,]sDel cat follows class11m C11121m C131m C141m C151m C1>ced.long$class<-factor(ced.long$class)>attach(ced.long)>xtabs(~sDel+class)classsDel123404141033106512431165151413202092Do we need to build logistic regression models,or could we get everything that we need to get from just looking at crosstabs?For instance,it seems like we can already see here that/s/-deletion is strongly disfavored by class1speakers,but moderately preferred by other classes.Sometimes looking at crosstabs works,but the fundamental observation is this:the predictive effect of variables can be spurious,hidden or reversed when just looking at the marginal totals in crosstabs,because they do not take into account correlations of explanatory variables.The magical good thing that logistic regression does is work out the best way to attribute causal effect to explanatory variables.(Of course,it can only do this for variables coded in the data....)Contrary to what Baayen suggests,you can load this into the basic glm function.Here’s what you get: >summary(glm(sDel~cat+follows+class,family=binomial))Call:glm(formula=sDel~cat+follows+class,family=binomial)Deviance Residuals:Min1Q Median3Q Max-1.9219-1.20130.7298 1.0796 1.8392Coefficients:Estimate Std.Error z value Pr(>|z|)(Intercept)-1.318270.12221-10.787<2e-16catd-0.169310.10032-1.6880.091459catm0.178580.08952 1.9950.046053catn0.666720.09651 6.9084.91e-12catv-0.767540.21844-3.5140.000442followsP0.952550.0740012.872<2e-16followsV0.534080.056609.436<2e-16class2 1.270450.1032012.310<2e-16class3 1.048050.1035510.122<2e-16class4 1.374250.1015513.532<2e-16(Dispersion parameter for binomial family taken to be1)Null deviance:12061on8845degrees of freedomResidual deviance:11301on8836degrees of freedomAIC:11321Number of Fisher Scoring iterations:4The deviances are huge when each observation is a cell.But note that the difference between the null and12residual deviance is the same as we started offwith originally(760).And the estimated coefficients are just as we saw previously.However,nevertheless,the Design package lets you do various spiffy stuff.So,let’s try that one.It turns out that some things don’t work with lrm unless you use the magical incantation to create a data distribution object.I’m still trying tofigure out what this does myself,but if you print it,part of it seems to summarize the factors....>ced.ddist<-datadist(ced.long)>options(datadist="ced.ddist")But at any rate:(i)you need to have done it before you can call various other methods like summary or anova on an lrm object,and(ii)it’s best to call it right at the beginning and set it with the options() function,because then various data statistics are stored there rather than being computed each time.>ced.lrm<-lrm(sDel~cat+follows+factor(class))>ced.lrmLogistic Regression Modellrm(formula=sDel~cat+follows+factor(class))Frequencies of Responses0137555091Obs Max Deriv Model L.R. d.f.P C Dxy Gamma88465e-09760.04900.660.320.336Tau-a R2Brier0.1560.1110.224Coef S.E.Wald Z PIntercept-1.31830.12221-10.790.0000cat=d-0.16930.10032-1.690.0915cat=m0.17860.08952 1.990.0461cat=n0.66670.09651 6.910.0000cat=v-0.76750.21844-3.510.0004follows=P0.95250.0740012.870.0000follows=V0.53410.056609.440.0000class=2 1.27040.1032012.310.0000class=3 1.04800.1035510.120.0000class=4 1.37420.1015513.530.0000Here”Model L.R.”is again the difference between Null and Residual deviance.This is again associated witha difference in degrees of freedom and a p-value.Hooray!Our model is better than one with no factors!Now I can get an anova.Note:for lrm,ANOVA does not present sequential tables adding factors,but considers each factor separately!>anova(ced.lrm)Wald Statistics Response:sDelFactor Chi-Square d.f.Pcat164.934<.0001follows214.962<.000113class 197.623<.0001TOTAL 656.619<.0001Among the nifty features of lrm is that you can do penalized (or regularized)estimation of coefficients to avoid overfitting (searched for with pentrace,or specified with the penalty parameter in lrm)–see Baayen pp.224-227.This was the regularlization that appeared in the Hayes and Wilson paper.lrm has a really nice option to plot the logit coefficients (turned back into probabilities)for each level of each factor.AFAIK,you can’t do this with glm.>par(mfrow =c(2,2))>plot(ced.lrm,fun=plogis,ylab="Pr(sDel)",adj.subtitle=F,ylim=c(0,1))It’d be even more fun with a real-valued explanatory variable.catP r (s D e l )a d m n v0.00.20.40.60.81.0−−−−−−−−−−followsP r (s D e l )C P V0.00.20.40.60.81.0−−−−−−classP r (s D e l )12340.00.20.40.60.81.0−−−−−−−−One thing that we haven’t addressed is that the class variable should by rights be an ordinal scale.There is an extension of logistic regression to ordinal explanatory variables,and it’s discussed in Baayen,pp.227ff.However,looking at this figure shows that the probability of /s/-deletion does not increase monotonically as you head along this putative ordinal scale.Something else seems to be at work.And so an ordinal logistic regression analysis is counterindicated here.4Working out data likelihood:practice with vectorsMost R calculations work directly on vectors,and that’s often handy.Here we use that to directly work out the likelihood of the data according to a model.We will use our very initial model again for illustration.We start with predicted probabilities of sDeletion for each cell,and turn them into log props:>probsdel <-fitted(ced.logr)14。
回归模型项目分析报告论文(附代码数据)摘要该项目包括评估一组变量与每加仑(MPG)英里之间的关系。
汽车趋势大体上是对这个具体问题的答案的本质感兴趣:* MPG的自动或手动变速箱更好吗?*量化自动和手动变速器之间的手脉差异。
我们在哪里证实传输不足以解释MPG的变化。
我们已经接受了这个项目的加速度,传输和重量作为解释汽油里程使用率的84%变化的变量。
分析表明,通过使用我们的最佳拟合模型来解释哪些变量解释了MPG 的大部分变化,我们可以看到手册允许我们以每加仑2.97多的速度驱动。
(A.1)1.探索性数据分析通过第一个简单的分析,我们通过箱形图可以看出,手动变速箱肯定有更高的mpg结果,提高了性能。
基于变速箱类型的汽油里程的平均值在下面的表格中给出,传输比自动传输产生更好的性能。
根据附录A.4,通过比较不同传输的两种方法,我们排除了零假设的0.05%的显着性。
第二个结论嵌入上面的图表使我们看到,其他变量可能会对汽油里程的使用有重要的作用,因此也应该考虑。
由于simplistisc模型显示传播只能解释MPG变异的35%(AppendiX A.2。
)我们将测试不同的模型,我们将在这个模型中减少这个变量的影响,以便能够回答,如果传输是唯一的变量要追究责任,或者如果其他变量的确与汽油里程的关系更强传输本身。
(i.e.MPG)。
### 2.模型测试(线性回归和多变量回归)从Anova分析中我们可以看出,仅仅接受变速箱作为与油耗相关的唯一变量的模型将是一个误解。
一个更完整的模型,其中的变量,如重量,加速度和传输被考虑,将呈现与燃油里程使用(即MPG)更强的关联。
一个F = 62.11告诉我们,如果零假设是真的,那么这个大的F比率的可能性小于0.1%的显着性是可能的,因此我们可以得出结论:模型2显然是一个比油耗更好的预测值仅考虑传输。
为了评估我们模型的整体拟合度,我们运行了另一个分析来检索调整的R平方,这使得我们可以推断出模型2,其中传输,加速度和重量被选择,如果我们需要,它解释了大约84%的变化预测汽油里程的使用情况。
用R语言做逻辑回归模型研究谋杀率问题及数据来源以下的案例来自美国社会调查机构:为了研究谋杀率与哪些因素有关,收集美国50个地区的谋杀率、收入水平、文盲率、生活经验、霜冻天数的数据,如下图:$ Population: num 3615 365 2212 2110 21198 ...$ Income : num 3624 6315 4530 3378 5114 ...$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...$ Life Exp : num 69 69.3 70.5 70.7 71.7 ...$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...$ HS Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 5 2.6 40.6 ...$ Frost : num 20 152 15 65 20 166 139 103 11 60 ... $ Area : num 50708 566432 113417 51945 156361 ...$ 谋杀率 : num 1 1 1 1 1 0 0 0 1 1 ...$ 是否高寿 : num 0 0 0 0 1 1 1 0 0 0 ...数据描述统计1.分析不同地区的谋杀率分布特征。
从面的核密度图可见,谋杀率呈现正态分布的特征,输入代码:2.分析不同变量的拟合性情况这里主要调用car包,通过使用scatterplotMatrix() 函数,分析不同变量的相关关系及对因变量的拟合情况上面的拟合组合图可见,谋杀率与收入水平、生活经验以及霜冻天数呈现出负相关关系,而和文盲率呈现出正相关关系,这个比较好理解。
3.进一步检验各个变量之间的相关性这里主要是使用psych包,通过corr.test()函数,计算各个变量之间的相关性及其显著性,在R环境中输入如下代码:corr.test(sj[,2:6])Call:corr.test(x = sj[, 2:6])Correlation matrixIncome Illiteracy Life Exp Murder HS GradIncome 1.00 -0.44 0.34 -0.23 0.62Illiteracy -0.44 1.00 -0.59 0.70 -0.66Life Exp 0.34 -0.59 1.00 -0.78 0.58Murder -0.23 0.70 -0.78 1.00 -0.49HS Grad 0.62 -0.66 0.58 -0.49 1.00Sample Size[1] 50Probability values (Entries above the diagonal are adjusted for multip le tests.)Income Illiteracy Life Exp Murder HS GradIncome 0.00 0 0.03 0.11 0Illiteracy 0.00 0 0.00 0.00 0Life Exp 0.02 0 0.00 0.00 0Murder 0.11 0 0.00 0.00 0HS Grad 0.00 0 0.00 0.00 0To see confidence intervals of the correlations, print with the short =FALSE option上图第一部分,显示的是各个变量之间的相关性,第二部分显示的是各个相关性的显著性检验P值。
【原创】R语言经济指标回归、时间序列分析报告论文附代码数据
本文使用R语言对经济指标进行回归和时间序列分析,旨在探讨经济指标对GDP的影响以及GDP的未来走势。
首先,我们使用OLS回归分析了GDP与各经济指标之间的关系,并通过分析结果
得出相关结论。
接着,我们引入时间序列分析工具ARIMA模型对GDP进行预测,并对预测结果进行解读,为决策者提供参考。
除此之外,我们还附上了相关代码和数据,以便读者复现整个分析过程。
本文的主要内容包括:
1. 数据获取和处理
2. OLS回归分析
3. 时间序列分析
4. 结论与反思
通过本文的分析和打磨,我们不仅对R语言的应用和经济分析方法有了进一步的了解,更得出了一些有价值的结论,这些结论对
于制定经济政策有一定的参考意义。
同时,本文的数据和代码也可以为读者在以后的应用和研究中提供参考价值。
需要说明的是,本文中使用的数据来自官方统计机构的公开数据,数据的准确性和真实性得到了验证。
为了避免涉及版权问题,本文中没有引用其他的资料。
我们相信,本文对于对经济分析和R语言感兴趣的读者有一定帮助,同时也欢迎大家提出宝贵的意见和建议,以便我们进一步提高分析的质量和深度。
logistic回归模型的分类评估及r语言实现logistic回归模型的分类评估及R语言实现引言在机器学习中,logistic回归是一种常用的分类算法。
该算法用于预测二分类问题的概率,能够根据自变量的线性组合估计出目标类别的概率。
本文将介绍logistic回归模型的评估指标,并使用R语言实现相关代码。
一、分类评估指标1. 准确率(Accuracy)准确率是最常见的分类模型评估指标之一。
它表示分类器正确分类的样本数量占总样本数量的比例。
计算公式如下:准确率= (TP + TN) / (TP + TN + FP + FN)其中,TP(True Positive)表示真正例的数量,即阳性样本分类正确的数量;TN(True Negative)表示真反例的数量,即阴性样本分类正确的数量;FP(False Positive)表示假正例的数量,即阴性样本被错误地分类为阳性的数量;FN(False Negative)表示假反例的数量,即阳性样本被错误地分类为阴性的数量。
2. 精确率(Precision)精确率表示分类器将正例(阳性样本)正确分类的能力。
计算公式如下:精确率= TP / (TP + FP)精确率越高,表示分类器将阳性样本误判为阴性样本的概率较低。
3. 召回率(Recall)召回率表示分类器对阳性样本的识别能力,即将阴性样本误判为阳性样本的概率较低。
计算公式如下:召回率= TP / (TP + FN)召回率越高,表示分类器对阳性样本的识别能力越强。
4. F1分数(F1 Score)F1分数是精确率和召回率的调和平均值,综合了两者的性能。
计算公式如下:F1分数= 2 * (精确率* 召回率) / (精确率+ 召回率)F1分数越高,表示分类器的综合性能越好。
5. ROC曲线与AUCROC曲线(Receiver Operating Characteristic Curve)是以假阳性率(False Positive Rate)为横坐标,真阳性率(True Positive Rate)为纵坐标的曲线。
基于逻辑回归模型的ST股票分析研究问题通过对某股票数据分析,了解经营活动产生的现金流量净额净资产收益率... 每股收益和每股净资产对股票是否ST的影响。
数据介绍随机抽取的股票。
因变量是否为ST股票(0=非ST,1=ST)。
为了能够预测是否为ST,我们采集了下面这些来自当年的指标。
该数据存放在 csv 文件上市公司数据 (1).csv 中。
做完整的逻辑回归分析,包括参数估计、假设检验,以及预测评估和模型评价;因变量(是否为ST)STindex[1] 1 0数据描述绘制变量之间的散点图经营活动产生的现金流量净额净资产收益率...经营活动产生的现金流量净额 1.00000000 -0.06822659净资产收益率... -0.06822659 1.00000000每股收益 0.14347066 0.46849026每股净资产 0.39543001 -0.10833833ST -0.11777849 0.11277458每股收益每股净资产 ST经营活动产生的现金流量净额 0.1434707 0.3954300 -0.1177785净资产收益率... 0.4684903 -0.1083383 0.1127746每股收益 1.0000000 0.3101421 -0.1607072每股净资产 0.3101421 1.0000000 -0.4064833ST -0.1607072 -0.4064833 1.0000000从上面的图中,我们可以看到各个变量之间的相关关系,其中每股收益和每股净资产呈正相关关系。
绘制箱线图可以看到ST股票和非ST股票的4个变量具有显著差异。
非ST股票的各项指标要高于ST股票的变量值。
建立逻辑回归模型因此进行逻辑回归模型的分析。
随机抽取2/3作为训练集summary(fit)data = data_train)Deviance Residuals:Min 1Q Median 3Q Max-1.5105 -0.9038 -0.3875 0.9781 1.9334Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) 7.272e-01 4.283e-01 1.698 0.08950 .经营活动产生的现金流量净额 3.803e-10 4.233e-10 0.899 0.36888 净资产收益率... 2.198e-01 2.808e-01 0.783 0.43365每股收益 -2.121e+00 8.805e-01 -2.409 0.01600 *每股净资产 -4.901e-01 1.641e-01 -2.986 0.00282 **---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 96.716 on 70 degrees of freedomResidual deviance: 74.795 on 66 degrees of freedomAIC: 84.795Number of Fisher Scoring iterations: 6从输出结果可以看出,回归方程为ST= 1.285e+1.532e-10经营活动产生的现金流量净额 +3.023e-01 净资产收益率-2.078e+00每股收益-4.586e-01 股净资产,变量和的统计量的估计值分别为1.285e+00、1.532e-10、3.023e-01、-2.078e+00和-4.586e-01 ,每股收益和每股净资产对应的值都比显著性水平0.05小,可得2个偏回归系p数在显著性水平0.05下均显著不为零。
用R语言做逻辑回归用R语言做逻辑回归jmzeng(******************)回归的本质是建立一个模型用来预测,而逻辑回归的独特性在于,预测的结果是只能有两种,true or false在R里面做逻辑回归也很简单,只需要构造好数据集,然后用glm函数(广义线性模型(generalized linear model))建模即可,预测用predict函数。
我这里简单讲一个例子,来自于加州大学洛杉矶分校的课程首先加载需要用的包然后加载测试数据可以看到这个数据集是关于申请学校是否被录取的,根据学生的GRE成绩,GPA和排名来预测该学生是否被录取。
•其中GRE成绩是连续性的变量,学生可以考取任意正常分数。
•而GPA也是连续性的变量,任意正常GPA均可。
•最后的排名虽然也是连续性变量,但是一般前几名才有资格申请,所以这里把它当做因子,只考虑前四名!而我们想做这个逻辑回归分析的目的也很简单,就是想根据学生的成绩排名,绩点信息,托福或者GRE成绩来预测它被录取的概率是多少!接下来建模根据对这个模型的summary结果可知:•GRE成绩每增加1分,被录取的优势对数(log odds)增加0.002 •而GPA每增加1单位,被录取的优势对数(log odds)增加0.804,不过一般GPA相差都是零点几。
•最后第二名的同学比第一名同学在其它同等条件下被录取的优势对数(log odds)小了0.675,看来排名非常重要啊这里必须解释一下这个优势对数(log odds)是什么意思了,如果预测这个学生被录取的概率是p,那么优势对数(log odds)就是log2(p/(1-p)),一般是以自然对数为底最后我们可以根据模型来预测啦可以看到,排名越高,被录取的概率越大log(0.5166016/(1-0.5166016)) ## 第一名的优势对数0.06643082log((0.3522846/(1-0.3522846))) ##第二名的优势对数-0.609012两者的差值正好是0.675,就是模型里面预测的!最后可以做一些简单的可视化。
逻辑回归模型回归是一种极易理解的模型,就相当于y=f(x),表明自变量x与因变量y的关系。
最常见问题有如医生治病时的望、闻、问、切,之后判定病人是否生病或生了什么病,其中的望闻问切就是获取自变量x,即特征数据,判断是否生病就相当于获取因变量y,即预测分类。
最简单的回归是线性回归,在此借用Andrew NG的讲义,有如图1.a所示,X为数据点——肿瘤的大小,Y为观测值——是否是恶性肿瘤。
通过构建线性回归模型,如h θ (x)所示,构建线性回归模型后,即可以根据肿瘤大小,预测是否为恶性肿瘤h θ(x)≥.05为恶性,h θ (x)<0.5为良性。
Z i=ln(P i1−P i)=β0+β1x1+..+βn x n Zi=ln(Pi1−Pi)=β0+β1x1+..+βnxn数据描述用R语言做logistic regression,建模及分析报告,得出结论,数据有一些小问题,现已改正重发:改成以“是否有汽车购买意愿(1买0不买)”为因变量,以其他的一些项目为自变量,来建模分析,目的是研究哪些变量对用户的汽车购买行为的影响较为显著。
问题描述我们尝试并预测个人是否可以根据数据中可用的人口统计学变量使用逻辑回归预测是否有汽车购买意愿(1买0不买)。
在这个过程中,我们将:1.导入数据2.检查类别偏差3.创建训练和测试样本4.建立logit模型并预测测试数据5.模型诊断数据描述分析查看部分数据head(inputData)是否有汽车购买意愿.1买0不买. 区域城市人均地区生产总值.元.1 NA NA2 NA NA3 0 中部长沙 1078904 0 中部长沙 1078905 0 中部长沙 1078906 0 中部长沙 107890职工平均工资.元. 全市总人口.万人. 全市面积.平方公里.1 NA NA NA2 NA NA NA3 56383.16 662.8 118164 56383.16 662.8 118165 56383.16 662.8 118166 56383.16 662.8 11816全市人口密度.人.平方公里. 市区总人口.万人. 市区面积.平方公里.1 NA NA NA2 NA NA NA3 560.94 299.3 19104 560.94 299.3 19105 560.94 299.3 19106 560.94 299.3 1910市区人口密度.人.平方公里. 城市道路面积.万平方米. 公共汽.电.车车辆数.辆.1 NA NA NA2 NA NA NA3 1566.75 29964 1574 1566.75 2996 4 1575 1566.75 2996 4 1576 1566.75 2996 4 157公交客运总量.万人次. 出租汽车数.辆. 每万人拥有公共汽车.辆.1 NA NA NA2 NA NA NA3 73943 6915 13.894 73943 6915 13.895 73943 6915 13.896 73943 6915 13.89人均城市道路面积.平方米. 私人汽车保有量.辆. 地铁条数地铁长度1 NA NA NA NA2 NA NA NA NA3 10.01 1200000 0 04 10.01 1200000 0 05 10.01 1200000 0 06 10.01 1200000 0 0日平均温度.F.的平均值日最高温度.F.的最大值日最高温度.F.的平均值1 NA NA NA2 NA NA NA3 64.42 104 71.54 64.42 104 71.55 64.42 104 71.56 64.42 104 71.5日最低温度.F.的平均值日最低温度.F.的最小值日最高温低于0度天数1 NA NA NA2 NA NA NA3 57.3 26 04 57.3 26 05 57.3 26 06 57.3 26 0日最低温低于0度天数日最高温高于30度天数下雨天数住房数性别.1男2女.1 NA NA NA NA N A2 NA NA NA NA N A3 22 95 173 2 14 22 95 173 2 25 22 95 173 3 16 22 95 173 1 1年龄职业类型学生.1代表是.后同. 蓝领白领.粉领其他职业或无职业1 NA NA NA NA NA NA2 NA NA NA NA NA NA3 404 0 0 1 04 30 4 0 0 1 05 26 4 0 0 1 06 30 2 0 0 1 0电动自行车数量汽车数量摩托车数量有驾照司机数成人数儿童数在家1 NA NA NA2 NA NA NA3 1 1 0 1 2 1 54 2 1 1 2 2 1 55 1 1 1 1 2 1 56 3 0 1 0 3 0 5上学工作家庭收入行程出行时间 X 购买时间购买时间.1 购买时间.21 NA NA NA NA NA2 NA NA NA NA NA3 4 5 10.0 0.63 NA 2009 20114 2 11 20.0 0.25 NA 2009 2009 2008.0005 5 11 2.0 0.12 NA 2011 NA6 2 3 2.7 0.17 NA 2009 2011购买时间.3 购买时间.4 购买时间.5 购买时间.61 NA NA NA2 NA NA NA3 NA NA NA4 NA NA NA5 NA NA NA6 NA NA NA查看数据维度[1] 948 56对数据进行描述统计分析:是否有汽车购买意愿.1买0不买. 区域城市Min. :0.0000 东部 :414 安庆 : 371st Qu.:0.0000 南部 :122 青岛 : 27Median :0.0000 北部 :121 镇江 : 27Mean :0.2144 中部 : 81 柳州 : 263rd Qu.:0.0000 西北 : 74 唐山 : 26Max. :1.0000 西南 : 68 赤峰 : 24NA's :20 (Other): 68 (Other):781人均地区生产总值.元. 职工平均工资.元. 全市总人口.万人. 全市面积.平方公里. Min. : 17096 Min. :32183 Min. : 53.6 Min. : 761 1st Qu.: 36340 1st Qu.:41305 1st Qu.: 345.9 1st Qu.: 7615 Median : 54034 Median :48270 Median : 613.3 Median :12065 Mean : 63605 Mean :49529 Mean : 635.4 Mean :15970 3rd Qu.: 84699 3rd Qu.:54211 3rd Qu.: 759.7 3rd Qu.:16757 Max. :155690 Max. :93997 Max. :3358.4 Max. :90021。
务(附代码数据),咨询QQ:3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网:/datablogR语言Logistic逻辑回归算法案例如果线性回归用于预测连续的Y变量,则逻辑回归用于二元分类。
如果我们使用线性回归来模拟二分变量(作为Y),则得到的模型可能不会将预测的Y s限制在0和1之内。
此外,线性回归的其他假设(例如误差的正态性)可能会被违反。
因此,我们建模事件ln的对数几率(P1 - P.)升ñ(P1- P),其中,P是事件的概率。
上面的等式可以使用参数glm()设置来建模。
但是我们对事件的概率比事件的对数几率更感兴趣。
因此,上述模型的预测值,即事件的对数几率,可以转换为事件概率,如下所示:family"binomial"【原创】定制代写开发r/python/spss/matlab/WEKA/sas/sql/C++/stata/eviews/Computer science assignment代写/代做Project/数据挖掘和统计分析可视化调研报告/程序/PPT等/爬虫数据采集服务(附代码数据),咨询QQ:3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网:/datablog务(附代码数据),咨询QQ:3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网:/datablog使用该plogis()函数实现此转换,如下所示,当我们构建logit模型并进行预测时。
示例问题让我们尝试使用基于adult数据中可用的人口统计变量的逻辑回归来预测个人是否会获得超过50,000美元的收入。
在这个过程中,我们将:1.导入数据2.检查课堂偏见3.创建培训和测试样本4.计算信息值以找出重要变量5.构建logit模型并预测测试数据6.做模型诊断导入数据inputData <-read.csv("/wp-content/uploads/2015/09/adult.csv")务(附代码数据),咨询QQ:3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网:/databloghead(inputData)#=> AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS #=> 1 39 State-gov 77516 Bachelors 13 Never-married #=> 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse #=> 3 38 Private 215646 HS-grad 9 Divorced #=> 4 53 Private 234721 11th 7 Married-civ-spouse #=> 5 28 Private 338409 Bachelors 13 Married-civ-spouse #=> 6 37 Private 284582 Masters 14 Married-civ-spouse # OCCUPATION RELATIONSHIP RACE SEX CAPITALGAIN CAPITALLOSS #=> 1 Adm-clerical Not-in-family White Male 2174 0 #=> 2 Exec-managerial Husband White Male 0 0 #=> 3 Handlers-cleaners Not-in-family White Male 0 0 #=> 4 Handlers-cleaners Husband Black Male 0 0 #=> 5 Prof-specialty Wife Black Female 0 0 #=> 6 Exec-managerial Wife White Female 0 0 # HOURSPERWEEK NATIVECOUNTRY ABOVE50K#=> 1 40 United-States 0#=> 2 13 United-States 0务(附代码数据),咨询QQ:3025393450有问题到百度搜索“大数据部落”就可以了欢迎登陆官网:/datablog#=> 3 40 United-States 0#=> 4 40 United-States 0#=> 5 40 Cuba 0#=> 6 40 United-States 0检查类偏差理想情况下,Y变量中事件和非事件的比例应大致相同。
使用r语言逻辑回归使用R语言进行逻辑回归分析逻辑回归是一种常用的分类算法,它可以用于预测二元变量的结果。
在R语言中,我们可以使用glm函数来进行逻辑回归分析。
我们需要准备数据。
假设我们有一个数据集,其中包含了一些人的性别、年龄和是否吸烟的信息,我们想要预测这些人是否患有肺癌。
我们可以使用以下代码来生成一个示例数据集:```set.seed(123)gender <- sample(c("male", "female"), 100, replace = TRUE)age <- rnorm(100, 50, 10)smoking <- sample(c("yes", "no"), 100, replace = TRUE)cancer <- ifelse(age > 60 & smoking == "yes", "yes", "no")data <- data.frame(gender, age, smoking, cancer)```接下来,我们可以使用glm函数来拟合逻辑回归模型。
在这个例子中,我们将使用性别、年龄和吸烟状态作为自变量,将肺癌状态作为因变量。
代码如下:```model <- glm(cancer ~ gender + age + smoking, data = data, family =binomial)summary(model)```在这个模型中,我们使用了binomial家族,因为我们的因变量是二元变量。
summary函数可以输出模型的摘要信息,包括每个自变量的系数、标准误差、z值和p值。
我们可以使用这些信息来判断每个自变量是否对因变量有显著影响。
如果我们想要预测一个新的个体是否患有肺癌,我们可以使用predict函数。
r语言logistics回归校准曲线全文共四篇示例,供读者参考第一篇示例:随着现代物流行业的发展,物流回归分析已经成为物流管理中一种常用的工具。
随着数据的不断积累和分析技术的不断发展,物流回归分析也变得越来越重要。
R语言是一种流行的统计分析工具,可以帮助我们进行各种数据分析,包括物流回归分析。
物流回归分析是一种模型,用来研究物流运输中不同变量之间的关系。
通过对物流数据进行回归分析,我们可以了解不同变量之间的相互影响,并且可以预测未来的物流运输情况。
在进行物流回归分析时,我们经常会用到校准曲线来评估模型的拟合效果。
校准曲线是一种用来评估模型预测准确性的方法。
在物流回归分析中,我们会比较模型的实际输出值和预测值之间的差异,来评估模型的准确性。
校准曲线通常是通过绘制实际输出值和预测值之间的关系曲线来展示的。
使用R语言进行物流回归分析和校准曲线的制作非常方便。
R语言提供了丰富的统计分析函数和可视化工具,可以帮助我们快速准确地分析物流数据,并生成直观的校准曲线。
接下来,我们将介绍如何使用R语言进行物流回归分析和校准曲线的制作。
我们需要准备一组物流数据,例如货物的运输时间、运输距离、运输成本等。
然后,我们可以使用R语言中的lm函数来进行物流回归分析。
lm函数可以帮助我们建立一个线性模型,来分析变量之间的关系。
接下来,我们可以使用summary函数来查看模型的详细信息,包括各个变量的系数、标准误差、t值、p值等。
这些信息可以帮助我们判断模型的拟合效果和可靠性。
接着,我们可以使用predict函数来对模型进行预测,得到每个样本的预测值。
然后,我们可以将实际输出值和预测值绘制成散点图,来展示模型预测的效果。
在绘制校准曲线时,我们可以计算每个预测值的偏差,然后绘制出实际输出值和预测值之间的关系曲线。
通过观察校准曲线,我们可以直观地了解模型的预测准确性。
如果校准曲线接近45度直线,则表示模型的预测效果较好;如果校准曲线偏离45度直线较远,则表示模型的预测效果较差。
一、probit回归模型在R中,可以使用glm函数(广义线性模型)实现,只需将选项binomial选项设为probit即可,并使用summary函数得到glm结果的细节,但是和lm不同,summary对于广义线性模型并不能给出决定系数,需要使用pscl包中的pR2函数得到伪决定系数,然后再使用summary 得到细节> library(RSADBE)> data(sat)> pass_probit <- glm(Pass~Sat,data=sat,binomial(probit))> summary(pass_probit)> library(pscl)> pR2(pass_probit)> predict(pass_probit,newdata=list(Sat=400),type = "response")> predict(pass_probit,newdata=list(Sat=700),type = "response")二、logistic回归模型可使用glm函数和其选项family=binomial拟合logistic回归模型。
> library(RSADBE)> data(sat)> pass_logistic <- glm(Pass~Sat,data=sat,family = 'binomial')> summary.glm(pass_logistic)> pR2(pass_logistic)> with(pass_logistic, pchisq(null.deviance - deviance, df.null+ - df.residual, lower.tail = FALSE))> confint(pass_logistic)> predict.glm(pass_logistic,newdata=list(Sat=400),type = "response")> predict.glm(pass_logistic,newdata=list(Sat=700),type = "response")> sat_x <- seq(400,700, 10)> pred_l <- predict(pass_logistic,newdata=list(Sat=sat_x),type="response")> plot(sat_x,pred_l,type="l",ylab="Probability",xlab="Sat_M")上述代码解释:通过glm函数拟合logistic模型,并通过summary.glm得到模型结果细节,其中Null deviance 和Residual deviance类似于线性回归模型中的残差平方和,看用来评估拟合优度,Null deviance是没有任何信息的模型的残差,如果自变量对因变量有影响,则Residual deviance 应该明显小于Null deviance。
咨询QQ:3025393450
有问题百度搜索“”就可以了
欢迎登陆官网:/datablog
在R语言中实现Logistic逻辑回归数据分析报告
来源:大数据部落|
逻辑回归是拟合回归曲线的方法,当y是分类变量时,y = f(x)。
典型的使用这种模式被预测Ÿ给定一组预测的X。
预测因子可以是连续的,分类的或两者的混合。
R中的逻辑回归实现
R可以很容易地拟合逻辑回归模型。
要调用的函数是glm(),拟合过程与线性回归中使用的过程没有太大差别。
在这篇文章中,我将拟合一个二元逻辑回归模型并解释每一步。
数据集
我们将研究泰坦尼克号数据集。
这个数据集有不同版本可以在线免费获得,但我建议使用Kaggle提供的数据集,因为它几乎可以使用(为了下载它,你需要注册Kaggle)。
数据集(训练)是关于一些乘客的数据集合(准确地说是889),并且竞赛的目标是预测生存(如果乘客幸存,则为1,否则为0)基于某些诸如服务等级,性别,年龄等特征。
正如您所看到的,我们将使用分类变量和连续变量。
数据清理过程
咨询QQ:3025393450
有问题百度搜索“”就可以了
欢迎登陆官网:/datablog
在处理真实数据集时,我们需要考虑到一些数据可能丢失或损坏的事实,因此我们需要为我们的分析准备数据集。
作为第一步,我们使用该函数加载csv数据read.csv()。
确保参数na.strings等于c("")使每个缺失值编码为a NA。
这将帮助我们接下来的步骤。
training.data.raw < - read.csv('train.csv',header = T,na.strings = c(“”))
现在我们需要检查缺失的值,并查看每个变量的唯一值,使用sapply()函数将函数作为参数传递给数据框的每一列。
sapply(training.data.raw,function(x)sum(is.na(x)))PassengerId生存的Pclass名称性别0 0 0 0 0 年龄SibSp Parch票价177 0 0 0 0 小屋着手687 2 sapply
(training.data.raw,函数(x)长度(unique(x)))PassengerId生存的Pclass名称性别891 2 3 891 2 年龄SibSp Parch票价89 7 7 681 248 小屋着手148 4
对缺失值进行可视化处理可能会有所帮助:Amelia包具有特殊的绘图功能missmap(),可以绘制数据集并突出显示缺失值:
咨询QQ:3025393450
有问题百度搜索“”就可以了欢迎登陆官网:/datablog
咨询QQ:3025393450
有问题百度搜索“”就可以了
欢迎登陆官网:/datablog
可变机舱有太多的缺失值,我们不会使用它。
我们也会放弃PassengerId,因为它只是一个索引和票据。
使用subset()函数我们对原始数据集进行子集化,只选择相关列。
data < - subset(training.data.raw,select = c(2,3,5,6,7,8,10,12))
现在我们需要解释其他缺失的值。
通过在拟合函数内设置参数来拟合广义线性模型时,R可以很容易地处理它们。
但是,我个人更倾向于在可能的情况下更换NAs“手动”。
有不同的方法可以做到这一点,一种典型的方法是用现有的平均值,中位数或模式代替缺失值。
我将使用平均值。
数据$ Age [is.na(data $ Age)] < - mean(数据$ Age,na.rm = T)
咨询QQ:3025393450
有问题百度搜索“”就可以了
欢迎登陆官网:/datablog
就分类变量而言,使用read.table()或read.csv()默认会将分类变量编码为因子。
一个因素是R如何处理分类变量。
为了更好地理解R如何处理分类变量,我们可以使用该contrasts()函数。
这个函数将告诉我们变量如何被R虚拟化,以及如何在模型中解释它们。
在进行拟合过程之前,让我提醒您清洁和格式化数据的重要性。
这个预处理步骤对于获得模型的良好拟合和更好的预测能力通常是至关重要的。
模型拟合
我们将数据分成两部分:训练和测试集。
训练集将用于适合我们将在测试集上进行测试的模型。
model < - glm(Survived〜。
,family = binomial(link ='logit'),data = train)
通过使用函数,summary()我们获得了我们模型的结果:
摘要(模型):glm(formula = Survived〜。
,family = binomial(link =“logit”))偏差残差:最低1Q中位数3Q最高-2.6064 -0.5954 -0.4254 0.6220 2.4165 系数:估计标准错误z值Pr(> | z |)(截距)5.137627 0.594998 8.635 <2e-16 ***Pclass -1.087156 0.151168 -7.192 6.40e-13 ***Sexmale -2.756819 0.212026 -13.002 <2e-16 ***年龄-0.037267 0.008195 -4.547 5.43e-06 ***SibSp -0.292920 0.114642 -2.555 0.0106 * Parch -0.116576 0.128127 -0.910
0.3629 票价0.001528 0.002353 0.649 0.5160 EmbarkedQ -0.002656 0.400882 -0.007 0.9947 登入S -0.318786 0.252960 -1.260 0.2076 ---Signif。
代码:0'***'0.001'**'0.01'*'0.05'。
' 0.1''1(二项系列的色散参数取1)无偏差:在799自由度上为1065.39剩余偏差:791自由度709.39AIC:727.39Fisher评分迭代次数:5
解释我们的逻辑回归模型的结果。