当前位置：首页 > 编程资源 > 编程问答 >内容正文

编程问答

UA MATH571A 回归分析概念与R code总结

发布时间：2025/4/14 编程问答 34 豆豆

生活随笔收集整理的这篇文章主要介绍了 UA MATH571A 回归分析概念与R code总结小编觉得挺不错的,现在分享给大家,帮大家做个参考.

UA MATH571A 回归分析概念与R code总结

- - Simple Linear Regression
  - Multivariate Linear Regression

Part 0 Basic R code
Tip 1: Read in data

Based on path of the data file
read.csv(“D:/Stat PhD/taking course/summer1/ref/regression/Salary1.csv”, header = TRUE, sep = “,”, quote = “”", dec = “.”, fill = TRUE, comment.char = “”)

Choosing the data file from dictionary (Suggested)
read.csv( file.choose() )

Tip 2: R code plot()

plot(x,y,…) or plot(y~x,…) to select variables

type = , “p” (point), “l” (line), “b” (both), “c” (line connetced points), “h” (histogram)

pch = (input number 0-25, indicating 26 point characteristics)

lty = (input number 1-6, indicating 6 line types)

lwd = (input number, line width)

col = “” (input can be )

title = “”, xlab = “”, ylab = “”

xlim = c(,), ylim = c(,)

can use par( new = TRUE ) to plot two objects in one window (use the same xlim and ylim)

Tip 3: R code lm()

lm(formula, weights) will return a model object (a list). Use summary() to take a glance.

formula: basic form y~model, model can be variables with +, :, *. + indicates follow, i.e. y~a+b means using variable b following a to regress y. : indicates interaction. * indicates full model, i.e a*b means a + b + a:b. -1 indicates regression through the origin and +0 indicates regression without intercept.

If no value for weights, indicates OLS. weights = numeric vector indicates WLS

Tip 4: R code predict()

model object (a list), can be lm object, or loess object.

data.frame(model = ), input data for model variables in y~model

interval = “” (conf or pred)

level = (input number for alpha)

Simple Linear Regression

Part 1 Inference and Prediction
Tip 1: Model Assumption $Yi∼iidN(β0+β1Xi,σ2)Y_i\sim_{iid} N(\beta_0+\beta_1X_i,\sigma^2)$
Tip 2: Define $ki=Xi−Xˉ∑i=1n(Xi−Xˉ)2k_i = \frac{X_i-\bar{X}}{\sum_{i=1}^n (X_i-\bar{X})^2}$ . Estimators of coefficients are
$β^1=∑i=1nkiYi=β1+∑i=1nkiϵi∼N(β1,σ2∑i=1n(Xi−Xˉ)2)β^0=∑i=1n(1n−kiXˉ)Yi=β0+∑i=1n(1n−kiXˉ)ϵi∼N(β0,σ2(1n+Xˉ2∑i=1n(Xi−Xˉ)2))\hat{\beta}_1= \sum_{i=1}^{n} k_i Y_i = \beta_1 + \sum_{i=1}^{n} k_i \epsilon_i \sim N(\beta_1, \frac{\sigma^2}{\sum_{i=1}^n(X_i-\bar{X})^2} ) \\ \hat{\beta}_0 = \sum_{i=1}^{n} ( \frac{1}{n}- k_i \bar{X}) Y_i = \beta_0 + \sum_{i=1}^{n} ( \frac{1}{n}- k_i \bar{X}) \epsilon_i \sim N(\beta_0, \sigma^2 (\frac{1}{n}+\frac{ \bar{X}^2}{\sum_{i=1}^n (X_i-\bar{X})^2} ))$

Tip 3: Fitted value and residuals are
$Y^h=β^0+β^1Xh∼N(Yh,σ2(1n+(Xh−Xˉ)2∑i=1n(Xi−Xˉ)2))ej=Yj−Y^j∼N(0,σ2(1+1n+(Xj−Xˉ)2∑i=1n(Xi−Xˉ)2))\hat{Y}_h = \hat{\beta}_0 + \hat{\beta}_1 X_h\sim N(Y_h,\sigma^2 (\frac{1}{n} + \frac{(X_h - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} )) \\ e_j = Y_j - \hat Y_j \sim N(0,\sigma^2 (1+\frac{1}{n} + \frac{(X_j - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} ))$

Tip 4: Predictive intervals are
$Var(Y^h−Yh)=σ2(1+1n+(Xh−Xˉ)2∑i=1n(Xi−Xˉ)2)Y^h−se{Y^h−Yh}t(1−α2,n−2)<Yh<Y^h+se{Y^h−Yh}t(1−α2,n−2)Var(\hat{Y}_h-Y_h)=\sigma^2 (1+\frac{1}{n} + \frac{(X_h - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} ) \\ \hat{Y}_h-se\{\hat{Y}_h-Y_h\}t(1-\frac{\alpha}{2},n-2)< Y_h < \hat{Y}_h+se\{\hat{Y}_h-Y_h\}t(1-\frac{\alpha}{2},n-2)$

Tip 5: ANOVA

SourceSSdfMSF

Regression	$SSR=∑i=1n(Y^i−Yˉ)2SSR=\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2$	1	$\frac{SSR}{df_R}$	$\frac{MSR}{MSE}$
Residuals	$SSE=∑i=1n(Yi−Y^i)2SSE=\sum_{i=1}^n (Y_i - \hat{Y}_i )^2$	n-2	$\frac{SSE}{df_E}$
Total	$SST=∑i=1n(Yi−Yˉ)2SST=\sum_{i=1}^n (Y_i - \bar{Y})^2$	n-1	$\frac{SST}{df_T}$

$\sim \chi^2_{n-2}, F \sim F(1,n-2) \\ H_0:\beta_0 = \beta_1 = 0\ \ H_a:\beta_0 \ne 0\ or\ \beta_1 \ne 0$

Part 2 Residual Plots and Diagnotics
Tip 1: Residuals against sample index: to check whether there’s sequential correlation

Tip 2: Residual against independent variable: to check whether there’s missing higher order term

Tip 3: Residuals against fitted value: to check whether there’s heteroscedasticity

Tip 4: Residuals against potential independent variable: to check whether there’s important missing variable

Tip 5: Normal Probability Plot: to check whether normality holds
R code qqnorm(), qqline() to create QQ plot.

Tip 6: Shapiro-Wilk test
R code shapiro.test() to implement Shapiro-Wilk test. Accept normality for large p value.

Tip 7: Brown-Forsythe test
R code leveneTest() to implement Brown-Forsythe test to test heteroscedasticity in different groups. Accept homogeneity for large p value. Require R package car. Example,

ei = resid(Ex3.lm)
G<-(X<80)[order(X)]
group<-as.factor(G)
BF.htest = leveneTest(ei[order(X)],group)

Tip 8: Lack-of-fit F test
Fit reduced model and full model and use R code anova(Reduced, Full) to do Lack-of-fit F test. Accept reduced model for large p value.

Part 3 Remedial Methods: Box-Cox, Family-wise Adjustment and Nonparametrics
Tip 1: Box-Cox transformation
If not linearity, use Box-Cox transformation. Example

require( MASS )
Ex3.bc = boxcox( Ex3.lm,lambda=seq(-1, 1, 0.1), interp=F )
cbind( Ex3.bc$x, Ex3.bc$y )

Tip 2: Bonferroni Adjustment
Familywise alpha = # of estimators times pointwise alpha
For statistic $T_i$ , familywise CI is
$t(1-\frac{\alpha}{2m},n-2) \\ [T_i - Bse(T_i),T_i + Bse(T_i)]$

Tip 3: Working-Hotelling-Scheffe Bound
For statistic $T_i$ , familywise CI is
$\sqrt{mF(1-\alpha,m,n-2)}\\ [T_i - Wse(T_i),T_i + Wse(T_i)]$

$α\alpha$ is pointwise alpha. If m is small, use Bonferroni, otherwise use WHS.

Tip 4: Heteroscedasticity and WLS, if homogeinity of variance is rejected, use WLS.

Tip 5: Nonparametric Regression, if normality is rejected, use nonparametric regression. Use R code loess() to fit and predict() to get the nonparametrics curve. Example

baseballSLR.lm <- lm(Y ~ X)
absresid = abs(resid(baseballSLR.lm))
Yhat = fitted(baseballSLR.lm)
baseball.lo = loess( absresid~Yhat, span = 0.5, degree = 2,
family=‘symmetric’ )
Ysmooth = predict( baseball.lo,
data.frame(Yhat = seq(min(Yhat),max(Yhat),.001)) )
plot( absresid~Yhat, xlim=c(.25,.29), ylim=c(0,.11) )
par( new=TRUE )
plot( Ysmooth~seq(min(Yhat),max(Yhat),.001), type=‘l’, lwd=2,
xaxt=‘n’, yaxt=‘n’ , xlab=’’, ylab=’’, xlim=c(.25,.29), ylim=c(0,.11))

Part 4 Correlation Analysis
Tip 1: PPMCC
Tip 2: Spearman’s Rank test

Part 5 Inverse Prediction and Bootstraps
Tip 1: Inverse prediction
$X^h=Yh−β^0β^1X^h−Xhse{predX}∼t(N−2)s2{predX}=MSEβ^12[1+1n+X^h−Xh∑i=1N(Xi−Xˉ)2]\hat{X}_h = \frac{Y_h - \hat{\beta}_0}{\hat{\beta}_1} \\ \frac{\hat{X}_h - X_h}{se\{predX\}} \sim t(N-2) \\ s^2\{predX\} = \frac{MSE}{\hat{\beta}_1^2} [1+\frac{1}{n}+\frac{\hat{X}_h - X_h}{\sum_{i=1}^{N} (X_i - \bar{X})^2}]$

Tip 2: Bootstrapped confidential interval: above $β^1\hat\beta_1$ is treated as constant which is not really true. A better way for CI is bootstrap. Example

theta_hat <- (50-finalpr2a.lm$coefficients[1])/finalpr2a.lm$coefficients[2]
ei <- resid(finalpr2a.lm)
Yhat <- fitted(finalpr2a.lm)
b1origin <- finalpr2a.lm$coefficients[2]
b0origin <- finalpr2a.lm$coefficients[1]
n <- length(Y)
B <- 5000
b0 <- numeric(B)
b1 <- numeric(B)
set.seed(2019)
for(b in 1:B){
esti <- sample(ei,n,replace=T)
Yest <- Yhat + esti
b0[b] <- coef(lm(Yest~X1))[1]
b1[b] <- coef(lm(Yest~X1))[2]
}
theta <- (50-b0)/b1
summary(theta)
theta <- sort(theta)
thetaL <- theta[126]
thetaU <- theta[4875]
c(thetaL,thetaU)
hist(theta, probability = T)
abline(v=thetaL, lty=2, lwd=2)
abline(v=thetaU, lty=2, lwd=2)

Multivariate Linear Regression

Part 1 Inference and Prediction
Tip 1: Model Assumption $\sim N(X\beta,\sigma^2I_n)$
Tip 2: Estimators of coefficients are $β^=(XTX)−1XTY\hat{\beta} = (X^TX)^{-1}X^TY$
Tip 3: Fitted value and residuals are

Hat matrix

H = X(X^TX)^{-1}X^T

Fitted value

Y^=HY\hat Y = HY

Residual

e = (I_n - H)Y

Tip 4: ANOVA

SourceSSdfMS

Regression	$SSR=β^TXTY−1NYTJYSSR=\hat{\beta}^TX^TY - \frac{1}{N}Y^TJY$	p-1	$\frac{SSR}{df_R}$
Residuals	$SSE=YTY−β^TXTYSSE=Y^TY - \hat{\beta}^TX^TY$	N-p	$\frac{SSE}{df_E}$
Total	$SST=YT(IN−JN)YSST=Y^T(I_N-\frac{J}{N})Y$	N-1	$\frac{SST}{df_T}$

Tip 5: Sequential ANOVA, When given a sequence of objects, anova() tests the models against one another in the order specified.

Part 2 Variable Selection
Tip 1: Coefficient of determination. Use R code leaps(x,y,method = “adjr2”)

library(leaps)
CH09PR11.r2 <- leaps( x=cbind(X1,X2,X3,X4), y=Y, method=‘adjr2’)
p = seq( min(CH09PR11.r2$size), max(CH09PR11.r2$size) )
plot( CH09PR11.r2$adjr2 ~ CH09PR11.r2$size , ylab=expression(R^2), xlab=‘p’ )
Rp2 = by( data=CH09PR11.r2$adjr2,INDICES=factor(CH09PR11.r2$size), FUN=max )
lines( Rp2 ~ p )
CH09PR11.r2[[“adjr2”]]
CH09PR11.r2[[“which”]]

Tip 2: Mallow’s $C_p$ . Use R code leaps(x,y,method = “Cp”)

library(leaps)
CH09PR11.cp <- leaps( x=cbind(X1,X2,X3,X4), y=Y, method=‘Cp’)
p = seq( min(CH09PR11.cp$size), max(CH09PR11.cp$size) )
plot( CH09PR11.cp$Cp ~ CH09PR11.cp$size , ylab=expression(R^2), xlab=‘p’ )
Rp2 = by( data=CH09PR11.cp$Cp,INDICES=factor(CH09PR11.cp$size), FUN=min )
lines( Rp2 ~ p )
CH09PR11.cp[[“Cp”]]
CH09PR11.cp[[“which”]]

Tip 3: AIC and BIC. Use R code step() and input lm object to calculate AIC.

CH09PR10.lm <- lm( Y ~ X1+X2+X3+X4 )
CH09PR10.step <- step(CH09PR10.lm)

Tip 4: $PRESS_p$ . Use PRESS() to get PRESS.

library(MPV)
CH09PR23.r2 <- leaps( x=cbind(X1,X2,X3), y=Y, method=‘adjr2’)
p = seq( min(CH09PR23.r2$size), max(CH09PR23.r2$size) )
CH09PR23.r2[[“which”]]
PRESSp = numeric( length(CH09PR23.r2$size) )
SSEp = numeric( length(CH09PR23.r2$size) )
PRESSp[1] = PRESS( lm( Y ~ X1 ) )
PRESSp[2] = PRESS( lm( Y ~ X2 ) )
PRESSp[3] = PRESS( lm( Y ~ X3 ) )
PRESSp[4] = PRESS( lm( Y ~ X1+X2 ) )
PRESSp[5] = PRESS( lm( Y ~ X1+X3 ) )
PRESSp[6] = PRESS( lm( Y ~ X2+X3 ) )
PRESSp[7] = PRESS( lm( Y ~ X1+X2+X3 ) )
PRESSp[8] = PRESS( lm( Y ~ X1+X2+X1sq+X2sq ) )
PRESSp[9] = PRESS( lm( Y ~ X1+X2+X1X2 ) )
PRESSp[10] = PRESS( lm( Y ~ X1+X2+X1sq+X2X3 ) )
SSEp[1] = anova(lm(Y~X1))[2,2]
SSEp[2] = anova(lm(Y~X2))[2,2]
SSEp[3] = anova(lm(Y~X3))[2,2]
SSEp[4] = anova(lm(Y~X1+X2))[3,2]
SSEp[5] = anova(lm(Y~X1+X3))[3,2]
SSEp[6] = anova(lm(Y~X2+X3))[3,2]
SSEp[7] = anova(lm(Y~X1+X2+X3))[4,2]
SSEp[8] = anova(lm(Y~X1+X2+X1sq+X2sq))[5,2]
SSEp[9] = anova(lm(Y~X1+X2+X1X2))[4,2]
SSEp[10] = anova(lm(Y~X1+X2+X1sq+X2X3))[5,2]
plot( PRESSp[1:7] ~ CH09PR23.r2$size , ylab=expression(PRESS[p]), xlab=‘p’ )
minPRESSp = by( data=PRESSp[1:7],INDICES=factor(CH09PR23.r2$size), FUN=min )
lines(minPRESSp ~ p )

Tip 5: Forward Stepwise Regression

library(rms)
step( lm(Y~1),Y~X1+X2+X3+X4+X51999+X52000+X52001, direction=‘forward’)

Tip 6: Backward Elimination. Use fastbw(). k is the multiple of the number of degrees of freedom used for the penalty. k = 2 for AIC and k = log(n) for BIC (n = length(Y))

library(rms)
CH09PR20.lm <- lm(Y~X1+X2+X3+X4+X51999+X52000+X52001)
step( CH09PR20.lm, direction=‘backward’,k=2)
CH09PR20.ols <- ols( Y~X1+X2+X3+X4+X51999+X52000+X52001 )
fastbw( fit=CH09PR20.ols, rule=‘p’,type=‘individual’, sls=.10 )

Part 3 Outliears Detection
Tip 1: Studentized Detected Residual: internally studentized residual (fitting) and externally studentized residual (LOOCV). R code rstudent() for externally studentized residual.

tcrit <- qt( 1-.5*(.05/52), 52-4-1 ) # Bonferroni adjustment needed
CH10PR10.lm <- lm( Y ~ X)
plot( rstudent(CH10PR10.lm) ~ fitted(CH10PR10.lm), ylim=c(-4,4) )
abline( h=tcrit , lty=2 )
abline( h=-tcrit, lty=2 )
abline( h=0 )

Tip 2: Hat Matrix Leverage Value. Use hatvalues() to find hat value. Fails when number of predictors is large.

n <- length(Y) # sample size
p <- 3 # number of predictors
hii = hatvalues( CH10PR10.lm )
hii>2*p/n # Empirical Rule to determine outliers.

Part 4 Influential Analysis
Tip 1: DFFITS. The influence of the i-th data to i-th fitted value. Use R code dffits().
Tip 2: Cook’s Distance. The influence of the i-th data to all fitted value. Use R code influence.measures().
Tip 3: DFBETAS. The influence of the i-th data to k-th coefficients. Use R code cooks.distance(). Input is model object.

DFF <- dffits(CH10PR10.lm)
abs(DFF)>2*sqrt(p/n)
DFB <- influence.measures( CH10PR10.lm )
DFB[[“infmat”]][c(16,22,43,48,10,32,38,40),]
plot( cooks.distance(CH10PR10.lm), type=‘o’,pch=19 )
ei = resid( CH10PR10.lm )
yhat = fitted(CH10PR10.lm)
radius = sqrt( cooks.distance(CH10PR10.lm)/pi )
plot( ei ~ yhat, pch=’’)
abline( h=0 )
symbols( yhat,ei,circles=radius,inches=.15,bg=‘black’,fg=‘white’,add=T )

Part 5 Multicolinearity and Ridge Regression
Tip 1: Variance Inflation Factor

Use pairs() and cor() to get pairwise dot plots and correlation matrix.

require( car )
vif( lm(Y ~ X1+X2+X3) )

If max(VIF)>10 and mean(VIF)>6, suggesting Multicolinearity.

Tip 2: Ridge Regression. Use R package genridge.

c = seq( from=.01,to=10,by=.01 )
traceplot( ridge(U~Z1+Z2+Z3, lambda=c) )

总结

以上是生活随笔为你收集整理的UA MATH571A 回归分析概念与R code总结的全部内容，希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错，欢迎将生活随笔推荐给好友。

上一篇： UA MATH566 统计理论 Baye
下一篇： UA MATH566 统计理论推导卡方

编程问答

UA MATH571A 回归分析 概念与R code总结

UA MATH571A 回归分析 概念与R code总结

Simple Linear Regression

Multivariate Linear Regression

总结

UA MATH571A 回归分析概念与R code总结

UA MATH571A 回归分析概念与R code总结