欢迎访问 生活随笔!

生活随笔

当前位置: 首页 > 编程资源 > 编程问答 >内容正文

编程问答

UA MATH571A 回归分析 概念与R code总结

发布时间:2025/4/14 编程问答 34 豆豆
生活随笔 收集整理的这篇文章主要介绍了 UA MATH571A 回归分析 概念与R code总结 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

UA MATH571A 回归分析 概念与R code总结

      • Simple Linear Regression
      • Multivariate Linear Regression

Part 0 Basic R code
Tip 1: Read in data

  • Based on path of the data file
    read.csv(“D:/Stat PhD/taking course/summer1/ref/regression/Salary1.csv”, header = TRUE, sep = “,”, quote = “”", dec = “.”, fill = TRUE, comment.char = “”)
  • Choosing the data file from dictionary (Suggested)
    read.csv( file.choose() )
  • Tip 2: R code plot()

  • plot(x,y,…) or plot(y~x,…) to select variables

  • type = , “p” (point), “l” (line), “b” (both), “c” (line connetced points), “h” (histogram)

  • pch = (input number 0-25, indicating 26 point characteristics)

  • lty = (input number 1-6, indicating 6 line types)

  • lwd = (input number, line width)

  • col = “” (input can be )

  • title = “”, xlab = “”, ylab = “”

  • xlim = c(,), ylim = c(,)

  • can use par( new = TRUE ) to plot two objects in one window (use the same xlim and ylim)

  • Tip 3: R code lm()

  • lm(formula, weights) will return a model object (a list). Use summary() to take a glance.
  • formula: basic form y~model, model can be variables with +, :, *. + indicates follow, i.e. y~a+b means using variable b following a to regress y. : indicates interaction. * indicates full model, i.e a*b means a + b + a:b. -1 indicates regression through the origin and +0 indicates regression without intercept.
  • If no value for weights, indicates OLS. weights = numeric vector indicates WLS
  • Tip 4: R code predict()

  • model object (a list), can be lm object, or loess object.
  • data.frame(model = ), input data for model variables in y~model
  • interval = “” (conf or pred)
  • level = (input number for alpha)
  • Simple Linear Regression

    Part 1 Inference and Prediction
    Tip 1: Model Assumption Yi∼iidN(β0+β1Xi,σ2)Y_i\sim_{iid} N(\beta_0+\beta_1X_i,\sigma^2)YiiidN(β0+β1Xi,σ2)
    Tip 2: Define ki=Xi−Xˉ∑i=1n(Xi−Xˉ)2k_i = \frac{X_i-\bar{X}}{\sum_{i=1}^n (X_i-\bar{X})^2}ki=i=1n(XiXˉ)2XiXˉ. Estimators of coefficients are
    β^1=∑i=1nkiYi=β1+∑i=1nkiϵi∼N(β1,σ2∑i=1n(Xi−Xˉ)2)β^0=∑i=1n(1n−kiXˉ)Yi=β0+∑i=1n(1n−kiXˉ)ϵi∼N(β0,σ2(1n+Xˉ2∑i=1n(Xi−Xˉ)2))\hat{\beta}_1= \sum_{i=1}^{n} k_i Y_i = \beta_1 + \sum_{i=1}^{n} k_i \epsilon_i \sim N(\beta_1, \frac{\sigma^2}{\sum_{i=1}^n(X_i-\bar{X})^2} ) \\ \hat{\beta}_0 = \sum_{i=1}^{n} ( \frac{1}{n}- k_i \bar{X}) Y_i = \beta_0 + \sum_{i=1}^{n} ( \frac{1}{n}- k_i \bar{X}) \epsilon_i \sim N(\beta_0, \sigma^2 (\frac{1}{n}+\frac{ \bar{X}^2}{\sum_{i=1}^n (X_i-\bar{X})^2} )) β^1=i=1nkiYi=β1+i=1nkiϵiN(β1,i=1n(XiXˉ)2σ2)β^0=i=1n(n1kiXˉ)Yi=β0+i=1n(n1kiXˉ)ϵiN(β0,σ2(n1+i=1n(XiXˉ)2Xˉ2))

    Tip 3: Fitted value and residuals are
    Y^h=β^0+β^1Xh∼N(Yh,σ2(1n+(Xh−Xˉ)2∑i=1n(Xi−Xˉ)2))ej=Yj−Y^j∼N(0,σ2(1+1n+(Xj−Xˉ)2∑i=1n(Xi−Xˉ)2))\hat{Y}_h = \hat{\beta}_0 + \hat{\beta}_1 X_h\sim N(Y_h,\sigma^2 (\frac{1}{n} + \frac{(X_h - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} )) \\ e_j = Y_j - \hat Y_j \sim N(0,\sigma^2 (1+\frac{1}{n} + \frac{(X_j - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} ))Y^h=β^0+β^1XhN(Yh,σ2(n1+i=1n(XiXˉ)2(XhXˉ)2))ej=YjY^jN(0,σ2(1+n1+i=1n(XiXˉ)2(XjXˉ)2))

    Tip 4: Predictive intervals are
    Var(Y^h−Yh)=σ2(1+1n+(Xh−Xˉ)2∑i=1n(Xi−Xˉ)2)Y^h−se{Y^h−Yh}t(1−α2,n−2)<Yh<Y^h+se{Y^h−Yh}t(1−α2,n−2)Var(\hat{Y}_h-Y_h)=\sigma^2 (1+\frac{1}{n} + \frac{(X_h - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} ) \\ \hat{Y}_h-se\{\hat{Y}_h-Y_h\}t(1-\frac{\alpha}{2},n-2)< Y_h < \hat{Y}_h+se\{\hat{Y}_h-Y_h\}t(1-\frac{\alpha}{2},n-2) Var(Y^hYh)=σ2(1+n1+i=1n(XiXˉ)2(XhXˉ)2)Y^hse{Y^hYh}t(12α,n2)<Yh<Y^h+se{Y^hYh}t(12α,n2)

    Tip 5: ANOVA

    SourceSSdfMSF
    RegressionSSR=∑i=1n(Y^i−Yˉ)2SSR=\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2SSR=i=1n(Y^iYˉ)21MSR=SSRdfRMSR = \frac{SSR}{df_R}MSR=dfRSSRF=MSRMSEF = \frac{MSR}{MSE}F=MSEMSR
    ResidualsSSE=∑i=1n(Yi−Y^i)2SSE=\sum_{i=1}^n (Y_i - \hat{Y}_i )^2SSE=i=1n(YiY^i)2n-2MSE=SSEdfEMSE = \frac{SSE}{df_E}MSE=dfESSE
    TotalSST=∑i=1n(Yi−Yˉ)2SST=\sum_{i=1}^n (Y_i - \bar{Y})^2SST=i=1n(YiYˉ)2n-1MST=SSTdfTMST = \frac{SST}{df_T}MST=dfTSST

    MSE∼χn−22,F∼F(1,n−2)H0:β0=β1=0Ha:β0≠0orβ1≠0MSE \sim \chi^2_{n-2}, F \sim F(1,n-2) \\ H_0:\beta_0 = \beta_1 = 0\ \ H_a:\beta_0 \ne 0\ or\ \beta_1 \ne 0MSEχn22,FF(1,n2)H0:β0=β1=0  Ha:β0=0 or β1=0

    Part 2 Residual Plots and Diagnotics
    Tip 1: Residuals against sample index: to check whether there’s sequential correlation

    Tip 2: Residual against independent variable: to check whether there’s missing higher order term

    Tip 3: Residuals against fitted value: to check whether there’s heteroscedasticity

    Tip 4: Residuals against potential independent variable: to check whether there’s important missing variable

    Tip 5: Normal Probability Plot: to check whether normality holds
    R code qqnorm(), qqline() to create QQ plot.

    Tip 6: Shapiro-Wilk test
    R code shapiro.test() to implement Shapiro-Wilk test. Accept normality for large p value.

    Tip 7: Brown-Forsythe test
    R code leveneTest() to implement Brown-Forsythe test to test heteroscedasticity in different groups. Accept homogeneity for large p value. Require R package car. Example,

    ei = resid(Ex3.lm)
    G<-(X<80)[order(X)]
    group<-as.factor(G)
    BF.htest = leveneTest(ei[order(X)],group)

    Tip 8: Lack-of-fit F test
    Fit reduced model and full model and use R code anova(Reduced, Full) to do Lack-of-fit F test. Accept reduced model for large p value.

    Part 3 Remedial Methods: Box-Cox, Family-wise Adjustment and Nonparametrics
    Tip 1: Box-Cox transformation
    If not linearity, use Box-Cox transformation. Example

    require( MASS )
    Ex3.bc = boxcox( Ex3.lm,lambda=seq(-1, 1, 0.1), interp=F )
    cbind( Ex3.bc$x, Ex3.bc$y )

    Tip 2: Bonferroni Adjustment
    Familywise alpha = # of estimators times pointwise alpha
    For statistic TiT_iTi, familywise CI is
    B=t(1−α2m,n−2)[Ti−Bse(Ti),Ti+Bse(Ti)]B = t(1-\frac{\alpha}{2m},n-2) \\ [T_i - Bse(T_i),T_i + Bse(T_i)] B=t(12mα,n2)[TiBse(Ti),Ti+Bse(Ti)]

    Tip 3: Working-Hotelling-Scheffe Bound
    For statistic TiT_iTi, familywise CI is
    W=mF(1−α,m,n−2)[Ti−Wse(Ti),Ti+Wse(Ti)]W = \sqrt{mF(1-\alpha,m,n-2)}\\ [T_i - Wse(T_i),T_i + Wse(T_i)] W=mF(1α,m,n2)[TiWse(Ti),Ti+Wse(Ti)]

    α\alphaα is pointwise alpha. If m is small, use Bonferroni, otherwise use WHS.

    Tip 4: Heteroscedasticity and WLS, if homogeinity of variance is rejected, use WLS.

    Tip 5: Nonparametric Regression, if normality is rejected, use nonparametric regression. Use R code loess() to fit and predict() to get the nonparametrics curve. Example

    baseballSLR.lm <- lm(Y ~ X)
    absresid = abs(resid(baseballSLR.lm))
    Yhat = fitted(baseballSLR.lm)
    baseball.lo = loess( absresid~Yhat, span = 0.5, degree = 2,
    family=‘symmetric’ )
    Ysmooth = predict( baseball.lo,
    data.frame(Yhat = seq(min(Yhat),max(Yhat),.001)) )
    plot( absresid~Yhat, xlim=c(.25,.29), ylim=c(0,.11) )
    par( new=TRUE )
    plot( Ysmooth~seq(min(Yhat),max(Yhat),.001), type=‘l’, lwd=2,
    xaxt=‘n’, yaxt=‘n’ , xlab=’’, ylab=’’, xlim=c(.25,.29), ylim=c(0,.11))

    Part 4 Correlation Analysis
    Tip 1: PPMCC
    Tip 2: Spearman’s Rank test

    Part 5 Inverse Prediction and Bootstraps
    Tip 1: Inverse prediction
    X^h=Yh−β^0β^1X^h−Xhse{predX}∼t(N−2)s2{predX}=MSEβ^12[1+1n+X^h−Xh∑i=1N(Xi−Xˉ)2]\hat{X}_h = \frac{Y_h - \hat{\beta}_0}{\hat{\beta}_1} \\ \frac{\hat{X}_h - X_h}{se\{predX\}} \sim t(N-2) \\ s^2\{predX\} = \frac{MSE}{\hat{\beta}_1^2} [1+\frac{1}{n}+\frac{\hat{X}_h - X_h}{\sum_{i=1}^{N} (X_i - \bar{X})^2}] X^h=β^1Yhβ^0se{predX}X^hXht(N2)s2{predX}=β^12MSE[1+n1+i=1N(XiXˉ)2X^hXh]

    Tip 2: Bootstrapped confidential interval: above β^1\hat\beta_1β^1 is treated as constant which is not really true. A better way for CI is bootstrap. Example

    theta_hat <- (50-finalpr2a.lm$coefficients[1])/finalpr2a.lm$coefficients[2]
    ei <- resid(finalpr2a.lm)
    Yhat <- fitted(finalpr2a.lm)
    b1origin <- finalpr2a.lm$coefficients[2]
    b0origin <- finalpr2a.lm$coefficients[1]
    n <- length(Y)
    B <- 5000
    b0 <- numeric(B)
    b1 <- numeric(B)
    set.seed(2019)
    for(b in 1:B){
    esti <- sample(ei,n,replace=T)
    Yest <- Yhat + esti
    b0[b] <- coef(lm(Yest~X1))[1]
    b1[b] <- coef(lm(Yest~X1))[2]
    }
    theta <- (50-b0)/b1
    summary(theta)
    theta <- sort(theta)
    thetaL <- theta[126]
    thetaU <- theta[4875]
    c(thetaL,thetaU)
    hist(theta, probability = T)
    abline(v=thetaL, lty=2, lwd=2)
    abline(v=thetaU, lty=2, lwd=2)

    Multivariate Linear Regression

    Part 1 Inference and Prediction
    Tip 1: Model Assumption Y∼N(Xβ,σ2In)Y \sim N(X\beta,\sigma^2I_n)YN(Xβ,σ2In)
    Tip 2: Estimators of coefficients are β^=(XTX)−1XTY\hat{\beta} = (X^TX)^{-1}X^TYβ^=(XTX)1XTY
    Tip 3: Fitted value and residuals are

  • Hat matrix H=X(XTX)−1XTH = X(X^TX)^{-1}X^TH=X(XTX)1XT
  • Fitted value Y^=HY\hat Y = HYY^=HY
  • Residual e=(In−H)Ye = (I_n - H)Ye=(InH)Y
  • Tip 4: ANOVA

    SourceSSdfMS
    RegressionSSR=β^TXTY−1NYTJYSSR=\hat{\beta}^TX^TY - \frac{1}{N}Y^TJYSSR=β^TXTYN1YTJYp-1MSR=SSRdfRMSR = \frac{SSR}{df_R}MSR=dfRSSR
    ResidualsSSE=YTY−β^TXTYSSE=Y^TY - \hat{\beta}^TX^TYSSE=YTYβ^TXTYN-pMSE=SSEdfEMSE = \frac{SSE}{df_E}MSE=dfESSE
    TotalSST=YT(IN−JN)YSST=Y^T(I_N-\frac{J}{N})YSST=YT(INNJ)YN-1MST=SSTdfTMST = \frac{SST}{df_T}MST=dfTSST

    Tip 5: Sequential ANOVA, When given a sequence of objects, anova() tests the models against one another in the order specified.

    Part 2 Variable Selection
    Tip 1: Coefficient of determination. Use R code leaps(x,y,method = “adjr2”)

    library(leaps)
    CH09PR11.r2 <- leaps( x=cbind(X1,X2,X3,X4), y=Y, method=‘adjr2’)
    p = seq( min(CH09PR11.r2$size), max(CH09PR11.r2$size) )
    plot( CH09PR11.r2$adjr2 ~ CH09PR11.r2$size , ylab=expression(R^2), xlab=‘p’ )
    Rp2 = by( data=CH09PR11.r2$adjr2,INDICES=factor(CH09PR11.r2$size), FUN=max )
    lines( Rp2 ~ p )
    CH09PR11.r2[[“adjr2”]]
    CH09PR11.r2[[“which”]]

    Tip 2: Mallow’s CpC_pCp. Use R code leaps(x,y,method = “Cp”)

    library(leaps)
    CH09PR11.cp <- leaps( x=cbind(X1,X2,X3,X4), y=Y, method=‘Cp’)
    p = seq( min(CH09PR11.cp$size), max(CH09PR11.cp$size) )
    plot( CH09PR11.cp$Cp ~ CH09PR11.cp$size , ylab=expression(R^2), xlab=‘p’ )
    Rp2 = by( data=CH09PR11.cp$Cp,INDICES=factor(CH09PR11.cp$size), FUN=min )
    lines( Rp2 ~ p )
    CH09PR11.cp[[“Cp”]]
    CH09PR11.cp[[“which”]]

    Tip 3: AIC and BIC. Use R code step() and input lm object to calculate AIC.

    CH09PR10.lm <- lm( Y ~ X1+X2+X3+X4 )
    CH09PR10.step <- step(CH09PR10.lm)

    Tip 4: PRESSpPRESS_pPRESSp. Use PRESS() to get PRESS.

    library(MPV)
    CH09PR23.r2 <- leaps( x=cbind(X1,X2,X3), y=Y, method=‘adjr2’)
    p = seq( min(CH09PR23.r2$size), max(CH09PR23.r2$size) )
    CH09PR23.r2[[“which”]]
    PRESSp = numeric( length(CH09PR23.r2$size) )
    SSEp = numeric( length(CH09PR23.r2$size) )
    PRESSp[1] = PRESS( lm( Y ~ X1 ) )
    PRESSp[2] = PRESS( lm( Y ~ X2 ) )
    PRESSp[3] = PRESS( lm( Y ~ X3 ) )
    PRESSp[4] = PRESS( lm( Y ~ X1+X2 ) )
    PRESSp[5] = PRESS( lm( Y ~ X1+X3 ) )
    PRESSp[6] = PRESS( lm( Y ~ X2+X3 ) )
    PRESSp[7] = PRESS( lm( Y ~ X1+X2+X3 ) )
    PRESSp[8] = PRESS( lm( Y ~ X1+X2+X1sq+X2sq ) )
    PRESSp[9] = PRESS( lm( Y ~ X1+X2+X1X2 ) )
    PRESSp[10] = PRESS( lm( Y ~ X1+X2+X1sq+X2X3 ) )
    SSEp[1] = anova(lm(Y~X1))[2,2]
    SSEp[2] = anova(lm(Y~X2))[2,2]
    SSEp[3] = anova(lm(Y~X3))[2,2]
    SSEp[4] = anova(lm(Y~X1+X2))[3,2]
    SSEp[5] = anova(lm(Y~X1+X3))[3,2]
    SSEp[6] = anova(lm(Y~X2+X3))[3,2]
    SSEp[7] = anova(lm(Y~X1+X2+X3))[4,2]
    SSEp[8] = anova(lm(Y~X1+X2+X1sq+X2sq))[5,2]
    SSEp[9] = anova(lm(Y~X1+X2+X1X2))[4,2]
    SSEp[10] = anova(lm(Y~X1+X2+X1sq+X2X3))[5,2]
    plot( PRESSp[1:7] ~ CH09PR23.r2$size , ylab=expression(PRESS[p]), xlab=‘p’ )
    minPRESSp = by( data=PRESSp[1:7],INDICES=factor(CH09PR23.r2$size), FUN=min )
    lines(minPRESSp ~ p )

    Tip 5: Forward Stepwise Regression

    library(rms)
    step( lm(Y~1),Y~X1+X2+X3+X4+X51999+X52000+X52001, direction=‘forward’)

    Tip 6: Backward Elimination. Use fastbw(). k is the multiple of the number of degrees of freedom used for the penalty. k = 2 for AIC and k = log(n) for BIC (n = length(Y))

    library(rms)
    CH09PR20.lm <- lm(Y~X1+X2+X3+X4+X51999+X52000+X52001)
    step( CH09PR20.lm, direction=‘backward’,k=2)
    CH09PR20.ols <- ols( Y~X1+X2+X3+X4+X51999+X52000+X52001 )
    fastbw( fit=CH09PR20.ols, rule=‘p’,type=‘individual’, sls=.10 )

    Part 3 Outliears Detection
    Tip 1: Studentized Detected Residual: internally studentized residual (fitting) and externally studentized residual (LOOCV). R code rstudent() for externally studentized residual.

    tcrit <- qt( 1-.5*(.05/52), 52-4-1 ) # Bonferroni adjustment needed
    CH10PR10.lm <- lm( Y ~ X)
    plot( rstudent(CH10PR10.lm) ~ fitted(CH10PR10.lm), ylim=c(-4,4) )
    abline( h=tcrit , lty=2 )
    abline( h=-tcrit, lty=2 )
    abline( h=0 )

    Tip 2: Hat Matrix Leverage Value. Use hatvalues() to find hat value. Fails when number of predictors is large.

    n <- length(Y) # sample size
    p <- 3 # number of predictors
    hii = hatvalues( CH10PR10.lm )
    hii>2*p/n # Empirical Rule to determine outliers.

    Part 4 Influential Analysis
    Tip 1: DFFITS. The influence of the i-th data to i-th fitted value. Use R code dffits().
    Tip 2: Cook’s Distance. The influence of the i-th data to all fitted value. Use R code influence.measures().
    Tip 3: DFBETAS. The influence of the i-th data to k-th coefficients. Use R code cooks.distance(). Input is model object.

    DFF <- dffits(CH10PR10.lm)
    abs(DFF)>2*sqrt(p/n)
    DFB <- influence.measures( CH10PR10.lm )
    DFB[[“infmat”]][c(16,22,43,48,10,32,38,40),]
    plot( cooks.distance(CH10PR10.lm), type=‘o’,pch=19 )
    ei = resid( CH10PR10.lm )
    yhat = fitted(CH10PR10.lm)
    radius = sqrt( cooks.distance(CH10PR10.lm)/pi )
    plot( ei ~ yhat, pch=’’)
    abline( h=0 )
    symbols( yhat,ei,circles=radius,inches=.15,bg=‘black’,fg=‘white’,add=T )

    Part 5 Multicolinearity and Ridge Regression
    Tip 1: Variance Inflation Factor

  • Use pairs() and cor() to get pairwise dot plots and correlation matrix.
  • require( car )
    vif( lm(Y ~ X1+X2+X3) )

  • If max(VIF)>10 and mean(VIF)>6, suggesting Multicolinearity.
  • Tip 2: Ridge Regression. Use R package genridge.

    c = seq( from=.01,to=10,by=.01 )
    traceplot( ridge(U~Z1+Z2+Z3, lambda=c) )

    总结

    以上是生活随笔为你收集整理的UA MATH571A 回归分析 概念与R code总结的全部内容,希望文章能够帮你解决所遇到的问题。

    如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。