Linear Regression
Dataset
In this dataset (taken from: An Introduction to Generalized Linear Models, A. J. Dobson & A. G. Barnett, 3rd edition p.96), the response variable \(y\) corresponds to the percentage of total calories obtained from complex carbohydrates for 20 male insulin-dependent diabetics who have been on a high carbohydrate diet for six months. Additional information is collected about the individuals taking part in the study including age (in years), weight (relative to ideal weight) and other calories intake from protein (as percentage).
carbohydrate=c(33,40,37,27,30,43,34,48,30,38,50,51,30,36,41,42,46,24,35,37) # response vector
age=c(33,47,49,35,46,52,62,23,32,42,31,61,63,40,50,64,56,61,48,28)
weight=c(100,92,135,144,140,101,95,101,98,105,108,85,130,127,109,107,117,100,118,102)
protein=c(14,15,18,12,15,15,14,17,15,14,17,19,19,20,15,16,18,13,18,14)
Fitting a linear model with R functions
The R functions lm() and glm() can be used to fit a linear model of the form \[ y_{\text{carbohydrate}}=\underbrace{\beta_0+\beta_1 \ x_{\text{age}}+\beta_2 \ x_{\text{weight}} + \beta_3 \ x_{\text{protein}}}_{\hat{y}}+\epsilon \] Once estimates for parameters \(\beta\)s are available, then a predictor \(\hat{y}\) can be computed for any inputs \((x_{\text{age}},x_{\text{weight}},x_{\text{protein}})\).
Call:
lm(formula = carbohydrate ~ age + weight + protein)
Residuals:
Min 1Q Median 3Q Max
-10.3424 -4.8203 0.9897 3.8553 7.9087
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.96006 13.07128 2.828 0.01213 *
age -0.11368 0.10933 -1.040 0.31389
weight -0.22802 0.08329 -2.738 0.01460 *
protein 1.95771 0.63489 3.084 0.00712 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.956 on 16 degrees of freedom
Multiple R-squared: 0.4805, Adjusted R-squared: 0.3831
F-statistic: 4.934 on 3 and 16 DF, p-value: 0.01297
Call:
glm(formula = carbohydrate ~ age + weight + protein, family = gaussian)
Deviance Residuals:
Min 1Q Median 3Q Max
-10.3424 -4.8203 0.9897 3.8553 7.9087
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.96006 13.07128 2.828 0.01213 *
age -0.11368 0.10933 -1.040 0.31389
weight -0.22802 0.08329 -2.738 0.01460 *
protein 1.95771 0.63489 3.084 0.00712 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 35.47893)
Null deviance: 1092.80 on 19 degrees of freedom
Residual deviance: 567.66 on 16 degrees of freedom
AIC: 133.67
Number of Fisher Scoring iterations: 2
Fitting a linear model using linear algebra
This code used linear algebra to recover some of the results found by the R functions lm() and glm().
We define matrix \(\mathrm{X}\) with the explanatory variables, and vector \(\mathbf{y}\) collecting the responses.
X=matrix(1,20,4)
X[,1]=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
X[,2]=age
X[,3]=weight
X[,4]=protein
Y=carbohydrate
Least squares solution
The least square solution is computed with: \[ \hat{\beta}=(\mathrm{X}^{T}\mathrm{X})^{-1} \ \mathrm{X}^{T} \mathbf{y} \]
Using R, notice how we recover the same estimates computed with R function functions lm() and glm():
[,1]
[1,] 36.9600559
[2,] -0.1136764
[3,] -0.2280174
[4,] 1.9577126
Residual standard error
The Residual Standard Error (RSE) reported by the funtion lm() can be also computed using linear algebra as follow: \[ RSE=\sqrt{\frac{SSE}{n-p}} \] with \(n\) is the number of observations and \(p\) is the degree of freedom of the model (i.e. number of parameters \(\lbrace \beta_0,\beta_1,\beta_2,\beta_3\rbrace\) to estimate in the model). The Sum of Square Error (SSE) is computed as : \[ SSE=\sum_{i=1}^n (y_{i}-\hat{y}_i)^2 \]
fitted =X %*% BetaHat
residuals=carbohydrate-fitted
SSE=t(residuals)%*%residuals # Sum of Square Error
sigmahatsq=SSE/(length(carbohydrate)-length(BetaHat))
#Residual standard error as reported by lm()
sqrt(sigmahatsq)
[,1]
[1,] 5.956419
Standard Errors of estimated parameters
The uncertainty associated with the parameters \(\beta\)s (Std. Error as reported with lm() and glm()) can also be computed:
# uncertainty of BetaHat (standard error)
CovBetaHat=solve(t(X)%*%X)*as.numeric(sigmahatsq)
Std.Error.BetaHat=sqrt(diag(CovBetaHat))
Std.Error.BetaHat
[1] 13.07128293 0.10932548 0.08328895 0.63489286
Multiple R-squared
Similarly R-squared (as reported with lm()) can be computed: \[ R^2=\frac{\sum_{i=1}^n (\hat{y}_{i}-\bar{y})^2}{\sum_{i=1}^n (y_{i}-\bar{y})^2} \] using the sample mean \(\bar{y}=\frac{1}{n}\sum_{i=1}^n y_{i}\).
#Multiple R-squared: (coefficient of determination)
S0=sum((carbohydrate-mean(carbohydrate))*(carbohydrate-mean(carbohydrate)))
Rsq=(S0-SSE)/S0
Rsq
[,1]
[1,] 0.4805428
[1] 0.4805428