Dataset

In this dataset (taken from: An Introduction to Generalized Linear Models, A. J. Dobson & A. G. Barnett, 3rd edition p.96), the response variable \(y\) corresponds to the percentage of total calories obtained from complex carbohydrates for 20 male insulin-dependent diabetics who have been on a high carbohydrate diet for six months. Additional information is collected about the individuals taking part in the study including age (in years), weight (relative to ideal weight) and other calories intake from protein (as percentage).

carbohydrate=c(33,40,37,27,30,43,34,48,30,38,50,51,30,36,41,42,46,24,35,37) # response vector
age=c(33,47,49,35,46,52,62,23,32,42,31,61,63,40,50,64,56,61,48,28)
weight=c(100,92,135,144,140,101,95,101,98,105,108,85,130,127,109,107,117,100,118,102)
protein=c(14,15,18,12,15,15,14,17,15,14,17,19,19,20,15,16,18,13,18,14)

Fitting a linear model with R functions

The R functions lm() and glm() can be used to fit a linear model of the form \[ y_{\text{carbohydrate}}=\underbrace{\beta_0+\beta_1 \ x_{\text{age}}+\beta_2 \ x_{\text{weight}} + \beta_3 \ x_{\text{protein}}}_{\hat{y}}+\epsilon \] Once estimates for parameters \(\beta\)s are available, then a predictor \(\hat{y}\) can be computed for any inputs \((x_{\text{age}},x_{\text{weight}},x_{\text{protein}})\).

#using lm

res.lm=lm(carbohydrate~age+weight+protein)
summary(res.lm)

Call:
lm(formula = carbohydrate ~ age + weight + protein)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.3424  -4.8203   0.9897   3.8553   7.9087 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 36.96006   13.07128   2.828  0.01213 * 
age         -0.11368    0.10933  -1.040  0.31389   
weight      -0.22802    0.08329  -2.738  0.01460 * 
protein      1.95771    0.63489   3.084  0.00712 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.956 on 16 degrees of freedom
Multiple R-squared:  0.4805,    Adjusted R-squared:  0.3831 
F-statistic: 4.934 on 3 and 16 DF,  p-value: 0.01297
#using glm
res.glm=glm(carbohydrate~age+weight+protein,family=gaussian)
summary(res.glm)

Call:
glm(formula = carbohydrate ~ age + weight + protein, family = gaussian)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-10.3424   -4.8203    0.9897    3.8553    7.9087  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 36.96006   13.07128   2.828  0.01213 * 
age         -0.11368    0.10933  -1.040  0.31389   
weight      -0.22802    0.08329  -2.738  0.01460 * 
protein      1.95771    0.63489   3.084  0.00712 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 35.47893)

    Null deviance: 1092.80  on 19  degrees of freedom
Residual deviance:  567.66  on 16  degrees of freedom
AIC: 133.67

Number of Fisher Scoring iterations: 2

Fitting a linear model using linear algebra

This code used linear algebra to recover some of the results found by the R functions lm() and glm().

We define matrix \(\mathrm{X}\) with the explanatory variables, and vector \(\mathbf{y}\) collecting the responses.

X=matrix(1,20,4)
X[,1]=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
X[,2]=age
X[,3]=weight
X[,4]=protein
Y=carbohydrate

Least squares solution

The least square solution is computed with: \[ \hat{\beta}=(\mathrm{X}^{T}\mathrm{X})^{-1} \ \mathrm{X}^{T} \mathbf{y} \]

Using R, notice how we recover the same estimates computed with R function functions lm() and glm():

BetaHat=solve(t(X)%*%X)%*%t(X)%*%carbohydrate
BetaHat
           [,1]
[1,] 36.9600559
[2,] -0.1136764
[3,] -0.2280174
[4,]  1.9577126

Residual standard error

The Residual Standard Error (RSE) reported by the funtion lm() can be also computed using linear algebra as follow: \[ RSE=\sqrt{\frac{SSE}{n-p}} \] with \(n\) is the number of observations and \(p\) is the degree of freedom of the model (i.e. number of parameters \(\lbrace \beta_0,\beta_1,\beta_2,\beta_3\rbrace\) to estimate in the model). The Sum of Square Error (SSE) is computed as : \[ SSE=\sum_{i=1}^n (y_{i}-\hat{y}_i)^2 \]

fitted =X %*% BetaHat
residuals=carbohydrate-fitted
SSE=t(residuals)%*%residuals # Sum of Square Error
sigmahatsq=SSE/(length(carbohydrate)-length(BetaHat))
#Residual standard error as reported by lm()
sqrt(sigmahatsq)
         [,1]
[1,] 5.956419

Standard Errors of estimated parameters

The uncertainty associated with the parameters \(\beta\)s (Std. Error as reported with lm() and glm()) can also be computed:

# uncertainty of BetaHat (standard error)
CovBetaHat=solve(t(X)%*%X)*as.numeric(sigmahatsq)
Std.Error.BetaHat=sqrt(diag(CovBetaHat))
Std.Error.BetaHat
[1] 13.07128293  0.10932548  0.08328895  0.63489286

Multiple R-squared

Similarly R-squared (as reported with lm()) can be computed: \[ R^2=\frac{\sum_{i=1}^n (\hat{y}_{i}-\bar{y})^2}{\sum_{i=1}^n (y_{i}-\bar{y})^2} \] using the sample mean \(\bar{y}=\frac{1}{n}\sum_{i=1}^n y_{i}\).

#Multiple R-squared: (coefficient of determination) 
S0=sum((carbohydrate-mean(carbohydrate))*(carbohydrate-mean(carbohydrate)))
Rsq=(S0-SSE)/S0
Rsq
          [,1]
[1,] 0.4805428
S1=sum((fitted-mean(carbohydrate))*(fitted-mean(carbohydrate)))
Rsq2<-S1/S0
Rsq2
[1] 0.4805428