Chapter 10 Multiple Regression

10.1 Reminder: Least Squares algorithm

Here, we look only at the case of

  • a Normal distribution

  • with the identity link function

In this case, having collected \(N\) independent responses \(\lbrace y_i \rbrace_{i=1,\cdots,N}\), the likelihood function is: \[f(y_1, y_2,\cdots, y_N;\theta_1,\theta_2,\cdots,\theta_N)=\prod_{i=1}^N f(y_i;\theta_i)=\prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma} \exp\left( \frac{-(y_i-\theta_i)^2}{2\sigma^2}\right)\quad \text{(independence)}\] Now considering the link \(\theta_i=\mathbf{x}_i^{T}\pmb{\beta}=\beta_1+\beta_2\ x_{1,i}+\cdots+\beta_p\ x_{p-1,i}\) then the likelihood becomes: \[f(y_1, y_2,\cdots, y_N;\pmb{\beta})=\prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma} \exp\left( \frac{-(y_i-\mathbf{x}_i^{T}\pmb{\beta})^2}{2\sigma^2}\right)\] The log likelihood function is: \[\log f(y_1, y_2,\cdots, y_N;\pmb{\beta} )=- N\log\left (\sqrt{2\pi}\sigma \right)-\sum_{i=1}^N\frac{(y_i-\mathbf{x}_i^{T}\pmb{\beta})^2}{2\sigma^2}\] Finding the estimate \(\pmb{\hat{\beta}}\) that maximises the likelihood is equivalent to finding the estimate \(\pmb{\hat{\beta}}\) that minimises the negative of the log-likelihood function. So \(\pmb{\hat{\beta}}\) is found by making the derivative of \(-\log f(y_1, y_2,\cdots, y_N;\pmb{\beta} )\) equal to 0, which is equivalent to: \[\frac{\partial}{\partial \pmb{\beta}} \sum_{i=1}^N (y_i-\mathbf{x}_i^{T}\pmb{\beta})^2=0\] So maximising the likelihood is equivalent to minimising the sum of square error (SSE) when defining the error to be \(\epsilon_i=y_i-\mathbf{x}_i^{T}\pmb{\beta}\). Lets define the vector \(\pmb{\epsilon}\) concatenating \(\epsilon_1, \cdots,\epsilon_N\), the vector \(\mathbf{y}=[y_1,\cdots,y_N]^T\) and the matrix \(\mathrm{X}\): \[\mathrm{X}= \left\lbrack \begin{array}{cccc} 1 & x_{1,1} &\cdots & x_{p-1,1}\\ 1 & x_{1,2} &\cdots & x_{p-1,2}\\ \vdots &\vdots & &\vdots\\ 11 & x_{1,N} &\cdots & x_{p-1,N}\\ \end{array}\right\rbrack\] Then \(\pmb{\epsilon}=\mathbf{y}-\mathrm{X} \pmb{\beta}\). Minimising the SSE corresponds to finding \(\pmb{\beta}\): \[\begin{array}{ll} \pmb{\hat{\beta}}&=\arg\min_{\pmb{\beta}} \left\lbrace SSE=\sum_{i=1}^{n}\epsilon_i^2=\|\pmb{\epsilon}\|^2\right\rbrace\\ &=\arg\min_{\pmb{\beta}} \left\lbrace SSE=\|\mathbf{y}-\mathrm{X}\pmb{\beta}\|^2\right\rbrace\\ &=\arg\min_{\pmb{\beta}} \left\lbrace SSE=(\mathbf{y}-\mathrm{X}\pmb{\beta})^{T}(\mathbf{y}-\mathrm{X}\pmb{\beta})\right\rbrace\\ &=\arg\min_{\pmb{\beta}} \left\lbrace SSE=\mathbf{y}^{T}\mathbf{y}-\mathbf{y}^{T}\mathrm{X}\pmb{\beta}-\pmb{\beta}^{T}\mathrm{X}^{T}\mathbf{y}-\pmb{\beta}^{T}\mathrm{X}^{T}\mathrm{X}\pmb{\beta}\right\rbrace\\ \end{array}\] The derivative of the SSE w.r.t. \(\pmb{\beta}\) is is: \[\begin{array}{ll} \frac{\partial}{\partial \pmb{\beta}} SSE &=\frac{\partial}{\partial \pmb{\beta}} \left\lbrace \mathbf{y}^{T}\mathbf{y}-\mathbf{y}^{T}\mathrm{X}\pmb{\beta}-\pmb{\beta}^{T}\mathrm{X}^{T}\mathbf{y}-\pmb{\beta}^{T}\mathrm{X}^{T}\mathrm{X}\pmb{\beta}\right\rbrace\\ &=0-(\mathbf{y}^{T}\mathrm{X})^T -\mathrm{X}^{T}\mathbf{y}+\mathrm{X}^{T}\mathrm{X}\pmb{\beta}+(\mathrm{X}^{T}\mathrm{X})^T\pmb{\beta}\\ &= -2\mathrm{X}^T\mathbf{y}+2\mathrm{X}^{T}\mathrm{X}\pmb{\beta}\\ \end{array}\] So the solution \(\pmb{\hat{\beta}}\) for which this derivative is 0 is \[\pmb{\hat{\beta}}=(\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbf{y} \quad \text{(Least Square estimate)} \label{eq:LS}\] This is the maximum likelihood estimate when considering a Normal distribution with the identity link function.

10.2 Covariance of \(\hat{\beta}\)

Expectation of \(\hat{\beta}\) \[\mathbb{E}[\hat{\beta}]=\mathbb{E}\left\lbrack(\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbf{y} \right\rbrack=(\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbb{E}\left\lbrack\mathbf{y} \right\rbrack =(\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathrm{X}\beta=\beta\]

Covariance of \(\hat{\beta}\) \[\mathbb{C}[\hat{\beta}]=\mathbb{E} \left\lbrack(\hat{\beta}-\beta)(\hat{\beta}-\beta)^T\right\rbrack=\mathbb{E} \left\lbrack((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbf{y}-\beta)((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbf{y}-\beta)^T\right\rbrack\] with \(\mathbf{y}=\mathrm{X}\beta+\epsilon\), we get \[\mathbb{C}[\hat{\beta}]=\mathbb{E} \left\lbrack ((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T} \epsilon)((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T} \epsilon)^T \right\rbrack =\mathbb{E} \left\lbrack (\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T} \epsilon \epsilon^T ((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T})^T \right\rbrack\] or \[\mathbb{C}[\hat{\beta}] =\sigma^2 (\mathrm{X}^{T}\mathrm{X})^{-1}(\mathrm{X}^{T}\mathrm{X}) ((\mathrm{X}^{T}\mathrm{X})^{-1})^{T} =\sigma^2 (\mathrm{X}^{T}\mathrm{X})^{-1}\]

Remark \(\nabla_{\hat{\beta}}=0\): \[\mathcal{L}(\beta) = \mathcal{L}(\hat{\beta}) +\frac{1}{2}\ (\beta-\hat{\beta}) \ \mathrm{H}_{\hat{\beta}}\ (\beta-\hat{\beta})^T\] in comparison to Gaussian approximation at \(\hat{\beta}\) \[\mathcal{L}(\beta) =\text{constant}-\frac{1}{2}\ (\beta-\hat{\beta}) \ \Sigma^{-1}\ (\beta-\hat{\beta})^T\] \(-\mathrm{H}_{\hat{\beta}}\) is called the observed information matrix and approximates \(\Sigma^{-1}\). Hence diagonal values of \(-\mathrm{H}_{\hat{\beta}}^{-1}\) approximate the standard errors squared of \(\hat{\beta}\).

10.3 Case study : Carbohydrate Diet

In this dataset (Dobson and Barnett 2008) ( p.96), the response variable \(y\) corresponds to the percentage of total calories obtained from complex carbohydrates for 20 male insulin-dependent diabetics who have been on a high carbohydrate diet for six months. Additional information is collected about the individuals taking part in the study including age (in years), weight (relative to ideal weight) and other calories intake from protein (as percentage).

10.3.1 Fitting with lm() and glm()

carbohydrate=c(33,40,37,27,30,43,34,48,30,38,50,51,30,36,41,42,46,24,35,37)
age=c(33,47,49,35,46,52,62,23,32,42,31,61,63,40,50,64,56,61,48,28)
weight=c(100,92,135,144,140,101,95,101,98,105,108,85,130,127,109,107,117,100,118,102)
protein=c(14,15,18,12,15,15,14,17,15,14,17,19,19,20,15,16,18,13,18,14)

data=cbind(carbohydrate,age,weight,protein)
knitr::kable(t(data), caption = "Carbohydrate diet data")
Table 10.1: Carbohydrate diet data
carbohydrate 33 40 37 27 30 43 34 48 30 38 50 51 30 36 41 42 46 24 35 37
age 33 47 49 35 46 52 62 23 32 42 31 61 63 40 50 64 56 61 48 28
weight 100 92 135 144 140 101 95 101 98 105 108 85 130 127 109 107 117 100 118 102
protein 14 15 18 12 15 15 14 17 15 14 17 19 19 20 15 16 18 13 18 14
#using lm
res.lm=lm(carbohydrate~age+weight+protein)
summary(res.lm)
## 
## Call:
## lm(formula = carbohydrate ~ age + weight + protein)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3424  -4.8203   0.9897   3.8553   7.9087 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 36.96006   13.07128   2.828  0.01213 * 
## age         -0.11368    0.10933  -1.040  0.31389   
## weight      -0.22802    0.08329  -2.738  0.01460 * 
## protein      1.95771    0.63489   3.084  0.00712 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.956 on 16 degrees of freedom
## Multiple R-squared:  0.4805, Adjusted R-squared:  0.3831 
## F-statistic: 4.934 on 3 and 16 DF,  p-value: 0.01297
#using glm

res.glm=glm(carbohydrate~age+weight+protein,family=gaussian)
summary(res.glm)
## 
## Call:
## glm(formula = carbohydrate ~ age + weight + protein, family = gaussian)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -10.3424   -4.8203    0.9897    3.8553    7.9087  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 36.96006   13.07128   2.828  0.01213 * 
## age         -0.11368    0.10933  -1.040  0.31389   
## weight      -0.22802    0.08329  -2.738  0.01460 * 
## protein      1.95771    0.63489   3.084  0.00712 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 35.47893)
## 
##     Null deviance: 1092.80  on 19  degrees of freedom
## Residual deviance:  567.66  on 16  degrees of freedom
## AIC: 133.67
## 
## Number of Fisher Scoring iterations: 2

10.3.2 Using Math

carbohydrate=c(33,40,37,27,30,43,34,48,30,38,50,51,30,36,41,42,46,24,35,37) # response vector
age=c(33,47,49,35,46,52,62,23,32,42,31,61,63,40,50,64,56,61,48,28)
weight=c(100,92,135,144,140,101,95,101,98,105,108,85,130,127,109,107,117,100,118,102)
protein=c(14,15,18,12,15,15,14,17,15,14,17,19,19,20,15,16,18,13,18,14)

#
X=matrix(1,20,4)
X[,1]=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
X[,2]=age
X[,3]=weight
X[,4]=protein

# least square estimate
BetaHat=solve(t(X)%*%X)%*%t(X)%*%carbohydrate

Compare the estimated parameters \(\hat{\beta}\)

BetaHat
##            [,1]
## [1,] 36.9600559
## [2,] -0.1136764
## [3,] -0.2280174
## [4,]  1.9577126

with those computed with glm and lm earlier.

res.lm$coefficients
## (Intercept)         age      weight     protein 
##  36.9600559  -0.1136764  -0.2280174   1.9577126
res.glm$coefficients
## (Intercept)         age      weight     protein 
##  36.9600559  -0.1136764  -0.2280174   1.9577126

So we have our Math right for computing estimate \(\hat{\beta}\) !

fitted =X %*% BetaHat
residuals=carbohydrate-fitted
SSE=t(residuals)%*%residuals
sigmahatsq=SSE/(length(carbohydrate)-length(BetaHat))
#Residual standard error 
sqrt(sigmahatsq)
##          [,1]
## [1,] 5.956419

Compare this estimate with the Residual standard error computed with lm .

# uncertainty of BetaHat (standard error)
CovBetaHat=solve(t(X)%*%X)*as.numeric(sigmahatsq)
Std.Error.BetaHat=sqrt(diag(CovBetaHat))
Std.Error.BetaHat
## [1] 13.07128293  0.10932548  0.08328895  0.63489286

Compare these standard errors with those reported in summary(res.lm) and summary(res.glm)

#Multiple R-squared: (coefficient of determination) 
S0=sum((carbohydrate-mean(carbohydrate))*(carbohydrate-mean(carbohydrate)))
Rsq=(S0-SSE)/S0
Rsq
##           [,1]
## [1,] 0.4805428

References

Dobson, A. J., and A. G. Barnett. 2008. An Introduction to Generalized Linear Models. CRC Press, Third Edition.