Chapter 10 Multiple Regression
10.1 Reminder: Least Squares algorithm
Here, we look only at the case of
a Normal distribution
with the identity link function
In this case, having collected \(N\) independent responses \(\lbrace y_i \rbrace_{i=1,\cdots,N}\), the likelihood function is: \[f(y_1, y_2,\cdots, y_N;\theta_1,\theta_2,\cdots,\theta_N)=\prod_{i=1}^N f(y_i;\theta_i)=\prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma} \exp\left( \frac{-(y_i-\theta_i)^2}{2\sigma^2}\right)\quad \text{(independence)}\] Now considering the link \(\theta_i=\mathbf{x}_i^{T}\pmb{\beta}=\beta_1+\beta_2\ x_{1,i}+\cdots+\beta_p\ x_{p-1,i}\) then the likelihood becomes: \[f(y_1, y_2,\cdots, y_N;\pmb{\beta})=\prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma} \exp\left( \frac{-(y_i-\mathbf{x}_i^{T}\pmb{\beta})^2}{2\sigma^2}\right)\] The log likelihood function is: \[\log f(y_1, y_2,\cdots, y_N;\pmb{\beta} )=- N\log\left (\sqrt{2\pi}\sigma \right)-\sum_{i=1}^N\frac{(y_i-\mathbf{x}_i^{T}\pmb{\beta})^2}{2\sigma^2}\] Finding the estimate \(\pmb{\hat{\beta}}\) that maximises the likelihood is equivalent to finding the estimate \(\pmb{\hat{\beta}}\) that minimises the negative of the log-likelihood function. So \(\pmb{\hat{\beta}}\) is found by making the derivative of \(-\log f(y_1, y_2,\cdots, y_N;\pmb{\beta} )\) equal to 0, which is equivalent to: \[\frac{\partial}{\partial \pmb{\beta}} \sum_{i=1}^N (y_i-\mathbf{x}_i^{T}\pmb{\beta})^2=0\] So maximising the likelihood is equivalent to minimising the sum of square error (SSE) when defining the error to be \(\epsilon_i=y_i-\mathbf{x}_i^{T}\pmb{\beta}\). Lets define the vector \(\pmb{\epsilon}\) concatenating \(\epsilon_1, \cdots,\epsilon_N\), the vector \(\mathbf{y}=[y_1,\cdots,y_N]^T\) and the matrix \(\mathrm{X}\): \[\mathrm{X}= \left\lbrack \begin{array}{cccc} 1 & x_{1,1} &\cdots & x_{p-1,1}\\ 1 & x_{1,2} &\cdots & x_{p-1,2}\\ \vdots &\vdots & &\vdots\\ 11 & x_{1,N} &\cdots & x_{p-1,N}\\ \end{array}\right\rbrack\] Then \(\pmb{\epsilon}=\mathbf{y}-\mathrm{X} \pmb{\beta}\). Minimising the SSE corresponds to finding \(\pmb{\beta}\): \[\begin{array}{ll} \pmb{\hat{\beta}}&=\arg\min_{\pmb{\beta}} \left\lbrace SSE=\sum_{i=1}^{n}\epsilon_i^2=\|\pmb{\epsilon}\|^2\right\rbrace\\ &=\arg\min_{\pmb{\beta}} \left\lbrace SSE=\|\mathbf{y}-\mathrm{X}\pmb{\beta}\|^2\right\rbrace\\ &=\arg\min_{\pmb{\beta}} \left\lbrace SSE=(\mathbf{y}-\mathrm{X}\pmb{\beta})^{T}(\mathbf{y}-\mathrm{X}\pmb{\beta})\right\rbrace\\ &=\arg\min_{\pmb{\beta}} \left\lbrace SSE=\mathbf{y}^{T}\mathbf{y}-\mathbf{y}^{T}\mathrm{X}\pmb{\beta}-\pmb{\beta}^{T}\mathrm{X}^{T}\mathbf{y}-\pmb{\beta}^{T}\mathrm{X}^{T}\mathrm{X}\pmb{\beta}\right\rbrace\\ \end{array}\] The derivative of the SSE w.r.t. \(\pmb{\beta}\) is is: \[\begin{array}{ll} \frac{\partial}{\partial \pmb{\beta}} SSE &=\frac{\partial}{\partial \pmb{\beta}} \left\lbrace \mathbf{y}^{T}\mathbf{y}-\mathbf{y}^{T}\mathrm{X}\pmb{\beta}-\pmb{\beta}^{T}\mathrm{X}^{T}\mathbf{y}-\pmb{\beta}^{T}\mathrm{X}^{T}\mathrm{X}\pmb{\beta}\right\rbrace\\ &=0-(\mathbf{y}^{T}\mathrm{X})^T -\mathrm{X}^{T}\mathbf{y}+\mathrm{X}^{T}\mathrm{X}\pmb{\beta}+(\mathrm{X}^{T}\mathrm{X})^T\pmb{\beta}\\ &= -2\mathrm{X}^T\mathbf{y}+2\mathrm{X}^{T}\mathrm{X}\pmb{\beta}\\ \end{array}\] So the solution \(\pmb{\hat{\beta}}\) for which this derivative is 0 is \[\pmb{\hat{\beta}}=(\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbf{y} \quad \text{(Least Square estimate)} \label{eq:LS}\] This is the maximum likelihood estimate when considering a Normal distribution with the identity link function.
10.2 Covariance of \(\hat{\beta}\)
Expectation of \(\hat{\beta}\) \[\mathbb{E}[\hat{\beta}]=\mathbb{E}\left\lbrack(\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbf{y} \right\rbrack=(\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbb{E}\left\lbrack\mathbf{y} \right\rbrack =(\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathrm{X}\beta=\beta\]
Covariance of \(\hat{\beta}\) \[\mathbb{C}[\hat{\beta}]=\mathbb{E} \left\lbrack(\hat{\beta}-\beta)(\hat{\beta}-\beta)^T\right\rbrack=\mathbb{E} \left\lbrack((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbf{y}-\beta)((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T}\mathbf{y}-\beta)^T\right\rbrack\] with \(\mathbf{y}=\mathrm{X}\beta+\epsilon\), we get \[\mathbb{C}[\hat{\beta}]=\mathbb{E} \left\lbrack ((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T} \epsilon)((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T} \epsilon)^T \right\rbrack =\mathbb{E} \left\lbrack (\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T} \epsilon \epsilon^T ((\mathrm{X}^{T}\mathrm{X})^{-1}\mathrm{X}^{T})^T \right\rbrack\] or \[\mathbb{C}[\hat{\beta}] =\sigma^2 (\mathrm{X}^{T}\mathrm{X})^{-1}(\mathrm{X}^{T}\mathrm{X}) ((\mathrm{X}^{T}\mathrm{X})^{-1})^{T} =\sigma^2 (\mathrm{X}^{T}\mathrm{X})^{-1}\]
Remark \(\nabla_{\hat{\beta}}=0\): \[\mathcal{L}(\beta) = \mathcal{L}(\hat{\beta}) +\frac{1}{2}\ (\beta-\hat{\beta}) \ \mathrm{H}_{\hat{\beta}}\ (\beta-\hat{\beta})^T\] in comparison to Gaussian approximation at \(\hat{\beta}\) \[\mathcal{L}(\beta) =\text{constant}-\frac{1}{2}\ (\beta-\hat{\beta}) \ \Sigma^{-1}\ (\beta-\hat{\beta})^T\] \(-\mathrm{H}_{\hat{\beta}}\) is called the observed information matrix and approximates \(\Sigma^{-1}\). Hence diagonal values of \(-\mathrm{H}_{\hat{\beta}}^{-1}\) approximate the standard errors squared of \(\hat{\beta}\).
10.3 Case study : Carbohydrate Diet
In this dataset (Dobson and Barnett 2008) ( p.96), the response variable \(y\) corresponds to the percentage of total calories obtained from complex carbohydrates for 20 male insulin-dependent diabetics who have been on a high carbohydrate diet for six months. Additional information is collected about the individuals taking part in the study including age (in years), weight (relative to ideal weight) and other calories intake from protein (as percentage).
10.3.1 Fitting with lm() and glm()
carbohydrate=c(33,40,37,27,30,43,34,48,30,38,50,51,30,36,41,42,46,24,35,37)
age=c(33,47,49,35,46,52,62,23,32,42,31,61,63,40,50,64,56,61,48,28)
weight=c(100,92,135,144,140,101,95,101,98,105,108,85,130,127,109,107,117,100,118,102)
protein=c(14,15,18,12,15,15,14,17,15,14,17,19,19,20,15,16,18,13,18,14)
data=cbind(carbohydrate,age,weight,protein)
knitr::kable(t(data), caption = "Carbohydrate diet data")
carbohydrate | 33 | 40 | 37 | 27 | 30 | 43 | 34 | 48 | 30 | 38 | 50 | 51 | 30 | 36 | 41 | 42 | 46 | 24 | 35 | 37 |
age | 33 | 47 | 49 | 35 | 46 | 52 | 62 | 23 | 32 | 42 | 31 | 61 | 63 | 40 | 50 | 64 | 56 | 61 | 48 | 28 |
weight | 100 | 92 | 135 | 144 | 140 | 101 | 95 | 101 | 98 | 105 | 108 | 85 | 130 | 127 | 109 | 107 | 117 | 100 | 118 | 102 |
protein | 14 | 15 | 18 | 12 | 15 | 15 | 14 | 17 | 15 | 14 | 17 | 19 | 19 | 20 | 15 | 16 | 18 | 13 | 18 | 14 |
#using lm
res.lm=lm(carbohydrate~age+weight+protein)
summary(res.lm)
##
## Call:
## lm(formula = carbohydrate ~ age + weight + protein)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3424 -4.8203 0.9897 3.8553 7.9087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.96006 13.07128 2.828 0.01213 *
## age -0.11368 0.10933 -1.040 0.31389
## weight -0.22802 0.08329 -2.738 0.01460 *
## protein 1.95771 0.63489 3.084 0.00712 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.956 on 16 degrees of freedom
## Multiple R-squared: 0.4805, Adjusted R-squared: 0.3831
## F-statistic: 4.934 on 3 and 16 DF, p-value: 0.01297
#using glm
res.glm=glm(carbohydrate~age+weight+protein,family=gaussian)
summary(res.glm)
##
## Call:
## glm(formula = carbohydrate ~ age + weight + protein, family = gaussian)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10.3424 -4.8203 0.9897 3.8553 7.9087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.96006 13.07128 2.828 0.01213 *
## age -0.11368 0.10933 -1.040 0.31389
## weight -0.22802 0.08329 -2.738 0.01460 *
## protein 1.95771 0.63489 3.084 0.00712 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 35.47893)
##
## Null deviance: 1092.80 on 19 degrees of freedom
## Residual deviance: 567.66 on 16 degrees of freedom
## AIC: 133.67
##
## Number of Fisher Scoring iterations: 2
10.3.2 Using Math
carbohydrate=c(33,40,37,27,30,43,34,48,30,38,50,51,30,36,41,42,46,24,35,37) # response vector
age=c(33,47,49,35,46,52,62,23,32,42,31,61,63,40,50,64,56,61,48,28)
weight=c(100,92,135,144,140,101,95,101,98,105,108,85,130,127,109,107,117,100,118,102)
protein=c(14,15,18,12,15,15,14,17,15,14,17,19,19,20,15,16,18,13,18,14)
#
X=matrix(1,20,4)
X[,1]=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
X[,2]=age
X[,3]=weight
X[,4]=protein
# least square estimate
BetaHat=solve(t(X)%*%X)%*%t(X)%*%carbohydrate
Compare the estimated parameters \(\hat{\beta}\)
BetaHat
## [,1]
## [1,] 36.9600559
## [2,] -0.1136764
## [3,] -0.2280174
## [4,] 1.9577126
with those computed with glm
and lm
earlier.
res.lm$coefficients
## (Intercept) age weight protein
## 36.9600559 -0.1136764 -0.2280174 1.9577126
res.glm$coefficients
## (Intercept) age weight protein
## 36.9600559 -0.1136764 -0.2280174 1.9577126
So we have our Math right for computing estimate \(\hat{\beta}\) !
fitted =X %*% BetaHat
residuals=carbohydrate-fitted
SSE=t(residuals)%*%residuals
sigmahatsq=SSE/(length(carbohydrate)-length(BetaHat))
#Residual standard error
sqrt(sigmahatsq)
## [,1]
## [1,] 5.956419
Compare this estimate with the Residual standard error computed with lm
.
# uncertainty of BetaHat (standard error)
CovBetaHat=solve(t(X)%*%X)*as.numeric(sigmahatsq)
Std.Error.BetaHat=sqrt(diag(CovBetaHat))
Std.Error.BetaHat
## [1] 13.07128293 0.10932548 0.08328895 0.63489286
Compare these standard errors with those reported in summary(res.lm)
and summary(res.glm)
#Multiple R-squared: (coefficient of determination)
S0=sum((carbohydrate-mean(carbohydrate))*(carbohydrate-mean(carbohydrate)))
Rsq=(S0-SSE)/S0
Rsq
## [,1]
## [1,] 0.4805428
References
Dobson, A. J., and A. G. Barnett. 2008. An Introduction to Generalized Linear Models. CRC Press, Third Edition.