Chapter 7 Deviance

7.1 Deviance

The deviance is a log-likelihood ratio statistics that compares the saturated model with the proposed GLM model. We have already introduced the following terms:

the maximum likelihood estimates for the saturated model are computed by: \[(\hat{\theta}_1,\cdots,\hat{\theta}_{N})=\arg\max \left \lbrace \log \mathcal{L}(\theta_1,\cdots,\theta_{N}) \right \rbrace\] and the value \(\log \mathcal{L}(\hat{\theta}_1,\cdots,\hat{\theta}_{N})\) is therefore the maximum value of the log likelihood function of the saturated model.
When considering a generalised linear model, a link function \(g\) is used to constraint the parameters such that \(\theta_i\propto g^{-1}(x_i^{T}\beta),\ \forall i=1,\cdots,N\). In this case the likelihood is written \(\mathcal{L}(\beta)\) and the log likelihood is \(\log \mathcal{L}(\beta)\). The parameter \(\beta\) has often a lower dimension than \(\theta s\) (i.e. \(\dim(\beta)\leq N\)) and the maximum likelihood estimate is computed such that: \[\hat{\beta}=\arg\max\ \left\lbrace \log \mathcal{L}(\beta) \right\rbrace\] The maximum log likelihood value for the GLM model is then \(\log \mathcal{L}(\hat{\beta})\).

Definition 7.1 (Deviance) The deviance, also called the log-likelihood (ratio) statistic, is defined by: \[D=2\ \left\lbrace\log \mathcal{L}(\hat{\theta}_1,\cdots,\hat{\theta}_{N}) -\log \mathcal{L}(\hat{\beta})) \right \rbrace= 2 \ \log\left( \frac{ \mathcal{L}(\hat{\theta}_1,\cdots,\hat{\theta}_{N})}{\mathcal{L}(\hat{\beta})} \right)\] where

\(\log \mathcal{L}(\hat{\theta}_1,\cdots,\hat{\theta}_{N})\) is the maximum value of the log-likelihood function for the saturated model, and
\(\log \mathcal{L}(\hat{\beta})\) is the value of the log-likelihood function when fitting the model \(g(\mathbb{E}[y])=\mathbf{x}^T\pmb{\hat{\beta}}\).

The deviance is given as an output in R when fitting GLMs.

7.2 Approximation of the log likelihood function near its maximum

Using Taylor expansion, the log likelihood can be approximated near the maximum likelihood estimate

for the saturated model with notation \((\theta_1,\cdots,\theta_N)^T=\theta\), when \(\theta\) is close to \(\hat{\theta}\): \[\log \mathcal{L}(\theta)\simeq\log \mathcal{L}(\hat{\theta})+ (\theta-\hat{\theta})^T \ \nabla_{\hat{\theta}} +\frac{1}{2} (\theta-\hat{\theta})^T \mathrm{H}_{\hat{\theta}} \ (\theta-\hat{\theta})\] with \(\nabla_{\hat{\theta}}\) the gradient of the log likelihood function at \(\hat{\theta}\) and \(\mathrm{H}_{\hat{\theta}}\) the hessian matrix of the log likelihood function at \(\hat{\theta}\) for the saturated model.
similarly for the GLM model, when \(\beta\) is close to \(\hat{\beta}\): \[\log \mathcal{L}(\beta)\simeq\log \mathcal{L}(\hat{\beta})+ (\beta-\hat{\beta})^T \ \nabla_{\hat{\beta}} + \frac{1}{2} (\beta-\hat{\beta})^T \mathrm{H}_{\hat{\beta}} \ (\beta-\hat{\beta})\]

In both case, the gradients \(\nabla_{\hat{\beta}}\) and \(\nabla_{\hat{\theta}}\) are zero-vectors (since \(\hat{\beta}\) and \(\hat{\theta}\) are maxima of the log likelihoods !). So the deviance can be approximated by: \[\begin{array}{ll} D&\simeq 2\ \left\lbrace\log \mathcal{L}(\hat{\theta}) -\log \mathcal{L}(\hat{\beta})) \right \rbrace\\ &\\ &\simeq 2\ \left\lbrace \log \mathcal{L}(\theta) - \frac{1}{2} (\theta-\hat{\theta})^T \mathrm{H}_{\hat{\theta}} \ (\theta-\hat{\theta}) - \log \mathcal{L}(\beta) +\frac{1}{2} (\beta-\hat{\beta})^T \mathrm{H}_{\hat{\beta}} \ (\beta-\hat{\beta}) \right\rbrace\\ &\\ &\simeq \underbrace{2\ \log\left ( \frac{\mathcal{L}(\theta)}{\mathcal{L}(\beta)} \right)}_{\nu} -(\theta-\hat{\theta})^T \mathrm{H}_{\hat{\theta}} \ (\theta-\hat{\theta}) + (\beta-\hat{\beta})^T \mathrm{H}_{\hat{\beta}} \ (\beta-\hat{\beta})\\ \end{array}\] The term \(\nu\) is positive and it will be near zero if the GLM model fits the data almost as well the saturated model does. Note that Hessian matrices computed at the maxima, \(\mathrm{H}_{\hat{\beta}}\) and \(\mathrm{H}_{\hat{\theta}}\), are negative.

7.3 Sampling distribution for the deviance

The likelihood for \(\beta\) can be approximated by a Normal distribution near the estimate \(\hat{\beta}\) such that \(\mathcal{L}(\beta)\propto p_{\beta}(\beta)\) with: \[\begin{array}{c} p_{\beta}(\beta) =\frac{1}{\sqrt{2\pi |\Sigma|} } \exp\left\lbrack -\frac{1}{2} (\beta-\hat{\beta})^T \Sigma^{-1} (\beta-\hat{\beta})\right\rbrack\\ \text{or}\\ \log p_{\beta}(\beta) = -\log\left( \sqrt{2\pi |\Sigma|}\right) -\frac{1}{2} (\beta-\hat{\beta})^T \Sigma^{-1} (\beta-\hat{\beta}) \end{array}\] Comparing with \[\log \mathcal{L}(\beta)\simeq\log \mathcal{L}(\hat{\beta})+ \frac{1}{2} (\beta-\hat{\beta})^T \mathrm{H}_{\hat{\beta}} \ (\beta-\hat{\beta})\] we can identify the covariance matrix with the Hessian matrix of the log likelihood computed at \(\hat{\beta}\): \[\Sigma^{-1}=-\mathrm{H}_{\hat{\beta}}\] Remember that the covariance matrix is a real positive definite symmetric matrix that can be rewritten by eigen decomposition as: \[\Sigma=\mathrm{U}^T{\Lambda}\mathrm{U}\] with the orthogonal matrix \(\mathrm{U}\) and the diagonal matrix \(\Lambda=\Upsilon^2\) collecting the positive eigenvalues of \(\Sigma\). In this case, we have \[\begin{array}{ll} (\theta-\hat{\theta})^T \Sigma^{-1} (\theta-\hat{\theta}) &= (\mathrm{U}\ (\theta-\hat{\theta}))^T \Lambda^{-1} (\mathrm{U}\ (\theta-\hat{\theta}))\\ & = (\Upsilon^{-1}\ \mathrm{U}\ (\theta-\hat{\theta}))^T \ \mathrm{I} \ (\Upsilon^{-1}\ \mathrm{U}\ (\theta-\hat{\theta})) \\ \end{array}\] with \(\mathrm{I}\) is the identity matrix. So by changing variable \(\vec{z}=(\Upsilon^{-1}\ \mathrm{U}\ (\theta-\hat{\theta}))\), we have \(\vec{z}\sim\mathcal{N}(0,\mathrm{I})\).

Definition 7.2 (Central \(\chi^2\) distribution) The central \(\chi^2\) distribution with \(k\) degree of freedom is defined as the distribution of the sum of squares of \(k\) independent random variables \(z_1\), ..., \(z_k\) such that \(z_i\sim \mathcal{N}(0,1), \forall i=1,\cdots,k\): \[x=\sum_{i=1}^k z_i^2 \quad \text{then } x\sim \chi^2(k)\] with the distribution \(\chi^2(k)\) defined as: \[p_{x}(x) = \left\lbrace\begin{array}{l} \frac{1}{2^{k/2} \Gamma(k/2)} x^{k/2-1} \exp\left(\frac{-x}{2}\right) \quad \text{when } x\geq 0\\ 0 \quad \text{otherwise}\\ \end{array}\right.\] with the function \(\Gamma\) defined as \[\Gamma(k)=\int_{0}^{+\infty} t^{k-1} \ \exp(-t) \ dt\] \(\Gamma(1/2)=\sqrt{\pi}\), \(\Gamma(1)=1\) and \(\Gamma(y+1)=y\ \Gamma(y) \ \forall y\). The expectation of \(x\) is \(\mathbb{E}[x]=k\).

For the deviance,

the term \(- (\theta-\hat{\theta})^T \mathrm{H}_{\hat{\theta}} \ (\theta-\hat{\theta})\) is a weighted sum of squares of \(N\) variables with distribution \(\mathcal{N}(0,1)\), then this term follows a \(\chi^2(\dim(\theta))\) distribution with \(\dim(\theta)=N\) for the saturated model
similarly, the term \(- (\beta-\hat{\beta})^T \mathrm{H}_{\hat{\beta}} \ (\beta-\hat{\beta})\) will follow a \(\chi^2(\dim(\beta))\) distribution with \(\dim(\beta)=p\).

Definition 7.3 (Sampling distribution of the deviance) The sampling distribution of the deviance is approximated by a \(\chi^2\) distribution of degree \(\dim(\theta)-\dim(\beta)\) and non-centrality parameter \(\nu\): \[D\sim \chi^2(\dim(\theta)-\dim(\beta),\nu)\] with \(\dim(\theta)=N\) the nb of parameters in the saturated model, \(\dim(\beta)=p\) is the number of parameters in the GLM model of interest, and \(\nu\) is a positive constant near 0 when the model of interest fits almost as well as the saturated model: \[D\sim \chi^2(N-p,0)\]