Chapter 2 Introduction

Problems in Statistics (or Data Science) often start with a dataset of \(N\) observations \(\lbrace y^{(i)},x^{(i)}\rbrace_{i=1,\cdots,N}\) and Mathematics provide several abstract objects (e.g. variable, vector, functions) in which one can plug-in this data. In the models seen in this course, for each observations \((y^{(i)},x^{(i)})\) we associate \((y_{i},x_{i})\) with \(y_i\) a variable (called response variable) and \(x_i\) a vector made up of several variables (called explanatory variables). This one-to-one mapping between data point \((y^{(i)},x^{(i)})\) and variables \((y_{i},x_{i})\) makes it pointless to differentiate them in the notations - we only use notation \((y_{i},x_{i})\) confusing the two meanings. It is worth however pointing out that the likelihood function \(p(y_1,y_2,\cdots,y_N|x_1,x_2,\cdots,x_N,\beta)\) is a positive function defined on a \(N\) dimensional space such that \[\int\int \cdots\int p(y_1,y_2,\cdots,y_N|x_1,x_2,\cdots,x_N,\beta)\ dy_1 \ dy_2\cdots dy_N=1\] and such writing only make sense when considering variables (e.g. \(y_i\)) and not data (e.g. \(y^{(i)}\)). This chapter gives a quick overview of Linear Regression and how its premises can be reformulated. We give then a quick introduction of this course pointing out how the premises of Linear Regression will be extended.

2.1 Linear Regression - Reminder

In Linear Regression, we have a set of observations \(\lbrace (y_i,x_i)\rbrace_{i=1,\cdots,N}\) such that the following linear relationship holds \(\forall i\): \[\begin{array}{ll} y_i&=\beta_0+\beta_1 \ x_{1i}+\cdots + \beta_{k}\ x_{ki}+\epsilon_i\\ &= \beta^{T}x_i +\epsilon_i \\ \end{array} \label{eq:LR}\] where

  • \(y_i\in \mathbb{R}\) is the outcome or the observed value for the response variable,

  • \(x_i=(1, x_{1i},\cdots,x_{ki})^T \in \mathbb{R}^{k+1}\) is a vector collating the values of the explanatory variables associated with the outcome \(y_i\),

  • \(\epsilon_i\) is the noise, residual or error associated with the outcome \(y_i\). It is assumed that the error \(\epsilon\) has a Normal distribution with mean \(0\) and variance \(\sigma^2\): \[p_{\epsilon}(\epsilon)=\frac{\exp\left(\frac{-\epsilon^2}{2\sigma^2}\right)}{\sqrt{2\pi} \sigma}\]

The best parameters \(\beta\) are then estimated as the ones that maximise the joint probability density function of all the residuals: \[\hat{\beta} =\arg\max_{\beta} p(\epsilon_1,\cdots,\epsilon_N) \label{eq:LR:ML}\] Because the residuals are independent and all follow the same distribution \(p_{\epsilon}\), the joint density function of the residuals corresponds to: \[p(\epsilon_1,\cdots,\epsilon_N)=\prod_{i=1}^N p_{\epsilon}(\epsilon_i) =\prod_{i=1}^N \left( \frac{\exp\left(\frac{-\epsilon_i^2}{2\sigma^2}\right)}{\sqrt{2\pi} \sigma} \right) =\prod_{i=1}^N \left(\frac{\exp\left(\frac{-(y_i- \beta^{T} x_i)^2}{2\sigma^2}\right)}{\sqrt{2\pi} \sigma}\right)\] Using the log transformation of the joint density function of the residuals, the maximum likelihood estimate of the parameters [\(\beta\)] (#eq:LR:ML) is in fact computed by minimising the sum of square errors: \[\hat{\beta} =\arg\min_{\beta} \left\lbrace \sum_{i=1}^N \epsilon_i^2 =\sum_{i=1}^N (y_i- \beta^{T} x_i)^2 \right\rbrace\] and a forecast for output response \(\hat{y}=\hat{\beta}^{T}x\) can be computed for any chosen input \(x\) - confidence intervals can also be computed for modelling uncertainty associated with \(\hat{y}\).

2.2 Linear Regression - Reformulation

Consider the equation \(y=\beta^Tx+\epsilon\) (without the index \(i\)) for a moment, if \(x\) and \(\beta\) are given, then the only uncertainty related to the response \(y\) is having the same properties as the uncertainty defined for \(\epsilon\). In other words, we have: \[p_{y|\beta,x}(y|\beta,x)=p_{\epsilon}(y-\beta^Tx)\] where \(p_{y|\beta,x}\) is the probability density function of \(y\) given \(x\) and \(\beta\) (this can be understood as a change of variable \(\epsilon=y-\beta^Tx\)). So saying that the error has a Normal distribution with mean \(0\) and variance \(\sigma^2\) is the same as assuming that the conditional probability density function of the response \(y\) given the parameters \(\beta\) and the explanatory variables \(x\) is: \[p_{y|\beta,x}(y|\beta,x)=\frac{\exp\left(\frac{-(y-\beta^Tx)^2}{2\sigma^2}\right)}{\sqrt{2\pi} \sigma}\] which is a Normal distribution with mean \(\beta^T x\) and variance \(\sigma^2\). So we can in fact define the Linear Regression problem without introducing explicitly a variable \(\epsilon\) as follow:

Definition 2.1 Premises Linear Regression. Consider a set of observations \(\lbrace (y_i,x_i)\rbrace_{i=1,\cdots,N}\) collected independently, such that \(\forall i\)

  • the response \(y_i\) is normally distributed

  • with mean \(\mathbb{E}[y_i]=\beta^{T}x_i\) (and variance \(\mathbb{E}[(y_i-\beta^{T}x_i)^2]=\sigma^2\)).

so \(y_i \sim p_{y|x,\beta}(y_i|x_i,\beta)\) is fully defined and the parameters can be estimated using the likelihood function: \[\hat{\beta}=\arg\max_{\beta} \left \lbrace \mathcal{L}(\beta)=p(y_1,\cdots,y_N|x_1,\cdots,x_N,\beta) = \prod_{i=1}^N p_{y|x,\beta}(y_i|x_i,\beta) \right\rbrace\] providing a parametric model \(p_{y|x,\beta}(y|x,\hat{\beta})\).

2.3 Generalised Linear Models

In a nutshell, this course will generalise these premises used for Linear Regression as follow:

  • The probability density function \(p_{y|x,\beta}\) is a member of the exponential family of distributions. The Normal distribution is a member of that family, but other distributions are also available to deal with various situations such as when the outcome \(y\) is not an element of \(\mathbb{R}\) (as assumed by the Normal distribution) but is for instance in binary form (e.g. the outcome \(y\) indicates a failure \(0\) or success \(1\)).

  • The expectation of \(y\) given \(x\) and \(\beta\) that is defined by: \[\mathbb{E}[y]=\int_{\mathbb{Y}} y \ p_{y|x\beta}(y|x\beta) \ dy, \quad \text{with } \mathbb{Y} \text{ domain of definition of the outcome $y$ }\] is now related to the explanatory variables with a link function \(g\) such that \[g\left(\mathbb{E}[y]\right)=\beta^{T}x\] For instance in the case of Linear Regression, the link function \(g\) that we used is the identity function \(g\) defined for \(z\in\mathbb{R} \rightarrow g(z)=z\in\mathbb{R}\). \(g\) is a link function that will be chosen such that it is bijective and its inverse \(g^{-1}\) exists. In general, this function maps the space of the expectation \(\mathbb{E}[y]\) to the space \(\mathbb{R}\) where \(\beta^T x\) takes its value.