Chapter 2 Introduction

Problems in Statistics (or Data Science) often start with a dataset of $N$ observations $\lbrace y^{(i)},x^{(i)}\rbrace_{i=1,\cdots,N}$ and Mathematics provide several abstract objects (e.g. variable, vector, functions) in which one can plug-in this data. In the models seen in this course, for each observations $(y^{(i)},x^{(i)})$ we associate $(y_{i},x_{i})$ with $y_i$ a variable (called response variable) and $x_i$ a vector made up of several variables (called explanatory variables). This one-to-one mapping between data point $(y^{(i)},x^{(i)})$ and variables $(y_{i},x_{i})$ makes it pointless to differentiate them in the notations - we only use notation $(y_{i},x_{i})$ confusing the two meanings. It is worth however pointing out that the likelihood function $p(y_1,y_2,\cdots,y_N|x_1,x_2,\cdots,x_N,\beta)$ is a positive function defined on a $N$ dimensional space such that \[\int\int \cdots\int p(y_1,y_2,\cdots,y_N|x_1,x_2,\cdots,x_N,\beta)\ dy_1 \ dy_2\cdots dy_N=1\] and such writing only make sense when considering variables (e.g. $y_i$) and not data (e.g. $y^{(i)}$). This chapter gives a quick overview of Linear Regression and how its premises can be reformulated. We give then a quick introduction of this course pointing out how the premises of Linear Regression will be extended.

2.1 Linear Regression - Reminder

In Linear Regression, we have a set of observations $\lbrace (y_i,x_i)\rbrace_{i=1,\cdots,N}$ such that the following linear relationship holds $\forall i$: \[\begin{array}{ll} y_i&=\beta_0+\beta_1 \ x_{1i}+\cdots + \beta_{k}\ x_{ki}+\epsilon_i\\ &= \beta^{T}x_i +\epsilon_i \\ \end{array} \label{eq:LR}\] where

$y_i\in \mathbb{R}$ is the outcome or the observed value for the response variable,
$x_i=(1, x_{1i},\cdots,x_{ki})^T \in \mathbb{R}^{k+1}$ is a vector collating the values of the explanatory variables associated with the outcome $y_i$,
$\epsilon_i$ is the noise, residual or error associated with the outcome $y_i$. It is assumed that the error $\epsilon$ has a Normal distribution with mean $0$ and variance $\sigma^2$: \[p_{\epsilon}(\epsilon)=\frac{\exp\left(\frac{-\epsilon^2}{2\sigma^2}\right)}{\sqrt{2\pi} \sigma}\]

The best parameters $\beta$ are then estimated as the ones that maximise the joint probability density function of all the residuals: \[\hat{\beta} =\arg\max_{\beta} p(\epsilon_1,\cdots,\epsilon_N) \label{eq:LR:ML}\] Because the residuals are independent and all follow the same distribution $p_{\epsilon}$, the joint density function of the residuals corresponds to: \[p(\epsilon_1,\cdots,\epsilon_N)=\prod_{i=1}^N p_{\epsilon}(\epsilon_i) =\prod_{i=1}^N \left( \frac{\exp\left(\frac{-\epsilon_i^2}{2\sigma^2}\right)}{\sqrt{2\pi} \sigma} \right) =\prod_{i=1}^N \left(\frac{\exp\left(\frac{-(y_i- \beta^{T} x_i)^2}{2\sigma^2}\right)}{\sqrt{2\pi} \sigma}\right)\] Using the log transformation of the joint density function of the residuals, the maximum likelihood estimate of the parameters [$\beta$] (#eq:LR:ML) is in fact computed by minimising the sum of square errors: \[\hat{\beta} =\arg\min_{\beta} \left\lbrace \sum_{i=1}^N \epsilon_i^2 =\sum_{i=1}^N (y_i- \beta^{T} x_i)^2 \right\rbrace\] and a forecast for output response $\hat{y}=\hat{\beta}^{T}x$ can be computed for any chosen input $x$ - confidence intervals can also be computed for modelling uncertainty associated with $\hat{y}$.

2.2 Linear Regression - Reformulation

Consider the equation $y=\beta^Tx+\epsilon$ (without the index $i$) for a moment, if $x$ and $\beta$ are given, then the only uncertainty related to the response $y$ is having the same properties as the uncertainty defined for $\epsilon$. In other words, we have: \[p_{y|\beta,x}(y|\beta,x)=p_{\epsilon}(y-\beta^Tx)\] where $p_{y|\beta,x}$ is the probability density function of $y$ given $x$ and $\beta$ (this can be understood as a change of variable $\epsilon=y-\beta^Tx$). So saying that the error has a Normal distribution with mean $0$ and variance $\sigma^2$ is the same as assuming that the conditional probability density function of the response $y$ given the parameters $\beta$ and the explanatory variables $x$ is: \[p_{y|\beta,x}(y|\beta,x)=\frac{\exp\left(\frac{-(y-\beta^Tx)^2}{2\sigma^2}\right)}{\sqrt{2\pi} \sigma}\] which is a Normal distribution with mean $\beta^T x$ and variance $\sigma^2$. So we can in fact define the Linear Regression problem without introducing explicitly a variable $\epsilon$ as follow:

Definition 2.1 Premises Linear Regression. Consider a set of observations $\lbrace (y_i,x_i)\rbrace_{i=1,\cdots,N}$ collected independently, such that $\forall i$

the response $y_i$ is normally distributed
with mean $\mathbb{E}[y_i]=\beta^{T}x_i$ (and variance $\mathbb{E}[(y_i-\beta^{T}x_i)^2]=\sigma^2$).

so $y_i \sim p_{y|x,\beta}(y_i|x_i,\beta)$ is fully defined and the parameters can be estimated using the likelihood function: \[\hat{\beta}=\arg\max_{\beta} \left \lbrace \mathcal{L}(\beta)=p(y_1,\cdots,y_N|x_1,\cdots,x_N,\beta) = \prod_{i=1}^N p_{y|x,\beta}(y_i|x_i,\beta) \right\rbrace\] providing a parametric model $p_{y|x,\beta}(y|x,\hat{\beta})$.

2.3 Generalised Linear Models

In a nutshell, this course will generalise these premises used for Linear Regression as follow:

The probability density function $p_{y|x,\beta}$ is a member of the exponential family of distributions. The Normal distribution is a member of that family, but other distributions are also available to deal with various situations such as when the outcome $y$ is not an element of $\mathbb{R}$ (as assumed by the Normal distribution) but is for instance in binary form (e.g. the outcome $y$ indicates a failure $0$ or success $1$).
The expectation of $y$ given $x$ and $\beta$ that is defined by: \[\mathbb{E}[y]=\int_{\mathbb{Y}} y \ p_{y|x\beta}(y|x\beta) \ dy, \quad \text{with } \mathbb{Y} \text{ domain of definition of the outcome $y$ }\] is now related to the explanatory variables with a link function $g$ such that \[g\left(\mathbb{E}[y]\right)=\beta^{T}x\] For instance in the case of Linear Regression, the link function $g$ that we used is the identity function $g$ defined for $z\in\mathbb{R} \rightarrow g(z)=z\in\mathbb{R}$. $g$ is a link function that will be chosen such that it is bijective and its inverse $g^{-1}$ exists. In general, this function maps the space of the expectation $\mathbb{E}[y]$ to the space $\mathbb{R}$ where $\beta^T x$ takes its value.