class: center, middle, inverse, title-slide # Backpropagation ## Machine Learning ### Prof. Rozenn Dahyot ###
--- ## Supervised Machine Learning .center[ ![Supervised Machine Learning](data:image/png;base64,#images/Machine.drawio.svg) ] --- ## Machine design in Deep Neural Networks In DNNs, the machine is designed as **composition of functions**: $$ f_{\theta}(x)= \color{darkgreen}{\mathrm{W}_L} \circ \cdots \color{blue}{\sigma_l} \circ \color{darkgreen}{\mathrm{W}_l} \circ \cdots \circ \color{blue}{\sigma_1} \circ \color{darkgreen}{\mathrm{W}_1} (x) $$ A given layer `\(l\)` of the network is characterized by - a linear transformation `\(\mathbb{R}^{N_{l-1}} \rightarrow \mathbb{R}^{N_{l}}: x \rightarrow \color{darkgreen}{\mathrm{W}_l} \ x\)`, with `\(N_l \times N_{l-1}\)` matrix of weights `\(\color{darkgreen}{\mathrm{W}_l}\)`. - Activation functions `\(\lbrace \sigma_{l,k} \rbrace _{k=1,\cdots,N_l}\)` : `\(\color{blue}{\sigma_l}(x)=\left(\sigma_{l,1}(x_1),\cdots,\sigma_{l,N_l}(x_{N_l}) \right)\)` .center[ ![Neuron](data:image/png;base64,#images/Neuron1.drawio.svg) ] --- ## Machine design in Deep Neural Networks `\(\sigma_{l,n}:\mathbb{R} \rightarrow \mathbb{R}\)` is the **activation function** of the neuron indexed by `\((l, n)\)`. In a conventional setup, the activation function is fixed: `$$\sigma_{l,n}(x)=\sigma(x+b_{l,n}) \quad \quad \text{with bias } b_{l,n}\in \mathbb{R}$$` A common choice is `\(\sigma(x)=\max(0,x)\)` called ReLU (rectified linear unit). `\(L\)` is the **depth** of the neural net or number of layers. .center[ ![Neuron](data:image/png;base64,#images/Neuron1.drawio.svg) ] --- ## Machine design in Deep Neural Networks The parameters `\(\theta\)` are composed of all the weights `\(\lbrace \mathrm{W}_l \rbrace_{l=1,\cdots,L}\)` and all the biases `\(\lbrace \mathbf{b}_{l} \rbrace_{l=1,\cdots,L}\)` or `$$\theta \equiv \left\lbrace \lbrace (\mathrm{W}_l,\mathbf{b}_{l}) \rbrace_{l=1,\cdots,L} \right \rbrace$$` The parameters `\(\theta\)` control the machine `\(f_{\theta}\)`. .center[ ![Neuron](data:image/png;base64,#images/Neuron1.drawio.svg) ] --- ## Stochastic Gradient Descent (SGD) To optimize neural networks, derivatives (w.r.t. parameters `\(\theta\)`) needs to be computed. Starting from an initial guess `\(\theta^{(0)}\)`, update to convergence `$$\theta^{(t+1)} = \theta^{(t)} - \frac{\eta_t }{N}\sum_{i=1}^N \nabla_{\theta} \mathcal{L}\left(f_{\theta}(x_i),y_i \right) \quad \quad \text{(SGD)}$$` `\(\eta_t\)` is the **learning rate** (or *step size* hyperparameter). **Remark:** Note the cost that is minimized is an empirical mean approximating an expectation: `$$\mathbb{E}\left\lbrack \nabla_{\theta} \mathcal{L} \right\rbrack\approx\frac{1 }{N}\sum_{i=1}^N \nabla_{\theta} \mathcal{L}\left(f_{\theta}(x_i),y_i \right)$$` --- ## Stochastic Gradient Descent (SGD) **Example of learning rate decay:** `$$\eta_t = \underbrace{\eta_{0}}_{\text{initial learning rate }} / (1 + \underbrace{k}_{\text{decay rate}}\cdot \underbrace{t}_{\text{epoch number}})$$` 1. Start with a larger learning rate 2. Reduce it in the course of training --- ## Stochastic Gradient Descent (SGD) ### Iterative method (SGD) - When `\(N\)` is large or samples are recorded overtime, the true gradient is approximated by a gradient at a single sample: `$$\theta^{(t+1)} = \theta^{(t)} - \eta \ \nabla_{\theta} \mathcal{L}\left(f_{\theta}(x_i),y_i \right)$$` This update is repeated iteratively `\(\forall i=1,\cdots, N\)`. - A compromise is to use a **batch** (= a randomly selected subsample of size `\(n\leq N\)` noted `\(\lbrace (x_i^*,y_i^*)\rbrace_{i=1,\cdots,n}\)` ) taken from the full dataset `\(\lbrace (x_i,y_i)\rbrace_{i=1,\cdots,N}\)`). `$$\theta^{(t+1)} = \theta^{(t)} - \eta \ \frac{1}{n}\sum_{i=1}^n \nabla_{\theta}\mathcal{L}\left(f_{\theta}(x_i^*),y_i^* \right)$$` --- ## Stochastic Gradient Descent (SGD) Having a training dataset `\(\lbrace (x_i,y_i)\rbrace_{i=1,\cdots,N}\)` of size `\(N\)` - A **batch** refers to equally sized subsets of the dataset. The batch size is a number of samples processed before the model is updated. $$ \text{batch size} \leq N$$ - An **epoch** means that each sample in the training dataset has had an opportunity to update the model parameters `\(\theta\)`. An epoch is comprised of one or more batches. The number of epochs is the number of complete passes through the training dataset. **Example:** `\(N= 200\)` samples, batch size=5, and 1,000 epochs: the training dataset is divided into `\(\frac{200}{5}=40\)` batches and the model parameters `\(\theta\)` are updated 40000 times: `$$\lbrace\theta^{(t)}\rbrace_{t=0,\cdots,40000}$$` --- ## Stochastic Gradient Descent (SGD) ### Learning curves .center[ ![Example](data:image/png;base64,#images/convergence.drawio.svg)] .footnote[ Exerpt from *Harmonic Convolutional Networks based on Discrete Cosine Transform*, M. Ulicny et al (2021) https://arxiv.org/pdf/2001.06570.pdf ] --- ## Backpropagation **Backpropagation** can compute this gradient `\(\sum_{i=1}^N\nabla_{\theta}\mathcal{L}\left(f_{\theta}(x_i),y_i \right)\)` ! Backpropagation optimisation procedure: - Forward pass computes the loss, - Backward pass computes the derivative of the Loss function with respect to the parameters Remember the chain rule: `$$\frac{d}{dt}\left( g\circ z(t) \right)=\frac{d}{dt}\left( g(z(t)) \right)=\frac{dg}{dz} \cdot\frac{dz}{dt}=g'(z(t))\cdot z'(t)$$` --- ## Univariate Chain Rule: Example .center[ ![Example](data:image/png;base64,#images/backpropagation.drawio.svg) ] --- ## Univariate Chain Rule: Example .pull-left[ Computing the Loss (forward pass) $$ z=wx+b $$ `$$\hat{y}=\sigma(z)$$` $$ \mathcal{L}=\frac{1}{2}(\hat{y}-y)^2 $$ ![Example](data:image/png;base64,#images/backpropagation.drawio.svg) ] .pull-right[ Computing the derivatives (backward pass) $$ \frac{d\mathcal{L}}{d\hat{y}}=\hat{y}-y $$ $$ \frac{d\mathcal{L}}{dz}=\frac{d\mathcal{L}}{d\hat{y}}\frac{d\hat{y}}{dz}=\frac{d\mathcal{L}}{d\hat{y}}\cdot \sigma'(z) $$ $$ \frac{d\mathcal{L}}{dw}=\frac{d\mathcal{L}}{dz}\frac{dz}{dw}=\frac{d\mathcal{L}}{dz}\cdot x $$ $$ \frac{d\mathcal{L}}{db}=\frac{d\mathcal{L}}{dz}\frac{dz}{db}=\frac{d\mathcal{L}}{dz} $$ ] --- ## Univariate Chain Rule: Example **Exercise:** 1. Compute the forward and backward pass values for a training sample `\((x,y)=(1,1)\)` and starting with initial guess `$$\theta^{(0)}=(w^{(0)},b^{(0)})=(0,0)$$` using the identity activation function `\(\sigma(x)=x\)`. 1. How would `\(\theta^{(1)}\)` be updated? <center> <img src="data:image/png;base64,#images/backpropagation.drawio.svg" width="50%" /> </center> ??? `$$\theta^{(1)} = \theta^{(0)} - \eta \ \nabla_{\theta} \mathcal{L}\left(f_{\theta}(x_i),y_i \right)$$` `$$\left( \begin{array}{c} w^{(1)}\\ b^{(1)}\\ \end{array}\right)=\left(\begin{array}{c} w^{(0)}\\ b^{(0)}\\ \end{array}\right)-\eta \left(\begin{array}{c} -1 \\ -1\\ \end{array}\right)$$`