class: center, middle, inverse, title-slide # Loss functions ## Machine Learning ### Prof. Rozenn Dahyot ###
--- ## Supervised Machine Learning .center[ ![Supervised Machine Learning](data:image/png;base64,#images/Machine.drawio.svg) ] ??? - What are the loss functions `\(\mathcal{L}\)` that are (commonly) used? --- ## Loss functions `\(\mathcal{L}\)` for Regression For Regression: `\(\hat{y}_i\in \mathbb{R}^d\)` and `\(y_i\in \mathbb{R}^d\)` $$ \mathcal{L}(\hat{y}_i,y_i)=\frac{1}{2}\||\hat{y}_i-y_i\||^2 \quad \quad\quad \text{(Quadratic Loss)} $$ This corresponds to using an Euclidian distance. **Extension: Robust regression** e.g. [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss) $$ \mathcal{L}(\hat{y}_i,y_i)=\frac{1}{2}\rho\left(\||\hat{y}_i-y_i\||\right) \quad \quad\quad \text{(M-estimation Loss)} $$ .footnote[[A General and Adaptive Robust Loss Function](https://openaccess.thecvf.com/content_CVPR_2019/papers/Barron_A_General_and_Adaptive_Robust_Loss_Function_CVPR_2019_paper.pdf), J. Barron (CVPR 2019)] ??? - several functions `\(\rho\)` exist --- ## Loss functions `\(\mathcal{L}\)` for Regression .pull-left[ Using M-estimators: ![](data:image/png;base64,#images/CVPR2000.png) [*Robust visual recognition of colour images*](https://roznn.github.io/PDF/htm_Cvpr00.pdf), Computer Vision and Pattern Recognition (CVPR 2000) [DOI:10.1109/CVPR.2000.855886](http://doi.org/10.1109/CVPR.2000.855886) ] .pull-right[ ![](data:image/png;base64,#images/d41586-019-03013-5_17247076.png) *Why deep-learning AIs are so easy to fool*, Nature (2019) [DOI:10.1038/d41586-019-03013-5](https://doi.org/10.1038/d41586-019-03013-5) ] ??? - robust machines are resilient to *outliers* data --- ## Regression ### Denoising & Super-resolution .center[ ![MLSP2018](data:image/png;base64,#images/MLSP2018Albluwi.png) ] .footnote[ Image Deblurring and Super-Resolution Using Deep Convolutional Neural Networks, Albluwi et al, Workshop Machine Learning for Signal Processing [DOI:10.1109/MLSP.2018.8516983](https://doi.org/10.1109/MLSP.2018.8516983) https://github.com/Fatma-Albluwi/DBSRCNN ] ??? Mean Squared Error (MSE) is used as the cost function in this paper https://doi.org/10.1109/MLSP.2018.8516983. Training datasets for such application are easy to create. --- ## Regression ### Colorization .center[ ![](data:image/png;base64,#https://camo.githubusercontent.com/c84c51c1a70c67a968b2f69533f5800957ed6b16d974c48a08bb12818353c268/68747470733a2f2f692e696d6775722e636f6d2f577072517750352e6a7067) https://github.com/jantic/DeOldify ] .footnote[ Image Colorization: A Survey and Dataset, S. Anwar et al. (2020) https://arxiv.org/pdf/2008.10774.pdf ] ??? - Training datasets for such application are easy to create. - *The loss function utilized is the least-squares error... the Huber loss...Mean Squared Error loss...* --- ## Regression ### Depth Estimation .pull-left[ Example of application: Depth from a single image MiDaS models for computing relative depth from a single image. https://pytorch.org/hub/intelisl_midas_v2/ ] .pull-right[ <img src="data:image/png;base64,#https://pytorch.org/assets/images/midas_samples.png" width="60%" /> ] .footnote[ [Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer](https://arxiv.org/abs/1907.01341), R. Ranftl et al, [https://dx.doi.org/10.1109/TPAMI.2020.3019967](10.1109/TPAMI.2020.3019967), https://pytorch.org/hub/intelisl_midas_v2/ ] ??? - Training datasets for such application are not easy to create e.g. registering image and depth information not recorded by the same sensors is needed. - Loss `\(\rho(\dot)\)` MSE and robust ones are used. --- ## Classification .center[ ![softmax](data:image/png;base64,#images/classification.drawio.svg)] --- ## Loss functions `\(\mathcal{L}\)` for classification Classification ( `\(J\)` classes) `$$\mathcal{L}(\hat{y}_i,y_i)=- \sum_{j=1}^{J} y_{i,j}\ \log(\hat{y}_{i,j} ) \quad \quad \text{(Cross-entropy)}$$` .center[ ![softmax](data:image/png;base64,#images/softmax.drawio.svg)] --- ## Loss functions `\(\mathcal{L}\)` for classification ### Example ( `\(J=4\)` ) Using cross entropy to compare output `\(\hat{y}\)` with true label `\(y\)` `$$\text{softmax}\left(\underbrace{ \left\lbrack \begin{array}{c} 3.2\\ 1.3\\ 0.2\\ 0.8\\ \end{array}\right\rbrack}_{z}\right) = \underbrace{ \left\lbrack \begin{array}{c} 0.775\\ 0.116\\ 0.039\\ 0.070\\ \end{array}\right\rbrack}_{\hat{y}} \quad \text{Comparison with} \quad\underbrace{ \left\lbrack \begin{array}{c} 1\\ 0\\ 0\\ 0\\ \end{array}\right\rbrack}_{y}$$` $$ \mathcal{L}(\hat{y},y)= - 1 \times \log (0.775) =0.2548922 $$ --- ## Loss functions `\(\mathcal{L}\)` for classification For binary classification `\(J=2\)` then for simplification `\(\hat{y}\in [0;1]\)` and `\(y\in\lbrace 0,1\rbrace\)` are scalars and the cross entropy is: $$ \mathcal{L}(\hat{y},y)=- y\ \log(\hat{y} )-(1-y)\ \log(1-\hat{y}) \quad \quad \text{(Cross-entropy)} $$ and `\(\hat{y}\)` can be computed with `\(z\in\mathbb{R}\)` : `$$\text{sigmoid:}\quad \hat{y}=\frac{1}{1+\exp(z)}$$` <!-- https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e --> --- ## Loss with Regularisation The cost function used to estimate the best parameters `\(\theta\)` of the machine M can be extended with a regularization term: `$$\hat{\theta}=\arg\min_{\theta} \left\lbrace \underbrace{\frac{1}{N}\sum_{i=1}^N \mathcal{L}(\hat{y}_i,y_i)}_{\text{model fit}}+\lambda \ \underbrace{\mathcal{R}(\theta)}_{\text{regularization}} \right\rbrace$$` `\(\lambda\)` controls the regularization strength. Example of regularization: - L2 regularization `\(\mathcal{R}(\theta)=\|\theta\|_{2}^2=\sum_{k=1}^{\dim{\theta}} \theta_k^2\)` - L1 regularization `\(\mathcal{R}(\theta)=\|\theta\|_{1}=\sum_{k=1}^{\dim{\theta}} |\theta_k|\)` --- ## Loss with Regularisation Purpose of Regularization: - Express preferences in among models - Avoid overfitting: Prefer simple models that generalize better --- ## Example: Ridge Regression Consider dataset `\(\lbrace (\mathbf{x}_i,y_i)\rbrace_{i=1,\cdots,N}\)` `$$\hat{\theta}=\arg\min_{\theta}\left\lbrace\sum_{i=1}^N (y_i-\mathbf{x}_i^T\theta)^2 + \lambda \ \|\theta\|^2=\|\mathbf{y}-\mathrm{X}\theta\|^2+\lambda \|\theta\|^2\right\rbrace$$` with matrix `$$\mathrm{X}= \left\lbrack \begin{array}{cccc} 1 & x_{1,1} &\cdots & x_{d,1}\\ 1 & x_{1,2} &\cdots & x_{d,2}\\ \vdots &\vdots & &\vdots\\ 1 & x_{1,N} &\cdots & x_{d,N}\\ \end{array}\right\rbrack \quad \text{and}\quad \mathbf{y}=\left\lbrack \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_{N} \end{array}\right\rbrack$$` --- ## Example: Ridge Regression `$$\frac{\partial}{\partial \theta} \|\theta\|^2 =\frac{\partial}{\partial \theta} \theta^T\theta = 2\ \theta$$` `$$\frac{\partial}{\partial \theta} \|\mathbf{y}-\mathrm{X}\theta\|^2 =-2\mathrm{X}^T\mathbf{y}+2\mathrm{X}^{T}\mathrm{X}\theta$$` Hence the solution to `$$\hat{\theta}=\arg\min_{\theta}\left\lbrace\sum_{i=1}^N (y_i-\mathbf{x}_i^T\theta)^2 + \lambda \ \|\theta\|^2=\|\mathbf{y}-\mathrm{X}\theta\|^2+\lambda \|\theta\|^2\right\rbrace$$` is such that `$$-2\mathrm{X}^T\mathbf{y}+2\mathrm{X}^{T}\mathrm{X}\hat{\theta}+\lambda \ 2\hat{\theta}=0$$` --- ## Example: Ridge Regression With `\(\mathrm{I}\)` the identity matrix, the ridge regression solution is written: `$$\hat{\theta}=\left(\mathrm{X}^T\mathrm{X}+\lambda \ \mathrm{I}\right)^{-1}\mathrm{X}^T\mathbf{y}$$` 1. What is the solution of ridge regression when `\(\lambda=0\)` ? 1. What is the solution of ridge regression when `\(\lambda\rightarrow\infty\)`? --- ## Other Operations Other operations are used in DNNs to help their optimization: - Dropout : [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](https://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf), N. Srivasta et al, JMLR 2014 - Batch Normalization [Decorrelated Batch Normalization](https://openaccess.thecvf.com/content_cvpr_2018/papers/Huang_Decorrelated_Batch_Normalization_CVPR_2018_paper.pdf), L. Huang et al, CVPR 2018 --- ## Questions 1. Why is softmax used in the final layer of a neural network classifier? 1. If vector `\(\hat{\mathbf{y}}\)` is computed with softmax, what are the properties of vector `\(\hat{\mathbf{y}}\)`? 1. Why is regularization used when optimizing a Neural Network? 1. Consider the model https://keras.io/examples/vision/mnist_convnet/ : Is this model performing regression or classification? Explain. --- ## Supervised Machine Learning .center[ ![Supervised Machine Learning](data:image/png;base64,#images/Machine.drawio.svg) ]