class: center, middle, inverse, title-slide .title[ # Machine Design ] .subtitle[ ## CNNs & Transformers ] .author[ ### Prof. Rozenn Dahyot ] .institute[ ###
] --- ## Introduction .center[ <div class="figure"> <img src="data:image/png;base64,#images/PastAI.svg" alt="The classic landscape: One architecture per community" width="90%" /> <p class="caption">The classic landscape: One architecture per community</p> </div> ] .footnote[ Excerpt from [Lucas Beyer's slides (2022)](http://lucasb.eyer.be/transformer) on Transformer : [video](https://youtu.be/UpfcyzoZ644) ] --- ## Computer Vision tasks .center[ <img src="data:image/png;base64,#images/SegmentationDefinition.drawio.svg" width="70%" /> ] .footnote[ Excerpt from Justin Johnson [Online Lectures](https://www.youtube.com/playlist?list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r) #16 ] --- ## Convolutional Neural Networks Convolutional Neural Networks (CNNs) consist of three main layers: - **Convolutional layer** abstracts the input image as a feature map via the use of filters and kernels. - **Pooling layer** downsamples feature maps by summarizing the presence of features in patches of the feature map. - **Fully connected layer**: connects every neuron in one layer to every neuron in another layer. --- ## ConvNet **Example:** ```python model = keras.Sequential( [ keras.Input(shape=input_shape), layers.Conv2D(32, kernel_size=(3, 3), activation="relu"), layers.MaxPooling2D(pool_size=(2, 2)), layers.Conv2D(64, kernel_size=(3, 3), activation="relu"), layers.MaxPooling2D(pool_size=(2, 2)), layers.Flatten(), layers.Dropout(0.5), layers.Dense(num_classes, activation="softmax"), ] ) ``` .footnote[ From https://keras.io/examples/vision/mnist_convnet/ ] ??? - CNNs are powerful visual models that yield hierarchies of features. - convnets are inherently translation invariant. Their basic components (convolution, pooling, and activation functions) operate on local input regions, and depend only on relative spatial coordinates. - The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates. --- ## ResNet .right-column[ To help approximate the identity function, the ResNet block (Residual Network) was proposed so that is learnt `$$X_{l+1}=\sigma(X_l+\delta(X_l))$$` instead of `$$X_{l+1}=\sigma(f(X_l))$$` - Identity shortcut connections add neither extra parameter nor computational complexity. - These residual networks are easier to optimize, and can gain accuracy from considerably increased depth. ] .left-column[ ![softmax](data:image/png;base64,#images/ResNetBlock.drawio.svg) ] .footnote[ [Deep Residual Learning for Image Recognition](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf), CVPR 2016] ??? - Deeper neural networks are more difficult to train. - The residual layer approximates better the identity function. - The shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers. --- ## FCNs .center[ <img src="data:image/png;base64,#images/overall_1505.04366.svg" width="80%" /> ] .pull-left[ - [Fully Convolutional Networks (CVPR 2015) ](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Long_Fully_Convolutional_Networks_2015_CVPR_paper.pdf) - Deconvolution network. Image from [Learning Deconvolution Network (ICCV 2015)](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Noh_Learning_Deconvolution_Network_ICCV_2015_paper.pdf). ] .pull-right[ <img src="data:image/png;base64,#images/deconvolution_1505.04366.svg" width="80%" /> ] .footnote[ see Justin Johnson's [Lecturenotes #16](https://web.eecs.umich.edu/~justincj/slides/eecs498/498_FA2019_lecture16.pdf), 2019 ] ??? Deconvolution network is a mirrored version of the convolution network, and has multiple series of unpooling, deconvolution, and rectification layers. In [Lecturenotes #16](https://web.eecs.umich.edu/~justincj/slides/eecs498/498_FA2019_lecture16.pdf) - max unpooling slide #47 - deconvolution slides #48-61 --- ## U-Net .center[ <img src="data:image/png;base64,#images/u-net-illustration-correct-scale2.svg" width="70%" /> ] .footnote[ [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), International Conference on Medical Image Computing and Computer-Assisted Intervention MICCAI 2015 Example 2021: [DR-VNet: Retinal Vessel Segmentation via Dense Residual UNet](https://arxiv.org/pdf/2111.04739.pdf) ] ??? - U-Net is based on the fully convolutional network designed to work with fewer training images and to yield more precise segmentations. - Note the concatenation operations (horizontal grey arrows) --- ## Region-Based CNNs Region-Based Convolutional Neural Network (R-CNNs) - Region of Interest (RoI) pooling - FC layers + proposal classification <img src="data:image/png;base64,#images/Fast.RCNN.15044.08083.svg" width="100%" /> .footnote[ [Fast R-CNN ](https://openaccess.thecvf.com/content_iccv_2015/html/Girshick_Fast_R-CNN_ICCV_2015_paper.html), ICCV 2015 [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf) NeurIsp 2015. ] ??? - **Region Proposal Network (RPN)** : Neural Network that proposes multiple objects that are available within a particular image. - Fast R-CNN extracts features using RoIPool (Region of Interest Pooling) from each candidate box and performs classification and bounding-box regression. In [Lecturenotes #16](https://web.eecs.umich.edu/~justincj/slides/eecs498/498_FA2019_lecture16.pdf) - Slide #73 --- ## Mask R-CNN ![](data:image/png;base64,#images/Mask.RCNN_1703.06870teaser.svg) .footnote[ From [Mask R-CNN](https://openaccess.thecvf.com/content_iccv_2017/html/He_Mask_R-CNN_ICCV_2017_paper.html), International Conference on Computer Vision (ICCV) 2017 ] ??? Mask R-CNN is a Convolutional Neural Network (CNN) and state-of-the-art in terms of image segmentation. This variant of a Deep Neural Network detects objects in an image and generates a high-quality segmentation mask for each instance. Mask R-CNN was developed on top of Faster R-CNN, a Region-Based Convolutional Neural Network. In [Lecturenotes #16](https://web.eecs.umich.edu/~justincj/slides/eecs498/498_FA2019_lecture16.pdf) - Slide #74 --- ## CoordConv Adding **positional encoding** to CNNs: .center[ <img src="data:image/png;base64,#images/CoordConv.png" width="80%" /> ] .footnote[ From <a href="https://proceedings.neurips.cc/paper/2018/file/60106888f8977b71e1f15db7bc9a88d1-Paper.pdf" target="_blank">An intriguing failing of convolutional neural networks and the CoordConv solution</a>, NeurIsp 2018. ] --- ## Contrastive learning .center[ <img src="data:image/png;base64,#images/ContrastiveLearning.png" width="80%" /> ] .footnote[ From <a href="https://proceedings.neurips.cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf" target="_blank">Supervised Contrastive Learning</a>, NeurIsp 2020 ] --- ## Other CNNs .left-column[ RoadTracer: search algorithm, guided by a decision function implemented via a CNN, to compute the graph iteratively. ] .right-column[ <img src="data:image/png;base64,#images/RoadTracer_qualitative_results.jpg" width="80%" /> ] .footnote[ [RoadTracer: Automatic Extraction of Road Networks from Aerial Images](https://openaccess.thecvf.com/content_cvpr_2018/papers/Bastani_RoadTracer_Automatic_Extraction_CVPR_2018_paper.pdf), F. Bastani et al, CVPR 2018 ] --- ## Remarks - A convolution operation assumes that nearby pixels are more important than far away pixels. - Only after several convolutional layers are stacked together does the receptive field grow large enough to attend to the entire image. - Adding Attention to models lets them look at different parts of the input at each timestep - Transformers are a new neural network model that only uses attention .footnote[ Have a look at [Facebook Detectron2](https://github.com/facebookresearch/detectron2) ] --- ## Transformer's takeover: One community at a time .center[ <img src="data:image/png;base64,#images/TransformerTakeOver.svg" width="90%" /> ] .footnote[ Excerpt from [Lucas Beyer's slides (2022)](http://lucasb.eyer.be/transformer) on Transformer [Online Video](https://youtu.be/UpfcyzoZ644) ] --- ## Transformer ![](data:image/png;base64,#images/vit_model_scheme_convertio.io.svg) .footnote[ From [An Image is worth 16x16 words: Transformers for image recognition at scale](https://arxiv.org/pdf/2010.11929v2.pdf), A. Dosovitskiy et al, International Conference on Learning Representation (ICLR) 2021 ] ??? Suggested Reading: https://e2eml.school/transformers.html --- ## Transformer .center[ <img src="data:image/png;base64,#images/SwinTransformer_2103.14030.png" width="80%" /> ] .footnote[ [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf), Z. Liu et al, ICCV 2021 best paper: https://github.com/microsoft/Swin-Transformer ] ??? - Extension of Vit to segmentation, detection