Computer Vision & Machine Learning

1 Introduction

This page aims at providing links to useful materials for my undergrads+postgrads for learning about Computer Vision and related fields.

1.1 Useful Prerequises

Linear Algebra e.g. Computational Linear Algebra
Programming e.g. Python

1.2 Online References

books

Probabilistic Machine Learning: Advanced Topics by Kevin P. Murphy
Computer Vision: Algorithms and Applications by Richard Szeliski
Pattern Recognition and Machine Learning by Christopher M. Bishop
Data Mining and Analysis; Fundamental Concepts and Algorithms by Mohammed J. Zaki and Wagner Meira Jr.
https://d2l.ai/
https://www.deeplearningbook.org/

The Modern Mathematics of Deep Learning (advanced)
Graph Representation Learning W. L. Hamilton
Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications by John Wright and Yi Ma
Principles of Deep Learning

youtube videos and PDF slides

http://introtodeeplearning.com MIT Introduction to Deep Learning
Deep Learning for Computer Vision by Justin Johnson
ICLR 2021 Keynote - “Geometric Deep Learning: The Erlangen Programme of ML” - M Bronstein
CS video courses

Miscellaneous

1.3 Conferences & Journals

2 Lecturenotes

2.1 Miscellaneous

2.2 Slides CS431 Principal Component Analysis

2.3 Slides CS410 Computer Vision

Computer Vision with Machine Learning: Overview, dataset preparation \(\lbrace x_i,y_i\rbrace_{i=1,\cdots,N}\), and AI Ethics
Loss Functions: Loss functions \(\mathcal{L}\)
Backpropagation: computation of parameters \(\hat{\theta}\)
Machine Design: Machines \(f_{\theta}\)
Formulas

3 Architectures

CNN Explainer https://poloclub.github.io/cnn-explainer/
Type of convolution https://xzz201920.medium.com/conv1d-conv2d-and-conv3d-8a59182c4d6

3.1 ConvNet

Source: https://keras.io/examples/vision/mnist_convnet/

model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dropout (Dropout)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                16010     
=================================================================
Total params: 34,826
Trainable params: 34,826
Non-trainable params: 0
_________________________________________________________________

This model is designed with a first convolution layer taking an image \((28\times 28 \times 1)\) (MNIST are grey level images of width 28 and height 28) and computing 32 maps using 32 filters of size \((3\times 3\times 1)\) creating a output tensor of size \(26 \times 26 \times 32\) (as no padding is applied). The 32 filters have \(3\times3\times 32=288\) parameters and in addition 32 biases need to be estimated leading to a total of \(288+32=320\) parameters to be estimated in this first convolution.

Convolution of the input image with 32 filters of size 3x3 on the first layer

The maxpooling operation down-samples by 2 in spatial dimensions creating a tensor of size \((\frac{26}{2} \times \frac{26}{2} \times 32)=(13 \times 13 \times 32)\).

The second convolution layer operates on the input tensor \((13 \times 13 \times 32)\) using 64 filters of size \((3\times 3 \times 32)\) creating an output tensor of size \((11 \times 11 \times 64)\) (no padding used). The 64 filters have \(3\times3\times 32\times 64=18432\) parameters and in addition 64 biases need to be estimated leading to a total of \(18432+64=18496\) parameters to be estimated in this second convolution layer.

Second convolutional layer with 64 filters

The maxpooling operation down-samples by 2 in spatial dimension creating a tensor of size \((\lfloor\frac{11}{2}\rfloor \times \lfloor\frac{11}{2}\rfloor \times 64)=(5 \times 5 \times 64)\). This tensor is flatten into a column vector of \(5 \times 5 \times 64=1600\) dimensions.

The dense layer takes the \(1600\times 1\) vector \(\mathbf{x}\) to be multiplied by a weight matrix \(\mathrm{W}\) of dimension \(10\times 1600\) and \(10\times 1\) vector of biases \(\mathbf{b}\) creating the \(10\times 1\) output vector \(\mathbf{z}\) to be fed into a softmax function for classification into 10 classes (the 10 digits). \[\mathrm{W}\mathbf{x}+\mathbf{b}=\mathbf{z} \rightarrow \text{softmax}(\mathbf{z})\]

3.2 ResNet18

The following code https://pytorch.org/hub/pytorch_vision_resnet/ is a Deep residual networks pre-trained on ImageNet.

Paper: https://arxiv.org/pdf/1512.03385.pdf

input: image

output: scores

The output is a set of scores over the classes in Imagenet. The sum of all the scores over all the classes is equal to 1. The top 5 are displayed:

Samoyed 0.8846219182014465
Arctic fox 0.04580527916550636
white wolf 0.04427633807063103
Pomeranian 0.005621336866170168
Great Pyrenees 0.004651939496397972

Conv2d: Convolution
BatchNorm2d: Batch Normalisation
ReLU: rectified linear unit function (activation function)
MaxPool2d: Max Pooling
Linear : linear transformation

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=1000, bias=True)
)

3.3 About (Batch) Normalization

From Normalization is dead, long live normalization! ICLR 2022

3.4 Attention (Transformer)

See Attention mechanism pages 99-100

3.5 VGG

https://pytorch.org/hub/pytorch_vision_vgg/

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (11): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (12): ReLU(inplace=True)
    (13): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (14): ReLU(inplace=True)
    (15): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (16): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (17): ReLU(inplace=True)
    (18): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (19): ReLU(inplace=True)
    (20): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace=True)
    (2): Dropout(p=0.5, inplace=False)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU(inplace=True)
    (5): Dropout(p=0.5, inplace=False)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
)

3.6 UNet

https://pytorch.org/hub/mateuszbuda_brain-segmentation-pytorch_unet/

3.7 Using Matlab

example on how to use a pretrained Convolutional Neural Network (CNN) as a feature extractor for training an image category classifier

4 Demos

Image Segmentation https://huggingface.co/spaces/xvjiarui/ODISE paper CVPR 2023 https://jerryxu.net/ODISE/