Computer Vision & Machine Learning
1 Introduction
This page aims at providing links to useful materials for my undergrads+postgrads for learning about Computer Vision and related fields.
1.1 Useful Prerequises
Linear Algebra e.g. Computational Linear Algebra
Programming e.g. Python
1.2 Online References
books
Probabilistic Machine Learning: Advanced Topics by Kevin P. Murphy
Computer Vision: Algorithms and Applications by Richard Szeliski
Pattern Recognition and Machine Learning by Christopher M. Bishop
Data Mining and Analysis; Fundamental Concepts and Algorithms by Mohammed J. Zaki and Wagner Meira Jr.
The Modern Mathematics of Deep Learning (advanced)
Graph Representation Learning W. L. Hamilton
Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications by John Wright and Yi Ma
youtube videos and PDF slides
http://introtodeeplearning.com MIT Introduction to Deep Learning
Deep Learning for Computer Vision by Justin Johnson
ICLR 2021 Keynote - “Geometric Deep Learning: The Erlangen Programme of ML” - M Bronstein
Miscellaneous
1.3 Conferences & Journals
2 Lecturenotes
2.1 Miscellaneous
2.2 Slides CS431 Principal Component Analysis
2.3 Slides CS410 Computer Vision
Computer Vision with Machine Learning: Overview, dataset preparation \(\lbrace x_i,y_i\rbrace_{i=1,\cdots,N}\), and AI Ethics
Loss Functions: Loss functions \(\mathcal{L}\)
Backpropagation: computation of parameters \(\hat{\theta}\)
Machine Design: Machines \(f_{\theta}\)
3 Architectures
CNN Explainer https://poloclub.github.io/cnn-explainer/
Type of convolution https://xzz201920.medium.com/conv1d-conv2d-and-conv3d-8a59182c4d6
3.1 ConvNet
Source: https://keras.io/examples/vision/mnist_convnet/
model = keras.Sequential(
[
keras.Input(shape=input_shape),
layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Flatten(),
layers.Dropout(0.5),
layers.Dense(num_classes, activation="softmax"),
]
)
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 11, 11, 64) 18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 1600) 0
_________________________________________________________________
dropout (Dropout) (None, 1600) 0
_________________________________________________________________
dense (Dense) (None, 10) 16010
=================================================================
Total params: 34,826
Trainable params: 34,826
Non-trainable params: 0
_________________________________________________________________
This model is designed with a first convolution layer taking an image \((28\times 28 \times 1)\) (MNIST are grey level images of width 28 and height 28) and computing 32 maps using 32 filters of size \((3\times 3\times 1)\) creating a output tensor of size \(26 \times 26 \times 32\) (as no padding is applied). The 32 filters have \(3\times3\times 32=288\) parameters and in addition 32 biases need to be estimated leading to a total of \(288+32=320\) parameters to be estimated in this first convolution.
The maxpooling operation down-samples by 2 in spatial dimensions creating a tensor of size \((\frac{26}{2} \times \frac{26}{2} \times 32)=(13 \times 13 \times 32)\).
The second convolution layer operates on the input tensor \((13 \times 13 \times 32)\) using 64 filters of size \((3\times 3 \times 32)\) creating an output tensor of size \((11 \times 11 \times 64)\) (no padding used). The 64 filters have \(3\times3\times 32\times 64=18432\) parameters and in addition 64 biases need to be estimated leading to a total of \(18432+64=18496\) parameters to be estimated in this second convolution layer.
The maxpooling operation down-samples by 2 in spatial dimension creating a tensor of size \((\lfloor\frac{11}{2}\rfloor \times \lfloor\frac{11}{2}\rfloor \times 64)=(5 \times 5 \times 64)\). This tensor is flatten into a column vector of \(5 \times 5 \times 64=1600\) dimensions.
The dense layer takes the \(1600\times 1\) vector \(\mathbf{x}\) to be multiplied by a weight matrix \(\mathrm{W}\) of dimension \(10\times 1600\) and \(10\times 1\) vector of biases \(\mathbf{b}\) creating the \(10\times 1\) output vector \(\mathbf{z}\) to be fed into a softmax function for classification into 10 classes (the 10 digits). \[\mathrm{W}\mathbf{x}+\mathbf{b}=\mathbf{z} \rightarrow \text{softmax}(\mathbf{z})\]
3.2 ResNet18
The following code https://pytorch.org/hub/pytorch_vision_resnet/ is a Deep residual networks pre-trained on ImageNet.
Paper: https://arxiv.org/pdf/1512.03385.pdf
input: image
output: scores
The output is a set of scores over the classes in Imagenet. The sum of all the scores over all the classes is equal to 1. The top 5 are displayed:
Samoyed 0.8846219182014465
Arctic fox 0.04580527916550636
white wolf 0.04427633807063103
Pomeranian 0.005621336866170168
Great Pyrenees 0.004651939496397972
Conv2d: Convolution
BatchNorm2d: Batch Normalisation
ReLU: rectified linear unit function (activation function)
MaxPool2d: Max Pooling
Linear : linear transformation
ResNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=1000, bias=True)
)
3.3 About (Batch) Normalization
From Normalization is dead, long live normalization! ICLR 2022
3.4 Attention (Transformer)
See Attention mechanism pages 99-100
3.5 VGG
https://pytorch.org/hub/pytorch_vision_vgg/
VGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): ReLU(inplace=True)
(5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(7): ReLU(inplace=True)
(8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): ReLU(inplace=True)
(10): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(11): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(12): ReLU(inplace=True)
(13): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(14): ReLU(inplace=True)
(15): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(16): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(17): ReLU(inplace=True)
(18): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(19): ReLU(inplace=True)
(20): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
(classifier): Sequential(
(0): Linear(in_features=25088, out_features=4096, bias=True)
(1): ReLU(inplace=True)
(2): Dropout(p=0.5, inplace=False)
(3): Linear(in_features=4096, out_features=4096, bias=True)
(4): ReLU(inplace=True)
(5): Dropout(p=0.5, inplace=False)
(6): Linear(in_features=4096, out_features=1000, bias=True)
)
)
4 Demos
- Image Segmentation https://huggingface.co/spaces/xvjiarui/ODISE paper CVPR 2023 https://jerryxu.net/ODISE/