Transformers
1 Introduction
Transformers are models that transform a set of vectors in some representation space into a corresponding set of vectors, having the same dimension: \[ \mathrm{X} \rightarrow \mathrm{Y}=\mathcal{T}(\mathrm{X}) \quad \text{with}\quad \dim(\mathrm{X} )=\dim(\mathrm{Y}) \]
Transformers are based on a processing concept called attention, which allows a network to give different weights to different inputs
The goal of the transformation is to have a richer internal representation better suited to solving downstream tasks.
transformer is especially well suited to massively parallel processing hardware such as GPUs.
Transformer started with NLP: Attention Is All You Need, NeurIPS 2017
Transformer’s takeover: One community at a time
2 Attention
Considering input data as a tensor \(\mathrm{X}\) with \(N\) tokens \(\mathbf{x}_n\) of \(D\) dimensions, the \(N\) outputs \(\mathbf{y}_n\) are computed as a weighted sum of the input tokens: \[ \forall n=\lbrace 1,\cdots,N\rbrace, \quad \mathbf{y}_n= \sum_{m=1}^N \underbrace{\color{green}{a_{nm}}}_{\color{green}{\text{attention weight}}}\ \underbrace{\mathbf{x}_m}_{\text{input token}} \]
because the input tokens \(\lbrace \mathbf{x}_n\rbrace\) are of \(D\) dimensions, the output tokens \(\lbrace \mathbf{y}_n\rbrace\) are also of \(D\) dimensions
because \(N\) output tokens are created, they can be stored likewise in a tensor \(\mathrm{Y}\) of dimension \(N\times D\) (same as \(\mathrm{X}\)).
Wish list for creating Attention coefficients :
- The attention weights or coefficients should be close to zero for input tokens that have little influence on the output \(\mathbf{y}_n\) and largest for inputs that have most influence. Hence, the following constraints are applied: \[ \left(\sum_{m=1}^N a_{nm}=1 \right) \quad \wedge \quad \left( 0\leq a_{nm}\leq 1 \right) \]
- transformation \(\mathcal{T}: \mathrm{X} \longrightarrow \mathrm{Y}\) should be a learnable transformation (i.e. depending on parameters that can be learned or tuned using a training set).
2.1 Self Attention
The attention coefficient depends on the input tokens and Softmax function is introduced to make them positive summing to 1 (as per wish list): \[ \text{Self attention coefficient:}\quad a_{nm}=\color{red}{\text{Softmax}}(\mathbf{x}_n^T \mathbf{x}_m)=\frac{\exp(\mathbf{x}_n^T \mathbf{x}_m)}{\sum_{k=1}^N \exp(\mathbf{x}_n^T \mathbf{x}_k)} \]
Using matrix notation, this gives: \[ \text{Self attention:}\quad \underbrace{\mathrm{Y}}_{N\times D}=\underbrace{\text{Softmax}\left\lbrack \color{green}{\mathrm{X}\mathrm{X}^T}\right\rbrack}_{N\times N} \ \underbrace{\mathrm{X}}_{N\times D} \]
2.2 Learnable Self Attention
Introducting a learnable \(D\times D\) parameter matrix \(\color{red}{\mathrm{U}}\): \[ \text{Learnable Self attention:}\quad \underbrace{\mathrm{Y}}_{N\times D}=\underbrace{\text{Softmax}\left\lbrack \mathrm{X}\color{red}{\mathrm{U}\mathrm{U}^T}\mathrm{X}^T\right\rbrack}_{N\times N} \ \underbrace{\mathrm{X} \color{red}{\mathrm{U}}}_{N\times D} \]
2.3 Learnable Attention
- Self-attention \(\mathrm{X}\mathrm{X}^T\) and \(\mathrm{X}\mathrm{U}\mathrm{U}^T\mathrm{X}^T\) are symmetric matrices.
- Asymmetric information is often important so the definition of attention is extended by including a separate set of learnable parameters for the \(\color{blue}{\text{Query}}\), \(\color{red}{\text{Key}}\), and \(\color{green}{\text{Value}}\) vectors: : \[ \text{Learnable attention:}\quad \mathrm{Y}=\text{Softmax}\left\lbrack \underbrace{\color{blue}{\mathrm{X}\mathrm{W}_q}}_{\color{blue}{\mathrm{Q}}\text{uery}}\underbrace{\color{red}{\mathrm{W}_k^T\mathrm{X}^T}}_{\color{red}{\mathrm{K}^T}\text{ey}}\right\rbrack \ \underbrace{\color{green}{\mathrm{X} \mathrm{W}_v}}_{\color{green}{\mathrm{V}}\text{alue}} \] i.e. \[ \text{Learnable attention:}\quad \mathrm{Y} =\text{Softmax}\left\lbrack \color{blue}{\mathrm{Q}}\ \color{red}{\mathrm{K}^T}\right\rbrack \ \color{green}{\mathrm{V}} \]
2.4 Scaled Attention
\[ \text{Learnable scaled attention:}\quad \mathrm{Y} =\text{Softmax}\left\lbrack \frac{ \color{blue}{\mathrm{Q}}\ \color{red}{\mathrm{K}^T}}{D_k}\right\rbrack \ \color{green}{\mathrm{V}} \]
3 Vision Transformer (ViT)
4 References
Lucas Beyer’s slides (2022) on Transformer presented in this Online Video.
Deep Learning Foundations and Concepts, Chapter 12, Bishop & Bishop, 2024