Transformers

1 Introduction

  • Transformers are models that transform a set of vectors in some representation space into a corresponding set of vectors, having the same dimension: \[ \mathrm{X} \rightarrow \mathrm{Y}=\mathcal{T}(\mathrm{X}) \quad \text{with}\quad \dim(\mathrm{X} )=\dim(\mathrm{Y}) \]

  • Transformers are based on a processing concept called attention, which allows a network to give different weights to different inputs

  • The goal of the transformation is to have a richer internal representation better suited to solving downstream tasks.

  • transformer is especially well suited to massively parallel processing hardware such as GPUs.

  • Transformer started with NLP: Attention Is All You Need, NeurIPS 2017

  • Transformer’s takeover: One community at a time

The classic landscape: One architecture per community. Excerpt from Lucas Beyer's slides (2022), http://lucasb.eyer.be/transformer

The classic landscape: One architecture per community. Excerpt from Lucas Beyer’s slides (2022), http://lucasb.eyer.be/transformer

Transformer take over. Excerpt from Lucas Beyer's slides (2022), http://lucasb.eyer.be/transformer

Transformer take over. Excerpt from Lucas Beyer’s slides (2022), http://lucasb.eyer.be/transformer

2 Attention

Considering input data as a tensor \(\mathrm{X}\) with \(N\) tokens \(\mathbf{x}_n\) of \(D\) dimensions, the \(N\) outputs \(\mathbf{y}_n\) are computed as a weighted sum of the input tokens: \[ \forall n=\lbrace 1,\cdots,N\rbrace, \quad \mathbf{y}_n= \sum_{m=1}^N \underbrace{\color{green}{a_{nm}}}_{\color{green}{\text{attention weight}}}\ \underbrace{\mathbf{x}_m}_{\text{input token}} \]

  • because the input tokens \(\lbrace \mathbf{x}_n\rbrace\) are of \(D\) dimensions, the output tokens \(\lbrace \mathbf{y}_n\rbrace\) are also of \(D\) dimensions

  • because \(N\) output tokens are created, they can be stored likewise in a tensor \(\mathrm{Y}\) of dimension \(N\times D\) (same as \(\mathrm{X}\)).

Data Matrix  with $N$ tokens in feature space of $D$ dimensions (inspired from Bishop&Bishop 2024)

Data Matrix with \(N\) tokens in feature space of \(D\) dimensions (inspired from Bishop&Bishop 2024)

Wish list for creating Attention coefficients :

  1. The attention weights or coefficients should be close to zero for input tokens that have little influence on the output \(\mathbf{y}_n\) and largest for inputs that have most influence. Hence, the following constraints are applied: \[ \left(\sum_{m=1}^N a_{nm}=1 \right) \quad \wedge \quad \left( 0\leq a_{nm}\leq 1 \right) \]
  2. transformation \(\mathcal{T}: \mathrm{X} \longrightarrow \mathrm{Y}\) should be a learnable transformation (i.e. depending on parameters that can be learned or tuned using a training set).

2.1 Self Attention

The attention coefficient depends on the input tokens and Softmax function is introduced to make them positive summing to 1 (as per wish list): \[ \text{Self attention coefficient:}\quad a_{nm}=\color{red}{\text{Softmax}}(\mathbf{x}_n^T \mathbf{x}_m)=\frac{\exp(\mathbf{x}_n^T \mathbf{x}_m)}{\sum_{k=1}^N \exp(\mathbf{x}_n^T \mathbf{x}_k)} \]

Using matrix notation, this gives: \[ \text{Self attention:}\quad \underbrace{\mathrm{Y}}_{N\times D}=\underbrace{\text{Softmax}\left\lbrack \color{green}{\mathrm{X}\mathrm{X}^T}\right\rbrack}_{N\times N} \ \underbrace{\mathrm{X}}_{N\times D} \]

2.2 Learnable Self Attention

Introducting a learnable \(D\times D\) parameter matrix \(\color{red}{\mathrm{U}}\): \[ \text{Learnable Self attention:}\quad \underbrace{\mathrm{Y}}_{N\times D}=\underbrace{\text{Softmax}\left\lbrack \mathrm{X}\color{red}{\mathrm{U}\mathrm{U}^T}\mathrm{X}^T\right\rbrack}_{N\times N} \ \underbrace{\mathrm{X} \color{red}{\mathrm{U}}}_{N\times D} \]

2.3 Learnable Attention

  • Self-attention \(\mathrm{X}\mathrm{X}^T\) and \(\mathrm{X}\mathrm{U}\mathrm{U}^T\mathrm{X}^T\) are symmetric matrices.
  • Asymmetric information is often important so the definition of attention is extended by including a separate set of learnable parameters for the \(\color{blue}{\text{Query}}\), \(\color{red}{\text{Key}}\), and \(\color{green}{\text{Value}}\) vectors: : \[ \text{Learnable attention:}\quad \mathrm{Y}=\text{Softmax}\left\lbrack \underbrace{\color{blue}{\mathrm{X}\mathrm{W}_q}}_{\color{blue}{\mathrm{Q}}\text{uery}}\underbrace{\color{red}{\mathrm{W}_k^T\mathrm{X}^T}}_{\color{red}{\mathrm{K}^T}\text{ey}}\right\rbrack \ \underbrace{\color{green}{\mathrm{X} \mathrm{W}_v}}_{\color{green}{\mathrm{V}}\text{alue}} \] i.e. \[ \text{Learnable attention:}\quad \mathrm{Y} =\text{Softmax}\left\lbrack \color{blue}{\mathrm{Q}}\ \color{red}{\mathrm{K}^T}\right\rbrack \ \color{green}{\mathrm{V}} \]

2.4 Scaled Attention

\[ \text{Learnable scaled attention:}\quad \mathrm{Y} =\text{Softmax}\left\lbrack \frac{ \color{blue}{\mathrm{Q}}\ \color{red}{\mathrm{K}^T}}{D_k}\right\rbrack \ \color{green}{\mathrm{V}} \]

Scaled Attention

Scaled Attention

3 Vision Transformer (ViT)

From *An Image is worth 16x16 words: Transformers for image recognition at scale* , A. Dosovitskiy et al, International Conference on Learning Representation (ICLR) 2021; https://arxiv.org/pdf/2010.11929v2.pdf

From An Image is worth 16x16 words: Transformers for image recognition at scale , A. Dosovitskiy et al, International Conference on Learning Representation (ICLR) 2021; https://arxiv.org/pdf/2010.11929v2.pdf

4 References