Principal Component Analysis

PCA

1 Introduction

These notes follow notations used in this recommended reading: Section 12.1.1. of Pattern Recognition and Machine Learning by Christopher Bishop.
Appendix C of that same book also provides a reminder on Linear Algebra (e.g. eigenvectors of matrices etc.)
Some of the results presented here are using the following Linear Algebra Formula showing how to differentiate w.r.t. a vector \(\mathbf{x}\), where \(\mathrm{A}\) is a matrix not depending on \(\mathbf{x}\):

\[\frac{\partial (\mathrm{A}\mathbf{x})}{\partial \mathbf{x}}=\mathrm{A}^{T} \quad (\diamond) \]

\[\frac{\partial (\mathbf{x}^{T}\mathrm{A})}{\partial \mathbf{x}}=\mathrm{A}\quad (\diamond^2) \] \[\frac{\partial (\mathbf{x}^{T}\mathbf{x})}{\partial \mathbf{x}}=2 \mathbf{x} \quad (\diamond^3) \]

\[\frac{\partial (\mathbf{x}^{T}\mathrm{A}\mathbf{x})}{\partial \mathbf{x}}=\mathrm{A} \mathbf{x}+\mathrm{A}^{T} \mathbf{x}\quad (\diamond^4)\]

In addition Lagrange multipliers are also used for computing PCA.

2 Definitions

2.1 Data

Consider that we have a set of vectors \(\lbrace \mathbf{x}_n \rbrace _{n=1,\cdots ,N}\) in \(\mathbb{R}^D\). This means that any vector \(\mathbf{x}_n\) has \(D\) coordinates as follows: \[ \mathbf{x}_n = \left\lbrack \begin{array}{c} x_n[1]\\ x_n[2]\\ \vdots\\ x_n[D]\\ \end{array} \right\rbrack \]

2.2 Mean

We can define the mean \(\overline{\mathbf{x}}\) such that \[ \overline{\mathbf{x}}=\frac{1}{N}\sum_{n=1}^{N} \mathbf{x}_n \] Spatially, the mean can be understood as the center of gravity of the clouds of points \(\lbrace \mathbf{x}_n \rbrace _{n=1\cdots N}\).

\(\overline{\mathbf{x}}\) is a also vector of dimension \(D\): \[ \overline{\mathbf{x}} = \left\lbrack \begin{array}{c} \frac{1}{N} \sum_{n=1}^N x_n[1]\\ \frac{1}{N} \sum_{n=1}^N x_n[2]\\ \vdots\\ \frac{1}{N} \sum_{n=1}^N x_n[D]\\ \end{array} \right\rbrack \]

2.3 Covariance matrix

The covariance matrix \(\mathrm{S}\) is defined as: \[ \mathrm{S}=\frac{1}{N} \sum_{n=1}^{N} (\mathbf{x}_n-\overline{\mathbf{x}}) (\mathbf{x}_n-\overline{\mathbf{x}})^{T} \]

with defining \(\mathbf{\tilde{x}}_n=\mathbf{x}_n-\mathbf{\overline{x}}\), \(\forall n\) then \[ \mathrm{S}=\frac{1}{N} \sum_{n=1}^{N} \mathbf{\tilde{x}}_n \ \mathbf{\tilde{x}}_n^T \] or \[ \mathrm{S}=\frac{1}{N} \sum_{n=1}^N \left \lbrack \begin{array}{cccc} (\tilde{x}_{n}[1])^2 & (\tilde{x}_{n}[1]\times \tilde{x}_{n}[2]) &\cdots &(\tilde{x}_{n}[1] \times \tilde{x}_{n}[D])\\ &&&\\ (\tilde{x}_{n}[2]\times \tilde{x}_{n}[1])& (\tilde{x}_{n}[2])^2 &&\\ &&&\\ &&\ddots&\\ &&&\\ &&& (\tilde{x}_{n}[D])^2 \\ \end{array} \right\rbrack \]

Defining the \(D\times N\) matrix \(\mathrm{\tilde{X}}=[\tilde{x}_{1},\cdots,\tilde{x}_{N}]\) then the covariance matrix is \[ \mathrm{S}=\frac{1}{N} \mathrm{\tilde{X}}\mathrm{\tilde{X}}^T \] The covariance matrix \(\mathrm{S}\) is of size \(D\times D\).

3 Maximum variance formulation

Consider unit vector \(\mathbf{u}_1\) (i.e. such that \(\|\mathbf{u}_1\|^2=\mathbf{u}_1^T\mathbf{u}_1=1\)).

The variance of the projected data is: \[ \frac{1}{N}\sum_{n=1}^N \left( \mathbf{u}_1^T\mathbf{x}_n - \mathbf{u}_1^T\overline{\mathbf{x}} \right)^2 = \mathbf{u}_1^T \mathrm{S}\ \mathbf{u}_1 \] We want to find the vector \(\mathbf{u}_1\) such that:

\[ \hat{\mathbf{u}}_1=\arg \max_{\mathbf{u}_1} \lbrace \mathbf{u}_1^T \mathrm{S}\ \mathbf{u}_1\rbrace \quad \text{subject to} \quad \|\mathbf{u}_1 \|^2=1 \] Introducing the Lagrange multiplier \(\lambda_1\), this is the same as optimising:
\[ \hat{\mathbf{u}}_1,\hat{\lambda}_1=\arg \max_{\mathbf{u}_1,\lambda_1}\ \lbrace \ \mathcal{E}(\mathbf{u}_1,\lambda_1)=\mathbf{u}_1^T \mathrm{S}\ \mathbf{u}_1 +\lambda_{1} \ (1-\mathbf{u}_1^T\mathbf{u}_1 ) \ \rbrace \] This means that we are looking for the vector \(\mathbf{u}_1\) that maximizes the energy function \(\mathcal{E}\).

3.1 Finding the first principal component

The solution is such that the derivatives of \(\mathcal{E}\) are zeros, so we need to find \((\mathbf{u}_1,\lambda_1)\) such that:

\[ \frac{\partial\mathcal{E}}{\partial \mathbf{u}_1}=0 \quad (\star) \] and \[ \frac{\partial\mathcal{E}}{\partial \lambda_1}=0 \quad (\star \star) \]

3.1.1 Solving \((\star\star)\)

\(\lambda_1\) is a scalar so : \[ \frac{\partial\mathcal{E}}{\partial \lambda_1}= (1-\mathbf{u}_1^T\mathbf{u}_1 )=0 \] and this differentiation w.r.t. the Lagrange multiplier \(\lambda_1\) allows to recover the constraint \[ \mathbf{u}_1^T\mathbf{u}_1=\|\mathbf{u}_1\|^2=1 \]

3.1.2 Solving \((\star)\)

The derivative w.r.t. vector \(\mathbf{u}_1\) can be computed using the Linear Algebra Formula provided in introduction: \[ \frac{\partial\mathcal{E}}{\partial \mathbf{u}_1}=\underbrace{\frac{\partial (\mathbf{u}_1^T \mathrm{S}\ \mathbf{u}_1 )}{\partial \mathbf{u}_1}}_{(a)} -\lambda_{1} \ \underbrace{\frac{\partial (\mathbf{u}_1^T\mathbf{u}_1 )}{\partial \mathbf{u}_1}}_{(b)}+\lambda_{1} \ \underbrace{\frac{\partial (1)}{\partial \mathbf{u}_1}}_{=0} \]

For computing (a) use the last formula \((\diamond^4)\)

\[ \frac{\partial (\mathbf{u}_1^T \mathrm{S}\ \mathbf{u}_1 )}{\partial \mathbf{u}_1} = \mathrm{S}\ \mathbf{u}_1 +\mathrm{S}^T\ \mathbf{u}_1=2\ \mathrm{S}\ \mathbf{u}_1 \] because the covariance matrix \(\mathrm{S}\) is a \(D\times D\) symmetric matrix hence \(\mathrm{S}^T=\mathrm{S}\).

For computing (b) use the third formula \((\diamond^3)\)

\[ \frac{\partial (\mathbf{u}_1^T\mathbf{u}_1 )}{\partial \mathbf{u}_1}=2\ \mathbf{u}_1 \]

3.1.3 Solution

The vector \(\mathbf{u}_1\) that maximizes the variance is the eigenvector of the covariance matrix \(\mathrm{S}\) with eigenvalue \(\lambda_1\): \[ \mathrm{S}\ \mathbf{u}_1=\lambda_1\ \mathbf{u}_1 \]

Eigenvalue \(\lambda_1\) is also the variance of the projections: \[ \mathbf{u}_1^T\mathrm{S}\mathbf{u}_1=\lambda_1 \] \(\lambda_1\) is the highest eigenvalue of \(\mathrm{S}\).

3.2 How to find the direction with second most maximum variance?

Once we have the direction \(\mathbf{u}_1\) with its associated eigenvalue \(\lambda_1\) (variance of the projections of centered data on \(\mathbf{u}_1\)), the direction \(\mathbf{u}_2\) with second maximum variance can be found such that: \[ \mathbf{u}_2^T\mathrm{S}\mathbf{u}_2 \] is maximised with the constraints \(\|\mathbf{u}_2\|=1\) and \(\mathbf{u}_2^T\mathbf{u}_1=0\) (\(\mathbf{u}_2\) is orthogonal to \(\mathbf{u}_1\)).

4 Principal components

4.1 Computing PCA

From a set of vectors \(\lbrace \mathbf{x}_n \rbrace_{n=1,\cdots,N}\)

Compute the mean vector \(\overline{\mathbf{x}}\in \mathbb{R}^D\)
Center each vector observation \(\mathbf{\tilde{x}}_n=\mathbf{x}_n-\overline{\mathbf{x}}\)
Compute the covariance matrix \[ \mathrm{S}=\frac{1}{N} \sum_{n=1}^{N} \mathbf{\tilde{x}}_n \mathbf{\tilde{x}}_n^T \]
Compute the eigenvectors of \(\mathrm{S}\) and sort them from the one associated with the highest eigenvalue , to the one associated with the lowest eigenvalue.

\[ \text{eigenvectors} \quad \equiv \quad \text{Principal components} \]

Note the \(D\) principal components \(\mathrm{U}=\lbrack\mathbf{u}_1, \mathbf{u}_2, \cdots, \mathbf{u}_D\rbrack\) of the covariance matrix \(\mathrm{S}\) are sorted such that:
\[ \lambda_{1}\geq \lambda_{2} \geq \cdots \geq \lambda_{D} \geq 0 \] with \(\lambda_n\) the eigenvalue associated with eigenvector \(\mathbf{u}_n\), \(\forall n\).

4.2 Using PCA

Each vector in the set can be written as a linear combination of the mean and the eigenvectors: \[ \mathbf{x} = \overline{\mathbf{x}} +\sum_{j=1}^D \alpha_j\ \mathbf{u}_j \] Principal components associated with highest \(d\) eigenvalues provide a good reconstruction: \[ \mathbf{x} = \overline{\mathbf{x}} +\underbrace{\sum_{j=1}^d \alpha_j\ \mathbf{u}_j}_{\text{reconstruction}} +\underbrace{\sum_{j=d+1}^D \alpha_j\ \mathbf{u}_j}_{\text{reconstruction error} } \]

5 Applications

Compression / Dimensionality reduction
Visualisation of data distribution
Example PCA applied to images of faces : https://en.wikipedia.org/wiki/Eigenface
see Python examples
- https://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html#sphx-glr-auto-examples-decomposition-plot-faces-decomposition-py
- https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html#sphx-glr-auto-examples-datasets-plot-iris-dataset-py

6 Questions

What is the dimension of the matrix \(\tilde{\mathrm{X}}\)?
What is the dimension of the covariance matrix \(\mathrm{S}\)?
What are the properties of the covariance matrix \(\mathrm{S}\)?