Principal Component Analysis
PCA
1 Introduction
These notes follow notations used in this recommended reading: Section 12.1.1. of Pattern Recognition and Machine Learning by Christopher Bishop.
Appendix C of that same book also provides a reminder on Linear Algebra (e.g. eigenvectors of matrices etc.)
Some of the results presented here are using the following Linear Algebra Formula showing how to differentiate w.r.t. a vector \(\mathbf{x}\), where \(\mathrm{A}\) is a matrix not depending on \(\mathbf{x}\):
\[\frac{\partial (\mathrm{A}\mathbf{x})}{\partial \mathbf{x}}=\mathrm{A}^{T} \quad (\diamond) \]
\[\frac{\partial (\mathbf{x}^{T}\mathrm{A})}{\partial \mathbf{x}}=\mathrm{A}\quad (\diamond^2) \] \[\frac{\partial (\mathbf{x}^{T}\mathbf{x})}{\partial \mathbf{x}}=2 \mathbf{x} \quad (\diamond^3) \]
\[\frac{\partial (\mathbf{x}^{T}\mathrm{A}\mathbf{x})}{\partial \mathbf{x}}=\mathrm{A} \mathbf{x}+\mathrm{A}^{T} \mathbf{x}\quad (\diamond^4)\]
- In addition Lagrange multipliers are also used for computing PCA.
2 Definitions
2.1 Data
Consider that we have a set of vectors \(\lbrace \mathbf{x}_n \rbrace _{n=1,\cdots ,N}\) in \(\mathbb{R}^D\). This means that any vector \(\mathbf{x}_n\) has \(D\) coordinates as follows: \[ \mathbf{x}_n = \left\lbrack \begin{array}{c} x_n[1]\\ x_n[2]\\ \vdots\\ x_n[D]\\ \end{array} \right\rbrack \]
2.2 Mean
We can define the mean \(\overline{\mathbf{x}}\) such that \[ \overline{\mathbf{x}}=\frac{1}{N}\sum_{n=1}^{N} \mathbf{x}_n \] Spatially, the mean can be understood as the center of gravity of the clouds of points \(\lbrace \mathbf{x}_n \rbrace _{n=1\cdots N}\).
\(\overline{\mathbf{x}}\) is a also vector of dimension \(D\): \[ \overline{\mathbf{x}} = \left\lbrack \begin{array}{c} \frac{1}{N} \sum_{n=1}^N x_n[1]\\ \frac{1}{N} \sum_{n=1}^N x_n[2]\\ \vdots\\ \frac{1}{N} \sum_{n=1}^N x_n[D]\\ \end{array} \right\rbrack \]
2.3 Covariance matrix
The covariance matrix \(\mathrm{S}\) is defined as: \[ \mathrm{S}=\frac{1}{N} \sum_{n=1}^{N} (\mathbf{x}_n-\overline{\mathbf{x}}) (\mathbf{x}_n-\overline{\mathbf{x}})^{T} \]
with defining \(\mathbf{\tilde{x}}_n=\mathbf{x}_n-\mathbf{\overline{x}}\), \(\forall n\) then \[ \mathrm{S}=\frac{1}{N} \sum_{n=1}^{N} \mathbf{\tilde{x}}_n \ \mathbf{\tilde{x}}_n^T \] or \[ \mathrm{S}=\frac{1}{N} \sum_{n=1}^N \left \lbrack \begin{array}{cccc} (\tilde{x}_{n}[1])^2 & (\tilde{x}_{n}[1]\times \tilde{x}_{n}[2]) &\cdots &(\tilde{x}_{n}[1] \times \tilde{x}_{n}[D])\\ &&&\\ (\tilde{x}_{n}[2]\times \tilde{x}_{n}[1])& (\tilde{x}_{n}[2])^2 &&\\ &&&\\ &&\ddots&\\ &&&\\ &&& (\tilde{x}_{n}[D])^2 \\ \end{array} \right\rbrack \]
Defining the \(D\times N\) matrix \(\mathrm{\tilde{X}}=[\tilde{x}_{1},\cdots,\tilde{x}_{N}]\) then the covariance matrix is \[ \mathrm{S}=\frac{1}{N} \mathrm{\tilde{X}}\mathrm{\tilde{X}}^T \] The covariance matrix \(\mathrm{S}\) is of size \(D\times D\).
3 Maximum variance formulation
Consider unit vector \(\mathbf{u}_1\) (i.e. such that \(\|\mathbf{u}_1\|^2=\mathbf{u}_1^T\mathbf{u}_1=1\)).
The variance of the projected data is: \[ \frac{1}{N}\sum_{n=1}^N \left( \mathbf{u}_1^T\mathbf{x}_n - \mathbf{u}_1^T\overline{\mathbf{x}} \right)^2 = \mathbf{u}_1^T \mathrm{S}\ \mathbf{u}_1 \] We want to find the vector \(\mathbf{u}_1\) such that:
\[
\hat{\mathbf{u}}_1=\arg \max_{\mathbf{u}_1} \lbrace \mathbf{u}_1^T
\mathrm{S}\ \mathbf{u}_1\rbrace \quad \text{subject to} \quad
\|\mathbf{u}_1 \|^2=1
\] Introducing the Lagrange multiplier \(\lambda_1\), this is the same as
optimising:
\[
\hat{\mathbf{u}}_1,\hat{\lambda}_1=\arg
\max_{\mathbf{u}_1,\lambda_1}\ \lbrace \
\mathcal{E}(\mathbf{u}_1,\lambda_1)=\mathbf{u}_1^T \mathrm{S}\
\mathbf{u}_1 +\lambda_{1} \ (1-\mathbf{u}_1^T\mathbf{u}_1 ) \ \rbrace
\] This means that we are looking for the vector \(\mathbf{u}_1\) that maximizes the energy
function \(\mathcal{E}\).
3.1 Finding the first principal component
The solution is such that the derivatives of \(\mathcal{E}\) are zeros, so we need to find \((\mathbf{u}_1,\lambda_1)\) such that:
\[ \frac{\partial\mathcal{E}}{\partial \mathbf{u}_1}=0 \quad (\star) \] and \[ \frac{\partial\mathcal{E}}{\partial \lambda_1}=0 \quad (\star \star) \]
3.1.1 Solving \((\star\star)\)
\(\lambda_1\) is a scalar so : \[ \frac{\partial\mathcal{E}}{\partial \lambda_1}= (1-\mathbf{u}_1^T\mathbf{u}_1 )=0 \] and this differentiation w.r.t. the Lagrange multiplier \(\lambda_1\) allows to recover the constraint \[ \mathbf{u}_1^T\mathbf{u}_1=\|\mathbf{u}_1\|^2=1 \]
3.1.2 Solving \((\star)\)
The derivative w.r.t. vector \(\mathbf{u}_1\) can be computed using the Linear Algebra Formula provided in introduction: \[ \frac{\partial\mathcal{E}}{\partial \mathbf{u}_1}=\underbrace{\frac{\partial (\mathbf{u}_1^T \mathrm{S}\ \mathbf{u}_1 )}{\partial \mathbf{u}_1}}_{(a)} -\lambda_{1} \ \underbrace{\frac{\partial (\mathbf{u}_1^T\mathbf{u}_1 )}{\partial \mathbf{u}_1}}_{(b)}+\lambda_{1} \ \underbrace{\frac{\partial (1)}{\partial \mathbf{u}_1}}_{=0} \]
For computing (a) use the last formula \((\diamond^4)\)
\[ \frac{\partial (\mathbf{u}_1^T \mathrm{S}\ \mathbf{u}_1 )}{\partial \mathbf{u}_1} = \mathrm{S}\ \mathbf{u}_1 +\mathrm{S}^T\ \mathbf{u}_1=2\ \mathrm{S}\ \mathbf{u}_1 \] because the covariance matrix \(\mathrm{S}\) is a \(D\times D\) symmetric matrix hence \(\mathrm{S}^T=\mathrm{S}\).
For computing (b) use the third formula \((\diamond^3)\)
\[ \frac{\partial (\mathbf{u}_1^T\mathbf{u}_1 )}{\partial \mathbf{u}_1}=2\ \mathbf{u}_1 \]
3.1.3 Solution
The vector \(\mathbf{u}_1\) that maximizes the variance is the eigenvector of the covariance matrix \(\mathrm{S}\) with eigenvalue \(\lambda_1\): \[ \mathrm{S}\ \mathbf{u}_1=\lambda_1\ \mathbf{u}_1 \]
Eigenvalue \(\lambda_1\) is also the variance of the projections: \[ \mathbf{u}_1^T\mathrm{S}\mathbf{u}_1=\lambda_1 \] \(\lambda_1\) is the highest eigenvalue of \(\mathrm{S}\).
3.2 How to find the direction with second most maximum variance?
Once we have the direction \(\mathbf{u}_1\) with its associated eigenvalue \(\lambda_1\) (variance of the projections of centered data on \(\mathbf{u}_1\)), the direction \(\mathbf{u}_2\) with second maximum variance can be found such that: \[ \mathbf{u}_2^T\mathrm{S}\mathbf{u}_2 \] is maximised with the constraints \(\|\mathbf{u}_2\|=1\) and \(\mathbf{u}_2^T\mathbf{u}_1=0\) (\(\mathbf{u}_2\) is orthogonal to \(\mathbf{u}_1\)).
4 Principal components
4.1 Computing PCA
From a set of vectors \(\lbrace \mathbf{x}_n \rbrace_{n=1,\cdots,N}\)
- Compute the mean vector \(\overline{\mathbf{x}}\in \mathbb{R}^D\)
- Center each vector observation \(\mathbf{\tilde{x}}_n=\mathbf{x}_n-\overline{\mathbf{x}}\)
- Compute the covariance matrix \[ \mathrm{S}=\frac{1}{N} \sum_{n=1}^{N} \mathbf{\tilde{x}}_n \mathbf{\tilde{x}}_n^T \]
- Compute the eigenvectors of \(\mathrm{S}\) and sort them from the one associated with the highest eigenvalue , to the one associated with the lowest eigenvalue.
\[ \text{eigenvectors} \quad \equiv \quad \text{Principal components} \]
Note the \(D\) principal components
\(\mathrm{U}=\lbrack\mathbf{u}_1,
\mathbf{u}_2, \cdots, \mathbf{u}_D\rbrack\) of the covariance
matrix \(\mathrm{S}\) are sorted such
that:
\[
\lambda_{1}\geq \lambda_{2} \geq \cdots \geq \lambda_{D} \geq 0
\] with \(\lambda_n\) the
eigenvalue associated with eigenvector \(\mathbf{u}_n\), \(\forall n\).
4.2 Using PCA
Each vector in the set can be written as a linear combination of the mean and the eigenvectors: \[ \mathbf{x} = \overline{\mathbf{x}} +\sum_{j=1}^D \alpha_j\ \mathbf{u}_j \] Principal components associated with highest \(d\) eigenvalues provide a good reconstruction: \[ \mathbf{x} = \overline{\mathbf{x}} +\underbrace{\sum_{j=1}^d \alpha_j\ \mathbf{u}_j}_{\text{reconstruction}} +\underbrace{\sum_{j=d+1}^D \alpha_j\ \mathbf{u}_j}_{\text{reconstruction error} } \]
5 Applications
- Compression / Dimensionality reduction
- Visualisation of data distribution
- Example PCA applied to images of faces : https://en.wikipedia.org/wiki/Eigenface
- see Python examples
6 Questions
What is the dimension of the matrix \(\tilde{\mathrm{X}}\)?
What is the dimension of the covariance matrix \(\mathrm{S}\)?
What are the properties of the covariance matrix \(\mathrm{S}\)?