DFFS and DIFS

PCA

1 Introduction

In this lecture, we look at how to use eigenspaces defined with principal components computed with PCA:

  • Difference-From-Feature-Space (DFFS)

  • Distance-In-Feature-Space (DIFS)

which are both defined in Probabilistic visual learning for object representation B. Moghaddam and A. Pentland (PAMI 1997).

2 Representation with Principal Components

Having applied PCA to a dataset \(\lbrace\mathbf{x}_n\rbrace_{n=1,\cdots,N}\) with \(\forall n, \mathbf{x}_n \in \mathbb{R}^D\), we note \(\overline{\mathbf{x}}\in \mathbb{R}^D\) the mean vector and \(\mathrm{U}=\lbrack\mathbf{u}_1, \mathbf{u}_2, \cdots, \mathbf{u}_D\rbrack\) the \(D\) eigenvectors (principal components) of the covariance matrix \(\mathrm{S}\) are sorted such that:
\[ \lambda_{1}\geq \lambda_{2} \geq \cdots \geq \lambda_{D} \geq 0 \] with \(\lambda_n\) the eigenvalue associated with eigenvector \(\mathbf{u}_n\), \(\forall n\).

Because \(\mathbf{u}_i^T\mathbf{u}_j=0, \forall i\neq j\) (orthogonality), \(\mathbf{u}_n^T\mathbf{u}_n=1,\forall n\) (unit norm), the \(D\) eigenvectors form a orthonormal basis of the space \(\mathbb{R}^D\).

Any vector \(\mathbf{x}\in \mathbb{R}^D\) can be reconstructed in that basis of eigenvectors: \[ \mathbf{x}= \overline{\mathbf{x}} +\sum_{i=1}^D \alpha_i \ \mathbf{u_{i}} \] or with notation \(\tilde{\mathbf{x}}=\mathbf{x}-\overline{\mathbf{x}}\) \[ \tilde{\mathbf{x}}= \sum_{i=1}^D \alpha_i \ \mathbf{u_{i}} \] with the coordinates \(\alpha_i=(\mathbf{x}-\overline{\mathbf{x}})^T \mathbf{u}_i\), \(\forall i=1,\cdots, D\).

The vector of coordinates \(\pmb{\alpha}=[\alpha_1, \cdots, \alpha_D]^T\) represents \(\mathbf{x}\) in a space centered on the mean \(\overline{\mathbf{x}}\) and orthonormal basis defined by the eigenvectors \(\lbrace\mathbf{u}_i\rbrace_{i=1,\cdots,D}\):

\[\mathbf{x}= \left\lbrack\begin{array}{c} x_1\\ x_2\\ \vdots\\ x_D\\ \end{array}\right\rbrack=\left\lbrack\begin{array}{c} \overline{x}_1\\ \overline{x}_2\\ \vdots\\ \overline{x}_D\\ \end{array}\right\rbrack + \sum_{i=1}^D \alpha_i \ \mathbf{u}_i \] or \[\mathbf{x} =\underbrace{ \left\lbrack\begin{array}{c} \overline{x}_1\\ \overline{x}_2\\ \vdots\\ \overline{x}_D\\ \end{array}\right\rbrack}_{\overline{\mathbf{x}}} + \alpha_1 \underbrace{ \left\lbrack\begin{array}{c} u_{1,1}\\ u_{1,2}\\ \vdots\\ u_{1,D}\\ \end{array}\right\rbrack}_{\mathbf{u}_1} + \alpha_2 \underbrace{\left\lbrack\begin{array}{c} u_{2,1}\\ u_{2,2}\\ \vdots\\ u_{2,D}\\ \end{array}\right\rbrack}_{\mathbf{u}_2} +\ ...\ + \alpha_D \underbrace{\left\lbrack\begin{array}{c} u_{D,1}\\ u_{D,2}\\ \vdots\\ u_{D,D}\\ \end{array}\right\rbrack}_{\mathbf{u}_D} \]

This representation with \(\pmb{\alpha}\) is a more useful representation than the original representation of \(\mathbf{x}\) in the space centered on the origin \(\pmb{0}_{D}\) and with the standard basis:
\[ \mathbf{x}= \left\lbrack\begin{array}{c} x_1\\ x_2\\ \vdots\\ x_D\\ \end{array}\right\rbrack= \left\lbrack\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \end{array}\right\rbrack +\sum_{i=1}^D x_i \ \mathbf{e}_i = \underbrace{ \left\lbrack\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \end{array}\right\rbrack}_{\pmb{0}_D} +x_1 \underbrace{ \left\lbrack\begin{array}{c} 1\\ 0\\ \vdots\\ 0\\ \end{array}\right\rbrack}_{\mathbf{e}_1} + x_2 \underbrace{\left\lbrack\begin{array}{c} 0\\ 1\\ \vdots\\ 0\\ \end{array}\right\rbrack}_{\mathbf{e}_2} +\ ...\ + x_D \underbrace{\left\lbrack\begin{array}{c} 0\\ 0\\ \vdots\\ 1\\ \end{array}\right\rbrack}_{\mathbf{e}_D} \]

3 PCA Whitening

PCA Whitening corresponds to normalising (dividing) coordinates \(\alpha\)’s with their respective standard deviation (square root of their corresponding eigenvalue):

\[ \alpha_i^W = \frac{\alpha_i}{\sqrt{\lambda_i}} \]

See Whitening transformation on Wikipedia for additional information.

Whitening (or batch normalisation) is a standard operation computed in Neural Networks (e.g. see Decorrelated Batch Normalization (CVPR 2018)).

4 DFFS and DIFS

A subspace \(F\) is created using the first \(d\) principal components associated with the \(d\) highest eigenvalues with: \[ 0\leq d \leq D \] The remaining subspace noted \(F^{\perp}\) is described by the \((D-d)\) remaining eigenvectors.

\[ \mathbb{R}^D= F \oplus F^{\perp} \] and \[F \cap F^{\perp} =\lbrace 0 \rbrace \] Any vector \(\tilde{\mathbf{x}}\in\mathbb{R}^D\) centered on the mean \(\overline{\mathbf{x}}\) can be decomposed into a vector in \(F\) and another vector in \(F^{\perp}\): \[ \tilde{\mathbf{x}}= \underbrace{\sum_{i=1}^d \alpha_i \ \mathbf{u_{i}}}_{\tilde{\mathbf{x}}_F\in F} +\underbrace{\sum_{i=d+1}^D \alpha_i \ \mathbf{u_{i}}}_{\tilde{\mathbf{x}}_{F^{\perp}}\in F^{\perp}} \]

Question: Show \(\|\tilde{\mathbf{x}}\|^2=\|\tilde{\mathbf{x}}_F\|^2+\|\tilde{\mathbf{x}}_{F^{\perp}}\|^2\)

We define two distances as a measure of similarity between any vector \(\mathbf{x}\in \mathbb{R}^D\) and a PCA-learned eigenspace \(F\).

4.1 Distance-In-Feature-Space (DIFS)

In the feature space \(F\), the Mahalanobis distance is often used to define the Distance-In-Feature-Space (DIFS): \[ DIFS(\mathbf{x})=\sum_{i=1}^d \frac{\alpha_i^2}{\lambda_i} \]

4.2 Difference-From-Feature-Space (DFFS)

In \(F^{\perp}\), the Distance From Feature Space (DFFS) is defined as: \[ DFFS(\mathbf{x})=\|\tilde{\mathbf{x}}_{F^{\perp}}\|^2=\sum_{i=d+1}^D \alpha_i^2=\|\tilde{\mathbf{x}}\|^2- \|\mathbf{\tilde{x}}_{F}\|^2\]

In practice only the first \(d\) coordinates \(\lbrace \alpha_i\rbrace_{i=1,\cdots,d}\) are computed and these are used to compute DIFS and \(\|\tilde{\mathbf{x}}_{F}\|^2\).

4.3 Choosing \(d=\dim(F)\)

The dimension \(d\) of the space \(F\) can be chosen using the percentage of explained variance: \[ \text{Explained variance}(d)=100\times\frac{\sum_{i=1}^d \lambda_{i}}{\sum_{i=1}^D \lambda_{i}} \] \(F\) can be set (and \(d\) chosen) so that 90% of variance is explained for instance.

5 Relation to Multivariate Normal Distribution

The Multivariate Normal Distribution is a Probability Density Function defined that can be used for describing the distribution of a random vector \(\mathbf{x}\) in \(\mathbb{R}^D\).

Only 2 parameters are needed to compute the Multivariate Normal:

  • \(\pmb{\mu}\) its mean

  • \(\Sigma\) its covariance matrix

Having a set of observations (dataset \(\lbrace \mathbf{x}_n\rbrace_{n=1,\cdots,N}\)) for random vector \(\mathbf{x}\), then:

  • \(\pmb{\mu}\) can be estimated by its sample mean \(\overline{\mathbf{x}}\),

  • \(\Sigma\) can be estimated by the covariance matrix \(\mathrm{S}\) computed with the dataset.

\[ p(\mathbf{x})=\frac{1}{(\sqrt{2\pi})^D} \det(\mathrm{S})^{1/2} \exp\left(-\frac{1}{2} (\mathbf{x}-\overline{\mathbf{x}})^T\mathrm{S}^{-1} (\mathrm{x}-\overline{\mathrm{x}}) \right) \] Since \(\mathrm{S}=\mathrm{U}^T\Lambda\mathrm{U}\) with \(\mathrm{U}=[\mathbf{u}_1,\cdots,\mathbf{u}_D]\) (eigenvectors) and \(\Lambda\) the diagonal matrix with the eigenvalues: \[ \Lambda= \left\lbrack \begin{array}{cccc} \lambda_1&0&0&0\\ 0&\lambda_2&0&0\\ \vdots&&\ddots&\\ 0&0&&\lambda_D\\ \end{array} \right\rbrack \] We have (using properties of linear algebra with \(\mathrm{U}\) and orthonormal matrix): \[ \mathrm{S}^{-1}=\mathrm{U}^T\Lambda^{-1}\mathrm{U} \] so we have \[ \begin{array}{lcl} (\mathbf{x}-\overline{\mathbf{x}})^T\mathrm{S}^{-1} (\mathrm{x}-\overline{\mathrm{x}}) &=&(\mathbf{x}-\overline{\mathbf{x}})^T\mathrm{U}^T\Lambda^{-1}\mathrm{U} (\mathrm{x}-\overline{\mathrm{x}})\\ &=& \lbrack\mathrm{U}(\mathbf{x}-\overline{\mathbf{x}})\rbrack^T\Lambda^{-1}\lbrack\mathrm{U} (\mathrm{x}-\overline{\mathrm{x}})\rbrack\\ &=&\pmb{\alpha}^T\Lambda^{-1}\pmb{\alpha}\\ &=&\sum_{i=1}^D \frac{\alpha_i^2}{\lambda_i}\\ &\simeq& \underbrace{\sum_{i=1}^d \frac{\alpha_i^2}{\lambda_i}}_{DIFS} +\frac{1}{\rho}\ \underbrace{\sum_{i=d+1}^D \alpha_i^2}_{DFFS}\\ \end{array} \] The term \((\mathbf{x}-\overline{\mathbf{x}})^T\mathrm{S}^{-1} (\mathrm{x}-\overline{\mathrm{x}})\) is the Mahalanobis distance in the space \(\mathbb{R}^D\) and it can be approximated as a weighted sum of the DFFS and DIFS. The parameter \(\rho\) can be approximated as the average of the eigenvalues that were left out: \[ \rho=\frac{1}{D-d} \sum_{i=d+1}^D \lambda_i \]

See wikipedia example of a Multivariate Normal Distribution in \(\mathbb{R}^2\)