Component Analysis and Discriminants (0) - PCA
Introduction
In this article, I survey most of the component-analysis-type(or dimension reduction) of learning methods such as PCA(Principal Component Analysis), LDA(Linear Discriminant Analysis), FLD(Fischer Linear Discriminant), MDA(Multivariate Discriminant Analysis), CCA(Canonical Correlation Analysis). The following description is in chronological order. Whenever we move to the next methodology, the conditions of problems that we face are more complicated. There are so many articles about these methodologies. But I could not found any report that focuses on the relationship between methods. I need to know why our predecessors developed these: from PCA to CCA or other more complicated methods. My final goal is to set up a unified table of all methods using the same mathematical notation for easier understanding.PCA(Principal Component Analysis)
Line Fitting on Sample Points
Before we start to discuss PCA, let's review the typical line fitting problem using the sum of squared errors.All of the basic machine learning methods are based on minimizing the sum of squared errors.
Let's see a regression line $\mathbf{\mathbf{x}_{0}+}a\mathbf{e}$ (where $\left\Vert \mathbf{e}\right\Vert =1$ ) on the sample $\mathbf{x}_{n}$.
Objective function $f$ to minimize is : \begin{equation} f(\mathbf{x}_{0},\mathbf{e},a_{0},...a_{N})=\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-(\mathbf{x}_{0}+a_{n}\mathbf{e})\right\Vert ^{2} \end{equation} where $\mathbf{x}_{0}+a_{n}\mathbf{e}$ is the closed point to $\mathbf{x}_{n}$ on the regression line.
This sum of squared errors (SSE) is just one among several candidates of cost functions. L1 norm or other norms can be the cost function. If a specific cost function of error is given, we must use it instead of SSE. The reason why most standard methods use it is easy to handle the formula. SSE is differentiable, so we can analyze its parameters in closed-form expressions. (c.f. In logistics regression we should do numerical analysis.)
I will check the relationship between the regression line and $\mu$, mean of $\mathbf{x}_{n}$.
At first, calculate the derivative of $a_{n}$in terms of $\mathbf{x}_{0},\mathbf{x}_{n},\mathbf{e}$. \begin{equation} f(\mathbf{x}_{0},\mathbf{e},a_{0},...a_{N})=\sum_{n}^{N}\left\Vert a_{n}\mathbf{e}+\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}\right\Vert ^{2} \end{equation} \begin{eqnarray*} \frac{\partial f}{\partial a_{n}} & = & \frac{\partial\left(\sum_{n}^{N}\left\Vert a_{n}\mathbf{e}+\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}\right\Vert ^{2}\right)}{\partial a_{n}}=\frac{\partial\left(\left\Vert a_{n}\mathbf{e}+\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}\right\Vert ^{2}\right)}{\partial a_{n}}\\ & = & \frac{\partial\left(a_{n}^{2}\left\Vert \mathbf{e}\right\Vert ^{2}+2a_{n}\mathbf{e}^{T}(\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n})+\left\Vert \mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}\right\Vert ^{2}\right)}{\partial a_{n}}\\ & = & 2a_{n}+2\mathbf{e}^{T}(\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}) \end{eqnarray*} We can get$\underset{a_{n}}{\arg\min}(f)$ by ${\displaystyle \frac{\partial f}{\partial a_{n}}=0}$. \begin{equation} a_{n}=\mathbf{e}^{T}(\mathbf{\mathbf{x}}_{n}-\mathbf{x}_{0}) \end{equation}
This means that projection of $(\mathbf{\mathbf{x}}_{n}-\mathbf{x}_{0})$ on $\mathbf{e}$. Actually, most books skip this proof since it is basic linear algebra.
Now analyze $f$ with respect to $\mathbf{x}_{0}$.
In expanding $f$ I use $a_{n}$ instead of $\mathbf{e}^{T}(\mathbf{\mathbf{x}}_{n}-\mathbf{x}_{0})$ for a moment. It is just for easier calculation. \begin{eqnarray*} f(\mathbf{x}_{0},\mathbf{e}) & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}-\mathbf{x}_{n}+a_{n}\mathbf{e}\right\Vert ^{2}\\ & = & \sum_{n}^{N}\left(\left\Vert \mathbf{x}_{0}-\mathbf{x}_{n}\right\Vert ^{2}+2(a_{n}\mathbf{e})^{T}(\mathbf{x}_{0}-\mathbf{x}_{n})+\left\Vert a_{n}\mathbf{e}\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}-\mathbf{x}_{n}\right\Vert ^{2}+2\sum_{n}^{N}a_{n}\overbrace{\left(\mathbf{e}{}^{T}(\mathbf{x}_{0}-\mathbf{x}_{n})\right)}^{=-a_{n}}+\sum_{n}^{N}a_{n}^{2}\overbrace{\left\Vert \mathbf{e}\right\Vert ^{2}}^{=1}\\ & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-2\sum_{n}^{N}a_{n}(a_{n})+\sum_{n}^{N}a_{n}^{2}\\ & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-\sum_{n}^{N}a_{n}^{2}\\ & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-\sum_{n}^{N}(\mathbf{x}_{0}-\mathbf{x}_{n})^{T}\mathbf{e}\mathbf{e}^{T}(\mathbf{x}_{0}-\mathbf{x}_{n})\\ & = & N\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-\sum_{n}^{N}\left(\mathbf{x}_{0}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}-2\mathbf{x}_{n}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}+\mathbf{x}_{n}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{n}\right)\\ & = & N\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-\sum_{n}^{N}\mathbf{x}_{0}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{n} \end{eqnarray*} Similarly to the case of $a_{n}$, set ${\displaystyle \frac{\partial f}{\partial\mathbf{x}_{0}}=0}$ \begin{equation} \frac{\partial f(\mathbf{x}_{0},\mathbf{e})}{\partial\mathbf{x}_{0}}=2N\mathbf{\mathbf{x}_{0}}-2N\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}-2\sum_{n}^{N}\mathbf{x}_{n}+2\mathbf{e}\mathbf{e}^{T}\sum_{n}^{N}\mathbf{x}_{n} \end{equation} We don't need to build a closed-form of $\mathbf{x}_{0}$. Just consider it in this implicit form. \begin{eqnarray*} \frac{\partial f(\mathbf{x}_{0},\mathbf{e})}{\partial\mathbf{x}_{0}} & = & 0\\ & = & 2N\mathbf{\mathbf{x}_{0}}-2N\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}-2\sum_{n}^{N}\mathbf{x}_{n}+2\mathbf{e}\mathbf{e}^{T}\sum_{n}^{N}\mathbf{x}_{n}\\ & = & 2N\mathbf{\mathbf{x}_{0}}-2\sum_{n}^{N}\mathbf{x}_{n}+2\mathbf{e}\mathbf{e}^{T}\sum_{n}^{N}(\mathbf{x}_{n}-\mathbf{x}_{0})\\ & = & 2N\mathbf{\mathbf{x}_{0}}-2\sum_{n}^{N}\mathbf{x}_{n}+2\mathbf{e}\sum_{n}^{N}a_{n} \end{eqnarray*} \begin{equation} \frac{\partial f}{\partial\mathbf{x}_{0}}=0\Longleftrightarrow\frac{1}{N}\sum_{n}^{N}\mathbf{x}_{n}=\mu=\mathbf{x}_{0}+\mathbf{e}\left(\frac{1}{N}\sum_{n}^{N}a_{n}\right) \end{equation} The meaning of Eq.(5) is that the mean should be on the regression line $\mathbf{x}_{0}+a\mathbf{e}$. It is reasonable intuitively.
This explanation for Line Fitting is not general. It is for easier descripition of the relationship between PCA and Regression.
PCA
I will explain PCA as an expansion of Line Fitting.The strategy of PCA is :
Analyze covariances of sample vectors and extract a new set of bases that represents samples more "Discriminable". Finally set up a smaller size of the new basis set by ignoring less discriminative bases. Why these types of processes are called "Dimension Reduction" is that the size of new bases is smaller than the original dimensions.
And "Discriminability" means larger variances.
See the following figure. After getting new basis vectors $\mathbf{e}_1$, $\mathbf{e}_2$, just ignore $\mathbf{e}_2$. Now sample data can be represented values projected on $\mathbf{e}_1$. This process reduces 2D to 1D
It is reasonable that basis $\mathbf{e}_1$ is more discriminative than $\mathbf{e}_2$ since $\mathbf{e}_1$ maximizes new variance, the variance of projected values on it.
$\mathbf{e}_2$ is also the largest basis representing the subspace orthogonal to $\mathbf{e}_1$. In reduction problems from 3D to 2D, $\mathbf{e}_2$ can be a member of new bases.
Now I will explain the fundamentals of PCA step by step.
The following mathematical description is based on the famous textbook, "Richard O. Duda et al., Pattern Classification 2nd Ed.".
Let's think about only one new basis vector for samples. It is exactly the same as Line Fitting explained in the previous section.
Rearrange the main formula of the previous section in terms of $e$ by setting $\mathbf{x}_0=\mu$. I will explain why $\mathbf{x}_0=\mu$ is feasible by using the previous analysis. In this step just assume that $\mu$ is the best representative value of samples' distribution.
\begin{eqnarray*} f(\mathbf{e}) & = & \sum_{n}^{N}\left\Vert a_{n}\mathbf{e}-(\mathbf{x}_{n}-\mu)\right\Vert ^{2}\\ & = & \sum_{n}^{N}\left(\left\Vert a_{n}\mathbf{e}\right\Vert ^{2}-2a_{n}\mathbf{e}^{T}(\mathbf{x}_{n}-\mu)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(a_{n}^{2}-2a_{n}^{2}+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(-a_{n}^{2}+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & -\sum_{n}^{N}a_{n}^{2}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\\ & = & -\sum_{n}^{N}\mathbf{e}^{T}(\mathbf{x}_{n}-\mu)(\mathbf{x}_{n}-\mu)^{T}\mathbf{e}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\\ & = & -N\mathbf{e}^{T}\mathbf{\Sigma}\mathbf{e}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2} \end{eqnarray*}
where $\Sigma$ is the covariance matrix.
To minimize $f$, constraint $\left\Vert \mathbf{e}\right\Vert =1$ should be considered. $f$ is transformed to Lagrangian formula $L$ with a Lagrangian multiplier$\lambda$.
\begin{equation} L(\mathbf{e},\lambda)=-N\mathbf{e}^{T}\mathbf{\Sigma}\mathbf{e}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}+\lambda(\mathbf{e}^{T}\mathbf{e}-1) \end{equation} \begin{equation} \frac{\partial L(\mathbf{e},\lambda)}{\partial\mathbf{e}}=-2N\mathbf{\Sigma}\mathbf{e}+2\lambda\mathbf{e}=0\Longleftrightarrow N\mathbf{\Sigma}\mathbf{e}=\lambda\mathbf{e} \end{equation}
Since eigenvectors are orthogonal the eigenvectors of $N\mathbf{\Sigma}$ means new basis vectors for PCA.
As mentioned before. Larger $\lambda$ makes $f$ smaller. So we choose $D$ eigenvectors in order of size of $\lambda$ from large.
Let's go back to why $\mathbf{x}_0$ should be $\mu$. As I explained in the previous section, $\mu$ is on the line $\mathbf{e}_0$ (the largest basis). All of $\mathbf{e}_d$ have same formula.
\begin{equation} \mu=\mathbf{x}_{0}+\frac{1}{N}\sum_{n}^{N}a_{n,d}\mathbf{e}_{d} \end{equation}
$\mu$ should be on all of lines of $\mathbf{e}_d$. If $D>1$, $\mathbf{x}_0$ should be only one, $\mu$.\begin{equation} L(\mathbf{e},\lambda)=-N\mathbf{e}^{T}\mathbf{\Sigma}\mathbf{e}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}+\lambda(\mathbf{e}^{T}\mathbf{e}-1) \end{equation} \begin{equation} \frac{\partial L(\mathbf{e},\lambda)}{\partial\mathbf{e}}=-2N\mathbf{\Sigma}\mathbf{e}+2\lambda\mathbf{e}=0\Longleftrightarrow N\mathbf{\Sigma}\mathbf{e}=\lambda\mathbf{e} \end{equation}
This is a typical eigen problem and $(\mathbf{e},\lambda)$ is (eigenvector, eigenvalue).
We can get the following analysis :
1. $\mathbf{rank}$ of covariance matrix means numbers of possible eigenvectors.
2. Since $\sum_{n}^{N}a_{n}^{2}$ is N times of variance of $a_{n}$s and $\sum_{n}^{N}a_{n}^{2} = N\mathbf{e}^{T}\mathbf{\Sigma}\mathbf{e}= \mathbf{e}^{T}\lambda\mathbf{e}=\lambda$, a larger $\lambda$ means a larger variance.
3. Since $f = -N\mathbf{e}^{T}\mathbf{\Sigma}\mathbf{e}+ ...= -\mathbf{e}^{T}\lambda\mathbf{e}=\lambda$, $\mathbf{e}$ with the largest eigenvalue means a input value of global miminum. Others are local maximums.
Let's expand these analyses to the case of multiple basis vectors,
\begin{eqnarray*} f(\mathbf{e}_{0,}\mathbf{e}_{1},...,\mathbf{e}_{D}) & = & \sum_{n}^{N}\left\Vert \sum_{d}^{D}a_{n,d}\mathbf{e}_{d}-(\mathbf{x}_{n}-\mu)\right\Vert ^{2}\\ & = & \sum_{n}^{N}\left(\left(\sum_{d}^{D}a_{n,d}\mathbf{e}_{d}\right)^{2}-2\sum_{d}^{D}a_{n,d}\mathbf{e}_{d}^{T}(\mathbf{x}_{n}-\mu)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(\sum_{d}^{D}a_{n,d}^{2}+2\sum_{d\neq d'}^{D}a_{n,d}a_{n,d'}\underbrace{\mathbf{e}_{d}\mathbf{e}_{d'}}_{=0}-2\sum_{d}^{D}a_{n,d}\mathbf{e}_{d}^{T}(\mathbf{x}_{n}-\mu)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(\sum_{d}^{D}\left(a_{n,d}^{2}-2a_{n,d}^{2}\right)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(-\sum_{d}^{D}a_{n,d}^{2}+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & -N\sum_{d}^{D}\mathbf{e}_{d}^{T}\mathbf{\Sigma}\mathbf{e}_{d}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2} \end{eqnarray*}
where $D\leqq\mathbf{dim}(\mathbf{x}_n)$.
The process of getting individual basis vectors is the same as the previous process.
\begin{equation} \frac{\partial L(\mathbf{e}_{d},\lambda)}{\partial\mathbf{e}_{d}}=-2N\mathbf{\Sigma}\mathbf{e}_{d}+2\lambda\mathbf{e}_{d}=0\Longleftrightarrow N\mathbf{\Sigma}\mathbf{e}_{d}=\lambda\mathbf{e}_{d} \end{equation}
Let's expand these analyses to the case of multiple basis vectors,
\begin{eqnarray*} f(\mathbf{e}_{0,}\mathbf{e}_{1},...,\mathbf{e}_{D}) & = & \sum_{n}^{N}\left\Vert \sum_{d}^{D}a_{n,d}\mathbf{e}_{d}-(\mathbf{x}_{n}-\mu)\right\Vert ^{2}\\ & = & \sum_{n}^{N}\left(\left(\sum_{d}^{D}a_{n,d}\mathbf{e}_{d}\right)^{2}-2\sum_{d}^{D}a_{n,d}\mathbf{e}_{d}^{T}(\mathbf{x}_{n}-\mu)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(\sum_{d}^{D}a_{n,d}^{2}+2\sum_{d\neq d'}^{D}a_{n,d}a_{n,d'}\underbrace{\mathbf{e}_{d}\mathbf{e}_{d'}}_{=0}-2\sum_{d}^{D}a_{n,d}\mathbf{e}_{d}^{T}(\mathbf{x}_{n}-\mu)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(\sum_{d}^{D}\left(a_{n,d}^{2}-2a_{n,d}^{2}\right)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(-\sum_{d}^{D}a_{n,d}^{2}+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & -N\sum_{d}^{D}\mathbf{e}_{d}^{T}\mathbf{\Sigma}\mathbf{e}_{d}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2} \end{eqnarray*}
where $D\leqq\mathbf{dim}(\mathbf{x}_n)$.
The process of getting individual basis vectors is the same as the previous process.
\begin{equation} \frac{\partial L(\mathbf{e}_{d},\lambda)}{\partial\mathbf{e}_{d}}=-2N\mathbf{\Sigma}\mathbf{e}_{d}+2\lambda\mathbf{e}_{d}=0\Longleftrightarrow N\mathbf{\Sigma}\mathbf{e}_{d}=\lambda\mathbf{e}_{d} \end{equation}
Since eigenvectors are orthogonal the eigenvectors of $N\mathbf{\Sigma}$ means new basis vectors for PCA.
As mentioned before. Larger $\lambda$ makes $f$ smaller. So we choose $D$ eigenvectors in order of size of $\lambda$ from large.
Let's go back to why $\mathbf{x}_0$ should be $\mu$. As I explained in the previous section, $\mu$ is on the line $\mathbf{e}_0$ (the largest basis). All of $\mathbf{e}_d$ have same formula.
\begin{equation} \mu=\mathbf{x}_{0}+\frac{1}{N}\sum_{n}^{N}a_{n,d}\mathbf{e}_{d} \end{equation}
Comments
Post a Comment