Kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are done in a reproducing kernel Hilbert space with a non-linear mapping.
KPCA is an extension of PCA to non-linear distributions. Instead of directly doing a PCA, the original data points are mapped into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function φ through a mathematical process called the “kernel trick”.
The Kernel Trick
The kernel trick transforms any algorithm that solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced with the kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. This non-linear algorithm is equivalent to the linear algorithm operating in the range space of φ. However, because kernels are used, the φ function is never explicitly computed. This is desirable, because the high-dimensional space may be infinite-dimensional (as is the case when the kernel is a Gaussian).
Interesting video by Udi Aharoni demonstrating how points that are not linearly separable in a 2D space can almost always be linearly separated in a higher dimensional (3D) space.
Principal Components in Kernel Space
Like in PCA, the overall idea is to perform a transformation that will maximize the variance of the captured variables while minimizing the overall covariance between those variables. Using the kernel trick, the covariance matrix is substituted by the Kernel matrix and the analysis is carried analogously in feature space. An Eigen value decomposition is performed and the eigenvectors are sorted in ascending order of Eigen values, so those vectors may form a basis in feature space that explain most of the variance in the data on its first dimensions.
However, because the principal components are in feature space, we will not be directly performing dimensionality reduction. Suppose that the number of observations m exceeds the input dimensionality n. In linear PCA, we can find at most n nonzero Eigen values. On the other hand, using Kernel PCA, we can find up to m nonzero Eigen values because we will be operating on a m x m kernel matrix.
A more detailed discussion can be seen in the works of Mika et al. (Kernel PCA and De-Noising in Feature Spaces) and Scholkopf et al. (Nonlinear Component Analysis as a Kernel Eigenvalue Problem). A more accessible explanation of the topic is also available in the work from Ian Fasel.
The Pre-image Problem
In standard PCA, it was possible to revert the transformation, retrieving the original data points and enabling the use of PCA for compression* and denoising. However, as the results provided by kernel PCA live in some high dimensional feature space, they need not have pre-images in input space. In fact, even if it exists, it also need be not unique.
Thus, in KPCA, it is only possible to obtain an approximate pre-image for a given set of patterns projected on the feature space. This problem can (and has) been tackled by many approaches, some of them of iterative nature, but some closed-form solutions also exists. Typically, those solutions make use of the fact that distances in feature space are related to distances in input space. Thus, those solutions try to achieve an optimal transformation that can embed those feature points in input space respecting those distances relationships.
One of the first solutions of this form were proposed by Kwok and Tsang in their paper “The Pre-Image Problem in Kernel Methods". A better approach is also described by Bakir on his thesis “Extensions to Kernel Dependency Estimation”, alongside with a nice comprehensive description on various other methods for handling the pre-image problem.
Note: Actually, the use of KPCA for compression is a bit limited, since one needs to store the original data set to compute transformations and its approximate reversion.
Code Overview
Below is the main code behind KPCA. This is a direct, naive implementation of the algorithm that may not scale very well for very large datasets. It is implemented on top of the previous standard PCA code and thus on top of the AForge.NET Framework.
public override void Compute()
{
int rows = Source.GetLength(0);
int cols = Source.GetLength(1);
sourceCentered = adjust(Source);
double[,] K = new double[rows, rows];
for (int i = 0; i < rows; i++)
{
for (int j = i; j < rows; j++)
{
double k = kernel.Kernel(sourceCentered.GetRow(i), sourceCentered.GetRow(j));
K[i, j] = k;
K[j, i] = k;
}
}
if (centerFeatureSpace)
K = centerKernel(K);
EigenValueDecomposition evd = new EigenValueDecomposition(K);
double[] evals = evd.RealEigenValues;
double[,] eigs = evd.EigenVectors;
int[] indices = new int[rows];
for (int i = 0; i < rows; i++) indices[i] = i;
Array.Sort(evals, indices, new AbsoluteComparer(true));
eigs = eigs.Submatrix(0, rows - 1, indices);
if (centerFeatureSpace)
{
for (int j = 0; j < rows; j++)
{
double eig = System.Math.Sqrt(System.Math.Abs(evals[j]));
for (int i = 0; i < rows; i++)
eigs[i, j] = eigs[i, j] / eig;
}
}
this.SingularValues = new double[rows];
this.EigenValues = evals;
this.ComponentMatrix = eigs;
double[,] result = new double[rows, rows];
for (int i = 0; i < rows; i++)
for (int j = 0; j < rows; j++)
for (int k = 0; k < rows; k++)
result[i, j] += K[i, k] * eigs[k, j];
this.Result = result;
createComponents();
}
public override double[,] Transform(double[,] data, int components)
{
int rows = data.GetLength(0);
int cols = data.GetLength(1);
int N = sourceCentered.GetLength(0);
data = adjust(data);
double[,] K = new double[rows, N];
for (int i = 0; i < rows; i++)
for (int j = 0; j < N; j++)
K[i, j] = kernel.Kernel(data.GetRow(i), sourceCentered.GetRow(j));
if (centerFeatureSpace)
K = centerKernel(K);
double[,] result = new double[rows, components];
for (int i = 0; i < rows; i++)
for (int j = 0; j < components; j++)
for (int k = 0; k < rows; k++)
result[i, j] += K[i, k] * ComponentMatrix[k, j];
return result;
}
The main code behind the reversion of the projection. It is a direct, naive implementation of the closed-form solution algorithm as proposed by Kwok and Tsang. Currently, it is completely not optimized, but it gives an better overview of the algorithm on its essential form. The source code available to download may be better optimized in the future.
public double[,] Revert(double[,] data, int neighbors)
{
int rows = data.GetLength(0);
int cols = data.GetLength(1);
double[,] reversion = new double[rows, sourceCentered.GetLength(1)];
int nn = System.Math.Min(neighbors, sourceCentered.GetLength(0));
for (int p = 0; p < rows; p++)
{
double[] y = data.GetRow(p);
double[,] X = sourceCentered;
double[] d2 = new double[Result.GetLength(0)];
int[] inx = new int[Result.GetLength(0)];
for (int i = 0; i < X.GetLength(0); i++)
{
inx[i] = i;
d2[i] = kernel.Distance(y, Result.GetRow(i).Submatrix(y.Length));
}
Array.Sort(d2, inx);
inx = inx.Submatrix(nn);
X = X.Submatrix(inx).Transpose();
d2 = d2.Submatrix(nn);
double[,] H = Matrix.Identity(nn).Subtract(Matrix.Create(nn, 1.0 / nn));
SingularValueDecomposition svd = new SingularValueDecomposition(X.Multiply(H));
double[,] U = svd.LeftSingularVectors;
double[,] L = Matrix.Diagonal(nn, svd.Diagonal);
double[,] V = svd.RightSingularVectors;
double[,] Z = L.Multiply(V.Transpose());
double[] d02 = Matrix.Sum(Matrix.DotPow(Z,2));
double[,] inv = Matrix.PseudoInverse(Z.Transpose());
double[] z = (-0.5).Multiply(inv).Multiply(d2.Subtract(d02));
double[] x = (U.Multiply(z)).Add(Matrix.Sum(X.Transpose()).Multiply(1.0 / nn));
for (int i = 0; i < reversion.GetLength(1); i++)
reversion[p, i] = x[i];
}
if (this.Method == AnalysisMethod.Correlation)
{
for (int i = 0; i < reversion.GetLength(0); i++)
for (int j = 0; j < reversion.GetLength(1); j++)
reversion[i, j] = (reversion[i, j] * StandardDeviations[j]) + Means[j];
}
else
{
for (int i = 0; i < reversion.GetLength(0); i++)
for (int j = 0; j < reversion.GetLength(1); j++)
reversion[i, j] = reversion[i, j] + Means[j];
}
return reversion;
}
Using the Code
As with PCA, code usage is simple. First, choose a kernel implementing the IKernel interface
, then pass it to the KernelPrincipalComponentAnalysis
constructor together with the data points.
Gaussian kernel = new Gaussian(10.0);
var kpca = new KernelPrincipalComponentAnalysis(data, kernel);
Then, simply call Compute()
to perform the analysis. Once done, data about the analysis will be available through the object properties and transformations on additional data can be computed by calling Transform()
.
kpca.Compute();
double[,] result = kpca.Transform(data, 2);
Sample Application
To demonstrate the use of KPCA, I created a simple Windows Forms Application which performs simple statistical analysis and KPCA transformations. It can also perform projection reversion by solving the approximate pre-image problem.
On left: Scholkopf Toy Example. Middle: Kernel principal components detected by the KPCA. Right: The two first dimensions obtained by the two first Kernel principal components.
Left: Wikipedia example. Middle: Projection using a non-centered Gaussian kernel. Right: Reconstruction of the original data using all Kernel principal components, showing some denoising ability.
Linear PCA as a Special Case
Additionally, we may check that linear PCA is indeed a special case of kernel PCA by using a linear kernel on the example by Lindsay Smith on his A Tutorial On Principal Component Analysis.
References
- BAKIR, Gokhan. Extensions to Kernel Dependency Estimation. Doctorate Thesis.
Available in: http://opus.kobv.de/tuberlin/volltexte/2006/1256/pdf/bakir_goekhan.pdf
- FASEL, Ian. Scholkopf, Smola and Muller: Kernel PCA.
Available in: http://cseweb.ucsd.edu/classes/fa01/cse291/kernelPCA_article.pdf
- HOFFMANN, Heiko. Unsupervised Learning of Visuomotor Associations. Dissertation.
Available in: http://www.heikohoffmann.de/htmlthesis/hoffmann_diss.html.
- KWOK, James; TSANG, Ivor. The Pre-Image Problem in Kernel Methods. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
Available in: http://www.hpl.hp.com/conferences/icml2003/papers/345.pdf
- KWOK, James; TSANG, Ivor. The Pre-Image Problem in Kernel Methods. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
Available in: http://www.cse.ust.hk/~jamesk/papers/icml03_slides.pdf
- MIKA, Sebastian; SCHOLKOPF, Benhard; et al. Kernel PCA and De-Noising in Feature Spaces.
Available in: http://www.cs.dartmouth.edu/farid/teaching/cs88/nips99.pdf
- WIKIPEDIA; Kernel Principal Component Analysis.
Available in: http://en.wikipedia.org/wiki/Kernel_principal_component_analysis
- SMITH, Lindsay. A Tutorial on Principal Component Analysis. 2002.
Available in: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
See Also