How is a linear autoencoder equal to PCA? - machine-learning

I would like the mathematical proof of it. does anyone know a paper for it. or can workout the math?

https://pvirie.wordpress.com/2016/03/29/linear-autoencoders-do-pca/
PCA is restricted to a linear map, while auto encoders can have nonlinear enoder/decoders.
A single layer auto encoder with linear transfer function is nearly equivalent to PCA, where nearly means that the WW found by AE and PCA won't be the same--but the subspace spanned by the respective WW's will.

Here is a detailed mathematical explanation:
https://arxiv.org/abs/1804.10253
This paper also shows that using a linear autoencoder, it is possible not only to compute the subspace spanned by the PCA vectors, but it is actually possible to compute the principal components themselves.

Related

Weights in eigenface approach

1) In eigenface approach the eigenfaces is a combination of elements from different faces. What are these elements?
2) The output face is an image composed of different eigenfaces with different weights. What does the weights of eigenfaces exactly mean? I know that the weight is percentage of eigenfacein the image, but what does it mean exactly, is mean the number of selected pixels?
Please study about PCA to understand what is the physical meaning of eigenfaces, when PCA is applied to an image. The answer lies in the understanding of eigenvectors and eigenvalues associated with PCA.
EigenFaces is based on Principal Component Analysis
Principal Component Analysis does dimensionality reduction and finds unique features in the training images and removes the similar features from the face images
By getting unique features our recognition task gets simpler
By using PCA you calculate the eigenvectors for your face image data
From these eigenvectors you calculate EigenFace of every training subject or you can say calculating EigenFace for every class in your data
So if you have 9 classes then the number of EigenFaces will be 9
The weight usually means how important something is
In EigenFaces weight of a particular EigenFace is a vector which just tells you how important that particular EigenFace is in contributing the MeanFace
Now if you have 9 EigenFaces then for every EigenFace you will get exactly one Weight vector which will be of N dimension where N is number of eigenvectors
So every element out N elements in one weight vector will tell you how important that particular eigenvector is for that corresponding EigenFace
The facial Recognition in EigenFaces is done by comparing the weights of training images and testing images with some kind of distance function
You can refer this github link: https://github.com/jayshah19949596/Computer-Vision-Course-Assignments/blob/master/EigenFaces/EigenFaces.ipynb
The code on the above link is a good documented code so If you know the basics you will understand the code

Given a feature vector, how to find whether my data points are linearly separable

I have a feature vector in matrix notation, and I have data points in 2D plane. How to find whether my data points are linearly separable with that feature vector?
One can check whether there exists a line divides the data points into two. If there isn't a line, how to check for linear separability in higher dimensions?
A theoretical answer
If we assume the samples of the two classes are distributed according to a Gaussian, we will get a quadratic function describing the decision boundary in the general case.
If the covariance matrices are identical we get a linear decision boundary.
A practical answer
See the SO discussion here.

Why Gaussian radial basis function maps the examples into an infinite-dimensional space?

I've just run through the Wikipedia page about SVMs, and this line caught my eyes:
"If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimensions." http://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_classification
In my understanding, if I apply Gaussian kernel in SVM, the resulting feature space will be m-dimensional (where m is the number of training samples), as you choose your landmarks to be your training examples, and you're measuring the "similarity" between a specific example and all the examples with the Gaussian kernel. As a consequence, for a single example you'll have as many similarity values as training examples. These are going to be the new feature vectors which are going to m-dimensional vectors, and not infinite dimensionals.
Could somebody explain to me what do I miss?
Thanks,
Daniel
The dual formulation of the linear SVM depends only on scalar products of all training vectors. Scalar product essentially measures similarity of two vectors. We can then generalize it by replacing with any other "well-behaved" (it should be positive-definite, it's needed to preserve convexity, as well as enables Mercer's theorem) similarity measure. And RBF is just one of them.
If you take a look at the formula here you'll see that RBF is basically a scalar product in a certain infinitely dimensional space
Thus RBF is kind of a union of polynomial kernels of all possible degrees.
The other answers are correct but don't really tell the right story here. Importantly, you are correct. If you have m distinct training points then the gaussian radial basis kernel makes the SVM operate in an m dimensional space. We say that the radial basis kernel maps to a space of infinite dimension because you can make m as large as you want and the space it operates in keeps growing without bound.
However, other kernels, like the polynomial kernel do not have this property of the dimensionality scaling with the number of training samples. For example, if you have 1000 2D training samples and you use a polynomial kernel of <x,y>^2 then the SVM will operate in a 3 dimensional space, not a 1000 dimensional space.
The short answer is that this business about infinite dimensional spaces is only part of the theoretical justification, and of no practical importance. You never actually touch an infinite-dimensional space in any sense. It's part of the proof that the radial basis function works.
Basically, SVMs are proved to work the way they do by relying on properties of dot products over vector spaces. You can't just swap in the radial basis function and expect it necessarily works. To prove that it does, however, you show that the radial basis function is actually like a dot product over a different vector space, and it's as if we're doing regular SVMs in a transformed space, which works. And it happens that infinite dimensioal-ness is OK, and that the radial basis function does correspond to a dot product in such a space. So you can say SVMs still work when you use this particular kernel.

Features scaling and mean normalization in a sparse matrix

Is it a good idea to perform features scaling and mean normalization on a sparse matrix? I have a matrix that is 70% sparse. Usually, features scaling and mean normalization improve the algorithm performance, but in the case of a sparse matrix, it adds a lot of non zero terms
If it's important that the representation be sparse, in order to fit into memory for example, then you can't mean-normalize in the representation itself, no. It becomes completely dense and defeats the purpose.
Usually you push around the mean normalization math into another part of the formula or computation. Or you can just do the normalization as you access the elements, having previously computed the mean and variance.
Or you can pick an algorithm that doesn't need normalization, if possible.
If using scikit-learn you can just as below:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
scaler.fit(data)
Where you zero the mean to maintain sparsity as you can see on documentation here.

Is there a way to approximate the Laplacian matrix eigenvalues by using random walk like algorithm

It seems to me that many advanced graph analysis algorithm are based on spectral graph analysis, relying more specifically on the Laplacian matrix properties.
I know there are some alternative for clustering that are based on random-walk type algorithms, that make no use of the Laplacian matrix factorization.
I am curious if there exists anything to go a bit further and determine the Laplacian matrix eigenvalues (especially the second one), without using spectral method, but more like wandering on the graph.

Resources