Is there any problem on performing both the feature selection and PCA
After performing PCA you get a set of eigenvalues and a set of eigenvectors. When you sort your eigenvectors by eigenvalue from largest to smallest, you sort them virtually by importance. Now get the first n of those and dump the rest.
Thus PCA is a feature selection algorithm.
Related
What measure of average is considered for calculating the target average value in Decision Trees? Is it mean or median? Also, if it is mean, then, why is median not considered since it is a better measure of center in the presence of outliers?
I found the answer. It is mean and not median. This is because calculating median requires the data to be sorted, and so the process becomes more intensive. The cases where median is a better measure of center is when the data has skewed data points. However, in Decision Trees the data points that are part of the leaf are most likely to not have the outlier because the outlier would land in other leaf.
I know that feature selection helps me remove features that may have low contribution. I know that PCA helps reduce possibly correlated features into one, reducing the dimensions. I know that normalization transforms features to the same scale.
But is there a recommended order to do these three steps? Logically I would think that I should weed out bad features by feature selection first, followed by normalizing them, and finally use PCA to reduce dimensions and make the features as independent from each other as possible.
Is this logic correct?
Bonus question - are there any more things to do (preprocess or transform)
to the features before feeding them into the estimator?
If I were doing a classifier of some sort I would personally use this order
Normalization
PCA
Feature Selection
Normalization: You would do normalization first to get data into reasonable bounds. If you have data (x,y) and the range of x is from -1000 to +1000 and y is from -1 to +1 You can see any distance metric would automatically say a change in y is less significant than a change in X. we don't know that is the case yet. So we want to normalize our data.
PCA: Uses the eigenvalue decomposition of data to find an orthogonal basis set that describes the variance in data points. If you have 4 characteristics, PCA can show you that only 2 characteristics really differentiate data points which brings us to the last step
Feature Selection: once you have a coordinate space that better describes your data you can select which features are salient.Typically you'd use the largest eigenvalues(EVs) and their corresponding eigenvectors from PCA for your representation. Since larger EVs mean there is more variance in that data direction, you can get more granularity in isolating features. This is a good method to reduce number of dimensions of your problem.
of course this could change from problem to problem, but that is simply a generic guide.
Generally speaking, Normalization is needed before PCA.
The key to the problem is the order of feature selection, and it's depends on the method of feature selection.
A simple feature selection is to see whether the variance or standard deviation of the feature is small. If these values are relatively small, this feature may not help the classifier. But if you do normalization before you do this, the standard deviation and variance will become smaller (generally less than 1), which will result in very small differences in std or var between the different features.If you use zero-mean normalization, the mean of all the features will equal 0 and std equals 1.At this point, it might be bad to do normalization before feature selection
Feature selection is flexible, and there are many ways to select features. The order of feature selection should be chosen according to the actual situation
Good answers here. One point needs to be highlighted. PCA is a form of dimensionality reduction. It will find a lower dimensional linear subspace that approximates the data well. When the axes of this subspace align with the features that one started with, it will lead to interpretable feature selection as well. Otherwise, feature selection after PCA, will lead to features that are linear combinations of the original set of features and they are difficult to interpret based on the original set of features.
1) In eigenface approach the eigenfaces is a combination of elements from different faces. What are these elements?
2) The output face is an image composed of different eigenfaces with different weights. What does the weights of eigenfaces exactly mean? I know that the weight is percentage of eigenfacein the image, but what does it mean exactly, is mean the number of selected pixels?
Please study about PCA to understand what is the physical meaning of eigenfaces, when PCA is applied to an image. The answer lies in the understanding of eigenvectors and eigenvalues associated with PCA.
EigenFaces is based on Principal Component Analysis
Principal Component Analysis does dimensionality reduction and finds unique features in the training images and removes the similar features from the face images
By getting unique features our recognition task gets simpler
By using PCA you calculate the eigenvectors for your face image data
From these eigenvectors you calculate EigenFace of every training subject or you can say calculating EigenFace for every class in your data
So if you have 9 classes then the number of EigenFaces will be 9
The weight usually means how important something is
In EigenFaces weight of a particular EigenFace is a vector which just tells you how important that particular EigenFace is in contributing the MeanFace
Now if you have 9 EigenFaces then for every EigenFace you will get exactly one Weight vector which will be of N dimension where N is number of eigenvectors
So every element out N elements in one weight vector will tell you how important that particular eigenvector is for that corresponding EigenFace
The facial Recognition in EigenFaces is done by comparing the weights of training images and testing images with some kind of distance function
You can refer this github link: https://github.com/jayshah19949596/Computer-Vision-Course-Assignments/blob/master/EigenFaces/EigenFaces.ipynb
The code on the above link is a good documented code so If you know the basics you will understand the code
I am not sure whether I am applying PCA correctly or not! I have p features and n observations (instances). I put these in an nxp matrix X. I perform mean normalization and I get the normalized matrix B. I calculate the eigenvalues and eigenvectors of the pxp covariance matrix C=(1/(n-1))B*.B where * denotes the conjugate transpose.
The eigenvectors corresponding to the descendingly ordered eigenvalues are in a pxp matrix E. Let's say I want to reduce the number of attributes from p to k. I use the equation X_new=B.E_reduced where E_reduced is produced by choosing the first k columns of E. Here are my questions:
1) Should it be X_new=B.E_reduced or X_new=X.E_reduced?
2) Should I repeat the above calculations in the testing phase? If testing phase is similar to training phase, then no speed-up is gained because I have to calculate all the p features for each instance in the testing phase and PCA makes the algorithm slower because of eigenvector calculation overhead.
3) After applying PCA, I noticed that the accuracy decreased. Is this related to the number k (I set k=p/2) or the fact that I am using linear PCA instead of kernel PCA? What is the best way to choose the number k? I read that I can find the ratio of summation of k eigenvalues over the summation of all eigenvalues and make a decision based on this ratio.
You apply the multiplication to the centered data usually, so your projected data is also centered.
Never re-run PCA during testing. Only usenit on training data, and keep the shift vector and projection matrix. You need to apply exactly the same projection as during training, not recompute a new projection.
Decreased performance can have many reasons. E.g. did you also apply scaling using the roots of the eigenvalues? And what method did you use the first place?
I'm looking to set up a linear regression using 2D Gaussian basis functions. My input training variables cover a two dimensional space. Before applying the machine learning (Bayesian linear regression), I need to select parameters for the Gaussians - mean and variance and also decide how many basis functions to use.
I am currently spacing the means (of a preallocated number of basis Gaussians) evenly over a grid, and just assuming constant variance. This is obviously not the best approach.
Any ideas on how to calculate these variables?