I'm quite new to machine learning and just got introduced to principle component analysis as a dimensionality reduction method. What I don't understand, in which circumstances is PCA any better than simply removing some features from the model? If the aim is to obtain lower dimensional data, why don't we just group those features that are correlated and retain one single feature from each group?
There is a fundamental difference between feature reduction (such as PCA) and feature selection (which you describe). The crucial difference is that feature reduction (PCA) maps your data to lower dimensional through some projection of all original dimensions, for example PCA uses linear combination of each. So final data embedding has information from all features. If you perform feature selection you discard information, you completely loose anything that was present there. Furthermore, PCA guarantees you retaining given fraction of the data variance.
Related
In machine learning, more features or dimensions can decrease a model’s accuracy since there is more data that needs to be generalized and this is known as the curse of dimensionality.
Dimensionality reduction is a way to reduce the complexity of a model and avoid overfitting. Principal Component Analysis (PCA) algorithm is used to compress a dataset onto a lower-dimensional feature to reduce the complexity of the model.
When/How should I consider that my data set has many numbers of features and I should look for PCA for dimension reduction?
simple answer is , Its is used When we need to tackle the curse of dimensionality
When should I use PCA?
Do you want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration?
Do you want to ensure your variables are independent of one another?
Are you comfortable making your independent variables less interpretable?
If you answered “yes” to all three questions, then PCA is a good method to use. If you answered “no” to question 3, you should not use PCA.
Good tutorial is here
Let me provide another view into this.
In general, you can use Principal Component Analysis for two main reasons:
For compression:
To reduce space to store your data, for example.
To speed up your learning algorithm (selecting the principal components with more
variance). Looking at the cumulative variance of the components.
For visualization purposes, using 2 or 3 components.
I am still exploring this area of Machine Learning and although I know what's the difference between Feature Selection and Dimensionality Reduction, I am finding some difficulties grasping the concepts of when to do Feature Selection or Dimensionality Reduction (or both together).
Assuming that I have a dataset with around 40 features, is it good practice to perform Dimensionality Reduction alone or Feature Selection alone? Or should there be a hybrid of both approaches (i.e. Do feature selection first and then dimensionality reduction - or vice versa)?
The term feature selection is a bit misleading. It can have two meanings:
Selecting features by incorporating the domain knowledge (this involves constructing new features as well).
For example, finding the rotation invariant points in an image data set or creating BMI as a new feature when you have height and weight as features.
Keeping only the features of high importance according to a some measure
This is a one step of the dimensionality reduction process. The so-called dimensionality reduction process actually involves two steps:
Transforming the original features to new (artificial) features by changing the basis.
eg. PCA does so by finding a set of orthogonal features so that the variance along each axis is maximized.
Keeping only the most important (importance is defined by some measure) features resulted in the above step. This is actually a feature selection step.
eg. In PCA, this is achieved by keeping only the top-k number of features that have the highest explained variances.
As for the order of above (1) and (2) should happen: I think this is problem dependent.
If there's enough domain knowledge to construct/select features to cater the problem at hand, we should do the manual feature engineering (plus selection) first. If this feature engineering/selection process still results in a large number of features, then the so-called dimensionality reduction can be done to find a subspace that can represent the data with an even lesser number of totally new features that have almost no meaning in real life.
If the domain knowledge can't add anything to the data set, just doing dimensionality reduction would be fine which actually contain a feature selection step in it.
In a broad sense, we can think that feature selection is actually a special case of dimensionality reduction where no basis change occurs to the original data set.
I am developing a recommendation engine with the help of kNN. Data, though, is sparse, have around 1500 samples and around 200 features. I have an ordinal target having values 1 or 0.
What would be the techniques to do feature selection for it? I am assuming that if i choose Random forest for feature selection then the selected features may be different that what kNN assume important features are.
Also, is there any restriction on the number of features containing i have so less number of samples?
Features selection techniques want either to exclude irrelevant features, or/and to exclude redundant ones. One proven technique is to use Supervized discretization based on entropy (some more generic explanation can be found here) to meaningfully reduce the size of your data, and then use Info Gain to get top k most correlated features with the target variable. There are at least 5 various methods that you can try, it also depends on the ml library/framework that you are using to implement your app.
I would try with Relief algorithm, since its core part is the nearest neighbour search.
I am planning to do SVM classification on multidimensional sensor data. There are two classes and 13 sensors. Suppose that I want to extract features e.g. average, standard deviation etc. I read from somewhere that we need to do feature scaling before apply to SVM. I am wondering when I should do the scaling, before extracting features or after extracting features?
As you have read, and as already pointed out, you would:
do feature derivation
do feature normalization (scaling, deskewing if necessary, etc)
hand data to training/evaluating model(s).
For the example you mentioned, just to be clear: I assume you mean that you want to derive (the same) features for each sample, so that you have e.g. a mean feature, standard deviation feature, etc. for each sample - which is how it should be done. Normalization, in turn, has to be done per feature over all samples.
for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])