In machine learning, more features or dimensions can decrease a model’s accuracy since there is more data that needs to be generalized and this is known as the curse of dimensionality.
Dimensionality reduction is a way to reduce the complexity of a model and avoid overfitting. Principal Component Analysis (PCA) algorithm is used to compress a dataset onto a lower-dimensional feature to reduce the complexity of the model.
When/How should I consider that my data set has many numbers of features and I should look for PCA for dimension reduction?
simple answer is , Its is used When we need to tackle the curse of dimensionality
When should I use PCA?
Do you want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration?
Do you want to ensure your variables are independent of one another?
Are you comfortable making your independent variables less interpretable?
If you answered “yes” to all three questions, then PCA is a good method to use. If you answered “no” to question 3, you should not use PCA.
Good tutorial is here
Let me provide another view into this.
In general, you can use Principal Component Analysis for two main reasons:
For compression:
To reduce space to store your data, for example.
To speed up your learning algorithm (selecting the principal components with more
variance). Looking at the cumulative variance of the components.
For visualization purposes, using 2 or 3 components.
Related
I am trying to apply some clustering method on my datasets (with numerical dimensions). But I'm convinced that the features have different weights for different clusters. I read that there is an approach called soft subspace clustering that tries do identify the clusters and the weights of the features for each cluster simultaneously. However, the algorithms that I have found apre applied only to categorical data.
I am trying to identify some algorithm of soft subspace clustering for numerical. Do you know if there is any, or how can I adapt methods originally designed to deal with categorical data for dealing with numerical data (I think that it would necessary to propose some way of measuring the relevance of each numerical feature in each cluster)?
Yes, there are dozens of subspace clustering algorithms.
You'll need to do a proper literature research, this is too broad to cover in a QA like stack overflow. Look for (surprise) "subspace clustering", but also include "biclustering", for example.
I am still exploring this area of Machine Learning and although I know what's the difference between Feature Selection and Dimensionality Reduction, I am finding some difficulties grasping the concepts of when to do Feature Selection or Dimensionality Reduction (or both together).
Assuming that I have a dataset with around 40 features, is it good practice to perform Dimensionality Reduction alone or Feature Selection alone? Or should there be a hybrid of both approaches (i.e. Do feature selection first and then dimensionality reduction - or vice versa)?
The term feature selection is a bit misleading. It can have two meanings:
Selecting features by incorporating the domain knowledge (this involves constructing new features as well).
For example, finding the rotation invariant points in an image data set or creating BMI as a new feature when you have height and weight as features.
Keeping only the features of high importance according to a some measure
This is a one step of the dimensionality reduction process. The so-called dimensionality reduction process actually involves two steps:
Transforming the original features to new (artificial) features by changing the basis.
eg. PCA does so by finding a set of orthogonal features so that the variance along each axis is maximized.
Keeping only the most important (importance is defined by some measure) features resulted in the above step. This is actually a feature selection step.
eg. In PCA, this is achieved by keeping only the top-k number of features that have the highest explained variances.
As for the order of above (1) and (2) should happen: I think this is problem dependent.
If there's enough domain knowledge to construct/select features to cater the problem at hand, we should do the manual feature engineering (plus selection) first. If this feature engineering/selection process still results in a large number of features, then the so-called dimensionality reduction can be done to find a subspace that can represent the data with an even lesser number of totally new features that have almost no meaning in real life.
If the domain knowledge can't add anything to the data set, just doing dimensionality reduction would be fine which actually contain a feature selection step in it.
In a broad sense, we can think that feature selection is actually a special case of dimensionality reduction where no basis change occurs to the original data set.
I am developing a recommendation engine with the help of kNN. Data, though, is sparse, have around 1500 samples and around 200 features. I have an ordinal target having values 1 or 0.
What would be the techniques to do feature selection for it? I am assuming that if i choose Random forest for feature selection then the selected features may be different that what kNN assume important features are.
Also, is there any restriction on the number of features containing i have so less number of samples?
Features selection techniques want either to exclude irrelevant features, or/and to exclude redundant ones. One proven technique is to use Supervized discretization based on entropy (some more generic explanation can be found here) to meaningfully reduce the size of your data, and then use Info Gain to get top k most correlated features with the target variable. There are at least 5 various methods that you can try, it also depends on the ml library/framework that you are using to implement your app.
I would try with Relief algorithm, since its core part is the nearest neighbour search.
I'm quite new to machine learning and just got introduced to principle component analysis as a dimensionality reduction method. What I don't understand, in which circumstances is PCA any better than simply removing some features from the model? If the aim is to obtain lower dimensional data, why don't we just group those features that are correlated and retain one single feature from each group?
There is a fundamental difference between feature reduction (such as PCA) and feature selection (which you describe). The crucial difference is that feature reduction (PCA) maps your data to lower dimensional through some projection of all original dimensions, for example PCA uses linear combination of each. So final data embedding has information from all features. If you perform feature selection you discard information, you completely loose anything that was present there. Furthermore, PCA guarantees you retaining given fraction of the data variance.
Anyone got an idea on how a simple K-means algorithm could be tuned to handle data sets of this form.
The most direct way to handle data of that form while still using k-means it to use a kernelized version of k-means. 2 implemtations of it exist in the JSAT library (see here https://github.com/EdwardRaff/JSAT/blob/67fe66db3955da9f4192bb8f7823d2aa6662fc6f/JSAT/src/jsat/clustering/kmeans/ElkanKernelKMeans.java)
As Nicholas said, another option is to create a new feature space on which you run k-means. However this takes some prior knowledge of what kind of data you will be clustering.
After that, you really just need to move to a different algorithm. k-means is a simple algorithm that makes simple assumptions about the world, and when those assumptions are too strongly violated (non linearly separable clusters being one of those assumptions) then you just have to accept that and pick a more appropriate algorithm.
One possible solution to this problem is to add another dimension to your data set, for which there is a split between the two classes.
Obviously this is not applicable in many cases, but if you have applied some sort of dimensionality reduction to your data, then it may be something worth investigating.