I am still exploring this area of Machine Learning and although I know what's the difference between Feature Selection and Dimensionality Reduction, I am finding some difficulties grasping the concepts of when to do Feature Selection or Dimensionality Reduction (or both together).
Assuming that I have a dataset with around 40 features, is it good practice to perform Dimensionality Reduction alone or Feature Selection alone? Or should there be a hybrid of both approaches (i.e. Do feature selection first and then dimensionality reduction - or vice versa)?
The term feature selection is a bit misleading. It can have two meanings:
Selecting features by incorporating the domain knowledge (this involves constructing new features as well).
For example, finding the rotation invariant points in an image data set or creating BMI as a new feature when you have height and weight as features.
Keeping only the features of high importance according to a some measure
This is a one step of the dimensionality reduction process. The so-called dimensionality reduction process actually involves two steps:
Transforming the original features to new (artificial) features by changing the basis.
eg. PCA does so by finding a set of orthogonal features so that the variance along each axis is maximized.
Keeping only the most important (importance is defined by some measure) features resulted in the above step. This is actually a feature selection step.
eg. In PCA, this is achieved by keeping only the top-k number of features that have the highest explained variances.
As for the order of above (1) and (2) should happen: I think this is problem dependent.
If there's enough domain knowledge to construct/select features to cater the problem at hand, we should do the manual feature engineering (plus selection) first. If this feature engineering/selection process still results in a large number of features, then the so-called dimensionality reduction can be done to find a subspace that can represent the data with an even lesser number of totally new features that have almost no meaning in real life.
If the domain knowledge can't add anything to the data set, just doing dimensionality reduction would be fine which actually contain a feature selection step in it.
In a broad sense, we can think that feature selection is actually a special case of dimensionality reduction where no basis change occurs to the original data set.
Related
In machine learning, more features or dimensions can decrease a model’s accuracy since there is more data that needs to be generalized and this is known as the curse of dimensionality.
Dimensionality reduction is a way to reduce the complexity of a model and avoid overfitting. Principal Component Analysis (PCA) algorithm is used to compress a dataset onto a lower-dimensional feature to reduce the complexity of the model.
When/How should I consider that my data set has many numbers of features and I should look for PCA for dimension reduction?
simple answer is , Its is used When we need to tackle the curse of dimensionality
When should I use PCA?
Do you want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration?
Do you want to ensure your variables are independent of one another?
Are you comfortable making your independent variables less interpretable?
If you answered “yes” to all three questions, then PCA is a good method to use. If you answered “no” to question 3, you should not use PCA.
Good tutorial is here
Let me provide another view into this.
In general, you can use Principal Component Analysis for two main reasons:
For compression:
To reduce space to store your data, for example.
To speed up your learning algorithm (selecting the principal components with more
variance). Looking at the cumulative variance of the components.
For visualization purposes, using 2 or 3 components.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!
I am developing a recommendation engine with the help of kNN. Data, though, is sparse, have around 1500 samples and around 200 features. I have an ordinal target having values 1 or 0.
What would be the techniques to do feature selection for it? I am assuming that if i choose Random forest for feature selection then the selected features may be different that what kNN assume important features are.
Also, is there any restriction on the number of features containing i have so less number of samples?
Features selection techniques want either to exclude irrelevant features, or/and to exclude redundant ones. One proven technique is to use Supervized discretization based on entropy (some more generic explanation can be found here) to meaningfully reduce the size of your data, and then use Info Gain to get top k most correlated features with the target variable. There are at least 5 various methods that you can try, it also depends on the ml library/framework that you are using to implement your app.
I would try with Relief algorithm, since its core part is the nearest neighbour search.
I'm quite new to machine learning and just got introduced to principle component analysis as a dimensionality reduction method. What I don't understand, in which circumstances is PCA any better than simply removing some features from the model? If the aim is to obtain lower dimensional data, why don't we just group those features that are correlated and retain one single feature from each group?
There is a fundamental difference between feature reduction (such as PCA) and feature selection (which you describe). The crucial difference is that feature reduction (PCA) maps your data to lower dimensional through some projection of all original dimensions, for example PCA uses linear combination of each. So final data embedding has information from all features. If you perform feature selection you discard information, you completely loose anything that was present there. Furthermore, PCA guarantees you retaining given fraction of the data variance.
Essentially I have a data set, that has a feature vector, and label indicating whether it is spam or non-spam.
To get the labels for this data, 2 distinct types of expert were used each using different approaches to evaluate the item, the type of expert used then also became a feature in the vector.
Training and then testing on a separate portion of the data has achieved a high degree accuracy using a Random Forest algorithm.
However, it is clear now that, the feature describing the expert who made the label will not be available in a live environment. So I have tried a number of approaches to reflect this:
Remove the feature from the set and retrain and test
Split the data into 2 distinct sets based on the feature, and then train and test 2 separate classifiers
For the test data, set the feature in question all to the same value
With all 3 approaches, the classifiers have dropped from being highly accurate, to being virtually useless.
So I am looking for any advice or intuitions as to why this has occurred and how I might approach resolving it so as to regain some of the accuracy I was previously seeing?
To be clear I have no background in machine learning or statistics and am simply using a third party c# code library as a black box to achieve these results.
Sounds like you've completely overfit to the "who labeled what" feature (and combinations of this feature with other features). You can find out for sure by inspecting the random forest's feature importances and checking whether the annotator feature ranks high. Another way to find out is to let the annotators check each other's annotations and compute an agreement score such as Cohen's kappa. A low value, say less than .5, indicates disagreement among the annotators, which makes machine learning very hard.
Since the feature will not be available at test time, there's no easy way to get the performance back.