Random Forest important features output stability issue - machine-learning

I'm fitting 2 almost identical Random Forest regression models. Both models use the same data set that have 60 features and 90 data points. The only difference is they're using different targets (the target column of each model is excluded from the respective features dataframes, of course). All of the cross validation settings are same of both models (number of folds, number of iterations, scoring) and the hyperparameter grids are also identical.
I'm interested in the feature importance output. However, one of the model consistently output the same top features while the other doesn't. Does anyone know why this is the case?

You can set a seed or the parameter random_state in case you rely on sklearn.ensemble.RandomForestRegressor in order to stabilize your results.
It's quite normal to get varying feature importance since the forest is assembled randomly. Furthermore, feature importance may not be the optimal metric to evaluate actual feature importance. You could try Boruta-Algorithm/Permutation Feature Importance to get a different perspective.
Towards your actual question, maybe your regressors are better suited to predict one target variable over the other.
How do both models perform accuracy-wise on the data? This might be one possibility to explain why one model is more stable. Do feature importances remain unstable for a larger amount of trees fitted?

Related

Shouldn't the variables ranking be the same for MLP and RF?

I have a question about variable importance ranking.
I built an MLP and an RF model using the same dataset with 34 variables and achieved the same accuracy on a similar test dataset. As you can see in the picture below the top variables for the SHAP summary plot and the RF VIM are quite different.
Interestingly, I removed the low-ranked variable from the MLP and the accuracy increased. However, the RF result didn’t change.
Does that mean the RF is not a good choice for modeling this dataset?
It’s still strange to me that the rankings are so different:
SHAP summary plot vs. RF VIM, I numbered the top and low-ranked variable
Shouldn't the variables ranking be the same for MLP and RF?
No. There may be tendency for different algos to rank certain features higher, but there is no reason for ranking to be the same.
Different algorithms:
May have different objective functions to achieve intended goal.
May use features differently to achieve min (max) of the objective function.
On top, what you cite as RF "feature importances" (mean decrease in Gini) is only one of the many ways to calculate "feature importance" for RF (including which metric you use, and how you calculate total decrease due to a feature). In contrast, SHAP is model agnostic when it comes to explaining feature contributions to outcome.
In sum:
Different models will have different opinions about what is important and not. What is important for one algo may be not so important for another and vice versa. It doesn't tell anything about applicability of a model to a specific dataset.
Use SHAP values (or any other feature importance metric that you and your clients understand) to explain a model (if necessary).
Choose "best" model based on your goals: performance or explainability.

Feature selection - how to go about it when you have way too many features?

Let's assume you have 1,400 columns/data points for 200k entries and your goal is to determine which of these columns show the most signal towards a simple classification task.
I've already removed columns with a threshold of null values, low variance, bad and also too many levels for categorical, and I still have 900+ columns.
I can use lasso if I only include the 500+ numerical columns, but if I try to include the categorical as well I keep crashing, it's too much data to process.
How would you go about further reducing features in that case? My goal, more than the classification itself, is to identify the features that bring in the most information towards the classification task.
You could use a data driven approach, for example the most simple one would be to use the L1 regularisation on a logistic regression (with your simple classification task) and looking at the weights you select the ones that are not zero or close to zero.
Basically the L1 norm on the model weights enforces the sparsity of the weights vector, and in doing so, the only surviving weight are the ones corresponding to the "important" features.
In any case be careful and normalise the data before using this technique and also be careful about categorical and scalar features...
You can also use a Neural network, and then compute the gradient w.r.t. the input to see what influences the decision more.
Or some other technique like: https://link.springer.com/chapter/10.1007/978-3-030-33778-0_24
Alternatively you can also use a Random Forest model and do feature importance like: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Which predictive models in sklearn are affected by the order of the columns in the training dataframe?

I'm wondering if any of the estimators that Sci-kit Learn provides is affected by the order of the columns in the dataframe by which it is being trained. I tried establishing a baseline by using ExtraTreesRegressor and it came out to 3 different scores:
.531687 for the regular order
.535309 for the reverse order
.554458 for the regular order
Obviously ExtraTreesRegressor is not a good example here, so I tried LinearRegression but it gave .295898 no matter what the order of the columns were.
What I want to know is if there are ANY estimators that are affected by the order of the columns and if there are not then can you point me in the direction of some way, or provide some code, that I can use to make sure that the order of the columns does matter?
Any algorithm that involves some randomness in selecting features while building the model is expected to be affected from their order; AFAIK, the only cases present in scikit-learn are the Extra Trees and the Random Forest (in both their incarnations as classifiers or regressors), which indeed share some similarities.
The smoking gun for such a behavior is the argument max_features; from the RF docs (the description is identical in the Extra Trees as well):
max_features : {“auto”, “sqrt”, “log2”} int or float, default=”auto”
The number of features to consider when looking for the best split
I am not aware of other algorithms that involve such kind of random feature selection (linear models, decision trees, SVMs, naive Bayes, neural nets, and gradient boosted trees do not), but if you glimpse something similar enough in the documentation, you can bet that the respective algorithm is also affected by the order of the features.
Keep in mind that such slight discrepancies that should not happen in theory are rather to be expected in models where randomness enters from way too many angles. For a similar case with RF in R (slightly different results when asking for importance=TRUE), check my answer in Why does the importance parameter influence performance of Random Forest in R?

Random Forest - Max Features

I do have a question and I need your support. I have a data set which I am analyzing. I need to predict a target. To do this I did some data cleaning, among others drop highly (linear correlated feautes)
After preparing my data I applied random forest regressor (it is a regression problem). I am stucked a bit, since I really cannot catch the meaning and thus the value for max_features
I found the following page answer, where it is written
features=n_features for regression is a mistake on scikit's part. The original paper for RF gave max_features = n_features/3 for regression
I do get different results if I use max_features=sqrt(n) or max_features=n_features
Can any1 give me a good explanation how to approach this parameter?
That would be really great
max_features is a parameter that needs to be tuned. Values such as sqrt or n/3 are defaults and usually perform decently, but the parameter needs to be optimized for every dataset, as it will depend on the features you have, their correlations and importances.
Therefore, I suggest training the model many times with a grid of values for max_features, trying every possible value from 2 to the total number of your features. Train your RandomForestRegressor with oob_score=True and use oob_score_ to assess the performance of the Forest. Once you have looped over all possible values of max_features, keep the one that obtained the highest oob_score.
For safety, keep the n_estimators on the high end.
PS: this procedure is basically a grid search optimization for one parameter, and is usually done via Cross Validation. Since RFs give you OOB scores, you can use these instead of CV scores, as they are quicker to compute.

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.
Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

Resources