XGBoost features with more feature importance giving less accuracy

XGBoost features with more feature importance giving less accuracy - machine-learning

I have six features for my model f1,f2,f3,f4,f5 and f6.
And feature importance scores are in order
f1>f2>f3>f4>f5>f6
but rmse of model with features f1,f4 and f5 is less than rmse of model with features f1,f2,f3,f4,f5 and f6 or model with features f1,f2,f3. Any possible reason for this?

It is hard to guess without the data.
However, typically this results from correlated features.
So if f2==f1 it would be the case that adding f2 to a model which contains already f1 does not provide any value. However, adding an uncorrelated feature, e.g. f4 can add a lot even so f2 > f1

Related

Shouldn't the variables ranking be the same for MLP and RF?

I have a question about variable importance ranking.
I built an MLP and an RF model using the same dataset with 34 variables and achieved the same accuracy on a similar test dataset. As you can see in the picture below the top variables for the SHAP summary plot and the RF VIM are quite different.
Interestingly, I removed the low-ranked variable from the MLP and the accuracy increased. However, the RF result didn’t change.
Does that mean the RF is not a good choice for modeling this dataset?
It’s still strange to me that the rankings are so different:
SHAP summary plot vs. RF VIM, I numbered the top and low-ranked variable

Shouldn't the variables ranking be the same for MLP and RF?
No. There may be tendency for different algos to rank certain features higher, but there is no reason for ranking to be the same.
Different algorithms:
May have different objective functions to achieve intended goal.
May use features differently to achieve min (max) of the objective function.
On top, what you cite as RF "feature importances" (mean decrease in Gini) is only one of the many ways to calculate "feature importance" for RF (including which metric you use, and how you calculate total decrease due to a feature). In contrast, SHAP is model agnostic when it comes to explaining feature contributions to outcome.
In sum:
Different models will have different opinions about what is important and not. What is important for one algo may be not so important for another and vice versa. It doesn't tell anything about applicability of a model to a specific dataset.
Use SHAP values (or any other feature importance metric that you and your clients understand) to explain a model (if necessary).
Choose "best" model based on your goals: performance or explainability.

K Nearest Neighbour Classifier - random state for train test split leads to different accuracy scores

I'm fairly new to data analysis and machine learning. I've been carrying out some KNN classification analysis on a breast cancer dataset in python's sklearn module. I have the following code which attemps to find the optimal k for classification of a target variable.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
breast_cancer_data = load_breast_cancer()
training_data, validation_data, training_labels, validation_labels = train_test_split(breast_cancer_data.data, breast_cancer_data.target, test_size = 0.2, random_state = 40)
results = []
for k in range(1,101):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(training_data, training_labels)
results.append(classifier.score(validation_data, validation_labels))
k_list = range(1,101)
plt.plot(k_list, results)
plt.ylim(0.85,0.99)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()
The code loops through 1 to 100 and generates 100 KNN models with 'k' set to incremental values in the range 1 to 100. The performance of each of those models is saved to a list and a plot is generated showing 'k' on the x-axis and model performance on the y-axis.
The problem I have is that when I change the random_state parameter when spliting the data into training and testing partitions this results in completely different plots indicating varying model performance for different 'k'values for different dataset partitions.
For me this makes it difficult to decide which 'k' is optimal as the algorithm performs differently for different 'k's using different random states. Surely this doesn't mean that, for this particular dataset, 'k' is arbitrary? Can anyone help shed some light on this?
Thanks in anticipation

This is completely expected. When you do the train-test-split, you are effectively sampling from your original population. This means that when you fit a model, any statistic (such as a model parameter estimate, or a model score) will it self be a sample estimate taken from some distribution. What you really want is a confidence interval around this score and the easiest way to get that is to repeat the sampling and remeasure the score.
But you have to be very careful how you do this. Here are some robust options:
1. Cross Validation
The most common solution to this problem is to use k-fold cross-validation. In order not to confuse this k with the k from knn I'm going to use a capital for cross-validation (but bear in mind this is not normal nomenclature) This is a scheme to do the suggestion above but without a target leak. Instead of creating many splits at random, you split the data into K parts (called folds). You then train K models each time on K-1 folds of the data leaving aside a different fold as your test set each time. Now each model is independent and without a target leak. It turns out that the mean of whatever success score you use from these K models on their K separate test sets is a good estimate for the performance of training a model with those hyperparameters on the whole set. So now you should get a more stable score for each of your different values of k (small k for knn) and you can choose a final k this way.
Some extra notes:
Accuracy is a bad measure for classification performance. Look at scores like precision vs recall or AUROC or f1.
Don't try program CV yourself, use sklearns GridSearchCV
If you are doing any preprocessing on your data that calculates some sort of state using the data, that needs to be done on only the training data in each fold. For example if you are scaling your data you can't include the test data when you do the scaling. You need to fit (and transform) the scaler on the training data and then use that same scaler to transform on your test data (don't fit again). To get this to work in CV you need to use sklearn Pipelines. This is very important, make sure you understand it.
You might get more stability if you stratify your train-test-split based on the output class. See the stratify argument on train_test_split.
Note the CV is the industry standard and that's what you should do, but there are other options:
2. Bootstrapping
You can read about this in detail in introduction to statistical learning section 5.2 (pg 187) with examples in section 5.3.4.
The idea is to take you training set and draw a random sample from it with replacement. This means you end up with some repeated records. You take this new training set, train and model and then score it on the records that didn't make it into the bootstrapped sample (often called out-of-bag samples). You repeat this process multiple times. You can now get a distribution of your score (e.g. accuracy) which you can use to choose your hyper-parameter rather than just the point estimate you were using before.
3. Making sure you test set is representative of your validation set
Jeremy Howard has a very interesting suggestion on how to calibrate your validation set to be a good representation of your test set. You only need to watch about 5 minutes from where that link starts. The idea is to split into three sets (which you should be doing anyway to choose a hyper parameter like k), train a bunch of very different but simple quick models on your train set and then score them on both your validation and test set. It is OK to use the test set here because these aren't real models that will influence your final model. Then plot the validation scores vs the test scores. They should fall roughly on a straight line (the y=x line). If they do, this means the validation set and test set are both either good or bad, i.e. performance in the validation set is representative of performance in the test set. If they don't fall on this straight line, it means the model scores you get from you validation set are not indicative of the score you'll get on unseen data and thus you can't use that split to train a sensible model.
4. Get a larger data set
This is obviously not very practical for your situation but I thought I'd mention it for completeness. As your sample size increases, your standard error drops (i.e. you can get tighter bounds on your confidence intervals). But you'll need more training and more test data. While you might not have access to that here, it's worth keeping in mind for real world situations where you can assess the trade-off of the cost of gathering new data vs the desired accuracy in assessing your model performance (and probably the performance itself too).

This "behavior" is to be expected. Of course you get different results, when training and test is split differently.
You can approach the problem statistically, by repeating each 'k' several times with new train-validation-splits. Then take the median performance for each k. Or even better: look at the performance distribution and the median. A narrow performance distribution for a given 'k' is also a good sign that the 'k' is chosen well.
Afterwards you can use the test set to test your model

Machine learning model with relative feature importance

I have ~12 features and not much data. I would like to train a machine learning model but instruct it that I have some information in which some features are more important than others. Is there a way to do that, one way I came up with was to generate a lot of data based on pre-existing data with small changes and include the same labels thus covering more of the search space. I would like that the relative feature importance matrix has some weight on the final feature importance (as generated by a classification tree for ex.)
Ideally it would be like
Relative feature importance matrix:
N F1 F2 F3
F1 1 2 N
F2 .5 1 1
F3 N 1 1

If I understand the question, you want to have some features be more important than others. To do this, you can assign weights to the individual features themselves based on which you want to be taken into account more heavily.
This question is rather broad so I hope this can be of help.

Handle mismatch in number of features in Training Data and Prediction Data

I have 6 text features (say f1,f2,..,f6) available for the data on which I have trained a model. But when this model is deployed and a new data point comes, for which I have to make prediction using this model, it has only 2 features (f1, and f2). So, there is the problem of feature mismatch. How can I tackle this problem?
I have a few thoughts, but that are not very efficient.
Use only two features for training (f1 and f2), and discard other features (f3,..,f6). But this leads to a loss of information and my test set accuracy decreases.
Learn some relation between (f3,..,f6) with (f1 and f2). So that even though, (f3,..,f6) is not there in the new data point, the information can be extracted from f1, and f2 only.

The best way is of course train a new model using f1, f2 and any new data you may have.
Don't want to do that? If you don't have f3...f6, you shouldn't magically expect the model works as intended.
Now, think what are those "f3...f6"? Are they related to the new information you have? If they are, you may be able to approximate them. We can't tell you what to do because we don't have any clue what they are. Interpolation? Regression? Rough approximation?
My suggestion: you are missing most of the predictors for your model. Your old model is meaningless. Please just train a new one.

Perhaps you could fill in data for f3 to f6 with noise data that is an average value for all data that includes that feature. That way the data from features f3 through f6 won't stand out too much, and won't lean your classifier one way or the other. The classifier would be more likely to rely on the features provided f1 and f2 to classify.
When calculating this make sure the averages are calculated for each classification first then averaged. That way if your data set has a large amount of one class it won't skew the average.
Of course this might be an over simplification, and would work best with binary classification. It depends on the data set and classification.
Hope this helps :)

Wrong way to cascade classifiers in Weka

I have a data set with two classes and was trying to get an optimal classifier using Weka. The best classifier I could obtain was about 79% accuracy. Then I tried adding attributes to my data by classifying it and saving the probability distribution generated by this classification in the data itself.
When I reran the training process on the modified data I got over 93% accuracy!! I sure this is wrong but I can't exactly figure out why.
These are the exact steps I went through:
Open the data in Weka.
Click on add Filter and select AddClassification from Supervised->attribute.
Select a classifier. I select J48 with default settings.
Set "Output Classification" to false and set Output Distribution to true.
Run the filter and restore the class to be your original nominal class. Note the additional attributes added to the end of the attribute list. They will have the names: distribution_yourFirstClassName and distribution_yourSecondClassName.
Go the Classify tab and select a classifier: again I selected J48.
Run it. In this step I noticed much more accuracy than before.
Is this a valid way of creating classifiers? Didn't I "cheat" by adding classification information within the original data? If it is valid, how would one proceed to create a classifier that can predict unlabeled data? How can it add the additional attribute (the distribution) ?
I did try reproducing the same effect using a FilteredClassifier but it didn't work.
Thanks.

The process that you appear to have undertaken seems somewhat close to the Stacking ensemble method, where classifier outputs are used to generate an ensemble output (more on that here).
In your case however, the attributes and a previously trained classifier output is being used to predict your class. It is likely that most of the second J48 model's rules will be based on the first (As the class output will correlate more strongly to the J48 than the other attributes), but with some fine-tuning to improve model accuracy. In this case, the concept of 'two heads are better than one' is used to improve the overall performance of the model.
That's not to say that it is all good though. If you needed to use your J48 with unseen data, then you would not be able to use the same J48 that was used for your attributes (unless you saved it previously). Additionally, you are adding more processing work by using more than one classifier as opposed to the single J48. These costs would also need to be considered against the problem that you are tackling.
Hope this helps!

Okay, here is how I did cascaded learning:
I have the dataset D and divided into 10 equal sized stratified folds (D1 to D10) without repetition.
I applied algorithm A1 to train a classifier C1 on D1 to D9 and then just like you, applied C1 on D10 to give me the additional distribution of positive and negative classes. I name this D10 with the additional two (or more, depending on what information from C1 you want to be included in D10) attributes/features as D10_new.
Next, I applied the same algorithm to train a classifier C2 on D1 to D8 and D10 and then just like you, applied C2 on D9 to give me the additional distribution of positive and negative classes. I name this D9 with the additional attributes/features as D9_new.
In this way I create D1_new to D10_new.
Then I applied another classifier (perhaps with algorithm A2) on these D1_new to D10_new to predict the labels (a 10 fold CV is a good choice).
In this setup, you removed the bias of seeing the data prior to testing it. Also, it is advisable that A1 and A2 should be different.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart