Interpreting XGB feature importance and SHAP values

Interpreting XGB feature importance and SHAP values - machine-learning

For a particular prediction problem, I observed that a certain variable ranks high in the XGBoost feature importance that gets generated (on the basis of Gain) while it ranks quite low in the SHAP output.
How to interpret this? As in, is the variable highly important or not that important for our prediction problem?

Impurity-based importances (such as sklearn and xgboost built-in routines) summarize the overall usage of a feature by the tree nodes. This naturally gives more weight to high cardinality features (more feature values yield more possible splits), while gain may be affected by tree structure (node order matters even though predictions may be same). There may be lots of splits with little effect on the prediction or the other way round (many splits diluting the average importance) - see https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27 and https://www.actuaries.digital/2019/06/18/analytics-snippet-feature-importance-and-the-shap-approach-to-machine-learning-models/ for various mismatch examples.
In an oversimplified way:
impurity-base importance explains the feature usage for generalizing on the train set;
permutation importance explains the contribution of a feature to the model accuracy;
SHAP explains how much would changing a feature value affect the prediction (not necessarily correct).

Related

Shouldn't the variables ranking be the same for MLP and RF?

I have a question about variable importance ranking.
I built an MLP and an RF model using the same dataset with 34 variables and achieved the same accuracy on a similar test dataset. As you can see in the picture below the top variables for the SHAP summary plot and the RF VIM are quite different.
Interestingly, I removed the low-ranked variable from the MLP and the accuracy increased. However, the RF result didn’t change.
Does that mean the RF is not a good choice for modeling this dataset?
It’s still strange to me that the rankings are so different:
SHAP summary plot vs. RF VIM, I numbered the top and low-ranked variable

Shouldn't the variables ranking be the same for MLP and RF?
No. There may be tendency for different algos to rank certain features higher, but there is no reason for ranking to be the same.
Different algorithms:
May have different objective functions to achieve intended goal.
May use features differently to achieve min (max) of the objective function.
On top, what you cite as RF "feature importances" (mean decrease in Gini) is only one of the many ways to calculate "feature importance" for RF (including which metric you use, and how you calculate total decrease due to a feature). In contrast, SHAP is model agnostic when it comes to explaining feature contributions to outcome.
In sum:
Different models will have different opinions about what is important and not. What is important for one algo may be not so important for another and vice versa. It doesn't tell anything about applicability of a model to a specific dataset.
Use SHAP values (or any other feature importance metric that you and your clients understand) to explain a model (if necessary).
Choose "best" model based on your goals: performance or explainability.

Python/SKlearn: Using KFold Results in big ROC_AUC Variations

Based on data that our business department supplied to us, I used the sklearn decision tree algorithm to determine the ROC_AUC for a binary classification problem.
The data consists of 450 rows and there are 30 features in the data.
I used 10 times StratifiedKFold repetition/split of training and test data. As a result, I got the following ROC_AUC values:
0.624
0.594
0.522
0.623
0.585
0.656
0.629
0.719
0.589
0.589
0.592
As I am new in machine learning, I am unsure whether such a variation in the ROC_AUC values can be expected (with minimum values of 0.522 and maximum values of 0.719).
My questions are:
Is such a big variation to be expected?
Could it be reduced with more data (=rows)?
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?

Well, you do k-fold splits to actually evaluate how well your model generalizes.
Therefore, from your current results I would assume the following:
This is a difficult problem, the AUCs are usually low.
0.71 is an outlier, you were just lucky there (probably).
Important questions that will help us help you:
What is the proportion of the binary classes? Are they balanced?
What are the features? Are they all continuous? If categorical, are they ordinal or nominal?
Why Decision Tree? Have you tried other methods? Logistic Regression for instance is a good start before you move on to more advanced ML methods.
You should run more iterations, instead of k fold use the ShuffleSplit function and run at least 100 iterations, compute the Average AUC with 95% Confidence Intervals. That will give you a better idea of how well the models perform.
Hope this helps!

Is such a big variation to be expected?
This is a textbook case of high variance.
Depending on the difficulty of your problem, 405 training samples may not be enough for it to generalize properly, and the random forest may be too powerful.
Try adding some regularization, by limiting the number of splits that the trees are allowed to make. This should reduce the variance in your model, though you might expect a potentially lower average performance.
Could it be reduced with more data (=rows)?
Yes, adding data is the other popular way of lowering the variance of your model. If you're familiar with deep learning, you'll know that deep models usually need LOTS of samples to learn properly. That's because they are very powerful models with an intrinsically high variance, and therefore a lot of data is needed for them to generalize.
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Variance will decrease with regularization and adding data, it has no relation to the actual performance "number" that you get.
Cheers

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?

I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.

Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop

There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

unigrams & bigrams (tf-idf) less accurate than just unigrams (ff-idf)?

This is a question about linear regression with ngrams, using Tf-IDF (term frequency - inverse document frequency). To do this, I am using numpy sparse matrices and sklearn for linear regression.
I have 53 cases and over 6000 features when using unigrams. The predictions are based on cross validation using LeaveOneOut.
When I create a tf-idf sparse matrix of only unigram scores, I get slightly better predictions than when I create a tf-idf sparse matrix of unigram+bigram scores. The more columns I add to the matrix (columns for trigram, quadgram, quintgrams, etc.), the less accurate the regression prediction.
Is this common? How is this possible? I would have thought that the more features, the better.

It's not common for bigrams to perform worse than unigrams, but there are situations where it may happen. In particular, adding extra features may lead to overfitting. Tf-idf is unlikely to alleviate this, as longer n-grams will be rarer, leading to higher idf values.
I'm not sure what kind of variable you're trying to predict, and I've never done regression on text, but here's some comparable results from literature to get you thinking:
In random text generation with small (but non-trivial) training sets, 7-grams tend to reconstruct the input text almost verbatim, i.e. cause complete overfit, while trigrams are more likely to generate "new" but still somewhat grammatical/recognizable text (see Jurafsky & Martin; can't remember which chapter and I don't have my copy handy).
In classification-style NLP tasks performed with kernel machines, quadratic kernels tend to fare better than cubic ones because the latter often overfit on the training set. Note that unigram+bigram features can be thought of as a subset of the quadratic kernel's feature space, and {1,2,3}-grams of that of the cubic kernel.
Exactly what is happening depends on your training set; it might simply be too small.

As larsmans said, adding more variables / features makes it easier for the model to overfit hence lose in test accuracy. In the master branch of scikit-learn there is now a min_df parameter to cut-off any feature with less than that number of occurrences. Hence min_df==2 to min_df==5 might help you get rid of spurious bi-grams.
Alternatively you can use L1 or L1 + L2 penalized linear regression (or classification) using either the following classes:
sklearn.linear_model.Lasso (regression)
sklearn.linear_model.ElasticNet (regression)
sklearn.linear_model.SGDRegressor (regression) with penalty == 'elastic_net' or 'l1'
sklearn.linear_model.SGDClassifier (classification) with penalty == 'elastic_net' or 'l1'
This will make it possible to ignore spurious features and lead to a sparse model with many zero weights for noisy features. Grid Searching the regularization parameters will be very important though.
You can also try univariate feature selection such as done the text classification example of scikit-learn (check the SelectKBest and chi2 utilities.

Ways to improve the accuracy of a Naive Bayes Classifier?

I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.
I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?

In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).
so when you want to improve classifier prediction, you can look in several places:
tune your classifier (adjusting the classifier's tunable paramaters);
apply some sort of classifier combination technique (eg,
ensembling, boosting, bagging); or you can
look at the data fed to the classifier--either add more data,
improve your basic parsing, or refine the features you select from
the data.
w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.
I. Data Parsing (pre-processing)
i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.
stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for
instance, if you have the terms programmer, program, progamming,
programmed in a given data point, a stemmer will reduce them to a
single stem (probably program) so your term vector for that data
point will have a value of 4 for the feature program, which is
probably what you want.
synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer,
coder, and software engineer and roll them into a single term
neutral words: words with similar frequencies across classes make poor features
II. Feature Selection
consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.
III. Specific Classifier Optimizations
Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.
The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me,
i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.

I would suggest using a SGDClassifier as in this and tune it in terms of regularization strength.
Also try to tune the formula in TFIDF you're using by tuning the parameters of TFIFVectorizer.
I usually see that for text classification problems SVM or Logistic Regressioin when trained one-versus-all outperforms NB. As you can see in this nice article by Stanford people for longer documents SVM outperforms NB. The code for the paper which uses a combination of SVM and NB (NBSVM) is here.
Second, tune your TFIDF formula (e.g. sublinear tf, smooth_idf).
Normalize your samples with l2 or l1 normalization (default in Tfidfvectorization) because it compensates for different document lengths.
Multilayer Perceptron, usually gets better results than NB or SVM because of the non-linearity introduced which is inherent to many text classification problems. I have implemented a highly parallel one using Theano/Lasagne which is easy to use and downloadable here.
Try to tune your l1/l2/elasticnet regularization. It makes a huge difference in SGDClassifier/SVM/Logistic Regression.
Try to use n-grams which is configurable in tfidfvectorizer.
If your documents have structure (e.g. have titles) consider using different features for different parts. For example add title_word1 to your document if word1 happens in the title of the document.
Consider using the length of the document as a feature (e.g. number of words or characters).
Consider using meta information about the document (e.g. time of creation, author name, url of the document, etc.).
Recently Facebook published their FastText classification code which performs very well across many tasks, be sure to try it.

Using Laplacian Correction along with AdaBoost.
In AdaBoost, first a weight is assigned to each data tuple in the training dataset. The intial weights are set using the init_weights method, which initializes each weight to be 1/d, where d is the size of the training data set.
Then, a generate_classifiers method is called, which runs k times, creating k instances of the Naïve Bayes classifier. These classifiers are then weighted, and the test data is run on each classifier. The sum of the weighted "votes" of the classifiers constitutes the final classification.

Improves Naive Bayes classifier for general cases
Take the logarithm of your probabilities as input features
We change the probability space to log probability space since we calculate the probability by multiplying probabilities and the result will be very small. when we change to log probability features, we can tackle the under-runs problem.
Remove correlated features.
Naive Byes works based on the assumption of independence when we have a correlation between features which means one feature depends on others then our assumption will fail.
More about correlation can be found here
Work with enough data not the huge data
naive Bayes require less data than logistic regression since it only needs data to understand the probabilistic relationship of each attribute in isolation with the output variable, not the interactions.
Check zero frequency error
If the test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
More than this is well described in the following posts
Please refer below posts.
machinelearningmastery site post
Analyticvidhya site post

keeping the n size small also make NB to give high accuracy result. and at the core, as the n size increase its accuracy degrade,

Select features which have less correlation between them. And try using different combination of features at a time.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart