Difference between feature_contribs and feature importance - machine-learning

XGBoost API has two data points that it exposes regarding the features
Feature importance which can be accessed by
xgb_bst.get_score(importance_type='gain')
The documents explain these as :
Get feature importance of each feature. Importance type can be defined as:
‘gain’: the average gain across all splits the feature is used in.
Feature contributions which can be accessed by
feature_contribs = xgb_bst.predict(dtest, pred_contribs=True)
Per documentation, output will be a matrix of size (nsample, nfeats + 1) with each record indicating the feature contributions (SHAP values) for that prediction. The sum of all feature contributions is equal to the raw untransformed margin value of the prediction.
My question here is,
What is the correlation between the two, since from what I see a feature with maximum
importance (high gain) is not necessarily the top
contributor(positive) for most of the individual data and vice
versa?
Intutively, how do I interpret feature importance vs Feature
contribs? How do they differ?

Related

Interpreting XGB feature importance and SHAP values

For a particular prediction problem, I observed that a certain variable ranks high in the XGBoost feature importance that gets generated (on the basis of Gain) while it ranks quite low in the SHAP output.
How to interpret this? As in, is the variable highly important or not that important for our prediction problem?
Impurity-based importances (such as sklearn and xgboost built-in routines) summarize the overall usage of a feature by the tree nodes. This naturally gives more weight to high cardinality features (more feature values yield more possible splits), while gain may be affected by tree structure (node order matters even though predictions may be same). There may be lots of splits with little effect on the prediction or the other way round (many splits diluting the average importance) - see https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27 and https://www.actuaries.digital/2019/06/18/analytics-snippet-feature-importance-and-the-shap-approach-to-machine-learning-models/ for various mismatch examples.
In an oversimplified way:
impurity-base importance explains the feature usage for generalizing on the train set;
permutation importance explains the contribution of a feature to the model accuracy;
SHAP explains how much would changing a feature value affect the prediction (not necessarily correct).

Shouldn't the variables ranking be the same for MLP and RF?

I have a question about variable importance ranking.
I built an MLP and an RF model using the same dataset with 34 variables and achieved the same accuracy on a similar test dataset. As you can see in the picture below the top variables for the SHAP summary plot and the RF VIM are quite different.
Interestingly, I removed the low-ranked variable from the MLP and the accuracy increased. However, the RF result didn’t change.
Does that mean the RF is not a good choice for modeling this dataset?
It’s still strange to me that the rankings are so different:
SHAP summary plot vs. RF VIM, I numbered the top and low-ranked variable
Shouldn't the variables ranking be the same for MLP and RF?
No. There may be tendency for different algos to rank certain features higher, but there is no reason for ranking to be the same.
Different algorithms:
May have different objective functions to achieve intended goal.
May use features differently to achieve min (max) of the objective function.
On top, what you cite as RF "feature importances" (mean decrease in Gini) is only one of the many ways to calculate "feature importance" for RF (including which metric you use, and how you calculate total decrease due to a feature). In contrast, SHAP is model agnostic when it comes to explaining feature contributions to outcome.
In sum:
Different models will have different opinions about what is important and not. What is important for one algo may be not so important for another and vice versa. It doesn't tell anything about applicability of a model to a specific dataset.
Use SHAP values (or any other feature importance metric that you and your clients understand) to explain a model (if necessary).
Choose "best" model based on your goals: performance or explainability.

Information gain on non discrete dataset

Jiawei Han's book on Data Mining 2nd edition (Attribute Selection Measures - pp 297 thru 300) explains how to calculate information gain achieved by each attribute (age, income, credit_rating) and class (buys_computer yes or no).
In this example, each of the attribute values is discrete, for e.g. age can be youth/middle-aged/senior, income can be high/low/medium, credit_rating fair/excellent etc.
I would like to know how the same information gain can be applied to attributes which take non discrete data. For e.g. the income attribute takes any currency amount like 100.68, 120.90, etc etc.
If there are 1000 students, there could be 1000 different amount values.
How can we apply the same information gain over non discrete data? Any tutorial/sample example/video url would be of great help.
When your target variable is discrete (categorical), you just calculate entropy over the empirical distribution of categories in the left/right split you're considering, and compare their weighted average to the entropy without the split.
For a continuous target variable, like income, this is defined analogously as differential entropy. For your purpose you would assume that the values in your set have a normal distribution, and calculate the differential entropy accordingly. From Wikipedia:
That is it's just a function of the variance of the values. Note that this is in nats, not bits of entropy. To compare to Shannon entropy above, you'd have to convert, which is just a multiplication.
Most common way for to do splitting for continuous variable (1d) is picking a threshold (from discretized set of thresholds, or you can choose a prior). So you can compute information gain for continuous value by first sorting it (you have to have an order) and then scanning it for the best value. http://dilekylmzr.files.wordpress.com/2011/09/data-mining_lecture9.ppt
Example of using this technique in random forests
Often this technique is used in random forests (or decision trees), so I will post few references to resources on that.
More information on random forests and this technique can be found here : http://www.cs.ubc.ca/~nando/540-2013/lectures.html . See lectures on youtube because slides are not very much informative. In the lecture it is described how to match body parts using random forests in Kinect, so it is quite interesting.
Also you can look it up here : https://research.microsoft.com/pubs/145347/bodypartrecognition.pdf - the original paper being discussed in the lecture.
Note that for information gain you can use also gaussian entropy. It is basically fitting gaussian to data before and after split.

Adding attributes to a training set

If I had a feature calories and another feature number of people, why does adding the feature calorie per person or adding the feature calories/10 help in improving testing? I don't see how performing simple arithmetic on two features will gain you more information.
Thanks
Consider you're using a classifier/regression mechanism which is linear (or log-linear) in the feature space. If your instance x has features x_i, then being linear means the score is something like:
y_i = \sum_i x_i * w_i
Now consider you think there are some important interactions between the features---maybe you think that x_i is only important if x_j takes a similar value, or their sum is more important than the individual values, or whatever. One way of incorporating this information is to have the algorithm explicitly model cross products, e.g.:
y_i = [ \sum_i x_i * w_i ] + [\sum_i,j x_i * x_j * w_ij]
However, linear algorithms are ubiquitous and easy to use, so a way of getting interaction-like terms into your standard linear classifier/regression mechanism is to augment the feature space so for every pair x_i, x_j you create a feature of the form [x_i * x_j] or [x_i / x_j] or whatever. Now you can model interactions between features without needing to use a non-linear algorithm.
Performing that type of arithmetic allows you to use that information in models that don't explicitly consider nonlinear combinations of variables. Some classifiers attempt to find features that best explain/predict the training data and often the best feature may be nonlinear.
Using your data, suppose you wanted to predict whether a group of people will - on average - gain weight. And suppose the "correct" answer is that the group will gain weight if people in the group consume over an average of 3,000 calories per day. If your inputs are group_size and group_calories, you will need to use both of those variables to make an accurate prediction. But if you also provide group_avg_calories (which is just group_calories / group_size), you could just use that single feature to make the prediction. Even if the first two features added some additional information, if you were to feed those 3 features to a decision tree classifier, it would almost certainly pick group_avg_calories as the root node and you would end up with a much simpler tree structure. There is also a downside to adding lots of arbitrary nonlinear combinations of features to your model, which is that it can add significantly to the classifier's training time.
With regard to calories/10, it's not clear why you would do that specifically, but normalizing the input features can improve convergence rates for some classifiers (e.g., ANNs) and can also provide better performance for clustering algorithms because the input features will all be at the same scale (i.e., distances along different feature axes are comparable).

What is the relation between the number of Support Vectors and training data and classifiers performance? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am using LibSVM to classify some documents. The documents seem to be a bit difficult to classify as the final results show. However, I have noticed something while training my models. and that is: If my training set is for example 1000 around 800 of them are selected as support vectors.
I have looked everywhere to find if this is a good thing or bad. I mean is there a relation between the number of support vectors and the classifiers performance?
I have read this previous post but I am performing a parameter selection and also I am sure that the attributes in the feature vectors are all ordered.
I just need to know the relation.
Thanks.
p.s: I use a linear kernel.
Support Vector Machines are an optimization problem. They are attempting to find a hyperplane that divides the two classes with the largest margin. The support vectors are the points which fall within this margin. It's easiest to understand if you build it up from simple to more complex.
Hard Margin Linear SVM
In a training set where the data is linearly separable, and you are using a hard margin (no slack allowed), the support vectors are the points which lie along the supporting hyperplanes (the hyperplanes parallel to the dividing hyperplane at the edges of the margin)
All of the support vectors lie exactly on the margin. Regardless of the number of dimensions or size of data set, the number of support vectors could be as little as 2.
Soft-Margin Linear SVM
But what if our dataset isn't linearly separable? We introduce soft margin SVM. We no longer require that our datapoints lie outside the margin, we allow some amount of them to stray over the line into the margin. We use the slack parameter C to control this. (nu in nu-SVM) This gives us a wider margin and greater error on the training dataset, but improves generalization and/or allows us to find a linear separation of data that is not linearly separable.
Now, the number of support vectors depends on how much slack we allow and the distribution of the data. If we allow a large amount of slack, we will have a large number of support vectors. If we allow very little slack, we will have very few support vectors. The accuracy depends on finding the right level of slack for the data being analyzed. Some data it will not be possible to get a high level of accuracy, we must simply find the best fit we can.
Non-Linear SVM
This brings us to non-linear SVM. We are still trying to linearly divide the data, but we are now trying to do it in a higher dimensional space. This is done via a kernel function, which of course has its own set of parameters. When we translate this back to the original feature space, the result is non-linear:
Now, the number of support vectors still depends on how much slack we allow, but it also depends on the complexity of our model. Each twist and turn in the final model in our input space requires one or more support vectors to define. Ultimately, the output of an SVM is the support vectors and an alpha, which in essence is defining how much influence that specific support vector has on the final decision.
Here, accuracy depends on the trade-off between a high-complexity model which may over-fit the data and a large-margin which will incorrectly classify some of the training data in the interest of better generalization. The number of support vectors can range from very few to every single data point if you completely over-fit your data. This tradeoff is controlled via C and through the choice of kernel and kernel parameters.
I assume when you said performance you were referring to accuracy, but I thought I would also speak to performance in terms of computational complexity. In order to test a data point using an SVM model, you need to compute the dot product of each support vector with the test point. Therefore the computational complexity of the model is linear in the number of support vectors. Fewer support vectors means faster classification of test points.
A good resource:
A Tutorial on Support Vector Machines for Pattern Recognition
800 out of 1000 basically tells you that the SVM needs to use almost every single training sample to encode the training set. That basically tells you that there isn't much regularity in your data.
Sounds like you have major issues with not enough training data. Also, maybe think about some specific features that separate this data better.
Both number of samples and number of attributes may influence the number of support vectors, making model more complex. I believe you use words or even ngrams as attributes, so there are quite many of them, and natural language models are very complex themselves. So, 800 support vectors of 1000 samples seem to be ok. (Also pay attention to #karenu's comments about C/nu parameters that also have large effect on SVs number).
To get intuition about this recall SVM main idea. SVM works in a multidimensional feature space and tries to find hyperplane that separates all given samples. If you have a lot of samples and only 2 features (2 dimensions), the data and hyperplane may look like this:
Here there are only 3 support vectors, all the others are behind them and thus don't play any role. Note, that these support vectors are defined by only 2 coordinates.
Now imagine that you have 3 dimensional space and thus support vectors are defined by 3 coordinates.
This means that there's one more parameter (coordinate) to be adjusted, and this adjustment may need more samples to find optimal hyperplane. In other words, in worst case SVM finds only 1 hyperplane coordinate per sample.
When the data is well-structured (i.e. holds patterns quite well) only several support vectors may be needed - all the others will stay behind those. But text is very, very bad structured data. SVM does its best, trying to fit sample as well as possible, and thus takes as support vectors even more samples than drops. With increasing number of samples this "anomaly" is reduced (more insignificant samples appear), but absolute number of support vectors stays very high.
SVM classification is linear in the number of support vectors (SVs). The number of SVs is in the worst case equal to the number of training samples, so 800/1000 is not yet the worst case, but it's still pretty bad.
Then again, 1000 training documents is a small training set. You should check what happens when you scale up to 10000s or more documents. If things don't improve, consider using linear SVMs, trained with LibLinear, for document classification; those scale up much better (model size and classification time are linear in the number of features and independent of the number of training samples).
There is some confusion between sources. In the textbook ISLR 6th Ed, for instance, C is described as a "boundary violation budget" from where it follows that higher C will allow for more boundary violations and more support vectors.
But in svm implementations in R and python the parameter C is implemented as "violation penalty" which is the opposite and then you will observe that for higher values of C there are fewer support vectors.

Resources