Lets say I have 100 independent features - 90 are binary (e.g. 0/1) and 10 are continuous variables (e.g. age, height, weight, etc). I use the 100 features to predict a classifier problem with an adequate amount of samples.
When I set a XGBClassifier function and fit it, then the 10 most important features from the standpoint of gain are always the 10 continuous variable. For now I am not interested in cover or frequency. The 10 continuous variables take up like .8 to .9 of space in gain list ( sum(gain) = 1).
I tried tuning the gamma, reg_alpha , reg_lambda , max_depth, colsample. Still top 10 features by gain are always the 10 continuous features.
Any suggestions?
small update -- someone asked why I think this is happening. I believe it's because a continuous variable can be split on multiple times per decision tree. A binary variable can only be split on once. Hence, the higher prevalence of continuous variables in trees and thus a higher gain score
Yes, it's well-known that a tree(/forest) algorithm (xgboost/rpart/etc.) will generally 'prefer' continuous variables over binary categorical ones in its variable selection, since it can choose the continuous split-point wherever it wants to maximize the information gain (and can freely choose different split-points for that same variable at other nodes, or in other trees). If that's the optimal tree (for those particular variables), well then it's the optimal tree. See Why do Decision Trees/rpart prefer to choose continuous over categorical variables? on sister site CrossValidated.
When you say "any suggestions", depends what exactly do you want, it could be one of the following:
a) To find which of the other 90 binary categorical features give the most information gain
b) To train a suboptimal tree just to find out which features those are
c) To engineer some "compound" features by combining the binary features into n-bit categorical features which have more information gain (while being sure to remove the individual binary features from the input)
d) You could look into association rules : What is the practical difference between association rules and decision trees in data mining?
If you want to explore a)...c), suggest something vaguely like this:
exclude various subsets of the 10 continuous variables, then see which binary features show up as having the most gain. Let's say that gives you N candidate features. N will be << 90, let's assume N < 20 to make the following more computationally efficient.
then compute the pairwise measure of association or correlation (Spearman or Kendall) between each of the N features. Look at a corrplot. Pick the clusters of variables which are most associated with each other. Create compound n-bit variables which combine those individual binary features. Then retrain the tree, including the compound variables, and excluding the individual binary variables (to avoid changing the total variance in the input).
iterate for excluding various subsets of the 10 continuous variables. See which patterns emerge in your compound variables. I'm sure there's an algorithm for doing this (compound feature-engineering of n-bit categoricals) more formally and methodically, I just don't know it.
Anyway, for hacking a tree-based method for better performance, I imagine the most naive way is "at every step, pick the two most highly-correlated/associated categorical features and combine them". Then retrain the tree (include new feature, exclude its constituent features) and use the revised gain numbers.
perhaps a more robust way might be:
Pick some threshold T for correlation/association, say start at a high level T = 0.9 or 0.95
At each step, merge any features whose absolute correlation/association to each other >= T
If there were no merges at this step, reduce T by some value (like T -= 0.05) or ratio (e.g. T *= 0.9 . If still no merges, keep reducing T until there are merges, or until you hit some termination value (e.g. T = 0.03)
Retrain the tree including the compound variables, excluding their constituent subvariables.
Now go back and retrain what should be an improved tree with all 10 continuous variables, and your compound categorical features.
Or you could early-terminate the compound feature selection to see what the full retrained tree looks like.
This issue arose in the 2014 Kaggle Allstate Purchase Prediction Challenge, where the policy coverage options A,B,C,D,E,F,G were each categoricals with between 2-4 values, and very highly correlated with each other. (The current option of C, "C_previous", is one of the input features). See that competitions's forums and published solutions for more. Be aware that policy = (A,B,C,D,E,F,G) is the output. But C_previous is an input variable.
Some general fast-and-dirty rules-of-thumb on feature selection from Kaggle are:
throw out any near-constant/ very-low-variance variables (because they have near-zero information content)
throw out any very-high-cardinality categorical variables (cardinality >~ training-set-size/2), (because they will also tend to have low information content, but cause lots of spurious overfitting and blow up training time). This can include customer IDs, row IDs, transaction IDs, sequence IDs, and other variables which shouldn't be trained on in the first place but accidentally ended up in the training set.
I can suggest few things for you to try.
Test your model without this data (only 90 features) and evaluate the decrease in your score. If it's insignificant you might want to remove those features.
Turn them into groups.
For example, age can be categorized into groups, 0 : 0-7, 1 : 8-16, 2 : 17-25 and so on.
Turn them into binary. Out of the box idea on how to chose the best value to split them into binary is: Build 1 tree with 1 node (max depth = 1) and use only 1 feature. (1 out of the continuous features). then, dump the model to a .txt file and see the value it chose to split on. using this value, you can transform all that feature column into binary
I'm dealing myself with very similar problems right now, So i'll be happy to hear your results and the paths you chose to try.
I learned a lot from the answer by #smci, so I would recommend to follow his suggestions.
In the case, when your binary categorical features are in fact OHE representations of several categorical features with several classes in each, you can follow two more approaches:
Convert OHE into label encoding. Yes, this has the caveat that one introduces an order into a categorical features, which might be meaningless, for example green=3 > red=2 > blue=1. But in practice is seems that trees handle label=encoded categorical variables (even with meaningless order) reasonably well.
Convert OHE into target-/mean-/likelihood encoding. This is tricky, because you need to apply regularisation to avoid data leakage.
Both of those ideas are meant to group together several binary features into a single one based on prior knowledge about feature meaning. If you do not have that luxury, you can also try to deduce such groups by doing scalar product of columns and finding those giving zero product.
Related
I found several examples of two types.
Single feature
Given a data with only two items classes. For example only blue and yellow balls. I.e. we have only one feature in this case is color. This is clear example to show "divide and conquer" rule applicable to entropy. But this is senselessly for any prediction or classification problems because if we have an object with only one feature and the value is known we don't need a tree to decide that "this ball is yellow".
Multiple features
Given a data with multiple features and a feature to predict (known for training data). We can calculate a predicate based on minimum average entropy for each feature. Closer to life, isn't it? It was clear to me until I haven't tried to implement the algorithm.
And now I have a collision in my mind.
If we calculate entropy relatively to a known features (one per node) we will have meaningful results at classification with a tree only if unknown feature is strictly dependent from every known feature. Otherwise a single unbound known feature could break all prediction driving a decision in a wrong way. But if we calculate entropy relatively to a values of the feature which we want to predict at classification we are returned to the first senseless example. In this way there is no difference which of a known feature to use for a node...
And a question about a tree building process.
Should I calculate entropy only for known features and just believe that all the known features are bound to an unknown? Or maybe I should calculate entropy for unknown feature (known for training data) TOO to determine which feature more affects result?
I had the same problem (in maybe a similar programing task) some years ago: Do I calculate the entropy against the complete set of features, the relevant features for a branch or the relevant features for a level?
Turned out like this: In a decision tree it comes down to comparing entropies between different branches to determine the optimal branch. Comparison requires equal base sets, i.e. whenever you want to compare two entropie values, they must be based on the same feature set.
For your problem you can go with the features relevant to the set of branches you want to compare, as long as you are aware that with this solution you cannot compare entropies between different branch sets.
Otherwise go with the whole feature set.
(Disclaimer: Above solution is a mind protocol from a problem that lead to about an hour of thinking some years ago. Hopefully I got everything right.)
PS: Beware of the car dataset! ;)
I have a twitter-like(another micro blog) data set with 1.6 million datapoints and tried to predict the its retweet numbers based on its content. I extracted its keyword and use the keywords as the bag of words feature. Then I got 1.2 million dimension feature. The feature vector is very sparse,usually only ten dimension in one data point. And I use SVR to do the regression. Now it has taken 2 days. I think the training time might take quite a long time. I don't know if I do this task like this is normal. Is there any way or is it necessary to optimize this problem?
BTW. If in this case , I don't use any kernel and the machine is 32GB RAM and i-7 16 cores. How long the training time will be in estimation? I used the lib pyml.
You need to find a dimensionality reduction approach that works for your problem.
I've worked on a similar problem to yours and I found that Information Gain worked well, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Ter-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
At first you can simply remove all words with high frequency and all words with low frequency, because both of them don't tell you much about content of a text, then you have to do a word-stemming.
After that you can try to reduce dimensionality of your space, with Feature hashing, or some more advance dimensionality reduction trick (PCA, ICA), or even both of them.
I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.
Example:
I have m sets of ~1000 text documents, ~10 are predictive of a binary result, roughly 990 aren't.
I want to train a classifier to take a set of documents and predict the binary result.
Assume for discussion that the documents each map the text to 100 features.
How is this modeled in terms of training examples and features? Do I merge all the text together and map it to a fixed set of features? Do I have 100 features per document * ~1000 documents (100,000 features) and one training example per set of documents? Do I classify each document separately and analyze the resulting set of confidences as they relate to the final binary prediction?
The most common way to handle text documents is with a bag of words model. The class proportions are irrelevant. Each word gets mapped to a unique index. Make the value at that index equal to the number of times that token occurs (there are smarter things to do). The number of features/dimension is then the number of unique tokens/words in your corpus. There are manny issues with this, and some of them are discussed here. But it works well enough for many things.
I would want to approach it as a two stage problem.
Stage 1: predict the relevancy of a document from the set of 1000. For best combination with stage 2, use something probabilistic (logistic regression is a good start).
Stage 2: Define features on the output of stage 1 to determine the answer to the ultimate question. These could be things like the counts of words for the n most relevant docs from stage 1, the probability of the most probable document, the 99th percentile of those probabilities, variances in probabilities, etc. Whatever you think will get you the correct answer (experiment!)
The reason for this is as follows: concatenating documents together will drown you in irrelevant information. You'll spend ages trying to figure out which words/features allow actual separation between the classes.
On the other hand, if you concatenate feature vectors together, you'll run into an exchangeability problem. By that I mean, word 1 in document 1 will be in position 1, word 1 in document 2 will be in position 1001, in document 3 it will be in position 2001, etc. and there will be no way to know that the features are all related. Furthermore, an alternate presentation of the order of the documents would lead to the positions in the feature vector changing its order, and your learning algorithm won't be smart to this. Equally valid presentations of the document orders will lead to completely different results in an entirely non-deterministic and unsatisfying way (unless you spend a long time designing a custom classifier that's not afficted with this problem, which might ultimately be necessary but it's not the thing I'd start with).
If I had a feature calories and another feature number of people, why does adding the feature calorie per person or adding the feature calories/10 help in improving testing? I don't see how performing simple arithmetic on two features will gain you more information.
Thanks
Consider you're using a classifier/regression mechanism which is linear (or log-linear) in the feature space. If your instance x has features x_i, then being linear means the score is something like:
y_i = \sum_i x_i * w_i
Now consider you think there are some important interactions between the features---maybe you think that x_i is only important if x_j takes a similar value, or their sum is more important than the individual values, or whatever. One way of incorporating this information is to have the algorithm explicitly model cross products, e.g.:
y_i = [ \sum_i x_i * w_i ] + [\sum_i,j x_i * x_j * w_ij]
However, linear algorithms are ubiquitous and easy to use, so a way of getting interaction-like terms into your standard linear classifier/regression mechanism is to augment the feature space so for every pair x_i, x_j you create a feature of the form [x_i * x_j] or [x_i / x_j] or whatever. Now you can model interactions between features without needing to use a non-linear algorithm.
Performing that type of arithmetic allows you to use that information in models that don't explicitly consider nonlinear combinations of variables. Some classifiers attempt to find features that best explain/predict the training data and often the best feature may be nonlinear.
Using your data, suppose you wanted to predict whether a group of people will - on average - gain weight. And suppose the "correct" answer is that the group will gain weight if people in the group consume over an average of 3,000 calories per day. If your inputs are group_size and group_calories, you will need to use both of those variables to make an accurate prediction. But if you also provide group_avg_calories (which is just group_calories / group_size), you could just use that single feature to make the prediction. Even if the first two features added some additional information, if you were to feed those 3 features to a decision tree classifier, it would almost certainly pick group_avg_calories as the root node and you would end up with a much simpler tree structure. There is also a downside to adding lots of arbitrary nonlinear combinations of features to your model, which is that it can add significantly to the classifier's training time.
With regard to calories/10, it's not clear why you would do that specifically, but normalizing the input features can improve convergence rates for some classifiers (e.g., ANNs) and can also provide better performance for clustering algorithms because the input features will all be at the same scale (i.e., distances along different feature axes are comparable).