Decision tree entropy calculation target - machine-learning

I found several examples of two types.
Single feature
Given a data with only two items classes. For example only blue and yellow balls. I.e. we have only one feature in this case is color. This is clear example to show "divide and conquer" rule applicable to entropy. But this is senselessly for any prediction or classification problems because if we have an object with only one feature and the value is known we don't need a tree to decide that "this ball is yellow".
Multiple features
Given a data with multiple features and a feature to predict (known for training data). We can calculate a predicate based on minimum average entropy for each feature. Closer to life, isn't it? It was clear to me until I haven't tried to implement the algorithm.
And now I have a collision in my mind.
If we calculate entropy relatively to a known features (one per node) we will have meaningful results at classification with a tree only if unknown feature is strictly dependent from every known feature. Otherwise a single unbound known feature could break all prediction driving a decision in a wrong way. But if we calculate entropy relatively to a values of the feature which we want to predict at classification we are returned to the first senseless example. In this way there is no difference which of a known feature to use for a node...
And a question about a tree building process.
Should I calculate entropy only for known features and just believe that all the known features are bound to an unknown? Or maybe I should calculate entropy for unknown feature (known for training data) TOO to determine which feature more affects result?

I had the same problem (in maybe a similar programing task) some years ago: Do I calculate the entropy against the complete set of features, the relevant features for a branch or the relevant features for a level?
Turned out like this: In a decision tree it comes down to comparing entropies between different branches to determine the optimal branch. Comparison requires equal base sets, i.e. whenever you want to compare two entropie values, they must be based on the same feature set.
For your problem you can go with the features relevant to the set of branches you want to compare, as long as you are aware that with this solution you cannot compare entropies between different branch sets.
Otherwise go with the whole feature set.
(Disclaimer: Above solution is a mind protocol from a problem that lead to about an hour of thinking some years ago. Hopefully I got everything right.)
PS: Beware of the car dataset! ;)

Related

Selected features in random forest subsampling

I am trying to figure it out which features are being considered in each subsampling in my classification problem, for this, I am assuming that there is a random subset of features of length max_features that is considered when building every tree.
I am interested in this because I am using two different types of features for my problem so I want to make sure that in every tree both types of features are being used for every node split. So one way to at least make each tree to consider all features is by setting the max_features parameter to None. So one question here would be:
Does that mean that both types of features are being considered for every node split?
Another one derived from the previous question is:
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well? Besides, can this subsampling be done by group of rows instead of randomly?
Besides, it does not seem to be a good assumption to use all the features in the max_features parameter neither on Decision Trees nor on random forest since it is opposite to the whole point and definition of random forest in terms of correlation among trees (I am not completely sure about this statement).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Any suggestion or comment is very welcomed.
Feel free to correct any assumption.
In the source code I have been reading about this but could not find where this might be defined.
Source code inspected so far:
splitter.py code from decision tree
forest.py code from random forest
Does that mean that both types of features are being considered for every node split?
Given that you have correctly pointed out above that setting max_features to None will indeed force the algorithm to consider all features in every split, it is unclear what exactly you are asking here: all means all and, from the point of view of the algorithm, there are not different "types" of features.
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well?
Both. But, regarding the rows, it is not exactly subsampling, it is actually bootstrap sampling, i.e. sampling with replacement, which means that, in each sample, some rows will be missing while others will be present multiple times.
Random forest is in fact the combination of two independent ideas: bagging, and random selection of features. The latter corresponds essentially to "column subsampling", while the former includes the bootstrap sampling I have just described.
Besides, can this subsampling be done by group of rows instead of randomly?
AFAIK no, at least in the standard implementations (including scikit-learn).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Everything can be modified in the source code, literally; now, if it is really necessary (or even a good idea) to do so is a different story...
Besides, it does not seem to be a good assumption to use all the features in the max_features parameter
It does not indeed, as this is the very characteristic that discriminates RF from the simpler bagging approach (short for bootstrap aggregating). Experiments have indeed shown that adding this random selection of features at each step boosts performance related to simple bagging.
Although your question (and issue) sounds rather vague, my advice would be to "sit back and relax", letting the (powerful enough) RF algorithm do its job with your data...

How to derive the top contributing factors in a binary classification problem

I have a binary classification problem with about 30 features and an ultimate pass/fail label. I first trained a classifier to be able to predict if new instances will pass or fail but now I want to get a deeper understanding.
How can I derive some analysis about why these items pass or fail based on their features? I would ideally like to be able to show the top contributing factors with a weight associated with each one. Complicating this is that my features are not necessarily statistically independent of each other. What sorts of methods should I look into, what keywords will point me in the right direction?
Some initial thoughts: Use a decision tree classifier (ID3 or CART) and look at the top of the tree for top factors. I am not sure how robust this approach would be and it isn't immediately clear to me how one can assign the importance of each factor (one would just get an ordered list).
If I understand your objectives correctly, you might want to consider a Random Forest model. Random forests have the advantage of naturally providing an importance to the features by virtue of how the algorithm works.
In Python's scikit-learn, check out sklearn.ensemble.RandomForestClassifier(). feature_importances_ would return the "weights" I believe you're looking for. Check out the example in the documentation.
Alternatively, you can use R's randomForest package. After constructing the model, you can use importance() to extract the feature importance values.

XGBoost: minimize influence of continuous linear features as opposed to categorical

Lets say I have 100 independent features - 90 are binary (e.g. 0/1) and 10 are continuous variables (e.g. age, height, weight, etc). I use the 100 features to predict a classifier problem with an adequate amount of samples.
When I set a XGBClassifier function and fit it, then the 10 most important features from the standpoint of gain are always the 10 continuous variable. For now I am not interested in cover or frequency. The 10 continuous variables take up like .8 to .9 of space in gain list ( sum(gain) = 1).
I tried tuning the gamma, reg_alpha , reg_lambda , max_depth, colsample. Still top 10 features by gain are always the 10 continuous features.
Any suggestions?
small update -- someone asked why I think this is happening. I believe it's because a continuous variable can be split on multiple times per decision tree. A binary variable can only be split on once. Hence, the higher prevalence of continuous variables in trees and thus a higher gain score
Yes, it's well-known that a tree(/forest) algorithm (xgboost/rpart/etc.) will generally 'prefer' continuous variables over binary categorical ones in its variable selection, since it can choose the continuous split-point wherever it wants to maximize the information gain (and can freely choose different split-points for that same variable at other nodes, or in other trees). If that's the optimal tree (for those particular variables), well then it's the optimal tree. See Why do Decision Trees/rpart prefer to choose continuous over categorical variables? on sister site CrossValidated.
When you say "any suggestions", depends what exactly do you want, it could be one of the following:
a) To find which of the other 90 binary categorical features give the most information gain
b) To train a suboptimal tree just to find out which features those are
c) To engineer some "compound" features by combining the binary features into n-bit categorical features which have more information gain (while being sure to remove the individual binary features from the input)
d) You could look into association rules : What is the practical difference between association rules and decision trees in data mining?
If you want to explore a)...c), suggest something vaguely like this:
exclude various subsets of the 10 continuous variables, then see which binary features show up as having the most gain. Let's say that gives you N candidate features. N will be << 90, let's assume N < 20 to make the following more computationally efficient.
then compute the pairwise measure of association or correlation (Spearman or Kendall) between each of the N features. Look at a corrplot. Pick the clusters of variables which are most associated with each other. Create compound n-bit variables which combine those individual binary features. Then retrain the tree, including the compound variables, and excluding the individual binary variables (to avoid changing the total variance in the input).
iterate for excluding various subsets of the 10 continuous variables. See which patterns emerge in your compound variables. I'm sure there's an algorithm for doing this (compound feature-engineering of n-bit categoricals) more formally and methodically, I just don't know it.
Anyway, for hacking a tree-based method for better performance, I imagine the most naive way is "at every step, pick the two most highly-correlated/associated categorical features and combine them". Then retrain the tree (include new feature, exclude its constituent features) and use the revised gain numbers.
perhaps a more robust way might be:
Pick some threshold T for correlation/association, say start at a high level T = 0.9 or 0.95
At each step, merge any features whose absolute correlation/association to each other >= T
If there were no merges at this step, reduce T by some value (like T -= 0.05) or ratio (e.g. T *= 0.9 . If still no merges, keep reducing T until there are merges, or until you hit some termination value (e.g. T = 0.03)
Retrain the tree including the compound variables, excluding their constituent subvariables.
Now go back and retrain what should be an improved tree with all 10 continuous variables, and your compound categorical features.
Or you could early-terminate the compound feature selection to see what the full retrained tree looks like.
This issue arose in the 2014 Kaggle Allstate Purchase Prediction Challenge, where the policy coverage options A,B,C,D,E,F,G were each categoricals with between 2-4 values, and very highly correlated with each other. (The current option of C, "C_previous", is one of the input features). See that competitions's forums and published solutions for more. Be aware that policy = (A,B,C,D,E,F,G) is the output. But C_previous is an input variable.
Some general fast-and-dirty rules-of-thumb on feature selection from Kaggle are:
throw out any near-constant/ very-low-variance variables (because they have near-zero information content)
throw out any very-high-cardinality categorical variables (cardinality >~ training-set-size/2), (because they will also tend to have low information content, but cause lots of spurious overfitting and blow up training time). This can include customer IDs, row IDs, transaction IDs, sequence IDs, and other variables which shouldn't be trained on in the first place but accidentally ended up in the training set.
I can suggest few things for you to try.
Test your model without this data (only 90 features) and evaluate the decrease in your score. If it's insignificant you might want to remove those features.
Turn them into groups.
For example, age can be categorized into groups, 0 : 0-7, 1 : 8-16, 2 : 17-25 and so on.
Turn them into binary. Out of the box idea on how to chose the best value to split them into binary is: Build 1 tree with 1 node (max depth = 1) and use only 1 feature. (1 out of the continuous features). then, dump the model to a .txt file and see the value it chose to split on. using this value, you can transform all that feature column into binary
I'm dealing myself with very similar problems right now, So i'll be happy to hear your results and the paths you chose to try.
I learned a lot from the answer by #smci, so I would recommend to follow his suggestions.
In the case, when your binary categorical features are in fact OHE representations of several categorical features with several classes in each, you can follow two more approaches:
Convert OHE into label encoding. Yes, this has the caveat that one introduces an order into a categorical features, which might be meaningless, for example green=3 > red=2 > blue=1. But in practice is seems that trees handle label=encoded categorical variables (even with meaningless order) reasonably well.
Convert OHE into target-/mean-/likelihood encoding. This is tricky, because you need to apply regularisation to avoid data leakage.
Both of those ideas are meant to group together several binary features into a single one based on prior knowledge about feature meaning. If you do not have that luxury, you can also try to deduce such groups by doing scalar product of columns and finding those giving zero product.

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Is that make sense to construct a learning Model using only one feature?

in order to improve the accuracy of an adaboost classifier (for image classification), I am using genetic programming to derive new statistical Measures. Every Time when a new feature is generated, i evaluate its fitness by training an adaboost Classifier and by testing its performances. But i want to know if that procedure is correct; I mean the use of a single feature to train a learning model.
You can build a model on one feature. I assume, that by "one feature" you mean simply one number in R (otherwise, it would be completely "traditional" usage). However this means, that you are building a classifier in one-dimensional space, and as such - many classifiers will be redundant (as it is really a simple problem). What is more important - checking whether you can correctly classify objects using one particular dimensions does not mean that it is a good/bad feature once you use combination of them. In particular it may be the case that:
Many features may "discover" the same phenomena in data, and so - each of them separatly can yield good results, but once combined - they won't be any better then each of them (as they simply capture same information)
Features may be useless until used in combination. Some phenomena can be described only in multi-dimensional space, and if you are analyzing only one-dimensional data - you won't ever discover their true value, as a simple example consider four points (0,0),(0,1),(1,0),(1,1) such that (0,0),(1,1) are elements of one class, and rest of another. If you look separatly on each dimension - then the best possible accuracy is 0.5 (as you always have points of two different classes in exactly same points - 0 and 1). Once combined - you can easily separate them, as it is a xor problem.
To sum up - it is ok to build a classifier in one dimensional space, but:
Such problem can be solved without "heavy machinery".
Results should not be used as a base of feature selection (or to be more strict - this can be very deceptive).

Resources