I have a dataset which has 5 variables and 1 response. The variables are discrete. I want to find the key variable and its value which leads to a significant increase or decrease to the response.
You will need to perform some statistical tests in order to find which variables are the most significant.
If you are familiar with python you could use SelectKBest from scikit-learn. It will give you a score, the highest the score, the stronger the link between the feature and the output.
Additionally you can train an explainable ML model, strong enough to converge, and find the pattern within the data, from that you could compute the feature importance.
For example you could use DecisionTreeClasifier from scikit-learn. It has a decision_path class function that will plot the decision path taken by the tree, decision_path has a property called feature_importances_ that uses Gini coefficient to compute the importance of the features.
Last but not the least, you can use feature reduction techniques, such as PCA, it's used to find the variance between variables, from the PCA you will compute new Principal Components that are linked to the features, from the most explenatory ones you can find the features importance. Check this stack overflow answer that explains everything you should know for that.
Lets say I have 100 independent features - 90 are binary (e.g. 0/1) and 10 are continuous variables (e.g. age, height, weight, etc). I use the 100 features to predict a classifier problem with an adequate amount of samples.
When I set a XGBClassifier function and fit it, then the 10 most important features from the standpoint of gain are always the 10 continuous variable. For now I am not interested in cover or frequency. The 10 continuous variables take up like .8 to .9 of space in gain list ( sum(gain) = 1).
I tried tuning the gamma, reg_alpha , reg_lambda , max_depth, colsample. Still top 10 features by gain are always the 10 continuous features.
Any suggestions?
small update -- someone asked why I think this is happening. I believe it's because a continuous variable can be split on multiple times per decision tree. A binary variable can only be split on once. Hence, the higher prevalence of continuous variables in trees and thus a higher gain score
Yes, it's well-known that a tree(/forest) algorithm (xgboost/rpart/etc.) will generally 'prefer' continuous variables over binary categorical ones in its variable selection, since it can choose the continuous split-point wherever it wants to maximize the information gain (and can freely choose different split-points for that same variable at other nodes, or in other trees). If that's the optimal tree (for those particular variables), well then it's the optimal tree. See Why do Decision Trees/rpart prefer to choose continuous over categorical variables? on sister site CrossValidated.
When you say "any suggestions", depends what exactly do you want, it could be one of the following:
a) To find which of the other 90 binary categorical features give the most information gain
b) To train a suboptimal tree just to find out which features those are
c) To engineer some "compound" features by combining the binary features into n-bit categorical features which have more information gain (while being sure to remove the individual binary features from the input)
d) You could look into association rules : What is the practical difference between association rules and decision trees in data mining?
If you want to explore a)...c), suggest something vaguely like this:
exclude various subsets of the 10 continuous variables, then see which binary features show up as having the most gain. Let's say that gives you N candidate features. N will be << 90, let's assume N < 20 to make the following more computationally efficient.
then compute the pairwise measure of association or correlation (Spearman or Kendall) between each of the N features. Look at a corrplot. Pick the clusters of variables which are most associated with each other. Create compound n-bit variables which combine those individual binary features. Then retrain the tree, including the compound variables, and excluding the individual binary variables (to avoid changing the total variance in the input).
iterate for excluding various subsets of the 10 continuous variables. See which patterns emerge in your compound variables. I'm sure there's an algorithm for doing this (compound feature-engineering of n-bit categoricals) more formally and methodically, I just don't know it.
Anyway, for hacking a tree-based method for better performance, I imagine the most naive way is "at every step, pick the two most highly-correlated/associated categorical features and combine them". Then retrain the tree (include new feature, exclude its constituent features) and use the revised gain numbers.
perhaps a more robust way might be:
Pick some threshold T for correlation/association, say start at a high level T = 0.9 or 0.95
At each step, merge any features whose absolute correlation/association to each other >= T
If there were no merges at this step, reduce T by some value (like T -= 0.05) or ratio (e.g. T *= 0.9 . If still no merges, keep reducing T until there are merges, or until you hit some termination value (e.g. T = 0.03)
Retrain the tree including the compound variables, excluding their constituent subvariables.
Now go back and retrain what should be an improved tree with all 10 continuous variables, and your compound categorical features.
Or you could early-terminate the compound feature selection to see what the full retrained tree looks like.
This issue arose in the 2014 Kaggle Allstate Purchase Prediction Challenge, where the policy coverage options A,B,C,D,E,F,G were each categoricals with between 2-4 values, and very highly correlated with each other. (The current option of C, "C_previous", is one of the input features). See that competitions's forums and published solutions for more. Be aware that policy = (A,B,C,D,E,F,G) is the output. But C_previous is an input variable.
Some general fast-and-dirty rules-of-thumb on feature selection from Kaggle are:
throw out any near-constant/ very-low-variance variables (because they have near-zero information content)
throw out any very-high-cardinality categorical variables (cardinality >~ training-set-size/2), (because they will also tend to have low information content, but cause lots of spurious overfitting and blow up training time). This can include customer IDs, row IDs, transaction IDs, sequence IDs, and other variables which shouldn't be trained on in the first place but accidentally ended up in the training set.
I can suggest few things for you to try.
Test your model without this data (only 90 features) and evaluate the decrease in your score. If it's insignificant you might want to remove those features.
Turn them into groups.
For example, age can be categorized into groups, 0 : 0-7, 1 : 8-16, 2 : 17-25 and so on.
Turn them into binary. Out of the box idea on how to chose the best value to split them into binary is: Build 1 tree with 1 node (max depth = 1) and use only 1 feature. (1 out of the continuous features). then, dump the model to a .txt file and see the value it chose to split on. using this value, you can transform all that feature column into binary
I'm dealing myself with very similar problems right now, So i'll be happy to hear your results and the paths you chose to try.
I learned a lot from the answer by #smci, so I would recommend to follow his suggestions.
In the case, when your binary categorical features are in fact OHE representations of several categorical features with several classes in each, you can follow two more approaches:
Convert OHE into label encoding. Yes, this has the caveat that one introduces an order into a categorical features, which might be meaningless, for example green=3 > red=2 > blue=1. But in practice is seems that trees handle label=encoded categorical variables (even with meaningless order) reasonably well.
Convert OHE into target-/mean-/likelihood encoding. This is tricky, because you need to apply regularisation to avoid data leakage.
Both of those ideas are meant to group together several binary features into a single one based on prior knowledge about feature meaning. If you do not have that luxury, you can also try to deduce such groups by doing scalar product of columns and finding those giving zero product.
I am using a DRF model generated with h2o flow. When running fresh input data against this model (using its MOJO in a java program with the EasyPredictModelWrapper), there are a large number of UnknownCategoricalLevels (checking with the getUnknownCategoricalLevelsSeen() and getUnknownCategoricalLevelsSeenPerColumn() methods).
My workaround for this was to only use those predictions that had a prediction confidence above a certain threshold (say 0.90). Ie. the classProbability selected by the model must be grater than threshold to be used.
My questions are:
Is this solution wrong-headed (ie. does not actually address/workaround the problem (eg. unknownlevels don't actually affect the class probability values)) or is it a valid workaround to the problem?
Is there a better way to address this issue?
Thanks.
The unknown categorical level is treated as an NA for that column.
Without knowing the details of your data (including the cost implications of false positives and false negatives), I wouldn't say that you need to threshold rows that have NAs any differently than for rows that do not. (The NA is already handled quite well by DRF.)
Note the built-in threshold is max-F1 (not 0.5). So if you are changing the threshold for rows with unknown values, it's relative to max-F1 (not 0.5). Using your own threshold is certainly a valid approach.
If you want to visualize your trees to more easily see how the NAs behave, you can do so following the instructions here:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#viewing-a-mojo
There are also other strategies for dealing with it, like target-encoding your categorical input column and treating an NA as the average target value. (This effectively turns a categorical variable into a numeric one, but requires you to preprocess the data.)
I have a (probably stupid) question about predicting a new instance with a missing predictor(s).
I am given a data. Let's say I preprocess, clean data and as a result, let's say, 10 predictors left. Then, I train my model on a resulting data, so I am ready to use model to predict.
Now, what should I do if I want to predict a new instance which 1 or 2 predictors are missing?
There are at least two reasonable solutions.
(1) Average the output over the possible values of the missing variable or variables, conditional on the values of the non-missing variables. That is, compute a weighted average of the output prediction(missing, non-missing) for each possible value of missing, weighted by the probability of missing given non-missing. This is essentially a variety of what's called "multiple imputation" in the literature.
The first thing to try is to just weight by the unconditional distribution of missing. If that seems too complicated, a very rough approximation is to substitute the mean value of missing into the prediction.
(2) Build a model for each combination variables. If you have n variables, this means building 2^n variables. If n = 10, 1024 models is not a big deal these days. Then if you are missing some variables, just use the model for the ones that are present.
By the way, you might get more interest in this question at stats.stackexchange.com.
I am working on to find strength of association between binary dependent variable and ordinal independent variable(IV). I tried chi square to see the cross tabulation and clearly few categories from (IV) have more association if dependent variable(yes or no). But many cells have expected values less than 5.
I also used logistic regression however it gives me significant value such as 1.000 0.999 etc and no significant value among all the (IV)levels.
I am not sure if I am able to clearly specify my problem. I can only use SPSS!
How did you treat the independent variable? Just entering it as a linear predictor is not appropriate if it is ordinal scale.
You might also want to collapse categories. A decision tree might also be useful.