Confusion in making a decision tree classifier

Confusion in making a decision tree classifier - machine-learning

I want to predict students play cricket or not{Target Variable}.
Suppose I have 3 columns :
Gender, Class, Age
As we can see, I have 2 categorical attributes and one continuous attribute.
While deciding the root node, I know that both categorical attributes can be compared traditionally using gini criterion. How should I split the continuous attribute and which criterion should I take into account for it to be considered as a competitor for being the root node against 2 categorical?

You can split continuous variables in intervals. Lets suppose you have continuous variable form 1 to 10, You can split it like 1 to 5 in one category and 6 to 10 in different category.

It really depends on which model (algorithm) you are using for doing the splitting. However, In generel the F-test is what normally what is used when splitting continuous variables. Try to have a look at what SAS uses for their implementation: SAS - splitting criteria. Also, here is a quite good explanation of decision trees: Decision tree. It begins here.

Related

Machine Learning model generalisation

I'm new to Machine Learning, and I'd like to make a question regarding the model generalization. In my case, I'm going to produce some mechanical parts, and I'm interested in the control of the input parameters to obtain certain properties on the final part.
More particularly, I'm interested in 8 parameters (say, P1, P2, ..., P8). In which to optimize the number of required pieces produced to maximize the combinations of parameters explored, I've divided the problem into 2 sets. For the first set of pieces, I'll vary the first 4 parameters (P1 ... P4), while the others will be held constant. In the second case, I'll do the opposite (variables P5 ... P8 and constants P1 ... P4).
So I'd like to know if it's possible to make a single model that has the eight parameters as inputs to predict the properties of the final part. I ask because as I'm not varying all the 8 variables at once, I thought that maybe I would have to do 1 model for each set of parameters, and the predictions of the 2 different models couldn't be related one to the other.
Thanks in advance.

In most cases having two different models will have a better accuracy then one big model. The reason is that in local models, the model will only look at 4 features and will be able to identify patterns among them to make prediction.
But this particular approach will most certainly fail to scale. Right now you only have two sets of data but what if it increases and you have 20 sets of data. It will not be possible for you to create and maintain 20 ML models in production.
What works best for your case will need some experimentation. Take a random sample from data and train ML models. Take one big model and two local models and evaluate their performance. Not just accuracy, but also their F1 score, AUC-PR and ROC curve too to find out what works best for you. If you do not see a major performance drop, then one big model for the entire dataset will be a better option. If you know that your data will always be divided into these two sets and you dont care about scalability, then go with two local models.

XGBoost: minimize influence of continuous linear features as opposed to categorical

Lets say I have 100 independent features - 90 are binary (e.g. 0/1) and 10 are continuous variables (e.g. age, height, weight, etc). I use the 100 features to predict a classifier problem with an adequate amount of samples.
When I set a XGBClassifier function and fit it, then the 10 most important features from the standpoint of gain are always the 10 continuous variable. For now I am not interested in cover or frequency. The 10 continuous variables take up like .8 to .9 of space in gain list ( sum(gain) = 1).
I tried tuning the gamma, reg_alpha , reg_lambda , max_depth, colsample. Still top 10 features by gain are always the 10 continuous features.
Any suggestions?
small update -- someone asked why I think this is happening. I believe it's because a continuous variable can be split on multiple times per decision tree. A binary variable can only be split on once. Hence, the higher prevalence of continuous variables in trees and thus a higher gain score

Yes, it's well-known that a tree(/forest) algorithm (xgboost/rpart/etc.) will generally 'prefer' continuous variables over binary categorical ones in its variable selection, since it can choose the continuous split-point wherever it wants to maximize the information gain (and can freely choose different split-points for that same variable at other nodes, or in other trees). If that's the optimal tree (for those particular variables), well then it's the optimal tree. See Why do Decision Trees/rpart prefer to choose continuous over categorical variables? on sister site CrossValidated.
When you say "any suggestions", depends what exactly do you want, it could be one of the following:
a) To find which of the other 90 binary categorical features give the most information gain
b) To train a suboptimal tree just to find out which features those are
c) To engineer some "compound" features by combining the binary features into n-bit categorical features which have more information gain (while being sure to remove the individual binary features from the input)
d) You could look into association rules : What is the practical difference between association rules and decision trees in data mining?
If you want to explore a)...c), suggest something vaguely like this:
exclude various subsets of the 10 continuous variables, then see which binary features show up as having the most gain. Let's say that gives you N candidate features. N will be << 90, let's assume N < 20 to make the following more computationally efficient.
then compute the pairwise measure of association or correlation (Spearman or Kendall) between each of the N features. Look at a corrplot. Pick the clusters of variables which are most associated with each other. Create compound n-bit variables which combine those individual binary features. Then retrain the tree, including the compound variables, and excluding the individual binary variables (to avoid changing the total variance in the input).
iterate for excluding various subsets of the 10 continuous variables. See which patterns emerge in your compound variables. I'm sure there's an algorithm for doing this (compound feature-engineering of n-bit categoricals) more formally and methodically, I just don't know it.
Anyway, for hacking a tree-based method for better performance, I imagine the most naive way is "at every step, pick the two most highly-correlated/associated categorical features and combine them". Then retrain the tree (include new feature, exclude its constituent features) and use the revised gain numbers.
perhaps a more robust way might be:
Pick some threshold T for correlation/association, say start at a high level T = 0.9 or 0.95
At each step, merge any features whose absolute correlation/association to each other >= T
If there were no merges at this step, reduce T by some value (like T -= 0.05) or ratio (e.g. T *= 0.9 . If still no merges, keep reducing T until there are merges, or until you hit some termination value (e.g. T = 0.03)
Retrain the tree including the compound variables, excluding their constituent subvariables.
Now go back and retrain what should be an improved tree with all 10 continuous variables, and your compound categorical features.
Or you could early-terminate the compound feature selection to see what the full retrained tree looks like.
This issue arose in the 2014 Kaggle Allstate Purchase Prediction Challenge, where the policy coverage options A,B,C,D,E,F,G were each categoricals with between 2-4 values, and very highly correlated with each other. (The current option of C, "C_previous", is one of the input features). See that competitions's forums and published solutions for more. Be aware that policy = (A,B,C,D,E,F,G) is the output. But C_previous is an input variable.
Some general fast-and-dirty rules-of-thumb on feature selection from Kaggle are:
throw out any near-constant/ very-low-variance variables (because they have near-zero information content)
throw out any very-high-cardinality categorical variables (cardinality >~ training-set-size/2), (because they will also tend to have low information content, but cause lots of spurious overfitting and blow up training time). This can include customer IDs, row IDs, transaction IDs, sequence IDs, and other variables which shouldn't be trained on in the first place but accidentally ended up in the training set.

I can suggest few things for you to try.
Test your model without this data (only 90 features) and evaluate the decrease in your score. If it's insignificant you might want to remove those features.
Turn them into groups.
For example, age can be categorized into groups, 0 : 0-7, 1 : 8-16, 2 : 17-25 and so on.
Turn them into binary. Out of the box idea on how to chose the best value to split them into binary is: Build 1 tree with 1 node (max depth = 1) and use only 1 feature. (1 out of the continuous features). then, dump the model to a .txt file and see the value it chose to split on. using this value, you can transform all that feature column into binary
I'm dealing myself with very similar problems right now, So i'll be happy to hear your results and the paths you chose to try.

I learned a lot from the answer by #smci, so I would recommend to follow his suggestions.
In the case, when your binary categorical features are in fact OHE representations of several categorical features with several classes in each, you can follow two more approaches:
Convert OHE into label encoding. Yes, this has the caveat that one introduces an order into a categorical features, which might be meaningless, for example green=3 > red=2 > blue=1. But in practice is seems that trees handle label=encoded categorical variables (even with meaningless order) reasonably well.
Convert OHE into target-/mean-/likelihood encoding. This is tricky, because you need to apply regularisation to avoid data leakage.
Both of those ideas are meant to group together several binary features into a single one based on prior knowledge about feature meaning. If you do not have that luxury, you can also try to deduce such groups by doing scalar product of columns and finding those giving zero product.

Convert Decision Table To Decision Tree

How to convert or visualize decision table to decision tree graph,
is there an algorithm to solve it, or a software to visualize it?
For example, I want to visualize my decision table below:
http://i.stack.imgur.com/Qe2Pw.jpg

Gotta say that is an interesting question.
I don't know the definitive answer, but I'd propose such a method:
use Karnaugh map to turn your decision table to minimized boolean function
turn your function into a tree
Lets simplyify an example, and assume that using Karnaugh got you function (a and b) or c or d. You can turn that into a tree as:
Source: my own

It certainly is easier to generate a decision table from a decision tree, not the other way around.
But the way I see it you could convert your decision table to a data set. Let the 'Disease' be the class attribute and treat the evidence as simple binary instance attributes. From that you can easily generate a decision tree using one of available decision tree induction algorithms, for example C4.5. Just remember to disable pruning and lower the minimum number of objects parameter.
During that process you would lose a bit of information, but the accuracy would remain the same. Take a look at both rows describing disease D04 - the second row is in fact more general than the first. Decision tree generated from this data would recognize the mentioned disease only from E11, 12 and 13 attributes, since it's enough to correctly label the instance.

I've spent few hours looking for a good algorithm. But I'm happy with my results.
My code is too dirty now to paste here (I can share privately on request, on your discretion) but the general idea is as the following.
Assume you have a data set with some decision criteria and outcome.
Define a tree structure (e.g. data.tree in R) and create "Start" root node.
Calculate outcome entropy of your data set. If entropy is zero you are done.
Using each criterion, one by one, as tree node calculate entropy for all branches created with this criterion. Take the minimum one entropy of all branches.
Branches created with the criterion with the smallest (minimum) entropy are your next tree node. Add them as child nodes.
Split your data according to decision point/tree node found in step 4 and remove the criterion used.
Repeat step 2-4 for each branch until your all branches have entropy = 0.
Enjoy your ideal decision tree :)

How much prediction accuracy of SVM (or other ML models) depend on the way features are encoded?

Suppose that for a given ML problem, we have a feature which car the person possesses. We can encode this information in one of the following ways:
Assign an id to each of the car. Make a column 'CAR_POSSESSED' and put feature id as value.
Make columns for each of the car and put 0 or 1 according to whether that car is possessed by the considered sample or not. Columns will be like "BMW_POSSESSED", "AUDI_POSSESSED".
In my experiments the 2nd way performed much better than 1st one, when tried with SVM.
How does the encoding way affects the model learning, and are there some resources in which affect of encoding has been studied? Or do we need to do hit and trials to check where it performs best?

The problem with the first way is that you use arbitrary numbers to represent the features (e.g. BMW=2, etc.) and SVM take those numbers seriously, as if they have order: e.g. it may try to use cases with CAR_OWNED>3 for the prediction.
So the second way is better.

Chapter 2.1 Categorical Features:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
You'll find many more if you search for "svm Categorical Features"

Using decision tree in Recommender Systems

I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.

Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)

A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart