What does depth of decision tree depend on? - machine-learning

Below is a paramter for DecisionTreeClassifier: max_depth
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
I always thought that depth of the decision tree should be equal or less than number of the features (attributes) of a given dataset. IWhat if we find pure classes before the mentioned input for that parameter? Does it stop splitting or splits further till the mentioned input?
Is it possible to use the same attribute in two different level of a decision tree while splitting?

If the number of features are very high for a decision tree then it can grow very very large. To answer your question, yes, it will stop if it finds the pure class variable.
This is another reason DecisionTrees tend to do overfitting.
You would like to use max_depth parameter when you are using Random Forest , which does not select all features for any specific tree, therefore all trees are not expected to grow to the maximum possible depth, which in turn will require pruning. Decision Trees are weak learners and in RandomForest along with max_depth these participate in voting. More details about these RF and DT relations can be search easily on internet. There are a range of articles published.
So, Generally you would like to use max_depth when you are having large number of features. Also, in actual implementations you would like to use RandomForest rather than DecisionTree alone.

Related

Why in some cases random forest with n_estimators equals to 1 performs worse than decision tree [duplicate]

This question already has an answer here:
Why is Random Forest with a single tree much better than a Decision Tree classifier?
(1 answer)
Closed 4 months ago.
Why in some cases random forest with n_estimators equals to 1 performs worse than decision tree, even after setting the bootstrap to false?
Try to use different machine learning model for predicting credit card default rate, I tried random forest and decision tree, but random forest seems to perform worse, then I tried random forest with only 1 tree, so it is supposed to be the same as decision tree, but it still performed worse.
A specific answer to your observations depends on the implementation of the decision tree (DT) and random forest (RF) methods that you're using. That said, there are three most likely reasons:
bootstrapping: Although you mention that you set that to False, in the most general form, RFs use two forms of bootstrapping: of the dataset and of the features. Perhaps the setting only controls one of these. Even if both of these are off, some RF implementations have other parameters that control the number of attributes considered for each split of the tree and how they are selected.
tree hyperparameters: Related to my remark on the previous point, the other aspect to check is if all of the other tree hyperparameters are the same. Tree depth, number of points per leaf node, etc, these all would have to matched to make the methods directly comparable.
growing method: Lastly, it is important to remember that trees are learned via indirect/heuristic losses that are often greedily optimized. Accordingly, there are different algorithms to grow the trees (e.g., C4.5), and the DT and RF implementation may be using different approaches.
If all of these match, then the differences should really be minor. If there are still differences (i.e., "in some cases"), these may be because of randomness in initialization and the greedy learning schemes which lead to suboptimal trees. That is the main reason for RFs, in which the ensemble diversity is used to mitigate these issues.

Which predictive models in sklearn are affected by the order of the columns in the training dataframe?

I'm wondering if any of the estimators that Sci-kit Learn provides is affected by the order of the columns in the dataframe by which it is being trained. I tried establishing a baseline by using ExtraTreesRegressor and it came out to 3 different scores:
.531687 for the regular order
.535309 for the reverse order
.554458 for the regular order
Obviously ExtraTreesRegressor is not a good example here, so I tried LinearRegression but it gave .295898 no matter what the order of the columns were.
What I want to know is if there are ANY estimators that are affected by the order of the columns and if there are not then can you point me in the direction of some way, or provide some code, that I can use to make sure that the order of the columns does matter?
Any algorithm that involves some randomness in selecting features while building the model is expected to be affected from their order; AFAIK, the only cases present in scikit-learn are the Extra Trees and the Random Forest (in both their incarnations as classifiers or regressors), which indeed share some similarities.
The smoking gun for such a behavior is the argument max_features; from the RF docs (the description is identical in the Extra Trees as well):
max_features : {“auto”, “sqrt”, “log2”} int or float, default=”auto”
The number of features to consider when looking for the best split
I am not aware of other algorithms that involve such kind of random feature selection (linear models, decision trees, SVMs, naive Bayes, neural nets, and gradient boosted trees do not), but if you glimpse something similar enough in the documentation, you can bet that the respective algorithm is also affected by the order of the features.
Keep in mind that such slight discrepancies that should not happen in theory are rather to be expected in models where randomness enters from way too many angles. For a similar case with RF in R (slightly different results when asking for importance=TRUE), check my answer in Why does the importance parameter influence performance of Random Forest in R?

Decision Tree Performance, ML

If we don't give any constraints such as max_depth, minimum number of samples for nodes, Can decision tree always give 0 training error? or it depends on Dataset? What about shown dataset?
edit- it is possible to have a split which results in lower accuracy than parent node, right? According to theory of decision tree it should stop splitting there even if the end results after several splitting can be good! Am I correct?
Decision tree will always find a split that imrpoves accuracy/score
For example, I've built a decision tree on data similiar to yours:
A decision tree can get to 100% accuracy on any data set where there are no 2 samples with the same feature values but different labels.
This is one reason why decision trees tend to overfit, especially on many features or on categorical data with many options.
Indeed, sometimes, we prevent a split in a node if the improvement created by the split is not high enough. This is problematic as some relationships, like y=x_1 xor x_2 cannot be expressed by trees with this limitation.
So commonly, a tree doesn't stop because he cannot improve the model on training data.
The reason you don't see trees with 100% accuracy is because we use techniques to reduce overfitting, such as:
Tree pruning like this relatively new example. This basically means that you build your entire tree, but then you go back and prune nodes that did not contribute enough to the model's performance.
Using a ratio instead of gain for the splits. Basically this is a way to express the fact that we expect less improvement from a 50%-50% split than a 10%-90% split.
Setting hyperparameters, such as max_depth and min_samples_leaf, to prevent the tree from splitting too much.

Regression trees with standard deviation reduction

I have a data set of 1k records and my job is to do a decision algorithm based on those records.
Here is what I can share:
The target is a continuous value.
Some of the predictors (or attributes) are continuous values,
some of them are discrete and some are arrays of discrete values
(there can be more than one option)
My initial thoughts were to separate the arrays of discrete values and make them individual features (predictors). For the continuous values in the predictors I was thinking about just randomly picking a few decision boundaries and see which one reduces the entropy the most. Then make a decision tree (or a random forest) which use standard deviation reduction when creating the tree.
My question is: Am I on the right path? Is there a better way to do that?
I know this comes probably a bit late but what you are searching for are Model Trees. Model trees are decision trees with continuous rater than categorical values in the leafs. In general these values are predicted by linear regression models. One of the more prominent model trees and one that more or less suits your needs is the M5 model tree introduced by Quinlan. Wang and Witten re-implemented M5 and extended its functionality so that it can handle both, continuous and categorical attributes. Their version is called M5', you can find an implementation e.g. in Weka. The only thing left would be to handle the arrays. However, your description is a bit generic in that respect. From what I gather your choices are either flattening or, as you suggested, seperating them.
Note that, since Wang and Witten's work, more sophisticated model trees have been introduced. However, M5' is robust and does not need any parameterization in its original formulation, which makes it easy to use.

Multivariate Decision Tree learner

A lot univariate decision tree learner implementations (C4.5 etc) do exist, but does actually someone know multivariate decision tree learner algorithms?
Bennett and Blue's A Support Vector Machine Approach to Decision Trees does multivariate splits by using embedded SVMs for each decision in the tree.
Similarly, in Multicategory classification via discrete support vector machines (2009) , Orsenigo and Vercellis embed a multicategory variant of discrete support vector machines (DSVM) into the decision tree nodes.
CART algorithm for decisions tree can be made into a Multivariate. CART is a binary splitting algorithm as opposed to C4.5 which creates a node per unique value for discrete values. They use the same algorithm for MARS as for missing values too.
To create a Multivariant tree you compute the best split at each node, but instead of throwing away all splits that weren't the best you take a portion of those (maybe all), then evaluate all of the data's attributes by each of the potential splits at that node weighted by the order. So the first split (which lead to the maximum gain) is weighted at 1. Then the next highest gain split is weighted by some fraction < 1.0, and so on. Where the weights decrease as the gain of that split decreases. That number is then compared to same calculation of the nodes within the left node if it's above that number go left. Otherwise go right. That's pretty rough description, but that's a multi-variant split for decision trees.
Yes, there are some, such as OC1, but they are less common than ones which make univariate splits. Adding multivariate splits expands the search space enormously. As a sort of compromise, I have seen some logical learners which simply calculate linear discriminant functions and add them to the candidate variable list.

Resources