Why can't we initiate the root node randomly in Decision Trees? - machine-learning

I just got into learning about Decision Trees. So the questions might be a bit silly.
The idea of selecting the root node is a bit confusing. Why can't we randomly select the root node? The only difference it seems to make is that it would make the Decision Tree longer and complex, but would get the same result eventually.
Also just as an extension of the feature selection process in Decision Trees, why can't be use something as simple as correlation between features and target, or Chi-Square test to figure which Feature to start off with?

Why can't we randomly select the root node?
We can, but this could also be extended to its child node and to child node of that child node and so on...
The only difference it seems to make is that it would make the Decision Tree longer and complex, but would get the same result eventually.
The more complex the tree is the higher variance it will have, meaning 2 things:
small changes in the training dataset can greatly affect the shape of the three
it overfits the training set
None of these is good and even if you pick a sensible choice at each step, based on entropy or gini impurity index, you will still probably end up with larger three than you would like. Yes that tree might have a good accuracy on the training set but it will probably overfit the training set.
Most of the algorithms that are using decision trees have their own ways to combat this variance, in one way or another. If you consider simple decision tree algorithm itself, the way to reduce the variance is to first train the tree and prune the tree afterwards to make it smaller and less overfitting. Random forest is solving it by averaging over large number of trees while randomly restricting which predictor can be considered for slit every time that decision has to be made.
So, randomly picking the root node will lead to the same result eventually but only on the training set and only once the overfitting is so extreme that the tree simply predicts everything with 100% accuracy. But the more the tree overfits the training set, the less accuracy it will have on a test set (in general), and we care about accuracy on the test set, not on the training set.

Related

Why in some cases random forest with n_estimators equals to 1 performs worse than decision tree [duplicate]

This question already has an answer here:
Why is Random Forest with a single tree much better than a Decision Tree classifier?
(1 answer)
Closed 4 months ago.
Why in some cases random forest with n_estimators equals to 1 performs worse than decision tree, even after setting the bootstrap to false?
Try to use different machine learning model for predicting credit card default rate, I tried random forest and decision tree, but random forest seems to perform worse, then I tried random forest with only 1 tree, so it is supposed to be the same as decision tree, but it still performed worse.
A specific answer to your observations depends on the implementation of the decision tree (DT) and random forest (RF) methods that you're using. That said, there are three most likely reasons:
bootstrapping: Although you mention that you set that to False, in the most general form, RFs use two forms of bootstrapping: of the dataset and of the features. Perhaps the setting only controls one of these. Even if both of these are off, some RF implementations have other parameters that control the number of attributes considered for each split of the tree and how they are selected.
tree hyperparameters: Related to my remark on the previous point, the other aspect to check is if all of the other tree hyperparameters are the same. Tree depth, number of points per leaf node, etc, these all would have to matched to make the methods directly comparable.
growing method: Lastly, it is important to remember that trees are learned via indirect/heuristic losses that are often greedily optimized. Accordingly, there are different algorithms to grow the trees (e.g., C4.5), and the DT and RF implementation may be using different approaches.
If all of these match, then the differences should really be minor. If there are still differences (i.e., "in some cases"), these may be because of randomness in initialization and the greedy learning schemes which lead to suboptimal trees. That is the main reason for RFs, in which the ensemble diversity is used to mitigate these issues.

Regression tree output

I'm confused about the intuition behind decision trees when used to describe continuous targets in machine learning.
I understand that decision trees uses splits based on feature values to decide which branches of a tree to go down to get to a leaf value.
It intuitively make sense to me when doing inference on classification based on nominal targets because each leaf would have as specific value (label), so after going down enough branches one eventually arrives at discrete value which is the label.
But if we're doing regression where a machine learning model predicts a value on a continuum, for example a real number between 0 and 100, how could there be enough leaves to allow the model to output any real number between 0 and 100?
Regression trees are only what you could call 'pseudo continuous' in contrast for example to linear regression models. For the 'leaves' the outputs will have a steady value for certain ranges of the independent variable(s) - dependent on the mentioned 'splits'.
However, there exists some academic work that fits (regression) models in the nodes (...). See the accepted answer here:
https://stats.stackexchange.com/questions/439756/decision-tree-that-fits-a-regression-at-leaf-nodes

Decision Tree Performance, ML

If we don't give any constraints such as max_depth, minimum number of samples for nodes, Can decision tree always give 0 training error? or it depends on Dataset? What about shown dataset?
edit- it is possible to have a split which results in lower accuracy than parent node, right? According to theory of decision tree it should stop splitting there even if the end results after several splitting can be good! Am I correct?
Decision tree will always find a split that imrpoves accuracy/score
For example, I've built a decision tree on data similiar to yours:
A decision tree can get to 100% accuracy on any data set where there are no 2 samples with the same feature values but different labels.
This is one reason why decision trees tend to overfit, especially on many features or on categorical data with many options.
Indeed, sometimes, we prevent a split in a node if the improvement created by the split is not high enough. This is problematic as some relationships, like y=x_1 xor x_2 cannot be expressed by trees with this limitation.
So commonly, a tree doesn't stop because he cannot improve the model on training data.
The reason you don't see trees with 100% accuracy is because we use techniques to reduce overfitting, such as:
Tree pruning like this relatively new example. This basically means that you build your entire tree, but then you go back and prune nodes that did not contribute enough to the model's performance.
Using a ratio instead of gain for the splits. Basically this is a way to express the fact that we expect less improvement from a 50%-50% split than a 10%-90% split.
Setting hyperparameters, such as max_depth and min_samples_leaf, to prevent the tree from splitting too much.

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.
Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

Decision Tree Uniqueness sklearn

I have some questions regarding decision tree and random forest classifier.
Question 1: Is a trained Decision Tree unique?
I believe that it should be unique as it maximizes Information Gain over each split. Now if it is unique why there is random_state parameter in decision tree classifier.As it is unique so it will be reproducible every time. So no need for random_state as Decision tree is unique.
Question 2: What does a decision tree actually predict?
While going through random forest algorithm I read that it averages probability of each class from its individual tree, But as far I know decision tree predicts class not the Probability for each class.
Even without checking out the code, you will see this note in the docs:
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
For splitter='best', this is happening here:
# Draw a feature at random
f_j = rand_int(n_drawn_constants, f_i - n_found_constants,
random_state)
And for your other question, read this:
...
Just build the tree so that the leaves contain not just a single class estimate, but also a probability estimate as well. This could be done simply by running any standard decision tree algorithm, and running a bunch of data through it and counting what portion of the time the predicted label was correct in each leaf; this is what sklearn does. These are sometimes called "probability estimation trees," and though they don't give perfect probability estimates, they can be useful. There was a bunch of work investigating them in the early '00s, sometimes with fancier approaches, but the simple one in sklearn is decent for use in forests.
...

Resources