I'm confused about the intuition behind decision trees when used to describe continuous targets in machine learning.
I understand that decision trees uses splits based on feature values to decide which branches of a tree to go down to get to a leaf value.
It intuitively make sense to me when doing inference on classification based on nominal targets because each leaf would have as specific value (label), so after going down enough branches one eventually arrives at discrete value which is the label.
But if we're doing regression where a machine learning model predicts a value on a continuum, for example a real number between 0 and 100, how could there be enough leaves to allow the model to output any real number between 0 and 100?
Regression trees are only what you could call 'pseudo continuous' in contrast for example to linear regression models. For the 'leaves' the outputs will have a steady value for certain ranges of the independent variable(s) - dependent on the mentioned 'splits'.
However, there exists some academic work that fits (regression) models in the nodes (...). See the accepted answer here:
https://stats.stackexchange.com/questions/439756/decision-tree-that-fits-a-regression-at-leaf-nodes
Related
This question already has an answer here:
Why is Random Forest with a single tree much better than a Decision Tree classifier?
(1 answer)
Closed 4 months ago.
Why in some cases random forest with n_estimators equals to 1 performs worse than decision tree, even after setting the bootstrap to false?
Try to use different machine learning model for predicting credit card default rate, I tried random forest and decision tree, but random forest seems to perform worse, then I tried random forest with only 1 tree, so it is supposed to be the same as decision tree, but it still performed worse.
A specific answer to your observations depends on the implementation of the decision tree (DT) and random forest (RF) methods that you're using. That said, there are three most likely reasons:
bootstrapping: Although you mention that you set that to False, in the most general form, RFs use two forms of bootstrapping: of the dataset and of the features. Perhaps the setting only controls one of these. Even if both of these are off, some RF implementations have other parameters that control the number of attributes considered for each split of the tree and how they are selected.
tree hyperparameters: Related to my remark on the previous point, the other aspect to check is if all of the other tree hyperparameters are the same. Tree depth, number of points per leaf node, etc, these all would have to matched to make the methods directly comparable.
growing method: Lastly, it is important to remember that trees are learned via indirect/heuristic losses that are often greedily optimized. Accordingly, there are different algorithms to grow the trees (e.g., C4.5), and the DT and RF implementation may be using different approaches.
If all of these match, then the differences should really be minor. If there are still differences (i.e., "in some cases"), these may be because of randomness in initialization and the greedy learning schemes which lead to suboptimal trees. That is the main reason for RFs, in which the ensemble diversity is used to mitigate these issues.
I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.
I have some questions regarding decision tree and random forest classifier.
Question 1: Is a trained Decision Tree unique?
I believe that it should be unique as it maximizes Information Gain over each split. Now if it is unique why there is random_state parameter in decision tree classifier.As it is unique so it will be reproducible every time. So no need for random_state as Decision tree is unique.
Question 2: What does a decision tree actually predict?
While going through random forest algorithm I read that it averages probability of each class from its individual tree, But as far I know decision tree predicts class not the Probability for each class.
Even without checking out the code, you will see this note in the docs:
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
For splitter='best', this is happening here:
# Draw a feature at random
f_j = rand_int(n_drawn_constants, f_i - n_found_constants,
random_state)
And for your other question, read this:
...
Just build the tree so that the leaves contain not just a single class estimate, but also a probability estimate as well. This could be done simply by running any standard decision tree algorithm, and running a bunch of data through it and counting what portion of the time the predicted label was correct in each leaf; this is what sklearn does. These are sometimes called "probability estimation trees," and though they don't give perfect probability estimates, they can be useful. There was a bunch of work investigating them in the early '00s, sometimes with fancier approaches, but the simple one in sklearn is decent for use in forests.
...
I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.
I have a data set of 1k records and my job is to do a decision algorithm based on those records.
Here is what I can share:
The target is a continuous value.
Some of the predictors (or attributes) are continuous values,
some of them are discrete and some are arrays of discrete values
(there can be more than one option)
My initial thoughts were to separate the arrays of discrete values and make them individual features (predictors). For the continuous values in the predictors I was thinking about just randomly picking a few decision boundaries and see which one reduces the entropy the most. Then make a decision tree (or a random forest) which use standard deviation reduction when creating the tree.
My question is: Am I on the right path? Is there a better way to do that?
I know this comes probably a bit late but what you are searching for are Model Trees. Model trees are decision trees with continuous rater than categorical values in the leafs. In general these values are predicted by linear regression models. One of the more prominent model trees and one that more or less suits your needs is the M5 model tree introduced by Quinlan. Wang and Witten re-implemented M5 and extended its functionality so that it can handle both, continuous and categorical attributes. Their version is called M5', you can find an implementation e.g. in Weka. The only thing left would be to handle the arrays. However, your description is a bit generic in that respect. From what I gather your choices are either flattening or, as you suggested, seperating them.
Note that, since Wang and Witten's work, more sophisticated model trees have been introduced. However, M5' is robust and does not need any parameterization in its original formulation, which makes it easy to use.