Multi Class Random Forest - random-forest

If a decision tree splits into 2 classes, how is random forest able to create multiple buckets in classification?
Can you post any link about this theory?
What is the theory behind it?

A junction in a decision tree doesn't split in two classes. It splits in two subtrees. The outcome of a decision tree is determined by following junctions until you arrive in a leaf node. A simple 2 level tree with 3 binary junctions
has 4 leaves
J
/ \
J J
/ \ / \
L L L L
With 4 leafs, you can have up to 4 classes, but in general many leaf nodes will belong to the same class.
Of course, with a forest, each tree has many leafs and there are many leafs in the forest, so there are man many leafs across the whole forest.
You can even look at the number of decision trees to decide how reliable an outcome is. If your forest has 100 trees and 3 classes, one input may result in a 90-6-4 distribution and another input may give a 50-30-20 outcome. Both inputs are apparently class 1, but the second input is less certainly so.

Related

Meaning of 2 datasets having different distributions and why can't a neural network work on them together?

I am using different projects datasets with input features (depth of inheritance tree, number of children, number of methods), where these features have values per class in each different project.
I have read many papers saying that a neural network or any other model can't work on datasets of different distributions
My question is:
1. what is the meaning of datasets with different distributions (where a single dataset has a number of samples, each sample corresponding to a class in that project)
2. Why NN or any algorithm can't work on 2 datasets of different distributions
Thanks in advance.
One of the most used assumptions when formulating a statistical learning probelm is that the samples are IID, this means that your samples are identically distributed, thus all the sample should came from the ssame distribution. When you say that you have two different dataets, it means that this assumption is not true, and the majority of theoritical guarantes are no longer holding. Now, maybe your question is what does it mean "a data distribution", this is just the joint law p(x, y) whe x are the features and y the labels. So two dataset have different distributions means that p_{1}(x, y) != p_{2}(x, y)

Why different stocks can be mergerd together to build a single prediction models?

Given n samples with d features of stock A, we can build a (d+1) dimensional linear model to predict the profit. However, in some books, I found that if we have m different stocks with n samples and d features for each, then they merge these data to get m*n samples with d features to build a single (d+1) dimensional linear model to predict the profit.
My confusion is that, different stocks usually have little connection with each other, and their profit are influenced by different factors and environment, so why they can be merged to build a single model?
If you are using R as tool of choice, you might like the time series embedding howto and its appendix -- the mathematics behind that is Taken's theorem:
[Takens's theorem gives] conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system.
It looks to me as the statement's you quote seem to relate to exactly this theorem: For d features (we are lucky, if we know that number - we usually don't), we need d+1 dimensions.
If more time series should be predicted, we can use the same embedding space if the features are the same. The dimensions d are usually simple variables (like e.g. temperature for different energy commodity stocks) - this example helped me to intuitively grasp the idea.
Further reading
Forecasting with Embeddings

5 input and 3 output features for machine learning

Need some advise here.
I am trying to build a model where it can predict the 3 different output features when 5 input features are given.
for example,
5 input features: size of the house, house floor, house condition, number of rooms, parking.
3 output features: price for selling, price for buying, price for renting
What I am confusing right now is that, is that possible that the trained model are able to predict the 3 outputs? What I found from others' example/tutorial is that they mostly trying to do one thing only on their model.
Sorry if my explanations are bad, I am new to tensorflow and machine learning.
Neural network definitely can predict/approximate more outputs. I have experience with neuron regulator and there net produce control signal for two motors.
So I don't have experience with tensorflow. But this framework is from Google and is quite popular, so I'm almost sure, there is multioutput functionality.
There is nice example of such thing.
In common practice, we build a model to predict only one output, that is because in surpervised learning, we should input some certain kinds of variables, and find a relation between them with a wanted output. Because this relation generally cannot work between the input and another wanted output.
But we can have a special technique to fit your problem:
If we have four input variables : I1, I2, I3, I4, and we want three output lables (generally discrete): O1, O2, O3. so we can created a new lable O4 after mergering the original three outputs. For example, if O1, O2, O3 can onlt be 0 or 1, the O4 have 2^3, in total, 8 possible values. So, we can build a prediction model between four input variables and the output O4. And once value of O4 is known, O1-O3 are all known as well.
Howover, if the output variable are not all discrete, especially regression technique is used, the technique above wonnot work. So, to predict three output, we normally do training three times and make three models.

Decision Trees combined with Logistic Regression

Basicly my question is related to the following paper (it is enough to read only sections 1.Introduction, beginning of section 3.Prediction model structure and section 3.1 Decision tree feature transforms, everything else could be skipped)
https://pdfs.semanticscholar.org/daf9/ed5dc6c6bad5367d7fd8561527da30e9b8dd.pdf
This paper suggests that binary classification could show better performance in case of combined decision trees + linear classification (e.g. logistic regression) compared to using ONLY decision trees or linear classification (not both)
Simply speaking, the trick is that we have several decision trees (assume 2 trees for simplicity, 1st tree with 3 leaf nodes and 2nd tree with 2 leaf nodes) and some real-valued feature vector x which goes as an input to all decision trees
So,
- if first tree's decision is leaf node 1 and second tree's decision is leaf node 2 then linear classifier will receive binary string [ 1 0 0 0 1 ]
- if first tree's decision is leaf node 2 and second tree's decision is leaf node 1 then linear classifier will receive binary string [ 0 1 0 1 0 ]
and so on
If we used only decision trees (without linear classif.), clearly we would have either class 100/ class 010/class 001 for 1st tree and class 10/ class 01 for 2nd tree, but in this scheme the outputs of trees are combined into binary string which is fed to linear classifier. So it's not clear how to train these decision trees? What we have is aforementioned vector x and click/no-click, which is output of linear classif., not tree
Any ideas?
For me, You need to perform boosting decisions trees by minimizing the log-loss criteria (binary classification). Once you trained your trees (assume you have 2 trees with 3 and 2 leaves). Then for each instance you predict the leaf index for each tree.
Example
If for an instance you get the leaf 1 for tree 1 and leaf 2 for the second tree. IE you get a vector of (1, 0, 0, 0 , 1) , it is a binary vector not String. Then you have two strategies:
You train a linear classifier (ex: logistic regression) on the result of your trees prediction, your dataset has dimension (N*5), where N is number of your instances. You will train a logistic regression on binary data.
You concatenate your vector for dimension 5 with your initial vector of features, and you perform a linear classifier. You will train logistic regression on both real and binary data.

Multivariate Decision Tree learner

A lot univariate decision tree learner implementations (C4.5 etc) do exist, but does actually someone know multivariate decision tree learner algorithms?
Bennett and Blue's A Support Vector Machine Approach to Decision Trees does multivariate splits by using embedded SVMs for each decision in the tree.
Similarly, in Multicategory classification via discrete support vector machines (2009) , Orsenigo and Vercellis embed a multicategory variant of discrete support vector machines (DSVM) into the decision tree nodes.
CART algorithm for decisions tree can be made into a Multivariate. CART is a binary splitting algorithm as opposed to C4.5 which creates a node per unique value for discrete values. They use the same algorithm for MARS as for missing values too.
To create a Multivariant tree you compute the best split at each node, but instead of throwing away all splits that weren't the best you take a portion of those (maybe all), then evaluate all of the data's attributes by each of the potential splits at that node weighted by the order. So the first split (which lead to the maximum gain) is weighted at 1. Then the next highest gain split is weighted by some fraction < 1.0, and so on. Where the weights decrease as the gain of that split decreases. That number is then compared to same calculation of the nodes within the left node if it's above that number go left. Otherwise go right. That's pretty rough description, but that's a multi-variant split for decision trees.
Yes, there are some, such as OC1, but they are less common than ones which make univariate splits. Adding multivariate splits expands the search space enormously. As a sort of compromise, I have seen some logical learners which simply calculate linear discriminant functions and add them to the candidate variable list.

Resources