Do the decision tree regressors in random forests have different parameters? - random-forest

Decision tree Regressors have several tunable parameters e.g criterion, max_depth, min_samples_leaf etc. Do theses trees have different parameters?

They all have the same hyper parameter values but the actual value of e.g depth might vary

Related

How to weigh features or determine feature importance in unsupervised learning

I have two sets each with 15-20 attributes. I am using similarity/distance metrics like Jaccard or Hamming to find the similarity/distance between the two sets.
I am looking at an option to weigh the features before finding the similarity of the two sets. Like, attribute 1 has more weight than attribute 2 to determine the similarity between the sets.
I understand that feature importance can be determined when we have a target variable, but how can this be done when we do not have target?
Does options like PCA or filter methods like calculating the variance will help here? If yes, are there any references?
The attributes are mostly categorical, both nominal and ordinal.

Number of Trees in Random Forest Regression

I am learning the Random Forest Regression Model. I know that it forms many Trees(models) and then we can predict our target variables by averaging the result of all Trees. I also have a descent understanding of Decision Tree Regression Algorithm. How can we form the best number of Trees?
For example i have a dataset where i am predicting person salary and i have only two input variables that are 'Years of Experience', 'Performance Score ' then how many random Trees can i form using such dataset? Are Random Forest Trees dependent upon the number of input variables? Any Good Example will highly be appreciated..
Thanks in Advance
A decision tree trains the model on the entire dataset and only one model is created. In random forest, multiple decision trees are created and each decision tree is trained on a subset of data by limiting the number of rows and the features. In your case, you only have two features so the model will create and train data on subset of data.
You can create any number of random trees for your data. Usually in random forest, more trees result in better performance but also more computation time. Experiment with your data and see the performance changes between different number of trees. If performance remains same, then use less trees to have faster computation. You can use grid search for this.
Also you can experiment with other ml models like linear regression, which migh† perform well in your case.

What does depth of decision tree depend on?

Below is a paramter for DecisionTreeClassifier: max_depth
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
I always thought that depth of the decision tree should be equal or less than number of the features (attributes) of a given dataset. IWhat if we find pure classes before the mentioned input for that parameter? Does it stop splitting or splits further till the mentioned input?
Is it possible to use the same attribute in two different level of a decision tree while splitting?
If the number of features are very high for a decision tree then it can grow very very large. To answer your question, yes, it will stop if it finds the pure class variable.
This is another reason DecisionTrees tend to do overfitting.
You would like to use max_depth parameter when you are using Random Forest , which does not select all features for any specific tree, therefore all trees are not expected to grow to the maximum possible depth, which in turn will require pruning. Decision Trees are weak learners and in RandomForest along with max_depth these participate in voting. More details about these RF and DT relations can be search easily on internet. There are a range of articles published.
So, Generally you would like to use max_depth when you are having large number of features. Also, in actual implementations you would like to use RandomForest rather than DecisionTree alone.

What splitting criterion does Random Tree in Weka 3.7.11 use for numerical attributes?

I'm using RandomForest from Weka 3.7.11 which in turn is bagging Weka's RandomTree. My input attributes are numerical and the output attribute(label) is also numerical.
When training the RandomTree, K attributes are chosen at random for each node of the tree. Several splits based on those attributes are attempted and the "best" one is chosen. How does Weka determine what split is best in this (numerical) case?
For nominal attributes I believe Weka is using the information gain criterion which is based on conditional entropy.
IG(T|a) = H(T) - H(T|a)
Is something similar used for numerical attributes? Maybe differential entropy?
When tree is split on numerical attribute, it is split on the condition like a>5. So, this condition effectively becomes binary variable and the criterion (information gain) is absolutely the same.
P.S. For regression commonly used is the sum of squared errors (for each leaf, then sum over leaves). But I do not know specifically about Weka

Decision tree with high cardinality attribute

I want to learn a decision tree having a reasonable discrete target attribute with 5 possible different values.
However, there are discrete high cardinality input attributes (1000s of different possible string values) that I wonder if it makes sense to include them. Is there any policy what the maximum cardinality should be when including an attribute to train a decision tree?
There is no maximum cardinality, no. Of course, you could omit values that do not actually appear in the data.
You will have to use an RDF implementation that handles multi-label categorical features directly rather than converts them to a series of binary indicator features.
For a categorical feature with N values there are 2^N - 2 possible decision rules on the feature, which is too many to consider by a long way. The heuristic I have used is to compute the entropy of the target when you divide up the data by the N categorical feature values. Then order the values by entropy and evaluate the N-2 rules you get by considering prefixes of that list.

Resources