Should I put a random state for random forests? - random-forest

just a general question for random forest algorithm:
I guess in order to run the random forest more efficiently I need to change the training dataset.
So I don't need to use
random_state=0
is this right?
thanks a lot

Related

Selecting important features to perform random forest classification

I have 9 parameters, I want to select 6 important parameters and discard 3. What is the best method to do it? I have seen some methods of ranking the parameters by recursive feature elimination (e.g. RFECV). Can I use the random forest classification to rank the parameters and select those important parameters and use them for the random forest classifier? My question is that while using a random forest algorithm for feature selection, how can I make sure that I have used the best hyperparameters. Is it right to use an un-hyper tuned random forest classifier and decide the importance of the parameter? Are there any other methods for selecting important features?

Number of Trees in Random Forest Regression

I am learning the Random Forest Regression Model. I know that it forms many Trees(models) and then we can predict our target variables by averaging the result of all Trees. I also have a descent understanding of Decision Tree Regression Algorithm. How can we form the best number of Trees?
For example i have a dataset where i am predicting person salary and i have only two input variables that are 'Years of Experience', 'Performance Score ' then how many random Trees can i form using such dataset? Are Random Forest Trees dependent upon the number of input variables? Any Good Example will highly be appreciated..
Thanks in Advance
A decision tree trains the model on the entire dataset and only one model is created. In random forest, multiple decision trees are created and each decision tree is trained on a subset of data by limiting the number of rows and the features. In your case, you only have two features so the model will create and train data on subset of data.
You can create any number of random trees for your data. Usually in random forest, more trees result in better performance but also more computation time. Experiment with your data and see the performance changes between different number of trees. If performance remains same, then use less trees to have faster computation. You can use grid search for this.
Also you can experiment with other ml models like linear regression, which migh† perform well in your case.

Random Forests Decorrelation

In Random Forests , you choose out of m features at each node not the complete feature-set. This is said to de-correlate the predictors. Intuitively I understand it but is there any statistics behind when can be say the predictors will be de-correlated and how we can prove in this case this is the scenario
Random forest tries to improve on bagging by de-correlating the trees and reduce variation, so it can select randomly the predictor and form the tree
typically m= log2 P where p is number of features, we cant see the randomly selected predictor logic since it will taken care by algorith

Does a random forest randomly sample the data for each tree?

I appreciate bagging randomly resamples the training set for each tree, and random forests randomly select a subset of features for each tree.
My question though is does a random forest also resample the training set as well as taking a random subset of features. Is it in effect double random?
The answer is yes, most of the times, if you want to.
Random forests bootstrap the data and randomly select features.
bootstrapping means that it samples a data-set with the same size as the original dataset, but with replacement. So if you have N data points, each tree will use N data points, but some my be duplicated (as it samples them one by one with replacement).
However, it really is up to you what you do. In the sklearn implementation, the default is to bootstrap but you can flag bootstarp=False, and then you only have the random features selection.
See the documentation here:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Manipulate Random forest to produce a score rather than 0/1 label

Here is my situation: I am able to use random forest to do binary classification problem; note that given a feature vector, a typical random forest model can predict whether it is belonging to group 1 or group 0, thus making a binary classification.
However, due to multiple reasons, for each feature vector, I want to have a score ranging from 0 to 1 instead of the 0/1 label. Idealy, the higher the score is, I have more confidence that the feature vector should be put into the 1 set, otherwise, it should belong to the 0 set.
So it is still a 0/1 classification, but this time, I want to have a score ranging from 0 to 1, instead of the 0 or 1 label.
I was told that some statistic classification method, such as naive Bayes, can generate the possibility score, representing whether a given feature vector should be put into 0 set or 1 set. However, I did a quick 10-fold validation using naive Bayes on my data set, and comparing with random forest, the performance looks very bad.
precision recall
random forest 0.901 0.907
naive Bayes 0.752 0.653
Too bad... I want to keep the high performance of random forest, as well as acquring a score..
I am aware that random forest has a special tree-like structure, and as a newbie to machine learning, I have no idea how to manipulate random forest to generate the score.
So here is my question, how manipulate random forest to generate a score ranging from 0 to 1, instead of the 0 or 1 label, given a feature vector sample? Am I clear enough? Thank you!
This is a normally feature of random forest. The easiest way to get this is: each tree in the forest gives a decision on 0/1. Take the average of the decisions. You'll now get a score in [0,1] range.
If your random forest package doesn't provide this feature, you should look for another implementation that does (or check the documentation, you may have missed it).
For example, in scikit learn you call the predict_proba method to get probabilities and just predict to get the decision.

Resources