Do 10 random forests with 50 trees each on same data, equal one random forest on same data with 500 trees? - random-forest

I have a data set with 1 million rows.
While running 1 random survival forest containing 500 trees, with the package randomForestSRC in R, it is taking a lot of time due to memory problems.
So, can I run 10 random survival forests with 50 trees on the same data, with different seed each time, and average the results of the 10 random forests (by dividing by 10), so that I can get a reasonably similar result as the one with 500 trees?

Yes, results should be similar. A random forest is simply a collection of decision trees. Adding more trees later is no problem, as long as you use the same data and parameters with each of your 10 sets of 50 trees. Also, you could look at more efficient versions of the random forest algorithm, e.g. the package ranger, which can also do survival forests, iirc.

Related

Large difference between different classification algorithms

EDITED:
I have a classification dataset of 350000 rows and 500 features. The features are a Tfidf vector.
While my Y(predictor) has values from 1-16 to classify the sentences into 16 types.
The training and testing are randomly split
When I send my data through a classification algorithm, I'm getting a huge difference between the accuracy :
SVM and Naive Bayes are giving 20%+ (which is too less)
RandomForest gives around 55% accuracy which seems more accurate but is still less
Is there a reason why I'm getting such a huge difference across different algorithms and is there a way to further increase the accuracy?
I'm trying predict a person's personality through his tweets

Why do Tensorflow tf.learn classification results vary a lot?

I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial
classifier = tf.contrib.learn.DNNClassifier(
hidden_units=[10],
n_classes=2,
dropout=0.1,
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data))
classifier.fit(x=training_set.data, y=training_set.target, steps=100)
val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation_set.target)["accuracy"]
Accuracy score varies roughly from 54% to 90%, with 21 documents in the validation (test) set which are always the same.
What does the very significant deviation mean? I understand there are some random factors (eg. dropout), but to my understanding the model should converge towards an optimum.
I use words (lemmas), bi- and trigrams, sentiment scores and LIWC scores as features, so I do have a very high-dimensional feature space, with only 28 training and 21 validation documents. Can this cause problems? How can I consistently improve the results apart from collecting more training data?
Update: To clarify, I generate a dictionary of occurring words and n-grams and discard those that occur only 1 time, so I only use words (n-grams) that exist in the corpus.
This has nothing to do with TensorFlow. This dataset is ridiculously small, thus you can obtain any results. You have 28 + 21 points, in a space which has "infinite" amount of dimensions (there are around 1,000,000 english words, thus 10^18 trigrams, however some of them do not exist, and for sure they do not exist in your 49 documents, but still you have at least 1,000,000 dimensions). For such problem, you have to expect huge variance of the results.
How can I consistently improve the results apart from collecting more training data?
You pretty much cannot. This is simply way to small sample to do any statistical analysis.
Consequently the best you can do is change evaluation scheme instead of splitting data to 28/21 do 10-fold cross validation, with ~50 points this means that you will have to run 10 experiments, each with 45 training documents and 4 testing ones, and average the result. This is the only thing you can do to reduce the variance, however remember that even with CV, dataset so small gives you no guarantees how well your model will actualy behave "in the wild" (once applied to never seen before data).

random forest tuning - tree depth and number of trees

I have basic question about tuning a random forest classifier. Is there any relation between the number of trees and the tree depth? Is it necessary that the tree depth should be smaller than the number of trees?
For most practical concerns, I agree with Tim.
Yet, other parameters do affect when the ensemble error converges as a function of added trees. I guess limiting the tree depth typically would make the ensemble converge a little earlier. I would rarely fiddle with tree depth, as though computing time is lowered, it does not give any other bonus. Lowering bootstrap sample size both gives lower run time and lower tree correlation, thus often a better model performance at comparable run-time.
A not so mentioned trick: When RF model explained variance is lower than 40%(seemingly noisy data), one can lower samplesize to ~10-50% and increase trees to e.g. 5000(usually unnecessary many). The ensemble error will converge later as a function of trees. But, due to lower tree correlation, the model becomes more robust and will reach a lower OOB error level converge plateau.
You see below samplesize gives the best long run convergence, whereas maxnodes starts from a lower point but converges less. For this noisy data, limiting maxnodes still better than default RF. For low noise data, the decrease in variance by lowering maxnodes or sample size does not make the increase in bias due to lack-of-fit.
For many practical situations, you would simply give up, if you only could explain 10% of variance. Thus is default RF typically fine. If your a quant, who can bet on hundreds or thousands of positions, 5-10% explained variance is awesome.
the green curve is maxnodes which kinda tree depth but not exactly.
library(randomForest)
X = data.frame(replicate(6,(runif(1000)-.5)*3))
ySignal = with(X, X1^2 + sin(X2) + X3 + X4)
yNoise = rnorm(1000,sd=sd(ySignal)*2)
y = ySignal + yNoise
plot(y,ySignal,main=paste("cor="),cor(ySignal,y))
#std RF
rf1 = randomForest(X,y,ntree=5000)
print(rf1)
plot(rf1,log="x",main="black default, red samplesize, green tree depth")
#reduced sample size
rf2 = randomForest(X,y,sampsize=.1*length(y),ntree=5000)
print(rf2)
points(1:5000,rf2$mse,col="red",type="l")
#limiting tree depth (not exact )
rf3 = randomForest(X,y,maxnodes=24,ntree=5000)
print(rf2)
points(1:5000,rf3$mse,col="darkgreen",type="l")
It is true that generally more trees will result in better accuracy. However, more trees also mean more computational cost and after a certain number of trees, the improvement is negligible. An article from Oshiro et al. (2012) pointed out that, based on their test with 29 data sets, after 128 of trees there is no significant improvement(which is inline with the graph from Soren).
Regarding the tree depth, standard random forest algorithm grow the full decision tree without pruning. A single decision tree do need pruning in order to overcome over-fitting issue. However, in random forest, this issue is eliminated by random selecting the variables and the OOB action.
Reference:
Oshiro, T.M., Perez, P.S. and Baranauskas, J.A., 2012, July. How many trees in a random forest?. In MLDM (pp. 154-168).
I agree with Tim that there is no thumb ratio between the number of trees and tree depth. Generally you want as many trees as will improve your model. More trees also mean more computational cost and after a certain number of trees, the improvement is negligible. As you can see in figure below, after sometime there is no significant improvement in error rate even if we are increasing no of tree.
The depth of the tree meaning length of tree you desire. Larger tree helps you to convey more info whereas smaller tree gives less precise info.So depth should large enough to split each node to your desired number of observations.
Below is example of short tree(leaf node=3) and long tree(leaf node=6) for Iris dataset: Short tree(leaf node=3) gives less precise info compared to long tree(leaf node=6).
Short tree(leaf node=3):
Long tree(leaf node=6):
It all depends on your data set.
I have an example where I was building the Random Forest classifier on Adult Income dataset and reducing the depth of trees (from 42 to 6) improved the performance of the model. The side effect of reducing the depth of trees was How can I reduce the long feature vector which is a list of double values? model size (in RAM and disk space after save)
Regarding the number of trees, I was doing the experiment on 72 classification tasks from OpenML-CC18 benchmark and I found that:
the more rows in the data, the more trees are needed,
the best performance is obtained by tuning the number of trees with 1 tree precision. Train large Random Forest (for example with 1000 trees) and then use validation data to find optimal number of trees.

caret: using random forest and include cross-validation

I used the caret package to train a random forest, including repeated cross-validation. I’d like to know whether the OOB, as in the original RF by Breiman, is used or whether this is replaced by the cross-validation. If it is replaced, do I have the same advantages as described in Breiman 2001, like increased accuracy by reducing the correlation between input data? As OOB is drawn with replacement and CV is drawn without replacement, are both procedures comparable? What is the OOB estimate of error rate (based on CV)?
How are the trees grown? Is CART used?
As this is my first thread, please let me know if you need more details. Many thanks in advance.
There are a lot of basic questions here and you would be better served by reading a book on machine learning or predictive modeling. Thats probably why you haven't gotten much of a response.
For caret you should also consult the package website where some of these questions are answered.
Here are some notes:
CV and OOB estimation for RF are somewhat different. This post might help explain how. For this application, the OOB rate from random forest is computed while the model is being build whereas CV uses holdout samples that are predicted after the random forest model is computed.
The original random forest model (used here) uses unpruned CART trees. Again, this is in many text books and papers.
Max
I recently got a little confused with this too, but reading chapter 4 in Applied Predictive Modeling by Max Kuhn helped me to understand the difference.
If you use randomForest in R, you grow a number of decision trees by sampling N cases with replacement (N is the number of cases in the training set). You then sample m variables at each node where m is less than the number of predictors. Each tree is then grown fully and terminal nodes are assigned to a class based on the mode of cases in that node. New cases are classified by sending them down all the trees and then taking a vote; the majority vote wins.
The key points to note here are:
how the trees are grown - sampling WITH replacement (a bootstrap). This means that some cases will be represented many times in your bootstrap sample and others may not be represented at all. The bootstrap sample will be the same size as your training dataset.
The cases that are not selected for building trees are referred to as the OOB samples- an OOB error estimate is calculated by classifying the cases that aren't selected when building a tree. About 63% of the data points in the bootstrap sample are represented at least once.
If you use caret in R, you will normally use caret::train(....) and specify the method as "rf" and trControl="repeatedcv". You can change trControl to "oob" if you want out of the bag. The way this works is as follows (I'm going to use a simple example of a 10 fold cv repeated 5 times): the training dataset is split into 10 folds of roughly equal size, a number of trees will be built using only 9 samples - so omitting the 1st fold (which is held out). The held out sample is predicted by running the cases through the trees and used to estimate performance measures. The first subset is returned to the training set and the procedure repeats with the 2nd subset held out, and so on. The process is repeated 10 times. This whole procedure can be repeated multiple times (in my example, I do this 5 times); for each of the 5 runs, the training dataset with be split into 10 slightly different folds. It should be noted that 50 different held out samples are used to calculate model efficacy.
The key points to note are:
this involves sampling WITHOUT replacement - you split the training data and build a model on 9 samples and predict the held out sample (the remaining 1 sample of the 10) and repeat this process as above
the model is built using a dataset that is smaller than the training dataset; this is different to the bootstrap method discussed above
You are using 2 different resampling techniques which will yield different results therefore they are not comparable. The k fold repeated cv tends to have low bias (for k large); where k is 2 or 3, bias is high and comparable to the bootstrap method. K fold cv tends to have high variance though...

How to estimate amount of memory needed for binary classifier?

Say I wanna create a binary classifier for detecting SPAM messages. I have a billion of training examples and about 20 features. I want my trained classifier to fit in memory (I will run it on cloud and disk operations which are actually rpc-calls will be very expensive).
My question is: how can I estimate the amount of memory I'll need for it? Say my classifier is Random Forest and I know nothing about distribution of SPAM messages in my training set.
Only numbers: two classes, billion examples, 20 features.
Is such an estimation possible at all? How can it be done?
For spam classification you should probably run a linear classifier on word occurrences features + bigrams + domain names or ip addresses occurring in links + stuff extracted from the headers and the SMTP context.
In that case you can hash the features on 2 ** 18 dimensions (using vowpal wabbit for instance) times 8 bytes per features that makes you a 2MB model in memory.

Resources