random forest tuning - tree depth and number of trees - random-forest

I have basic question about tuning a random forest classifier. Is there any relation between the number of trees and the tree depth? Is it necessary that the tree depth should be smaller than the number of trees?

For most practical concerns, I agree with Tim.
Yet, other parameters do affect when the ensemble error converges as a function of added trees. I guess limiting the tree depth typically would make the ensemble converge a little earlier. I would rarely fiddle with tree depth, as though computing time is lowered, it does not give any other bonus. Lowering bootstrap sample size both gives lower run time and lower tree correlation, thus often a better model performance at comparable run-time.
A not so mentioned trick: When RF model explained variance is lower than 40%(seemingly noisy data), one can lower samplesize to ~10-50% and increase trees to e.g. 5000(usually unnecessary many). The ensemble error will converge later as a function of trees. But, due to lower tree correlation, the model becomes more robust and will reach a lower OOB error level converge plateau.
You see below samplesize gives the best long run convergence, whereas maxnodes starts from a lower point but converges less. For this noisy data, limiting maxnodes still better than default RF. For low noise data, the decrease in variance by lowering maxnodes or sample size does not make the increase in bias due to lack-of-fit.
For many practical situations, you would simply give up, if you only could explain 10% of variance. Thus is default RF typically fine. If your a quant, who can bet on hundreds or thousands of positions, 5-10% explained variance is awesome.
the green curve is maxnodes which kinda tree depth but not exactly.
library(randomForest)
X = data.frame(replicate(6,(runif(1000)-.5)*3))
ySignal = with(X, X1^2 + sin(X2) + X3 + X4)
yNoise = rnorm(1000,sd=sd(ySignal)*2)
y = ySignal + yNoise
plot(y,ySignal,main=paste("cor="),cor(ySignal,y))
#std RF
rf1 = randomForest(X,y,ntree=5000)
print(rf1)
plot(rf1,log="x",main="black default, red samplesize, green tree depth")
#reduced sample size
rf2 = randomForest(X,y,sampsize=.1*length(y),ntree=5000)
print(rf2)
points(1:5000,rf2$mse,col="red",type="l")
#limiting tree depth (not exact )
rf3 = randomForest(X,y,maxnodes=24,ntree=5000)
print(rf2)
points(1:5000,rf3$mse,col="darkgreen",type="l")

It is true that generally more trees will result in better accuracy. However, more trees also mean more computational cost and after a certain number of trees, the improvement is negligible. An article from Oshiro et al. (2012) pointed out that, based on their test with 29 data sets, after 128 of trees there is no significant improvement(which is inline with the graph from Soren).
Regarding the tree depth, standard random forest algorithm grow the full decision tree without pruning. A single decision tree do need pruning in order to overcome over-fitting issue. However, in random forest, this issue is eliminated by random selecting the variables and the OOB action.
Reference:
Oshiro, T.M., Perez, P.S. and Baranauskas, J.A., 2012, July. How many trees in a random forest?. In MLDM (pp. 154-168).

I agree with Tim that there is no thumb ratio between the number of trees and tree depth. Generally you want as many trees as will improve your model. More trees also mean more computational cost and after a certain number of trees, the improvement is negligible. As you can see in figure below, after sometime there is no significant improvement in error rate even if we are increasing no of tree.
The depth of the tree meaning length of tree you desire. Larger tree helps you to convey more info whereas smaller tree gives less precise info.So depth should large enough to split each node to your desired number of observations.
Below is example of short tree(leaf node=3) and long tree(leaf node=6) for Iris dataset: Short tree(leaf node=3) gives less precise info compared to long tree(leaf node=6).
Short tree(leaf node=3):
Long tree(leaf node=6):

It all depends on your data set.
I have an example where I was building the Random Forest classifier on Adult Income dataset and reducing the depth of trees (from 42 to 6) improved the performance of the model. The side effect of reducing the depth of trees was How can I reduce the long feature vector which is a list of double values? model size (in RAM and disk space after save)
Regarding the number of trees, I was doing the experiment on 72 classification tasks from OpenML-CC18 benchmark and I found that:
the more rows in the data, the more trees are needed,
the best performance is obtained by tuning the number of trees with 1 tree precision. Train large Random Forest (for example with 1000 trees) and then use validation data to find optimal number of trees.

Related

Maximum number of feature dimensions

I have a classification problem and my current feature vector does not seem to hold enough information.
My training set has 10k entries and I am using a SVM as classifier (scikit-learn).
What is the maximum reasonable feature vector size (how many dimension)?
(Training and evaluation using Labtop CPU)
100? 1k? 10k? 100k? 1M?
The thing is not how many features should it be for a certain number of cases (i.e. entries) but rather the opposite:
It’s not who has the best algorithm that wins. It’s who has the most data. (Banko and Brill, 2001)
Banko and Brill in 2001 made a comparison among 4 different algorithms, they kept increasing the Training Set Size to millions and came up with the above-quoted conclusion.
Moreover, Prof. Andrew Ng clearly covered this topic, and I’m quoting here:
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much
So as a rule of thumb, your data cases must be greater than the number of features in your dataset taking into account that all features should be informative as much as possible (i.e. the features are not highly collinear (i.e. redundant)).
I read once in more than one place and somewhere in Scikit-Learn Documentation, that the number of inputs (i.e. samples) must be at least the square size of the number of features (i.e. n_samples > n_features ** 2 ).
Nevertheless, for SVM in particular, the number of features n v.s number of entries m is an important factor to specify the type of kernel to use initially, as a second rule of thumb for SVM in particular (also according to Prof. Andrew Ng):
If thr number of features is much greater than number of entries (i.e. n is up to 10K and m is up to 1K) --> use SVM without a kernel (i.e. "linear kernel") or use Logistic Regression.
If the number of features is small and if the number of entries is intermediate (i.e. n is up to 1K and m is up to 10K) --> use SVM with Gaussian kernel.
If the number of feature is small and if the number of entries is much larger (i.e. n is up to 1K and m > 50K) --> Create/add more features, then use SVM without a kernel or use Logistic Regression.

Python/SKlearn: Using KFold Results in big ROC_AUC Variations

Based on data that our business department supplied to us, I used the sklearn decision tree algorithm to determine the ROC_AUC for a binary classification problem.
The data consists of 450 rows and there are 30 features in the data.
I used 10 times StratifiedKFold repetition/split of training and test data. As a result, I got the following ROC_AUC values:
0.624
0.594
0.522
0.623
0.585
0.656
0.629
0.719
0.589
0.589
0.592
As I am new in machine learning, I am unsure whether such a variation in the ROC_AUC values can be expected (with minimum values of 0.522 and maximum values of 0.719).
My questions are:
Is such a big variation to be expected?
Could it be reduced with more data (=rows)?
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Well, you do k-fold splits to actually evaluate how well your model generalizes.
Therefore, from your current results I would assume the following:
This is a difficult problem, the AUCs are usually low.
0.71 is an outlier, you were just lucky there (probably).
Important questions that will help us help you:
What is the proportion of the binary classes? Are they balanced?
What are the features? Are they all continuous? If categorical, are they ordinal or nominal?
Why Decision Tree? Have you tried other methods? Logistic Regression for instance is a good start before you move on to more advanced ML methods.
You should run more iterations, instead of k fold use the ShuffleSplit function and run at least 100 iterations, compute the Average AUC with 95% Confidence Intervals. That will give you a better idea of how well the models perform.
Hope this helps!
Is such a big variation to be expected?
This is a textbook case of high variance.
Depending on the difficulty of your problem, 405 training samples may not be enough for it to generalize properly, and the random forest may be too powerful.
Try adding some regularization, by limiting the number of splits that the trees are allowed to make. This should reduce the variance in your model, though you might expect a potentially lower average performance.
Could it be reduced with more data (=rows)?
Yes, adding data is the other popular way of lowering the variance of your model. If you're familiar with deep learning, you'll know that deep models usually need LOTS of samples to learn properly. That's because they are very powerful models with an intrinsically high variance, and therefore a lot of data is needed for them to generalize.
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Variance will decrease with regularization and adding data, it has no relation to the actual performance "number" that you get.
Cheers

Classification - What can we use when the number of dimensions is greater than the number of samples?

I've read on the scikit-learn website that SVM is a good choice when the number of dimensions is greater than the number of samples.
I would like to know what you think (as experienced users) is the more efficient in these cases where the class to predict is binary.
And especially what to do when the number of labeled samples is about 50.
Algorithms that should work ? Things to care about ?
If you have dense data, and n < d, you have a high risk of overfitting: each dimension or differences of dimensions could uniquely identify your training records.
That doesn't mean it won't work, only that you have to be extra careful when evaluating your method, because of the high risk of overfitting.
Methods that only use one dimension at a time - such as decision trees - could be less affected. In particular, if you apply pruning techniques to your tree.

big number of attributes best classifiers

I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.

Confusion related to kernel svm

I have this confusion related to kernel svm. I read that with kernel svm the number of support vectors retained is large. That's why it is difficult to train and is time consuming. I didn't get this part why is it difficult to optimize. Ok I can say that noisy data requires large number of support vectors. But what does it have to do with the training time.
Also I was reading another article where they were trying to convert non linear SVM kernel to linear SVM kernel. In the case of linear kernel it is just the dot product of the original features themselves. But in the case of non linear one it is RBF and others. I didn't get what they mean by "manipulating the kernel matrix imposes significant computational bottle neck". As far as I know, the kernel matrix is static isn't it. For linear kernel it is just the dot product of the original features. In the case of RBF it is using the gaussian kernel. So I just need to calculate it once, then I am done isn't it. So what's the point of manipulating and the bottleneck thinkg
Support Vector Machine (SVM) (Cortes and Vap- nik, 1995) as the state-of-the-art classification algo- rithm has been widely applied in various scientific do- mains. The use of kernels allows the input samples to be mapped to a Reproducing Kernel Hilbert S- pace (RKHS), which is crucial to solving linearly non- separable problems. While kernel SVMs deliver the state-of-the-art results, the need to manipulate the k- ernel matrix imposes significant computational bottle- neck, making it difficult to scale up on large data.
It's because the kernel matrix is a matrix that is N rows by N columns in size where N is the number of training samples. So imagine you have 500,000 training samples, then that would mean the matrix needs 500,000*500,000*8 bytes (1.81 terabytes) of RAM. This is huge and would require some kind of parallel computing cluster to deal with in any reasonable way. Not to mention the time it takes to compute each element. For example, if it took your computer 1 microsecond to compute 1 kernel evaluation then it would take 69.4 hours to compute the entire kernel matrix. For comparison, a good linear solver can handle a problem of this size in a few minutes or an hour on a regular desktop workstation. So that's why linear SVMs are preferred.
To understand why they are so much faster you have to take a step back and think about how these optimizers work. At the highest level you can think of them as searching for a function that gives the correct outputs on all the training samples. Moreover, most solvers are iterative in the sense that they have a current best guess at what this function should be and in each iteration they test it on the training data and see how good it is. Then they update the function in some way to improve it. They keep doing this until they find the best function.
Keeping this in mind, the main reason why linear solvers are so fast is because the function they are learning is just a dot product between a fixed size weight vector and a training sample. So in each iteration of the optimization it just needs to compute the dot product between the current weight vector and all the samples. This takes O(N) time, moreover, good solvers converge in just a few iterations regardless of how many training samples you have. So the working memory for the solver is just the memory required to store the single weight vector and all the training samples. This means the entire process takes only O(N) time and O(N) bytes of RAM.
A non-linear solver on the other hand is learning a function that is not just a dot product between a weight vector and a training sample. In this case, it is a function that is the sum of a bunch of kernel evaluations between a test sample and all the other training samples. So in this case, just evaluating the function you are learning against one training sample takes O(N) time. Therefore, to evaluate it against all training samples takes O(N^2) time. There have been all manner of clever tricks devised to try and keep the non-linear function compact to speed this up. But all of them are at least a little bit heuristic or approximate in some sense while good linear solvers find exact solutions. So that's part of the reason for the popularity of linear solvers.

Resources