Random Forest Train / Test meaning - machine-learning

I have the following:
rf = RandomForestClassifier(n_estimators=500, criterion='entropy', random_state=42)
rf.fit(X_train, y_train)
From this, I get:
1.0 accuracy on training set
0.6990116801437556 accuracy on test set
Since we're not setting the max_depth, it seems the trees are overfitting to the training data.
My question is: what does this tell us about the training data? Does the fact that it has reasonable accuracy imply that the test data is very like the training data and that's the only reason we're getting such an accuracy?

Since you don't specify the max_depth of the tree, it grows until you have all pure nodes. So it is natural to overfit and correct/expected to have 100% (or rather high if the min_number of samples for node is not too large) accuracy on the training set.
This fact in not very insightful about the training set.
The fact that you are having a "such good" accuracy on the test set could indeed point out a similarity in the distribution of training/test set (that a one point it is expected if they are drawn from the same phenomenon) and that the tree has some degree of generalizability.
As general rule I would say that it is wrong to infer conclusion from a single result and when the training set is over-fitting. Additionally considering 0.69 accuracy a "good" accuracy is relative to the problem at hand. 30% of difference between training set and test set could be a huge gap in many applications.
In order to have a better understanding of your problem and more robust results it would be better to use a cross validation approach and a random forest.

Related

How to quantify bias and variance given train data samples

I have a model that I train using polynomial and radial basis functions, I split the data into train set and test set and I take a lot of samples from the train set. Now I'm at a loss for the next step, I know bias is the loss of the sample with the least loss. Do I calculate this on train data or test data? Is the variance just the variance of the losses on the test set?
The main goal of this tradeoff is to find the right amount of complexity for the decision boundary.
High complexity: (Could) Memorizes the past and (may) not generalize for the future (High variance problem)
Low complexity: (Could) not learn enough from the past because of very simple decision boundary and again (may) fail to have a good prediction as well (high bias problem)
This could be simply shown with a figure like the following,

Test accuracy is greater than train accuracy what to do?

I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.

Is it possible to overfit on 250,000 examples in a few epochs?

Generally speaking, is it possible to tell if training a given neural network of depth X on Y training examples for Z epochs is likely to overfit? Or can overfitting only be detected for sure by looking at loss and accuracy graphs of training vs test set?
Concretely I have ~250,000 examples, each of which is a flat image 200x200px. The model is a CNN with about 5 convolution + pooling layers, followed by 2 dense layers with 1024 units each. The model classifies 12 different classes. I've been training it for about 35 hours with ~90% accuracy on training set and ~80% test set.
Generally speaking, is it possible to tell if training a given neural network of depth X on Y training examples for Z epochs is likely to overfit?
Generally speaking, no. Fitting deep learning models is still an almost exclusively empirical art, and the theory behind it is still (very) poor. And although by gaining more and more experience one is more likely to tell beforehand if a model is prone to overfit, the confidence will generally be not high (extreme cases excluded), and the only reliable judge will be the experiment.
Elaborating a little further: if you take the Keras MNIST CNN example and remove the intermediate dense layer(s) (the previous version of the script used to include 2x200 dense layers instead of 1x128 now), thus keeping only conv/pooling layers and the final softmax one, you will end up with ~ 98.8% test accuracy after only 20 epochs, but I am unaware of anyone that could reliably predict this beforehand...
Or can overfitting only be detected for sure by looking at loss and accuracy graphs of training vs test set?
Exactly, this is the only safe way. The telltale signature of overfitting is the divergence of the learning curves (training error still decreasing, while validation or test error heading up). But even if we have diagnosed overfitting, the cause might not be always clear-cut (see a relevant question and answer of mine here).
~90% accuracy on training set and ~80% test set
Again very generally speaking and only in principle, this does not sound bad for a problem with 12 classes. You already seem to know that, if you worry for possible overfitting, it is the curves rather than the values themselves (or the training time) that you have to monitor.
On the more general topic of the poor theory behind deep learning models as related to the subject of model intepretability, you might find this answer of mine useful...

Why test accuracy remains constant and do not increase in binary classification when test and train dataset are from different source

I have train dataset and test dataset from two different sources. I mean they are from two different experiments but the results of both of them are same biological images. I want to do binary classification using deep CNN and I have following results on test accuracy and train accuracy. The blue line shows train accuracy and the red line shows test accuracy after almost 250 epochs. Why the test accuracy is almost constant and not raising? Is that because Test and Train dataset are come from different distributions?
Edited:
After I have add dropout layer, reguralization terms and mean subtraction I still get following strange results which says the model is overfitting from the beginning!
There could be 2 reasons. First you overfit on the training data. This can be validated by using the validation score as a comparison metric to the test data. If so you can use standard techniques to combat overfitting, like weight decay and dropout.
The second one is that your data is too different to be learned like this. This is harder to solve. You should first look at the value spread of both images. Are they both normalized. Matplotlib normalizes automatically for plotted images. If this still does not work you might want to look into augmentation to make your training data more similar to the test data. Here I can not tell you what to use, without seeing both the trainset and the testset.
Edit:
For normalization the test set and the training set should have a similar value spread. If you do dataset normalization you calculate mean and std on training set. But you also need to use those calculated values on the test set and not calculate the test set values from the test set. This only makes sense if the value spread is similar for both the training and test set. If this is not the case you might want to do per sample normalization first.
Other augmentation that are commonly used for every dataset are oversampling, random channel shifts, random rotations, random translation and random zoom. This makes you invariante to those operations.

Machine Learning Experiment Design with Small Positive Sample Set in Sci-kit Learn

I am interested in any tips on how to train a set with a very limited positive set and a large negative set.
I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?
I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful.
Thanks!
The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).
For RandomForest() you can look at this thread which discusses
the sample weight parameter. In general sample_weight is what
you're looking for in scikit-learn.
For SVM's have a look at either this example or this
example.
For NB classifiers, this should be handled implicitly by Bayes
rule, however in practice you may see some poor performances.
For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).
You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.
Class Weight-The higher the weight a class is given, the more its error rate is decreased.
Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)
Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.
Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.
Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

Resources