Is this classification model overfitting? - machine-learning

I am performing a url classification (phishing - nonphishing) and I plotted the learning curves (training vs cross validation score) for my model (Gradient Boost).
My View
It seems that these two curves converge and the difference is not significant. Tt's normal for the training set to have a slightly higher accuracy). (Figure 1)
The Question
I have limited experience on machine learning, thus I am asking your opinion. Is the way I am approaching the problem right? Is this model fine or is it overfitting?
Note: The classes are balanced and the features are well chosen
Relevant code
from yellowbrick.model_selection import LearningCurve
def plot_learning_curves(Χ, y, model):
# Create the learning curve visualizer
cv = StratifiedKFold(n_splits=5)
sizes = np.linspace(0.1, 1.0, 8)
visualizer = LearningCurve(model, cv=cv, train_sizes=sizes, n_jobs=4)
visualizer.fit(Χ, y) # Fit the data to the visualizer
visualizer.poof()

Firstly, in your graph there are 8 different models.
It's hard to tell if one of them is overfitting because overfitting can be detected with a "epoch vs performance (train / valid)" graph (there would be 8 in your case).
Overfitting means that, after a certain number of epochs, as the number of epoch increases, training accuracy goes up while validation accuracy goes down. This can be the case, for example, when you have too few data points regarding the complexity of your problem, hence your model is using spurious correlations.
With your graph, what we can say is that the complexity of your problem seems to require a "high" number or training instances because your validation performance keep increasing as you add more training instances. There is a chance that the model with <10000 is overfitting but your >50000 could be overiftting too and we don't see that because you are using early stopping!
Hope it helps

Related

Test accuracy is greater than train accuracy what to do?

I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.

Is it possible to overfit on 250,000 examples in a few epochs?

Generally speaking, is it possible to tell if training a given neural network of depth X on Y training examples for Z epochs is likely to overfit? Or can overfitting only be detected for sure by looking at loss and accuracy graphs of training vs test set?
Concretely I have ~250,000 examples, each of which is a flat image 200x200px. The model is a CNN with about 5 convolution + pooling layers, followed by 2 dense layers with 1024 units each. The model classifies 12 different classes. I've been training it for about 35 hours with ~90% accuracy on training set and ~80% test set.
Generally speaking, is it possible to tell if training a given neural network of depth X on Y training examples for Z epochs is likely to overfit?
Generally speaking, no. Fitting deep learning models is still an almost exclusively empirical art, and the theory behind it is still (very) poor. And although by gaining more and more experience one is more likely to tell beforehand if a model is prone to overfit, the confidence will generally be not high (extreme cases excluded), and the only reliable judge will be the experiment.
Elaborating a little further: if you take the Keras MNIST CNN example and remove the intermediate dense layer(s) (the previous version of the script used to include 2x200 dense layers instead of 1x128 now), thus keeping only conv/pooling layers and the final softmax one, you will end up with ~ 98.8% test accuracy after only 20 epochs, but I am unaware of anyone that could reliably predict this beforehand...
Or can overfitting only be detected for sure by looking at loss and accuracy graphs of training vs test set?
Exactly, this is the only safe way. The telltale signature of overfitting is the divergence of the learning curves (training error still decreasing, while validation or test error heading up). But even if we have diagnosed overfitting, the cause might not be always clear-cut (see a relevant question and answer of mine here).
~90% accuracy on training set and ~80% test set
Again very generally speaking and only in principle, this does not sound bad for a problem with 12 classes. You already seem to know that, if you worry for possible overfitting, it is the curves rather than the values themselves (or the training time) that you have to monitor.
On the more general topic of the poor theory behind deep learning models as related to the subject of model intepretability, you might find this answer of mine useful...

Test accuracy vs Training time on Weka

From what I know, test accuracy should increase when training time increase(up to some point); but experimenting with weka yielded the opposite. I am wondering if misunderstood someting.
I used diabetes.arff for classification with 70% for training and 30% for testing. I used MultilayerPerceptron classifier and tried training times 100,500,1000,3000,5000.
Here are my results,
Training time Accuracy
100 75.2174 %
500 75.2174 %
1000 74.7826 %
3000 72.6087 %
5000 70.4348 %
10000 68.6957 %
What can be the reason for this? Thank you!
You got a very nice example of overfitting.
Here is the short explanation of what happened:
You model (doesn't matter whether this is multilayer perceptron, decision trees or literally anything else) can fit the training data in two ways.
First one is a generalization - model tries to find patterns and trends and use them to make predictions. The second one is remembering the exact data points from the training dataset.
Imagine the computer vision task: classify images into two categories – humans vs trucks. The good model will find common features that are present in human pictures but not in the trucks pictures (smooth curves, skin-color surfaces). This is a generalization. Such model will be able to handle new pictures pretty well. The bad model, overfitted one, will just remember exact images, exact pixels of the training dataset and will have no idea what to do with new images on the test set.
What can you do to prevent overfitting?
There are few common approaches to deal with overfitting:
Use simpler models. With fewer parameters, it will be difficult for a model to remember the dataset
Use regularization. Constrain the weights of the model and/or use dropout in your perceptron.
Stop the training process. Split your training data once more, so you will have three parts of the data: training, dev, and test. Then train your model using training data only and stop the training when the error on the dev set stopped decreasing.
The good starting point to read about overfitting is Wikipedia: https://en.wikipedia.org/wiki/Overfitting

What does this learning curve show ? And how to handle non representativity of a sample?

==> to see learning curves
I am trying a random forest regressor for a machine learning problem (price estimation of spatial points). I have a sample of spatial points in a city. The sample is not randomly drawn since there are very few observations downtown. And I want to estimate prices for all addresses in the city.
I have a good cross validation score (absolute mean squared error) an also a good test score after splitting the training set. But predictions are very bad.
What could explain this results ?
I plotted the learning curve (link above) : cross validation score increases with number of instances (that sounds logical), training score remains high (should it decrease ?) ... What do these learning curves show ? And in general how do we "read" learning curves ?
Moreover, I suppose that the sample is not representative. I tried to make the dataset for which I want predictions spatially similar to the training set by drawing whitout replacement according to proportions of observations in each district for the training set. But this didn't change the result. How can I handle this non representativity ?
Thanks in advance for any help
There are a few common cases that pop up when looking at training and cross-validation scores:
Overfitting: When your model has a very high training score but a poor cross-validation score. Generally this occurs when your model is too complex, allowing it to fit the training data exceedingly well but giving it poor generalization to the validation dataset.
Underfitting: When neither the training nor the cross-validation scores are high. This occurs when your model is not complex enough.
Ideal fit: When both the training and cross-validation scores are fairly high. You model not only learns to represent the training data, but it generalizes well to new data.
Here's a nice graphic from this Quora post showing how model complexity and error relate to the type a fit a model exhibits.
In the plot above, the errors for a given complexity are the errors found at equilibrium. In contrast, learning curves show how the score progresses throughout the entire training process. Generally you never want to see the score decreasing during training, as this usually means your model is diverging. But the difference between the training and validation scores as they move forward in time (towards equilibrium) indicates how well your model is fitting.
Notice that even when you have an ideal fit (middle of complexity axis) it is common to see a training score that's higher than the cross-validation score, since the model's parameters are updated using the training data. But since you're getting poor predictions, and since validation score is ~10% lower than training score (assuming the score is out of 1), I would guess that your model is overfitting and could benefit from less complexity.
To answer your second point, models will generalize better if the training data is a better representation of validation data. So when splitting the data into training and validation sets, I recommend finding a way to randomly segregate the data. For example, you could generate a list of all the points in the city, iterate of the list, and for each point draw from a uniform distribution to decide which dataset that point belongs to.

Cross Validation in Classification

I have two different datasets, datset X and dataset Y... From which I calculate features to use for classification..
Case 1. When I combine both together as one large datset then use 10 fold cross validation I get very good classification results with accuracy and AUC > 95%
Case2. Yet if I use one of the datasets for training and the other for testing, results fall severely low with both accuracy and AUC becoming ~ 50%
My questions are:
Which of the cases' results is more reliable??
And why the huge difference in results??
Thanks..
There could be a bias in the way the datasets were obtained that makes you get worst results.
Read this.
Another thing is that on one case you are training your classifier with a smaller dataset (the two combined is larger assuming they are about the same size, even with the 10 fold cross validation). This necessarily causes a poorer performance.
So my answers would be:
Depends on how you obtained both datasets and on how the final classifier will be used.
Differences in the size of the training set and bias on how they are obtained.

Resources