Training a deep learning model after approx 100 epochs :
Train accuracy : 93 %
Test accuracy : 54 %
then training accuracy increases and test accuracy decreases :
Train accuracy : 94 %
Test accuracy : 53 %
Train accuracy : 95 %
Test accuracy : 52 %
Train accuracy : 96 %
Test accuracy : 51 %
For initial version of the model we are satisfied with 54% accuracy but I don't know what the meaning of training accuracy increasing, test accuracy decreasing other than the model is overfitting. Should I stop training the model and use trained parameters when max test accuracy is achieved, in this case 54% ?
What knowledge can I gain from the observation of training accuracy increasing & test accuracy decreasing?, is this an example of stronger over-fitting ?
Yes this is definitely overfitting. You should terminate the training procedure at the point where the test accuracy stops increasing. By the numbers you show, your model is actually overfitting a lot. You should consider adding regularization to possibly increase the test accuracy.
(me adding): regularization is, as #Djib2011, says the way to go to help prevent overfitting. You could look into e.g. L2 or Dropout which are amongst the most common ones.
The question was answered in the comments and, since no one wrote an answer,
I made this answer a community wiki answer. This is to remove this question from the
unanswered list.
The original answer was by #Djib2011 . The OP is encouraged to select this as the
answer to remove the questions status as unanswered. (If the person
who answered in the comments decides to make an answer the OP can, and should,
select that answer instead).
Yes definitely overfitting, when I first started building logistic regressions in SAS, we used to have a thumb rule of having a model with train and test performances not more than 10% off each other.
Another way could be to use k-fold and get balanced performance across all folds.
Overall it implies that the model is stable, we are fitting it to the actual data trends and not fitting it to the noise.
Related
When training my LSTM ( using the Keras library in Python ) the validation loss keeps increasing, although it eventually does obtain a higher validation accuracy. Which leads me to 2 questions:
How/Why does it obtain a (significantly) higher validation accuracy at a (significantly) higher validation loss?
Is it problematic that the validation loss increases? ( because it eventually does obtain a good validation accuracy either way )
This is an example history log of my LSTM for which this applies:
As visible when comparing epoch 0 with epoch ~430:
52% val accuracy at 1.1 val loss vs. 61% val accuracy at 1.8 val loss
For the loss function I'm using tf.keras.losses.CategoricalCrossentropy and I'm using the SGD optimizer at a high learning rate of 50-60% ( as it obtained the best validation accuracy with it ).
Initially I thought it may be overfitting, but then I don't understand how the validation accuracy does eventually get quite a lot higher at almost 2 times as high of a validation loss.
Any insights would be much appreciated.
EDIT: Another example of a different run, less fluctuating validation accuracy but still significantly higher validation accuracy as the validation loss increases:
In this run I used a low instead of high dropout.
As you stated, "at a high learning rate of 50-60%", this might be the reason why graphs are oscillating. Lowering the learning rate or adding regularization should solve the oscillating problem.
More generally,
Cross Entropy loss is not a bounded loss, so having very badly outliers would make it explode.
Accuracy can go higher which means your model is able to learn the rest of the dataset except the outliers.
Validation set has too many outliers that causing the oscillation of the loss values.
To conclude if you are overfitting or not, you should inspect validation set for outliers.
I am using resnet50 to classify pictures of flowers from a Kaggle dataset. I would like to clarify some things about my results.
epoch train_loss valid_loss error_rate time
0 0.205352 0.226580 0.077546 02:01
1 0.148942 0.205224 0.074074 02:01
These are the last two epochs of training. As you can see, the second epoch shows some overfitting because the train_loss is a good margin lower than the validation loss. Despite the overfitting, the error_rate and the validation loss decreased. I am wondering whether the model had actually improved in spite of the overfitting. Is it better to use the model from epoch 0 or epoch 1 for unseen data? Thank you!
Sadly, "overfitting" is a much abused term nowadays, used to mean almost everything linked to suboptimal performance; nevertheless, and practically speaking, overfitting means something very specific: its telltale signature is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:
(Image adapted from Wikipedia entry on overfitting)
It's clear than nothing of the sorts happens in your case; the "margin" between your training and validation loss is another story altogether (it is called generalization gap), and does not signify overfitting.
Thus, in principle, you have absolutely no reason at all to choose a model with higher validation loss (i.e. your first one) instead of one with a lower validation loss (your second one).
Generally speaking, is it possible to tell if training a given neural network of depth X on Y training examples for Z epochs is likely to overfit? Or can overfitting only be detected for sure by looking at loss and accuracy graphs of training vs test set?
Concretely I have ~250,000 examples, each of which is a flat image 200x200px. The model is a CNN with about 5 convolution + pooling layers, followed by 2 dense layers with 1024 units each. The model classifies 12 different classes. I've been training it for about 35 hours with ~90% accuracy on training set and ~80% test set.
Generally speaking, is it possible to tell if training a given neural network of depth X on Y training examples for Z epochs is likely to overfit?
Generally speaking, no. Fitting deep learning models is still an almost exclusively empirical art, and the theory behind it is still (very) poor. And although by gaining more and more experience one is more likely to tell beforehand if a model is prone to overfit, the confidence will generally be not high (extreme cases excluded), and the only reliable judge will be the experiment.
Elaborating a little further: if you take the Keras MNIST CNN example and remove the intermediate dense layer(s) (the previous version of the script used to include 2x200 dense layers instead of 1x128 now), thus keeping only conv/pooling layers and the final softmax one, you will end up with ~ 98.8% test accuracy after only 20 epochs, but I am unaware of anyone that could reliably predict this beforehand...
Or can overfitting only be detected for sure by looking at loss and accuracy graphs of training vs test set?
Exactly, this is the only safe way. The telltale signature of overfitting is the divergence of the learning curves (training error still decreasing, while validation or test error heading up). But even if we have diagnosed overfitting, the cause might not be always clear-cut (see a relevant question and answer of mine here).
~90% accuracy on training set and ~80% test set
Again very generally speaking and only in principle, this does not sound bad for a problem with 12 classes. You already seem to know that, if you worry for possible overfitting, it is the curves rather than the values themselves (or the training time) that you have to monitor.
On the more general topic of the poor theory behind deep learning models as related to the subject of model intepretability, you might find this answer of mine useful...
I wonder why is our objective is to maximize AUC when maximizing accuracy yields the same?
I think that along with the primary goal to maximize accuracy, AUC will automatically be large.
I guess we use AUC because it explains how well our method is able to separate the data independently of a threshold.
For some applications, we don't want to have false positive or negative. And when we use accuracy, we already make an a priori on the best threshold to separate the data regardless of the specificity and sensitivity.
.
In binary classification, accuracy is a performance metric of a single model for a certain threshold and the AUC (Area under ROC curve) is a performance metric of a series of models for a series of thresholds.
Thanks to this question, I have learnt quite a bit on AUC and accuracy comparisons. I don't think that there's a correlation between the two and I think this is still an open problem. At the end of this answer, I've added some links like these that I think would be useful.
One scenario where accuracy fails:
Example Problem
Let's consider a binary classification problem where you evaluate the performance of your model on a data set of 100 samples (98 of class 0 and 2 of class 1).
Take out your sophisticated machine learning model and replace the whole thing with a dumb system that always outputs 0 for whatever the input it receives.
What is the accuracy now?
Accuracy = Correct predictions/Total predictions = 98/100 = 0.98
We got a stunning 98% accuracy on the "Always 0" system.
Now you convert your system to a cancer diagnosis system and start predicting (0 - No cancer, 1 - Cancer) on a set of patients. Assuming there will be a few cases that corresponds to class 1, you will still achieve a high accuracy.
Despite having a high accuracy, what is the point of the system if it fails to do well on the class 1 (Identifying patients with cancer)?
This observation suggests that accuracy is not a good evaluation metric for every type of machine learning problems. The above is known as an imbalanced class problem and there are enough practical problems of this nature.
As for the comparison of accuracy and AUC, here are some links I think would be useful,
An introduction to ROC analysis
Area under curve of ROC vs. overall accuracy
Why is AUC higher for a classifier that is less accurate than for one that is more accurate?
What does AUC stand for and what is it?
Understanding ROC curve
ROC vs. Accuracy vs. AROC
I'm training a simple logistic regression classifier using LIBLINEAR. There are only 3 features, and label is binary 0-1.
Sample input file:
1 1:355.55660999775586 2:-3.401379785 3:5
1 1:252.43759050148728 2:-3.96044759307 3:9
1 1:294.15085871437088 2:-13.1649273486 3:14
1 1:432.10492221032933 2:-2.72636786196 3:9
0 1:753.80863694081768 2:-12.4841741178 3:14
1 1:376.54927850355756 2:-6.9494008935 3:7
Now, if I use "-s 6", which is "L1-regularized logistic regression", then the 10-fold cross validation accuracy is around 70%, and each iter finishes within seconds.
But if I use "-s 7", which is "L2-regularized logistic regression (dual)", then the training iteration exceeds 1000, and the 10-fold accuracy is only 60%.
Has anybody seen this kind of strange behavior? From my understanding, the only difference between L1 and L2 is whether the regularization term uses abs(x) or pow(x, 2).
Thanks for posting this! I work with liblinear fairly often and generally always use L2 loss without thinking. This article does a pretty good job explaining the difference: http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/
Based on that, I'm guessing that not only do you have a small amount of features but maybe also a small dataset? Have you tried to increase the number of input points?
Not think it's a 'Strange' behavior in my poor opinion. You have to make a trial to confirm which one is fitted into your case better before you have not any sense of it. Theoretically,L1-regular is bounded,just like feature selection,while l2-regular is more smooth.
I just realized there are two logistic regression classifier provided by LIBLINEAR:
0 -- L2-regularized logistic regression (primal)
7 -- L2-regularized logistic regression (dual)
I was using 7, which doesn't converge even after 1000 iterations.
After I switched to 0, it converged very fast and was able to get to ~70% accuracy.
I believe the dual vs. primal is mainly the difference in optimization methods, so I think this is probably some numerical computation issue.
For more info on dual form vs. primal form:
https://stats.stackexchange.com/questions/29059/logistic-regression-how-to-get-dual-function