I am using resnet50 to classify pictures of flowers from a Kaggle dataset. I would like to clarify some things about my results.
epoch train_loss valid_loss error_rate time
0 0.205352 0.226580 0.077546 02:01
1 0.148942 0.205224 0.074074 02:01
These are the last two epochs of training. As you can see, the second epoch shows some overfitting because the train_loss is a good margin lower than the validation loss. Despite the overfitting, the error_rate and the validation loss decreased. I am wondering whether the model had actually improved in spite of the overfitting. Is it better to use the model from epoch 0 or epoch 1 for unseen data? Thank you!
Sadly, "overfitting" is a much abused term nowadays, used to mean almost everything linked to suboptimal performance; nevertheless, and practically speaking, overfitting means something very specific: its telltale signature is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:
(Image adapted from Wikipedia entry on overfitting)
It's clear than nothing of the sorts happens in your case; the "margin" between your training and validation loss is another story altogether (it is called generalization gap), and does not signify overfitting.
Thus, in principle, you have absolutely no reason at all to choose a model with higher validation loss (i.e. your first one) instead of one with a lower validation loss (your second one).
Related
When training my LSTM ( using the Keras library in Python ) the validation loss keeps increasing, although it eventually does obtain a higher validation accuracy. Which leads me to 2 questions:
How/Why does it obtain a (significantly) higher validation accuracy at a (significantly) higher validation loss?
Is it problematic that the validation loss increases? ( because it eventually does obtain a good validation accuracy either way )
This is an example history log of my LSTM for which this applies:
As visible when comparing epoch 0 with epoch ~430:
52% val accuracy at 1.1 val loss vs. 61% val accuracy at 1.8 val loss
For the loss function I'm using tf.keras.losses.CategoricalCrossentropy and I'm using the SGD optimizer at a high learning rate of 50-60% ( as it obtained the best validation accuracy with it ).
Initially I thought it may be overfitting, but then I don't understand how the validation accuracy does eventually get quite a lot higher at almost 2 times as high of a validation loss.
Any insights would be much appreciated.
EDIT: Another example of a different run, less fluctuating validation accuracy but still significantly higher validation accuracy as the validation loss increases:
In this run I used a low instead of high dropout.
As you stated, "at a high learning rate of 50-60%", this might be the reason why graphs are oscillating. Lowering the learning rate or adding regularization should solve the oscillating problem.
More generally,
Cross Entropy loss is not a bounded loss, so having very badly outliers would make it explode.
Accuracy can go higher which means your model is able to learn the rest of the dataset except the outliers.
Validation set has too many outliers that causing the oscillation of the loss values.
To conclude if you are overfitting or not, you should inspect validation set for outliers.
I have a CNN that is performing very well (96% accuracy, 1.~ loss) on training data but poorly (50% accuracy, 3.5 loss) on testing data.
The telltale signature of overfitting is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:
(Image adapted from Wikipedia entry on overfitting)
Here are some other plots indicating overfitting (source):
See also the SO thread How to know if underfitting or overfitting is occuring?.
Clearly, your loss plot does exhibit such behavior, so yes, you are indeed overfitting.
On the contrary, the plot you have linked to in a comment:
does not exhibit such behavior, hence here you are not actually overfitting (you just have reached a saturation point, beyond which your validation error is not further improving).
96% accuracy suggests you have a really close fit to your training data. 50% accuracy on testing data shows that your model cannot account for the noise/variability of the data being studied. This looks like textbook overfitting.
You seem to be calling your validation data your test data. Maybe you can better partition your data?
I am using dice loss for my implementation of a Fully Convolutional Network(FCN) which involves hypernetworks. The model has two inputs and one output which is a binary segmentation map. The model is updating weights but loss is constant.
It is not even overfitting on only three training examples
I have used other loss functions as well like dice+binarycrossentropy loss, jacard loss and MSE loss but the loss is almost constant.
I have also tried almost every activation function like ReLU, LeakyReLU, Tanh. Moreover I have to use sigmoid at the the output because I need my outputs to be in range [0,1]
Learning rate is 0.01. Moreover, I have tried different learning rates as well like 0.0001, 0.001, 0.1. And no matter what loss the training starts at, it always comes at this value
This shows gradients for three training examples. And overall loss
tensor(0.0010, device='cuda:0')
tensor(0.1377, device='cuda:0')
tensor(0.1582, device='cuda:0')
Epoch 9, Overall loss = 0.9604763123724196, mIOU=0.019766070265581623
tensor(0.0014, device='cuda:0')
tensor(0.0898, device='cuda:0')
tensor(0.0455, device='cuda:0')
Epoch 10, Overall loss = 0.9616242945194244, mIOU=0.01919178702228237
tensor(0.0886, device='cuda:0')
tensor(0.2561, device='cuda:0')
tensor(0.0108, device='cuda:0')
Epoch 11, Overall loss = 0.960331304506822, mIOU=0.01983801422510155
I expect the loss to converge in few epochs.
What should I do?
It's not really a question for stack overflow. There's a million things which could be wrong and it's usually not possible to post enough code to allow us to pinpoint the issue, and even if it were, nobody could bother reading that much.
That being said, there are some general guidelines which often work for me.
Try reducing the problem. If you replace your network with a single convolutional layer, will it converge? If yes, apparently something's wrong with your network
Look at the data as you feed it as well as the labels (matplotlib plots, etc). Perhaps you're misaligning input with output (cropping issues, etc) or your data augmentation is way too strong.
Look for, well..., bugs. Perhaps you're returning torch.sigmoid(x) from your network and then feeding it into torch.nn.functional.binary_cross_entropy_with_logits (effectively applying sigmoid twice). Maybe your last layer is ReLU and your network just cannot (by construction) output negative values where you would expect them.
Finally, I've personally never had much success training with dice as the primary loss function, so I would definitely try to get it working with cross entropy first, and then move on to dice.
#Muhammad Hamza Mughal
You got to add code of at least your forward and train functions for us to pinpoint the issue, #Jatentaki is right there could be so many things that could mess up a ML / DL code. Even I moved recently to pytorch from Keras, took some time to get used to it. But, here are the things I'd do:
1) As you're dealing with images, try to pre-process them a bit ( rotation, normalization, Gaussian Noise etc).
2) Zero gradients of your optimizer at the beginning of each batch you fetch and also step optimizer after you calculated loss and called loss.backward().
3) Add a weight decay term to your optimizer call, typically L2, as you're dealing with Convolution networks have a decay term of 5e-4 or 5e-5.
4) Add a learning rate scheduler to your optimizer, to change learning rates if there's no improvement over time.
We really can't include code in our answers. It's up to the practitioner to scout for how to implement all this stuff. Hope this helps.
#MuhammadHamzaMughal since you are using sigmoid to generate predictions, have you made sure that the target attributes in ground truth/training data/validation data are all in range [0-1] ?
Normalize the data with min-max normalization so that it is in [0-1] range.
I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.
Would you please guide me how to interpret the following results?
1) loss < validation_loss
2) loss > validation_loss
It seems that the training loss always should be less than validation loss. But, both of these cases happen when training a model.
Really a fundamental question in machine learning.
If validation loss >> training loss you can call it overfitting.
If validation loss > training loss you can call it some overfitting.
If validation loss < training loss you can call it some underfitting.
If validation loss << training loss you can call it underfitting.
Your aim is to make the validation loss as low as possible.
Some overfitting is nearly always a good thing. All that matters in the end is: is the validation loss as low as you can get it.
This often occurs when the training loss is quite a bit lower.
Also check how to prevent overfitting.
In machine learning and deep learning there are basically three cases
1) Underfitting
This is the only case where loss > validation_loss, but only slightly, if loss is far higher than validation_loss, please post your code and data so that we can have a look at
2) Overfitting
loss << validation_loss
This means that your model is fitting very nicely the training data but not at all the validation data, in other words it's not generalizing correctly to unseen data
3) Perfect fitting
loss == validation_loss
If both values end up to be roughly the same and also if the values are converging (plot the loss over time) then chances are very high that you are doing it right
1) Your model performs better on the training data than on the unknown validation data. A bit of overfitting is normal, but higher amounts need to be regulated with techniques like dropout to ensure generalization.
2) Your model performs better on the validation data. This can happen when you use augmentation on the training data, making it harder to predict in comparison to the unmodified validation samples. It can also happen when your training loss is calculated as a moving average over 1 epoch, whereas the validation loss is calculated after the learning phase of the same epoch.
Aurélien Geron made a good Twitter thread about this phenomenon. Summary:
Regularization is typically only applied during training, not validation and testing. For example, if you're using dropout, the model has fewer features available to it during training.
Training loss is measured after each batch, while the validation loss is measured after each epoch, so on average the training loss is measured ½ an epoch earlier. This means that the validation loss has the benefit of extra gradient updates.
the val set can be easier than the training set. For example, data augmentations often distort or occlude parts of the image. This can also happen if you get unlucky during sampling (val set has too many easy classes, or too many easy examples), or if your val set is too small. Or, the train set leaked into the val set.
If your validation loss is less than your training loss, you have not correctly split the training data. This correctly indicates that the distribution of the training and validation sets is different. It should ideally be the same. MOROVER, Good Fit: In the ideal case, the training and validation losses both drop and stabilize at specified points, indicating an optimal fit, i.e. a model that does neither overfit or underfit.