I am working on multilabel classification problem. The classes are highly imbalance. However, I balanced the imbalance problem with class weights. I am using "Binary cross entropy" as cost funtion and sigmoid activation function at output layer. But, I am confused with loss curve (since the validation loss and testing loss are parallel ). Is this the case of overfitting?
The telltale signature of overfitting is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:
(Image adapted from Wikipedia entry on overfitting)
Here are some other plots indicating overfitting (source):
See also the SO thread How to know if underfitting or overfitting is occuring?.
Clearly, your plot does not exhibit such behavior, hence you are not overfitting.
Related
When training my LSTM ( using the Keras library in Python ) the validation loss keeps increasing, although it eventually does obtain a higher validation accuracy. Which leads me to 2 questions:
How/Why does it obtain a (significantly) higher validation accuracy at a (significantly) higher validation loss?
Is it problematic that the validation loss increases? ( because it eventually does obtain a good validation accuracy either way )
This is an example history log of my LSTM for which this applies:
As visible when comparing epoch 0 with epoch ~430:
52% val accuracy at 1.1 val loss vs. 61% val accuracy at 1.8 val loss
For the loss function I'm using tf.keras.losses.CategoricalCrossentropy and I'm using the SGD optimizer at a high learning rate of 50-60% ( as it obtained the best validation accuracy with it ).
Initially I thought it may be overfitting, but then I don't understand how the validation accuracy does eventually get quite a lot higher at almost 2 times as high of a validation loss.
Any insights would be much appreciated.
EDIT: Another example of a different run, less fluctuating validation accuracy but still significantly higher validation accuracy as the validation loss increases:
In this run I used a low instead of high dropout.
As you stated, "at a high learning rate of 50-60%", this might be the reason why graphs are oscillating. Lowering the learning rate or adding regularization should solve the oscillating problem.
More generally,
Cross Entropy loss is not a bounded loss, so having very badly outliers would make it explode.
Accuracy can go higher which means your model is able to learn the rest of the dataset except the outliers.
Validation set has too many outliers that causing the oscillation of the loss values.
To conclude if you are overfitting or not, you should inspect validation set for outliers.
I have a CNN that is performing very well (96% accuracy, 1.~ loss) on training data but poorly (50% accuracy, 3.5 loss) on testing data.
The telltale signature of overfitting is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:
(Image adapted from Wikipedia entry on overfitting)
Here are some other plots indicating overfitting (source):
See also the SO thread How to know if underfitting or overfitting is occuring?.
Clearly, your loss plot does exhibit such behavior, so yes, you are indeed overfitting.
On the contrary, the plot you have linked to in a comment:
does not exhibit such behavior, hence here you are not actually overfitting (you just have reached a saturation point, beyond which your validation error is not further improving).
96% accuracy suggests you have a really close fit to your training data. 50% accuracy on testing data shows that your model cannot account for the noise/variability of the data being studied. This looks like textbook overfitting.
You seem to be calling your validation data your test data. Maybe you can better partition your data?
Would you please guide me how to interpret the following results?
1) loss < validation_loss
2) loss > validation_loss
It seems that the training loss always should be less than validation loss. But, both of these cases happen when training a model.
Really a fundamental question in machine learning.
If validation loss >> training loss you can call it overfitting.
If validation loss > training loss you can call it some overfitting.
If validation loss < training loss you can call it some underfitting.
If validation loss << training loss you can call it underfitting.
Your aim is to make the validation loss as low as possible.
Some overfitting is nearly always a good thing. All that matters in the end is: is the validation loss as low as you can get it.
This often occurs when the training loss is quite a bit lower.
Also check how to prevent overfitting.
In machine learning and deep learning there are basically three cases
1) Underfitting
This is the only case where loss > validation_loss, but only slightly, if loss is far higher than validation_loss, please post your code and data so that we can have a look at
2) Overfitting
loss << validation_loss
This means that your model is fitting very nicely the training data but not at all the validation data, in other words it's not generalizing correctly to unseen data
3) Perfect fitting
loss == validation_loss
If both values end up to be roughly the same and also if the values are converging (plot the loss over time) then chances are very high that you are doing it right
1) Your model performs better on the training data than on the unknown validation data. A bit of overfitting is normal, but higher amounts need to be regulated with techniques like dropout to ensure generalization.
2) Your model performs better on the validation data. This can happen when you use augmentation on the training data, making it harder to predict in comparison to the unmodified validation samples. It can also happen when your training loss is calculated as a moving average over 1 epoch, whereas the validation loss is calculated after the learning phase of the same epoch.
Aurélien Geron made a good Twitter thread about this phenomenon. Summary:
Regularization is typically only applied during training, not validation and testing. For example, if you're using dropout, the model has fewer features available to it during training.
Training loss is measured after each batch, while the validation loss is measured after each epoch, so on average the training loss is measured ½ an epoch earlier. This means that the validation loss has the benefit of extra gradient updates.
the val set can be easier than the training set. For example, data augmentations often distort or occlude parts of the image. This can also happen if you get unlucky during sampling (val set has too many easy classes, or too many easy examples), or if your val set is too small. Or, the train set leaked into the val set.
If your validation loss is less than your training loss, you have not correctly split the training data. This correctly indicates that the distribution of the training and validation sets is different. It should ideally be the same. MOROVER, Good Fit: In the ideal case, the training and validation losses both drop and stabilize at specified points, indicating an optimal fit, i.e. a model that does neither overfit or underfit.
My colleague and I are trying to wrap our heads around the difference between logistic regression and an SVM. Clearly they are optimizing different objective functions. Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? How do the support vectors come into play? What about the slack variables? Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
I will answer one thing at at time
Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss?
SVM is simply a linear classifier, optimizing hinge loss with L2 regularization.
Or is it more complex than that?
No, it is "just" that, however there are different ways of looking at this model leading to complex, interesting conclusions. In particular, this specific choice of loss function leads to extremely efficient kernelization, which is not true for log loss (logistic regression) nor mse (linear regression). Furthermore you can show very important theoretical properties, such as those related to Vapnik-Chervonenkis dimension reduction leading to smaller chance of overfitting.
Intuitively look at these three common losses:
hinge: max(0, 1-py)
log: y log p
mse: (p-y)^2
Only the first one has the property that once something is classified correctly - it has 0 penalty. All the remaining ones still penalize your linear model even if it classifies samples correctly. Why? Because they are more related to regression than classification they want a perfect prediction, not just correct.
How do the support vectors come into play?
Support vectors are simply samples placed near the decision boundary (losely speaking). For linear case it does not change much, but as most of the power of SVM lies in its kernelization - there SVs are extremely important. Once you introduce kernel, due to hinge loss, SVM solution can be obtained efficiently, and support vectors are the only samples remembered from the training set, thus building a non-linear decision boundary with the subset of the training data.
What about the slack variables?
This is just another definition of the hinge loss, more usefull when you want to kernelize the solution and show the convexivity.
Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
You can, however as SVM is not a probabilistic model, its training might be a bit tricky. Furthermore whole strength of SVM comes from efficiency and global solution, both would be lost once you create a deep network. However there are such models, in particular SVM (with squared hinge loss) is nowadays often choice for the topmost layer of deep networks - thus the whole optimization is actually a deep SVM. Adding more layers in between has nothing to do with SVM or other cost - they are defined completely by their activations, and you can for example use RBF activation function, simply it has been shown numerous times that it leads to weak models (to local features are detected).
To sum up:
there are deep SVMs, simply this is a typical deep neural network with SVM layer on top.
there is no such thing as putting SVM layer "in the middle", as the training criterion is actually only applied to the output of the network.
using of "typical" SVM kernels as activation functions is not popular in deep networks due to their locality (as opposed to very global relu or sigmoid)
I just implemented AdaDelta (http://arxiv.org/abs/1212.5701) for my own Deep Neural Network Library.
The paper kind of says that SGD with AdaDelta is not sensitive to hyperparameters, and that it always converge to somewhere good. (at least the output reconstruction loss of AdaDelta-SGD is comparable to that of well-tuned Momentum method)
When I used AdaDelta-SGD as learning method in in Denoising AutoEncoder, it did converge in some specific settings, but not always.
When I used MSE as loss function, and Sigmoid as activation function, it converged very quickly, and after iterations of 100 epochs, the final reconstruction loss was better than all of plain SGD, SGD with Momentum, and AdaGrad.
But when I used ReLU as activation function, it didn't converge but continued to be stacked(oscillating) with high(bad) reconstruction loss (just like the case when you used plain SGD with very high learning rate).
The magnitude of reconstruction loss it stacked was about 10 to 20 times higher than the final reconstruction loss generated with Momentum method.
I really don't understand why it happened since the paper says AdaDelta is just good.
Please let me know the reason behind the phenomena and teach me how I could avoid it.
The activation of a ReLU is unbounded, making its use in Auto Encoders difficult since your training vectors likely do not have arbitrarily large and unbounded responses! ReLU simply isn't a good fit for that type of network.
You can force a ReLU into an auto encoder by applying some transformation to the output layer, as is done here. However, hey don't discuss the quality of the results in terms of an auto-encoder, but instead only as a pre-training method for classification. So its not clear that its a worth while endeavor for building an auto encoder either.