Meaning of a constant very low , training loss in learning curve - machine-learning

I have a trained a fasttext model on a binary text classification problem and generated the learning curve over increasing training size
I get very quick a very low training loss , close to 0, which stays constant.
I interpret this as the model overfitting on the data.
But the validation loss curve looks good to me, slowly decreasing.
Crossvalidation on unknow data produces as well accuracies with little variation, about 90% accuracy.
So I am wondering, if I indeed have an "Overfiiting" model as the learning curve suggests.
Is there any other check I can do on my model ?
As the fasttext model uses as well epochs, I am even wondering if a learning curve should vary the epochs (and keep training size constant) or "slowly increase training set size while keep epoch constant" (or both ...)

Related

Keras accuracy plots flat while loss plots not flat

I am doing deep learning using a multi-layer perceptron for regression. The loss curve turns flat in the third epoch however accuracy curve remains flat at the beginning. I wonder whether this makes sense.
Since you didn't provide the code, it would be harder to narrow down what is the problem. Being said, here are some pointers that might help you see what is the problem:
Validation set is either small or it is a bad representation of your training set. (bear in mind, if you are using validation_split in fit function, then keras will only take the last percentage of your training set and will keep it the same for all epochs. link]).
You are not using any regularization (Dropout, Regularization, Constraints).
The model could be small (layers- and neurons-wise), so it is underfitting.
Hope these pointers help you with your problem.

Reducing (Versus Delaying) Overfitting in Neural Network

In neural nets, regularization (e.g. L2, dropout) is commonly used to reduce overfitting. For example, the plot below shows typical loss vs epoch, with and without dropout. Solid lines = Train, dashed = Validation, blue = baseline (no dropout), orange = with dropout. Plot courtesy of Tensorflow tutorials.
Weight regularization behaves similarly.
Regularization delays the epoch at which validation loss starts to increase, but regularization apparently does not decrease the minimum value of validation loss (at least in my models and the tutorial from which the above plot is taken).
If we use early stopping to stop training when validation loss is minimum (to avoid overfitting) and if regularization is only delaying the minimum validation loss point (vs. decreasing the minimum validation loss value) then it seems that regularization does not result in a network with greater generalization but rather just slows down training.
How can regularization be used to reduce the minimum validation loss (to improve model generalization) as opposed to just delaying it? If regularization is only delaying minimum validation loss and not reducing it, then why use it?
Over-generalizing from a single tutorial plot is arguably not a good idea; here is a relevant plot from the original dropout paper:
Clearly, if the effect of dropout was to delay convergence it would not be of much use. But of course it does not work always (as your plot clearly suggests), hence it should not be used by default (which is arguably the lesson here)...

Why would a neural networks validation loss and accuracy fluctuate at first?

I am training a neural network and at the beginning of training my networks loss and accuracy on the validation data fluctuates a lot, but towards the end of training it stabilizes. I am reduce learning rate on plateau for this network. Could it be that the network starts with a high learning rate and as the learning rate decreases both accuracy and loss stabilize?
For SGD, the amount of change in the parameters is a multiple of the learning rate and the gradient of the parameter values with respect to the loss.
θ = θ − α ∇θ E[J(θ)]
Every step it takes will be in a sub-optimal direction (ie slightly wrong) as the optimiser has usually only seen some of the values. At the start of training you are relatively from the optimal solution, so the gradient ∇θ E[J(θ)] is large, therefore each sub-optimal step has a large effect on your loss and accuracy.
Over time, as you (hopefully) get closer to the optimal solution, the gradient is smaller, so the steps become smaller, meaning that the effects of being slightly wrong are diminished. Smaller errors on each step makes your loss decrease more smoothly, so reduces fluctuations.

How to judge whether model is overfitting or not

I am doing video classification with a model combining CNN and LSTM.
In the training data, the accuracy rate is 100%, but the accuracy rate of the test data is not so good.
The number of training data is small, about 50 per class.
In such a case, can I declare that over learning is occurring?
Or is there another cause?
Most likely you are indeed overfitting if the performance of your model is perfect on the training data, yet poor on test/validation data set.
A good way of observing that effect is to evaluate your model on both training and validation data after each epoch of training. You might observe that while you train, the performance on your validation set is increasing initially, and then starts to decrease. That is the moment when your model starts to overfit and where you can interrupt your training.
Here's a plot demonstrating this phenomenon with the blue and red lines corresponding to errors on training and validation sets respectively.

Learning curve for noisy data

I am doing a supervised classification of small texts, and the data is very noisy. I plotted a learning curve: x-axis is # instances. y-axis is the value of F-measure. The curve is falling: the more instances I use, the lower the F-measure score is. Is it typical for noisy data? Or there is some other reason for this behavior?
Did you calculate F-measure using training set or test set?
If you calculated it using training set then falling learning curve is pretty normal.
If you calculated it using test set then there may be many causes, the most probable is that training and test sets are not iid.

Resources