How to judge whether model is overfitting or not - machine-learning

I am doing video classification with a model combining CNN and LSTM.
In the training data, the accuracy rate is 100%, but the accuracy rate of the test data is not so good.
The number of training data is small, about 50 per class.
In such a case, can I declare that over learning is occurring?
Or is there another cause?

Most likely you are indeed overfitting if the performance of your model is perfect on the training data, yet poor on test/validation data set.
A good way of observing that effect is to evaluate your model on both training and validation data after each epoch of training. You might observe that while you train, the performance on your validation set is increasing initially, and then starts to decrease. That is the moment when your model starts to overfit and where you can interrupt your training.
Here's a plot demonstrating this phenomenon with the blue and red lines corresponding to errors on training and validation sets respectively.

Related

Time series classification and prediction sensor data from CAN bus

i am trying a time series classification task where i have to predict/classify in advance a driving scenario, making sure my predictions are accurate too and how ahead or time or delayed, it made a prediction. I am working with lstm model but it is too much overfitted. Accuracy curves for validation not changing, confusion matrix too is detecting only one label for each of my 3 labels. I tried SMOTE; sklearn class weights too, nothing improved, my ques is is there data augmentation for time series data too?.... and other problems too if some solutions or hints can be given.....
i am trying a time series classification task where i have to predict/classify in advance a driving scenario, making sure my predictions are accurate too and how ahead or time or delayed, it made a prediction. I am working with lstm model but it is too much overfitted. Accuracy curves for validation not changing, confusion matrix too is detecting only one label for each of my 3 labels. I tried SMOTE; sklearn class weights too, nothing improved, my ques is is there data augmentation for time series data too?.... and other problems too if some solutions or hints can be given.....

overfitting and data augmentation in random forest, prediction

I want to make a prediction model using Random Forest, but overfitting occurs. We adjusted various parameters due to overfitting, but there is no big change.
When I looked up the reason, I checked the Internet post that it could be caused by a small number of data (1,000). As you know, in the case of image classification, data augmentation increases the amount of data by gradually transforming the shape and angle of the image.
How about increasing the amount of data in predictions like this? And we copied the entire data, and we made about three times as many data as three thousand. This prevents overfitting and increases accuracy.
But I'm not sure if this is the right way in terms of data science, so I'm writing like this.
In addition to these methods, I would like to ask you how to avoid overfitting the prediction problem or how to increase the amount of data.
Thank you!

Meaning of a constant very low , training loss in learning curve

I have a trained a fasttext model on a binary text classification problem and generated the learning curve over increasing training size
I get very quick a very low training loss , close to 0, which stays constant.
I interpret this as the model overfitting on the data.
But the validation loss curve looks good to me, slowly decreasing.
Crossvalidation on unknow data produces as well accuracies with little variation, about 90% accuracy.
So I am wondering, if I indeed have an "Overfiiting" model as the learning curve suggests.
Is there any other check I can do on my model ?
As the fasttext model uses as well epochs, I am even wondering if a learning curve should vary the epochs (and keep training size constant) or "slowly increase training set size while keep epoch constant" (or both ...)

Overfitting in convolutional neural network

I was applying CNN for classification of hand gestures I have 10 gestures and 100 images for each gestures. Model constructed by me was giving accuracy around 97% on training data, and I got 89% accuracy in testing data. Can I say that my model is overfitted or is it acceptable to have such accuracy graph(shown below)?
Add more data to training set
When you have a large amount of data(all kinds of instances) in your training set, it is good to create an overfitting model.
Example: Let's say you want to detect just one gesture say 'thumbs-up'(Binary classification problem) and you have created your positive training set with around 1000 images where images are rotated, translated, scaled, different colors, different angles, viewpoint varied, back-ground cluttered...etc. And if your training accuracy is 99%, your test accuracy will also be somewhere close.
Because our training set is big enough to cover all instances of the positive class, so even if the model is overfitted, it will perform well with the test set as the instances in the test set will only be a slight variation to that of the instances in the training set.
In your case, your model is good but if you can add some more data, you will get even better accuracy.
What kind of data to add?
Manually go through the test samples which the model got wrong and check for patterns if you can figure out what kind of samples are going wrong, you can add such kind to your training set and re-train again.

Increasing training examples reduces accuracy for maximum entropy classifier

I am using MaxEnt part of speech tagger to pos tag classification of a language corpus. I know it from theory, that increasing training examples should generally improve the classification accuracy. But, I am observing that in my case, the tagger gives max f measure value if I take 3/4th data for training and rest for testing. If I increase the training data size by taking it to be 85 or 90℅ of the whole corpus, then the accuracy decreases. Even on reducing the training data size to 50℅ of full corpus, the accuracy decreases.
I would like to know the possible reason for this decrease in accuracy with increasing training examples.
I suspected that in the reduced testing set you selected extreme samples and add more general samples into your train set then you reduced the number of testing samples that your model knows them.

Resources