Exploring Overfit of model - machine-learning

I m detecting a overfit for my deep-learning model.
But I m confused about overfitting.
For example, my training accuracy and loss are high, my development loss and accuracy are almost equal to it.
Questions;
i) What does this mean?
ii) What actions should I do?
iii) What are the possible results for the actions?

There's very simple concept to find whether your model is overfitting as taught by Professor Andrew Ng
If there's much difference between your training set accuracy and your development set accuracy then that means your model is overfitting your training data and you can do different things like
1.) Introduce L2 regularization
2.) Introduce dropout in the network(or increase the network dropout probability)
3.) Try adding more data if possible in your training set which is a representative of the data you might be using in your dev or test set.
4.) Try to change your neural network architecture.
5.) You can introduce random noise in your data (Data Augmentation).
If there's not much difference between your training data and your development data accuracy, then the network is not overfitting the data.

Related

should I get the same accuracy in the test set and the training set

I am new to machine learning, I have built a model that predicts if a client will subscribe in the following month or not. I got 73.4 on the training set and 72.8 on the test set. is it okay? or do I have Overfitting?
It's ok.
Overfitting happens when the accuracy in the training set in higher and the accuracy in the test set is lower (with a marginal difference).
This is what overfitting looks like.
Train accuracy: 99.4%
Test accuracy: 71.4%
You can, however, increase the accuracy using different models and feature engineering
We call it as over-fitting,If the accuracy of training data is abnormally higher (greater than 95%) and accuracy of test data is very low (less than 65%).
In your case,both training and testing accuracy are almost similar.So there is no over-fitting.
Try for more test data and check whether the accuracy is decreasing or not.You can also try to improve the model by
Trying different algorithms
Increasing the size of train data
Trying K-fold cross validation
Hyper parameter tuning
Using Regularization methods
Standardizing feature variables

dealing with imbalanced classification data?

I am building a predictive model, on which I predict if a client will subscribe again or not. I already have the dataset and the problem is that it is imbalanced ( the NOs are more then the YESs). I believe that my model is biased, but when I check the accuracy on the training set and the testing set with the predictions made the accuracy is really close (0.8879 on training set and 0.8868 on the test set). The reason why I am confused, is if my model is biased why do I have the accuracy of training and test set close? Or is my model not biased?
Quick response: Yes, your model is very likely to predict everything as the Majority Class.
Let's think of it in a simpler way. You have an optimizer in the training process, who tries to maximize the accuracy (minimize the misclassification). Suppose you have a training set of 1000 images, and you have only 10 tigers in that dataset, and you intend to learn a classifier to distinguish tigers vs non-tigers.
What the optimizer is very likely to do is to predict always non-tiger for every single image. Why? cause it is a much simpler model and easier(likelier in a simpler space) to achieve, and also it gets to 99% accuracy!
I suggest you read more about imbalanced data problems( This one seems to be a good one to start https://machinelearningmastery.com/what-is-imbalanced-classification/) Depending on the problem you are to solve, you might one try to down-sampling, or over-sampling or more advanced solutions, like changing the loss functions and metrics, using F1 or AUC and/or doing ranking instead of classification.

Why does pre-trained ResNet18 have a higher validation accuracy than training?

For PyTorch's tutorial on performing transfer learning for computer vision (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html), we can see that there is a higher validation accuracy than training accuracy. Applying the same steps to my own dataset, I see similar results. Why is this the case? Does it have something to do with ResNet 18's architecture?
Assuming there aren't bugs in your code and the train and validation data are in the same domain, then there are a couple reasons why this may occur.
Training loss/acc is computed as the average across an entire training epoch. The network begins the epoch with one set of weights and ends the epoch with a different (hopefully better!) set of weights. During validation you're evaluating everything using only the most recent weights. This means that the comparison between validation and train accuracy is misleading since training accuracy/loss was computed with samples from potentially much worse states of your model. This is usually most noticeable at the start of training or right after the learning rate is adjusted since the network often starts the epoch in a much worse state than it ends. It's also often noticeable when the training data is relatively small (as is the case in your example).
Another difference is the data augmentations used during training that aren't used during validation. During training you randomly crop and flip the training images. While these random augmentations are useful for increasing the ability of your network to generalize they aren't performed during validation because they would diminish performance.
If you were really motivated and didn't mind spending the extra computational power you could get a more meaningful comparison by running the training data back through your network at the end of each epoch using the same data transforms used for validation.
The short answer is that train and validation data are from different distributions, and it's "easier" for model to predict target in validation data then it is for training.
The likely reason for this particular case, as indicated by this answer, is data augmentation during training. This is a way to regularize your model by increasing variability in the training data.
Other architectures can use Dropout (or its modifications), which are deliberately "hurting" training performance, reducing the potential of overfitting.
Notice, that you're using pretrained model, which already contains some information about how to solve classification problem. If your domain is not that different from the data it was trained on, you can expect good performance off-the-shelf.

Test accuracy is greater than train accuracy what to do?

I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.

Training Loss and Validation Loss in Deep Learning

Would you please guide me how to interpret the following results?
1) loss < validation_loss
2) loss > validation_loss
It seems that the training loss always should be less than validation loss. But, both of these cases happen when training a model.
Really a fundamental question in machine learning.
If validation loss >> training loss you can call it overfitting.
If validation loss > training loss you can call it some overfitting.
If validation loss < training loss you can call it some underfitting.
If validation loss << training loss you can call it underfitting.
Your aim is to make the validation loss as low as possible.
Some overfitting is nearly always a good thing. All that matters in the end is: is the validation loss as low as you can get it.
This often occurs when the training loss is quite a bit lower.
Also check how to prevent overfitting.
In machine learning and deep learning there are basically three cases
1) Underfitting
This is the only case where loss > validation_loss, but only slightly, if loss is far higher than validation_loss, please post your code and data so that we can have a look at
2) Overfitting
loss << validation_loss
This means that your model is fitting very nicely the training data but not at all the validation data, in other words it's not generalizing correctly to unseen data
3) Perfect fitting
loss == validation_loss
If both values end up to be roughly the same and also if the values are converging (plot the loss over time) then chances are very high that you are doing it right
1) Your model performs better on the training data than on the unknown validation data. A bit of overfitting is normal, but higher amounts need to be regulated with techniques like dropout to ensure generalization.
2) Your model performs better on the validation data. This can happen when you use augmentation on the training data, making it harder to predict in comparison to the unmodified validation samples. It can also happen when your training loss is calculated as a moving average over 1 epoch, whereas the validation loss is calculated after the learning phase of the same epoch.
Aurélien Geron made a good Twitter thread about this phenomenon. Summary:
Regularization is typically only applied during training, not validation and testing. For example, if you're using dropout, the model has fewer features available to it during training.
Training loss is measured after each batch, while the validation loss is measured after each epoch, so on average the training loss is measured ½ an epoch earlier. This means that the validation loss has the benefit of extra gradient updates.
the val set can be easier than the training set. For example, data augmentations often distort or occlude parts of the image. This can also happen if you get unlucky during sampling (val set has too many easy classes, or too many easy examples), or if your val set is too small. Or, the train set leaked into the val set.
If your validation loss is less than your training loss, you have not correctly split the training data. This correctly indicates that the distribution of the training and validation sets is different. It should ideally be the same. MOROVER, Good Fit: In the ideal case, the training and validation losses both drop and stabilize at specified points, indicating an optimal fit, i.e. a model that does neither overfit or underfit.

Resources