I would like to show an example of a model that overfit a test set and does not generalize well on future data.
I split the news dataset in 3 sets:
train set length: 11314
test set length: 5500
future set length: 2031
I am using a text dataset and build a CountVectorizer.
I am creating a grid search (without cross-validation), each loop will test some parameters on the vectorizer ('min_df','max_df') and some parameter on my model LogisticRegression ('C', 'fit_intercept', 'tol', ...).
The best result I get is:
({'binary': False, 'max_df': 1.0, 'min_df': 1},
{'C': 0.1, 'fit_intercept': True, 'tol': 0.0001},
test set score: 0.64018181818181819,
training set score: 0.92902598550468451)
but now if I run it on the future set I will get a score similar to the test set:
clf.score(X_future, y_future): 0.6509108813392418
How can I demonstrate a case where I overfitted my test set so it does not generalize well to future data?
You have a model trained on some data "train set".
Performing a classification task on these data you get a score of 92%.
Then you take new data, not seen during the training, such as "test set" or "future set".
Performing a classification task on any of these unseen dataset you get a score of 65%.
This is exactly the definition of a model which is overfitting: it has a very high variance, a big difference in the performance between seen and unseen data.
By the way, taking into account your specific case, some parameter choices which could cause overfitting are the following:
min_df = 0
high C value for logistic regression (whihc means low regularization)
I wrote a comment on alsora's answer but I think I really should expand on it as an actual answer.
As I said, there is no way to "over-fit" the test set because over-fit implies something negative. A theoretical model that fits the test set at 92% but fits the training set to only 65% is a very good model indeed (assuming your sets are balanced).
I think what you are referring to as your "test set" might actually be a validation set, and your "future set" is actually the test set. Lets clarify.
You have a set of 18,845 examples. You divide them into 3 sets.
Training set: The examples the model gets to look at and learn off of. Every time your model makes a guess from this set you tell it whether it was right or wrong and it adjusts accordingly.
Validation set: After every epoch (time running through the training set), you check the model on these examples, which its never seen before. You compare the training loss and training accuracy to the validation loss and validation accuracy. If training accuracy > validation accuracy or training loss < validation loss, then your model is over-fitting and training should stop to avoid over-fitting. You can either stop it early (early-stopping) or add dropout. You should not give feedback to your model based on examples from the validation set. As long as you follow the above rule and as long as your validation set is well-mixed, you can't over-fit this data.
Testing set: Used to assess the accuracy of your model once training has completed. This is the one that matters because its based on examples your model has never seen before. Again you can't over-fit this data.
Of your 18,845 examples you have 11314 in training set, 5500 in the validation set, and 2031 in the testing set.
Related
Let's say I initially split my dataset into training (80%) and test (20%) sets, perform a 10-fold CV on my training set and obtain an average R² of 75%. After that I check the best model's accuracy on the test set and obtain an R² of 74%, which indicates that the model is fairly robust. Now, before deploying it to real applications, I tune it with the whole data. Someone asks me the model's approximate R²; if I say 74% or 75%, I will me ignoring the fact that the model was now tunned with more data (test set). Is it a resonable approach to perform a leave one out CV on the chosen model with the whole data, compare the predicted targets with the real ones, check the R² (let's say it's 80% now) and say that the real-world model will most likely have an R² of 80%? I see no problems with that, but I do not know if this approach is correct.
It is true that you should train again on the whole data and it might lead to performance improvements. However, in this context your whole data should not be train + test! It should be just the trainign dataset but without any cross-validation. So before you had %80 for training and you were doing 10 fold CV, meaning that you were training your model actually on %72 of your complete data(train+test) and keeping the %8 for the validation. Now you should train it on the whole %80 percent and report your final results again on the unseen test set.
If you do LOOCV on the train + test, you can not report your performance on validation samples because this is how the model is finetuned and you might as well overfit to validation data.
What is the difference between the Holdout dataset and the Validation dataset in Machine Learning context?
The Validation dataset is used during training to track the performance of your model on "unseen" data. I wrote the unseen in quotes because although the model doesn't directly see the data in validation set, you will optimize the hyper-parameters to decrease the loss on validation set (since increasing val loss will mean over-fitting). However, by doing so, you may over-fit the hyper-parameters to validation set (So that the loss will be low on that specific validation set, but will become worse on any other unseen set). That's why you usually keep another 3rd set, called test set (or held-out set), which will be your truly unseen data, and you will test the performance of your model on that test set only once, after training your final model.
I have a simple question that has made me doubt my work all of a sudden.
If I only have a training and validation set, am I allowed to monitor val_loss while training or is that adding bias to my training. I want to test my accuracy at the end of training on my validation set but suddenly I am thinking if I am monitoring that dataset while training, that'd be problematic? or no?
Short answer - yes, monitoring validation error and using it as a basis for decision about specific set up of algorithm adds bias to your algorithm. To elaborate a bit:
1) You fix hyperparameters of any ML algorithm and than train it on train set. Your resulting ML algorithm with specific set up of hyperparameters overfits to training set and you use validation set to estimate which performance you can get with these hyperparameters on unseen data
2) But you obviously want to adjust your hyperparameters to get best performance. You may be doing a gridsearch or something like it to get best hyperparameter settings for this specific algorithm using validation set. As result your hyperparameter settings overfit to validation set. Think of it as some of the information about validation set still leaks into your model through hyperparameters
3) As result you must do the following: split data set into training set, validation set and test set. Use training set for training, use validation set to make a decision about specific hyperparameters. When you are done (fully done!) with fine tuning your model you must use test set which model have never seen to get an estimation of the final performance in the combat mode.
I have split the dataset ( around 28K images) into 75% trainset and 25% testset. Then I have taken randomly 15% of trainset and randomly 15% of testset to create a validation set. The goal is to classify the images into two categories. The exact image sample can't be shared. But its similar to the one attached. I'm using this model: VGG19 with imagenet weights with last two layers trainable and 4 dense layers appended. I am also using ImageDataGenerator to Augment the images. I trained the model for 30 epochs and found that the training accuracy is 95% and Validation accuracy is 96% and when trained on test dataset it fell down enormously to 75% only.
I have tried regularization and dropout to tackle the overfitting if it is suffering. I have also done one more thing to see what happens if I use the testset as Validation set and test the model on the same testset. The results were: Trainset Acc = 96% and Validation ACC = 96.3% and the testAcc = 68%. I don't understand what should I Do ?
image
First off, you need to make sure that when you split in data, the relative size of every class in the new datasets is equal. It can be imbalanced if that is the distribution of your initial data, but it must have the same imbalance in all datasets after the split.
Now, regarding the split. If you need a train, validation and test sets, they must all be independent of each-other (no-shared samples). This is important if you don't want to cheat yourself with the results that you are getting.
In general, in machine-learning we start from a training set and a test set. For choosing the best model architecture/hyper-parameters, we further divide the training set to get the validation set (the test set should not be touched).
After determining the best architecture/hyper-parameters for our model, we combine the training and validation set and train the best-case model from scratch with the combined full training set. Only now we get to test the results on the test set.
I had faced a similar issue in one of my practice projects.
My InceptionV3 model gave a high training accuracy (99%), a high validation accuracy (95%+) but a very low testing accuracy (55%).
The dataset was a subset of the popular Dogs vs. Cats dataset (https://www.kaggle.com/c/dogs-vs-cats/data), made by me, having 15k images split into 3 folders: train, valid, and test in the ratio of 60:20:20 (9000, 3000, 3000 each halved for cats folder and dogs folder).
The error in my case was actually in my code. It had nothing to do with the model or the data. The model had been defined inside a function and that was creating an untrained instance during the evaluation. Hence, an untrained model was being tested upon on the test dataset. After correcting the errors in my notebook I got a 96%+ testing accuracy.
Links:
https://colab.research.google.com/drive/1-PO1KJYvXdNC8LbvrdL70oG6QbHg_N-e?usp=sharing&fbclid=IwAR2k9ZCXvX_y_UNWpl4ljs1y0P3budKmlOggVrw6xI7ht0cgm03_VeoKVTI
https://drive.google.com/drive/u/3/folders/1h6jVHasLpbGLtu6Vsnpe1tyGCtR7bw_G?fbclid=IwAR3Xtsbm_EZA3TOebm5EfSvJjUmndHrWXm4Iet2fT3BjE6pPJmnqIwW8KWY
tyuhm
Other probable causes:
One possibility is that the testing set would have a different
distribution than the validation set (This could be excluded by
joining all the data, randomizing, and splitting again to train,
test, valid).
To swap valid and test with each other and see if it has an
effect (Sometimes if one set has relatively harder examples).
If the training somehow overfitted on the validation set (Is it
possible that during training, at one or more steps, the model giving the best score on the validation set is chosen).
Images overlapping, lack of shuffling.
In the deep learning world, if something seems way too odd to be
true, or even way too good to be even true, a good guess is that its
probably a bug unless proven otherwise!
I was wondering if a model trains itself from the test data as well while evaluating it multiple times, leading to a over-fitting scenario. Normally we split the training data into train-test splits and I noticed some people split it into 3 sets of data - train, test and eval. eval is for final evaluation of the model. I might be wrong but my point is that if the above mentioned scenario is not true, then there is no need for an eval data set.
Need some clarification.
The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.
People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.
Does this help?
Consider you have train and test datasets. Train dataset is the one in which you know the output and you train your model on train dataset and you try to predict the output of Test dataset.
Most people split train dataset into train and validation. So first you run your model on train data and evaluate it on validation set. Then again you run the model on test dataset.
Now you are wondering how this will help and of any use?
This helps you to understand your model performance on seen data(validation data) and unseen data(your test data).
Here comes bias-variance trade-off into picture.
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
Let's consider a binary classification example where a student's previous semester grades, Sports achievements, Extracurriculars etc are used to predict whether or not he will pass the final semester.
Let's say we have around 10000 samples (data of 10000 students).
Now we split them:
Training set - 6000 samples
Validation set - 2000 samples
Test set - 1000 samples
The training data is generally split into three (training set, validation set, and test set) for the following reasons:
1) Feature Selection: Let's assume you have trained the model using some algorithm. You calculate the training accuracy and validation accuracy. You plot the learning curves and find if the model is overfitting or underfitting and make changes (add or remove features, add more samples etc). Repeat until you have the best validation accuracy. Now test the model with the test set to get your final score.
2) Parameter Selection: When you use algorithms like KNN, And you need to find the best K value which fits the model properly. You can plot the accuracy of different K value and choose the best validation accuracy and use it for your test set. (same applies when you find n_estimators for Random forests etc)
3) Model Selection: Also you can train the model with different algorithms and choose the model which better fits the data by testing out the accuracy using validation set.
So basically the Validation set helps you evaluate your model's performance how you must fine-tune it for best accuracy.
Hope you find this helpful.