Data augmentation in test/validation set? - machine-learning

It is common practice to augment data (add samples programmatically, such as random crops, etc. in the case of a dataset consisting of images) on both training and test set, or just the training data set?

Only on training. Data augmentation is used to increase the size of the training set and to get more different images.
Technically, you could use data augmentation on the test set to see how the model behaves on such images, but usually, people don't do it.

Data augmentation is done only on training set as it helps the model become more generalize and robust. So there's no point of augmenting the test set.

This answer on stats.SE makes the case for applying crops on the validation / test sets so as to make that input similar the the input in the training set that the network was trained on.

Do it only on the training set. And, of course, make sure that the augmentation does not make the label wrong (e.g. when rotating 6 and 9 by about 180°).
The reason why we use a training and a test set in the first place is that we want to estimate the error our system will have in reality. So the data for the test set should be as close to real data as possible.
If you do it on the test set, you might have the problem that you introduce errors. For example, say you want to recognize digits and you augment by rotating. Then a 6 might look like a 9. But not all examples are that easy. Better be save than sorry.

I would argue that, in some cases, using data augmentation for the validation set can be helpful.
For example, I train a lot of CNNs for medical image segmentation. Many of the augmentation transforms that I use are meant to reduce the image quality so that the network is trained to be robust against such data. If the training set looks bad and the validation set looks nice, it will be hard to compare the losses during training and therefore assessing overfit will be complicated.
I would never use augmentation for the test set unless I'm using test-time augmentation to improve results or estimate aleatoric uncertainty.

In computer vision, you can use data augmentation during test time to obtain different views on the test image. You then have to aggregate the results obtained from each image for example by averaging them.
For example, given this symbol below, changing the point of view can lead to different interpretations :

Some image preprocessing software tools like Roboflow (https://roboflow.com/) apply data augmentation to test data as well. I'd say that if one is dealing with small and rare objects, say, cerebral microbleeds (which are tiny and difficult to spot on magnetic resonance images), augmenting one's test set could be useful. Then you can verify that your model has learned to detect these objects given different orientation and brightness conditions (given that your training data has been augmented in the same way).

The goal of data augmentation is to generalize the model and make it learn more orientation of the images, such that the during testing the model is able to apprehend the test data well. So, it is well practiced to use augmentation technique only for training sets.

The point of adding validation data is to build generalized model so it is nothing but to predict real-world data. inorder to predict real-world data, the validation set should contain real data. There is no problem with augmenting validation data but it won't increase the accuracy of the model.

Here are my two cents:
You train your model on the training data and the validation data: the former to optimize your parameters, and the latter to give you an appropriate stopping condition. The test data is to give you a real-world estimate of how well you can expect your model to perform.
For training, you can augment your training data to increase robustness to various factors including, but not limited to, sampling error, bias between data sources, shifts in global data distribution, positioning, and any other sort of variation you would like to account for.
The validation data should indicate to the training method when the model is most generalizable. By this logic, if you expect to see some variation in real-world data that can be simulated using data augmentation, then by all means, the validation dataset should be augmented.
The test data, on the other hand, should not be augmented, except potentially in special scenarios where data is very limited, and an estimate of real-world performance on test data has too much variance.

You can use augmentation data in training, validation and test sets.
The only thing to avoid is using the same data from the training set in validation or test sets.
For example, if you generate 3 augmented instances from an register of the training data, make sure that no one of these 3 augmented instances accidentally ends up in the validation or test sets.
It turns out that using data from the training set, even augmented data, to validate or test a model is a methodology mistake.

Related

Is it possible to apply image pre-processing techniques on test data

In the case of a deep CNN task, I do understand that sometimes image pre-processing techniques such as gaussian filtering and cropping can be helpful for deep CNN modeling. I wonder if it is also acceptable to be applied to testing data as well. I've always thought that test data should never be touched whatsoever so that the model performance can be evaluated accurately.
As a matter of fact, you do need to apply those filters, which you have used on training data, on your test as well!
The fact that you should not touch your test data, is about not using them in during the training, so the generalization is being done only using the train, so that when you evaluate on a test, you get a realistic performance and quality of your model.
Any filtering, like Gaussian, applied on train data, before injecting them to model training, should be done on the test data as well.
For cropping, it really depends on how you crop, and what you crop. If your photos have always the frames around them, and in the train dataset you crop to remove those, I highly suggest doing the same for the test
Yes. The preprocessing is then part of your model and should therefore be part of testing.

(cross) validation of CNN models - when to bring in test data?

For a research project, I am working on using a CNN for defect detection on images of weld beads, using a dataset containing about 500 images. To do so, I am currently testing different models (e.g. ResNet18-50), as well as data augmentation and transfer learning techniques.
After having experimented for a while, I am wondering which way of training/validation/testing is best suited to provide an accurate measurement of performance while at the same time keeping the computational expensiveness at a reasonable level, considering I want to test as many models etc.as possible.
I guess it would be reasonable performing some sort of cross validation (cv), especially since the dataset is so small. However, after conducting some research, I could not find a clear best way to apply this and it seems to me that different people follow different strategies an I am a little confused.
Could somebody maybe clarify:
Do you use the validation set only to find the best performing epoch/weights while training and then directly test every model on the test set (and repeat this k times as a k-fold-cv)?
Or do you find the best performing model by comparing the average validation set accuracies of all models (with k-runs per model) and then check its accuracy on the testset? If so, which exact weigths do you load for testing or would one perform another cv for the best model to determine the final test accuracy?
Would it be an option to perform multiple consecutive training-validation-test runs for each model, while before each run shuffling the dataset and splitting it in “new” training-, validation- & testsets to determine an average test accuracy (like monte-carlo-cv, but maybe with less amount of runs)?
Thanks a lot!
In your case I would remove 100 images for your test set. You should make sure that they resemble what you expect your final model to be able to handle. E.g. it should be objects which are not in the training or validation set, because you probably also want it to generalize to new weld beads. Then you can do something like 5 fold cross validation, where you randomly or intelligently sample 100 images as your validation set. Then you train your model on the remaining 300 and use the remaining 100 for validation purposes. Then you can use a confidence interval on the performance of the model. This confidence interval is what you use to tune your hyperparameters.
The test set is then useful to predict the performance on novel data, but you should !!!NEVER!!! use it to tune hyperparmeter.

Is it a good practice to use your full data set for predictions?

I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.
You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.
In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.
Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.

Combining training, validation and test datasets

Is it possible to train a model based on training and validation data sets.Basically end up combining both of them to create a new model. And from that combined model use it to classify all of the data in the test dataset.
This is what is usually done. Assuming that you know how to transfer hyperparameters, as you usually fit model on train data, select hyperparameters based on the score on the valid one. Thus when you combine train + valid you get significantly bigger dataset, thus "optimal hyperparameters" might be completely different from the ones you selected before. So in general - yes, this is exactly what is usually done, but it might be more tricky than you expect (especially if your method is highly stochastic, non deterministic, etc.).

What is the right way to measure if a machine learning model has overfit?

I understand the intuitive meaning of overfitting and underfitting. Now, given a particular machine learning model that is trained upon the training data, how can you tell if the training overfitted or underfitted the data? Is there a quantitative way to measure these factors?
Can we look at the error and say if it has overfit or underfit?
I believe the easiest approach is to have two sets of data. Training data and validation data. You train the model on the training data as long as the fitness of the model on the training data is close to the fitness of the model on the validation data. When the models fitness is increasing on the training data but not on the validation data then you're overfitting.
The usual way, I think, is known as cross-validation. The idea is to split the training set into several pieces, known as folds, then pick one at a time for evaluation and train on the remaining ones.
It does not, of course, measure the actual overfitting or underfitting, but if you can vary the complexity of the model, e.g. by changing the regularization term, you can find the optimal point. This is as far as one can go with just training and testing, I think.
You don't look at the error on the training data, but on the validation data only.
A common way of testing is to try different model complexities, and see how the error changes with model complexity. Usually these have a typical curve. In the beginning, the errors quickly improve. Then there is saturation (where the model is good), then they start decreasing again, but not because of being a better model, but because of overfitting. You want to be on the low complexity end of the plateau, the simplest model that provides a reasonable generalization.
The existing answers are not strictly speaking wrong, but they are not complete. Yes, you do need a validation set, but an important issue here is that you do not simply look at the model error on the validation set and try to minimize it. It will lead to overfitting all the same, because you will effectively be fitting on a validation set that way. The right approach is not minimizing the error on your sets, but making an error independent from which training and validation sets you use. If error on validation set is significantly different (doesn't matter if it is worse, or better), then the model is overfit. Also, certainly, this should be done in a cross-validation way when you train on some random set and then validate on another random set.

Resources