(cross) validation of CNN models - when to bring in test data? - machine-learning

For a research project, I am working on using a CNN for defect detection on images of weld beads, using a dataset containing about 500 images. To do so, I am currently testing different models (e.g. ResNet18-50), as well as data augmentation and transfer learning techniques.
After having experimented for a while, I am wondering which way of training/validation/testing is best suited to provide an accurate measurement of performance while at the same time keeping the computational expensiveness at a reasonable level, considering I want to test as many models etc.as possible.
I guess it would be reasonable performing some sort of cross validation (cv), especially since the dataset is so small. However, after conducting some research, I could not find a clear best way to apply this and it seems to me that different people follow different strategies an I am a little confused.
Could somebody maybe clarify:
Do you use the validation set only to find the best performing epoch/weights while training and then directly test every model on the test set (and repeat this k times as a k-fold-cv)?
Or do you find the best performing model by comparing the average validation set accuracies of all models (with k-runs per model) and then check its accuracy on the testset? If so, which exact weigths do you load for testing or would one perform another cv for the best model to determine the final test accuracy?
Would it be an option to perform multiple consecutive training-validation-test runs for each model, while before each run shuffling the dataset and splitting it in “new” training-, validation- & testsets to determine an average test accuracy (like monte-carlo-cv, but maybe with less amount of runs)?
Thanks a lot!

In your case I would remove 100 images for your test set. You should make sure that they resemble what you expect your final model to be able to handle. E.g. it should be objects which are not in the training or validation set, because you probably also want it to generalize to new weld beads. Then you can do something like 5 fold cross validation, where you randomly or intelligently sample 100 images as your validation set. Then you train your model on the remaining 300 and use the remaining 100 for validation purposes. Then you can use a confidence interval on the performance of the model. This confidence interval is what you use to tune your hyperparameters.
The test set is then useful to predict the performance on novel data, but you should !!!NEVER!!! use it to tune hyperparmeter.

Related

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.
Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

Data augmentation in test/validation set?

It is common practice to augment data (add samples programmatically, such as random crops, etc. in the case of a dataset consisting of images) on both training and test set, or just the training data set?
Only on training. Data augmentation is used to increase the size of the training set and to get more different images.
Technically, you could use data augmentation on the test set to see how the model behaves on such images, but usually, people don't do it.
Data augmentation is done only on training set as it helps the model become more generalize and robust. So there's no point of augmenting the test set.
This answer on stats.SE makes the case for applying crops on the validation / test sets so as to make that input similar the the input in the training set that the network was trained on.
Do it only on the training set. And, of course, make sure that the augmentation does not make the label wrong (e.g. when rotating 6 and 9 by about 180°).
The reason why we use a training and a test set in the first place is that we want to estimate the error our system will have in reality. So the data for the test set should be as close to real data as possible.
If you do it on the test set, you might have the problem that you introduce errors. For example, say you want to recognize digits and you augment by rotating. Then a 6 might look like a 9. But not all examples are that easy. Better be save than sorry.
I would argue that, in some cases, using data augmentation for the validation set can be helpful.
For example, I train a lot of CNNs for medical image segmentation. Many of the augmentation transforms that I use are meant to reduce the image quality so that the network is trained to be robust against such data. If the training set looks bad and the validation set looks nice, it will be hard to compare the losses during training and therefore assessing overfit will be complicated.
I would never use augmentation for the test set unless I'm using test-time augmentation to improve results or estimate aleatoric uncertainty.
In computer vision, you can use data augmentation during test time to obtain different views on the test image. You then have to aggregate the results obtained from each image for example by averaging them.
For example, given this symbol below, changing the point of view can lead to different interpretations :
Some image preprocessing software tools like Roboflow (https://roboflow.com/) apply data augmentation to test data as well. I'd say that if one is dealing with small and rare objects, say, cerebral microbleeds (which are tiny and difficult to spot on magnetic resonance images), augmenting one's test set could be useful. Then you can verify that your model has learned to detect these objects given different orientation and brightness conditions (given that your training data has been augmented in the same way).
The goal of data augmentation is to generalize the model and make it learn more orientation of the images, such that the during testing the model is able to apprehend the test data well. So, it is well practiced to use augmentation technique only for training sets.
The point of adding validation data is to build generalized model so it is nothing but to predict real-world data. inorder to predict real-world data, the validation set should contain real data. There is no problem with augmenting validation data but it won't increase the accuracy of the model.
Here are my two cents:
You train your model on the training data and the validation data: the former to optimize your parameters, and the latter to give you an appropriate stopping condition. The test data is to give you a real-world estimate of how well you can expect your model to perform.
For training, you can augment your training data to increase robustness to various factors including, but not limited to, sampling error, bias between data sources, shifts in global data distribution, positioning, and any other sort of variation you would like to account for.
The validation data should indicate to the training method when the model is most generalizable. By this logic, if you expect to see some variation in real-world data that can be simulated using data augmentation, then by all means, the validation dataset should be augmented.
The test data, on the other hand, should not be augmented, except potentially in special scenarios where data is very limited, and an estimate of real-world performance on test data has too much variance.
You can use augmentation data in training, validation and test sets.
The only thing to avoid is using the same data from the training set in validation or test sets.
For example, if you generate 3 augmented instances from an register of the training data, make sure that no one of these 3 augmented instances accidentally ends up in the validation or test sets.
It turns out that using data from the training set, even augmented data, to validate or test a model is a methodology mistake.

Is it a good practice to use your full data set for predictions?

I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.
You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.
In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.
Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.

Splitting training and test data

Can someone recommend what is the best percent of divided the training data and testing data in Machine learning. What are the disadvantages if i split training and test data in 30-70 percentage ?
There is no one "right way" to split your data unfortunately, people use different values which are chosen based on different heuristics, gut feeling and personal experience/preference. A good starting point is the Pareto principle (80-20).
Sometimes using a simple split isn't an option as you might simply have too much data - in that case you might need to sample your data or use smaller testing sets if your algorithm is computationally complex. An important part is to randomly choose your data. The trade-off is pretty simple: less testing data = the performance of your algorithm will have bigger variance. Less training data = parameter estimates will have bigger variance.
For me personally more important than the size of the split is that you obviously shouldn't always perform your tests only once on the same test split as it might be biased (you might be lucky or unlucky with your split). That's why you should do tests for several configurations (for instance you run your tests X times each time choosing a different 20% for testing). Here you can have problems with the so called model variance - different splits will result in different values. That's why you should run tests several times and average out the results.
With the above method you might find it troublesome to test all the possible splits. A well established method of splitting data is the so called cross validation and as you can read in the wiki article there are several types of it (both exhaustive and non-exhaustive). Pay extra attention to the k-fold cross validation.
Read up on the various strategies for cross-validation.
A 10%-90% split is popular, as it arises from 10x cross-validation.
But you could do 3x or 4x cross validation, too. (33-67 or 25-75)
Much larger errors arise from:
having duplicates in both test and train
unbalanced data
Make sure to first merge all duplicates, and do stratified splits if you have unbalanced data.

Training Random forest with different datasets gives totally different result! Why?

I am working with a dataset which contains 12 attributes including the timestamp and one attribute as the output. Also it has about 4000 rows. Besides there is no duplication in the records. I am trying to train a random forest to predict the output. For this purpose I created two different datasets:
ONE: Randomly chose 80% of data for the training and the other 20% for the testing.
TWO: Sort the dataset based on timestamp and then the first 80% for the training and the last 20% for the testing.
Then I removed the timestamp attribute from the both dataset and used the other 11 attributes for the training and the testing (I am sure the timestamp should not be part of the training).
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
I do appreciate if someone can help me to know
why I have this huge difference.
Also I need to have the test dataset with the latest timestamps (same as the dataset in the second experiment). Is there anyway to select data from the rest of the dataset for the training to improve the
training.
PS: I already test the random selection from the first 80% of the timestamp and it doesn't improved the performance.
First of all, it is not clear how exactly you're testing. Second, either way, you are doing the testing wrong.
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
Is this for the training set or the test set? If the test set, that means you have poor generalization.
You are doing it wrong because you are not allowed to tweak your model so that it performs well on the same test set, because it might lead you to a model that does just that, but that generalizes badly.
You should do one of two things:
1. A training-validation-test split
Keep 60% of the data for training, 20% for validation and 20% for testing in a random manner. Train your model so that it performs well on the validation set using your training set. Make sure you don't overfit: the performance on the training set should be close to that on the validation set, if it's very far, you've overfit your training set. Do not use the test set at all at this stage.
Once you're happy, train your selected model on the training set + validation set and test it on the test set you've held out. You should get acceptable performance. You are not allowed to tweak your model further based on the results you get on this test set, if you're not happy, you have to start from scratch.
2. Use cross validation
A popular form is 10-fold cross validation: shuffle your data and split it into 10 groups of equal or almost equal size. For each of the 10 groups, train on the other 9 and test on the remaining one. Average your results on the test groups.
You are allowed to make changes on your model to improve that average score, just run cross validation again after each change (make sure to reshuffle).
Personally I prefer cross validation.
I am guessing what happens is that by sorting based on timestamp, you make your algorithm generalize poorly. Maybe the 20% you keep for testing differ significantly somehow, and your algorithm is not given a chance to capture this difference? In general, your data should be sorted randomly in order to avoid such issues.
Of course, you might also have a buggy implementation.
I would suggest you try cross validation and see what results you get then.

Resources