In a post on Quora, someone says:
At test time, the layer is supposed to see only one test data point at
a time, hence computing the mean / variance along a whole batch is
infeasible (and is cheating).
But as long as testing data have not been seen by the network during training isn't it ok to use several testing images?
I mean, our network as been train to predict using batches, so what is the issue with giving it batches?
If someone could explain what informations our network gets from batches that it is not suppose to have that would be great :)
Thank you
But as long as testing data have not been seen by the network during training isn't it ok to use several testing images ?
First of all, it's ok to use batches for testing. Second, in test mode, batchnorm doesn't compute the mean or variance for the test batch. It takes the mean and variance it already has (let's call them mu and sigma**2), which are computed based solely on the training data. The result of batch norm in test mode is that all tensors x are normalized to (x - mu) / sigma.
At test time, the layer is supposed to see only one test data point at a time, hence computing the mean / variance along a whole batch is infeasible (and is cheating)
I just skimmed through Quora discussion, may be this quote has a different context. But taken on its own, it's just wrong. No matter what the batch is, all tensors will go through the same transformation, because mu and sigma are not changed during testing, just like all other variables. So there's no cheating there.
The claim is very simple, you train your model so it is useful for some task. And in classification the task is usually - you get a data point and you output the class, there is no batch. Of course in some practical applications you can have batches (say many images from the same user etc.). So that's it - it is application dependent, so if you want to claim something "in general" about the learning model you cannot make an assumption that during test time one has access to batches, that's all.
Related
For a research project, I am working on using a CNN for defect detection on images of weld beads, using a dataset containing about 500 images. To do so, I am currently testing different models (e.g. ResNet18-50), as well as data augmentation and transfer learning techniques.
After having experimented for a while, I am wondering which way of training/validation/testing is best suited to provide an accurate measurement of performance while at the same time keeping the computational expensiveness at a reasonable level, considering I want to test as many models etc.as possible.
I guess it would be reasonable performing some sort of cross validation (cv), especially since the dataset is so small. However, after conducting some research, I could not find a clear best way to apply this and it seems to me that different people follow different strategies an I am a little confused.
Could somebody maybe clarify:
Do you use the validation set only to find the best performing epoch/weights while training and then directly test every model on the test set (and repeat this k times as a k-fold-cv)?
Or do you find the best performing model by comparing the average validation set accuracies of all models (with k-runs per model) and then check its accuracy on the testset? If so, which exact weigths do you load for testing or would one perform another cv for the best model to determine the final test accuracy?
Would it be an option to perform multiple consecutive training-validation-test runs for each model, while before each run shuffling the dataset and splitting it in “new” training-, validation- & testsets to determine an average test accuracy (like monte-carlo-cv, but maybe with less amount of runs)?
Thanks a lot!
In your case I would remove 100 images for your test set. You should make sure that they resemble what you expect your final model to be able to handle. E.g. it should be objects which are not in the training or validation set, because you probably also want it to generalize to new weld beads. Then you can do something like 5 fold cross validation, where you randomly or intelligently sample 100 images as your validation set. Then you train your model on the remaining 300 and use the remaining 100 for validation purposes. Then you can use a confidence interval on the performance of the model. This confidence interval is what you use to tune your hyperparameters.
The test set is then useful to predict the performance on novel data, but you should !!!NEVER!!! use it to tune hyperparmeter.
I am reading the a deep learning with python book.
After reading chapter 4, Fighting Overfitting, I have two questions.
Why might increasing the number of epochs cause overfitting?
I know increasing increasing the number of epochs will involve more attempts at gradient descent, will this cause overfitting?
During the process of fighting overfitting, will the accuracy be reduced ?
I'm not sure which book you are reading, so some background information may help before I answer the questions specifically.
Firstly, increasing the number of epochs won't necessarily cause overfitting, but it certainly can do. If the learning rate and model parameters are small, it may take many epochs to cause measurable overfitting. That said, it is common for more training to do so.
To keep the question in perspective, it's important to remember that we most commonly use neural networks to build models we can use for prediction (e.g. predicting whether an image contains a particular object or what the value of a variable will be in the next time step).
We build the model by iteratively adjusting weights and biases so that the network can act as a function to translate between input data and predicted outputs. We turn to such models for a number of reasons, often because we just don't know what the function is/should be or the function is too complex to develop analytically. In order for the network to be able to model such complex functions, it must be capable of being highly-complex itself. Whilst this complexity is powerful, it is dangerous! The model can become so complex that it can effectively remember the training data very precisely but then fail to act as an effective, general function that works for data outside of the training set. I.e. it can overfit.
You can think of it as being a bit like someone (the model) who learns to bake by only baking fruit cake (training data) over and over again – soon they'll be able to bake an excellent fruit cake without using a recipe (training), but they probably won't be able to bake a sponge cake (unseen data) very well.
Back to neural networks! Because the risk of overfitting is high with a neural network there are many tools and tricks available to the deep learning engineer to prevent overfitting, such as the use of dropout. These tools and tricks are collectively known as 'regularisation'.
This is why we use development and training strategies involving test datasets – we pretend that the test data is unseen and monitor it during training. You can see an example of this in the plot below (image credit). After about 50 epochs the test error begins to increase as the model has started to 'memorise the training set', despite the training error remaining at its minimum value (often training error will continue to improve).
So, to answer your questions:
Allowing the model to continue training (i.e. more epochs) increases the risk of the weights and biases being tuned to such an extent that the model performs poorly on unseen (or test/validation) data. The model is now just 'memorising the training set'.
Continued epochs may well increase training accuracy, but this doesn't necessarily mean the model's predictions from new data will be accurate – often it actually gets worse. To prevent this, we use a test data set and monitor the test accuracy during training. This allows us to make a more informed decision on whether the model is becoming more accurate for unseen data.
We can use a technique called early stopping, whereby we stop training the model once test accuracy has stopped improving after a small number of epochs. Early stopping can be thought of as another regularisation technique.
More attempts of decent(large number of epochs) can take you very close to the global minima of the loss function ideally, Now since we don't know anything about the test data, fitting the model so precisely to predict the class labels of the train data may cause the model to lose it generalization capabilities(error over unseen data). In a way, no doubt we want to learn the input-output relationship from the train data, but we must not forget that the end goal is for the model to perform well over the unseen data. So, it is a good idea to stay close but not very close to the global minima.
But still, we can ask what if I reach the global minima, what can be the problem with that, why would it cause the model to perform badly on unseen data?
The answer to this can be that in order to reach the global minima we would be trying to fit the maximum amount of train data, this will result in a very complex model(since it is less probable to have a simpler spatial distribution of the selected number of train data that is fortunately available with us). But what we can assume is that a large amount of unseen data(say for facial recognition) will have a simpler spatial distribution and will need a simpler Model for better classification(I mean the entire world of unseen data, will definitely have a pattern that we can't observe just because we have an access small fraction of it in the form of training data)
If you incrementally observe points from a distribution(say 50,100,500, 1000 ...), we will definitely find the structure of the data complex until we have observed a sufficiently large number of points (max: the entire distribution), but once we have observed enough points we can expect to observe the simpler pattern present in the data that can be easily classified.
In short, a small fraction of train data should have a complex structure as compared to the entire dataset. And overfitting to the train data may cause our model to perform worse on the test data.
One analogous example to emphasize the above phenomenon from day to day life is as follows:-
Say we meet N number of people till date in our lifetime, while meeting them we naturally learn from them(we become what we are surrounded with). Now if we are heavily influenced by each individual and try to tune to the behaviour of all the people very closely, we develop a personality that closely resembles the people we have met but on the other hand we start judging every individual who is unlike me -> unlike the people we have already met. Becoming judgemental takes a toll on our capability to tune in with new groups since we trained very hard to minimize the differences with the people we have already met(the training data). This according to me is an excellent example of overfitting and loss in genralazition capabilities.
I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).
I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.
You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.
In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.
Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.
Can someone recommend what is the best percent of divided the training data and testing data in Machine learning. What are the disadvantages if i split training and test data in 30-70 percentage ?
There is no one "right way" to split your data unfortunately, people use different values which are chosen based on different heuristics, gut feeling and personal experience/preference. A good starting point is the Pareto principle (80-20).
Sometimes using a simple split isn't an option as you might simply have too much data - in that case you might need to sample your data or use smaller testing sets if your algorithm is computationally complex. An important part is to randomly choose your data. The trade-off is pretty simple: less testing data = the performance of your algorithm will have bigger variance. Less training data = parameter estimates will have bigger variance.
For me personally more important than the size of the split is that you obviously shouldn't always perform your tests only once on the same test split as it might be biased (you might be lucky or unlucky with your split). That's why you should do tests for several configurations (for instance you run your tests X times each time choosing a different 20% for testing). Here you can have problems with the so called model variance - different splits will result in different values. That's why you should run tests several times and average out the results.
With the above method you might find it troublesome to test all the possible splits. A well established method of splitting data is the so called cross validation and as you can read in the wiki article there are several types of it (both exhaustive and non-exhaustive). Pay extra attention to the k-fold cross validation.
Read up on the various strategies for cross-validation.
A 10%-90% split is popular, as it arises from 10x cross-validation.
But you could do 3x or 4x cross validation, too. (33-67 or 25-75)
Much larger errors arise from:
having duplicates in both test and train
unbalanced data
Make sure to first merge all duplicates, and do stratified splits if you have unbalanced data.