I am studying transformer model with this tutorial and I am confused on the difference between evaluation and inference. In my understanding evaluation happens after the model is trained, by only giving it source and ask it to predict the target one by one (in the seq2seq problem).
However, in the tutorial it does evaluation in the same way as training, which is getting the loss from a forward pass to the model, given both source and target. And the inference step is more similar to what I understand as evaluation. In this case, I tried the model and it does really well with evaluation and testing, but at the inference step I found that it can't output anything meaningful.
Can anyone explain me the difference between evaluation and inference?
Evaluation and inference are indeed two different subjects, but the former can be performed only by means of the latter. Practically, you repeatedly make an inferences in order to perform the evaluation.
At this specific step:
train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
valid_loss = evaluate(model, valid_iterator, criterion)
After each epoch on the training set, we need to evaluate/make an inference on each sample from the validation set, in order to check if we have overfitting or underfitting or other phenomena. This happens in any robust machine learning process (at least it should happen).
If you have problems when testing, it can be due to two different reasons:
You either do not test the same way you fed the input to the training + validation.
The test set is so different from the training and the validation that your model does not behave well when subjected to inputs statistically different from the training and the validation you used.
Also, note that the evaluation and inference is not related to seq2seq/nor is it different in other machine learning problems as compared to seq2seq; the same concepts of evaluation and inference apply for all machine learning tasks.
Related
I'm studying the difference between GLM models (OLS, Logistic Regression, Zero Inflated, etc.), which are deterministic, since we can infer the parameters exactly, and some CART models (Random Forest, LightGBM, CatBoost, etc.) that are based on stochastic prediction.
What I've heard is that for stochastic models we should split into train and test to avoid over-fitting, fact that does not happen in deterministic models, because they use Linear Programming for finding the best parameters.
I've like to start some discussion about it.
My opinion is that it's true. Deterministic models are just equations solved, and it should not over-fit the data at all, and it differs from stochastic models based on randomness to make predictions.
But what I found was every course saying to split every datasets, independent if its deterministic or not.
There is confusion over multiple concepts in your question.
Should one use train/test set splits for deterministic models? If you are training a model for prediction, absolutely! The important thing to remember is that a prediction model needs to generalize to data other than the one used for training. This is evaluated using the test set. Even if a model is being learned simply as a means to explore the data, this is still recommended as a way to verify that one isn't just overfitting to the noise.
The second point of confusion is that splitting into train and test sets avoid overfitting. This is not true per se. The separation is so that one can use the test set to verify if the model is overfitting. If the performance on the train and test sets differ "dramatically" then a model is likely overfitting and needs to be simplified, regularized, or otherwise constrained somehow.
The other point pertains to what constitutes a stochastic model. All of the CART models that you mention are actually deterministic in the sense that, once you train then, they always yield exactly the same output for the same input. The stochasticity that you may have been referring is that the training uses random initializations which may result in quite different final models. If this is a concern (because of local optima for example), then use multiple initializations (a.k.a., multiple restarts, or Monte Carlo runs) to resolve them.
Finally, you mentioned that deterministic models don't need this split because they cannot overfit. This is not true. Consider an SVM classifier with a Gaussian kernel of sufficiently small bandwidth. If solved to optimality, the training is deterministic and will most assuredly overfit the training data.
so recently I've been following the tutorial in https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html and I came up with the following question: is there a training/validation split happening internally?
The thing is, in this tutorial, the main dataset is spliced into training and testing. Here, the training set is used for training and the testing in the evaluate() function.
To my knowledge, when dealing with neural networks usually the data is split into 3 sets: training, validation and testing. In this tutorial though, it is only split into training and testing. For what I know, usually the model is trained and then evaluated, and the weights are then updated according to what was learnt in the evaluation step. However, I can't seem to find any connection between the evaluate function and training. Therefore, in this example the model is being evaluated AND tested using the same dataset.
Is there something here that I might be missing? Is there an internal split of the training dataset happening during training (into training and validation) and the function evaluate() is simply used for testing the performance of the model?
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)```
Is there a training/validation split happening internally?
Is there an internal split of the training dataset happening during
training (into training and validation) and the function evaluate() is
simply used for testing the performance of the model?
No, you are not missing anything. What you see is exactly what's being done there. There is no internal splitting happening. It's just an example to show how something is done in pytorch without cluttering it unncessarily.
Some datasets such as CIFAR10/CIFAR100, come only with a train/test set and usually it has been the norm to just train and then evaluate on the test set in examples. However, nothing stops you from splitting the training set however you like, it's up to you. In such tutorials they just tried to keep everything as simple as possible.
When developing a neural net one typically partitions training data into Train, Test, and Holdout datasets (many people call these Train, Validation, and Test respectively. Same things, different names). Many people advise selecting hyperparameters based on performance in the Test dataset. My question is: why? Why not maximize performance of hyperparameters in the Train dataset, and stop training the hyperparameters when we detect overfitting via a drop in performance in the Test dataset? Since Train is typically larger than Test, would this not produce better results compared to training hyperparameters on the Test dataset?
UPDATE July 6 2016
Terminology change, to match comment below. Datasets are now termed Train, Validation, and Test in this post. I do not use the Test dataset for training. I am using a GA to optimize hyperparameters. At each iteration of the outer GA training process, the GA chooses a new hyperparameter set, trains on the Train dataset, and evaluates on the Validation and Test datasets. The GA adjusts the hyperparameters to maximize accuracy in the Train dataset. Network training within an iteration stops when network overfitting is detected (in the Validation dataset), and the outer GA training process stops when overfitting of the hyperparameters is detected (again in Validation). The result is hyperparameters psuedo-optimized for the Train dataset. The question is: why do many sources (e.g. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf, Section B.1) recommend optimizing the hyperparameters on the Validation set, rather than the Train set? Quoting from Srivasta, Hinton, et al (link above): "Hyperparameters were tuned on the validation set such that the best validation error was produced..."
The reason is that developing a model always involves tuning its configuration: for example, choosing the number of layers or the size of the layers (called the hyper-parameters of the model, to distinguish them from the parameters, which are the network’s weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. In essence, this tuning is a form of learning: a search for a good configuration in some parameter space. As a result, tuning the configuration of the model based on its performance on the validation set can quickly result in overfitting to the validation set, even though your model is never directly trained on it.
Central to this phenomenon is the notion of information leaks. Every time you tune a hyperparameter of your model based on the model’s performance on the validation set, some information about the validation data leaks into the model. If you do this only once, for one parameter, then very few bits of information will leak, and your validation set will remain reliable to evaluate the model. But if you repeat this many times—running one experiment, evaluating on the validation set, and modifying your model as a result—then you’ll leak an increasingly significant amount of information about the validation set into the model.
At the end of the day, you’ll end up with a model that performs artificially well on the validation data, because that’s what you optimized it for. You care about performance on completely new data, not the validation data, so you need to use a completely different, never-before-seen dataset to evaluate the model: the test dataset. Your model shouldn’t have had access to any information about the test set, even indirectly. If anything about the model has been tuned based on test set performance, then your measure of generalization will be flawed.
There are two things you are missing here. First, minor, is that test set is never used to do any training. This is a purpose of validation (test is just to asses your final, testing performance). The major missunderstanding is what it means "to use validation set to fit hyperparameters". This means exactly what you describe - to train a model with a given hyperparameters on the training set, and use validation to simply check if you are overfitting (you use it to estimate generalization) , but you do not really "train" on them, you simply check your scores on this subset (which, as you noticed - is way smaller).
You cannot "stop training hyperparamters" because this is not a continuous process, usually hyperparameters are just "possible sets of values", and you have to simply test lots of them, there is no valid way of defining a direct trainingn procedure between actual metric you are interested in (like accuracy) and hyperparameters (like size of the hidden layer in NN or even C parameter in SVM), as the functional link between these two is not differentiable, is highly non convex and in general "ugly" to optimize. If you can define a nice optimization procedure in terms of a hyperparameter than it is usually not called a hyperparameter but a parameter, the crucial distinction in this naming convention is what makes it hard to optimize directly - we call hyperparameter a parameter, than cannot be directly optimized against thus you need a "meta method" (like simply testing on validation set) to select it.
However, you can define a "nice" meta optimization protocol for hyperparameters, but this will still use validation set as an estimator, for example Bayesian optimization of hyperparameters does exactly this - it tries to fit a function saying how well is you model behaving in the space of hyperparameters, but in order to have any "training data" for this meta-method, you need validation set to estimate it for any given set of hyperparameters (input to your meta method)
simple answer: we do
In the case of a simple feedforward neural network you do have to select e.g. layer and unit count per layer, regularization (and non-continuous parameters like topology if not feedforward and loss function) in the beginning and you would optimize on those.
So, in summary you optimize:
ordinary parameters only during training but not during validation
hyperparameters during training and during validation
It is very important not to touch the many ordinary parameters (weights and biases) during validation. That's because there are thousands of degrees of freedom in them which means they can learn the data you train them on. But then the model doesn't generalize to new data as well (even when that new data originated from the same distribution). You usually only have very few degrees of freedom in the hyperparameters which usually control the rigidity of the model (regularization).
This holds true for other machine learning algorithms like decision trees, forests, etc as well.
I'm currently wondering when to stop training of Deep Autoencoders, especially when it seems to be stuck in a local minimum.
Is it essential to get the training criterium (e.g. MSE) to e.g. 0.000001 and force it to perfectly reconstruct the input or is it okay to keep differences (e.g. stop when the MSE is at about 0.5) depending on the dataset used.
I know that a better reconstruction might lead to better classification results afterwards but is there a "rule of thumb" when to stop? I'm especially interested in rules that have no heuristic character like "if the MSE doesn't get smaller in x iterations".
I don't think it's possible to derive a general rule of thumb for this, as generating NN:s/machine learning is a very problem-specific procedure, and generally, there is no free lunch. How to decide what is a "good" training error to terminate at depends on various problem-specific factors, e.g. the noise in the data. Evaluating your NN only with regard to training sets, with the only objective of minimising the MSE, will many times lead to overfitting. With only the training error as feedback, you might tune your NN to the noise in the training data (hence the overfitting). One method to avoid this is holdout validation. Instead of only training your NN to given data, your divide your data set into a training set, a validation set (and a test set).
Training sets: Training and feedback to NN, will naturally keep decreasing with longer training (at least down to "OK" MSE values for the specific problem).
Validation sets: Evaluate your NN w.r.t. to these, but don't give feedback to your NN/genetic algoritm.
Along with the evaluation-feedback of your training sets you should hence also evaluate the validation set, however without giving feedback to your neural network (NN).
Track the decrease in MSE for training as well as validation sets; generally training error will steadily decrease, whereas, at some point, the validation error will reach a minimum and start to increase with further training. Of course, you cannot know during runtime where this minima occurs, so generally one stores the NN with the lowest validation error, and after this has seemingly not been updated in some time (i.e., in error retrospect: we've passed a minima in validation error), the algorithm is terminated.
See e.g. the following article Neural Network: Train-validate-Test Stopping for details, as well as this SE-statistics thread discussing two different validation methods.
For the training/validation of Deep Autoencoders/Deep Learning, specifically w.r.t. overfitting, I find the article Dropout: A Simple Way to Prevent Neural Networks from Overfitting (*) to be valuable.
(*) By H. Srivistava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, University of Toronto.
I've been studying neural networks for a bit and recently learned about the dropout training algorithm. There are excellent papers out there to understand how it works, including the ones from the authors.
So I built a neural network with dropout training (it was fairly easy) but I'm a bit confused about how to perform model selection. From what I understand, looks like dropout is a method to be used when training the final model obtained through model selection.
As for the test part, papers always talk about using the complete network with halved weights, but they do not mention how to use it in the training/validation part (at least the ones I read).
I was thinking about using the network without dropout for the model selection part. Say that makes me find that the net performs well with N neurons. Then, for the final training (the one I use to train the network for the test part) I use 2N neurons with dropout probability p=0.5. That assures me to have exactly N neurons active on average, thus using the network at the right capacity most of the time.
Is this a correct approach?
By the way, I'm aware of the fact that dropout might not be the best choice with small datasets. The project I'm working on has academic purposes, so it's not really needed that I use the best model for the data, as long as I stick with machine learning good practices.
First of all, model selection and the training of a particular model are completely different issues. For model selection, you would usually need a data set that is completely independent of both training set used to build the model and test set used to estimate its performance. So if you're doing for example a cross-validation, you would need an inner cross-validation (to train the models and estimate the performance in general) and an outer cross-validation to do the model selection.
To see why, consider the following thought experiment (shamelessly stolen from this paper). You have a model that makes a completely random prediction. It has a number of parameters that you can set, but have no effect. If you're trying different parameter settings long enough, you'll eventually get a model that has a better performance than all the others simply because you're sampling from a random distribution. If you're using the same data for all of these models, this is the model you will choose. If you have a separate test set, it will quickly tell you that there is no real effect because the performance of this parameter setting that achieves good results during the model-building phase is not better on the separate set.
Now, back to neural networks with dropout. You didn't refer to any particular paper; I'm assuming that you mean Srivastava et. al. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". I'm not an expert on the subject, but the method to me seems to be similar to what's used in random forests or bagging to mitigate the flaws an individual learner may exhibit by applying it repeatedly in slightly different contexts. If I understood the method correctly, essentially what you end up with is an average over several possible models, very similar to random forests.
This is a way to make an individual model better, but not for model selection. The dropout is a way of adjusting the learned weights for a single neural network model.
To do model selection on this, you would need to train and test neural networks with different parameters and then evaluate those on completely different sets of data, as described in the paper I've referenced above.