Monitoring val_loss while training - machine-learning

I have a simple question that has made me doubt my work all of a sudden.
If I only have a training and validation set, am I allowed to monitor val_loss while training or is that adding bias to my training. I want to test my accuracy at the end of training on my validation set but suddenly I am thinking if I am monitoring that dataset while training, that'd be problematic? or no?

Short answer - yes, monitoring validation error and using it as a basis for decision about specific set up of algorithm adds bias to your algorithm. To elaborate a bit:
1) You fix hyperparameters of any ML algorithm and than train it on train set. Your resulting ML algorithm with specific set up of hyperparameters overfits to training set and you use validation set to estimate which performance you can get with these hyperparameters on unseen data
2) But you obviously want to adjust your hyperparameters to get best performance. You may be doing a gridsearch or something like it to get best hyperparameter settings for this specific algorithm using validation set. As result your hyperparameter settings overfit to validation set. Think of it as some of the information about validation set still leaks into your model through hyperparameters
3) As result you must do the following: split data set into training set, validation set and test set. Use training set for training, use validation set to make a decision about specific hyperparameters. When you are done (fully done!) with fine tuning your model you must use test set which model have never seen to get an estimation of the final performance in the combat mode.

Related

Hyper-parameter Tuning for a machine learning model

Why a hyper-parameter like regularization parameter (a real number) cannot be trained over training data along with model parameters? What will go wrong?
This is generally done to prevent overfitting. Model parameters are trained using the training set. Hyper-parameter tuning is done using a validation set that is (ideally) completely independent of the training data. The final performance should be evaluated on a test set. Typical splits are 80/10/10 or 60/20/20.
If you tune your hypermeters on the training set, you will very likely vastly overfit and suffer a performance hit on the test set.
Try it out! See the difference in performance on your test set when you do hyper-parameter tuning on the training set, vs on a separate validation set.

How is Machine Learning Overfitting works

How noise in the data, target complexity and size of the training set are related to over-fitting?
I am guessing that you are a beginner, suppose you have dataset with lots of features(as in columns). you create a model and test it on your training and test dataset, you notice that it gives you an accuracy of 100 percent on your training set and 60-70 on your test set, this is an example of Overfitting. it is because you have chosen a lot of features which were not related to predicting the outcome.
you can remove it by dropping those irrelevant columns(which are called as noise), apply K-fold cross validation on your data.
this video might help you get a better understanding
https://www.youtube.com/watch?v=Anq4PgdASsc

train, dev set and test set advices for implementation and hyperameter tuning

I have some doubts about the implementation and tuning of parameters and hyperparameters by using the classic train, validation and test set. So it would be of great help if somebody could clarify me these concepts and bring me some hints for its implementation in a language like Python.
For example, if I have a Neural Network, for what I know the parameter tuning (lets consider the number of hidden layers and neurons per layer), could be tuned with the training set. So when it comes to the validation set, which is approximately 20% of the dataset, I can tune my hyperparameters with the following algorithm:
Example: Tuning batch size and learning rate:
hyperListB=[]
hyperListL=[]
//let´s suppose both lists have the same dimensions
for i in range(0,hyperListB):
model=fit(train_set,hyperListB[i],hyperlistL[i]
values[].add(evaluate(model,validation_set) //add scores of each run
end for
for i in range(0,values):
plot_loss_functions(values)
select best set of hyperparameters
model=fit(test_set, selecter_hyperparameters)
evaluate(model)
would this sequence of steps be correct? I have searched thru different pages and did not find something that could help me with this. Please, bear in mind that I do not want to use cross-validation or other library-based techniques such as GridSearchCV.
Thanks
In a Train validation test split, the fit method on the train data.
Validation data is used for hyperparameter tuning. A set of hyperparameters is selected and the model is trained on the train set. Then this model will be evaluated on the validation set. This is repeated until all permutations of the different hyperparameters have been exhausted.
The best set of hyperparameters are the ones that gave the best result on the validation set. This method is called Grid search.
The test set is used to evaluate the model with the best hyperparameters selected. THis gives the final unbiased accuracy and loss.
The fit method will never be called on the validation or test set.
your example will look like:
hyperListB=[]
hyperListL=[]
//let´s suppose both lists have the same dimensions
for hyperB in hyperListB:
for hyperL in hyperListL:
model=fit(train_set,hyperB,hyperL)
values[].add(evaluate(model,validation_set) //add scores of each run
end for
end for
for i in range(0,values):
plot_loss_functions(values)
select best set of hyperparameters
evaluate(model,test_set)

How can I voluntarily overfit my model for text classification

I would like to show an example of a model that overfit a test set and does not generalize well on future data.
I split the news dataset in 3 sets:
train set length: 11314
test set length: 5500
future set length: 2031
I am using a text dataset and build a CountVectorizer.
I am creating a grid search (without cross-validation), each loop will test some parameters on the vectorizer ('min_df','max_df') and some parameter on my model LogisticRegression ('C', 'fit_intercept', 'tol', ...).
The best result I get is:
({'binary': False, 'max_df': 1.0, 'min_df': 1},
{'C': 0.1, 'fit_intercept': True, 'tol': 0.0001},
test set score: 0.64018181818181819,
training set score: 0.92902598550468451)
but now if I run it on the future set I will get a score similar to the test set:
clf.score(X_future, y_future): 0.6509108813392418
How can I demonstrate a case where I overfitted my test set so it does not generalize well to future data?
You have a model trained on some data "train set".
Performing a classification task on these data you get a score of 92%.
Then you take new data, not seen during the training, such as "test set" or "future set".
Performing a classification task on any of these unseen dataset you get a score of 65%.
This is exactly the definition of a model which is overfitting: it has a very high variance, a big difference in the performance between seen and unseen data.
By the way, taking into account your specific case, some parameter choices which could cause overfitting are the following:
min_df = 0
high C value for logistic regression (whihc means low regularization)
I wrote a comment on alsora's answer but I think I really should expand on it as an actual answer.
As I said, there is no way to "over-fit" the test set because over-fit implies something negative. A theoretical model that fits the test set at 92% but fits the training set to only 65% is a very good model indeed (assuming your sets are balanced).
I think what you are referring to as your "test set" might actually be a validation set, and your "future set" is actually the test set. Lets clarify.
You have a set of 18,845 examples. You divide them into 3 sets.
Training set: The examples the model gets to look at and learn off of. Every time your model makes a guess from this set you tell it whether it was right or wrong and it adjusts accordingly.
Validation set: After every epoch (time running through the training set), you check the model on these examples, which its never seen before. You compare the training loss and training accuracy to the validation loss and validation accuracy. If training accuracy > validation accuracy or training loss < validation loss, then your model is over-fitting and training should stop to avoid over-fitting. You can either stop it early (early-stopping) or add dropout. You should not give feedback to your model based on examples from the validation set. As long as you follow the above rule and as long as your validation set is well-mixed, you can't over-fit this data.
Testing set: Used to assess the accuracy of your model once training has completed. This is the one that matters because its based on examples your model has never seen before. Again you can't over-fit this data.
Of your 18,845 examples you have 11314 in training set, 5500 in the validation set, and 2031 in the testing set.

Why not optimize hyperparameters on train dataset?

When developing a neural net one typically partitions training data into Train, Test, and Holdout datasets (many people call these Train, Validation, and Test respectively. Same things, different names). Many people advise selecting hyperparameters based on performance in the Test dataset. My question is: why? Why not maximize performance of hyperparameters in the Train dataset, and stop training the hyperparameters when we detect overfitting via a drop in performance in the Test dataset? Since Train is typically larger than Test, would this not produce better results compared to training hyperparameters on the Test dataset?
UPDATE July 6 2016
Terminology change, to match comment below. Datasets are now termed Train, Validation, and Test in this post. I do not use the Test dataset for training. I am using a GA to optimize hyperparameters. At each iteration of the outer GA training process, the GA chooses a new hyperparameter set, trains on the Train dataset, and evaluates on the Validation and Test datasets. The GA adjusts the hyperparameters to maximize accuracy in the Train dataset. Network training within an iteration stops when network overfitting is detected (in the Validation dataset), and the outer GA training process stops when overfitting of the hyperparameters is detected (again in Validation). The result is hyperparameters psuedo-optimized for the Train dataset. The question is: why do many sources (e.g. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf, Section B.1) recommend optimizing the hyperparameters on the Validation set, rather than the Train set? Quoting from Srivasta, Hinton, et al (link above): "Hyperparameters were tuned on the validation set such that the best validation error was produced..."
The reason is that developing a model always involves tuning its configuration: for example, choosing the number of layers or the size of the layers (called the hyper-parameters of the model, to distinguish them from the parameters, which are the network’s weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. In essence, this tuning is a form of learning: a search for a good configuration in some parameter space. As a result, tuning the configuration of the model based on its performance on the validation set can quickly result in overfitting to the validation set, even though your model is never directly trained on it.
Central to this phenomenon is the notion of information leaks. Every time you tune a hyperparameter of your model based on the model’s performance on the validation set, some information about the validation data leaks into the model. If you do this only once, for one parameter, then very few bits of information will leak, and your validation set will remain reliable to evaluate the model. But if you repeat this many times—running one experiment, evaluating on the validation set, and modifying your model as a result—then you’ll leak an increasingly significant amount of information about the validation set into the model.
At the end of the day, you’ll end up with a model that performs artificially well on the validation data, because that’s what you optimized it for. You care about performance on completely new data, not the validation data, so you need to use a completely different, never-before-seen dataset to evaluate the model: the test dataset. Your model shouldn’t have had access to any information about the test set, even indirectly. If anything about the model has been tuned based on test set performance, then your measure of generalization will be flawed.
There are two things you are missing here. First, minor, is that test set is never used to do any training. This is a purpose of validation (test is just to asses your final, testing performance). The major missunderstanding is what it means "to use validation set to fit hyperparameters". This means exactly what you describe - to train a model with a given hyperparameters on the training set, and use validation to simply check if you are overfitting (you use it to estimate generalization) , but you do not really "train" on them, you simply check your scores on this subset (which, as you noticed - is way smaller).
You cannot "stop training hyperparamters" because this is not a continuous process, usually hyperparameters are just "possible sets of values", and you have to simply test lots of them, there is no valid way of defining a direct trainingn procedure between actual metric you are interested in (like accuracy) and hyperparameters (like size of the hidden layer in NN or even C parameter in SVM), as the functional link between these two is not differentiable, is highly non convex and in general "ugly" to optimize. If you can define a nice optimization procedure in terms of a hyperparameter than it is usually not called a hyperparameter but a parameter, the crucial distinction in this naming convention is what makes it hard to optimize directly - we call hyperparameter a parameter, than cannot be directly optimized against thus you need a "meta method" (like simply testing on validation set) to select it.
However, you can define a "nice" meta optimization protocol for hyperparameters, but this will still use validation set as an estimator, for example Bayesian optimization of hyperparameters does exactly this - it tries to fit a function saying how well is you model behaving in the space of hyperparameters, but in order to have any "training data" for this meta-method, you need validation set to estimate it for any given set of hyperparameters (input to your meta method)
simple answer: we do
In the case of a simple feedforward neural network you do have to select e.g. layer and unit count per layer, regularization (and non-continuous parameters like topology if not feedforward and loss function) in the beginning and you would optimize on those.
So, in summary you optimize:
ordinary parameters only during training but not during validation
hyperparameters during training and during validation
It is very important not to touch the many ordinary parameters (weights and biases) during validation. That's because there are thousands of degrees of freedom in them which means they can learn the data you train them on. But then the model doesn't generalize to new data as well (even when that new data originated from the same distribution). You usually only have very few degrees of freedom in the hyperparameters which usually control the rigidity of the model (regularization).
This holds true for other machine learning algorithms like decision trees, forests, etc as well.

Resources