Hyper-parameter Tuning for a machine learning model - machine-learning

Why a hyper-parameter like regularization parameter (a real number) cannot be trained over training data along with model parameters? What will go wrong?

This is generally done to prevent overfitting. Model parameters are trained using the training set. Hyper-parameter tuning is done using a validation set that is (ideally) completely independent of the training data. The final performance should be evaluated on a test set. Typical splits are 80/10/10 or 60/20/20.
If you tune your hypermeters on the training set, you will very likely vastly overfit and suffer a performance hit on the test set.
Try it out! See the difference in performance on your test set when you do hyper-parameter tuning on the training set, vs on a separate validation set.

Related

What is the difference between Holdout dataset vs Validation dataset?

What is the difference between the Holdout dataset and the Validation dataset in Machine Learning context?
The Validation dataset is used during training to track the performance of your model on "unseen" data. I wrote the unseen in quotes because although the model doesn't directly see the data in validation set, you will optimize the hyper-parameters to decrease the loss on validation set (since increasing val loss will mean over-fitting). However, by doing so, you may over-fit the hyper-parameters to validation set (So that the loss will be low on that specific validation set, but will become worse on any other unseen set). That's why you usually keep another 3rd set, called test set (or held-out set), which will be your truly unseen data, and you will test the performance of your model on that test set only once, after training your final model.

train, dev set and test set advices for implementation and hyperameter tuning

I have some doubts about the implementation and tuning of parameters and hyperparameters by using the classic train, validation and test set. So it would be of great help if somebody could clarify me these concepts and bring me some hints for its implementation in a language like Python.
For example, if I have a Neural Network, for what I know the parameter tuning (lets consider the number of hidden layers and neurons per layer), could be tuned with the training set. So when it comes to the validation set, which is approximately 20% of the dataset, I can tune my hyperparameters with the following algorithm:
Example: Tuning batch size and learning rate:
hyperListB=[]
hyperListL=[]
//let´s suppose both lists have the same dimensions
for i in range(0,hyperListB):
model=fit(train_set,hyperListB[i],hyperlistL[i]
values[].add(evaluate(model,validation_set) //add scores of each run
end for
for i in range(0,values):
plot_loss_functions(values)
select best set of hyperparameters
model=fit(test_set, selecter_hyperparameters)
evaluate(model)
would this sequence of steps be correct? I have searched thru different pages and did not find something that could help me with this. Please, bear in mind that I do not want to use cross-validation or other library-based techniques such as GridSearchCV.
Thanks
In a Train validation test split, the fit method on the train data.
Validation data is used for hyperparameter tuning. A set of hyperparameters is selected and the model is trained on the train set. Then this model will be evaluated on the validation set. This is repeated until all permutations of the different hyperparameters have been exhausted.
The best set of hyperparameters are the ones that gave the best result on the validation set. This method is called Grid search.
The test set is used to evaluate the model with the best hyperparameters selected. THis gives the final unbiased accuracy and loss.
The fit method will never be called on the validation or test set.
your example will look like:
hyperListB=[]
hyperListL=[]
//let´s suppose both lists have the same dimensions
for hyperB in hyperListB:
for hyperL in hyperListL:
model=fit(train_set,hyperB,hyperL)
values[].add(evaluate(model,validation_set) //add scores of each run
end for
end for
for i in range(0,values):
plot_loss_functions(values)
select best set of hyperparameters
evaluate(model,test_set)

Monitoring val_loss while training

I have a simple question that has made me doubt my work all of a sudden.
If I only have a training and validation set, am I allowed to monitor val_loss while training or is that adding bias to my training. I want to test my accuracy at the end of training on my validation set but suddenly I am thinking if I am monitoring that dataset while training, that'd be problematic? or no?
Short answer - yes, monitoring validation error and using it as a basis for decision about specific set up of algorithm adds bias to your algorithm. To elaborate a bit:
1) You fix hyperparameters of any ML algorithm and than train it on train set. Your resulting ML algorithm with specific set up of hyperparameters overfits to training set and you use validation set to estimate which performance you can get with these hyperparameters on unseen data
2) But you obviously want to adjust your hyperparameters to get best performance. You may be doing a gridsearch or something like it to get best hyperparameter settings for this specific algorithm using validation set. As result your hyperparameter settings overfit to validation set. Think of it as some of the information about validation set still leaks into your model through hyperparameters
3) As result you must do the following: split data set into training set, validation set and test set. Use training set for training, use validation set to make a decision about specific hyperparameters. When you are done (fully done!) with fine tuning your model you must use test set which model have never seen to get an estimation of the final performance in the combat mode.

Why does overfitting still happen after cross validation?

It is a binary photo classification problem, I extracted the features using AlexNet. The measurement is log-loss. There are totally 25000 records in the training set, 12500 "1"s and 12500 "0"s, so the data set is balanced.
I trained a XGBoost model. After tuning the parameters using cross validation, the training log-loss is 0.078, the validation log-loss is 0.09. But when I make predictions using test set, the log-loss is 2.1. It seems that over-fitting is still pretty serious.
Why is that? Do I have to further tune the parameters or try another pre-trained model?

Why not optimize hyperparameters on train dataset?

When developing a neural net one typically partitions training data into Train, Test, and Holdout datasets (many people call these Train, Validation, and Test respectively. Same things, different names). Many people advise selecting hyperparameters based on performance in the Test dataset. My question is: why? Why not maximize performance of hyperparameters in the Train dataset, and stop training the hyperparameters when we detect overfitting via a drop in performance in the Test dataset? Since Train is typically larger than Test, would this not produce better results compared to training hyperparameters on the Test dataset?
UPDATE July 6 2016
Terminology change, to match comment below. Datasets are now termed Train, Validation, and Test in this post. I do not use the Test dataset for training. I am using a GA to optimize hyperparameters. At each iteration of the outer GA training process, the GA chooses a new hyperparameter set, trains on the Train dataset, and evaluates on the Validation and Test datasets. The GA adjusts the hyperparameters to maximize accuracy in the Train dataset. Network training within an iteration stops when network overfitting is detected (in the Validation dataset), and the outer GA training process stops when overfitting of the hyperparameters is detected (again in Validation). The result is hyperparameters psuedo-optimized for the Train dataset. The question is: why do many sources (e.g. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf, Section B.1) recommend optimizing the hyperparameters on the Validation set, rather than the Train set? Quoting from Srivasta, Hinton, et al (link above): "Hyperparameters were tuned on the validation set such that the best validation error was produced..."
The reason is that developing a model always involves tuning its configuration: for example, choosing the number of layers or the size of the layers (called the hyper-parameters of the model, to distinguish them from the parameters, which are the network’s weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. In essence, this tuning is a form of learning: a search for a good configuration in some parameter space. As a result, tuning the configuration of the model based on its performance on the validation set can quickly result in overfitting to the validation set, even though your model is never directly trained on it.
Central to this phenomenon is the notion of information leaks. Every time you tune a hyperparameter of your model based on the model’s performance on the validation set, some information about the validation data leaks into the model. If you do this only once, for one parameter, then very few bits of information will leak, and your validation set will remain reliable to evaluate the model. But if you repeat this many times—running one experiment, evaluating on the validation set, and modifying your model as a result—then you’ll leak an increasingly significant amount of information about the validation set into the model.
At the end of the day, you’ll end up with a model that performs artificially well on the validation data, because that’s what you optimized it for. You care about performance on completely new data, not the validation data, so you need to use a completely different, never-before-seen dataset to evaluate the model: the test dataset. Your model shouldn’t have had access to any information about the test set, even indirectly. If anything about the model has been tuned based on test set performance, then your measure of generalization will be flawed.
There are two things you are missing here. First, minor, is that test set is never used to do any training. This is a purpose of validation (test is just to asses your final, testing performance). The major missunderstanding is what it means "to use validation set to fit hyperparameters". This means exactly what you describe - to train a model with a given hyperparameters on the training set, and use validation to simply check if you are overfitting (you use it to estimate generalization) , but you do not really "train" on them, you simply check your scores on this subset (which, as you noticed - is way smaller).
You cannot "stop training hyperparamters" because this is not a continuous process, usually hyperparameters are just "possible sets of values", and you have to simply test lots of them, there is no valid way of defining a direct trainingn procedure between actual metric you are interested in (like accuracy) and hyperparameters (like size of the hidden layer in NN or even C parameter in SVM), as the functional link between these two is not differentiable, is highly non convex and in general "ugly" to optimize. If you can define a nice optimization procedure in terms of a hyperparameter than it is usually not called a hyperparameter but a parameter, the crucial distinction in this naming convention is what makes it hard to optimize directly - we call hyperparameter a parameter, than cannot be directly optimized against thus you need a "meta method" (like simply testing on validation set) to select it.
However, you can define a "nice" meta optimization protocol for hyperparameters, but this will still use validation set as an estimator, for example Bayesian optimization of hyperparameters does exactly this - it tries to fit a function saying how well is you model behaving in the space of hyperparameters, but in order to have any "training data" for this meta-method, you need validation set to estimate it for any given set of hyperparameters (input to your meta method)
simple answer: we do
In the case of a simple feedforward neural network you do have to select e.g. layer and unit count per layer, regularization (and non-continuous parameters like topology if not feedforward and loss function) in the beginning and you would optimize on those.
So, in summary you optimize:
ordinary parameters only during training but not during validation
hyperparameters during training and during validation
It is very important not to touch the many ordinary parameters (weights and biases) during validation. That's because there are thousands of degrees of freedom in them which means they can learn the data you train them on. But then the model doesn't generalize to new data as well (even when that new data originated from the same distribution). You usually only have very few degrees of freedom in the hyperparameters which usually control the rigidity of the model (regularization).
This holds true for other machine learning algorithms like decision trees, forests, etc as well.

Resources