I am trying to optimise the hyper parameters of a random forest regressor in Python.
I have 3 separate datasets: train/validate/test. Therefore, rather than using a cross validation method I want to use the specific validation set to tune the hyperparameters, i.e. the "First Approach" described in this stackoverflow post.
Now, sklearn has some nice inbuilt methods for hyperparameter optimisation using cross validation (e.g. this tutorial), but what about if I want to tune my hyperparameters with a specific validation set? Is it still possible to use a method like RandomizedSearchCV?
It is indeed possible with cv option. As the documentation suggests, one of the possible inputs is an iterable of train/test index tuples:
An iterable yielding (train, test) splits as arrays of indices.
So, a list of size one with train and validation indices packed as a tuple would be ok.
I think we should just have some wording clarified:
'Validation set'
A validation-set is used to evaluate your model on a unseen set of data i.e data not used for training. This is to simulate how your model would behave on new data. We use the validation-set to tune our hyper-parameters such as number of trees, max-depths etc. and chose the hyper-parameters which works best on the validation set.
'Cross-validate'
When you CV (cross-validate) with, say, 5 folds you divide your data into 5 sets where set [1,2,3,4] are used for traning, and set 5 is used for validation. Then you use [2,3,4,5] for training and use set 1 for validation - you repeat this untill all sets (i.e 5 times when using 5 fold) have been used as a validation-set and then you would average your 5 validation-score e.g accuracy to get one score which you want to (often) maximize.
Answer
So, to answer your question; yes, you can use GridSearchCV on your validation-set but that wouldn't often be the case since. You would often do one of the following:
a) Use a (i.e one) validation-set to tune your hyper-parameters against, as explained in "Validation set"
b) Use all your data i.e train+validation as one data-set and then run a, say, 5-fold grid-CV search as explained in "Cross-validate"
Related
I have some trouble grasping the standard way of how to use cross validation for hyperparameter tuning and evaluation. I try to do 10-fold CV. Which of the following is the correct way?
All of the data gets used for parameter tuning (e. g. using random grid search with cross validation). This returns the best hyperparameters. Then, a new model is constructed with these hyperparameters, and it can be evaluated by doing a cross validation (nine folds for training, one for testing, in the end the metrics like accuracy or confusion matrix get averaged).
Another way that I found is to first split the data into a train and a test set, and then only perform cross validation on the training set. One then would evaluate using the testset. However, as I understand, that would undermine the whole concept of cross validation, as the idea behind it is to be independent of the split, right?
Lastly, my supervisor told me that I would use eight folds for training, one for hyperparameter estimation and one for testing (and therefore evaluation). However, I could not find any material where this approach had been used. Is that a standard procedure or have I just understood something wrong there?
in general you can split your data into 3 sets.
training set
validation set
test set
Test set:
The test set is the most easiest one to explain.
Once you've created your test set (15-30% of the data). You store this data set somewhere and you DON'T TOUCH that data set ANYMORE until you think you're done.
- The reason for this is simple, once you start to focus on this data set (e.g. to increase the AUC or ...) then you're starting to over fit your data ...
The same also counts for the validation set (+/-). When you're hyper-tuning your parameters etc. you're starting to focusing on this set ... which means that you aren't generalizing anymore. (and a good model, should work on all data, not only on the test and validation set).
That been said, now you've only the training- and validation set over.
Cross validation: some motivations to use cross validation is to have a better generalization and view of your model/data (imagine, that some special cases only existed in the validation set etc. + you don't take a single decision for granted.
- the main downside of e.g. 10-fold cross validation is ... it takes 10 times longer to finish ... but it gives you a more trustworthy results ... (e.g. if you do 10 fold cross validation and your AUC fluctuates from 80 85 75 77 81 65 ... --> then you might have some data issues ... in a perfect scenario, the diff between the AUC should be small ...
Nevertheless ... what I would do (and it also depends on your resources, model, time, data set size)
Create 10 random folds. (and keep track of them)
Do a 10 fold- grid search if possible (to have a general view the importance of each parameter, (you don't have to take small steps ... E.g. Random forest has a max_features parameter, but if you notice that all the models perform less when that value is log2, then you can just eliminate that hyper parameter)
check which hyper-parameters performed well
do a 10 fold random search or grid search in the area's which performed well
but always use the same folds for each new experiment, in this way you can compare the models with each other. + Often you'll see that some folds are more difficult then other folds but they are difficult for all the the models
I am training a deep learning model using a 5-fold CV over three random seeds (random seeds are for model initialization, CV is split once). For each fold, I save the best model. Hence, I get 15 models after the simulation. To assess the performance, I take the best of these 15 models (unchanged during the entire evaluation process) and evaluate it using the validation fold of all the 5-folds for each seed. I then average the results across these seeds.
I would like to know if I am doing the right thing here.
I have read that there are two ways to compute CV performance: [1] pooling, where the performance is calculated globally over the union of all the test sets [2] averaging, where the performance is computed for every test set separately, with results being the average of these.
I intend to use method two (averaging).
Yes, you can use the averaging method for the 5-fold CV, but I don't understand what you mean by "For each fold, I save the best model". Moreover, three random seed values are not enough. You should use at least 10 different values and plot a boxplot for the corresponding results across these seeds.
I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.
I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.
I have read the general step for K-fold cross validation under
https://machinelearningmastery.com/k-fold-cross-validation/
It describe the general procedure is as follows:
Shuffle the dataset randomly.
Split the dataset into k groups (folds)
For each unique group:Take the group as a hold out or test data set
Take the remaining groups as a training data set Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model
evaluation scores
So if it is K-fold then K models will be built, right? But why I read from the following link from H2O which is saying it built K+1 models?
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb
Arguably, "I read somewhere else" is too vague a statement (where?), because context does matter.
Most probably, such statements refer to some libraries which, by default, after finishing the CV proper procedure, go on to build a model on the whole training data using the hyperparameters found by CV to give best performance; see for example the relevant train function of the caret R package, which, apart from performing CV (if requested), returns also the finalModel:
finalModel
A fit object using the best parameters
Similarly, scikit-learn GridSearchCV has also a relevant parameter refit:
refit : boolean, or string, default=True
Refit an estimator using the best found parameters on the whole dataset.
[...]
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.
But even then, the models fitted are almost never just K+1: when you use CV in practice for hyperparameter tuning (and keep in mind that there there are other uses, too, for CV), you will end up fitting m*K models, where m is the length of your hyperparameters combination set (all K-folds in a single round are run with one single set of hyperparameters).
In other words, if your hypeparameter search grid consists of, say, 3 values for the no. of trees and 2 values for the tree depth, you will fit 2*3*K = 6*K models during the CV procedure, and possibly +1 for fitting your model at the end to the whole data with the best hyperparameters found.
So, to summarize:
By definition, each K-fold CV procedure consists of fitting just K models, one for each fold, with fixed hyperparameters across all folds
In case of CV for hyperparameter search, this procedure will be repeated for each hyperparameter combination of the search grid, leading to m*K fits
Having found the best hyperparameters, you may want to use them for fitting the final model, i.e. 1 more fit
leading to a total of m*K + 1 model fits.
Hope this helps...