I have observed in many articles and books that model selection is done before model tuning.
Model selection is generally done using some form of cross validation like k-fold where multiple models' metrics are calculated and the best one is selected.
And then the model selected is tuned to get the best hyperparameters.
But my question is that a model that was not selected might perform better with the right hyperparameters.
So why aren't all the models we are interested in tuned to get the right hyperparameters and then the best model be selected by cross validation.
It depends on the experimental set-up followed in each article/book, but in short the correct way of performing model selection + hyperparameter optimisation in the same experiment is to use Nested Cross-validation:
An Outer loop that evaluates the performance of the model (as usual)
An Inner loop (that splits again the dataset formed by the N-1 training partitions of the outer loop) that performs hyperparameter optimisation in every fold.
You can have a look at this other question to learn more about this validation scheme.
Note, however, that in some cases it can be acceptable to just do a general comparison with all the models, and then optimise only the top performing ones. But, in a rigorous study, this is far from ideal.
Related
I'm using a grid search to tune the hyperparameters of my DNN, which has 2 depth layers. I'm currently scoring each model based on the average loss in the test set, but I'm not sure if this is the best approach. Would it be better to use the accuracy, or both the loss and accuracy, as a scoring metric? How do other people typically score their models during hyperparameter tuning? Any advice or insights would be greatly appreciated.
The first thing in your experimental setup is using the test set while making hyperparameter tunning. You should train your model with your train set and make your hyperparameter tunning with your validation set. After finishing this process, you need to use test set to get the model score is the best option to the way of using/splitting the dataset correctly.
The second part of your question is very open-ended, but you may benefit from the following tips:
Different metrics may be suitable for different tasks, so it is important to choose the right metric. For instance, in some classification tasks you would like to track accuracy, and some of them recall or precision etc. (or you can use and track multiple metrics to understand your model behavior more deeper)
The recent advancement on this topic is generally referred to as AutoML and there are many different applications/libraries/methodologies that are used for hyperparameter tuning. So you may also want to search other methods rather than just using GridSeach. If you want to continue with GridSearch, to find the optimal parameters for your problem, you can switch the GridSearchCV so you can test your model more than once with a different part of the dataset which makes your hyperparameter tunning operation more robust.
I'm working on an assignment in which I have to compare two regressors (random forest and svr) that I implement with scikit-learn. I want to evaluate both regressors and I was googling around a lot and came across the nested cross validation where you use the inner loop to tune the hyperparameter and the outer loop to validate on the k folds of the training set. I would like to use the inner loop to tune both my regressors and the outer loop to validate both so I will have the same test and train folds for both rergessors.
Is this a proper way to compare two ml algorithms with each other? Are there better ways to compare two algorithms with each other? Especialy regressors?
I found some entries in blogs but I could not find any scientific paper stating this is a good technique to compare two algorithms with each other, which would be important to me. If there are some links to current papers I would be glad if you could post them, too.
Thanks for the help in advance!
EDIT
I have a very low amount of data (apprx. 200 samples) with a high amount of features (apprx. 250, after using feature selection, otherwise about 4500) so I decided to use cross validation.My dependent variable is a continous value from 0 to 1.
The problem is a recommender problem so it makes no sense to test for accuracy in this case. As this is only an assignment I can only measure the ml algorithms with statistical methods rather than asking users for their opinion or measure the purchases done by them.
I think it depends on what you want to compare. If you just want to compare different models with regards to prediction power (classifier and regressor alike), nested cross validation is usually good in order to not report overly optimistic metrics: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html while allowing you to find the best set of hyperparameters.
However, sometimes it seems like is just overkilling: https://arxiv.org/abs/1809.09446
Also, depending on how the ml algorithms behave, what datasets are you talking about, their characteristics, etc etc, maybe your "comparison" might need to take into consideration a lot of other things rather than just prediction power. Maybe if you give some more details we will be able to help more.
I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.
I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.
Is it possible to train a model based on training and validation data sets.Basically end up combining both of them to create a new model. And from that combined model use it to classify all of the data in the test dataset.
This is what is usually done. Assuming that you know how to transfer hyperparameters, as you usually fit model on train data, select hyperparameters based on the score on the valid one. Thus when you combine train + valid you get significantly bigger dataset, thus "optimal hyperparameters" might be completely different from the ones you selected before. So in general - yes, this is exactly what is usually done, but it might be more tricky than you expect (especially if your method is highly stochastic, non deterministic, etc.).