Why K-fold cross validation will built K+1 models? - machine-learning

I have read the general step for K-fold cross validation under
https://machinelearningmastery.com/k-fold-cross-validation/
It describe the general procedure is as follows:
Shuffle the dataset randomly.
Split the dataset into k groups (folds)
For each unique group:Take the group as a hold out or test data set
Take the remaining groups as a training data set Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model
evaluation scores
So if it is K-fold then K models will be built, right? But why I read from the following link from H2O which is saying it built K+1 models?
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb

Arguably, "I read somewhere else" is too vague a statement (where?), because context does matter.
Most probably, such statements refer to some libraries which, by default, after finishing the CV proper procedure, go on to build a model on the whole training data using the hyperparameters found by CV to give best performance; see for example the relevant train function of the caret R package, which, apart from performing CV (if requested), returns also the finalModel:
finalModel
A fit object using the best parameters
Similarly, scikit-learn GridSearchCV has also a relevant parameter refit:
refit : boolean, or string, default=True
Refit an estimator using the best found parameters on the whole dataset.
[...]
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.
But even then, the models fitted are almost never just K+1: when you use CV in practice for hyperparameter tuning (and keep in mind that there there are other uses, too, for CV), you will end up fitting m*K models, where m is the length of your hyperparameters combination set (all K-folds in a single round are run with one single set of hyperparameters).
In other words, if your hypeparameter search grid consists of, say, 3 values for the no. of trees and 2 values for the tree depth, you will fit 2*3*K = 6*K models during the CV procedure, and possibly +1 for fitting your model at the end to the whole data with the best hyperparameters found.
So, to summarize:
By definition, each K-fold CV procedure consists of fitting just K models, one for each fold, with fixed hyperparameters across all folds
In case of CV for hyperparameter search, this procedure will be repeated for each hyperparameter combination of the search grid, leading to m*K fits
Having found the best hyperparameters, you may want to use them for fitting the final model, i.e. 1 more fit
leading to a total of m*K + 1 model fits.
Hope this helps...

Related

How to initialize the parameter in the cross validation method and get the final model after training and evaluating using this method?

As I learned about cross-validation algorithm, from most of the articles on the web, there are variety of cross-validation methods. Here I want to be clear about the k-fold cross-validation technique.
In the k-fold cross-validation algorithm, we can split the training set in to k not-overlapped folds.
As we split the training data in to k folds, we have to train the model in k iterations.
So, in each iteration, we train the model with (k-1) folds and validate it with the remained fold.
In each split we can calculate the desired metric(s) of our model.
At the end we can report the training error by taking the average of scores of all iterations.
But what is the final trained model?
Some points in those articles are not clear for me?
Should I initiate model's parameters in each iteration?
I ask this, because if I don’t initialize the parameter's it could save the pattern of data which I want to be unseen in the next iteration and so on…
Should I save the initial parameter of the split in which I gained the best score, as the best initial values of the parameters?
Should I retrain the model initiating it with the initial values of the parameters gained in my second question and then feed it with whole training dataset and gain the final trained model?
Alright so before answering your question I will go a bit back to explain the purpose of cross validation and model evaluation. You can read these slides or research more about statistical learning theory if you want to go deeper.
Train/test split
Suppose you have a model with defined hyperparameter (or none) and you train it on the training split. If you calculate the metrics over the test split, this will give you the risk of the model on new data. Then you know that this particular model will perform like that on unseen data.
So we have a learning process B, that takes a dataset S (here the training dataset) as well as hyperparameters h, and gives a fitted model m; then B(S, h)->m (training B on S with hp h gives a model m, with its parameters). Then we tested this model to evaluate the risk R on the test dataset.
k-fold Cross validation
When doing k-fold cross validation, you fit k models using the learning process B. Each model is fitted on a different training set, and the risk is computed on non overlapping samples.
Then, you calculate the mean risk among the folds. A common mistake is that it gives you the performance of the model, that's not true. This gives you the mean (or expected) performances of the learning process B (and hyperparams h). That means, if you train a new model using B (and hyperparams h), its expected performance will be around the calculated metrics (of course this is not always true).
For your questions
Yes you should train the model from scratch, if possible with the same initial parameters (if initialization is not random) to avoid any difference between folds. Using a warm start with the previous parameters can modify the learning process, and the fitting.
No, if initialization is random let it be, if it is fixed use the same initial parameters for all folds
For the two previous questions, if by initial parameters you meant hyperparameters, then you should keep the same for all folds, otherwise the calculated risk will be useless. If you want to try multiple hyperparameters, you have to repeat the cross validation multiple times, and then you can select the best ones based on the risk calculated.
Once you tuned your hyperparameters you can train the model on your whole training set. This will give you a model m. Before your cross validation you can keep a small test split to evaluate this final model on unseen data

Hyperparameter optimisation in Python with a separate validation set

I am trying to optimise the hyper parameters of a random forest regressor in Python.
I have 3 separate datasets: train/validate/test. Therefore, rather than using a cross validation method I want to use the specific validation set to tune the hyperparameters, i.e. the "First Approach" described in this stackoverflow post.
Now, sklearn has some nice inbuilt methods for hyperparameter optimisation using cross validation (e.g. this tutorial), but what about if I want to tune my hyperparameters with a specific validation set? Is it still possible to use a method like RandomizedSearchCV?
It is indeed possible with cv option. As the documentation suggests, one of the possible inputs is an iterable of train/test index tuples:
An iterable yielding (train, test) splits as arrays of indices.
So, a list of size one with train and validation indices packed as a tuple would be ok.
I think we should just have some wording clarified:
'Validation set'
A validation-set is used to evaluate your model on a unseen set of data i.e data not used for training. This is to simulate how your model would behave on new data. We use the validation-set to tune our hyper-parameters such as number of trees, max-depths etc. and chose the hyper-parameters which works best on the validation set.
'Cross-validate'
When you CV (cross-validate) with, say, 5 folds you divide your data into 5 sets where set [1,2,3,4] are used for traning, and set 5 is used for validation. Then you use [2,3,4,5] for training and use set 1 for validation - you repeat this untill all sets (i.e 5 times when using 5 fold) have been used as a validation-set and then you would average your 5 validation-score e.g accuracy to get one score which you want to (often) maximize.
Answer
So, to answer your question; yes, you can use GridSearchCV on your validation-set but that wouldn't often be the case since. You would often do one of the following:
a) Use a (i.e one) validation-set to tune your hyper-parameters against, as explained in "Validation set"
b) Use all your data i.e train+validation as one data-set and then run a, say, 5-fold grid-CV search as explained in "Cross-validate"

K Nearest Neighbour Classifier - random state for train test split leads to different accuracy scores

I'm fairly new to data analysis and machine learning. I've been carrying out some KNN classification analysis on a breast cancer dataset in python's sklearn module. I have the following code which attemps to find the optimal k for classification of a target variable.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
breast_cancer_data = load_breast_cancer()
training_data, validation_data, training_labels, validation_labels = train_test_split(breast_cancer_data.data, breast_cancer_data.target, test_size = 0.2, random_state = 40)
results = []
for k in range(1,101):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(training_data, training_labels)
results.append(classifier.score(validation_data, validation_labels))
k_list = range(1,101)
plt.plot(k_list, results)
plt.ylim(0.85,0.99)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()
The code loops through 1 to 100 and generates 100 KNN models with 'k' set to incremental values in the range 1 to 100. The performance of each of those models is saved to a list and a plot is generated showing 'k' on the x-axis and model performance on the y-axis.
The problem I have is that when I change the random_state parameter when spliting the data into training and testing partitions this results in completely different plots indicating varying model performance for different 'k'values for different dataset partitions.
For me this makes it difficult to decide which 'k' is optimal as the algorithm performs differently for different 'k's using different random states. Surely this doesn't mean that, for this particular dataset, 'k' is arbitrary? Can anyone help shed some light on this?
Thanks in anticipation
This is completely expected. When you do the train-test-split, you are effectively sampling from your original population. This means that when you fit a model, any statistic (such as a model parameter estimate, or a model score) will it self be a sample estimate taken from some distribution. What you really want is a confidence interval around this score and the easiest way to get that is to repeat the sampling and remeasure the score.
But you have to be very careful how you do this. Here are some robust options:
1. Cross Validation
The most common solution to this problem is to use k-fold cross-validation. In order not to confuse this k with the k from knn I'm going to use a capital for cross-validation (but bear in mind this is not normal nomenclature) This is a scheme to do the suggestion above but without a target leak. Instead of creating many splits at random, you split the data into K parts (called folds). You then train K models each time on K-1 folds of the data leaving aside a different fold as your test set each time. Now each model is independent and without a target leak. It turns out that the mean of whatever success score you use from these K models on their K separate test sets is a good estimate for the performance of training a model with those hyperparameters on the whole set. So now you should get a more stable score for each of your different values of k (small k for knn) and you can choose a final k this way.
Some extra notes:
Accuracy is a bad measure for classification performance. Look at scores like precision vs recall or AUROC or f1.
Don't try program CV yourself, use sklearns GridSearchCV
If you are doing any preprocessing on your data that calculates some sort of state using the data, that needs to be done on only the training data in each fold. For example if you are scaling your data you can't include the test data when you do the scaling. You need to fit (and transform) the scaler on the training data and then use that same scaler to transform on your test data (don't fit again). To get this to work in CV you need to use sklearn Pipelines. This is very important, make sure you understand it.
You might get more stability if you stratify your train-test-split based on the output class. See the stratify argument on train_test_split.
Note the CV is the industry standard and that's what you should do, but there are other options:
2. Bootstrapping
You can read about this in detail in introduction to statistical learning section 5.2 (pg 187) with examples in section 5.3.4.
The idea is to take you training set and draw a random sample from it with replacement. This means you end up with some repeated records. You take this new training set, train and model and then score it on the records that didn't make it into the bootstrapped sample (often called out-of-bag samples). You repeat this process multiple times. You can now get a distribution of your score (e.g. accuracy) which you can use to choose your hyper-parameter rather than just the point estimate you were using before.
3. Making sure you test set is representative of your validation set
Jeremy Howard has a very interesting suggestion on how to calibrate your validation set to be a good representation of your test set. You only need to watch about 5 minutes from where that link starts. The idea is to split into three sets (which you should be doing anyway to choose a hyper parameter like k), train a bunch of very different but simple quick models on your train set and then score them on both your validation and test set. It is OK to use the test set here because these aren't real models that will influence your final model. Then plot the validation scores vs the test scores. They should fall roughly on a straight line (the y=x line). If they do, this means the validation set and test set are both either good or bad, i.e. performance in the validation set is representative of performance in the test set. If they don't fall on this straight line, it means the model scores you get from you validation set are not indicative of the score you'll get on unseen data and thus you can't use that split to train a sensible model.
4. Get a larger data set
This is obviously not very practical for your situation but I thought I'd mention it for completeness. As your sample size increases, your standard error drops (i.e. you can get tighter bounds on your confidence intervals). But you'll need more training and more test data. While you might not have access to that here, it's worth keeping in mind for real world situations where you can assess the trade-off of the cost of gathering new data vs the desired accuracy in assessing your model performance (and probably the performance itself too).
This "behavior" is to be expected. Of course you get different results, when training and test is split differently.
You can approach the problem statistically, by repeating each 'k' several times with new train-validation-splits. Then take the median performance for each k. Or even better: look at the performance distribution and the median. A narrow performance distribution for a given 'k' is also a good sign that the 'k' is chosen well.
Afterwards you can use the test set to test your model

How do you do cross validation correctly? [duplicate]

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.

Order between using validation, training and test sets

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.

Resources