What's the difference between K-fold cross validation and Out of sample cross validation? Could you use a few sentence to describe the steps for each CV method?
K-fold cross validation is a type of out-of-sample cross validation. The name "out of sample" comes from the following fact: if we fit the model and compute the MSE on the training set, we will get an optimistically biased assessment of how well the model will fit an independent data set. This biased estimate is called the in-sample estimate of the fit (we would be using training samples), whereas the cross-validation estimate is an out-of-sample estimate.
In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once.
For other methods, you can check wikipedia, they have excellents summaries there: https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Types
Related
As I learned about cross-validation algorithm, from most of the articles on the web, there are variety of cross-validation methods. Here I want to be clear about the k-fold cross-validation technique.
In the k-fold cross-validation algorithm, we can split the training set in to k not-overlapped folds.
As we split the training data in to k folds, we have to train the model in k iterations.
So, in each iteration, we train the model with (k-1) folds and validate it with the remained fold.
In each split we can calculate the desired metric(s) of our model.
At the end we can report the training error by taking the average of scores of all iterations.
But what is the final trained model?
Some points in those articles are not clear for me?
Should I initiate model's parameters in each iteration?
I ask this, because if I don’t initialize the parameter's it could save the pattern of data which I want to be unseen in the next iteration and so on…
Should I save the initial parameter of the split in which I gained the best score, as the best initial values of the parameters?
Should I retrain the model initiating it with the initial values of the parameters gained in my second question and then feed it with whole training dataset and gain the final trained model?
Alright so before answering your question I will go a bit back to explain the purpose of cross validation and model evaluation. You can read these slides or research more about statistical learning theory if you want to go deeper.
Train/test split
Suppose you have a model with defined hyperparameter (or none) and you train it on the training split. If you calculate the metrics over the test split, this will give you the risk of the model on new data. Then you know that this particular model will perform like that on unseen data.
So we have a learning process B, that takes a dataset S (here the training dataset) as well as hyperparameters h, and gives a fitted model m; then B(S, h)->m (training B on S with hp h gives a model m, with its parameters). Then we tested this model to evaluate the risk R on the test dataset.
k-fold Cross validation
When doing k-fold cross validation, you fit k models using the learning process B. Each model is fitted on a different training set, and the risk is computed on non overlapping samples.
Then, you calculate the mean risk among the folds. A common mistake is that it gives you the performance of the model, that's not true. This gives you the mean (or expected) performances of the learning process B (and hyperparams h). That means, if you train a new model using B (and hyperparams h), its expected performance will be around the calculated metrics (of course this is not always true).
For your questions
Yes you should train the model from scratch, if possible with the same initial parameters (if initialization is not random) to avoid any difference between folds. Using a warm start with the previous parameters can modify the learning process, and the fitting.
No, if initialization is random let it be, if it is fixed use the same initial parameters for all folds
For the two previous questions, if by initial parameters you meant hyperparameters, then you should keep the same for all folds, otherwise the calculated risk will be useless. If you want to try multiple hyperparameters, you have to repeat the cross validation multiple times, and then you can select the best ones based on the risk calculated.
Once you tuned your hyperparameters you can train the model on your whole training set. This will give you a model m. Before your cross validation you can keep a small test split to evaluate this final model on unseen data
Regularization Term
Overfitting in Polynomial Regression, Comparing the training set's root mean squared error and the validation set's root-mean-squared-error.
Graph of the root-mean-square-error vs lnλ for the M=9 polynomial
I didn't understand this graph properly. While training the model to learn the parameters, we have to set λ = 0 since it doesn't make sense to already select the value of λ and then proceed with the training. So How is the training error varying as we vary the value of λ?. We divided the dataset into the valid and the train so that we train the model in the training set, and then verify the validation through the valid set.
You may confuse concepts here.
Why the loss function?
You apply a shrinkage penalty to your loss function.
This will nudge your model towards finding weights closer to zero. Which can e.g. be helpful to do regularization by trading variance with bias (see e.g. the Ridge Regression in “An Introduction to Statistical Learning” by Witten et al.)
How to train?
Setting the term λ after training should not affect your trained model. Think about it in terms of linear regression: as soon as you have fitted the linear regression function, your error function does not matter anymore. You just apply the linear regression function.
Thus, you have to set the λ parameter during training.
Otherwise, your model will optimize the parameters without the regularization term and will thus not shrink the sum of weights. Hence, you are actually training your model without regularization.
How to find a good value for λ?
You have to distinguish multiple steps:
Training: You train a model setting λ to a fixed value on your training data. The λ does not change during training but must not be zero ( if you want to have regularization).
This training will yield a model; let’s call it Model λ1.
Single Validation Run: You validate how well Model λ1 performs on the validation set. This tells you how and if λ improved your model, but not if there is a better λ.
Cross-Validation Idea: Train multiple models and use validation runs to evaluate their performance to find the best λ. In other words, you train multiple models (e.g. Model λ2 .. λ10) with different λ values. Then you can compare their validation set performance among each other, to see which value for λ is best (on the validation set).
Why having a 3 set split ( train / validation / test): If you pick a final model with this procedure (e.g. Model λ3) you still don’t know how well your model generalizes because you have been using the validation set to find a good value for λ. Thus, you would expect your model to perform rather well on the validation set.
Hence, you evaluate your final model on the test set which your model has never seen, and where you never performed any kind of parameter optimization. The performance you measure is the final performance of your model. It is important that you do not evaluate multiple models on the training set, and then select the best because, then, again you would be optimizing the performance on the training set.
How to interpret the plot?
This is actually hard without some more knowledge about the problem you are tackling.
On the first look, it seems like your model is overfitting for small values of λ and improves its performance on the validation set for larger values due to regularization.
I am training a Multi-layer Perceptron . I have two questions first one is that How can K fold prevents Overfitting because train-test-split also do same thing that take the training part and validate the model , same as for K fold instead of just there are multiple folds . But there is a chance of overfitting in train_test_split , then how K fold prevents it , because in my perception model could also gets overfit into train part of K fold what you think ?
Second Question is that i am getting 95% + accuracy from K fold , i have been told by sir that there is too much variance , how it is possible here because k fold resolves this overfitting?
K-Fold cross-validation won't reduce overfitting on its own, but using it will generally give you a better insight on your model, which eventually can help you avoid or reduce overfitting.
Using a simple training/validation split, the model may perform well if the way the split isn't indicative of the true data distribution. K-Fold cross-validation splits the data into k chunks & performs training k times, by using a particular chunk as the validation set & the rest of the chunks as the training set. Therefore, the model may perform quite well on some training fold, but relatively worse on other training folds. This will give you a better indication of how well the model truly performs.
If a relatively high training accuracy is attained but a substantially lower validation accuracy indicates overfitting (high variance & low bias). The goal would be to keep both variance & bias at low levels, potentially at the expense of slightly worse training accuracy, as this would indicate that the learnt model has generalised well to unseen instances. You can read more on the bias vs variance tradeoff.
Choosing the number of folds may also play a part in this insight, as explained in this answer. Depending on the size of the data, the training folds being used may be too large compared to the validation data.
K fold can help with overfitting because you essentially split your data into various different train test splits compared to doing it once. By running the train test splits on multiple different sets as opposed to just one, you get a better understanding of how your model is actually performing on the dataset and unseen data. It doesn’t completely prevent it and it all boils down to your data at the end of the day (if the data you have training, testing and validating is not truly representative of future points you can still end up with an overfit model).
I was wondering if a model trains itself from the test data as well while evaluating it multiple times, leading to a over-fitting scenario. Normally we split the training data into train-test splits and I noticed some people split it into 3 sets of data - train, test and eval. eval is for final evaluation of the model. I might be wrong but my point is that if the above mentioned scenario is not true, then there is no need for an eval data set.
Need some clarification.
The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.
People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.
Does this help?
Consider you have train and test datasets. Train dataset is the one in which you know the output and you train your model on train dataset and you try to predict the output of Test dataset.
Most people split train dataset into train and validation. So first you run your model on train data and evaluate it on validation set. Then again you run the model on test dataset.
Now you are wondering how this will help and of any use?
This helps you to understand your model performance on seen data(validation data) and unseen data(your test data).
Here comes bias-variance trade-off into picture.
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
Let's consider a binary classification example where a student's previous semester grades, Sports achievements, Extracurriculars etc are used to predict whether or not he will pass the final semester.
Let's say we have around 10000 samples (data of 10000 students).
Now we split them:
Training set - 6000 samples
Validation set - 2000 samples
Test set - 1000 samples
The training data is generally split into three (training set, validation set, and test set) for the following reasons:
1) Feature Selection: Let's assume you have trained the model using some algorithm. You calculate the training accuracy and validation accuracy. You plot the learning curves and find if the model is overfitting or underfitting and make changes (add or remove features, add more samples etc). Repeat until you have the best validation accuracy. Now test the model with the test set to get your final score.
2) Parameter Selection: When you use algorithms like KNN, And you need to find the best K value which fits the model properly. You can plot the accuracy of different K value and choose the best validation accuracy and use it for your test set. (same applies when you find n_estimators for Random forests etc)
3) Model Selection: Also you can train the model with different algorithms and choose the model which better fits the data by testing out the accuracy using validation set.
So basically the Validation set helps you evaluate your model's performance how you must fine-tune it for best accuracy.
Hope you find this helpful.
I built a classifier with 13 features ( no binary ones ) and normalized individually for each sample using scikit tool ( Normalizer().transform).
When I make predictions it predicts all training sets as positives and all test sets as negatives ( irrespective of fact whether it is positive or negative )
What anomalies I should focus on in my classifier, feature or data ???
Notes: 1) I normalize test and training sets (individually for each sample) separately.
2) I tried cross validation but the performance is same
3) I used both SVM linear and RBF Kernels
4) I tried without normalizing too. But same poor results
5) I have same number of positive and negative datasets ( 400 each) and 34 samples of positive and 1000+ samples of negative test sets.
If you're training on balanced data the fact that "it predicts all training sets as positive" is probably enough to conclude that something has gone wrong.
Try building something very simple (e.g. a linear SVM with one or two features) and look at the model as well as a visualization of your training data; follow the scikit-learn example: http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
There's also a possibility that your input data has many large outliers impacting the transform process...
Try doing feature selection on the training data (Seperately from your test/validation data).
Feature selection on your whole dataset can easily lead to overfitting.