K fold cross validation makes no part of data blind to the model - machine-learning

I have a conceptual question about K fold Cross validation.
In general, we train a model to learn based on test data and validate it with test data, and we assume the system is blind to this data, and this is why we can evaluate if the system really learnt or not.
Now with k fold, the final model actually have seen (indirectly, though) all data, so why it is still valid??? It already has seen all data and we do not know how it predicts unseen data.
This is my question that based on this fact, why we know this method valid?
Thanks.

In K-Fold Cross Validation, you actually train K different models. Let's say we are doing 5-Fold CV and the size of the dataset is 100 samples. Then, in each fold, we randomly split the data as 80 train samples and 20 test samples. We train on 80 train samples then we test the trained model on 20 left-out test samples. We compute accuracy and note it. At the end, we will have 5 different models. Then, we can average the accuracies of each fold and report this as the average performance of the model. Coming to your question, actually you need to think why we need K-Fold Cross Validation. The answer is, you need to report the performance of you model, right? However, if you just train and evaluate your model with single split, then there is a possibility that your model may be biassed to this specific split. I mean, in this split, a rare case may come out like a highly domain shift between train and test sets which is bad for the performance.

TL;DR: Think of your 'test data' more like 'validation data', which you hope represents truly unseen test data. Ideally if the model performs well for many different validation datasets it will work well when applied to real life test data which wasn't used in the training-validation process.
This confusion is justified. You are correct.
This is where the terminology training data, validation data and test data can make things more clear. Models are trained on training data. This is data directly seen by the model to go through the process of updating its parameters and learn. Validation data is data the we use to validate how well the model has actually learned. It is not directly seen by the model and we use it to judge things like under or overfitting. It is assumed that the validation data is a good representation of test data. Test data is what we will end up applying our model to in the real world, it have never been seen in any way by the model.
Test and validation data are often used interchangeably, with most people just using training and test terminology.
An example:
If you are build a cat detector you collect images of cats, you split these images into training and validation sets. You assume the validation set is an accurate representation of the kinds of cat images people will use your model on in the real world. You train your model on the training data, validate how well it has learned on the validation data and once you think it has learned well you deploy the model. People will use it on their own images to detect cats. These images are the true test data, which have never been seen by the model, but hopefully your validation set was a good indicator of how you model will perform on these images.
K-fold cross validation is best used when your validation set may be small, or you are unsure of how well it represents test data (e.g. if there are only ginger cats in your validation set, it lead to your model failing on test data, so you would like to mix the validation set up). By performing k-fold cross validation you can validate your model more times, with different choices of validation set, which hopefully will give a better indication of your model's generalizability.

Related

Can i predict and evaluated model with the whole dataset?

I split the dataset into train and test of 80-20 ration respectively. I predicted and evaluated with test dataset. And my question is can we evaluate and predict model with the whole dataset before that I shuffle entire dataset. Can we do that? If not, why should not we do that? what is wrongdoing like that?
Data Snooping is the quick answer what you are looking for.
In other words, your model would seem outperforming on your test data if it was trained on 100% data first. Model would become an overfitted model that basically would predict seen data with higher accuracy however would fail to do so with any sort of unseen test data.
You can do it, however it would result in overfitted model. You can try k fold cross validation method in stead.
If you use the whole dataset for training, the model will fit to all the variances in data (overfitting). As a result, the performance of your model on similar data will be high. However, the model will exhibit low performance on unseen data with a different distribution compared to your training dataset. One way to prevent this is to: a) split your data into training, validation, and testing datasets (see the note below), b) apply k-fold cross-validation on training and validation splits, c) verify the performance of your models from step b on the third split (test dataset).
Note: There is no consensus on the naming of the splits. Some sources name them training-validation-testing while others use training-testing-validation.

Training Data Vs. Test Data

This might sound like an elementary question but I am having a major confusion regarding Training Set and Test.
When we use Supervised learning techniques such as Classification to predict something a common practice is to split the dataset into two parts training and test set. The training set will have a predictor variable, we train the model on the dataset and "predict" things.
Let's take an example. We are going to predict loan defaulters in a bank and we have the German credit data set where we are predicting defaulters and non- defaulters but there is already a definition column which says whether a customer is a defaulter or Non-defaulter.
I understand the logic of prediction on UNSEEN data, like the Titanic survival data but what is the point of prediction where a class is already mentioned, such as German credit lending data.
As you said, the idea is to come up a model that you can predict UNSEEN data. The test data is only used to measure the performance of your model created through training data. You want to make sure the model you comes up does not "overfit" your training data. That's why the testing data is important. Eventually, you will use the model to predict whether a new loaner is going to default or not, thus making a business decision whether to approve the loan application.
The reason why they include the defaulted values is so that you can verify that the model is working as expected and predicting the correct results. Without which there is no way for anyone to be confident that their model is working as expected.
The ultimate purpose of training a model is to apply it to what you call UNSEEN data.
Even in your German credit lending example, at the end of the day you will have a trained model that you could use to predict if new - unseen - credit applications will default or not. And you should be able to use it in the future for any new credit application, as long as you are able to represent the new credit data in the same format you used to train your model.
On the other hand, the test set is just a formalism used to estimate how good the model is. You cannot know for sure how accurate your model it is going to be with future credit applications, but what you can do is to save a small part of your training data, and use it only to check the model's performance after it has been built. That's what you would call the test set (or more precisely, a validation set).

Is it a good practice to use your full data set for predictions?

I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.
You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.
In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.
Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.

Combining training, validation and test datasets

Is it possible to train a model based on training and validation data sets.Basically end up combining both of them to create a new model. And from that combined model use it to classify all of the data in the test dataset.
This is what is usually done. Assuming that you know how to transfer hyperparameters, as you usually fit model on train data, select hyperparameters based on the score on the valid one. Thus when you combine train + valid you get significantly bigger dataset, thus "optimal hyperparameters" might be completely different from the ones you selected before. So in general - yes, this is exactly what is usually done, but it might be more tricky than you expect (especially if your method is highly stochastic, non deterministic, etc.).

What is the right way to measure if a machine learning model has overfit?

I understand the intuitive meaning of overfitting and underfitting. Now, given a particular machine learning model that is trained upon the training data, how can you tell if the training overfitted or underfitted the data? Is there a quantitative way to measure these factors?
Can we look at the error and say if it has overfit or underfit?
I believe the easiest approach is to have two sets of data. Training data and validation data. You train the model on the training data as long as the fitness of the model on the training data is close to the fitness of the model on the validation data. When the models fitness is increasing on the training data but not on the validation data then you're overfitting.
The usual way, I think, is known as cross-validation. The idea is to split the training set into several pieces, known as folds, then pick one at a time for evaluation and train on the remaining ones.
It does not, of course, measure the actual overfitting or underfitting, but if you can vary the complexity of the model, e.g. by changing the regularization term, you can find the optimal point. This is as far as one can go with just training and testing, I think.
You don't look at the error on the training data, but on the validation data only.
A common way of testing is to try different model complexities, and see how the error changes with model complexity. Usually these have a typical curve. In the beginning, the errors quickly improve. Then there is saturation (where the model is good), then they start decreasing again, but not because of being a better model, but because of overfitting. You want to be on the low complexity end of the plateau, the simplest model that provides a reasonable generalization.
The existing answers are not strictly speaking wrong, but they are not complete. Yes, you do need a validation set, but an important issue here is that you do not simply look at the model error on the validation set and try to minimize it. It will lead to overfitting all the same, because you will effectively be fitting on a validation set that way. The right approach is not minimizing the error on your sets, but making an error independent from which training and validation sets you use. If error on validation set is significantly different (doesn't matter if it is worse, or better), then the model is overfit. Also, certainly, this should be done in a cross-validation way when you train on some random set and then validate on another random set.

Resources