Let's say I want to use a Random Forest model to predict future data. I'm thinking about two ways of training this model, picking the best hyperparameters, and putting this model in production. The difference between the two approaches is that the first one splits the data into a training and test set, while the second does not.
Can I use both these approaches? Is one of these better to use than the other? I guess one downside of the 2nd approach is that there is no unbiased performance estimate, but does this really matter?
1)
Split data into train and test set (80/20)
Use k-fold cross validation on the train data set.
Choose hyperparameters which perform best on the k validation sets.
Train this best model on complete training data
Get an unbiased performance estimate on test set
Train best model on complete data set
Predict future data using final model
Use k-fold cross validation on the complete data set.
Choose hyperparameters which perform best on the k validation sets.
Train best model on complete data
Predict future data using final model
Cross-validation is one specific case of k-fold validation where k = (1/split_rate) - 1 and doing just 1 round of validation.
So you do not need cross-validation when you already do optimization through k-fold validation.
Related
As I learned about cross-validation algorithm, from most of the articles on the web, there are variety of cross-validation methods. Here I want to be clear about the k-fold cross-validation technique.
In the k-fold cross-validation algorithm, we can split the training set in to k not-overlapped folds.
As we split the training data in to k folds, we have to train the model in k iterations.
So, in each iteration, we train the model with (k-1) folds and validate it with the remained fold.
In each split we can calculate the desired metric(s) of our model.
At the end we can report the training error by taking the average of scores of all iterations.
But what is the final trained model?
Some points in those articles are not clear for me?
Should I initiate model's parameters in each iteration?
I ask this, because if I don’t initialize the parameter's it could save the pattern of data which I want to be unseen in the next iteration and so on…
Should I save the initial parameter of the split in which I gained the best score, as the best initial values of the parameters?
Should I retrain the model initiating it with the initial values of the parameters gained in my second question and then feed it with whole training dataset and gain the final trained model?
Alright so before answering your question I will go a bit back to explain the purpose of cross validation and model evaluation. You can read these slides or research more about statistical learning theory if you want to go deeper.
Train/test split
Suppose you have a model with defined hyperparameter (or none) and you train it on the training split. If you calculate the metrics over the test split, this will give you the risk of the model on new data. Then you know that this particular model will perform like that on unseen data.
So we have a learning process B, that takes a dataset S (here the training dataset) as well as hyperparameters h, and gives a fitted model m; then B(S, h)->m (training B on S with hp h gives a model m, with its parameters). Then we tested this model to evaluate the risk R on the test dataset.
k-fold Cross validation
When doing k-fold cross validation, you fit k models using the learning process B. Each model is fitted on a different training set, and the risk is computed on non overlapping samples.
Then, you calculate the mean risk among the folds. A common mistake is that it gives you the performance of the model, that's not true. This gives you the mean (or expected) performances of the learning process B (and hyperparams h). That means, if you train a new model using B (and hyperparams h), its expected performance will be around the calculated metrics (of course this is not always true).
For your questions
Yes you should train the model from scratch, if possible with the same initial parameters (if initialization is not random) to avoid any difference between folds. Using a warm start with the previous parameters can modify the learning process, and the fitting.
No, if initialization is random let it be, if it is fixed use the same initial parameters for all folds
For the two previous questions, if by initial parameters you meant hyperparameters, then you should keep the same for all folds, otherwise the calculated risk will be useless. If you want to try multiple hyperparameters, you have to repeat the cross validation multiple times, and then you can select the best ones based on the risk calculated.
Once you tuned your hyperparameters you can train the model on your whole training set. This will give you a model m. Before your cross validation you can keep a small test split to evaluate this final model on unseen data
When a MLP model is trained using a training and validation dataset, we can know the number of epochs that best fits the model. Once the training is done, and we know the best number of epochs, in order to get the best mlp model, would be fine if the model is retrained not only with the training set but the entire data set with the same number of epochs, so the model can see more data? Or this number of epochs could result in a good MLP model for the first approach but in an overfitted one for the second?
There is not a single approach to that. It depends on factors such as validation strategy (e.g. k-fold cross-validation vs validation set as the test set itself), if the model is learning on the fly or offline, if there is biased or imbalanced data on the validation set.
You may find useful the following sources:
https://stats.stackexchange.com/questions/11602/training-on-the-full-dataset-after-cross-
https://stats.stackexchange.com/questions/11602/training-on-the-full-dataset-after-cross-validation
https://stats.stackexchange.com/questions/402055/fitting-after-training-and-validation
https://stats.stackexchange.com/questions/361494/how-to-correctly-retrain-model-using-all-data-after-cross-validation-with-early
https://www.reddit.com/r/datascience/comments/7xqszr/should_final_model_be_retrained_on_full_dataset/
https://www.quora.com/Should-we-train-neural-networks-on-the-training-set-combined-with-the-validation-set-after-tuning
I split the dataset into train and test of 80-20 ration respectively. I predicted and evaluated with test dataset. And my question is can we evaluate and predict model with the whole dataset before that I shuffle entire dataset. Can we do that? If not, why should not we do that? what is wrongdoing like that?
Data Snooping is the quick answer what you are looking for.
In other words, your model would seem outperforming on your test data if it was trained on 100% data first. Model would become an overfitted model that basically would predict seen data with higher accuracy however would fail to do so with any sort of unseen test data.
You can do it, however it would result in overfitted model. You can try k fold cross validation method in stead.
If you use the whole dataset for training, the model will fit to all the variances in data (overfitting). As a result, the performance of your model on similar data will be high. However, the model will exhibit low performance on unseen data with a different distribution compared to your training dataset. One way to prevent this is to: a) split your data into training, validation, and testing datasets (see the note below), b) apply k-fold cross-validation on training and validation splits, c) verify the performance of your models from step b on the third split (test dataset).
Note: There is no consensus on the naming of the splits. Some sources name them training-validation-testing while others use training-testing-validation.
I was wondering if a model trains itself from the test data as well while evaluating it multiple times, leading to a over-fitting scenario. Normally we split the training data into train-test splits and I noticed some people split it into 3 sets of data - train, test and eval. eval is for final evaluation of the model. I might be wrong but my point is that if the above mentioned scenario is not true, then there is no need for an eval data set.
Need some clarification.
The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.
People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.
Does this help?
Consider you have train and test datasets. Train dataset is the one in which you know the output and you train your model on train dataset and you try to predict the output of Test dataset.
Most people split train dataset into train and validation. So first you run your model on train data and evaluate it on validation set. Then again you run the model on test dataset.
Now you are wondering how this will help and of any use?
This helps you to understand your model performance on seen data(validation data) and unseen data(your test data).
Here comes bias-variance trade-off into picture.
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
Let's consider a binary classification example where a student's previous semester grades, Sports achievements, Extracurriculars etc are used to predict whether or not he will pass the final semester.
Let's say we have around 10000 samples (data of 10000 students).
Now we split them:
Training set - 6000 samples
Validation set - 2000 samples
Test set - 1000 samples
The training data is generally split into three (training set, validation set, and test set) for the following reasons:
1) Feature Selection: Let's assume you have trained the model using some algorithm. You calculate the training accuracy and validation accuracy. You plot the learning curves and find if the model is overfitting or underfitting and make changes (add or remove features, add more samples etc). Repeat until you have the best validation accuracy. Now test the model with the test set to get your final score.
2) Parameter Selection: When you use algorithms like KNN, And you need to find the best K value which fits the model properly. You can plot the accuracy of different K value and choose the best validation accuracy and use it for your test set. (same applies when you find n_estimators for Random forests etc)
3) Model Selection: Also you can train the model with different algorithms and choose the model which better fits the data by testing out the accuracy using validation set.
So basically the Validation set helps you evaluate your model's performance how you must fine-tune it for best accuracy.
Hope you find this helpful.
Suppose I split my data into training set and validation set. I perform a 5-fold cross-validation on my training set to obtain the optimal hyper-parameters for my model, then I use the optimal hyper-parameters to train my model and apply the resulting model on my validation set. My question is, is it reasonable to combine the training and validation set, and use the hyper-parameters obtained from the training set to build a final model?
It is resonable if training data was relatively small and adding validation set makes your model significantly stronger. However, at the same time, adding new data makes your previously selected hyperparameters possibly suboptimal (it is really hard to show what kind of transformation of hyperparameters you should apply when you add new data to your training set). Thus you balance two things - gain in model quality from more data and possible loss due to hard to predict change in hyperparameters meaning. To some extent you can simulate this process to make sure it makes sense, if you have N points in training data and M in validation, you can try to split training further to chunks with the same proportion (thus one is now N * (N/(N+M) and other N * (M/(N+M))), train on first one and check whether optimal hyperparameters transfer (more or less) to the optimal one on the whole training set - if so, you can safely add validation as they should transfer as well. If they do not - the risk might be not worth the gain.