I want to know about how we select model from k-fold cross validation method. In k-fold cross validation we can get k models and an accuracy score using the average of k models' accuracies. Can you please provide a method to get the final best model from cross validation?
K-fold cross validation is for comparing the performance of two models not for building models. Say, we desingned two 2 seq2seq generative models with different structures and our dataset is small, and we want to choose one model. We can follow the k-fold cross-validation method and get an average score for each model and choose the superior one with the higher score.
We don't need to choose a model from the k models, but we can ensemble the k modles into one by utilizing bagging(one of the three Ensemble methods). For more information please refer to this blog: Bagging and Random Forest Ensemble Algorithms for Machine Learning.
Reference:
1. https://stats.stackexchange.com/a/52277/103153
2. https://stats.stackexchange.com/a/19053/103153
Related
K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.
I'm trying to compare multiple species distribution modeling approaches via k-fold cross-validation. Currently I'm calculating the RSME and AUC to compare model-performance. A friend suggested to further use the sum of log-likelihoods as metric to compare models. However, one of the models is a random forest fitted with the ranger package. If actually possible how would I calculate the log-likelihood for a random forest model and would it actually be a comparable metric to use with other models (GAM, GLM).
Thanks for your help.
Let's say I want to use a Random Forest model to predict future data. I'm thinking about two ways of training this model, picking the best hyperparameters, and putting this model in production. The difference between the two approaches is that the first one splits the data into a training and test set, while the second does not.
Can I use both these approaches? Is one of these better to use than the other? I guess one downside of the 2nd approach is that there is no unbiased performance estimate, but does this really matter?
1)
Split data into train and test set (80/20)
Use k-fold cross validation on the train data set.
Choose hyperparameters which perform best on the k validation sets.
Train this best model on complete training data
Get an unbiased performance estimate on test set
Train best model on complete data set
Predict future data using final model
Use k-fold cross validation on the complete data set.
Choose hyperparameters which perform best on the k validation sets.
Train best model on complete data
Predict future data using final model
Cross-validation is one specific case of k-fold validation where k = (1/split_rate) - 1 and doing just 1 round of validation.
So you do not need cross-validation when you already do optimization through k-fold validation.
I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.
I want to evaluate the performance different model such as SVM, RandForest, CNN etc, I only have one dataset. So I split the dataset to training set and testing set and train different model on this dataset with training data and test with testing dataset.
My question: Can I get the real performance of different model on only one dataset? For example: I found SVM model get the best result, So Should I select the SVM as my final classification model?
Its probably a better idea to cross validate your models with different test samples through cross validation to avoid biases. Also check your models against different evaluation metrics depending upon your application type. For instance use recall, accuracy and AUC for each model if its a classification problem.
Evaluation results can be pretty deceptive and require extensive validation.
You can Plot ROC curve for all the models.The model for which AUC is highest will be best model.