I want to evaluate the performance different model such as SVM, RandForest, CNN etc, I only have one dataset. So I split the dataset to training set and testing set and train different model on this dataset with training data and test with testing dataset.
My question: Can I get the real performance of different model on only one dataset? For example: I found SVM model get the best result, So Should I select the SVM as my final classification model?
Its probably a better idea to cross validate your models with different test samples through cross validation to avoid biases. Also check your models against different evaluation metrics depending upon your application type. For instance use recall, accuracy and AUC for each model if its a classification problem.
Evaluation results can be pretty deceptive and require extensive validation.
You can Plot ROC curve for all the models.The model for which AUC is highest will be best model.
Related
K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.
When a MLP model is trained using a training and validation dataset, we can know the number of epochs that best fits the model. Once the training is done, and we know the best number of epochs, in order to get the best mlp model, would be fine if the model is retrained not only with the training set but the entire data set with the same number of epochs, so the model can see more data? Or this number of epochs could result in a good MLP model for the first approach but in an overfitted one for the second?
There is not a single approach to that. It depends on factors such as validation strategy (e.g. k-fold cross-validation vs validation set as the test set itself), if the model is learning on the fly or offline, if there is biased or imbalanced data on the validation set.
You may find useful the following sources:
https://stats.stackexchange.com/questions/11602/training-on-the-full-dataset-after-cross-
https://stats.stackexchange.com/questions/11602/training-on-the-full-dataset-after-cross-validation
https://stats.stackexchange.com/questions/402055/fitting-after-training-and-validation
https://stats.stackexchange.com/questions/361494/how-to-correctly-retrain-model-using-all-data-after-cross-validation-with-early
https://www.reddit.com/r/datascience/comments/7xqszr/should_final_model_be_retrained_on_full_dataset/
https://www.quora.com/Should-we-train-neural-networks-on-the-training-set-combined-with-the-validation-set-after-tuning
I was wondering if a model trains itself from the test data as well while evaluating it multiple times, leading to a over-fitting scenario. Normally we split the training data into train-test splits and I noticed some people split it into 3 sets of data - train, test and eval. eval is for final evaluation of the model. I might be wrong but my point is that if the above mentioned scenario is not true, then there is no need for an eval data set.
Need some clarification.
The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.
People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.
Does this help?
Consider you have train and test datasets. Train dataset is the one in which you know the output and you train your model on train dataset and you try to predict the output of Test dataset.
Most people split train dataset into train and validation. So first you run your model on train data and evaluate it on validation set. Then again you run the model on test dataset.
Now you are wondering how this will help and of any use?
This helps you to understand your model performance on seen data(validation data) and unseen data(your test data).
Here comes bias-variance trade-off into picture.
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
Let's consider a binary classification example where a student's previous semester grades, Sports achievements, Extracurriculars etc are used to predict whether or not he will pass the final semester.
Let's say we have around 10000 samples (data of 10000 students).
Now we split them:
Training set - 6000 samples
Validation set - 2000 samples
Test set - 1000 samples
The training data is generally split into three (training set, validation set, and test set) for the following reasons:
1) Feature Selection: Let's assume you have trained the model using some algorithm. You calculate the training accuracy and validation accuracy. You plot the learning curves and find if the model is overfitting or underfitting and make changes (add or remove features, add more samples etc). Repeat until you have the best validation accuracy. Now test the model with the test set to get your final score.
2) Parameter Selection: When you use algorithms like KNN, And you need to find the best K value which fits the model properly. You can plot the accuracy of different K value and choose the best validation accuracy and use it for your test set. (same applies when you find n_estimators for Random forests etc)
3) Model Selection: Also you can train the model with different algorithms and choose the model which better fits the data by testing out the accuracy using validation set.
So basically the Validation set helps you evaluate your model's performance how you must fine-tune it for best accuracy.
Hope you find this helpful.
I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.
What is the difference between classification and prediction in machine learning?
Classification is the prediction of a categorial variable within a predefined vocabulary based on training examples.
The prediction of numerical (continuous) variables is called regression.
In summary, classification is one kind of prediction, but there are others. Hence, prediction is a more general problem.
Functionality
Classification is about determining a (categorial) class (or label) for an element in a dataset
Prediction is about predicting a missing/unknown element(continuous value) of a dataset
Working Strategy
In classification, data is grouped into categories based on a training dataset.
In prediction, a classification/regression model is built to predict the outcome(continuous value)
Example
In a hospital, the grouping of patients based on their medical record or treatment outcome is considered classification, whereas, if you use a classification model to predict the treatment outcome for a new patient, it is considered a prediction.
Classification is the process of identifying the category or class label of the new observation to which it belongs.
Predication is the process of identifying the missing or unavailable numerical data for a new observation.
That is the key difference between classification and prediction. The predication does not concern about the class label like in classification.
Predictions can be using both regression as well as classification models. It means that once a model is trained on the training data; the next phase is to do predictions for the data whose real/ground-truth values are either unknown or kept aside to evaluate the performance of model. If the nature of the problem is of determining classes/labels/categories athen its classification and if the problem is about determining real numbers (numeric) values then its regression. In nutshell, predictions are supposed to done with both classification and regression for the test data set.
1.Prediction is like saying something which may going to be happened in future.Prediction may be a kind of classification
2.Prediction is mostly based on our future assumptions
whereas
1.Classification is categorization of the things or data that we already have with us.This categorization can be based on any kind of technique or algorithms
2.Classification is mostly based on our current or past assumptions