Machine learning: training model from test data - machine-learning

I was wondering if a model trains itself from the test data as well while evaluating it multiple times, leading to a over-fitting scenario. Normally we split the training data into train-test splits and I noticed some people split it into 3 sets of data - train, test and eval. eval is for final evaluation of the model. I might be wrong but my point is that if the above mentioned scenario is not true, then there is no need for an eval data set.
Need some clarification.

The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.
People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.
Does this help?

Consider you have train and test datasets. Train dataset is the one in which you know the output and you train your model on train dataset and you try to predict the output of Test dataset.
Most people split train dataset into train and validation. So first you run your model on train data and evaluate it on validation set. Then again you run the model on test dataset.
Now you are wondering how this will help and of any use?
This helps you to understand your model performance on seen data(validation data) and unseen data(your test data).
Here comes bias-variance trade-off into picture.
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/

Let's consider a binary classification example where a student's previous semester grades, Sports achievements, Extracurriculars etc are used to predict whether or not he will pass the final semester.
Let's say we have around 10000 samples (data of 10000 students).
Now we split them:
Training set - 6000 samples
Validation set - 2000 samples
Test set - 1000 samples
The training data is generally split into three (training set, validation set, and test set) for the following reasons:
1) Feature Selection: Let's assume you have trained the model using some algorithm. You calculate the training accuracy and validation accuracy. You plot the learning curves and find if the model is overfitting or underfitting and make changes (add or remove features, add more samples etc). Repeat until you have the best validation accuracy. Now test the model with the test set to get your final score.
2) Parameter Selection: When you use algorithms like KNN, And you need to find the best K value which fits the model properly. You can plot the accuracy of different K value and choose the best validation accuracy and use it for your test set. (same applies when you find n_estimators for Random forests etc)
3) Model Selection: Also you can train the model with different algorithms and choose the model which better fits the data by testing out the accuracy using validation set.
So basically the Validation set helps you evaluate your model's performance how you must fine-tune it for best accuracy.
Hope you find this helpful.

Related

How to initialize the parameter in the cross validation method and get the final model after training and evaluating using this method?

As I learned about cross-validation algorithm, from most of the articles on the web, there are variety of cross-validation methods. Here I want to be clear about the k-fold cross-validation technique.
In the k-fold cross-validation algorithm, we can split the training set in to k not-overlapped folds.
As we split the training data in to k folds, we have to train the model in k iterations.
So, in each iteration, we train the model with (k-1) folds and validate it with the remained fold.
In each split we can calculate the desired metric(s) of our model.
At the end we can report the training error by taking the average of scores of all iterations.
But what is the final trained model?
Some points in those articles are not clear for me?
Should I initiate model's parameters in each iteration?
I ask this, because if I don’t initialize the parameter's it could save the pattern of data which I want to be unseen in the next iteration and so on…
Should I save the initial parameter of the split in which I gained the best score, as the best initial values of the parameters?
Should I retrain the model initiating it with the initial values of the parameters gained in my second question and then feed it with whole training dataset and gain the final trained model?
Alright so before answering your question I will go a bit back to explain the purpose of cross validation and model evaluation. You can read these slides or research more about statistical learning theory if you want to go deeper.
Train/test split
Suppose you have a model with defined hyperparameter (or none) and you train it on the training split. If you calculate the metrics over the test split, this will give you the risk of the model on new data. Then you know that this particular model will perform like that on unseen data.
So we have a learning process B, that takes a dataset S (here the training dataset) as well as hyperparameters h, and gives a fitted model m; then B(S, h)->m (training B on S with hp h gives a model m, with its parameters). Then we tested this model to evaluate the risk R on the test dataset.
k-fold Cross validation
When doing k-fold cross validation, you fit k models using the learning process B. Each model is fitted on a different training set, and the risk is computed on non overlapping samples.
Then, you calculate the mean risk among the folds. A common mistake is that it gives you the performance of the model, that's not true. This gives you the mean (or expected) performances of the learning process B (and hyperparams h). That means, if you train a new model using B (and hyperparams h), its expected performance will be around the calculated metrics (of course this is not always true).
For your questions
Yes you should train the model from scratch, if possible with the same initial parameters (if initialization is not random) to avoid any difference between folds. Using a warm start with the previous parameters can modify the learning process, and the fitting.
No, if initialization is random let it be, if it is fixed use the same initial parameters for all folds
For the two previous questions, if by initial parameters you meant hyperparameters, then you should keep the same for all folds, otherwise the calculated risk will be useless. If you want to try multiple hyperparameters, you have to repeat the cross validation multiple times, and then you can select the best ones based on the risk calculated.
Once you tuned your hyperparameters you can train the model on your whole training set. This will give you a model m. Before your cross validation you can keep a small test split to evaluate this final model on unseen data

Cross validation: train/test set split necessary?

Let's say I want to use a Random Forest model to predict future data. I'm thinking about two ways of training this model, picking the best hyperparameters, and putting this model in production. The difference between the two approaches is that the first one splits the data into a training and test set, while the second does not.
Can I use both these approaches? Is one of these better to use than the other? I guess one downside of the 2nd approach is that there is no unbiased performance estimate, but does this really matter?
1)
Split data into train and test set (80/20)
Use k-fold cross validation on the train data set.
Choose hyperparameters which perform best on the k validation sets.
Train this best model on complete training data
Get an unbiased performance estimate on test set
Train best model on complete data set
Predict future data using final model
Use k-fold cross validation on the complete data set.
Choose hyperparameters which perform best on the k validation sets.
Train best model on complete data
Predict future data using final model
Cross-validation is one specific case of k-fold validation where k = (1/split_rate) - 1 and doing just 1 round of validation.
So you do not need cross-validation when you already do optimization through k-fold validation.

Can i predict and evaluated model with the whole dataset?

I split the dataset into train and test of 80-20 ration respectively. I predicted and evaluated with test dataset. And my question is can we evaluate and predict model with the whole dataset before that I shuffle entire dataset. Can we do that? If not, why should not we do that? what is wrongdoing like that?
Data Snooping is the quick answer what you are looking for.
In other words, your model would seem outperforming on your test data if it was trained on 100% data first. Model would become an overfitted model that basically would predict seen data with higher accuracy however would fail to do so with any sort of unseen test data.
You can do it, however it would result in overfitted model. You can try k fold cross validation method in stead.
If you use the whole dataset for training, the model will fit to all the variances in data (overfitting). As a result, the performance of your model on similar data will be high. However, the model will exhibit low performance on unseen data with a different distribution compared to your training dataset. One way to prevent this is to: a) split your data into training, validation, and testing datasets (see the note below), b) apply k-fold cross-validation on training and validation splits, c) verify the performance of your models from step b on the third split (test dataset).
Note: There is no consensus on the naming of the splits. Some sources name them training-validation-testing while others use training-testing-validation.

Overfitting and Data splitting

Let's say that I have a data file like:
Index,product_buying_date,col1,col2
0,2013-01-16,34,Jack
1,2013-01-12,43,Molly
2,2013-01-21,21,Adam
3,2014-01-09,54,Peirce
4,2014-01-17,38,Goldberg
5,2015-01-05,72,Chandler
..
..
2000000,2015-01-27,32,Mike
with some more data and I have a target variable y. Assume something as per your convenience.
Now I am aware that we divide the data into 2 parts i.e. Train and Test. And then we divide Train into 70:30, build the model with 70% and validate it with 30%. We tune the parameters so that model does not get overfit. And then predict with the Test data. For example: I divide 2000000 into two equal parts. 1000000 is train and I divide it in validate i.e. 30% of 1000000 which is 300000 and 70% is where I build the model i.e. 700000.
QUESTION: Is the above logic depending upon how the original data splits?
Generally we shuffle the data and then break it into train, validate and test. (train + validate = Train). (Please don't confuse here)
But what if the split is alternate. Like When I divide it in Train and Test first, I give even rows to Test and odd rows to Train. (Here data is initially sort on the basis of 'product_buying_date' column so when i split it in odd and even rows it gets uniformly split.
And when I build the model with Train I overfit it so that I get maximum AUC with Test data.
QUESTION: Isn't overfitting helping in this case?
QUESTION: Is the above logic depending upon how the original data
splits?
If dataset is large(hundred of thousand), you can randomly split the data and you should not have any problem but if dataset is small then you can adopt the different approaches like cross-validation to generate the data set. Cross-validation states that you split you make n number of training-validation set out of your Training set.
suppose you have 2000 data points, you split like
1000 - Training dataset
1000 - testing dataset.
5-cross validation would mean that you would make five 800/200 training/validation dataset.
QUESTION: Isn't overfitting helping in this case?
Number one rule of the machine learning is that, you don't touch the test data set. It's a holly data set that should not be touched.
If you overfit the test data to get maximum AUC score then there won't be any meaning of validation dataset. Foremost aim of any ml algorithm is to reduce the generalization error i.e. algorithm should be able to perform good on unseen data. If you would tune your algorithm with testing data. you won't be able to meet this criteria. In cross-validation also you do not touch your testing set. you select your algorithm. tune its parameter with validation dataset and after you have done with that apply your algorithm to test dataset which is your final score.

How to evaluate the performance of different model on one dataset?

I want to evaluate the performance different model such as SVM, RandForest, CNN etc, I only have one dataset. So I split the dataset to training set and testing set and train different model on this dataset with training data and test with testing dataset.
My question: Can I get the real performance of different model on only one dataset? For example: I found SVM model get the best result, So Should I select the SVM as my final classification model?
Its probably a better idea to cross validate your models with different test samples through cross validation to avoid biases. Also check your models against different evaluation metrics depending upon your application type. For instance use recall, accuracy and AUC for each model if its a classification problem.
Evaluation results can be pretty deceptive and require extensive validation.
You can Plot ROC curve for all the models.The model for which AUC is highest will be best model.

Resources