Does it make sense using validation set together with a crossvalidation approach? - machine-learning

I want to train a MultiLayerPerceptron using Weka with ~200 samples and 6 attributes.
I was thinking of spliting into train and test, and on train, specify a certain % of the train as Validation set.
But then I considered using fold-crossvalidation in order to make a better use of my set of samples.
My question is: Does it make sense to specify a validation set when doing a crossvalidation approach?
And, considering the size of the sample, can you suggest me some numbers for the two approaches? (e.g. 2/3 for train, 1/3 test, and 20% validation... and for CV: 10-fold, 2-fold, or LOOCV instead...)
Thank you in advance!

Your questions sounds like you're not exactly familiar with cross-validation. Like you noticed there is a parameter for the number of folds to run. For a simple cross-validation the parameter defines the number of subsets which are created out of your original set. Let that parameter be k. Your original set is splitted into k equally sized subset. Then for each run, the trainig is run on k-1 subsets and the validation is done on the remaining, k-th subset. Then another permutation of k-1 subsets of the k subsets is used for training, and so on. So you run k iterations of this process.
For your data set size, k=10 sounds alright, but basically everything is worth testing, as long as you take all results into account and don't take the best one.
For the very simple evaluation you just use 2/3 as training set and the 1/3 "test set" is actually your validation set. There are more sophisticated approaches though which use the test set as a termination criterion and another validation set as the final evaluation (since your results might be overfitted to the test set as well, because it defines the termination). For this approach you obviously need to split up the set differently (e.g. 2/3 training, 3/12 test and 1/12 validation).

You should be carefully because you don't have much sample. On the other hand if you want to check your model accuracy you should partition a test set for your model. Cross validation splits your data as train and validation data. Then when we consider that you don't have much sample and your validation set will be so small you can have a look at that approach:
5×2 cross-validation, which uses training cross-validation and
validation sets of equal size (Dietterich (1998))
You can find more info at Ethem Alpaydin's Machine Learning book about it.
Don't memorize the data and don't test on small amounts of sample it looks like a dilemma but the certain decision depends on your data set.

Related

Tuning of hyperparameters and evaluation using cross validation

I have some trouble grasping the standard way of how to use cross validation for hyperparameter tuning and evaluation. I try to do 10-fold CV. Which of the following is the correct way?
All of the data gets used for parameter tuning (e. g. using random grid search with cross validation). This returns the best hyperparameters. Then, a new model is constructed with these hyperparameters, and it can be evaluated by doing a cross validation (nine folds for training, one for testing, in the end the metrics like accuracy or confusion matrix get averaged).
Another way that I found is to first split the data into a train and a test set, and then only perform cross validation on the training set. One then would evaluate using the testset. However, as I understand, that would undermine the whole concept of cross validation, as the idea behind it is to be independent of the split, right?
Lastly, my supervisor told me that I would use eight folds for training, one for hyperparameter estimation and one for testing (and therefore evaluation). However, I could not find any material where this approach had been used. Is that a standard procedure or have I just understood something wrong there?
in general you can split your data into 3 sets.
training set
validation set
test set
Test set:
The test set is the most easiest one to explain.
Once you've created your test set (15-30% of the data). You store this data set somewhere and you DON'T TOUCH that data set ANYMORE until you think you're done.
- The reason for this is simple, once you start to focus on this data set (e.g. to increase the AUC or ...) then you're starting to over fit your data ...
The same also counts for the validation set (+/-). When you're hyper-tuning your parameters etc. you're starting to focusing on this set ... which means that you aren't generalizing anymore. (and a good model, should work on all data, not only on the test and validation set).
That been said, now you've only the training- and validation set over.
Cross validation: some motivations to use cross validation is to have a better generalization and view of your model/data (imagine, that some special cases only existed in the validation set etc. + you don't take a single decision for granted.
- the main downside of e.g. 10-fold cross validation is ... it takes 10 times longer to finish ... but it gives you a more trustworthy results ... (e.g. if you do 10 fold cross validation and your AUC fluctuates from 80 85 75 77 81 65 ... --> then you might have some data issues ... in a perfect scenario, the diff between the AUC should be small ...
Nevertheless ... what I would do (and it also depends on your resources, model, time, data set size)
Create 10 random folds. (and keep track of them)
Do a 10 fold- grid search if possible (to have a general view the importance of each parameter, (you don't have to take small steps ... E.g. Random forest has a max_features parameter, but if you notice that all the models perform less when that value is log2, then you can just eliminate that hyper parameter)
check which hyper-parameters performed well
do a 10 fold random search or grid search in the area's which performed well
but always use the same folds for each new experiment, in this way you can compare the models with each other. + Often you'll see that some folds are more difficult then other folds but they are difficult for all the the models

How do you do cross validation correctly? [duplicate]

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.

Order between using validation, training and test sets

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.

Overfitting and Data splitting

Let's say that I have a data file like:
Index,product_buying_date,col1,col2
0,2013-01-16,34,Jack
1,2013-01-12,43,Molly
2,2013-01-21,21,Adam
3,2014-01-09,54,Peirce
4,2014-01-17,38,Goldberg
5,2015-01-05,72,Chandler
..
..
2000000,2015-01-27,32,Mike
with some more data and I have a target variable y. Assume something as per your convenience.
Now I am aware that we divide the data into 2 parts i.e. Train and Test. And then we divide Train into 70:30, build the model with 70% and validate it with 30%. We tune the parameters so that model does not get overfit. And then predict with the Test data. For example: I divide 2000000 into two equal parts. 1000000 is train and I divide it in validate i.e. 30% of 1000000 which is 300000 and 70% is where I build the model i.e. 700000.
QUESTION: Is the above logic depending upon how the original data splits?
Generally we shuffle the data and then break it into train, validate and test. (train + validate = Train). (Please don't confuse here)
But what if the split is alternate. Like When I divide it in Train and Test first, I give even rows to Test and odd rows to Train. (Here data is initially sort on the basis of 'product_buying_date' column so when i split it in odd and even rows it gets uniformly split.
And when I build the model with Train I overfit it so that I get maximum AUC with Test data.
QUESTION: Isn't overfitting helping in this case?
QUESTION: Is the above logic depending upon how the original data
splits?
If dataset is large(hundred of thousand), you can randomly split the data and you should not have any problem but if dataset is small then you can adopt the different approaches like cross-validation to generate the data set. Cross-validation states that you split you make n number of training-validation set out of your Training set.
suppose you have 2000 data points, you split like
1000 - Training dataset
1000 - testing dataset.
5-cross validation would mean that you would make five 800/200 training/validation dataset.
QUESTION: Isn't overfitting helping in this case?
Number one rule of the machine learning is that, you don't touch the test data set. It's a holly data set that should not be touched.
If you overfit the test data to get maximum AUC score then there won't be any meaning of validation dataset. Foremost aim of any ml algorithm is to reduce the generalization error i.e. algorithm should be able to perform good on unseen data. If you would tune your algorithm with testing data. you won't be able to meet this criteria. In cross-validation also you do not touch your testing set. you select your algorithm. tune its parameter with validation dataset and after you have done with that apply your algorithm to test dataset which is your final score.

Training Random forest with different datasets gives totally different result! Why?

I am working with a dataset which contains 12 attributes including the timestamp and one attribute as the output. Also it has about 4000 rows. Besides there is no duplication in the records. I am trying to train a random forest to predict the output. For this purpose I created two different datasets:
ONE: Randomly chose 80% of data for the training and the other 20% for the testing.
TWO: Sort the dataset based on timestamp and then the first 80% for the training and the last 20% for the testing.
Then I removed the timestamp attribute from the both dataset and used the other 11 attributes for the training and the testing (I am sure the timestamp should not be part of the training).
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
I do appreciate if someone can help me to know
why I have this huge difference.
Also I need to have the test dataset with the latest timestamps (same as the dataset in the second experiment). Is there anyway to select data from the rest of the dataset for the training to improve the
training.
PS: I already test the random selection from the first 80% of the timestamp and it doesn't improved the performance.
First of all, it is not clear how exactly you're testing. Second, either way, you are doing the testing wrong.
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
Is this for the training set or the test set? If the test set, that means you have poor generalization.
You are doing it wrong because you are not allowed to tweak your model so that it performs well on the same test set, because it might lead you to a model that does just that, but that generalizes badly.
You should do one of two things:
1. A training-validation-test split
Keep 60% of the data for training, 20% for validation and 20% for testing in a random manner. Train your model so that it performs well on the validation set using your training set. Make sure you don't overfit: the performance on the training set should be close to that on the validation set, if it's very far, you've overfit your training set. Do not use the test set at all at this stage.
Once you're happy, train your selected model on the training set + validation set and test it on the test set you've held out. You should get acceptable performance. You are not allowed to tweak your model further based on the results you get on this test set, if you're not happy, you have to start from scratch.
2. Use cross validation
A popular form is 10-fold cross validation: shuffle your data and split it into 10 groups of equal or almost equal size. For each of the 10 groups, train on the other 9 and test on the remaining one. Average your results on the test groups.
You are allowed to make changes on your model to improve that average score, just run cross validation again after each change (make sure to reshuffle).
Personally I prefer cross validation.
I am guessing what happens is that by sorting based on timestamp, you make your algorithm generalize poorly. Maybe the 20% you keep for testing differ significantly somehow, and your algorithm is not given a chance to capture this difference? In general, your data should be sorted randomly in order to avoid such issues.
Of course, you might also have a buggy implementation.
I would suggest you try cross validation and see what results you get then.

Resources