N fold cross validation in weka for tweet classification - twitter

My aim is to use weka to classify a bunch of tweets to a predefined set of 3 classes(say news,education,sports)
In this case training set and testing set are different.(Training lengthy web pages, Testing just one or two line tweets).
How to perform 'N' fold cross validation for this problem.
do i need to mix up training and testing data set to makeup a single file and apply 'n' fold cross validation or do i need to train the classifier first and then apply 'n' fold cross validation for the test set in weka.
I presume the latter makes sense but i am not sure.Please help me to sort out this problem.

The nature of your data should be the same in training and set set. This requirement make the N-fold cross-validation technique usable.
For the problems related to the model selection, have a look at this:
https://vimeo.com/29569892

Related

Tuning of hyperparameters and evaluation using cross validation

I have some trouble grasping the standard way of how to use cross validation for hyperparameter tuning and evaluation. I try to do 10-fold CV. Which of the following is the correct way?
All of the data gets used for parameter tuning (e. g. using random grid search with cross validation). This returns the best hyperparameters. Then, a new model is constructed with these hyperparameters, and it can be evaluated by doing a cross validation (nine folds for training, one for testing, in the end the metrics like accuracy or confusion matrix get averaged).
Another way that I found is to first split the data into a train and a test set, and then only perform cross validation on the training set. One then would evaluate using the testset. However, as I understand, that would undermine the whole concept of cross validation, as the idea behind it is to be independent of the split, right?
Lastly, my supervisor told me that I would use eight folds for training, one for hyperparameter estimation and one for testing (and therefore evaluation). However, I could not find any material where this approach had been used. Is that a standard procedure or have I just understood something wrong there?
in general you can split your data into 3 sets.
training set
validation set
test set
Test set:
The test set is the most easiest one to explain.
Once you've created your test set (15-30% of the data). You store this data set somewhere and you DON'T TOUCH that data set ANYMORE until you think you're done.
- The reason for this is simple, once you start to focus on this data set (e.g. to increase the AUC or ...) then you're starting to over fit your data ...
The same also counts for the validation set (+/-). When you're hyper-tuning your parameters etc. you're starting to focusing on this set ... which means that you aren't generalizing anymore. (and a good model, should work on all data, not only on the test and validation set).
That been said, now you've only the training- and validation set over.
Cross validation: some motivations to use cross validation is to have a better generalization and view of your model/data (imagine, that some special cases only existed in the validation set etc. + you don't take a single decision for granted.
- the main downside of e.g. 10-fold cross validation is ... it takes 10 times longer to finish ... but it gives you a more trustworthy results ... (e.g. if you do 10 fold cross validation and your AUC fluctuates from 80 85 75 77 81 65 ... --> then you might have some data issues ... in a perfect scenario, the diff between the AUC should be small ...
Nevertheless ... what I would do (and it also depends on your resources, model, time, data set size)
Create 10 random folds. (and keep track of them)
Do a 10 fold- grid search if possible (to have a general view the importance of each parameter, (you don't have to take small steps ... E.g. Random forest has a max_features parameter, but if you notice that all the models perform less when that value is log2, then you can just eliminate that hyper parameter)
check which hyper-parameters performed well
do a 10 fold random search or grid search in the area's which performed well
but always use the same folds for each new experiment, in this way you can compare the models with each other. + Often you'll see that some folds are more difficult then other folds but they are difficult for all the the models

How do you do cross validation correctly? [duplicate]

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.

Order between using validation, training and test sets

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Second approach
Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set
Third approach
Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set
So the question is:
Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.
There are two separate ways of approaching the problem:
Either you use an explicit validation set to do hyperparameter search & tuning
Or you use cross-validation
So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).
After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).
So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.
Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.
What Wikipedia means is actually your first approach.
1 Split data into training set, validation set and test set
2 Use the
training set to fit the model (find the best parameters: coefficients
of the polynomial).
That just means that you use your training data to fit a model.
3 Afterwards, use the validation set to find the best hyper-parameters
(in this case, polynomial degree) (wikipedia article says:
"Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation dataset")
That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.
You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.
Finally, use the test set to score the model fitted with the training
set.
Why you need the validation set is pretty well explained in this stackexchange question
https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
In the end you can use any of your three aproaches.
approach:
is the fastest because you only train one model for every hyperparameter.
also you don't need as much data as for the other two.
approach:
is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.
You also need a lot of data because you split your data three times and that first part again in k folds.
But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:
is in its pros and cons in between of the other two. Here you also have less likely overfitting.
In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.
Edit As #desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.

How to actually use a validation set when using support vector machines in sklearn

While working with SVMs, I am seeing that it is a good practice to perform a three way split on the original data set, something along the lines of, say, a 70/15/15 split.
This split would correspond to %70 for training, %15 for testing, and %15 for what is referred to as "validation."
I'm fairly clear on why this is a good practice, but I'm not sure about the nuts and bolts needed to actually perform this. Lots of online sources discuss the importance, but I can't seem to find a definite (or at least algorithmic) description of the process. For example, sklearn discusses it here but stops before giving any solid tools.
Here's my idea:
Train the algorithm, using training set
Find error rate, using testing set
?? tweak parameters
Get error rate again, using validation set
If anyone could point me in the direction of a good resource, I'd be grateful.
The role of the validation set in all supervised learning algorithms is to find the optimium for the parameters of the algorithm (if there are any).
After splitting your data into traing/validation/test data, the best practise to train an algorithm is like that:
choose initial learning parameters
train the algorithm using the training set and the parameters
get the (validation) accuracy using the validation set (cross-validation test)
change parameters and continue with 2 until found parameters leading to best validation accuracy
get the (test) accuracy using the test set which represents the actual expected accuracy of your trained algorithm on new unseen data.
There are some advanced approaches for performing the cross-validation test. Some libraries like libsvm have them included: the k-fold cross validation.
In k-fold cross validation you split your train data randomly into k same-sized portions. You train using k-1 portions and cross validate with the remaining portion. You do this k-times with different subsets and finally using the average.
Wikipedia is a good source:
http://en.wikipedia.org/wiki/Supervised_learning
http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29

Does it make sense using validation set together with a crossvalidation approach?

I want to train a MultiLayerPerceptron using Weka with ~200 samples and 6 attributes.
I was thinking of spliting into train and test, and on train, specify a certain % of the train as Validation set.
But then I considered using fold-crossvalidation in order to make a better use of my set of samples.
My question is: Does it make sense to specify a validation set when doing a crossvalidation approach?
And, considering the size of the sample, can you suggest me some numbers for the two approaches? (e.g. 2/3 for train, 1/3 test, and 20% validation... and for CV: 10-fold, 2-fold, or LOOCV instead...)
Thank you in advance!
Your questions sounds like you're not exactly familiar with cross-validation. Like you noticed there is a parameter for the number of folds to run. For a simple cross-validation the parameter defines the number of subsets which are created out of your original set. Let that parameter be k. Your original set is splitted into k equally sized subset. Then for each run, the trainig is run on k-1 subsets and the validation is done on the remaining, k-th subset. Then another permutation of k-1 subsets of the k subsets is used for training, and so on. So you run k iterations of this process.
For your data set size, k=10 sounds alright, but basically everything is worth testing, as long as you take all results into account and don't take the best one.
For the very simple evaluation you just use 2/3 as training set and the 1/3 "test set" is actually your validation set. There are more sophisticated approaches though which use the test set as a termination criterion and another validation set as the final evaluation (since your results might be overfitted to the test set as well, because it defines the termination). For this approach you obviously need to split up the set differently (e.g. 2/3 training, 3/12 test and 1/12 validation).
You should be carefully because you don't have much sample. On the other hand if you want to check your model accuracy you should partition a test set for your model. Cross validation splits your data as train and validation data. Then when we consider that you don't have much sample and your validation set will be so small you can have a look at that approach:
5×2 cross-validation, which uses training cross-validation and
validation sets of equal size (Dietterich (1998))
You can find more info at Ethem Alpaydin's Machine Learning book about it.
Don't memorize the data and don't test on small amounts of sample it looks like a dilemma but the certain decision depends on your data set.

Resources