10 fold cross validation with weka api - machine-learning

How can I make a classification model by 10-fold cross-validation using Weka Api.
Should I cross validate model first : e.g.
evaluation.crossValidateModel(classifier, trainingSet, 10, Random(1))
and then build a new classifier based on this trainedSet. e.g
NaiveBayes nb2 = new NaiveBayes();
nb2.buildClassifier(train);
and then save and use this model (nb2)?

You are mixing concepts. Cross validation is used to test the performance of learning techniques over a dataset. The common procedure is to perform CV using the whole dataset with 10 folds usually. Then you can see which learning technique is obtaining better performance. You can use that technique to learn a model over the whole dataset for future predictions.
http://en.wikipedia.org/wiki/Cross-validation_(statistics)

Related

Application and Deployment of K-Fold Cross-Validation

K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.

How to use over-sampled data in cross validation?

I have a imbalanced dataset. I am using SMOTE (Synthetic Minority Oversampling Technique)to perform oversampling. When performing the binary classification, I use 10-fold cross validation on this oversampled dataset.
However, I recently came accross this paper; Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models that mentions that it is incorrect to use the oversampled dataset during cross-validation as it leads to overoptimistic performance estimates.
I want to verify the correct approach/procedure of using the over-sampled data in cross validation?
To avoid overoptimistic performance estimates from cross-validation in Weka when using a supervised filter, use FilteredClassifier (in the meta category) and configure it with the filter (e.g. SMOTE) and classifier (e.g. Naive Bayes) that you want to use.
For each cross-validation fold Weka will use only that fold's training data to parameterise the filter.
When you do this with SMOTE you won't see a difference in the number of instances in the Weka results window, but what's happening is that Weka is building the model on the SMOTE-applied dataset, but showing the output of evaluating it on the unfiltered training set - which makes sense in terms of understanding the real performance. Try changing the SMOTE filter settings (e.g. the -P setting, which controls how many additional minority-class instances are generated as a percentage of the number in the dataset) and you should see the performance changing, showing you that the filter is actually doing something.
The use of FilteredClassifier is illustrated in this video and these slides from the More Data Mining with Weka online course. In this example the filtering operation is supervised discretisation, not SMOTE, but the same principle applies to any supervised filter.
If you have further questions about the SMOTE technique I suggest asking them on Cross Validated and/or the Weka mailing list.
The correct approach would be first splitting the data into multiple folds and then applying sampling just to the training data and let the validation data be as is. The image below states the correct approach of how the dataset should be resampled in a K-fold fashion.
If you want to achieve this in python, there is a library for that:
Link to the library: https://pypi.org/project/k-fold-imblearn/

How to correctly combine my classifiers?

I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.

How to evaluate the performance of different model on one dataset?

I want to evaluate the performance different model such as SVM, RandForest, CNN etc, I only have one dataset. So I split the dataset to training set and testing set and train different model on this dataset with training data and test with testing dataset.
My question: Can I get the real performance of different model on only one dataset? For example: I found SVM model get the best result, So Should I select the SVM as my final classification model?
Its probably a better idea to cross validate your models with different test samples through cross validation to avoid biases. Also check your models against different evaluation metrics depending upon your application type. For instance use recall, accuracy and AUC for each model if its a classification problem.
Evaluation results can be pretty deceptive and require extensive validation.
You can Plot ROC curve for all the models.The model for which AUC is highest will be best model.

How do I update a trained model (weka.classifiers.functions.MultilayerPerceptron) with new training data in Weka?

I would like to load a model I trained before and then update this model with new training data. But I found this task hard to accomplish.
I have learnt from Weka Wiki that
Classifiers implementing the weka.classifiers.UpdateableClassifier interface can be trained incrementally.
However, the regression model I trained is using weka.classifiers.functions.MultilayerPerceptron classifier which does not implement UpdateableClassifier.
Then I checked the Weka API and it turns out that no regression classifier implements UpdateableClassifier.
How can I train a regression model in Weka, and then update the model later with new training data after loading the model?
I have some data mining experience in Weka as well as in scikit-learn and r and updateble regression models do not exist in weka and scikit-learn as far as I know. Some R libraries however do support updating regression models (take a look at this linear regression model for example: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/update.html), so if you are free to switching data mining tool this might help you out.
If you need to stick to Weka than I'm afraid that you would probably need to implement such a model yourself, but since I'm not a complete Weka expert please check with the guys at weka list (http://weka.wikispaces.com/Weka+Mailing+List).
The SGD classifier implementation in Weka supports multiple loss functions. Among them are two loss functions that are meant for linear regression, viz. Epsilon insensitive, and Huber loss functions.
Therefore one can use a linear regression trained with SGD as long as either of these two loss functions are used to minimize training error.

Resources