Different sampling methods in Weka - machine-learning

I have a unbalanced dataset. I try to balance it using different resampling methods. So far I know there are three methods to handle sampling. 1. Random sampling 2. cross validation 3. Bootstrap.
I am using Weka for data preprocessing. I know how to use cross validation in Weka. It comes with classifier like Random Forest or Naive bayes or any other.
But I did not find Random Sampling or Bootstrap.
I found supervised -> instances-> resample and unsupervised -> instances-> resample.
I would like to know the difference between two resample. This post is not very helpful.
How could I use Bootstap in Weka? Is there any options for that?

Bootstrapping isn't really an evaluation method within Weka.
See Eibe's reply on the Wekalist mailing a few years ago:
https://list.waikato.ac.nz/hyperkitty/list/wekalist#list.waikato.ac.nz/thread/WIHQM6EK5HM4J4FHOOFNKDINK2EEWYZI/

Related

How to use over-sampled data in cross validation?

I have a imbalanced dataset. I am using SMOTE (Synthetic Minority Oversampling Technique)to perform oversampling. When performing the binary classification, I use 10-fold cross validation on this oversampled dataset.
However, I recently came accross this paper; Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models that mentions that it is incorrect to use the oversampled dataset during cross-validation as it leads to overoptimistic performance estimates.
I want to verify the correct approach/procedure of using the over-sampled data in cross validation?
To avoid overoptimistic performance estimates from cross-validation in Weka when using a supervised filter, use FilteredClassifier (in the meta category) and configure it with the filter (e.g. SMOTE) and classifier (e.g. Naive Bayes) that you want to use.
For each cross-validation fold Weka will use only that fold's training data to parameterise the filter.
When you do this with SMOTE you won't see a difference in the number of instances in the Weka results window, but what's happening is that Weka is building the model on the SMOTE-applied dataset, but showing the output of evaluating it on the unfiltered training set - which makes sense in terms of understanding the real performance. Try changing the SMOTE filter settings (e.g. the -P setting, which controls how many additional minority-class instances are generated as a percentage of the number in the dataset) and you should see the performance changing, showing you that the filter is actually doing something.
The use of FilteredClassifier is illustrated in this video and these slides from the More Data Mining with Weka online course. In this example the filtering operation is supervised discretisation, not SMOTE, but the same principle applies to any supervised filter.
If you have further questions about the SMOTE technique I suggest asking them on Cross Validated and/or the Weka mailing list.
The correct approach would be first splitting the data into multiple folds and then applying sampling just to the training data and let the validation data be as is. The image below states the correct approach of how the dataset should be resampled in a K-fold fashion.
If you want to achieve this in python, there is a library for that:
Link to the library: https://pypi.org/project/k-fold-imblearn/

The difference between supervised and unsupervised learning when using PCA

I have read the answer here. But, I can't apply it on one of my example so I probably still don't get it.
Here is my example:
Suppose that my program is trying to learn PCA (principal component analysis).
Or diagonalization process.
I have a matrix, and the answer is it's diagonalization:
A = PDP-1
If I understand correctly:
In supervised learning I will have all tries with it's errors
My question is:
What will I have in unsupervised learning?
Will I have error for each trial as I go along in trials and not all errors in advance? Or is it something else?
First of all, PCA is neither used for classification, nor clustering. It is an analysis tool for data where you find the principal components in the data. This can be used for e.g. dimensionality reduction. Supervised and unsupervised learning has no relevance here.
However, PCA can often be applied to data before a learning algorithm is used.
In supervised learning, you have (as you say) a labeled set of data with "errors".
In unsupervised learning you don't have any labels, i.e, you can't validate anything at all. All you can do is to cluster the data somehow. The goal is often to achieve clusters that internally are more homogeneous. Success can be measured, e.g., using the within-cluster variance metric.
Supervised Learning:
-> You give variously labeled example data as input along with correct answer.
-> This algorithm will learn form it and start predicting correct result based on input.
example: email spam filter
Unsupervised Learning:
-> You gave just data and don't tell anything like label or correct answer.
-> Algorithm automatically analyse pattern in the data.
example: google news

Machine Learning Text Classification technique

I am new to Machine Learning.I am working on a project where the machine learning concept need to be applied.
Problem Statement:
I have large number(say 3000)key words.These need to be classified into seven fixed categories.Each category is having training data(sample keywords).I need to come with a algorithm, when a new keyword is passed to that,it should predict to which category this key word belongs to.
I am not aware of which text classification technique need to applied for this.do we have any tools that can be used.
Please help.
Thanks in advance.
This comes under linear classification. You can use naive-bayes classifier for this. Most of the ml frameworks will have an implementation for naive-bayes. ex: mahout
Yes, I would also suggest to use Naive Bayes, which is more or less the baseline classification algorithm here. On the other hand, there are obviously many other algorithms. Random forests and Support Vector Machines come to mind. See http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/ If you use a standard toolkit, such as Weka, Rapidminer, etc. these algorithms should be available. There is also OpenNLP for Java, which comes with a maximum entropy classifier.
You could use the Word2Vec Word Cosine distance between descriptions of each your category and keywords in the dataset and then simple match each keyword to a category with the closest distance
Alternatively, you could create a training dataset from already matched to category, keywords and use any ML classifier, for example, based on artificial neural networks by using vectors of keywords Cosine distances to each category as an input to your model. But it could require a big quantity of data for training to reach good accuracy. For example, the MNIST dataset contains 70000 of the samples and it allowed me reach 99,62% model's cross validation accuracy with a simple CNN, for another dataset with only 2000 samples I was able reached only about 90% accuracy
There are many classification algorithms. Your example looks to be a text classification problems - some good classifiers to try out would be SVM and naive bayes. For SVM, liblinear and libshorttext classifiers are good options (and have been used in many industrial applcitions):
liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/
libshorttext:https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
They are also included with ML tools such as scikit-learna and WEKA.
With classifiers, it is still some operation to build and validate a pratically useful classifier. One of the challenges is to mix
discrete (boolean and enumerable)
and continuous ('numbers')
predictive variables seamlessly. Some algorithmic preprocessing is generally necessary.
Neural networks do offer the possibility of using both types of variables. However, they require skilled data scientists to yield good results. A straight-forward option is to use an online classifier web service like Insight Classifiers to build and validate a classifier in one go. N-fold cross validation is being used there.
You can represent the presence or absence of each word in a separate column. The outcome variable is desired category.

How to choose classifier on specific dataset

When given the dataset, normally m instances by n features matrix, how to choose the classifier that is most appropriate for the dataset.
This is just like what algorithm to solve a prime Number. Not every algorithm solve any problem means each problem assigned which finite no. of algorithm. In machine learning you can apply different algorithm on a type of problem.
If matrix contain real numbered features then you can use KNN algorithm can be used. Or if matrix have words as feature then you can use naive bayes classifier which is one of best for text classification. And Machine learning have tons of algorithm you can read them apply to your problem which fits best. Hope you understand what I said.
An interesting but much more general map I found:
http://scikit-learn.org/stable/tutorial/machine_learning_map/
If you have weka, you can use experimenter and choose different algorithms on same data set to evaluate different models.
This project compares many different classifiers on different typical datasets.
If you have no idea, you could use this simple tool auto-weka which will test all the different classifiers you selected within different constraints. Before using auto-weka, you may need to convert your data to ARFF using Weka or just manually (many tutorial on youtube).
The best classifier depends on your data (binary/string/real/tags, patterns, distribution...), what kind of output to predict (binary class / multi-class / evolving classes / a value from regression ?) and the expected performance (time, memory, accuracy). It would also depend on whether you want to update your model frequently or not (ie. if it is a stream, better use an online classifier).
Please note that the best classifier may not be one but an ensemble of different classifiers.

WEKA's MultilayerPerceptron: training then training again

I am trying to do the following with weka's MultilayerPerceptron:
Train with a small subset of the training Instances for a portion of the epochs input,
Train with whole set of Instances for the remaining epochs.
However, when I do the following in my code, the network seems to reset itself to start with a clean slate the second time.
mlp.setTrainingTime(smallTrainingSetEpochs);
mlp.buildClassifier(smallTrainingSet);
mlp.setTrainingTime(wholeTrainingSetEpochs);
mlp.buildClassifier(wholeTrainingSet);
Am I doing something wrong, or is this the way that the algorithm is supposed to work in weka?
If you need more information to answer this question, please let me know. I am kind of new to programming with weka and am unsure as to what information would be helpful.
This thread on the weka mailing list is a question very similar to yours.
It seems that this is how weka's MultilayerPerceptron is supposed to work. It's designed to be a 'batch' learner, you are trying to use it incrementally. Only classifiers that implement weka.classifiers.UpdateableClassifier can be incrementally trained.

Resources