How to do these in weka: cross validation + imbalanced data + feature selection - machine-learning

I have an imbalanced dataset (classification dataset).
By using Weka platform, I want to apply these techniques: cross validation, balancing the training folds, feature selection
So, I did the following (From Classify tab):
I chose 10-fold cross-validation technique.
I chose FilteredClassifier, and edited its proprieties by:
choosing a classifier.
choosing the filter Multifilter, and editing its proprieties by adding two filters:
SMOTE as the first filter.
AttributeSelection as the second filter.
Is my work correct?

The preprocess panel or feature selection tab should only be used to explore the data (hence the name Weka Explorer).
For incorporating preprocessing for addressing balancing training data or selecting attributes, use Weka's meta classifiers (you can nest them):
balancing: FilteredClassifier with SMOTE
feature selection: AttributeSelectedClassifier with your choice of search/evaluation and base classifier
Balancing is the outermost classifier, which uses the feature selection classifier as its classifier.
You can use the Weka Experimenter for comparing various setups and obtain statistical significance results. See the Weka manual PDF for details.

Related

How to use over-sampled data in cross validation?

I have a imbalanced dataset. I am using SMOTE (Synthetic Minority Oversampling Technique)to perform oversampling. When performing the binary classification, I use 10-fold cross validation on this oversampled dataset.
However, I recently came accross this paper; Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models that mentions that it is incorrect to use the oversampled dataset during cross-validation as it leads to overoptimistic performance estimates.
I want to verify the correct approach/procedure of using the over-sampled data in cross validation?
To avoid overoptimistic performance estimates from cross-validation in Weka when using a supervised filter, use FilteredClassifier (in the meta category) and configure it with the filter (e.g. SMOTE) and classifier (e.g. Naive Bayes) that you want to use.
For each cross-validation fold Weka will use only that fold's training data to parameterise the filter.
When you do this with SMOTE you won't see a difference in the number of instances in the Weka results window, but what's happening is that Weka is building the model on the SMOTE-applied dataset, but showing the output of evaluating it on the unfiltered training set - which makes sense in terms of understanding the real performance. Try changing the SMOTE filter settings (e.g. the -P setting, which controls how many additional minority-class instances are generated as a percentage of the number in the dataset) and you should see the performance changing, showing you that the filter is actually doing something.
The use of FilteredClassifier is illustrated in this video and these slides from the More Data Mining with Weka online course. In this example the filtering operation is supervised discretisation, not SMOTE, but the same principle applies to any supervised filter.
If you have further questions about the SMOTE technique I suggest asking them on Cross Validated and/or the Weka mailing list.
The correct approach would be first splitting the data into multiple folds and then applying sampling just to the training data and let the validation data be as is. The image below states the correct approach of how the dataset should be resampled in a K-fold fashion.
If you want to achieve this in python, there is a library for that:
Link to the library: https://pypi.org/project/k-fold-imblearn/

How can i apply feature reduction methods in Weka?

1) How can i apply feature reduction methods like LSI etc in weka for text classification?
2) Do applying feature reduction methods like LSI etc can improve the accuracy of classification ?
Take a look at FilteredClassifier class or at AttributeSelectedClassifier. With FilteredClassifier you can use such features reduction method as Principal Component Analysis (PCA). Here is a video how to filter your dataset using PCA, so that you could try different classifiers on reduced dataset.
It can help, but there is no guarantee about that. If you remove redundant features, or transform features in some way (like SVM or PCA do) classification task can become simpler. Anyway big number of features usually lead to curse of dimensionality and attribute selection is a way to avoid it.

weka J48 feature selection

I am using Weka and applying J48 to build my classifier. I have 40 features with 2000 instances (700 class a and 1300 class b).
The J48 decision tree is just using 2 features out of 40! Is there anyway to allow J48 to use all features or is there any other algorithm that allows using all features?
Thanks in advance.
Maybe it is because J48 does not need more attributes.
You can check feature's correlation in Select attribute tab, and run the selector with Ranker as search method and Principal Components as evaluator. It will show you the relations between each feature and each class, and it will also tell you which are the features that best describe your classes.
It is not necessary that all the 40 features are needed for the classification. Because some features might be redundant (e.g. correlated) or does not contain discriminatory information.
You can run feature selection before from the Select attributes tab in Weka Explorer and see which features are important.
Also you can test classifiers such as SVM (libSVM or SMO), Neural Network ( MultilayerPerceptron) and/or Random Forest as they tend to give the best classification results in general (problem dependent)

Does weka makes automatic preprocessing with numeric attributes

Does weka preprocess numeric attributes like speed (meter per second) before the classification?
I want to use the weka toolkit to classify numeric speed and step data . In the related work weka is often used and it is mentioned that the authors of the related work have used mean, standard deviation, max and mean for classification. Does weka do that preprocessing automatically or do I have to do it before classification?
Weka doesn't automatically do that, but it does have filters for it. With the AddExpression filter, you can compute the mean, standard deviation, max and mean of a number of attributes, just as you described.

Weka machine learning:how to interprete Naive Bayes classifier?

I am using the explorer feature for classification. My .arff data file has 10 features of numeric and binary values; (only the ID of instances is nominal).I have abt 16 instances. The class to predict is Yes/No.i have used Naive bayes but i cantnot interpret the results,,does anyone know how to interpret results from naive Bayes classification?
Naive Bayes doesn't select any important features. As you mentioned, the result of the training of a Naive Bayes classifier is the mean and variance for every feature. The classification of new samples into 'Yes' or 'No' is based on whether the values of features of the sample match best to the mean and variance of the trained features for either 'Yes' or 'No'.
You could use others algorithms to find the most informative attributes. In that case you might want to use a decision tree classifier, e.g. J48 in WEKA (which is the open-source implementation of C4.5 decision tree algorithm). The first node in the resulting decision tree tells you which feature has the most predictive power.
Even better (as stated by Rushdi Shams in the other post); Weka's Explorer offers purpose build options to find the most useful attributes in a dataset. These options can be found under the Select attributes tab.
As Sicco said NB cannot offer you the best features. Decision tree is a good choice because the branching can sometimes tell you the feature that is important- BUT NOT ALWAYS. In order to handle simple to complex featureset, you can use WEKA's SELECT ATTRIBUTE tab. There, you can find search methods and attribute evaluator. Depending on your task, you can choose the one that best suits you. They will provide you a ranking of the features (either from training data or from a k-fold cross validation). Personally, I believe that decision trees perform poor if your dataset is overfitting. In that case, a ranking of features is the standard way to select best features. Most of the times I use infogain and ranker algorithm. When you see your attributes are ranked from 1 to k, it is really nice to figure out the required features and unnecessary ones.

Resources