WEKA has profound support for kNN classifiers (many different distances and etc.)
Unfortunately WEKA doesn't support multi-label problems.
One of the solutions can be to use binary relevance approach.
I am not sure whether it's a correct workaround? What do you think?
You can try Meka which is based on Weka and is expected to handle multilabel classification problems.
Related
I am using scikit-learn to build a multiclass classification model. To this extent, I have tried the RandomForestClassifier, KNeighborsClassifier, LogisticRegression, MultinomialNB, and SVC algorithms. I am satisfied with the generated output. However, I do have a question about the default mechanism used by the algorithms for multiclass classification. I read that all scikit-learn classifiers are capable of multiclass classification, but I could not find any information about the default mechanism used by the algorithms.
One-vs-the-rest or One-vs-all is the most commonly used and a fair default strategy for multiclass classification algorithms. For each classifier, the class is fitted against all the other classes. Check here for more information https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html
I used Logistic Regression as a classifier. I have six features, I want to know the important features in this classifier that influence the result more than other features. I used Information Gain but it seems that it doesn't depend on the used classifier. Is there any method to rank the features according to their importance based on specific classifier (like Logistic Regression)?
any help would be highly appreciated.
You could use Random Forest Classifier to give you a ranking of your features. You could then select the top x features from this and use it for logistic regression, although Random Forest would work perfectly fine as well.
Check out variable importance at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
One way to do this is by null hypothesis significance testing. Basically, for each feature, you test for evidence that the coefficient of that feature is nonzero. Most statistical software reports the results of these tests by default in the model summary (Scikit-learn and other machine-learning oriented tools tend to not do so). With a small number of features, you can use this information and stepwise regression to rank the importance of the features.
I am trying to implement Multiclass classification in WEKA.
I have lot of rows, say bank transactions, and one is tagged as Food,Medicine,Rent,etc. I want to develop a classifier which can be trained with the previous data I have and predict the class it can belong to for future transactions. If I am right this is Multiclass and not multilabel since each transaction can belong to only one class.
Below are a few algorithms I am considering
Naive Bayes
Multinomial Logistic Regression
Multiclass SVM
Max Entropy
Neural Networks (if possible)
In my data Number of features <<< Number of transactions and hence I am thinking of one vs rest binary classifier instead of one vs one.
Are there any other algorithms I should lok into which will help with my goal?
Is there any algos that I put are useless for my goal?
Also,I found that scikit-learn in Python is better than WEKA but I can run scikit-learn only on one processor. Is this true?
Answers to any question would be helpful.
Thanks!
You can look at RandomForest which is a well known classifier and quite efficient.
In scikit-learn, you have some class that can be used over several core like RandomForestClassifier. It has a constructor parameter that can be used to define the number of core or a value that will use every available core. Look at the documentation, constructor that contains n_jobs parameter can be used over several core
I'm using an OpenCV Haar classifier in my work but I keep reading conflicting reports on whether the OpenCV Haar classifier is an SVM or not, can anyone clarify if it is using an SVM? Also if it is not using an SVM what advantages does the Haar method offer over an SVM approach?
SVM and Boosting (AdaBoost, GentleBoost, etc) are feature classification strategies/algorithms. Support Vector Machines solve a complex optimization problem, often using kernel functions which allows us to separate samples by working in a much higher dimension feature space. On the other hand, boosting is a strategy based on combining lots of "cheap" classifiers in a smart way, which leads to a very fast classification. Those weak classifiers can be even SVM.
Haar-like features are a kind of features based in integral images and very suitable for Computer Vision problems.
This is, you can combine Haar features with any of the two classification schemes.
It isn't SVM. Here is the documentation:
http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html#haar-feature-based-cascade-classifier-for-object-detection
It uses boosting (supporting AdaBoost and a variety of other similar methods -- all based on boosting).
The important difference is related to speed of evaluation is important in cascade classifiers and their stage based boosting algorithms allow very fast evaluation and high accuracy (in particular support training with many negatives), at a better balance point than an SVM for this particular application.
Is it possible to use GPML (http://www.gaussianprocess.org/gpml/) toolbox for multi-class classification? (The dataset that I am using has about 7000 training samples from 3 different classes).
You can always use a binary classifier for multi-class classification. Notably, by making all-pair comparisons, a simple majority vote gives a very reliable answer. It, however, needs O(n2) applications of binary classifiers for the problem with n classes. But that's not an issue for 3 classes.