LibSVM for Multi-Class SVM Accuracy - machine-learning

I have a set on one-vs-all SVMs. These are LibSVM binary trained SVMs on a genuine class then all other classes. I want to show FAR and FRR from the system, but I appear to get getting very large FRR values and very little FAR values. This is because I use a positive test set from the genuine class as positive test data and the positive data from all other classes as negative test data for a test. This means that I get an equal number of FAR and FRR values. If a genuine sample is falsely rejected then it means another SVM will falsely accept it in another test for another user.
This gives the same FAR and FRR values. But it means the FAR and FRR percentages are EXTREMELY different. The negative dataset can be up to 100 times bigger than the positive set. This means that, if we have n false rejections (and consequently false acceptances) then we have n/pos_data_size FRR and n/(pos_data_size*100) FAR!
I would like to nicely represent the error rates. But this seems to be very difficult to do! Is there a way that would work in this case?

Related

Random Forest classifier class_weight

I have an unbalanced dataset of 200000 descriptions being class 0, and something like 10000 being class 1. However, in my training dataset I have equal number of 'positive' and 'negative' samples, about 8000 each. So now I am confused about how I should properly use the "class_weight" option of the classifier. It seems that it works only if the number of the 'positive' and 'negative' samples in the training data is the same as in the whole dataset. In this case it would be 8000 'positive' and 160000 of 'negative' ones, which is not really feasible. And reducing the number of the 'positive' samples doesn't seem to be a good idea either. Or am I wrong?
The class_weightoption does nothing more than increasing the weight of making an error with the under-represented class. In other words, misclassifying the rare class is punished harsher.
The classifier is likely to perform better on your test set (where both classes are represented equally, so both are equally important), but that is something you can easily verify yourself.
A side-effect is that predict_proba returns probabilities which are far away from the actual probabilities. (If you want to understand why, plot the simple average chance and the distribution of predicted scores without and with different class_weight=. How do the predicted scores shift?). Depending on your final use-case (classification, ranking, probability estimation) you should consider the choices in your model.
Strictly speaking, from the point of view of your training set, you don't face a class imbalance issue, so you could very well leave class_weight to its default None value.
The real issue here and in imbalanced datasets in general (about which you don't provide any info) is if the cost of misclassification is the same for both classes. And this is a "businesss" decision (i.e. not a statistics/algorithmic one).
Usually, imbalanced datasets go hand-in-hand with problems with different misclassification costs; medical diagnosis is a textbook example here, since:
The datasets are almost by default imbalanced, since healthy people vastly outnumber infected ones
We would prefer a false alarm (misclassifying someone as having the disease, while he/she doesn't) rather than a missed detection (misclassifying an infected person as healthy, hence risking his/her life)
So, this is the actual problem you should be thinking about (i.e. even before building your training set).
If, for the business problem you are trying to address, there is not any difference between misclassifying a "0" for "1" and a "1" for "0", and given that your training set is balanced, you can proceed without worrying about assigning different class weights...

Naive Bays classifier: output percentage is too low

I'm writing a naive bayes classifier for a class project and I just got it working... sort of. While I do get an error-free output, the winning output label had an output probability of 3.89*10^-85.
Wow.
I have a couple of ideas of what I might be doing wrong. Firstly, I am not normalizing the output percentages for the classes, so all of the percentages are effectively zero. While that would give me numbers that look nice, I don't know if that's the correct thing to do.
My second idea was to reduce the number of features. Our input data is a list of pseudo-images in the form of a very long text file. Currently, our features are just the binary value of every pixel of the image, and with a 28x28 image that's a lot of features. If I instead chopped the image into blocks of size, say, 7x7, how much would that actually improve the output percentages?
tl;dr Here's the general things I'm trying to understand about naive bayes:
1) Do you need to normalize the output percentages from testing each class?
2) How much of an effect does having too many features have on the results?
Thanks in advance for any help you can give me.
It could be normal. The output of a naive bayes is not meant to be a real probability. What it is meant to do is order a score among competing classes.
The reason why the probability is so low is that many Naive Bayes implementations are the product of the probabilities of all the observed features of the instance that is being classified. If you are classifying text, each feature may have a low conditional probability for each class (example: lower than 0.01). If you multiply 1000s of feature probabilities, you quickly end up with numbers such as you have reported.
Also, the probabilities returned are not the probabilities of each class given the instance, but an estimate of the probabilities of observing this set of features, given the class. Thus, the more you have features, the less likely it is to observe these exact features. A bayesian theorem is used to change argmax_c P(class_c|features) to argmax_c P(class_c)*P(features|class_c), and then the P(features|class_c) is further simplified by making independence assumption, which allows changing that to a product of the probabilities of observing each individual feature given the class. These assumptions don't change the argmax (the winning class).
If I were you, I would not really care about the probability output, focus instead on the accuracy of your classifier and take action to improve the accuracy, not the calculated probabilities.

What do the features given by a feature selection method mean in a binary classifier which has a cross validation accuracy of 0?

So I know that given a binary classifier, the farther away you are from an accuracy of 0.5 the better your classifier is. (I.e. A binary classifier that gets everything wrong can be converted to one which gets everything right by always inverting its decisions.)
However, I have an inner feature selection process, which provides me "good" features to use (I'm trying out recursive feature elimination, and another based on Spearman's rank correlation coefficient). Given that the classifier using these "good" features gets a cross validation accuracy of 0, can I still conclude that the features selected are useful and are predictive of the class in this binary prediction problem?
To simplify, let's assume you're testing on some balanced set. Half the testing data is positive and half the testing data is negative.
I would say that something strange is happening that is flipping the sign of your decision. That classifier you're evaluating is very useful, but you would need to flip the decision it makes. You should probably check your code to make sure you're not flipping the class of the training data. Some libraries (LIBSVM for example) require that the first training example is from the positive class.
To summarize: It seems the features you're selecting are useful, but it seems you have a bug that is flipping the classes.

Suggestions to improve my normalized accuracy with libsvm

I'm with a problem when I try to classify my data using libsvm. My training and test data are highly unbalanced. When I do the grid search for the svm parameters and train my data with weights for the classes, the testing gives the accuracy of 96.8113%. But because the testing data is unbalanced, all the correct predicted values are from the negative class, which is larger than the positive class.
I tried a lot of things, from changing the weights until changing the gamma and cost values, but my normalized accuracy (which takes into account the positive classes and negative classes) is lower in each try. Training 50% of positives and 50% of negatives with the default grid.py parameters i have a very low accuracy (18.4234%).
I want to know if the problem is in my description (how to build the feature vectors), in the unbalancing (should i use balanced data in another way?) or should i change my classifier?
Better data always helps.
I think that imbalance is part of the problem. But a more significant part of the problem is how you're evaluating your classifier. Evaluating accuracy given the distribution of positives and negatives in your data is pretty much useless. So is training on 50% and 50% and testing on data that is distributed 99% vs 1%.
There are problems in real life that are like the one your studying (that have a great imbalance in positives to negatives). Let me give you two examples:
Information retrieval: given all documents in a huge collection return the subset that are relevant to search term q.
Face detection: this large image mark all locations where there are human faces.
Many approaches to these type of systems are classifier-based. To evaluate two classifiers two tools are commonly used: ROC curves, Precision Recall curves and the F-score. These tools give a more principled approach to evaluate when one classifier is working better than the another.

One versus rest classifier

I'm implementing an one-versus-rest classifier to discriminate between neural data corresponding (1) to moving a computer cursor up and (2) to moving it in any of the other seven cardinal directions or no movement. I'm using an SVM classifier with an RBF kernel (created by LIBSVM), and I did a grid search to find the best possible gamma and cost parameters for my classifier. I have tried using training data with 338 elements from each of the two classes (undersampling my large "rest" class) and have used 338 elements from my first class and 7218 from my second one with a weighted SVM.
I have also used feature selection to bring the number of features I'm using down from 130 to 10. I tried using the ten "best" features and the ten "worst" features when training my classifier. I have also used the entire feature set.
Unfortunately, my results are not very good, and moreover, I cannot find an explanation why. I tested with 37759 data points, where 1687 of them came from the "one" (i.e. "up") class and the remaining 36072 came from the "rest" class. In all cases, my classifier is 95% accurate BUT the values that are predicted correctly all fall into the "rest" class (i.e. all my data points are predicted as "rest" and all the values that are incorrectly predicted fall in the "one"/"up" class). When I tried testing with 338 data points from each class (the same ones I used for training), I found that the number of support vectors was 666, which is ten less than the number of data points. In this case, the percent accuracy is only 71%, which is unusual since my training and testing data are the exact same.
Do you have any idea what could be going wrong? If you have any suggestions, please let me know.
Thanks!
Test dataset being same as training data implies your training accuracy was 71%. There is nothing wrong about it as the data was possibly not well separable by the kernel you used.
However, one point of concern is the number of support vectors being high suggests probable overfitting .
Not sure if this amounts to an answer - it would probably be hard to give one without actually seeing the data - but here are some ideas regarding the issue you describe:
In general, SVM tries to find a hyperplane that would best separate your classes. However, since you have opted for 1vs1 classification, you have no choice but to mix all negative cases together (your 'rest' class). This might make the 'best' separation much less fit to solve your problem. I'm guessing that this might be a major issue here.
To verify if that's the case, I suggest trying to use only one other cardinal direction as the negative set, and see if that improves results. In case it does, you can train 7 classifiers, one for each direction. Another option might be to use the multiclass option of libSVM, or a tool like SVMLight, which is able to classify one against many.
One caveat of most SVM implementations is their inability to support big differences between the positive and negative sets, even with weighting. From my experience, weighting factors of over 4-5 are problematic in many cases. On the other hand, since your variety in the negative side is large, taking equal sizes might also be less than optimal. Thus, I'd suggest using something like 338 positive examples, and around 1000-1200 random negative examples, with weighting.
A little off your question, I would have considered also other types of classification. To start with, I'd suggest thinking about knn.
Hope it helps :)

Resources