Weird phenomenon with SVM: negative examples score higher - machine-learning

I use the VL-Feat and LIBLINEAR to handle the 2-category classification. The #(-)/#(+) for the training set is 35.01 and the dimension of each feature vector is 3.6e5. I have around 15000 examples.
I have set the weight of positive example to be 35.01 and negative examples to be 1 as default. But what I get is extremely poor performance on the test dataset.
So in order to find out the reason, I set the training examples as input. What I see is negative examples get slightly higher decision values than positive ones. It is really weird, right? I've checked the input to make sure I did not mislabel the examples. I've done normalization to the histogram vectors.
Has anybody met this situation before?
Here are the parameters of trained model. I can feel strange about parameters like bias, regularizer and dualityGap, because they are so small that may lose accuracy easily.
model.info =
solver: 'sdca'
lambda: 0.0100
biasMultiplier: 1
bias: -1.6573e-14
objective: 1.9439
regularizer: 6.1651e-04
loss: 1.9432
dualObjective: 1.9439
dualLoss: 1.9445
dualityGap: -2.6645e-15
iteration: 43868
epoch: 2
elapsedTime: 228.9374

One thing that could be happening is that LIBSVM takes the first example in the data set as the positive class and the negative class the one that isn't the first example in the dataset. So it could be that since you have 35x more negatives than positives, your first example is negative and your classes are being inverted. How to check this? Make sure that the first data point in the training set is of the positive class.
I've checked in the FAQ of LIBLINEAR and it seems it happens in LIBLINEAR as well (I'm not as familiar with LIBLINEAR):
http://www.csie.ntu.edu.tw/~cjlin/liblinear/FAQ.html (search for reversed)

Related

what would be the impact of adding a lot of positive examples to binary classifier?

Say I have Binary Classifier trained with equal number of N positive examples and N negative examples. And now, I try to add another N positive examples for training. What would be the effect of this?
What would be the effect of having unproportional training examples with respect to label type.
In general, it would mean that you would bias your classification algorithm towards the positive examples. For optimal results, it is therefore important that your training datasets have the same proportion of positive/negative samples than your validation data set (and the data set that you will you in production later on).
The details might however dependent on the type of the algorithm that you are using and if the added positive samples are independent of the already present positive samples.

Why does one not use IOU for training?

When people try to solve the task of semantic segmentation with CNN's they usually use a softmax-crossentropy loss during training (see Fully conv. - Long). But when it comes to comparing the performance of different approaches measures like intersection-over-union are reported.
My question is why don't people train directly on the measure they want to optimize? Seems odd to me to train on some measure during training, but evaluate on another measure for benchmarks.
I can see that the IOU has problems for training samples, where the class is not present (union=0 and intersection=0 => division zero by zero). But when I can ensure that every sample of my ground truth contains all classes, is there another reason for not using this measure?
Checkout this paper where they come up with a way to make the concept of IoU differentiable. I implemented their solution with amazing results!
It is like asking "why for classification we train log loss and not accuracy?". The reason is really simple - you cannot directly train for most of the metrics, because they are not differentiable wrt. to your parameters (or at least do not produce nice error surface). Log loss (softmax crossentropy) is a valid surrogate for accuracy. Now you are completely right that it is plain wrong to train with something that is not a valid surrogate of metric you are interested in, and the linked paper does not do a good job since for at least a few metrics they are considering - we could easily show good surrogate (like for weighted accuracy all you have to do is weight log loss as well).
Here's another way to think about this in a simple manner.
Remember that it is not sufficient to simply evaluate a metric such as accuracy or IoU while solving a relevant image problem. Evaluating the metric must also help the network learn in which direction the weights must be nudged towards, so that a network can learn effectively over iterations and epochs.
Evaluating this direction is what the earlier comments mean that the errors are differentiable. I suppose that there is nothing about the IoU metrics that the network can use to say: "hey, it's not exactly here, but I have to maybe move my bounding box a little to the left!"
Just a trickle of an explanation, but hope it helps..
I always use mean IOU for training a segmentation model. More exactly, -log(MIOU). Plain -MIOU as a loss function will easily trap your optimizer around 0 because of its narrow range (0,1) and thus its steep surface. By taking its log scale, the loss surface becomes slow and good for training.

Suggestions to improve my normalized accuracy with libsvm

I'm with a problem when I try to classify my data using libsvm. My training and test data are highly unbalanced. When I do the grid search for the svm parameters and train my data with weights for the classes, the testing gives the accuracy of 96.8113%. But because the testing data is unbalanced, all the correct predicted values are from the negative class, which is larger than the positive class.
I tried a lot of things, from changing the weights until changing the gamma and cost values, but my normalized accuracy (which takes into account the positive classes and negative classes) is lower in each try. Training 50% of positives and 50% of negatives with the default grid.py parameters i have a very low accuracy (18.4234%).
I want to know if the problem is in my description (how to build the feature vectors), in the unbalancing (should i use balanced data in another way?) or should i change my classifier?
Better data always helps.
I think that imbalance is part of the problem. But a more significant part of the problem is how you're evaluating your classifier. Evaluating accuracy given the distribution of positives and negatives in your data is pretty much useless. So is training on 50% and 50% and testing on data that is distributed 99% vs 1%.
There are problems in real life that are like the one your studying (that have a great imbalance in positives to negatives). Let me give you two examples:
Information retrieval: given all documents in a huge collection return the subset that are relevant to search term q.
Face detection: this large image mark all locations where there are human faces.
Many approaches to these type of systems are classifier-based. To evaluate two classifiers two tools are commonly used: ROC curves, Precision Recall curves and the F-score. These tools give a more principled approach to evaluate when one classifier is working better than the another.

Good performance only for one class naive bayes

I use Naive Bayes from Weka to do text classification. I have two classes for my sentences, "Positive" and "Negative". I collected about 207 sentences with positive meaning and 189 sentences with negative meaning, in order to create my training set.
When I ran Naive Bayes with a test set that contains sentences with strong negative meaning, such as the one of the word "hate", the accuracy of the results is pretty good, about 88%. But when I use sentences with positive meaning, such as the one of the word "love", as a test set, the accuracy is much worse, about 56%.
I think that this difference probably has something to do with my training set and especially its "Positive" sentences.
Can you think of any reason that could explain this difference? Or maybe a way to help me find out where the problem begins?
Thanks a lot for your time,
Nantia
Instead of creating test sets which contain only positive or negative samples I would just create a test set with mixed samples. You can the view the resulting confusion matrix in Weka which allows you to see how well both the positive and negative samples where classified. Furthermore I would use (10-fold) cross-validation to get a more stable measure of the performance (once you have done this you might want to edit your post with the confusion matrix cross-validation results and we might be able to help out more).
It may be that your negative sentences have words that are more consistently present, whereas your positive sentences have more variations in the words that are present or those words may also often be present in the negative sentences.
It is hard to give specific advice without knowing the size of your dictionary (i.e., number of attributes), size of your test set, etc. Since the Naive Bayes Classifier calculates the product of the probabilities of individual words being present or absent, I would take some of the misclassified positive examples and examine the conditional probabilities for both positive and negative classification to see why the examples are being misclassified.
To better understand how your classifier works, you can inspect the parameters to see which words the classifier thinks are the most predictive of positive/negative of sentence. Can you print out the top predictors for positive and negative cases?
e.g.,
top positive predictors:
p('love'|positive) = 0.05
p('like'|positive) = 0.016
...
top negative predictors:
p('hate'|negative) = 0.25
p('dislike'|negative) = 0.17
...

How to purposely overfit Weka tree classifiers?

I have a binary class dataset (0 / 1) with a large skew towards the "0" class (about 30000 vs 1500). There are 7 features for each instance, no missing values.
When I use the J48 or any other tree classifier, I get almost all of the "1" instances misclassified as "0".
Setting the classifier to "unpruned", setting minimum number of instances per leaf to 1, setting confidence factor to 1, adding a dummy attribute with instance ID number - all of this didn't help.
I just can't create a model that overfits my data!
I've also tried almost all of the other classifiers Weka provides, but got similar results.
Using IB1 gets 100% accuracy (trainset on trainset) so it's not a problem of multiple instances with the same feature values and different classes.
How can I create a completely unpruned tree?
Or otherwise force Weka to overfit my data?
Thanks.
Update: Okay, this is absurd. I've used only about 3100 negative and 1200 positive examples, and this is the tree I got (unpruned!):
J48 unpruned tree
------------------
F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)
Needless to say, IB1 still gives 100% precision.
Update 2: Don't know how I missed it - unpruned SimpleCart works and gives 100% accuracy train on train; pruned SimpleCart is not as biased as J48 and has a decent false positive and negative ratio.
Weka contains two meta-classifiers of interest:
weka.classifiers.meta.CostSensitiveClassifier
weka.classifiers.meta.MetaCost
They allows you to make any algorithm cost-sensitive (not restricted to SVM) and to specify a cost matrix (penalty of the various errors); you would give a higher penalty for misclassifying 1 instances as 0 than you would give for erroneously classifying 0 as 1.
The result is that the algorithm would then try to:
minimize expected misclassification cost (rather than the most likely class)
The quick and dirty solution is to resample. Throw away all but 1500 of your positive examples and train on a balanced data set. I am pretty sure there is a resample component in Weka to do this.
The other solution is to use a classifier with a variable cost for each class. I'm pretty sure libSVM allows you to do this and I know Weka can wrap libSVM. However I haven't used Weka in a while so I can't be of much practical help here.

Resources