I am using TensorFlow to implement object recognition. I followed this tutorial but use my own dataset. https://www.tensorflow.org/versions/r0.8/tutorials/mnist/pros/index.html#deep-mnist-for-experts
I used 212 positive samples and 120 negative samples to train. The test set contains 100 positive and 20 negative samples.
The training precision is only 32.15%, but the test precision is 83.19%
I am wondering what makes the test precision is higher than training precision, is my data set not large enough? The data doesn't show any statistical meaning? Or it is a general thing, because I saw some people said the training precision doesn't make any sense. But why is that?
There are two problems here.
First, precision is not a very good measure of performance when classes are unbalanced.
Second, and more important you have bad ratio of negative to positive in test set. Your test set should come from the same process as the training one, but in your case negatives are ~40% of the training set but only ~17% of the test set. Not very suprisingly - classifier which simply answers "true" for every single input, will get 83% precision on your test set (as positives are 83% of the whole data).
Thus it is not a matter of number of test samples, it is a matter of incorrect construction of training/test datasets. I can also imagine, that there are more issues with this split, probably there is a completely different structure in train and in test.
Related
Say I have Binary Classifier trained with equal number of N positive examples and N negative examples. And now, I try to add another N positive examples for training. What would be the effect of this?
What would be the effect of having unproportional training examples with respect to label type.
In general, it would mean that you would bias your classification algorithm towards the positive examples. For optimal results, it is therefore important that your training datasets have the same proportion of positive/negative samples than your validation data set (and the data set that you will you in production later on).
The details might however dependent on the type of the algorithm that you are using and if the added positive samples are independent of the already present positive samples.
I have train dataset and test dataset from two different sources. I mean they are from two different experiments but the results of both of them are same biological images. I want to do binary classification using deep CNN and I have following results on test accuracy and train accuracy. The blue line shows train accuracy and the red line shows test accuracy after almost 250 epochs. Why the test accuracy is almost constant and not raising? Is that because Test and Train dataset are come from different distributions?
Edited:
After I have add dropout layer, reguralization terms and mean subtraction I still get following strange results which says the model is overfitting from the beginning!
There could be 2 reasons. First you overfit on the training data. This can be validated by using the validation score as a comparison metric to the test data. If so you can use standard techniques to combat overfitting, like weight decay and dropout.
The second one is that your data is too different to be learned like this. This is harder to solve. You should first look at the value spread of both images. Are they both normalized. Matplotlib normalizes automatically for plotted images. If this still does not work you might want to look into augmentation to make your training data more similar to the test data. Here I can not tell you what to use, without seeing both the trainset and the testset.
Edit:
For normalization the test set and the training set should have a similar value spread. If you do dataset normalization you calculate mean and std on training set. But you also need to use those calculated values on the test set and not calculate the test set values from the test set. This only makes sense if the value spread is similar for both the training and test set. If this is not the case you might want to do per sample normalization first.
Other augmentation that are commonly used for every dataset are oversampling, random channel shifts, random rotations, random translation and random zoom. This makes you invariante to those operations.
I use the VL-Feat and LIBLINEAR to handle the 2-category classification. The #(-)/#(+) for the training set is 35.01 and the dimension of each feature vector is 3.6e5. I have around 15000 examples.
I have set the weight of positive example to be 35.01 and negative examples to be 1 as default. But what I get is extremely poor performance on the test dataset.
So in order to find out the reason, I set the training examples as input. What I see is negative examples get slightly higher decision values than positive ones. It is really weird, right? I've checked the input to make sure I did not mislabel the examples. I've done normalization to the histogram vectors.
Has anybody met this situation before?
Here are the parameters of trained model. I can feel strange about parameters like bias, regularizer and dualityGap, because they are so small that may lose accuracy easily.
model.info =
solver: 'sdca'
lambda: 0.0100
biasMultiplier: 1
bias: -1.6573e-14
objective: 1.9439
regularizer: 6.1651e-04
loss: 1.9432
dualObjective: 1.9439
dualLoss: 1.9445
dualityGap: -2.6645e-15
iteration: 43868
epoch: 2
elapsedTime: 228.9374
One thing that could be happening is that LIBSVM takes the first example in the data set as the positive class and the negative class the one that isn't the first example in the dataset. So it could be that since you have 35x more negatives than positives, your first example is negative and your classes are being inverted. How to check this? Make sure that the first data point in the training set is of the positive class.
I've checked in the FAQ of LIBLINEAR and it seems it happens in LIBLINEAR as well (I'm not as familiar with LIBLINEAR):
http://www.csie.ntu.edu.tw/~cjlin/liblinear/FAQ.html (search for reversed)
I'm with a problem when I try to classify my data using libsvm. My training and test data are highly unbalanced. When I do the grid search for the svm parameters and train my data with weights for the classes, the testing gives the accuracy of 96.8113%. But because the testing data is unbalanced, all the correct predicted values are from the negative class, which is larger than the positive class.
I tried a lot of things, from changing the weights until changing the gamma and cost values, but my normalized accuracy (which takes into account the positive classes and negative classes) is lower in each try. Training 50% of positives and 50% of negatives with the default grid.py parameters i have a very low accuracy (18.4234%).
I want to know if the problem is in my description (how to build the feature vectors), in the unbalancing (should i use balanced data in another way?) or should i change my classifier?
Better data always helps.
I think that imbalance is part of the problem. But a more significant part of the problem is how you're evaluating your classifier. Evaluating accuracy given the distribution of positives and negatives in your data is pretty much useless. So is training on 50% and 50% and testing on data that is distributed 99% vs 1%.
There are problems in real life that are like the one your studying (that have a great imbalance in positives to negatives). Let me give you two examples:
Information retrieval: given all documents in a huge collection return the subset that are relevant to search term q.
Face detection: this large image mark all locations where there are human faces.
Many approaches to these type of systems are classifier-based. To evaluate two classifiers two tools are commonly used: ROC curves, Precision Recall curves and the F-score. These tools give a more principled approach to evaluate when one classifier is working better than the another.
I use Naive Bayes from Weka to do text classification. I have two classes for my sentences, "Positive" and "Negative". I collected about 207 sentences with positive meaning and 189 sentences with negative meaning, in order to create my training set.
When I ran Naive Bayes with a test set that contains sentences with strong negative meaning, such as the one of the word "hate", the accuracy of the results is pretty good, about 88%. But when I use sentences with positive meaning, such as the one of the word "love", as a test set, the accuracy is much worse, about 56%.
I think that this difference probably has something to do with my training set and especially its "Positive" sentences.
Can you think of any reason that could explain this difference? Or maybe a way to help me find out where the problem begins?
Thanks a lot for your time,
Nantia
Instead of creating test sets which contain only positive or negative samples I would just create a test set with mixed samples. You can the view the resulting confusion matrix in Weka which allows you to see how well both the positive and negative samples where classified. Furthermore I would use (10-fold) cross-validation to get a more stable measure of the performance (once you have done this you might want to edit your post with the confusion matrix cross-validation results and we might be able to help out more).
It may be that your negative sentences have words that are more consistently present, whereas your positive sentences have more variations in the words that are present or those words may also often be present in the negative sentences.
It is hard to give specific advice without knowing the size of your dictionary (i.e., number of attributes), size of your test set, etc. Since the Naive Bayes Classifier calculates the product of the probabilities of individual words being present or absent, I would take some of the misclassified positive examples and examine the conditional probabilities for both positive and negative classification to see why the examples are being misclassified.
To better understand how your classifier works, you can inspect the parameters to see which words the classifier thinks are the most predictive of positive/negative of sentence. Can you print out the top predictors for positive and negative cases?
e.g.,
top positive predictors:
p('love'|positive) = 0.05
p('like'|positive) = 0.016
...
top negative predictors:
p('hate'|negative) = 0.25
p('dislike'|negative) = 0.17
...