I read that when using CNNs, we should have approximately equal number of samples per class. I am doing binary classification, detecting pedestrians from background so the 2 classes are pedestrian and background (anything not pedestrian really).
If I were to incorporate hard negative mining in my training, I would end up with more negative samples than positive if I am getting a lot of false positives.
1) Would this be okay?
2) If not then how do I solve this issue?
3) And what are the consequences of training a CNN with more negative than positive samples?
4) If it is okay to have more negative than positive samples, is there a maximum limit that I should not exceed ? Like for eg. I should not have 3x more negative samples than positives.
5) I can augment my positive samples by jittering but how much additional samples per image should I create? Is there a 'too much'? Like if I start off with 2000 positive samples, how much additional samples is too much? Is generating a total of 100k samples from the 2k samples via jittering too much?
It depends on which cost function you use but if you set it to be a log_loss then I can show you how intuitively not balanced dataset may harm your training and what are the possible solutions for this problem:
a. If you don't change the distribution of your classes and leave them unbalanced then - if your model is able to achieve relatively small value of a loss function then it will not only be a good detector of a pedestrian on an image but also it will learn that pedestrian detection is a relatively rare event and it may prevent you from a lot of false positives. So if you are able to spend a lot more time on training a bigger model - it may bring you a really good results.
b. If you change the distribution of your classes - then you could probably achieve relatively good results with much smaller model in shorter time - but on the other hand - because of the fact that your classifier will learn different distribution - you may achieve a lot of False positives.
But - if the training phase of your classifier is not lasting too long - you may find a good compromise between these two methods. You may set a multiplication factor (e.g. if you will increase the number of samples by 2, 3 or n times) as meta parameter and optimise the value of it e.g. using grid search schema.
Related
I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help
I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )
My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset
For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.
I've read on the scikit-learn website that SVM is a good choice when the number of dimensions is greater than the number of samples.
I would like to know what you think (as experienced users) is the more efficient in these cases where the class to predict is binary.
And especially what to do when the number of labeled samples is about 50.
Algorithms that should work ? Things to care about ?
If you have dense data, and n < d, you have a high risk of overfitting: each dimension or differences of dimensions could uniquely identify your training records.
That doesn't mean it won't work, only that you have to be extra careful when evaluating your method, because of the high risk of overfitting.
Methods that only use one dimension at a time - such as decision trees - could be less affected. In particular, if you apply pruning techniques to your tree.
I have a data set with 150 rows, 45 features and 40 outputs. I can well overfit the data but I cannot obtain acceptable results for my cross validation set.
With 25 hidden layers and quite large number of iterations, I was able to get ~94% accuracy on my training set; put a smile on my face. But cross validation result turned out to be less than 15%.
So to mitigate overfitting I started playing with the regularization parameter (lambda) and also the number of hidden layers. The best result (CV) I could get was 24% on training set and 34% on the training set with lambda=1, 70 hidden layers and 14000 iterations. Increasing the number of iters also made it worse; I can't understand why I cannot improve CV results with increased lambda and iters?
Here is the lambda-hiddenLayer-iter combinations I have tried:
https://docs.google.com/spreadsheets/d/11ObRTg05lZENpjUj4Ei3CbHOh5mVzF7h9PKHq6Yn6T4/edit?usp=sharing
Any suggested way(s) of trying smarter regulationParameter-hiddenLayer-iters combinations? Or other ways of improving my NN? I using my matlab code from Andrew Ng's ML class (uses backpropagation algorithm.)
It's very hard to learn anything from just 150 training examples with 45 features (and if I read your question right, 40 possible output classes). You need far more labeled training examples if you want to learn a reasonable classifier - probably tens or hundreds of thousands if you do have 40 possible classes. Even for binary classification or regression, you likely need thousands of examples of you have 45 meaningful features.
Some suggestions:
overfitting occurs primarily when the structure of the neural network is too complex for the problem in hand. If the structure of the NN isn't too complex, increasing the number of iterations shouldn't decrease accuracy of prediction
70 hidden layers is quite a lot, you may try to dramatically decrease the number of hidden layers (to 3-15) and increase the number of iterations. It seems from your file that 15 hidden layers are fine in comparison to 70 hidden layers
while reducing the number of hidden layers you may vary the number of neurons in hidden layers (increase/decrease) and check how the results are changing
I agree with Logan. What you see in your dataset makes perfect sense. If you simply train a NN classifier with 45 features for 40 classes you will get great accuracy because you have more features than output-classes. So the model can basically "assign" each feature to one of the output classe, but the resulting model will be highly over-fitted and probably not represent whatever you are modeling. Your significantly lower cross-validation results seem to be right.
You should rethink your approach: Why do you have 40 classes? Maybe you can change your problem into a regressive problem instead of a classification problem? Also try to look into some other algorithms like Random Forrest for example. Or decrease the number of features significantly.
I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.
I'm with a problem when I try to classify my data using libsvm. My training and test data are highly unbalanced. When I do the grid search for the svm parameters and train my data with weights for the classes, the testing gives the accuracy of 96.8113%. But because the testing data is unbalanced, all the correct predicted values are from the negative class, which is larger than the positive class.
I tried a lot of things, from changing the weights until changing the gamma and cost values, but my normalized accuracy (which takes into account the positive classes and negative classes) is lower in each try. Training 50% of positives and 50% of negatives with the default grid.py parameters i have a very low accuracy (18.4234%).
I want to know if the problem is in my description (how to build the feature vectors), in the unbalancing (should i use balanced data in another way?) or should i change my classifier?
Better data always helps.
I think that imbalance is part of the problem. But a more significant part of the problem is how you're evaluating your classifier. Evaluating accuracy given the distribution of positives and negatives in your data is pretty much useless. So is training on 50% and 50% and testing on data that is distributed 99% vs 1%.
There are problems in real life that are like the one your studying (that have a great imbalance in positives to negatives). Let me give you two examples:
Information retrieval: given all documents in a huge collection return the subset that are relevant to search term q.
Face detection: this large image mark all locations where there are human faces.
Many approaches to these type of systems are classifier-based. To evaluate two classifiers two tools are commonly used: ROC curves, Precision Recall curves and the F-score. These tools give a more principled approach to evaluate when one classifier is working better than the another.