what would be the impact of adding a lot of positive examples to binary classifier? - machine-learning

Say I have Binary Classifier trained with equal number of N positive examples and N negative examples. And now, I try to add another N positive examples for training. What would be the effect of this?
What would be the effect of having unproportional training examples with respect to label type.

In general, it would mean that you would bias your classification algorithm towards the positive examples. For optimal results, it is therefore important that your training datasets have the same proportion of positive/negative samples than your validation data set (and the data set that you will you in production later on).
The details might however dependent on the type of the algorithm that you are using and if the added positive samples are independent of the already present positive samples.

Related

Machine Learning - Huge Only positive text dataset

I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help
I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )
My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset
For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.

Machine Learning Experiment Design with Small Positive Sample Set in Sci-kit Learn

I am interested in any tips on how to train a set with a very limited positive set and a large negative set.
I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?
I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful.
Thanks!
The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).
For RandomForest() you can look at this thread which discusses
the sample weight parameter. In general sample_weight is what
you're looking for in scikit-learn.
For SVM's have a look at either this example or this
example.
For NB classifiers, this should be handled implicitly by Bayes
rule, however in practice you may see some poor performances.
For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).
You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.
Class Weight-The higher the weight a class is given, the more its error rate is decreased.
Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)
Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.
Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.
Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

Input matches no features in training set; how much more training data do I need?

I am new to Text Mining. I am working on Spam filter. I did text cleaning, removed stop words. n-grams are my features. So I build a frequency matrix and build model using Naive Bayes. I have very limited set of training data, so I am facing the following problem.
When a sentence comes to me for classification and if none of its features match with the existing features in training then my frequency vector has only zeros.
When I send this vector for classification, I obviously get a useless result.
What can be ideal size of training data to expect better results?
Generally, the more data you have, the better. You will get diminishing returns at some point. It is often a good idea to see if your training set size is a problem by plotting the cross validation performance while varying the size of the training set. In scikit-learn has an example of this type of "learning curve."
Scikit-learn Learning Curve Example
You may consider bringing in outside sample posts to increase the size of your training set.
As you grow your training set, you may want to try reducing the bias of your classifier. This could be done by adding n-gram features, or switching to a logistic regression or SVM model.
When a sentence comes to me for classification and if none of its features match with the existing features in training then my frequency vector has only zeros.
You should normalize your input so that it forms some kind of rough distribution around 0. A common method is to do this tranformation:
input_signal = (feature - feature_mean) / feature_stddev
Then all zeroes would only happen if all features were exactly at the mean.

Suggestions to improve my normalized accuracy with libsvm

I'm with a problem when I try to classify my data using libsvm. My training and test data are highly unbalanced. When I do the grid search for the svm parameters and train my data with weights for the classes, the testing gives the accuracy of 96.8113%. But because the testing data is unbalanced, all the correct predicted values are from the negative class, which is larger than the positive class.
I tried a lot of things, from changing the weights until changing the gamma and cost values, but my normalized accuracy (which takes into account the positive classes and negative classes) is lower in each try. Training 50% of positives and 50% of negatives with the default grid.py parameters i have a very low accuracy (18.4234%).
I want to know if the problem is in my description (how to build the feature vectors), in the unbalancing (should i use balanced data in another way?) or should i change my classifier?
Better data always helps.
I think that imbalance is part of the problem. But a more significant part of the problem is how you're evaluating your classifier. Evaluating accuracy given the distribution of positives and negatives in your data is pretty much useless. So is training on 50% and 50% and testing on data that is distributed 99% vs 1%.
There are problems in real life that are like the one your studying (that have a great imbalance in positives to negatives). Let me give you two examples:
Information retrieval: given all documents in a huge collection return the subset that are relevant to search term q.
Face detection: this large image mark all locations where there are human faces.
Many approaches to these type of systems are classifier-based. To evaluate two classifiers two tools are commonly used: ROC curves, Precision Recall curves and the F-score. These tools give a more principled approach to evaluate when one classifier is working better than the another.

Good performance only for one class naive bayes

I use Naive Bayes from Weka to do text classification. I have two classes for my sentences, "Positive" and "Negative". I collected about 207 sentences with positive meaning and 189 sentences with negative meaning, in order to create my training set.
When I ran Naive Bayes with a test set that contains sentences with strong negative meaning, such as the one of the word "hate", the accuracy of the results is pretty good, about 88%. But when I use sentences with positive meaning, such as the one of the word "love", as a test set, the accuracy is much worse, about 56%.
I think that this difference probably has something to do with my training set and especially its "Positive" sentences.
Can you think of any reason that could explain this difference? Or maybe a way to help me find out where the problem begins?
Thanks a lot for your time,
Nantia
Instead of creating test sets which contain only positive or negative samples I would just create a test set with mixed samples. You can the view the resulting confusion matrix in Weka which allows you to see how well both the positive and negative samples where classified. Furthermore I would use (10-fold) cross-validation to get a more stable measure of the performance (once you have done this you might want to edit your post with the confusion matrix cross-validation results and we might be able to help out more).
It may be that your negative sentences have words that are more consistently present, whereas your positive sentences have more variations in the words that are present or those words may also often be present in the negative sentences.
It is hard to give specific advice without knowing the size of your dictionary (i.e., number of attributes), size of your test set, etc. Since the Naive Bayes Classifier calculates the product of the probabilities of individual words being present or absent, I would take some of the misclassified positive examples and examine the conditional probabilities for both positive and negative classification to see why the examples are being misclassified.
To better understand how your classifier works, you can inspect the parameters to see which words the classifier thinks are the most predictive of positive/negative of sentence. Can you print out the top predictors for positive and negative cases?
e.g.,
top positive predictors:
p('love'|positive) = 0.05
p('like'|positive) = 0.016
...
top negative predictors:
p('hate'|negative) = 0.25
p('dislike'|negative) = 0.17
...

Resources