Which class to predict for imbalanced data? - machine-learning

For machine learning binary classification problems with imbalanced classes, does it matter which class is considered the positive class? So if class A is the majority class, by convention do I want to predict that or the minority class (class B)? Does it even matter?

In fact it does not matter, but it depends on your underlying problem. For example if you want to classifiy a medical test, where positive corresponds to 'disease is present' and we assume that positive samples are the minority, you probably want to predict how high is the probabilty that one Person is sick / belongs to the minority.

Related

Is Naive Bayes biased?

I have a use case where in text needs to be classified into one of the three categories. I started with Naive Bayes [Apache OpenNLP, Java] but i was informed that the algorithm is biased, meaning if my training data has 60% of data as classA and 30% as classB and 10% as classC then the algorithm tends to biased towards ClassA and thus predicting the other class texts to be of classA.
If this is true is there a way to overcome this issue?
There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure which will be more suitable for my use case. Please advise.
there a way to overcome this issue?
Yes, there is. But first you need to understand why it happens?
Basically your dataset is imbalanced.
An imbalanced dataset means instances of one of the two classes is higher than the other, in another way, the number of observations is not the same for all the classes in a classification dataset.
In this scenario, your model becomes bias towards the class with majority of samples as you have more training data for that class.
Solutions
Under sampling:
Randomly removing samples from majority class to make dataset balance.
Over sampling:
Adding more samples of minority classes to makes dataset balance.
Change Performance Metrics
Use F1-score, 'recallorprecision` to measure the performance of your model.
There are few more solutions, if you want to know more refer this blog
There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure
which will be more suitable for my usecase
You will never know unless you try, I would suggest you try 3-4 different algorithms on your data.

Balance classes in cross validation

I would like to build a GBM model with H2O. My data set is imbalanced, so I am using the balance_classes parameter. For grid search (parameter tuning) I would like to use 5-fold cross validation. I am wondering how H2O deals with class balancing in that case. Will only the training folds be rebalanced? I want to be sure the test-fold is not rebalanced.
In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain realistic, i.e. you want to test your classifier performance in the real world setting, where, say, the negative class will include the 99% of the samples, in order to see how well your model will do in predicting the 1% positive class of interest without too many false positives. Artificially inflating the minority class or reducing the majority one will lead to performance metrics that are unrealistic, bearing no real relation to the real world problem you are trying to solve.
For corroboration, here is Max Kuhn, creator of the caret R package and co-author of the (highly recommended) Applied Predictive Modelling textbook, in Chapter 11: Subsampling For Class Imbalances of the caret ebook:
You would never want to artificially balance the test set; its class frequencies should be in-line with what one would see “in the wild”.
Re-balancing makes sense only in the training set, so as to prevent the classifier from simply and naively classifying all instances as negative for a perceived accuracy of 99%.
Hence, you can rest assured that in the setting you describe the rebalancing takes action only for the training set/folds.
A way to force balancing is using a weight columns to use different weights for different classes, in H2O weights_column

How to interpret scored probabilities in machine learning classification algorithm?

I am using two Neural networks for two class text classification. I'm getting 90% accuracy on test data. Also using different performance metrics like precision, recall, f-score and confusion matrix to make sure that model is performing as expected.
In the predictive experiment using trained model, I'm fetching probabilities for each prediction.The output looks as follows (Couldn't provide codes it's implemented in Azure ML Studio )
ex:
class (probability) , class 2 (probability) -> predicted class
class 1 (0.99) , class 2 (0.01) -> class 1
class 1 (0.53) , class 2 (0.47) -> class 1
class 1 (0.2) , class 2(0.8) -> class 2
Example
As per my understanding so far, by looking at the probability we can tell, how confident is the model about its prediction.And 90% accuracy means out 100 records 10 predictions could go wrong.
Now my question is, by looking at probability (confidence) can we tell which bucket the current records falls into 90%(correct prediction) or 10% (wrong prediction)?
What I'm trying to achieve is, to give end your some metric to tell him/her that this prediction is probably wrong, they might want to change it to some other class before using these results.
90% accuracy means out 100 records 10 predictions could go wrong.
It is not exactly like that; accuracy is always (although implicitly) linked to the specific test set we have used to measure it: so, 90% means that out of 100 records our classifier indeed misclassified 10 (i.e. there is not "could").
What we hope for in machine learning is that the performance of our models in new, unseen data, will be comparable to that of our test set (which, regarding the training of our model, is also unseen). Roughly speaking, provided that our new data come from the same statistical distribution with our training & test sets, it is not an unreasonable expectation.
What I'm trying to achieve is, to give end your some metric to tell him/her that this prediction is probably wrong, they might want to change it to some other class before using these results.
Intuitively, you should already know the answer to this: interpreting the returned probabilities as confidence (which, at least in principle, is not an invalid interpretation), their values tell you something about how "certain" your model is about its answers: so, what you could do is provide the end users with these probability values; in your example, the case of "Question" with probability 0.97 is indeed qualitatively not the same with the case "Question" with probability ~ 0.50...

Use Generative or Discriminative model for classification?

Beginner at machine learning here! Just like to get a sensing of how I should approach a classification problem. Given that the problem at hand is to say classify whether an object belongs to class A or class B, I am wondering whether I should use a generative or a discriminative model. I have 2 questions.
A discriminative model seems to do a better job at classification problems because it is purely concerned with how the decision boundary is drawn and nothing else.
Q: However, with a small dataset of around 80 class A objects and less than 10 class B objects to train and test, would a discriminative model overfit and therefore a generative model would perform better?
Also, with a very huge difference in numbers of the number of class A objects and class B objects, the model trained is likely to only be able to pick up on class A objects. Even if the model classifies all objects to be class A, this would still result in a very high accuracy score.
Q: Any ideas on how to reduce this biasedness given that there is no other way of increasing the size of class B's dataset?

Rearranging 6-class model to 6 2-class models. Will it give any improvement?

I have a SVM model consisting of 6 classes and 19 features. It works well, 95% accuracy.
I'm evaluating, how to get the last 5%. My idea is to create other models with other features, train instances.
Another idea is to rearrange the existing model from 6 classes to 6 models each with 2 classes, where one class is positive and the other 5 classes are negative. The features will remain the same. Will it bring any new classification results, or is it just a redundant model?
Thank you!
My idea is to create other models with other features, train
instances.
Yes, it's a good idea. Check performance of other models on your data.
Another idea is to rearrange the existing model from 6 classes to 6
models each with 2 classes, where one class is positive and the other
5 classes are negative.
Since SVM is a binary classifier. A multiclass SVM classifier internally uses either One-Vs-All or One-vs-One. What you are suggesting is one-vs-all. Since libsvm uses One-vs-One technique. you can use one-vs-all but this usually doesn't increase accuracy performance as one-vs-one uses more number of classifier.
SVM is only actually capable of doing binary classification. The multi-class adaptation uses several models and votes on what the class should be in a one-vs-one scheme.
Quick example:
class1 vs class2
class2 vs class3
class1 vs class3
would all be used in a 3-class SVM, then the models would vote on what class a observation should be. one-vs-all is another popular way to use SVM in a multiple classification scenario. To answer your question, that's already kind of what is going on behind the scenes. It is possible building even more models could help improve on your accuracy by a small margin, so its worth a shot if you're bored and want to see if it helps or not

Resources