Is there any difference between skewed class && imbalanced class in machine learning?Or both are same with different terminologies?
Basically , yes , they mean the same and are commonly used in the same context.
In machine learning skewed class mean the observation in dataset in the data set belong to one of two class has highest percentage than other. For example in caner classification problem, the people have cancer is 1%, so y = 1 and people who does not have cancer is 99% then y = 0. Their is imbalance between the class in the dataset.
So skewed class and imbalanced class one and the same.
Related
I'm working on a binary classification problem. I had this situation that I used the logistic regression and support vector machine model imported from sklearn. These two models were fit with the same , imbalanced training data and class weights were adjusted. And they have achieved comparable performances. When I used these two pre-trained models to predict a new dataset. The LR model and the SVM models predicted similar number of instances as positives. And the predicted instances share a big overlap.
However, when I looked at the probability scores of being classified as positives, the distribution by LR is from 0.5 to 1 while the SVM starts from around 0.1. I called the function model.predict(prediction_data) to find out the instances predicted as each class and the function
model.predict_proba(prediction_data) to give the probability scores of being classified as 0(neg) and 1(pos), and assume they all have a default threshold 0.5.
There is no error in my code and I have no idea why the SVM predicted instances with probability scores < 0.5 as positives as well. Any thoughts on how to interpret this situation?
That's a known fact in sklearn when it comes to binary classification problems with SVC(), which is reported, for instance, in these github issues
(here and here). Moreover, it is also
reported in the User guide where it is said that:
In addition, the probability estimates may be inconsistent with the scores:
the “argmax” of the scores may not be the argmax of the probabilities; in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.
or directly within libsvm faq, where it is said that
Let's just consider two-class classification here. After probability information is obtained in training, we do not have prob > = 0.5 if and only if decision value >= 0.
All in all, the point is that:
on one side, predictions are based on decision_function values: if the decision value computed on a new instance is positive, the predicted class is the positive class and viceversa.
on the other side, as stated within one of the github issues, np.argmax(self.predict_proba(X), axis=1) != self.predict(X) which is where the inconsistency comes from. In other terms, in order to always have consistency on binary classification problems you would need a classifier whose predictions are based on the output of predict_proba() (which is btw what you'll get when considering calibrators), like so:
def predict(self, X):
y_proba = self.predict_proba(X)
return np.argmax(y_proba, axis=1)
I'd also suggest this post on the topic.
We are attempting to implement multi-label classification using CNN in pytorch. We have 8 labels and around 260 images using a 90/10 split for train/validation sets.
The classes are highly imbalanced with the most frequent class occurring in over 140 images. On the other hand, the least frequent class occurs in less than 5 images.
We attempted BCEWithLogitsLoss function initially that led to the model predicting the same label for all images.
We then implemented a focal loss approach to handle class imbalance as follows:
import torch.nn as nn
import torch
class FocalLoss(nn.Module):
def __init__(self, alpha=1, gamma=2):
super(FocalLoss, self).__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, outputs, targets):
bce_criterion = nn.BCEWithLogitsLoss()
bce_loss = bce_criterion(outputs, targets)
pt = torch.exp(-bce_loss)
focal_loss = self.alpha * (1 - pt) ** self.gamma * bce_loss
return focal_loss
This resulted in the model predicting empty sets (no labels) for every image since it could not get a greater than 0.5 confidence for any classes.
Is there a approach in pytorch to help address this situation?
There's basically three ways of dealing with this.
Discard data from the more common class
Weight minority class loss values more heavily
Oversample the minority class
Option 1 is implemented by selecting the files you include in your Dataset.
Option 2 is implemented with the pos_weight parameter for BCEWithLogitsLoss
Option 3 is implemented with a custom Sampler passed to your Dataloader
For deep learning, oversampling typically works best.
I want to get an estimate on how well the classifiers would work on an imbalance dataset of mine. When I try to fit KNN classifier from sklearn it learns nothing for the minority class. So what I did was I fit the classifier with k = R (where r is the imbalance ratio 1: R) and I predict probabilities for each test point and assign a point to minority class if the probability output of the classifier for the minority class is great than R (where r is the imbalance ratio 1: R). I do this to get an estimate of how the classifier performs(F1-score). I don't need the classifier in production. Is what I'm doing right?
Since you have mentioned in the comments that you dont want to use resampling, the one way out is batching. Create multiple dataset from your majority class so that they will be 1:1 ratio with minority class. Train multiple models with each model getting one part of the majority set and all of the minority. Make a prediction with all the models and take a vote from them and decide your final outcome.
But I would suggest using SMOTE over this method.
For machine learning binary classification problems with imbalanced classes, does it matter which class is considered the positive class? So if class A is the majority class, by convention do I want to predict that or the minority class (class B)? Does it even matter?
In fact it does not matter, but it depends on your underlying problem. For example if you want to classifiy a medical test, where positive corresponds to 'disease is present' and we assume that positive samples are the minority, you probably want to predict how high is the probabilty that one Person is sick / belongs to the minority.
Beginner at machine learning here! Just like to get a sensing of how I should approach a classification problem. Given that the problem at hand is to say classify whether an object belongs to class A or class B, I am wondering whether I should use a generative or a discriminative model. I have 2 questions.
A discriminative model seems to do a better job at classification problems because it is purely concerned with how the decision boundary is drawn and nothing else.
Q: However, with a small dataset of around 80 class A objects and less than 10 class B objects to train and test, would a discriminative model overfit and therefore a generative model would perform better?
Also, with a very huge difference in numbers of the number of class A objects and class B objects, the model trained is likely to only be able to pick up on class A objects. Even if the model classifies all objects to be class A, this would still result in a very high accuracy score.
Q: Any ideas on how to reduce this biasedness given that there is no other way of increasing the size of class B's dataset?