We are attempting to implement multi-label classification using CNN in pytorch. We have 8 labels and around 260 images using a 90/10 split for train/validation sets.
The classes are highly imbalanced with the most frequent class occurring in over 140 images. On the other hand, the least frequent class occurs in less than 5 images.
We attempted BCEWithLogitsLoss function initially that led to the model predicting the same label for all images.
We then implemented a focal loss approach to handle class imbalance as follows:
import torch.nn as nn
import torch
class FocalLoss(nn.Module):
def __init__(self, alpha=1, gamma=2):
super(FocalLoss, self).__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, outputs, targets):
bce_criterion = nn.BCEWithLogitsLoss()
bce_loss = bce_criterion(outputs, targets)
pt = torch.exp(-bce_loss)
focal_loss = self.alpha * (1 - pt) ** self.gamma * bce_loss
return focal_loss
This resulted in the model predicting empty sets (no labels) for every image since it could not get a greater than 0.5 confidence for any classes.
Is there a approach in pytorch to help address this situation?
There's basically three ways of dealing with this.
Discard data from the more common class
Weight minority class loss values more heavily
Oversample the minority class
Option 1 is implemented by selecting the files you include in your Dataset.
Option 2 is implemented with the pos_weight parameter for BCEWithLogitsLoss
Option 3 is implemented with a custom Sampler passed to your Dataloader
For deep learning, oversampling typically works best.
Related
In many training datasets, the class distribution is skewed, I.e. one class is very frequent (e.g. 95%), while another is rare (e.g. 5%). In such applications, the Naive classifier that assigns all test cases to the majority class achieves very high accuracy. However, the Naive classifier does not predict any cases of the minority class, although it is often more important than the majority class.
So are there any ways to improve the accuracy on the minority class just by modifying the training dataset? or Do we have to modify the classification algorithm as well?
This is the issue of the Imbalanced classification. Below are the methods used to treat the Imbalanced Datasets.
1.Undersampling
2.Oversampling
3.Synthetic Data Generation
4.Cost-Sensitive Learning
1.Undersampling
This method works with the majority class. It reduces the number of observations from the majority class to make the data set balanced. This method is best to use when the data set is huge and reducing the number of training samples helps to improve run time and storage troubles.
Undersampling methods are of 2 types: Random and Informative.
Oversampling
This method works with minority class. It replicates the observations from minority class to balance the data. It is also known as upsampling. Similar to undersampling, this method also can be divided into two types: Random Oversampling and Informative Oversampling.
Synthetic Data Generation
In simple words, instead of replicating and adding observations from the minority class, it overcomes imbalances by generates artificial data. It is also a type of oversampling technique.
In regards to synthetic data generation, synthetic minority oversampling technique (SMOTE) is a powerful and widely used method. SMOTE algorithm creates artificial data based on feature space (rather than data space) similarities from minority samples. We can also say, it generates a random set of minority class observations to shift the classifier learning bias towards minority class.
Cost-Sensitive Learning (CSL)
It is another commonly used method to handle classification problems with imbalanced data. It’s an interesting method. In simple words, this method evaluates the cost associated with misclassifying observations.
It does not create a balanced data distribution. Instead, it highlights the imbalanced learning problem by using cost matrices which describe the cost for misclassification in a particular scenario. Recent researches have shown that cost-sensitive learning has many times outperformed sampling methods. Therefore, this method provides a likely alternative to sampling methods.
To build these in R refer the link https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/
There are many ways. You can upsample miority class, downsample majority, do some weighting based on loss, or some other measures. As you've tagged your post to be related to TF I suggest you look at Importance Sampling for Keras, http://idiap.ch/~katharas/importance-sampling/ it has some ready to use wrappers that you put around your model.
I am currently trying to use satellite imagery to recognize Apples orchards. And I am facing a small problem in the number of representative data for each class.
In fact my question is :
Is it possible to take randomly some different images in my "not-apples" class at each epoch because I have much more of theses (compared to the "apples" one) and I want to increase the probability my network will classify out an image unrepresentative.
Thanks in advance for your help
That is not possible in Keras. Keras will, by default, shuffle your training data and then train on it in a mini-batch fashion. However, there are still ways to re-balance your dataset.
The imbalanced training data problem that you are facing is pretty common. You have many options available to you; I list a few below:
You can adjust the relative weights of your classes using class_weight keyword of the model.fit() function.
You can "up-sample" your "apples" class or "down-sample" your "non-apples" class to have equal numbers of both classes during training.
You can generate synthetic images of your "apples" class to augment your data set. To this end, the ImageDataGenerator class in Keras can be particularly useful. This Keras tutorial is a good introduction to its usage.
In my experience, I've found #2 and #3 to be most useful. #1 is limited by the fact that the convergence of stochastic gradient descent suffers when using class weights differing by a couple orders of magnitude and smaller batch sizes.
Jason Brownlee has put together a list of tactics for dealing with imbalanced classes that might also be useful to you.
I am analyzing the Secom dataset from the UCI Machine Learning repository by lasso-regularized logistic regression, but the results are bad.
https://archive.ics.uci.edu/ml/datasets/SECOM
Characteristics:
1546 data samples with 590 numeric attributes
106 positive samples (production failures)
The goal is to predict the positive class accurately, and also to perform feature selection.
I optimize lambda by 10 fold cross validation with the glmnet package in R. But the results are terrible, as the model tends to assign all test samples to only a single class.
Is it simply the wrong kind of model for this dataset?
Predicting with imbalanced classes can be a very difficult problem to solve. Thankfully, there is a huge bibliography on how to solve such problems. There is a particular one that worked really well for me. It involves using up-sampling and/or down-sampling techniques:
down-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model). caret contains a function (downSample) to do this.
up-sampling: randomly sample (with replacement) the minority class to be the same size as the majority class. caret contains a function (upSample) to do this.
hybrid methods: techniques such as SMOTE and ROSE down-sample the majority class and synthesize new data points in the minority class. There are two packages (DMwR and ROSE) that implement these procedures.
I took the above bullet points from this caret's documentation. The post contains examples about each one of the above bullet points and R code. You should be able to use a Lasso logistic regression and have better results after you have transformed your data based on the above techniques.
I have data set for classification problem. I have in total 50 classes.
Class1: 10,000 examples
Class2: 10 examples
Class3: 5 examples
Class4: 35 examples
.
.
.
and so on.
I tried to train my classifier using SVM ( both linear and Gaussian kernel). My accurate is very bad on test data 65 and 72% respectively. Now I am thinking to go for a neural network. Do you have any suggestion for any machine learning model and algorithm for large imbalanced data? It would be extremely helpful to me
You should provide more information about the data set features and the class distribution, this would help others to advice you.
In any case, I don't think a neural network fits here as this data set is too small for it.
Assuming 50% or more of the samples are of class 1 then I would first start by looking for a classifier that differentiates between class 1 and non-class 1 samples (binary classification). This classifier should outperform a naive classifier (benchmark) which randomly chooses a classification with a prior corresponding to the training set class distribution.
For example, assuming there are 1,000 samples, out of which 700 are of class 1, then the benchmark classifier would classify a new sample as class 1 in a probability of 700/1,000=0.7 (like an unfair coin toss).
Once you found a classifier with good accuracy, the next phase can be classifying the non-class 1 classified samples as one of the other 49 classes, assuming these classes are more balanced then I would start with RF, NB and KNN.
There are multiple ways to handle with imbalanced datasets, you can try
Up sampling
Down Sampling
Class Weights
I would suggest either Up sampling or providing class weights to balance it
https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-data-in-machine-learning-80836d45d30c
You should think about your performance metric, don't use Accuracy score as your performance metric , You can use Log loss or any other suitable metric
https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/
From my experience the most successful ways to deal with unbalanced classes are :
Changing distribiution of inputs: 20000 samples (the approximate number of examples which you have) is not a big number so you could change your dataset distribiution simply by using every sample from less frequent classes multiple times. Depending on a number of classes you could set the number of examples from them to e.g. 6000 or 8000 each in your training set. In this case remember to not change distribiution on test and validation set.
Increase the time of training: in case of neural networks, when changing distribiution of your input is impossible I strongly advise you trying to learn network for quite a long time (e.g. 1000 epochs). In this case you have to remember about regularisation. I usually use dropout and l2 weight regulariser with their parameters learnt by random search algorithm.
Reduce the batch size: In neural networks case reducing a batch size might lead to improving performance on less frequent classes.
Change your loss function: using MAPE insted of Crossentropy may also improve accuracy on less frequent classes.
Feel invited to test different combinations of approaches shown by e.g. random search algorithm.
Data-level methods:
Undersampling runs the risk of losing important data from removing data. Oversampling runs the risk of overfitting on training data, especially if the added copies of the minority class are replicas of existing data. Many sophisticated sampling techniques have been developed to mitigate these risks.
One such technique is two-phase learning. You first train your model on the resampled data. This resampled data can be achieved by randomly undersampling large classes until each class has only N instances. You then fine-tune your model on the original data.
Another technique is dynamic sampling: oversample the low-performing classes and undersample the high-performing classes during the training process. Introduced by Pouyanfar et al., the method aims to show the model less of what it has already learned and more of what it has not.
Algorithm-level methods
Cost-sensitive learning
Class-balanced loss
Focal loss
References:
esigning Machine Learning Systems
Survey on deep learning with class imbalance
How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.