Must a dataset for a classifier be perfectly balanced? - machine-learning

In machine learning it is generally accepted that the dataset for a binary classifier has to be balanced: EG. class 1 50%, class 0 50%.
My question is: has the dataset to be perfectly balanced (50/50)? I have a dataset with 1568 observations: class 1 is 47%, class 0 is 53%. Can this create issues?

Ideally, having 50% class for each one, will have nice results while training the model, since the model is looking to optimise the accuracy.
For example, imagine you have 99% class 0 and 1% class 1. When training, you will obtain a model that always classify 0, doing that, the model will obtain a general accuracy of 99%, but this is what you want? Balancing this data will help you avoid this type of "silly" models. This is an extreme case, but that's the reason why balanced class are used.
When balancing take care of overfitting with the minority class, so the model will learn to classify the same sample multiple times.
Finally, if you are using such a balanced data like you say (47%, 53%) balancing is no needed. Probably you can start balancing around 35% / 65% or something like this. Obviously, the behaviour of the data with balancing depends on the datset.

Related

Sample Weight vs. Down Sampling for Handling Imbalanced Data

Suppose you have an imbalanced dataset. Without considering to generate new data to it, how can you handle it efficiently. I know we can use sample weight or down sampling. However, between these two, I am not sure which to choose. Also, suppose you need to build a classification model on this imbalanced data, how will these two techniques influence the model performances differently?
It totally depends upon that if you down sample how much of data observations do you have left, and how efficiently the down sampled class is able to accommodate the variety of the class down sampled.
for example, you have class 1 which consists of 100 observations and class 2 which contains 2000 observations (Class 1 is ~ 5%). Then downsampling wont make sense as there wont be enough data observations present to effetively implement a model. 100 observations are very less. The model would have high training errors.
But if you have Class 1 which has 100,000 observations and Class 2 having 2,000,000 (5% again) then it still make sense to downsample as you have enough observations to train model.
So the answer totally depends on the type of data you have. I personally would go with SMOTE. Hope this helps.

tackle class imbalance in single shot object detector

I am training an object detection model for multi-class objects in the image. The dataset is custom collected and labelled data with bounding boxes and class labels in the ground truth data.
I trained the MobileNet+SSD , SqueezeDet and YoloV3 networks with this custom data but get poor results. The rationale of choosing these models is their fast performance and light weight (low memory foot print). Their single shot detector approach is shown to perform well in literature as well.
The class instance distribution in the dataset is as below
Class 1 -- 2469
Class 2 -- 5660
Class 3 -- 7614
Class 4 -- 13253
Class 5 -- 35262
Each image can have objects from any of the five classes. Class 4 and 5 have very high incidence.
The performance is very skewed with high recall scores and Average Precision for the class 4 and 5 , and an order of magnitude difference (lower) for the other 3 classes.
I have tried fine tuning on different filtering parameters , NMS threshold, model training parameters to no avail.
Question,
How to tackle such class imbalance to boost the detection Average precision and object detection accuracy for all classes in object detection models. ?
Low precision means your model is suffering from false positives. So you can try hard negative mining. Run your model. Find False positives. Include them in your training data. You can even try using only false negatives as false examples.
As you expect another way can be collecting more data if possible.
If it is not possible you may consider adding synthetic data. (i.e. change brightness of image, or view point(multiply with a matrix so it looks stretched))
One last thing may be having data for each class i.e. 5k for each.
PS: Keep in mind that flexibility of your model has a great impact. So be aware of over fitting under fitting.
In generating your synthetic data as mentioned by previous author, do not apply illumination or viewpoint variations..etc to all your dataset but rather, randomly. The number of classes is also way off, and will be best to either limit the numbers or gather more datasets for those classes. You could also try applying class weights to penalize the over representing classes more. You are making alot of assumptions that simple experimentation will yield results that could surprise you. Remember deep learning is part science and alot of art.

Balance classes in cross validation

I would like to build a GBM model with H2O. My data set is imbalanced, so I am using the balance_classes parameter. For grid search (parameter tuning) I would like to use 5-fold cross validation. I am wondering how H2O deals with class balancing in that case. Will only the training folds be rebalanced? I want to be sure the test-fold is not rebalanced.
In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain realistic, i.e. you want to test your classifier performance in the real world setting, where, say, the negative class will include the 99% of the samples, in order to see how well your model will do in predicting the 1% positive class of interest without too many false positives. Artificially inflating the minority class or reducing the majority one will lead to performance metrics that are unrealistic, bearing no real relation to the real world problem you are trying to solve.
For corroboration, here is Max Kuhn, creator of the caret R package and co-author of the (highly recommended) Applied Predictive Modelling textbook, in Chapter 11: Subsampling For Class Imbalances of the caret ebook:
You would never want to artificially balance the test set; its class frequencies should be in-line with what one would see “in the wild”.
Re-balancing makes sense only in the training set, so as to prevent the classifier from simply and naively classifying all instances as negative for a perceived accuracy of 99%.
Hence, you can rest assured that in the setting you describe the rebalancing takes action only for the training set/folds.
A way to force balancing is using a weight columns to use different weights for different classes, in H2O weights_column

Neural Network gets stuck

I am experimenting with classification using neural networks (I am using tensorflow).
And unfortunately the training of my neural network gets stuck at 42% accuracy.
I have 4 classes, into which I try to classify the data.
And unfortunately, my data set is not well balanced, meaning that:
43% of the data belongs to class 1 (and yes, my network gets stuck predicting only this)
37% to class 2
13% to class 3
7% to class 4
The optimizer I am using is AdamOptimizer and the cost function is tf.nn.softmax_cross_entropy_with_logits.
I was wondering if the reason for my training getting stuck at 42% is really the fact that my data set is not well balanced, or because the nature of the data is really random, and there are really no patterns to be found.
Currently my NN consists of:
input layer
2 convolution layers
7 fully connected layers
output layer
I tried changing this structure of the network, but the result is always the same.
I also tried Support Vector Classification, and the result is pretty much the same, with small variations.
Did somebody else encounter similar problems?
Could anybody please provide me some hints how to get out of this issue?
Thanks,
Gerald
I will assume that you have already double, triple and quadruple checked that the data going in is matching what you expect.
The question is quite open-ended, and even a topic for research. But there are some things that can help.
In terms of better training, there's two normal ways in which people train neural networks with an unbalanced dataset.
Oversample the examples with lower frequency, such that the proportion of examples for each class that the network sees is equal. e.g. in every batch, enforce that 1/4 of the examples are from class 1, 1/4 from class 2, etc.
Weight the error for misclassifying each class by it's proportion. e.g. incorrectly classifying an example of class 1 is worth 100/43, while incorrectly classifying an example of class 4 is worth 100/7
That being said, if your learning rate is good, neural networks will often eventually (after many hours of just sitting there) jump out of only predicting for one class, but they still rarely end well with a badly skewed dataset.
If you want to know whether or not there are patterns in your data which can be determined, there is a simple way to do that.
Create a new dataset by randomly select elements from all of your classes such that you have an even number of all of them (i.e. if there's 700 examples of class 4, then construct a dataset by randomly selecting 700 examples from every class)
Then you can use all of your techniques on this new dataset.
Although, this paper suggests that even with random labels, it should be able to find some pattern that it understands.
Firstly you should check if your model is overfitting or underfitting, both of which could cause low accuracy. Check the accuracy of both training set and dev set, if accuracy on training set is much higher than dev/test set, the model may be overfiiting, and if accuracy on training set is as low as it on dev/test set, then it could be underfitting.
As for overfiiting, more data or simpler learning structures may work while make your structure more complex and longer training time may solve underfitting problem

Machine learning model suggestion for large imbalance data

I have data set for classification problem. I have in total 50 classes.
Class1: 10,000 examples
Class2: 10 examples
Class3: 5 examples
Class4: 35 examples
.
.
.
and so on.
I tried to train my classifier using SVM ( both linear and Gaussian kernel). My accurate is very bad on test data 65 and 72% respectively. Now I am thinking to go for a neural network. Do you have any suggestion for any machine learning model and algorithm for large imbalanced data? It would be extremely helpful to me
You should provide more information about the data set features and the class distribution, this would help others to advice you.
In any case, I don't think a neural network fits here as this data set is too small for it.
Assuming 50% or more of the samples are of class 1 then I would first start by looking for a classifier that differentiates between class 1 and non-class 1 samples (binary classification). This classifier should outperform a naive classifier (benchmark) which randomly chooses a classification with a prior corresponding to the training set class distribution.
For example, assuming there are 1,000 samples, out of which 700 are of class 1, then the benchmark classifier would classify a new sample as class 1 in a probability of 700/1,000=0.7 (like an unfair coin toss).
Once you found a classifier with good accuracy, the next phase can be classifying the non-class 1 classified samples as one of the other 49 classes, assuming these classes are more balanced then I would start with RF, NB and KNN.
There are multiple ways to handle with imbalanced datasets, you can try
Up sampling
Down Sampling
Class Weights
I would suggest either Up sampling or providing class weights to balance it
https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-data-in-machine-learning-80836d45d30c
You should think about your performance metric, don't use Accuracy score as your performance metric , You can use Log loss or any other suitable metric
https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/
From my experience the most successful ways to deal with unbalanced classes are :
Changing distribiution of inputs: 20000 samples (the approximate number of examples which you have) is not a big number so you could change your dataset distribiution simply by using every sample from less frequent classes multiple times. Depending on a number of classes you could set the number of examples from them to e.g. 6000 or 8000 each in your training set. In this case remember to not change distribiution on test and validation set.
Increase the time of training: in case of neural networks, when changing distribiution of your input is impossible I strongly advise you trying to learn network for quite a long time (e.g. 1000 epochs). In this case you have to remember about regularisation. I usually use dropout and l2 weight regulariser with their parameters learnt by random search algorithm.
Reduce the batch size: In neural networks case reducing a batch size might lead to improving performance on less frequent classes.
Change your loss function: using MAPE insted of Crossentropy may also improve accuracy on less frequent classes.
Feel invited to test different combinations of approaches shown by e.g. random search algorithm.
Data-level methods:
Undersampling runs the risk of losing important data from removing data. Oversampling runs the risk of overfitting on training data, especially if the added copies of the minority class are replicas of existing data. Many sophisticated sampling techniques have been developed to mitigate these risks.
One such technique is two-phase learning. You first train your model on the resampled data. This resampled data can be achieved by randomly undersampling large classes until each class has only N instances. You then fine-tune your model on the original data.
Another technique is dynamic sampling: oversample the low-performing classes and undersample the high-performing classes during the training process. Introduced by Pouyanfar et al., the method aims to show the model less of what it has already learned and more of what it has not.
Algorithm-level methods
Cost-sensitive learning
Class-balanced loss
Focal loss
References:
esigning Machine Learning Systems
Survey on deep learning with class imbalance

Resources