Suppose you have an imbalanced dataset. Without considering to generate new data to it, how can you handle it efficiently. I know we can use sample weight or down sampling. However, between these two, I am not sure which to choose. Also, suppose you need to build a classification model on this imbalanced data, how will these two techniques influence the model performances differently?
It totally depends upon that if you down sample how much of data observations do you have left, and how efficiently the down sampled class is able to accommodate the variety of the class down sampled.
for example, you have class 1 which consists of 100 observations and class 2 which contains 2000 observations (Class 1 is ~ 5%). Then downsampling wont make sense as there wont be enough data observations present to effetively implement a model. 100 observations are very less. The model would have high training errors.
But if you have Class 1 which has 100,000 observations and Class 2 having 2,000,000 (5% again) then it still make sense to downsample as you have enough observations to train model.
So the answer totally depends on the type of data you have. I personally would go with SMOTE. Hope this helps.
Related
I am training a machine learning model on a classification problem. My dataset is 10000 observations with 37 categorical class. But the data is imbalanced, I have some classes with 100 observations and some other classes with 3000 and 4000 observations.
After searching on how to do some feature engineering on this type of data to improve the performance of the algorithm. I found 2 solutions:
upsampling which means to get more data about the minority classes
downsampling which means to remove data about the majority classes
According to the first solution:
I have many classes with a few observations so it will require much more data and long time. so it will be a hard for me!
And by applying the second one:
I think all classes will have a few observations and the data will be very small so that it will be hard for the algorithm to generalize.
So Is there another solution I can try for this problem?
You can change the weights in your loss function so that the smaller classes have larger importance when optimizing. In keras you can use weighted_cross_entropy_with_logits, for example.
You could use a combination of both.
It sounds like you are worried about getting a dataset that is too large if you upsample all minority classes to match the majority classes. If this is the case, you can downsample the majority classes to something like 25% or 50%, and at the same time upsample the minority classes. An alternative to upsampling is synthesising samples for the minority classes using an algorithm like SMOTE.
If you are training a neural network in batch, it is good to make sure that the training set is properly shuffled and that you have an even-is distribution of minority/majority samples across the mini batches.
I am training an object detection model for multi-class objects in the image. The dataset is custom collected and labelled data with bounding boxes and class labels in the ground truth data.
I trained the MobileNet+SSD , SqueezeDet and YoloV3 networks with this custom data but get poor results. The rationale of choosing these models is their fast performance and light weight (low memory foot print). Their single shot detector approach is shown to perform well in literature as well.
The class instance distribution in the dataset is as below
Class 1 -- 2469
Class 2 -- 5660
Class 3 -- 7614
Class 4 -- 13253
Class 5 -- 35262
Each image can have objects from any of the five classes. Class 4 and 5 have very high incidence.
The performance is very skewed with high recall scores and Average Precision for the class 4 and 5 , and an order of magnitude difference (lower) for the other 3 classes.
I have tried fine tuning on different filtering parameters , NMS threshold, model training parameters to no avail.
Question,
How to tackle such class imbalance to boost the detection Average precision and object detection accuracy for all classes in object detection models. ?
Low precision means your model is suffering from false positives. So you can try hard negative mining. Run your model. Find False positives. Include them in your training data. You can even try using only false negatives as false examples.
As you expect another way can be collecting more data if possible.
If it is not possible you may consider adding synthetic data. (i.e. change brightness of image, or view point(multiply with a matrix so it looks stretched))
One last thing may be having data for each class i.e. 5k for each.
PS: Keep in mind that flexibility of your model has a great impact. So be aware of over fitting under fitting.
In generating your synthetic data as mentioned by previous author, do not apply illumination or viewpoint variations..etc to all your dataset but rather, randomly. The number of classes is also way off, and will be best to either limit the numbers or gather more datasets for those classes. You could also try applying class weights to penalize the over representing classes more. You are making alot of assumptions that simple experimentation will yield results that could surprise you. Remember deep learning is part science and alot of art.
I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if oversampling should be done before or after splitting my data into train and test sets. I have generally seen it done before splitting in online examples, like this:
df_class0 = train[train.predict_var == 0]
df_class1 = train[train.predict_var == 1]
df_class1_over = df_class1.sample(len(df_class0), replace=True)
df_over = pd.concat([df_class0, df_class1_over], axis=0)
However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data. I am fine doing this, but I would like to know what is considered good practice. Thank you!
I was wondering if oversampling should be done before or after splitting my data into train and test sets.
It should certainly be done after splitting, i.e. it should be applied only to your training set, and not to your validation and test ones; see also my related answer here.
I have generally seen it done before splitting in online examples, like this
From the code snippet you show, it is not at all obvious that it is done before splitting, as you claim. It depends on what exactly the train variable is here: if it is the product of a train-test split, then the oversampling takes place after splitting indeed, as it should be.
However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data.
Exactly, this is the reason why the oversampling should be done after splitting to train-test, and not before.
(I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...).
I am fine doing this
You shouldn't :)
From my experience this is a bad practice. As you mentioned test data should contain unseen samples so it would not overfit and give you better evaluation of training process. If you need to increase sample sizes - think about data transformation possibilities.
E.g. human/cat image classification, as they are symmetric you can double sample size by mirroring images.
I am building a sentiment analysis program using some tweets i have gathered. The labeled data which i have gathered would go through a neural network which classifies them into two classes, positive and negative.
The data is still being labeled. So far i have observed that the positive category has very small number of observations.
The records for positive category in my training set could be about 5% of the training data set (the same ratio could reflect in the population as well).
Would this create problems in the final "program" ?
Size of the data set is about 5000 records.
Yes, yes it can. There are two things to consider:
5% of 5000 is 250. Thus you will try to model data distribution of your class based on just 250 samples. This might be orders of magnitude to small for neural network. Consequently you might need 40x more data to have a representative sample of your data. While you can easily reduce the majority class through subsampling, without the big risk of destroing the structure - there is no way to get "more structure" from less points (you can replicate points, add noise etc. but this does not add structure, this just adds assumptions).
Class imbalance can also lead to convergence to naive solutions, like "always False" which has 95% accuracy. Here you can simply play around with the cost function to make it more robust to imbalance (in particular - train splits suggested by #PureW is nothing else like "black box" method of trying to change the loss function so it has bigger weight on minority class. When you have access to your classifier loss, like in NN you should not due this - but instead change the cost function and still keep all the data).
Without even splits of the different classes, you may want to introduce weights in your loss function such that errors in the smaller class are deemed more important.
Another solution, since 5000 samples may or may not be much data depending on your problem, could be to sample more datasets. You basically take this set of 5000 samples, and sample datapoints from it such that you have a new dataset with even split of the classes. This means the new dataset is only 10% of your original dataset. But it is evenly split between the classes. You can redo this sampling many times and end up with several datasets, useful in bootstrap aggregating.
I'm new to data mining and I'm trying to train decision tree, but the data set I've chosen is very biased therefore the result that I am getting is also biased. I've searched online and I came across with balanced accuracy. I'm not satisfied with the result.
Will it be a good idea if I sample my data set in such a way that I proportion it equally, as in 1000 cases of YES and 1000 of NO?
One way to handle class imbalance is to undersample the larger class so that the class distribution is approximately half and half.
Answer of your question is yes, provided 1000 is the size of smaller class so that you lose less larger-class data points.
Note: While selecting from larger-class data points, try to leave out those data points which have more missing values.
You can also give weightage while modelling. you can assign higher weight to minority class it will compensate the imbalance.