Using SMOTE on unbalanced dataset - machine-learning

I have a 2 class unbalanced dataset where the ratio is 20:1
I am using SMOTE to oversample the minor class and wanted to know when using SMOTE to develop a usable model, if it was best to oversample so that the percentage of the minor class was the same as the other class (i.e 1:1) or establish through trial an error the lowest possible ratio to improve the model overall to an acceptable level (i.e F1Score >0.7) but not use too many synthetic samples if that makes sense.
Any thoughts/advice appreciated.

It is always better to undersample the majority class than SMOTING, as in my experience SMOTING never helped.My suggestion would be to try taking all/Most the cases of the minority class and try undersampling the majority class at different ratios to hit a sweetspot in terms of F! Score.
-Thanks
Satish

You can try different SMOTE percentages and nearest neighbors values. Then you chose the best parameters values based on your F1Score for example.
Your best result will not necessary be the one with a highest SMOTE percentage.

Related

Evaluation of generative models like variational autoencoder

i hope everyone is doing well
I need some help with generative models.
So im working on a project where the main task is to build a binary classification model. In the dataset which contains 300000 sample and 100 feature, there is an imbalance between the 2 classes where majority class is too much bigger than the minory class.
To handle this problem, i'm using VAE (variational autoencoders) to solve this problem.
So i started training the VAE on the minority class and then use the decoder part of the VAE to generate new or fake samples that are similars to the minority class then concatenate this new data with training set in order to have a new balanced training set.
My question is : is there anyway to evalutate generative models like vae, like is there a way to know if the data generated is similar to the real one ??
I have read that there is some metrics to evaluate generated data like inception distance and Frechet inception distance but i saw that they have been only used on image data
I wanna know if i can use them too on my dataset ?
Thanks in advance
I believe your data is not image as you say there are 100 features. What I believe that you can check the similarity between the synthesised features and the original features (the ones belong to minority class), and keep only the ones with certain similarity. Cosine similarity index would be useful for this problem.
That would be also very nice to check a scatter plot of the synthesised features with the original ones to see if they are close to each other. tSNE would be useful at this point.

Sample Weight vs. Down Sampling for Handling Imbalanced Data

Suppose you have an imbalanced dataset. Without considering to generate new data to it, how can you handle it efficiently. I know we can use sample weight or down sampling. However, between these two, I am not sure which to choose. Also, suppose you need to build a classification model on this imbalanced data, how will these two techniques influence the model performances differently?
It totally depends upon that if you down sample how much of data observations do you have left, and how efficiently the down sampled class is able to accommodate the variety of the class down sampled.
for example, you have class 1 which consists of 100 observations and class 2 which contains 2000 observations (Class 1 is ~ 5%). Then downsampling wont make sense as there wont be enough data observations present to effetively implement a model. 100 observations are very less. The model would have high training errors.
But if you have Class 1 which has 100,000 observations and Class 2 having 2,000,000 (5% again) then it still make sense to downsample as you have enough observations to train model.
So the answer totally depends on the type of data you have. I personally would go with SMOTE. Hope this helps.

tackle class imbalance in single shot object detector

I am training an object detection model for multi-class objects in the image. The dataset is custom collected and labelled data with bounding boxes and class labels in the ground truth data.
I trained the MobileNet+SSD , SqueezeDet and YoloV3 networks with this custom data but get poor results. The rationale of choosing these models is their fast performance and light weight (low memory foot print). Their single shot detector approach is shown to perform well in literature as well.
The class instance distribution in the dataset is as below
Class 1 -- 2469
Class 2 -- 5660
Class 3 -- 7614
Class 4 -- 13253
Class 5 -- 35262
Each image can have objects from any of the five classes. Class 4 and 5 have very high incidence.
The performance is very skewed with high recall scores and Average Precision for the class 4 and 5 , and an order of magnitude difference (lower) for the other 3 classes.
I have tried fine tuning on different filtering parameters , NMS threshold, model training parameters to no avail.
Question,
How to tackle such class imbalance to boost the detection Average precision and object detection accuracy for all classes in object detection models. ?
Low precision means your model is suffering from false positives. So you can try hard negative mining. Run your model. Find False positives. Include them in your training data. You can even try using only false negatives as false examples.
As you expect another way can be collecting more data if possible.
If it is not possible you may consider adding synthetic data. (i.e. change brightness of image, or view point(multiply with a matrix so it looks stretched))
One last thing may be having data for each class i.e. 5k for each.
PS: Keep in mind that flexibility of your model has a great impact. So be aware of over fitting under fitting.
In generating your synthetic data as mentioned by previous author, do not apply illumination or viewpoint variations..etc to all your dataset but rather, randomly. The number of classes is also way off, and will be best to either limit the numbers or gather more datasets for those classes. You could also try applying class weights to penalize the over representing classes more. You are making alot of assumptions that simple experimentation will yield results that could surprise you. Remember deep learning is part science and alot of art.

Balance classes in cross validation

I would like to build a GBM model with H2O. My data set is imbalanced, so I am using the balance_classes parameter. For grid search (parameter tuning) I would like to use 5-fold cross validation. I am wondering how H2O deals with class balancing in that case. Will only the training folds be rebalanced? I want to be sure the test-fold is not rebalanced.
In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain realistic, i.e. you want to test your classifier performance in the real world setting, where, say, the negative class will include the 99% of the samples, in order to see how well your model will do in predicting the 1% positive class of interest without too many false positives. Artificially inflating the minority class or reducing the majority one will lead to performance metrics that are unrealistic, bearing no real relation to the real world problem you are trying to solve.
For corroboration, here is Max Kuhn, creator of the caret R package and co-author of the (highly recommended) Applied Predictive Modelling textbook, in Chapter 11: Subsampling For Class Imbalances of the caret ebook:
You would never want to artificially balance the test set; its class frequencies should be in-line with what one would see “in the wild”.
Re-balancing makes sense only in the training set, so as to prevent the classifier from simply and naively classifying all instances as negative for a perceived accuracy of 99%.
Hence, you can rest assured that in the setting you describe the rebalancing takes action only for the training set/folds.
A way to force balancing is using a weight columns to use different weights for different classes, in H2O weights_column

data imbalance in SVM using libSVM

How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.

Resources