Machine learning model suggestion for large imbalance data - machine-learning

I have data set for classification problem. I have in total 50 classes.
Class1: 10,000 examples
Class2: 10 examples
Class3: 5 examples
Class4: 35 examples
.
.
.
and so on.
I tried to train my classifier using SVM ( both linear and Gaussian kernel). My accurate is very bad on test data 65 and 72% respectively. Now I am thinking to go for a neural network. Do you have any suggestion for any machine learning model and algorithm for large imbalanced data? It would be extremely helpful to me

You should provide more information about the data set features and the class distribution, this would help others to advice you.
In any case, I don't think a neural network fits here as this data set is too small for it.
Assuming 50% or more of the samples are of class 1 then I would first start by looking for a classifier that differentiates between class 1 and non-class 1 samples (binary classification). This classifier should outperform a naive classifier (benchmark) which randomly chooses a classification with a prior corresponding to the training set class distribution.
For example, assuming there are 1,000 samples, out of which 700 are of class 1, then the benchmark classifier would classify a new sample as class 1 in a probability of 700/1,000=0.7 (like an unfair coin toss).
Once you found a classifier with good accuracy, the next phase can be classifying the non-class 1 classified samples as one of the other 49 classes, assuming these classes are more balanced then I would start with RF, NB and KNN.

There are multiple ways to handle with imbalanced datasets, you can try
Up sampling
Down Sampling
Class Weights
I would suggest either Up sampling or providing class weights to balance it
https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-data-in-machine-learning-80836d45d30c
You should think about your performance metric, don't use Accuracy score as your performance metric , You can use Log loss or any other suitable metric
https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/

From my experience the most successful ways to deal with unbalanced classes are :
Changing distribiution of inputs: 20000 samples (the approximate number of examples which you have) is not a big number so you could change your dataset distribiution simply by using every sample from less frequent classes multiple times. Depending on a number of classes you could set the number of examples from them to e.g. 6000 or 8000 each in your training set. In this case remember to not change distribiution on test and validation set.
Increase the time of training: in case of neural networks, when changing distribiution of your input is impossible I strongly advise you trying to learn network for quite a long time (e.g. 1000 epochs). In this case you have to remember about regularisation. I usually use dropout and l2 weight regulariser with their parameters learnt by random search algorithm.
Reduce the batch size: In neural networks case reducing a batch size might lead to improving performance on less frequent classes.
Change your loss function: using MAPE insted of Crossentropy may also improve accuracy on less frequent classes.
Feel invited to test different combinations of approaches shown by e.g. random search algorithm.

Data-level methods:
Undersampling runs the risk of losing important data from removing data. Oversampling runs the risk of overfitting on training data, especially if the added copies of the minority class are replicas of existing data. Many sophisticated sampling techniques have been developed to mitigate these risks.
One such technique is two-phase learning. You first train your model on the resampled data. This resampled data can be achieved by randomly undersampling large classes until each class has only N instances. You then fine-tune your model on the original data.
Another technique is dynamic sampling: oversample the low-performing classes and undersample the high-performing classes during the training process. Introduced by Pouyanfar et al., the method aims to show the model less of what it has already learned and more of what it has not.
Algorithm-level methods
Cost-sensitive learning
Class-balanced loss
Focal loss
References:
esigning Machine Learning Systems
Survey on deep learning with class imbalance

Related

Feature Engineering on imbalanced data

I am training a machine learning model on a classification problem. My dataset is 10000 observations with 37 categorical class. But the data is imbalanced, I have some classes with 100 observations and some other classes with 3000 and 4000 observations.
After searching on how to do some feature engineering on this type of data to improve the performance of the algorithm. I found 2 solutions:
upsampling which means to get more data about the minority classes
downsampling which means to remove data about the majority classes
According to the first solution:
I have many classes with a few observations so it will require much more data and long time. so it will be a hard for me!
And by applying the second one:
I think all classes will have a few observations and the data will be very small so that it will be hard for the algorithm to generalize.
So Is there another solution I can try for this problem?
You can change the weights in your loss function so that the smaller classes have larger importance when optimizing. In keras you can use weighted_cross_entropy_with_logits, for example.
You could use a combination of both.
It sounds like you are worried about getting a dataset that is too large if you upsample all minority classes to match the majority classes. If this is the case, you can downsample the majority classes to something like 25% or 50%, and at the same time upsample the minority classes. An alternative to upsampling is synthesising samples for the minority classes using an algorithm like SMOTE.
If you are training a neural network in batch, it is good to make sure that the training set is properly shuffled and that you have an even-is distribution of minority/majority samples across the mini batches.

How can we modify the training dataset to achieve better accuracy on the minority class for any classification algorithm?

In many training datasets, the class distribution is skewed, I.e. one class is very frequent (e.g. 95%), while another is rare (e.g. 5%). In such applications, the Naive classifier that assigns all test cases to the majority class achieves very high accuracy. However, the Naive classifier does not predict any cases of the minority class, although it is often more important than the majority class.
So are there any ways to improve the accuracy on the minority class just by modifying the training dataset? or Do we have to modify the classification algorithm as well?
This is the issue of the Imbalanced classification. Below are the methods used to treat the Imbalanced Datasets.
1.Undersampling
2.Oversampling
3.Synthetic Data Generation
4.Cost-Sensitive Learning
1.Undersampling
This method works with the majority class. It reduces the number of observations from the majority class to make the data set balanced. This method is best to use when the data set is huge and reducing the number of training samples helps to improve run time and storage troubles.
Undersampling methods are of 2 types: Random and Informative.
Oversampling
This method works with minority class. It replicates the observations from minority class to balance the data. It is also known as upsampling. Similar to undersampling, this method also can be divided into two types: Random Oversampling and Informative Oversampling.
Synthetic Data Generation
In simple words, instead of replicating and adding observations from the minority class, it overcomes imbalances by generates artificial data. It is also a type of oversampling technique.
In regards to synthetic data generation, synthetic minority oversampling technique (SMOTE) is a powerful and widely used method. SMOTE algorithm creates artificial data based on feature space (rather than data space) similarities from minority samples. We can also say, it generates a random set of minority class observations to shift the classifier learning bias towards minority class.
Cost-Sensitive Learning (CSL)
It is another commonly used method to handle classification problems with imbalanced data. It’s an interesting method. In simple words, this method evaluates the cost associated with misclassifying observations.
It does not create a balanced data distribution. Instead, it highlights the imbalanced learning problem by using cost matrices which describe the cost for misclassification in a particular scenario. Recent researches have shown that cost-sensitive learning has many times outperformed sampling methods. Therefore, this method provides a likely alternative to sampling methods.
To build these in R refer the link https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/
There are many ways. You can upsample miority class, downsample majority, do some weighting based on loss, or some other measures. As you've tagged your post to be related to TF I suggest you look at Importance Sampling for Keras, http://idiap.ch/~katharas/importance-sampling/ it has some ready to use wrappers that you put around your model.

Machine Learning Experiment Design with Small Positive Sample Set in Sci-kit Learn

I am interested in any tips on how to train a set with a very limited positive set and a large negative set.
I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?
I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful.
Thanks!
The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).
For RandomForest() you can look at this thread which discusses
the sample weight parameter. In general sample_weight is what
you're looking for in scikit-learn.
For SVM's have a look at either this example or this
example.
For NB classifiers, this should be handled implicitly by Bayes
rule, however in practice you may see some poor performances.
For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).
You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.
Class Weight-The higher the weight a class is given, the more its error rate is decreased.
Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)
Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.
Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.
Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

Training on imbalanced data using TensorFlow

The Situation:
I am wondering how to use TensorFlow optimally when my training data is imbalanced in label distribution between 2 labels. For instance, suppose the MNIST tutorial is simplified to only distinguish between 1's and 0's, where all images available to us are either 1's or 0's. This is straightforward to train using the provided TensorFlow tutorials when we have roughly 50% of each type of image to train and test on. But what about the case where 90% of the images available in our data are 0's and only 10% are 1's? I observe that in this case, TensorFlow routinely predicts my entire test set to be 0's, achieving an accuracy of a meaningless 90%.
One strategy I have used to some success is to pick random batches for training that do have an even distribution of 0's and 1's. This approach ensures that I can still use all of my training data and produced decent results, with less than 90% accuracy, but a much more useful classifier. Since accuracy is somewhat useless to me in this case, my metric of choice is typically area under the ROC curve (AUROC), and this produces a result respectably higher than .50.
Questions:
(1) Is the strategy I have described an accepted or optimal way of training on imbalanced data, or is there one that might work better?
(2) Since the accuracy metric is not as useful in the case of imbalanced data, is there another metric that can be maximized by altering the cost function? I can certainly calculate AUROC post-training, but can I train in such a way as to maximize AUROC?
(3) Is there some other alteration I can make to my cost function to improve my results for imbalanced data? Currently, I am using a default suggestion given in TensorFlow tutorials:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
I have heard this may be possible by up-weighting the cost of miscategorizing the smaller label class, but I am unsure of how to do this.
(1)It's ok to use your strategy. I'm working with imbalanced data as well, which I try to use down-sampling and up-sampling methods first to make the training set even distributed. Or using ensemble method to train each classifier with an even distributed subset.
(2)I haven't seen any method to maximise the AUROC. My thought is that AUROC is based on true positive and false positive rate, which doesn't tell how well it works on each instance. Thus, it may not necessarily maximise the capability to separate the classes.
(3)Regarding weighting the cost by the ratio of class instances, it similar to Loss function for class imbalanced binary classifier in Tensor flow
and the answer.
Regarding imbalanced datasets, the first two methods that come to mind are (upweighting positive samples, sampling to achieve balanced batch distributions).
Upweighting positive samples
This refers to increasing the losses of misclassified positive samples when training on datasets that have much fewer positive samples. This incentivizes the ML algorithm to learn parameters that are better for positive samples. For binary classification, there is a simple API in tensorflow that achieves this. See (weighted_cross_entropy) referenced below
https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits
Batch Sampling
This involves sampling the dataset so that each batch of training data has an even distribution positive samples to negative samples. This can be done using the rejections sampling API provided from tensorflow.
https://www.tensorflow.org/api_docs/python/tf/contrib/training/rejection_sample
I'm one who struggling with imbalanced data. What my strategy to counter imbalanced data are as below.
1) Use cost function calculating 0 and 1 labels at the same time like below.
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(_pred) + (1-y)*tf.log(1-_pred), reduction_indices=1))
2) Use SMOTE, oversampling method making number of 0 and 1 labels similar. Refer to here, http://comments.gmane.org/gmane.comp.python.scikit-learn/5278
Both strategy worked when I tried to make credit rating model.
Logistic regression is typical method to handle imbalanced data and binary classification such as predicting default rate. AUROC is one of the best metric to counter imbalanced data.
1) Yes. This is well received strategy to counter imbalanced data. But this strategy is good in Neural Nets only if you using SGD.
Another easy way to balance the training data is using weighted examples. Just amplify the per-instance loss by a larger weight/smaller when seeing imbalanced examples. If you use online gradient descent, it can be as simple as using a larger/smaller learning rate when seeing imbalanced examples.
Not sure about 2.

data imbalance in SVM using libSVM

How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.

Resources