Feature Engineering on imbalanced data - machine-learning

I am training a machine learning model on a classification problem. My dataset is 10000 observations with 37 categorical class. But the data is imbalanced, I have some classes with 100 observations and some other classes with 3000 and 4000 observations.
After searching on how to do some feature engineering on this type of data to improve the performance of the algorithm. I found 2 solutions:
upsampling which means to get more data about the minority classes
downsampling which means to remove data about the majority classes
According to the first solution:
I have many classes with a few observations so it will require much more data and long time. so it will be a hard for me!
And by applying the second one:
I think all classes will have a few observations and the data will be very small so that it will be hard for the algorithm to generalize.
So Is there another solution I can try for this problem?

You can change the weights in your loss function so that the smaller classes have larger importance when optimizing. In keras you can use weighted_cross_entropy_with_logits, for example.

You could use a combination of both.
It sounds like you are worried about getting a dataset that is too large if you upsample all minority classes to match the majority classes. If this is the case, you can downsample the majority classes to something like 25% or 50%, and at the same time upsample the minority classes. An alternative to upsampling is synthesising samples for the minority classes using an algorithm like SMOTE.
If you are training a neural network in batch, it is good to make sure that the training set is properly shuffled and that you have an even-is distribution of minority/majority samples across the mini batches.

Related

Is Naive Bayes biased?

I have a use case where in text needs to be classified into one of the three categories. I started with Naive Bayes [Apache OpenNLP, Java] but i was informed that the algorithm is biased, meaning if my training data has 60% of data as classA and 30% as classB and 10% as classC then the algorithm tends to biased towards ClassA and thus predicting the other class texts to be of classA.
If this is true is there a way to overcome this issue?
There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure which will be more suitable for my use case. Please advise.
there a way to overcome this issue?
Yes, there is. But first you need to understand why it happens?
Basically your dataset is imbalanced.
An imbalanced dataset means instances of one of the two classes is higher than the other, in another way, the number of observations is not the same for all the classes in a classification dataset.
In this scenario, your model becomes bias towards the class with majority of samples as you have more training data for that class.
Solutions
Under sampling:
Randomly removing samples from majority class to make dataset balance.
Over sampling:
Adding more samples of minority classes to makes dataset balance.
Change Performance Metrics
Use F1-score, 'recallorprecision` to measure the performance of your model.
There are few more solutions, if you want to know more refer this blog
There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure
which will be more suitable for my usecase
You will never know unless you try, I would suggest you try 3-4 different algorithms on your data.

Does the ratio of two classes matter in classification problems?

I am building a sentiment analysis program using some tweets i have gathered. The labeled data which i have gathered would go through a neural network which classifies them into two classes, positive and negative.
The data is still being labeled. So far i have observed that the positive category has very small number of observations.
The records for positive category in my training set could be about 5% of the training data set (the same ratio could reflect in the population as well).
Would this create problems in the final "program" ?
Size of the data set is about 5000 records.
Yes, yes it can. There are two things to consider:
5% of 5000 is 250. Thus you will try to model data distribution of your class based on just 250 samples. This might be orders of magnitude to small for neural network. Consequently you might need 40x more data to have a representative sample of your data. While you can easily reduce the majority class through subsampling, without the big risk of destroing the structure - there is no way to get "more structure" from less points (you can replicate points, add noise etc. but this does not add structure, this just adds assumptions).
Class imbalance can also lead to convergence to naive solutions, like "always False" which has 95% accuracy. Here you can simply play around with the cost function to make it more robust to imbalance (in particular - train splits suggested by #PureW is nothing else like "black box" method of trying to change the loss function so it has bigger weight on minority class. When you have access to your classifier loss, like in NN you should not due this - but instead change the cost function and still keep all the data).
Without even splits of the different classes, you may want to introduce weights in your loss function such that errors in the smaller class are deemed more important.
Another solution, since 5000 samples may or may not be much data depending on your problem, could be to sample more datasets. You basically take this set of 5000 samples, and sample datapoints from it such that you have a new dataset with even split of the classes. This means the new dataset is only 10% of your original dataset. But it is evenly split between the classes. You can redo this sampling many times and end up with several datasets, useful in bootstrap aggregating.

Machine learning model suggestion for large imbalance data

I have data set for classification problem. I have in total 50 classes.
Class1: 10,000 examples
Class2: 10 examples
Class3: 5 examples
Class4: 35 examples
.
.
.
and so on.
I tried to train my classifier using SVM ( both linear and Gaussian kernel). My accurate is very bad on test data 65 and 72% respectively. Now I am thinking to go for a neural network. Do you have any suggestion for any machine learning model and algorithm for large imbalanced data? It would be extremely helpful to me
You should provide more information about the data set features and the class distribution, this would help others to advice you.
In any case, I don't think a neural network fits here as this data set is too small for it.
Assuming 50% or more of the samples are of class 1 then I would first start by looking for a classifier that differentiates between class 1 and non-class 1 samples (binary classification). This classifier should outperform a naive classifier (benchmark) which randomly chooses a classification with a prior corresponding to the training set class distribution.
For example, assuming there are 1,000 samples, out of which 700 are of class 1, then the benchmark classifier would classify a new sample as class 1 in a probability of 700/1,000=0.7 (like an unfair coin toss).
Once you found a classifier with good accuracy, the next phase can be classifying the non-class 1 classified samples as one of the other 49 classes, assuming these classes are more balanced then I would start with RF, NB and KNN.
There are multiple ways to handle with imbalanced datasets, you can try
Up sampling
Down Sampling
Class Weights
I would suggest either Up sampling or providing class weights to balance it
https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-data-in-machine-learning-80836d45d30c
You should think about your performance metric, don't use Accuracy score as your performance metric , You can use Log loss or any other suitable metric
https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/
From my experience the most successful ways to deal with unbalanced classes are :
Changing distribiution of inputs: 20000 samples (the approximate number of examples which you have) is not a big number so you could change your dataset distribiution simply by using every sample from less frequent classes multiple times. Depending on a number of classes you could set the number of examples from them to e.g. 6000 or 8000 each in your training set. In this case remember to not change distribiution on test and validation set.
Increase the time of training: in case of neural networks, when changing distribiution of your input is impossible I strongly advise you trying to learn network for quite a long time (e.g. 1000 epochs). In this case you have to remember about regularisation. I usually use dropout and l2 weight regulariser with their parameters learnt by random search algorithm.
Reduce the batch size: In neural networks case reducing a batch size might lead to improving performance on less frequent classes.
Change your loss function: using MAPE insted of Crossentropy may also improve accuracy on less frequent classes.
Feel invited to test different combinations of approaches shown by e.g. random search algorithm.
Data-level methods:
Undersampling runs the risk of losing important data from removing data. Oversampling runs the risk of overfitting on training data, especially if the added copies of the minority class are replicas of existing data. Many sophisticated sampling techniques have been developed to mitigate these risks.
One such technique is two-phase learning. You first train your model on the resampled data. This resampled data can be achieved by randomly undersampling large classes until each class has only N instances. You then fine-tune your model on the original data.
Another technique is dynamic sampling: oversample the low-performing classes and undersample the high-performing classes during the training process. Introduced by Pouyanfar et al., the method aims to show the model less of what it has already learned and more of what it has not.
Algorithm-level methods
Cost-sensitive learning
Class-balanced loss
Focal loss
References:
esigning Machine Learning Systems
Survey on deep learning with class imbalance

data imbalance in SVM using libSVM

How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.

What's the difference between ANN, SVM and KNN classifiers?

I am doing remote sensing image classification. I am using the object-oriented method: first I segmented the image to different regions, then I extract the features from regions such as color, shape and texture. The number of all features in a region may be 30 and commonly there are 2000 regions in all, and I will choose 5 classes with 15 samples for every class.
In summary:
Sample data 1530
Test data 197530
How do I choose the proper classifier? If there are 3 classifiers (ANN, SVM, and KNN), which should I choose for better classification?
KNN is the most basic machine learning algorithm to paramtise and implement, but as alluded to by #etov, would likely be outperformed by SVM due to the small training data sizes. ANNs have been observed to be limited by insufficient training data also. However, KNN makes the least number of assumptions regarding your data, other than that accurate training data should form relatively discrete clusters. ANN and SVM are notoriously difficult to paramtise, especially if you wish to repeat the process using multiple datasets and rely upon certain assumptions, such as that your data is linearly separable (SVM).
I would also recommend the Random Forests algorithm as this is easy to implement and is relatively insensitive to training data size, but I would advise against using very small training data sizes.
The scikit-learn module contains these algorithms and is able to cope with large training data sizes, so you could increase the number of training data samples. the best way to know for sure would be to investigate them yourself, as suggested by #etov
If your "sample data" is the train set, it seems very small. I'd first suggest using more than 15 examples per class.
As said in the comments, it's best to match the algorithm to the problem, so you can simply test to see which algorithm works better. But to start with, I'd suggest SVM: it works better than KNN with small train sets, and generally easier to train then ANN, as there are less choices to make.
Have a look at below mind map
KNN: KNN performs well when sample size < 100K records, for non textual data. If accuracy is not high, immediately move to SVC ( Support Vector Classifier of SVM)
SVM: When sample size > 100K records, go for SVM with SGDClassifier.
ANN: ANN has evolved overtime and they are powerful. You can use both ANN and SVM in combination to classify images
More details are available #semanticscholar.org

Resources