I am analyzing the Secom dataset from the UCI Machine Learning repository by lasso-regularized logistic regression, but the results are bad.
https://archive.ics.uci.edu/ml/datasets/SECOM
Characteristics:
1546 data samples with 590 numeric attributes
106 positive samples (production failures)
The goal is to predict the positive class accurately, and also to perform feature selection.
I optimize lambda by 10 fold cross validation with the glmnet package in R. But the results are terrible, as the model tends to assign all test samples to only a single class.
Is it simply the wrong kind of model for this dataset?
Predicting with imbalanced classes can be a very difficult problem to solve. Thankfully, there is a huge bibliography on how to solve such problems. There is a particular one that worked really well for me. It involves using up-sampling and/or down-sampling techniques:
down-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model). caret contains a function (downSample) to do this.
up-sampling: randomly sample (with replacement) the minority class to be the same size as the majority class. caret contains a function (upSample) to do this.
hybrid methods: techniques such as SMOTE and ROSE down-sample the majority class and synthesize new data points in the minority class. There are two packages (DMwR and ROSE) that implement these procedures.
I took the above bullet points from this caret's documentation. The post contains examples about each one of the above bullet points and R code. You should be able to use a Lasso logistic regression and have better results after you have transformed your data based on the above techniques.
Related
We are attempting to implement multi-label classification using CNN in pytorch. We have 8 labels and around 260 images using a 90/10 split for train/validation sets.
The classes are highly imbalanced with the most frequent class occurring in over 140 images. On the other hand, the least frequent class occurs in less than 5 images.
We attempted BCEWithLogitsLoss function initially that led to the model predicting the same label for all images.
We then implemented a focal loss approach to handle class imbalance as follows:
import torch.nn as nn
import torch
class FocalLoss(nn.Module):
def __init__(self, alpha=1, gamma=2):
super(FocalLoss, self).__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, outputs, targets):
bce_criterion = nn.BCEWithLogitsLoss()
bce_loss = bce_criterion(outputs, targets)
pt = torch.exp(-bce_loss)
focal_loss = self.alpha * (1 - pt) ** self.gamma * bce_loss
return focal_loss
This resulted in the model predicting empty sets (no labels) for every image since it could not get a greater than 0.5 confidence for any classes.
Is there a approach in pytorch to help address this situation?
There's basically three ways of dealing with this.
Discard data from the more common class
Weight minority class loss values more heavily
Oversample the minority class
Option 1 is implemented by selecting the files you include in your Dataset.
Option 2 is implemented with the pos_weight parameter for BCEWithLogitsLoss
Option 3 is implemented with a custom Sampler passed to your Dataloader
For deep learning, oversampling typically works best.
In many training datasets, the class distribution is skewed, I.e. one class is very frequent (e.g. 95%), while another is rare (e.g. 5%). In such applications, the Naive classifier that assigns all test cases to the majority class achieves very high accuracy. However, the Naive classifier does not predict any cases of the minority class, although it is often more important than the majority class.
So are there any ways to improve the accuracy on the minority class just by modifying the training dataset? or Do we have to modify the classification algorithm as well?
This is the issue of the Imbalanced classification. Below are the methods used to treat the Imbalanced Datasets.
1.Undersampling
2.Oversampling
3.Synthetic Data Generation
4.Cost-Sensitive Learning
1.Undersampling
This method works with the majority class. It reduces the number of observations from the majority class to make the data set balanced. This method is best to use when the data set is huge and reducing the number of training samples helps to improve run time and storage troubles.
Undersampling methods are of 2 types: Random and Informative.
Oversampling
This method works with minority class. It replicates the observations from minority class to balance the data. It is also known as upsampling. Similar to undersampling, this method also can be divided into two types: Random Oversampling and Informative Oversampling.
Synthetic Data Generation
In simple words, instead of replicating and adding observations from the minority class, it overcomes imbalances by generates artificial data. It is also a type of oversampling technique.
In regards to synthetic data generation, synthetic minority oversampling technique (SMOTE) is a powerful and widely used method. SMOTE algorithm creates artificial data based on feature space (rather than data space) similarities from minority samples. We can also say, it generates a random set of minority class observations to shift the classifier learning bias towards minority class.
Cost-Sensitive Learning (CSL)
It is another commonly used method to handle classification problems with imbalanced data. It’s an interesting method. In simple words, this method evaluates the cost associated with misclassifying observations.
It does not create a balanced data distribution. Instead, it highlights the imbalanced learning problem by using cost matrices which describe the cost for misclassification in a particular scenario. Recent researches have shown that cost-sensitive learning has many times outperformed sampling methods. Therefore, this method provides a likely alternative to sampling methods.
To build these in R refer the link https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/
There are many ways. You can upsample miority class, downsample majority, do some weighting based on loss, or some other measures. As you've tagged your post to be related to TF I suggest you look at Importance Sampling for Keras, http://idiap.ch/~katharas/importance-sampling/ it has some ready to use wrappers that you put around your model.
I am currently trying to use satellite imagery to recognize Apples orchards. And I am facing a small problem in the number of representative data for each class.
In fact my question is :
Is it possible to take randomly some different images in my "not-apples" class at each epoch because I have much more of theses (compared to the "apples" one) and I want to increase the probability my network will classify out an image unrepresentative.
Thanks in advance for your help
That is not possible in Keras. Keras will, by default, shuffle your training data and then train on it in a mini-batch fashion. However, there are still ways to re-balance your dataset.
The imbalanced training data problem that you are facing is pretty common. You have many options available to you; I list a few below:
You can adjust the relative weights of your classes using class_weight keyword of the model.fit() function.
You can "up-sample" your "apples" class or "down-sample" your "non-apples" class to have equal numbers of both classes during training.
You can generate synthetic images of your "apples" class to augment your data set. To this end, the ImageDataGenerator class in Keras can be particularly useful. This Keras tutorial is a good introduction to its usage.
In my experience, I've found #2 and #3 to be most useful. #1 is limited by the fact that the convergence of stochastic gradient descent suffers when using class weights differing by a couple orders of magnitude and smaller batch sizes.
Jason Brownlee has put together a list of tactics for dealing with imbalanced classes that might also be useful to you.
How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.
I'm working on a text classification problem, and I have problems with missing values on some features.
I'm calculating class probabilities of words from labeled training data.
For example;
Let word foo belongs to class A for 100 times and belongs to class B for 200 times. In this case, i find class probability vector as [0.33,0.67] , and give it along with the word itself to classifier.
Problem is that, in the test set, there are some words that have not been seen in training data, so they have no probability vectors.
What could i do for this problem?
I ve tried giving average class probability vector of all words for missing values, but it did not improve accuracy.
Is there a way to make classifier ignore some features during evaluation just for specific instances which does not have a value for giving feature?
Regards
There is many way to achieve that
Create and train classifiers for all sub-set of feature you have. You can train your classifier on sub-set with the same data as tre training of the main classifier.
For each sample juste look at the feature it have and use the classifier that fit him the better. Don't try to do some boosting with thoses classifiers.
Just create a special class for samples that can't be classified. Or you have experimented result too poor with so little feature.
Sometimes humans too can't succefully classify samples. In many case samples that can't be classified should just be ignore. The problem is not in the classifier but in the input or can be explain by the context.
As nlp point of view, many word have a meaning/usage that is very similare in many application. So you can use stemming/lemmatization to create class of words.
You can also use syntaxic corrections, synonyms, translations (does the word come from another part of the world ?).
If this problem as enouph importance for you then you will end with a combination of the 3 previous points.