For example I have classification problem with 2 classes, but data is skewed(data samples of different classes are in proportion 1:10)
How I can handle unbalance data using SVM?
I found no parameter for weights of different classes (OpenCV seems have no parameters for this?)
It has class_weights parameter in CvSVMParams::CvSVMParams.
class_weights – Optional weights in the C_SVC problem , assigned to
particular classes. They are multiplied by C so the parameter C of
class #i becomes class_weights_i * C. Thus these weights affect the
misclassification penalty for different classes. The larger weight,
the larger penalty on misclassification of data from the corresponding
class.
Related
Hi I came across a documentation in Pytorch which implement cross-entropy loss function in two ways:
# Example of target with class indices
loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
output = loss(input, target)
output.backward()
# Example of target with class probabilities
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5).softmax(dim=1)
output = loss(input, target)
output.backward()
One method uses the probability vector of the target, and the other uses it as a one-hot vector. To me, the implementation with class probabilities is closer to the definition of the loss function, but in most places, I have seen the other method. Can someone clarify the difference between these methods?
Thanks
I think there are two things worth explaining. From what I understood, you are asking for the following:
the difference between soft and hard labels.
As well as the difference between the one-hot and dense encoding of hard labels.
Just to be clear a probability distribution determines how probabilities are distributed over the values of the random variable (here they correspond to the different classes for your classification task). A soft label represents the distribution itself, while a hard label represents the value of the random variable that has the highest probability (i.e. the most probable class).
If you're applying the loss function against a target that represents a probability distribution, its values are positive and sum to one, then this would most likely correspond to a pseudo-label. In practice, this means you are applying a soft cross-entropy loss and supervising the whole distribution explicitly.
Now, hards labels are what you can expect to have when using ground-truth annotations. You can choose to represent a label in one of two ways (what is called the encoding of the label):
either with a one-hot encoding vector, all 0s, and a single 1 for the true class
or a dense representation which is the index of the true class.
With hard labels, only the probability of true class (the class corresponding to the ground-truth information) is explicitly supervised: you are pushing to maximize the probability mass it gets predicted with. Implicitly minimizing the mass of all other labels... since the mass is finite (i.e. equal to 1).
In PyTorch, the utility provided by nn.CrossEntropyLoss expects dense labels for the target vector. Tensorflow's implementation on the other hand allows you to provide targets as one-hot encoding. This let's you apply the function not only with one-hot-encodings (as intended for classical classification tasks), but also soft target...
I am working on a classification problem into progressive classes. In other words, there is some hierarchy of categories in such a way, that A < B < C, e.g. low, medium, high.
What loss function and activation function for the output layer should I use to take advantage of the class hierarchy?
My ideas are:
1) To assign some value to each category, use one output unit with the sigmoid activation and RMS loss function. Then to assign each class to an interval, e.g. 0-033 - class A, 0.33-0.66 class B 0.66-1 - class C.
It seem to do the trick, but can favor the extreme categories over the middle ones.
2) Use K softmax output units, integer labels instead of one-hot encoded and the sparse categorical crossentropy loss function.
In this case I am not sure how exactly sparse categorical crossentropy works and if it really takes into account the hierarchy.
Let's suppose we have a highly imbalanced binary classification problem in place.
Now, XGBoost provides us with 2 options to manage class imbalance during training. One is using the parameter scale_pos_weight while the other is using weights parameter of the DMatrix.
For eg -
I can either use - params = {'scale_pos_weight' : some value}
Or I can give class weights while creating the DMatrix like -
xgb = xgb.DMatrix(features, target, weight)
Can somebody please explain me the difference between these 2 cases? And how does the scores differ in both the cases?
The Difference
As explained in the (python) documentation, scale_pos_weight is a float (i.e. single value), which allows you to adjust the classification threshold. I.e. to tune the model's tendency to predict positive or negative values across the entire dataset.
DMatrix's weight argument requires an array-like object and is used to specify a "Weight for each instance". This allows more control over how the classifier makes its predictions, as each weight is used to scale the loss function that is being optimised.
An Example
To make this difference more concrete, imagine we are trying to predict the presence of cats in pictures under two scenarios.
Scenario One
In this scenario, our dataset consists of images either with a cat or with no animal at all. The dataset is imbalanced, with most images having no animal. Here we might use scale_pos_weight to increase the weighting of positive (with cats) images to deal with the imbalance.
In general, we tend to set scale_pos_weight proportionally to the imbalance. For example, if 20% of the images contain a cat, we would set scale_positive_weight to 4. (Of course this hyperparameter should be set empirically, e.g. using cross-validation, but this is a sensible initial/default value.)
Scenario Two
In this scenario, our dataset is again imbalanced with similar proportions of cat vs 'no-cat' images. However, this time it also includes some images with a dog. Potentially, our classifier may tend to mistake dogs for cats, decreasing its performance, with a higher false positive rate. In this instance, we may wish to specify per-sample weights using DMatrix's weight argument. In effect we would attempt to penalise dog-related false positives, which would not be possible with a single factor applied to the overall classification threshold.
I have a training set where the input vectors are speed, acceleration and turn angle change. Output is a crisp class- an activity state from the given set {rest, walk, run}. e.g- say for input vectors [3.1 1.2 2]-->run ; [2.1 1 1]-->walk and so on.
I am using weka to develop a Neural Network model. The output I am defining as crisp ones (or rather qualitative ones in words- categorical values). After training the model, the model can fairly classify on test data.
I was wondering how the internal process (mapping function) is taking place? Is the qualitative output states are getting some nominal value inside the model and after processing it is again getting converted to the categorical data? because a NN model cannot map float input values to a categorical data through hidden neurons, so what is actually happening, although the model is working fine.
If the model converts the categorical outputs into nominal ones and then start processing then on what basis it converts the categorical value into some arbitrary numerical values?
Yes, categorical values are usually being converted to numbers, and the networks learn to associate input data with these numbers. However these numbers are often further encoded, not to use only single output neuron. The most common way to do it, for unordered labels, is to add dummy output neurons dedicated to each category and use 1-of-C encoding, with 0.1 and 0.9 as target values. Output is interpreted using the Winner-take-all paradigm.
Using only one neuron and encoding categories with different numbers for unordered labels often leads to problems - as the network will treat middle categories as "averages" of the boundary categories. This however may sometimes be desired, if you have ordered categorical data.
You can find very good explanation of this issue in this part of the online Neural Network FAQ.
The neural net's computations all take place on continuous values. To do multiclass classification with discrete output, its final layer produces a vector of such values, one for each class. To make a discrete class prediction, take the index of the maximum element in that vector.
So if the final layer in a classification network for four classes predicts [0 -1 2 1], then the third element of the vector is the largest and the third class is selected. Often, these values are also constrained to form a probability distribution by means of a softmax activation function.
How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.