renormalizing class weights for imbalanced data - machine-learning

i have a set of imbalanced data for training on a CNN neural net.
i want to calculate class weights that will be proportional to the frequency of each label, such that labels that are less frequent will be enhanced when calculating the back-propagation term so that they are well represented.
what i did so far:
i have a list A with frequency of each label.
A=[1009,2910,4014,152,605]
so i did the following-
class_weights_new=1/(A/np.min(A))
this produced a list of weights that reduce the learning proportional to the frequency of the label, to reduce over learning of one label over the others.
now i have two questions regarding the matter -
is there something wrong with my logic, am i missing something?
so far this calculation produced worse performance, and i want to perhaps smoothen the weights such that they will still have some imbalance in them. i mean that the ratio between labels will remain somewhat the same, but they all will tend closer to 1.
what is the mathematical operation that will give me such result?
thanks !!!

The most common weight calculation would be,
class_weights = np.array(A/np.sum(A))
So, you get a proper scale.
With your approach, it also works as you can see for high-frequency class the weight is low.
import numpy as np
import matplotlib.pyplot as plt
A=[1009,2910,4014,152,605]
class_weights_new=1/(A/np.min(A))
plt.plot(A)
plt.plot(class_weights_new*4000)
plt.legend(['freq', 'weights'])
plt.show()
print(class_weights_new)
You can use scikit-learn to compute class weight too: https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html

Related

SMOTE oversampling for anomaly detection using a classifier

I have sensor data and I want to do live anomaly detection using LOF on the training set to detect anomalies and then apply the labeled data to a classifier to do classification for new data points. I thought about using SMOTE because I want more anamolies points in the training data to overcome the imbalanced classification problem but the issue is that SMOTE created many points which are inside the normal range.
how can I do oversampling without creating samples in the normal data range?
the graph for the data before applying SMOTE.
data after SMOTE
SMOTE is going to linearly interpolate synthetic points between a minority class sample's k-nearest neighbors. This means that you're going to end up with points between a sample and its neighbors. When samples are all over the place like this, it makes sense that you're going to create synthetic points in the middle.
SMOTE should really be used to identify more specific regions in the feature space as the decision region for the minority class. This doesn't seem to be your use case. You want to know which points "don't belong," per se.
This seems like a fairly nice use case for DBSCAN, a density-based clustering algorithm that will identify points beyond some distance, eps, as not belonging to the same neighborhood.

How can I make Weka classify the smaller class, with a 2:1 class imbalance?

How can I make Weka classify the smaller classification? I have a data set where the positive classification is 35% of the data set and the negative classification is 65% of the data set. I want Weka to predict the positive classification but in some cases, the resultant model predicts all instances to be the negative classification. Regardless, it is classifying the negative (larger) class. How can I force it to classify the positive (smaller) classification?
One simple solution is to adjust your training set to be more balanced (50% positive, 50% negative) to encourage classification for both cases. I would guess that more of your cases are negative in the problem space, and therefore you would need to find some way to ensure that the negative cases still represent the problem well.
Since the ratio of positive to negative is 1:2, you could also try duplicating the positive cases in the training set to make it 2:2 and see how that goes.
Use stratified sampling (e.g. train on a 50%/50% sample) or class weights/class priors. It helps greatly if you tell us which specific classifier? Weka seems to have at least 50.
Is the penalty for Type I errors = penalty for Type II errors?
This is a special case of the receiver operating curve (ROC).
If the penalties are not equal, experiment with the cutoff value and the AUC.
You probably also want to read the sister site CrossValidated for statistics.
Use CostSensitiveClassifier, which is available under "meta" classifiers
You will need to change "classifier" to your J48 and (!) change cost matrix
to be like [(0,1), (2,0)]. This will tell J48 that misclassification of a positive instance is twice more costly than misclassification of a negative instance. Of course, you adjust your cost matrix according to your business values.

One-class Support Vector Machine Sensitivity Drops when the number of training sample increase

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks
The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results

Features scaling and mean normalization in a sparse matrix

Is it a good idea to perform features scaling and mean normalization on a sparse matrix? I have a matrix that is 70% sparse. Usually, features scaling and mean normalization improve the algorithm performance, but in the case of a sparse matrix, it adds a lot of non zero terms
If it's important that the representation be sparse, in order to fit into memory for example, then you can't mean-normalize in the representation itself, no. It becomes completely dense and defeats the purpose.
Usually you push around the mean normalization math into another part of the formula or computation. Or you can just do the normalization as you access the elements, having previously computed the mean and variance.
Or you can pick an algorithm that doesn't need normalization, if possible.
If using scikit-learn you can just as below:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
scaler.fit(data)
Where you zero the mean to maintain sparsity as you can see on documentation here.

Precision/recall for multiclass-multilabel classification

I'm wondering how to calculate precision and recall measures for multiclass multilabel classification, i.e. classification where there are more than two labels, and where each instance can have multiple labels?
For multi-label classification you have two ways to go
First consider the following.
is the number of examples.
is the ground truth label assignment of the example..
is the example.
is the predicted labels for the example.
Example based
The metrics are computed in a per datapoint manner. For each predicted label its only its score is computed, and then these scores are aggregated over all the datapoints.
Precision =
, The ratio of how much of the predicted is correct. The numerator finds how many labels in the predicted vector has common with the ground truth, and the ratio computes, how many of the predicted true labels are actually in the ground truth.
Recall =
, The ratio of how many of the actual labels were predicted. The numerator finds how many labels in the predicted vector has common with the ground truth (as above), then finds the ratio to the number of actual labels, therefore getting what fraction of the actual labels were predicted.
There are other metrics as well.
Label based
Here the things are done labels-wise. For each label the metrics (eg. precision, recall) are computed and then these label-wise metrics are aggregated. Hence, in this case you end up computing the precision/recall for each label over the entire dataset, as you do for a binary classification (as each label has a binary assignment), then aggregate it.
The easy way is to present the general form.
This is just an extension of the standard multi-class equivalent.
Macro averaged
Micro averaged
Here the are the true positive, false positive, true negative and false negative counts respectively for only the label.
Here $B$ stands for any of the confusion-matrix based metric. In your case you would plug in the standard precision and recall formulas. For macro average you pass in the per label count and then sum, for micro average you average the counts first, then apply your metric function.
You might be interested to have a look into the code for the mult-label metrics here , which a part of the package mldr in R. Also you might be interested to look into the Java multi-label library MULAN.
This is a nice paper to get into the different metrics: A Review on Multi-Label Learning Algorithms
The answer is that you have to compute precision and recall for each class, then average them together. E.g. if you classes A, B, and C, then your precision is:
(precision(A) + precision(B) + precision(C)) / 3
Same for recall.
I'm no expert, but this is what I have determined based on the following sources:
https://list.scms.waikato.ac.nz/pipermail/wekalist/2011-March/051575.html
http://stats.stackexchange.com/questions/21551/how-to-compute-precision-recall-for-multiclass-multilabel-classification
Let us assume that we have a 3-class multi classification problem with labels A, B and C.
The first thing to do is to generate a confusion matrix. Note that the values in the diagonal are always the true positives (TP).
Now, to compute recall for label A you can read off the values from the confusion matrix and compute:
= TP_A/(TP_A+FN_A)
= TP_A/(Total gold labels for A)
Now, let us compute precision for label A, you can read off the values from the confusion matrix and compute:
= TP_A/(TP_A+FP_A)
= TP_A/(Total predicted as A)
You just need to do the same for the remaining labels B and C. This applies to any multi-class classification problem.
Here is the full article that talks about how to compute precision and recall for any multi-class classification problem, including examples.
In python using sklearn and numpy:
from sklearn.metrics import confusion_matrix
import numpy as np
labels = ...
predictions = ...
cm = confusion_matrix(labels, predictions)
recall = np.diag(cm) / np.sum(cm, axis = 1)
precision = np.diag(cm) / np.sum(cm, axis = 0)
Simple averaging will do if the classes are balanced.
Otherwise, recall for each real class needs to be weighted by prevalence of the class, and precision for each predicted label needs to be weighted by the bias (probability) for each label. Either way you get Rand Accuracy.
A more direct way is to make a normalized contingency table (divide by N so table adds up to 1 for each combination of label and class) and add the diagonal to get Rand Accuracy.
But if classes aren't balanced, the bias remains and a chance corrected method such as kappa is more appropriate, or better still ROC analysis or a chance correct measure such as informedness (height above the chance line in ROC).

Resources