When we evaluate a classifier in WEKA, for example a 2-class classifier, it gives us 3 f-measures: f-measure for class 1, for class 2 and the weighted f-measure.
I'm so confused! I thought f-measure is a balanced measure that show balanced performance measure for multiple class, so what does f-measure for class 1 and 2 mean?
The f-score (or f-measure) is calculated based on the precision and recall. The calculation is as follows:
Precision = t_p / (t_p + f_p)
Recall = t_p / (t_p + f_n)
F-score = 2 * Precision * Recall / (Precision + Recall)
Where t_p is the number of true positives, f_p the number of false positives and f_n the number of false negatives. Precision is defined as the fraction of elements correctly classified as positive out of all the elements the algorithm classified as positive, whereas recall is the fraction of elements correctly classified as positive out of all the positive elements.
In the multiclass case, each class i have a respective precision and recall, in which a "true positive" is an element predicted to be in i is really in it and a "true negative" is an element predicted to not be in i that isn't in it.
Thus, with this new definition of precision and recall, each class can have its own f-score by doing the same calculation as in the binary case. This is what Weka's showing you.
The weighted f-score is a weighted average of the classes' f-scores, weighted by the proportion of how many elements are in each class.
I am confused too,
I used the same equation for f-score for each class depending of their precision and recall, but the results are different!
example:
f-score different from weka claculaton
Related
I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the classifier has 0.18% precision & 0.37% F1 score. I assumed I can find better F1 score by changing the threshold but I failed (I checked the threshold between 0 to 1 with step = 0.01). Also, it seems weired to me that usually when dealing with imbalanced dataset, it is hard to get a high recall. The goal is to get a better F1 score. What can I do for the next step? Thanks!
(To be clear, I used SMOTE to upsample the failure samples in training dataset)
Getting 100% recall is trivial in fact: just classify everything as 1.
Is the precision/recall curve any good? Perhaps a more thorough scan could yield a better result:
probabilities = model.predict_proba(X_test)
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, probabilities)
f1_scores = 2 * recall * precision / (recall + precision)
best_f1 = np.max(f1_scores)
best_thresh = thresholds[np.argmax(f1_scores)]
According to the pytorch doc of nn.BCEWithLogitsLoss, pos_weight is an optional argument a that takes the weight of positive examples. I don't fully understand the statement "pos_weight > 1 increases recall and pos_weight < 1 increases precision" in that page. How do you guys understand this statement?
The binary cross-entropy with logits loss (nn.BCEWithLogitsLoss, equivalent to F.binary_cross_entropy_with_logits) is a sigmoid layer (nn.Sigmoid) followed with a binary cross-entropy loss (nn.BCELoss). The general case assumes you are in a multi-label classification task i.e. a single input can be labeled with multiple classes. One common sub-case is to have a single class: the binary classification task. If you define q as your tensor of predicted classes and p the ground-truth [0,1] corresponding to the true probabilities for each class.
The explicit formulation for the binary cross-entropy would be:
z = torch.sigmoid(q)
loss = -(w_p*p*torch.log(z) + (1-p)*torch.log(1-z))
introducing the w_p, the weight associated with the true label for each class. Read this post for more details on the weighting scheme used by the BCELoss.
For a given class:
precision = TP / (TP + FP)
recall = TP / (TP + FN)
Then if w_p > 1, it increases the weight on the positive classification (classifying as true). This will tend to increase false positives (FP), thus decreasing the precision. Similarly if if w_p < 1, we are decreasing the weight on the true class which means it will tend to increase false negatives (FN), which decreases recall.
i understand that F1 score is more important if the false positive/false negative are crucial to determine a good classifier. i read in a site that "F1 Score is the weighted average of Precision and Recall; therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution". the fact the F1 score is more suitable for uneven or unbalanced class was written also in other sites, but what is the reason about this condition?
lets say you have class A = 1000 and class B = 100,
Here when you use accuracy as a evaluation metrics.
where,
Accuracy = Correct Predictions for class A + Class B / Total Predictions
lets say out of 1000 from class A , correct prediction is 950 and for class B , correct predictions are 10 out of 100.
Then as per the accuracy formula,
Accuracy = 950 (class A correct predictions) + 10 (class B correct predictions) / 1100
Accuracy = 0.8727272727272727 (87%)
In this imbalanced case we got 87% accuracy which is good but if you noticed for class B we only predicted 10 records correctly out of 100, which means our model is not able to predict class B but Accuracy metric shows our model is very good (87%) accuracy.
So for this case we use f1-score which handle evaluation of imbalanced problem.
F1 = 2 * (precision * recall) / (precision + recall)
f1-score takes precision and recall into consideration hence it is important to evaluate model with f1-score in case of imbalance data or else if you still want to use accuracy as a matrix use with class wise accuracy like accuracy for class A and accuracy for class B.
I use a CatBoostClassifier and my classes are highly imbalanced. I applied a scale_pos_weight parameter to account for that. While training with an evaluation dataset (test) CatBoost shows a high precision on test. However, when I make predictions on test using a predict method, I only get a low precision score (calculated using the sklearn.metrics).
I think this might be related to class weights that I applied. However, I don't quite understand how a precision score is affected by this.
params = frozendict({
'task_type': 'CPU',
'loss_function': 'Logloss',
'eval_metric': 'F1',
'custom_metric': ['F1', 'Precision', 'Recall'],
'iterations': 100,
'random_seed': 20190128,
'scale_pos_weight': 56.88657244809081,
'learning_rate': 0.5412829495147387,
'depth': 7,
'l2_leaf_reg': 9.526905230698302
})
from catboost import CatBoostClassifier
model = cb.CatBoostClassifier(**params)
model.fit(
X_train, y_train,
cat_features=np.where(X_train.dtypes == np.object)[0],
eval_set=(X_test, y_test),
verbose=False,
plot=True
)
model.get_best_score()
{'learn': {'Recall': 0.9243007537531925,
'Logloss': 0.15892360013680026,
'F1': 0.9416723809244181,
'Precision': 0.9640191600545249},
'validation_0': {'Recall': 0.914252301192093,
'Logloss': 0.1714387314107052,
'F1': 0.9357892623978286,
'Precision': 0.9642642597943112}}
y_test_pred = model.predict(data=X_test)
from sklearn.metrics import balanced_accuracy_score, recall_score, precision_score, f1_score
print('Balanced accuracy: {:.2f}'.format(balanced_accuracy_score(y_test, y_test_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_test_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_test_pred)))
print('F1: {:.2f}'.format(f1_score(y_test, y_test_pred)))
Balanced accuracy: 0.94
Precision: 0.29
Recall: 0.91
F1: 0.44
I expected to get the same precision as CatBoost show while training, however, it's not so. What am I doing wrong?
Default use_weights is set to True , which means adding weights to the evaluation metrics, e.g. Precision:use_weights=True,
To let your own precision calculator the same as his, change to Precision: use_weights=False
Also, get_best_score gives the highest score over the iterations, you need to specify which iteration to be used in prediction. You can set use_best_model=True in model.fit to automatically choose the iteration.
The predict function uses a standard threshold of 0.5 to convert the probabilities of the prediction into a binary value. When you are dealing with a imbalanced problem, the threshold of 0.5 is not always the best value, that's why on the test set you are achieving a poor precision.
In order to find a better threshold, catboost has some methods that help you to do so, like get_roc_curve, get_fpr_curve, get_fnr_curve. These 3 methods can help you to visualize the true positive, false positive and false negative rates by changing the prediction threhsold.
Besides these visualization methods, catboost has a method called select_threshold which gives you the best threshold by that optimizes one of the curves.
You can check this on their documentation.
In addition to setting the use_bet_model=True, ensure that the class balance in both datasets is the same, or use balanced accuracy metrics to account for different class balance.
If you've done both of these, and you still see much worse accuracy metrics on a test set versus the train set, it is a sign of overfitting. I'd recommend you take advantage of the CatBoost's overfitting detector. The most common first method is to set early_stopping_rounds to an integer like 10, which will stop training once an improvement in the selected loss function isn't achieved after that number of training rounds (see early_stopping_rounds documentation).
Consider the below scenario:
I have batches of data whose features and labels have similar distribution.
Say something like 4000000 negative labels and 25000 positive labels
As its a highly imbalanced set, I have undersampled the negative labels so that my training set (taken from one of the batch) now contains 25000 positive labels and 500000 negative labels.
Now I am trying to measure the precision and recall from a test set after training (generated from a different batch)
I am using XGBoost with 30 estimators.
Now if I use all of 40000000 negative labels, I get a (0.1 precsion and 0.1 recall at 0.7 threshold) worser precision-recall score than if I use a subset say just 500000 negative labels(0.4 precision with 0.1 recall at 0.3 threshold)..
What could be a potential reason that this could happen?
Few of the thoughts that I had:
The features of the 500000 negative labels are vastly different from the rest in the overall 40000000 negative labels.
But when I plot the individual features, their central tendencies closely match with the subset.
Are there any other ways to identify why I get a lower and a worser presicion recall, when the number of negative labels increase so much?
Are there any ways to compare the distributions?
Is my undersampled training a cause for this?
To understand this, we first need to understand how precision and recall are calculated. For this I will use the following variables:
P - total number of positives
N - total number of negatives
TP - number of true positives
TN - number of true negatives
FP - number of false positives
FN - number of false negatives
It is important to note that:
P = TP + FN
N = TN + FP
Now, precision is TP/(TP + FP)
recall is TP/(TP + FN), therefore TP/P.
Accuracy is TP/(TP + FN) + TN/(TN + FP), hence (TP + TN)/(P + N)
In your case where the the data is imbalanced, we have that N>>P.
Now imagine some random model. We can usually say that for such a model accuracy is around 50%, but that is only if the data is balanced. In your case, there will tend to be more FP's and TN's than TP's and FN's because a random selection of the data has more liklihood of returning a negative sample.
So we can establish that the more % of negative samples N/(T+N), the more FP and TN we get. That is, whenever your model is not able to select the correct label, it will pick a random label out of P and N and it is mostly going to be N.
Recall that FP is a denominator in precision? This means that precision also decreases with increasing N/(T+N).
For recall, we have neither FP nor TN in its derivation, so will likely not to change much with increasing N/(T+N) . As can be seen in your example, it clearly stays the same.
Therefore, I would try to make the data balanced to get better result. A ratio of 1:1.5 should do.
You can also use a different metric like the F1 score that combines precision and recall to get a better understanding of the performance.
Also check some of the other points made here on how to combat imbalance data