Looking at a project and dont understand why both classes have a recall score when recall only involves the positive class? (converted is the positive class)
Code for the confusion matrix:
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Not Converted', 'Converted'], yticklabels = ['Not Converted', 'Converted'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Confusion Matrix
Which recall score is the correct measure for my model? I believe it is 86% but then the precision score would be 93%? But that doesnt make sense as the model is doing better on recall considering there are fewer false negatives than false positives 57 vs 224 (thats accounting for the ratio difference of having more not converted cases vs converted cases). What am I getting wrong here?
Related
I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the classifier has 0.18% precision & 0.37% F1 score. I assumed I can find better F1 score by changing the threshold but I failed (I checked the threshold between 0 to 1 with step = 0.01). Also, it seems weired to me that usually when dealing with imbalanced dataset, it is hard to get a high recall. The goal is to get a better F1 score. What can I do for the next step? Thanks!
(To be clear, I used SMOTE to upsample the failure samples in training dataset)
Getting 100% recall is trivial in fact: just classify everything as 1.
Is the precision/recall curve any good? Perhaps a more thorough scan could yield a better result:
probabilities = model.predict_proba(X_test)
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, probabilities)
f1_scores = 2 * recall * precision / (recall + precision)
best_f1 = np.max(f1_scores)
best_thresh = thresholds[np.argmax(f1_scores)]
I'm running CatboostClassifier on an imbalanced dataset, binary classification, optimizing logloss and metric F1 Score. The resultant plot shows different results on F1:use_weights = True, F1:use_weights = False and gives different results from training predictions and validation predictions.
params = {
'iterations':500,
'learning_rate':0.2,
'eval_metric': 'F1',
'loss_function': 'Logloss',
'custom_metric': ['F1', 'Precision', 'Recall'],
'scale_pos_weight':19,
'use_best_model':True,
'max_depth':8
}
modelcat = CatBoostClassifier(**params)
modelcat.fit(
train_pool,
eval_set=validation_pool,
verbose=False,
plot=True
)
When I predict for validation and training set and check f1 score using sklearn's f1_score I get this score
ypredcat0 = modelcat.predict(valX_cat0) #validation predictions
print(f"F1: {f1_score(y_val,ypredcat0)}")
F1: 0.4163473818646233
ytrainpredcat0 = modelcat.predict(trainX_cat0) #training predictions
print(f"F1: {f1_score(y_train,ytrainpredcat0)}")
F1: 0.42536905412793874
But when I look at the plot created by plot=True, I find different convergence scores
when use_weights = False
when use_weights = True
In the plots, clearly training F1 has reached the score of 1, but when making predictions it's only 0.42. Why is this different? And how is use_weights working here?
Okay I figured out an answer. The difference lies in how F1 score is calculated taking into account various averages. By default for binary classification scikit-learn uses average = 'binary', so binary F1 score is 0.42. When I changed the average = 'macro' it gave F1 score as 0.67 which is what the Catboost shows with use_weights = False. When I calculated with average = 'micro' it gave F1 score as 0.88, even higher than what the plot shows, but anyway, that solves both the questions I had.
my anomaly detection algorithm gave me an array of predictions where all the values greater than 0 should be of the positive class (= 0) and all the other should be classified as anomalies (= 1). I built my classifier as well: (I have three datasets, the one with only non-anomaly values and the other with all anomaly values):
normal = np.load('normal_score.pkl')
anom_1 = np.load('anom1_score.pkl')
anom2_ = np.load('anom2_score.pkl')
y_normal = np.asarray([0]*len(normal)) # I know they are normal
y_anom_1 = np.asarray([1]*len(anom_1)) # I know they are anomaly
y_anom_2 = np.asarray([1]*len(anom_2)) # I know they are anomaly
score = np.concatenate([normal, anom_1, anom_2])
y = np.concatenate([y_normal, y_anom_1, y_anom_2])
auc = roc_auc_score(y, score)
fpr, tpr, thresholds = roc_curve(y, score)
display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc)
The AUC score I get is 0.02 and the plot looks like:
From what I understood this result is great because I should just reverse the labels to make it almost 0.98, but my question is: is there a way to specify it and automatically reverse it through a function?
The values in my normal score data are all in the range (21;57) and the anomalies values are in the range (-1090; -1836) so it should be easy to spot them.
"I should just reverse the labels to make it almost 0.98"
That's not how it should be done. It is because if you can predict "normal", let's say, with 95% confidence, you can not infer from this that you can also predict "anomaly" with the same confidence.
It becomes crucial in case of heavily imbalanced data which is probably the case here.
You should define which of these two you want to predict with high confidence and what are the target prediction metrics. For example, if you have a target on the precision and recall for predicting the "anomaly" then that should be your class "1" and calculate the metrics accordingly, and vice versa.
Say, I'm building a ML model to predict if a patient has flu or not. I know that, on average, only 2 out of 100 patients in the population have flu.
Usually, to estimate model's accuracy I just calculate what percentage of new data the model labels correctly:
accuracy rate = (correctly identified patients / total number of patients)
But in this case, I can write a model that labels all patients as not having flu and it will be accurate 98% of the time.
So probably the estimator should consider not only how much patients the model labeled correctly but also how much of the sick patient it actually found, something like
accuracy rate = (correctly identified patients / total number of patients) *
(correctly identified patients with flu / total number of patient with flu)
But this estimator has no real-world interpretation.
Is it a right way to think about it and how would you calculate accuracy rate of a model on such a skewed data? Thanks!
If you want to a balanced model, the long answer is "It depends", the short-term answer you can look into is something called the Matthews Correlation Coefficient (MCC) / Phi-Value.
As you saw, Accuracy is a really bad metric when facing imbalanced datasets. The MCC takes the size of classes into account and corrects for that. It delivers the same result for the the same model performance, no matter what the makeup of a dataset is.
TP = Number of true positives
TN = Number of true negatives
FP = Number of false positives
TN = Number of false negatives
MCC = (TP * TN - FP * FN) / sqrt((TP + FP)*(TP + FN)*(TN + FP)*(TN + FN))
MCC = 1 -> Perfect prediction
MCC = 0 -> No correlation
MCC = -1 -> Absolute contradiction
Just from experience (in my field, therefore with a huuuuge grain of salt):
reasonable models for the companies I work with usually start around MCC >= 0.75
I think You have to use of MAP. And for it you need to calculate Recall and Precision:
Recall = (True Positive) / (True Positive + False Positive)
Precision = (True Positive) / (True Positive + False Negative)
Positive: patient has flu
Negative: patient has not flu
True: correctly identified
False: wrong identified
Let's assume that we have a classification problem with 3 classes and that we have highly imbalanced data. Let's say in class 1 we have 185 data points, in class 2 199 and in class 3 720.
For calculating the AUC on a multiclass problem there is the macro-average (giving equal weight to the classification of each label) and micro-average method (considering each element of the label indicator matrix as a binary predictio) as written in the scikit-learn tutorial.
For such imbalanced dataset should micro-averaging or macro-averaging of AUC be used?
I'm unsure because when we have a confusion matrix as shown below, I'm getting a micro-averaged AUC of 0.76 and a macro-averaged AUC of 0.55.
Since you have the class with majority number of data points classified with much higher precision, the overall precision computed with micro-average is going to be higher than the same computed with macro-average.
Here, P1 = 12/185 = 0.06486486,
P2 = 11/199 = 0.05527638,
P3 = 670 / 720 = 0.9305556
overall precision with macro-average = (P1 + P2 + P3) / 3 = 0.3502323, which is much less than overall precision with micro-average = (12+11+670)/(185+199+720) = 0.6277174.
Same holds true for AUC.