High AUC and 100% recall, but precision and F1 are low

High AUC and 100% recall, but precision and F1 are low - machine-learning

I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the classifier has 0.18% precision & 0.37% F1 score. I assumed I can find better F1 score by changing the threshold but I failed (I checked the threshold between 0 to 1 with step = 0.01). Also, it seems weired to me that usually when dealing with imbalanced dataset, it is hard to get a high recall. The goal is to get a better F1 score. What can I do for the next step? Thanks!
(To be clear, I used SMOTE to upsample the failure samples in training dataset)

Getting 100% recall is trivial in fact: just classify everything as 1.
Is the precision/recall curve any good? Perhaps a more thorough scan could yield a better result:
probabilities = model.predict_proba(X_test)
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, probabilities)
f1_scores = 2 * recall * precision / (recall + precision)
best_f1 = np.max(f1_scores)
best_thresh = thresholds[np.argmax(f1_scores)]

Related

in a classification problem, why F1-score is more suitable than accuracy when the classes are unbalanced?

i understand that F1 score is more important if the false positive/false negative are crucial to determine a good classifier. i read in a site that "F1 Score is the weighted average of Precision and Recall; therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution". the fact the F1 score is more suitable for uneven or unbalanced class was written also in other sites, but what is the reason about this condition?

lets say you have class A = 1000 and class B = 100,
Here when you use accuracy as a evaluation metrics.
where,
Accuracy = Correct Predictions for class A + Class B / Total Predictions
lets say out of 1000 from class A , correct prediction is 950 and for class B , correct predictions are 10 out of 100.
Then as per the accuracy formula,
Accuracy = 950 (class A correct predictions) + 10 (class B correct predictions) / 1100
Accuracy = 0.8727272727272727 (87%)
In this imbalanced case we got 87% accuracy which is good but if you noticed for class B we only predicted 10 records correctly out of 100, which means our model is not able to predict class B but Accuracy metric shows our model is very good (87%) accuracy.
So for this case we use f1-score which handle evaluation of imbalanced problem.
F1 = 2 * (precision * recall) / (precision + recall)
f1-score takes precision and recall into consideration hence it is important to evaluate model with f1-score in case of imbalance data or else if you still want to use accuracy as a matrix use with class wise accuracy like accuracy for class A and accuracy for class B.

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])

It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)

If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

How to calculate accuracy score of a random classifier?

Say for example, a dataset contains 60% instances for "Yes" class and 30% instances for "NO" class.
In this scenario, Precision, Recall for the random classifier are
Precision =60%
Recall =50%
Then, what will be the accuracy for random classifier in this scenario?

Some caution is required here, since the very definition of a random classifier is somewhat ambiguous; this is best illustrated in cases of imbalanced data.
By definition, the accuracy of a binary classifier is
acc = P(class=0) * P(prediction=0) + P(class=1) * P(prediction=1)
where P stands for probability.
Indeed, if we stick to the intuitive definition of a random binary classifier as giving
P(prediction=0) = P(prediction=1) = 0.5
then the accuracy computed by the above formula is always 0.5, irrespectively of the class distribution (i.e. the values of P(class=0) and P(class=1)).
However, in this definition, there is an implicit assumption, i.e. that our classes are balanced, each one consisting of 50% of our dataset.
This assumption (and the corresponding intuition) breaks down in cases of class imbalance: if we have a dataset where, say, 90% of samples are of class 0 (i.e. P(class=0)=0.9), then it doesn't make much sense to use the above definition of a random binary classifier; instead, we should use the percentages of the class distributions themselves as the probabilities of our random classifier, i.e.:
P(prediction=0) = P(class=0) = 0.9
P(prediction=1) = P(class=1) = 0.1
Now, plugging these values to the formula defining the accuracy, we get:
acc = P(class=0) * P(prediction=0) + P(class=1) * P(prediction=1)
= (0.9 * 0.9) + (0.1 * 0.1)
= 0.82
which is nowhere close to the naive value of 0.5...
As I already said, AFAIK there are no clear-cut definitions of a random classifier in the literature. Sometimes the "naive" random classifier (always flip a fair coin) is referred to as a "random guess" classifier, while what I have described is referred to as a "weighted guess" one, but still this is far from being accepted as a standard...
The bottom line here is the following: since the main reason for using a random classifier is as a baseline, it makes sense to do so only in relatively balanced datasets. In your case of a 60-40 balance, the result turns out to be 0.52, which is admittedly not far from the naive one of 0.5; but for highly imbalanced datasets (e.g. 90-10), the usefulness itself of the random classifier as a baseline ceases to exist, since the correct baseline has become "always predict the majority class", which here would give an accuracy of 90%, in contrast to the random classifier accuracy of just 82% (let alone the 50% accuracy of the naive approach)...

As #desertnaut mentioned, if you're after a naïve benchmark for your model you're always better using "always predict the majority class" as your benchmark, achieving accuracy of %of_samples_in_majority_class (which is always better than either a random guess or a weighted guess).
In Deepchecks (a package I maintain) we have a check that automatically compares the performance of your model to a simple model (either weighted random, majority class or simple decision tree).
from deepchecks.checks import SimpleModelComparison
from deepchecks import Dataset
SimpleModelComparison().run(Dataset(train_df, label='target'), Dataset(test_df, label='target'), model)

Is there an optimizer in keras based on precision or recall instead of loss?

I am developping a segmentation neural network with only two classes, 0 and 1 (0 is the background and 1 the object that I want to find on the image). On each image, there are about 80% of 1 and 20% of 0. As you can see, the dataset is unbalanced and it makes the results wrong. My accuracy is 85% and my loss is low, but that is only because my model is good at finding the background !
I would like to base the optimizer on another metric, like precision or recall which is more usefull in this case.
Does anyone know how to implement this ?

You don't use precision or recall to be optimize. You just track them as valid scores to get the best weights. Do not mix loss, optimizer, metrics and other. They are not meant for the same thing.
THRESHOLD = 0.5
def precision(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
precision = tp / (tp + fp)
return precision
def recall(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fn = K.sum(K.round(K.clip(y_true - y_pred_bin, 0, 1)))
recall = tp / (tp + fn)
return recall
def fbeta(y_true, y_pred, beta = 2, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))
precision = tp / (tp + fp)
recall = tp / (tp + fn)
beta_squared = beta ** 2
return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall)
def model_fit(X,y,X_test,y_test):
class_weight={
1: 1/(np.sum(y) / len(y)),
0:1}
np.random.seed(47)
model = Sequential()
model.add(Dense(1000, input_shape=(X.shape[1],)))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(250))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adamax',metrics=[fbeta,precision,recall])
model.fit(X, y,validation_data=(X_test,y_test), epochs=200, batch_size=50, verbose=2,class_weight = class_weight)
return model

No. To do a 'gradient descent', you need to compute a gradient. For this the function need to be somehow smooth. Precision/recall or accuracy is not a smooth function, it has only sharp edges on which the gradient is infinity and flat places on which the gradient is zero. Hence you can not use any kind of numerical method to find a minimum of such a function - you would have to use some kind of combinatorial optimization and that would be NP-hard.

As others have stated, precision/recall is not directly usable as a loss function. However, better proxy loss functions have been found that help with a whole family of precision/recall related functions (e.g. ROC AUC, precision at fixed recall, etc.)
The research paper Scalable Learning of Non-Decomposable Objectives covers this with a method to sidestep the combinatorial optimization by the use of certain calculated bounds, and some Tensorflow code by the authors is available at the tensorflow/models repository. Additionally, there is a followup question on StackOverflow that has an answer that adapts this into a usable Keras loss function.
Special thanks to Francois Chollet and other participants on the Keras issue thread here that turned up that research paper. You may also find that thread provides other useful insights into the problem at hand.

Having the same problem with an unbalanced dataset, I'd suggest you use the F1 score as the metric of your optimizer.
Andrew Ng teaches that having ONE metric for the model is the simplest (best?) way to train a model. If you have 2 metrics, like precision and recall - it's not clear which one is more important. Trying to set limits on one metric obviously impacts the other metric...
F1 score is the prodigy of recall and precision - it is their harmonic mean.
Keras that I'm using, unfortunately has no implementation of F1 score as a metric, like there is one for accuracy, or many other Keras metrics https://keras.io/api/metrics/.
I found an implementation of the F1 score as a Keras metric, used at each epoch at:
https://medium.com/#aakashgoel12/how-to-add-user-defined-function-get-f1-score-in-keras-metrics-3013f979ce0d
I've implemented the simple function from the above article and the model trains now on F1 score as its Keras optimizer metric. Results on test: accuracy went down a bit and F1 score went up a lot.

I have the same problem regarding an unbalanced dataset for binary classification and I want to increase the recall sensitivity too. I found out that there is a built-in function for recall in tf.keras and can be used in the compile statement as follow:
from tensorflow.keras.metrics import Recall, Accuracy
model.compile(loss='binary_crossentropy' , optimizer=opt, metrics=[Accuracy(),Recall()])

The recommended approach to deal with an unbalanced dataset like you have is to use class_weights or sample_weights. See the model fit API for details.
Quote:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
With weights that are inversely proportional to the class frequency the loss will avoid just predicting the background class.
I understand that this is not how you formulated the question but imho it is the most practical approach to the issue you are facing.

I think that the Callbacks and Early Stopping mechanisms provide one with techniques that can lead you as close as possible to what you want to achieve. Please read the following article by Jason Brownlee about Early Stopping (read to the end!):
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

what is f-measure for each class in weka

When we evaluate a classifier in WEKA, for example a 2-class classifier, it gives us 3 f-measures: f-measure for class 1, for class 2 and the weighted f-measure.
I'm so confused! I thought f-measure is a balanced measure that show balanced performance measure for multiple class, so what does f-measure for class 1 and 2 mean?

The f-score (or f-measure) is calculated based on the precision and recall. The calculation is as follows:
Precision = t_p / (t_p + f_p)
Recall = t_p / (t_p + f_n)
F-score = 2 * Precision * Recall / (Precision + Recall)
Where t_p is the number of true positives, f_p the number of false positives and f_n the number of false negatives. Precision is defined as the fraction of elements correctly classified as positive out of all the elements the algorithm classified as positive, whereas recall is the fraction of elements correctly classified as positive out of all the positive elements.
In the multiclass case, each class i have a respective precision and recall, in which a "true positive" is an element predicted to be in i is really in it and a "true negative" is an element predicted to not be in i that isn't in it.
Thus, with this new definition of precision and recall, each class can have its own f-score by doing the same calculation as in the binary case. This is what Weka's showing you.
The weighted f-score is a weighted average of the classes' f-scores, weighted by the proportion of how many elements are in each class.

I am confused too,
I used the same equation for f-score for each class depending of their precision and recall, but the results are different!
example:
f-score different from weka claculaton

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

High AUC and 100% recall, but precision and F1 are low - machine-learning

Related

in a classification problem, why F1-score is more suitable than accuracy when the classes are unbalanced?

Can intercept and regression coefficients (Beta values) be very high?

How to calculate accuracy score of a random classifier?

Is there an optimizer in keras based on precision or recall instead of loss?

what is f-measure for each class in weka

Categories

Resources