I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the classifier has 0.18% precision & 0.37% F1 score. I assumed I can find better F1 score by changing the threshold but I failed (I checked the threshold between 0 to 1 with step = 0.01). Also, it seems weired to me that usually when dealing with imbalanced dataset, it is hard to get a high recall. The goal is to get a better F1 score. What can I do for the next step? Thanks!
(To be clear, I used SMOTE to upsample the failure samples in training dataset)
Getting 100% recall is trivial in fact: just classify everything as 1.
Is the precision/recall curve any good? Perhaps a more thorough scan could yield a better result:
probabilities = model.predict_proba(X_test)
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, probabilities)
f1_scores = 2 * recall * precision / (recall + precision)
best_f1 = np.max(f1_scores)
best_thresh = thresholds[np.argmax(f1_scores)]
i understand that F1 score is more important if the false positive/false negative are crucial to determine a good classifier. i read in a site that "F1 Score is the weighted average of Precision and Recall; therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution". the fact the F1 score is more suitable for uneven or unbalanced class was written also in other sites, but what is the reason about this condition?
lets say you have class A = 1000 and class B = 100,
Here when you use accuracy as a evaluation metrics.
where,
Accuracy = Correct Predictions for class A + Class B / Total Predictions
lets say out of 1000 from class A , correct prediction is 950 and for class B , correct predictions are 10 out of 100.
Then as per the accuracy formula,
Accuracy = 950 (class A correct predictions) + 10 (class B correct predictions) / 1100
Accuracy = 0.8727272727272727 (87%)
In this imbalanced case we got 87% accuracy which is good but if you noticed for class B we only predicted 10 records correctly out of 100, which means our model is not able to predict class B but Accuracy metric shows our model is very good (87%) accuracy.
So for this case we use f1-score which handle evaluation of imbalanced problem.
F1 = 2 * (precision * recall) / (precision + recall)
f1-score takes precision and recall into consideration hence it is important to evaluate model with f1-score in case of imbalance data or else if you still want to use accuracy as a matrix use with class wise accuracy like accuracy for class A and accuracy for class B.
What if different inputs give the same output as non-linear equations have more than one roots?
Also, can multiple entirely different weight values give approximately the same optimal predictions?
For example,
length = 1000
X = np.random.rand(length,1) * 100
plt.plot(X,3.5*(X**2) + 3.45*X + 3.44,"g.",alpha = 0.8)
plt.plot(X,3.60818952*(X**2)-5.94902958*X -17.74136614,"y.")
plt.show()
gives the same curves.
matplotlib graph
I am developping a segmentation neural network with only two classes, 0 and 1 (0 is the background and 1 the object that I want to find on the image). On each image, there are about 80% of 1 and 20% of 0. As you can see, the dataset is unbalanced and it makes the results wrong. My accuracy is 85% and my loss is low, but that is only because my model is good at finding the background !
I would like to base the optimizer on another metric, like precision or recall which is more usefull in this case.
Does anyone know how to implement this ?
You don't use precision or recall to be optimize. You just track them as valid scores to get the best weights. Do not mix loss, optimizer, metrics and other. They are not meant for the same thing.
THRESHOLD = 0.5
def precision(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
precision = tp / (tp + fp)
return precision
def recall(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fn = K.sum(K.round(K.clip(y_true - y_pred_bin, 0, 1)))
recall = tp / (tp + fn)
return recall
def fbeta(y_true, y_pred, beta = 2, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))
precision = tp / (tp + fp)
recall = tp / (tp + fn)
beta_squared = beta ** 2
return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall)
def model_fit(X,y,X_test,y_test):
class_weight={
1: 1/(np.sum(y) / len(y)),
0:1}
np.random.seed(47)
model = Sequential()
model.add(Dense(1000, input_shape=(X.shape[1],)))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(250))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adamax',metrics=[fbeta,precision,recall])
model.fit(X, y,validation_data=(X_test,y_test), epochs=200, batch_size=50, verbose=2,class_weight = class_weight)
return model
No. To do a 'gradient descent', you need to compute a gradient. For this the function need to be somehow smooth. Precision/recall or accuracy is not a smooth function, it has only sharp edges on which the gradient is infinity and flat places on which the gradient is zero. Hence you can not use any kind of numerical method to find a minimum of such a function - you would have to use some kind of combinatorial optimization and that would be NP-hard.
As others have stated, precision/recall is not directly usable as a loss function. However, better proxy loss functions have been found that help with a whole family of precision/recall related functions (e.g. ROC AUC, precision at fixed recall, etc.)
The research paper Scalable Learning of Non-Decomposable Objectives covers this with a method to sidestep the combinatorial optimization by the use of certain calculated bounds, and some Tensorflow code by the authors is available at the tensorflow/models repository. Additionally, there is a followup question on StackOverflow that has an answer that adapts this into a usable Keras loss function.
Special thanks to Francois Chollet and other participants on the Keras issue thread here that turned up that research paper. You may also find that thread provides other useful insights into the problem at hand.
Having the same problem with an unbalanced dataset, I'd suggest you use the F1 score as the metric of your optimizer.
Andrew Ng teaches that having ONE metric for the model is the simplest (best?) way to train a model. If you have 2 metrics, like precision and recall - it's not clear which one is more important. Trying to set limits on one metric obviously impacts the other metric...
F1 score is the prodigy of recall and precision - it is their harmonic mean.
Keras that I'm using, unfortunately has no implementation of F1 score as a metric, like there is one for accuracy, or many other Keras metrics https://keras.io/api/metrics/.
I found an implementation of the F1 score as a Keras metric, used at each epoch at:
https://medium.com/#aakashgoel12/how-to-add-user-defined-function-get-f1-score-in-keras-metrics-3013f979ce0d
I've implemented the simple function from the above article and the model trains now on F1 score as its Keras optimizer metric. Results on test: accuracy went down a bit and F1 score went up a lot.
I have the same problem regarding an unbalanced dataset for binary classification and I want to increase the recall sensitivity too. I found out that there is a built-in function for recall in tf.keras and can be used in the compile statement as follow:
from tensorflow.keras.metrics import Recall, Accuracy
model.compile(loss='binary_crossentropy' , optimizer=opt, metrics=[Accuracy(),Recall()])
The recommended approach to deal with an unbalanced dataset like you have is to use class_weights or sample_weights. See the model fit API for details.
Quote:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
With weights that are inversely proportional to the class frequency the loss will avoid just predicting the background class.
I understand that this is not how you formulated the question but imho it is the most practical approach to the issue you are facing.
I think that the Callbacks and Early Stopping mechanisms provide one with techniques that can lead you as close as possible to what you want to achieve. Please read the following article by Jason Brownlee about Early Stopping (read to the end!):
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/
When we evaluate a classifier in WEKA, for example a 2-class classifier, it gives us 3 f-measures: f-measure for class 1, for class 2 and the weighted f-measure.
I'm so confused! I thought f-measure is a balanced measure that show balanced performance measure for multiple class, so what does f-measure for class 1 and 2 mean?
The f-score (or f-measure) is calculated based on the precision and recall. The calculation is as follows:
Precision = t_p / (t_p + f_p)
Recall = t_p / (t_p + f_n)
F-score = 2 * Precision * Recall / (Precision + Recall)
Where t_p is the number of true positives, f_p the number of false positives and f_n the number of false negatives. Precision is defined as the fraction of elements correctly classified as positive out of all the elements the algorithm classified as positive, whereas recall is the fraction of elements correctly classified as positive out of all the positive elements.
In the multiclass case, each class i have a respective precision and recall, in which a "true positive" is an element predicted to be in i is really in it and a "true negative" is an element predicted to not be in i that isn't in it.
Thus, with this new definition of precision and recall, each class can have its own f-score by doing the same calculation as in the binary case. This is what Weka's showing you.
The weighted f-score is a weighted average of the classes' f-scores, weighted by the proportion of how many elements are in each class.
I am confused too,
I used the same equation for f-score for each class depending of their precision and recall, but the results are different!
example:
f-score different from weka claculaton