What is the impact of `pos_weight` argument in `BCEWithLogitsLoss`? - machine-learning

According to the pytorch doc of nn.BCEWithLogitsLoss, pos_weight is an optional argument a that takes the weight of positive examples. I don't fully understand the statement "pos_weight > 1 increases recall and pos_weight < 1 increases precision" in that page. How do you guys understand this statement?

The binary cross-entropy with logits loss (nn.BCEWithLogitsLoss, equivalent to F.binary_cross_entropy_with_logits) is a sigmoid layer (nn.Sigmoid) followed with a binary cross-entropy loss (nn.BCELoss). The general case assumes you are in a multi-label classification task i.e. a single input can be labeled with multiple classes. One common sub-case is to have a single class: the binary classification task. If you define q as your tensor of predicted classes and p the ground-truth [0,1] corresponding to the true probabilities for each class.
The explicit formulation for the binary cross-entropy would be:
z = torch.sigmoid(q)
loss = -(w_p*p*torch.log(z) + (1-p)*torch.log(1-z))
introducing the w_p, the weight associated with the true label for each class. Read this post for more details on the weighting scheme used by the BCELoss.
For a given class:
precision = TP / (TP + FP)
recall = TP / (TP + FN)
Then if w_p > 1, it increases the weight on the positive classification (classifying as true). This will tend to increase false positives (FP), thus decreasing the precision. Similarly if if w_p < 1, we are decreasing the weight on the true class which means it will tend to increase false negatives (FN), which decreases recall.

Related

Why we target to minimize Cross Entropy in classification problem in Deep neural network?

I am reading about why cross-entropy is the widely used as loss function in deep neural network for classification problems.
As per my understanding, cross entropy compares two probability distributions and if both distributions(Target and predicted) are same then cross entropy will be same as entropy.
formulas are as follows:
entropy = Summataion of i=0 to n for p(X)x log(p(X))
cross entropy = Summataion of i=0 to n for p(X) x log(q(X))
p(x) --> probabilty distribution of target values
q(x) --> probabilty distribution of predicted values
Why we target to minimize the value of cross entropy then how will it learn the probability distribution of actual values?
Lets forget about cross entropy and instead start by saying
"I want to maximise the probability that all my predictions p match
targets y"
Now maximising probability and its logarithm is the same, so lets do it for computational simplicity so we don't have to multiply a lot of numbers smaller than one (which could lead to super tiny numbers of underflows). For simplicity lets do this for just binary case, but multi-class works the same way.
log PROD_i P(p_i=y_i) = SUM_i log P(p_i=y_i)
= SUM_i [y_i * log P(p_i=1) + (1-y_i) log P(p_i = 0)]
= SUM_i [y_i * log P(p_i=1) + (1-y_i) log (1-P(p_i = 1))]
= -CE( y || p )
The crucial transformation happened when we noticed that log P(p_i=1) is either equal to log P(p_i=1) is y_i was 1, or log P(p_i=0) when it was 0 (and thus equivalent to multiplying by 1-y_i).

Accuracy and error rate of example Siamese network in Keras

I have been following this example here and I want to know how exactly this accuracy function works:
def compute_accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
pred = y_pred.ravel() < 0.5
return np.mean(pred == y_true)
As far as I know the output of the network in this case is going to be the distance between two pairs. So how can we calculate the accuracy in this case? What does the "0.5" threshold refers to? Also, how can I calculate the error rate?
It seems there are some gaps in the understanding of that example which needs to be filled first:
If you study the data preparation step (i.e. create_pairs method), you would realize that the positive pairs (i.e. pairs of samples belonging to the same class) are assigned a label of 1 (i.e. positive/true) and the negative pairs (i.e. pairs of samples belonging to different classes) are assigned a label of 0 (i.e. negative/false).
Further, the Siamese network in the example is designed such that given a pair of samples as input it would predict their distance as output. By using the contrastive loss as the loss function of the model, the model is trained such that given a positive pair as input a small distance value is predicted (because they belong to the same class and therefore their distance should be low, i.e. to convey similarity) and given a negative pair as input a large distance value is predicted (because they belong to difference classes and therefore their distance should be high, i.e. to convey dissimilarity). As an exercise, try to confirm these points by considering them numerically (i.e. when y_true is 1 and when y_true is 0) using contrastive loss definition in the code.
So, the accuracy function in the example is implemented such that a fixed arbitrary threshold, i.e. 0.5, is applied on predicted distance values, i.e. y_pred (this means the author of this example has decided that distance values of less than 0.5 indicate positive pairs; you may decided to use another threshold value, but it should be a reasonable choice based on experiment/experience). Then the result would be compared with true label values, i.e. y_true:
When y_pred is lower than 0.5 (y_pred < 0.5 would be equal to True): if y_true is 1 (i.e. positive) then this means the prediction of the network is consistent with the true label (i.e. True == 1 is equal to True) and therefore the prediction for this sample is counted towards correct predictions (i.e. accuracy). However, if y_true is 0 (i.e. negative) then the prediction for this sample is not correct (i.e. True == 0 is equal to False) and therefore this would not contribute to correct predictions.
When y_pred is equal or greater than 0.5 (y_pred < 0.5 would be equal to False): Same reasoning as above applies (left as an exercise!).
(Note: don't forget that the model is trained on batches of samples. Therefore, y_pred or y_true are not a single value; rather, they are arrays of values, and all the calculations/comparisons mentioned above are applied element-wise).
Let's look at an (imaginary) numerical example on an input batch of 5 sample pairs and how the accuracy is calculated for predictions of the model on this batch:
>>> y_pred = np.array([1.5, 0.7, 0.1, 0.3, 3.2])
>>> y_true = np.array([1, 0, 0, 1, 0])
>>> pred = y_pred < 0.5
>>> pred
array([False, False, True, True, False])
>>> result = pred == y_true
>>> result
array([False, True, False, True, True])
>>> accuracy = np.mean(result)
>>> accuracy
0.6

in a classification problem, why F1-score is more suitable than accuracy when the classes are unbalanced?

i understand that F1 score is more important if the false positive/false negative are crucial to determine a good classifier. i read in a site that "F1 Score is the weighted average of Precision and Recall; therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution". the fact the F1 score is more suitable for uneven or unbalanced class was written also in other sites, but what is the reason about this condition?
lets say you have class A = 1000 and class B = 100,
Here when you use accuracy as a evaluation metrics.
where,
Accuracy = Correct Predictions for class A + Class B / Total Predictions
lets say out of 1000 from class A , correct prediction is 950 and for class B , correct predictions are 10 out of 100.
Then as per the accuracy formula,
Accuracy = 950 (class A correct predictions) + 10 (class B correct predictions) / 1100
Accuracy = 0.8727272727272727 (87%)
In this imbalanced case we got 87% accuracy which is good but if you noticed for class B we only predicted 10 records correctly out of 100, which means our model is not able to predict class B but Accuracy metric shows our model is very good (87%) accuracy.
So for this case we use f1-score which handle evaluation of imbalanced problem.
F1 = 2 * (precision * recall) / (precision + recall)
f1-score takes precision and recall into consideration hence it is important to evaluate model with f1-score in case of imbalance data or else if you still want to use accuracy as a matrix use with class wise accuracy like accuracy for class A and accuracy for class B.

Is there an optimizer in keras based on precision or recall instead of loss?

I am developping a segmentation neural network with only two classes, 0 and 1 (0 is the background and 1 the object that I want to find on the image). On each image, there are about 80% of 1 and 20% of 0. As you can see, the dataset is unbalanced and it makes the results wrong. My accuracy is 85% and my loss is low, but that is only because my model is good at finding the background !
I would like to base the optimizer on another metric, like precision or recall which is more usefull in this case.
Does anyone know how to implement this ?
You don't use precision or recall to be optimize. You just track them as valid scores to get the best weights. Do not mix loss, optimizer, metrics and other. They are not meant for the same thing.
THRESHOLD = 0.5
def precision(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
precision = tp / (tp + fp)
return precision
def recall(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fn = K.sum(K.round(K.clip(y_true - y_pred_bin, 0, 1)))
recall = tp / (tp + fn)
return recall
def fbeta(y_true, y_pred, beta = 2, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))
precision = tp / (tp + fp)
recall = tp / (tp + fn)
beta_squared = beta ** 2
return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall)
def model_fit(X,y,X_test,y_test):
class_weight={
1: 1/(np.sum(y) / len(y)),
0:1}
np.random.seed(47)
model = Sequential()
model.add(Dense(1000, input_shape=(X.shape[1],)))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(250))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adamax',metrics=[fbeta,precision,recall])
model.fit(X, y,validation_data=(X_test,y_test), epochs=200, batch_size=50, verbose=2,class_weight = class_weight)
return model
No. To do a 'gradient descent', you need to compute a gradient. For this the function need to be somehow smooth. Precision/recall or accuracy is not a smooth function, it has only sharp edges on which the gradient is infinity and flat places on which the gradient is zero. Hence you can not use any kind of numerical method to find a minimum of such a function - you would have to use some kind of combinatorial optimization and that would be NP-hard.
As others have stated, precision/recall is not directly usable as a loss function. However, better proxy loss functions have been found that help with a whole family of precision/recall related functions (e.g. ROC AUC, precision at fixed recall, etc.)
The research paper Scalable Learning of Non-Decomposable Objectives covers this with a method to sidestep the combinatorial optimization by the use of certain calculated bounds, and some Tensorflow code by the authors is available at the tensorflow/models repository. Additionally, there is a followup question on StackOverflow that has an answer that adapts this into a usable Keras loss function.
Special thanks to Francois Chollet and other participants on the Keras issue thread here that turned up that research paper. You may also find that thread provides other useful insights into the problem at hand.
Having the same problem with an unbalanced dataset, I'd suggest you use the F1 score as the metric of your optimizer.
Andrew Ng teaches that having ONE metric for the model is the simplest (best?) way to train a model. If you have 2 metrics, like precision and recall - it's not clear which one is more important. Trying to set limits on one metric obviously impacts the other metric...
F1 score is the prodigy of recall and precision - it is their harmonic mean.
Keras that I'm using, unfortunately has no implementation of F1 score as a metric, like there is one for accuracy, or many other Keras metrics https://keras.io/api/metrics/.
I found an implementation of the F1 score as a Keras metric, used at each epoch at:
https://medium.com/#aakashgoel12/how-to-add-user-defined-function-get-f1-score-in-keras-metrics-3013f979ce0d
I've implemented the simple function from the above article and the model trains now on F1 score as its Keras optimizer metric. Results on test: accuracy went down a bit and F1 score went up a lot.
I have the same problem regarding an unbalanced dataset for binary classification and I want to increase the recall sensitivity too. I found out that there is a built-in function for recall in tf.keras and can be used in the compile statement as follow:
from tensorflow.keras.metrics import Recall, Accuracy
model.compile(loss='binary_crossentropy' , optimizer=opt, metrics=[Accuracy(),Recall()])
The recommended approach to deal with an unbalanced dataset like you have is to use class_weights or sample_weights. See the model fit API for details.
Quote:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
With weights that are inversely proportional to the class frequency the loss will avoid just predicting the background class.
I understand that this is not how you formulated the question but imho it is the most practical approach to the issue you are facing.
I think that the Callbacks and Early Stopping mechanisms provide one with techniques that can lead you as close as possible to what you want to achieve. Please read the following article by Jason Brownlee about Early Stopping (read to the end!):
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

what is f-measure for each class in weka

When we evaluate a classifier in WEKA, for example a 2-class classifier, it gives us 3 f-measures: f-measure for class 1, for class 2 and the weighted f-measure.
I'm so confused! I thought f-measure is a balanced measure that show balanced performance measure for multiple class, so what does f-measure for class 1 and 2 mean?
The f-score (or f-measure) is calculated based on the precision and recall. The calculation is as follows:
Precision = t_p / (t_p + f_p)
Recall = t_p / (t_p + f_n)
F-score = 2 * Precision * Recall / (Precision + Recall)
Where t_p is the number of true positives, f_p the number of false positives and f_n the number of false negatives. Precision is defined as the fraction of elements correctly classified as positive out of all the elements the algorithm classified as positive, whereas recall is the fraction of elements correctly classified as positive out of all the positive elements.
In the multiclass case, each class i have a respective precision and recall, in which a "true positive" is an element predicted to be in i is really in it and a "true negative" is an element predicted to not be in i that isn't in it.
Thus, with this new definition of precision and recall, each class can have its own f-score by doing the same calculation as in the binary case. This is what Weka's showing you.
The weighted f-score is a weighted average of the classes' f-scores, weighted by the proportion of how many elements are in each class.
I am confused too,
I used the same equation for f-score for each class depending of their precision and recall, but the results are different!
example:
f-score different from weka claculaton

Resources