Binary Crossentropy to penalize all components of one-hot vector - machine-learning

I understand that binary cross-entropy is the same as categorical cross-entropy in case of two classes.
Further, it is clear for me what softmax is.
Therefore, I see that categorical cross-entropy just penalizes the one component (probability) that should be 1.
But why, can't or shouldn't I use binary cross-entropy on a one-hot vector?
Normal Case for 1-Label-Multiclass-Mutual-exclusivity-classification:
################
pred = [0.1 0.3 0.2 0.4]
label (one hot) = [0 1 0 0]
costfunction: categorical crossentropy
= sum(label * -log(pred)) //just consider the 1-label
= 0.523
Why not that?
################
pred = [0.1 0.3 0.2 0.4]
label (one hot) = [0 1 0 0]
costfunction: binary crossentropy
= sum(- label * log(pred) - (1 - label) * log(1 - pred))
= 1*-log(0.3)-log(1-0.1)-log(1-0.2)-log(1-0.4)
= 0.887
I see that in binary cross-entropy the zero is a target class, and corresponds to the following one-hot encoding:
target class zero 0 -> [1 0]
target class one 1 -> [0 1]
In summary: Why do we just calculate/summarize the negative log likelihood for the predicted class. Why don't we penalize the other SHOULD-BE-ZERO-/NOT-THAT-CLASS classes?
In case one uses binary cross-entropy to a one-hot vector. Probabilities to expected zero labels would be penalized too.

See my answer on a similar question. In short, binary cross-entropy formula doesn't make sense for the one-hot vector. It's either possible to apply softmax cross-entropy for two or more classes or use the vector of (independent) probabilities in label, depending on the task.
But why, can't or shouldn't I use binary crossentropy on a one-hot vector?
What you compute is binary cross-entropy of 4 independent features:
pred = [0.1 0.3 0.2 0.4]
label = [0 1 0 0]
The model inference predicted that first feature is on with 10% probability, the second feature is on with 30% probability and so on. Target label is interpreted this way: all features are off, except for the second one. Note that [1, 1, 1, 1] is a perfectly valid label as well, i.e. it's not one-hot vector, and pred=[0.5, 0.8, 0.7, 0.1] is a valid prediction, i.e. the sum doesn't have to equal to one.
In other words, your computation is valid, but for a completely different problem: multi-label non-exclusive binary classification.
See also the difference between softmax and sigmoid cross-entropy loss functions in tensorflow.

Related

What is the impact of `pos_weight` argument in `BCEWithLogitsLoss`?

According to the pytorch doc of nn.BCEWithLogitsLoss, pos_weight is an optional argument a that takes the weight of positive examples. I don't fully understand the statement "pos_weight > 1 increases recall and pos_weight < 1 increases precision" in that page. How do you guys understand this statement?
The binary cross-entropy with logits loss (nn.BCEWithLogitsLoss, equivalent to F.binary_cross_entropy_with_logits) is a sigmoid layer (nn.Sigmoid) followed with a binary cross-entropy loss (nn.BCELoss). The general case assumes you are in a multi-label classification task i.e. a single input can be labeled with multiple classes. One common sub-case is to have a single class: the binary classification task. If you define q as your tensor of predicted classes and p the ground-truth [0,1] corresponding to the true probabilities for each class.
The explicit formulation for the binary cross-entropy would be:
z = torch.sigmoid(q)
loss = -(w_p*p*torch.log(z) + (1-p)*torch.log(1-z))
introducing the w_p, the weight associated with the true label for each class. Read this post for more details on the weighting scheme used by the BCELoss.
For a given class:
precision = TP / (TP + FP)
recall = TP / (TP + FN)
Then if w_p > 1, it increases the weight on the positive classification (classifying as true). This will tend to increase false positives (FP), thus decreasing the precision. Similarly if if w_p < 1, we are decreasing the weight on the true class which means it will tend to increase false negatives (FN), which decreases recall.

Understanding dense layer in LSTM architecture (labels & logits)

I'm working through this notebook -- https://github.com/aamini/introtodeeplearning/blob/master/lab1/solutions/Part2_Music_Generation_Solution.ipynb -- where we are using an embedding layer, LSTM, and final dense layer w/ softmax to generate music.
I'm a little confused, however, on how we're calculating loss; it is my understanding that in this notebook (in compute_loss()), in any given batch, we are comparing expected labels (which are the notes themselves) to the logits (i.e. predictions from the dense layer). However, aren't these predictions supposed to be a probability distribution? When are we actually selecting the label that we are predicting against?
A little more clarification on my question: if the shape of our labels is (batch_size, # of time steps), and the shape of our logits is (batch_size, # of time steps, vocab_size), at what point in the compute_loss() function are we actually selecting a label for each time step?
The short answer is that the Keras loss function sparse_categorical_crossentropy() does everything you need.
At each timestep of the LSTM model, the top dense layer and softmax function inside that loss function together generate a probability distribution over the model's vocabulary, which in this case are musical notes. Suppose the vocabulary comprises the notes A, B, C, D. Then one possible probability distribution generated is: [0.01, 0.70, 0.28, 0.01], meaning that the model is putting a lot of probability on note B (index 1), like so:
Label: A B C D
---- ---- ---- ---- ----
Index: 0 1 2 3
---- ---- ---- ---- ----
Prob: 0.01 0.70 0.28 0.01
Suppose the true note should be C, which is represented by the number 2, since it is at index 2 in the distribution array (with indexing starting at 0). To measure the difference between the predicted distribution and the true value distributions, use the sparse_categorical_crossentropy() function to produce a floating-point number representing the loss.
More information can be found on this TensorFlow documentation page. On that page, they have the example:
y_true = [1, 2]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)
You can see in that example there is a batch of two instances. For the first instance, the true label is 1 and the predicted distribution is [0.05, 0.95, 0], and for the second instance, the true label is 2 while the predicted distribution is [0.1, 0.8, 0.1].
This function is used in your Jupyter Notebook in section 2.5:
To train our model on this classification task, we can use a form of the crossentropy loss (negative log likelihood loss). Specifically, we will use the sparse_categorical_crossentropy loss, as it utilizes integer targets for categorical classification tasks. We will want to compute the loss using the true targets -- the labels -- and the predicted targets -- the logits.
So to answer your questions directly:
it is my understanding that in this notebook (in compute_loss()), in any given batch, we are comparing expected labels (which are the notes themselves) to the logits (i.e. predictions from the dense layer).
Yes, your understanding is correct.
However, aren't these predictions supposed to be a probability distribution?
Yes, they are.
When are we actually selecting the label that we are predicting against?
It is done inside the sparse_categorical_crossentropy() function. If your distribution is [0.05, 0.95, 0], then that implicitly means that the function is predicting 0.05 probability for index 0, 0.95 probability for index 1, and 0.0 probability for index 3.
A little more clarification on my question: if the shape of our labels is (batch_size, # of time steps), and the shape of our logits is (batch_size, # of time steps, vocab_size), at what point in the compute_loss() function are we actually selecting a label for each time step?
It's inside that function.

Accuracy and error rate of example Siamese network in Keras

I have been following this example here and I want to know how exactly this accuracy function works:
def compute_accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
pred = y_pred.ravel() < 0.5
return np.mean(pred == y_true)
As far as I know the output of the network in this case is going to be the distance between two pairs. So how can we calculate the accuracy in this case? What does the "0.5" threshold refers to? Also, how can I calculate the error rate?
It seems there are some gaps in the understanding of that example which needs to be filled first:
If you study the data preparation step (i.e. create_pairs method), you would realize that the positive pairs (i.e. pairs of samples belonging to the same class) are assigned a label of 1 (i.e. positive/true) and the negative pairs (i.e. pairs of samples belonging to different classes) are assigned a label of 0 (i.e. negative/false).
Further, the Siamese network in the example is designed such that given a pair of samples as input it would predict their distance as output. By using the contrastive loss as the loss function of the model, the model is trained such that given a positive pair as input a small distance value is predicted (because they belong to the same class and therefore their distance should be low, i.e. to convey similarity) and given a negative pair as input a large distance value is predicted (because they belong to difference classes and therefore their distance should be high, i.e. to convey dissimilarity). As an exercise, try to confirm these points by considering them numerically (i.e. when y_true is 1 and when y_true is 0) using contrastive loss definition in the code.
So, the accuracy function in the example is implemented such that a fixed arbitrary threshold, i.e. 0.5, is applied on predicted distance values, i.e. y_pred (this means the author of this example has decided that distance values of less than 0.5 indicate positive pairs; you may decided to use another threshold value, but it should be a reasonable choice based on experiment/experience). Then the result would be compared with true label values, i.e. y_true:
When y_pred is lower than 0.5 (y_pred < 0.5 would be equal to True): if y_true is 1 (i.e. positive) then this means the prediction of the network is consistent with the true label (i.e. True == 1 is equal to True) and therefore the prediction for this sample is counted towards correct predictions (i.e. accuracy). However, if y_true is 0 (i.e. negative) then the prediction for this sample is not correct (i.e. True == 0 is equal to False) and therefore this would not contribute to correct predictions.
When y_pred is equal or greater than 0.5 (y_pred < 0.5 would be equal to False): Same reasoning as above applies (left as an exercise!).
(Note: don't forget that the model is trained on batches of samples. Therefore, y_pred or y_true are not a single value; rather, they are arrays of values, and all the calculations/comparisons mentioned above are applied element-wise).
Let's look at an (imaginary) numerical example on an input batch of 5 sample pairs and how the accuracy is calculated for predictions of the model on this batch:
>>> y_pred = np.array([1.5, 0.7, 0.1, 0.3, 3.2])
>>> y_true = np.array([1, 0, 0, 1, 0])
>>> pred = y_pred < 0.5
>>> pred
array([False, False, True, True, False])
>>> result = pred == y_true
>>> result
array([False, True, False, True, True])
>>> accuracy = np.mean(result)
>>> accuracy
0.6

Deep Learning Log Likelihood

I am new babie to the Deep Learning field, and I am use log-likelihood method to compare the MSE metrics.Could anyone be able to show how to calculate the following 2 predicted output examples with 3 outputs neurons each. Thanks
yt = [ [1,0,0],[0,0,1]]
yp = [ [0.9, 0.2,0.2], [0.2,0.8,0.3] ]
MSE or Mean Squared Error is simply the expected value of the squared difference between the predicted and the ground truth labels, represented as
\text{MSE}(\hat{\theta}) = E\left[(\hat{\theta} - \theta)^2\right]
where theta is the ground truth labels and theta^hat is the predicted labels
I am not sure what are you referring to exactly, like a theoretical question or a part of code
As a Python implementation
def mean_squared_error(A, B):
return np.square(np.subtract(A,B)).mean()
yt = [[1,0,0],[0,0,1]]
yp = [[0.9, 0.2,0.2], [0.2,0.8,0.3]]
mse = mean_squared_error(yt, yp)
print(mse)
This will give a value of 0.21
If you are using one of the DL frameworks like TensorFlow, then they are already providing the function which calculates the mse loss between tensors
tf.losses.mean_squared_error
where
tf.losses.mean_squared_error(
labels,
predictions,
weights=1.0,
scope=None,
loss_collection=tf.GraphKeys.LOSSES,
reduction=Reduction.SUM_BY_NONZERO_WEIGHTS
)
Args:
labels: The ground truth output tensor, same dimensions as 'predictions'.
predictions: The predicted outputs.
weights: Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions
must be either 1, or the same as the corresponding losses dimension).
scope: The scope for the operations performed in computing the loss.
loss_collection: collection to which the loss will be added.
reduction: Type of reduction to apply to loss.
Returns:
Weighted loss float Tensor. If reduction is NONE, this has the same
shape as labels; otherwise, it is scalar.

Cost function training target versus accuracy desired goal

When we train neural networks, we typically use gradient descent, which relies on a continuous, differentiable real-valued cost function. The final cost function might, for example, take the mean squared error. Or put another way, gradient descent implicitly assumes the end goal is regression - to minimize a real-valued error measure.
Sometimes what we want a neural network to do is perform classification - given an input, classify it into two or more discrete categories. In this case, the end goal the user cares about is classification accuracy - the percentage of cases classified correctly.
But when we are using a neural network for classification, though our goal is classification accuracy, that is not what the neural network is trying to optimize. The neural network is still trying to optimize the real-valued cost function. Sometimes these point in the same direction, but sometimes they don't. In particular, I've been running into cases where a neural network trained to correctly minimize the cost function, has a classification accuracy worse than a simple hand-coded threshold comparison.
I've boiled this down to a minimal test case using TensorFlow. It sets up a perceptron (neural network with no hidden layers), trains it on an absolutely minimal dataset (one input variable, one binary output variable) assesses the classification accuracy of the result, then compares it to the classification accuracy of a simple hand-coded threshold comparison; the results are 60% and 80% respectively. Intuitively, this is because a single outlier with a large input value, generates a correspondingly large output value, so the way to minimize the cost function is to try extra hard to accommodate that one case, in the process misclassifying two more ordinary cases. The perceptron is correctly doing what it was told to do; it's just that this does not match what we actually want of a classifier. But the classification accuracy is not a continuous differentiable function, so we can't use it as the target for gradient descent.
How can we train a neural network so that it ends up maximizing classification accuracy?
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
pred = tf.tensordot(X, W, 1) + b
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
How can we train a neural network so that it ends up maximizing classification accuracy?
I'm asking for a way to get a continuous proxy function that's closer to the accuracy
To start with, the loss function used today for classification tasks in (deep) neural nets was not invented with them, but it goes back several decades, and it actually comes from the early days of logistic regression. Here is the equation for the simple case of binary classification:
The idea behind it was exactly to come up with a continuous & differentiable function, so that we would be able to exploit the (vast, and still expanding) arsenal of convex optimization for classification problems.
It is safe to say that the above loss function is the best we have so far, given the desired mathematical constraints mentioned above.
Should we consider this problem (i.e. better approximating the accuracy) solved and finished? At least in principle, no. I am old enough to remember an era when the only activation functions practically available were tanh and sigmoid; then came ReLU and gave a real boost to the field. Similarly, someone may eventually come up with a better loss function, but arguably this is going to happen in a research paper, and not as an answer to a SO question...
That said, the very fact that the current loss function comes from very elementary considerations of probability and information theory (fields that, in sharp contrast with the current field of deep learning, stand upon firm theoretical foundations) creates at least some doubt as to if a better proposal for the loss may be just around the corner.
There is another subtle point on the relation between loss and accuracy, which makes the latter something qualitatively different than the former, and is frequently lost in such discussions. Let me elaborate a little...
All the classifiers related to this discussion (i.e. neural nets, logistic regression etc) are probabilistic ones; that is, they do not return hard class memberships (0/1) but class probabilities (continuous real numbers in [0, 1]).
Limiting the discussion for simplicity to the binary case, when converting a class probability to a (hard) class membership, we are implicitly involving a threshold, usually equal to 0.5, such as if p[i] > 0.5, then class[i] = "1". Now, we can find many cases whet this naive default choice of threshold will not work (heavily imbalanced datasets are the first to come to mind), and we'll have to choose a different one. But the important point for our discussion here is that this threshold selection, while being of central importance to the accuracy, is completely external to the mathematical optimization problem of minimizing the loss, and serves as a further "insulation layer" between them, compromising the simplistic view that loss is just a proxy for accuracy (it is not). As nicely put in the answer of this Cross Validated thread:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Enlarging somewhat an already broad discussion: Can we possibly move completely away from the (very) limiting constraint of mathematical optimization of continuous & differentiable functions? In other words, can we do away with back-propagation and gradient descend?
Well, we are actually doing so already, at least in the sub-field of reinforcement learning: 2017 was the year when new research from OpenAI on something called Evolution Strategies made headlines. And as an extra bonus, here is an ultra-fresh (Dec 2017) paper by Uber on the subject, again generating much enthusiasm in the community.
I think you are forgetting to pass your output through a simgoid. Fixed below:
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
# CHANGE HERE: Remember, you need an activation function!
pred = tf.nn.sigmoid(tf.tensordot(X, W, 1) + b)
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
The output:
0 0.28319069743156433 [ 0.75648874] -0.9745011329650879
1 0.28302448987960815 [ 0.75775659] -0.9742625951766968
2 0.28285878896713257 [ 0.75902224] -0.9740257859230042
4 0.28252947330474854 [ 0.76154679] -0.97355717420578
8 0.28187844157218933 [ 0.76656926] -0.9726400971412659
16 0.28060704469680786 [ 0.77650583] -0.970885694026947
32 0.27818527817726135 [ 0.79593837] -0.9676888585090637
64 0.2738055884838104 [ 0.83302218] -0.9624817967414856
128 0.26666420698165894 [ 0.90031379] -0.9562843441963196
256 0.25691407918930054 [ 1.01172411] -0.9567816257476807
512 0.2461051195859909 [ 1.17413962] -0.9872989654541016
1024 0.23519910871982574 [ 1.38549554] -1.088881492614746
2048 0.2241383194923401 [ 1.64616168] -1.298340916633606
4096 0.21433120965957642 [ 1.95981205] -1.6126530170440674
8192 0.2075471431016922 [ 2.31746769] -1.989408016204834
9999 0.20618653297424316 [ 2.42539024] -2.1028473377227783
4/5 = perceptron accuracy
4/5 = threshold accuracy

Resources