How to pass a gradient through a categorical distribution? - machine-learning

I have a sequence of words:
my_sequence = "This is my sequence"
I also have model_1 that predicts a label for each word, out of 2 possible labels:
model_1_probabilities_predictions = [[0.1, 0.9], [0.4, 0.6], [0.7, 0.3], [0.9, 0.1]]
So far so good. Now here is where I'm stuck:
I have another model, call it model_2. This model needs to take the categorical predictions (i.e., the argmax of model_1_probabilities_predictions) in order to make a prediction. It only uses the words corresponding to the first label.
model_2_predictions = model_2( [This, is] )
Using this I can get a loss:
loss = loss_function(model_2_predictions, true_label)
There are 3 issues:
1. I can't differentiate through argmax
2. Even if there was a way to differentiate through argmax, the loss is 0 if the categorical labels are not exact and 1 otherwise (e.g., 0 for [label_1, label_1, label_0, label_0] but 1 for [label_0, label_1, label_0, label_0]).
3. Since model_2 only uses the words corresponding to the first label, the loss can't really propagate to the other words
I looked into the Gumbel softmax trick but it's not quite what I need, as model_2 must have the categorical labels, so there isn't really a way around it

Related

What is the correct way to penalize one prediction more over another?

I have a BERT-based sequence classification model that takes as an input 4 strings and out 2 labels for each one:
my_input = [string_1, string_2, string_3, string_4]
out_logits = model(my_input).logits
out_softmax = torch.softmax(out_logits)
out_softmax
>>> tensor([[0.8666, 0.1334],
[0.8686, 0.1314],
[0.8673, 0.1327],
[0.8665, 0.1335]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
My loss function is nn.CrossEntropyLoss() and my labels are tensors with indices corresponding to the correct labels: tensor([0., 0., 0., 1.]). Note that every label except for one is 1.
loss = loss_fun(out_softmax, labels_tensor)
# step
optim.zero_grad()
loss.backward()
optim.step()
The issue I'm having as appearing above, is that the model learns to just predict one class (e.g., the first column above). Not entirely sure why it's happening, but I thought that penalizing more the prediction that should be 1 might help.
How can I penalize more that prediction?
You can pass a weight tensor (one weight for each class) to the constructor of nn.CrossEntropyLoss to get such a weighting:
Parameters:
weight (Tensor, optional) – a manual rescaling weight given to each class. If given, has to be a Tensor of size C
where C is the number of classes.
But you should also think about alternatives, see the comment of #Sean above or e.g. this question.

binary classification:why use +1/0 as label, what's the difference between +1/-1 or even +100/-100

In binary classification problem, we usually use +1 for positive label and 0 for negative label. why is that? especially why use 0 rather than -1 for the negative label?
what's the difference between using -1 for negative label, or even more generally, can we use +100 for positive label and -100 for negative label?
As the name suggests (labeling) is used for differentiating the classes. You can use 0/1, +1/-1, cat/dog, etc. (Any name that fits your problem).
For example:
If you want to distinguish between cat and dog images, then use cat and dog labels.
If you want to detect spam, then labels will be spam/genuine.
However, because ML algorithms mostly work with numbers before training, labels transform to numeric formats.
Using labels of 0 and 1 arises naturally from some of the historically first methods that have been used for binary classification. E.g. logistic regression models directly the probability of an event happening, event in this case meaning belonging of an object to positive or negative class. When we use training data with labels 0 and 1, it basically means that objects with label 0 have probability of 0 belonging to a given class, and objects with label 1 have probability of 1 belonging to a given class. E.g. for spam classification, emails that are not spam would have label 0, which means they have 0 probability of being a spam, and emails that are spam would have label 1, because their probability of being a spam is 1.
So using labels of 0 and 1 makes perfect sense mathematically. When a binary classifaction model outputs e.g. 0.4 for some input, we can usually interpret this as a probability of belonging to a class 1 (although strictly it's not always the case, as pointed out for example here).
There are classification methods that don't make use of convenient properties of labels 0 and 1, such as support vector machines or linear discriminant analysis, but in their case no other labels would provide more convenience than 0 and 1, so using 0 and 1 is still okay.
Even encoding of classes for multiclass classification makes use of probabilities of belonging to a given class. For example in classification with three classes, objects from the first class would be encoded like [1 0 0], from the second class [0 1 0] and the third class [0 0 1], which again can be interpreted with probabilities. (This is called one-hot encoding). Output of a multiclass classification model is often a vector of form [0.1 0.6 0.3] which can be conveniently intepreted as a vector of class probabilities for given object.

Is there a way to implement a Neural Network able to work with vector target?

I'm trying to implement a Neural network model using keras, where the output is a vector of five elements.
Basically the target contains elements from 0 to 4 and nan. So I can have some targets like
[0,3,2,1,4] and others like [nan, 0, nan, 1 ,2]. The important thing is that the element in the vector are not repeated, only nan can.
One solution I tried was to use something like one hot encoder for the target, in this way I transformed a target in a 25 components vector, with all zeros and 1 in corrispondence of the number to map ( i.e. [nan, 0, nan, 1 ,2] -> [(0 , 0 ,0 ,0 ,0),(1,0,0,0,0),(0,0,0,0,0),(0,1,0,0,0)(0,0,1,0,0)] - i'm using the round brackets only to highlight groups of five element).
Any ideas please?
As far as I have understood, what you're trying to predict is a list of 5 elements, each of them takes a discrete value from the {nan, 0, 1, 2, 3, 4}.
What you'll need to do is training 5 neural networks (for each position of the list), each one predicts a value from the previous set, thus, you need to hot-encode the outputs, apply a softmax and select the highest probability for each of neural network.
when trying to predict the output list of a new sample, what you're going to do is predict every position, put them in the list and Voila !
def predict_sample(sample):
pos_0 = nn0.predict(sample)[0]
pos_1 = nn1.predict(sample)[0]
pos_2 = nn2.predict(sample)[0]
pos_3 = nn3.predict(sample)[0]
pos_4 = nn4.predict(sample)[0]
outp = [pos_0, pos_1, pos_2, pos_3, pos_4]
# if nan is encoded as 5 then:
outp[outp == 5] = np.nan
return outp
You cannot assume NaNs will be unique at prediction, only the data will affect that but what you can do if for example taking the second highest probability when already a NaN is predicted at a certain position of the list.

Understanding dense layer in LSTM architecture (labels & logits)

I'm working through this notebook -- https://github.com/aamini/introtodeeplearning/blob/master/lab1/solutions/Part2_Music_Generation_Solution.ipynb -- where we are using an embedding layer, LSTM, and final dense layer w/ softmax to generate music.
I'm a little confused, however, on how we're calculating loss; it is my understanding that in this notebook (in compute_loss()), in any given batch, we are comparing expected labels (which are the notes themselves) to the logits (i.e. predictions from the dense layer). However, aren't these predictions supposed to be a probability distribution? When are we actually selecting the label that we are predicting against?
A little more clarification on my question: if the shape of our labels is (batch_size, # of time steps), and the shape of our logits is (batch_size, # of time steps, vocab_size), at what point in the compute_loss() function are we actually selecting a label for each time step?
The short answer is that the Keras loss function sparse_categorical_crossentropy() does everything you need.
At each timestep of the LSTM model, the top dense layer and softmax function inside that loss function together generate a probability distribution over the model's vocabulary, which in this case are musical notes. Suppose the vocabulary comprises the notes A, B, C, D. Then one possible probability distribution generated is: [0.01, 0.70, 0.28, 0.01], meaning that the model is putting a lot of probability on note B (index 1), like so:
Label: A B C D
---- ---- ---- ---- ----
Index: 0 1 2 3
---- ---- ---- ---- ----
Prob: 0.01 0.70 0.28 0.01
Suppose the true note should be C, which is represented by the number 2, since it is at index 2 in the distribution array (with indexing starting at 0). To measure the difference between the predicted distribution and the true value distributions, use the sparse_categorical_crossentropy() function to produce a floating-point number representing the loss.
More information can be found on this TensorFlow documentation page. On that page, they have the example:
y_true = [1, 2]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)
You can see in that example there is a batch of two instances. For the first instance, the true label is 1 and the predicted distribution is [0.05, 0.95, 0], and for the second instance, the true label is 2 while the predicted distribution is [0.1, 0.8, 0.1].
This function is used in your Jupyter Notebook in section 2.5:
To train our model on this classification task, we can use a form of the crossentropy loss (negative log likelihood loss). Specifically, we will use the sparse_categorical_crossentropy loss, as it utilizes integer targets for categorical classification tasks. We will want to compute the loss using the true targets -- the labels -- and the predicted targets -- the logits.
So to answer your questions directly:
it is my understanding that in this notebook (in compute_loss()), in any given batch, we are comparing expected labels (which are the notes themselves) to the logits (i.e. predictions from the dense layer).
Yes, your understanding is correct.
However, aren't these predictions supposed to be a probability distribution?
Yes, they are.
When are we actually selecting the label that we are predicting against?
It is done inside the sparse_categorical_crossentropy() function. If your distribution is [0.05, 0.95, 0], then that implicitly means that the function is predicting 0.05 probability for index 0, 0.95 probability for index 1, and 0.0 probability for index 3.
A little more clarification on my question: if the shape of our labels is (batch_size, # of time steps), and the shape of our logits is (batch_size, # of time steps, vocab_size), at what point in the compute_loss() function are we actually selecting a label for each time step?
It's inside that function.

Deep Learning Log Likelihood

I am new babie to the Deep Learning field, and I am use log-likelihood method to compare the MSE metrics.Could anyone be able to show how to calculate the following 2 predicted output examples with 3 outputs neurons each. Thanks
yt = [ [1,0,0],[0,0,1]]
yp = [ [0.9, 0.2,0.2], [0.2,0.8,0.3] ]
MSE or Mean Squared Error is simply the expected value of the squared difference between the predicted and the ground truth labels, represented as
\text{MSE}(\hat{\theta}) = E\left[(\hat{\theta} - \theta)^2\right]
where theta is the ground truth labels and theta^hat is the predicted labels
I am not sure what are you referring to exactly, like a theoretical question or a part of code
As a Python implementation
def mean_squared_error(A, B):
return np.square(np.subtract(A,B)).mean()
yt = [[1,0,0],[0,0,1]]
yp = [[0.9, 0.2,0.2], [0.2,0.8,0.3]]
mse = mean_squared_error(yt, yp)
print(mse)
This will give a value of 0.21
If you are using one of the DL frameworks like TensorFlow, then they are already providing the function which calculates the mse loss between tensors
tf.losses.mean_squared_error
where
tf.losses.mean_squared_error(
labels,
predictions,
weights=1.0,
scope=None,
loss_collection=tf.GraphKeys.LOSSES,
reduction=Reduction.SUM_BY_NONZERO_WEIGHTS
)
Args:
labels: The ground truth output tensor, same dimensions as 'predictions'.
predictions: The predicted outputs.
weights: Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions
must be either 1, or the same as the corresponding losses dimension).
scope: The scope for the operations performed in computing the loss.
loss_collection: collection to which the loss will be added.
reduction: Type of reduction to apply to loss.
Returns:
Weighted loss float Tensor. If reduction is NONE, this has the same
shape as labels; otherwise, it is scalar.

Resources