I am trying to understand how to compute the gradient of log-softmax function in matrix form for policy function parameter update in Policy Gradient RL.
I found one Python implementation online from here using the Cartpole environment as an example.
In the example, the author uses softmax policy. This is implemented as per below:
def policy(state, w):
"""Computes softmax policy, i.e. the decision probabilities
for each action in our action space.
:param state: state vector,
:param theta: parameters,
:return: vector of decision probabilities.
"""
z = state.dot(w)
exp = np.exp(z)
return exp/np.sum(exp)
In the article, instead of computing the gradient of log-softmax directly, we obtain it by (1) first computing the Jacobian-matrix of the softmax, (2) then by dividing the a "sliced-Jacobian" with the policy probability and finally (3) by multiplying this ratio from (1) and (2) with the state vector.
The Jacobian of the softmax is computed using the following code:
def softmax_grad(softmax):
"""Computes the Jacobian of the softmax function."""
s = softmax.reshape(-1,1)
return np.diagflat(s) — np.dot(s, s.T)
The derivation of the Jacobian for the softmax function is rather easy. However, it is the next part that is confusing me:
# Computation of the gradient for log-softmax using Jacobian of softmax.
dsoftmax = softmax_grad(probs)[action, :] # Why do we only take one row?
dlog = dsoftmax / probs[0,action]
grad = state.T.dot(dlog[None, :]) # Why do we multiply with the state?
In the above, the grad is the final gradient for log-softmax function. I just don't understand why.
My questions are the following:
What is the mathematical justification for this computation?
How can we derive this computation mathematically?
Why are we slicing the Jacobian?
Why are we multiplying the resulting probabilities with the state?
I'd appreciate a formal and mathematical derivation for this. Any ideas?
Thanks!
Related
I'm building Kmeans in pytorch using gradient descent on centroid locations, instead of expectation-maximisation. Loss is the sum of square distances of each point to its nearest centroid. To identify which centroid is nearest to each point, I use argmin, which is not differentiable everywhere. However, pytorch is still able to backprop and update weights (centroid locations), giving similar performance to sklearn kmeans on the data.
Any ideas how this is working, or how I can figure this out within pytorch? Discussion on pytorch github suggests argmax is not differentiable: https://github.com/pytorch/pytorch/issues/1339.
Example code below (on random pts):
import numpy as np
import torch
num_pts, batch_size, n_dims, num_clusters, lr = 1000, 100, 200, 20, 1e-5
# generate random points
vector = torch.from_numpy(np.random.rand(num_pts, n_dims)).float()
# randomly pick starting centroids
idx = np.random.choice(num_pts, size=num_clusters)
kmean_centroids = vector[idx][:,None,:] # [num_clusters,1,n_dims]
kmean_centroids = torch.tensor(kmean_centroids, requires_grad=True)
for t in range(4001):
# get batch
idx = np.random.choice(num_pts, size=batch_size)
vector_batch = vector[idx]
distances = vector_batch - kmean_centroids # [num_clusters, #pts, #dims]
distances = torch.sum(distances**2, dim=2) # [num_clusters, #pts]
# argmin
membership = torch.min(distances, 0)[1] # [#pts]
# cluster distances
cluster_loss = 0
for i in range(num_clusters):
subset = torch.transpose(distances,0,1)[membership==i]
if len(subset)!=0: # to prevent NaN
cluster_loss += torch.sum(subset[:,i])
cluster_loss.backward()
print(cluster_loss.item())
with torch.no_grad():
kmean_centroids -= lr * kmean_centroids.grad
kmean_centroids.grad.zero_()
As alvas noted in the comments, argmax is not differentiable. However, once you compute it and assign each datapoint to a cluster, the derivative of loss with respect to the location of these clusters is well-defined. This is what your algorithm does.
Why does it work? If you had only one cluster (so that the argmax operation didn't matter), your loss function would be quadratic, with minimum at the mean of the data points. Now with multiple clusters, you can see that your loss function is piecewise (in higher dimensions think volumewise) quadratic - for any set of centroids [C1, C2, C3, ...] each data point is assigned to some centroid CN and the loss is locally quadratic. The extent of this locality is given by all alternative centroids [C1', C2', C3', ...] for which the assignment coming from argmax remains the same; within this region the argmax can be treated as a constant, rather than a function and thus the derivative of loss is well-defined.
Now, in reality, it's unlikely you can treat argmax as constant, but you can still treat the naive "argmax-is-a-constant" derivative as pointing approximately towards a minimum, because the majority of data points are likely to indeed belong to the same cluster between iterations. And once you get close enough to a local minimum such that the points no longer change their assignments, the process can converge to a minimum.
Another, more theoretical way to look at it is that you're doing an approximation of expectation maximization. Normally, you would have the "compute assignments" step, which is mirrored by argmax, and the "minimize" step which boils down to finding the minimizing cluster centers given the current assignments. The minimum is given by d(loss)/d([C1, C2, ...]) == 0, which for a quadratic loss is given analytically by the means of data points within each cluster. In your implementation, you're solving the same equation but with a gradient descent step. In fact, if you used a 2nd order (Newton) update scheme instead of 1st order gradient descent, you would be implicitly reproducing exactly the baseline EM scheme.
Imagine this:
t = torch.tensor([-0.0627, 0.1373, 0.0616, -1.7994, 0.8853,
-0.0656, 1.0034, 0.6974, -0.2919, -0.0456])
torch.argmax(t).item() # outputs 6
We increase t[0] for some, δ close to 0, will this update the argmax? It will not, so we are dealing with 0 gradients, all the time. Just ignore this layer, or assume it is frozen.
The same is for argmin, or any other function where the dependent variable is in discrete steps.
I have 2 images as input, x1 and x2 and try to use convolution as a similarity measure. The idea is that the learned weights substitute more traditional measure of similarity (cross correlation, NN, ...). Defining my forward function as follows:
def forward(self,x1,x2):
out_conv1a = self.conv1(x1)
out_conv2a = self.conv2(out_conv1a)
out_conv3a = self.conv3(out_conv2a)
out_conv1b = self.conv1(x2)
out_conv2b = self.conv2(out_conv1b)
out_conv3b = self.conv3(out_conv2b)
Now for the similarity measure:
out_cat = torch.cat([out_conv3a, out_conv3b],dim=1)
futher_conv = nn.Conv2d(out_cat)
My question is as follows:
1) Would Depthwise/Separable Convolutions as in the google paper yield any advantage over 2d convolution of the concatenated input. For that matter can convolution be a similarity measure, cross correlation and convolution are very similar.
2) It is my understanding that the groups=2 option in conv2d would provide 2 separate inputs to train weights with, in this case each of the previous networks weights. How are these combined afterwards?
For a basic concept see here.
Using a nn.Conv2d layer you assume weights are trainable parameters. However, if you want to filter one feature map with another, you can dive deeper and use torch.nn.functional.conv2d to explicitly define both input and filter yourself:
out = torch.nn.functional.conv2d(out_conv3a, out_conv3b)
Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?
This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.
Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.
I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.
#perform softmax
logits = tf.matmul(current_input, W) + b
softmax = tf.nn.softmax(logits)
#perform all possible operations on the input
op_1_val = tf_op_1(current_input)
op_2_val = tf_op_2(current_input)
op_3_val = tf_op_2(current_input)
values = tf.concat([op_1_val, op_2_val, op_3_val], 1)
#create a mask
argmax = tf.argmax(softmax, 1)
mask = tf.one_hot(argmax, num_of_operations)
#produce the input, by masking out those operation results which have not been selected
output = values * mask
I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:
1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).
2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:
output = values * mask
Use:
output = values * softmax
Now, the operations will converge down to zero based on how much the softmax will not select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).
This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290
Assuming after performing median frequency balancing for images used for segmentation, we have these class weights:
class_weights = {0: 0.2595,
1: 0.1826,
2: 4.5640,
3: 0.1417,
4: 0.9051,
5: 0.3826,
6: 9.6446,
7: 1.8418,
8: 0.6823,
9: 6.2478,
10: 7.3614,
11: 0.0}
The idea is to create a weight_mask such that it could be multiplied by the cross entropy output of both classes. To create this weight mask, we can broadcast the values based on the ground_truth labels or the predictions. Some mathematics in my implementation:
Both labels and logits are of shape [batch_size, height, width, num_classes]
The weight mask is of shape [batch_size, height, width, 1]
The weight mask is broadcasted to the num_classes number of channels of the multiplication between the softmax of the logit and the labels to give an output shape of [batch_size, height, width, num_classes]. In this case, num_classes is 12.
Reduce sum for each example in a batch, then perform reduce mean for all examples in one batch to get a single scalar value of loss.
In this case, should we create the weight mask based on the predictions or the ground truth?
If we build it based on the ground_truth, then it means no matter what the predicted pixel labels are, they get penalized based on the actual labels of the class, which doesn't seem to guide the training in a sensible way.
But if we build it based on the predictions, then for whatever logit predictions that are produced, if the predicted label (from taking the argmax of the logit) is dominant, then the logit values for that pixel will all be reduced by a significant amount.
--> Although this means the maximum logit will still be the maximum since all of the logits in the 12 channels will be scaled by the same value, the final softmax probability of the label predicted (which is still the same before and after scaling), will be lower than before scaling (did some simple math to estimate). --> a lower loss is predicted
But the problem is this: If a lower loss is predicted as a result of this weighting, then wouldn't it contradict the idea that predicting dominant labels should give you a greater loss?
The impression I get in total for this method is that:
For the dominant labels, they are penalized and rewarded much lesser.
For the less dominant labels, they are rewarded highly if the predictions are correct, but they're also penalized heavily for a wrong prediction.
So how does this help to tackle the issue of class-balancing? I don't quite get the logic here.
IMPLEMENTATION
Here is my current implementation for calculating the weighted cross entropy loss, although I'm not sure if it is correct.
def weighted_cross_entropy(logits, onehot_labels, class_weights):
if not logits.dtype == tf.float32:
logits = tf.cast(logits, tf.float32)
if not onehot_labels.dtype == tf.float32:
onehot_labels = tf.cast(onehot_labels, tf.float32)
#Obtain the logit label predictions and form a skeleton weight mask with the same shape as it
logit_predictions = tf.argmax(logits, -1)
weight_mask = tf.zeros_like(logit_predictions, dtype=tf.float32)
#Obtain the number of class weights to add to the weight mask
num_classes = logits.get_shape().as_list()[3]
#Form the weight mask mapping for each pixel prediction
for i in xrange(num_classes):
binary_mask = tf.equal(logit_predictions, i) #Get only the positions for class i predicted in the logits prediction
binary_mask = tf.cast(binary_mask, tf.float32) #Convert boolean to ones and zeros
class_mask = tf.multiply(binary_mask, class_weights[i]) #Multiply only the ones in the binary mask with the specific class_weight
weight_mask = tf.add(weight_mask, class_mask) #Add to the weight mask
#Multiply the logits with the scaling based on the weight mask then perform cross entropy
weight_mask = tf.expand_dims(weight_mask, 3) #Expand the fourth dimension to 1 for broadcasting
logits_scaled = tf.multiply(logits, weight_mask)
return tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits_scaled)
Could anyone verify whether my concept of this weighted loss is correct, and whether my implementation is correct? This is my first time getting acquainted with a dataset with imbalanced class, and so I would really appreciate it if anyone could verify this.
TESTING RESULTS: After doing some tests, I found the implementation above results in a greater loss. Is this supposed to be the case? i.e. Would this make the training harder but produce a more accurate model eventually?
SIMILAR THREADS
Note that I have checked a similar thread here: How can I implement a weighted cross entropy loss in tensorflow using sparse_softmax_cross_entropy_with_logits
But it seems that TF only has a sample-wise weighting for loss but not a class-wise one.
Many thanks to all of you.
Here is my own implementation in Keras using the TensorFlow backend:
def class_weighted_pixelwise_crossentropy(target, output):
output = tf.clip_by_value(output, 10e-8, 1.-10e-8)
with open('class_weights.pickle', 'rb') as f:
weight = pickle.load(f)
return -tf.reduce_sum(target * weight * tf.log(output))
where weight is just a standard Python list with the indexes of the weights matched to those of the corresponding class in the one-hot vectors. I store the weights as a pickle file to avoid having to recalculate them. It is an adaptation of the Keras categorical_crossentropy loss function. The first line simply clips the value to make sure we never take the log of 0.
I am unsure why one would calculate the weights using the predictions rather than the ground truth; if you provide further explanation I can update my answer in response.
Edit: Play around with this numpy code to understand how this works. Also review the definition of cross entropy.
import numpy as np
weights = [1,2]
target = np.array([ [[0.0,1.0],[1.0,0.0]],
[[0.0,1.0],[1.0,0.0]]])
output = np.array([ [[0.5,0.5],[0.9,0.1]],
[[0.9,0.1],[0.4,0.6]]])
crossentropy_matrix = -np.sum(target * np.log(output), axis=-1)
crossentropy = -np.sum(target * np.log(output))