Related
Background:
Thanks for your attention! I am learning the basic knowledge of 2D convolution, linear algebra and PyTorch. I encounter the implementation problem about the psedo-inverse of the convolution operator. Specifically, I have no idea about how to implement it in an efficient way. Please see the following problem statements for details. Any help/tip/suggestion is welcomed.
(Thanks a lot for your attention!)
The Original Problem:
I have an image feature x with shape [b,c,h,w] and a 3x3 convolutional kernel K with shape [c,c,3,3]. There is y = K * x. How to implement the corresponding pseudo-inverse on y in an efficient way?
There is [y = K * x = Ax], how to implement [x_hat = (A^+)y]?
I guess that there should be some operations using torch.fft. However, I still have no idea about how to implement it. I do not know if there exists an implementation previously.
import torch
import torch.nn.functional as F
c = 32
K = torch.randn(c, c, 3, 3)
x = torch.randn(1, c, 128, 128)
y = F.conv2d(x, K, padding=1)
print(y.shape)
# How to implement pseudo-inverse for y = K * x in an efficient way?
Some of My Efforts:
I may know that the 2D convolution is a linear operator. It is equivalent to a "matrix product" operator. We can actually write out the matrix form of the convolution and calculate its psedo-inverse. However, I think this type of operation will be inefficient. And I have no idea about how to implement it in an efficient way.
According to Wikipedia, the psedo-inverse may satisfy the property of A(A_pinv(x))=x, where A is the convolutional operator, A_pinv is its psedo-inverse, and x may be any image feature.
(Thanks again for reading such a long post!)
This takes the problem to another level.
The convolution itself is a linear operation, you can determine the matrix of the operation and solve a least square problem directly [1], or compute the pseudo-inverse as you mentioned, and then apply to different outputs and predicting a projection of the input.
I am changing your code to using padding=0
import torch
import torch.nn.functional as F
# your code
c = 32
K = torch.randn(c, c, 1, 1)
x = torch.randn(4, c, 128, 128)
y = F.conv2d(x, K, bias=torch.zeros((c,)))
Also, as you probably already suggested the convolution can be computed as ifft(fft(h)*fft(x)). However, the conv2d function is a cross-correlation, so you have to conjugate the filter leading to ifft(fft(h)*fft(x)), also you have to apply this to two axes, and you have to make sure the FFT is calcuated using the same representation (size), since the data is real, we can apply multi-dimensional real FFT. To be complete, conv2d works on multiple channels, so we have to calculate summations of convolutions. Since the FFT is linear, we can simply compute the summations on the frequency domain
using einsum.
s = y.shape[-2:]
K_f = torch.fft.rfftn(K, s)
x_f = torch.fft.rfftn(x, s)
y_f = torch.einsum('jkxy,ikxy->ijxy', K_f.conj(), x_f)
y_hat = torch.fft.irfftn(y_f, s)
Except for the borders it should be accurate (remember FFT computes a cyclic convolution).
torch.max(abs(y_hat[:,:,:-2,:-2] - y[:,:,:,:]))
Now, notice the pattern jk,ik->ij on the einsum, that means y_f[i,j] = sum(K_f[j,k] * x_f[i,k]) = x_f # K_f.T, if # is the matrix product on the first two dimensions. So to invert this operation we have to can interpret the first two dimensions as matrices. The function pinv will compute pseudo-inverses on the last two axes, so in order to use that we have to permute the axes. If we right multiply the output by the pseudo-inverse of transposed K_f we should invert this operation.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))
Nottice that I am using the full convolution, but the conv2d actually cropped the images. Let's apply that
y_hat[:,:,128-(k-1):,:] = 0
y_hat[:,:,:,128-(k-1):] = 0
Repeating the calculation you will see that the input is not accurate anymore, so you have to be careful about what you do with your convolution, but in some situations where you can get this to work it will be in fact efficient.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))
I have a pretrained model which was saved by
torch.save(net, 'lenet5_mnist_model')
And now I am loading it back and trying to calculate fisher information matrix like this:
precision_matrices = {}
batch_size = 32
my_model = torch.load('lenet5_mnist_model')
my_model.eval() # I tried to comment this off, but still no luck
for n, p in deepcopy({n: p for n, p in my_model.named_parameters()}).items()
p = torch.tensor(p, requires_grad = True)
p.data.zero_()
precision_matrices[n] = variable(p.data)
for idx in range(int(images.shape[0]/batch_size)):
x = images[idx*batch_size : (idx+1)*batch_size]
my_model.zero_grad()
x = Variable(x.cuda(), requires_grad = True)
output = my_model(x).view(1,-1)
label = output.max(1)[1].view(-1)
loss = F.nll_loss(F.log_softmax(output, dim=1), label)
loss = Variable(loss, requires_grad = True)
loss.backward()
for n, p in my_model.named_parameters():
precision_matrices[n].data += p.grad.data**2
Finally, the above code will crash at the last line, because p.grad is NoneType. So the error is:
AttributeError: 'NoneType' object has no attribute 'data'.
Could someone provide some guidance on what caused the NoneType grad for the parameters? How should I fix this?
Your loss does not backpropagate the gradients through the model, because you are creating a new loss tensor with the value of the actual loss, which is a leaf of the computational graph, meaning that there is no history to backpropagate through.
loss.backward() needs to be called on the output of loss = F.nll_loss(F.log_softmax(output, dim=1), label).
I'm assuming that you thought you need to create a tensor with requires_grad=True, to be able to calculate the gradients. That is not the case. Tensors created with requires_grad=True are the leaves of the computational graph (they start the graph) and every operation performed on any tensor that is part of the graph is tracked such that the gradients can flow through the intermediate results to the leaves. Only tensors that need to be optimised (i.e. learnable parameters) should set requires_grad=True manually (the model's parameters do that automatically), everything else regarding the gradients is inferred. Neither x nor the loss are learnable parameters.
This confusion presumably arose due to the use of Variable. It was deprecated in PyTorch 0.4.0, which was released over 2 years ago, and all of its functionality has been merged into the tensors. Please do not use Variable.
x = images[idx*batch_size : (idx+1)*batch_size]
my_model.zero_grad()
x = x.cuda()
output = my_model(x).view(1,-1)
label = output.max(1)[1].view(-1)
loss = F.nll_loss(F.log_softmax(output, dim=1), label)
loss.backward()
I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0
Could someone give a clear explanation of backpropagation for LSTM RNNs?
This is the type structure I am working with. My question is not posed at what is back propagation, I understand it is a reverse order method of calculating the error of the hypothesis and output used for adjusting the weights of neural networks. My question is how LSTM backpropagation is different then regular neural networks.
I am unsure of how to find the initial error of each gates. Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation? I am unsure how the cell state plays a role in the backprop of LSTMs if it does at all. I have looked thoroughly for a good source for LSTMs but have yet to find any.
That's a good question. You certainly should take a look at suggested posts for details, but a complete example here would be helpful too.
RNN Backpropagaion
I think it makes sense to talk about an ordinary RNN first (because LSTM diagram is particularly confusing) and understand its backpropagation.
When it comes to backpropagation, the key idea is network unrolling, which is way to transform the recursion in RNN into a feed-forward sequence (like on the picture above). Note that abstract RNN is eternal (can be arbitrarily large), but each particular implementation is limited because the memory is limited. As a result, the unrolled network really is a long feed-forward network, with few complications, e.g. the weights in different layers are shared.
Let's take a look at a classic example, char-rnn by Andrej Karpathy. Here each RNN cell produces two outputs h[t] (the state which is fed into the next cell) and y[t] (the output on this step) by the following formulas, where Wxh, Whh and Why are the shared parameters:
In the code, it's simply three matrices and two bias vectors:
# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias
The forward pass is pretty straightforward, this example uses softmax and cross-entropy loss. Note each iteration uses the same W* and h* arrays, but the output and hidden state are different:
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
Now, the backward pass is performed exactly as if it was a feed-forward network, but the gradient of W* and h* arrays accumulates the gradients in all cells:
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
Both passes above are done in chunks of size len(inputs), which corresponds to the size of the unrolled RNN. You might want to make it bigger to capture longer dependencies in the input, but you pay for it by storing all outputs and gradients per each cell.
What's different in LSTMs
LSTM picture and formulas look intimidating, but once you coded plain vanilla RNN, the implementation of LSTM is pretty much same. For example, here is the backward pass:
# Loop over all cells, like before
d_h_next_t = np.zeros((N, H))
d_c_next_t = np.zeros((N, H))
for t in reversed(xrange(T)):
d_x_t, d_h_prev_t, d_c_prev_t, d_Wx_t, d_Wh_t, d_b_t = lstm_step_backward(d_h_next_t + d_h[:,t,:], d_c_next_t, cache[t])
d_c_next_t = d_c_prev_t
d_h_next_t = d_h_prev_t
d_x[:,t,:] = d_x_t
d_h0 = d_h_prev_t
d_Wx += d_Wx_t
d_Wh += d_Wh_t
d_b += d_b_t
# The step in each cell
# Captures all LSTM complexity in few formulas.
def lstm_step_backward(d_next_h, d_next_c, cache):
"""
Backward pass for a single timestep of an LSTM.
Inputs:
- dnext_h: Gradients of next hidden state, of shape (N, H)
- dnext_c: Gradients of next cell state, of shape (N, H)
- cache: Values from the forward pass
Returns a tuple of:
- dx: Gradient of input data, of shape (N, D)
- dprev_h: Gradient of previous hidden state, of shape (N, H)
- dprev_c: Gradient of previous cell state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, z, next_h = cache
d_z = o * d_next_h
d_o = z * d_next_h
d_next_c += (1 - z * z) * d_z
d_f = d_next_c * prev_c
d_prev_c = d_next_c * f
d_i = d_next_c * g
d_g = d_next_c * i
d_a_g = (1 - g * g) * d_g
d_a_o = o * (1 - o) * d_o
d_a_f = f * (1 - f) * d_f
d_a_i = i * (1 - i) * d_i
d_a = np.concatenate((d_a_i, d_a_f, d_a_o, d_a_g), axis=1)
d_prev_h = d_a.dot(Wh.T)
d_Wh = prev_h.T.dot(d_a)
d_x = d_a.dot(Wx.T)
d_Wx = x.T.dot(d_a)
d_b = np.sum(d_a, axis=0)
return d_x, d_prev_h, d_prev_c, d_Wx, d_Wh, d_b
Summary
Now, back to your questions.
My question is how is LSTM backpropagation different then regular Neural Networks
The are shared weights in different layers, and few more additional variables (states) that you need to pay attention to. Other than this, no difference at all.
Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation?
First up, the loss function is not necessarily L2. In the example above it's a cross-entropy loss, so initial error signal gets its gradient:
# remember that ps is the probability distribution from the forward pass
dy = np.copy(ps[t])
dy[targets[t]] -= 1
Note that it's the same error signal as in ordinary feed-forward neural network. If you use L2 loss, the signal indeed equals to ground-truth minus actual output.
In case of LSTM, it's slightly more complicated: d_next_h = d_h_next_t + d_h[:,t,:], where d_h is the upstream gradient the loss function, which means that error signal of each cell gets accumulated. But once again, if you unroll LSTM, you'll see a direct correspondence with the network wiring.
I think your questions could not be answered in a short response. Nico's simple LSTM has a link to a great paper from Lipton et.al., please read this. Also his simple python code sample helps to answer most of your questions.
If you understand Nico's last sentence
ds = self.state.o * top_diff_h + top_diff_s
in detail, please give me a feed back. At the moment I have a final problem with his "Putting all this s and h derivations together".
I need to perform a custom spatial convolution in Torch. Rather than simply multiplying each input pixel by a weight for that pixel and adding them together with the filter's bias to form each output pixel, I need to do a more complex mathematical function to the input pixels before adding them together.
I know how to do this, but I do not know a GOOD way to do this. The best way I've come up with is to take the full input tensor, create a bunch of secondary tensors that are "views" of the original without allocating additional memory, putting those into a Replicate layer (the output filter count being the replication count), and feeding that into a ParallelTable layer containing a bunch of regular layers that have their parameters shared between filters.
The trouble is, even though this is fine memory-wise with a very manageable overhead, we're talking inputwidth^inputheight^inputdepth^outputdepth mini-networks, here. Maybe there's some way to create massive "long and tall" networks that work on the entire replicated input set at once, but how do I create layers that are partially-connected (like convolutions) instead of fully-connected?
I would have liked to just use inheritance to create a special copy of the regular SpatialConvolution "class" and modify it, but I can't even try because it's implemented in an external C library. I can't just use regular layers before a regular SpatialConvolution layer because I need to do my math with different weights and biases for each filter (shared between applications of the same filter to different input coordinates).
Good question. You made me give some serious thought.
Your approach has a flaw: it does not allow to take advantage of vectorized computations since each mini-network works independently.
My idea is as follows:
Suppose network's input and output are 2D tensors. We can produce (efficiently, without memory copying) an auxiliary 4D tensor
rf_input (kernel_size x kernel_size x output_h x output_w)
such that rf_input[:, :, k, l] is a 2D tensor of size kernel_size x kernel_size containing a receptive field which output[k, l] will be gotten from. Then we iterate over positions inside the kernel rf_input[i, j, :, :] getting pixels at position (i, j) inside all receptive fields and computing their contribution to each output[k, l] at once using vectorization.
Example:
Let our "convolving" function be, for example, a product of tangents of sums:
Then its partial derivative w.r.t. the input pixel at position (s,t) in its receptive field is
Derivative w.r.t. weight is the same.
At the end, of course, we must sum up gradients from different output[k,l] points. For example, each input[m, n] contributes to at most kernel_size^2 outputs as a part of their receptive fields, and each weight[i, j] contributes to all output_h x output_w outputs.
Simple implementation may look like this:
require 'nn'
local CustomConv, parent = torch.class('nn.CustomConv', 'nn.Module')
-- This module takes and produces a 2D map.
-- To work with multiple input/output feature maps and batches,
-- you have to iterate over them or further vectorize computations inside the loops.
function CustomConv:__init(ker_size)
parent.__init(self)
self.ker_size = ker_size
self.weight = torch.rand(self.ker_size, self.ker_size):add(-0.5)
self.gradWeight = torch.Tensor(self.weight:size()):zero()
end
function CustomConv:_get_recfield_input(input)
local rf_input = {}
for i = 1, self.ker_size do
rf_input[i] = {}
for j = 1, self.ker_size do
rf_input[i][j] = input[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]
end
end
return rf_input
end
function CustomConv:updateOutput(_)
local output = torch.Tensor(self.rf_input[1][1]:size())
-- Kernel-specific: our kernel is multiplicative, so we start with ones
output:fill(1)
--
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
output:cmul(ker_pt:add(w):tan())
--
end
end
return output
end
function CustomConv:updateGradInput_and_accGradParameters(_, gradOutput)
local gradInput = torch.Tensor(self.input:size()):zero()
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
local subGradInput = torch.cmul(gradOutput, torch.cdiv(self.output, ker_pt:add(w):tan():cmul(ker_pt:add(w):cos():pow(2))))
local subGradWeight = subGradInput
--
gradInput[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]:add(subGradInput)
self.gradWeight[{i, j}] = self.gradWeight[{i, j}] + torch.sum(subGradWeight)
end
end
return gradInput
end
function CustomConv:forward(input)
self.input = input
self.rf_input = self:_get_recfield_input(input)
self.output = self:updateOutput(_)
return self.output
end
function CustomConv:backward(input, gradOutput)
gradInput = self:updateGradInput_and_accGradParameters(_, gradOutput)
return gradInput
end
If you change this code a bit:
updateOutput:
output:fill(0)
[...]
output:add(ker_pt:mul(w))
updateGradInput_and_accGradParameters:
local subGradInput = torch.mul(gradOutput, w)
local subGradWeight = torch.cmul(gradOutput, ker_pt)
then it will work exactly as nn.SpatialConvolutionMM with zero bias (I've tested it).