I'm implementing a deep neural network in Torch7 with a dataset made of two torch.Tensor() objects.
The first is made of 12 elements (completeTable), the other one is made of 1 element (presentValue).
Each dataset row is an array of these two tensors:
dataset[p] = {torch.Tensor(completeTable[p]), torch.Tensor(presentValue)};
Everything works for the neural network training and testing.
But now I want to switch and use only half of the 12 elements of completeTable, that are only 6 elements (firstChromRegionProfile).
dataset_firstChromRegion[p] = {torch.Tensor(firstChromRegionProfile), torch.Tensor(presentValue)};
If I run the same neural network architecture with this new dataset, it does not work. It says that the trainer:train(dataset_firstChromRegion) function cannot work because of "size mismatch".
Here's my neural network function:
-- Neural network application
function neuralNetworkApplication(input_number, output_number, datasetTrain, datasetTest, dropOutFlag, hiddenUnits, hiddenLayers)
require "nn"
-- act_function = nn.Sigmoid();
act_function = nn.Tanh();
print('input_number '.. input_number);
print('output_number '.. output_number);
-- NEURAL NETWORK CREATION - <START>
perceptron=nn.Sequential(); -- make a multi-layer perceptron
perceptron:add(nn.Linear(input_number, hiddenUnits));
perceptron:add(act_function);
if dropOutFlag==TRUE then perceptron:add(nn.Dropout()) end -- DROPOUT
-- we add w layers DEEP LEARNING
for w=0, hiddenLayers do
perceptron:add(nn.Linear(hiddenUnits,hiddenUnits)) -- DEEP LEARNING layer
perceptron:add(act_function); -- DEEP LEARNING
if dropOutFlag==TRUE then
perceptron:add(nn.Dropout()) -- DROPOUT
end
end
print('\n#datasetTrain '.. #datasetTrain);
print('#datasetTrain[1] '.. #datasetTrain[1]);
print('(#datasetTrain[1][1])[1] '..(#datasetTrain[1][1])[1]);
print('\n#datasetTest '.. #datasetTest);
print('#datasetTest[1] '.. #datasetTest[1]);
print('(#datasetTest[1][1])[1] '..(#datasetTest[1][1])[1]);
perceptron:add(nn.Linear(hiddenUnits, output_number));
perceptron:add(act_function);
criterion = nn.MSECriterion(); -- MSE: Mean Square Error
trainer = nn.StochasticGradient(perceptron, criterion)
trainer.learningRate = LEARNING_RATE_CONST;
trainer:train(datasetTrain);
idp=3;
predValueVector={}
for i=1,(#datasetTest) do
pred=perceptron:forward(datasetTest[i][1]); -- get the prediction of the perceptron
predValueVector[i]=pred[1];
end
-- NEURAL NETWORK CREATION - <END>
return predValueVector;
end
Here's the error log:
input_number 6
output_number 1
#datasetTrain 13416
#datasetTrain[1] 2
(#datasetTrain[1][1])[1] 6
#datasetTest 3354
#datasetTest[1] 2
(#datasetTest[1][1])[1] 6
# StochasticGradient: training
/mnt/work1/software/torch/7/bin/luajit: /mnt/work1/software/torch/7/share/lua/5.1/nn/Linear.lua:71: size mismatch
stack traceback:
[C]: in function 'addmv'
/mnt/work1/software/torch/7/share/lua/5.1/nn/Linear.lua:71: in function 'updateGradInput'
/mnt/work1/software/torch/7/share/lua/5.1/nn/Sequential.lua:36: in function 'updateGradInput'
...software/torch/7/share/lua/5.1/nn/StochasticGradient.lua:37: in function 'train'
siamese_neural_network.lua:278: in function 'neuralNetworkApplication'
siamese_neural_network.lua:223: in function 'kfold_cross_validation_separate'
siamese_neural_network.lua:753: in main chunk
[C]: in function 'dofile'
...1/software/torch/7/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x004057d0
All of your activation layers share the same nn.Tanh() object. That is the problem. Try something like this instead:
act_function = nn.Tanh
perceptron:add( act_function() )
Why?
To perform a backward propagation step, we have to compute a gradient of the layer w.r.t. its input. In our case:
tanh'(input) = 1 - tanh(input)^2
One can notice that tanh(input) = output of the layer's forward step. You can store this output inside the layer and use it during backward pass to speed up training. This is exactly what happens inside nn library:
// torch/nn/generic/Tanh.c/Tanh_updateGradInput:
for(i = 0; i < THTensor_(nElement)(gradInput); i++)
{
real z = ptr_output[i];
ptr_gradInput[i] = ptr_gradOutput[i] * (1. - z*z);
}
Output sizes of your activation layers don't match, so error occurs. Even if they did, it would lead to wrong result.
Sorry about my English.
I was surprisingly able to fix my problem, by eliminating the line:
act_function = nn.Tanh();
and consequently by replacing any occurrences of act_function with nn.Tanh()
I do not why, but know everything works...
So the lesson is: never assign an activation function to a variable (!?).
Related
I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0
Could someone give a clear explanation of backpropagation for LSTM RNNs?
This is the type structure I am working with. My question is not posed at what is back propagation, I understand it is a reverse order method of calculating the error of the hypothesis and output used for adjusting the weights of neural networks. My question is how LSTM backpropagation is different then regular neural networks.
I am unsure of how to find the initial error of each gates. Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation? I am unsure how the cell state plays a role in the backprop of LSTMs if it does at all. I have looked thoroughly for a good source for LSTMs but have yet to find any.
That's a good question. You certainly should take a look at suggested posts for details, but a complete example here would be helpful too.
RNN Backpropagaion
I think it makes sense to talk about an ordinary RNN first (because LSTM diagram is particularly confusing) and understand its backpropagation.
When it comes to backpropagation, the key idea is network unrolling, which is way to transform the recursion in RNN into a feed-forward sequence (like on the picture above). Note that abstract RNN is eternal (can be arbitrarily large), but each particular implementation is limited because the memory is limited. As a result, the unrolled network really is a long feed-forward network, with few complications, e.g. the weights in different layers are shared.
Let's take a look at a classic example, char-rnn by Andrej Karpathy. Here each RNN cell produces two outputs h[t] (the state which is fed into the next cell) and y[t] (the output on this step) by the following formulas, where Wxh, Whh and Why are the shared parameters:
In the code, it's simply three matrices and two bias vectors:
# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias
The forward pass is pretty straightforward, this example uses softmax and cross-entropy loss. Note each iteration uses the same W* and h* arrays, but the output and hidden state are different:
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
Now, the backward pass is performed exactly as if it was a feed-forward network, but the gradient of W* and h* arrays accumulates the gradients in all cells:
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
Both passes above are done in chunks of size len(inputs), which corresponds to the size of the unrolled RNN. You might want to make it bigger to capture longer dependencies in the input, but you pay for it by storing all outputs and gradients per each cell.
What's different in LSTMs
LSTM picture and formulas look intimidating, but once you coded plain vanilla RNN, the implementation of LSTM is pretty much same. For example, here is the backward pass:
# Loop over all cells, like before
d_h_next_t = np.zeros((N, H))
d_c_next_t = np.zeros((N, H))
for t in reversed(xrange(T)):
d_x_t, d_h_prev_t, d_c_prev_t, d_Wx_t, d_Wh_t, d_b_t = lstm_step_backward(d_h_next_t + d_h[:,t,:], d_c_next_t, cache[t])
d_c_next_t = d_c_prev_t
d_h_next_t = d_h_prev_t
d_x[:,t,:] = d_x_t
d_h0 = d_h_prev_t
d_Wx += d_Wx_t
d_Wh += d_Wh_t
d_b += d_b_t
# The step in each cell
# Captures all LSTM complexity in few formulas.
def lstm_step_backward(d_next_h, d_next_c, cache):
"""
Backward pass for a single timestep of an LSTM.
Inputs:
- dnext_h: Gradients of next hidden state, of shape (N, H)
- dnext_c: Gradients of next cell state, of shape (N, H)
- cache: Values from the forward pass
Returns a tuple of:
- dx: Gradient of input data, of shape (N, D)
- dprev_h: Gradient of previous hidden state, of shape (N, H)
- dprev_c: Gradient of previous cell state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, z, next_h = cache
d_z = o * d_next_h
d_o = z * d_next_h
d_next_c += (1 - z * z) * d_z
d_f = d_next_c * prev_c
d_prev_c = d_next_c * f
d_i = d_next_c * g
d_g = d_next_c * i
d_a_g = (1 - g * g) * d_g
d_a_o = o * (1 - o) * d_o
d_a_f = f * (1 - f) * d_f
d_a_i = i * (1 - i) * d_i
d_a = np.concatenate((d_a_i, d_a_f, d_a_o, d_a_g), axis=1)
d_prev_h = d_a.dot(Wh.T)
d_Wh = prev_h.T.dot(d_a)
d_x = d_a.dot(Wx.T)
d_Wx = x.T.dot(d_a)
d_b = np.sum(d_a, axis=0)
return d_x, d_prev_h, d_prev_c, d_Wx, d_Wh, d_b
Summary
Now, back to your questions.
My question is how is LSTM backpropagation different then regular Neural Networks
The are shared weights in different layers, and few more additional variables (states) that you need to pay attention to. Other than this, no difference at all.
Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation?
First up, the loss function is not necessarily L2. In the example above it's a cross-entropy loss, so initial error signal gets its gradient:
# remember that ps is the probability distribution from the forward pass
dy = np.copy(ps[t])
dy[targets[t]] -= 1
Note that it's the same error signal as in ordinary feed-forward neural network. If you use L2 loss, the signal indeed equals to ground-truth minus actual output.
In case of LSTM, it's slightly more complicated: d_next_h = d_h_next_t + d_h[:,t,:], where d_h is the upstream gradient the loss function, which means that error signal of each cell gets accumulated. But once again, if you unroll LSTM, you'll see a direct correspondence with the network wiring.
I think your questions could not be answered in a short response. Nico's simple LSTM has a link to a great paper from Lipton et.al., please read this. Also his simple python code sample helps to answer most of your questions.
If you understand Nico's last sentence
ds = self.state.o * top_diff_h + top_diff_s
in detail, please give me a feed back. At the moment I have a final problem with his "Putting all this s and h derivations together".
I have a convolutional neural network whose output is a 4-channel 2D image. I want to apply sigmoid activation function to the first two channels and then use BCECriterion to computer the loss of the produced images with the ground truth ones. I want to apply squared loss function to the last two channels and finally computer the gradients and do backprop. I would also like to multiply the cost of the squared loss for each of the two last channels by a desired scalar.
So the cost has the following form:
cost = crossEntropyCh[{1, 2}] + l1 * squaredLossCh_3 + l2 * squaredLossCh_4
The way I'm thinking about doing this is as follow:
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
error = criterion1:forward(model.output[{{}, {1, 2}}], groundTruth1) + l1 * criterion2:forward(model.output[{{}, {3}}], groundTruth2) + l2 * criterion2:forward(model.output[{{}, {4}}], groundTruth3)
However, I don't think this is the correct way of doing it since I will have to do 3 separate backprop steps, one for each of the cost terms. So I wonder, can anyone give me a better solution to do this in Torch?
SplitTable and ParallelCriterion might be helpful for your problem.
Your current output layer is followed by nn.SplitTable that splits your output channels and converts your output tensor into a table. You can also combine different functions by using ParallelCriterion so that each criterion is applied on the corresponding entry of output table.
For details, I suggest you read documentation of Torch about tables.
After comments, I added the following code segment solving the original question.
M = 100
C = 4
H = 64
W = 64
dataIn = torch.rand(M, C, H, W)
layerOfTables = nn.Sequential()
-- Because SplitTable discards the dimension it is applied on, we insert
-- an additional dimension.
layerOfTables:add(nn.Reshape(M,C,1,H,W))
-- We want to split over the second dimension (i.e. channels).
layerOfTables:add(nn.SplitTable(2, 5))
-- We use ConcatTable in order to create paths accessing to the data for
-- numereous number of criterions. Each branch from the ConcatTable will
-- have access to the data (i.e. the output table).
criterionPath = nn.ConcatTable()
-- Starting from offset 1, NarrowTable will select 2 elements. Since you
-- want to use this portion as a 2 dimensional channel, we need to combine
-- then by using JoinTable. Without JoinTable, the output will be again a
-- table with 2 elements.
criterionPath:add(nn.Sequential():add(nn.NarrowTable(1, 2)):add(nn.JoinTable(2)))
-- SelectTable is simplified version of NarrowTable, and it fetches the desired element.
criterionPath:add(nn.SelectTable(3))
criterionPath:add(nn.SelectTable(4))
layerOfTables:add(criterionPath)
-- Here goes the criterion container. You can use this as if it is a regular
-- criterion function (Please see the examples on documentation page).
criterionContainer = nn.ParallelCriterion()
criterionContainer:add(nn.BCECriterion())
criterionContainer:add(nn.MSECriterion())
criterionContainer:add(nn.MSECriterion())
Since I used almost every possible table operation, it looks a little bit nasty. However, this is the only way I could solve this problem. I hope that it helps you and others suffering from the same problem. This is how the result looks like:
dataOut = layerOfTables:forward(dataIn)
print(dataOut)
{
1 : DoubleTensor - size: 100x2x64x64
2 : DoubleTensor - size: 100x1x64x64
3 : DoubleTensor - size: 100x1x64x64
}
I have a 2x16x3x10x10 tensor that I feed into my network. My network has two parts that work in parallel. The first part takes the 16x3x10x10 matrix and computes the sum over the last two dimensions, returning a 16x3 tensor.
The second part is a convolutional neural network that produces a 16x160 tensor.
Whenever I try to run this model, I get the following error:
...903/nTorch/Torch7/install/share/lua/5.1/torch/Tensor.lua:457: expecting a contiguous tensor
stack traceback:
[C]: in function 'assert'
...903/nTorch/Torch7/install/share/lua/5.1/torch/Tensor.lua:457: in function 'view'
...8/osu7903/nTorch/Torch7/install/share/lua/5.1/nn/Sum.lua:26: in function 'updateGradInput'
...03/nTorch/Torch7/install/share/lua/5.1/nn/Sequential.lua:40: in function 'updateGradInput'
...7903/nTorch/Torch7/install/share/lua/5.1/nn/Parallel.lua:52: in function 'updateGradInput'
...su7903/nTorch/Torch7/install/share/lua/5.1/nn/Module.lua:30: in function 'backward'
...03/nTorch/Torch7/install/share/lua/5.1/nn/Sequential.lua:73: in function 'backward'
./train_v2_with_batch.lua:144: in function 'opfunc'
...su7903/nTorch/Torch7/install/share/lua/5.1/optim/sgd.lua:43: in function 'sgd'
./train_v2_with_batch.lua:160: in function 'train'
run.lua:93: in main chunk
[C]: in function 'dofile'
...rch/Torch7/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00405800
Here is the relevant part of the model:
local first_part = nn.Parallel(1,2)
local CNN = nn.Sequential()
local sums = nn.Sequential()
sums:add(nn.Sum(3))
sums:add(nn.Sum(3))
first_part:add(sums)
-- stage 1: conv+max
CNN:add(nn.SpatialConvolutionMM(nfeats, convDepth_L1,receptiveFieldWidth_L1,receptiveFieldHeight_L1))
-- Since the default stride of the receptive field is 1, then
-- (assuming receptiveFieldWidth_L1 = receptiveFieldHeight_L1 = 3) the number of receptive fields is (10-3+1)x(10-3+1) or 8x8
-- so the output volume is (convDepth_L1 X 8 X 8) or 10 x 8 x 8
--CNN:add(nn.Threshold())
CNN:add(nn.ReLU())
CNN:add(nn.SpatialMaxPooling(poolsize,poolsize,poolsize,poolsize))
-- if poolsize=2, then the output of this is 10x4x4
CNN:add(nn.Reshape(convDepth_L1*outputWdith_L2*outputWdith_L2,true))
first_part:add(CNN)
The code works when the input tensor is 2x1x3x10x10, but not when the tensor is 2x16x3x10x10.
Edit: I only just realized that this happens when I do model:backward and not model:forward. Here is the relevant code:
local y = model:forward(x)
local E = loss:forward(y,yt)
-- estimate df/dW
local dE_dy = loss:backward(y,yt)
print(dE_dy)
model:backward(x,dE_dy)
x is a 2x16x3x10x10 tensor and dE_dy is 16x2.
This is a flaw in torch.nn library. To perform a backward step, nn.Parallel splits gradOutput it receives from higher module into pieces and sends them to its parallel submodules. Splitting are done effectively without copying memory, and thus those pieces are non-contiguous (unless you split on the 1st dimension).
local first_part = nn.Parallel(1,2)
-- ^
-- Merging on the 2nd dimension;
-- Chunks of splitted gradOutput will not be contiguous
The problem is that nn.Sum cannot work with non-contiguous gradOutput. I haven't got a better idea than to make changes to it:
Sum_nc, _ = torch.class('nn.Sum_nc', 'nn.Sum')
function Sum_nc:updateGradInput(input, gradOutput)
local size = input:size()
size[self.dimension] = 1
-- modified code:
if gradOutput:isContiguous() then
gradOutput = gradOutput:view(size) -- doesn't work with non-contiguous tensors
else
gradOutput = gradOutput:resize(size) -- slower because of memory reallocation and changes gradOutput
-- gradOutput = gradOutput:clone():resize(size) -- doesn't change gradOutput; safer and even slower
end
--
self.gradInput:resizeAs(input)
self.gradInput:copy(gradOutput:expandAs(input))
return self.gradInput
end
[...]
sums = nn.Sequential()
sums:add(nn.Sum_nc(3)) -- <- will use torch.view
sums:add(nn.Sum_nc(3)) -- <- will use torch.resize
I have been trying to think of exactly how backpropogation in a neural network works, what the derivative is, and what function it is trying to minimize.
Below I tried to make the simplest model I could, an input, hidden, and output layer with 1 node in each, represented by a(). The network uses a sigmoid function, w() is the weights between each node. We calculate each of the deltas, then update the weights accordingly. Please let me know if I have made any errors in my model.
My question is, what exactly are we trying to minimize? We are trying to minimize the delta for each node, correct? Delta does not seem to be the derivative of anything, I know that x*(1-x) is the derivative of the logistic function (1/1+e^-x), but for delta(a1) a(1) is the x in this case and it would not be the x in the sigmoid function for a(1).. And if delta(a1) is a derivative, what function is it a derivative of exactly?
Input Hidden Output
a(0) ---w(1)--- a(1) ---w(2)--- a(3)
delta(a2) = a(2) - y
delta(a1) = [delta(a2) * w(2)] * a(1) * [1 - a(1)]
w1 = w1 - [a(0) * delta(a1)]
w2 = w2 - [a(1) * delta(a2)]