I have a 2x16x3x10x10 tensor that I feed into my network. My network has two parts that work in parallel. The first part takes the 16x3x10x10 matrix and computes the sum over the last two dimensions, returning a 16x3 tensor.
The second part is a convolutional neural network that produces a 16x160 tensor.
Whenever I try to run this model, I get the following error:
...903/nTorch/Torch7/install/share/lua/5.1/torch/Tensor.lua:457: expecting a contiguous tensor
stack traceback:
[C]: in function 'assert'
...903/nTorch/Torch7/install/share/lua/5.1/torch/Tensor.lua:457: in function 'view'
...8/osu7903/nTorch/Torch7/install/share/lua/5.1/nn/Sum.lua:26: in function 'updateGradInput'
...03/nTorch/Torch7/install/share/lua/5.1/nn/Sequential.lua:40: in function 'updateGradInput'
...7903/nTorch/Torch7/install/share/lua/5.1/nn/Parallel.lua:52: in function 'updateGradInput'
...su7903/nTorch/Torch7/install/share/lua/5.1/nn/Module.lua:30: in function 'backward'
...03/nTorch/Torch7/install/share/lua/5.1/nn/Sequential.lua:73: in function 'backward'
./train_v2_with_batch.lua:144: in function 'opfunc'
...su7903/nTorch/Torch7/install/share/lua/5.1/optim/sgd.lua:43: in function 'sgd'
./train_v2_with_batch.lua:160: in function 'train'
run.lua:93: in main chunk
[C]: in function 'dofile'
...rch/Torch7/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00405800
Here is the relevant part of the model:
local first_part = nn.Parallel(1,2)
local CNN = nn.Sequential()
local sums = nn.Sequential()
sums:add(nn.Sum(3))
sums:add(nn.Sum(3))
first_part:add(sums)
-- stage 1: conv+max
CNN:add(nn.SpatialConvolutionMM(nfeats, convDepth_L1,receptiveFieldWidth_L1,receptiveFieldHeight_L1))
-- Since the default stride of the receptive field is 1, then
-- (assuming receptiveFieldWidth_L1 = receptiveFieldHeight_L1 = 3) the number of receptive fields is (10-3+1)x(10-3+1) or 8x8
-- so the output volume is (convDepth_L1 X 8 X 8) or 10 x 8 x 8
--CNN:add(nn.Threshold())
CNN:add(nn.ReLU())
CNN:add(nn.SpatialMaxPooling(poolsize,poolsize,poolsize,poolsize))
-- if poolsize=2, then the output of this is 10x4x4
CNN:add(nn.Reshape(convDepth_L1*outputWdith_L2*outputWdith_L2,true))
first_part:add(CNN)
The code works when the input tensor is 2x1x3x10x10, but not when the tensor is 2x16x3x10x10.
Edit: I only just realized that this happens when I do model:backward and not model:forward. Here is the relevant code:
local y = model:forward(x)
local E = loss:forward(y,yt)
-- estimate df/dW
local dE_dy = loss:backward(y,yt)
print(dE_dy)
model:backward(x,dE_dy)
x is a 2x16x3x10x10 tensor and dE_dy is 16x2.
This is a flaw in torch.nn library. To perform a backward step, nn.Parallel splits gradOutput it receives from higher module into pieces and sends them to its parallel submodules. Splitting are done effectively without copying memory, and thus those pieces are non-contiguous (unless you split on the 1st dimension).
local first_part = nn.Parallel(1,2)
-- ^
-- Merging on the 2nd dimension;
-- Chunks of splitted gradOutput will not be contiguous
The problem is that nn.Sum cannot work with non-contiguous gradOutput. I haven't got a better idea than to make changes to it:
Sum_nc, _ = torch.class('nn.Sum_nc', 'nn.Sum')
function Sum_nc:updateGradInput(input, gradOutput)
local size = input:size()
size[self.dimension] = 1
-- modified code:
if gradOutput:isContiguous() then
gradOutput = gradOutput:view(size) -- doesn't work with non-contiguous tensors
else
gradOutput = gradOutput:resize(size) -- slower because of memory reallocation and changes gradOutput
-- gradOutput = gradOutput:clone():resize(size) -- doesn't change gradOutput; safer and even slower
end
--
self.gradInput:resizeAs(input)
self.gradInput:copy(gradOutput:expandAs(input))
return self.gradInput
end
[...]
sums = nn.Sequential()
sums:add(nn.Sum_nc(3)) -- <- will use torch.view
sums:add(nn.Sum_nc(3)) -- <- will use torch.resize
Related
On gihub :
https://github.com/torch/tutorials/blob/master/2_supervised/4_train.lua
we have a example of a script defining a training procedure. I'm interested by the construction of feval function in this script.
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
-- evaluate function for complete mini batch
for i = 1,#inputs do
-- estimate f
local output = model:forward(inputs[i])
local err = criterion:forward(output, targets[i])
f = f + err
-- estimate df/dW
local df_do = criterion:backward(output, targets[i])
model:backward(inputs[i], df_do)
-- update confusion
confusion:add(output, targets[i])
end
-- normalize gradients and f(X)
gradParameters:div(#inputs)
f = f/#inputs
-- return f and df/dX
return f,gradParameters
end
I try to modify this function by suppressing the loop :
for i = 1,#inputs do ...
So instead of doing the forward and backward inputs by inputs (inputs[i]) I'm doing it for the whole mini batch (inputs). This really speed up the process. This is the modify script:
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
-- evaluate function for complete mini batch
-- estimate f
local output = model:forward(inputs)
local f = criterion:forward(output, targets)
-- estimate df/dW
local df_do = criterion:backward(output, targets)
-- update weight
model:backward(inputs, df_do)
-- update confusion
confusion:batchAdd(output, targets)
-- return f and df/dX
return f,gradParameters
end
But when I check in detail the return of feval (f,gradParameters) for a given mini batch we haven't the same result with the loop and without loop.
So my questions are :
1 - Why do we have this loop ?
2 - And is it possible to get the same result without this loop ?
Regards
Sam
NB: I'm beginner in Torch7
I'm sure you noticed getting the second way to work requires a bit more than simply changing feval.
In your second example, inputs needs to be a 4D tensor, rather than a table of 3D tensors (unless something has changed since I last updated). These tensors have different sizes depending on the loss criterion/model used. Whoever implemented the example must have thought the loop was an easier way to go here. In addition, ClassNLLCriterion does not seem to like batch processing (one would usually use CrossEntropy criterion to get around this).
All of this aside though, the two methods should give the same result. The only slight difference is that the first example uses the average error/gradient, and the second uses the sum, as you can see from:
gradParameters:div(inputs:size(1))
f = f/inputs:size(1)
In the second case, f and gradParameters should differ from the first only in a factor opt.batchSize. These are mathematically equivalent for optimization purposes.
I have been running a piece of code, train.lua, found here: https://github.com/karpathy/char-rnn/blob/master/train.lua
This is a character-wise language prediction model based off of the SRNNs/LSTMs. It had been working perfectly fine on OSX with CPU until I tried implementing a word-wise prediction model instead. Namely, the network predicts the next word, as opposed to the next alphabet. The number of vocabs (possible outcomes) went up to 13320 and the number of parameters also increased to 39963. With Luajit, I got an error message "not enough memory", and I was looking around for a solution. I found the issue of the memory limit on Luajit brought up here: https://github.com/karpathy/char-rnn/issues/80
So I removed torch and installed plain lua. However, neither LUA51, LUA52, nor LUA53 worked. I ran into the same memory issue. It just says "Kill: 9" every time I run the training code. In particular, the issue arises when I get it to create T (the sequence length or the time steps) hidden layers, which share the same weights, using the "model_utils.clone_many_times" function in the util/model_utils.lua file.
In my case, the function runs up to the point where it clones 7 hidden layers, and kills the process there. I set the rnn_size and the batch_size to be both 1. Of course, I want to run much bigger networks, but the code still fails with this small size.
Update:
Here is the workaround I am working on.
The cloning process seems somewhat redundant as it stores T hidden layers. Maybe we can change the function in a way that it only carries activations in the units as opposed to the entire layers through T time steps. I feel the only issue is backprop. Activation levels of the hidden units are carried over by the table, init_state_global, from batch to batch. So we somehow need to establish back-propagation over multiple batches.
Here is a workaround I found. Everything else equal, the results I got were almost the same as the original one except some float precision errors for some reason. It saves memory (seq_length does not even affect the memory size). I set the number of clones in the "model_utils.clone_many_times" function to be 1 (so we probably don't even need this memory-consuming function anymore), and just store the hidden units activation for backprop.
function feval(x)
if x ~= params then
params:copy(x)
end
grad_params:zero()
------------------ get minibatch -------------------
local x, y = loader:next_batch(1)
x,y = prepro(x,y) -- seq_length by batch_size tensor
------------------- forward pass -------------------
local rnn_state = {[0] = init_state_global}
local predictions = {} -- softmax outputs
local loss = 0
local hidden_units = {}
for t=1,opt.seq_length do
clones.rnn[1]:training() -- make sure we are in correct mode (this is cheap, sets flag)
local lst = clones.rnn[1]:forward{x[t], unpack(rnn_state[t-1])}
rnn_state[t] = {}
for i=1,#init_state do table.insert(rnn_state[t], lst[i]) end -- extract the state, without output
hidden_units[t] = {}
local j = 1
for k = 1, #clones.rnn[1].modules do
if clones.rnn[1].modules[k].output then
if not (type(clones.rnn[1].modules[k].output) == 'table') then
hidden_units[t][j] = clones.rnn[1].modules[k].output:clone()
else
hidden_units[t][j] = {}
for l=1, #clones.rnn[1].modules[k].output do
hidden_units[t][j][l] = clones.rnn[1].modules[k].output[l]:clone()
end
end
j = j+1
end
end
predictions[t] = lst[#lst] -- last element is the prediction
loss = loss + clones.criterion[1]:forward(predictions[t], y[t])
end
loss = loss / opt.seq_length
------------------ backward pass -------------------
-- initialize gradient at time t to be zeros (there's no influence from future)
local drnn_state = {[opt.seq_length] = clone_list(init_state, true)} -- true also zeros the clones
for t=opt.seq_length,1,-1 do
-- backprop through loss, and softmax/linear
local j = 1
for k = 1, #clones.rnn[1].modules do
if clones.rnn[1].modules[k].output then
clones.rnn[1].modules[k].output = hidden_units[t][j]
j = j+1
end
end
local doutput_t = clones.criterion[1]:backward(predictions[t], y[t])
table.insert(drnn_state[t], doutput_t)
local dlst = clones.rnn[1]:backward({x[t], unpack(rnn_state[t-1])}, drnn_state[t])
drnn_state[t-1] = {}
for k,v in pairs(dlst) do
for k=1, #clones.rnn[1].modules[k].output do
hidden_units[t][j][k] = clones.rnn[1].modules[k].output:clone()
end
end
j = j+1
end
end
predictions[t] = lst[#lst] -- last element is the prediction
loss = loss + clones.criterion[1]:forward(predictions[t], y[t])
end
loss = loss / opt.seq_length
------------------ backward pass -------------------
-- initialize gradient at time t to be zeros (there's no influence from future)
local drnn_state = {[opt.seq_length] = clone_list(init_state, true)} -- true also zeros the clones
for t=opt.seq_length,1,-1 do
-- backprop through loss, and softmax/linear
local j = 1
for k = 1, #clones.rnn[1].modules do
if clones.rnn[1].modules[k].output then
clones.rnn[1].modules[k].output = hidden_units[t][j]
j = j+1
end
end
local doutput_t = clones.criterion[1]:backward(predictions[t], y[t])
table.insert(drnn_state[t], doutput_t)
local dlst = clones.rnn[1]:backward({x[t], unpack(rnn_state[t-1])}, drnn_state[t])
drnn_state[t-1] = {}
for k = 1, #clones.rnn[1].modules do
if clones.rnn[1].modules[k].output then
clones.rnn[1].modules[k].output = hidden_units[t][j]
j = j+1
end
end
local doutput_t = clones.criterion[1]:backward(predictions[t], y[t])
table.insert(drnn_state[t], doutput_t)
local dlst = clones.rnn[1]:backward({x[t], unpack(rnn_state[t-1])}, drnn_state[t])
drnn_state[t-1] = {}
for k,v in pairs(dlst) do
if k > 1 then -- k == 1 is gradient on x, which we dont need
-- note we do k-1 because first item is dembeddings, and then follow the
-- derivatives of the state, starting at index 2. I know...
drnn_state[t-1][k-1] = v
end
end
end
------------------------ misc ----------------------
-- transfer final state to initial state (BPTT)
init_state_global = rnn_state[#rnn_state] -- NOTE: I don't think this needs to be a clone, right?
-- grad_params:div(opt.seq_length) -- this line should be here but since we use rmsprop it would have no effect. Removing for efficiency
-- clip gradient element-wise
--Lets not clip gradient this time grad_params:clamp(-opt.grad_clip, opt.grad_clip)
return loss, grad_params
end
I have a convolutional neural network whose output is a 4-channel 2D image. I want to apply sigmoid activation function to the first two channels and then use BCECriterion to computer the loss of the produced images with the ground truth ones. I want to apply squared loss function to the last two channels and finally computer the gradients and do backprop. I would also like to multiply the cost of the squared loss for each of the two last channels by a desired scalar.
So the cost has the following form:
cost = crossEntropyCh[{1, 2}] + l1 * squaredLossCh_3 + l2 * squaredLossCh_4
The way I'm thinking about doing this is as follow:
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
error = criterion1:forward(model.output[{{}, {1, 2}}], groundTruth1) + l1 * criterion2:forward(model.output[{{}, {3}}], groundTruth2) + l2 * criterion2:forward(model.output[{{}, {4}}], groundTruth3)
However, I don't think this is the correct way of doing it since I will have to do 3 separate backprop steps, one for each of the cost terms. So I wonder, can anyone give me a better solution to do this in Torch?
SplitTable and ParallelCriterion might be helpful for your problem.
Your current output layer is followed by nn.SplitTable that splits your output channels and converts your output tensor into a table. You can also combine different functions by using ParallelCriterion so that each criterion is applied on the corresponding entry of output table.
For details, I suggest you read documentation of Torch about tables.
After comments, I added the following code segment solving the original question.
M = 100
C = 4
H = 64
W = 64
dataIn = torch.rand(M, C, H, W)
layerOfTables = nn.Sequential()
-- Because SplitTable discards the dimension it is applied on, we insert
-- an additional dimension.
layerOfTables:add(nn.Reshape(M,C,1,H,W))
-- We want to split over the second dimension (i.e. channels).
layerOfTables:add(nn.SplitTable(2, 5))
-- We use ConcatTable in order to create paths accessing to the data for
-- numereous number of criterions. Each branch from the ConcatTable will
-- have access to the data (i.e. the output table).
criterionPath = nn.ConcatTable()
-- Starting from offset 1, NarrowTable will select 2 elements. Since you
-- want to use this portion as a 2 dimensional channel, we need to combine
-- then by using JoinTable. Without JoinTable, the output will be again a
-- table with 2 elements.
criterionPath:add(nn.Sequential():add(nn.NarrowTable(1, 2)):add(nn.JoinTable(2)))
-- SelectTable is simplified version of NarrowTable, and it fetches the desired element.
criterionPath:add(nn.SelectTable(3))
criterionPath:add(nn.SelectTable(4))
layerOfTables:add(criterionPath)
-- Here goes the criterion container. You can use this as if it is a regular
-- criterion function (Please see the examples on documentation page).
criterionContainer = nn.ParallelCriterion()
criterionContainer:add(nn.BCECriterion())
criterionContainer:add(nn.MSECriterion())
criterionContainer:add(nn.MSECriterion())
Since I used almost every possible table operation, it looks a little bit nasty. However, this is the only way I could solve this problem. I hope that it helps you and others suffering from the same problem. This is how the result looks like:
dataOut = layerOfTables:forward(dataIn)
print(dataOut)
{
1 : DoubleTensor - size: 100x2x64x64
2 : DoubleTensor - size: 100x1x64x64
3 : DoubleTensor - size: 100x1x64x64
}
I need to perform a custom spatial convolution in Torch. Rather than simply multiplying each input pixel by a weight for that pixel and adding them together with the filter's bias to form each output pixel, I need to do a more complex mathematical function to the input pixels before adding them together.
I know how to do this, but I do not know a GOOD way to do this. The best way I've come up with is to take the full input tensor, create a bunch of secondary tensors that are "views" of the original without allocating additional memory, putting those into a Replicate layer (the output filter count being the replication count), and feeding that into a ParallelTable layer containing a bunch of regular layers that have their parameters shared between filters.
The trouble is, even though this is fine memory-wise with a very manageable overhead, we're talking inputwidth^inputheight^inputdepth^outputdepth mini-networks, here. Maybe there's some way to create massive "long and tall" networks that work on the entire replicated input set at once, but how do I create layers that are partially-connected (like convolutions) instead of fully-connected?
I would have liked to just use inheritance to create a special copy of the regular SpatialConvolution "class" and modify it, but I can't even try because it's implemented in an external C library. I can't just use regular layers before a regular SpatialConvolution layer because I need to do my math with different weights and biases for each filter (shared between applications of the same filter to different input coordinates).
Good question. You made me give some serious thought.
Your approach has a flaw: it does not allow to take advantage of vectorized computations since each mini-network works independently.
My idea is as follows:
Suppose network's input and output are 2D tensors. We can produce (efficiently, without memory copying) an auxiliary 4D tensor
rf_input (kernel_size x kernel_size x output_h x output_w)
such that rf_input[:, :, k, l] is a 2D tensor of size kernel_size x kernel_size containing a receptive field which output[k, l] will be gotten from. Then we iterate over positions inside the kernel rf_input[i, j, :, :] getting pixels at position (i, j) inside all receptive fields and computing their contribution to each output[k, l] at once using vectorization.
Example:
Let our "convolving" function be, for example, a product of tangents of sums:
Then its partial derivative w.r.t. the input pixel at position (s,t) in its receptive field is
Derivative w.r.t. weight is the same.
At the end, of course, we must sum up gradients from different output[k,l] points. For example, each input[m, n] contributes to at most kernel_size^2 outputs as a part of their receptive fields, and each weight[i, j] contributes to all output_h x output_w outputs.
Simple implementation may look like this:
require 'nn'
local CustomConv, parent = torch.class('nn.CustomConv', 'nn.Module')
-- This module takes and produces a 2D map.
-- To work with multiple input/output feature maps and batches,
-- you have to iterate over them or further vectorize computations inside the loops.
function CustomConv:__init(ker_size)
parent.__init(self)
self.ker_size = ker_size
self.weight = torch.rand(self.ker_size, self.ker_size):add(-0.5)
self.gradWeight = torch.Tensor(self.weight:size()):zero()
end
function CustomConv:_get_recfield_input(input)
local rf_input = {}
for i = 1, self.ker_size do
rf_input[i] = {}
for j = 1, self.ker_size do
rf_input[i][j] = input[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]
end
end
return rf_input
end
function CustomConv:updateOutput(_)
local output = torch.Tensor(self.rf_input[1][1]:size())
-- Kernel-specific: our kernel is multiplicative, so we start with ones
output:fill(1)
--
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
output:cmul(ker_pt:add(w):tan())
--
end
end
return output
end
function CustomConv:updateGradInput_and_accGradParameters(_, gradOutput)
local gradInput = torch.Tensor(self.input:size()):zero()
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
local subGradInput = torch.cmul(gradOutput, torch.cdiv(self.output, ker_pt:add(w):tan():cmul(ker_pt:add(w):cos():pow(2))))
local subGradWeight = subGradInput
--
gradInput[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]:add(subGradInput)
self.gradWeight[{i, j}] = self.gradWeight[{i, j}] + torch.sum(subGradWeight)
end
end
return gradInput
end
function CustomConv:forward(input)
self.input = input
self.rf_input = self:_get_recfield_input(input)
self.output = self:updateOutput(_)
return self.output
end
function CustomConv:backward(input, gradOutput)
gradInput = self:updateGradInput_and_accGradParameters(_, gradOutput)
return gradInput
end
If you change this code a bit:
updateOutput:
output:fill(0)
[...]
output:add(ker_pt:mul(w))
updateGradInput_and_accGradParameters:
local subGradInput = torch.mul(gradOutput, w)
local subGradWeight = torch.cmul(gradOutput, ker_pt)
then it will work exactly as nn.SpatialConvolutionMM with zero bias (I've tested it).
I'm implementing a deep neural network in Torch7 with a dataset made of two torch.Tensor() objects.
The first is made of 12 elements (completeTable), the other one is made of 1 element (presentValue).
Each dataset row is an array of these two tensors:
dataset[p] = {torch.Tensor(completeTable[p]), torch.Tensor(presentValue)};
Everything works for the neural network training and testing.
But now I want to switch and use only half of the 12 elements of completeTable, that are only 6 elements (firstChromRegionProfile).
dataset_firstChromRegion[p] = {torch.Tensor(firstChromRegionProfile), torch.Tensor(presentValue)};
If I run the same neural network architecture with this new dataset, it does not work. It says that the trainer:train(dataset_firstChromRegion) function cannot work because of "size mismatch".
Here's my neural network function:
-- Neural network application
function neuralNetworkApplication(input_number, output_number, datasetTrain, datasetTest, dropOutFlag, hiddenUnits, hiddenLayers)
require "nn"
-- act_function = nn.Sigmoid();
act_function = nn.Tanh();
print('input_number '.. input_number);
print('output_number '.. output_number);
-- NEURAL NETWORK CREATION - <START>
perceptron=nn.Sequential(); -- make a multi-layer perceptron
perceptron:add(nn.Linear(input_number, hiddenUnits));
perceptron:add(act_function);
if dropOutFlag==TRUE then perceptron:add(nn.Dropout()) end -- DROPOUT
-- we add w layers DEEP LEARNING
for w=0, hiddenLayers do
perceptron:add(nn.Linear(hiddenUnits,hiddenUnits)) -- DEEP LEARNING layer
perceptron:add(act_function); -- DEEP LEARNING
if dropOutFlag==TRUE then
perceptron:add(nn.Dropout()) -- DROPOUT
end
end
print('\n#datasetTrain '.. #datasetTrain);
print('#datasetTrain[1] '.. #datasetTrain[1]);
print('(#datasetTrain[1][1])[1] '..(#datasetTrain[1][1])[1]);
print('\n#datasetTest '.. #datasetTest);
print('#datasetTest[1] '.. #datasetTest[1]);
print('(#datasetTest[1][1])[1] '..(#datasetTest[1][1])[1]);
perceptron:add(nn.Linear(hiddenUnits, output_number));
perceptron:add(act_function);
criterion = nn.MSECriterion(); -- MSE: Mean Square Error
trainer = nn.StochasticGradient(perceptron, criterion)
trainer.learningRate = LEARNING_RATE_CONST;
trainer:train(datasetTrain);
idp=3;
predValueVector={}
for i=1,(#datasetTest) do
pred=perceptron:forward(datasetTest[i][1]); -- get the prediction of the perceptron
predValueVector[i]=pred[1];
end
-- NEURAL NETWORK CREATION - <END>
return predValueVector;
end
Here's the error log:
input_number 6
output_number 1
#datasetTrain 13416
#datasetTrain[1] 2
(#datasetTrain[1][1])[1] 6
#datasetTest 3354
#datasetTest[1] 2
(#datasetTest[1][1])[1] 6
# StochasticGradient: training
/mnt/work1/software/torch/7/bin/luajit: /mnt/work1/software/torch/7/share/lua/5.1/nn/Linear.lua:71: size mismatch
stack traceback:
[C]: in function 'addmv'
/mnt/work1/software/torch/7/share/lua/5.1/nn/Linear.lua:71: in function 'updateGradInput'
/mnt/work1/software/torch/7/share/lua/5.1/nn/Sequential.lua:36: in function 'updateGradInput'
...software/torch/7/share/lua/5.1/nn/StochasticGradient.lua:37: in function 'train'
siamese_neural_network.lua:278: in function 'neuralNetworkApplication'
siamese_neural_network.lua:223: in function 'kfold_cross_validation_separate'
siamese_neural_network.lua:753: in main chunk
[C]: in function 'dofile'
...1/software/torch/7/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x004057d0
All of your activation layers share the same nn.Tanh() object. That is the problem. Try something like this instead:
act_function = nn.Tanh
perceptron:add( act_function() )
Why?
To perform a backward propagation step, we have to compute a gradient of the layer w.r.t. its input. In our case:
tanh'(input) = 1 - tanh(input)^2
One can notice that tanh(input) = output of the layer's forward step. You can store this output inside the layer and use it during backward pass to speed up training. This is exactly what happens inside nn library:
// torch/nn/generic/Tanh.c/Tanh_updateGradInput:
for(i = 0; i < THTensor_(nElement)(gradInput); i++)
{
real z = ptr_output[i];
ptr_gradInput[i] = ptr_gradOutput[i] * (1. - z*z);
}
Output sizes of your activation layers don't match, so error occurs. Even if they did, it would lead to wrong result.
Sorry about my English.
I was surprisingly able to fix my problem, by eliminating the line:
act_function = nn.Tanh();
and consequently by replacing any occurrences of act_function with nn.Tanh()
I do not why, but know everything works...
So the lesson is: never assign an activation function to a variable (!?).