On gihub :
https://github.com/torch/tutorials/blob/master/2_supervised/4_train.lua
we have a example of a script defining a training procedure. I'm interested by the construction of feval function in this script.
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
-- evaluate function for complete mini batch
for i = 1,#inputs do
-- estimate f
local output = model:forward(inputs[i])
local err = criterion:forward(output, targets[i])
f = f + err
-- estimate df/dW
local df_do = criterion:backward(output, targets[i])
model:backward(inputs[i], df_do)
-- update confusion
confusion:add(output, targets[i])
end
-- normalize gradients and f(X)
gradParameters:div(#inputs)
f = f/#inputs
-- return f and df/dX
return f,gradParameters
end
I try to modify this function by suppressing the loop :
for i = 1,#inputs do ...
So instead of doing the forward and backward inputs by inputs (inputs[i]) I'm doing it for the whole mini batch (inputs). This really speed up the process. This is the modify script:
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
-- evaluate function for complete mini batch
-- estimate f
local output = model:forward(inputs)
local f = criterion:forward(output, targets)
-- estimate df/dW
local df_do = criterion:backward(output, targets)
-- update weight
model:backward(inputs, df_do)
-- update confusion
confusion:batchAdd(output, targets)
-- return f and df/dX
return f,gradParameters
end
But when I check in detail the return of feval (f,gradParameters) for a given mini batch we haven't the same result with the loop and without loop.
So my questions are :
1 - Why do we have this loop ?
2 - And is it possible to get the same result without this loop ?
Regards
Sam
NB: I'm beginner in Torch7
I'm sure you noticed getting the second way to work requires a bit more than simply changing feval.
In your second example, inputs needs to be a 4D tensor, rather than a table of 3D tensors (unless something has changed since I last updated). These tensors have different sizes depending on the loss criterion/model used. Whoever implemented the example must have thought the loop was an easier way to go here. In addition, ClassNLLCriterion does not seem to like batch processing (one would usually use CrossEntropy criterion to get around this).
All of this aside though, the two methods should give the same result. The only slight difference is that the first example uses the average error/gradient, and the second uses the sum, as you can see from:
gradParameters:div(inputs:size(1))
f = f/inputs:size(1)
In the second case, f and gradParameters should differ from the first only in a factor opt.batchSize. These are mathematically equivalent for optimization purposes.
Related
I have been running a piece of code, train.lua, found here: https://github.com/karpathy/char-rnn/blob/master/train.lua
This is a character-wise language prediction model based off of the SRNNs/LSTMs. It had been working perfectly fine on OSX with CPU until I tried implementing a word-wise prediction model instead. Namely, the network predicts the next word, as opposed to the next alphabet. The number of vocabs (possible outcomes) went up to 13320 and the number of parameters also increased to 39963. With Luajit, I got an error message "not enough memory", and I was looking around for a solution. I found the issue of the memory limit on Luajit brought up here: https://github.com/karpathy/char-rnn/issues/80
So I removed torch and installed plain lua. However, neither LUA51, LUA52, nor LUA53 worked. I ran into the same memory issue. It just says "Kill: 9" every time I run the training code. In particular, the issue arises when I get it to create T (the sequence length or the time steps) hidden layers, which share the same weights, using the "model_utils.clone_many_times" function in the util/model_utils.lua file.
In my case, the function runs up to the point where it clones 7 hidden layers, and kills the process there. I set the rnn_size and the batch_size to be both 1. Of course, I want to run much bigger networks, but the code still fails with this small size.
Update:
Here is the workaround I am working on.
The cloning process seems somewhat redundant as it stores T hidden layers. Maybe we can change the function in a way that it only carries activations in the units as opposed to the entire layers through T time steps. I feel the only issue is backprop. Activation levels of the hidden units are carried over by the table, init_state_global, from batch to batch. So we somehow need to establish back-propagation over multiple batches.
Here is a workaround I found. Everything else equal, the results I got were almost the same as the original one except some float precision errors for some reason. It saves memory (seq_length does not even affect the memory size). I set the number of clones in the "model_utils.clone_many_times" function to be 1 (so we probably don't even need this memory-consuming function anymore), and just store the hidden units activation for backprop.
function feval(x)
if x ~= params then
params:copy(x)
end
grad_params:zero()
------------------ get minibatch -------------------
local x, y = loader:next_batch(1)
x,y = prepro(x,y) -- seq_length by batch_size tensor
------------------- forward pass -------------------
local rnn_state = {[0] = init_state_global}
local predictions = {} -- softmax outputs
local loss = 0
local hidden_units = {}
for t=1,opt.seq_length do
clones.rnn[1]:training() -- make sure we are in correct mode (this is cheap, sets flag)
local lst = clones.rnn[1]:forward{x[t], unpack(rnn_state[t-1])}
rnn_state[t] = {}
for i=1,#init_state do table.insert(rnn_state[t], lst[i]) end -- extract the state, without output
hidden_units[t] = {}
local j = 1
for k = 1, #clones.rnn[1].modules do
if clones.rnn[1].modules[k].output then
if not (type(clones.rnn[1].modules[k].output) == 'table') then
hidden_units[t][j] = clones.rnn[1].modules[k].output:clone()
else
hidden_units[t][j] = {}
for l=1, #clones.rnn[1].modules[k].output do
hidden_units[t][j][l] = clones.rnn[1].modules[k].output[l]:clone()
end
end
j = j+1
end
end
predictions[t] = lst[#lst] -- last element is the prediction
loss = loss + clones.criterion[1]:forward(predictions[t], y[t])
end
loss = loss / opt.seq_length
------------------ backward pass -------------------
-- initialize gradient at time t to be zeros (there's no influence from future)
local drnn_state = {[opt.seq_length] = clone_list(init_state, true)} -- true also zeros the clones
for t=opt.seq_length,1,-1 do
-- backprop through loss, and softmax/linear
local j = 1
for k = 1, #clones.rnn[1].modules do
if clones.rnn[1].modules[k].output then
clones.rnn[1].modules[k].output = hidden_units[t][j]
j = j+1
end
end
local doutput_t = clones.criterion[1]:backward(predictions[t], y[t])
table.insert(drnn_state[t], doutput_t)
local dlst = clones.rnn[1]:backward({x[t], unpack(rnn_state[t-1])}, drnn_state[t])
drnn_state[t-1] = {}
for k,v in pairs(dlst) do
for k=1, #clones.rnn[1].modules[k].output do
hidden_units[t][j][k] = clones.rnn[1].modules[k].output:clone()
end
end
j = j+1
end
end
predictions[t] = lst[#lst] -- last element is the prediction
loss = loss + clones.criterion[1]:forward(predictions[t], y[t])
end
loss = loss / opt.seq_length
------------------ backward pass -------------------
-- initialize gradient at time t to be zeros (there's no influence from future)
local drnn_state = {[opt.seq_length] = clone_list(init_state, true)} -- true also zeros the clones
for t=opt.seq_length,1,-1 do
-- backprop through loss, and softmax/linear
local j = 1
for k = 1, #clones.rnn[1].modules do
if clones.rnn[1].modules[k].output then
clones.rnn[1].modules[k].output = hidden_units[t][j]
j = j+1
end
end
local doutput_t = clones.criterion[1]:backward(predictions[t], y[t])
table.insert(drnn_state[t], doutput_t)
local dlst = clones.rnn[1]:backward({x[t], unpack(rnn_state[t-1])}, drnn_state[t])
drnn_state[t-1] = {}
for k = 1, #clones.rnn[1].modules do
if clones.rnn[1].modules[k].output then
clones.rnn[1].modules[k].output = hidden_units[t][j]
j = j+1
end
end
local doutput_t = clones.criterion[1]:backward(predictions[t], y[t])
table.insert(drnn_state[t], doutput_t)
local dlst = clones.rnn[1]:backward({x[t], unpack(rnn_state[t-1])}, drnn_state[t])
drnn_state[t-1] = {}
for k,v in pairs(dlst) do
if k > 1 then -- k == 1 is gradient on x, which we dont need
-- note we do k-1 because first item is dembeddings, and then follow the
-- derivatives of the state, starting at index 2. I know...
drnn_state[t-1][k-1] = v
end
end
end
------------------------ misc ----------------------
-- transfer final state to initial state (BPTT)
init_state_global = rnn_state[#rnn_state] -- NOTE: I don't think this needs to be a clone, right?
-- grad_params:div(opt.seq_length) -- this line should be here but since we use rmsprop it would have no effect. Removing for efficiency
-- clip gradient element-wise
--Lets not clip gradient this time grad_params:clamp(-opt.grad_clip, opt.grad_clip)
return loss, grad_params
end
I am following this demo-
https://github.com/torch/demos/blob/master/linear-regression/example-linear-regression.lua
feval = function(x_new)
-- set x to x_new, if differnt
-- (in this simple example, x_new will typically always point to x,
-- so the copy is really useless)
if x ~= x_new then
x:copy(x_new)
end
-- select a new training sample
_nidx_ = (_nidx_ or 0) + 1
if _nidx_ > (#data)[1] then _nidx_ = 1 end
local sample = data[_nidx_]
local target = sample[{ {1} }] -- this funny looking syntax allows
local inputs = sample[{ {2,3} }] -- slicing of arrays.
dl_dx:zero()
local loss_x = criterion:forward(model:forward(inputs), target)
model:backward(inputs, criterion:backward(model.output, target))
return loss_x, dl_dx
end
I have a few doubts in this function
Where is the argument x_new (or its copy x) used in the code?
What does _nidx_ = (_nidx_ or 0) + 1 mean?
what is the value of nidx when the function is first called?
Where is dl_dx updated? Ideally it should have been just after local loss_x is updated, but it isnt written explicitly
EDIT:
My point#4 is very clear now. For those who are interested-
(source- deep learning, oxford, practical 3 lab sheet)
Where is the argument x_new (or its copy x) used in the code?
x is the tensor of parameters of your model. It was previously acquired via x, dl_dx = model:getParameters(). model:forward() and model:backward() automatically use this parameter tensor. x_new is a new set of parameters for your model and is provided by the optimizer (SGD). If it is ever different from your model's parameter tensor, your model's parameters will be set to these new parameters via x:copy(x_new) (in-place copy of tensor's x_new values to x).
What does nidx = (nidx or 0) + 1 mean?
It increases the value of _nidx_ by 1 ((_nidx_) + 1) or sets it to 1 ((0) + 1) if _nidx_ was not yet defined.
what is the value of nidx when the function is first called?
It is never set before that function. Variables which were not yet set have the value nil in lua.
Where is dl_dx updated? Ideally it should have been just after local loss_x is updated, but it isnt written explicitly
dl_dx is the model's tensor of gradients. model:backward() computes the gradient per parameter given a loss and adds it to the model's gradient tensor. As dl_dx is the model's gradient tensor, its values will be increases. Notice that the gradient values are added, which is why you need to call dl_dx:zero() (sets the values of dl_dx in-place to zero), otherwise your gradient values would keep increasing with every call of feval.
x is a global variable, see line 126. The function only seems to update it, not to use it.
This is a common lua idiom: you set things to a parameter or a default value if it is not present. Typical use in functions:
function foo(a, b)
local a = a or 0
local b = b or "foo"
end
The idea is that an expression using and or or evaluates to the first or the second argument, according to the values. x and y yields y if x is not nil or false and x (nil or false) otherwise.
x or y yields y if x is not present (nil or false) and x otherwise. Therefore, or is used for default arguments.
The two can be rewritten the following way:
-- x and y
if x then
return y
else
return x
end
-- x or y
if x then
return x
else
return y
end
you have _nidx_ = (_nidx or 0) + 1, so at the first call of the function, _nidx_ is nil, since it has been defined nowhere. After that, it is (globally) set to 1 (0 + 1)
I'm not sure what you mean exactly. It is reset in line 152 and returned by the function itself. It is a global variable, so maybe there is an outer use for it?
I need to perform a custom spatial convolution in Torch. Rather than simply multiplying each input pixel by a weight for that pixel and adding them together with the filter's bias to form each output pixel, I need to do a more complex mathematical function to the input pixels before adding them together.
I know how to do this, but I do not know a GOOD way to do this. The best way I've come up with is to take the full input tensor, create a bunch of secondary tensors that are "views" of the original without allocating additional memory, putting those into a Replicate layer (the output filter count being the replication count), and feeding that into a ParallelTable layer containing a bunch of regular layers that have their parameters shared between filters.
The trouble is, even though this is fine memory-wise with a very manageable overhead, we're talking inputwidth^inputheight^inputdepth^outputdepth mini-networks, here. Maybe there's some way to create massive "long and tall" networks that work on the entire replicated input set at once, but how do I create layers that are partially-connected (like convolutions) instead of fully-connected?
I would have liked to just use inheritance to create a special copy of the regular SpatialConvolution "class" and modify it, but I can't even try because it's implemented in an external C library. I can't just use regular layers before a regular SpatialConvolution layer because I need to do my math with different weights and biases for each filter (shared between applications of the same filter to different input coordinates).
Good question. You made me give some serious thought.
Your approach has a flaw: it does not allow to take advantage of vectorized computations since each mini-network works independently.
My idea is as follows:
Suppose network's input and output are 2D tensors. We can produce (efficiently, without memory copying) an auxiliary 4D tensor
rf_input (kernel_size x kernel_size x output_h x output_w)
such that rf_input[:, :, k, l] is a 2D tensor of size kernel_size x kernel_size containing a receptive field which output[k, l] will be gotten from. Then we iterate over positions inside the kernel rf_input[i, j, :, :] getting pixels at position (i, j) inside all receptive fields and computing their contribution to each output[k, l] at once using vectorization.
Example:
Let our "convolving" function be, for example, a product of tangents of sums:
Then its partial derivative w.r.t. the input pixel at position (s,t) in its receptive field is
Derivative w.r.t. weight is the same.
At the end, of course, we must sum up gradients from different output[k,l] points. For example, each input[m, n] contributes to at most kernel_size^2 outputs as a part of their receptive fields, and each weight[i, j] contributes to all output_h x output_w outputs.
Simple implementation may look like this:
require 'nn'
local CustomConv, parent = torch.class('nn.CustomConv', 'nn.Module')
-- This module takes and produces a 2D map.
-- To work with multiple input/output feature maps and batches,
-- you have to iterate over them or further vectorize computations inside the loops.
function CustomConv:__init(ker_size)
parent.__init(self)
self.ker_size = ker_size
self.weight = torch.rand(self.ker_size, self.ker_size):add(-0.5)
self.gradWeight = torch.Tensor(self.weight:size()):zero()
end
function CustomConv:_get_recfield_input(input)
local rf_input = {}
for i = 1, self.ker_size do
rf_input[i] = {}
for j = 1, self.ker_size do
rf_input[i][j] = input[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]
end
end
return rf_input
end
function CustomConv:updateOutput(_)
local output = torch.Tensor(self.rf_input[1][1]:size())
-- Kernel-specific: our kernel is multiplicative, so we start with ones
output:fill(1)
--
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
output:cmul(ker_pt:add(w):tan())
--
end
end
return output
end
function CustomConv:updateGradInput_and_accGradParameters(_, gradOutput)
local gradInput = torch.Tensor(self.input:size()):zero()
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
local subGradInput = torch.cmul(gradOutput, torch.cdiv(self.output, ker_pt:add(w):tan():cmul(ker_pt:add(w):cos():pow(2))))
local subGradWeight = subGradInput
--
gradInput[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]:add(subGradInput)
self.gradWeight[{i, j}] = self.gradWeight[{i, j}] + torch.sum(subGradWeight)
end
end
return gradInput
end
function CustomConv:forward(input)
self.input = input
self.rf_input = self:_get_recfield_input(input)
self.output = self:updateOutput(_)
return self.output
end
function CustomConv:backward(input, gradOutput)
gradInput = self:updateGradInput_and_accGradParameters(_, gradOutput)
return gradInput
end
If you change this code a bit:
updateOutput:
output:fill(0)
[...]
output:add(ker_pt:mul(w))
updateGradInput_and_accGradParameters:
local subGradInput = torch.mul(gradOutput, w)
local subGradWeight = torch.cmul(gradOutput, ker_pt)
then it will work exactly as nn.SpatialConvolutionMM with zero bias (I've tested it).
I have a 2x16x3x10x10 tensor that I feed into my network. My network has two parts that work in parallel. The first part takes the 16x3x10x10 matrix and computes the sum over the last two dimensions, returning a 16x3 tensor.
The second part is a convolutional neural network that produces a 16x160 tensor.
Whenever I try to run this model, I get the following error:
...903/nTorch/Torch7/install/share/lua/5.1/torch/Tensor.lua:457: expecting a contiguous tensor
stack traceback:
[C]: in function 'assert'
...903/nTorch/Torch7/install/share/lua/5.1/torch/Tensor.lua:457: in function 'view'
...8/osu7903/nTorch/Torch7/install/share/lua/5.1/nn/Sum.lua:26: in function 'updateGradInput'
...03/nTorch/Torch7/install/share/lua/5.1/nn/Sequential.lua:40: in function 'updateGradInput'
...7903/nTorch/Torch7/install/share/lua/5.1/nn/Parallel.lua:52: in function 'updateGradInput'
...su7903/nTorch/Torch7/install/share/lua/5.1/nn/Module.lua:30: in function 'backward'
...03/nTorch/Torch7/install/share/lua/5.1/nn/Sequential.lua:73: in function 'backward'
./train_v2_with_batch.lua:144: in function 'opfunc'
...su7903/nTorch/Torch7/install/share/lua/5.1/optim/sgd.lua:43: in function 'sgd'
./train_v2_with_batch.lua:160: in function 'train'
run.lua:93: in main chunk
[C]: in function 'dofile'
...rch/Torch7/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00405800
Here is the relevant part of the model:
local first_part = nn.Parallel(1,2)
local CNN = nn.Sequential()
local sums = nn.Sequential()
sums:add(nn.Sum(3))
sums:add(nn.Sum(3))
first_part:add(sums)
-- stage 1: conv+max
CNN:add(nn.SpatialConvolutionMM(nfeats, convDepth_L1,receptiveFieldWidth_L1,receptiveFieldHeight_L1))
-- Since the default stride of the receptive field is 1, then
-- (assuming receptiveFieldWidth_L1 = receptiveFieldHeight_L1 = 3) the number of receptive fields is (10-3+1)x(10-3+1) or 8x8
-- so the output volume is (convDepth_L1 X 8 X 8) or 10 x 8 x 8
--CNN:add(nn.Threshold())
CNN:add(nn.ReLU())
CNN:add(nn.SpatialMaxPooling(poolsize,poolsize,poolsize,poolsize))
-- if poolsize=2, then the output of this is 10x4x4
CNN:add(nn.Reshape(convDepth_L1*outputWdith_L2*outputWdith_L2,true))
first_part:add(CNN)
The code works when the input tensor is 2x1x3x10x10, but not when the tensor is 2x16x3x10x10.
Edit: I only just realized that this happens when I do model:backward and not model:forward. Here is the relevant code:
local y = model:forward(x)
local E = loss:forward(y,yt)
-- estimate df/dW
local dE_dy = loss:backward(y,yt)
print(dE_dy)
model:backward(x,dE_dy)
x is a 2x16x3x10x10 tensor and dE_dy is 16x2.
This is a flaw in torch.nn library. To perform a backward step, nn.Parallel splits gradOutput it receives from higher module into pieces and sends them to its parallel submodules. Splitting are done effectively without copying memory, and thus those pieces are non-contiguous (unless you split on the 1st dimension).
local first_part = nn.Parallel(1,2)
-- ^
-- Merging on the 2nd dimension;
-- Chunks of splitted gradOutput will not be contiguous
The problem is that nn.Sum cannot work with non-contiguous gradOutput. I haven't got a better idea than to make changes to it:
Sum_nc, _ = torch.class('nn.Sum_nc', 'nn.Sum')
function Sum_nc:updateGradInput(input, gradOutput)
local size = input:size()
size[self.dimension] = 1
-- modified code:
if gradOutput:isContiguous() then
gradOutput = gradOutput:view(size) -- doesn't work with non-contiguous tensors
else
gradOutput = gradOutput:resize(size) -- slower because of memory reallocation and changes gradOutput
-- gradOutput = gradOutput:clone():resize(size) -- doesn't change gradOutput; safer and even slower
end
--
self.gradInput:resizeAs(input)
self.gradInput:copy(gradOutput:expandAs(input))
return self.gradInput
end
[...]
sums = nn.Sequential()
sums:add(nn.Sum_nc(3)) -- <- will use torch.view
sums:add(nn.Sum_nc(3)) -- <- will use torch.resize
I'm implementing a deep neural network in Torch7 with a dataset made of two torch.Tensor() objects.
The first is made of 12 elements (completeTable), the other one is made of 1 element (presentValue).
Each dataset row is an array of these two tensors:
dataset[p] = {torch.Tensor(completeTable[p]), torch.Tensor(presentValue)};
Everything works for the neural network training and testing.
But now I want to switch and use only half of the 12 elements of completeTable, that are only 6 elements (firstChromRegionProfile).
dataset_firstChromRegion[p] = {torch.Tensor(firstChromRegionProfile), torch.Tensor(presentValue)};
If I run the same neural network architecture with this new dataset, it does not work. It says that the trainer:train(dataset_firstChromRegion) function cannot work because of "size mismatch".
Here's my neural network function:
-- Neural network application
function neuralNetworkApplication(input_number, output_number, datasetTrain, datasetTest, dropOutFlag, hiddenUnits, hiddenLayers)
require "nn"
-- act_function = nn.Sigmoid();
act_function = nn.Tanh();
print('input_number '.. input_number);
print('output_number '.. output_number);
-- NEURAL NETWORK CREATION - <START>
perceptron=nn.Sequential(); -- make a multi-layer perceptron
perceptron:add(nn.Linear(input_number, hiddenUnits));
perceptron:add(act_function);
if dropOutFlag==TRUE then perceptron:add(nn.Dropout()) end -- DROPOUT
-- we add w layers DEEP LEARNING
for w=0, hiddenLayers do
perceptron:add(nn.Linear(hiddenUnits,hiddenUnits)) -- DEEP LEARNING layer
perceptron:add(act_function); -- DEEP LEARNING
if dropOutFlag==TRUE then
perceptron:add(nn.Dropout()) -- DROPOUT
end
end
print('\n#datasetTrain '.. #datasetTrain);
print('#datasetTrain[1] '.. #datasetTrain[1]);
print('(#datasetTrain[1][1])[1] '..(#datasetTrain[1][1])[1]);
print('\n#datasetTest '.. #datasetTest);
print('#datasetTest[1] '.. #datasetTest[1]);
print('(#datasetTest[1][1])[1] '..(#datasetTest[1][1])[1]);
perceptron:add(nn.Linear(hiddenUnits, output_number));
perceptron:add(act_function);
criterion = nn.MSECriterion(); -- MSE: Mean Square Error
trainer = nn.StochasticGradient(perceptron, criterion)
trainer.learningRate = LEARNING_RATE_CONST;
trainer:train(datasetTrain);
idp=3;
predValueVector={}
for i=1,(#datasetTest) do
pred=perceptron:forward(datasetTest[i][1]); -- get the prediction of the perceptron
predValueVector[i]=pred[1];
end
-- NEURAL NETWORK CREATION - <END>
return predValueVector;
end
Here's the error log:
input_number 6
output_number 1
#datasetTrain 13416
#datasetTrain[1] 2
(#datasetTrain[1][1])[1] 6
#datasetTest 3354
#datasetTest[1] 2
(#datasetTest[1][1])[1] 6
# StochasticGradient: training
/mnt/work1/software/torch/7/bin/luajit: /mnt/work1/software/torch/7/share/lua/5.1/nn/Linear.lua:71: size mismatch
stack traceback:
[C]: in function 'addmv'
/mnt/work1/software/torch/7/share/lua/5.1/nn/Linear.lua:71: in function 'updateGradInput'
/mnt/work1/software/torch/7/share/lua/5.1/nn/Sequential.lua:36: in function 'updateGradInput'
...software/torch/7/share/lua/5.1/nn/StochasticGradient.lua:37: in function 'train'
siamese_neural_network.lua:278: in function 'neuralNetworkApplication'
siamese_neural_network.lua:223: in function 'kfold_cross_validation_separate'
siamese_neural_network.lua:753: in main chunk
[C]: in function 'dofile'
...1/software/torch/7/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x004057d0
All of your activation layers share the same nn.Tanh() object. That is the problem. Try something like this instead:
act_function = nn.Tanh
perceptron:add( act_function() )
Why?
To perform a backward propagation step, we have to compute a gradient of the layer w.r.t. its input. In our case:
tanh'(input) = 1 - tanh(input)^2
One can notice that tanh(input) = output of the layer's forward step. You can store this output inside the layer and use it during backward pass to speed up training. This is exactly what happens inside nn library:
// torch/nn/generic/Tanh.c/Tanh_updateGradInput:
for(i = 0; i < THTensor_(nElement)(gradInput); i++)
{
real z = ptr_output[i];
ptr_gradInput[i] = ptr_gradOutput[i] * (1. - z*z);
}
Output sizes of your activation layers don't match, so error occurs. Even if they did, it would lead to wrong result.
Sorry about my English.
I was surprisingly able to fix my problem, by eliminating the line:
act_function = nn.Tanh();
and consequently by replacing any occurrences of act_function with nn.Tanh()
I do not why, but know everything works...
So the lesson is: never assign an activation function to a variable (!?).