linear-regression with torch7 demo - lua

I am following this demo-
https://github.com/torch/demos/blob/master/linear-regression/example-linear-regression.lua
feval = function(x_new)
-- set x to x_new, if differnt
-- (in this simple example, x_new will typically always point to x,
-- so the copy is really useless)
if x ~= x_new then
x:copy(x_new)
end
-- select a new training sample
_nidx_ = (_nidx_ or 0) + 1
if _nidx_ > (#data)[1] then _nidx_ = 1 end
local sample = data[_nidx_]
local target = sample[{ {1} }] -- this funny looking syntax allows
local inputs = sample[{ {2,3} }] -- slicing of arrays.
dl_dx:zero()
local loss_x = criterion:forward(model:forward(inputs), target)
model:backward(inputs, criterion:backward(model.output, target))
return loss_x, dl_dx
end
I have a few doubts in this function
Where is the argument x_new (or its copy x) used in the code?
What does _nidx_ = (_nidx_ or 0) + 1 mean?
what is the value of nidx when the function is first called?
Where is dl_dx updated? Ideally it should have been just after local loss_x is updated, but it isnt written explicitly
EDIT:
My point#4 is very clear now. For those who are interested-
(source- deep learning, oxford, practical 3 lab sheet)

Where is the argument x_new (or its copy x) used in the code?
x is the tensor of parameters of your model. It was previously acquired via x, dl_dx = model:getParameters(). model:forward() and model:backward() automatically use this parameter tensor. x_new is a new set of parameters for your model and is provided by the optimizer (SGD). If it is ever different from your model's parameter tensor, your model's parameters will be set to these new parameters via x:copy(x_new) (in-place copy of tensor's x_new values to x).
What does nidx = (nidx or 0) + 1 mean?
It increases the value of _nidx_ by 1 ((_nidx_) + 1) or sets it to 1 ((0) + 1) if _nidx_ was not yet defined.
what is the value of nidx when the function is first called?
It is never set before that function. Variables which were not yet set have the value nil in lua.
Where is dl_dx updated? Ideally it should have been just after local loss_x is updated, but it isnt written explicitly
dl_dx is the model's tensor of gradients. model:backward() computes the gradient per parameter given a loss and adds it to the model's gradient tensor. As dl_dx is the model's gradient tensor, its values will be increases. Notice that the gradient values are added, which is why you need to call dl_dx:zero() (sets the values of dl_dx in-place to zero), otherwise your gradient values would keep increasing with every call of feval.

x is a global variable, see line 126. The function only seems to update it, not to use it.
This is a common lua idiom: you set things to a parameter or a default value if it is not present. Typical use in functions:
function foo(a, b)
local a = a or 0
local b = b or "foo"
end
The idea is that an expression using and or or evaluates to the first or the second argument, according to the values. x and y yields y if x is not nil or false and x (nil or false) otherwise.
x or y yields y if x is not present (nil or false) and x otherwise. Therefore, or is used for default arguments.
The two can be rewritten the following way:
-- x and y
if x then
return y
else
return x
end
-- x or y
if x then
return x
else
return y
end
you have _nidx_ = (_nidx or 0) + 1, so at the first call of the function, _nidx_ is nil, since it has been defined nowhere. After that, it is (globally) set to 1 (0 + 1)
I'm not sure what you mean exactly. It is reset in line 152 and returned by the function itself. It is a global variable, so maybe there is an outer use for it?

Related

LSTM multiple feature regression data preparation

I am modeling an LSTM model that contains multiple features and one target value. It is a regression problem.
I have doubts that my data preparation for the LSTM is erroneous; mainly because the model learns nothing but the average of the target value.
The following code I wrote is for preparing the data for the LSTM:
# df is a pandas data frame that contains the feature columns (f1 to f5) and the target value named 'target'
# all columns of the df are time series data (including the 'target')
# seq_length is the sequence length
def prepare_data_multiple_feature(df):
X = []
y = []
for x in range(len(df)):
start_id = x
end_id = x + seq_length
one_data_point = []
if end_id + 1 <= len(df):
# prepare X
for col in ['f1', 'f2', 'f3', 'f4', 'f5']:
one_data_point.append(np.array(df[col].values[start_id:end_id]))
X.append(np.array(one_data_point))
# prepare y
y.append(np.array(df['target'].values[end_id ]))
assert len(y) == len(X)
return X, y
Then, I reshape the data as follows:
X, y = prepare_data_multiple_feature(df)
X = X.reshape((len(X), seq_length, 5)) #5 is the number of features, i.e., f1 to f5
is my data preparation method and data reshaping correct?
As #isp-zax mentioned, please provide a reprex so we could reproduce the outcome and see where the problem lies.
As an aside, you could use for col in df.columns instead of listing all the column names and (minor optimisation) the first loop should be executed for x in range(len(df) - seq_length), otherwise at the end you execute the loop seq_length - 1 many times without actually processing any data. Also, df.values[a, b] will not include the element at index b so if you want to include the "window" with last row inside your X the end_id can be equal to the len(df), i.e. you could execute your inner condition (prepare and append) for if end_id <= len(df):
Apart from that I think it would be simpler to read if you sliced the dataframe across columns and rows at the same time, without using one_data_point, i.e.
to select seq_length rows without the (last) target column, simply do:
df.values[start_id, end_id, :-1]

Select prior probability of inclusion in CausalImpact or bsts?

In the CausalImpact package, the supplied covariates are independently selected with some prior probability M/J where M is the expected model size and J is the number of covariates. However, on page 11 of the paper, they say get the values by "asking about the expected model size M." I checked the documentation for CausalImpact but was unable to find any more information. Where is this done in the package? Is there a parameter I can set in a function call to decide why my desired M?
You are right, this is not directly possible with CausalImpact, but it is possible. CausalImpact uses bsts behind the scenes and this package allows to set the parameter. So you have to define you model using bsts first, set the parameter and then provide it to your CausalImpact call like this (modified example from the CausalImpact manual):
post.period <- c(71, 100)
post.period.response <- y[post.period[1] : post.period[2]]
y[post.period[1] : post.period[2]] <- NA
ss <- AddLocalLevel(list(), y)
bsts.model <- bsts(y ~ x1, ss, niter = 1000, expected.model.size = 4)
impact <- CausalImpact(bsts.model = bsts.model,
post.period.response = post.period.response)

Create a List and Use it in Loss Function Tensorflow

I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0

About feval function on tutorials/2_supervised/4_train.lua

On gihub :
https://github.com/torch/tutorials/blob/master/2_supervised/4_train.lua
we have a example of a script defining a training procedure. I'm interested by the construction of feval function in this script.
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
-- evaluate function for complete mini batch
for i = 1,#inputs do
-- estimate f
local output = model:forward(inputs[i])
local err = criterion:forward(output, targets[i])
f = f + err
-- estimate df/dW
local df_do = criterion:backward(output, targets[i])
model:backward(inputs[i], df_do)
-- update confusion
confusion:add(output, targets[i])
end
-- normalize gradients and f(X)
gradParameters:div(#inputs)
f = f/#inputs
-- return f and df/dX
return f,gradParameters
end
I try to modify this function by suppressing the loop :
for i = 1,#inputs do ...
So instead of doing the forward and backward inputs by inputs (inputs[i]) I'm doing it for the whole mini batch (inputs). This really speed up the process. This is the modify script:
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
-- evaluate function for complete mini batch
-- estimate f
local output = model:forward(inputs)
local f = criterion:forward(output, targets)
-- estimate df/dW
local df_do = criterion:backward(output, targets)
-- update weight
model:backward(inputs, df_do)
-- update confusion
confusion:batchAdd(output, targets)
-- return f and df/dX
return f,gradParameters
end
But when I check in detail the return of feval (f,gradParameters) for a given mini batch we haven't the same result with the loop and without loop.
So my questions are :
1 - Why do we have this loop ?
2 - And is it possible to get the same result without this loop ?
Regards
Sam
NB: I'm beginner in Torch7
I'm sure you noticed getting the second way to work requires a bit more than simply changing feval.
In your second example, inputs needs to be a 4D tensor, rather than a table of 3D tensors (unless something has changed since I last updated). These tensors have different sizes depending on the loss criterion/model used. Whoever implemented the example must have thought the loop was an easier way to go here. In addition, ClassNLLCriterion does not seem to like batch processing (one would usually use CrossEntropy criterion to get around this).
All of this aside though, the two methods should give the same result. The only slight difference is that the first example uses the average error/gradient, and the second uses the sum, as you can see from:
gradParameters:div(inputs:size(1))
f = f/inputs:size(1)
In the second case, f and gradParameters should differ from the first only in a factor opt.batchSize. These are mathematically equivalent for optimization purposes.

Custom Spatial Convolution In Torch

I need to perform a custom spatial convolution in Torch. Rather than simply multiplying each input pixel by a weight for that pixel and adding them together with the filter's bias to form each output pixel, I need to do a more complex mathematical function to the input pixels before adding them together.
I know how to do this, but I do not know a GOOD way to do this. The best way I've come up with is to take the full input tensor, create a bunch of secondary tensors that are "views" of the original without allocating additional memory, putting those into a Replicate layer (the output filter count being the replication count), and feeding that into a ParallelTable layer containing a bunch of regular layers that have their parameters shared between filters.
The trouble is, even though this is fine memory-wise with a very manageable overhead, we're talking inputwidth^inputheight^inputdepth^outputdepth mini-networks, here. Maybe there's some way to create massive "long and tall" networks that work on the entire replicated input set at once, but how do I create layers that are partially-connected (like convolutions) instead of fully-connected?
I would have liked to just use inheritance to create a special copy of the regular SpatialConvolution "class" and modify it, but I can't even try because it's implemented in an external C library. I can't just use regular layers before a regular SpatialConvolution layer because I need to do my math with different weights and biases for each filter (shared between applications of the same filter to different input coordinates).
Good question. You made me give some serious thought.
Your approach has a flaw: it does not allow to take advantage of vectorized computations since each mini-network works independently.
My idea is as follows:
Suppose network's input and output are 2D tensors. We can produce (efficiently, without memory copying) an auxiliary 4D tensor
rf_input (kernel_size x kernel_size x output_h x output_w)
such that rf_input[:, :, k, l] is a 2D tensor of size kernel_size x kernel_size containing a receptive field which output[k, l] will be gotten from. Then we iterate over positions inside the kernel rf_input[i, j, :, :] getting pixels at position (i, j) inside all receptive fields and computing their contribution to each output[k, l] at once using vectorization.
Example:
Let our "convolving" function be, for example, a product of tangents of sums:
Then its partial derivative w.r.t. the input pixel at position (s,t) in its receptive field is
Derivative w.r.t. weight is the same.
At the end, of course, we must sum up gradients from different output[k,l] points. For example, each input[m, n] contributes to at most kernel_size^2 outputs as a part of their receptive fields, and each weight[i, j] contributes to all output_h x output_w outputs.
Simple implementation may look like this:
require 'nn'
local CustomConv, parent = torch.class('nn.CustomConv', 'nn.Module')
-- This module takes and produces a 2D map.
-- To work with multiple input/output feature maps and batches,
-- you have to iterate over them or further vectorize computations inside the loops.
function CustomConv:__init(ker_size)
parent.__init(self)
self.ker_size = ker_size
self.weight = torch.rand(self.ker_size, self.ker_size):add(-0.5)
self.gradWeight = torch.Tensor(self.weight:size()):zero()
end
function CustomConv:_get_recfield_input(input)
local rf_input = {}
for i = 1, self.ker_size do
rf_input[i] = {}
for j = 1, self.ker_size do
rf_input[i][j] = input[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]
end
end
return rf_input
end
function CustomConv:updateOutput(_)
local output = torch.Tensor(self.rf_input[1][1]:size())
-- Kernel-specific: our kernel is multiplicative, so we start with ones
output:fill(1)
--
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
output:cmul(ker_pt:add(w):tan())
--
end
end
return output
end
function CustomConv:updateGradInput_and_accGradParameters(_, gradOutput)
local gradInput = torch.Tensor(self.input:size()):zero()
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
local subGradInput = torch.cmul(gradOutput, torch.cdiv(self.output, ker_pt:add(w):tan():cmul(ker_pt:add(w):cos():pow(2))))
local subGradWeight = subGradInput
--
gradInput[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]:add(subGradInput)
self.gradWeight[{i, j}] = self.gradWeight[{i, j}] + torch.sum(subGradWeight)
end
end
return gradInput
end
function CustomConv:forward(input)
self.input = input
self.rf_input = self:_get_recfield_input(input)
self.output = self:updateOutput(_)
return self.output
end
function CustomConv:backward(input, gradOutput)
gradInput = self:updateGradInput_and_accGradParameters(_, gradOutput)
return gradInput
end
If you change this code a bit:
updateOutput:
output:fill(0)
[...]
output:add(ker_pt:mul(w))
updateGradInput_and_accGradParameters:
local subGradInput = torch.mul(gradOutput, w)
local subGradWeight = torch.cmul(gradOutput, ker_pt)
then it will work exactly as nn.SpatialConvolutionMM with zero bias (I've tested it).

Resources