Julia ReverseDiff: how to take a gradient w.r.t. only a subset of inputs? - machine-learning

In my data flow, I'm querying a small subset of a database, using those results to construct about a dozen arrays, and then, given some parameter values, computing a likelihood value. Then repeating for a subset of the database. I want to compute the gradient of the likelihood function with respect to the parameters but not the data. But ReverseDiff computes the gradient with respect to all inputs. How can I get around this? Specifically, how can I construct a ReverseDiff.Tape object
TL;DR: How to marry stochastic gradient descent and ReverseDiff? (I'm not wedded to using ReverseDiff. It just seemed like the right tool for the job.)
It seems like this has to be a common coding pattern. It's used all the time in my field. But I'm missing something. Julia's scoping rules seem to undermine the scoped/anonymous function approach, and ReverseDiff is holding on to the original data values in the tape generated instead of using the mutated values.
some sample code of things that don't work
using ReverseDiff
using Base.Test
mutable struct data
X::Array{Float64, 2}
end
const D = data(zeros(Float64, 2, 2))
# baseline known data to compare against
function f1(params)
X = float.([1 2; 3 4])
f2(params, X)
end
# X is data, want derivative wrt to params only
function f2(params, X)
sum(params[1]' * X[:, 1] - (params[1] .* params[2])' * X[:, 2].^2)
end
# store data of interest in D.X so that we can call just f2(params) and get our
# gradient
f2(params) = f2(params, D.X)
# use an inner function and swap out Z's data
function scope_test()
function f2_only_params(params)
f2(params, Z)
end
Z = float.([6 7; 1 3])
f2_tape = ReverseDiff.GradientTape(f2_only_params, [1, 2])
Z[:] = float.([1 2; 3 4])
grad = ReverseDiff.gradient!(f2_tape, [3,4])
return grad
end
function struct_test()
D.X[:] = float.([6 7; 1 3])
f2_tape = ReverseDiff.GradientTape(f2, [1., 2.])
D.X[:] = float.([1 2; 3 4])
grad = ReverseDiff.gradient!(f2_tape, [3., 4.])
return grad
end
function struct_test2()
D.X[:] = float.([1 2; 3 4])
f2_tape = ReverseDiff.GradientTape(f2, [3., 4.])
D.X[:] = float.([1 2; 3 4])
grad = ReverseDiff.gradient!(f2_tape, [3., 4.])
return grad
end
D.X[:] = float.([1 2; 3 4])
#test f1([3., 4.]) == f2([3., 4.], D.X)
#test f1([3., 4.]) == f2([3., 4.])
f1_tape = ReverseDiff.GradientTape(f1, [3,4])
f1_grad = ReverseDiff.gradient!(f1_tape, [3,4])
# fails! uses line 33 values
#test scope_test() == f1_grad
# fails, uses line 42 values
#test struct_test() == f1_grad
# succeeds, so, not completely random
#test struct_test2() == f1_grad

This is currently not possible (sadly). And there is a GitHub issue with the two work-arounds:
https://github.com/JuliaDiff/ReverseDiff.jl/issues/36
either do not use a prerecorded tape
or differentiate relative to all arguments and ignore the gradient for some of the input parameters.
I had the same issue, and I used the grad function of Knet instead. I supports only the differentiation relative to one argument, but this argument can be quite flexible (e.g. an array of arrays or an dict or arrays).

Thanks Alex, your answer was 90% of the way there. AutoGrad (what Knet is using at the time of writing) does provide a very nice interface that is natural I think for most users. However, it turns out that using anonymous functions with ReverseDiff is faster than the approach taken by AutoGrad, for reasons I don't quite understand.
If you follow the chain of issues referenced in what you linked, this seems to be what the ReverseDiff/ForwardDiff folks want people doing:
ReverseDiff.gradient(p -> f(p, non_differentiated_data), params)
Certainly disappointing that we can't get a precompiled tape with this incredibly common usage scenario, and maybe future work will change things. But this seems to be where things stand now.
Some references for those interested in further reading:
https://github.com/JuliaDiff/ForwardDiff.jl/issues/77
https://github.com/JuliaDiff/ForwardDiff.jl/issues/32
https://github.com/JuliaDiff/ForwardDiff.jl/pull/182

Related

How to Record Variables in Pytorch Without Breaking Gradient Computation?

I am trying to implement some policy gradient training, similar to this. However, I would like to manipulate the rewards(like discounted future sum and other differentiable operations) before doing the backward propagation.
Consider the manipulate function defined to calculate the reward to go:
def manipulate(reward_pool):
n = len(reward_pool)
R = np.zeros_like(reward_pool)
for i in reversed(range(n)):
R[i] = reward_pool[i] + (R[i+1] if i+1 < n else 0)
return T.as_tensor(R)
I tried to store the rewards in a list:
#pseudocode
reward_pool = [0 for i in range(batch_size)]
for k in batch_size:
act = net(state)
state, reward = env.step(act)
reward_pool[k] = reward
R = manipulate(reward_pool)
R.backward()
optimizer.step()
It seems like inplace operation breaks the gradient computation, the code gives me an error: one of the variables needed for gradient computation has been modified by an inplace operation.
I also tried to initialize an empty tensor first, and store it in the tensor, but inplace operation is still the issue - a view of a leaf Variable that requires grad is being used in an in-place operation.
I am kind of new to PyTorch. Does anyone know what the right way recording rewards is in this case?
The issue is due to assignment to existing object.
Simply initialize the empty pool(list) for each iteration and append to the pool when new reward is calculated, i.e.
reward_pool = []
for k in batch_size:
act = net(state)
state, reward = env.step(act)
reward_pool.append(reward)
R = manipulate(reward_pool)
R.backward()
optimizer.step()

Vectorizing distance to several points on Octave (Matlab)

I'm writing a k-means algorithm. At each step, I want to compute the distance of my n points to k centroids, without a for loop, and for d dimensions.
The problem is I have a hard time splitting on my number of dimensions with the Matlab functions I know. Here is my current code, with x being my n 2D-points and y my k centroids (also 2D-points of course), and with the points distributed along dimension 1, and the spatial coordinates along the dimension 2:
dist = #(a,b) (a - b).^2;
dx = bsxfun(dist, x(:,1), y(:,1)'); % x is (n,1) and y is (1,k)
dy = bsxfun(dist, x(:,2), y(:,2)'); % so the result is (n,k)
dists = dx + dy; % contains the square distance of each points to the k centroids
[_,l] = min(dists, [], 2); % we then argmin on the 2nd dimension
How to vectorize furthermore ?
First edit 3 days later, searching on my own
Since asking this question I made progress on my own towards vectorizing this piece of code.
The code above runs in approximately 0.7 ms on my example.
I first used repmat to make it easy to do broadcasting:
dists = permute(permute(repmat(x,1,1,k), [3,2,1]) - y, [3,2,1]).^2;
dists = sum(dists, 2);
[~,l] = min(dists, [], 3);
As expected it is slightly slower since we replicate the matrix, it runs at 0.85 ms.
From this example it was pretty easy to use bsxfun for the whole thing, but it turned out to be extremely slow, running in 150 ms so more than 150 times slower than the repmat version:
dist = #(a, b) (a - b).^2;
dists = permute(bsxfun(dist, permute(x, [3, 2, 1]), y), [3, 2, 1]);
dists = sum(dists, 2);
[~,l] = min(dists, [], 3);
Why is it so slow ? Isn't vectorizing always an improvement on speed, since it uses vector instructions on the CPU ? I mean of course simple for loops could be optimized to use it aswell, but how can vectorizing make the code slower ? Did I do it wrong ?
Using a for loop
For the sake of completeness, here's the for loop version of my code, surprisingly the fastest running in 0.4 ms, not sure why..
for i=1:k
dists(:,i) = sum((x - y(i,:)).^2, 2);
endfor
[~,l] = min(dists, [], 2);
Note: This answer was written when the question was also tagged MATLAB. Links to Octave documentation added after the MATLAB tag was removed.
You can use the pdist2MATLAB/Octave function to calculate pairwise distances between two sets of observations.
This way, you offload the bother of vectorization to the people who wrote MATLAB/Octave (and they have done a pretty good job of it)
X = rand(10,3);
Y = rand(5,3);
D = pdist2(X, Y);
D is now a 10x5 matrix where the i, jth element is the distance between the ith X and jth Y point.
You can pass it the kind of distance you want as the third argument -- e.g. 'euclidean', 'minkowski', etc, or you could pass a function handle to your custom function like so:
dist = #(a,b) (a - b).^2;
D = pdist2(X, Y, dist);
As saastn mentions, pdist2(..., 'smallest', k) makes things easier in k-means. This returns just the smallest k values from each column of pdist2's result. Octave doesn't have this functionality, but it's easily replicated using sort()MATLAB/Octave.
D_smallest = sort(D);
D_smallest = D_smallest(1:k, :);

Using quantile in Flux (Julia) in loss function

I am trying to use quantile in a loss function to train! (for some robustness, like least trimmed squares), but it mutates the array and Zygote throws an error Mutating arrays is not supported, coming from sort! . Below is a simple example (the content does not make sense of course):
using Flux, StatsBase
xdata = randn(2, 100)
ydata = randn(100)
model = Chain(Dense(2,10), Dense(10, 1))
function trimmedLoss(x,y; trimFrac=0.f05)
yhat = model(x)
absRes = abs.(yhat .- y) |> vec
trimVal = quantile(absRes, 1.f0-trimFrac)
s = sum(ifelse.(absRes .> trimVal, 0.f0 , absRes ))/(length(absRes)*(1.f0-trimFrac))
#s = sum(absRes)/length(absRes) # using this and commenting out the two above works (no surprise)
end
println(trimmedLoss(xdata, ydata)) #works ok
Flux.train!(trimmedLoss, params(model), zip([xdata], [ydata]), ADAM())
println(trimmedLoss(xdata, ydata)) #changed loss?
This is all in Flux 0.10 with Julia 1.2
Thanks in advance for any hints or workaround!
Ideally, we'd define a custom adjoint for quantile so that this works out of the box. (Feel free to open an issue to remind us to do this.)
In the mean time there's a quick workaround. It's actually the sorting that causes trouble here so if you do quantile(xs, p, sorted=true) it'll work. Obviously this requires xs to be sorted to get correct results, so you might need to use quantile(sort(xs), ...).
Depending on your Zygote version you might also need an adjoint for sort. That one's pretty easy:
julia> using Zygote: #adjoint
julia> #adjoint function sort(x)
p = sortperm(x)
x[p], x̄ -> (x̄[invperm(p)],)
end
julia> gradient(x -> quantile(sort(x), 0.5, sorted=true), [1, 2, 3, 3])
([0.0, 0.5, 0.5, 0.0],)
We'll make that built-in in the next Zygote release, but for now if you add that to your script it'll get your code working.

Create a List and Use it in Loss Function Tensorflow

I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0

Custom Spatial Convolution In Torch

I need to perform a custom spatial convolution in Torch. Rather than simply multiplying each input pixel by a weight for that pixel and adding them together with the filter's bias to form each output pixel, I need to do a more complex mathematical function to the input pixels before adding them together.
I know how to do this, but I do not know a GOOD way to do this. The best way I've come up with is to take the full input tensor, create a bunch of secondary tensors that are "views" of the original without allocating additional memory, putting those into a Replicate layer (the output filter count being the replication count), and feeding that into a ParallelTable layer containing a bunch of regular layers that have their parameters shared between filters.
The trouble is, even though this is fine memory-wise with a very manageable overhead, we're talking inputwidth^inputheight^inputdepth^outputdepth mini-networks, here. Maybe there's some way to create massive "long and tall" networks that work on the entire replicated input set at once, but how do I create layers that are partially-connected (like convolutions) instead of fully-connected?
I would have liked to just use inheritance to create a special copy of the regular SpatialConvolution "class" and modify it, but I can't even try because it's implemented in an external C library. I can't just use regular layers before a regular SpatialConvolution layer because I need to do my math with different weights and biases for each filter (shared between applications of the same filter to different input coordinates).
Good question. You made me give some serious thought.
Your approach has a flaw: it does not allow to take advantage of vectorized computations since each mini-network works independently.
My idea is as follows:
Suppose network's input and output are 2D tensors. We can produce (efficiently, without memory copying) an auxiliary 4D tensor
rf_input (kernel_size x kernel_size x output_h x output_w)
such that rf_input[:, :, k, l] is a 2D tensor of size kernel_size x kernel_size containing a receptive field which output[k, l] will be gotten from. Then we iterate over positions inside the kernel rf_input[i, j, :, :] getting pixels at position (i, j) inside all receptive fields and computing their contribution to each output[k, l] at once using vectorization.
Example:
Let our "convolving" function be, for example, a product of tangents of sums:
Then its partial derivative w.r.t. the input pixel at position (s,t) in its receptive field is
Derivative w.r.t. weight is the same.
At the end, of course, we must sum up gradients from different output[k,l] points. For example, each input[m, n] contributes to at most kernel_size^2 outputs as a part of their receptive fields, and each weight[i, j] contributes to all output_h x output_w outputs.
Simple implementation may look like this:
require 'nn'
local CustomConv, parent = torch.class('nn.CustomConv', 'nn.Module')
-- This module takes and produces a 2D map.
-- To work with multiple input/output feature maps and batches,
-- you have to iterate over them or further vectorize computations inside the loops.
function CustomConv:__init(ker_size)
parent.__init(self)
self.ker_size = ker_size
self.weight = torch.rand(self.ker_size, self.ker_size):add(-0.5)
self.gradWeight = torch.Tensor(self.weight:size()):zero()
end
function CustomConv:_get_recfield_input(input)
local rf_input = {}
for i = 1, self.ker_size do
rf_input[i] = {}
for j = 1, self.ker_size do
rf_input[i][j] = input[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]
end
end
return rf_input
end
function CustomConv:updateOutput(_)
local output = torch.Tensor(self.rf_input[1][1]:size())
-- Kernel-specific: our kernel is multiplicative, so we start with ones
output:fill(1)
--
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
output:cmul(ker_pt:add(w):tan())
--
end
end
return output
end
function CustomConv:updateGradInput_and_accGradParameters(_, gradOutput)
local gradInput = torch.Tensor(self.input:size()):zero()
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
local subGradInput = torch.cmul(gradOutput, torch.cdiv(self.output, ker_pt:add(w):tan():cmul(ker_pt:add(w):cos():pow(2))))
local subGradWeight = subGradInput
--
gradInput[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]:add(subGradInput)
self.gradWeight[{i, j}] = self.gradWeight[{i, j}] + torch.sum(subGradWeight)
end
end
return gradInput
end
function CustomConv:forward(input)
self.input = input
self.rf_input = self:_get_recfield_input(input)
self.output = self:updateOutput(_)
return self.output
end
function CustomConv:backward(input, gradOutput)
gradInput = self:updateGradInput_and_accGradParameters(_, gradOutput)
return gradInput
end
If you change this code a bit:
updateOutput:
output:fill(0)
[...]
output:add(ker_pt:mul(w))
updateGradInput_and_accGradParameters:
local subGradInput = torch.mul(gradOutput, w)
local subGradWeight = torch.cmul(gradOutput, ker_pt)
then it will work exactly as nn.SpatialConvolutionMM with zero bias (I've tested it).

Resources