Julia: add to multiple arrays in parallel - machine-learning

I want to compute and add gradients to multiple arrays in parallel:
a = zeros(1,3); b = zeros(1,5)
a, b = #parallel (+) for i = 1:10
f(a,b)
end
Where f(a,b) returns gradients of a and b (these are arrays the same size as a and b, respectively). Obviously the above method doesn't work because tuples are immutable, but I can't think of a way to do this that doesn't involve combining a and b into a larger matrix. Any ideas?

Not the most elegant, but this works:
function ta(t1,t2)
t1[1].+t2[1], t1[2].+t2[2]
end
a, b = #parallel (ta) for i = 1:10
f(a,b)
end

Related

Efficient pseudo-inverse for PyTorch 2D convolution

Background:
Thanks for your attention! I am learning the basic knowledge of 2D convolution, linear algebra and PyTorch. I encounter the implementation problem about the psedo-inverse of the convolution operator. Specifically, I have no idea about how to implement it in an efficient way. Please see the following problem statements for details. Any help/tip/suggestion is welcomed.
(Thanks a lot for your attention!)
The Original Problem:
I have an image feature x with shape [b,c,h,w] and a 3x3 convolutional kernel K with shape [c,c,3,3]. There is y = K * x. How to implement the corresponding pseudo-inverse on y in an efficient way?
There is [y = K * x = Ax], how to implement [x_hat = (A^+)y]?
I guess that there should be some operations using torch.fft. However, I still have no idea about how to implement it. I do not know if there exists an implementation previously.
import torch
import torch.nn.functional as F
c = 32
K = torch.randn(c, c, 3, 3)
x = torch.randn(1, c, 128, 128)
y = F.conv2d(x, K, padding=1)
print(y.shape)
# How to implement pseudo-inverse for y = K * x in an efficient way?
Some of My Efforts:
I may know that the 2D convolution is a linear operator. It is equivalent to a "matrix product" operator. We can actually write out the matrix form of the convolution and calculate its psedo-inverse. However, I think this type of operation will be inefficient. And I have no idea about how to implement it in an efficient way.
According to Wikipedia, the psedo-inverse may satisfy the property of A(A_pinv(x))=x, where A is the convolutional operator, A_pinv is its psedo-inverse, and x may be any image feature.
(Thanks again for reading such a long post!)
This takes the problem to another level.
The convolution itself is a linear operation, you can determine the matrix of the operation and solve a least square problem directly [1], or compute the pseudo-inverse as you mentioned, and then apply to different outputs and predicting a projection of the input.
I am changing your code to using padding=0
import torch
import torch.nn.functional as F
# your code
c = 32
K = torch.randn(c, c, 1, 1)
x = torch.randn(4, c, 128, 128)
y = F.conv2d(x, K, bias=torch.zeros((c,)))
Also, as you probably already suggested the convolution can be computed as ifft(fft(h)*fft(x)). However, the conv2d function is a cross-correlation, so you have to conjugate the filter leading to ifft(fft(h)*fft(x)), also you have to apply this to two axes, and you have to make sure the FFT is calcuated using the same representation (size), since the data is real, we can apply multi-dimensional real FFT. To be complete, conv2d works on multiple channels, so we have to calculate summations of convolutions. Since the FFT is linear, we can simply compute the summations on the frequency domain
using einsum.
s = y.shape[-2:]
K_f = torch.fft.rfftn(K, s)
x_f = torch.fft.rfftn(x, s)
y_f = torch.einsum('jkxy,ikxy->ijxy', K_f.conj(), x_f)
y_hat = torch.fft.irfftn(y_f, s)
Except for the borders it should be accurate (remember FFT computes a cyclic convolution).
torch.max(abs(y_hat[:,:,:-2,:-2] - y[:,:,:,:]))
Now, notice the pattern jk,ik->ij on the einsum, that means y_f[i,j] = sum(K_f[j,k] * x_f[i,k]) = x_f # K_f.T, if # is the matrix product on the first two dimensions. So to invert this operation we have to can interpret the first two dimensions as matrices. The function pinv will compute pseudo-inverses on the last two axes, so in order to use that we have to permute the axes. If we right multiply the output by the pseudo-inverse of transposed K_f we should invert this operation.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))
Nottice that I am using the full convolution, but the conv2d actually cropped the images. Let's apply that
y_hat[:,:,128-(k-1):,:] = 0
y_hat[:,:,:,128-(k-1):] = 0
Repeating the calculation you will see that the input is not accurate anymore, so you have to be careful about what you do with your convolution, but in some situations where you can get this to work it will be in fact efficient.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))

Julia ReverseDiff: how to take a gradient w.r.t. only a subset of inputs?

In my data flow, I'm querying a small subset of a database, using those results to construct about a dozen arrays, and then, given some parameter values, computing a likelihood value. Then repeating for a subset of the database. I want to compute the gradient of the likelihood function with respect to the parameters but not the data. But ReverseDiff computes the gradient with respect to all inputs. How can I get around this? Specifically, how can I construct a ReverseDiff.Tape object
TL;DR: How to marry stochastic gradient descent and ReverseDiff? (I'm not wedded to using ReverseDiff. It just seemed like the right tool for the job.)
It seems like this has to be a common coding pattern. It's used all the time in my field. But I'm missing something. Julia's scoping rules seem to undermine the scoped/anonymous function approach, and ReverseDiff is holding on to the original data values in the tape generated instead of using the mutated values.
some sample code of things that don't work
using ReverseDiff
using Base.Test
mutable struct data
X::Array{Float64, 2}
end
const D = data(zeros(Float64, 2, 2))
# baseline known data to compare against
function f1(params)
X = float.([1 2; 3 4])
f2(params, X)
end
# X is data, want derivative wrt to params only
function f2(params, X)
sum(params[1]' * X[:, 1] - (params[1] .* params[2])' * X[:, 2].^2)
end
# store data of interest in D.X so that we can call just f2(params) and get our
# gradient
f2(params) = f2(params, D.X)
# use an inner function and swap out Z's data
function scope_test()
function f2_only_params(params)
f2(params, Z)
end
Z = float.([6 7; 1 3])
f2_tape = ReverseDiff.GradientTape(f2_only_params, [1, 2])
Z[:] = float.([1 2; 3 4])
grad = ReverseDiff.gradient!(f2_tape, [3,4])
return grad
end
function struct_test()
D.X[:] = float.([6 7; 1 3])
f2_tape = ReverseDiff.GradientTape(f2, [1., 2.])
D.X[:] = float.([1 2; 3 4])
grad = ReverseDiff.gradient!(f2_tape, [3., 4.])
return grad
end
function struct_test2()
D.X[:] = float.([1 2; 3 4])
f2_tape = ReverseDiff.GradientTape(f2, [3., 4.])
D.X[:] = float.([1 2; 3 4])
grad = ReverseDiff.gradient!(f2_tape, [3., 4.])
return grad
end
D.X[:] = float.([1 2; 3 4])
#test f1([3., 4.]) == f2([3., 4.], D.X)
#test f1([3., 4.]) == f2([3., 4.])
f1_tape = ReverseDiff.GradientTape(f1, [3,4])
f1_grad = ReverseDiff.gradient!(f1_tape, [3,4])
# fails! uses line 33 values
#test scope_test() == f1_grad
# fails, uses line 42 values
#test struct_test() == f1_grad
# succeeds, so, not completely random
#test struct_test2() == f1_grad
This is currently not possible (sadly). And there is a GitHub issue with the two work-arounds:
https://github.com/JuliaDiff/ReverseDiff.jl/issues/36
either do not use a prerecorded tape
or differentiate relative to all arguments and ignore the gradient for some of the input parameters.
I had the same issue, and I used the grad function of Knet instead. I supports only the differentiation relative to one argument, but this argument can be quite flexible (e.g. an array of arrays or an dict or arrays).
Thanks Alex, your answer was 90% of the way there. AutoGrad (what Knet is using at the time of writing) does provide a very nice interface that is natural I think for most users. However, it turns out that using anonymous functions with ReverseDiff is faster than the approach taken by AutoGrad, for reasons I don't quite understand.
If you follow the chain of issues referenced in what you linked, this seems to be what the ReverseDiff/ForwardDiff folks want people doing:
ReverseDiff.gradient(p -> f(p, non_differentiated_data), params)
Certainly disappointing that we can't get a precompiled tape with this incredibly common usage scenario, and maybe future work will change things. But this seems to be where things stand now.
Some references for those interested in further reading:
https://github.com/JuliaDiff/ForwardDiff.jl/issues/77
https://github.com/JuliaDiff/ForwardDiff.jl/issues/32
https://github.com/JuliaDiff/ForwardDiff.jl/pull/182

How tf.gradients work in TensorFlow

Given I have a linear model as the following I would like to get the gradient vector with regards to W and b.
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.mul(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
However if I try something like this where cost is a function of cost(x,y,w,b) and I only want to gradients with respect to w and b:
grads = tf.gradients(cost, tf.all_variable())
My placeholders will also be included (X and Y).
Even if I do get a gradient with [x,y,w,b] how do I know which element in the gradient that belong to each parameter since it is just a list without names to which parameter the derivative has be taken with regards to?
In this question I'm using parts of this code and I build on this question.
Quoting the docs for tf.gradients
Constructs symbolic partial derivatives of sum of ys w.r.t. x in xs.
So, this should work:
dc_dw, dc_db = tf.gradients(cost, [W, b])
Here, tf.gradients() returns the gradient of cost wrt each tensor in the second argument as a list in the same order.
Read tf.gradients for more information.

Custom Spatial Convolution In Torch

I need to perform a custom spatial convolution in Torch. Rather than simply multiplying each input pixel by a weight for that pixel and adding them together with the filter's bias to form each output pixel, I need to do a more complex mathematical function to the input pixels before adding them together.
I know how to do this, but I do not know a GOOD way to do this. The best way I've come up with is to take the full input tensor, create a bunch of secondary tensors that are "views" of the original without allocating additional memory, putting those into a Replicate layer (the output filter count being the replication count), and feeding that into a ParallelTable layer containing a bunch of regular layers that have their parameters shared between filters.
The trouble is, even though this is fine memory-wise with a very manageable overhead, we're talking inputwidth^inputheight^inputdepth^outputdepth mini-networks, here. Maybe there's some way to create massive "long and tall" networks that work on the entire replicated input set at once, but how do I create layers that are partially-connected (like convolutions) instead of fully-connected?
I would have liked to just use inheritance to create a special copy of the regular SpatialConvolution "class" and modify it, but I can't even try because it's implemented in an external C library. I can't just use regular layers before a regular SpatialConvolution layer because I need to do my math with different weights and biases for each filter (shared between applications of the same filter to different input coordinates).
Good question. You made me give some serious thought.
Your approach has a flaw: it does not allow to take advantage of vectorized computations since each mini-network works independently.
My idea is as follows:
Suppose network's input and output are 2D tensors. We can produce (efficiently, without memory copying) an auxiliary 4D tensor
rf_input (kernel_size x kernel_size x output_h x output_w)
such that rf_input[:, :, k, l] is a 2D tensor of size kernel_size x kernel_size containing a receptive field which output[k, l] will be gotten from. Then we iterate over positions inside the kernel rf_input[i, j, :, :] getting pixels at position (i, j) inside all receptive fields and computing their contribution to each output[k, l] at once using vectorization.
Example:
Let our "convolving" function be, for example, a product of tangents of sums:
Then its partial derivative w.r.t. the input pixel at position (s,t) in its receptive field is
Derivative w.r.t. weight is the same.
At the end, of course, we must sum up gradients from different output[k,l] points. For example, each input[m, n] contributes to at most kernel_size^2 outputs as a part of their receptive fields, and each weight[i, j] contributes to all output_h x output_w outputs.
Simple implementation may look like this:
require 'nn'
local CustomConv, parent = torch.class('nn.CustomConv', 'nn.Module')
-- This module takes and produces a 2D map.
-- To work with multiple input/output feature maps and batches,
-- you have to iterate over them or further vectorize computations inside the loops.
function CustomConv:__init(ker_size)
parent.__init(self)
self.ker_size = ker_size
self.weight = torch.rand(self.ker_size, self.ker_size):add(-0.5)
self.gradWeight = torch.Tensor(self.weight:size()):zero()
end
function CustomConv:_get_recfield_input(input)
local rf_input = {}
for i = 1, self.ker_size do
rf_input[i] = {}
for j = 1, self.ker_size do
rf_input[i][j] = input[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]
end
end
return rf_input
end
function CustomConv:updateOutput(_)
local output = torch.Tensor(self.rf_input[1][1]:size())
-- Kernel-specific: our kernel is multiplicative, so we start with ones
output:fill(1)
--
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
output:cmul(ker_pt:add(w):tan())
--
end
end
return output
end
function CustomConv:updateGradInput_and_accGradParameters(_, gradOutput)
local gradInput = torch.Tensor(self.input:size()):zero()
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
local subGradInput = torch.cmul(gradOutput, torch.cdiv(self.output, ker_pt:add(w):tan():cmul(ker_pt:add(w):cos():pow(2))))
local subGradWeight = subGradInput
--
gradInput[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]:add(subGradInput)
self.gradWeight[{i, j}] = self.gradWeight[{i, j}] + torch.sum(subGradWeight)
end
end
return gradInput
end
function CustomConv:forward(input)
self.input = input
self.rf_input = self:_get_recfield_input(input)
self.output = self:updateOutput(_)
return self.output
end
function CustomConv:backward(input, gradOutput)
gradInput = self:updateGradInput_and_accGradParameters(_, gradOutput)
return gradInput
end
If you change this code a bit:
updateOutput:
output:fill(0)
[...]
output:add(ker_pt:mul(w))
updateGradInput_and_accGradParameters:
local subGradInput = torch.mul(gradOutput, w)
local subGradWeight = torch.cmul(gradOutput, ker_pt)
then it will work exactly as nn.SpatialConvolutionMM with zero bias (I've tested it).

How to store complex numbers in OpenCV matrix?

I have two matrices A and B, where A & B contains some real numbers. Now I want a complex numbered matrix C, such that: C[0] = A[0] + i B[0].
My question is how create such complex matrix C, and how to pass A, B matrices values into matrix C.
I came to know that, I can create matrix C as follows:
CvMat* C_Matrix = cvCreateMat(5, 5, CV_64FC2);
But now how to pass values of A & B to matrix C_Matrix?
I think you can use merge() function here, See the Documentation
It says : Composes a multi-channel array from several single-channel arrays.

Resources