In Lua Torch, the product of two zero matrices has nan entries - lua

I have encountered a strange behavior of the torch.mm function in Lua/Torch. Here is a simple program that demonstrates the problem.
iteration = 0;
a = torch.Tensor(2, 2);
b = torch.Tensor(2, 2);
prod = torch.Tensor(2,2);
a:zero();
b:zero();
repeat
prod = torch.mm(a,b);
ent = prod[{2,1}];
iteration = iteration + 1;
until ent ~= ent
print ("error at iteration " .. iteration);
print (prod);
The program consists of one loop, in which the program multiplies two zero 2x2 matrices and tests if entry ent of the product matrix is equal to nan. It seems that the program should run forever since the product should always be equal to 0, and hence ent should be 0. However, the program prints:
error at iteration 548
0.000000 0.000000
nan nan
[torch.DoubleTensor of size 2x2]
Why is this happening?
Update:
The problem disappears if I replace prod = torch.mm(a,b) with torch.mm(prod,a,b), which suggests that something is wrong with the memory allocation.
My version of Torch was compiled without BLAS & LAPACK libraries. After I recompiled torch with OpenBLAS, the problem disappeared. However, I am still interested in its cause.

The part of code that auto-generates the Lua wrapper for torch.mm can be found here.
When you write prod = torch.mm(a,b) within your loop it corresponds to the following C code behind the scenes (generated by this wrapper thanks to cwrap):
/* this is the tensor that will hold the results */
arg1 = THDoubleTensor_new();
THDoubleTensor_resize2d(arg1, arg5->size[0], arg6->size[1]);
arg3 = arg1;
/* .... */
luaT_pushudata(L, arg1, "torch.DoubleTensor");
/* effective matrix multiplication operation that will fill arg1 */
THDoubleTensor_addmm(arg1,arg2,arg3,arg4,arg5,arg6);
So:
a new result tensor is created and resized with the proper dimensions,
but this new tensor is NOT initialized, i.e. there is no calloc or explicit fill here so it points to junk memory and could contain NaN-s,
this tensor is pushed on the stack so as to be available on the Lua side as the return value.
The last point means that this returned tensor is different from the initial prod one (i.e. within the loop, prod shadows the initial value).
On the other hand calling torch.mm(prod,a,b) does use your initial prod tensor to store the results (behind the scenes there is no need to create a dedicated tensor in that case). Since in your code snippet you do not initialize / fill it with given values it could also contain junk.
In both cases the core operation is a gemm multiplication like C = beta * C + alpha * A * B, with beta=0 and alpha=1. The naive implementation looks like that:
real *a_ = a;
for(i = 0; i < m; i++)
{
real *b_ = b;
for(j = 0; j < n; j++)
{
real sum = 0;
for(l = 0; l < k; l++)
sum += a_[l*lda]*b_[l];
b_ += ldb;
/*
* WARNING: beta*c[j*ldc+i] could give NaN even if beta=0
* if the other operand c[j*ldc+i] is NaN!
*/
c[j*ldc+i] = beta*c[j*ldc+i]+alpha*sum;
}
a_++;
}
Comments are mine.
So:
with torch.mm(a,b): at each iteration, a new result tensor is created without being initialized (it could contain NaN-s). So every iteration presents a risk of returning NaN-s (see above warning),
with torch.mm(prod,a,b): there is the same risk since you do not initialized the prod tensor. BUT: this risk only exists at the first iteration of the repeat / until loop since right after prod is filled with 0-s and re-used for the subsequent iterations.
So this is why you do not observe a problem here (it is less frequent).
In case 1: this should be improved at the Torch level, i.e. make sure the wrapper initializes the output (e.g. with THDoubleTensor_fill(arg1, 0);).
In case 2: you should initialize prod initially and use the torch.mm(prod,a,b) construct to avoid any NaN problem.
--
EDIT: this problem is now fixed (see this pull request).

Related

Thread index as an memory location index in CUDA

By definition, a thread is a path of execution within a process.
But during the implementation of a kernel, a thread_id or global_index is generated to access a memory location allocated. For instance, in the Matrix Multiplication code below, ROW and COL are generated to access matrix A and B sequential.
My doubt here is, index generated isn't pointing to a thread(by definition), instead, it is used to access the location of the data in the memory, then why do we refer to it as thread index or global thread index and why not memory index or something else?
__global__ void matrixMultiplicationKernel(float* A, float* B, float* C, int N) {
int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x;
float tmpSum = 0;
if (ROW < N && COL < N) {
// each thread computes one element of the block sub-matrix
for (int i = 0; i < N; i++) {
tmpSum += A[ROW * N + i] * B[i * N + COL];
}
}
C[ROW * N + COL] = tmpSum;
}
 This question seems to be mostly about semantics, so let's start at Wikipedia
.... a thread of execution is the smallest sequence of programmed
instructions that can be managed independently by a scheduler ....
That is pretty much describes exactly what s thread in CUDA is -- the kernel is the sequence of instructions, and the scheduler is the warp/thread scheduler in each streaming multiprocessor on the GPU.
The code in your question is calculating the unique ID of the thread in the kernel launch, as it is abstracted in the CUDA programming/execution model. It has no intrinsic relationship to memory layouts, only to the unique ID in the kernel launch. The fact it is being used to ensure that each parallel operation is being performed on a different memory location is programming technique and nothing more.
Thread ID seems like a logical moniker to me, but to paraphrase Miles Davis when he was asked what the name of the jam his band just played at the Isle of Wight festival in 1970: "call it whatever you want".

Performing an "online" linear interpolation

I have a problem where I need to do a linear interpolation on some data as it is acquired from a sensor (it's technically position data, but the nature of the data doesn't really matter). I'm doing this now in matlab, but since I will eventually migrate this code to other languages, I want to keep the code as simple as possible and not use any complicated matlab-specific/built-in functions.
My implementation initially seems OK, but when checking my work against matlab's built-in interp1 function, it seems my implementation isn't perfect, and I have no idea why. Below is the code I'm using on a dataset already fully collected, but as I loop through the data, I act as if I only have the current sample and the previous sample, which mirrors the problem I will eventually face.
%make some dummy data
np = 109; %number of data points for x and y
x_data = linspace(3,98,np) + (normrnd(0.4,0.2,[1,np]));
y_data = normrnd(2.5, 1.5, [1,np]);
%define the query points the data will be interpolated over
qp = [1:100];
kk=2; %indexes through the data
cc = 1; %indexes through the query points
qpi = qp(cc); %qpi is the current query point in the loop
y_interp = qp*nan; %this will hold our solution
while kk<=length(x_data)
kk = kk+1; %update the data counter
%perform online interpolation
if cc<length(qp)-1
if qpi>=y_data(kk-1) %the query point, of course, has to be in-between the current value and the next value of x_data
y_interp(cc) = myInterp(x_data(kk-1), x_data(kk), y_data(kk-1), y_data(kk), qpi);
end
if qpi>x_data(kk), %if the current query point is already larger than the current sample, update the sample
kk = kk+1;
else %otherwise, update the query point to ensure its in between the samples for the next iteration
cc = cc + 1;
qpi = qp(cc);
%It is possible that if the change in x_data is greater than the resolution of the query
%points, an update like the above wont work. In this case, we must lag the data
if qpi<x_data(kk),
kk=kk-1;
end
end
end
end
%get the correct interpolation
y_interp_correct = interp1(x_data, y_data, qp);
%plot both solutions to show the difference
figure;
plot(y_interp,'displayname','manual-solution'); hold on;
plot(y_interp_correct,'k--','displayname','matlab solution');
leg1 = legend('show');
set(leg1,'Location','Best');
ylabel('interpolated points');
xlabel('query points');
Note that the "myInterp" function is as follows:
function yi = myInterp(x1, x2, y1, y2, qp)
%linearly interpolate the function value y(x) over the query point qp
yi = y1 + (qp-x1) * ( (y2-y1)/(x2-x1) );
end
And here is the plot showing that my implementation isn't correct :-(
Can anyone help me find where the mistake is? And why? I suspect it has something to do with ensuring that the query point is in-between the previous and current x-samples, but I'm not sure.
The problem in your code is that you at times call myInterp with a value of qpi that is outside of the bounds x_data(kk-1) and x_data(kk). This leads to invalid extrapolation results.
Your logic of looping over kk rather than cc is very confusing to me. I would write a simple for loop over cc, which are the points at which you want to interpolate. For each of these points, advance kk, if necessary, such that qp(cc) is in between x_data(kk) and x_data(kk+1) (you can use kk-1 and kk instead if you prefer, just initialize kk=2 to ensure that kk-1 exists, I just find starting at kk=1 more intuitive).
To simplify the logic here, I'm limiting the values in qp to be inside the limits of x_data, so that we don't need to test to ensure that x_data(kk+1) exists, nor that x_data(1)<pq(cc). You can add those tests in if you wish.
Here's my code:
qp = [ceil(x_data(1)+0.1):floor(x_data(end)-0.1)];
y_interp = qp*nan; % this will hold our solution
kk=1; % indexes through the data
for cc=1:numel(qp)
% advance kk to where we can interpolate
% (this loop is guaranteed to not index out of bounds because x_data(end)>qp(end),
% but needs to be adjusted if this is not ensured prior to the loop)
while x_data(kk+1) < qp(cc)
kk = kk + 1;
end
% perform online interpolation
y_interp(cc) = myInterp(x_data(kk), x_data(kk+1), y_data(kk), y_data(kk+1), qp(cc));
end
As you can see, the logic is a lot simpler this way. The result is identical to y_interp_correct. The inner while x_data... loop serves the same purpose as your outer while loop, and would be the place where you read your data from wherever it's coming from.

How to randomly get a value from a table [duplicate]

I am working on programming a Markov chain in Lua, and one element of this requires me to uniformly generate random numbers. Here is a simplified example to illustrate my question:
example = function(x)
local r = math.random(1,10)
print(r)
return x[r]
end
exampleArray = {"a","b","c","d","e","f","g","h","i","j"}
print(example(exampleArray))
My issue is that when I re-run this program multiple times (mash F5) the exact same random number is generated resulting in the example function selecting the exact same array element. However, if I include many calls to the example function within the single program by repeating the print line at the end many times I get suitable random results.
This is not my intention as a proper Markov pseudo-random text generator should be able to run the same program with the same inputs multiple times and output different pseudo-random text every time. I have tried resetting the seed using math.randomseed(os.time()) and this makes it so the random number distribution is no longer uniform. My goal is to be able to re-run the above program and receive a randomly selected number every time.
You need to run math.randomseed() once before using math.random(), like this:
math.randomseed(os.time())
From your comment that you saw the first number is still the same. This is caused by the implementation of random generator in some platforms.
The solution is to pop some random numbers before using them for real:
math.randomseed(os.time())
math.random(); math.random(); math.random()
Note that the standard C library random() is usually not so uniformly random, a better solution is to use a better random generator if your platform provides one.
Reference: Lua Math Library
Standard C random numbers generator used in Lua isn't guananteed to be good for simulation. The words "Markov chain" suggest that you may need a better one. Here's a generator widely used for Monte-Carlo calculations:
local A1, A2 = 727595, 798405 -- 5^17=D20*A1+A2
local D20, D40 = 1048576, 1099511627776 -- 2^20, 2^40
local X1, X2 = 0, 1
function rand()
local U = X2*A2
local V = (X1*A2 + X2*A1) % D20
V = (V*D20 + U) % D40
X1 = math.floor(V/D20)
X2 = V - X1*D20
return V/D40
end
It generates a number between 0 and 1, so r = math.floor(rand()*10) + 1 would go into your example.
(That's multiplicative random number generator with period 2^38, multiplier 5^17 and modulo 2^40, original Pascal code by http://osmf.sscc.ru/~smp/)
math.randomseed(os.clock()*100000000000)
for i=1,3 do
math.random(10000, 65000)
end
Always results in new random numbers. Changing the seed value will ensure randomness. Don't follow os.time() because it is the epoch time and changes after one second but os.clock() won't have the same value at any close instance.
There's the Luaossl library solution: (https://github.com/wahern/luaossl)
local rand = require "openssl.rand"
local randominteger
if rand.ready() then -- rand has been properly seeded
-- Returns a cryptographically strong uniform random integer in the interval [0, n−1].
randominteger = rand.uniform(99) + 1 -- randomizes an integer from range 1 to 100
end
http://25thandclement.com/~william/projects/luaossl.pdf

Torch Insert tensors of different dimension into a table

I'm trying to insert tensors of different dimensions into a lua table. But the insertion is writing the last tensor to all the previous elements in the table.
MWE:
require 'nn';
char = nn.LookupTable(100,10,0,1)
charRep = nn.Sequential():add(char):add(nn.Squeeze())
c = {}
c[1] = torch.IntTensor(5):random(1,100)
c[2] = torch.IntTensor(2):random(1,100)
c[3] = torch.IntTensor(3):random(1,100)
--This works fine
print(c)
charFeatures = {}
for i=1,3 do
charFeatures[i] = charRep:forward(c[i])
--table.insert(charFeatures, charRep:forward(c[i]))
-- No difference when table.insert is used
end
--This fails
print(charFeatures)
Maybe i haven't understood how tables work in Lua. But this code copies the last tensor to all previous charFeatures elements.
The issue is not related with tables, but is very common in Torch though. When you call the forward method on a neural net, its state value output is changed. Now when you save this value into charFeatures[i] you actually create a reference from charFeatures[i] to charRep.output. Then in the next iteration of the loop charRep.output is modified and consequently all the elements of charFeatures are modified too, since they point to the same value which is charRep.output.
Note that this behavior is the same as when you do
a = torch.Tensor(5):zero()
b = a
a[1] = 0
-- then b is also modified
Finally to solve your problem you should clone the output of the network:
charFeatures[i] = charRep:forward(c[i]):clone()
And all will work as expected!

Randomly scan through every coordinate

TL;DR Randomly access every tile in a tilemap
I have a way of generating random positions of tiles by just filling an entire layer of them (its only 10x10) then running a forloop like for (int x = 0; x < 13; x++)
{
for (int y = 0; y < 11; y++)}}
Where I randomly delete tiles. I also have a cap to this, which is at about 30. The problem is that when the loop runs over, it uses up the cap to the left (Because it starts at x=0 y=0 then does x0 x1 x2 x3 ...). I tried randomly generating coordinates, but this didn't work because it doesnt go over all the coordinates.
Does anyone know a better practice for scanning over every coordinate in a map in random order?
Number your tiles 0 to n anyway you want. Create a NSMutableIndexSet, and add all to it. Use a random number generator scaled to the number of items still in the index set (actually the range of first to last), grab one, then remove it from the set. If the random number is not in the set generate a new be, etc, until you find one in the set.
I think the best practice to accomplish this is via double hashing. To find out more read this link: Double Hashing. I will try to explain it simply.
You have two auxiliary hash functions which need to be pre-computed. And a main hash function which will run in a for loop. Lets test this out (this will be pseudo code):
key = random_number() //lets get a random number and call it "key"
module = map_size_x // map size for our module
//form of hash function 1 is: h1(key) = key % module, lets compute the hash 1 for our main hash function
aux1 = key % module
//form of hash function 2 is: h2(key) = 1 + (key % module'), where module' is module smaller for a small number (lets use 1), lets compute it:
aux2 = 1 + (key % (module - 1))
//the main hash function which will generate a random permutation is in the form of: h(key, index) = (h1(key) + index*h2(key)) % module. we already have h1 and h2 so lets loop this through:
for (i = 0; i < map_size_x; i++)
{
randomElement = (aux1 + i*aux2) % module //here we have index of the random element
//DO STUFF HERE
}
To get another permutation simply change the value of key. For more info check the link.
Hope this helps. Cheers.

Resources