Efficient use of memories in OpenACC

Efficient use of memories in OpenACC - memory

I am working on an OpenACC computational fluid dynamics code to increase the granularity of computations inside a loop by breaking down the overall computations to bunch of small operations. My final goal is to reduce the amount of registers per threat by split the original complex task to small simpler series of tasks on the GPU.
For instance, I have many formulas to compute for a specific node of the computational domain:
!$acc parallel loop ...
do i=1,n
D1 = s(i+1,1) - s(i-1,1)
D2 = s(i+1,2) - s(i-1,2)
...
R = D1 + D2 + ...
enddo
As you see, I can spread the computation to threads of a block and at the end sum up the results (by reduction) to R. Therefore, I defined an inner parallel loop as follows:
!$acc parallel loop
do i=1,n
!$acc parallel loop ...
do j=1,m
D[j] = s(i+1,j) - s(i-1,j)
end
!$acc parallel loop reduction(+:R)
do j=1,m
R = R + D[j]
enddo
enddo
However, I need to define D as a shared memory for all threads but I don't know actually what is the best way for OpenACC? (I used !$acc cache but I got worse performance). Also I need to send some unchanged data to constant memory and again I don't know how I can.
Is there any efficient way to implement this idea to OpenACC? I really appreciate your help.
Thanks a lot,
Behzad

I think what you're looking for is declaring the D array to be gang-private. If you were in C, I'd say that you can just declare it inside the i loop. Since you're in Fortran, try doing the following:
!$acc parallel loop private(D)
do i=1,n
!$acc loop ...
do j=1,m
D[j] = s(i+1,j) - s(i-1,j)
end
!$acc loop reduction(+:R)
do j=1,m
R = R + D[j]
enddo
enddo
Once it's private to the gangs, you might actually find that the cache directive is more effective.
I was looking at a code doing something similar earlier today and you might also want to try expanding the size of D to be n x m and then splitting the i loop between the two j loops. Right now the compiler will need to insert synchronization between the two j loops, plus it will need to strip-mine the i loop into vectors, so you're probably losing M/vector_length parallelism. If you break it into two doubly-nested loops, you could collapse the i and j loops together and get more parallelism. That would look something like this.
!$acc parallel loop private(D)
do i=1,n collapse(2)
do j=1,m
D(j,i) = s(i+1,j) - s(i-1,j)
end
enddo
!$acc parallel loop private(D)
do i=1,n collapse(2) reduction(+:R)
do j=1,m
R = R + D(j,i)
enddo
enddo
The trade-off is that D will require more storage and the cache performance on the CPU will suffer. It may be worth experimenting though to see if the trade-off is worthwhile.

The previous serial code version was like:
do j=1,n
do i=1,m
a_x = s(i+1,j,1) - s(i-1,j,1)
b_x = s(i+1,j,2) - s(i-1,j,2)
c_x = s(i+1,j,3) - s(i-1,j,3)
a_y = s(i,j+1,1) - s(i,j-1,1)
b_y = s(i,j+1,2) - s(i,j-1,2)
c_y = s(i,j+1,3) - s(i,j-1,3)
...
R1 = b_x + c_y
R2 = a_x + a_y
R3 = c_x + b_y
enddo
enddo
The new approach is to split the computations into x and y direction as follows (serial version):
do j=1,n
do i=1,m
do k=1,2
do m=1,3
D(m) = s(i+dir(k,1),j+dir(k,2),m) - s(i-dir(k,1),j-dir(k,2),m)
enddo
...
R_1(k) = a(k+1)
R_2(k) = a(1)
R_3(k) = a(3)
enddo
do k=1,2
R1 = R1 + R_1(k)
R2 = R2 + R_2(k)
R3 = R3 + R_3(k)
enddo
enddo
enddo
which dir(2,2) = {(1,0),(0,1)} determines the index shift in 2D for each direction (k).
Now I am trying to import this code to GPU by OpenACC.

Related

I can't figure out how to do an simple algoritm to get the sum of two numbers

I am trying to find a solution to the problem "Two Sum" if you recognize it , and I've run into a problem and I cannot figure it out (Lua)
Code:
num = {2,7,11,15}
target = 9
current = 0
repeat
createNum1 = tonumber(num[math.random(1,#num)])
createNum2 = tonumber(num[math.random(1,#num)])
current = createNum1 + createNum2
until current == target
print(table.find(num,createNum1), table.find(num,createNum2))
Error:
lua5.3: HelloWorld.lua:9: attempt to call a nil value (field 'find')
stack traceback:
HelloWorld.lua:9: in main chunk
[C]: in ?
Thank you!

Lua has no table.find function in its very small standard library; just take a look at the reference manual.
You could implement your own table.find function, but that would just be monkey-patching an overall broken algorithm. There is no need to use a probabilistic algorithm that probably runs in at least quadratic time if there only is one pair of numbers that adds up to the desired number. Instead, you should leverage Lua's tables - associative arrays - here. First build an index of [number] = last index:
local num = {2,7,11,15}
local target = 9
local idx = {}
for i, n in ipairs(num) do idx[n] = i end
then loop over the numbers; given a number m you just need to look for target - m in your idx lookup:
for i, n in ipairs(num) do local j = idx[target - n]; if j then print(i, j) break end end
if you want to exit early - sometimes without building the full idx table - you can fuse the two loops:
local idx = {}
for i, n in ipairs(num) do
local j = idx[target - n]
if j then
print(j, i)
break
end
idx[n] = i
end
other solutions exist (e.g. using sorting, which requires no auxiliary space), but this one is elegant in that it runs in O(n) time & O(n) space to produce a solution and leverages Lua's builtin data structures.

Performing an "online" linear interpolation

I have a problem where I need to do a linear interpolation on some data as it is acquired from a sensor (it's technically position data, but the nature of the data doesn't really matter). I'm doing this now in matlab, but since I will eventually migrate this code to other languages, I want to keep the code as simple as possible and not use any complicated matlab-specific/built-in functions.
My implementation initially seems OK, but when checking my work against matlab's built-in interp1 function, it seems my implementation isn't perfect, and I have no idea why. Below is the code I'm using on a dataset already fully collected, but as I loop through the data, I act as if I only have the current sample and the previous sample, which mirrors the problem I will eventually face.
%make some dummy data
np = 109; %number of data points for x and y
x_data = linspace(3,98,np) + (normrnd(0.4,0.2,[1,np]));
y_data = normrnd(2.5, 1.5, [1,np]);
%define the query points the data will be interpolated over
qp = [1:100];
kk=2; %indexes through the data
cc = 1; %indexes through the query points
qpi = qp(cc); %qpi is the current query point in the loop
y_interp = qp*nan; %this will hold our solution
while kk<=length(x_data)
kk = kk+1; %update the data counter
%perform online interpolation
if cc<length(qp)-1
if qpi>=y_data(kk-1) %the query point, of course, has to be in-between the current value and the next value of x_data
y_interp(cc) = myInterp(x_data(kk-1), x_data(kk), y_data(kk-1), y_data(kk), qpi);
end
if qpi>x_data(kk), %if the current query point is already larger than the current sample, update the sample
kk = kk+1;
else %otherwise, update the query point to ensure its in between the samples for the next iteration
cc = cc + 1;
qpi = qp(cc);
%It is possible that if the change in x_data is greater than the resolution of the query
%points, an update like the above wont work. In this case, we must lag the data
if qpi<x_data(kk),
kk=kk-1;
end
end
end
end
%get the correct interpolation
y_interp_correct = interp1(x_data, y_data, qp);
%plot both solutions to show the difference
figure;
plot(y_interp,'displayname','manual-solution'); hold on;
plot(y_interp_correct,'k--','displayname','matlab solution');
leg1 = legend('show');
set(leg1,'Location','Best');
ylabel('interpolated points');
xlabel('query points');
Note that the "myInterp" function is as follows:
function yi = myInterp(x1, x2, y1, y2, qp)
%linearly interpolate the function value y(x) over the query point qp
yi = y1 + (qp-x1) * ( (y2-y1)/(x2-x1) );
end
And here is the plot showing that my implementation isn't correct :-(
Can anyone help me find where the mistake is? And why? I suspect it has something to do with ensuring that the query point is in-between the previous and current x-samples, but I'm not sure.

The problem in your code is that you at times call myInterp with a value of qpi that is outside of the bounds x_data(kk-1) and x_data(kk). This leads to invalid extrapolation results.
Your logic of looping over kk rather than cc is very confusing to me. I would write a simple for loop over cc, which are the points at which you want to interpolate. For each of these points, advance kk, if necessary, such that qp(cc) is in between x_data(kk) and x_data(kk+1) (you can use kk-1 and kk instead if you prefer, just initialize kk=2 to ensure that kk-1 exists, I just find starting at kk=1 more intuitive).
To simplify the logic here, I'm limiting the values in qp to be inside the limits of x_data, so that we don't need to test to ensure that x_data(kk+1) exists, nor that x_data(1)<pq(cc). You can add those tests in if you wish.
Here's my code:
qp = [ceil(x_data(1)+0.1):floor(x_data(end)-0.1)];
y_interp = qp*nan; % this will hold our solution
kk=1; % indexes through the data
for cc=1:numel(qp)
% advance kk to where we can interpolate
% (this loop is guaranteed to not index out of bounds because x_data(end)>qp(end),
% but needs to be adjusted if this is not ensured prior to the loop)
while x_data(kk+1) < qp(cc)
kk = kk + 1;
end
% perform online interpolation
y_interp(cc) = myInterp(x_data(kk), x_data(kk+1), y_data(kk), y_data(kk+1), qp(cc));
end
As you can see, the logic is a lot simpler this way. The result is identical to y_interp_correct. The inner while x_data... loop serves the same purpose as your outer while loop, and would be the place where you read your data from wherever it's coming from.

How to randomly get a value from a table [duplicate]

I am working on programming a Markov chain in Lua, and one element of this requires me to uniformly generate random numbers. Here is a simplified example to illustrate my question:
example = function(x)
local r = math.random(1,10)
print(r)
return x[r]
end
exampleArray = {"a","b","c","d","e","f","g","h","i","j"}
print(example(exampleArray))
My issue is that when I re-run this program multiple times (mash F5) the exact same random number is generated resulting in the example function selecting the exact same array element. However, if I include many calls to the example function within the single program by repeating the print line at the end many times I get suitable random results.
This is not my intention as a proper Markov pseudo-random text generator should be able to run the same program with the same inputs multiple times and output different pseudo-random text every time. I have tried resetting the seed using math.randomseed(os.time()) and this makes it so the random number distribution is no longer uniform. My goal is to be able to re-run the above program and receive a randomly selected number every time.

You need to run math.randomseed() once before using math.random(), like this:
math.randomseed(os.time())
From your comment that you saw the first number is still the same. This is caused by the implementation of random generator in some platforms.
The solution is to pop some random numbers before using them for real:
math.randomseed(os.time())
math.random(); math.random(); math.random()
Note that the standard C library random() is usually not so uniformly random, a better solution is to use a better random generator if your platform provides one.
Reference: Lua Math Library

Standard C random numbers generator used in Lua isn't guananteed to be good for simulation. The words "Markov chain" suggest that you may need a better one. Here's a generator widely used for Monte-Carlo calculations:
local A1, A2 = 727595, 798405 -- 5^17=D20*A1+A2
local D20, D40 = 1048576, 1099511627776 -- 2^20, 2^40
local X1, X2 = 0, 1
function rand()
local U = X2*A2
local V = (X1*A2 + X2*A1) % D20
V = (V*D20 + U) % D40
X1 = math.floor(V/D20)
X2 = V - X1*D20
return V/D40
end
It generates a number between 0 and 1, so r = math.floor(rand()*10) + 1 would go into your example.
(That's multiplicative random number generator with period 2^38, multiplier 5^17 and modulo 2^40, original Pascal code by http://osmf.sscc.ru/~smp/)

math.randomseed(os.clock()*100000000000)
for i=1,3 do
math.random(10000, 65000)
end
Always results in new random numbers. Changing the seed value will ensure randomness. Don't follow os.time() because it is the epoch time and changes after one second but os.clock() won't have the same value at any close instance.

There's the Luaossl library solution: (https://github.com/wahern/luaossl)
local rand = require "openssl.rand"
local randominteger
if rand.ready() then -- rand has been properly seeded
-- Returns a cryptographically strong uniform random integer in the interval [0, n−1].
randominteger = rand.uniform(99) + 1 -- randomizes an integer from range 1 to 100
end
http://25thandclement.com/~william/projects/luaossl.pdf

Implementing post / pre increment / decrement when translating to Lua

I'm writing a LSL to Lua translator, and I'm having all sorts of trouble implementing incrementing and decrementing operators. LSL has such things using the usual C like syntax (x++, x--, ++x, --x), but Lua does not. Just to avoid massive amounts of typing, I refer to these sorts of operators as "crements". In the below code, I'll use "..." to represent other parts of the expression.
... x += 1 ...
Wont work, coz Lua only has simple assignment.
... x = x + 1 ...
Wont work coz that's a statement, and Lua can't use statements in expressions. LSL can use crements in expressions.
function preIncrement(x) x = x + 1; return x; end
... preIncrement(x) ...
While it does provide the correct value in the expression, Lua is pass by value for numbers, so the original variable is not changed. If I could get this to actually change the variable, then all is good. Messing with the environment might not be such a good idea, dunno what scope x is. I think I'll investigate that next. The translator could output scope details.
Assuming the above function exists -
... x = preIncrement(x) ...
Wont work for the "it's a statement" reason.
Other solutions start to get really messy.
x = preIncrement(x)
... x ...
Works fine, except when the original LSL code is something like this -
while (doOneThing(x++))
{
doOtherThing(x);
}
Which becomes a whole can of worms. Using tables in the function -
function preIncrement(x) x[1] = x[1] + 1; return x[1]; end
temp = {x}
... preincrement(temp) ...
x = temp[1]
Is even messier, and has the same problems.
Starting to look like I might have to actually analyse the surrounding code instead of just doing simple translations to sort out what the correct way to implement any given crement will be. Anybody got any simple ideas?

I think to really do this properly you're going to have to do some more detailed analysis, and splitting of some expressions into multiple statements, although many can probably be translated pretty straight-forwardly.
Note that at least in C, you can delay post-increments/decrements to the next "sequence point", and put pre-increments/decrements before the previous sequence point; sequence points are only located in a few places: between statements, at "short-circuit operators" (&& and ||), etc. (more info here)
So it's fine to replace x = *y++ + z * f (); with { x = *y + z * f(); y = y + 1; }—the user isn't allowed to assume that y will be incremented before anything else in the statement, only that the value used in *y will be y before it's incremented. Similarly, x = *--y + z * f(); can be replaced with { y = y - 1; x = *y + z * f (); }

Lua is designed to be pretty much impervious to implementations of this sort of thing. It may be done as kind of a compiler/interpreter issue, since the interpreter can know that variables only change when a statement is executed.
There's no way to implement this kind of thing in Lua. Not in the general case. You could do it for global variables by passing a string to the increment function. But obviously it wouldn't work for locals, or for variables that are in a table that is itself global.
Lua doesn't want you to do it; it's best to find a way to work within the restriction. And that means code analysis.

Your proposed solution only will work when your Lua variables are all global. Unless this is something LSL also does, you will get trouble translating LSL programs that use variables called the same way in different places.
Lua is only able of modifying one lvalue per statement - tables being passed to functions are the only exception to this rule. You could use a local table to store all locals, and that would help you out with the pre-...-crements; they can be evaluated before the expression they are contained in is evauated. But the post-...-crements have to be evaluated later on, which is simply not possible in lua - at least not without some ugly code involving anonymous functions.
So you have one options: you must accept that some LSL statements will get translated to several Lua statements.
Say you have a LSL statement with increments like this:
f(integer x) {
integer y = x + x++;
return (y + ++y)
}
You can translate this to a Lua statement like this:
function f(x) {
local post_incremented_x = x + 1 -- extra statement 1 for post increment
local y = x + post_incremented_x
x = post_incremented_x -- extra statement 2 for post increment
local pre_incremented_y = y + 1
return y + pre_incremented_y
y = pre_incremented_y -- this line will never be executed
}
So you basically will have to add two statements per ..-crement used in your statements. For complex structures, that will mean calculating the order in which the expressions are evaluated.
For what is worth, I like with having post decrements and predecrements as individual statements in languages. But I consider it a flaw of the language when they can also be used as expressions. The syntactic sugar quickly becomes semantic diabetes.

After some research and thinking I've come up with an idea that might work.
For globals -
function preIncrement(x)
_G[x] = _G[x] + 1
return _G[x]
end
... preIncrement("x") ...
For locals and function parameters (which are locals to) I know at the time I'm parsing the crement that it is local, I can store four flags to tell me which of the four crements is being used in the variables AST structure. Then when it comes time to output the variables definition, I can output something like this -
local x;
function preIncrement_x() x = x + 1; return x; end
function postDecrement_x() local y = x; x = x - 1; return y; end
... preIncrement_x() ...

In most of your assessment of configurability to the code. You are trying to hard pass data types from one into another. And call it a 'translator'. And in all of this you miss regex and other pattern match capacities. Which are far more present in LUA than LSL. And since the LSL code is being passed into LUA. Try using them, along with other functions. Which would define the work more as a translator, than a hard pass.
Yes I know this was asked a while ago. Though, for other viewers of this topic. Never forget the environments you are working in. EVER. Use what they give you to the best ability you can.

Big-O of an operation over a single linked list

Suppose you've got a single linked list of size N, and you want to perform an operation on every element, beginning at the end.
I've come up with the following pseudocode:
while N > 0
Current = LinkedList
for 0 to N
Current = Current.tail
end
Operation(Current.head)
N := N-1
end
Now I've got to determine which Big-O this algorithm is.
Supposing that Operation() is O(1), I think it's something like this:
N + (N-1) + (N-2) + ... + (N-(N-1)) + 1
But I'm not sure what Big-O that actually is. I think it is definitely smaller than O(N^2), but I don't think you can say its O(N) either ...

Your equation is basically that of the triangular numbers, and sums to N(N+1)/2. I'll leave you to determine the O() from that!
A quicker way to do this is to construct a new list that is the reverse of the original list, and then perform the operations on that.

Your algorithm is O(n^2) as you suggest in your post. You can do it in O(n), though.
It's important to remember that Big-O notation is an upper bound on the algorithm's time complexity.

1+2+3+...+n = n*(n+1)/2 = 0.5*n^2+O(n)
This is O(n^2), and O(n^2) is tight, i.e. there is no lower runtime order that'd contain your runtime.
A faster algorithm that works from front-to-back could have O(n) instead of O(n^2)

Your runtime analysis is correct, the runtime is 1 + 2 + ... + N which is a sum of the arithmetic progression and therefore = (N²-N) / 2.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Efficient use of memories in OpenACC - memory

Related

I can't figure out how to do an simple algoritm to get the sum of two numbers

Performing an "online" linear interpolation

How to randomly get a value from a table [duplicate]

Implementing post / pre increment / decrement when translating to Lua

Big-O of an operation over a single linked list

Categories

Resources