Which OpenACC directive will tell compiler to execute a statement on device only? - nvidia

I am learning OpenACC with Fortran (with a suite of tools from Nvidia) and am doing it by porting my implementation of the Conjugate Gradient (CG) solver to GPUs.
Clearly, I am trying to keep as much data as possible on the device (GPU memory), with the following commands:
27 ! Copy matrix (a_sparse), vectors (ax - b) and scalars (alpha - pap) to GPU
28 !$acc enter data copyin(a_sparse)
29 !$acc enter data copyin(a_sparse % row(:))
30 !$acc enter data copyin(a_sparse % col(:))
31 !$acc enter data copyin(a_sparse % val(:))
32 !$acc enter data copyin(ax(:), ap(:), x(:), p(:), r(:), b(:))
33 !$acc enter data copyin(alpha, beta, rho, rho_old, pap)
From that point on, all operations constituting the solution algorithm of the CG solver, are done with the present clause. For a vector operation, an excerpt looks like:
49 !$acc parallel loop &
50 !$acc& present(r, b, ax)
51 do i = 1, n
52 r(i) = b(i) - ax(i)
53 end do
I do the same things with scalars, for example:
87 !$acc kernels present(alpha, rho, pap)
88 alpha = rho / pap
89 !$acc end kernels
All scalar variables are on the device. With lines 87-89 I am trying to execute the command alpha = rho / pap on device only, avoiding any data transfer from or to host, but nsight-sys profiler shows me the following:
To my astonishment, there seems to be data transfer at line 87, both before (red "Enter Data" square) and after (red "Exit Data" square) the compute construct (blue "Cg.f90: 87" square).
Could anyone tell me what is going on? Are the lines 87-89 executed on device? Moreover, why are there no corresponding CUDA commands for these "Enter Data" and "Exit Data" fields? If so, why there seems to be data transfer between the host and the device? If not, is there an OpenACC command which would direct compiler to execute a programming line, which is not necessarily a loop, on the device only?
I noticed the same for the array operations, such as the ones I wrote above in lines 49-53, there is some data transfer there too, but I could attribute it to the variable n which should be passed to device.

It could a few things. The Fortran specifies that the right hand side of an array syntax operation needs to be fully evaluated before assignment to the left hand side, so the compiler may be allocating a temp array to hold the result of the evaluation. Though often the compiler can optimize away the need for the temp, so it may or may not be the issue. Try making this an explicit loop, rather than use array syntax to see if it solves the issue.
A second possibility, is that the compiler is needing to copy the array descriptors since it can't tell if they've changed or not. Though, I'd expect to see some data movement rather than just the enter/exit regions.
The third possibility is that this is just the present check itself which does still call the enter/exit runtime calls. Instead of copying data, the call looks up the device pointer which is later passed to the kernel call and the reference counter is incremented/decremented.


OpenCL kernel out of resources based on number of loop iterations INSIDE the kernel. Can the compiled kernel be too large to fit on the GPU?

TL;DR I have an OpenCL kernel that loops for a large number of iterations and calls several user-made functions. The kernel works fine for few iterations, but increasing the number of iterations causes an CL_OUT_OF_RESOURCES (-5) error. If the same kernel is executed on a better GPU it is able to loop for more iterations without the error. What can be causing this error based on the number of iterations? Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
I am developing an OpenCL kernel to run on GPU that computes a very complex function. To keep things organized, I have a "kernel.cl" file with the main kernel (the __kernel void function) and a "aux_functions.cl" file with ~20 auxiliary functions (they are of type int, int2, int16, but not __kernel) that are called several times by the kernel and by themselves.
The problem specification is roughly as follows (justification for such many loops):
I have two arrays representing full HD images (1920x1080 integers)
For each 128x128 patch of one image, I must find the value of 4 parameters that optimize a given function (the second image is used to evaluate how good it is)
For the same 128x128 patch and the same 4 parameters, each 4x4 sub-patch is transformed slightly different based on its position inside the larger 128x128 patch
And I tried to model the execution as follows:
Each workgroup will compute the kernel for one 128x128 patch (I started processing only the 10 first patches -- 10 workgroups)
Each workgroup is composed of 256 workitems
Each workitem will test a distinct set of values (a fraction of a predefiend set) for the 4 parameters based on their IDs
The main structure of the kernel is as follows:
__kernel void funct(__global int *referenceFrameSamples, __global int *currentFrameSamples,const int frameWidth, const int frameHeight, __global int *result){
// Initialize some variables and get global and local IDs
for(executed X times){ // These 4 outer loops are used to test different combinations of parameters to be applied in a given function in the inner loops
for(executed Y times){
for(executed Z times){
for(executed W times){
// Simple assignments based on the outer for loops
for(executed 32x){ // Each execution of the inner loop applies a function to a 4x4 patch
for(executed 32x){
// The relevant computation is performed here
// Calls a couple of lightweight functions using only the private variables
// Calls a complex function that uses the __global int *referenceFrameSamples variable
// Compute something and use select() to update the best value
// Write the best value to __global *results buffer
The problem is that when the outer 4 loops are repeated a few times the kernel runs fine, but if I increase the iterations the kernel crashes with the error ERROR! clWaitForEvents returned CL_OUT_OF_RESOURCES (-5). I am testing it on a notebook with a GPU GeForce 940MX with 2GB, and the kernel starts crashing when X * Y * Z * W = 1024.
The clEnqueueNDRangeKernel() call has no error, only the clWaitForEvents() called after it returns an error. I am using CLIntercept to profile the errors and running time of the kernel. Also, when the kernel runs smooth I can measure the execution time correctly (showed next), but when it crashes, the "measured" execution time is ridiculously wrong (billions of miliseconds) even though it crashes on the first seconds.
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
double nanoSeconds = time_end-time_start;
printf("OpenCl Execution time is: %0.3f miliseconds \n",nanoSeconds / 1000000.0);
What I tested:
Improve the complex auxiliary function that used __global variable: instead of passing the __global pointer, I read the relevant part of the array into a private array and passed it as argument. Outcome: improved running time on success cases, but still fails in the same case
Reduce workgroups and workitems: even using 1 workgroup and 1 workitem (the absolute minimum) with the same number of iterations yields the same error. For a smaller number of iterations, running time decreases with less workitems/groups
Running the same kernel on a better GPU: after doing the previous 2 modifications (improved function and reduced workitems) I launched the kernel on a desktop equipped with a GPU Titan V with 12GB. It is able to compute the kernel with a larger number of iterations (I tried up to 1 million iterations) without giving the CL_OUT_OF_RESOURCES, and the running time seems to increase linearly with the iterations. Although this is the computer that will actually run the kernel over a dataset to solve my problem, it is a server that must be accessed remotely. I would prefer to do the development on my notebook and deploy the final code on the server.
My guess: I know that function calls are inlined in GPU. Since the program is crashing based on the number of iterations, my only guess is that these for loops are being unrolled, and with the inlined functions, the compiled kernel is too big to fit on the GPU (even with a single workitem). This also explains why using a better GPU allows increasing the number of iterations.
Question: What could be causing this CL_OUT_OF_RESOURCES error based on the number of iterations?
Of course I could reduce the number of iterations in each workitem, but then I would need multiple workgroups to process the same set of data (the same 128x128 patch) and would need to access global memory to select the best result between workgroups of the same patch. I may end up proceeding in this direction, but I would really like to know what is happening with the kernel at this moment.
Update after #doqtor comment:
Using -cl-nv-verbose when building the program reports the following resources usage. It's strange that these values do not change irrespective of the number of iterations, either when the program runs successfully and when it crashes.
ptxas info : 0 bytes gmem, 532 bytes cmem[3]
ptxas info : Compiling entry function 'naive_affine_2CPs_CU_128x128_SR_64x64' for 'sm_50'
ptxas info : Function properties for naive_affine_2CPs_CU_128x128_SR_64x64
ptxas . 66032 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 140 registers, 360 bytes cmem[0]
Running clinfo reports that my GPU has
Registers per block (NV) 65536
Global memory size 2101870592 (1.958GiB)
Max memory allocation 525467648 (501.1MiB)
Local memory size 49152 (48KiB)
It seems that I am not using too many registers, but I don't know how those stack frame, cmem[0] and cmem[3] relate to the memory information reported by clinfo.
Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
Yey, that is part of the problem. The compiler sees that you have a loop with a fixed, small range, and it automatically unrolls it. This happens for the six nested loops and then the assembly blows up. This will get register spilling into global memory which makes the application very slow.
However even if the compiler does not unroll the loops, every thread does X*Y*Z*W*32*32 iterations of "The relevant computation", which takes an eternety. The system thinks it freezed up and you get CL_OUT_OF_RESOURCES .
Can you really not parallelize any of these six nested loops? The best solution would be to parallelize them all, that means include them in the kernel range and launch a few hundred million threads that do "The relevant computation" without any loops. You should have as much independent threads / workgroups as possible to get the best performance (saturate the GPU).
Remember, your GPU has thousands of cores grouped into warps of 32 and SMs of 2 or 4 warps, and if you only launch a single workgroup, it will run on only a single SM with 64 or 128 cores and the remaining cores stay idle.

How to keep track of the seed

So in Lua it's common knowledge that you can use math.randomseed but it's also obvious that math.random sets the seed as well (calling it twice does not return the same result), what does it set it to, and how can I keep track of it, and if it's impossible, please explain why that is so.
This is not a Lua question, but general question on how some RNG algorithm works.
First, Lua don't have their own RNG - they just output you (slightly mangled) value from RNG of underlying C library. Most RNG implementations do not reveal you their inner state, but sometimes you can caclulate it yourself.
For example when you use Lua on Windows, you'll be using LCG-based RNG from MS C library. The numbers you get is a slice of seed, not full value. There are two ways you can deal with that:
If you know how many times you called random, you can just take initial seed value, feed it to your copy of the same algorithm with same constants that are hardcoded in MS library and get exact value of seed.
If you don't, but you can be sure that nobody interferes in between your two calls to random, you can get two generated numbers, and reverse LCG algorithm by shifting bits back to their place. This will leave you with several missing bits (with one more bit thanks to Lua mangling) that you will need to simply bruteforce - just reiterate over all missing bits until your copy of algorithm produces exactly same two "random" numbers you've recorded before. That will be current seed stored inside library's RNG as well. Well programmed solution in Lua can bruteforce this in about 0.2-0.5s on somewhat dated PC - I did it past. Here's example on Crypto.SE talking about this task in more details: Predicting values from a Linear Congruential Generator.
First approach can be used with any other RNG algorithm that doesn't use any real entropy, second with most RNGs that don't mask too much bits in slice to make bruteforcing unreasonable.
Real answer though is: you don't need to keep track of seed at all. What you want is probably something else.
If you set a seed all numbers math.random() generates are pseudo-random (This is always the case as the system will generate a seed by itself).
So if you reset the seed to the same value you can predict all values that are going to come up to the maximum number of consecutive values that you already generated using that seed.
What the seed does not do is keep the output of math.random() the same. It would be the same if you kept resetting it to the same value.
An analogy as an example
Imagine the random number is an integer between 0 and 9 (instead of a double between 0 and 1).
math.random() could traverse pi's decimals from an arbitrary starting position (default could be system time).
What you do when you use set.seed() is (not literally, this is an analogy as mentioned) set the starting decimals of where in pi you are going to retrieve your numbers.
If you now reset the seed to the same starting position the numbers are going to be the same as the last time you reset the starting position.
You will know the numbers of to the last call, after that you can't be certain anymore.

32 bit multiplication on 24 bit ALU

I want to port a 32 by 32 bit unsigned multiplication on a 24-bit dsp (it's a Linear Congruential Generator, so I'm not allowed to truncate, also I don't want to replace yet the current LCG with a 24 bit one). The available data types are 24 and 48 bit ints.
Only the last 32 LSB are needed. Do you know any hacks to implement this in fewer multiplies, masks and shifts than the usual way?
The line looks like this:
//val is an int(32 bit)
val = (1664525 * val) + 1013904223;
An outline would be (in my current compiler style):
static uint48_t val = SEED;
val = 0xFFFFFFFFUL & ((1664525UL * val) + 1013904223UL);
and hopefully the compiler will recognise:
it can use a multiply and accumulate command
it only needs a reduced multiply algorithim due to the "high word" of the constant being zero
the AND could be effected by resetting the upper bits or multiplying a constant and restoring
...other stuff depends on your {mystery dsp} target
if you scale up the coefficients by 2^16, you can get truncation for free, but due to lack of info
you will have to explore/decide if it is better overall.
(This is more an elaboration why two multiplications 24×24→n, 31<n are enough for 32×32→min(n, 40).)
The question discloses amazingly little about the capabilities to build a method
32×21→32 in fewer [24×24] multiplies, masks and shifts than the usual way on:
24 and 48 bit ints & DSP (I read high throughput, non-high latency 24×24→48).
As far as there indeed is a 24×24→48 multiply (or even 24×24+56→56 MAC) and one factor is less than 24 bits, the question is pointless, a second multiply being the compelling solution.
The usual composition of a 24<n<48×24<m<48→24<p multiply from 24×24→48 uses three of the latter; a compiler should know as well as a coder that "the fourth multiply" would yield bits with a significance/position exceeding the combined lengths of the lower parts of the factors.
So, is it possible to generate "the long product" using just a second 24×24→48?
Let the (bytes of the) factors be w_xyz and W_XYZ, respectively; the underscores suggesting "the Ws" being the lower significance bits in the higher significance words/ints if interpreted as 24bit ints. The first 24×24→48 gives the sum of
  xZ, what is needed (fat) is
 wZ +
This can be computed using one combined multiplication of
((w<<16)|(z & 0xff)) × ((W<<16)|(Z & 0xff)). (Never mind the 17th bit of wZ+zW "running" into wW.)
(In the first revision of this answer, I foolishly produced wZ and zW separately - their sum is wanted in the end, anyway.)
(Annoyingly, this is about all you can do for 24×24→24 as a base operation too - beyond this "combining multiplication", you need four instead of one.)
Another angle to explore is choosing a different PRNG.
It may have to be >24 bits (tell!).
On a 24 bit machine, XorShift* (or even XorShift+) 48/32 seems worth a look.

CUDA 128 bytes read in a single instruction

I am new to CUDA and currently optimize an existing application for molecular dynamics. What it does is that it takes array of double4 with coordinates and computes forces based on the neighborlist. I wrote a kernel with the following lines:
double4 mPos=d_arr_xyz[gid];
then Calc takes d_arr_xyz[id] and calculates force. That gives 1 read of double4 + 65 reads of (int +double4) inside every call of Calc (65 is average number of neighbors (not equal to -1) in d_neib_list for each particle).
Is it possible to reduce those reads? Neighborlists for different particles, i.e. d_arr_xyz[gid] and d_arr_xyz[id] do not correalte, so I cannot use shared memory for the block of threads to cache d_arr_xyz.
What I see is that if somehow to load the whole list int*MAX_NEIGHBORS into shared memory in one or few large transactions, that will remove 65 separate reads of int.
So the question is: is it possible to do it so that those 65 reads of int will be translated into several large transactions. I read in the documentation that reads can be even 128 bytes long. What exactly should I write so that assembler will make 1 large call?
Thank you for your replies. From the answer from user talonmies below, I changed the code replacing dimensions x and y for the neighbors matrix. Now consecutive threads load consecutive int[gid], I guess that may result in a 128 byte read. The program works 8% faster.
All memory transactions are issued (where possible) on a per warp basis. So the 128 byte transaction you are asking about is when all 32 threads in a warp issue a memory load instruction which can be serviced in a single "coalesced" transaction. A single thread can't issue large memory transactions, only a warp of 32 threads can, and only when the memory coalescing requirements of whichever architecture you run the code on can be satisfied.
I couldn't really follow your description of what you code is actually doing, but from first principles alone, the answer would appear to be no.

Trying to implement an 8 point 1D DCT-II in labview; can only put one value in my output array

I am trying to implement a 1D DCT type II filter in Labview. The formula for this can be seen here
As you can see xk = the sum of a sum function involving an iteration of n.
As far as I know the nested for loop should handle the function with the shift registers keeping a running total of the output. My problem lies with the output the the matrix xk. There is either only one output to the matrix or each output over-writes the last output due to no indexig. trying to put the matrix inside the for loop results in an error between the shift register and the matrix:
You have connected two terminals of different types.
The source is a double and the sink is a 1D array of double
Anyone know how I can index the output to the array?
I believe this should work. Please check the math.
the inner for-loop will run either 8 times, or however many elements are in the array xn. LabVIEW uses whichever number is smaller to determine the iteration count. So if xn is empty, the for loop wont run at all. If it's 20, the for loop will run 8 times.
Regardless, the outer loop will always run 8 times, so xk will have 8 elements total.
Also, shift registers that do not initialize a value at the beginning of a for or while loop can cause problems, unless you mean to do that. The value stored in the shift register after running the first time could be a problem the second time you go to run it.
