difference between parameter __local and within-kernel __local in OpenCL - local

Sometimes openCL kernels have in their parameters
kernel someName (some globalparameters, __local int* myArray) {
while other scripts have somewhere inside the kernel
__local int myArray[length];
what is the difference between these? I thought they were only syntactic differences, but now I see inside one of the official AMD samples (RadixSort) the comment
__local KV_TYPE localDataArray[TPG*4*2]; // Faster than using it as a parameter !!!
so apparently I was wrong. And effectively, when I try it in my scripts it goes faster. Buy why? Aren't both supposed to be simply array pointers?

I'm not sure about the performance issue, but the primary difference is not mere syntax. Passing a local buffer as a parameter allows the size to be determined at runtime. Defining the buffer in the kernel itself requires the size to be known at compile time.

Related

Is it safe for an OpenCL kernel to randomly write to a __global buffer?

I want to run an instrumented OpenCL kernel to get some execution metrics. More specifically, I have added a hidden global buffer which will be initialized from the host code with N zeros. Each of the N values are integers and they represent a different metric, which each kernel instance will increment in a different manner, depending on its execution path.
A simplistic example:
__kernel void test(__global int *a, __global int *hiddenCounter) {
if (get_global_id(0) == 0) {
// do stuff and then increment the appropriate counter (random numbers here)
hiddenCounter[0] += 3;
}
else {
// do stuff...
hiddenCounter[1] += 5;
}
}
After the kernel execution is complete, I need the host code to aggregate (a simple element-wise vector addition) all the hiddenCounter buffers and print the appropriate results.
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer (which will definitely happen in my project). Do I need to enforce some kind of synchronization? Or is this impossible with __global arguments and I need to change it to __private? Will I be able to aggregate __private buffers from the host code afterwards?
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer
The answer to this is emphatically yes, your code will be vulnerable to race conditions as currently written.
Do I need to enforce some kind of synchronization?
Yes, you can use global atomics for this purpose. All but the most ancient GPUs will support this. (anything supporting OpenCL 1.2, or cl_khr_global_int32_base_atomics and similar extensions)
Note that this will have a non-trivial performance overhead. Depending on your access patterns and frequency, collecting intermediate results in private or local memory and writing them out to global memory at the end of the kernel may be faster. (In the local case, the whole work group would share just one global atomic call for each updated cell - you'll need to use local atomics or a reduction algorithm to accumulate the values from individual work items across the group though.)
Another option is to use a much larger global memory buffer, with counters for each work item or group. In that case, you will not need atomics to write to them, but you will subsequently need to combine the values on the host. This uses much more memory, obviously, and likely more memory bandwidth too - modern GPUs should cache accesses to your small hiddenCounter buffer. So you'll need to work out/try which is the lesser evil in your case.

OpenCL When to use global, private, local, constant address spaces

I'm trying to learn OpenCL but I'm a having a hard time deciding which address spaces to use, as I only find assembled resources declaring what these address spaces are, but not why they exist or when to use them. The resources are at least too scattered, so with this question I hope to assemble all this information: what are all the address spaces, why do they exist, when to use which address space and what are the advantages and disadvantages regarding memory and performance.
As I understand it (which is probably too simplified), the GPU has two physical types of memory: global memory, far from the actual processors, so slow but pretty big and available to all workers, and local memory, close to the actual processors, so fast but small and not accessible from other workers.
Intuitively, the local qualifier makes sure a variable is placed on local memory and the global qualifier makes sure a variable is placed on global memory, though I'm not sure this is exactly what happens. This leaves the private and constant qualifiers. What's the purpose of those?
There also are some implicit qualifiers. For example, the specifications mention the generic address space, which is used for arguments with no qualifiers, I think. What does this do exactly? Then there also are local function variables. What's the address space for those?
Here is an example using my intuition, but without knowing what I'm actually doing:
Example:
Say I pass an array of type long and length 10000 to a kernel which I will only use to read, then I would declare it global const as it must be available to all workers and it will not change. Why wouldn't I use the constant qualifier? When setting the buffer for this array via the CPU, I actually also just could have made the array read-only, which in my eyes says the same as declaring it const. So again, when and why would I declare something constant or global const?
When performing memory-intensive tasks, would it be better to copy the array to a local array inside the kernel? My guess is that local memory would be too small, but what if the array only had a length of 10? When would the array be too big/small? More general: when is it worth copying data from global to local memory?
Say I also want to pass the length of this array, then I would add const int length to the arguments of my kernel, but I'm unsure why I would omit the global qualifier except because I have seen other people do it. After all, length must be accessible for all workers. If I'm right, then length would have a generic address space, but again, I don't really know what that means.
I hope someone with some experience can clear this up. That would be great not only for me, but I hope also for other enthusiasts who want to gain some practical knowledge concerning memory management on the GPU.
Constant: A small portion of cached global memory visible by all workers. Use it if you can, read only.
Global: Slow, visible by all, read or write. It is where all your data will end, so some accesses to it are always necessary.
Local: Do you need to share something in a local group? Use local! Do all your local workers access the same global memory? Use local!
Local memory is only visible inside local workers, and is limited in size, however is very fast.
Private: Memory that is only visible to a worker, consider it like registers. All non defined values are private by default.
Say I pass an array of type long and length 10000 to a kernel which I
will only use to read, then I would declare it global const as it must
be available to all workers and it will not change. Why wouldn't I use
the constant qualifier?
Actually, yes, you can and you should use constant qualifier. Which places your data on the constant memory (a small portion of read only memory quickly accessible by all workers). This is used by GPUs to transfer uniforms to all vertex shaders.
When setting the buffer for this array via the CPU, I actually also
just could have made the array read-only, which in my eyes says the
same as declaring it const. So again, when and why would I declare
something constant or global const?
Not really, when you create a buffer read only you are only specifiying to OpenCL you plan to use it read only, so it can do optimizations in the back, but you can actually write to it from a kernel.
global const is just a safeguard for the developer, so you don't accidentally write to it, it will give an error at compile time.
Basically, the same as in plain C host side computing. Programs will also work fine if all memory is non-const.
When performing memory-intensive tasks, would it be better to copy the array to a local array inside the kernel? My guess is that local memory would be too small, but what if the array only had a length of 10? When would the array be too big/small? More general: when is it worth copying data from global to local memory?
It is only worth if it is read by all workers. If each worker reads a single value of the global memory, then it is not worth.
Useful here:
Worker0 -> Reads 0,1,2,3
Worker1 -> Reads 0,1,2,3
Worker2 -> Reads 0,1,2,3
Worker3 -> Reads 0,1,2,3
Not useful here:
Worker0 -> Reads 0
Worker1 -> Reads 1
Worker2 -> Reads 2
Worker3 -> Reads 3
Say I also want to pass the length of this array, then I would add
const int length to the arguments of my kernel, but I'm unsure why I
would omit the global qualifier except because I have seen other
people do it. After all, length must be accessible for all workers. If
I'm right, then length would have a generic address space, but again,
I don't really know what that means.
When you don't specify a qualifier in the kernel parameter it typically defaults to constant, which is what you want for those small elements, to have a fast access by all workers.
The rules normally OpenCL compilers follow for kernel parameters is: if it only read and fits in constant, constant, otherwise global.

Passing arguments through __local memory in OpenCL

I am confused about the the __local memory in OpenCL here.
I read some spec saying that the data flow has to be from Host to
__Global, and then __Local.
But I also see some kernel function like this:
__kernel void foo(__local float * a)
I was wondering how the data was transferred directly into the __local
memory in this way?
Thanks.
It is not possible to fill local buffer on the host side. Therefore you have to follow the flow host -> __global -> __local.
Local buffer can be either created on the host side and then it is passed as a kernel parameter or on gpu side inside the kernel.
Creating local buffer on the host side gives the advantage to decide about its size before the kernel is run which can be important if the local buffer size needs to be different each time the kernel is run.
Local memory is not visible to anything but a single work-group, and may be allocated as the work-group is dispatched by hardware on many architectures. Hardware that can mix multiple work-groups from different kernels on each CU will allow the scheduling component to chunk up the local memory for each of the groups being issued. It doesn't exist before the group is launched, and does not exist after the group terminates. The size of this region is what you pass in as other answers have pointed out.
The result of this is that the only way on many architectures for filling local memory from the host would be for kernel code to be inserted by the compiler that would copy data in from global memory. Given that as the basis, it isn't any worse in terms of performance for the programmer to do it manually, and gives more control over exactly what happens. You do not end up in a situation where the compiler always generates copy code and ends up copying more than was really necessary because the API didn't make it clear what memory was copy-in and what was not.
In summary, you cannot fill local memory in any automated way. In practice you will rarely want to, because doing it manually gives you the opportunity to only put the result of a first stage into local, removing extra copy operations, or to transform the data on the way in to local, allowing padding or data transposition to remove bank conflicts and so on.
As #doqtor said, the size of local memory on kernel parameter can be specified by clSetKernelArg calls.
Fortunately, OpenCL 1.2+ support VLA(variable length array), local memory kernel parameter is not required any more.

Why dead code in OpenCL kernel influence result in Nvidia GTX550ti?

I am using OpenCL dev software of Nvidia on GTX550ti graphics card, and encounter a strange problem. (I am freshman for OpenCL).
My kernel code is like this:
__kernel void kernel_name(...)
{
size_t d = get_local_id(0);
char abc[8];
...
}
Actually, the char abc[8] is useless (dead code) for my case. But, if I have the char abc[8] in my kernel code, the result will be totally messy and the running time of kernel will be much longer (2095712 ns). If I comment out the char abc[8], the result becomes correct, and the running time of kernel becomes shorter (697856 ns). The compiler of kernel won't wipe off the dead code?
The above is just an explicit example that I can repeat. I also encounter more stranger case that one program gets different result when run at different time in totally the same environment.
Is that related to memory allocation or..? Anyone can give me some advice on how to find the problem?
By the way, oclDeviceQuery output information is listed as follows:
Platform Version = OpenCL 1.1
CUDA 4.2.1,
SDK Revision = 7027912
My OS is Windows XP.
Today is 2012-07-17, and I think I have resolved this problem.
don't use #include in kernel source file.
don't use ultra length line (for example, you write program to generate some line data for kernel source file) in kernel source file.
You're right, that shouldn't effect anything.
That's not your real code though, and I suspect given those run-times that your kernel isn't a simple thing. Possibly you're pushing your locals over some limit which means that variables are having to be stored in some slower memory which pushes your run-times up.
Something like that might also cause a change in behaviour if you had an uninitialised variable bug somewhere. In the fast store it happens to get a value that works. In the slow store it gets something else.
To check this theory I'd try to remove some other local data structure and see if it has the same effect. Anything else 8 bytes or larger should have the same effect.
...of course it's possibly you've found a bug in the OpenCL implementation, but that's easy to check. Just compile the kernel for a different OpenCL device, e.g. the CPU. This is worth doing anyway because different compiler pick up different issues.
Other than that I think you're back to standard debug techniques.
BTW: at one point in your question you call the array abs[8] rather than abc[8]. I assume that's a typo, but if it isn't then that could be your problem as the abs name will clash with the abs() function. That could confuse a stupid compiler.

Question about cl_mem in OpenCL

I have been using cl_mem in some of my OpenCL boilerplate code, but I have been using it through context and not a sharp understanding of what exactly it is. I have been using it as a type for the memory I push on and off the board, which has so far been floats. I tried looking at the OpenCL docs, but cl_mem doesn't show up (does it?). Is there any documentation on it, or is it simple and can someone explain.
The cl_mem type is a handle to a "Memory Object" (as described in Section 3.5 of the OpenCL 1.1 Spec). These essentially are inputs and outputs for OpenCL kernels, and are returned from OpenCL API calls in host code such as clCreateBuffer
cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags,
size_t size, void *host_ptr, cl_int *errcode_ret)
The memory areas represented can be permitted different access patterns e.g. Read Only, or be allocated in different memory regions, depending on the flags set in the create buffer calls.
The handle is typically stored to allow a later call to release the memory, e.g:
cl_int clReleaseMemObject (cl_mem memobj)
In short, it provides an abstraction over where the memory actually is: you can copy data into the associated memory or back out via the OpenCL APIs clEnqueueWriteBuffer and clEnqueueReadBuffer, but the OpenCL implementation can allocate the space where it wants.
For the computer a cl_mem is a number (like a file handler for Linux) that is reserved for the use as a "memory identifier"( the API/driver whatever stores information about your memory under this number that it knows what it holds/how big it is and stuff like that)

Resources