Declare __global object inside kernel - memory

I tried to declare a __global memory chunk inside the kernel, like
__global float arr[200];
I assume this would create an array in the global memory that I could referred to in the kernel. The program compiled successfully, but then
when I run it, it indicated:
error: variable with automatic storage duration
cannot be stored in the named address space
I don't know why this happen.
In order to use global memory, did we have to create a buffer on the host side before we use it?
If I want to create an array shared by all the threads, except passing another new argument for this global array, what can I do instead ?

You can allocate it in program scope, at least in OpenCL 2.
__global float arr[200];
kernel void foo()
{
if(get_global_id(0) == 0)
arr[0] = 3;
}
Though be careful with initialization of course, there's no way to synchronize the work-items across the dispatch so it is not really practical to initialize it and use it in the same kernel if you have multiple work-groups.
It doesn't really make much sense to allocate it in kernel scope. If the work-groups are serialized, what would the lifetime be of the global array allocated in the kernel code? Should it outlast a workgroup, a dispatch, stay permanently to be shared between that kernel and the next? The obvious might be that it would have the same lifetime as the kernel, but then it would be impossible to initialize and use without a race. If it is persistent across multiple kernels then host allocation or program scope allocation makes more sense.
Why is passing a new argument such a problem?

__global memory object can be allocated only via API call on the host side.
You can also use __local memory object which can be allocated via API call on the host side as well as inside the kernel and is visible to all threads within the work group.

Related

creating temporary global memory variables inside kernels [duplicate]

This question already has answers here:
How to dynamically allocate arrays inside a kernel?
(5 answers)
Closed 2 years ago.
As per my knowledge, atomicAdd can be used on shared memory and global memory. I need to atomically add floating point numbers from threads of different blocks; hence, I need to use a global temporary to hold the sum.
Is there a way to allocate temporary globals from inside a kernel?
Currently, I allocate a temporary global and pass a pointer to my kernel. This doesn't appear to be very user-friendly.
TL;DR: require a temporary variable for atomic addition across different blocks without the need to explicitly allocate a global and pass a pointer to it to the kernel
You can use malloc() inside kernel code. However, it's rarely a good idea to do so. It's usually much better to pre-allocate scratch space before the kernel is launched, pass it as an argument, and let each thread, or group of threads, have some formula for determining the location they will use for their common atomics within that scratch area.
Now, you've written this isn't very "user-friendly"; I guess you mean developer-friendly. Well, it can be made more friendly! For example, my CUDA Modern C++ API wrappers library offers an equivalent of std::unique_ptr - but for device memory:
#include <cuda/api_wrappers.hpp>
//... etc. etc. ...
{
auto scratch = cuda::memory::device::make_unique<float[]>(1024, my_cude_device_id);
my_kernel<<<blah,blah,blah>>>(output, input, scratch.get();
} // the device memory is released here!
(this is for synchronous launches of course.)
Something else you can do be more developer-friendly is use some kind of proxy function to get the location in that scratch memory relevant to a specific thread / warp / group of threads / whatever, which uses the same address for atomics. That should at least hide away some of the repeating, annoying, address arithmetic your kernel might be using.
There's also the option of using global __device__ variables (like #RobertCrovella mentioned), but I wouldn't encourage that: The size would have to be fixed at compile time, and you wouldn't be able to use if from two kernels at once without it being painful, etc.

OpenCL When to use global, private, local, constant address spaces

I'm trying to learn OpenCL but I'm a having a hard time deciding which address spaces to use, as I only find assembled resources declaring what these address spaces are, but not why they exist or when to use them. The resources are at least too scattered, so with this question I hope to assemble all this information: what are all the address spaces, why do they exist, when to use which address space and what are the advantages and disadvantages regarding memory and performance.
As I understand it (which is probably too simplified), the GPU has two physical types of memory: global memory, far from the actual processors, so slow but pretty big and available to all workers, and local memory, close to the actual processors, so fast but small and not accessible from other workers.
Intuitively, the local qualifier makes sure a variable is placed on local memory and the global qualifier makes sure a variable is placed on global memory, though I'm not sure this is exactly what happens. This leaves the private and constant qualifiers. What's the purpose of those?
There also are some implicit qualifiers. For example, the specifications mention the generic address space, which is used for arguments with no qualifiers, I think. What does this do exactly? Then there also are local function variables. What's the address space for those?
Here is an example using my intuition, but without knowing what I'm actually doing:
Example:
Say I pass an array of type long and length 10000 to a kernel which I will only use to read, then I would declare it global const as it must be available to all workers and it will not change. Why wouldn't I use the constant qualifier? When setting the buffer for this array via the CPU, I actually also just could have made the array read-only, which in my eyes says the same as declaring it const. So again, when and why would I declare something constant or global const?
When performing memory-intensive tasks, would it be better to copy the array to a local array inside the kernel? My guess is that local memory would be too small, but what if the array only had a length of 10? When would the array be too big/small? More general: when is it worth copying data from global to local memory?
Say I also want to pass the length of this array, then I would add const int length to the arguments of my kernel, but I'm unsure why I would omit the global qualifier except because I have seen other people do it. After all, length must be accessible for all workers. If I'm right, then length would have a generic address space, but again, I don't really know what that means.
I hope someone with some experience can clear this up. That would be great not only for me, but I hope also for other enthusiasts who want to gain some practical knowledge concerning memory management on the GPU.
Constant: A small portion of cached global memory visible by all workers. Use it if you can, read only.
Global: Slow, visible by all, read or write. It is where all your data will end, so some accesses to it are always necessary.
Local: Do you need to share something in a local group? Use local! Do all your local workers access the same global memory? Use local!
Local memory is only visible inside local workers, and is limited in size, however is very fast.
Private: Memory that is only visible to a worker, consider it like registers. All non defined values are private by default.
Say I pass an array of type long and length 10000 to a kernel which I
will only use to read, then I would declare it global const as it must
be available to all workers and it will not change. Why wouldn't I use
the constant qualifier?
Actually, yes, you can and you should use constant qualifier. Which places your data on the constant memory (a small portion of read only memory quickly accessible by all workers). This is used by GPUs to transfer uniforms to all vertex shaders.
When setting the buffer for this array via the CPU, I actually also
just could have made the array read-only, which in my eyes says the
same as declaring it const. So again, when and why would I declare
something constant or global const?
Not really, when you create a buffer read only you are only specifiying to OpenCL you plan to use it read only, so it can do optimizations in the back, but you can actually write to it from a kernel.
global const is just a safeguard for the developer, so you don't accidentally write to it, it will give an error at compile time.
Basically, the same as in plain C host side computing. Programs will also work fine if all memory is non-const.
When performing memory-intensive tasks, would it be better to copy the array to a local array inside the kernel? My guess is that local memory would be too small, but what if the array only had a length of 10? When would the array be too big/small? More general: when is it worth copying data from global to local memory?
It is only worth if it is read by all workers. If each worker reads a single value of the global memory, then it is not worth.
Useful here:
Worker0 -> Reads 0,1,2,3
Worker1 -> Reads 0,1,2,3
Worker2 -> Reads 0,1,2,3
Worker3 -> Reads 0,1,2,3
Not useful here:
Worker0 -> Reads 0
Worker1 -> Reads 1
Worker2 -> Reads 2
Worker3 -> Reads 3
Say I also want to pass the length of this array, then I would add
const int length to the arguments of my kernel, but I'm unsure why I
would omit the global qualifier except because I have seen other
people do it. After all, length must be accessible for all workers. If
I'm right, then length would have a generic address space, but again,
I don't really know what that means.
When you don't specify a qualifier in the kernel parameter it typically defaults to constant, which is what you want for those small elements, to have a fast access by all workers.
The rules normally OpenCL compilers follow for kernel parameters is: if it only read and fits in constant, constant, otherwise global.

Questions about CUDA memory

I am quite new to CUDA programming and there are some stuff about the memory model that are quite unclear to me. Like, how does it work? For example if I have a simple kernel
__global__ void kernel(const int* a, int* b){
some computation where different threads in different blocks might
write at the same index of b
}
So I imagine a will be in the so-called constant memory. But what about b? Since different threads in different blocks will write in it, how will it work? I read somewhere that it was guaranteed that in the case of concurrent writes in global memory by different threads in the same block at least one would be written, but there's no guarantee about the others. Do I need to worry about that, ie for example have every thread in a block write in shared memory and once they are all done, have one write it all to the global memory? Or is CUDA taking care of it for me?
So I imagine a will be in the so-called constant memory.
Yes, a the pointer will be in constant memory, but not because it is marked const (this is completely orthogonal). b the pointer is also in constant memory. All kernel arguments are passed in constant memory (except in CC 1.x). The memory pointed-to by a and b could, in theory, be anything (device global memory, host pinned memory, anything addressable by UVA, I believe). Where it resides is chosen by you, the user.
I read somewhere that it was guaranteed that in the case of concurrent writes in global memory by different threads in the same block at least one would be written, but there's no guarantee about the others.
Assuming your code looks like this:
b[0] = 10; // Executed by all threads
Then yes, that's a (benign) race condition, because all threads write the same value to the same location. The result of the write is defined, however the number of writes is unspecified and so is the thread that does the "final" write. The only guarantee is that at least one write happens. In practice, I believe one write per warp is issued, which is a waste of bandwidth if your blocks contain more than one warp (which they should).
On the other hand, if your code looks like this:
b[0] = threadIdx.x;
This is plain undefined behavior.
Do I need to worry about that, ie for example have every thread in a block write in shared memory and once they are all done, have one write it all to the global memory?
Yes, that's how it's usually done.

Passing arguments through __local memory in OpenCL

I am confused about the the __local memory in OpenCL here.
I read some spec saying that the data flow has to be from Host to
__Global, and then __Local.
But I also see some kernel function like this:
__kernel void foo(__local float * a)
I was wondering how the data was transferred directly into the __local
memory in this way?
Thanks.
It is not possible to fill local buffer on the host side. Therefore you have to follow the flow host -> __global -> __local.
Local buffer can be either created on the host side and then it is passed as a kernel parameter or on gpu side inside the kernel.
Creating local buffer on the host side gives the advantage to decide about its size before the kernel is run which can be important if the local buffer size needs to be different each time the kernel is run.
Local memory is not visible to anything but a single work-group, and may be allocated as the work-group is dispatched by hardware on many architectures. Hardware that can mix multiple work-groups from different kernels on each CU will allow the scheduling component to chunk up the local memory for each of the groups being issued. It doesn't exist before the group is launched, and does not exist after the group terminates. The size of this region is what you pass in as other answers have pointed out.
The result of this is that the only way on many architectures for filling local memory from the host would be for kernel code to be inserted by the compiler that would copy data in from global memory. Given that as the basis, it isn't any worse in terms of performance for the programmer to do it manually, and gives more control over exactly what happens. You do not end up in a situation where the compiler always generates copy code and ends up copying more than was really necessary because the API didn't make it clear what memory was copy-in and what was not.
In summary, you cannot fill local memory in any automated way. In practice you will rarely want to, because doing it manually gives you the opportunity to only put the result of a first stage into local, removing extra copy operations, or to transform the data on the way in to local, allowing padding or data transposition to remove bank conflicts and so on.
As #doqtor said, the size of local memory on kernel parameter can be specified by clSetKernelArg calls.
Fortunately, OpenCL 1.2+ support VLA(variable length array), local memory kernel parameter is not required any more.

How to allocate all of the available shared memory to a single block in CUDA?

I want to allocate all the available shared memory of an SM to one block. I am doing this because I don't want multiple blocks to be assigned to the same SM.
My GPU card has 64KB (Shared+L1) memory. In my current configuration, 48KB is assigned to the Shared memory and 16KB to the L1.
I wrote the following code to use up all of the available Shared memory.
__global__ void foo()
{
__shared__ char array[49152];
...
}
I have two questions:
How can I make sure that all of the shared memory space is used up?
I can increase "48K" to a much higher value(without getting any error or warning). Is there anyone who can justify this?
Thanks in advance,
Iman
You can read size of available device shared memory from cudaDeviceProp::sharedMemPerBlock that you can obtain by calling cudaGetDeviceProperties
You do not have to specify size of your array. Instead, you may dynamically pass size of the shared memory as 3rd kernel launch parameter.
The "clock" CUDA SDK sample illustrates how you can specify shared memory size at launch time.

Resources