In CUDA programming, if we want to use shared memory, we need to bring the data from global memory to shared memory. Threads are used for transferring such data.
I read somewhere (in online resources) that it is better not to involve all the threads in the block for copying data from global memory to shared memory. Such idea makes sense that all the threads are not executed together. Threads in a warp execute together. But my concern is all the warps are not executed sequentially. Say, a block with threads is divided into 3 warps: war p0 (0-31 threads), warp 1 (32-63 threads), warp 2 (64-95 threads). It is not guaranteed that warp 0 will be executed first (am I right?).
So which threads should I use to copy the data from global to shared memory?
To use a single warp to load a shared memory array, just do something like this:
__global__
void kernel(float *in_data)
{
__shared__ float buffer[1024];
if (threadIdx.x < warpSize) {
for(int i = threadIdx; i <1024; i += warpSize) {
buffer[i] = in_data[i];
}
}
__syncthreads();
// rest of kernel follows
}
[disclaimer: written in browser, never tested, use at own risk]
The key point here is the use of __syncthreads() to ensure that all threads in the block wait until the warp performing the load to shared memory have finished the load. The code I posted used the first warp, but you can calculate a warp number by dividing the thread index within the block by the warpSize. I also assumed a one-dimensional block, it is trivial to compute the thread index in a 2D or 3D block, so I leave that as an exercise to the reader.
As block is assigned to multiprocessor it resides there until all threads within that block are finished and during this time warp scheduler is mixing among warps that have ready operands. So if there is one block on multiprocessor with three warps and just one warp is fetching data from global to shared memory and other two warps are staying idle and probably waiting on __syncthreads() barrier, you loose nothing and you are limited just by latency of global memory what you would have been anyway. As soon as fetching is finished warps can go ahead in their work.
Therefore, no guarantee that warp0 is executed first is needed and you can use any threads. The only two things to keep in mind are to ensure as much coalesced access to global memory as possible and avoidance of bank conflicts.
Related
I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!
I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)
Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.
It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.
For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.
You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.
If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.
Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.
In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.
I want to run an instrumented OpenCL kernel to get some execution metrics. More specifically, I have added a hidden global buffer which will be initialized from the host code with N zeros. Each of the N values are integers and they represent a different metric, which each kernel instance will increment in a different manner, depending on its execution path.
A simplistic example:
__kernel void test(__global int *a, __global int *hiddenCounter) {
if (get_global_id(0) == 0) {
// do stuff and then increment the appropriate counter (random numbers here)
hiddenCounter[0] += 3;
}
else {
// do stuff...
hiddenCounter[1] += 5;
}
}
After the kernel execution is complete, I need the host code to aggregate (a simple element-wise vector addition) all the hiddenCounter buffers and print the appropriate results.
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer (which will definitely happen in my project). Do I need to enforce some kind of synchronization? Or is this impossible with __global arguments and I need to change it to __private? Will I be able to aggregate __private buffers from the host code afterwards?
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer
The answer to this is emphatically yes, your code will be vulnerable to race conditions as currently written.
Do I need to enforce some kind of synchronization?
Yes, you can use global atomics for this purpose. All but the most ancient GPUs will support this. (anything supporting OpenCL 1.2, or cl_khr_global_int32_base_atomics and similar extensions)
Note that this will have a non-trivial performance overhead. Depending on your access patterns and frequency, collecting intermediate results in private or local memory and writing them out to global memory at the end of the kernel may be faster. (In the local case, the whole work group would share just one global atomic call for each updated cell - you'll need to use local atomics or a reduction algorithm to accumulate the values from individual work items across the group though.)
Another option is to use a much larger global memory buffer, with counters for each work item or group. In that case, you will not need atomics to write to them, but you will subsequently need to combine the values on the host. This uses much more memory, obviously, and likely more memory bandwidth too - modern GPUs should cache accesses to your small hiddenCounter buffer. So you'll need to work out/try which is the lesser evil in your case.
I am quite new to CUDA programming and there are some stuff about the memory model that are quite unclear to me. Like, how does it work? For example if I have a simple kernel
__global__ void kernel(const int* a, int* b){
some computation where different threads in different blocks might
write at the same index of b
}
So I imagine a will be in the so-called constant memory. But what about b? Since different threads in different blocks will write in it, how will it work? I read somewhere that it was guaranteed that in the case of concurrent writes in global memory by different threads in the same block at least one would be written, but there's no guarantee about the others. Do I need to worry about that, ie for example have every thread in a block write in shared memory and once they are all done, have one write it all to the global memory? Or is CUDA taking care of it for me?
So I imagine a will be in the so-called constant memory.
Yes, a the pointer will be in constant memory, but not because it is marked const (this is completely orthogonal). b the pointer is also in constant memory. All kernel arguments are passed in constant memory (except in CC 1.x). The memory pointed-to by a and b could, in theory, be anything (device global memory, host pinned memory, anything addressable by UVA, I believe). Where it resides is chosen by you, the user.
I read somewhere that it was guaranteed that in the case of concurrent writes in global memory by different threads in the same block at least one would be written, but there's no guarantee about the others.
Assuming your code looks like this:
b[0] = 10; // Executed by all threads
Then yes, that's a (benign) race condition, because all threads write the same value to the same location. The result of the write is defined, however the number of writes is unspecified and so is the thread that does the "final" write. The only guarantee is that at least one write happens. In practice, I believe one write per warp is issued, which is a waste of bandwidth if your blocks contain more than one warp (which they should).
On the other hand, if your code looks like this:
b[0] = threadIdx.x;
This is plain undefined behavior.
Do I need to worry about that, ie for example have every thread in a block write in shared memory and once they are all done, have one write it all to the global memory?
Yes, that's how it's usually done.
What does the allocation under "VM: Dispatch continuations" signify?
(http://i.stack.imgur.com/4kuqz.png)
#InkGolem is on the right lines. This is a cache for dispatch blocks inside GCD.
#AbhiBeckert is off by a factor of 1000. 16MB is 2 million 64-bit pointers, not 2 billion.
This cache is allocated on a per-thread basis, and you're just seeing the allocation size of this cache, not what's actually in use. 16 MB is well within range, if you're doing lots of dispatching onto background threads (and since you're using RAC, I'm guessing that you are).
Don't worry about it, basically.
From what I understand, Continuations are a style of function pointer passing so that the process knows what to execute next, in your case I'm assuming those would be dispatch blocks from GCD. I'm assuming that the VM has a bunch of these that it uses over time, and that's what you're seeing in instruments. Then again, I'm not an expert on threading and I could be totally off in left field.
I am implementing a spiking neural network using the CUDA library and am really unsure of how to proceed with regard to the following things:
Allocating memory (cudaMalloc) to many different arrays. Up until now, simply using cudaMalloc 'by hand' has sufficed, as I have not had to make more than 10 or so arrays. However, I now need to make pointers to, and allocate memory for thousands of arrays.
How to decide how much memory to allocate to each of those arrays. The arrays have a height of 3 (1 row for the postsynaptic neuron ids, 1 row for the number of the synapse on the postsynaptic neuron, and 1 row for the efficacy of that synapse), but they have an undetermined length which changes over time with the number of outgoing synapses.
I have heard that dynamic memory allocation in CUDA is very slow and so toyed with the idea of allocating the maximum memory required for each array, however the number of outgoing synapses per neuron varies from 100-10,000 and so I thought this was infeasible, since I have on the order of 1000 neurons.
If anyone could advise me on how to allocate memory to many arrays on the GPU, and/or how to code a fast dynamic memory allocation for the above tasks I would have more than greatly appreciative.
Thanks in advance!
If you really want to do this, you can call cudaMalloc as many times as you want; however, it's probably not a good idea. Instead, try to figure out how to lay out the memory so that neighboring threads in a block will access neighboring elements of RAM whenever possible.
The reason this is likely to be problematic is that threads execute in groups of 32 at a time (a warp). NVidia's memory controller is quite smart, so if neighboring threads ask for neighboring bytes of RAM, it coalesces those loads into a single request that can be efficiently executed. In contrast, if each thread in a warp is accessing a random memory location, the entire warp must wait till 32 memory requests are completed. Furthermore, reads and writes to the card's memory happen a whole cache line at a time, so if the threads don't use all the RAM that was read before it gets evicted from the cache, memory bandwidth is wasted. If you don't optimize for coherent memory access within thread blocks, expect a 10x to 100x slowdown.
(side note: The above discussion is still applicable with post-G80 cards; the first generation of CUDA hardware (G80) was even pickier. It also required aligned memory requests if the programmer wanted the coalescing behavior.)