Compute shader: read data written in one thread from another? - directx

Can somebody tell me whether the following compute shader is possible with DirectX 11?
I want the first thread in a Dispatch that accesses an element in a buffer (g_positionsGrid) to set (compare exchange) that element with a temporary value to signify that it is taking some action.
In this case the temp value is 0xffffffff and the first thread is going to go continue on and allocate a value from a structed append buffer (g_positions) and assign it to that element.
So all fine so far but the other threads in the dispatch can come in inbetween the compare exchange and the allocation of the first thread and so need to wait until the allocation index is available. I do this with a busy wait ie the while loop.
However sadly this just locks up the GPU as I'm assuming that the value written by the first thread is not propogated through to the other threads stuck in the while loop.
Is there any way to get those threads to see that value?
Thanks for any help!
RWStructuredBuffer<float3> g_positions : register(u1);
RWBuffer<uint> g_positionsGrid : register(u2);
void AddPosition( uint address, float3 pos )
{
uint token = 0;
// Assign a temp value to signify first thread has accessed this particular element
InterlockedCompareExchange(g_positionsGrid[address], 0, 0xffffffff, token);
if(token == 0)
{
//If first thread in here allocate index and assign value which
//hopefully the other threads will pick up
uint index = g_positions.IncrementCounter();
g_positionsGrid[address] = index;
g_positions[index].m_position = pos;
}
else
{
if(token == 0xffffffff)
{
uint index = g_positionsGrid[address];
//This never meets its condition
[allow_uav_condition]
while(index == 0xffffffff)
{
//For some reason this thread never gets the assignment
//from the first thread assigned above
index = g_positionsGrid[address];
}
g_positions[index].m_position = pos;
}
else
{
//Just assign value as the first thread has already allocated a valid slot
g_positions[token].m_position = pos;
}
}
}

Thread sync in DirectCompute is very easy, but comparing to same features to CPU threading is very unflexible. AFAIK, the only way to sync data between threads in compute shader is to use groupshared memory and GroupMemoryBarrierWithGroupSync(). That means, that you can:
create small temporary buffer in groupshared memory
calculate value
write to groupshared buffer
synchronize threads with GroupMemoryBarrierWithGroupSync()
read from groupshared from another thread and use it somehow
To implement all this stuff, you need proper array indices. But where you can take it from? In DirectCompute values passed in Dispatch and system values that you can get in shader (SV_GroupIndex, SV_DispatchThreadID, SV_GroupThreadID, SV_GroupID) related. Using that values you can calculate indices to assess you buffers.
Compute shaders are not well documented, and there is no easy way, but to find out more info at least you can:
read MSDN: Compute shader overview
watch DirectCompute Lecture Series videos on channel9
examine compute shader samples from DirectX SDK, very nice
samples from NVIDIA`s SDK (10 and 11)
read this advanced NVIDIA paper where they implemented thread reduction and then optimize their code to run 10 times faster ;)
As of your code. Well, probably you can redesign it a little.
It is always good to all threads do the same task. Symmetric loading. Actually, you can not assign different tasks for you threads as you do it in CPU code.
If your data first need some preprocessing, and further processing, you may want to divide it to differrent Dispatch() calls (different shaders) that you will call in sequence:
preprocessShader reads from buffer inputData and writes to preprocessedData
calculateShader feads from preprocessedData and writes to finalData
In this case you can drop out any slow thread sync and slow group shared memory.
Look at "Thread reduction" trick mentioned above.
Hope it helps! And happy coding!

Related

Is it safe for an OpenCL kernel to randomly write to a __global buffer?

I want to run an instrumented OpenCL kernel to get some execution metrics. More specifically, I have added a hidden global buffer which will be initialized from the host code with N zeros. Each of the N values are integers and they represent a different metric, which each kernel instance will increment in a different manner, depending on its execution path.
A simplistic example:
__kernel void test(__global int *a, __global int *hiddenCounter) {
if (get_global_id(0) == 0) {
// do stuff and then increment the appropriate counter (random numbers here)
hiddenCounter[0] += 3;
}
else {
// do stuff...
hiddenCounter[1] += 5;
}
}
After the kernel execution is complete, I need the host code to aggregate (a simple element-wise vector addition) all the hiddenCounter buffers and print the appropriate results.
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer (which will definitely happen in my project). Do I need to enforce some kind of synchronization? Or is this impossible with __global arguments and I need to change it to __private? Will I be able to aggregate __private buffers from the host code afterwards?
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer
The answer to this is emphatically yes, your code will be vulnerable to race conditions as currently written.
Do I need to enforce some kind of synchronization?
Yes, you can use global atomics for this purpose. All but the most ancient GPUs will support this. (anything supporting OpenCL 1.2, or cl_khr_global_int32_base_atomics and similar extensions)
Note that this will have a non-trivial performance overhead. Depending on your access patterns and frequency, collecting intermediate results in private or local memory and writing them out to global memory at the end of the kernel may be faster. (In the local case, the whole work group would share just one global atomic call for each updated cell - you'll need to use local atomics or a reduction algorithm to accumulate the values from individual work items across the group though.)
Another option is to use a much larger global memory buffer, with counters for each work item or group. In that case, you will not need atomics to write to them, but you will subsequently need to combine the values on the host. This uses much more memory, obviously, and likely more memory bandwidth too - modern GPUs should cache accesses to your small hiddenCounter buffer. So you'll need to work out/try which is the lesser evil in your case.

MTLBuffer allocation + CPU/GPU synchronisation

I am using a metal performance shader(MPSImageHistogram) to compute something in an MTLBuffer that I grab, perform computations, and then display via MTKView. The MTLBuffer output from the shader is small (~4K bytes). So I am allocating a new MTLBuffer object for every render pass, and there are atleast 30 renders per second for every video frame.
calculation = MPSImageHistogram(device: device, histogramInfo: &histogramInfo)
let bufferLength = calculation.histogramSize(forSourceFormat: MTLPixelFormat.bgra8Unorm)
let buffer = device.makeBuffer(length: bufferLength, options: .storageModeShared)
let commandBuffer = commandQueue?.makeCommandBuffer()
calculation.encode(to: commandBuffer!, sourceTexture: metalTexture!, histogram: buffer!, histogramOffset: 0)
commandBuffer?.commit()
commandBuffer?.addCompletedHandler({ (cmdBuffer) in
let dataPtr = buffer!.contents().assumingMemoryBound(to: UInt32.self)
...
...
}
My questions -
Is it okay to make a new buffer every time using device.makeBuffer(..), or better to statically allocate
few buffers and implement reuse those buffers? If reuse is better, what do we do for synchronizing CPU/GPU data write/read on these buffers?
Another unrelated question, is it okay to draw in MTKView the results on a non-main thread? Or MTKView draws must only be in main thread (even though I read Metal is truly multithreaded)?
Allocations are somewhat expensive, so I'd recommend a reusable buffer scheme. My preferred way to do this is to keep a mutable array (queue) of buffers, enqueuing a buffer when the command buffer that used it completes (or in your case, after you've read back the results on the CPU), and allocating a new buffer when the queue is empty and you need to encode more work. In the steady state, you'll find that this scheme will rarely allocate more than 2-3 buffers total, assuming your frames are completing in a timely fashion. If you need this scheme to be thread-safe, you can protect access to the queue with a mutex (implemented with a dispatch_semaphore).
You can use another thread to encode rendering work that draws into a drawable vended by an MTKView, as long as you follow standard multithreading precautions. Remember that while command queues are thread-safe (in the sense that you can create and encode to multiple command buffers from the same queue concurrently), command buffers themselves and encoders are not. I'd advise you to profile the single-threaded case and only introduce the complication of multi-threading if/when absolutely necessary.
If it is a small amount of data (under 4K) you can use setBytes(): https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443159-setbytes
That might be faster/better than allocating a new buffer every frame. You could also use a triple-buffered approach so that successive frames' access to the buffer do not interfere. https://developer.apple.com/library/content/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/TripleBuffering.html
This tutorial shows how to set up triple buffering for rendering: https://www.raywenderlich.com/146418/metal-tutorial-swift-3-part-3-adding-texture
That's actually like the third part of the tutorial but it is the part that shows the triple-buffering setup, under "Reusing Uniform Buffers".

Dynamic output from compute shader

If I am generating 0-12 triangles in a compute shader, is there a way I can stream them to a buffer that will then be used for rendering to screen?
My current strategy is:
create a buffer of float3 of size threads * 12, so can store the maximum possible number of triangles;
write to the buffer using an index that depends on the thread position in the grid, so there are no race conditions.
If I want to render from this though, I would need to skip the empty memory. It sounds ugly, but probably there is no other way currently. I know CUDA geometry shaders can have variable length output, but I wonder if/how games on iOS can generate variable-length data on GPU.
UPDATE 1:
As soon as I wrote the question, I thought about the possibility of using a second buffer that would point out how many triangles are available for each block. The vertex shader would then process all vertices of all triangles of that block.
This will not solve the problem of the unused memory though and as I have a big number of threads, the total memory wasted would be considerable.
What you're looking for is the Metal equivalent of D3D's "AppendStructuredBuffer". You want a type that can have structures added to it atomically.
I'm not familiar with Metal, but it does support Atomic operations such as 'Add' which is all you really need to roll your own Append Buffer. Initialise the counter to 0 and have each thread add '1' to the counter and use the original value as the index to write to in your buffer.

copy to the shared memory in cuda

In CUDA programming, if we want to use shared memory, we need to bring the data from global memory to shared memory. Threads are used for transferring such data.
I read somewhere (in online resources) that it is better not to involve all the threads in the block for copying data from global memory to shared memory. Such idea makes sense that all the threads are not executed together. Threads in a warp execute together. But my concern is all the warps are not executed sequentially. Say, a block with threads is divided into 3 warps: war p0 (0-31 threads), warp 1 (32-63 threads), warp 2 (64-95 threads). It is not guaranteed that warp 0 will be executed first (am I right?).
So which threads should I use to copy the data from global to shared memory?
To use a single warp to load a shared memory array, just do something like this:
__global__
void kernel(float *in_data)
{
__shared__ float buffer[1024];
if (threadIdx.x < warpSize) {
for(int i = threadIdx; i <1024; i += warpSize) {
buffer[i] = in_data[i];
}
}
__syncthreads();
// rest of kernel follows
}
[disclaimer: written in browser, never tested, use at own risk]
The key point here is the use of __syncthreads() to ensure that all threads in the block wait until the warp performing the load to shared memory have finished the load. The code I posted used the first warp, but you can calculate a warp number by dividing the thread index within the block by the warpSize. I also assumed a one-dimensional block, it is trivial to compute the thread index in a 2D or 3D block, so I leave that as an exercise to the reader.
As block is assigned to multiprocessor it resides there until all threads within that block are finished and during this time warp scheduler is mixing among warps that have ready operands. So if there is one block on multiprocessor with three warps and just one warp is fetching data from global to shared memory and other two warps are staying idle and probably waiting on __syncthreads() barrier, you loose nothing and you are limited just by latency of global memory what you would have been anyway. As soon as fetching is finished warps can go ahead in their work.
Therefore, no guarantee that warp0 is executed first is needed and you can use any threads. The only two things to keep in mind are to ensure as much coalesced access to global memory as possible and avoidance of bank conflicts.

Pthreads in MPI

Hi, I have written an MPI quicksort program which works like this:
In my cluster 'Master' will divide the integer data and send these to 'Slave nodes'. Upon receiving at the Slave nodes, each slave will perform individual sorting operations and send the sorted data back to Master.
Now my problem is I'm interested in introducing hyper-threading for the slaves.
I have data coming from master
sub (which denotes the array)
count (size of an array)
Now I have initialized Pthreads as where
num_threads=12.
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i = 0; i < num_pthreads; i++) {
if (pthread_create(&thread[i], &attr, new_thread, (void *) &sub[i]))
{
printf("error creating a new thread \n");
exit(1);
}
else
{
printf(" threading is successful %d at node %d \n \t ",i,rank);
}
and in a new thread function
void * new_thread(int *sub)
{
quick_sort(sub,0, count-1);
}
return(0);
}
I don't understand whether my way is correct or not. Can anyone help me with this problem?
Your basic idea is correct, except you also need to determine how you're going to get results back from the threads.
Normally you would want to pass all relevant information for the thread to complete its task through the *arg argument from pthread_create. In your new_thread() function, the count variable is not passed in to the function and so is global between all threads. A better approach is to pass a pointer to a struct through from pthread_create.
typedef struct {
int *sub; /* Pointer to first element in array to sort, different for each thread. */
int count; /* Number of elements in sub. */
int done; /* Flag to indicate that sort is complete. */
} thread_params_t
void * new_thread(thread_params_t *params)
{
quick_sort(params->sub, 0, params->count-1);
params->done = 1;
return 0;
}
You would fill in a new thread_params_t for each thread that was spawned.
The other item that has to be managed is the sort results. It would be normal for the main thread to do a pthread_join on each thread which ensure that it has completed before continuing. Depending on your requirements you could either have each thread send results back to the master directly, of have the main function collect the results from each thread and send results back external to the worker threads.
You can use OpenMP instead of pthreads (just for the record - combining MPI with threading is called hybrid programming). It is a lightweight set of compiler pragmas that turn sequential code into parallel one. Support for OpenMP is available in virtually all modern compilers. With OpenMP you introduce the so-called parallel regions. A team of threads is created at the beginning of the parallel region, then the code continues to execute concurrently until the end of the parallel region, where all threads are joined and only the main thread continues execution (thread creation and joining is logical, e.g. it doesn't have to be implemented like this in real life and most implementations actually use thread pools to speed up the creation of threads):
#pragma omp parallel
{
// This code gets executed in parallel
...
}
You can use omp_get_thread_num() inside the parallel block to get the ID of the thread and make it compute something different. Or you can use one of the worksharing constructs of OpenMP like for, sections, etc. to make it divide the work automatically.
The biggest advantage of OpenMP is that is doesn't introduce deep changes to the source code and it abstracts threads creation/joining away so you don't have to do it manually. Most of the time you can get around with just a few pragmas. Then you have to enable OpenMP during compilation (with -fopenmp for GCC, -openmp for Intel compilers, -xopenmp for Sun/Oracle compilers, etc.). If you do not enable OpenMP or the particular compiler doesn't support it, you'll get a serial program.
You can find a quick but comprehensive OpenMP tutorial at LLNL.

Resources