Is Metal kernel function atomic? - metal

Can context switch happen among the lines inside kernel function?
Because I am setting some values before making changes, I want to make sure if the value is set, the change has been made.

Short answer is, yes, context switch will most certainly happen "between the lines". That's the whole point of context switches: if some lines in your shader (no matter fragment, vertex or kernel) need some resource that is not available yet (ALU, special function unit, texture units, memory), the GPU will most certainly switch contexts. This is called latency hiding and it's very important for the performance of the GPUs, since without it, GPU cores would spend most of the time stalling for different resources mentioned above. And all that means that Metal kernel functions are definitely not atomic.
As for the problem you have, if you want something to happen atomically, there are two main ways to do it in Metal Shading Language:
You can use atomic types and functions from metal_atomic header. This is a subset of C++14 atomic header and it contains atomic store, load, exchange, compare and exchange and fetch and modify functions.
You can use SIMD-group and threadgroup barriers. Barriers allow to wait for all threads in a group to execute all operations before any thread is allowed to continue. Depending on flags passed to barrier function, barrier can also order memory accesses or just be used as an execution barrier (which is probably not what you want, if you write some data out of your shader).
You can refer to Metal Shading Language Specification for more information on these functions (check sections 5.8.1 for information on "Threadgroup and SIMD-group Synchronization Functions" and 5.13 for information on "Atomic Functions").
Usually, if your kernel function processes some data, which you later need to reduce, you would do something like this (this is a very simple example):
kernel void
my_kernel(texture2d<half> src [[ texture(0) ]],
texture2d<half, access::write> dst [[ texture(1) ]],
threadgroup float *intermediate [[ threadgroup(0) ]],
ushort2 lid [[ thread_position_in_threadgroup ]],
ushort ti [[ thread_index_in_threadgroup ]],
ushort2 gid [[ thread_position_in_grid ]])
{
// Read data
half4 clr = src.read(gid);
// Do some work for each thread
intermediate[ti] = intermediateResult;
// Make sure threadhroup memory writes are visible to other threads
threadgroup_barrier(mem_flags::mem_threadgroup);
// One thread in the whole threadhgroup calculates some final result
if (lid.x == 0 && lid.y == 0)
{
// Do some work per threadgroup
dst.write(finalResult, gid);
}
}
Here all threads in threadgroup read data from src texture, execute work, store intermediate result in threadgroup memory and then calculate and write out final result to texture dst. threadgroup_barrier makes sure that other threads (including thread with thread_position_in_threadgroup equal to (0, 0) that is going to compute final result) can see the memory writes.

Related

Is there a way to bind assets once instead of with every command encoder?

I'm rendering a vertex/frag shader with a compute kernel.
Every frame I am binding large assets (such as a 450MB texture) in the usual way:
computeEncoder.setTexture(highResTexture, index: 0)
computeEncoder.setBuffer(largeBuffer, offset: 0, index: 0)
...
renderEncoder.setVertexTexture(highResTexture, index: 0)
renderEncoder.setVertexBuffer(largeBuffer, offset: 0, index: 0)
So that is close to 1GB in bandwidth for a single texture, and I have many more assets totaling a few hundred megs, so that is about 1.5GB that I bind for every frame.
Is there anyway to bind textures/buffers to the GPU once so that they would then be available in the kernel and vertex functions without binding every frame?
I could be wrong, but I thought something was introduced in the one of the last couple WWDCs so thought I would ask to make sure I'm not missing anything.
EDIT:
By simply binding a texture in the vertex function that I have already bound in the compute encoder it does indeed show more texture bandwidth used, even though I am not using it for the capture.
GPU Read Bandwidth:
6.3920 GiB/s without binding
7.1919 GiB/s with binding
Without binding the texture:
With binding the texture but not using it in any way:
Also, if it works as you describe, why does using multiple command encoders warn about wasted bandwidth? If I use more than one emitter, each with a separate encoder, even though they bind identical resources, I get the performance warning:
I think you are confused. Setting a texture to a command encoder doesn't consume bandwidth. Reading it or sampling it inside the shader does.
When you set a texture or any other buffer to an encoder, what happens is that driver just passes some small amount of metadata to the shader using some mechanism, likely some internal buffer that's not visible to you as the API user. It doesn't "load" the texture anywhere. There's an exception for buffers that are marked as constant address buffers in the shaders, because those may get pre-fetched by the GPU for better performance.
Another thing that happens is that the resource is made resident, meaning the GPU driver will map a range of addresses in the GPU addresses virtual memory table to point to the physical memory that stores the texture contents. This also does not consume memory, but it does consume available virtual address space. You might run out of virtual address space in some cases, but that's not a bandwidth issue.
Still, if you do have a lot of textures, you might be actually spending a lot of CPU time just encoding those setTexture commands. Instead, you can use argument buffers. If the hardware you are targeting supports argument buffers tier 2, you can put every texture in an argument buffer. This will require calling useResource on all of those textures, because the driver needs to know that you are going to use those textures to make them resident, so you will still spend CPU time encoding those commands. To avoid that, you can allocate all the textures from one or more heaps and call useHeaps on those heaps. This will make the whole heap resident, and you won't need to call useResource on individual resources. There are a bunch of WWDC talks on this topic, latest one being Explore bindless rendering in Metal.
But again, to reiterate: nothing I mentioned here "wastes" bandwidth.
Update:
A very basic example of using argument buffers would be to use it like this.
let argumentDescriptor = MTLArgumentDescriptor()
argumentDescriptor.index = 0
argumentDescriptor.dataType = .texture
argumentDescriptor.textureType = .type2D
let argumentEncoder = MTLArgumentEncoder(arguments: [argumentDescriptor])
let argumentBuffer = device.makeBuffer(length: argumentEncoder.encodedLength, options: [.storageModeShared])
argumentEncoder.setArgumentBuffer(argumentBuffer, offset: 0)
argumentEncoder.setTexture(someTexture, index: 0)
commandEncoder.setBuffer(argumentBuffer, offset: 0, index: 0)
commandEncoder.useResource(someTexture, usage: .read)
And in the shader you would write a struct like this:
struct MyTexture
{
texture2d<float> texture [[ id(0) ]];
};
and then bind it like
device MyTexture& myTexture [[ buffer(0) ]]
and use it like any other struct. This is a very basic example and you can actually use reflection to create those MTLArgumentEncoders for you from functions and binding indices.

Is it safe for an OpenCL kernel to randomly write to a __global buffer?

I want to run an instrumented OpenCL kernel to get some execution metrics. More specifically, I have added a hidden global buffer which will be initialized from the host code with N zeros. Each of the N values are integers and they represent a different metric, which each kernel instance will increment in a different manner, depending on its execution path.
A simplistic example:
__kernel void test(__global int *a, __global int *hiddenCounter) {
if (get_global_id(0) == 0) {
// do stuff and then increment the appropriate counter (random numbers here)
hiddenCounter[0] += 3;
}
else {
// do stuff...
hiddenCounter[1] += 5;
}
}
After the kernel execution is complete, I need the host code to aggregate (a simple element-wise vector addition) all the hiddenCounter buffers and print the appropriate results.
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer (which will definitely happen in my project). Do I need to enforce some kind of synchronization? Or is this impossible with __global arguments and I need to change it to __private? Will I be able to aggregate __private buffers from the host code afterwards?
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer
The answer to this is emphatically yes, your code will be vulnerable to race conditions as currently written.
Do I need to enforce some kind of synchronization?
Yes, you can use global atomics for this purpose. All but the most ancient GPUs will support this. (anything supporting OpenCL 1.2, or cl_khr_global_int32_base_atomics and similar extensions)
Note that this will have a non-trivial performance overhead. Depending on your access patterns and frequency, collecting intermediate results in private or local memory and writing them out to global memory at the end of the kernel may be faster. (In the local case, the whole work group would share just one global atomic call for each updated cell - you'll need to use local atomics or a reduction algorithm to accumulate the values from individual work items across the group though.)
Another option is to use a much larger global memory buffer, with counters for each work item or group. In that case, you will not need atomics to write to them, but you will subsequently need to combine the values on the host. This uses much more memory, obviously, and likely more memory bandwidth too - modern GPUs should cache accesses to your small hiddenCounter buffer. So you'll need to work out/try which is the lesser evil in your case.

Running compute kernel on portion of MTLBuffer?

I am populating an MTLBuffer with float2 vectors. The buffer is being created and populated like this:
struct Particle {
var position: float2
...
}
let particleCount = 100000
let bufferSize = MemoryLayout<Particle>.stride * particleCount
particleBuffer = device.makeBuffer(length: bufferSize)!
var pointer = particleBuffer.contents().bindMemory(to: Particle.self, capacity: particleCount)
pointer = pointer.advanced(by: currentParticles)
pointer.pointee.position = [x, y]
In my Metal file the buffer is being accessed like this:
struct Particle {
float2 position;
...
};
kernel void compute(device Particle *particles [[buffer(0)]],
uint id [[thread_position_in_grid]] … )
I need to be able to compute a given range of the MTLBuffer. For example, is it possible to run the compute kernel on say starting from the 50,000 value and ending at 75,000 value?
It seems like the offset parameter allows me to specify the start position, but it does not have a length parameter.
I see there is this call:
setBuffers(_:offsets:range:)
Does the range specify what portion of the buffer to run? It seems like the range specifies what buffers are used and not the range of values to use.
A compute shader doesn't run "on" a buffer (or portion). It runs on a grid, which is an abstract concept. As far as Metal is concerned, the grid isn't related to a buffer or anything else.
A buffer may be an input that your compute shader uses, but how it uses it is up to you. Metal doesn't know or care.
Here's my answer to a similar question.
So, the dispatch command you encode using a compute command encoder governs how many times your shader is invoked. It also dictates what thread_position_in_grid (and related) values each invocation receives. If your shader correlates each grid position to an element of an array backed by a buffer, then the number of threads you specify in your dispatch command governs how much of the buffer you end up accessing. (Again, this is not something Metal dictates; it's implicit in the way you code your shader.)
Now, to start at the 50,000th element, using an offset on the buffer binding to make that the effective start of the buffer from the point of view of the shader is a good approach. But it would also work to just add 50,000 to the index the shader computes when it accesses the buffer. And, if you only want to process 25,000 elements (75,000 minus 50,000), then just dispatch 25,000 threads.

Memory write performance - GPU CPU Shared Memory

I'm allocating both input and output MTLBuffer using posix_memalign according to the shared GPU/CPU documentation provided by memkite.
Aside: it is easier to just use latest API than muck around with posix_memalign
let metalBuffer = self.metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared)
My kernel function operates on roughly 16 million complex value structs and writes out an equal number of complex value structs to memory.
I've performed some experiments and my Metal kernel 'complex math section' executes in 0.003 seconds (Yes!), but writing the result to the buffer takes >0.05 (No!) seconds. In my experiment I commented out the math-part and just assign the zero to memory and it takes 0.05 seconds, commenting out the assignment and adding the math back, 0.003 seconds.
Is the shared memory slow in this case, or is there some other tip or trick I might try?
Additional detail
Test platforms
iPhone 6S - ~0.039 seconds per frame
iPad Air 2 - ~0.130 seconds per frame
The streaming data
Each update to the shader receives approximately 50000 complex numbers in the form of a pair of float types in a struct.
struct ComplexNumber {
float real;
float imaginary;
};
Kernel signature
kernel void processChannelData(const device Parameters *parameters [[ buffer(0) ]],
const device ComplexNumber *inputSampleData [[ buffer(1) ]],
const device ComplexNumber *partAs [[ buffer(2) ]],
const device float *partBs [[ buffer(3) ]],
const device int *lookups [[ buffer(4) ]],
device float *outputImageData [[ buffer(5) ]],
uint threadIdentifier [[ thread_position_in_grid ]]);
All the buffers contain - currently - unchanging data except inputSampleData which receives the 50000 samples I'll be operating on. The other buffers contain roughly 16 million values (128 channels x 130000 pixels) each. I perform some operations on each 'pixel' and sum the complex result across channels and finally take the absolute value of the complex number and assign the resulting float to outputImageData.
Dispatch
commandEncoder.setComputePipelineState(pipelineState)
commandEncoder.setBuffer(parametersMetalBuffer, offset: 0, atIndex: 0)
commandEncoder.setBuffer(inputSampleDataMetalBuffer, offset: 0, atIndex: 1)
commandEncoder.setBuffer(partAsMetalBuffer, offset: 0, atIndex: 2)
commandEncoder.setBuffer(partBsMetalBuffer, offset: 0, atIndex: 3)
commandEncoder.setBuffer(lookupsMetalBuffer, offset: 0, atIndex: 4)
commandEncoder.setBuffer(outputImageDataMetalBuffer, offset: 0, atIndex: 5)
let threadExecutionWidth = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
let threadGroups = MTLSize(width: self.numberOfPixels / threadsPerThreadgroup.width, height: 1, depth:1)
commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
metalCommandBuffer.commit()
metalCommandBuffer.waitUntilCompleted()
GitHub example
I've written an example called Slow and put it up on GitHub. Seems the bottleneck is the write of the values into the input Buffer. So, I guess the question becomes how to avoid the bottleneck?
Memory copy
I wrote up a quick test to compare the performance of various byte copying methods.
Current Status
I've reduced execution time to 0.02ish seconds which doesn't sound like a lot, but it makes a big difference in the number of frames per second. Currently the biggest improvements are a result of switching to cblas_scopy().
Reduce the size of the type
Originally, I was pre-converting signed 16-bit sized Integers as Floats (32-bit) since ultimately that is how they'll be used. This is a case where performance starts forcing you to store the values as 16-bits to cut your data-size in half.
Objective-C over Swift
For the code dealing with movement of data, you might choose Objective-C over Swift (Warren Moore recommendation). Performance of Swift in these special situations still isn't up to scratch. You can also try calling out to memcpy or similar methods. I've seen a couple of examples that used for-loop Buffer Pointers and this in my experiments performed slowly.
Difficulty of testing
I really wanted to do some of the experiments with relation to various copying methods in a playground on the machine and unfortunately this was useless. The iOS device versions of the same experiments performed completely differently. One might think that the relative performance would be the similar, but I found this to also be an invalid assumption. It would be really convenient if you could have a playground that used the iOS device as the interpreter.
You might get a large speedup via encoding your data to huffman codes and decoding on the GPU, see MetalHuffman. It depends on your data though.

Dynamic output from compute shader

If I am generating 0-12 triangles in a compute shader, is there a way I can stream them to a buffer that will then be used for rendering to screen?
My current strategy is:
create a buffer of float3 of size threads * 12, so can store the maximum possible number of triangles;
write to the buffer using an index that depends on the thread position in the grid, so there are no race conditions.
If I want to render from this though, I would need to skip the empty memory. It sounds ugly, but probably there is no other way currently. I know CUDA geometry shaders can have variable length output, but I wonder if/how games on iOS can generate variable-length data on GPU.
UPDATE 1:
As soon as I wrote the question, I thought about the possibility of using a second buffer that would point out how many triangles are available for each block. The vertex shader would then process all vertices of all triangles of that block.
This will not solve the problem of the unused memory though and as I have a big number of threads, the total memory wasted would be considerable.
What you're looking for is the Metal equivalent of D3D's "AppendStructuredBuffer". You want a type that can have structures added to it atomically.
I'm not familiar with Metal, but it does support Atomic operations such as 'Add' which is all you really need to roll your own Append Buffer. Initialise the counter to 0 and have each thread add '1' to the counter and use the original value as the index to write to in your buffer.

Resources