Performance of metal function multiple call - ios

i make rigid body simulation for iPhone/iPad with using Apple Metal. To do this, i need to make many calls of kernel functions, and i see, that it takes a long time, opposite to CUDA for example.
I implemented Metal kernel function call, like it describes in Apple tutorial
let commandQueue = device.newCommandQueue()
var commandBuffers:[MTLCommandBuffer]=[]
var gpuPrograms:[MTLFunction]=[]
var computePipelineFilters:[MTLComputePipelineState]=[]
var computeCommandEncoders:[MTLComputeCommandEncoder]=[]
//here i fill all arrays for my command queue
//and next i execute it
let threadsPerGroup = MTLSize(width:1,height:1,depth:1)
let numThreadgroups = MTLSize(width:threadsAmount, height:1, depth:1)
for computeCommandEncoder in computeCommandEncoders
{
computeCommandEncoder.dispatchThreadgroups(numThreadgroups, threadsPerThreadgroup: threadsPerGroup)
}
for computeCommandEncoder in computeCommandEncoders
{
computeCommandEncoder.endEncoding()
}
for commandBuffer in commandBuffers
{
commandBuffer.enqueue()
}
for commandBuffer in commandBuffers
{
commandBuffer.commit()
}
for commandBuffer in commandBuffers
{
commandBuffer.waitUntilCompleted()
}
I am do up to few dozens metal kernel functions every frame, and it works too slow. I tested it with empty kernel functions - and it shows me, that the problem are in Swift part of execution. I mean, when i want to execute kernel function in CUDA, i just call it like usual function and it works very fast. But here i must make many actions for every execution of every function every frame. May be i don't know something, but i want create all additional objects one time, and then just make something like
commandQueue.execute()
to execute all kernel functions.
Am i rights in my actions to execute many kernel functions, or there is some other way to do it faster?

I have a few projects that use multiple shaders in a single step. I only create a single buffer and encoder but multiple pipeline states; one for each compute function.
Remember that MTLCommandQueue is persistent, so only needs to be created once, so my MetalKit View's drawRect() function is roughly this (there are more shaders and textures being passed between them, but you get an idea of structure):
let commandBuffer = commandQueue.commandBuffer()
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(advect_pipelineState)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.setComputePipelineState(divergence_pipelineState)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
[...]
commandEncoder.endEncoding()
commandBuffer.commit()
My code actually iterates over one of the shaders twenty times and still runs pretty nippily, so if you reorganise and to follow this structure with a single buffer and a single encoder and only call endEncoding() and commit() once per pass, you may see an improvement in performance.
May being the operative word :)

Related

Using Metal discard_fragment() to discard individual samples in an MSAA attachment

For an MSSA attachment, the following simple Metal fragment shader is meant to be run in multiple render passes, once per sample, to fill the stencil attachment with different reference values per sample. It does not work as expected, and effectively fills all stencil pixel samples with the reference value on each renderpass.
struct _10
{
int _m0;
};
struct main0_out
{
float gl_FragDepth [[depth(any)]];
};
fragment main0_out main0(constant _10& _12 [[buffer(0)]], uint gl_SampleID [[sample_id]])
{
main0_out out = {};
if (gl_SampleID != _12._m0)
{
discard_fragment();
}
out.gl_FragDepth = 0.5;
return out;
}
The problem seems to be using discard_fragment() on a per-sample basis. The intended operation of discarding one sample but writing another does not occur. Instead, the sample is never discarded, regardless of the comparison value passed in the buffer.
In fact, from what I can tell from GPU capture shader tracing results, it appears that the entire if-discard clause is optimized away by the Metal compiler. My guess is that Metal probably recognizes the disconnect between per-sample invocations and discard_fragment(), and removes it, but I can't be sure.
I can't find any Metal documentation on discard_fragment() and its use with MSAA, so I can't tell whether discard_fragment() is supposed to work with individual sample invocations in that environment, or whether it can only discard the entire fragment (which admittedly the function name implies, but what does that mean for per-sample invocations?).
Does the logic and intention of this shader make sense? Is discard_fragment() supposed to work with individual sample invocations? And why would the Metal compiler possibly be removing the discard operation from my shader?

MTLBuffer allocation + CPU/GPU synchronisation

I am using a metal performance shader(MPSImageHistogram) to compute something in an MTLBuffer that I grab, perform computations, and then display via MTKView. The MTLBuffer output from the shader is small (~4K bytes). So I am allocating a new MTLBuffer object for every render pass, and there are atleast 30 renders per second for every video frame.
calculation = MPSImageHistogram(device: device, histogramInfo: &histogramInfo)
let bufferLength = calculation.histogramSize(forSourceFormat: MTLPixelFormat.bgra8Unorm)
let buffer = device.makeBuffer(length: bufferLength, options: .storageModeShared)
let commandBuffer = commandQueue?.makeCommandBuffer()
calculation.encode(to: commandBuffer!, sourceTexture: metalTexture!, histogram: buffer!, histogramOffset: 0)
commandBuffer?.commit()
commandBuffer?.addCompletedHandler({ (cmdBuffer) in
let dataPtr = buffer!.contents().assumingMemoryBound(to: UInt32.self)
...
...
}
My questions -
Is it okay to make a new buffer every time using device.makeBuffer(..), or better to statically allocate
few buffers and implement reuse those buffers? If reuse is better, what do we do for synchronizing CPU/GPU data write/read on these buffers?
Another unrelated question, is it okay to draw in MTKView the results on a non-main thread? Or MTKView draws must only be in main thread (even though I read Metal is truly multithreaded)?
Allocations are somewhat expensive, so I'd recommend a reusable buffer scheme. My preferred way to do this is to keep a mutable array (queue) of buffers, enqueuing a buffer when the command buffer that used it completes (or in your case, after you've read back the results on the CPU), and allocating a new buffer when the queue is empty and you need to encode more work. In the steady state, you'll find that this scheme will rarely allocate more than 2-3 buffers total, assuming your frames are completing in a timely fashion. If you need this scheme to be thread-safe, you can protect access to the queue with a mutex (implemented with a dispatch_semaphore).
You can use another thread to encode rendering work that draws into a drawable vended by an MTKView, as long as you follow standard multithreading precautions. Remember that while command queues are thread-safe (in the sense that you can create and encode to multiple command buffers from the same queue concurrently), command buffers themselves and encoders are not. I'd advise you to profile the single-threaded case and only introduce the complication of multi-threading if/when absolutely necessary.
If it is a small amount of data (under 4K) you can use setBytes(): https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443159-setbytes
That might be faster/better than allocating a new buffer every frame. You could also use a triple-buffered approach so that successive frames' access to the buffer do not interfere. https://developer.apple.com/library/content/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/TripleBuffering.html
This tutorial shows how to set up triple buffering for rendering: https://www.raywenderlich.com/146418/metal-tutorial-swift-3-part-3-adding-texture
That's actually like the third part of the tutorial but it is the part that shows the triple-buffering setup, under "Reusing Uniform Buffers".

How to use a custom compute shaders using metal and get very smooth performance?

Im trying to apply the live camera filters through metal using the default MPSKernal filters given by apple and custom compute Shaders.
In compute shader I did the inplace encoding with the MPSImageGaussianBlur
and the code goes here
func encode(to commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture, cropRect: MTLRegion = MTLRegion.init(), offset : CGPoint) {
let blur = MPSImageGaussianBlur(device: device, sigma: 0)
blur.clipRect = cropRect
blur.offset = MPSOffset(x: Int(offset.x), y: Int(offset.y), z: 0)
let threadsPerThreadgroup = MTLSizeMake(4, 4, 1)
let threadgroupsPerGrid = MTLSizeMake(sourceTexture.width / threadsPerThreadgroup.width, sourceTexture.height / threadsPerThreadgroup.height, 1)
let commandEncoder = commandBuffer.makeComputeCommandEncoder()
commandEncoder.setComputePipelineState(pipelineState!)
commandEncoder.setTexture(sourceTexture, at: 0)
commandEncoder.setTexture(destinationTexture, at: 1)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
autoreleasepool {
var inPlaceTexture = destinationTexture
blur.encode(commandBuffer: commandBuffer, inPlaceTexture: &inPlaceTexture, fallbackCopyAllocator: nil)
}
}
But sometimes the inplace texture tend to fail and eventually it creates a jerk effect on the screen.
So if anyone can suggest me the solution without using the inplace texture or how to use the fallbackCopyAllocator or using the compute shaders in a different way that would be really helpful.
I have done enough coding in this area (applying computing shaders to video stream from camera), and the most common problem you run into is the "pixel buffer reuse" issue.
The metal texture you create from the sample buffer is backed up a pixel buffer, which is managed by the video session, and can be re-used for following video frames, unless you retain the reference to the sample buffer (retaining the reference to the metal texture is not enough).
Feel free to take a look at my code at https://github.com/snakajima/vs-metal, which applies various computing shaders to a live video stream.
VSContext:set() method takes optional sampleBuffer parameter in addition to the texture parameter, and retain the reference to the sampleBuffer until the computing shader's computation is completed (in VSRuntime:encode() method).
The in place operation method can be hit or miss depending on what the underlying filter is doing. If it is a single pass filter for some parameters, then you'll end up running out of place for those cases.
Since that method was added, MPS has added an underlying MTLHeap to manage memory a bit more transparently for you. If your MPSImage doesn't need to be viewed by the CPU and exists for only a short period of time on the GPU, it is recommended that you just use a MPSTemporaryImage instead. When the readCount hits 0 on that, the backing store will be recycled through the MPS heap and made available for other MPSTemporaryImages and other temporary resources used downstream. Likewise, the backing store for it isn't actually allocated from the heap until absolutely necessary (e.g. texture is written to, or .texture is called) A separate heap is allocated for each command buffer.
Using temporary images should help reduce memory usage quite a lot. For example, in an Inception v3 neural network graph, which has over a hundred passes, the heap was able to automatically reduce the graph to just four allocations.

Metal rendering really slow - how to speed it up

I have a working metal application that is extremely slow, and needs to run faster. I believe the problem is I am creating too many MTLCommandBuffer objects.
The reason I am creating so many MTLCommandBuffer objects is I need to send different uniform values to the pixel shader. I've pasted a snippit of code to illustrate the problem below.
for (int obj_i = 0 ; obj_i < n ; ++obj_i)
{
// I create one render command buffer per object I draw so I can use different uniforms
id <MTLCommandBuffer> mtlCommandBuffer = [metal_info.g_commandQueue commandBuffer];
id <MTLRenderCommandEncoder> renderCommand = [mtlCommandBuffer renderCommandEncoderWithDescriptor:<#(MTLRenderPassDescriptor *)#>]
// glossing over details, but this call has per object specific data
memcpy([global_uniform_buffer contents], per_object_data, sizeof(per_data_object));
[renderCommand setVertexBuffer:object_vertices offset:0 atIndex:0];
// I am reusing a single buffer for all shader calls
// this is killing performance
[renderCommand setVertexBuffer:global_uniform_buffer offset:0 atIndex:1];
[renderCommand drawIndexedPrimitives:MTLPrimitiveTypeTriangle
indexCount:per_object_index_count
indexType:MTLIndexTypeUInt32
indexBuffer:indicies
indexBufferOffset:0];
[renderCommand endEncoding];
[mtlCommandBuffer presentDrawable:frameDrawable];
[mtlCommandBuffer commit];
}
The above code draw as expected, but is EXTREMELY slow. I'm guessing because there is a better way to force pixel shader evaluation than creating a MTLCommandBuffer per object.
I've consider simple allocating a buffer much larger than is needed for a single shader pass and simply using offset to queue up several calls in one render command encoder then execute them. This method seems pretty unorthodox, and I want to make sure I'm solving the issue of needed to send custom data per object in a Metal friendly way.
What is the fastest way to render using multiple passes of the same pixel/vertex shader with per call custom uniform data?
Don't reuse the same uniform buffer for every object. Doing that destroys all parallelism between the CPU and GPU and causes regular sync points.
Instead, make a separate uniform buffer for each object you are going to render in the frame. In fact you should really create 2 per object and alternate between them each frame so that the GPU can be rendering the last frame whilst you are preparing the next frame on the CPU.
After you do that, you simply refactor your loop so the command buffer and render command work are done once per frame. Your loop should only consist of copying the uniform data, setting the vertex buffer and calling draw primitive.

Compute shader: read data written in one thread from another?

Can somebody tell me whether the following compute shader is possible with DirectX 11?
I want the first thread in a Dispatch that accesses an element in a buffer (g_positionsGrid) to set (compare exchange) that element with a temporary value to signify that it is taking some action.
In this case the temp value is 0xffffffff and the first thread is going to go continue on and allocate a value from a structed append buffer (g_positions) and assign it to that element.
So all fine so far but the other threads in the dispatch can come in inbetween the compare exchange and the allocation of the first thread and so need to wait until the allocation index is available. I do this with a busy wait ie the while loop.
However sadly this just locks up the GPU as I'm assuming that the value written by the first thread is not propogated through to the other threads stuck in the while loop.
Is there any way to get those threads to see that value?
Thanks for any help!
RWStructuredBuffer<float3> g_positions : register(u1);
RWBuffer<uint> g_positionsGrid : register(u2);
void AddPosition( uint address, float3 pos )
{
uint token = 0;
// Assign a temp value to signify first thread has accessed this particular element
InterlockedCompareExchange(g_positionsGrid[address], 0, 0xffffffff, token);
if(token == 0)
{
//If first thread in here allocate index and assign value which
//hopefully the other threads will pick up
uint index = g_positions.IncrementCounter();
g_positionsGrid[address] = index;
g_positions[index].m_position = pos;
}
else
{
if(token == 0xffffffff)
{
uint index = g_positionsGrid[address];
//This never meets its condition
[allow_uav_condition]
while(index == 0xffffffff)
{
//For some reason this thread never gets the assignment
//from the first thread assigned above
index = g_positionsGrid[address];
}
g_positions[index].m_position = pos;
}
else
{
//Just assign value as the first thread has already allocated a valid slot
g_positions[token].m_position = pos;
}
}
}
Thread sync in DirectCompute is very easy, but comparing to same features to CPU threading is very unflexible. AFAIK, the only way to sync data between threads in compute shader is to use groupshared memory and GroupMemoryBarrierWithGroupSync(). That means, that you can:
create small temporary buffer in groupshared memory
calculate value
write to groupshared buffer
synchronize threads with GroupMemoryBarrierWithGroupSync()
read from groupshared from another thread and use it somehow
To implement all this stuff, you need proper array indices. But where you can take it from? In DirectCompute values passed in Dispatch and system values that you can get in shader (SV_GroupIndex, SV_DispatchThreadID, SV_GroupThreadID, SV_GroupID) related. Using that values you can calculate indices to assess you buffers.
Compute shaders are not well documented, and there is no easy way, but to find out more info at least you can:
read MSDN: Compute shader overview
watch DirectCompute Lecture Series videos on channel9
examine compute shader samples from DirectX SDK, very nice
samples from NVIDIA`s SDK (10 and 11)
read this advanced NVIDIA paper where they implemented thread reduction and then optimize their code to run 10 times faster ;)
As of your code. Well, probably you can redesign it a little.
It is always good to all threads do the same task. Symmetric loading. Actually, you can not assign different tasks for you threads as you do it in CPU code.
If your data first need some preprocessing, and further processing, you may want to divide it to differrent Dispatch() calls (different shaders) that you will call in sequence:
preprocessShader reads from buffer inputData and writes to preprocessedData
calculateShader feads from preprocessedData and writes to finalData
In this case you can drop out any slow thread sync and slow group shared memory.
Look at "Thread reduction" trick mentioned above.
Hope it helps! And happy coding!

Resources