I have very dynamic data which changes each frame. The data itself is relatively small so I use
[commandEncoder setVertexBytes:_vertices.buffer() length:_vertices.size() * sizeof(float) atIndex:0];
to set the vertex data.
However I need to set the index data as well when drawing using:
[commandEncoder drawIndexedPrimitives: ...];
How can update the provided MTLBuffer used for the index data in drawIndexedPrimitives method? I need to be able to efficiently update the index buffer.
This article contains an in-depth explanation of using multiple buffers to synchronize data between the CPU and GPU without forcing them to work in lock-step. You should read it carefully. I'll summarize the approach below.
Select a size for your index buffers that is large enough to hold the maximum number of indices you might need to draw.
Allocate a few (2-3) buffers of this size and place them in an array (of type NSArray<id<MTLBuffer>>). This is your reuse pool; I'll call it indexBufferPool below.
When initializing your renderer, create a dispatch semaphore whose value is equal to the number of buffers you put in your reuse pool. I'll call it bufferSemaphore below.
Create a buffer index member variable and initialize it to 0. I'll call it bufferIndex below.
Every time you draw, wait on the semaphore by calling dispatch_semaphore_wait(bufferSemaphore, DISPATCH_TIME_FOREVER).
When the semaphore wait function returns, it's safe to copy new index data into the buffer at the current buffer index. Using memcpy or some other copying technique, copy index data into the contents of bufferPool[bufferIndex].
Draw your primitives using bufferPool[bufferIndex] as the index buffer.
Increment bufferIndex by setting bufferIndex = (bufferIndex + 1) % ResourceCount, where ResourceCount is the number of buffers in the reuse pool.
Before committing it, add a completed handler to the current command buffer. The completed handler should call dispatch_semaphore_signal(bufferSemaphore). This lets any pending calls to the draw method know that it's safe to write to the buffer at the current buffer index.
Related
The documentation for setVertexBytes says:
Use this method for single-use data smaller than 4 KB. Create a MTLBuffer object if your data exceeds 4 KB in length or persists for multiple uses.
What exactly does single-use mean?
For example, if I have a uniforms struct which is less than 4KB(and is updated every frame), is it better to use a triple buffer technique or simply use setVertexBytes?
From what I understand using setVertexBytes would copy the data every time into a MTLBuffer that Metal manages. This sounds slower than using triple buffering.
But then if I have different objects, each with its own uniforms, I would have to triple buffer everything, since it's dynamically updated.
And if I have a material that updates rarely but is passed to the shader every frame, would it be better to keep it in a buffer or pass it as a pointer using setVertexBytes?
It's not necessarily the case that Metal manages a distinct resource into which this data is written. As user Columbo notes in their comment, some hardware allows constant data to be recorded directly into command buffer memory, from which it can be subsequently read by a shader.
As always, you should profile in order to find the difference between the two approaches on your target hardware, but if the amount of data you're pushing per draw call is small, you might very well find that using setVertexBytes:... is faster than writing into a buffer and calling setVertexBuffer:....
For data that doesn't vary every frame (your slow-varying material use case), it may indeed be more efficient to keep that data in a buffer (double- or triple-buffered) rather than using setVertexBytes:....
I am using a metal performance shader(MPSImageHistogram) to compute something in an MTLBuffer that I grab, perform computations, and then display via MTKView. The MTLBuffer output from the shader is small (~4K bytes). So I am allocating a new MTLBuffer object for every render pass, and there are atleast 30 renders per second for every video frame.
calculation = MPSImageHistogram(device: device, histogramInfo: &histogramInfo)
let bufferLength = calculation.histogramSize(forSourceFormat: MTLPixelFormat.bgra8Unorm)
let buffer = device.makeBuffer(length: bufferLength, options: .storageModeShared)
let commandBuffer = commandQueue?.makeCommandBuffer()
calculation.encode(to: commandBuffer!, sourceTexture: metalTexture!, histogram: buffer!, histogramOffset: 0)
commandBuffer?.commit()
commandBuffer?.addCompletedHandler({ (cmdBuffer) in
let dataPtr = buffer!.contents().assumingMemoryBound(to: UInt32.self)
...
...
}
My questions -
Is it okay to make a new buffer every time using device.makeBuffer(..), or better to statically allocate
few buffers and implement reuse those buffers? If reuse is better, what do we do for synchronizing CPU/GPU data write/read on these buffers?
Another unrelated question, is it okay to draw in MTKView the results on a non-main thread? Or MTKView draws must only be in main thread (even though I read Metal is truly multithreaded)?
Allocations are somewhat expensive, so I'd recommend a reusable buffer scheme. My preferred way to do this is to keep a mutable array (queue) of buffers, enqueuing a buffer when the command buffer that used it completes (or in your case, after you've read back the results on the CPU), and allocating a new buffer when the queue is empty and you need to encode more work. In the steady state, you'll find that this scheme will rarely allocate more than 2-3 buffers total, assuming your frames are completing in a timely fashion. If you need this scheme to be thread-safe, you can protect access to the queue with a mutex (implemented with a dispatch_semaphore).
You can use another thread to encode rendering work that draws into a drawable vended by an MTKView, as long as you follow standard multithreading precautions. Remember that while command queues are thread-safe (in the sense that you can create and encode to multiple command buffers from the same queue concurrently), command buffers themselves and encoders are not. I'd advise you to profile the single-threaded case and only introduce the complication of multi-threading if/when absolutely necessary.
If it is a small amount of data (under 4K) you can use setBytes(): https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443159-setbytes
That might be faster/better than allocating a new buffer every frame. You could also use a triple-buffered approach so that successive frames' access to the buffer do not interfere. https://developer.apple.com/library/content/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/TripleBuffering.html
This tutorial shows how to set up triple buffering for rendering: https://www.raywenderlich.com/146418/metal-tutorial-swift-3-part-3-adding-texture
That's actually like the third part of the tutorial but it is the part that shows the triple-buffering setup, under "Reusing Uniform Buffers".
If I am generating 0-12 triangles in a compute shader, is there a way I can stream them to a buffer that will then be used for rendering to screen?
My current strategy is:
create a buffer of float3 of size threads * 12, so can store the maximum possible number of triangles;
write to the buffer using an index that depends on the thread position in the grid, so there are no race conditions.
If I want to render from this though, I would need to skip the empty memory. It sounds ugly, but probably there is no other way currently. I know CUDA geometry shaders can have variable length output, but I wonder if/how games on iOS can generate variable-length data on GPU.
UPDATE 1:
As soon as I wrote the question, I thought about the possibility of using a second buffer that would point out how many triangles are available for each block. The vertex shader would then process all vertices of all triangles of that block.
This will not solve the problem of the unused memory though and as I have a big number of threads, the total memory wasted would be considerable.
What you're looking for is the Metal equivalent of D3D's "AppendStructuredBuffer". You want a type that can have structures added to it atomically.
I'm not familiar with Metal, but it does support Atomic operations such as 'Add' which is all you really need to roll your own Append Buffer. Initialise the counter to 0 and have each thread add '1' to the counter and use the original value as the index to write to in your buffer.
I have a working metal application that is extremely slow, and needs to run faster. I believe the problem is I am creating too many MTLCommandBuffer objects.
The reason I am creating so many MTLCommandBuffer objects is I need to send different uniform values to the pixel shader. I've pasted a snippit of code to illustrate the problem below.
for (int obj_i = 0 ; obj_i < n ; ++obj_i)
{
// I create one render command buffer per object I draw so I can use different uniforms
id <MTLCommandBuffer> mtlCommandBuffer = [metal_info.g_commandQueue commandBuffer];
id <MTLRenderCommandEncoder> renderCommand = [mtlCommandBuffer renderCommandEncoderWithDescriptor:<#(MTLRenderPassDescriptor *)#>]
// glossing over details, but this call has per object specific data
memcpy([global_uniform_buffer contents], per_object_data, sizeof(per_data_object));
[renderCommand setVertexBuffer:object_vertices offset:0 atIndex:0];
// I am reusing a single buffer for all shader calls
// this is killing performance
[renderCommand setVertexBuffer:global_uniform_buffer offset:0 atIndex:1];
[renderCommand drawIndexedPrimitives:MTLPrimitiveTypeTriangle
indexCount:per_object_index_count
indexType:MTLIndexTypeUInt32
indexBuffer:indicies
indexBufferOffset:0];
[renderCommand endEncoding];
[mtlCommandBuffer presentDrawable:frameDrawable];
[mtlCommandBuffer commit];
}
The above code draw as expected, but is EXTREMELY slow. I'm guessing because there is a better way to force pixel shader evaluation than creating a MTLCommandBuffer per object.
I've consider simple allocating a buffer much larger than is needed for a single shader pass and simply using offset to queue up several calls in one render command encoder then execute them. This method seems pretty unorthodox, and I want to make sure I'm solving the issue of needed to send custom data per object in a Metal friendly way.
What is the fastest way to render using multiple passes of the same pixel/vertex shader with per call custom uniform data?
Don't reuse the same uniform buffer for every object. Doing that destroys all parallelism between the CPU and GPU and causes regular sync points.
Instead, make a separate uniform buffer for each object you are going to render in the frame. In fact you should really create 2 per object and alternate between them each frame so that the GPU can be rendering the last frame whilst you are preparing the next frame on the CPU.
After you do that, you simply refactor your loop so the command buffer and render command work are done once per frame. Your loop should only consist of copying the uniform data, setting the vertex buffer and calling draw primitive.
Can somebody tell me whether the following compute shader is possible with DirectX 11?
I want the first thread in a Dispatch that accesses an element in a buffer (g_positionsGrid) to set (compare exchange) that element with a temporary value to signify that it is taking some action.
In this case the temp value is 0xffffffff and the first thread is going to go continue on and allocate a value from a structed append buffer (g_positions) and assign it to that element.
So all fine so far but the other threads in the dispatch can come in inbetween the compare exchange and the allocation of the first thread and so need to wait until the allocation index is available. I do this with a busy wait ie the while loop.
However sadly this just locks up the GPU as I'm assuming that the value written by the first thread is not propogated through to the other threads stuck in the while loop.
Is there any way to get those threads to see that value?
Thanks for any help!
RWStructuredBuffer<float3> g_positions : register(u1);
RWBuffer<uint> g_positionsGrid : register(u2);
void AddPosition( uint address, float3 pos )
{
uint token = 0;
// Assign a temp value to signify first thread has accessed this particular element
InterlockedCompareExchange(g_positionsGrid[address], 0, 0xffffffff, token);
if(token == 0)
{
//If first thread in here allocate index and assign value which
//hopefully the other threads will pick up
uint index = g_positions.IncrementCounter();
g_positionsGrid[address] = index;
g_positions[index].m_position = pos;
}
else
{
if(token == 0xffffffff)
{
uint index = g_positionsGrid[address];
//This never meets its condition
[allow_uav_condition]
while(index == 0xffffffff)
{
//For some reason this thread never gets the assignment
//from the first thread assigned above
index = g_positionsGrid[address];
}
g_positions[index].m_position = pos;
}
else
{
//Just assign value as the first thread has already allocated a valid slot
g_positions[token].m_position = pos;
}
}
}
Thread sync in DirectCompute is very easy, but comparing to same features to CPU threading is very unflexible. AFAIK, the only way to sync data between threads in compute shader is to use groupshared memory and GroupMemoryBarrierWithGroupSync(). That means, that you can:
create small temporary buffer in groupshared memory
calculate value
write to groupshared buffer
synchronize threads with GroupMemoryBarrierWithGroupSync()
read from groupshared from another thread and use it somehow
To implement all this stuff, you need proper array indices. But where you can take it from? In DirectCompute values passed in Dispatch and system values that you can get in shader (SV_GroupIndex, SV_DispatchThreadID, SV_GroupThreadID, SV_GroupID) related. Using that values you can calculate indices to assess you buffers.
Compute shaders are not well documented, and there is no easy way, but to find out more info at least you can:
read MSDN: Compute shader overview
watch DirectCompute Lecture Series videos on channel9
examine compute shader samples from DirectX SDK, very nice
samples from NVIDIA`s SDK (10 and 11)
read this advanced NVIDIA paper where they implemented thread reduction and then optimize their code to run 10 times faster ;)
As of your code. Well, probably you can redesign it a little.
It is always good to all threads do the same task. Symmetric loading. Actually, you can not assign different tasks for you threads as you do it in CPU code.
If your data first need some preprocessing, and further processing, you may want to divide it to differrent Dispatch() calls (different shaders) that you will call in sequence:
preprocessShader reads from buffer inputData and writes to preprocessedData
calculateShader feads from preprocessedData and writes to finalData
In this case you can drop out any slow thread sync and slow group shared memory.
Look at "Thread reduction" trick mentioned above.
Hope it helps! And happy coding!