I am using a metal performance shader(MPSImageHistogram) to compute something in an MTLBuffer that I grab, perform computations, and then display via MTKView. The MTLBuffer output from the shader is small (~4K bytes). So I am allocating a new MTLBuffer object for every render pass, and there are atleast 30 renders per second for every video frame.
calculation = MPSImageHistogram(device: device, histogramInfo: &histogramInfo)
let bufferLength = calculation.histogramSize(forSourceFormat: MTLPixelFormat.bgra8Unorm)
let buffer = device.makeBuffer(length: bufferLength, options: .storageModeShared)
let commandBuffer = commandQueue?.makeCommandBuffer()
calculation.encode(to: commandBuffer!, sourceTexture: metalTexture!, histogram: buffer!, histogramOffset: 0)
commandBuffer?.commit()
commandBuffer?.addCompletedHandler({ (cmdBuffer) in
let dataPtr = buffer!.contents().assumingMemoryBound(to: UInt32.self)
...
...
}
My questions -
Is it okay to make a new buffer every time using device.makeBuffer(..), or better to statically allocate
few buffers and implement reuse those buffers? If reuse is better, what do we do for synchronizing CPU/GPU data write/read on these buffers?
Another unrelated question, is it okay to draw in MTKView the results on a non-main thread? Or MTKView draws must only be in main thread (even though I read Metal is truly multithreaded)?
Allocations are somewhat expensive, so I'd recommend a reusable buffer scheme. My preferred way to do this is to keep a mutable array (queue) of buffers, enqueuing a buffer when the command buffer that used it completes (or in your case, after you've read back the results on the CPU), and allocating a new buffer when the queue is empty and you need to encode more work. In the steady state, you'll find that this scheme will rarely allocate more than 2-3 buffers total, assuming your frames are completing in a timely fashion. If you need this scheme to be thread-safe, you can protect access to the queue with a mutex (implemented with a dispatch_semaphore).
You can use another thread to encode rendering work that draws into a drawable vended by an MTKView, as long as you follow standard multithreading precautions. Remember that while command queues are thread-safe (in the sense that you can create and encode to multiple command buffers from the same queue concurrently), command buffers themselves and encoders are not. I'd advise you to profile the single-threaded case and only introduce the complication of multi-threading if/when absolutely necessary.
If it is a small amount of data (under 4K) you can use setBytes(): https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443159-setbytes
That might be faster/better than allocating a new buffer every frame. You could also use a triple-buffered approach so that successive frames' access to the buffer do not interfere. https://developer.apple.com/library/content/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/TripleBuffering.html
This tutorial shows how to set up triple buffering for rendering: https://www.raywenderlich.com/146418/metal-tutorial-swift-3-part-3-adding-texture
That's actually like the third part of the tutorial but it is the part that shows the triple-buffering setup, under "Reusing Uniform Buffers".
Related
I'm rendering a vertex/frag shader with a compute kernel.
Every frame I am binding large assets (such as a 450MB texture) in the usual way:
computeEncoder.setTexture(highResTexture, index: 0)
computeEncoder.setBuffer(largeBuffer, offset: 0, index: 0)
...
renderEncoder.setVertexTexture(highResTexture, index: 0)
renderEncoder.setVertexBuffer(largeBuffer, offset: 0, index: 0)
So that is close to 1GB in bandwidth for a single texture, and I have many more assets totaling a few hundred megs, so that is about 1.5GB that I bind for every frame.
Is there anyway to bind textures/buffers to the GPU once so that they would then be available in the kernel and vertex functions without binding every frame?
I could be wrong, but I thought something was introduced in the one of the last couple WWDCs so thought I would ask to make sure I'm not missing anything.
EDIT:
By simply binding a texture in the vertex function that I have already bound in the compute encoder it does indeed show more texture bandwidth used, even though I am not using it for the capture.
GPU Read Bandwidth:
6.3920 GiB/s without binding
7.1919 GiB/s with binding
Without binding the texture:
With binding the texture but not using it in any way:
Also, if it works as you describe, why does using multiple command encoders warn about wasted bandwidth? If I use more than one emitter, each with a separate encoder, even though they bind identical resources, I get the performance warning:
I think you are confused. Setting a texture to a command encoder doesn't consume bandwidth. Reading it or sampling it inside the shader does.
When you set a texture or any other buffer to an encoder, what happens is that driver just passes some small amount of metadata to the shader using some mechanism, likely some internal buffer that's not visible to you as the API user. It doesn't "load" the texture anywhere. There's an exception for buffers that are marked as constant address buffers in the shaders, because those may get pre-fetched by the GPU for better performance.
Another thing that happens is that the resource is made resident, meaning the GPU driver will map a range of addresses in the GPU addresses virtual memory table to point to the physical memory that stores the texture contents. This also does not consume memory, but it does consume available virtual address space. You might run out of virtual address space in some cases, but that's not a bandwidth issue.
Still, if you do have a lot of textures, you might be actually spending a lot of CPU time just encoding those setTexture commands. Instead, you can use argument buffers. If the hardware you are targeting supports argument buffers tier 2, you can put every texture in an argument buffer. This will require calling useResource on all of those textures, because the driver needs to know that you are going to use those textures to make them resident, so you will still spend CPU time encoding those commands. To avoid that, you can allocate all the textures from one or more heaps and call useHeaps on those heaps. This will make the whole heap resident, and you won't need to call useResource on individual resources. There are a bunch of WWDC talks on this topic, latest one being Explore bindless rendering in Metal.
But again, to reiterate: nothing I mentioned here "wastes" bandwidth.
Update:
A very basic example of using argument buffers would be to use it like this.
let argumentDescriptor = MTLArgumentDescriptor()
argumentDescriptor.index = 0
argumentDescriptor.dataType = .texture
argumentDescriptor.textureType = .type2D
let argumentEncoder = MTLArgumentEncoder(arguments: [argumentDescriptor])
let argumentBuffer = device.makeBuffer(length: argumentEncoder.encodedLength, options: [.storageModeShared])
argumentEncoder.setArgumentBuffer(argumentBuffer, offset: 0)
argumentEncoder.setTexture(someTexture, index: 0)
commandEncoder.setBuffer(argumentBuffer, offset: 0, index: 0)
commandEncoder.useResource(someTexture, usage: .read)
And in the shader you would write a struct like this:
struct MyTexture
{
texture2d<float> texture [[ id(0) ]];
};
and then bind it like
device MyTexture& myTexture [[ buffer(0) ]]
and use it like any other struct. This is a very basic example and you can actually use reflection to create those MTLArgumentEncoders for you from functions and binding indices.
I have very dynamic data which changes each frame. The data itself is relatively small so I use
[commandEncoder setVertexBytes:_vertices.buffer() length:_vertices.size() * sizeof(float) atIndex:0];
to set the vertex data.
However I need to set the index data as well when drawing using:
[commandEncoder drawIndexedPrimitives: ...];
How can update the provided MTLBuffer used for the index data in drawIndexedPrimitives method? I need to be able to efficiently update the index buffer.
This article contains an in-depth explanation of using multiple buffers to synchronize data between the CPU and GPU without forcing them to work in lock-step. You should read it carefully. I'll summarize the approach below.
Select a size for your index buffers that is large enough to hold the maximum number of indices you might need to draw.
Allocate a few (2-3) buffers of this size and place them in an array (of type NSArray<id<MTLBuffer>>). This is your reuse pool; I'll call it indexBufferPool below.
When initializing your renderer, create a dispatch semaphore whose value is equal to the number of buffers you put in your reuse pool. I'll call it bufferSemaphore below.
Create a buffer index member variable and initialize it to 0. I'll call it bufferIndex below.
Every time you draw, wait on the semaphore by calling dispatch_semaphore_wait(bufferSemaphore, DISPATCH_TIME_FOREVER).
When the semaphore wait function returns, it's safe to copy new index data into the buffer at the current buffer index. Using memcpy or some other copying technique, copy index data into the contents of bufferPool[bufferIndex].
Draw your primitives using bufferPool[bufferIndex] as the index buffer.
Increment bufferIndex by setting bufferIndex = (bufferIndex + 1) % ResourceCount, where ResourceCount is the number of buffers in the reuse pool.
Before committing it, add a completed handler to the current command buffer. The completed handler should call dispatch_semaphore_signal(bufferSemaphore). This lets any pending calls to the draw method know that it's safe to write to the buffer at the current buffer index.
The documentation for setVertexBytes says:
Use this method for single-use data smaller than 4 KB. Create a MTLBuffer object if your data exceeds 4 KB in length or persists for multiple uses.
What exactly does single-use mean?
For example, if I have a uniforms struct which is less than 4KB(and is updated every frame), is it better to use a triple buffer technique or simply use setVertexBytes?
From what I understand using setVertexBytes would copy the data every time into a MTLBuffer that Metal manages. This sounds slower than using triple buffering.
But then if I have different objects, each with its own uniforms, I would have to triple buffer everything, since it's dynamically updated.
And if I have a material that updates rarely but is passed to the shader every frame, would it be better to keep it in a buffer or pass it as a pointer using setVertexBytes?
It's not necessarily the case that Metal manages a distinct resource into which this data is written. As user Columbo notes in their comment, some hardware allows constant data to be recorded directly into command buffer memory, from which it can be subsequently read by a shader.
As always, you should profile in order to find the difference between the two approaches on your target hardware, but if the amount of data you're pushing per draw call is small, you might very well find that using setVertexBytes:... is faster than writing into a buffer and calling setVertexBuffer:....
For data that doesn't vary every frame (your slow-varying material use case), it may indeed be more efficient to keep that data in a buffer (double- or triple-buffered) rather than using setVertexBytes:....
If I am generating 0-12 triangles in a compute shader, is there a way I can stream them to a buffer that will then be used for rendering to screen?
My current strategy is:
create a buffer of float3 of size threads * 12, so can store the maximum possible number of triangles;
write to the buffer using an index that depends on the thread position in the grid, so there are no race conditions.
If I want to render from this though, I would need to skip the empty memory. It sounds ugly, but probably there is no other way currently. I know CUDA geometry shaders can have variable length output, but I wonder if/how games on iOS can generate variable-length data on GPU.
UPDATE 1:
As soon as I wrote the question, I thought about the possibility of using a second buffer that would point out how many triangles are available for each block. The vertex shader would then process all vertices of all triangles of that block.
This will not solve the problem of the unused memory though and as I have a big number of threads, the total memory wasted would be considerable.
What you're looking for is the Metal equivalent of D3D's "AppendStructuredBuffer". You want a type that can have structures added to it atomically.
I'm not familiar with Metal, but it does support Atomic operations such as 'Add' which is all you really need to roll your own Append Buffer. Initialise the counter to 0 and have each thread add '1' to the counter and use the original value as the index to write to in your buffer.
We often have the case where we need to stream textures to the graphics card (in game case: terrains, in my case image from different input sources like cameras/capture cards/videos)
Of course in camera case, I receive my data in a separate thread, but still need to upload that data to the GPU for display.
I know 2 models for it.
Use a dynamic resource:
You create a dynamic texture which has the same size and format as your input image, when you receive a new image you set a flag that tells you need upload, and then use map in the device context to upload the texture data (with eventual double buffer of course).
Advantage is you have a single memory location, hence you don't have memory fragmentation over time.
Drawback is you need to upload in immediate context, so your upload had to be in your render loop.
Use immutable and load/discard
In that case you upload in the image receiving thread, by creating a new resource, push the data and discard the old resource.
Advantage is you should have a stall free upload (no need for immediate context, you can still run your command list while texture is uploading), resource can be used with a simple trigger once available (to swap SRV).
Drawback is you can fragment memory over time (by allocating and freeing resources in a constant manner (30 fps for a standard camera as example).
Also you have to deal with throttling yourself (but that part is not a big deal).
So is there something I missed in those techniques, or is there an even better way to handle this?
These are the two main methods of updating textures D3D11.
However, the assumption that the first method will not result in memory usage patterns identical to the second case is dependent on the driver, and likely is not true. You would use D3D11_MAP_WRITE_DISCARD if you are overwriting the whole image (which it sounds like what you are doing), meaning that the current contents of the buffer become undefined. However, this is only true from the CPU's point-of-view. They are retained for the GPU, if they are potentially used in a pending draw operation. Most (maybe all?) drivers will actually allocate new storage for the write location of the mapped texture in this case, otherwise command buffer processing would need to stall. The same holds if you do not use the discard flag. Instead, when the map command is processed in the command buffer, the resource's buffer is updated to the value returned from Map in D3D11_MAPPED_SUBRESOURCE.
Also, it is not true that you must update dynamic textures in the immediate context. Only that if you update them in a deferred context, you must use the D3D11_MAP_DISCARD flag. This means you could update the texture on a worker thread, if you are overwriting the entire texture.
The bottom line is that, since the CPU/GPU system on a PC is not a unified memory system, there will be synchronization issues updating GPU resources coming from the CPU.