Metal rendering really slow - how to speed it up - ios

I have a working metal application that is extremely slow, and needs to run faster. I believe the problem is I am creating too many MTLCommandBuffer objects.
The reason I am creating so many MTLCommandBuffer objects is I need to send different uniform values to the pixel shader. I've pasted a snippit of code to illustrate the problem below.
for (int obj_i = 0 ; obj_i < n ; ++obj_i)
{
// I create one render command buffer per object I draw so I can use different uniforms
id <MTLCommandBuffer> mtlCommandBuffer = [metal_info.g_commandQueue commandBuffer];
id <MTLRenderCommandEncoder> renderCommand = [mtlCommandBuffer renderCommandEncoderWithDescriptor:<#(MTLRenderPassDescriptor *)#>]
// glossing over details, but this call has per object specific data
memcpy([global_uniform_buffer contents], per_object_data, sizeof(per_data_object));
[renderCommand setVertexBuffer:object_vertices offset:0 atIndex:0];
// I am reusing a single buffer for all shader calls
// this is killing performance
[renderCommand setVertexBuffer:global_uniform_buffer offset:0 atIndex:1];
[renderCommand drawIndexedPrimitives:MTLPrimitiveTypeTriangle
indexCount:per_object_index_count
indexType:MTLIndexTypeUInt32
indexBuffer:indicies
indexBufferOffset:0];
[renderCommand endEncoding];
[mtlCommandBuffer presentDrawable:frameDrawable];
[mtlCommandBuffer commit];
}
The above code draw as expected, but is EXTREMELY slow. I'm guessing because there is a better way to force pixel shader evaluation than creating a MTLCommandBuffer per object.
I've consider simple allocating a buffer much larger than is needed for a single shader pass and simply using offset to queue up several calls in one render command encoder then execute them. This method seems pretty unorthodox, and I want to make sure I'm solving the issue of needed to send custom data per object in a Metal friendly way.
What is the fastest way to render using multiple passes of the same pixel/vertex shader with per call custom uniform data?

Don't reuse the same uniform buffer for every object. Doing that destroys all parallelism between the CPU and GPU and causes regular sync points.
Instead, make a separate uniform buffer for each object you are going to render in the frame. In fact you should really create 2 per object and alternate between them each frame so that the GPU can be rendering the last frame whilst you are preparing the next frame on the CPU.
After you do that, you simply refactor your loop so the command buffer and render command work are done once per frame. Your loop should only consist of copying the uniform data, setting the vertex buffer and calling draw primitive.

Related

Using Metal discard_fragment() to discard individual samples in an MSAA attachment

For an MSSA attachment, the following simple Metal fragment shader is meant to be run in multiple render passes, once per sample, to fill the stencil attachment with different reference values per sample. It does not work as expected, and effectively fills all stencil pixel samples with the reference value on each renderpass.
struct _10
{
int _m0;
};
struct main0_out
{
float gl_FragDepth [[depth(any)]];
};
fragment main0_out main0(constant _10& _12 [[buffer(0)]], uint gl_SampleID [[sample_id]])
{
main0_out out = {};
if (gl_SampleID != _12._m0)
{
discard_fragment();
}
out.gl_FragDepth = 0.5;
return out;
}
The problem seems to be using discard_fragment() on a per-sample basis. The intended operation of discarding one sample but writing another does not occur. Instead, the sample is never discarded, regardless of the comparison value passed in the buffer.
In fact, from what I can tell from GPU capture shader tracing results, it appears that the entire if-discard clause is optimized away by the Metal compiler. My guess is that Metal probably recognizes the disconnect between per-sample invocations and discard_fragment(), and removes it, but I can't be sure.
I can't find any Metal documentation on discard_fragment() and its use with MSAA, so I can't tell whether discard_fragment() is supposed to work with individual sample invocations in that environment, or whether it can only discard the entire fragment (which admittedly the function name implies, but what does that mean for per-sample invocations?).
Does the logic and intention of this shader make sense? Is discard_fragment() supposed to work with individual sample invocations? And why would the Metal compiler possibly be removing the discard operation from my shader?

What is the correct sequence for uploading a uniform block?

In the example page at https://www.lighthouse3d.com/tutorials/glsl-tutorial/uniform-blocks/ has this:
uniformBlockBinding()
bindBuffer()
bufferData()
bindBufferBase()
But conceptually, wouldn't this be more correct?
bindBuffer()
bufferData()
uniformBlockBinding()
bindBufferBase()
The idea being that uploading to a buffer (bindBuffer+bufferData) should be agnostic about what the buffer will be used for - and then, separately, uniformBlockBinding()+bindBufferBase() would be used to update those uniforms, per shader, when the relevant buffer has changed?
Adding answer since the accepted answer has lots of info irrelevant to WebGL2
At init time you call uniformBlockBinding. For the given program it sets up which uniform buffer index bind point that particular program will get a particular uniform buffer from.
At render time you call bindBufferRange or bindBufferBase to bind a specific buffer to a specific uniform buffer index bind point
If you also need to upload new data to that buffer you can then call bufferData
In pseudo code
// at init time
for each uniform block
gl.uniformBlockBinding(program, indexOfBlock, indexOfBindPoint)
// at render time
for each uniform block
gl.bindBufferRange(gl.UNIFORM_BUFFER, indexOfBindPoint, buffer, offset, size)
if (need to update data in buffer)
gl.bufferData/gl.bufferSubData(gl.UNIFORM_BUFFER, data, ...)
Note that there is no “correct” sequence. The issue here is that how you update your buffers is really up to you. Since you might store multiple uniform buffer datas in a single buffer at different offsets then calling gl.bufferData/gl.bufferSubData like above is really not “correct”, it’s just one way of 100s.
WebGL2 (GLES 3.0 ES) does not support the layout(binding = x) mentioned in the accepted answer. There is also no such thing as glGenBuffers in WebGL2
Neither is "more correct" than the other; they all work. But if you're talking about separation of concerns, the first one better emphasizes correct separation.
glUniformBlockBinding modifies the program; it doesn't affect the nature of the buffer object or context buffer state. Indeed, by all rights, that call shouldn't even be in the same function; it's part of program object setup. In a modern GL tutorial, they would use layout(binding=X) to set the binding, so the function wouldn't even appear. For older code, it should be set to a known, constant value after creating the program and then left alone.
So calling the function between allocating storage for the buffer and binding it to an indexed bind point for use creates the impression that they should be calling glUniformBlockBinding every frame, which is the wrong impression.
And speaking of wrong impressions, glBindBufferBase shouldn't even be called there. The rest of that code is buffer setup code; it should only be done once, at the beginning of the application. glBindBufferBase should be called as part of the rendering process, not the setup process. In a good application, that call shouldn't be anywhere near the glGenBuffers call.

MTLBuffer allocation + CPU/GPU synchronisation

I am using a metal performance shader(MPSImageHistogram) to compute something in an MTLBuffer that I grab, perform computations, and then display via MTKView. The MTLBuffer output from the shader is small (~4K bytes). So I am allocating a new MTLBuffer object for every render pass, and there are atleast 30 renders per second for every video frame.
calculation = MPSImageHistogram(device: device, histogramInfo: &histogramInfo)
let bufferLength = calculation.histogramSize(forSourceFormat: MTLPixelFormat.bgra8Unorm)
let buffer = device.makeBuffer(length: bufferLength, options: .storageModeShared)
let commandBuffer = commandQueue?.makeCommandBuffer()
calculation.encode(to: commandBuffer!, sourceTexture: metalTexture!, histogram: buffer!, histogramOffset: 0)
commandBuffer?.commit()
commandBuffer?.addCompletedHandler({ (cmdBuffer) in
let dataPtr = buffer!.contents().assumingMemoryBound(to: UInt32.self)
...
...
}
My questions -
Is it okay to make a new buffer every time using device.makeBuffer(..), or better to statically allocate
few buffers and implement reuse those buffers? If reuse is better, what do we do for synchronizing CPU/GPU data write/read on these buffers?
Another unrelated question, is it okay to draw in MTKView the results on a non-main thread? Or MTKView draws must only be in main thread (even though I read Metal is truly multithreaded)?
Allocations are somewhat expensive, so I'd recommend a reusable buffer scheme. My preferred way to do this is to keep a mutable array (queue) of buffers, enqueuing a buffer when the command buffer that used it completes (or in your case, after you've read back the results on the CPU), and allocating a new buffer when the queue is empty and you need to encode more work. In the steady state, you'll find that this scheme will rarely allocate more than 2-3 buffers total, assuming your frames are completing in a timely fashion. If you need this scheme to be thread-safe, you can protect access to the queue with a mutex (implemented with a dispatch_semaphore).
You can use another thread to encode rendering work that draws into a drawable vended by an MTKView, as long as you follow standard multithreading precautions. Remember that while command queues are thread-safe (in the sense that you can create and encode to multiple command buffers from the same queue concurrently), command buffers themselves and encoders are not. I'd advise you to profile the single-threaded case and only introduce the complication of multi-threading if/when absolutely necessary.
If it is a small amount of data (under 4K) you can use setBytes(): https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443159-setbytes
That might be faster/better than allocating a new buffer every frame. You could also use a triple-buffered approach so that successive frames' access to the buffer do not interfere. https://developer.apple.com/library/content/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/TripleBuffering.html
This tutorial shows how to set up triple buffering for rendering: https://www.raywenderlich.com/146418/metal-tutorial-swift-3-part-3-adding-texture
That's actually like the third part of the tutorial but it is the part that shows the triple-buffering setup, under "Reusing Uniform Buffers".

Dynamic output from compute shader

If I am generating 0-12 triangles in a compute shader, is there a way I can stream them to a buffer that will then be used for rendering to screen?
My current strategy is:
create a buffer of float3 of size threads * 12, so can store the maximum possible number of triangles;
write to the buffer using an index that depends on the thread position in the grid, so there are no race conditions.
If I want to render from this though, I would need to skip the empty memory. It sounds ugly, but probably there is no other way currently. I know CUDA geometry shaders can have variable length output, but I wonder if/how games on iOS can generate variable-length data on GPU.
UPDATE 1:
As soon as I wrote the question, I thought about the possibility of using a second buffer that would point out how many triangles are available for each block. The vertex shader would then process all vertices of all triangles of that block.
This will not solve the problem of the unused memory though and as I have a big number of threads, the total memory wasted would be considerable.
What you're looking for is the Metal equivalent of D3D's "AppendStructuredBuffer". You want a type that can have structures added to it atomically.
I'm not familiar with Metal, but it does support Atomic operations such as 'Add' which is all you really need to roll your own Append Buffer. Initialise the counter to 0 and have each thread add '1' to the counter and use the original value as the index to write to in your buffer.

best approach to constructing an OpenGL ES 2.0 shader-based dynamic chain of filters

I'm on iOS 6 (7 too if you will and makes any difference) and GL ES 2.0.
The idea is for a CAEAGLLayer to have a dynamic chain of shader-based filters that processes its contents property and displays the final result. Filters can be added / removed at any point in the chain.
So far I came up with an implementation, but I'm wondering if there's better ways to go about it. My implementation roughly goes about it this way:
A base filter class from which concrete filters inherit, creating a shader program (vertex / fragment combo) for whatever filter / imaging they implement.
A CAEAGLLayer subclass which implements the filter chain and to which filters are added. The high-level processing algorithm is:
// 1 - Assume whenever the layer's content property is changed to an image, a copy of the image gets stored in a sourceImage property.
// 2 - Assume changing the content property or adding / removing an image unit triggers this algorithm.
// 3 - Assume the whole filter chain basically processes a quad with position and texture coordinates thru a VBO.
// 4 - Assume all shader programs (by shader program I mean a vertex and fragment shader pair in a single program) have access to texture unit 0.
// 5 - Assume P shader programs.
load imageSource into a texture object bound to GL_TEXTURE2D and pointing to to GL_TEXTURE0
attach bound texture object to GL_FRAMEBUFFER GL_COLOR_ATTACHMENT0 (so we are doing render-to-texture, which will be accessible to fragment shaders)
for p = program identifier 0 up to P - 2:
glUseProgram(p)
glDrawArrays()
attach GL_RENDERBUFFER to GL_FRAMEBUFFER GL_COLOR_ATTACHMENT0 (GL_RENDERBUFFER in turn has its storage set to the layer itself);
p = program identifier P - 1 (last program in the chain)
glUseProgram(p)
glDrawArrays()
present GL_RENDERBUFFER onscreen
This approach seems to work so far, but there's a number of things I'm wondering about:
Best way to implement adding / removing of filters:
Adding and removing programs seems the most logical approach right now. However this means one program per plugin and switching between all of these at render time. I wonder how these other approaches would compare:
Attaching / detaching shader-pairs and re-linking a single composite program, instead of adding / removing programs. The OpenGL ES 2.0 Programming Guide says you cannot do it. However, since desktop GL allows for multiple shader objects in one program, I'm anyway curious if it would be a better approach if ES supported it.
Keeping the filters in text format (their code within a function other than main) and instead compile them all into a monolithic shader pair (with an added main of course) each time a filter is added / removed.
Best way to implement per-filter caching:
Right now, adding / removing any number of filters at any point in the chain requires running all programs again to render the final image. It'd be nice however if I could somehow cache the output of each filter. That way, removing, adding or bypassing a filter would only require running the filters past the point of insertion / deletion / bypassing in the chain. I can think of a naive approach: on each program pass, bind a different texture object to GL_TEXTURE0 and to the GL_COLOR_ATTACHMENT0of the frame buffer. In this way I can keep the output of every filter around. However, creating a new texture, binding and changing the framebuffer attachment once per filter seems inefficient.
I don't have much to say about the filter output caching problem, but as for filter switching... The EXT_separate_shader_objects extension is designed to solve this very problem, and it's supported on every device that runs iOS 5.0 or later. Here's a brief overview:
There's a new convenience API for compiling shader programs that also takes care of making them "separable":
_vertexProgram = glCreateShaderProgramvEXT(GL_VERTEX_SHADER, 1, &source);
Program Pipeline Objects manage your program state and let you mix and match already-compiled shaders:
GLuint _ppo;
glGenProgramPipelinesEXT(1, &_ppo);
glBindProgramPipelineEXT(_ppo);
glUseProgramStagesEXT(_ppo, GL_VERTEX_SHADER_BIT_EXT, _vertexProgram);
glUseProgramStagesEXT(_ppo, GL_FRAGMENT_SHADER_BIT_EXT, _fragmentProgram);
Mixing and matching shaders can make attribute binding a pain, so you can specify that in the shader (likewise for varyings):
#extension GL_EXT_separate_shader_objects : enable
layout(location = 0) attribute vec4 position;
layout(location = 1) attribute vec3 normal;
Uniforms are set for the shader program they belong to:
glProgramUniformMatrix3fvEXT(_vertexProgram, u_normalMatrix, 1, 0, _normalMatrix.m);

Resources