Multiple models in Metal. How? - metal

This is an absolute beginner question.
Background: I’m not really a game developer, but I’m trying to learn the basics of low-level 3D programming, because it’s a fun and interesting topic. I’ve picked Apple’s Metal as the graphics framework. I know about SceneKit and other higher level frameworks, but I’m intentionally trying to learn the low level bits. Unfortunately I’m way out of my depth, and there seems to be very little beginner-oriented Metal resources on the web.
By reading the Apple documentation and following the tutorials I could find, I’ve managed to implement a simple vertex shader and a fragment shader and draw an actual 3D model on the screen. Now I’m trying to draw a second a model, but I’m kind of stuck, because I’m not really sure what’s really the best way to go about it.
Do I…
Use a single vertex buffer and index buffer for all of my models, and tell the MTLRenderCommandEncoder the offsets when rendering the individual models?
Have a separate vertex buffer / index buffer for each model? Would such an approach scale?
Something else?
TL;DR: What is the recommended way to store the vertex data of multiple models in Metal (or any other 3D framework)?

There is no one recommended way. When you're working at such a low level as Metal, there are many possibilities, and the one you pick depends heavily on the situation and what performance characteristics you want/need to optimize for. If you're just playing around with intro projects, most of these decisions are irrelevant, because the performance issues won't bite until you scale up to a "real" project.
Typically, game engines use one buffer (or set of vertex/index buffers) per model, especially if each model requires different render states (e.g. shaders, bound textures). This means that when new models are introduced to the scene or old ones no longer needed, the requisite resources can be loaded into / removed from GPU memory (by way of creating / destroying MTL objects).
The main use case for doing multiple draws out of (different parts of) the same buffer is when you're mutating the buffer. For example, on frame n you're using the first 1KB of a buffer to draw with, while at the same time you're computing / streaming in new vertex data and writing it to the second 1KB of the buffer... then for frame n + 1 you switch which parts of the buffer are being used for what.

To add a bit to rickster's answer, I would encapsulate your model in a class that contains one buffer (or two, if you count the index buffer) per model, and pass an optional parameter with the number of instances of that model you want to create.
Then, keep an additional buffer where you store whatever variations you want to introduce per instance. Usually, it's just the transform and a different material. For instance,
struct PerInstanceUniforms {
var transform : Transform
var material : Material
}
In my case, the material contains a UV transform, but the texture has to be the same for all the instances.
Then your model class would look something like this,
class Model {
fileprivate var indexBuffer : MTLBuffer!
fileprivate var vertexBuffer : MTLBuffer!
var perInstanceUniforms : [PerInstanceUniforms]
let uniformBuffer : MTLBuffer!
// ... constructors, etc.
func draw(_ encoder: MTLRenderCommandEncoder) {
encoder.setVertexBuffer(vertexBuffer, offset: 0, at: 0)
RenderManager.sharedInstance.setUniformBuffer(encoder, atIndex: 1)
encoder.setVertexBuffer(self.uniformBuffer, offset: 0, at: 2)
encoder.drawIndexedPrimitives(type: .triangle, indexCount: numIndices, indexType: .uint16, indexBuffer: indexBuffer, indexBufferOffset: 0, instanceCount: self.numInstances)
}
// this gets called when we need to update the buffers used by the GPU
func updateBuffers(_ syncBufferIndex: Int) {
let uniformB = uniformBuffer.contents()
let uniformData = uniformB.advanced(by: MemoryLayout<PerInstanceUniforms>.size * perInstanceUniforms.count * syncBufferIndex).assumingMemoryBound(to: Float.self)
memcpy(uniformData, &perInstanceUniforms, MemoryLayout<PerInstanceUniforms>.size * perInstanceUniforms.count)
}
}
Your vertex shader with instances will look something like this,
vertex VertexInOut passGeometry(uint vid [[ vertex_id ]],
uint iid [[ instance_id ]],
constant TexturedVertex* vdata [[ buffer(0) ]],
constant Uniforms& uniforms [[ buffer(1) ]],
constant Transform* perInstanceUniforms [[ buffer(2) ]])
{
VertexInOut outVertex;
Transform t = perInstanceUniforms[iid];
float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix;
TexturedVertex v = vdata[vid];
outVertex.position = m * float4(t * v.position, 1.0);
outVertex.uv = float2(0,0);
outVertex.color = float4(0.5 * v.normal + 0.5, 1);
return outVertex;
}
Here's an example I wrote of using instancing, with a performance analysis: http://tech.metail.com/performance-quaternions-gpu/
You can find the full code for reference here: https://github.com/endavid/VidEngine
I hope that helps.

Related

Running compute kernel on portion of MTLBuffer?

I am populating an MTLBuffer with float2 vectors. The buffer is being created and populated like this:
struct Particle {
var position: float2
...
}
let particleCount = 100000
let bufferSize = MemoryLayout<Particle>.stride * particleCount
particleBuffer = device.makeBuffer(length: bufferSize)!
var pointer = particleBuffer.contents().bindMemory(to: Particle.self, capacity: particleCount)
pointer = pointer.advanced(by: currentParticles)
pointer.pointee.position = [x, y]
In my Metal file the buffer is being accessed like this:
struct Particle {
float2 position;
...
};
kernel void compute(device Particle *particles [[buffer(0)]],
uint id [[thread_position_in_grid]] … )
I need to be able to compute a given range of the MTLBuffer. For example, is it possible to run the compute kernel on say starting from the 50,000 value and ending at 75,000 value?
It seems like the offset parameter allows me to specify the start position, but it does not have a length parameter.
I see there is this call:
setBuffers(_:offsets:range:)
Does the range specify what portion of the buffer to run? It seems like the range specifies what buffers are used and not the range of values to use.
A compute shader doesn't run "on" a buffer (or portion). It runs on a grid, which is an abstract concept. As far as Metal is concerned, the grid isn't related to a buffer or anything else.
A buffer may be an input that your compute shader uses, but how it uses it is up to you. Metal doesn't know or care.
Here's my answer to a similar question.
So, the dispatch command you encode using a compute command encoder governs how many times your shader is invoked. It also dictates what thread_position_in_grid (and related) values each invocation receives. If your shader correlates each grid position to an element of an array backed by a buffer, then the number of threads you specify in your dispatch command governs how much of the buffer you end up accessing. (Again, this is not something Metal dictates; it's implicit in the way you code your shader.)
Now, to start at the 50,000th element, using an offset on the buffer binding to make that the effective start of the buffer from the point of view of the shader is a good approach. But it would also work to just add 50,000 to the index the shader computes when it accesses the buffer. And, if you only want to process 25,000 elements (75,000 minus 50,000), then just dispatch 25,000 threads.

How to use a custom compute shaders using metal and get very smooth performance?

Im trying to apply the live camera filters through metal using the default MPSKernal filters given by apple and custom compute Shaders.
In compute shader I did the inplace encoding with the MPSImageGaussianBlur
and the code goes here
func encode(to commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture, cropRect: MTLRegion = MTLRegion.init(), offset : CGPoint) {
let blur = MPSImageGaussianBlur(device: device, sigma: 0)
blur.clipRect = cropRect
blur.offset = MPSOffset(x: Int(offset.x), y: Int(offset.y), z: 0)
let threadsPerThreadgroup = MTLSizeMake(4, 4, 1)
let threadgroupsPerGrid = MTLSizeMake(sourceTexture.width / threadsPerThreadgroup.width, sourceTexture.height / threadsPerThreadgroup.height, 1)
let commandEncoder = commandBuffer.makeComputeCommandEncoder()
commandEncoder.setComputePipelineState(pipelineState!)
commandEncoder.setTexture(sourceTexture, at: 0)
commandEncoder.setTexture(destinationTexture, at: 1)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
autoreleasepool {
var inPlaceTexture = destinationTexture
blur.encode(commandBuffer: commandBuffer, inPlaceTexture: &inPlaceTexture, fallbackCopyAllocator: nil)
}
}
But sometimes the inplace texture tend to fail and eventually it creates a jerk effect on the screen.
So if anyone can suggest me the solution without using the inplace texture or how to use the fallbackCopyAllocator or using the compute shaders in a different way that would be really helpful.
I have done enough coding in this area (applying computing shaders to video stream from camera), and the most common problem you run into is the "pixel buffer reuse" issue.
The metal texture you create from the sample buffer is backed up a pixel buffer, which is managed by the video session, and can be re-used for following video frames, unless you retain the reference to the sample buffer (retaining the reference to the metal texture is not enough).
Feel free to take a look at my code at https://github.com/snakajima/vs-metal, which applies various computing shaders to a live video stream.
VSContext:set() method takes optional sampleBuffer parameter in addition to the texture parameter, and retain the reference to the sampleBuffer until the computing shader's computation is completed (in VSRuntime:encode() method).
The in place operation method can be hit or miss depending on what the underlying filter is doing. If it is a single pass filter for some parameters, then you'll end up running out of place for those cases.
Since that method was added, MPS has added an underlying MTLHeap to manage memory a bit more transparently for you. If your MPSImage doesn't need to be viewed by the CPU and exists for only a short period of time on the GPU, it is recommended that you just use a MPSTemporaryImage instead. When the readCount hits 0 on that, the backing store will be recycled through the MPS heap and made available for other MPSTemporaryImages and other temporary resources used downstream. Likewise, the backing store for it isn't actually allocated from the heap until absolutely necessary (e.g. texture is written to, or .texture is called) A separate heap is allocated for each command buffer.
Using temporary images should help reduce memory usage quite a lot. For example, in an Inception v3 neural network graph, which has over a hundred passes, the heap was able to automatically reduce the graph to just four allocations.

Memory write performance - GPU CPU Shared Memory

I'm allocating both input and output MTLBuffer using posix_memalign according to the shared GPU/CPU documentation provided by memkite.
Aside: it is easier to just use latest API than muck around with posix_memalign
let metalBuffer = self.metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared)
My kernel function operates on roughly 16 million complex value structs and writes out an equal number of complex value structs to memory.
I've performed some experiments and my Metal kernel 'complex math section' executes in 0.003 seconds (Yes!), but writing the result to the buffer takes >0.05 (No!) seconds. In my experiment I commented out the math-part and just assign the zero to memory and it takes 0.05 seconds, commenting out the assignment and adding the math back, 0.003 seconds.
Is the shared memory slow in this case, or is there some other tip or trick I might try?
Additional detail
Test platforms
iPhone 6S - ~0.039 seconds per frame
iPad Air 2 - ~0.130 seconds per frame
The streaming data
Each update to the shader receives approximately 50000 complex numbers in the form of a pair of float types in a struct.
struct ComplexNumber {
float real;
float imaginary;
};
Kernel signature
kernel void processChannelData(const device Parameters *parameters [[ buffer(0) ]],
const device ComplexNumber *inputSampleData [[ buffer(1) ]],
const device ComplexNumber *partAs [[ buffer(2) ]],
const device float *partBs [[ buffer(3) ]],
const device int *lookups [[ buffer(4) ]],
device float *outputImageData [[ buffer(5) ]],
uint threadIdentifier [[ thread_position_in_grid ]]);
All the buffers contain - currently - unchanging data except inputSampleData which receives the 50000 samples I'll be operating on. The other buffers contain roughly 16 million values (128 channels x 130000 pixels) each. I perform some operations on each 'pixel' and sum the complex result across channels and finally take the absolute value of the complex number and assign the resulting float to outputImageData.
Dispatch
commandEncoder.setComputePipelineState(pipelineState)
commandEncoder.setBuffer(parametersMetalBuffer, offset: 0, atIndex: 0)
commandEncoder.setBuffer(inputSampleDataMetalBuffer, offset: 0, atIndex: 1)
commandEncoder.setBuffer(partAsMetalBuffer, offset: 0, atIndex: 2)
commandEncoder.setBuffer(partBsMetalBuffer, offset: 0, atIndex: 3)
commandEncoder.setBuffer(lookupsMetalBuffer, offset: 0, atIndex: 4)
commandEncoder.setBuffer(outputImageDataMetalBuffer, offset: 0, atIndex: 5)
let threadExecutionWidth = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
let threadGroups = MTLSize(width: self.numberOfPixels / threadsPerThreadgroup.width, height: 1, depth:1)
commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
metalCommandBuffer.commit()
metalCommandBuffer.waitUntilCompleted()
GitHub example
I've written an example called Slow and put it up on GitHub. Seems the bottleneck is the write of the values into the input Buffer. So, I guess the question becomes how to avoid the bottleneck?
Memory copy
I wrote up a quick test to compare the performance of various byte copying methods.
Current Status
I've reduced execution time to 0.02ish seconds which doesn't sound like a lot, but it makes a big difference in the number of frames per second. Currently the biggest improvements are a result of switching to cblas_scopy().
Reduce the size of the type
Originally, I was pre-converting signed 16-bit sized Integers as Floats (32-bit) since ultimately that is how they'll be used. This is a case where performance starts forcing you to store the values as 16-bits to cut your data-size in half.
Objective-C over Swift
For the code dealing with movement of data, you might choose Objective-C over Swift (Warren Moore recommendation). Performance of Swift in these special situations still isn't up to scratch. You can also try calling out to memcpy or similar methods. I've seen a couple of examples that used for-loop Buffer Pointers and this in my experiments performed slowly.
Difficulty of testing
I really wanted to do some of the experiments with relation to various copying methods in a playground on the machine and unfortunately this was useless. The iOS device versions of the same experiments performed completely differently. One might think that the relative performance would be the similar, but I found this to also be an invalid assumption. It would be really convenient if you could have a playground that used the iOS device as the interpreter.
You might get a large speedup via encoding your data to huffman codes and decoding on the GPU, see MetalHuffman. It depends on your data though.

How to design a simple GLSL wrapper for shader use

UPDATE: Because I needed something right away, I've created a simple shader wrapper that does the sort of thing I need. You can find it here: ShaderManager on GitHub. Note that it's designed for Objective-C / iOS, so may not be useful to everyone. If you have any suggestions for design improvements, please let me know!
Original Problem:
I'm new to using GLSL shaders. I'm familiar enough with the GLSL language and the OpenGL interface, but I'm having trouble designing a simple API through which to use shaders.
OpenGL's C interface to interact with shaders seems cumbersome. I can't seem to find any tutorials on the net that cover the API design of such things.
My question is this: does any one have a good, simple, API design or pattern to wrap the OpenGL shader program API?
Take the following simple example. Say I have one vertex shader that just emulates fixed functionality, and two fragment shaders - one for drawing smooth rectangles and one for drawing smooth circles. I have the following files:
Shader.vsh : Simple vertex shader, with the following inputs/outputs:
-- Uniforms: mat4 Model, mat4 View, mat4 Projection
-- Attributes: vec4 Vertex, vec2 TexCoord, vec4 Color
-- Varying: vec4 vColor, vec2 vTexCoord
Square.fsh : Fragment shader for drawing squares based on tex coord / color
Circle.fsh : Fragment shader for drawing circles based on tex coord / color
Basic Linking
Now what is the standard way to use these? Do I link the above shaders into two OpenGL shader programs? That is:
Shader.vsh + Square.fsh = SquareProgram
Shader.vsh + Circle.fsh = CircleProgram
Or do I instead create one big program where the fragment shaders check some conditional uniform variables and call out to a shader function to generate their result. E.g:
Shader.vsh + Square.fsh + Circle.fsh + Main.fsh = ShaderProgram
//Main.fsh here would simply check whether to call out to square or circle
With two individual programs I would presumably need to call
glUseProgram(CircleProgram); or glUseProgram(SquareProgram);
Before each type of element I want to draw. I would then need to set the uniforms (Model / View / Projection) and attributes of each program before I use it. This seems so unwieldy.
With the single ShaderProgram option I would still need to set some sort of boolean switch (circle or square) in the fragment shader that would be checked before drawing each pixel. This also seems complicated.
As a side note, am I allowed to link two fragment shaders, each with a main() function, into one shader program? How would OpenGL know which one to call?
Setting Variables
The calls:
glUniform*
glVertexAttribPointer
Are used to set uniforms and attribute pointer locations on the current program.
Different classes and structures may need to access and set variables on the current shader (or change the current shader) from different places in the code. I can't think of a nice way to do this that decouples the shader code from the code that wants to use it.
That is, each shape I want to draw will need to set vertex and texture coordinate attributes - requiring the handles to those attributes generated by OpenGL.
The camera will need to set its projection matrix as a uniform in the vertex shader, while the class managing the model matrix stack will need to set its own uniform in the vertex shader.
Changing shaders part-way through drawing a scene would mean that all these classes will need to set their uniforms and attributes again.
How do most people design around this?
A global dictionary of shaders accessed by handle or name, with getters and setters for their parameters?
An OO design with shader objects that each have parameters?
I've looked at the following wrappers:
Jon's Teapot: GLSL Shader Manager - This wraps shaders in C++ classes. It seems like little more than a wrapper that enforces OO principles on a C API, resulting in a C++ API that is much the same.
I am after any sort of design that simplifies the use of Shader programs, and am not concerned about the particular paradigm used (OO, procedural, and so on)
I see this is tagged with iOS, so if you're partial to Objective-C, I'd take a good look at Jeff LaMarche's GLProgram wrapper class, which he describes here and has source available here. I've used it within my own applications to simplify some of the shader program setup, and to make the code a little cleaner.
For example, you can set up a shader and its attributes and uniforms using code like the following:
sphereDepthProgram = [[GLProgram alloc] initWithVertexShaderFilename:#"SphereDepth" fragmentShaderFilename:#"SphereDepth"];
[sphereDepthProgram addAttribute:#"position"];
[sphereDepthProgram addAttribute:#"inputImpostorSpaceCoordinate"];
if (![sphereDepthProgram link])
{
NSLog(#"Depth shader link failed");
NSString *progLog = [sphereDepthProgram programLog];
NSLog(#"Program Log: %#", progLog);
NSString *fragLog = [sphereDepthProgram fragmentShaderLog];
NSLog(#"Frag Log: %#", fragLog);
NSString *vertLog = [sphereDepthProgram vertexShaderLog];
NSLog(#"Vert Log: %#", vertLog);
[sphereDepthProgram release];
sphereDepthProgram = nil;
}
sphereDepthPositionAttribute = [sphereDepthProgram attributeIndex:#"position"];
sphereDepthImpostorSpaceAttribute = [sphereDepthProgram attributeIndex:#"inputImpostorSpaceCoordinate"];
sphereDepthModelViewMatrix = [sphereDepthProgram uniformIndex:#"modelViewProjMatrix"];
sphereDepthRadius = [sphereDepthProgram uniformIndex:#"sphereRadius"];
When you need to use the shader program, you then do something like the following:
[sphereDepthProgram use];
This doesn't address the issues of branching vs. individual shaders that you bring up above, but Jeff's implementation does provide a nice encapsulation of some of the OpenGL ES boilerplate shader setup code.
Basic Linking:
There is no standard way here. There are at least 2 general approaches:
Monolithic - one shader covers many cases, using uniform boolean switches. These branches don't hurt performance because the condition result is constant for any fragment group (actually, for all of the fragments).
Multi-object program compositing - main shader declares a set of entry points (like 'get_diffuse', 'get_specular', etc), which are implemented in separate shader objects attached. This implies individual shader for each object, but any kind of caching helps.
Setting Variables: Uniforms
I will just describe the approach I developed.
Each shader program has a list of uniform dictionaries. It's used to fill the uniform source list upon program (re-)linking. When the program is activated, it goes through the uniform list, fetches values from their sources and uploads them to GL. In the result, data is not directly connected with the user shader program, and whatever manages it does not care about the program using it.
One of these dictionaries can be, for example, a core one, containing model,view transformations, camera projection and maybe something else.
Setting Variables: Attributes
First of all, shader program is an attribute consumer, so it is what has to extract these attributes from a mesh (or any other data storage) and upload them to GL in a way it needs. It should also make sure that types of provided attributes match the requested types.
When using with monolithic shader approach, there is a possible unpleasant situation when one the disabled branch ways requires a vertex attribute that is not provided. I would advice using another attribute's data to supply the missing one, because we don't care about the actual values in this case.
P.S.
You can find an actual implementation of these ideas here: http://code.google.com/p/kri/

XNA - vertex streams?

Would someone provide/point me to an explanation of or tutorial on using multiple vertex streams in HLSL and XNA? I'm interested in how they're stored/accessed by the GPU, advantages of or uses for streams in custom shaders, etc.
I've seen a few examples on using multiple vertex streams for instanced geometry, but I'm having a hard time wrapping my head around the underlying mechanism.
Update
If I have a vertex shader which accepts two parameters (borrowed from this tutorial)
InstancingVSoutput InstancingVS(InstancingVSinput input, float4x4 instanceTransform : TEXCOORD0, float4 color : TEXCOORD4)
{
InstancingVSoutput output;
float4 pos = input.Position;
pos = mul(pos, transpose(instanceTransform));
pos = mul(pos, WVP);
output.Position = pos;
output.Color = color;
return output;
}
It seems from the example that I pulled this from that instanceTransform and input are pulled from separate streams. However, in this case, the input stream is a list of six vertices, and the instanceTransform comes from a stream of a much larger number of elements, consisting of translation matrices. This is supposed to be used for instanced geometry.
I'm confused about how many times this shader gets executed - is it VertexBuffer0.VertexCount*VertexBuffer1.VertexCount? The problem with this kind of thing is that, once someone's figured it out, they don't bother contributing a well-written document back to the community detailing their discovery.
Argh.
Since no one else has chimed in yet, I'll give it a go :-) this is a great thread on the apphub forums about Vertex Streams:
http://forums.create.msdn.com/forums/p/46229/276901.aspx
from one of the answers:
The gist is this: different streams
can have different data layouts, and
your VertexDeclaration determines what
data gets pulled from what stream.
So, for instance, you could have one
buffer that stores all your positions
and one buffer which stores all your
colors, and you could set those to
different streams; alternatively you
could munge them into a single stream,
but this isn't always convenient.
Hope it helps ;-)

Resources