Memory write performance - GPU CPU Shared Memory - ios

I'm allocating both input and output MTLBuffer using posix_memalign according to the shared GPU/CPU documentation provided by memkite.
Aside: it is easier to just use latest API than muck around with posix_memalign
let metalBuffer = self.metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared)
My kernel function operates on roughly 16 million complex value structs and writes out an equal number of complex value structs to memory.
I've performed some experiments and my Metal kernel 'complex math section' executes in 0.003 seconds (Yes!), but writing the result to the buffer takes >0.05 (No!) seconds. In my experiment I commented out the math-part and just assign the zero to memory and it takes 0.05 seconds, commenting out the assignment and adding the math back, 0.003 seconds.
Is the shared memory slow in this case, or is there some other tip or trick I might try?
Additional detail
Test platforms
iPhone 6S - ~0.039 seconds per frame
iPad Air 2 - ~0.130 seconds per frame
The streaming data
Each update to the shader receives approximately 50000 complex numbers in the form of a pair of float types in a struct.
struct ComplexNumber {
float real;
float imaginary;
};
Kernel signature
kernel void processChannelData(const device Parameters *parameters [[ buffer(0) ]],
const device ComplexNumber *inputSampleData [[ buffer(1) ]],
const device ComplexNumber *partAs [[ buffer(2) ]],
const device float *partBs [[ buffer(3) ]],
const device int *lookups [[ buffer(4) ]],
device float *outputImageData [[ buffer(5) ]],
uint threadIdentifier [[ thread_position_in_grid ]]);
All the buffers contain - currently - unchanging data except inputSampleData which receives the 50000 samples I'll be operating on. The other buffers contain roughly 16 million values (128 channels x 130000 pixels) each. I perform some operations on each 'pixel' and sum the complex result across channels and finally take the absolute value of the complex number and assign the resulting float to outputImageData.
Dispatch
commandEncoder.setComputePipelineState(pipelineState)
commandEncoder.setBuffer(parametersMetalBuffer, offset: 0, atIndex: 0)
commandEncoder.setBuffer(inputSampleDataMetalBuffer, offset: 0, atIndex: 1)
commandEncoder.setBuffer(partAsMetalBuffer, offset: 0, atIndex: 2)
commandEncoder.setBuffer(partBsMetalBuffer, offset: 0, atIndex: 3)
commandEncoder.setBuffer(lookupsMetalBuffer, offset: 0, atIndex: 4)
commandEncoder.setBuffer(outputImageDataMetalBuffer, offset: 0, atIndex: 5)
let threadExecutionWidth = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
let threadGroups = MTLSize(width: self.numberOfPixels / threadsPerThreadgroup.width, height: 1, depth:1)
commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
metalCommandBuffer.commit()
metalCommandBuffer.waitUntilCompleted()
GitHub example
I've written an example called Slow and put it up on GitHub. Seems the bottleneck is the write of the values into the input Buffer. So, I guess the question becomes how to avoid the bottleneck?
Memory copy
I wrote up a quick test to compare the performance of various byte copying methods.
Current Status
I've reduced execution time to 0.02ish seconds which doesn't sound like a lot, but it makes a big difference in the number of frames per second. Currently the biggest improvements are a result of switching to cblas_scopy().

Reduce the size of the type
Originally, I was pre-converting signed 16-bit sized Integers as Floats (32-bit) since ultimately that is how they'll be used. This is a case where performance starts forcing you to store the values as 16-bits to cut your data-size in half.
Objective-C over Swift
For the code dealing with movement of data, you might choose Objective-C over Swift (Warren Moore recommendation). Performance of Swift in these special situations still isn't up to scratch. You can also try calling out to memcpy or similar methods. I've seen a couple of examples that used for-loop Buffer Pointers and this in my experiments performed slowly.
Difficulty of testing
I really wanted to do some of the experiments with relation to various copying methods in a playground on the machine and unfortunately this was useless. The iOS device versions of the same experiments performed completely differently. One might think that the relative performance would be the similar, but I found this to also be an invalid assumption. It would be really convenient if you could have a playground that used the iOS device as the interpreter.

You might get a large speedup via encoding your data to huffman codes and decoding on the GPU, see MetalHuffman. It depends on your data though.

Related

Is there a way to bind assets once instead of with every command encoder?

I'm rendering a vertex/frag shader with a compute kernel.
Every frame I am binding large assets (such as a 450MB texture) in the usual way:
computeEncoder.setTexture(highResTexture, index: 0)
computeEncoder.setBuffer(largeBuffer, offset: 0, index: 0)
...
renderEncoder.setVertexTexture(highResTexture, index: 0)
renderEncoder.setVertexBuffer(largeBuffer, offset: 0, index: 0)
So that is close to 1GB in bandwidth for a single texture, and I have many more assets totaling a few hundred megs, so that is about 1.5GB that I bind for every frame.
Is there anyway to bind textures/buffers to the GPU once so that they would then be available in the kernel and vertex functions without binding every frame?
I could be wrong, but I thought something was introduced in the one of the last couple WWDCs so thought I would ask to make sure I'm not missing anything.
EDIT:
By simply binding a texture in the vertex function that I have already bound in the compute encoder it does indeed show more texture bandwidth used, even though I am not using it for the capture.
GPU Read Bandwidth:
6.3920 GiB/s without binding
7.1919 GiB/s with binding
Without binding the texture:
With binding the texture but not using it in any way:
Also, if it works as you describe, why does using multiple command encoders warn about wasted bandwidth? If I use more than one emitter, each with a separate encoder, even though they bind identical resources, I get the performance warning:
I think you are confused. Setting a texture to a command encoder doesn't consume bandwidth. Reading it or sampling it inside the shader does.
When you set a texture or any other buffer to an encoder, what happens is that driver just passes some small amount of metadata to the shader using some mechanism, likely some internal buffer that's not visible to you as the API user. It doesn't "load" the texture anywhere. There's an exception for buffers that are marked as constant address buffers in the shaders, because those may get pre-fetched by the GPU for better performance.
Another thing that happens is that the resource is made resident, meaning the GPU driver will map a range of addresses in the GPU addresses virtual memory table to point to the physical memory that stores the texture contents. This also does not consume memory, but it does consume available virtual address space. You might run out of virtual address space in some cases, but that's not a bandwidth issue.
Still, if you do have a lot of textures, you might be actually spending a lot of CPU time just encoding those setTexture commands. Instead, you can use argument buffers. If the hardware you are targeting supports argument buffers tier 2, you can put every texture in an argument buffer. This will require calling useResource on all of those textures, because the driver needs to know that you are going to use those textures to make them resident, so you will still spend CPU time encoding those commands. To avoid that, you can allocate all the textures from one or more heaps and call useHeaps on those heaps. This will make the whole heap resident, and you won't need to call useResource on individual resources. There are a bunch of WWDC talks on this topic, latest one being Explore bindless rendering in Metal.
But again, to reiterate: nothing I mentioned here "wastes" bandwidth.
Update:
A very basic example of using argument buffers would be to use it like this.
let argumentDescriptor = MTLArgumentDescriptor()
argumentDescriptor.index = 0
argumentDescriptor.dataType = .texture
argumentDescriptor.textureType = .type2D
let argumentEncoder = MTLArgumentEncoder(arguments: [argumentDescriptor])
let argumentBuffer = device.makeBuffer(length: argumentEncoder.encodedLength, options: [.storageModeShared])
argumentEncoder.setArgumentBuffer(argumentBuffer, offset: 0)
argumentEncoder.setTexture(someTexture, index: 0)
commandEncoder.setBuffer(argumentBuffer, offset: 0, index: 0)
commandEncoder.useResource(someTexture, usage: .read)
And in the shader you would write a struct like this:
struct MyTexture
{
texture2d<float> texture [[ id(0) ]];
};
and then bind it like
device MyTexture& myTexture [[ buffer(0) ]]
and use it like any other struct. This is a very basic example and you can actually use reflection to create those MTLArgumentEncoders for you from functions and binding indices.

MPSImageHistogramEqualization throws assertion that offset must be < [buffer length]

I'm trying to do histogram equalization using MPSImageHistogramEqualization on iOS but it ends up throwin an assertion I do not understand. Here is my code:
// Calculate Histogram
var histogramInfo = MPSImageHistogramInfo(
numberOfHistogramEntries: 256,
histogramForAlpha: false,
minPixelValue: vector_float4(0,0,0,0),
maxPixelValue: vector_float4(1,1,1,1))
let calculation = MPSImageHistogram(device: self.mtlDevice, histogramInfo: &histogramInfo)
let bufferLength = calculation.histogramSize(forSourceFormat: sourceTexture.pixelFormat)
let histogramInfoBuffer = self.mtlDevice.makeBuffer(length: bufferLength, options: [.storageModePrivate])!
calculation.encode(to: commandBuffer,
sourceTexture: sourceTexture,
histogram: histogramInfoBuffer,
histogramOffset: 0)
let histogramEqualization = MPSImageHistogramEqualization(device: self.mtlDevice, histogramInfo: &histogramInfo)
histogramEqualization.encodeTransform(to: commandBuffer, sourceTexture: sourceTexture, histogram: histogramInfoBuffer, histogramOffset: 0)
And here is the resulting assert that happens on that last line:
-[MTLDebugComputeCommandEncoder setBuffer:offset:atIndex:]:283: failed assertion `offset(4096) must be < [buffer length](4096).'
Any suggestions on what might be going on here?
This appears to be a bug in a specialized path in MPSImageHistogramEqualization, and I encourage you to file feedback on it.
When numberOfHistogramEntries is greater than 256, the image kernel allocates an internal buffer large enough to hold the data it needs to work with (for N=512, this is 8192 bytes), plus an extra bit of space (32 bytes). When the internal optimized256BinsUseCase flag is set, it allocates exactly 4096 bytes, omitting that last bit of extra storage. My suspicion is that subsequent operations rely on having more space after the initial data chunk, and inadvertently set the buffer offset past the length of the internal buffer.
You may be able to work around this by using a different number of histogram bins, like 512. This wastes a little space and time, but I assume it will produce the same results.
Alternatively, you might be able to avoid this crash by disabling the Metal validation layer, but I strongly discourage that, since you'll just be masking the underlying issue till it gets fixed.
Note: I did my reverse-engineering of the MetalPerformanceShaders framework on macOS Catalina. Different platforms and different software versions likely have different code paths.

How to use a custom compute shaders using metal and get very smooth performance?

Im trying to apply the live camera filters through metal using the default MPSKernal filters given by apple and custom compute Shaders.
In compute shader I did the inplace encoding with the MPSImageGaussianBlur
and the code goes here
func encode(to commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture, cropRect: MTLRegion = MTLRegion.init(), offset : CGPoint) {
let blur = MPSImageGaussianBlur(device: device, sigma: 0)
blur.clipRect = cropRect
blur.offset = MPSOffset(x: Int(offset.x), y: Int(offset.y), z: 0)
let threadsPerThreadgroup = MTLSizeMake(4, 4, 1)
let threadgroupsPerGrid = MTLSizeMake(sourceTexture.width / threadsPerThreadgroup.width, sourceTexture.height / threadsPerThreadgroup.height, 1)
let commandEncoder = commandBuffer.makeComputeCommandEncoder()
commandEncoder.setComputePipelineState(pipelineState!)
commandEncoder.setTexture(sourceTexture, at: 0)
commandEncoder.setTexture(destinationTexture, at: 1)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
autoreleasepool {
var inPlaceTexture = destinationTexture
blur.encode(commandBuffer: commandBuffer, inPlaceTexture: &inPlaceTexture, fallbackCopyAllocator: nil)
}
}
But sometimes the inplace texture tend to fail and eventually it creates a jerk effect on the screen.
So if anyone can suggest me the solution without using the inplace texture or how to use the fallbackCopyAllocator or using the compute shaders in a different way that would be really helpful.
I have done enough coding in this area (applying computing shaders to video stream from camera), and the most common problem you run into is the "pixel buffer reuse" issue.
The metal texture you create from the sample buffer is backed up a pixel buffer, which is managed by the video session, and can be re-used for following video frames, unless you retain the reference to the sample buffer (retaining the reference to the metal texture is not enough).
Feel free to take a look at my code at https://github.com/snakajima/vs-metal, which applies various computing shaders to a live video stream.
VSContext:set() method takes optional sampleBuffer parameter in addition to the texture parameter, and retain the reference to the sampleBuffer until the computing shader's computation is completed (in VSRuntime:encode() method).
The in place operation method can be hit or miss depending on what the underlying filter is doing. If it is a single pass filter for some parameters, then you'll end up running out of place for those cases.
Since that method was added, MPS has added an underlying MTLHeap to manage memory a bit more transparently for you. If your MPSImage doesn't need to be viewed by the CPU and exists for only a short period of time on the GPU, it is recommended that you just use a MPSTemporaryImage instead. When the readCount hits 0 on that, the backing store will be recycled through the MPS heap and made available for other MPSTemporaryImages and other temporary resources used downstream. Likewise, the backing store for it isn't actually allocated from the heap until absolutely necessary (e.g. texture is written to, or .texture is called) A separate heap is allocated for each command buffer.
Using temporary images should help reduce memory usage quite a lot. For example, in an Inception v3 neural network graph, which has over a hundred passes, the heap was able to automatically reduce the graph to just four allocations.

Is it safe to feed unaligned buffers to MTLBuffer?

When trying to use Metal to rapidly draw pixel buffers to the screen from memory, we create MTLBuffer objects using MTLDevice.makeBuffer(bytesNoCopy:..) to allow the GPU to directly read the pixels from memory without having to copy it. Shared memory is really a must-have for achieving good pixel transfer performance.
The catch is that makeBuffer requires a page-aligned memory address and a page aligned length. Those requirements are not only in the documentation -- they are also enforced using runtime assertions.
The code I am writing has to deal with a variety of incoming resolutions and pixel formats, and occasionally I get unaligned buffers or unaligned lengths. After researching this I discovered a hack that allows me to use shared memory for those instances.
Basically what I do is I round the unaligned buffer address down to the nearest page boundary, and use the offset parameter from makeTexture to ensure that the GPU starts reading from the right place. Then I round up length to the nearest page size. Obviously that memory is going to be valid (because allocations can only occur on page boundaries), and I think it's safe to assume the GPU isn't writing to or corrupting that memory.
Here is the code I'm using to allocate shared buffers from unaligned buffers:
extension MTLDevice {
func makeTextureFromUnalignedBuffer(textureDescriptor : MTLTextureDescriptor, bufferPtr : UnsafeMutableRawPointer, bufferLength : UInt, bytesPerRow : Int) -> MTLTexture? {
var calculatedBufferLength = bufferLength
let pageSize = UInt(getpagesize())
let pageSizeBitmask = UInt(getpagesize()) - 1
let alignedBufferAddr = UnsafeMutableRawPointer(bitPattern: UInt(bitPattern: bufferPtr) & ~pageSizeBitmask)
let offset = UInt(bitPattern: bufferPtr) & pageSizeBitmask
assert(bytesPerRow % 64 == 0 && offset % 64 == 0, "Supplied bufferPtr and bytesPerRow must be aligned on a 64-byte boundary!")
calculatedBufferLength += offset
if (calculatedBufferLength & pageSizeBitmask) != 0 {
calculatedBufferLength &= ~(pageSize - 1)
calculatedBufferLength += pageSize
}
let buffer = self.makeBuffer(bytesNoCopy: alignedBufferAddr!, length: Int(calculatedBufferLength), options: .storageModeShared, deallocator: nil)
return buffer.makeTexture(descriptor: textureDescriptor, offset: Int(offset), bytesPerRow: bytesPerRow)
}
}
I've tested this on numerous different buffers and it seems to work perfectly (only tested on iOS, not on macOS). My question is: Is this approach safe? Any obvious reasons why this wouldn't work?
Then again, if it is safe, why were the requirements imposed in the first place? Why isn't the API just doing this for us?
I have submitted an Apple TSI (Technical Support Incident) for this question, and the answer is basically yes, it is safe. Here is the exact response in case anyone is interested:
After discussing your approach with engineering we concluded that it
was valid and safe. Some noteworthy quotes:
“The framework shouldn’t care about the fact that the user doesn’t own
the entire page, because it shouldn’t ever read before the offset
where the valid data begins.”
“It really shouldn’t [care], but in general if the developer can use
page-allocators rather than malloc for their incoming images, that
would be nice.”
As to why the alignment constraints/assertions are in place:
“Typically mapping memory you don’t own into another address space is
a bit icky, even if it works in practice. This is one reason why we
required mapping to be page aligned, because the hardware really is
mapping (and gaining write access) to the entire page.”

Multiple models in Metal. How?

This is an absolute beginner question.
Background: I’m not really a game developer, but I’m trying to learn the basics of low-level 3D programming, because it’s a fun and interesting topic. I’ve picked Apple’s Metal as the graphics framework. I know about SceneKit and other higher level frameworks, but I’m intentionally trying to learn the low level bits. Unfortunately I’m way out of my depth, and there seems to be very little beginner-oriented Metal resources on the web.
By reading the Apple documentation and following the tutorials I could find, I’ve managed to implement a simple vertex shader and a fragment shader and draw an actual 3D model on the screen. Now I’m trying to draw a second a model, but I’m kind of stuck, because I’m not really sure what’s really the best way to go about it.
Do I…
Use a single vertex buffer and index buffer for all of my models, and tell the MTLRenderCommandEncoder the offsets when rendering the individual models?
Have a separate vertex buffer / index buffer for each model? Would such an approach scale?
Something else?
TL;DR: What is the recommended way to store the vertex data of multiple models in Metal (or any other 3D framework)?
There is no one recommended way. When you're working at such a low level as Metal, there are many possibilities, and the one you pick depends heavily on the situation and what performance characteristics you want/need to optimize for. If you're just playing around with intro projects, most of these decisions are irrelevant, because the performance issues won't bite until you scale up to a "real" project.
Typically, game engines use one buffer (or set of vertex/index buffers) per model, especially if each model requires different render states (e.g. shaders, bound textures). This means that when new models are introduced to the scene or old ones no longer needed, the requisite resources can be loaded into / removed from GPU memory (by way of creating / destroying MTL objects).
The main use case for doing multiple draws out of (different parts of) the same buffer is when you're mutating the buffer. For example, on frame n you're using the first 1KB of a buffer to draw with, while at the same time you're computing / streaming in new vertex data and writing it to the second 1KB of the buffer... then for frame n + 1 you switch which parts of the buffer are being used for what.
To add a bit to rickster's answer, I would encapsulate your model in a class that contains one buffer (or two, if you count the index buffer) per model, and pass an optional parameter with the number of instances of that model you want to create.
Then, keep an additional buffer where you store whatever variations you want to introduce per instance. Usually, it's just the transform and a different material. For instance,
struct PerInstanceUniforms {
var transform : Transform
var material : Material
}
In my case, the material contains a UV transform, but the texture has to be the same for all the instances.
Then your model class would look something like this,
class Model {
fileprivate var indexBuffer : MTLBuffer!
fileprivate var vertexBuffer : MTLBuffer!
var perInstanceUniforms : [PerInstanceUniforms]
let uniformBuffer : MTLBuffer!
// ... constructors, etc.
func draw(_ encoder: MTLRenderCommandEncoder) {
encoder.setVertexBuffer(vertexBuffer, offset: 0, at: 0)
RenderManager.sharedInstance.setUniformBuffer(encoder, atIndex: 1)
encoder.setVertexBuffer(self.uniformBuffer, offset: 0, at: 2)
encoder.drawIndexedPrimitives(type: .triangle, indexCount: numIndices, indexType: .uint16, indexBuffer: indexBuffer, indexBufferOffset: 0, instanceCount: self.numInstances)
}
// this gets called when we need to update the buffers used by the GPU
func updateBuffers(_ syncBufferIndex: Int) {
let uniformB = uniformBuffer.contents()
let uniformData = uniformB.advanced(by: MemoryLayout<PerInstanceUniforms>.size * perInstanceUniforms.count * syncBufferIndex).assumingMemoryBound(to: Float.self)
memcpy(uniformData, &perInstanceUniforms, MemoryLayout<PerInstanceUniforms>.size * perInstanceUniforms.count)
}
}
Your vertex shader with instances will look something like this,
vertex VertexInOut passGeometry(uint vid [[ vertex_id ]],
uint iid [[ instance_id ]],
constant TexturedVertex* vdata [[ buffer(0) ]],
constant Uniforms& uniforms [[ buffer(1) ]],
constant Transform* perInstanceUniforms [[ buffer(2) ]])
{
VertexInOut outVertex;
Transform t = perInstanceUniforms[iid];
float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix;
TexturedVertex v = vdata[vid];
outVertex.position = m * float4(t * v.position, 1.0);
outVertex.uv = float2(0,0);
outVertex.color = float4(0.5 * v.normal + 0.5, 1);
return outVertex;
}
Here's an example I wrote of using instancing, with a performance analysis: http://tech.metail.com/performance-quaternions-gpu/
You can find the full code for reference here: https://github.com/endavid/VidEngine
I hope that helps.

Resources