MPSImageHistogramEqualization throws assertion that offset must be < [buffer length] - ios

I'm trying to do histogram equalization using MPSImageHistogramEqualization on iOS but it ends up throwin an assertion I do not understand. Here is my code:
// Calculate Histogram
var histogramInfo = MPSImageHistogramInfo(
numberOfHistogramEntries: 256,
histogramForAlpha: false,
minPixelValue: vector_float4(0,0,0,0),
maxPixelValue: vector_float4(1,1,1,1))
let calculation = MPSImageHistogram(device: self.mtlDevice, histogramInfo: &histogramInfo)
let bufferLength = calculation.histogramSize(forSourceFormat: sourceTexture.pixelFormat)
let histogramInfoBuffer = self.mtlDevice.makeBuffer(length: bufferLength, options: [.storageModePrivate])!
calculation.encode(to: commandBuffer,
sourceTexture: sourceTexture,
histogram: histogramInfoBuffer,
histogramOffset: 0)
let histogramEqualization = MPSImageHistogramEqualization(device: self.mtlDevice, histogramInfo: &histogramInfo)
histogramEqualization.encodeTransform(to: commandBuffer, sourceTexture: sourceTexture, histogram: histogramInfoBuffer, histogramOffset: 0)
And here is the resulting assert that happens on that last line:
-[MTLDebugComputeCommandEncoder setBuffer:offset:atIndex:]:283: failed assertion `offset(4096) must be < [buffer length](4096).'
Any suggestions on what might be going on here?

This appears to be a bug in a specialized path in MPSImageHistogramEqualization, and I encourage you to file feedback on it.
When numberOfHistogramEntries is greater than 256, the image kernel allocates an internal buffer large enough to hold the data it needs to work with (for N=512, this is 8192 bytes), plus an extra bit of space (32 bytes). When the internal optimized256BinsUseCase flag is set, it allocates exactly 4096 bytes, omitting that last bit of extra storage. My suspicion is that subsequent operations rely on having more space after the initial data chunk, and inadvertently set the buffer offset past the length of the internal buffer.
You may be able to work around this by using a different number of histogram bins, like 512. This wastes a little space and time, but I assume it will produce the same results.
Alternatively, you might be able to avoid this crash by disabling the Metal validation layer, but I strongly discourage that, since you'll just be masking the underlying issue till it gets fixed.
Note: I did my reverse-engineering of the MetalPerformanceShaders framework on macOS Catalina. Different platforms and different software versions likely have different code paths.

Related

Is there a way to bind assets once instead of with every command encoder?

I'm rendering a vertex/frag shader with a compute kernel.
Every frame I am binding large assets (such as a 450MB texture) in the usual way:
computeEncoder.setTexture(highResTexture, index: 0)
computeEncoder.setBuffer(largeBuffer, offset: 0, index: 0)
...
renderEncoder.setVertexTexture(highResTexture, index: 0)
renderEncoder.setVertexBuffer(largeBuffer, offset: 0, index: 0)
So that is close to 1GB in bandwidth for a single texture, and I have many more assets totaling a few hundred megs, so that is about 1.5GB that I bind for every frame.
Is there anyway to bind textures/buffers to the GPU once so that they would then be available in the kernel and vertex functions without binding every frame?
I could be wrong, but I thought something was introduced in the one of the last couple WWDCs so thought I would ask to make sure I'm not missing anything.
EDIT:
By simply binding a texture in the vertex function that I have already bound in the compute encoder it does indeed show more texture bandwidth used, even though I am not using it for the capture.
GPU Read Bandwidth:
6.3920 GiB/s without binding
7.1919 GiB/s with binding
Without binding the texture:
With binding the texture but not using it in any way:
Also, if it works as you describe, why does using multiple command encoders warn about wasted bandwidth? If I use more than one emitter, each with a separate encoder, even though they bind identical resources, I get the performance warning:
I think you are confused. Setting a texture to a command encoder doesn't consume bandwidth. Reading it or sampling it inside the shader does.
When you set a texture or any other buffer to an encoder, what happens is that driver just passes some small amount of metadata to the shader using some mechanism, likely some internal buffer that's not visible to you as the API user. It doesn't "load" the texture anywhere. There's an exception for buffers that are marked as constant address buffers in the shaders, because those may get pre-fetched by the GPU for better performance.
Another thing that happens is that the resource is made resident, meaning the GPU driver will map a range of addresses in the GPU addresses virtual memory table to point to the physical memory that stores the texture contents. This also does not consume memory, but it does consume available virtual address space. You might run out of virtual address space in some cases, but that's not a bandwidth issue.
Still, if you do have a lot of textures, you might be actually spending a lot of CPU time just encoding those setTexture commands. Instead, you can use argument buffers. If the hardware you are targeting supports argument buffers tier 2, you can put every texture in an argument buffer. This will require calling useResource on all of those textures, because the driver needs to know that you are going to use those textures to make them resident, so you will still spend CPU time encoding those commands. To avoid that, you can allocate all the textures from one or more heaps and call useHeaps on those heaps. This will make the whole heap resident, and you won't need to call useResource on individual resources. There are a bunch of WWDC talks on this topic, latest one being Explore bindless rendering in Metal.
But again, to reiterate: nothing I mentioned here "wastes" bandwidth.
Update:
A very basic example of using argument buffers would be to use it like this.
let argumentDescriptor = MTLArgumentDescriptor()
argumentDescriptor.index = 0
argumentDescriptor.dataType = .texture
argumentDescriptor.textureType = .type2D
let argumentEncoder = MTLArgumentEncoder(arguments: [argumentDescriptor])
let argumentBuffer = device.makeBuffer(length: argumentEncoder.encodedLength, options: [.storageModeShared])
argumentEncoder.setArgumentBuffer(argumentBuffer, offset: 0)
argumentEncoder.setTexture(someTexture, index: 0)
commandEncoder.setBuffer(argumentBuffer, offset: 0, index: 0)
commandEncoder.useResource(someTexture, usage: .read)
And in the shader you would write a struct like this:
struct MyTexture
{
texture2d<float> texture [[ id(0) ]];
};
and then bind it like
device MyTexture& myTexture [[ buffer(0) ]]
and use it like any other struct. This is a very basic example and you can actually use reflection to create those MTLArgumentEncoders for you from functions and binding indices.

How to efficiently create a large vector of items initialized to the same value?

I'm looking to allocate a vector of small-sized structs.
This takes 30 milliseconds and increases linearly:
let v = vec![[0, 0, 0, 0]; 1024 * 1024];
This takes tens of microseconds:
let v = vec![0; 1024 * 1024];
Is there a more efficient solution to the first case? I'm okay with unsafe code.
Fang Zhang's answer is correct in the general case. The code you asked about is a little bit special: it could use alloc_zeroed, but it does not. As Stargateur also points out in the question comments, with future language and library improvements it is possible both cases could take advantage of this speedup.
This usually should not be a problem. Initializing a whole big vector at once probably isn't something you do extremely often. Big allocations are usually long-lived, so you won't be creating and freeing them in a tight loop -- the cost of initializing the vector will only be paid rarely. Sooner than resorting to unsafe, I would take a look at my algorithms and try to understand why a single memset is causing so much trouble.
However, if you happen to know that all-bits-zero is an acceptable initial value, and if you absolutely cannot tolerate the slowdown, you can do an end-run around the standard library by calling alloc_zeroed and creating the Vec using from_raw_parts. Vec::from_raw_parts is unsafe, so you have to be absolutely sure the size and alignment of the allocated memory is correct. Since Rust 1.44, you can use Layout::array to do this easily. Here's an example:
pub fn make_vec() -> Vec<[i8; 4]> {
let layout = std::alloc::Layout::array::<[i8; 4]>(1_000_000).unwrap();
// I copied the following unsafe code from Stack Overflow without understanding
// it. I was advised not to do this, but I didn't listen. It's my fault.
unsafe {
Vec::from_raw_parts(
std::alloc::alloc_zeroed(layout) as *mut _,
1_000_000,
1_000_000,
)
}
}
See also
How to perform efficient vector initialization in Rust?
vec![0; 1024 * 1024] is a special case. If you change it to vec![1; 1024 * 1024], you will see performance degrades dramatically.
Typically, for non-zero element e, vec![e; n] will clone the element n times, which is the major cost. For element equal to 0, there is other system approach to init the memory, which is much faster.
So the answer to your question is no.

How to use a custom compute shaders using metal and get very smooth performance?

Im trying to apply the live camera filters through metal using the default MPSKernal filters given by apple and custom compute Shaders.
In compute shader I did the inplace encoding with the MPSImageGaussianBlur
and the code goes here
func encode(to commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture, cropRect: MTLRegion = MTLRegion.init(), offset : CGPoint) {
let blur = MPSImageGaussianBlur(device: device, sigma: 0)
blur.clipRect = cropRect
blur.offset = MPSOffset(x: Int(offset.x), y: Int(offset.y), z: 0)
let threadsPerThreadgroup = MTLSizeMake(4, 4, 1)
let threadgroupsPerGrid = MTLSizeMake(sourceTexture.width / threadsPerThreadgroup.width, sourceTexture.height / threadsPerThreadgroup.height, 1)
let commandEncoder = commandBuffer.makeComputeCommandEncoder()
commandEncoder.setComputePipelineState(pipelineState!)
commandEncoder.setTexture(sourceTexture, at: 0)
commandEncoder.setTexture(destinationTexture, at: 1)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
autoreleasepool {
var inPlaceTexture = destinationTexture
blur.encode(commandBuffer: commandBuffer, inPlaceTexture: &inPlaceTexture, fallbackCopyAllocator: nil)
}
}
But sometimes the inplace texture tend to fail and eventually it creates a jerk effect on the screen.
So if anyone can suggest me the solution without using the inplace texture or how to use the fallbackCopyAllocator or using the compute shaders in a different way that would be really helpful.
I have done enough coding in this area (applying computing shaders to video stream from camera), and the most common problem you run into is the "pixel buffer reuse" issue.
The metal texture you create from the sample buffer is backed up a pixel buffer, which is managed by the video session, and can be re-used for following video frames, unless you retain the reference to the sample buffer (retaining the reference to the metal texture is not enough).
Feel free to take a look at my code at https://github.com/snakajima/vs-metal, which applies various computing shaders to a live video stream.
VSContext:set() method takes optional sampleBuffer parameter in addition to the texture parameter, and retain the reference to the sampleBuffer until the computing shader's computation is completed (in VSRuntime:encode() method).
The in place operation method can be hit or miss depending on what the underlying filter is doing. If it is a single pass filter for some parameters, then you'll end up running out of place for those cases.
Since that method was added, MPS has added an underlying MTLHeap to manage memory a bit more transparently for you. If your MPSImage doesn't need to be viewed by the CPU and exists for only a short period of time on the GPU, it is recommended that you just use a MPSTemporaryImage instead. When the readCount hits 0 on that, the backing store will be recycled through the MPS heap and made available for other MPSTemporaryImages and other temporary resources used downstream. Likewise, the backing store for it isn't actually allocated from the heap until absolutely necessary (e.g. texture is written to, or .texture is called) A separate heap is allocated for each command buffer.
Using temporary images should help reduce memory usage quite a lot. For example, in an Inception v3 neural network graph, which has over a hundred passes, the heap was able to automatically reduce the graph to just four allocations.

Is it safe to feed unaligned buffers to MTLBuffer?

When trying to use Metal to rapidly draw pixel buffers to the screen from memory, we create MTLBuffer objects using MTLDevice.makeBuffer(bytesNoCopy:..) to allow the GPU to directly read the pixels from memory without having to copy it. Shared memory is really a must-have for achieving good pixel transfer performance.
The catch is that makeBuffer requires a page-aligned memory address and a page aligned length. Those requirements are not only in the documentation -- they are also enforced using runtime assertions.
The code I am writing has to deal with a variety of incoming resolutions and pixel formats, and occasionally I get unaligned buffers or unaligned lengths. After researching this I discovered a hack that allows me to use shared memory for those instances.
Basically what I do is I round the unaligned buffer address down to the nearest page boundary, and use the offset parameter from makeTexture to ensure that the GPU starts reading from the right place. Then I round up length to the nearest page size. Obviously that memory is going to be valid (because allocations can only occur on page boundaries), and I think it's safe to assume the GPU isn't writing to or corrupting that memory.
Here is the code I'm using to allocate shared buffers from unaligned buffers:
extension MTLDevice {
func makeTextureFromUnalignedBuffer(textureDescriptor : MTLTextureDescriptor, bufferPtr : UnsafeMutableRawPointer, bufferLength : UInt, bytesPerRow : Int) -> MTLTexture? {
var calculatedBufferLength = bufferLength
let pageSize = UInt(getpagesize())
let pageSizeBitmask = UInt(getpagesize()) - 1
let alignedBufferAddr = UnsafeMutableRawPointer(bitPattern: UInt(bitPattern: bufferPtr) & ~pageSizeBitmask)
let offset = UInt(bitPattern: bufferPtr) & pageSizeBitmask
assert(bytesPerRow % 64 == 0 && offset % 64 == 0, "Supplied bufferPtr and bytesPerRow must be aligned on a 64-byte boundary!")
calculatedBufferLength += offset
if (calculatedBufferLength & pageSizeBitmask) != 0 {
calculatedBufferLength &= ~(pageSize - 1)
calculatedBufferLength += pageSize
}
let buffer = self.makeBuffer(bytesNoCopy: alignedBufferAddr!, length: Int(calculatedBufferLength), options: .storageModeShared, deallocator: nil)
return buffer.makeTexture(descriptor: textureDescriptor, offset: Int(offset), bytesPerRow: bytesPerRow)
}
}
I've tested this on numerous different buffers and it seems to work perfectly (only tested on iOS, not on macOS). My question is: Is this approach safe? Any obvious reasons why this wouldn't work?
Then again, if it is safe, why were the requirements imposed in the first place? Why isn't the API just doing this for us?
I have submitted an Apple TSI (Technical Support Incident) for this question, and the answer is basically yes, it is safe. Here is the exact response in case anyone is interested:
After discussing your approach with engineering we concluded that it
was valid and safe. Some noteworthy quotes:
“The framework shouldn’t care about the fact that the user doesn’t own
the entire page, because it shouldn’t ever read before the offset
where the valid data begins.”
“It really shouldn’t [care], but in general if the developer can use
page-allocators rather than malloc for their incoming images, that
would be nice.”
As to why the alignment constraints/assertions are in place:
“Typically mapping memory you don’t own into another address space is
a bit icky, even if it works in practice. This is one reason why we
required mapping to be page aligned, because the hardware really is
mapping (and gaining write access) to the entire page.”

Memory write performance - GPU CPU Shared Memory

I'm allocating both input and output MTLBuffer using posix_memalign according to the shared GPU/CPU documentation provided by memkite.
Aside: it is easier to just use latest API than muck around with posix_memalign
let metalBuffer = self.metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared)
My kernel function operates on roughly 16 million complex value structs and writes out an equal number of complex value structs to memory.
I've performed some experiments and my Metal kernel 'complex math section' executes in 0.003 seconds (Yes!), but writing the result to the buffer takes >0.05 (No!) seconds. In my experiment I commented out the math-part and just assign the zero to memory and it takes 0.05 seconds, commenting out the assignment and adding the math back, 0.003 seconds.
Is the shared memory slow in this case, or is there some other tip or trick I might try?
Additional detail
Test platforms
iPhone 6S - ~0.039 seconds per frame
iPad Air 2 - ~0.130 seconds per frame
The streaming data
Each update to the shader receives approximately 50000 complex numbers in the form of a pair of float types in a struct.
struct ComplexNumber {
float real;
float imaginary;
};
Kernel signature
kernel void processChannelData(const device Parameters *parameters [[ buffer(0) ]],
const device ComplexNumber *inputSampleData [[ buffer(1) ]],
const device ComplexNumber *partAs [[ buffer(2) ]],
const device float *partBs [[ buffer(3) ]],
const device int *lookups [[ buffer(4) ]],
device float *outputImageData [[ buffer(5) ]],
uint threadIdentifier [[ thread_position_in_grid ]]);
All the buffers contain - currently - unchanging data except inputSampleData which receives the 50000 samples I'll be operating on. The other buffers contain roughly 16 million values (128 channels x 130000 pixels) each. I perform some operations on each 'pixel' and sum the complex result across channels and finally take the absolute value of the complex number and assign the resulting float to outputImageData.
Dispatch
commandEncoder.setComputePipelineState(pipelineState)
commandEncoder.setBuffer(parametersMetalBuffer, offset: 0, atIndex: 0)
commandEncoder.setBuffer(inputSampleDataMetalBuffer, offset: 0, atIndex: 1)
commandEncoder.setBuffer(partAsMetalBuffer, offset: 0, atIndex: 2)
commandEncoder.setBuffer(partBsMetalBuffer, offset: 0, atIndex: 3)
commandEncoder.setBuffer(lookupsMetalBuffer, offset: 0, atIndex: 4)
commandEncoder.setBuffer(outputImageDataMetalBuffer, offset: 0, atIndex: 5)
let threadExecutionWidth = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
let threadGroups = MTLSize(width: self.numberOfPixels / threadsPerThreadgroup.width, height: 1, depth:1)
commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
metalCommandBuffer.commit()
metalCommandBuffer.waitUntilCompleted()
GitHub example
I've written an example called Slow and put it up on GitHub. Seems the bottleneck is the write of the values into the input Buffer. So, I guess the question becomes how to avoid the bottleneck?
Memory copy
I wrote up a quick test to compare the performance of various byte copying methods.
Current Status
I've reduced execution time to 0.02ish seconds which doesn't sound like a lot, but it makes a big difference in the number of frames per second. Currently the biggest improvements are a result of switching to cblas_scopy().
Reduce the size of the type
Originally, I was pre-converting signed 16-bit sized Integers as Floats (32-bit) since ultimately that is how they'll be used. This is a case where performance starts forcing you to store the values as 16-bits to cut your data-size in half.
Objective-C over Swift
For the code dealing with movement of data, you might choose Objective-C over Swift (Warren Moore recommendation). Performance of Swift in these special situations still isn't up to scratch. You can also try calling out to memcpy or similar methods. I've seen a couple of examples that used for-loop Buffer Pointers and this in my experiments performed slowly.
Difficulty of testing
I really wanted to do some of the experiments with relation to various copying methods in a playground on the machine and unfortunately this was useless. The iOS device versions of the same experiments performed completely differently. One might think that the relative performance would be the similar, but I found this to also be an invalid assumption. It would be really convenient if you could have a playground that used the iOS device as the interpreter.
You might get a large speedup via encoding your data to huffman codes and decoding on the GPU, see MetalHuffman. It depends on your data though.

Resources