Metal compute pipeline absurdly slow - ios

I saw an opportunity to improve my app performance by using a Metal compute pipeline. However, my initial testing revealed the the compute pipeline was absurdly slow (at least on older device).
So I did a sample project to compare the compute and render pipelines performance. The program takes a 2048 x 2048 source texture and convert it to grayscale in a destination texture.
On an iPhone 5S, it took 3 ms for the fragment shader to do the convertion. However, it took 177 ms for the compute kernel to do the same thing. That is 59 times longer!!!
What is your exeperience with the compute pipeline on older device? It isn't absurdly slow?
Here's are my fragment and compute functions:
// Grayscale Fragment Function
fragment half4 grayscaleFragment(RasterizerData in [[stage_in]],
texture2d<half> inTexture [[texture(0)]])
{
constexpr sampler textureSampler;
half4 inColor = inTexture.sample(textureSampler, in.textureCoordinate);
half gray = dot(inColor.rgb, kRec709Luma);
return half4(gray, gray, gray, 1.0);
}
// Grayscale Kernel Function
kernel void grayscaleKernel(uint2 gid [[thread_position_in_grid]],
texture2d<half, access::read> inTexture [[texture(0)]],
texture2d<half, access::write> outTexture [[texture(1)]])
{
half4 inColor = inTexture.read(gid);
half gray = dot(inColor.rgb, kRec709Luma);
outTexture.write(half4(gray, gray, gray, 1.0), gid);
}
Compute and render methods
- (void)compute {
id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
// Compute encoder
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:_computePipelineState];
[computeEncoder setTexture:_srcTexture atIndex:0];
[computeEncoder setTexture:_dstTexture atIndex:1];
[computeEncoder dispatchThreadgroups:_threadgroupCount threadsPerThreadgroup:_threadgroupSize];
[computeEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
}
- (void)render {
id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
// Render pass descriptor
MTLRenderPassDescriptor *renderPassDescriptor = [MTLRenderPassDescriptor renderPassDescriptor];
renderPassDescriptor.colorAttachments[0].loadAction = MTLLoadActionDontCare;
renderPassDescriptor.colorAttachments[0].texture = _dstTexture;
renderPassDescriptor.colorAttachments[0].storeAction = MTLStoreActionStore;
// Render encoder
id<MTLRenderCommandEncoder> renderEncoder = [commandBuffer renderCommandEncoderWithDescriptor:renderPassDescriptor];
[renderEncoder setRenderPipelineState:_renderPipelineState];
[renderEncoder setFragmentTexture:_srcTexture atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
[renderEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
}
And Metal setup:
- (void)setupMetal
{
// Get metal device
_device = MTLCreateSystemDefaultDevice();
// Create the command queue
_commandQueue = [_device newCommandQueue];
id<MTLLibrary> defaultLibrary = [_device newDefaultLibrary];
// Create compute pipeline state
_computePipelineState = [_device newComputePipelineStateWithFunction:[defaultLibrary newFunctionWithName:#"grayscaleKernel"] error:nil];
// Create render pipeline state
MTLRenderPipelineDescriptor *pipelineStateDescriptor = [[MTLRenderPipelineDescriptor alloc] init];
pipelineStateDescriptor.vertexFunction = [defaultLibrary newFunctionWithName:#"vertexShader"];
pipelineStateDescriptor.fragmentFunction = [defaultLibrary newFunctionWithName:#"grayscaleFragment"];
pipelineStateDescriptor.colorAttachments[0].pixelFormat = MTLPixelFormatBGRA8Unorm;
_renderPipelineState = [_device newRenderPipelineStateWithDescriptor:pipelineStateDescriptor error:nil];
// Create source and destination texture descriptor
// Since the compute kernel function doesn't check if pixels are within the bounds of the destination texture, make sure texture width
// and height are multiples of the pipeline threadExecutionWidth and (threadExecutionWidth / maxTotalThreadsPerThreadgroup) respectivly.
MTLTextureDescriptor *textureDescriptor = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm
width:2048
height:2048
mipmapped:NO];
// Create source texture
textureDescriptor.usage = MTLTextureUsageShaderRead;
_srcTexture = [_device newTextureWithDescriptor:textureDescriptor];
// Create description texture
textureDescriptor.usage = MTLTextureUsageShaderWrite | MTLTextureUsageRenderTarget;
_dstTexture = [_device newTextureWithDescriptor:textureDescriptor];
// Set the compute kernel's threadgroup size
NSUInteger threadWidth = _computePipelineState.threadExecutionWidth;
NSUInteger threadMax = _computePipelineState.maxTotalThreadsPerThreadgroup;
_threadgroupSize = MTLSizeMake(threadWidth, threadMax / threadWidth, 1);
// Set the compute kernel's threadgroup count
_threadgroupCount.width = (_srcTexture.width + _threadgroupSize.width - 1) / _threadgroupSize.width;
_threadgroupCount.height = (_srcTexture.height + _threadgroupSize.height - 1) / _threadgroupSize.height;
_threadgroupCount.depth = 1;
}

The Metal compute pipeline is unusable on A7 class CPU/GPU devices. The same compute pipeline has great performance on A8 and newer devices. Your options for dealing with this are to create fragment shader impls for A7 devices and use compute logic for all newer devices, or you can export computation to the CPUs on A7 (there are at least 2 CPUs with this device class). You could also just use all fragment shaders for all devices, but much better performance on complex code is possible with compute kernels, so it is something to think about.

Related

Metal Texture is not filterable

I am trying to mipmap a texture contained in an MTLTexture object. This texture was loaded from an OpenCV Mat. I can run correctly run kernels on this texture so I know my import process is correct.
Unfortunately, the generate mipmaps function gives this rather opaque error. I get a similar error even if I change temp to be BGRA.
-[MTLDebugBlitCommandEncoder generateMipmapsForTexture:]:1074:
failed assertion `tex(MTLPixelFormatR8Uint) is not filterable.'
// create an MTL Texture
{
MTLTextureDescriptor * textureDescriptor = [MTLTextureDescriptor
texture2DDescriptorWithPixelFormat:MTLPixelFormatR8Uint
width:cols
height:rows
mipmapped:NO];
textureDescriptor.usage = MTLTextureUsageShaderRead;
_mImgTex = [_mDevice newTextureWithDescriptor:textureDescriptor];
}
{
MTLTextureDescriptor * textureDescriptor = [MTLTextureDescriptor
texture2DDescriptorWithPixelFormat:MTLPixelFormatR8Uint
width:cols
height:rows
mipmapped:YES];
textureDescriptor.mipmapLevelCount = 5;
textureDescriptor.usage = MTLTextureUsageShaderRead | MTLTextureUsageShaderWrite;
_mPyrTex = [_mDevice newTextureWithDescriptor:textureDescriptor];
}
// copy data to GPU
cv::Mat temp;
cv::cvtColor(image, temp, cv::COLOR_BGRA2GRAY);
MTLRegion region = MTLRegionMake2D(0, 0, cols, rows);
const int bytesPerPixel = 1 * 1; // 1 uint * 1 channels
const int bytesPerRow = bytesPerPixel * cols;
[_mImgTex replaceRegion:region mipmapLevel:0 withBytes:temp.data bytesPerRow:bytesPerRow];
// try to mipmap
id<MTLBlitCommandEncoder> blitEncoder = [commandBuffer blitCommandEncoder];
MTLOrigin origin = MTLOriginMake(0, 0, 0);
MTLSize size = MTLSizeMake(cols, rows, 1);
[blitEncoder copyFromTexture:_mImgTex sourceSlice:0 sourceLevel:0 sourceOrigin:origin sourceSize:size toTexture:_mPyrTex destinationSlice:0 destinationLevel:0 destinationOrigin:origin];
[blitEncoder generateMipmapsForTexture:_mPyrTex];
[blitEncoder endEncoding];
The documentation for generateMipmapsForTextures says:
Mipmap generation works only for textures with color-renderable and color-filterable pixel formats.
If you look at the "Pixel Format Capabilities" table here, you can see that R8Uint does not support Filter nor is it colour renderable (Color).
Perhaps R8Unorm (MTLPixelFormatR8Unorm) will work well for your needs. Otherwise you might need to write your own mip generation code with compute (although I'm not sure if there's a use case for mipmaps with non filterable textures).

Compute Kernel Metal - How to retrieve results and debug?

I've downloaded apple's truedepth streamer example and am trying to add a compute pipeline. I think I'm retrieving the results of the computation but am not sure as they all seem to be zero.
I'm a beginner at iOS development so there maybe quite a few mistakes so please bear with me!
The pipeline set up: (i wasn't quite sure how to create the resultsbuffer, since the kernel outputs a float3)
int resultsCount = CVPixelBufferGetWidth(depthFrame) * CVPixelBufferGetHeight(depthFrame);
//because I will be output 3 floats for each value in depthframe
id<MTLBuffer> resultsBuffer = [self.device newBufferWithLength:(sizeof(float) * 3 * resultsCount) options:MTLResourceOptionCPUCacheModeDefault];
_threadgroupSize = MTLSizeMake(16, 16, 1);
// Calculate the number of rows and columns of threadgroups given the width of the input image
// Ensure that you cover the entire image (or more) so you process every pixel
_threadgroupCount.width = (inTexture.width + _threadgroupSize.width - 1) / _threadgroupSize.width;
_threadgroupCount.height = (inTexture.height + _threadgroupSize.height - 1) / _threadgroupSize.height;
// Since we're only dealing with a 2D data set, set depth to 1
_threadgroupCount.depth = 1;
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:_computePipelineState];
[computeEncoder setTexture: inTexture atIndex:0];
[computeEncoder setBuffer:resultsBuffer offset:0 atIndex:1];
[computeEncoder setBytes:&intrinsics length:sizeof(intrinsics) atIndex:0];
[computeEncoder dispatchThreadgroups:_threadgroupCount
threadsPerThreadgroup:_threadgroupSize];
[computeEncoder endEncoding];
// Finalize rendering here & push the command buffer to the GPU
[commandBuffer commit];
//for testing
[commandBuffer waitUntilCompleted];
I have added the following compute kernel:
kernel void
calc(texture2d<float, access::read> inTexture [[texture(0)]],
device float3 *resultsBuffer [[buffer(1)]],
constant float3x3& cameraIntrinsics [[ buffer(0) ]],
uint2 gid [[thread_position_in_grid]])
{
float val = inTexture.read(gid).x * 1000.0f;
float xrw = (gid.x - cameraIntrinsics[2][0]) * val / cameraIntrinsics[0][0];
float yrw = (gid.y - cameraIntrinsics[2][1]) * val / cameraIntrinsics[1][1];
int vertex_id = ((gid.y * inTexture.get_width()) + gid.x);
resultsBuffer[vertex_id] = float3(xrw, yrw, val);
}
Code for seeing buffer result: (I tried two different ways and both are outputting all zeroes at the moment)
void *output = [resultsBuffer contents];
for (int i = 0; i < 10; ++i) {
NSLog(#"value is %f", *(float *)(output) ); //= *(float *)(output + 4 * i);
}
NSData *data = [NSData dataWithBytesNoCopy:resultsBuffer.contents length:(sizeof(float) * 3 * resultsCount)freeWhenDone:NO];
float *finalArray = new float [resultsCount * 3];
[data getBytes:&finalArray[0] length:sizeof(finalArray)];
for (int i = 0; i < 10; ++i) {
NSLog(#"here is output %f", finalArray[i]);
}
I see a couple of problems here, but neither of them are related to your Metal code per se.
In your first output loop, as written, you're just printing the first element of the results buffer 10 times. The first element may legitimately be 0, leading you to believe all of the results are zero. But when I changed the first log line to
NSLog(#"value is %f", ((float *)output)[i]);
I saw different values printed when running your kernel on a test image.
The other issue is related to your getBytes:length: call. You want to pass the number of bytes to copy, but sizeof(finalArray) is actually the size of the finalArray pointer, i.e., 4 bytes, not the total size of the buffer it points to. This is an extremely common error in C and C++ code.
Instead, you can use the same byte count as the one you used when allocating space:
[data getBytes:&finalArray[0] length:(sizeof(float) * 3 * resultsCount)];
You should then find that you get the same (non-zero) values printed as in the previous step.

Error evaluting CoreML custom layer "----" on GPU?

I get this error without any other details.
This is the metal code
#include <metal_stdlib>
using namespace metal;
kernel void copy(texture2d_array<half, access::read> in_texture [[texture(0)]] ,
texture2d_array<half, access::write> out_texture [[texture(1)]],
ushort3 gid [[thread_position_in_grid]])
{
if (gid.x >= out_texture.get_width() || gid.y >= out_texture.get_height()) {
return;
}
const float4 x = float4(in_texture.read(gid.xy, gid.z));
out_texture.write(half4(x), gid.xy, gid.z);
}
This is the computer buffer implementation.
-(BOOL)encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer inputs:(NSArray<id<MTLTexture>> *)inputs outputs:(NSArray<id<MTLTexture>> *)outputs error:(NSError *__autoreleasing _Nullable *)error
{
auto encoder = [commandBuffer computeCommandEncoder];
[encoder setTexture: input atIndex: 0];
[encoder setTexture: output atIndex: 1];
[encoder setComputePipelineState:_pipeline];
// Set the compute kernel's threadgroup size of 16x16
MTLSize thread_group_size = MTLSizeMake(16, 16, 1);
MTLSize thread_group_count = MTLSizeMake(0, 0, 1);
// Calculate the number of rows and columns of threadgroups given the width of the input image
// Ensure that you cover the entire image (or more) so you process every pixel
thread_group_count.width = (input.width + thread_group_size.width - 1) / thread_group_size.width;
thread_group_count.height = (input.height + thread_group_size.height - 1) / thread_group_size.height;
[encoder dispatchThreadgroups:thread_group_count threadsPerThreadgroup:thread_group_size];
[encoder endEncoding];
}
Based off of : https://machinethink.net/blog/coreml-custom-layers/
What can I use to pin down the error?
I tried using an empty encodeToCommandBuffer and I still get the same error.

Metal vertex shader draw points of a Texture

I want to execute Metal (or OpenGLES 3.0) shader that draws Points primitive with blending. To do that, I need to pass all the pixel coordinates of the texture to Vertex shader as vertices which computes the position of the vertex to be passed to fragment shader. The fragment shader simply outputs the color for the point with blending enabled. My problem is if there is an efficient was to pass coordinates of vertices to the vertex shader, since there would be too many vertices for 1920x1080 image, and that needs to be done 30 times in a second? Like we do in a compute shader by using dispatchThreadgroups command, except that compute shader can not draw a geometry with blending enabled.
EDIT: This is what I did -
let vertexFunctionRed = library!.makeFunction(name: "vertexShaderHistogramBlenderRed")
let fragmentFunctionAccumulator = library!.makeFunction(name: "fragmentShaderHistogramAccumulator")
let renderPipelineDescriptorRed = MTLRenderPipelineDescriptor()
renderPipelineDescriptorRed.vertexFunction = vertexFunctionRed
renderPipelineDescriptorRed.fragmentFunction = fragmentFunctionAccumulator
renderPipelineDescriptorRed.colorAttachments[0].pixelFormat = .bgra8Unorm
renderPipelineDescriptorRed.colorAttachments[0].isBlendingEnabled = true
renderPipelineDescriptorRed.colorAttachments[0].rgbBlendOperation = .add
renderPipelineDescriptorRed.colorAttachments[0].sourceRGBBlendFactor = .one
renderPipelineDescriptorRed.colorAttachments[0].destinationRGBBlendFactor = .one
do {
histogramPipelineRed = try device.makeRenderPipelineState(descriptor: renderPipelineDescriptorRed)
} catch {
print("Unable to compile render pipeline state Histogram Red!")
return
}
Drawing code:
let commandBuffer = commandQueue?.makeCommandBuffer()
let renderEncoder = commandBuffer?.makeRenderCommandEncoder(descriptor: renderPassDescriptor!)
renderEncoder?.setRenderPipelineState(histogramPipelineRed!)
renderEncoder?.setVertexTexture(metalTexture, index: 0)
renderEncoder?.drawPrimitives(type: .point, vertexStart: 0, vertexCount: 1, instanceCount: metalTexture!.width*metalTexture!.height)
renderEncoder?.drawPrimitives(type: .point, vertexStart: 0, vertexCount: metalTexture!.width*metalTexture!.height, instanceCount: 1)
and Shaders:
vertex MappedVertex vertexShaderHistogramBlenderRed (texture2d<float, access::sample> inputTexture [[ texture(0) ]],
unsigned int vertexId [[vertex_id]])
{
MappedVertex out;
constexpr sampler s(s_address::clamp_to_edge, t_address::clamp_to_edge, min_filter::linear, mag_filter::linear, coord::pixel);
ushort width = inputTexture.get_width();
ushort height = inputTexture.get_height();
float X = (vertexId % width)/(1.0*width);
float Y = (vertexId/width)/(1.0*height);
int red = inputTexture.sample(s, float2(X,Y)).r;
out.position = float4(-1.0 + (red * 0.0078125), 0.0, 0.0, 1.0);
out.pointSize = 1.0;
out.colorFactor = half3(1.0, 0.0, 0.0);
return out;
}
fragment half4 fragmentShaderHistogramAccumulator ( MappedVertex in [[ stage_in ]]
)
{
half3 colorFactor = in.colorFactor;
return half4(colorFactor*(1.0/256.0), 1.0);
}
Maybe you can draw a single point instanced 1920x1080 times. Something like:
vertex float4 my_func(texture2d<float, access::read> image [[texture(0)]],
constant uint &width [[buffer(0)]],
uint instance_id [[instance_id]])
{
// decompose the instance ID to a position
uint2 pos = uint2(instance_id % width, instance_id / width);
return float4(image.read(pos).r * 255, 0, 0, 0);
}

Median Filter for ios

I'm looking to apply Median Filter to a UIImage in my iOS application.
Due to my company restrictions, i cannot use openGL filters.
Any ideas or current implementations would be very welcomed.
Thanks.
Apple's Core Image framework may be your solution. To be precise, you need a subclass of CIFilter which implements a median filter. (Guess you would be interested in CIMedianFilter or have a look at the filter reference)
CIIImage *inputImage = //...
CIFilter *filter = [CIFilter filterWithName:#"CIMedianFilter"];
[filter setDefaults];
[filter setValue:inputImage forKey:#"inputImage"];
CIImage *outputImage = [filter outputImage];
To convert the CIImage to UIImage and vice versa:
CIImage *ciImage = [UIImage imageNamed:#"test.png"].CIImage;
UIImage *uiImage = [[UIImage alloc] initWithCIImage:ciImage];
Look at the github project:
https://github.com/BradLarson/GPUImage
These are a lot of CIFilter-type filters that are GPU accelerated.
Another alternate: Use a Convolution Filter and roll your own median filter using CI's kernel language.
For those of you who can use OpenGL ES in your iOS app, this is how you calculate the median in a pixel neighborhood radius of your choosing:
kernel vec4 medianUnsharpKernel(sampler u) {
vec4 pixel = unpremultiply(sample(u, samplerCoord(u)));
vec2 xy = destCoord();
int radius = 3;
int bounds = (radius - 1) / 2;
vec4 sum = vec4(0.0);
for (int i = (0 - bounds); i <= bounds; i++)
{
for (int j = (0 - bounds); j <= bounds; j++ )
{
sum += unpremultiply(sample(u, samplerTransform(u, vec2(xy + vec2(i, j)))));
}
}
vec4 mean = vec4(sum / vec4(pow(float(radius), 2.0)));
float mean_avg = float(mean);
float comp_avg = 0.0;
vec4 comp = vec4(0.0);
vec4 median = mean;
for (int i = (0 - bounds); i <= bounds; i++)
{
for (int j = (0 - bounds); j <= bounds; j++ )
{
comp = unpremultiply(sample(u, samplerTransform(u, vec2(xy + vec2(i, j)))));
comp_avg = float(comp);
median = (comp_avg < mean_avg) ? max(median, comp) : median;
}
}
return premultiply(vec4(vec3(abs(pixel.rgb - median.rgb)), 1.0));
}
A brief description of the steps
1. Calculate the mean of the values of the pixels surrounding the source pixel in a 3x3 neighborhood;
2. Find the maximum pixel value of all pixels in the same neighborhood that are less than the mean.
3. [OPTIONAL] Subtract the median pixel value from the source pixel value for edge detection.
If you're using the median value for edge detection, there are a couple of ways to modify the above code for better results, namely, hybrid median filtering and truncated media filtering (a substitute and a better 'mode' filtering). If you're interested, please ask.
If someone would like to do it in Xamarin (for iOS), here is the code:
public static UIImage MedianFilter(UIImage image)
{
CIImage ciImage = new CIImage(image);
var medianFilter = new CIMedianFilter() { Image = ciImage };
CIImage output = medianFilter.OutputImage;
var context = CIContext.FromOptions(null);
var cgimage = context.CreateCGImage(output, output.Extent);
var uiImage = UIImage.FromImage(cgimage);
return uiImage;
}

Resources