How to Speed Up Metal Code for iOS/Mac OS - ios

I'm trying to implement code in Metal that performs a 1D convolution between two vectors with lengths. I've implemented the following which works correctly
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const device int& dataSize [[ buffer(1) ]],
const device float *filterVector [[ buffer(2) ]],
const device int& filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]]) {
int outputSize = dataSize - filterSize + 1;
for (int i=0;i<outputSize;i++) {
float sum = 0.0;
for (int j=0;j<filterSize;j++) {
sum += dataVector[i+j] * filterVector[j];
}
outVector[i] = sum;
}
}
My problem is it takes about 10 times longer to process (computation + data transfer to/from GPU) the same data using Metal than in Swift on a CPU. My question is how do I replace the inner loop with a single vector operation or is there another way to speed up the above code?

The key to taking advantage of the GPU's parallelism in this case is to let it manage the outer loop for you. Instead of invoking the kernel once for the entire data vector, we'll invoke it for each element in the data vector. The kernel function simplifies to this:
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const constant int &dataSize [[ buffer(1) ]],
const constant float *filterVector [[ buffer(2) ]],
const constant int &filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]])
{
float sum = 0.0;
for (int i = 0; i < filterSize; ++i) {
sum += dataVector[id + i] * filterVector[i];
}
outVector[id] = sum;
}
In order to dispatch this work, we select a threadgroup size based on the thread execution width recommended by the compute pipeline state. The one tricky thing here is making sure that there's enough padding in the input and output buffers so that we can slightly overrun the actual size of the data. This does cause us to waste a small amount of memory and computation, but saves us the complexity of doing a separate dispatch just to compute the convolution for the elements at the end of the buffer.
// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).
let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
In my experiments, this parallelized approach runs 400-1000x faster than the serial version in the question. I'm curious to hear how it compares to your CPU implementation.

The following code shows how to render encoded commands in parallel on the GPU using the Objective-C Metal API (the threading code above only divides rendering of the output into grid sections for parallel processing; the calculations are still not performed in parallel). It is what you're referring to in your question, even while it's not exactly what you want. I've provided this answer to help anyone who might have stumbled upon this question, thinking that it was going to provide an answer related to parallel rendering (when, in fact, it does not):
- (void)drawInMTKView:(MTKView *)view
{
dispatch_async(((AppDelegate *)UIApplication.sharedApplication.delegate).cameraViewQueue, ^{
id <CAMetalDrawable> drawable = [view currentDrawable]; //[(CAMetalLayer *)view.layer nextDrawable];
MTLRenderPassDescriptor *renderPassDesc = [view currentRenderPassDescriptor];
renderPassDesc.colorAttachments[0].loadAction = MTLLoadActionClear;
renderPassDesc.colorAttachments[0].clearColor = MTLClearColorMake(0.0,0.0,0.0,1.0);
renderPassDesc.renderTargetWidth = self.texture.width;
renderPassDesc.renderTargetHeight = self.texture.height;
renderPassDesc.colorAttachments[0].texture = drawable.texture;
if (renderPassDesc != nil)
{
dispatch_semaphore_wait(self._inflight_semaphore, DISPATCH_TIME_FOREVER);
id <MTLCommandBuffer> commandBuffer = [self.metalContext.commandQueue commandBuffer];
[commandBuffer enqueue];
// START PARALLEL RENDERING OPERATIONS HERE
id <MTLParallelRenderCommandEncoder> parallelRCE = [commandBuffer parallelRenderCommandEncoderWithDescriptor:renderPassDesc];
// FIRST PARALLEL RENDERING OPERATION
id <MTLRenderCommandEncoder> renderEncoder = [parallelRCE renderCommandEncoder];
[renderEncoder setRenderPipelineState:self.metalContext.renderPipelineState];
[renderEncoder setVertexBuffer:self.metalContext.vertexBuffer offset:0 atIndex:0];
[renderEncoder setVertexBuffer:self.metalContext.uniformBuffer offset:0 atIndex:1];
[renderEncoder setFragmentBuffer:self.metalContext.uniformBuffer offset:0 atIndex:0];
[renderEncoder setFragmentTexture:self.texture
atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip
vertexStart:0
vertexCount:4
instanceCount:1];
[renderEncoder endEncoding];
// ADD SECOND, THIRD, ETC. PARALLEL RENDERING OPERATION HERE
.
.
.
// SUBMIT ALL RENDERING OPERATIONS IN PARALLEL HERE
[parallelRCE endEncoding];
__block dispatch_semaphore_t block_sema = self._inflight_semaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
dispatch_semaphore_signal(block_sema);
}];
if (drawable)
[commandBuffer presentDrawable:drawable];
[commandBuffer commit];
[commandBuffer waitUntilScheduled];
}
});
}
In the above example, you would duplicate the renderEncoder-related for each calculation you want to perform in parallel. I do not see how this would be of benefit to you in your code example, as one operation appears to be dependent on another. Probably, then, the best you could hope for is the code provided to you by warrenm, even though that doesn't really qualify as parallel rendering, though.

Related

Optimize metal compute shader for image histogram

I have a metal shader that computes an image histogram like this:
#define CHANNEL_SIZE (256)
typedef atomic_uint HistoBuffer[CHANNEL_SIZE];
kernel void
computeHisto(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 grid [[thread_position_in_grid]]) {
if (grid.x >= sourceTexture.get_width() || grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&histo[grayvalue], 1, memory_order_relaxed);
}
This works as expected but takes too long (>1ms). I now tried to optimise this by reducing the number of atomic operations. I came up with the following improved code. The idea is to compute local histograms per thread group and add them later atomically into the global hist buffer.
kernel void
computeHisto_fast(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 t_pos_grid [[thread_position_in_grid]],
uint2 tg_pos_grid [[ threadgroup_position_in_grid ]],
uint2 t_pos_tg [[ thread_position_in_threadgroup]],
uint t_idx_tg [[ thread_index_in_threadgroup ]],
uint2 t_per_tg [[ threads_per_threadgroup ]]
)
{
threadgroup uint localhisto[CHANNEL_SIZE] = { 0 };
if (t_pos_grid.x >= sourceTexture.get_width() || t_pos_grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
localhisto[grayvalue]++;
// wait for all threads in threadgroup to finish
threadgroup_barrier(mem_flags::mem_none);
// copy the thread group result atomically into global histo buffer
if(t_idx_tg == 0) {
for(uint i=0;i<CHANNEL_SIZE;i++) {
atomic_fetch_add_explicit(&histo[i], localhisto[i], memory_order_relaxed);
}
}
}
There are 2 problems:
The improved routine does not yield identical results compared to the first and I currently don't see why ?
The run time didn't improve. in fact it takes 4 times the runtime of the unoptimised version. According to the debugger the for loop is the problem. But I do not understand this, since the number of atomic operation is reduced by 3 orders of magnitude, i.e. the thread group size, here (32x32)=1024.
Anbody who can explain what I am doing wrong here ? Thanks
EDIT: 2019-12-22:
According to Matthijs answer I have changed the local histogram also to atomic operations like this:
threadgroup atomic_uint localhisto[CHANNEL_SIZE] = {0};
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&localhisto[grayvalue], 1, memory_order_relaxed);
However the result sill is not the same as in the reference implementation above. There must be another severe conceptional bug ???
You'll still need to use atomic operations on the threadgroup memory, since it's still being shared by multiple threads. This should be faster than in your first version because there is less contention for the same locks.
I think the problem is with initializing shared memory, I don't think this definition does the job. Also, threadgroup level memory synchronization is required between zeroing shared memory and atomic update.
As for the device memory update, doing it using a single thread is clearly suboptimal. Updating the whole 256 length histogram in each threadblock can have a huge overhead depending on the size of the threadblock.
A sample I used for a small (16 element) histogram using 8x8 threadblocks:
kernel void gaussian_filter(device const uchar* data,
device atomic_uint* p_hist,
uint2 imageShape [[threads_per_grid]],
uint2 idx [[thread_position_in_grid]],
uint tidx [[thread_index_in_threadgroup]])
{
threadgroup atomic_uint sh_hist[16];
if (tidx < 16)
atomic_store_explicit(sh_hist + tidx, 0, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
uint histBin = (uint)data[imageShape[0]*idx[1] + idx[0]]/16;
atomic_fetch_add_explicit(sh_hist + histBin, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if (tidx < 16)
atomic_fetch_add_explicit(p_hist + tidx, atomic_load_explicit(sh_hist + tidx, memory_order_relaxed), memory_order_relaxed);
}

Access pixels beyond grid position in metal compute kernel?

I have a metal kernel function. Usually you access pixels like this:
kernel void edgeDetect(texture2d<half, access::sample> inTexture [[ texture(0) ]],
texture2d<half, access::write> outTexture [[ texture(1) ]],
device const uint *roi [[ buffer(0) ]],
uint2 grid [[ thread_position_in_grid ]]) {
if (grid.x >= outTexture.get_width() || grid.y >= outTexture.get_height()) {
return;
}
half c[9];
for (int i=0; i < 3; ++i) {
for (int j=0; j < 3; ++j) {
c[3*i+j] = inTexture.read(grid + uint2(i-1,j-1)).x;
}
}
half3 Lx = 2.0*(c[7]-c[1]) + c[6] + c[8] - c[2] - c[0];
half3 Ly = 2.0*(c[3]-c[5]) + c[6] + c[0] - c[2] - c[8];
half3 G = sqrt(Lx*Lx+Ly*Ly);
outTexture.write(half4(G, 0.0), grid);
}
Now I need to access pixels in the neighbourhood of the current grid position like this:
half4 inColor = inTexture.read(grid - uint2(-1,-1));
Basically this works, but on the thread boundaries I have "discontinuities" as shown in this image (the brick wall pattern).
This is clear since each thread is passed only it's sub-texture to process. So beyond thread boundaries I can't access pixels.
My question is: What is the concept when I need to address pixels beyond the current position in a compute kernel ? Is this possible with compute kernels at all ?
I have found the issue:
The line
c[3*i+j] = inTexture.read(grid + uint2(i-1,j-1)).x;
must be changed to:
c[3*i+j] = inTexture.read(grid + uint2(i,j)).x;
Obvisouly the position indices of -1 into the texture failed and produced the brick wall like artefacts shown in the image above.
To ensure somebody has attached it to this comment as an answer: there is no restriction on which pixels you can access in a compute shader. Your grid size affects scheduling only.
Your error is instantiating unsigned uint2 with negative numbers. At the first iteration of your loop you will attempt to construct uint2(-1, -1), which is the same as uint2(4294967295, 4294967295) and therefore way out of bounds.
You can use int2, or as per your self-answer just avoid negative numbers.

Draw portion of MTLBuffer?

I am rendering point fragments from a buffer with this call:
renderEncoder.drawPrimitives(type: .point,
vertexStart: 0,
vertexCount: 1,
instanceCount: emitter.currentParticles)
emitter.currentParticles is the total number of particles in the buffer. Is it possible to somehow draw only a portion of the buffer?
I have tried this, but it draws the first half of the buffer:
renderEncoder.drawPrimitives(type: .point,
vertexStart: emitter.currentParticles / 2,
vertexCount: 1,
instanceCount: emitter.currentParticles / 2)
In fact, it seems that vertexStart has no effect. I can seemingly set it to any value, and it still starts at 0.
Edit:
Pipeline configuration:
private func buildParticlePipelineStates() {
do {
guard let library = Renderer.device.makeDefaultLibrary(),
let function = library.makeFunction(name: "compute") else { return }
// particle update pipeline state
particlesPipelineState = try Renderer.device.makeComputePipelineState(function: function)
// render pipeline state
let vertexFunction = library.makeFunction(name: "vertex_particle")
let fragmentFunction = library.makeFunction(name: "fragment_particle")
let descriptor = MTLRenderPipelineDescriptor()
descriptor.vertexFunction = vertexFunction
descriptor.fragmentFunction = fragmentFunction
descriptor.colorAttachments[0].pixelFormat = renderPixelFormat
descriptor.colorAttachments[0].isBlendingEnabled = true
descriptor.colorAttachments[0].rgbBlendOperation = .add
descriptor.colorAttachments[0].alphaBlendOperation = .add
descriptor.colorAttachments[0].sourceRGBBlendFactor = .sourceAlpha
descriptor.colorAttachments[0].sourceAlphaBlendFactor = .sourceAlpha
descriptor.colorAttachments[0].destinationRGBBlendFactor = .oneMinusSourceAlpha
descriptor.colorAttachments[0].destinationAlphaBlendFactor = .oneMinusSourceAlpha
renderPipelineState = try
Renderer.device.makeRenderPipelineState(descriptor: descriptor)
renderPipelineState = try Renderer.device.makeRenderPipelineState(descriptor: descriptor)
} catch let error {
print(error.localizedDescription)
}
}
Vertex shader:
struct VertexOut {
float4 position [[ position ]];
float point_size [[ point_size ]];
float4 color;
};
vertex VertexOut vertex_particle(constant float2 &size [[buffer(0)]],
device Particle *particles [[buffer(1)]],
constant float2 &emitterPosition [[ buffer(2) ]],
uint instance [[instance_id]])
{
VertexOut out;
float2 position = particles[instance].position + emitterPosition;
out.position.xy = position.xy / size * 2.0 - 1.0;
out.position.z = 0;
out.position.w = 1;
out.point_size = particles[instance].size * particles[instance].scale;
out.color = particles[instance].color;
return out;
}
fragment float4 fragment_particle(VertexOut in [[ stage_in ]],
texture2d<float> particleTexture [[ texture(0) ]],
float2 point [[ point_coord ]]) {
constexpr sampler default_sampler;
float4 color = particleTexture.sample(default_sampler, point);
if ((color.a < 0.01) || (in.color.a < 0.01)) {
discard_fragment();
}
color = float4(in.color.xyz, 0.2 * color.a * in.color.a);
return color;
}
You're not using a vertex descriptor nor a [[stage_in]] parameter for your vertex shader. So, Metal is not fetching/gathering vertex data for you. You're just indexing into a buffer that's laid out with your vertex data already in the format you want. That's fine. See my answer here for more info about vertex descriptor.
Given that, though, the vertexStart parameter of the draw call only affects the value of a parameter to your vertex function with the [[vertex_id]] attribute. Your vertex function doesn't have such a parameter, let alone use it. Instead it uses an [[instance_id]] parameter to index into the vertex data buffer. You can read another of my answers here for a quick primer on draw calls and how they result in calls to your vertex shader function.
There are a couple of ways you could change things to draw only half of the points. You could change the draw call you use to:
renderEncoder.drawPrimitives(type: .point,
vertexStart: 0,
vertexCount: 1,
instanceCount: emitter.currentParticles / 2,
baseInstance: emitter.currentParticles / 2)
This would not require any changes to the vertex shader. It just changes the range of values fed to the instance parameter. However, since it doesn't seem like this is really a case of instancing, I recommend that you change the shader and your draw call. For the shader, rename the instance parameter to vertex or vid and change its attribute from [[instance_id]] to [[vertex_id]]. Then, change the draw call to:
renderEncoder.drawPrimitives(type: .point,
vertexStart: emitter.currentParticles / 2,
vertexCount: emitter.currentParticles / 2)
In truth, they basically behave the same way in this case, but the latter better represents what you're doing (and the draw call is simpler, which is nice).

Filling Float buffer in Metal

Problem:
I need to fill a MTLBuffer of Floats with a constant value — say 1729.68921. I also need it to be as fast as possible.
Therefore I'm prohibited from filling the buffer on the CPU side (i.e. getting UnsafeMutablePointer<Float> from the MTLBuffer and assigning in serial manner).
My approach
Ideally I'd use MTLBlitCommandEncoder.fill(), however AFAIK it's only capable to fill a buffer with UInt8 values (given that UInt8 is 1 byte long and Float is 4 bytes long, I can't specify arbitrary value of my Float constant).
So far I can see only 2 options left, but both seem to be overkill:
create another buffer B filled with the constant value and copy its contents into my buffer via MTLBlitCommandEncoder
create a kernel function that'd fill the buffer
Questions
What's the fastest way of filling MTLBuffer of Floats with a
constant value?
Using a compute shader that writes to multiple buffer elements from each thread was the fastest approach in my experiments. This is hardware-dependent, so you should test on the full range of devices you expect the app to be deployed on.
I wrote two compute shaders: one that fills 16 contiguous array elements without checking against the array bounds, and one that sets a single array element after checking against the length of the buffer:
kernel void fill_16_unchecked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
uint index [[thread_position_in_grid]])
{
for (int i = 0; i < 16; ++i) {
buffer[index * 16 + i] = value;
}
}
kernel void single_fill_checked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
constant uint &buffer_length [[buffer(2)]],
uint index [[thread_position_in_grid]])
{
if (index < buffer_length) {
buffer[index] = value;
}
}
If you know that your buffer count will always be a multiple of the thread execution width multiplied by the number of elements you set in the loop, you can just use the first function. The second function is a fallback for when you might dispatch a grid that would otherwise overrun the buffer.
Once you have two pipelines built from these functions, you can dispatch the work with a pair of compute commands as follows:
NSInteger executionWidth = [unchecked16Pipeline threadExecutionWidth];
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setBuffer:buffer offset:0 atIndex:0];
[computeEncoder setBytes:&value length:sizeof(float) atIndex:1];
if (bufferCount / (executionWidth * 16) != 0) {
[computeEncoder setComputePipelineState:unchecked16Pipeline];
[computeEncoder dispatchThreadgroups:MTLSizeMake(bufferCount / (executionWidth * 16), 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
if (bufferCount % (executionWidth * 16) != 0) {
int remainder = bufferCount % (executionWidth * 16);
[computeEncoder setComputePipelineState:checkedSinglePipeline];
[computeEncoder setBytes:&bufferCount length:sizeof(bufferCount) atIndex:2];
[computeEncoder dispatchThreadgroups:MTLSizeMake((remainder / executionWidth) + 1, 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
[computeEncoder endEncoding];
Note that doing the work in this manner will not necessarily be faster than the naive approach that just writes one element per thread. In my tests, it was 40% faster on A8, roughly equivalent on A10, and 2-3x slower (!) on A9. Always test with your own workload.

Metal IOS simple passthrough compute kernel takes 10 miliseconds on iphone 5s

I created simple passthrough compute kernel
kernel void filter(texture2d<float, access::read> inTexture [[texture(0)]],
texture2d<float, access::write> outTexture [[texture(1)]],
uint2 gridPos [[ thread_position_in_grid ]]) {
float4 color = inTexture.read(gridPos);
outTexture.write(color, gridPos);
}
Measuring the execution time
[self.timer start];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
CGFloat ms = [self.timer elapse];
Timer class works like this:
- (void)start {
self.startMach = mach_absolute_time();
}
- (CGFloat)elapse {
uint64_t end = mach_absolute_time();
uint64_t elapsed = end - self.startMach;
uint64_t nanosecs = elapsed * self.info.numer / self.info.denom;
uint64_t millisecs = nanosecs / 1000000;
return millisecs;
}
Dispatch call:
static const NSUInteger kGroupSize = 16;
- (MTLSize)threadGroupSize {
return MTLSizeMake(kGroupSize, kGroupSize, 1);
}
- (MTLSize)threadGroupsCount:(MTLSize)threadGroupSize {
return MTLSizeMake(self.provider.texture.width / kGroupSize,
self.provider.texture.height / kGroupSize, 1);
}
[commandEncoder dispatchThreadgroups:threadgroups
threadsPerThreadgroup:threadgroupSize];
gives me 13 ms on 512x512 rgba image and it grows lineary if I perform more passes.
Is this correct? It seems too much overhead for real time application.
Compute kernels are known to have rather high overhead on A7 processors. One thing to consider, though, is that this is basically the least flattering test you can run: a one-shot threadgroup dispatch might take ~2ms to get scheduled, but scheduling of subsequent dispatches can be up to an order of magnitude faster. Additionally there's little chance for latency hiding here. In practice, a much more complex kernel probably wouldn't take substantially longer to execute, and if you can interleave it with whatever rendering you might be doing, you might find performance to be acceptable.

Resources