I have a metal shader that computes an image histogram like this:
#define CHANNEL_SIZE (256)
typedef atomic_uint HistoBuffer[CHANNEL_SIZE];
kernel void
computeHisto(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 grid [[thread_position_in_grid]]) {
if (grid.x >= sourceTexture.get_width() || grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&histo[grayvalue], 1, memory_order_relaxed);
}
This works as expected but takes too long (>1ms). I now tried to optimise this by reducing the number of atomic operations. I came up with the following improved code. The idea is to compute local histograms per thread group and add them later atomically into the global hist buffer.
kernel void
computeHisto_fast(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 t_pos_grid [[thread_position_in_grid]],
uint2 tg_pos_grid [[ threadgroup_position_in_grid ]],
uint2 t_pos_tg [[ thread_position_in_threadgroup]],
uint t_idx_tg [[ thread_index_in_threadgroup ]],
uint2 t_per_tg [[ threads_per_threadgroup ]]
)
{
threadgroup uint localhisto[CHANNEL_SIZE] = { 0 };
if (t_pos_grid.x >= sourceTexture.get_width() || t_pos_grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
localhisto[grayvalue]++;
// wait for all threads in threadgroup to finish
threadgroup_barrier(mem_flags::mem_none);
// copy the thread group result atomically into global histo buffer
if(t_idx_tg == 0) {
for(uint i=0;i<CHANNEL_SIZE;i++) {
atomic_fetch_add_explicit(&histo[i], localhisto[i], memory_order_relaxed);
}
}
}
There are 2 problems:
The improved routine does not yield identical results compared to the first and I currently don't see why ?
The run time didn't improve. in fact it takes 4 times the runtime of the unoptimised version. According to the debugger the for loop is the problem. But I do not understand this, since the number of atomic operation is reduced by 3 orders of magnitude, i.e. the thread group size, here (32x32)=1024.
Anbody who can explain what I am doing wrong here ? Thanks
EDIT: 2019-12-22:
According to Matthijs answer I have changed the local histogram also to atomic operations like this:
threadgroup atomic_uint localhisto[CHANNEL_SIZE] = {0};
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&localhisto[grayvalue], 1, memory_order_relaxed);
However the result sill is not the same as in the reference implementation above. There must be another severe conceptional bug ???
You'll still need to use atomic operations on the threadgroup memory, since it's still being shared by multiple threads. This should be faster than in your first version because there is less contention for the same locks.
I think the problem is with initializing shared memory, I don't think this definition does the job. Also, threadgroup level memory synchronization is required between zeroing shared memory and atomic update.
As for the device memory update, doing it using a single thread is clearly suboptimal. Updating the whole 256 length histogram in each threadblock can have a huge overhead depending on the size of the threadblock.
A sample I used for a small (16 element) histogram using 8x8 threadblocks:
kernel void gaussian_filter(device const uchar* data,
device atomic_uint* p_hist,
uint2 imageShape [[threads_per_grid]],
uint2 idx [[thread_position_in_grid]],
uint tidx [[thread_index_in_threadgroup]])
{
threadgroup atomic_uint sh_hist[16];
if (tidx < 16)
atomic_store_explicit(sh_hist + tidx, 0, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
uint histBin = (uint)data[imageShape[0]*idx[1] + idx[0]]/16;
atomic_fetch_add_explicit(sh_hist + histBin, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if (tidx < 16)
atomic_fetch_add_explicit(p_hist + tidx, atomic_load_explicit(sh_hist + tidx, memory_order_relaxed), memory_order_relaxed);
}
Related
I have a metal kernel function. Usually you access pixels like this:
kernel void edgeDetect(texture2d<half, access::sample> inTexture [[ texture(0) ]],
texture2d<half, access::write> outTexture [[ texture(1) ]],
device const uint *roi [[ buffer(0) ]],
uint2 grid [[ thread_position_in_grid ]]) {
if (grid.x >= outTexture.get_width() || grid.y >= outTexture.get_height()) {
return;
}
half c[9];
for (int i=0; i < 3; ++i) {
for (int j=0; j < 3; ++j) {
c[3*i+j] = inTexture.read(grid + uint2(i-1,j-1)).x;
}
}
half3 Lx = 2.0*(c[7]-c[1]) + c[6] + c[8] - c[2] - c[0];
half3 Ly = 2.0*(c[3]-c[5]) + c[6] + c[0] - c[2] - c[8];
half3 G = sqrt(Lx*Lx+Ly*Ly);
outTexture.write(half4(G, 0.0), grid);
}
Now I need to access pixels in the neighbourhood of the current grid position like this:
half4 inColor = inTexture.read(grid - uint2(-1,-1));
Basically this works, but on the thread boundaries I have "discontinuities" as shown in this image (the brick wall pattern).
This is clear since each thread is passed only it's sub-texture to process. So beyond thread boundaries I can't access pixels.
My question is: What is the concept when I need to address pixels beyond the current position in a compute kernel ? Is this possible with compute kernels at all ?
I have found the issue:
The line
c[3*i+j] = inTexture.read(grid + uint2(i-1,j-1)).x;
must be changed to:
c[3*i+j] = inTexture.read(grid + uint2(i,j)).x;
Obvisouly the position indices of -1 into the texture failed and produced the brick wall like artefacts shown in the image above.
To ensure somebody has attached it to this comment as an answer: there is no restriction on which pixels you can access in a compute shader. Your grid size affects scheduling only.
Your error is instantiating unsigned uint2 with negative numbers. At the first iteration of your loop you will attempt to construct uint2(-1, -1), which is the same as uint2(4294967295, 4294967295) and therefore way out of bounds.
You can use int2, or as per your self-answer just avoid negative numbers.
I haven't written many Metal kernel shaders yet; here's a fledgling "fade" shader between two RGBX-32 images, using a tween value of 0.0 to 1.0 between inBuffer1 (0.0) to inBuffer2 (1.0).
Is there something I'm missing here? Something strikes me that this may be terribly inefficient.
My first inkling is to attempt to do subtraction and multiplication using the vector data types (eg. char4) thinking that might be better, but the results of this are certainly undefined (as some components will be negative).
Also, is there some advantage to using MTLTexture versus MTLBuffer objects as I've done?
kernel void fade_Kernel(device const uchar4 *inBuffer1 [[ buffer(0) ]],
device const uchar4 *inBuffer2 [[ buffer(1) ]],
device const float *tween [[ buffer(2) ]],
device uchar4 *outBuffer [[ buffer(3) ]],
uint gid [[ thread_position_in_grid ]])
{
const float t = tween[0];
uchar4 pixel1 = inBuffer1[gid];
uchar4 pixel2 = inBuffer2[gid];
// these values will be negative
short r=(pixel2.r-pixel1.r)*t;
short g=(pixel2.g-pixel1.g)*t;
short b=(pixel2.b-pixel1.b)*t;
outBuffer[gid]=uchar4(pixel1.r+r,pixel1.g+g,pixel1.b+b,0xff);
}
First, you should probably declare the tween parameter as:
constant float &tween [[ buffer(2) ]],
Using the constant address space is more appropriate for a value like this that's the same for all invocations of the function (and not indexed into by grid position or the like). Also, making it a reference instead of a pointer tells the compiler that you won't be indexing other elements in the "array" that a pointer might be.
Finally, there's a mix() function that performs exactly the sort of computation that you're doing here. So, you could replace the body of the function with:
uchar4 pixel1 = inBuffer1[gid];
uchar4 pixel2 = inBuffer2[gid];
outBuffer[gid] = uchar4(uchar3(mix(float3(pixel1.rgb), float3(pixel2.rgb), tween)), 0xff);
As to whether it would be better to use textures, that depends somewhat on what you plan to do with the result after running this kernel. If you're going to be doing texture-like things with it anyway, it might be better to use textures all throughout. Indeed, it might be better to use drawing operations with blending rather than a compute kernel. After all, such blending is something GPUs have to do all the time, so that path is probably fast. You'd have to test the performance of each approach.
If you are dealing with images, it's much more efficient to use MTLTexture than MTLBuffer. It is also better to use "half" than "uchar". I've learned this directly from an Apple engineer at WWDC this year.
kernel void alpha(texture2d<half, access::read> inTexture2 [[texture(0)]],
texture2d<half, access::read> inTexture1 [[texture(1)]],
texture2d<half, access::write> outTexture [[texture(2)]],
const device float& tween [[ buffer(3) ]],
uint2 gid [[thread_position_in_grid]])
{
// Check if the pixel is within the bounds of the output texture
if((gid.x >= outTexture.get_width()) || (gid.y >= outTexture.get_height())) {
// Return early if the pixel is out of bounds
return;
}
half4 color1 = inTexture1.read(gid);
half4 color2 = inTexture2.read(gid);
outTexture.write(half4(mix(color1.rgb, color2.rgb, half(tween)), color1.a), gid);
}
Problem:
I need to fill a MTLBuffer of Floats with a constant value — say 1729.68921. I also need it to be as fast as possible.
Therefore I'm prohibited from filling the buffer on the CPU side (i.e. getting UnsafeMutablePointer<Float> from the MTLBuffer and assigning in serial manner).
My approach
Ideally I'd use MTLBlitCommandEncoder.fill(), however AFAIK it's only capable to fill a buffer with UInt8 values (given that UInt8 is 1 byte long and Float is 4 bytes long, I can't specify arbitrary value of my Float constant).
So far I can see only 2 options left, but both seem to be overkill:
create another buffer B filled with the constant value and copy its contents into my buffer via MTLBlitCommandEncoder
create a kernel function that'd fill the buffer
Questions
What's the fastest way of filling MTLBuffer of Floats with a
constant value?
Using a compute shader that writes to multiple buffer elements from each thread was the fastest approach in my experiments. This is hardware-dependent, so you should test on the full range of devices you expect the app to be deployed on.
I wrote two compute shaders: one that fills 16 contiguous array elements without checking against the array bounds, and one that sets a single array element after checking against the length of the buffer:
kernel void fill_16_unchecked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
uint index [[thread_position_in_grid]])
{
for (int i = 0; i < 16; ++i) {
buffer[index * 16 + i] = value;
}
}
kernel void single_fill_checked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
constant uint &buffer_length [[buffer(2)]],
uint index [[thread_position_in_grid]])
{
if (index < buffer_length) {
buffer[index] = value;
}
}
If you know that your buffer count will always be a multiple of the thread execution width multiplied by the number of elements you set in the loop, you can just use the first function. The second function is a fallback for when you might dispatch a grid that would otherwise overrun the buffer.
Once you have two pipelines built from these functions, you can dispatch the work with a pair of compute commands as follows:
NSInteger executionWidth = [unchecked16Pipeline threadExecutionWidth];
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setBuffer:buffer offset:0 atIndex:0];
[computeEncoder setBytes:&value length:sizeof(float) atIndex:1];
if (bufferCount / (executionWidth * 16) != 0) {
[computeEncoder setComputePipelineState:unchecked16Pipeline];
[computeEncoder dispatchThreadgroups:MTLSizeMake(bufferCount / (executionWidth * 16), 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
if (bufferCount % (executionWidth * 16) != 0) {
int remainder = bufferCount % (executionWidth * 16);
[computeEncoder setComputePipelineState:checkedSinglePipeline];
[computeEncoder setBytes:&bufferCount length:sizeof(bufferCount) atIndex:2];
[computeEncoder dispatchThreadgroups:MTLSizeMake((remainder / executionWidth) + 1, 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
[computeEncoder endEncoding];
Note that doing the work in this manner will not necessarily be faster than the naive approach that just writes one element per thread. In my tests, it was 40% faster on A8, roughly equivalent on A10, and 2-3x slower (!) on A9. Always test with your own workload.
I'm trying to implement code in Metal that performs a 1D convolution between two vectors with lengths. I've implemented the following which works correctly
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const device int& dataSize [[ buffer(1) ]],
const device float *filterVector [[ buffer(2) ]],
const device int& filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]]) {
int outputSize = dataSize - filterSize + 1;
for (int i=0;i<outputSize;i++) {
float sum = 0.0;
for (int j=0;j<filterSize;j++) {
sum += dataVector[i+j] * filterVector[j];
}
outVector[i] = sum;
}
}
My problem is it takes about 10 times longer to process (computation + data transfer to/from GPU) the same data using Metal than in Swift on a CPU. My question is how do I replace the inner loop with a single vector operation or is there another way to speed up the above code?
The key to taking advantage of the GPU's parallelism in this case is to let it manage the outer loop for you. Instead of invoking the kernel once for the entire data vector, we'll invoke it for each element in the data vector. The kernel function simplifies to this:
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const constant int &dataSize [[ buffer(1) ]],
const constant float *filterVector [[ buffer(2) ]],
const constant int &filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]])
{
float sum = 0.0;
for (int i = 0; i < filterSize; ++i) {
sum += dataVector[id + i] * filterVector[i];
}
outVector[id] = sum;
}
In order to dispatch this work, we select a threadgroup size based on the thread execution width recommended by the compute pipeline state. The one tricky thing here is making sure that there's enough padding in the input and output buffers so that we can slightly overrun the actual size of the data. This does cause us to waste a small amount of memory and computation, but saves us the complexity of doing a separate dispatch just to compute the convolution for the elements at the end of the buffer.
// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).
let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
In my experiments, this parallelized approach runs 400-1000x faster than the serial version in the question. I'm curious to hear how it compares to your CPU implementation.
The following code shows how to render encoded commands in parallel on the GPU using the Objective-C Metal API (the threading code above only divides rendering of the output into grid sections for parallel processing; the calculations are still not performed in parallel). It is what you're referring to in your question, even while it's not exactly what you want. I've provided this answer to help anyone who might have stumbled upon this question, thinking that it was going to provide an answer related to parallel rendering (when, in fact, it does not):
- (void)drawInMTKView:(MTKView *)view
{
dispatch_async(((AppDelegate *)UIApplication.sharedApplication.delegate).cameraViewQueue, ^{
id <CAMetalDrawable> drawable = [view currentDrawable]; //[(CAMetalLayer *)view.layer nextDrawable];
MTLRenderPassDescriptor *renderPassDesc = [view currentRenderPassDescriptor];
renderPassDesc.colorAttachments[0].loadAction = MTLLoadActionClear;
renderPassDesc.colorAttachments[0].clearColor = MTLClearColorMake(0.0,0.0,0.0,1.0);
renderPassDesc.renderTargetWidth = self.texture.width;
renderPassDesc.renderTargetHeight = self.texture.height;
renderPassDesc.colorAttachments[0].texture = drawable.texture;
if (renderPassDesc != nil)
{
dispatch_semaphore_wait(self._inflight_semaphore, DISPATCH_TIME_FOREVER);
id <MTLCommandBuffer> commandBuffer = [self.metalContext.commandQueue commandBuffer];
[commandBuffer enqueue];
// START PARALLEL RENDERING OPERATIONS HERE
id <MTLParallelRenderCommandEncoder> parallelRCE = [commandBuffer parallelRenderCommandEncoderWithDescriptor:renderPassDesc];
// FIRST PARALLEL RENDERING OPERATION
id <MTLRenderCommandEncoder> renderEncoder = [parallelRCE renderCommandEncoder];
[renderEncoder setRenderPipelineState:self.metalContext.renderPipelineState];
[renderEncoder setVertexBuffer:self.metalContext.vertexBuffer offset:0 atIndex:0];
[renderEncoder setVertexBuffer:self.metalContext.uniformBuffer offset:0 atIndex:1];
[renderEncoder setFragmentBuffer:self.metalContext.uniformBuffer offset:0 atIndex:0];
[renderEncoder setFragmentTexture:self.texture
atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip
vertexStart:0
vertexCount:4
instanceCount:1];
[renderEncoder endEncoding];
// ADD SECOND, THIRD, ETC. PARALLEL RENDERING OPERATION HERE
.
.
.
// SUBMIT ALL RENDERING OPERATIONS IN PARALLEL HERE
[parallelRCE endEncoding];
__block dispatch_semaphore_t block_sema = self._inflight_semaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
dispatch_semaphore_signal(block_sema);
}];
if (drawable)
[commandBuffer presentDrawable:drawable];
[commandBuffer commit];
[commandBuffer waitUntilScheduled];
}
});
}
In the above example, you would duplicate the renderEncoder-related for each calculation you want to perform in parallel. I do not see how this would be of benefit to you in your code example, as one operation appears to be dependent on another. Probably, then, the best you could hope for is the code provided to you by warrenm, even though that doesn't really qualify as parallel rendering, though.
I created simple passthrough compute kernel
kernel void filter(texture2d<float, access::read> inTexture [[texture(0)]],
texture2d<float, access::write> outTexture [[texture(1)]],
uint2 gridPos [[ thread_position_in_grid ]]) {
float4 color = inTexture.read(gridPos);
outTexture.write(color, gridPos);
}
Measuring the execution time
[self.timer start];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
CGFloat ms = [self.timer elapse];
Timer class works like this:
- (void)start {
self.startMach = mach_absolute_time();
}
- (CGFloat)elapse {
uint64_t end = mach_absolute_time();
uint64_t elapsed = end - self.startMach;
uint64_t nanosecs = elapsed * self.info.numer / self.info.denom;
uint64_t millisecs = nanosecs / 1000000;
return millisecs;
}
Dispatch call:
static const NSUInteger kGroupSize = 16;
- (MTLSize)threadGroupSize {
return MTLSizeMake(kGroupSize, kGroupSize, 1);
}
- (MTLSize)threadGroupsCount:(MTLSize)threadGroupSize {
return MTLSizeMake(self.provider.texture.width / kGroupSize,
self.provider.texture.height / kGroupSize, 1);
}
[commandEncoder dispatchThreadgroups:threadgroups
threadsPerThreadgroup:threadgroupSize];
gives me 13 ms on 512x512 rgba image and it grows lineary if I perform more passes.
Is this correct? It seems too much overhead for real time application.
Compute kernels are known to have rather high overhead on A7 processors. One thing to consider, though, is that this is basically the least flattering test you can run: a one-shot threadgroup dispatch might take ~2ms to get scheduled, but scheduling of subsequent dispatches can be up to an order of magnitude faster. Additionally there's little chance for latency hiding here. In practice, a much more complex kernel probably wouldn't take substantially longer to execute, and if you can interleave it with whatever rendering you might be doing, you might find performance to be acceptable.