I have a metal kernel function. Usually you access pixels like this:
kernel void edgeDetect(texture2d<half, access::sample> inTexture [[ texture(0) ]],
texture2d<half, access::write> outTexture [[ texture(1) ]],
device const uint *roi [[ buffer(0) ]],
uint2 grid [[ thread_position_in_grid ]]) {
if (grid.x >= outTexture.get_width() || grid.y >= outTexture.get_height()) {
half c[9];
for (int i=0; i < 3; ++i) {
for (int j=0; j < 3; ++j) {
c[3*i+j] = inTexture.read(grid + uint2(i-1,j-1)).x;
half3 Lx = 2.0*(c[7]-c[1]) + c[6] + c[8] - c[2] - c[0];
half3 Ly = 2.0*(c[3]-c[5]) + c[6] + c[0] - c[2] - c[8];
half3 G = sqrt(Lx*Lx+Ly*Ly);
outTexture.write(half4(G, 0.0), grid);
Now I need to access pixels in the neighbourhood of the current grid position like this:
half4 inColor = inTexture.read(grid - uint2(-1,-1));
Basically this works, but on the thread boundaries I have "discontinuities" as shown in this image (the brick wall pattern).
This is clear since each thread is passed only it's sub-texture to process. So beyond thread boundaries I can't access pixels.
My question is: What is the concept when I need to address pixels beyond the current position in a compute kernel ? Is this possible with compute kernels at all ?
I have found the issue:
The line
c[3*i+j] = inTexture.read(grid + uint2(i-1,j-1)).x;
must be changed to:
c[3*i+j] = inTexture.read(grid + uint2(i,j)).x;
Obvisouly the position indices of -1 into the texture failed and produced the brick wall like artefacts shown in the image above.
To ensure somebody has attached it to this comment as an answer: there is no restriction on which pixels you can access in a compute shader. Your grid size affects scheduling only.
Your error is instantiating unsigned uint2 with negative numbers. At the first iteration of your loop you will attempt to construct uint2(-1, -1), which is the same as uint2(4294967295, 4294967295) and therefore way out of bounds.
You can use int2, or as per your self-answer just avoid negative numbers.
I have a metal shader that computes an image histogram like this:
#define CHANNEL_SIZE (256)
typedef atomic_uint HistoBuffer[CHANNEL_SIZE];
kernel void
computeHisto(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 grid [[thread_position_in_grid]]) {
if (grid.x >= sourceTexture.get_width() || grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&histo[grayvalue], 1, memory_order_relaxed);
This works as expected but takes too long (>1ms). I now tried to optimise this by reducing the number of atomic operations. I came up with the following improved code. The idea is to compute local histograms per thread group and add them later atomically into the global hist buffer.
kernel void
computeHisto_fast(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 t_pos_grid [[thread_position_in_grid]],
uint2 tg_pos_grid [[ threadgroup_position_in_grid ]],
uint2 t_pos_tg [[ thread_position_in_threadgroup]],
uint t_idx_tg [[ thread_index_in_threadgroup ]],
uint2 t_per_tg [[ threads_per_threadgroup ]]
threadgroup uint localhisto[CHANNEL_SIZE] = { 0 };
if (t_pos_grid.x >= sourceTexture.get_width() || t_pos_grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
// wait for all threads in threadgroup to finish
// copy the thread group result atomically into global histo buffer
if(t_idx_tg == 0) {
for(uint i=0;i<CHANNEL_SIZE;i++) {
atomic_fetch_add_explicit(&histo[i], localhisto[i], memory_order_relaxed);
There are 2 problems:
The improved routine does not yield identical results compared to the first and I currently don't see why ?
The run time didn't improve. in fact it takes 4 times the runtime of the unoptimised version. According to the debugger the for loop is the problem. But I do not understand this, since the number of atomic operation is reduced by 3 orders of magnitude, i.e. the thread group size, here (32x32)=1024.
Anbody who can explain what I am doing wrong here ? Thanks
EDIT: 2019-12-22:
According to Matthijs answer I have changed the local histogram also to atomic operations like this:
threadgroup atomic_uint localhisto[CHANNEL_SIZE] = {0};
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&localhisto[grayvalue], 1, memory_order_relaxed);
However the result sill is not the same as in the reference implementation above. There must be another severe conceptional bug ???
You'll still need to use atomic operations on the threadgroup memory, since it's still being shared by multiple threads. This should be faster than in your first version because there is less contention for the same locks.
I think the problem is with initializing shared memory, I don't think this definition does the job. Also, threadgroup level memory synchronization is required between zeroing shared memory and atomic update.
As for the device memory update, doing it using a single thread is clearly suboptimal. Updating the whole 256 length histogram in each threadblock can have a huge overhead depending on the size of the threadblock.
A sample I used for a small (16 element) histogram using 8x8 threadblocks:
kernel void gaussian_filter(device const uchar* data,
device atomic_uint* p_hist,
uint2 imageShape [[threads_per_grid]],
uint2 idx [[thread_position_in_grid]],
uint tidx [[thread_index_in_threadgroup]])
threadgroup atomic_uint sh_hist[16];
if (tidx < 16)
atomic_store_explicit(sh_hist + tidx, 0, memory_order_relaxed);
uint histBin = (uint)data[imageShape[0]*idx[1] + idx[0]]/16;
atomic_fetch_add_explicit(sh_hist + histBin, 1, memory_order_relaxed);
if (tidx < 16)
atomic_fetch_add_explicit(p_hist + tidx, atomic_load_explicit(sh_hist + tidx, memory_order_relaxed), memory_order_relaxed);
I haven't written many Metal kernel shaders yet; here's a fledgling "fade" shader between two RGBX-32 images, using a tween value of 0.0 to 1.0 between inBuffer1 (0.0) to inBuffer2 (1.0).
Is there something I'm missing here? Something strikes me that this may be terribly inefficient.
My first inkling is to attempt to do subtraction and multiplication using the vector data types (eg. char4) thinking that might be better, but the results of this are certainly undefined (as some components will be negative).
Also, is there some advantage to using MTLTexture versus MTLBuffer objects as I've done?
kernel void fade_Kernel(device const uchar4 *inBuffer1 [[ buffer(0) ]],
device const uchar4 *inBuffer2 [[ buffer(1) ]],
device const float *tween [[ buffer(2) ]],
device uchar4 *outBuffer [[ buffer(3) ]],
uint gid [[ thread_position_in_grid ]])
const float t = tween[0];
uchar4 pixel1 = inBuffer1[gid];
uchar4 pixel2 = inBuffer2[gid];
// these values will be negative
short r=(pixel2.r-pixel1.r)*t;
short g=(pixel2.g-pixel1.g)*t;
short b=(pixel2.b-pixel1.b)*t;
First, you should probably declare the tween parameter as:
constant float &tween [[ buffer(2) ]],
Using the constant address space is more appropriate for a value like this that's the same for all invocations of the function (and not indexed into by grid position or the like). Also, making it a reference instead of a pointer tells the compiler that you won't be indexing other elements in the "array" that a pointer might be.
Finally, there's a mix() function that performs exactly the sort of computation that you're doing here. So, you could replace the body of the function with:
uchar4 pixel1 = inBuffer1[gid];
uchar4 pixel2 = inBuffer2[gid];
outBuffer[gid] = uchar4(uchar3(mix(float3(pixel1.rgb), float3(pixel2.rgb), tween)), 0xff);
As to whether it would be better to use textures, that depends somewhat on what you plan to do with the result after running this kernel. If you're going to be doing texture-like things with it anyway, it might be better to use textures all throughout. Indeed, it might be better to use drawing operations with blending rather than a compute kernel. After all, such blending is something GPUs have to do all the time, so that path is probably fast. You'd have to test the performance of each approach.
If you are dealing with images, it's much more efficient to use MTLTexture than MTLBuffer. It is also better to use "half" than "uchar". I've learned this directly from an Apple engineer at WWDC this year.
kernel void alpha(texture2d<half, access::read> inTexture2 [[texture(0)]],
texture2d<half, access::read> inTexture1 [[texture(1)]],
texture2d<half, access::write> outTexture [[texture(2)]],
const device float& tween [[ buffer(3) ]],
uint2 gid [[thread_position_in_grid]])
// Check if the pixel is within the bounds of the output texture
if((gid.x >= outTexture.get_width()) || (gid.y >= outTexture.get_height())) {
// Return early if the pixel is out of bounds
half4 color1 = inTexture1.read(gid);
half4 color2 = inTexture2.read(gid);
outTexture.write(half4(mix(color1.rgb, color2.rgb, half(tween)), color1.a), gid);
I'm trying to implement code in Metal that performs a 1D convolution between two vectors with lengths. I've implemented the following which works correctly
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const device int& dataSize [[ buffer(1) ]],
const device float *filterVector [[ buffer(2) ]],
const device int& filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]]) {
int outputSize = dataSize - filterSize + 1;
for (int i=0;i<outputSize;i++) {
float sum = 0.0;
for (int j=0;j<filterSize;j++) {
sum += dataVector[i+j] * filterVector[j];
outVector[i] = sum;
My problem is it takes about 10 times longer to process (computation + data transfer to/from GPU) the same data using Metal than in Swift on a CPU. My question is how do I replace the inner loop with a single vector operation or is there another way to speed up the above code?
The key to taking advantage of the GPU's parallelism in this case is to let it manage the outer loop for you. Instead of invoking the kernel once for the entire data vector, we'll invoke it for each element in the data vector. The kernel function simplifies to this:
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const constant int &dataSize [[ buffer(1) ]],
const constant float *filterVector [[ buffer(2) ]],
const constant int &filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]])
float sum = 0.0;
for (int i = 0; i < filterSize; ++i) {
sum += dataVector[id + i] * filterVector[i];
outVector[id] = sum;
In order to dispatch this work, we select a threadgroup size based on the thread execution width recommended by the compute pipeline state. The one tricky thing here is making sure that there's enough padding in the input and output buffers so that we can slightly overrun the actual size of the data. This does cause us to waste a small amount of memory and computation, but saves us the complexity of doing a separate dispatch just to compute the convolution for the elements at the end of the buffer.
// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).
let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
In my experiments, this parallelized approach runs 400-1000x faster than the serial version in the question. I'm curious to hear how it compares to your CPU implementation.
The following code shows how to render encoded commands in parallel on the GPU using the Objective-C Metal API (the threading code above only divides rendering of the output into grid sections for parallel processing; the calculations are still not performed in parallel). It is what you're referring to in your question, even while it's not exactly what you want. I've provided this answer to help anyone who might have stumbled upon this question, thinking that it was going to provide an answer related to parallel rendering (when, in fact, it does not):
- (void)drawInMTKView:(MTKView *)view
dispatch_async(((AppDelegate *)UIApplication.sharedApplication.delegate).cameraViewQueue, ^{
id <CAMetalDrawable> drawable = [view currentDrawable]; //[(CAMetalLayer *)view.layer nextDrawable];
MTLRenderPassDescriptor *renderPassDesc = [view currentRenderPassDescriptor];
renderPassDesc.colorAttachments[0].loadAction = MTLLoadActionClear;
renderPassDesc.colorAttachments[0].clearColor = MTLClearColorMake(0.0,0.0,0.0,1.0);
renderPassDesc.renderTargetWidth = self.texture.width;
renderPassDesc.renderTargetHeight = self.texture.height;
renderPassDesc.colorAttachments[0].texture = drawable.texture;
if (renderPassDesc != nil)
dispatch_semaphore_wait(self._inflight_semaphore, DISPATCH_TIME_FOREVER);
id <MTLCommandBuffer> commandBuffer = [self.metalContext.commandQueue commandBuffer];
[commandBuffer enqueue];
id <MTLParallelRenderCommandEncoder> parallelRCE = [commandBuffer parallelRenderCommandEncoderWithDescriptor:renderPassDesc];
id <MTLRenderCommandEncoder> renderEncoder = [parallelRCE renderCommandEncoder];
[renderEncoder setRenderPipelineState:self.metalContext.renderPipelineState];
[renderEncoder setVertexBuffer:self.metalContext.vertexBuffer offset:0 atIndex:0];
[renderEncoder setVertexBuffer:self.metalContext.uniformBuffer offset:0 atIndex:1];
[renderEncoder setFragmentBuffer:self.metalContext.uniformBuffer offset:0 atIndex:0];
[renderEncoder setFragmentTexture:self.texture
[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip
[renderEncoder endEncoding];
[parallelRCE endEncoding];
__block dispatch_semaphore_t block_sema = self._inflight_semaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
if (drawable)
[commandBuffer presentDrawable:drawable];
[commandBuffer commit];
[commandBuffer waitUntilScheduled];
In the above example, you would duplicate the renderEncoder-related for each calculation you want to perform in parallel. I do not see how this would be of benefit to you in your code example, as one operation appears to be dependent on another. Probably, then, the best you could hope for is the code provided to you by warrenm, even though that doesn't really qualify as parallel rendering, though.
In iOS 8 there was a problem with the division of floats in Metal preventing proper texture projection, which I solved.
Today I discovered that the texture projection on iOS 9 is broken again, although I'm not sure why.
The result of warping a texture on CPU (with OpenCV) and on GPU are not the same. You can see on your iPhone if you run this example app (already includes the fix for iOS 8) from iOS 9.
The expected CPU warp is colored red, while the GPU warp done by Metal is colored green, so where they overlap they are yellow. Ideally you should not see green or red, but only shades of yellow.
Can you:
confirm the problem exists on your end;
give any advice on anything that might be wrong?
The shader code is:
struct VertexInOut
float4 position [[ position ]];
float3 warpedTexCoords;
float3 originalTexCoords;
vertex VertexInOut warpVertex(uint vid [[ vertex_id ]],
device float4 *positions [[ buffer(0) ]],
device float3 *texCoords [[ buffer(1) ]])
VertexInOut v;
v.position = positions[vid];
// example homography
simd::float3x3 h = {
{1.03140473, 0.0778113901, 0.000169219566},
{0.0342947133, 1.06025684, 0.000459250761},
{-0.0364957005, -38.3375587, 0.818259298}
v.warpedTexCoords = h * texCoords[vid];
v.originalTexCoords = texCoords[vid];
return v;
fragment half4 warpFragment(VertexInOut inFrag [[ stage_in ]],
texture2d<half, access::sample> original [[ texture(0) ]],
texture2d<half, access::sample> cpuWarped [[ texture(1) ]])
constexpr sampler s(coord::pixel, filter::linear, address::clamp_to_zero);
half4 gpuWarpedPixel = half4(original.sample(s, inFrag.warpedTexCoords.xy * (1.0 / inFrag.warpedTexCoords.z)).r, 0, 0, 255);
half4 cpuWarpedPixel = half4(0, cpuWarped.sample(s, inFrag.originalTexCoords.xy).r, 0, 255);
return (gpuWarpedPixel + cpuWarpedPixel) * 0.5;
Do not ask me why, but if I multiply the warped coordinates by 1.00005 or any number close to 1.0, it is fixed (apart from very tiny details). See last commit in the example app repo.
I'm in the proccess of procedural planet generation; so far I have done the dynamic LOD work, but my current software algorithm is very very slow. I decided to do it using DX11's new tessellation features instead.
Currently my sphere is a subdivided icosahedron. (20 sides all equilateral triangles)
Back when I was subdividing using my software algorithm, one triangle would be
split into four children across the midpoints of the parent forming the Hyrule symbol each time...like this: http://puu.sh/1xFIx
As you can see, each triangle subdivided created more and more equilateral triangles, i.e. each one was exactly the same shape.
But now that I am using the GPU to tessellate in HLSL, the result is definately not
what I am looking for: http://puu.sh/1xFx7
Is there anything I can do in the Hull and Domain shaders to change the tessellation
so that it subdivides into sets of equilateral triangles like the first image?
Should I be using the geometry shader for something like this? If so, would it be
slower then the tessellator?
I tried using Tessellation Shader, but I encontred a problem: the domain shader only pass the uv coordinate (SV_DomainLocation) and the input patch for positionining the vertices, when the domain location for vertex is 0.3, 0.3, 0.3 (center vertex) is impossible to know the correct position because you need information about the other vertices or a index(x, y) of iteration that's not provided by the Domain Shader Stage.
because this problem I write the code in geometry shader, this shader is very limited for tessellations because the output stream cannot have a size bigger than 1024 bytes (in shader model 5.0). I implemented the calculation of vertex positions using the uv (like SV_DomainLocation) but this only tessellate the triangles, you must use part of your code to calculate added position in center of triangles to create the precise final result.
this is the code for equilateral triangles tessellation:
// required for array
void DrawTriangle(float4 p0, float4 p1, float4 p2, inout TriangleStream<VS_OUT> stream)
VS_OUT v0;
v0.pos = p0;
VS_OUT v1;
v1.pos = p1;
VS_OUT v2;
v2.pos = p2;
[maxvertexcount(128)] // directx rule: maxvertexcount * sizeof(VS_OUT) <= 1024
void gs(triangle VS_OUT input[3], inout TriangleStream<VS_OUT> stream)
int itc = min(tess, MAX_ITERATIONS);
float fitc = itc;
float4 past_pos[MAX_ITERATIONS];
float4 array_pass[MAX_ITERATIONS];
for (int pi = 0; pi < MAX_ITERATIONS; pi++)
past_pos[pi] = float4(0, 0, 0, 0);
array_pass[pi] = float4(0, 0, 0, 0);
// -------------------------------------
// Tessellation kernel for the control points
for (int x = 0; x <= itc; x++)
float4 last;
for (int y = 0; y <= x; y++)
float2 seg = float2(x / fitc, y / fitc);
float3 uv;
uv.x = 1 - seg.x;
uv.z = seg.y;
uv.y = 1 - (uv.x + uv.z);
// ---------------------------------------
// Domain Stage
// uv Domain Location
// x,y IterationIndex
float4 fpos = input[0].pos * uv.x;
fpos += input[1].pos * uv.y;
fpos += input[2].pos * uv.z;
if (x > 0 && y > 0)
DrawTriangle(past_pos[y - 1], last, fpos, stream);
if (y < x)
// add adjacent triangle
DrawTriangle(past_pos[y - 1], fpos, past_pos[y], stream);
array_pass[y] = fpos;
last = fpos;
for (int i = 0; i < MAX_ITERATIONS; i++)
past_pos[i] = array_pass[i];