I haven't written many Metal kernel shaders yet; here's a fledgling "fade" shader between two RGBX-32 images, using a tween value of 0.0 to 1.0 between inBuffer1 (0.0) to inBuffer2 (1.0).
Is there something I'm missing here? Something strikes me that this may be terribly inefficient.
My first inkling is to attempt to do subtraction and multiplication using the vector data types (eg. char4) thinking that might be better, but the results of this are certainly undefined (as some components will be negative).
Also, is there some advantage to using MTLTexture versus MTLBuffer objects as I've done?
kernel void fade_Kernel(device const uchar4 *inBuffer1 [[ buffer(0) ]],
device const uchar4 *inBuffer2 [[ buffer(1) ]],
device const float *tween [[ buffer(2) ]],
device uchar4 *outBuffer [[ buffer(3) ]],
uint gid [[ thread_position_in_grid ]])
{
const float t = tween[0];
uchar4 pixel1 = inBuffer1[gid];
uchar4 pixel2 = inBuffer2[gid];
// these values will be negative
short r=(pixel2.r-pixel1.r)*t;
short g=(pixel2.g-pixel1.g)*t;
short b=(pixel2.b-pixel1.b)*t;
outBuffer[gid]=uchar4(pixel1.r+r,pixel1.g+g,pixel1.b+b,0xff);
}
First, you should probably declare the tween parameter as:
constant float &tween [[ buffer(2) ]],
Using the constant address space is more appropriate for a value like this that's the same for all invocations of the function (and not indexed into by grid position or the like). Also, making it a reference instead of a pointer tells the compiler that you won't be indexing other elements in the "array" that a pointer might be.
Finally, there's a mix() function that performs exactly the sort of computation that you're doing here. So, you could replace the body of the function with:
uchar4 pixel1 = inBuffer1[gid];
uchar4 pixel2 = inBuffer2[gid];
outBuffer[gid] = uchar4(uchar3(mix(float3(pixel1.rgb), float3(pixel2.rgb), tween)), 0xff);
As to whether it would be better to use textures, that depends somewhat on what you plan to do with the result after running this kernel. If you're going to be doing texture-like things with it anyway, it might be better to use textures all throughout. Indeed, it might be better to use drawing operations with blending rather than a compute kernel. After all, such blending is something GPUs have to do all the time, so that path is probably fast. You'd have to test the performance of each approach.
If you are dealing with images, it's much more efficient to use MTLTexture than MTLBuffer. It is also better to use "half" than "uchar". I've learned this directly from an Apple engineer at WWDC this year.
kernel void alpha(texture2d<half, access::read> inTexture2 [[texture(0)]],
texture2d<half, access::read> inTexture1 [[texture(1)]],
texture2d<half, access::write> outTexture [[texture(2)]],
const device float& tween [[ buffer(3) ]],
uint2 gid [[thread_position_in_grid]])
{
// Check if the pixel is within the bounds of the output texture
if((gid.x >= outTexture.get_width()) || (gid.y >= outTexture.get_height())) {
// Return early if the pixel is out of bounds
return;
}
half4 color1 = inTexture1.read(gid);
half4 color2 = inTexture2.read(gid);
outTexture.write(half4(mix(color1.rgb, color2.rgb, half(tween)), color1.a), gid);
}
Related
I have an MTLTexture in RGBA8Unorm format, and a screen texture (in MTKView) in BGRA8Unorm format (reversed). In the Metal shader, when I sample from that texture using sample(), I get float4. When I write to texture in metal shader, I also write float4. It seems that when I am inside the shader code, float4 always represents the same order of components RGBA regardless of the original format the texture is in ([0] for red, [1] for green, [2] for blue, and [3] for alpha). Is my conclusion correct that the meaning of the components of the sampled/written float4 is always the same inside the shader, regardless of what the storage format of the texture is?
UPDATE: I use the following code to write to a texture with RGBA8Unnorm format:
kernel void
computeColourMap(constant Uniforms &uniforms [[buffer(0)]],
constant array<float, 120> &s [[buffer(1)]],
constant array<float, 120> &red [[buffer(2)]],
constant array<float, 120> &green [[buffer(3)]],
constant array<float, 120> &blue [[buffer(4)]],
texture2d<float, access::write> output [[texture(0)]],
uint2 id [[thread_position_in_grid]])
{
if (id.x >= output.get_width() || id.y >= output.get_height()) {
return;
}
uint i = id.x % 120;
float4 col (0, 0, 0, 1);
col.x += amps[i] * red[i];
col.y += amps[i] * green[i];
col.z += amps[i] * blue[i];
output.write(col, id);
}
I then use the following shaders for the rendering stage:
vertex VertexOut
vertexShader(const device VertexIn *vertexArray [[buffer(0)]],
unsigned int vid [[vertex_id]])
{
VertexIn vertex_in = vertexArray[vid];
VertexOut vertex_out;
vertex_out.position = vertex_in.position;
vertex_out.textureCoord = vertex_in.textureCoord;
return vertex_out;
}
fragment float4
fragmentShader(VertexOut interpolated [[stage_in]],
texture2d<float> colorTexture [[ texture(0) ]])
{
const float4 colorSample = colorTexture.sample(nearestSampler,
interpolated.textureCoord);
return colorSample;
}
where colourTexture passed into the fragment shader is the one I generated in RGBA8Unorm format, and in Swift I have:
let renderPipelineDescriptor = MTLRenderPipelineDescriptor()
renderPipelineDescriptor.vertexFunction = library.makeFunction(name: "vertexShader")!
renderPipelineDescriptor.fragmentFunction = library.makeFunction(name: "fragmentShader")!
renderPipelineDescriptor.colorAttachments[0].pixelFormat = colorPixelFormat
the colorPixelFormat of the MTKView is BGRA8Unorm (reversed relative to texture), which is not the same as my texture, but the colours on the screen come out correct.
UPDATE 2: one further pointer that within a shader the colour represented by float4 always has order of rgba is: float4 type actually has accessors called v.r, v.g, v.b, v.rgb, etc...
The vector always has 4 components, but the type of the components is not necessarily float. When you declare a texture, you specify the component type as a template argument (texture2d<float ...> in your code).
For example, from Metal Shading Language Specification v2.1, section 5.10.1:
The following member functions can be used to sample from a 1D
texture.
Tv sample(sampler s, float coord) const
Tv is a 4-component vector type based on the templated type used
to declare the texture type. If T is float, Tv is float4. If T is half,
Tv is half4. If T is int, Tv is int4. If T is uint, Tv is uint4. If T
is short, Tv is short4 and if T is ushort, Tv is ushort4.
The same Tv type is used in the declaration of write(). The functions for other texture types are documented in a similar manner.
And, yes, component .r always contains the red component (if present), etc. And [0] always corresponds to .r (or .x).
I have a metal shader that computes an image histogram like this:
#define CHANNEL_SIZE (256)
typedef atomic_uint HistoBuffer[CHANNEL_SIZE];
kernel void
computeHisto(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 grid [[thread_position_in_grid]]) {
if (grid.x >= sourceTexture.get_width() || grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&histo[grayvalue], 1, memory_order_relaxed);
}
This works as expected but takes too long (>1ms). I now tried to optimise this by reducing the number of atomic operations. I came up with the following improved code. The idea is to compute local histograms per thread group and add them later atomically into the global hist buffer.
kernel void
computeHisto_fast(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 t_pos_grid [[thread_position_in_grid]],
uint2 tg_pos_grid [[ threadgroup_position_in_grid ]],
uint2 t_pos_tg [[ thread_position_in_threadgroup]],
uint t_idx_tg [[ thread_index_in_threadgroup ]],
uint2 t_per_tg [[ threads_per_threadgroup ]]
)
{
threadgroup uint localhisto[CHANNEL_SIZE] = { 0 };
if (t_pos_grid.x >= sourceTexture.get_width() || t_pos_grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
localhisto[grayvalue]++;
// wait for all threads in threadgroup to finish
threadgroup_barrier(mem_flags::mem_none);
// copy the thread group result atomically into global histo buffer
if(t_idx_tg == 0) {
for(uint i=0;i<CHANNEL_SIZE;i++) {
atomic_fetch_add_explicit(&histo[i], localhisto[i], memory_order_relaxed);
}
}
}
There are 2 problems:
The improved routine does not yield identical results compared to the first and I currently don't see why ?
The run time didn't improve. in fact it takes 4 times the runtime of the unoptimised version. According to the debugger the for loop is the problem. But I do not understand this, since the number of atomic operation is reduced by 3 orders of magnitude, i.e. the thread group size, here (32x32)=1024.
Anbody who can explain what I am doing wrong here ? Thanks
EDIT: 2019-12-22:
According to Matthijs answer I have changed the local histogram also to atomic operations like this:
threadgroup atomic_uint localhisto[CHANNEL_SIZE] = {0};
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&localhisto[grayvalue], 1, memory_order_relaxed);
However the result sill is not the same as in the reference implementation above. There must be another severe conceptional bug ???
You'll still need to use atomic operations on the threadgroup memory, since it's still being shared by multiple threads. This should be faster than in your first version because there is less contention for the same locks.
I think the problem is with initializing shared memory, I don't think this definition does the job. Also, threadgroup level memory synchronization is required between zeroing shared memory and atomic update.
As for the device memory update, doing it using a single thread is clearly suboptimal. Updating the whole 256 length histogram in each threadblock can have a huge overhead depending on the size of the threadblock.
A sample I used for a small (16 element) histogram using 8x8 threadblocks:
kernel void gaussian_filter(device const uchar* data,
device atomic_uint* p_hist,
uint2 imageShape [[threads_per_grid]],
uint2 idx [[thread_position_in_grid]],
uint tidx [[thread_index_in_threadgroup]])
{
threadgroup atomic_uint sh_hist[16];
if (tidx < 16)
atomic_store_explicit(sh_hist + tidx, 0, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
uint histBin = (uint)data[imageShape[0]*idx[1] + idx[0]]/16;
atomic_fetch_add_explicit(sh_hist + histBin, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if (tidx < 16)
atomic_fetch_add_explicit(p_hist + tidx, atomic_load_explicit(sh_hist + tidx, memory_order_relaxed), memory_order_relaxed);
}
I have a metal kernel function. Usually you access pixels like this:
kernel void edgeDetect(texture2d<half, access::sample> inTexture [[ texture(0) ]],
texture2d<half, access::write> outTexture [[ texture(1) ]],
device const uint *roi [[ buffer(0) ]],
uint2 grid [[ thread_position_in_grid ]]) {
if (grid.x >= outTexture.get_width() || grid.y >= outTexture.get_height()) {
return;
}
half c[9];
for (int i=0; i < 3; ++i) {
for (int j=0; j < 3; ++j) {
c[3*i+j] = inTexture.read(grid + uint2(i-1,j-1)).x;
}
}
half3 Lx = 2.0*(c[7]-c[1]) + c[6] + c[8] - c[2] - c[0];
half3 Ly = 2.0*(c[3]-c[5]) + c[6] + c[0] - c[2] - c[8];
half3 G = sqrt(Lx*Lx+Ly*Ly);
outTexture.write(half4(G, 0.0), grid);
}
Now I need to access pixels in the neighbourhood of the current grid position like this:
half4 inColor = inTexture.read(grid - uint2(-1,-1));
Basically this works, but on the thread boundaries I have "discontinuities" as shown in this image (the brick wall pattern).
This is clear since each thread is passed only it's sub-texture to process. So beyond thread boundaries I can't access pixels.
My question is: What is the concept when I need to address pixels beyond the current position in a compute kernel ? Is this possible with compute kernels at all ?
I have found the issue:
The line
c[3*i+j] = inTexture.read(grid + uint2(i-1,j-1)).x;
must be changed to:
c[3*i+j] = inTexture.read(grid + uint2(i,j)).x;
Obvisouly the position indices of -1 into the texture failed and produced the brick wall like artefacts shown in the image above.
To ensure somebody has attached it to this comment as an answer: there is no restriction on which pixels you can access in a compute shader. Your grid size affects scheduling only.
Your error is instantiating unsigned uint2 with negative numbers. At the first iteration of your loop you will attempt to construct uint2(-1, -1), which is the same as uint2(4294967295, 4294967295) and therefore way out of bounds.
You can use int2, or as per your self-answer just avoid negative numbers.
I am trying to add a smudge effect to my paint brush project. To achieve that, I think I need to sample the the current results (which is in paintedTexture) from the start of the brush stroke coordinates and pass it to the fragment shader.
I have a vertex shader such as:
vertex VertexOut vertex_particle(
device Particle *particles [[buffer(0)]],
constant RenderParticleParameters *params [[buffer(1)]],
texture2d<half> imageTexture [[ texture(0) ]],
texture2d<half> paintedTexture [[ texture(1) ]],
uint instance [[instance_id]])
{
VertexOut out;
Drawing a fragment shader such as:
fragment half4 fragment_particle(VertexOut in [[ stage_in ]],
half4 existingColor [[color(0)]],
texture2d<half> brushTexture [[ texture(0) ]],
float2 point [[ point_coord ]]) {
Is it possible to create a clipped texture from the paintedTexture and send it to the fragment shader?
paintedTexture is the current results that have been painted to the canvas. I would like to create a new texture from paintedTexture using the same area as the brush texture and pass it to the fragment shader.
The existingColor [[color(0)]] in the fragment shader is of no use since it is the current color, not the color at the beginning of a stroke. If I use existingColor, it's like using transparency (or a transfer mode based on what math is used to combine it with a new color).
If I am barking up the wrong tree, any suggestions on how to achieve a smudging effect with Metal would potentially be acceptable answers.
Update: I tried using a texture2d in the VertexOut struct:
struct VertexOut {
float4 position [[ position ]];
float point_size [[ point_size ]];
texture2d<half> paintedTexture;
}
But it fails to compile with the error:
vertex function has invalid return type 'VertexOut'
It doesn't seem possible to have an array in the VertexOut struct either (which isn't nearly as ideal as a texture, but it could be a path forward):
struct VertexOut {
float4 position [[ position ]];
float point_size [[ point_size ]];
half4 paintedPixels[65536];
}
Gives me the error:
type 'VertexOut' is not valid for attribute 'stage_in'
It's not possible for shaders to create textures. They could fill an existing one, but I don't think that's what you want or need, here.
I would expect you could pass paintedTexture to the fragment shader and use the vertex shader to note where, from that texture, to sample. So, just coordinates.
In iOS 8 there was a problem with the division of floats in Metal preventing proper texture projection, which I solved.
Today I discovered that the texture projection on iOS 9 is broken again, although I'm not sure why.
The result of warping a texture on CPU (with OpenCV) and on GPU are not the same. You can see on your iPhone if you run this example app (already includes the fix for iOS 8) from iOS 9.
The expected CPU warp is colored red, while the GPU warp done by Metal is colored green, so where they overlap they are yellow. Ideally you should not see green or red, but only shades of yellow.
Can you:
confirm the problem exists on your end;
give any advice on anything that might be wrong?
The shader code is:
struct VertexInOut
{
float4 position [[ position ]];
float3 warpedTexCoords;
float3 originalTexCoords;
};
vertex VertexInOut warpVertex(uint vid [[ vertex_id ]],
device float4 *positions [[ buffer(0) ]],
device float3 *texCoords [[ buffer(1) ]])
{
VertexInOut v;
v.position = positions[vid];
// example homography
simd::float3x3 h = {
{1.03140473, 0.0778113901, 0.000169219566},
{0.0342947133, 1.06025684, 0.000459250761},
{-0.0364957005, -38.3375587, 0.818259298}
};
v.warpedTexCoords = h * texCoords[vid];
v.originalTexCoords = texCoords[vid];
return v;
}
fragment half4 warpFragment(VertexInOut inFrag [[ stage_in ]],
texture2d<half, access::sample> original [[ texture(0) ]],
texture2d<half, access::sample> cpuWarped [[ texture(1) ]])
{
constexpr sampler s(coord::pixel, filter::linear, address::clamp_to_zero);
half4 gpuWarpedPixel = half4(original.sample(s, inFrag.warpedTexCoords.xy * (1.0 / inFrag.warpedTexCoords.z)).r, 0, 0, 255);
half4 cpuWarpedPixel = half4(0, cpuWarped.sample(s, inFrag.originalTexCoords.xy).r, 0, 255);
return (gpuWarpedPixel + cpuWarpedPixel) * 0.5;
}
Do not ask me why, but if I multiply the warped coordinates by 1.00005 or any number close to 1.0, it is fixed (apart from very tiny details). See last commit in the example app repo.