Anyone knows a proper way to calculate mean value of the buffer with random float numbers in the metal kernel?
Dispatching work on the compute command encoder:
threadsPerGroup = MTLSizeMake(1, 1, inputTexture.arrayLength);
numThreadGroups = MTLSizeMake(1, 1, inputTexture.arrayLength / threadsPerGroup.depth);
[commandEncoder dispatchThreadgroups:numThreadGroups
threadsPerThreadgroup:threadsPerGroup];
Kernel code:
kernel void mean(texture2d_array<float, access::read> inTex [[ texture(0) ]],
device float *means [[ buffer(1) ]],
uint3 id [[ thread_position_in_grid ]]) {
if (id.x == 0 && id.y == 0) {
float mean = 0.0;
for (uint i = 0; i < inTex.get_width(); ++i) {
for (uint j = 0; j < inTex.get_height(); ++j) {
mean += inTex.read(uint2(i, j), id.z)[0];
}
}
float textureArea = inTex.get_width() * inTex.get_height();
mean /= textureArea;
out[id.z] = mean;
}
}
The buffer is represented in the texture of texture2d_array type with R32Float pixel format.
If you can use an array of uint (instead of float) as your data source, I would suggest using an "Atomic Fetch and Modify functions" (as described in the metal shading language spec) to write atomically to a buffer.
Here's an example of a kernel function which takes an input buffer (data: an array of Float) and writes the sum of the buffer into an atomic buffer (sum, a pointer to a uint):
kernel void sum(device uint *data [[ buffer(0) ]],
volatile device atomic_uint *sum [[ buffer(1) ]],
uint gid [[ thread_position_in_grid ]])
{
atomic_fetch_add_explicit(sum, data[gid], memory_order_relaxed);
}
In your swift file, you would set the buffers:
...
let data: [UInt] = [1, 2, 3, 4]
let dataBuffer = device.makeBuffer(bytes: &data, length: (data.count * MemoryLayout<UInt>.size), options: [])
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
var sum:UInt = 0
let sumBuffer = device!.makeBuffer(bytes: &sum, length: MemoryLayout<UInt>.size, options: [])
commandEncoder.setBuffer(sumBuffer, offset: 0, at: 1)
commandEncoder.endEncoding()
Commit, wait and then fetch the data from the GPU:
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let nsData = NSData(bytesNoCopy: sumBuffer.contents(),
length: sumBuffer.length,
freeWhenDone: false)
nsData.getBytes(&sum, length:sumBuffer.length)
let mean = Float(sum/data.count)
print(mean)
Alternatively, if your initial data source has to be an array of float, you could use the vDSP_meanv method of the Accelerate framework which is very fast for such computation.
I Hope that helped, cheers!
Related
I have a requirement to pass the color values to the shader as an array. Right now I'm passing it as a structure RGBColors and receiving it as a structure and it is working fine.
But I want to receive it as float3 value in the shader. But as soon as I change it to float3 it is acting weird and it starts to flicker, it is not giving me proper color.
Here is the code I used to set the fragment buffer,
func setFragmentBuffer(_ values: [Float], at index: Int) {
let bufferValues = values
let datasize = 16 * values.count / 3
let colorBuffer = device.makeBuffer(bytes: bufferValues, length: datasize, options: [])
renderEncoder.setFragmentBuffer(colorBuffer, offset: 0, at: index)
}
Here is the structure for the RGBColors
struct RGBColors {
var r: Float
var g: Float
var b: Float
func floatBuffers() -> [Float] {
return [r,g,b]
}
}
From this structure I will create an array of Float values and set it to fragment buffer.
The code let datasize = 16 * values.count / 3 I gave the value 16 because the although float data type in C++ is 4 bytes float3 in simd is 16 bytes.
And in the shader I'm implementing the method
fragment float4
singleShader(RasterizerData in [[stage_in]],
texture2d<half> sourceTexture [[ texture(0) ]],
const device float3 &rgbColor [[ buffer(1) ]]
{
constexpr sampler textureSampler (mag_filter::linear,
min_filter::linear);
// Sample the texture and return the color to colorSample
const half4 colorSample = sourceTexture.sample (textureSampler, in.textureCoordinate);
float4 outputColor;
float red = colorSample.r * rgbColor.r;
float green = colorSample.g * rgbColor.g;
float blue = colorSample.b * rgbColor.b;
outputColor = float4(red, green,blue, colorSample.a);
outputColor = float4((outputColor.rgb * param1 + param2) / 4, colorSample.a);
return outputColor;
}
Finally I'm not getting the right output color.
How to match float3 simd data type to the swift Float data type. Can someone suggest me?
Edit:
I found a solution about how to create a MTLBuffer from float3 and here is the code:
func setFragmentBuffer(_ values: [float3], at index: Int) {
var valueBuffer = values
let bufferCreated = device.makeBuffer(length: MemoryLayout.size(ofValue: valueBuffer[0]) * 2 * valueBuffer.count , options: [])
let bufferPointer = bufferCreated.contents()
memcpy(bufferPointer, &valueBuffer, 16 * valueBuffer.count)
renderEncoder.setFragmentBuffer(bufferCreated, offset: 0, at: index)
}
This code works perfectly fine. But still in the code let bufferCreated = device.makeBuffer(length: MemoryLayout.size(ofValue: valueBuffer[0]) * 2 * valueBuffer.count , options: []) you can see that a the length should be multiplied by a factor of 2 to make the code work.
Why there should a multiplier to make it work? Im not understanding this. Could someone suggest me?
I've written a Compute shader that outputs to a Texture. The coordinate system of the output texture is in pixels. I then have a basic vertex and fragment shader that should simply sample the value and respond with what I thought would be in normalised coordinates. However, I thought this mapping between my programmatically drawn texture and the vertices of my rendering surface would match up.
The Compute Function
Can be summarized as
texture.write(color, uint2(x, y));
where x and y are integer pixel locations.
The Vertex Data
// position.x, position.y, texCoords.x, texCoords.y
let vertexData = [Float](arrayLiteral:
-1, 1, 0, 0,
-1, -1, 0, 1,
1, -1, 1, 1,
1, -1, 1, 1,
1, 1, 1, 0,
-1, 1, 0, 0)
The Metal Shader
typedef struct {
packed_float2 position;
packed_float2 texCoords;
} VertexIn;
typedef struct {
float4 position [[ position ]];
float2 texCoords;
} FragmentVertex;
vertex FragmentVertex simple_vertex(device VertexIn *vertexArray [[ buffer(0) ]],
uint vertexIndex [[ vertex_id ]])
{
VertexIn in = vertexArray[vertexIndex];
FragmentVertex out;
out.position = float4(in.position, 0.f, 1.f);
out.texCoords = in.texCoords;
return out;
}
fragment float4 simple_fragment(FragmentVertex in [[ stage_in ]],
texture2d<uint, access::sample> inputTexture [[ texture(0) ]],
sampler linearSampler [[ sampler(0) ]])
{
const uint2 imageSizeInPixels = uint2(360, 230);
float imageSizeInPixelsWidth = imageSizeInPixels.x;
float imageSizeInPixelsHeight = imageSizeInPixels.y;
float2 coords = float2(in.position.x / 360.f, in.position.y / 230.f);
float color = inputTexture.sample(linearSampler, in.texCoords).x / 255.f;
return float4(float3(color), 1.f);
}
The Sampler
let samplerDescriptor = MTLSamplerDescriptor()
samplerDescriptor.normalizedCoordinates = true
samplerDescriptor.minFilter = .linear
samplerDescriptor.magFilter = .linear
samplerDescriptor.sAddressMode = .clampToZero
samplerDescriptor.rAddressMode = .clampToZero
self.samplerState = self.metalDevice?.makeSamplerState(descriptor: samplerDescriptor)
In this experiment the only value that seems to work is coords, based upon the normalized in.position value. in.texCoords seems to always be zero. Shouldn't the texcoords and position received by the vertex and fragment shader be values be in the range of values defined in the vertex data?
My Vertex Buffer was right, but wrong
In the process of converting Obj-C code to Swift I failed to copy the vertex completely.
The Correct Copy
let byteCount = vertexData.count * MemoryLayout<Float>.size
let vertexBuffer = self.metalDevice?.makeBuffer(bytes: vertexData, length: byteCount, options: options)
The Source of my Woes
let vertexBuffer = self.metalDevice?.makeBuffer(bytes: vertexData, length: vertexData.count, options: options)
The Complete Vertex Buffer Creation
// Vertex data for a full-screen quad. The first two numbers in each row represent
// the x, y position of the point in normalized coordinates. The second two numbers
// represent the texture coordinates for the corresponding position.
let vertexData = [Float](arrayLiteral:
-1, 1, 0, 0,
-1, -1, 0, 1,
1, -1, 1, 1,
1, -1, 1, 1,
1, 1, 1, 0,
-1, 1, 0, 0)
// Create a buffer to hold the static vertex data
let options = MTLResourceOptions().union(.storageModeShared)
let byteCount = vertexData.count * MemoryLayout<Float>.size
let vertexBuffer = self.metalDevice?.makeBuffer(bytes: vertexData, length: byteCount, options: options)
vertexBuffer?.label = "Image Quad Vertices"
self.vertexBuffer = vertexBuffer
I am trying to load a model (form .OBJ) and draw it to the screen on iOS with MetalKit. The problem is that instead of my model, I get some random polygons...
Here is the code that is tend to load the model(The code is based on a tutorial from raywenderlich.com:
let allocator = MTKMeshBufferAllocator(device: device)
let vertexDescriptor = MDLVertexDescriptor()
let vertexLayout = MDLVertexBufferLayout()
vertexLayout.stride = sizeof(Vertex)
vertexDescriptor.layouts = [vertexLayout]
vertexDescriptor.attributes = [MDLVertexAttribute(name: MDLVertexAttributePosition, format: MDLVertexFormat.Float3, offset: 0, bufferIndex: 0),
MDLVertexAttribute(name: MDLVertexAttributeColor, format: MDLVertexFormat.Float4, offset: sizeof(float3), bufferIndex: 0),
MDLVertexAttribute(name: MDLVertexAttributeTextureCoordinate, format: MDLVertexFormat.Float2, offset: sizeof(float3)+sizeof(float4), bufferIndex: 0),
MDLVertexAttribute(name: MDLVertexAttributeNormal, format: MDLVertexFormat.Float3, offset: sizeof(float3)+sizeof(float4)+sizeof(float2), bufferIndex: 0)]
var error: NSError?
let asset = MDLAsset(URL: path, vertexDescriptor: vertexDescriptor, bufferAllocator: allocator, preserveTopology: true, error: &error)
if error != nil{
print(error)
return nil
}
let model = asset.objectAtIndex(0) as! MDLMesh
let mesh = try MTKMesh(mesh: model, device: device)
And here is my drawing method:
func render(commandQueue: MTLCommandQueue, pipelineState: MTLRenderPipelineState,drawable: CAMetalDrawable,projectionMatrix: float4x4,modelViewMatrix: float4x4, clearColor: MTLClearColor){
dispatch_semaphore_wait(bufferProvider.availibleResourcesSemaphore, DISPATCH_TIME_FOREVER)
let renderPassDescriptor = MTLRenderPassDescriptor()
renderPassDescriptor.colorAttachments[0].texture = drawable.texture
renderPassDescriptor.colorAttachments[0].loadAction = .Clear
renderPassDescriptor.colorAttachments[0].clearColor = clearColor
renderPassDescriptor.colorAttachments[0].storeAction = .Store
let commandBuffer = commandQueue.commandBuffer()
commandBuffer.addCompletedHandler { (buffer) in
dispatch_semaphore_signal(self.bufferProvider.availibleResourcesSemaphore)
}
let renderEncoder = commandBuffer.renderCommandEncoderWithDescriptor(renderPassDescriptor)
renderEncoder.setCullMode(MTLCullMode.None)
renderEncoder.setRenderPipelineState(pipelineState)
renderEncoder.setVertexBuffer(vertexBuffer, offset: 0, atIndex: 0)
renderEncoder.setFragmentTexture(texture, atIndex: 0)
if let samplerState = samplerState{
renderEncoder.setFragmentSamplerState(samplerState, atIndex: 0)
}
var nodeModelMatrix = self.modelMatrix()
nodeModelMatrix.multiplyLeft(modelViewMatrix)
uniformBuffer = bufferProvider.nextUniformsBuffer(projectionMatrix, modelViewMatrix: nodeModelMatrix, light: light)
renderEncoder.setVertexBuffer(self.uniformBuffer, offset: 0, atIndex: 1)
renderEncoder.setFragmentBuffer(uniformBuffer, offset: 0, atIndex: 1)
if indexBuffer != nil{
renderEncoder.drawIndexedPrimitives(.Triangle, indexCount: self.indexCount, indexType: self.indexType, indexBuffer: self.indexBuffer!, indexBufferOffset: 0)
}else{
renderEncoder.drawPrimitives(.Triangle, vertexStart: 0, vertexCount: vertexCount, instanceCount: vertexCount/3)
}
renderEncoder.endEncoding()
commandBuffer.presentDrawable(drawable)
commandBuffer.commit()
}
Here is my vertex shader:
struct VertexIn{
packed_float3 position;
packed_float4 color;
packed_float2 texCoord;
packed_float3 normal;
};
struct VertexOut{
float4 position [[position]];
float3 fragmentPosition;
float4 color;
float2 texCoord;
float3 normal;
};
struct Light{
packed_float3 color;
float ambientIntensity;
packed_float3 direction;
float diffuseIntensity;
float shininess;
float specularIntensity;
};
struct Uniforms{
float4x4 modelMatrix;
float4x4 projectionMatrix;
Light light;
};
vertex VertexOut basic_vertex(
const device VertexIn* vertex_array [[ buffer(0) ]],
const device Uniforms& uniforms [[ buffer(1) ]],
unsigned int vid [[ vertex_id ]]) {
float4x4 mv_Matrix = uniforms.modelMatrix;
float4x4 proj_Matrix = uniforms.projectionMatrix;
VertexIn VertexIn = vertex_array[vid];
VertexOut VertexOut;
VertexOut.position = proj_Matrix * mv_Matrix * float4(VertexIn.position,1);
VertexOut.fragmentPosition = (mv_Matrix * float4(VertexIn.position,1)).xyz;
VertexOut.color = VertexIn.color;
VertexOut.texCoord = VertexIn.texCoord;
VertexOut.normal = (mv_Matrix * float4(VertexIn.normal, 0.0)).xyz;
return VertexOut;
}
And here is how it looks like:
link
Actually I have an other class that is completely written by me to load models. It works fine, the problem is that it is not using indexing so f I try to load models that are more complex than a low-poly sphere, the GPU crashes... Anyways I tried to modify it to use indexing and I got the same result.. than I added hardcoded indices for testing and I got a really weird result. When I had 3 indices it drew a triangle, when I added 3 more, it drew the same triangle and after 3 more vertices it drew 2 triangles...
Edit:
Here is my Vertex structure:
struct Vertex:Equatable{
var x,y,z: Float
var r,g,b,a: Float
var s,t: Float
var nX,nY,nZ:Float
func floatBuffer()->[Float]{
return [x,y,z,r,g,b,a,s,t,nX,nY,nZ]
}
}
I see a couple of potential issues here.
1) Your vertex descriptor does not map exactly to your Vertex struct. The position variables (x, y, z) occupy 12 bytes, so the color variables start at an offset of 12 bytes. This matches the packed_float3 position field in your shader's VertexIn struct, but in the vertex descriptor you provide to Model I/O, you use sizeof(Float3), which is 16, as the offset of the color attribute. Because you're packing the position field, you should use sizeof(Float) * 3 for this value instead, and likewise in the subsequent offsets. I suspect this is the main cause of your problems.
More generally, it's a good idea to use strideof rather than sizeof to account for alignment, though--by chance--it wouldn't make a difference here.
2) Model I/O is allowed to use a single MTLBuffer to store both vertices and indices, so you should use the offset member of each MTKMeshBuffer when setting the vertex buffer or specifying the index buffer in each draw call, rather than assuming the offsets to be 0.
As mentioned in Apple's document, texture2d of shading language could be of int type. I have tried to use texture2d of int type as parameter of shader language, but the write method of texture2d failed to work.
kernel void dummy(texture2d<int, access::write> outTexture [[ texture(0) ]],
uint2 gid [[ thread_position_in_grid ]])
{
outTexture.write( int4( 2, 4, 6, 8 ), gid );
}
However, if I replace the int with float, it worked.
kernel void dummy(texture2d<float, access::write> outTexture [[ texture(0) ]],
uint2 gid [[ thread_position_in_grid ]])
{
outTexture.write( float4( 1.0, 0, 0, 1.0 ), gid );
}
Could other types of texture2d, such texture2d of int, texture2d of short and so on, be used as shader function parameters, and how to use them? Thanks for reviewing my question.
The related host codes:
MTLTextureDescriptor *desc = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatRGBA8Unorm
desc.usage = MTLTextureUsageShaderWrite;
id<MTLTexture> texture = [device newTextureWithDescriptor:desc];
[commandEncoder setTexture:texture atIndex:0];
The code to show the output computed by GPU, w and h represents width and height of textrue, respectively.
uint8_t* imageBytes = malloc(w*h*4);
memset( imageBytes, 0, w*h*4 );
MTLRegion region = MTLRegionMake2D(0, 0, [texture width], [texture height]);
[texture getBytes:imageBytes bytesPerRow:[texture width]*4 fromRegion:region mipmapLevel:0];
for( int j = 0; j < h; j++ )
{
printf("%3d: ", j);
for( int i = 0; i < w*pixel_size; i++ )
{
printf(" %3d",imageBytes[j*w*pixel_size+i] );
}
printf("\n")
}
The problem is that the pixel format you used to create this texture (MTLPixelFormatRGBA8Unorm) is normalized, meaning that the expected pixel value range is 0.0-1.0. For normalized pixel types, the required data type for reading or writing to this texture within a Metal kernel is float or half-float.
In order to write to a texture with integers, you must select an integer pixel format. Here are all of the available formats:
https://developer.apple.com/documentation/metal/mtlpixelformat
The Metal Shading Language Guide states that:
Note: If T is int or short, the data associated with the texture must use a signed integer format. If T is uint or ushort, the data associated with the texture must use an unsigned integer format.
All you have to do is make sure the texture you write to in the API (host code) matches what you have in the kernel function. Alternatively, you can also cast the int values into float before writing to the outTexture.
I have a MTLTexture containing 16bit unsigned integers (MTLPixelFormatR16Uint). The values range from about 7000 to 20000, with 0 being used as a 'nodata' value, which is why it is skipped in the code below. I'd like to find the minimum and maximum values so I can rescale these values between 0-255. Ultimately I'll be looking to base the minimum and maximum values on a histogram of the data (it has some outliers), but for now I'm stuck on simply extracting the min/max.
I can read the data from the GPU to CPU and pull the min/max values out but would prefer to perform this task on the GPU.
First attempt
The command encoder is dispatched with 16x16 threads per thread group, the number of thread groups is based on the texture size (eg; width = textureWidth / 16, height = textureHeight / 16).
typedef struct {
atomic_uint min;
atomic_uint max;
} BandMinMax;
kernel void minMax(texture2d<ushort, access::read> band1 [[texture(0)]],
device BandMinMax &out [[buffer(0)]],
uint2 gid [[thread_position_in_grid]])
{
ushort value = band1.read(gid).r;
if (value != 0) {
uint currentMin = atomic_load_explicit(&out.min, memory_order_relaxed);
uint currentMax = atomic_load_explicit(&out.max, memory_order_relaxed);
if (value > currentMax) {
atomic_store_explicit(&out.max, value, memory_order_relaxed);
}
if (value < currentMin) {
atomic_store_explicit(&out.min, value, memory_order_relaxed);
}
}
}
From this I get a minimum and maximum value, but for the same dataset the min and max will often return different values. Fairly certain this is the min and max from a single thread when there are multiple threads running.
Second attempt
Building on the previous attempt, this time I'm storing the individual min/max values from each thread, all 256 (16x16).
kernel void minMax(texture2d<ushort, access::read> band1 [[texture(0)]],
device BandMinMax *out [[buffer(0)]],
uint2 gid [[thread_position_in_grid]],
uint tid [[ thread_index_in_threadgroup ]])
{
ushort value = band1.read(gid).r;
if (value != 0) {
uint currentMin = atomic_load_explicit(&out[tid].min, memory_order_relaxed);
uint currentMax = atomic_load_explicit(&out[tid].max, memory_order_relaxed);
if (value > currentMax) {
atomic_store_explicit(&out[tid].max, value, memory_order_relaxed);
}
if (value < currentMin) {
atomic_store_explicit(&out[tid].min, value, memory_order_relaxed);
}
}
}
This returns an array containing 256 sets of min/max values. From these I guess I could find the lowest of the minimum values, but this seems like a poor approach. Would appreciate a pointer in the right direction, thanks!
The Metal Shading Language has atomic compare-and-swap functions you can use to compare the existing value at a memory location with a value, and replace the value at that location if they don't compare equal. With these, you can create a set of atomic compare-and-replace-if-[greater|less]-than operations:
static void atomic_uint_exchange_if_less_than(volatile device atomic_uint *current, uint candidate)
{
uint val;
do {
val = *((device uint *)current);
} while ((candidate < val || val == 0) && !atomic_compare_exchange_weak_explicit(current,
&val,
candidate,
memory_order_relaxed,
memory_order_relaxed));
}
static void atomic_uint_exchange_if_greater_than(volatile device atomic_uint *current, uint candidate)
{
uint val;
do {
val = *((device uint *)current);
} while (candidate > val && !atomic_compare_exchange_weak_explicit(current,
&val,
candidate,
memory_order_relaxed,
memory_order_relaxed));
}
To apply these, you might create a buffer that contains one interleaved min, max pair per threadgroup. Then, in the kernel function, read from the texture and conditionally write the min and max values:
kernel void min_max_per_threadgroup(texture2d<ushort, access::read> texture [[texture(0)]],
device uint *mapBuffer [[buffer(0)]],
uint2 tpig [[thread_position_in_grid]],
uint2 tgpig [[threadgroup_position_in_grid]],
uint2 tgpg [[threadgroups_per_grid]])
{
ushort val = texture.read(tpig).r;
device atomic_uint *atomicBuffer = (device atomic_uint *)mapBuffer;
atomic_uint_exchange_if_less_than(atomicBuffer + ((tgpig[1] * tgpg[0] + tgpig[0]) * 2),
val);
atomic_uint_exchange_if_greater_than(atomicBuffer + ((tgpig[1] * tgpg[0] + tgpig[0]) * 2) + 1,
val);
}
Finally, run a separate kernel to reduce over this buffer and collect the final min, max values across the entire texture:
kernel void min_max_reduce(constant uint *mapBuffer [[buffer(0)]],
device uint *reduceBuffer [[buffer(1)]],
uint2 tpig [[thread_position_in_grid]])
{
uint minv = mapBuffer[tpig[0] * 2];
uint maxv = mapBuffer[tpig[0] * 2 + 1];
device atomic_uint *atomicBuffer = (device atomic_uint *)reduceBuffer;
atomic_uint_exchange_if_less_than(atomicBuffer, minv);
atomic_uint_exchange_if_greater_than(atomicBuffer + 1, maxv);
}
Of course, you can only reduce over the total allowed thread execution width of the device (~256), so you may need to do the reduction in multiple passes, with each one reducing the size of the data to be operated on by a factor of the maximum thread execution width.
Disclaimer: This may not be the best technique, but it does appear to be correct in my limited testing of an OS X implementation. It was marginally faster than a naive CPU implementation on a 256x256 texture on Intel Iris Pro, but substantially slower on an Nvidia GT 750M (because of dispatch overhead).
Very interesting discussion.
I am going to share my metal code that will help your understanding.
kernel void grayscale_minmax(texture2d<half, access::read> inTexture [[texture(0)]],
texture2d<half, access::write> outTexture [[texture(1)]],
device atomic_uint *min_max [[buffer(0)]],
uint2 gid [[thread_position_in_grid]],
uint tid [[thread_index_in_threadgroup]],
uint2 tsz [[threads_per_threadgroup]])
{
// local_atomic[0]: min value, local_atomic[1]: max value
threadgroup atomic_uint local_count, local_atomic[2];
if (tid == 0) { // initialize thread group local vars
atomic_store_explicit(&local_atomic[0], 255, memory_order_relaxed);
atomic_store_explicit(&local_atomic[1], 0, memory_order_relaxed);
atomic_store_explicit(&local_count, 0, memory_order_relaxed);
}
if ((gid.x >= outTexture.get_width()) || (gid.y >= outTexture.get_height())) {
atomic_fetch_add_explicit(&local_count, 1, memory_order_relaxed);
uint count = atomic_load_explicit(&local_count, memory_order_relaxed);
// when threadgroup calculation ends up, update device vars
if (count >= (tsz.x*tsz.y)) {
uint threadgroup_min_val = atomic_load_explicit(&local_atomic[0], memory_order_relaxed);
uint threadgroup_max_val = atomic_load_explicit(&local_atomic[1], memory_order_relaxed);
atomic_fetch_min_explicit(&min_max[0], threadgroup_min_val, memory_order_relaxed);
atomic_fetch_max_explicit(&min_max[1], threadgroup_max_val, memory_order_relaxed);
}
return;
}
// true color to gray scaled
const half4 inColor = inTexture.read(gid);
const half outColor = dot(inColor.rgb, half3(0.299, 0.587, 0.114));
const uint intColor = uint(clamp(outColor, 0.h, 1.h)*255.h);
// wait for other threads in the thread group stopping work
threadgroup_barrier(mem_flags::mem_threadgroup);
// update local variables
atomic_fetch_min_explicit(&local_atomic[0], intColor, memory_order_relaxed);
atomic_fetch_max_explicit(&local_atomic[1], intColor, memory_order_relaxed);
atomic_fetch_add_explicit(&local_count, 1, memory_order_relaxed);
uint count = atomic_load_explicit(&local_count, memory_order_relaxed);
// when threadgroup calculation ends up, update device vars
if (count >= (tsz.x*tsz.y)) {
uint threadgroup_min_val = atomic_load_explicit(&local_atomic[0], memory_order_relaxed);
uint threadgroup_max_val = atomic_load_explicit(&local_atomic[1], memory_order_relaxed);
atomic_fetch_min_explicit(&min_max[0], threadgroup_min_val, memory_order_relaxed);
atomic_fetch_max_explicit(&min_max[1], threadgroup_max_val, memory_order_relaxed);
}
outTexture.write(half4(half3(outColor), 1.h), gid);
}