I have a MTLTexture containing 16bit unsigned integers (MTLPixelFormatR16Uint). The values range from about 7000 to 20000, with 0 being used as a 'nodata' value, which is why it is skipped in the code below. I'd like to find the minimum and maximum values so I can rescale these values between 0-255. Ultimately I'll be looking to base the minimum and maximum values on a histogram of the data (it has some outliers), but for now I'm stuck on simply extracting the min/max.
I can read the data from the GPU to CPU and pull the min/max values out but would prefer to perform this task on the GPU.
First attempt
The command encoder is dispatched with 16x16 threads per thread group, the number of thread groups is based on the texture size (eg; width = textureWidth / 16, height = textureHeight / 16).
typedef struct {
atomic_uint min;
atomic_uint max;
} BandMinMax;
kernel void minMax(texture2d<ushort, access::read> band1 [[texture(0)]],
device BandMinMax &out [[buffer(0)]],
uint2 gid [[thread_position_in_grid]])
{
ushort value = band1.read(gid).r;
if (value != 0) {
uint currentMin = atomic_load_explicit(&out.min, memory_order_relaxed);
uint currentMax = atomic_load_explicit(&out.max, memory_order_relaxed);
if (value > currentMax) {
atomic_store_explicit(&out.max, value, memory_order_relaxed);
}
if (value < currentMin) {
atomic_store_explicit(&out.min, value, memory_order_relaxed);
}
}
}
From this I get a minimum and maximum value, but for the same dataset the min and max will often return different values. Fairly certain this is the min and max from a single thread when there are multiple threads running.
Second attempt
Building on the previous attempt, this time I'm storing the individual min/max values from each thread, all 256 (16x16).
kernel void minMax(texture2d<ushort, access::read> band1 [[texture(0)]],
device BandMinMax *out [[buffer(0)]],
uint2 gid [[thread_position_in_grid]],
uint tid [[ thread_index_in_threadgroup ]])
{
ushort value = band1.read(gid).r;
if (value != 0) {
uint currentMin = atomic_load_explicit(&out[tid].min, memory_order_relaxed);
uint currentMax = atomic_load_explicit(&out[tid].max, memory_order_relaxed);
if (value > currentMax) {
atomic_store_explicit(&out[tid].max, value, memory_order_relaxed);
}
if (value < currentMin) {
atomic_store_explicit(&out[tid].min, value, memory_order_relaxed);
}
}
}
This returns an array containing 256 sets of min/max values. From these I guess I could find the lowest of the minimum values, but this seems like a poor approach. Would appreciate a pointer in the right direction, thanks!
The Metal Shading Language has atomic compare-and-swap functions you can use to compare the existing value at a memory location with a value, and replace the value at that location if they don't compare equal. With these, you can create a set of atomic compare-and-replace-if-[greater|less]-than operations:
static void atomic_uint_exchange_if_less_than(volatile device atomic_uint *current, uint candidate)
{
uint val;
do {
val = *((device uint *)current);
} while ((candidate < val || val == 0) && !atomic_compare_exchange_weak_explicit(current,
&val,
candidate,
memory_order_relaxed,
memory_order_relaxed));
}
static void atomic_uint_exchange_if_greater_than(volatile device atomic_uint *current, uint candidate)
{
uint val;
do {
val = *((device uint *)current);
} while (candidate > val && !atomic_compare_exchange_weak_explicit(current,
&val,
candidate,
memory_order_relaxed,
memory_order_relaxed));
}
To apply these, you might create a buffer that contains one interleaved min, max pair per threadgroup. Then, in the kernel function, read from the texture and conditionally write the min and max values:
kernel void min_max_per_threadgroup(texture2d<ushort, access::read> texture [[texture(0)]],
device uint *mapBuffer [[buffer(0)]],
uint2 tpig [[thread_position_in_grid]],
uint2 tgpig [[threadgroup_position_in_grid]],
uint2 tgpg [[threadgroups_per_grid]])
{
ushort val = texture.read(tpig).r;
device atomic_uint *atomicBuffer = (device atomic_uint *)mapBuffer;
atomic_uint_exchange_if_less_than(atomicBuffer + ((tgpig[1] * tgpg[0] + tgpig[0]) * 2),
val);
atomic_uint_exchange_if_greater_than(atomicBuffer + ((tgpig[1] * tgpg[0] + tgpig[0]) * 2) + 1,
val);
}
Finally, run a separate kernel to reduce over this buffer and collect the final min, max values across the entire texture:
kernel void min_max_reduce(constant uint *mapBuffer [[buffer(0)]],
device uint *reduceBuffer [[buffer(1)]],
uint2 tpig [[thread_position_in_grid]])
{
uint minv = mapBuffer[tpig[0] * 2];
uint maxv = mapBuffer[tpig[0] * 2 + 1];
device atomic_uint *atomicBuffer = (device atomic_uint *)reduceBuffer;
atomic_uint_exchange_if_less_than(atomicBuffer, minv);
atomic_uint_exchange_if_greater_than(atomicBuffer + 1, maxv);
}
Of course, you can only reduce over the total allowed thread execution width of the device (~256), so you may need to do the reduction in multiple passes, with each one reducing the size of the data to be operated on by a factor of the maximum thread execution width.
Disclaimer: This may not be the best technique, but it does appear to be correct in my limited testing of an OS X implementation. It was marginally faster than a naive CPU implementation on a 256x256 texture on Intel Iris Pro, but substantially slower on an Nvidia GT 750M (because of dispatch overhead).
Very interesting discussion.
I am going to share my metal code that will help your understanding.
kernel void grayscale_minmax(texture2d<half, access::read> inTexture [[texture(0)]],
texture2d<half, access::write> outTexture [[texture(1)]],
device atomic_uint *min_max [[buffer(0)]],
uint2 gid [[thread_position_in_grid]],
uint tid [[thread_index_in_threadgroup]],
uint2 tsz [[threads_per_threadgroup]])
{
// local_atomic[0]: min value, local_atomic[1]: max value
threadgroup atomic_uint local_count, local_atomic[2];
if (tid == 0) { // initialize thread group local vars
atomic_store_explicit(&local_atomic[0], 255, memory_order_relaxed);
atomic_store_explicit(&local_atomic[1], 0, memory_order_relaxed);
atomic_store_explicit(&local_count, 0, memory_order_relaxed);
}
if ((gid.x >= outTexture.get_width()) || (gid.y >= outTexture.get_height())) {
atomic_fetch_add_explicit(&local_count, 1, memory_order_relaxed);
uint count = atomic_load_explicit(&local_count, memory_order_relaxed);
// when threadgroup calculation ends up, update device vars
if (count >= (tsz.x*tsz.y)) {
uint threadgroup_min_val = atomic_load_explicit(&local_atomic[0], memory_order_relaxed);
uint threadgroup_max_val = atomic_load_explicit(&local_atomic[1], memory_order_relaxed);
atomic_fetch_min_explicit(&min_max[0], threadgroup_min_val, memory_order_relaxed);
atomic_fetch_max_explicit(&min_max[1], threadgroup_max_val, memory_order_relaxed);
}
return;
}
// true color to gray scaled
const half4 inColor = inTexture.read(gid);
const half outColor = dot(inColor.rgb, half3(0.299, 0.587, 0.114));
const uint intColor = uint(clamp(outColor, 0.h, 1.h)*255.h);
// wait for other threads in the thread group stopping work
threadgroup_barrier(mem_flags::mem_threadgroup);
// update local variables
atomic_fetch_min_explicit(&local_atomic[0], intColor, memory_order_relaxed);
atomic_fetch_max_explicit(&local_atomic[1], intColor, memory_order_relaxed);
atomic_fetch_add_explicit(&local_count, 1, memory_order_relaxed);
uint count = atomic_load_explicit(&local_count, memory_order_relaxed);
// when threadgroup calculation ends up, update device vars
if (count >= (tsz.x*tsz.y)) {
uint threadgroup_min_val = atomic_load_explicit(&local_atomic[0], memory_order_relaxed);
uint threadgroup_max_val = atomic_load_explicit(&local_atomic[1], memory_order_relaxed);
atomic_fetch_min_explicit(&min_max[0], threadgroup_min_val, memory_order_relaxed);
atomic_fetch_max_explicit(&min_max[1], threadgroup_max_val, memory_order_relaxed);
}
outTexture.write(half4(half3(outColor), 1.h), gid);
}
Related
Is the initial value of a threadgroup atomic_uint zero? I didn't see anything in the MSL spec. Or do I need to do something like the following to initialize it to zero?
threadgroup atomic_uint flags;
if(localIndex == 0) {
atomic_store_explicit(&flags, 0, memory_order_relaxed);
}
threadgroup_barrier(mem_flags::mem_threadgroup);
Local threadgroup memory is not initialized by default, so you will have to do something like you've done:
compute_kernel(texture2d<float, access::read> inTexture [[texture(0)]],
volatile device atomic_uint* outBins [[buffer(0)]],
uint2 gid [[thread_position_in_grid]],
uint2 threadIndex [[thread_position_in_threadgroup]])
{
threadgroup atomic_uint flags;
if (all(threadIndex == 0)) { // Only do the initialization on one thread
atomic_store_explicit(&flags + i, 0, memory_order_relaxed);
}
threadgroup_barrier(mem_flags::mem_threadgroup); // Make all threads wait until initialization is done
// flags is now initialized and usable
}
I have sample metal code that I'm trying to convert to iOS. Is there an iOS compatible value that I can use for bt601?
#include <metal_stdlib>
#include "utilities.h" // error not found
using namespace metal;
kernel void laplace(texture2d<half, access::read> inTexture [[ texture(0) ]],
texture2d<half, access::read_write> outTexture [[ texture(1) ]],
uint2 gid [[ thread_position_in_grid ]]) {
constexpr int kernel_size = 3;
constexpr int radius = kernel_size / 2;
half3x3 laplace_kernel = half3x3(0, 1, 0,
1, -4, 1,
0, 1, 0);
half4 acc_color(0, 0, 0, 0);
for (int j = 0; j <= kernel_size - 1; j++) {
for (int i = 0; i <= kernel_size - 1; i++) {
uint2 textureIndex(gid.x + (i - radius), gid.y + (j - radius));
acc_color += laplace_kernel[i][j] * inTexture.read(textureIndex).rgba;
}
}
half value = dot(acc_color.rgb, bt601); //bt601 not defined
half4 gray_color(value, value, value, 1.0);
outTexture.write(gray_color, gid);
}
It seems that the intention here is simply to derive a single "luminance" value from the RGB output of the kernel. In that case, bt601 would be a three-element vector whose components are the desired weights of the respective channels, summing to 1.0.
Borrowing values from Rec. 601, we might define it like this:
float3 bt601(0.299f, 0.587f, 0.114f);
This is certainly a common choice. Another popular choice uses coefficients found in the Rec. 709 standard. That would look like this:
float3 bt709(0.212671f, 0.715160f, 0.072169f);
Both of these vectors will give you a single gray value that approximates the brightness of a linear sRGB color. Whether either of them is "correct" depends on the provenance of your data and how you process it further down the pipeline.
For whatever it's worth, the MetalPerformanceShaders MPSImageThresholdBinary kernel seems to favor the BT.601 values.
I'd recommend taking a look at this answer for more detail on the issues, and conditions under which the use of these values is appropriate.
I am writing a metal cnn code.
Metal provides MPSCNNLocalContrastNormalization,
Since the concept of Instance Normalization is slightly different, I intend to implement it as a Kernel Function.
However, the problem is that the mean and variance for each R, G, B should be obtained when feature is R, G, B in texture received from input in kernel function.
I want to get some hints on how to implement this.
kernel void instance_normalization_2darray(texture2d_array<float, access::sample> src [[ texture(0) ]],
texture2d_array<float, access::write> dst [[ texture(1) ]],
uint3 tid [[thread_position_in_grid]]) {
}
kernel void calculate_avgA(texture2d_array<float, access::read> texture_in [[texture(0)]],
texture2d_array<float, access::write> texture_out [[texture(1)]],
uint3 tid [[thread_position_in_grid]])
{
int width = texture_in.get_width();
int height = texture_in.get_height();
int depth = texture_in.get_array_size();
float4 outColor;
uint3 kernelIndex(0,0,0);
uint3 textureIndex(0,0,0);
for(int k = 0; k < depth; k++) {
outColor = (0.0, 0.0, 0.0, 0.0);
for (int i=0; i < width; i++)
{
for (int j=0; j < height; j++)
{
kernelIndex = uint3(i, j, k);
textureIndex = uint3(tid.x + i, tid.y + j, tid.z + k);
float4 color = texture_in.read(textureIndex.xy, textureIndex.z).rgba;
outColor += color;
}
}
outColor = outColor / (width * height);
texture_out.write(float4(outColor.rgba), tid.xy, textureIndex.z);
}
}
Mr.Bista
I had the same problem for this, apple didn't provide some function for this with fast speed.
And I just use MPSCNNPoolingAverage for caculate mean before kernels.
Maybe it is a temporary method for it.
And other algorithm is not better than this ,such as reduction sum algorithm after my test with codes.
So I will continue to track better implementation for this.
Anyone knows a proper way to calculate mean value of the buffer with random float numbers in the metal kernel?
Dispatching work on the compute command encoder:
threadsPerGroup = MTLSizeMake(1, 1, inputTexture.arrayLength);
numThreadGroups = MTLSizeMake(1, 1, inputTexture.arrayLength / threadsPerGroup.depth);
[commandEncoder dispatchThreadgroups:numThreadGroups
threadsPerThreadgroup:threadsPerGroup];
Kernel code:
kernel void mean(texture2d_array<float, access::read> inTex [[ texture(0) ]],
device float *means [[ buffer(1) ]],
uint3 id [[ thread_position_in_grid ]]) {
if (id.x == 0 && id.y == 0) {
float mean = 0.0;
for (uint i = 0; i < inTex.get_width(); ++i) {
for (uint j = 0; j < inTex.get_height(); ++j) {
mean += inTex.read(uint2(i, j), id.z)[0];
}
}
float textureArea = inTex.get_width() * inTex.get_height();
mean /= textureArea;
out[id.z] = mean;
}
}
The buffer is represented in the texture of texture2d_array type with R32Float pixel format.
If you can use an array of uint (instead of float) as your data source, I would suggest using an "Atomic Fetch and Modify functions" (as described in the metal shading language spec) to write atomically to a buffer.
Here's an example of a kernel function which takes an input buffer (data: an array of Float) and writes the sum of the buffer into an atomic buffer (sum, a pointer to a uint):
kernel void sum(device uint *data [[ buffer(0) ]],
volatile device atomic_uint *sum [[ buffer(1) ]],
uint gid [[ thread_position_in_grid ]])
{
atomic_fetch_add_explicit(sum, data[gid], memory_order_relaxed);
}
In your swift file, you would set the buffers:
...
let data: [UInt] = [1, 2, 3, 4]
let dataBuffer = device.makeBuffer(bytes: &data, length: (data.count * MemoryLayout<UInt>.size), options: [])
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
var sum:UInt = 0
let sumBuffer = device!.makeBuffer(bytes: &sum, length: MemoryLayout<UInt>.size, options: [])
commandEncoder.setBuffer(sumBuffer, offset: 0, at: 1)
commandEncoder.endEncoding()
Commit, wait and then fetch the data from the GPU:
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let nsData = NSData(bytesNoCopy: sumBuffer.contents(),
length: sumBuffer.length,
freeWhenDone: false)
nsData.getBytes(&sum, length:sumBuffer.length)
let mean = Float(sum/data.count)
print(mean)
Alternatively, if your initial data source has to be an array of float, you could use the vDSP_meanv method of the Accelerate framework which is very fast for such computation.
I Hope that helped, cheers!
As mentioned in Apple's document, texture2d of shading language could be of int type. I have tried to use texture2d of int type as parameter of shader language, but the write method of texture2d failed to work.
kernel void dummy(texture2d<int, access::write> outTexture [[ texture(0) ]],
uint2 gid [[ thread_position_in_grid ]])
{
outTexture.write( int4( 2, 4, 6, 8 ), gid );
}
However, if I replace the int with float, it worked.
kernel void dummy(texture2d<float, access::write> outTexture [[ texture(0) ]],
uint2 gid [[ thread_position_in_grid ]])
{
outTexture.write( float4( 1.0, 0, 0, 1.0 ), gid );
}
Could other types of texture2d, such texture2d of int, texture2d of short and so on, be used as shader function parameters, and how to use them? Thanks for reviewing my question.
The related host codes:
MTLTextureDescriptor *desc = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatRGBA8Unorm
desc.usage = MTLTextureUsageShaderWrite;
id<MTLTexture> texture = [device newTextureWithDescriptor:desc];
[commandEncoder setTexture:texture atIndex:0];
The code to show the output computed by GPU, w and h represents width and height of textrue, respectively.
uint8_t* imageBytes = malloc(w*h*4);
memset( imageBytes, 0, w*h*4 );
MTLRegion region = MTLRegionMake2D(0, 0, [texture width], [texture height]);
[texture getBytes:imageBytes bytesPerRow:[texture width]*4 fromRegion:region mipmapLevel:0];
for( int j = 0; j < h; j++ )
{
printf("%3d: ", j);
for( int i = 0; i < w*pixel_size; i++ )
{
printf(" %3d",imageBytes[j*w*pixel_size+i] );
}
printf("\n")
}
The problem is that the pixel format you used to create this texture (MTLPixelFormatRGBA8Unorm) is normalized, meaning that the expected pixel value range is 0.0-1.0. For normalized pixel types, the required data type for reading or writing to this texture within a Metal kernel is float or half-float.
In order to write to a texture with integers, you must select an integer pixel format. Here are all of the available formats:
https://developer.apple.com/documentation/metal/mtlpixelformat
The Metal Shading Language Guide states that:
Note: If T is int or short, the data associated with the texture must use a signed integer format. If T is uint or ushort, the data associated with the texture must use an unsigned integer format.
All you have to do is make sure the texture you write to in the API (host code) matches what you have in the kernel function. Alternatively, you can also cast the int values into float before writing to the outTexture.