Converting an MLMultiArray to an image or an OpenGL / metal texture - ios

I'm trying to do background segmentation of a live video using CoreML. I used DeepLabV3 as provided by Apple. The model works ok, even though it already takes 100ms to process a 513x513 image. I then want to display the output, which is a 513x513 array of int32. Converting it in an image as done in CoreMLHelpers takes 300ms and I'm looking for a much faster way to display the results. I was thinking that maybe it'd be faster to somehow dump this to a OpenGL or Metal texture.
What is the best way to handle MLMultiArray for live inputs?

My answer is based on processing the MLMultiArray in Metal
Create an MTLBuffer:
let device = MTLCreateSystemDefaultDevice()!
let segmentationMaskBuffer: MTLBuffer = self.device.makeBuffer(length: segmentationHeight * segmentationWidth * MemoryLayout<Int32>.stride)
Copy MLMultiArray to MTLBuffer:
memcpy(segmentationMaskBuffer.contents(), mlOutput.semanticPredictions.dataPointer, segmentationMaskBuffer.length)
Setup Metal related variables:
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "binaryMask")!
let computePipeline = try! device.makeComputePipelineState(function: function)
create a struct for segmentation size:
let segmentationWidth = 513
let segmentationHeight = 513
struct MixParams {
var width: Int32 = Int32(segmentationWidth)
var height: Int32 = Int32(segmentationHeight)
}
create a output texture:
let textureDescriptor = MTLTextureDescriptor.texture2DDescriptor(pixelFormat: .bgra8Unorm, width: width, height: height, mipmapped: false)
textureDescriptor.usage = [.shaderRead, .shaderWrite]
let outputTexture = device.makeTexture(descriptor: textureDescriptor)!
pass the mtlbuffer, outputtexture to the kernal function:
let buffer = commandQueue.makeCommandBuffer()!
let maskCommandEncoder = buffer.makeComputeCommandEncoder()!
maskCommandEncoder.setTexture(outputTexture, index: 1)
maskCommandEncoder.setBuffer(segmentationBuffer, offset: 0, index: 0)
maskCommandEncoder.setBytes(&params, length: MemoryLayout<MixParams>.size, index: 1)
let w = computePipeline.threadExecutionWidth
let h = computePipeline.maxTotalThreadsPerThreadgroup / w
let threadGroupSize = MTLSizeMake(w, h, 1)
let threadGroups = MTLSizeMake(
(depthWidth + threadGroupSize.width - 1) / threadGroupSize.width,
(depthHeight + threadGroupSize.height - 1) / threadGroupSize.height, 1)
maskCommandEncoder.setComputePipelineState(computePipeline)
maskCommandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)
maskCommandEncoder.endEncoding()
write your kernel function in Shaders.metal file:
#include <metal_stdlib>
using namespace metal;
#include <CoreImage/CoreImage.h>
struct MixParams {
int segmentationWidth;
int segmentationHeight;
};
static inline int get_class(float2 pos, int width, int height, device int* mask) {
const int x = int(pos.x * width);
const int y = int(pos.y * height);
return mask[y*width + x];
}
static float get_person_probability(float2 pos, int width, int height, device int* mask) {
return get_class(pos, width, height, mask) == 15;
}
kernel void binaryMask(
texture2d<float, access::write> outputTexture [[texture(1)]],
device int* segmentationMask [[buffer(0)]],
constant MixParams& params [[buffer(1)]],
uint2 gid [[thread_position_in_grid]])
{
float width = outputTexture.get_width();
float height = outputTexture.get_height();
if (gid.x >= width ||
gid.y >= height) return;
const float2 pos = float2(float(gid.x) / width,
float(gid.y) / height);
const float is_person = get_person_probability(pos, params.segmentationWidth,
params.segmentationHeight,
segmentationMask);
float4 outPixel;
if (is_person < 0.5f) {
outPixel = float4(0.0,0.0,0.0,0.0);
} else {
outPixel = float4(1.0,1.0,1.0,1.0);
}
outputTexture.write(outPixel, gid);
}
Finally get the ciimage from output texture after encoding is finished:
let kciOptions: [CIImageOption: Any] = [CIImageOption.colorSpace: CGColorSpaceCreateDeviceRGB()]
let maskIMage = CIImage(mtlTexture: outputTexture,options: kciOptions)!.oriented(.downMirrored)

Instead of outputting an MLMultiArray you can change the model to make it output an image of type CVPixelBuffer. Then you can use CVMetalTextureCacheCreateTextureFromImage to turn the pixel buffer into an MTLTexture. (I think this works but I don't recall if I ever tried it. Not all pixel buffer objects can be turned into textures and I'm not sure if Core ML outputs a CVPixelBuffer object with the "Metal compatibility flag" turned on.)
Alternatively, you can write a compute kernel that takes in the MLMultiArray and converts it to a texture, which then gets drawn into a Metal view. This has the advantage that you apply all kinds of effects to the segmentation map in the compute kernel at the same time.

Related

How can I calculate the mean and variance value of an image with 16 channels using Metal Shader Lanuage

how can I calculate mean and variance value of an image with 16 channels using Metal ?
I want to calculate mean and variance value of different channel sperately!
ex.:
kernel void meanandvariance(texture2d_array<float, access::read> in[[texture(0)]],
texture2d_array<float, access::write> out[[texture(1)]],
ushort3 gid[[thread_position_in_grid]],
ushort tid[[thread_index_in_threadgroup]],
ushort3 tg_size[[threads_per_threadgroup]]) {
}
There's probably a way to do this by creating a sequence of texture views on the input texture array and output texture array, encoding a MPSImageStatisticsMeanAndVariance kernel invocation for each slice.
But let's take a look at how to do it ourselves. There are many different possible approaches, so I chose one that was simple and used some interesting results from statistics.
Essentially, we'll do the following:
Write a kernel that can produce a subset mean and variance for a single row of the image.
Write a kernel that can produce an overall mean and variance from the partial results from step 1.
Here are the kernels:
kernel void compute_row_mean_variance_array(texture2d_array<float, access::read> inTexture [[texture(0)]],
texture2d_array<float, access::write> outTexture [[texture(1)]],
uint3 tpig [[thread_position_in_grid]])
{
uint row = tpig.x;
uint slice = tpig.y;
uint width = inTexture.get_width();
if (row >= inTexture.get_height() || slice >= inTexture.get_array_size()) { return; }
float4 mean(0.0f);
float4 var(0.0f);
for (uint col = 0; col < width; ++col) {
float4 rgba = inTexture.read(ushort2(col, row), slice);
// http://datagenetics.com/blog/november22017/index.html
float weight = 1.0f / (col + 1);
float4 oldMean = mean;
mean = mean + (rgba - mean) * weight;
var = var + (rgba - oldMean) * (rgba - mean);
}
var = var / width;
outTexture.write(mean, ushort2(row, 0), slice);
outTexture.write(var, ushort2(row, 1), slice);
}
kernel void reduce_mean_variance_array(texture2d_array<float, access::read> inTexture [[texture(0)]],
texture2d_array<float, access::write> outTexture [[texture(1)]],
uint3 tpig [[thread_position_in_grid]])
{
uint width = inTexture.get_width();
uint slice = tpig.x;
// https://arxiv.org/pdf/1007.1012.pdf
float4 mean(0.0f);
float4 meanOfVar(0.0f);
float4 varOfMean(0.0f);
for (uint col = 0; col < width; ++col) {
float weight = 1.0f / (col + 1);
float4 oldMean = mean;
float4 submean = inTexture.read(ushort2(col, 0), slice);
mean = mean + (submean - mean) * weight;
float4 subvar = inTexture.read(ushort2(col, 1), slice);
meanOfVar = meanOfVar + (subvar - meanOfVar) * weight;
varOfMean = varOfMean + (submean - oldMean) * (submean - mean);
}
float4 var = meanOfVar + varOfMean / width;
outTexture.write(mean, ushort2(0, 0), slice);
outTexture.write(var, ushort2(1, 0), slice);
}
In summary, to achieve step 1, we use an "online" (incremental) algorithm to calculate the partial mean/variance of the row in a way that's more numerically-stable than just adding all the pixel values and dividing by the width. My reference for writing this kernel was this post. Each thread in the grid writes its row's statistics to the appropriate column and slice of an intermediate texture array.
To achieve step 2, we need to find a statistically-sound way of computing the overall statistics from the partial results. This is quite simple in the case of finding the mean: the mean of the population is the mean of the means of the subsets (this holds when the sample size of each subset is the same; in the general case, the overall mean is a weighted sum of the subset means). The variance is trickier, but it turns out that the variance of the population is the sum of the mean of the variances of the subsets and the variance of the means of the subsets (the same caveat about equally-sized subsets applies here). This is a convenient fact that we can combine with our incremental approach above to produce the final mean and variance of each slice, which is written to the corresponding slice of the output texture.
For completeness, here's the Swift code I used to drive these kernels:
let library = device.makeDefaultLibrary()!
let meanVarKernelFunction = library.makeFunction(name: "compute_row_mean_variance_array")!
let meanVarComputePipelineState = try! device.makeComputePipelineState(function: meanVarKernelFunction)
let reduceKernelFunction = library.makeFunction(name: "reduce_mean_variance_array")!
let reduceComputePipelineState = try! device.makeComputePipelineState(function: reduceKernelFunction)
let width = sourceTexture.width
let height = sourceTexture.height
let arrayLength = sourceTexture.arrayLength
let textureDescriptor = MTLTextureDescriptor.texture2DDescriptor(pixelFormat: .rgba32Float, width: width, height: height, mipmapped: false)
textureDescriptor.textureType = .type2DArray
textureDescriptor.arrayLength = arrayLength
textureDescriptor.width = height
textureDescriptor.height = 2
textureDescriptor.usage = [.shaderRead, .shaderWrite]
let partialResultsTexture = device.makeTexture(descriptor: textureDescriptor)!
textureDescriptor.width = 2
textureDescriptor.height = 1
textureDescriptor.usage = .shaderWrite
let destTexture = device.makeTexture(descriptor: textureDescriptor)!
let commandBuffer = commandQueue.makeCommandBuffer()!
let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()!
computeCommandEncoder.setComputePipelineState(meanVarComputePipelineState)
computeCommandEncoder.setTexture(sourceTexture, index: 0)
computeCommandEncoder.setTexture(partialResultsTexture, index: 1)
let meanVarGridSize = MTLSize(width: sourceTexture.height, height: sourceTexture.arrayLength, depth: 1)
let meanVarThreadgroupSize = MTLSizeMake(meanVarComputePipelineState.threadExecutionWidth, 1, 1)
let meanVarThreadgroupCount = MTLSizeMake((meanVarGridSize.width + meanVarThreadgroupSize.width - 1) / meanVarThreadgroupSize.width,
(meanVarGridSize.height + meanVarThreadgroupSize.height - 1) / meanVarThreadgroupSize.height,
1)
computeCommandEncoder.dispatchThreadgroups(meanVarThreadgroupCount, threadsPerThreadgroup: meanVarThreadgroupSize)
computeCommandEncoder.setComputePipelineState(reduceComputePipelineState)
computeCommandEncoder.setTexture(partialResultsTexture, index: 0)
computeCommandEncoder.setTexture(destTexture, index: 1)
let reduceThreadgroupSize = MTLSizeMake(1, 1, 1)
let reduceThreadgroupCount = MTLSizeMake(arrayLength, 1, 1)
computeCommandEncoder.dispatchThreadgroups(reduceThreadgroupCount, threadsPerThreadgroup: reduceThreadgroupSize)
computeCommandEncoder.endEncoding()
let destTexture2DDesc = MTLTextureDescriptor.texture2DDescriptor(pixelFormat: .rgba32Float, width: 2, height: 1, mipmapped: false)
destTexture2DDesc.usage = .shaderWrite
let destTexture2D = device.makeTexture(descriptor: destTexture2DDesc)!
meanVarKernel.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture2D, destinationTexture: destTexture2D)
#if os(macOS)
let blitCommandEncoder = commandBuffer.makeBlitCommandEncoder()!
blitCommandEncoder.synchronize(resource: destTexture)
blitCommandEncoder.synchronize(resource: destTexture2D)
blitCommandEncoder.endEncoding()
#endif
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
In my experiments, this program produced the same results as MPSImageStatisticsMeanAndVariance, give or take some differences on the order of 1e-7. It was also 2.5x slower than MPS on my Mac, probably due in part to failure to exploit latency hiding with granular parallelism.
#include <metal_stdlib>
using namespace metal;
kernel void instance_norm(constant float4* scale[[buffer(0)]],
constant float4* shift[[buffer(1)]],
texture2d_array<float, access::read> in[[texture(0)]],
texture2d_array<float, access::write> out[[texture(1)]],
ushort3 gid[[thread_position_in_grid]],
ushort tid[[thread_index_in_threadgroup]],
ushort3 tg_size[[threads_per_threadgroup]]) {
ushort width = in.get_width();
ushort height = in.get_height();
const ushort thread_count = tg_size.x * tg_size.y;
threadgroup float4 shared_mem [256];
float4 sum = 0;
for(ushort xIndex = gid.x; xIndex < width; xIndex += tg_size.x) {
for(ushort yIndex = gid.y; yIndex < height; yIndex += tg_size.y) {
sum += in.read(ushort2(xIndex, yIndex), gid.z);
}
}
shared_mem[tid] = sum;
threadgroup_barrier(mem_flags::mem_threadgroup);
// Reduce to 32 values
sum = 0;
if (tid < 32) {
for (ushort i = tid + 32; i < thread_count; i += 32) {
sum += shared_mem[i];
}
}
shared_mem[tid] += sum;
threadgroup_barrier(mem_flags::mem_threadgroup);
// Calculate mean
sum = 0;
if (tid == 0) {
ushort top = min(ushort(32), thread_count);
for (ushort i = 0; i < top; i += 1) {
sum += shared_mem[i];
}
shared_mem[0] = sum / (width * height);
}
threadgroup_barrier(mem_flags::mem_threadgroup);
const float4 mean = shared_mem[0];
threadgroup_barrier(mem_flags::mem_threadgroup);
// Variance
sum = 0;
for(ushort xIndex = gid.x; xIndex < width; xIndex += tg_size.x) {
for(ushort yIndex = gid.y; yIndex < height; yIndex += tg_size.y) {
sum += pow(in.read(ushort2(xIndex, yIndex), gid.z) - mean, 2);
}
}
shared_mem[tid] = sum;
threadgroup_barrier(mem_flags::mem_threadgroup);
// Reduce to 32 values
sum = 0;
if (tid < 32) {
for (ushort i = tid + 32; i < thread_count; i += 32) {
sum += shared_mem[i];
}
}
shared_mem[tid] += sum;
threadgroup_barrier(mem_flags::mem_threadgroup);
// Calculate variance
sum = 0;
if (tid == 0) {
ushort top = min(ushort(32), thread_count);
for (ushort i = 0; i < top; i += 1) {
sum += shared_mem[i];
}
shared_mem[0] = sum / (width * height);
}
threadgroup_barrier(mem_flags::mem_threadgroup);
const float4 sigma = sqrt(shared_mem[0] + float4(1e-4));
float4 multiplier = scale[gid.z] / sigma;
for(ushort xIndex = gid.x; xIndex < width; xIndex += tg_size.x) {
for(ushort yIndex = gid.y; yIndex < height; yIndex += tg_size.y) {
float4 val = in.read(ushort2(xIndex, yIndex), gid.z);
out.write(clamp((val - mean) * multiplier + shift[gid.z], -10.0, 10.0), ushort2(xIndex, yIndex), gid.z);
}
}
}
this is how Blend implement, but I do not think it is true, can anybody solve it ?
https://github.com/xmartlabs/Bender/blob/master/Sources/Metal/instanceNorm.metal

How to pass an array of float3 values from CPU to GPU in Metal?

I have a requirement to pass the color values to the shader as an array. Right now I'm passing it as a structure RGBColors and receiving it as a structure and it is working fine.
But I want to receive it as float3 value in the shader. But as soon as I change it to float3 it is acting weird and it starts to flicker, it is not giving me proper color.
Here is the code I used to set the fragment buffer,
func setFragmentBuffer(_ values: [Float], at index: Int) {
let bufferValues = values
let datasize = 16 * values.count / 3
let colorBuffer = device.makeBuffer(bytes: bufferValues, length: datasize, options: [])
renderEncoder.setFragmentBuffer(colorBuffer, offset: 0, at: index)
}
Here is the structure for the RGBColors
struct RGBColors {
var r: Float
var g: Float
var b: Float
func floatBuffers() -> [Float] {
return [r,g,b]
}
}
From this structure I will create an array of Float values and set it to fragment buffer.
The code let datasize = 16 * values.count / 3 I gave the value 16 because the although float data type in C++ is 4 bytes float3 in simd is 16 bytes.
And in the shader I'm implementing the method
fragment float4
singleShader(RasterizerData in [[stage_in]],
texture2d<half> sourceTexture [[ texture(0) ]],
const device float3 &rgbColor [[ buffer(1) ]]
{
constexpr sampler textureSampler (mag_filter::linear,
min_filter::linear);
// Sample the texture and return the color to colorSample
const half4 colorSample = sourceTexture.sample (textureSampler, in.textureCoordinate);
float4 outputColor;
float red = colorSample.r * rgbColor.r;
float green = colorSample.g * rgbColor.g;
float blue = colorSample.b * rgbColor.b;
outputColor = float4(red, green,blue, colorSample.a);
outputColor = float4((outputColor.rgb * param1 + param2) / 4, colorSample.a);
return outputColor;
}
Finally I'm not getting the right output color.
How to match float3 simd data type to the swift Float data type. Can someone suggest me?
Edit:
I found a solution about how to create a MTLBuffer from float3 and here is the code:
func setFragmentBuffer(_ values: [float3], at index: Int) {
var valueBuffer = values
let bufferCreated = device.makeBuffer(length: MemoryLayout.size(ofValue: valueBuffer[0]) * 2 * valueBuffer.count , options: [])
let bufferPointer = bufferCreated.contents()
memcpy(bufferPointer, &valueBuffer, 16 * valueBuffer.count)
renderEncoder.setFragmentBuffer(bufferCreated, offset: 0, at: index)
}
This code works perfectly fine. But still in the code let bufferCreated = device.makeBuffer(length: MemoryLayout.size(ofValue: valueBuffer[0]) * 2 * valueBuffer.count , options: []) you can see that a the length should be multiplied by a factor of 2 to make the code work.
Why there should a multiplier to make it work? Im not understanding this. Could someone suggest me?

Get RGB "CVPixelBuffer" from ARKit

I'm trying to get a CVPixelBuffer in RGB color space from the Apple's ARKit. In func session(_ session: ARSession, didUpdate frame: ARFrame) method of ARSessionDelegate I get an instance of ARFrame. On page Displaying an AR Experience with Metal I found that this pixel buffer is in YCbCr (YUV) color space.
I need to convert this to RGB color space (I actually need CVPixelBuffer and not UIImage). I've found something about color conversion on iOS but I was not able to get this working in Swift 3.
There's several ways to do this, depending on what you're after. The best way to do this in realtime (to say, render the buffer to a view) is to use a custom shader to convert the YCbCr CVPixelBuffer to RGB.
Using Metal:
If you make a new project, select "Augmented Reality App," and select "Metal" for the content technology, the project generated will contain the code and shaders necessary to make this conversion.
Using OpenGL:
The GLCameraRipple example from Apple uses an AVCaptureSession to capture the camera, and shows how to map the resulting CVPixelBuffer to GL textures, which are then converted to RGB in shaders (again, provided in the example).
Non Realtime:
The answer to this stackoverflow question addresses converting the buffer to a UIImage, and offers a pretty simple way to do it.
I have also stuck on this question for several days. All of the code snippet I could find on the Internet is written in Objective-C rather than Swift, regarding converting CVPixelBuffer to UIImage.
Finally, the following code snippet works perfect for me, to convert a YUV image to either JPG or PNG file format, and then you can write it to the local file in your application.
func pixelBufferToUIImage(pixelBuffer: CVPixelBuffer) -> UIImage {
let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
let context = CIContext(options: nil)
let cgImage = context.createCGImage(ciImage, from: ciImage.extent)
let uiImage = UIImage(cgImage: cgImage!)
return uiImage
}
The docs explicitly says that you need to access the luma and chroma planes:
ARKit captures pixel buffers in a planar YCbCr format (also known as YUV) format. To render these images on a device display, you'll need to access the luma and chroma planes of the pixel buffer and convert pixel values to an RGB format.
So there's no way to directly get the RGB planes and you'll have to handle this in your shaders, either in Metal or openGL as described by #joshue
You may want the Accelerate framework's image conversion functions. Perhaps a combination of vImageConvert_420Yp8_Cb8_Cr8ToARGB8888 and vImageConvert_ARGB8888toRGB888 (If you don't want the alpha channel). In my experience these work in real time.
Struggled a long while with this as well and I've ended up writing the following code, which works for me:
// Helper macro to ensure pixel values are bounded between 0 and 255
#define clamp(a) (a > 255 ? 255 : (a < 0 ? 0 : a));
- (void)processImageBuffer:(CVImageBufferRef)imageBuffer
{
OSType type = CVPixelBufferGetPixelFormatType(imageBuffer);
if (type == kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)
{
CVPixelBufferLockBaseAddress(imageBuffer, 0);
// We know the return format of the base address based on the YpCbCr8BiPlanarFullRange format (as per doc)
StandardBuffer baseAddress = (StandardBuffer)CVPixelBufferGetBaseAddress(imageBuffer);
// Get the number of bytes per row for the pixel buffer, width and height
size_t bytesPerRow = CVPixelBufferGetBytesPerRow(imageBuffer);
size_t width = CVPixelBufferGetWidth(imageBuffer);
size_t height = CVPixelBufferGetHeight(imageBuffer);
// Get buffer info and planar pixel data
CVPlanarPixelBufferInfo_YCbCrBiPlanar *bufferInfo = (CVPlanarPixelBufferInfo_YCbCrBiPlanar *)baseAddress;
uint8_t* cbrBuff = (uint8_t *)CVPixelBufferGetBaseAddressOfPlane(imageBuffer, 1);
// This just moved the pointer past the offset
baseAddress = (uint8_t *)CVPixelBufferGetBaseAddressOfPlane(imageBuffer, 0);
int bytesPerPixel = 4;
uint8_t *rgbData = rgbFromYCrCbBiPlanarFullRangeBuffer(baseAddress,
cbrBuff,
bufferInfo,
width,
height,
bytesPerRow);
[self doStuffOnRGBBuffer:rgbData width:width height:height bitsPerComponent:8 bytesPerPixel:bytesPerPixel bytesPerRow:bytesPerRow];
free(rgbData);
CVPixelBufferUnlockBaseAddress(imageBuffer, 0);
}
else
{
NSLog(#"Unsupported image buffer type");
}
}
uint8_t * rgbFromYCrCbBiPlanarFullRangeBuffer(uint8_t *inBaseAddress,
uint8_t *cbCrBuffer,
CVPlanarPixelBufferInfo_YCbCrBiPlanar * inBufferInfo,
size_t inputBufferWidth,
size_t inputBufferHeight,
size_t inputBufferBytesPerRow)
{
int bytesPerPixel = 4;
NSUInteger yPitch = EndianU32_BtoN(inBufferInfo->componentInfoY.rowBytes);
uint8_t *rgbBuffer = (uint8_t *)malloc(inputBufferWidth * inputBufferHeight * bytesPerPixel);
NSUInteger cbCrPitch = EndianU32_BtoN(inBufferInfo->componentInfoCbCr.rowBytes);
uint8_t *yBuffer = (uint8_t *)inBaseAddress;
for(int y = 0; y < inputBufferHeight; y++)
{
uint8_t *rgbBufferLine = &rgbBuffer[y * inputBufferWidth * bytesPerPixel];
uint8_t *yBufferLine = &yBuffer[y * yPitch];
uint8_t *cbCrBufferLine = &cbCrBuffer[(y >> 1) * cbCrPitch];
for(int x = 0; x < inputBufferWidth; x++)
{
int16_t y = yBufferLine[x];
int16_t cb = cbCrBufferLine[x & ~1] - 128;
int16_t cr = cbCrBufferLine[x | 1] - 128;
uint8_t *rgbOutput = &rgbBufferLine[x*bytesPerPixel];
int16_t r = (int16_t)roundf( y + cr * 1.4 );
int16_t g = (int16_t)roundf( y + cb * -0.343 + cr * -0.711 );
int16_t b = (int16_t)roundf( y + cb * 1.765);
// ABGR image representation
rgbOutput[0] = 0Xff;
rgbOutput[1] = clamp(b);
rgbOutput[2] = clamp(g);
rgbOutput[3] = clamp(r);
}
}
return rgbBuffer;
}

Indexed drawing with metal

I am trying to load a model (form .OBJ) and draw it to the screen on iOS with MetalKit. The problem is that instead of my model, I get some random polygons...
Here is the code that is tend to load the model(The code is based on a tutorial from raywenderlich.com:
let allocator = MTKMeshBufferAllocator(device: device)
let vertexDescriptor = MDLVertexDescriptor()
let vertexLayout = MDLVertexBufferLayout()
vertexLayout.stride = sizeof(Vertex)
vertexDescriptor.layouts = [vertexLayout]
vertexDescriptor.attributes = [MDLVertexAttribute(name: MDLVertexAttributePosition, format: MDLVertexFormat.Float3, offset: 0, bufferIndex: 0),
MDLVertexAttribute(name: MDLVertexAttributeColor, format: MDLVertexFormat.Float4, offset: sizeof(float3), bufferIndex: 0),
MDLVertexAttribute(name: MDLVertexAttributeTextureCoordinate, format: MDLVertexFormat.Float2, offset: sizeof(float3)+sizeof(float4), bufferIndex: 0),
MDLVertexAttribute(name: MDLVertexAttributeNormal, format: MDLVertexFormat.Float3, offset: sizeof(float3)+sizeof(float4)+sizeof(float2), bufferIndex: 0)]
var error: NSError?
let asset = MDLAsset(URL: path, vertexDescriptor: vertexDescriptor, bufferAllocator: allocator, preserveTopology: true, error: &error)
if error != nil{
print(error)
return nil
}
let model = asset.objectAtIndex(0) as! MDLMesh
let mesh = try MTKMesh(mesh: model, device: device)
And here is my drawing method:
func render(commandQueue: MTLCommandQueue, pipelineState: MTLRenderPipelineState,drawable: CAMetalDrawable,projectionMatrix: float4x4,modelViewMatrix: float4x4, clearColor: MTLClearColor){
dispatch_semaphore_wait(bufferProvider.availibleResourcesSemaphore, DISPATCH_TIME_FOREVER)
let renderPassDescriptor = MTLRenderPassDescriptor()
renderPassDescriptor.colorAttachments[0].texture = drawable.texture
renderPassDescriptor.colorAttachments[0].loadAction = .Clear
renderPassDescriptor.colorAttachments[0].clearColor = clearColor
renderPassDescriptor.colorAttachments[0].storeAction = .Store
let commandBuffer = commandQueue.commandBuffer()
commandBuffer.addCompletedHandler { (buffer) in
dispatch_semaphore_signal(self.bufferProvider.availibleResourcesSemaphore)
}
let renderEncoder = commandBuffer.renderCommandEncoderWithDescriptor(renderPassDescriptor)
renderEncoder.setCullMode(MTLCullMode.None)
renderEncoder.setRenderPipelineState(pipelineState)
renderEncoder.setVertexBuffer(vertexBuffer, offset: 0, atIndex: 0)
renderEncoder.setFragmentTexture(texture, atIndex: 0)
if let samplerState = samplerState{
renderEncoder.setFragmentSamplerState(samplerState, atIndex: 0)
}
var nodeModelMatrix = self.modelMatrix()
nodeModelMatrix.multiplyLeft(modelViewMatrix)
uniformBuffer = bufferProvider.nextUniformsBuffer(projectionMatrix, modelViewMatrix: nodeModelMatrix, light: light)
renderEncoder.setVertexBuffer(self.uniformBuffer, offset: 0, atIndex: 1)
renderEncoder.setFragmentBuffer(uniformBuffer, offset: 0, atIndex: 1)
if indexBuffer != nil{
renderEncoder.drawIndexedPrimitives(.Triangle, indexCount: self.indexCount, indexType: self.indexType, indexBuffer: self.indexBuffer!, indexBufferOffset: 0)
}else{
renderEncoder.drawPrimitives(.Triangle, vertexStart: 0, vertexCount: vertexCount, instanceCount: vertexCount/3)
}
renderEncoder.endEncoding()
commandBuffer.presentDrawable(drawable)
commandBuffer.commit()
}
Here is my vertex shader:
struct VertexIn{
packed_float3 position;
packed_float4 color;
packed_float2 texCoord;
packed_float3 normal;
};
struct VertexOut{
float4 position [[position]];
float3 fragmentPosition;
float4 color;
float2 texCoord;
float3 normal;
};
struct Light{
packed_float3 color;
float ambientIntensity;
packed_float3 direction;
float diffuseIntensity;
float shininess;
float specularIntensity;
};
struct Uniforms{
float4x4 modelMatrix;
float4x4 projectionMatrix;
Light light;
};
vertex VertexOut basic_vertex(
const device VertexIn* vertex_array [[ buffer(0) ]],
const device Uniforms& uniforms [[ buffer(1) ]],
unsigned int vid [[ vertex_id ]]) {
float4x4 mv_Matrix = uniforms.modelMatrix;
float4x4 proj_Matrix = uniforms.projectionMatrix;
VertexIn VertexIn = vertex_array[vid];
VertexOut VertexOut;
VertexOut.position = proj_Matrix * mv_Matrix * float4(VertexIn.position,1);
VertexOut.fragmentPosition = (mv_Matrix * float4(VertexIn.position,1)).xyz;
VertexOut.color = VertexIn.color;
VertexOut.texCoord = VertexIn.texCoord;
VertexOut.normal = (mv_Matrix * float4(VertexIn.normal, 0.0)).xyz;
return VertexOut;
}
And here is how it looks like:
link
Actually I have an other class that is completely written by me to load models. It works fine, the problem is that it is not using indexing so f I try to load models that are more complex than a low-poly sphere, the GPU crashes... Anyways I tried to modify it to use indexing and I got the same result.. than I added hardcoded indices for testing and I got a really weird result. When I had 3 indices it drew a triangle, when I added 3 more, it drew the same triangle and after 3 more vertices it drew 2 triangles...
Edit:
Here is my Vertex structure:
struct Vertex:Equatable{
var x,y,z: Float
var r,g,b,a: Float
var s,t: Float
var nX,nY,nZ:Float
func floatBuffer()->[Float]{
return [x,y,z,r,g,b,a,s,t,nX,nY,nZ]
}
}
I see a couple of potential issues here.
1) Your vertex descriptor does not map exactly to your Vertex struct. The position variables (x, y, z) occupy 12 bytes, so the color variables start at an offset of 12 bytes. This matches the packed_float3 position field in your shader's VertexIn struct, but in the vertex descriptor you provide to Model I/O, you use sizeof(Float3), which is 16, as the offset of the color attribute. Because you're packing the position field, you should use sizeof(Float) * 3 for this value instead, and likewise in the subsequent offsets. I suspect this is the main cause of your problems.
More generally, it's a good idea to use strideof rather than sizeof to account for alignment, though--by chance--it wouldn't make a difference here.
2) Model I/O is allowed to use a single MTLBuffer to store both vertices and indices, so you should use the offset member of each MTKMeshBuffer when setting the vertex buffer or specifying the index buffer in each draw call, rather than assuming the offsets to be 0.

ios metal: multiple kernel calls in one command buffer

I'm having a problem with the implementation of multiple kernel functions in Metal in combination with Swift.
My target is to implement a block-wise DCT transformation over an image. The DCT is implemented with two matrix multiplications.
J = H * I * H^-1
The following code shows the kernel functions itself and the used calls in the swift code. If I run each kernel function alone it works but i can't manage to hand over the write buffer from the first kernel function to the second function. The second function therefore always returns a buffer filled with just 0.
All the image input and output buffers are 400x400 big with RGB (16-bit Integer for each component). The matrices are 8x8 16-bit Integers.
Is there a special command needed to synchronize the buffer read and write accesses of the different kernel functions? Or am I doing something else wrong?
Thanks for your help
shaders.metal
struct Image3D16{
short data[400][400][3];
};
struct Matrix{
short data[8 * 8];
};
kernel void dct1(device Image3D16 *inputImage [[buffer(0)]],
device Image3D16 *outputImage [[buffer(1)]],
device Matrix *mult [[buffer(2)]],
uint2 gid [[thread_position_in_grid]],
uint2 tid [[thread_position_in_threadgroup]]){
int red = 0, green = 0, blue = 0;
for(int x=0;x<8;x++){
short r = inputImage->data[gid.x-tid.x + x][gid.y][0];
short g = inputImage->data[gid.x-tid.x + x][gid.y][1];
short b = inputImage->data[gid.x-tid.x + x][gid.y][2];
red += r * mult->data[tid.x*8 + x];
green += g * mult->data[tid.x*8 + x];
blue += b * mult->data[tid.x*8 + x];
}
outputImage->data[gid.x][gid.y][0] = red;
outputImage->data[gid.x][gid.y][1] = green;
outputImage->data[gid.x][gid.y][2] = blue;
}
kernel void dct2(device Image3D16 *inputImage [[buffer(0)]],
device Image3D16 *outputImage [[buffer(1)]],
device Matrix *mult [[buffer(2)]],
uint2 gid [[thread_position_in_grid]],
uint2 tid [[thread_position_in_threadgroup]]){
int red = 0, green = 0, blue = 0;
for(int y=0;y<8;y++){
short r = inputImage->data[gid.x][gid.y-tid.y + y][0];
short g = inputImage->data[gid.x][gid.y-tid.y + y][1];
short b = inputImage->data[gid.x][gid.y-tid.y + y][2];
red += r * mult->data[tid.y*8 + y];
green += g * mult->data[tid.y*8 + y];
blue += b * mult->data[tid.y*8 + y];
}
outputImage->data[gid.x][gid.y][0] = red;
outputImage->data[gid.x][gid.y][1] = green;
outputImage->data[gid.x][gid.y][2] = blue;
}
ViewController.swift
...
let commandBuffer = commandQueue.commandBuffer()
let computeEncoder1 = commandBuffer.computeCommandEncoder()
computeEncoder1.setComputePipelineState(computeDCT1)
computeEncoder1.setBuffer(input, offset: 0, atIndex: 0)
computeEncoder1.setBuffer(tmpBuffer3D1, offset: 0, atIndex: 1)
computeEncoder1.setBuffer(dctMatrix1, offset: 0, atIndex: 2)
computeEncoder1.dispatchThreadgroups(blocks, threadsPerThreadgroup: dctSize)
computeEncoder1.endEncoding()
let computeEncoder2 = commandBuffer.computeCommandEncoder()
computeEncoder2.setComputePipelineState(computeDCT2)
computeEncoder2.setBuffer(tmpBuffer3D1, offset: 0, atIndex: 0)
computeEncoder2.setBuffer(output, offset: 0, atIndex: 1)
computeEncoder2.setBuffer(dctMatrix2, offset: 0, atIndex: 2)
computeEncoder2.dispatchThreadgroups(blocks, threadsPerThreadgroup: dctSize)
computeEncoder2.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
I found the error. My kernel function tried to read outside of its allocated memory. The reaction of the metal interface is then to stop the execution of all following commands in the command buffer. Therefore was the output always zero because the computation was never done. The GPU usage of the application drops which can be used for detecting the error.

Resources