Processing camera feed data on GPU (metal) and CPU (OpenCV) on iPhone

I'm doing realtime video processing on iOS at 120 fps and want to first preprocess image on GPU (downsample, convert color, etc. that are not fast enough on CPU) and later postprocess frame on CPU using OpenCV.
What's the fastest way to share camera feed between GPU and CPU using Metal?
In other words the pipe would look like:
CMSampleBufferRef -> MTLTexture or MTLBuffer -> OpenCV Mat
I'm converting CMSampleBufferRef -> MTLTexture the following way
CVPixelBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer);
// textureRGBA
size_t width = CVPixelBufferGetWidth(pixelBuffer);
size_t height = CVPixelBufferGetHeight(pixelBuffer);
MTLPixelFormat pixelFormat = MTLPixelFormatBGRA8Unorm;
CVMetalTextureRef texture = NULL;
CVReturn status = CVMetalTextureCacheCreateTextureFromImage(NULL, _textureCache, pixelBuffer, NULL, pixelFormat, width, height, 0, &texture);
if(status == kCVReturnSuccess) {
textureBGRA = CVMetalTextureGetTexture(texture);
After my metal shader is finised I convert MTLTexture to OpenCV
cv::Mat image;
CGSize imageSize = CGSizeMake(drawable.texture.width, drawable.texture.height);
int imageByteCount = int(imageSize.width * imageSize.height * 4);
int mbytesPerRow = 4 * int(imageSize.width);
MTLRegion region = MTLRegionMake2D(0, 0, int(imageSize.width), int(imageSize.height));
CGSize resSize = CGSizeMake(drawable.texture.width, drawable.texture.height);
[drawable.texture bytesPerRow:mbytesPerRow fromRegion:region mipmapLevel:0];
Some observations:
1) Unfortunately MTLTexture.getBytes seems expensive (copying data from GPU to CPU?) and takes around 5ms on my iphone 5S which is too much when processing at ~100fps
2) I noticed some people use MTLBuffer instead of MTLTexture with the following method:
metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared)
(see: Memory write performance - GPU CPU Shared Memory)
However CMSampleBufferRef and accompanying CVPixelBufferRef is managed by CoreVideo is guess.

The fastest way to do this is to use a MTLTexture backed by a MTLBuffer; it is a special kind of MTLTexture that shares memory with a MTLBuffer. However, your C processing (openCV) will be running a frame or two behind, this is unavoidable as you need to submit the commands to the GPU (encoding) and the GPU needs to render it, if you use waitUntilCompleted to make sure the GPU is finished that just chews up the CPU and is wasteful.
So the process would be: first you create the MTLBuffer then you use the MTLBuffer method "newTextureWithDescriptor:offset:bytesPerRow:" to create the special MTLTexture. You need to create the special MTLTexture beforehand (as an instance variable), then you need to setup up a standard rendering pipeline (faster than using compute shaders) that will take the MTLTexture created from the CMSampleBufferRef and pass this into your special MTLTexture, in that pass you can downscale and do any colour conversion as necessary in one pass. Then you submit the command buffer to the gpu, in a subsequent pass you can just call [theMTLbuffer contents] to grab the pointer to the bytes that back your special MTLTexture for use in openCV.
Any technique that forces a halt in the CPU/GPU behaviour will never be efficient as half the time will be spent waiting i.e. the CPU waits for the GPU to finish and the GPU has to wait also for the next encodings (when the GPU is working you want the CPU to be encoding the next frame and doing any openCV work rather than waiting for the GPU to finish).
Also, when people normally refer to real-time processing they usually are referring to some processing with real-time feedback (visual), all modern iOS devices from the 4s and above have a 60Hz screen refresh rate, so any feedback presented faster than that is pointless but if you need 2 frames (at 120Hz) to make 1 (at 60Hz) then you have to have a custom timer or modify CADisplayLink.


Texture atlas to texture array via PIXEL_UNPACK_BUFFER

I have two questions:
First, is there any more direct, sane way to go from a texture atlas image to a texture array in WebGL than what I'm doing below? I've not tried this, but doing it entirely in WebGL seems possible, though four-times the work and I still have to make two round trips to the GPU to do it.
And am I right that because buffer data for texImage3D() must come from PIXEL_UNPACK_BUFFER, this data must come directly from the CPU side? I.e. There is no way to copy from one block of GPU memory to a PIXEL_UNPACK_BUFFER without copying it to the CPU first. I'm pretty sure the answer to this is a hard "no".
In case my questions themselves are stupid (and they may be), my ultimate goal here is simply to convert a texture atlas PNG to a texture array. From what I've tried, the fastest way to do this by far is via PIXEL_UNPACK_BUFFER, rather than extracting each sub-image and sending them in one at a time, which for large atlases is extremely slow.
This is basically how I'm currently getting my pixel data.
const imageToBinary = async (image: HTMLImageElement) => {
const canvas = document.createElement('canvas');
canvas.width = image.width;
canvas.height = image.height;
const context = canvas.getContext('2d');
context.drawImage(image, 0, 0);
const imageData = context.getImageData(0, 0, image.width, image.height);
So, I'm creating an HTMLImageElement object, which contains the uncompressed pixel data I want, but has no methods to get at it directly. Then I'm creating a 2D context version containing the same pixel data a second time. Then I'm repopulating the GPU with the same pixel data a third time. Seems bonkers to me, but I don't see a way around it.

Most efficient way to create lower resolution texture from a texture in Metal?

I have a large MTLTexture (up to 16k) that I need to create a lower resolution texture from - say half or quarter scale.
I can draw the high res texture into a lower resolution texture with:
let descriptorSmallerCanvas = MTLRenderPassDescriptor()
descriptorSmallerCanvas.colorAttachments[0].texture = canvasTextureSmaller
descriptorSmallerCanvas.colorAttachments[0].storeAction = .store
descriptorSmallerCanvas.colorAttachments[0].loadAction = .clear
let renderSmallCanvas = commandBuffer.makeRenderCommandEncoder(descriptor: descriptorSmallerCanvas)
renderSmallCanvas?.pushDebugGroup("Render Small Texture")
renderSmallCanvas?.setFragmentTexture(canvasTexture, index: 0)
renderSmallCanvas?.setVertexBuffer(uniformOrthoBuffer, offset: 0, index: 0)
renderSmallCanvas?.drawPrimitives(type: .triangleStrip, vertexStart: 0, vertexCount: 4, instanceCount: 1)
That works, but I was wondering if there was a more efficient way to do this? Would MTLBlitCommandEncoder's generateMipmaps(for: canvasTexture) be more efficient, or is there a more efficent way?
The answer depends on the algorithm you want to downscale with.
Without knowing what renderCanvasPipelineState is, I would guess that it renders a fullscreen quad that Rendering larger texture into smaller texture will use minFilter of the sampler state you have in your shader (I guess it's a constexpr sampler since you don't bind any here) so it would only use MTLSamplerMinMagFilterLinear at best, which isn't "high-quality" downscaling.
I think your best bet without writing your own resize is to use MPSImageLanczosScale or MPSImageBilinearScale
depending on the quality tradeoffs you want to make.
P.S. You don't need to .clear your attachment if you are filling the whole screen without blending, leave it at .dontCare and it will overwrite every pixel.

iOS Metal – reading old values while writing into texture

I have a kernel function (compute shader) that reads nearby pixels of a pixel from a texture and based on the old nearby-pixel values updates the value of the current pixel (it's not a simple convolution).
I've tried creating a copy of the texture using BlitCommandEncoder and feeding the kernel function with 2 textures - one read-only and another write-only. Unfortunately, this approach is GPU-wise time consuming.
What is the most efficient (GPU- and memory-wise) way of reading old values from a texture while updating its content?
(Bit late but oh well)
There is no way you could make it work with only one texture, because the GPU is a highly parallel processor: Your kernel that you wrote for a single pixel gets called in parallel on all pixels, you can't tell which one goes first.
So you definitely need 2 textures. The way you probably should do it is by using 2 textures where one is the "old" one and the other the "new" one. Between passes, you switch the role of the textures, now old is new and new is old. Here is some pseudoswift:
var currentText = MTLTexture()
var nextText = MTLTexture()
let semaphore = dispatch_semaphore_create(1)
func update() {
dispatch_semaphore_wait(semaphore) // Wait for updating done signal
let commands = commandQueue.commandBuffer()
let encoder = commands.computeCommandEncoder()
encoder.setTexture(currentText, atIndex: 0)
encoder.setTexture(nextText, atIndex: 1)
// When updating done, swap the textures and signal that it's done updating
commands.addCompletionHandler {
swap(&currentText, &nextText)
I have written plenty of iOS Metal code that samples (or reads) from the same texture it is rendering into. I am using the render pipeline, setting my texture as the render target attachment, and also loading it as a source texture. It works just fine.
To be clear, a more efficient approach is to use the color() attribute in your fragment shader, but that is only suitable if all you need is the value of the current fragment, not any other nearby positions. If you need to read from other positions in the render target, I would just load the render target as a source texture into the fragment shader.

Render speed for individual pixels in loop

I'm working on drawing individual pixels to a UIView to create fractal images. My problem is my rendering speed. I am currently running this loop 260,000 times, but would like to render even more pixels. As it is, it takes about 5 seconds to run on my iPad Mini.
I was using a UIBezierPath before, but that was even a bit slower (about 7 seconds). I've been looking in NSBitMap stuff, but I'm not exactly sure if that would speed it up or how to implement it in the first place.
I was also thinking about trying to store the pixels from my loop into an array, and then draw them all together after my loop. Again though, I am not quite sure what the best process would be to store and then retrieve pixels into and from an array.
Any help on speeding up this process would be great.
for (int i = 0; i < 260000; i++) {
float RN = drand48();
for (int i = 1; i < numBuckets; i++) {
if (RN < bucket[i]) {
col = i;
CGContextSetFillColor(context, CGColorGetComponents([UIColor colorWithRed:(colorSelector[i][0]) green:(colorSelector[i][1]) blue:(colorSelector[i][2]) alpha:(1)].CGColor));
xT = myTextFieldArray[1][1][col]*x1 + myTextFieldArray[1][2][col]*y1 + myTextFieldArray[1][5][col];
yT = myTextFieldArray[1][3][col]*x1 + myTextFieldArray[1][4][col]*y1 + myTextFieldArray[1][6][col];
x1 = xT;
y1 = yT;
if (i > 10000) {
CGContextFillRect(context, CGRectMake(xOrigin+(xT-xMin)*sizeScalar,yOrigin-(yT-yMin)*sizeScalar,.5,.5));
else if (i < 10000) {
if (x1 < xMin) {
xMin = x1;
else if (x1 > xMax) {
xMax = x1;
if (y1 < yMin) {
yMin = y1;
else if (y1 > yMax) {
yMax = y1;
else if (i == 10000) {
if (xMax - xMin > yMax - yMin) {
sizeScalar = 960/(xMax - xMin);
else {
sizeScalar = 960/(yMax - yMin);
I created a multidimensional array to store UIColors into, so I could use a bitmap to draw my image. It is significantly faster, but my colors are not working appropriately now.
Here is where I am storing my UIColors into the array:
int xPixel = xOrigin+(xT-xMin)*sizeScalar;
int yPixel = yOrigin-(yT-yMin)*sizeScalar;
pixelArray[1000-yPixel][xPixel] = customColors[col];
Here is my drawing stuff:
CGDataProviderRef provider = CGDataProviderCreateWithData(nil, pixelArray, 1000000, nil);
CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB();
CGImageRef image = CGImageCreate(1000,
kCGBitmapByteOrder32Big | kCGImageAlphaNoneSkipLast,
nil, //No decode
NO, //No interpolation
kCGRenderingIntentDefault); // Default rendering
CGContextDrawImage(context, self.bounds, image);
Not only are the colors not what they are supposed to be, but every time I render my image, to colors are completely different from the previous time. I have been testing different stuff with the colors, but I still have no idea why the colors are wrong, and I'm even more confused how they keep changing.
Per-pixel drawing — with complicated calculations for each pixel, like fractal rendering — is one of the hardest things you can ask a computer to do. Each of the other answers here touches on one aspect of its difficulty, but that's not quite all. (Luckily, this kind of rendering is also something that modern hardware is optimized for, if you know what to ask it for. I'll get to that.)
Both #jcaron and #JustinMeiners note that vector drawing operations (even rect fill) in CoreGraphics take a penalty for CPU-based rasterization. Manipulating a buffer of bitmap data would be faster, but not a lot faster.
Getting that buffer onto the screen also takes time, especially if you're having to go through a process of creating bitmap image buffers and then drawing them in a CG context — that's doing a lot of sequential drawing work on the CPU and a lot of memory-bandwidth work to copy that buffer around. So #JustinMeiners is right that direct access to GPU texture memory would be a big help.
However, if you're still filling your buffer in CPU code, you're still hampered by two costs (at best, worse if you do it naively):
sequential work to render each pixel
memory transfer cost from texture memory to frame buffer when rendering
#JustinMeiners' answer is good for his use case — image sequences are pre-rendered, so he knows exactly what each pixel is going to be and he just has to schlep it into texture memory. But your use case requires a lot of per-pixel calculations.
Luckily, per-pixel calculations are what GPUs are designed for! Welcome to the world of pixel shaders. For each pixel on the screen, you can be running an independent calculation to determine the relationship of that point to your fractal set and thus what color to draw it in. The can be running that calculation in parallel for many pixels at once, and its output is going straight to the screen, so there's no memory overhead to dump a bitmap into the framebuffer.
One easy way to work with pixel shaders on iOS is SpriteKit — it can handle most of the necessary OpenGL/Metal setup for you, so all you have to write is the per-pixel algorithm in GLSL (actually, a subset of GLSL that gets automatically translated to Metal shader language on Metal-supported devices). Here's a good tutorial on that, and here's another on OpenGL ES pixel shaders for iOS in general.
If you really want to change many different pixels individually, your best option is probably to allocate a chunk of memory (of size width * height * bytes per pixel), make the changes directly in memory, and then convert the whole thing into a bitmap at once with CGBitmapContextCreateWithData
There may be even faster methods that this (see Justin's answer).
If you want to maximize render speed I would recommend bitmap rendering. Vector rasterization is much slower and CGContext drawing isn't really intended for high performance realtime rendering.
I faced a similar technical challenge and found CVOpenGLESTextureCacheRef to be the fastest. The texture cache allows you to upload a bitmap directly into graphics memory for fast rendering. Rendering utilizes OpenGL, but because its just 2D fullscreen image - you really don't need to learn much about OpenGL to use it.
You can see see an example I wrote of using the texture cache here:
My original question related to this is here:
How to directly update pixels - with CGImage and direct CGDataProvider
My project renders bitmaps from files so it is a little bit different but you could look at ISSequenceView.m for an example of how to use the texture cache and setup OpenGL for this kind of rendering.
Your rendering procedure could like something like:
1. Draw to buffer (raw bytes)
2. Lock texture cache
3. Copy buffer to texture cache
4. Unlock texture cache.
5. Draw fullscreen quad with texture

Retaining CMSampleBufferRef from camera feed

I'm writing AR app that uses camera feed to take pictures positioned on certain places in the world. Now I came upon problem that I'm not sure what to do about.
I'm using CVOpenGLESTextureCacheRef to create textures from CMSampleBufferRef. The camera feed is being shown and it works perfectly. The problem occurs when I capture 12 photos and create textures from them. The way it works is that once I detect match with the target I create a texture like this:
CVImageBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(sampleBufferCopy);
size_t frameWidth = CVPixelBufferGetWidth(pixelBuffer);
size_t frameHeight = CVPixelBufferGetHeight(pixelBuffer);
CVOpenGLESTextureRef texture = NULL;
CVReturn err = CVOpenGLESTextureCacheCreateTextureFromImage(kCFAllocatorDefault,
if (!texture || err) {
NSLog(#"CVOpenGLESTextureCacheCreateTextureFromImage failed (error: %d)", err);
CVOpenGLESTextureCacheFlush(cache, 0);
The texture is then mapped to photo location in the world and is being rendered. I am not releasing texture here because I need it in the future. The texture used as the camera feed is obviously being released.
The issue appears when 12th photo is taken. The captureOutput:didOutputSampleBuffer:fromConnection: callback is not being called anymore. I understand it happens because the pool is full, like pointed out in documentation:
If your application is causing samples to be dropped by retaining the provided CMSampleBufferRef objects for too long, but it needs access to the sample data for a long period of time, consider copying the data into a new buffer and then releasing the sample buffer (if it was previously retained) so that the memory it references can be reused.
However I am not sure what to do. I tried using CMSampleBufferCreateCopy to create a copy of the buffer but it did not work because like documentation says, it creates a shallow copy.
How do I handle this in a most efficient way?
