I am building an iOS app that renders frames from the camera to Metal Textures in real time. I want to use CoreML to perform style transfer on subregions of the metal texture (imagine the camera output as a 2x2 grid, where each of the 4 squares is used as input to a style transfer network, and the output pasted back into the displayed texture). I am trying to figure out how to best use CoreML inside of a Metal pipeline to fill non-overlapping subregions of the texture with the output of the mlmodel(hopefully without decomposing the mlmodel into an MPSNNGraph). Is it possible to feed a MTLTexture or MTLBuffer to a coreML model directly? I'd like to avoid format conversions as much as possible (for speed).
My mlmodel takes CVPixelBuffers at its inputs and outputs. Can it be made to take MTLTextures instead?
The first thing I tried was: cutting the given sample buffer into subregions (by copying the pixel data ugh), inferring on each subregion, and then pasting them together into a new sample buffer which was then turned into a MTLTexture and displayed. This approach did not take advantage of metal at all, as the textures were not created until after inference. It also had alot of circuitous conversions/copy/paste operations that slow everything down.
The second thing I tried was: send the camera data to the MTLTexture directly, infer on subregions of the sample buffer, paste into the current displayed texture with MTLTexture.replace(region:...withBytes:) for each subregion. However, MTLTexture.replace() uses the cpu and is not fast enough for live video.
The idea I am about to try is: convert my mlmodel to an MPSNNGraph, get frames as textures, use the MPSNNGraph for inference on subregions, and display the output. I figured i'd check here before going through all of the effort of converting the mlmodel first though. Sorry if this is too broad, I mainly work in tensorflow and am a bit out of my depth here.
Related
I'm writing a video effect in iOS using Metal that requires pixel data of the current frame as well as many previous frames of the original input video. Is there a common pattern or best practice for how to achieve this type of shader that makes the most efficient use of memory and the GPU?
I am trying to implement the MoG background subtraction algorithm based on the opencv cuda implementation
What I need is to maintain a set of gaussian parameter independently for each pixel location across multiple frame. Currently I am just allocating a single big MTLBuffer to do the job and on every frame, I have to invoke the commandEncoder.setBuffer API. Is there a better way? I read about imageblock but I am not sure if it is relevant.
Also, I would be really happy if you can spot any things that shouldn't be directly translated from cuda to metal.
Allocate an 8 bit texture and store intermediate values into the texture in your compute shader. Then after this texture is rendered, you can rebind it as an input texture to whatever other methods need to read from it in the rest of the renders. You can find a very detailed example of this sort of thing at this github example project of a parallel prefix sum on top of Metal. This example also shows how to write XCTest regression tests for your Metal shaders. Github MetalPrefixSum
I’m interested in the issue of data processing from TrueDepth Camera. It is necessary to obtain the data of a person’s face, build a 3D model of the face and save this model in an .obj file.
Since in the 3D model needed presence of the person’s eyes and teeth, then ARKit / SceneKit is not suitable, because ARKit / SceneKit do not fill these areas with data.
But with the help of the SceneKit.ModelIO library, I managed to export ARSCNView.scene (type SCNScene) in the .obj format.
I tried to take this project as a basis:
https://developer.apple.com/documentation/avfoundation/cameras_and_media_capture/streaming_depth_data_from_the_truedepth_camera
In this project, working with TrueDepth Camera is done using Metal, but if I'm not mistaken, MTKView, rendered using Metal, is not a 3D model and cannot be exported as .obj.
Please tell me if there is a way to export MTKView to SCNScene or directly to .obj?
If there is no such method, then how to make a 3D model from AVDepthData?
Thanks.
It's possible to make a 3D model from AVDepthData, but that probably isn't what you want. One depth buffer is just that — a 2D array of pixel distance-from-camera values. So the only "model" you're getting from that isn't very 3D; it's just a height map. That means you can't look at it from the side and see contours that you couldn't have seen from the front. (The "Using Depth Data" sample code attached to the WWDC 2017 talk on depth photography shows an example of this.)
If you want more of a truly-3D "model", akin to what ARKit offers, you need to be doing the work that ARKit does — using multiple color and depth frames over time, along with a machine learning system trained to understand human faces (and hardware optimized for running that system quickly). You might not find doing that yourself to be a viable option...
It is possible to get an exportable model out of ARKit using Model I/O. The outline of the code you'd need goes something like this:
Get ARFaceGeometry from a face tracking session.
Create MDLMeshBuffers from the face geometry's vertices, textureCoordinates, and triangleIndices arrays. (Apple notes the texture coordinate and triangle index arrays never change, so you only need to create those once — vertices you have to update every time you get a new frame.)
Create a MDLSubmesh from the index buffer, and a MDLMesh from the submesh plus vertex and texture coordinate buffers. (Optionally, use MDLMesh functions to generate a vertex normals buffer after creating the mesh.)
Create an empty MDLAsset and add the mesh to it.
Export the MDLAsset to a URL (providing a URL with the .obj file extension so that it infers the format you want to export).
That sequence doesn't require SceneKit (or Metal, or any ability to display the mesh) at all, which might prove useful depending on your need. If you do want to involve SceneKit and Metal you can probably skip a few steps:
Create ARSCNFaceGeometry on your Metal device and pass it an ARFaceGeometry from a face tracking session.
Use MDLMesh(scnGeometry:) to get a Model I/O representation of that geometry, then follow steps 4-5 above to export it to an .obj file.
Any way you slice it, though... if it's a strong requirement to model eyes and teeth, none of the Apple-provided options will help you because none of them do that. So, some food for thought:
Consider whether that's a strong requirement?
Replicate all of Apple's work to do your own face-model inference from color + depth image sequences?
Cheat on eye modeling using spheres centered according to the leftEyeTransform/rightEyeTransform reported by ARKit?
Cheat on teeth modeling using a pre-made model of teeth, composed with the ARKit-provided face geometry for display? (Articulate your inner-jaw model with a single open-shut joint and use ARKit's blendShapes[.jawOpen] to animate it alongside the face.)
I have an interesting problem I need to research related to very low level video streaming.
Has anyone had any experience converting a raw stream of bytes(separated into per pixel information, but not a standard format of video) into a low resolution video stream? I believe that I can map the data into RGB value per pixel bytes, as the color values that correspond to the value in the raw data will be determined by us. I'm not sure where to go from there, or what the RGB format needs to be per pixel.
I've looked at FFMPeg but it's documentation is massive and I don't know where to start.
Specific questions I have include, is it possible to create CVPixelBuffer with that pixel data? If I were to do that, what sort of format for the per pixel data would I need to convert to?
Also, should I be looking deeper into OpenGL, and if so where would the best place to look for information on this topic?
What about CGBitmapContextCreate? For example, if I went I went with something like this, what would a typical pixel byte need to look like? Would this be fast enough to keep the frame rate above 20fps?
EDIT:
I think with the excellent help of you two, and some more research on my own, I've put together a plan for how to construct the raw RGBA data, then construct a CGImage from that data, in turn create a CVPixelBuffer from that CGImage from here CVPixelBuffer from CGImage.
However, to then play that live as the data comes in, I'm not sure what kind of FPS I would be looking at. Do I paint them to a CALayer, or is there some similar class to AVAssetWriter that I could use to play it as I append CVPixelBuffers. The experience that I have is using AVAssetWriter to export constructed CoreAnimation hierarchies to video, so the videos are always constructed before they begin playing, and not displayed as live video.
I've done this before, and I know that you found my GPUImage project a little while ago. As I replied on the issues there, the GPUImageRawDataInput is what you want for this, because it does a fast upload of RGBA, BGRA, or RGB data directly into an OpenGL ES texture. From there, the frame data can be filtered, displayed to the screen, or recorded into a movie file.
Your proposed path of going through a CGImage to a CVPixelBuffer is not going to yield very good performance, based on my personal experience. There's too much overhead when passing through Core Graphics for realtime video. You want to go directly to OpenGL ES for the fastest display speed here.
I might even be able to improve my code to make it faster than it is right now. I currently use glTexImage2D() to update texture data from local bytes, but it would probably be even faster to use the texture caches introduced in iOS 5.0 to speed up refreshing data within a texture that maintains its size. There's some overhead in setting up the caches that makes them a little slower for one-off uploads, but rapidly updating data should be faster with them.
My 2 cents:
I made an opengl game which lets the user record a 3d scene. Playback was done via replaying the scene (instead of playing a video because realtime encoding did not yield a comfortable FPS.
There is a technique which could help out, unfortunately I didn't have time to implement it:
http://allmybrain.com/2011/12/08/rendering-to-a-texture-with-ios-5-texture-cache-api/
This technique should cut down time on getting pixels back from openGL. You might get an acceptable video encoding rate.
I am trying to use the GPUImage library for image processing (not for image output to screen). By default, the library outputs one BGRA texture. I would like to instead output multiple single-channel/single-byte textures. Up to this point I have been bit-packing multiple calculations for each pixel in BGRA. I have reached the limitations of such a method because a) I now have greater than 4 return values for each pixel, and b) the overhead of de-interlacing BGRA-BGRA-BGRA-BGRA..... into BBBB..,GGGG..,RRRR..,AAAA.., is starting to really bog down my program.
I know there is some sample code for using multiple input textures with GPUImage, but I have not seen anything for multiple output textures. For single-byte output I believe I could use GL_ALPHA textures(?), so I am guessing it is a matter to binding multiple textures into variables in my filter kernel.
Thanks!
Probably what you need is Multiple Rendering Targets (or MRT for short). You can inspect how many color attachments does you target hardware supports with GL_COLOR_ATTACHMENT. Unfortunely I believe not all iOS devices support, if any.