IOS Swift buffer 30FPS Video for realtime object-detection - ios

I have trained an ObjectDetector for iOS. Now I want to use it on a Video with a frame rate of 30FPS.
The ObjectDetector is a bit too slow, needs 85ms for one frame. For the 30FPS it should be below 33ms.
Now I am wondering if it is possible to buffer the frames and the predictions for a specified time x and then play the video on the screen?

If you have already tried using a smaller/faster model (and also to ensured that your model is fully optimized to run in CoreML on the neural engine), we had success doing inference only every nth frame.
The results were suitable for our use-case and you couldn't really tell that we were only doing it at 5 fps because we were able to continue to display the camera output at full frame-rate.
If you don't need realtime then yes, certainly you could store the video and do the processing per frame afterwards; this would let you parallelize things into bigger batch sizes as well.

Related

Taking Frame from Video vs Taking a Photo

My specific question is: What are the drawbacks to using a snipped frame from a video vs taking a photo?
Details:
I want to use frames from live video streams to replace taking pictures because it is faster. I have already researched and considered:
Videos need faster shutter speed, leading to higher possibility of blurring
Faster shutter speed also means less exposure to light, leading to potentially darker images
A snipped frame from a video will probably be lower resolution (although maybe we can possibly turn up the resolution to compensate for this?)
Video might take up more memory -- I am still exploring the details with another post (What is being stored and where when you use cv2.VideoCapture()?)
Anything else?
I will reword my question to make it (possibly) easier to answer: What changes must I make to a "snip frame from video" process to make the result equivalent to taking a photo? Are these changes worth it?
The maximum resolution in picamera is 2592x1944 for still photos and 1920x1080 for video recording. Other issues to take into account are that you cannot receive all formats from VideoCapture, so now conversion of the YUV frame to JPG will be your responsibility. OK, OpenCV can handle this, but it takes considerable CPU time and memory.

Is there a quality difference between output of AVCaptureMovieFileOutput and AVCaptureVideoDataOutput?

In the process of capturing a light trail photo, I noticed that for fast moving objects, there is slightly more discontinuity between successive frames if I use the sample buffers from AVCaptureVideoDataOutput compared to if I record a movie and extract frames and run the same algo.
Is there a refresh rate/frame rate difference if the two modes are used?
A colleague who has experience in professional photography claims that there is a visible lag even in Apple's default camera app when comparing the preview in Photo mode and Video mode but it is not something very obvious to me.
Furthermore, I am actually capturing video at a low frame rate (close to highest exposure)
To conclude these experiments, I need to know if there is any definitive proof to confirm or disprove the same

Implementing audio waveform view and audio timeline view in iOS?

I am working on an app that will allow users to record from the mic, and I am using audio units for the purpose. I have the audio backend figured out and working, and I am starting to work on the views/controls etc.
There are two things I am yet to implement:
1) I will be using OpenGL ES to draw waveform of the audio input, there seems to be no easier way to do it for real-time drawing. I will be drawing inside a GLKView. After something is recorded, the user should be able to scroll back and forth and see the waveform without glitches. I know it's doable, but having a hard time understanding how that can be implemented. Suppose, the user is scrolling, would I need to re-read the recorded audio every time and re-draw everything? I obviously don't want to store the whole recording in memory, and reading from disk is slow.
2) For the scrolling etc., the user should see a timeline, and if I have an idea of the 1 question, I don't know how to implement the timeline.
All the functionality I'm describing is do-able since it can be seen in the Voice Memos app. Any help is always appreciated.
I have done just this. The way I did it was to create a data structure that holds different "zoom levels" data for the audio. Unless you are displaying the audio at a resolution that will display 1 sample per 1 pixel, you don't need every sample to be read from disk, so what you do is downsample your samples to a much smaller array that can be stored in memory ahead of time. A naive example is if your waveform were to display audio at a ratio of 64 samples per pixel. Lets say you have an array of 65536 stereo samples, you would average each L/R pair of samples into a positive mono value, then average 64 of the positive mono values into one float. Then your array of 65536 audio samples can be visualized with an array of 512 "visual samples". My real world implementation became much more complicated than this as I have ways to display all zoom levels and live resampling and such, but this is the basic idea. It's essentially a Mip map for audio.

What is the most efficient way to search for faces in all 4 orientations of a video with OpenCV using a GPU?

I am new to GPU programming and I have started by passing haarcascade_frontalface_alt.xml and a video file to this compiled example:
https://github.com/Itseez/opencv/blob/master/samples/gpu/cascadeclassifier.cpp
It seems to take about 3 seconds to load the video into the GPU and then another 2 seconds to search for faces. This works well but the videos could have been recorded at any orientations so if no faces are found, I rotate the video by 90 degrees and try again. The problem is that this approach takes at least 20 seconds to determine if any faces were found in all 4 orientations and hence the correct orientation of the video.
Is it possible to perform a rotation invariant cascade classifier to determine the orientation of the video? Or is it possible transpose the video in the GPU without having to reload a rotated version? Or is possible to apply a rotated version of cascade classifier? How can I search for faces in all 4 orientations without having to load 4 versions of the video into the GPU?
Many things are possible in the world of computer vision, but few are robust/reliable :). Rotation invariance is not the way to go (since effectively rotation invariance means that rotation-information is somehow dropped).
The simplest approach: Image rotation on a GPU is quite fast, so you could try rotating each image after having uploaded it to the device, using gpu::rotate.
A faster approach: The typical approach would be to learn four different detectors and apply all of them. Detection scales quite well in the number of detectors with some recent advances.
But I am still not sure of what you want to achieve. If you do not want to find all faces, but rather estimate the orientation of the video (as it sounds from parts of your question), you only need to process a subsample of all frames and infer from those (as head rotations do not tend do be randomly distributed :) )

Take photo during video-input

I'm currently trying to take an image in the best quality during capturing video at a lower quality. The problem is, that i'm using the video stream to check if face are in front of the cam and this needs lot's of resources, so i'm using a lower quality video stream and if there are any faces detected I want to take a photo in high quality.
Best regards and thank's for your help!
You can not have multiple capture sessions so at some point you will need to swap to higher resolution. First thing you are saying that face detection takes too much resources when using high res snapshots.. Why not try to simply down-sample the image and keep using high resolution all the time (send the down sampled one to the face detection, display the high res):
I would start with most common apple's graphic context and try to down scale it. If that takes too much cpu you could try to do the same on the GPU (find some library that does that or just create a simple program) or you could even try to simply drop odd lines and columns of the image as the raw data. In any of those cases you should also note that you probably do not need the face detection on the same thread as displaying, also you most likely don't even need a high frame rate for the detection (you display camera a full FPS but update the face recognition at 10 FPS for instance).
Another thing you can do is simply have the whole thing in low res, then when you need to take the image stop the session, start high res session, take a screenshot and swap back to low res for face detection.

Resources