Compress CVSampleBuffer in BrodcastExtension - ios

I have an app that broadcasts the screen of the device. I am using OpenTok library to send the frames. Their library does not handle the compressing process so every time I send a frame using this code to consume Buffer:
self.videoCaptureConsumer?.consumeImageBuffer(pixelBuffer, orientation:sample.orientation.openTokOrientation,timestamp:CMSampleBufferGetPresentatioTimeStamp(sample) ,metadata:nil)
the Broadcast extension crashes because of the memory limitation(50MB).
I have been searching in SO as well as repos in GitHub finally I ended up using a CPU image processing using Accelerate. I created an extension of CVPixelBuffer to resize the buffer. Here is the extension
I can resize the CVPixelBuffer and then send the new buffer to the OpenTok library. But the problem is since it is done by CPU, on iPhones X and below, the broadcast extension is stopped by the high CPU usage by the system.
So, I thought I have to find a way to compress-resize the buffer in a faster and memory-safer way with GPU acceleration.
Then I ended up checking Telegram's iOS app. I discovered the Telegram`s Broadcast Extension which actually works like a charm and it is what I need but I lack information on how it works since it uses C libraries.
My question is how I can compress-resize the CVPixelBuffer in a similar way that Telegram does but at least written with Swift language without passing memory limit and using GPU acceleration?

Related

Should CUDA stream be waited to be complete even if the output data are to be sent to OpenGL instead of CPU?

This is a general question, and although I use OpenCV as a framework, the question is broader than OpenCV's realm.
I am developing an image processing tool that will effectively get image from a webcam (yielding a host-memory located cv::Mat), upload it to a GPU device memory in CUDA (i.e. cv::GpuMat), do some processing using CUDA and get a result finalCudaMat, and finally send the result to OpenGL (i.e. cv::ogl::Buffer::mapDevice + finalCudaMat.copyTo(mappedOglBuffer)). Everything works as intended.
Since the whole process involves multiple steps, I use a CUDA stream object (cv::cuda::Stream) to be able to make CUDA calls asynchronous and not wait on every single operation to be finished on CPU side. Now if someone instead is to eventually copy the result to a CPU matrix (i.e. finalCudaMat.download(finalCpuMat)), as in a customary situation, typically a wait on the stream is required (cudaStream.waitForCompletion()) to ensure the result is ready before using the CPU side matrix.
In my case, the the result never gets back to the CPU as it continues to be rendered on the screen (a bit of OpenGL operations and shaders are also involved).
One way, it might be appropriate to wait for CUDA work to finish before starting to copy the GpuMat to OpenGL Buffer. So if I add the wait on stream, everything is working fine and the CUDA operations take ~2.5ms.
Another way, it feels like I don't need to wait for completion of the stream (all the results are consumed by the GPU anyway -- CPU is never invovled again). Therefore I can remove the cudaStream.waitForCompletion() call before performing finalCudaMat.copyTo(mappedOglBuffer), and everything seems to be working fine. The whole CUDA processing operation (basically any GPU task minus OpenGL related) apparently takes ~1.8ms for me.
In the past I have had bad experience of not properly synchronization GPU work if two different APIs were involved (e.g. do something on Direct3D 9, do not wait for it to finish, and then copy the resulting texture to a Direct3D 10 texture, and clearly on some frames the image becomes empty or torn).
At this point, the difference is tiny and doesn't affect my 60 FPS throughput. But I wonder if I am technically doing a correct work by removing the wait-on-stream operation. Any thoughts on this? Or maybe a document regarding OpenGL/CUDA interop that could help me?
The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability
In particular it says that
Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.
That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources, which is confirmed by its documentation:
This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.
So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.
As Ben Voigt already pointed out, CUDA requires explicit synchronization with OpenGL (or any other graphics API that interoperates with it). Now this used the be kind of a chore, where one had to submit callbacks to the compute stream and use them to manually work with e.g. OpenGL fences.
However due to the advent of Vulkan and with it the support for external resources (and OpenGL extensions for that) you can in fact synchonize between CUDA and OpenGL command streams, by having both sides import platform native semaphores (cudaImportExternalSemaphore, GL_EXT_semaphore) and use them for mutual synchronization. It usually still involves a whole round trip through the CPU side driver, but since that part has to manage the command streams anyway it's not really an issue of efficiency.

Memory limit issue with screen casting using the broadcast extension and WebRTC protocol on iOS

This is my first question posted on stackoverflow.
I'm trying to make screen cast app using BroadcastExtension and WebRTC protocol. But broadcast extension's memory limit(50mb) is so tight that if an application tries to send the original video(886 x 1918 30fps) without any processing, it immediately dies after receiving a memory usage warning. After lowering the resolution and frame rate of the video, there is no problem. Investigating the application using the profiler does not seem to cause any problems with memory leaks. I guess it is because of the frames allocated during the encoding process inside WebRTC framework.
So my question is, is it possible to send the original video using WebRTC without any other processing, such as down scaling or lowering the frame rate?
Possible.
I forgot to mention in the question, but the library I used is Google WebRTC. I made two mistakes. One is to build the modified framework in debug mode, and the other is to use a software encoder(default is VP8). Because of this, it seems that the processing of the video frames was delayed and accumulated in the memory. DefaultEncoderFactory basically provides an encoder that operates in SW. (At least on iOS. Android seems to support HW-based decoder encoders automatically.) Fortunately, the iOS version google WebRTC framework supports the H264 hardware encoder(EncoderFactoryH264). In other cases you have to implement it yourself.
However, when using H264 to transmit, there is a problem that some platforms cannot play, for example, Android. The Google webrtc group seems to be aware of this problem, but at least it seems to me that it has not been resolved properly. Additional work is needed to solve this.

Access the whole video memory

I'm looking for a way to read the whole video memory that a video card outputs to a display. That includes also hardware accelerated output, video playback and output in fullscreen mode (that somehow I feel could be different from windowed mode).
In short: I want to be able to capture everything that is going to be represented on a display.
I suppose that IF that's possible it would be os-dependant. The targets I'm interested in are Windows OSX and Linux.
Do you have any hint?
For windows I guess you could take CamStudio, strip it down and use it to record the screen then do whatever you want with the output, other than that you could look into forensic kernel drivers for accessing RAM. It's not exactly as simple as a pointer pointing to the video memory anymore, haha.
Digital Rights Management, requested feature of Windows, attempts to block your access to blocks of graphics-card frame buffer memory. Using an open-source driver under Linux would seem to be the only way to access this memory, or as mentioned earlier, some 3rd party software that knows some back doors or hacks or ways to locate other program's frame buffer space.
Unless of course, you are trying to capture output from your own program (ie you are calling the video/graphics creation functions yourself), there are APIs to manipulate display frames in DirectX and OpenGL.
I think I found some resources that can help to capture the display memory in Windows
Fastest method of screen capturing
How to save backbuffer to file in DirectX 10?
http://betterlogic.com/roger/2010/07/fast-screen-capture/

Fastest way to get frames from webcam

I have a little wee of a problem developing one of my programs in C++ (Visual studio) - Right now im struggling with connection of multiple webcams (connected via usb cables), creating for each of them separate thread to capture frames, and separate frame for processing image.
I use OpenCV to process frames, but the problem is that i dont get a peak of webcam possibilities (it supports 25 fps, i get only 18) is there some library that i could use to get frames, than process them with OpenCV that would made frames be captured faster?
I was researching a bit and the most popular way is to use directshow to get frames and OpenCV to process them.
Do You agree? Or do You have another solution?
I wouldn't be offended by some links :)
DirectShow is only used, if you open your capture using the
CV_CAP_DSHOW flag, like:
VideoCapture capture( CV_CAP_DSHOW + 0 ); // 0,1,2, your cam id there
(without it, it defaults to vfw )
the capture already runs in a separate thread, so wrapping it with more threads won't give you any gain.
another obstacle with multiple cams is the usb bandwidth, so if you got ports on the back & the front of your machine, dont plug all your cams into the same port/controller else you just saturate it
OpenCV uses DirectShow. Using DirectShow (primary video capture API in Windows) directly will obviously get you par or better performance (and even more likely so if OpenCV is set to use Video for Windows). USB cams typically hit USB bandwidth and hence frame rate limit, using DirectShow to capture in compressed formats or in formats with less bits/pixel is the way to reach higher frame rates within the same USB bandwidth limit.
Another typical problem causing low frame rates is slow synchronous processing delaying the capture. You typically identify this by putting trivial processing into the same capture loop and seeing higher FPS compared to processing-enabled operation.

How to use html canvas toDataURL on IOS without increasing memory?

Firstly, the following thread discusses the issue that toDataURL may increase memory consumption, but doesn't offer a way to use toDataURL safely:
javascript memory leak with HTML5 getImageData
In my application, I need to call toDataURL. I have a PhoneGap application running on IOS that take photos using the native camera, tiles the images together into one collage image, and sends the final image as a binary 64 string via ajax post to a server. The part of my code that tiles the images uses an html canvas and toDataURL to accomplish that. This tiling occurs repeatedly over the lifetime of the software process. I am seeing the application memory increase until IOS aborts the process.
What would you recommend to do to be able to call toDataURL but not run out of memory? I don't see how to release this memory.
Thanks.
Instead of trying to memory manage, it may just be easier to send the independent images to the server and let the server create the collage for you.

Resources