Does OpenCV UMat always reside on GPU? - opencv

I would like to know that whether OpenCV UMat always resides on GPU side if there is a OpenCL-compatible GPU available? Does "cv::ocl::setUseOpenCL(true)" make any difference?
If UMat does sits on GPU side, does it mean that the data transferring between CPU-GPU only happens when I call umat.getMat()?
Thanks a lot!

No the UMat doesn't completely resides on a single hardware component like CPU or GPU, The UMat internally implements OpenCL framework which tries to harness the processing power of any capable hardware attached to the device, it could be CPU, GPU or even Digital Signal Processor present in mobile devices, so OpenCL performs efficient multi-processing along various available devices capable of processing. For more information follow this link

Related

What platform to use for YOLO output when using AMD GPU?

long time tormented by this question, I ask your advice in what direction to move. Objective - to develop universal application with yolo on windows, which can use computing power of AMD/Nvidia/Intel GPU, AMD/Intel CPU (one of the devices will be used). As far as I know, the OpenCV DNN module is leading in CPU computation; a DNN + Cuda bundle is planned for Nvidia graphics cards and a DNN + OpenCL bundle is planned for Intel GPUs. But testing AMD GPU rx580 with DNN + OpenCL, I ran into the following problem: https://github.com/opencv/opencv/issues/17656. Does this module not support AMD GPU computing at all? If so, could you please let me know what platform this is possible on and, preferably, as efficiently as possible. A possible solution might be Tencent's ncnn, but I'm not sure of the performance on the desktop. By output I mean the coordinates of detected objects and their names (in opencv dnn module I got them with cv::dnn::Net::forward()). Also, correct me if I'm wrong somewhere. Any feedback would be appreciated.
I tried the OpenCV DNN + OpenCL module and expected high performance, but this combination does not work.
I believe OpenCV doesn't support AMD for GPU optimization. If you're interested in running DL models on non-Nvidia GPUs, I suggest reading PlaidML, YOLO-OpenCL, DeepCL

pi3 GPU functionaly with camera

In my project I have 2 main tasks – image recognition for the camera frames and saving the vidoes.
I think to use pi GPU here for accelerate this.
Is it possible using pi GPU get the frames from camera, than convert and save them in SD card?
And meantime pass the frames to processor for doing image recognition?
Can someone please provide some info about how I can use GPU and processor separately and what video-camera related operations can GPU do.
Thanks
I think you really just want to use the umat class. It makes a lot of opencvs functions run on the GPU (if possible). It can in some cases release a lot of cpu time for other tasks.
Some opencv functions are also often multiple times faster when run on a GPU.
See opencv-transparent-api
You can also easily find examples using it here on stack overflow.

OpenGL rendering to Framebuffer, transfer to OpenCV Umat for OpenCL accelerated processing

In an older version of OpenCV I could render using OpenGL to the backbuffer, used glreadpixels to "copy" the pixels to an OpenCV image (iplimage?) for some processing (blurring, templatematching with another OpenCV-loaded image). However that would cost me with the transfer from GPU to CPU and then if I wanted to display it, back to the GPU.
Now I can do something similar with just OpenGL and OpenCL, by using clEnqueueAcquireGLObjects and I do not have to transfer at all. I OpenGL render to a Framebuffer and let OpenCL take control of it.
However this forces me to write my own OpenCL kernels (nobody has time for that...actually terribly hard to debug OpenCL on an Nvidia) for whatever processing I would like to do. Now that OpenCV has some great OpenCL-accelerated processes I would like to try them out.
So my question: Is it possible to render to the Framebuffer (or another GL context on the GPU), give control (or copy) to an OpenCV context (umat?) for OpenCL-accelerated processing? If so, how (big picture, key components)?
I have a feeling I can use cv::ogl::Buffer to wrap the buffer, but the documentation is not exactly clear on this, and then copy it using ogl::Buffer::copyTo. similar: Is it possible to bind a OpenCV GpuMat as an OpenGL texture?
other ref: Transfer data from Mat/oclMat to cl_mem (OpenCV + OpenCL)
Yes it is possibly - now. I wrote demos that demonstrate OpenGL/OpenCL/VAAPI interop. tetra-demo renders a rotating tetrahedron using OpenGL, passes the framebuffer (actually a texture attached to a framebuffer) on to OpenCV/CL (as a cv::UMat) and encodes it to VP9. All on the GPU. The only problem is, that the required fixes reside in my OpenCV 4.x fork and you'll have to build it yourself.
It also requires two OpenCL extensions: cl_khr_gl_sharing and cl_intel_va_api_media_sharing
There are two open github issues addressing my efforts:
OpenGL/OpenCL and VAAPI/OpenCL interop at the same time
Capture MJPEG from V4l2 using VAAPI hardware acceleration to OpenCL using VAAPI/OpenCL interop

Would it work and be faster if I call function in OpenCV GPU module in my kernel function?

OpenCV has a gpu. GPU-accelerated Computer Vision module (http://docs.opencv.org/modules/gpu/doc/gpu.html). There are many functions which is already use GPU techniques. So I can directly use the function OpenCV applies. But I wonder whether it would be faster if I write my own kernel and in each kernel I call function of OpenCV GPU module. This is in the case I have many images. To handle each image I call OpenCV funtion in GPU module. Then it would be parallel-nested-parallel.
Your question is not entirely clear to me, but I would like to say this: it's impossible to say which would be faster, unless somebody already implemented that same algorithm using the approach you have in mind, and then shared a report about the benchmark tests.
There's a number of factors involved:
It depends on the type of operation you are trying to implement: techniques that have a high arithmetic intensity are better fit for GPUs for sure, however, not all problems can be modeled for GPUs.
The size of the input images matter: wasting time sending data from RAM to the GPU might not compensate in the end, so running the algorithm on the CPU can be faster for small images.
The model/power of the CPU/GPU: if the computer has a really crappy GPU, then it's probably better to run the algorithms on the CPU.
What I'm saying is: don't assume OpenCV GPU's module will always run it's algorithms faster than the CPU you got. Test it, measure it! The only way to know for sure is through experimentation and benchmark.

Calling BLAS routines inside OpenCL kernels

Currently I am doing some image processing algorithms using OpenCL. Basically my algorithm requires to solve a linear system of equations for each pixel. Each system is independent of others, so going for a parallel implementation is natural.
I have looked at several BLAS packages such as ViennaCL and AMD APPML, but it seems all of them have the same use pattern (host calling BLAS subroutines to be executed on CL device).
What I need is a BLAS library that could be called inside an OpenCL kernel so that I can solve many linear systems in parallel.
I found this similar question on the AMD forums.
Calling APPML BLAS functions from the kernel
Thanks
Its not possible. clBLAS routines make a series of kernel launches, some 'solve' routine kernel launches are really complicated. clBLAS routines take cl_mem and commandQueues as args. So if your buffer is already on device, clBLAS will directly act on that. It doesn't accept host buffer or manage host->device transfers
If you want to have a look at what kernel are generated and launched, uncomment this line https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/generic/common.c#L461 and build clBLAS. It will dump all kernels being called

Resources