What is the difference between UMat and Mat in OpenCV? - opencv

I have been through the documentation and didn't get a clear detailed description about UMat; however I think it has something to relate with GPU and CPU. Please help me out.
Thank you.

Perhaps section 3 of this document will help: [link now broken]
https://software.intel.com/sites/default/files/managed/2f/19/inde_opencv_3.0_arch_guide.pdf
Specifically, section 3.1:
A unified abstraction cv::UMat that enables the same APIs to be implemented using CPU or OpenCL code, without a requirement to call OpenCL accelerated version explicitly. These functions use an OpenCL-enabled GPU if exists in the system, and automatically switch to CPU operation otherwise.
and section 3.3:
Generally, the cv::UMat is the C++ class, which is very similar to cv::Mat. But the actual UMat data can be located in a regular system memory, dedicated video memory, or shared memory.
Link to usage suggested in the comments by #BourbonCreams:
https://docs.opencv.org/3.0-rc1/db/dfa/tutorial_transition_guide.html#tutorial_transition_hints_opencl

Related

OpenCL subbuffers, why is important?

I try to implement a multi-gpu OpenCL code. In my model, GPUs have to communicate and
exchange data.
I found (I don't remember where, it is been some time) that one solution is to deal with
subbuffers. Can anybody explain, as simple as possible, why subbuffers are important
in OpenCL? As far as I can understand, one can do exactly the same using only buffers.
Thanks a lot,
Giorgos
Supplementary Question:
What is the best way to exchange data between GPUs?
I am not sure(or I do not know) how sub-buffers will provide solutions to your problem when dealing with Multiple GPU's. AFAIK sub buffers provide a view into a buffer i.e a single buffer can be divided into chunks of smaller buffers(sub buffers) providing a layer of software abstraction, Sub-buffers are advantageous in same cases where in you need keep an offset first element to be zero.
To address multiGPU or MultiDevice problem OpenCL 1.2 provides API from where you can copy memory objects directly from One GPU to other using clEnqueueMigrateMemObjects OpenCL API call http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueMigrateMemObjects.html

setting up OpenCL with DirectX 9

I'm completely new to OpenCL and GPU programming in general. Right now I am working on a project where I'm trying to see the performance saves that making use of the GPU in a game has. With this, however, I have ran into a snag; how do I set up my Directx project to speak to the OpenCL code base?
I've been googling this for about a week and haven't been able to find anything. If someone could point me in the right direction, I would be greatful.
OpenCL does not have anything to do with DirectX, it's simply another library.
For OpenCL you'll need an implementation ('SDK'), as Khronos don't provide those (they only provide the specifications).
Intel, AMD and Nvidia all provide one, but they have different requirements and limitations. See here for some of the existing implementations
After installing one of these, you'll have the necessary headers and libraries to code against the OpenCL API and link with OpenCL.dll
There are lots of sample sources in the SDKs or online, you have to write the kernel, the rest is mostly boilerplate code for initialization and kernel compilation.
The specific OpenCL extension that allows sharing of OpenCL buffers as textures and vice versa is cl_khr_d3d10_sharing.txt. http://www.khronos.org/registry/cl/extensions/khr/cl_khr_d3d10_sharing.txt
OpenCL has extensions for sharing memory between DirectX and OpenCL (and also between OpenGL and OpenCL.) This allows you to read or write DirectX buffers, including textures from within OpenCL. Ani's answer mentioned the extension for DirectX 10, but since the question is about DirectX 9, the extension you'll actually be using is cl_khr_dx9_media_sharing.
This extension has just 4 functions:
clGetDeviceIDsFromDX9MediaAdapterKHR
This function allows you to get the OpenCL device IDs of the OpenCL device(s) that can share memory with a given Direct3D 9 device.
clCreateFromDX9MediaSurfaceKHR
This function gets an OpenCL cl_mem memory object for a given Direct3D 9 memory object.
clEnqueueAcquireDX9MediaSurfacesKHR
This function locks the specified shared memory object so that you can read and/or write to it from OpenCL.
clEnqueueReleaseDX9MediaSurfacesKHR
This function unlocks the specified memory object from OpenCL, so that Direct3D can read/write it again.
Once you've used the above functions to share and synchronize access to the memory buffers, everything else on both the Direct3D 9 side and the OpenCL side works as it would otherwise with those particular APIs.
Note that your GPU will need to support the cl_khr_dx9_media_sharing extension in order for this to work. You can check the extensions property of the OpenCL platform and device in order to confirm that this extension is supported.
Some NVidia GPUs support a different extension instead, called cl_nv_d3d9_sharing. The basic idea of how it works is the same as with the cl_khr_dx9_media_sharing extension, but the exact details are a bit different. The biggest difference is just that it has different functions for getting cl_mem objects for different types of Direct3D 9 buffers, rather than just one function to cover all of them.

What language and compiler to write ATI GPU code for?

I know Nvidia has CUDA, but what does ATI have? I dont want to use OpenCL because I want to keep as low level to the hardware as possible.
Is it brook, or stream?
The documentation available is pretty pathetic! CUDA seems easy to get programming, but I want to use ATI specifically because of their hardware.
OpenCL is AMD's currently preferred GPU/compute language.
Brook is deprecated.
However, you can write code at a very low level, using AMD's
shader and kernel analyzer
http://developer.amd.com/tools/shader/Pages/default.aspx.
http://developer.amd.com/tools/AMDAPPKernelAnalyzer/Pages/default.aspx
E.g. http://developer.amd.com/tools/shader/PublishingImages/GSA.png
shows OpenCL code, and the Radeon 5870 assembly produced.
You can actually code directly in several forms of "assembly".
Or at least you could - the webpages no longer mention this.
(I used to have this installed for tuning and testing, but do not at the moment.)
More usually, you can code in any of several forms of AMD IL, Intermediate Language,
which is closer to the machine than OpenCL. The kernel analyzer web page says
"If your kernel is an IL kernel Stream, KernelAnalyzer will automatically compile the IL..."
I would recommend that you use OpenCL, and then look at the disassembly and tweak the OpenCL code to be better tuned. But you can work in IL, and probably still can work at an even lower level.

Does replacing int with short help the performance in CUDA

assume that we have enough global memory. Does replacing int with short improve the performance in CUDA? (like short saves the usage of shared memory, registers, etc)
Advices are welcomed. Thanks.
Using short in shared memory will most likely reduce performance due to bank-conflicts, until you use short2.
Also, as far as I know, all registers on GPU are 32-bit, so it's unlikely that using short would reduce register usage.
Depends:
If your program is memory bound then Yes transferring the input as shorts could be beneficial.
If your kernel is computation bound is more likely to be No because the kernel have to do an extra operation to convert from short to int and then back to short each time.
Tesla-class hardware (SM 1.x) has surprisingly rich support for "half registers," so you might get some mileage from using short instead of int on those platforms. You can confirm by using cuobjdump to look at the microcode in the cubin. But Fermi removed that support.
With SM 2.1, NVIDIA added support for "video" instructions that implement 32-bit-wide SIMD operations on 32-bit registers - see section 8.7.9 of the PTX 2.1 spec.
http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/ptx_isa_2.1.pdf

Newbie to GPU programming: what to learn?

I am rendering a certain scene to an off-screen frame buffer (FBO) and then I'm reading the rendered image using glReadPixels() for processing on the CPU. The processing involves some very simple scanning routines and extraction of data.
After profiling I realized that most of what my application does is spend time in glReadPixels() - more than 50% of the time. So the natural step is to move the processing to the GPU so that the data would not have to be copied.
So my question is - what would be the best way to program such a thing to the GPU?
GLSL?
CUDA?
Anything else I'm not currently aware of?
The main requirements is that it'll have access to The rendered off-screen frame bufferes (or texture data since it is possible to render to a texture) and to be able to output some information to the CPU, say in the order of 1-2Kb per frame.
You might find the answers in the "Intro to GPU programming" questions useful.
-Adam
There are a number of pointers to getting started with GPU programming in other questions, but if you have an application that is already built using OpenGL, then probably your question really is "which one will interoperate with OpenGL"?
After all, your whole point is to avoid the overhead of reading your FBO back from the GPU to the CPU with glReadPixels(). If, for example you had to read it back anyway, then copy the data into a CUDA buffer, then transfer it back to the gpu using CUDA APIs, there wouldn't be much point.
So you need a GPGPU package that will take your OpenGL FBO object as an input directly, without any extra copying.
That would probably rule out everything except GLSL.
I'm not 100% sure whether CUDA has any way of operating directly on an OpenGL buffer object, but I don't think it has that feature.
I am sure that ATI's Stream SDK doesn't do that. (Although it will interoperate with DirectX.)
I doubt that the DirectX 11 "technology preview" with compute shaders has that feature, either.
EDIT: Follow-up: it looks like CUDA, at least the most recent version, has some support for OpenGL interoperability. If so, that's probably your best bet.
I recently found this Modern GPU
You may find OpenAI Triton useful

Resources