OpenCL subbuffers, why is important? - communication

I try to implement a multi-gpu OpenCL code. In my model, GPUs have to communicate and
exchange data.
I found (I don't remember where, it is been some time) that one solution is to deal with
subbuffers. Can anybody explain, as simple as possible, why subbuffers are important
in OpenCL? As far as I can understand, one can do exactly the same using only buffers.
Thanks a lot,
Giorgos
Supplementary Question:
What is the best way to exchange data between GPUs?

I am not sure(or I do not know) how sub-buffers will provide solutions to your problem when dealing with Multiple GPU's. AFAIK sub buffers provide a view into a buffer i.e a single buffer can be divided into chunks of smaller buffers(sub buffers) providing a layer of software abstraction, Sub-buffers are advantageous in same cases where in you need keep an offset first element to be zero.
To address multiGPU or MultiDevice problem OpenCL 1.2 provides API from where you can copy memory objects directly from One GPU to other using clEnqueueMigrateMemObjects OpenCL API call http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueMigrateMemObjects.html

Related

What does the Streaming stand for in Streaming SIMD Extensions (SSE)?

I've looked everywhere and I still can't figure it out. I know of two associations you can make with streams:
Wrappers for backing data stores meant as an abstraction layer between consumers and suppliers
Data becoming available with time, not all at once
SIMD stands for Single Instruction, Multiple Data; in the literature the instructions are often said to come from a stream of instructions. This corresponds to the second association.
I don't exactly understand why the Streaming in Streaming SIMD Extensions (or in Streaming Multiprocessor either), however. The instructions are coming from a stream, but can they come from anywhere else? Do we or could we have just SIMD extensions or just multiprocessors?
Tl;dr: can CPU instructions be non-streaming, i.e. not come from a stream?
SSE was introduced as an instruction set to improve performance in multimedia applications. The aim for the instruction set was to quickly stream in some data (some bit of a DVD to decode for example), process it quickly (using SIMD), and then stream the result to an output (e.g. the graphics ram). (Almost) All SSE instructions have a variant that allows it to read 16bytes from memory. The instruction set also contains instructions to control the CPU cache and HW prefetcher. It's pretty much just a marketing term.

What is the difference between UMat and Mat in OpenCV?

I have been through the documentation and didn't get a clear detailed description about UMat; however I think it has something to relate with GPU and CPU. Please help me out.
Thank you.
Perhaps section 3 of this document will help: [link now broken]
https://software.intel.com/sites/default/files/managed/2f/19/inde_opencv_3.0_arch_guide.pdf
Specifically, section 3.1:
A unified abstraction cv::UMat that enables the same APIs to be implemented using CPU or OpenCL code, without a requirement to call OpenCL accelerated version explicitly. These functions use an OpenCL-enabled GPU if exists in the system, and automatically switch to CPU operation otherwise.
and section 3.3:
Generally, the cv::UMat is the C++ class, which is very similar to cv::Mat. But the actual UMat data can be located in a regular system memory, dedicated video memory, or shared memory.
Link to usage suggested in the comments by #BourbonCreams:
https://docs.opencv.org/3.0-rc1/db/dfa/tutorial_transition_guide.html#tutorial_transition_hints_opencl

Spartan 6 SP605 VHDL external ram usage?

I'm new to using VHDL and have run into an issue with my project. I'm trying to make an FPGA to converts from one communication protocol to a different one, and for this purpose it would be useful to be able to store (hopefully multiple) packets before converting.
Before I tried to store this data in arrays, but it became quickly apparent that this takes up far too much space on the FPGA. Therefore, I have been searching for a way to store the data on the DDR3 ram on the SP605 board (http://www.xilinx.com/support/documentation/boards_and_kits/xtp067_sp605_schematics.pdf, page 9). I however cannot find instructions on how to write or read data from this. I'm trying to store one 8bit std_logic_vector per clock cycle to later be accessed.
Can anyone advise me on how to proceed?
Xilinx offers an IP Core generator. This IP catalog contains a Memory Interface Generator (MIG) which generates an IP Core to access different memory types. Configure this core for DDR3.
Writing a DDR3 controller in VHDL is not a project for a beginner not even for an experienced designer.
The state machine is simple and well known, but the calibration logic is very costly.
You should consider a caching or burst read/write technique, because DDR memory can not be accessed in every cycle.

Understanding Core Videos CVPixelBufferPool and CVOpenGLESTextureCache semantics

I'm refactoring my iOS OpenGL-based rendering pipeline. My pipeline consist of many rendering steps, hence I need a lot of intermediate textures to render to and read from. Those textures are of various types (unsigned byte and half float) and may posses a different number of channels.
To save memory and allocation effort I recycled textures that were used by previous steps in the pipeline and are no longer needed. In my previous implementation I did that on my own.
In my new implementation I want to use the APIs provided by the Core Video framework for that; especially since they provide much faster access to the texture memory from the CPU. I understand that the CVOpenGLESTextureCache allows me to create OpenGL textures out of CVPixelBuffers that can be created directly or using a CVPixelBufferPool. However, I am unable to find any documentation describing how they really work and how they play together.
Here are the things I want to know:
For getting a texture from the CVOpenGLESTextureCache I always need to provide a pixel buffer. Why is it called "cache" if I need to provide the memory anyways and are not able to retrieve an old, unused texture?
The CVOpenGLESTextureCacheFlush function "flushes currently unused resources". How does the cache know if a resource is "unused"? Are textures returned to the cache when I release the corresponding CVOpenGLESTextureRef? The same question applies to the CVPixelBufferPool.
Am I able to maintain textures with different properties (type, # channels, ...) in one texture cache? Does it know if a textures can be re-used or needs to be created depending on my request?
CVPixelBufferPools seem only to be able to manage buffers of the same size and type. This means I need to create one dedicated pool for each texture configuration I'm using, correct?
I'd be really happy if at least some of those questions could be clarified.
Yes, well you will not actually be able to find anything. I looked and looked and the short answer is you just need to test things out to see how the implementation functions. You can find my blog post on the subject along with example code at opengl_write_texture_cache. Basically, it seems that the way it works is that the texture cache object "holds" on to the association between the buffer (in the pool) and the OpenGL texture that is bound when a triangle render is executed. The result is that the same buffer should not be returned by the pool until after OpenGL is done with it. In the weird case of some kind of race condition, the pool might get 1 buffer larger to account for a buffer that is held too long. What is really nice about the texture cache API is that one only needs to write to the data buffer once, as opposed to calling an API like glTexImage2D() which would "upload" the data to the graphics card.

Newbie to GPU programming: what to learn?

I am rendering a certain scene to an off-screen frame buffer (FBO) and then I'm reading the rendered image using glReadPixels() for processing on the CPU. The processing involves some very simple scanning routines and extraction of data.
After profiling I realized that most of what my application does is spend time in glReadPixels() - more than 50% of the time. So the natural step is to move the processing to the GPU so that the data would not have to be copied.
So my question is - what would be the best way to program such a thing to the GPU?
GLSL?
CUDA?
Anything else I'm not currently aware of?
The main requirements is that it'll have access to The rendered off-screen frame bufferes (or texture data since it is possible to render to a texture) and to be able to output some information to the CPU, say in the order of 1-2Kb per frame.
You might find the answers in the "Intro to GPU programming" questions useful.
-Adam
There are a number of pointers to getting started with GPU programming in other questions, but if you have an application that is already built using OpenGL, then probably your question really is "which one will interoperate with OpenGL"?
After all, your whole point is to avoid the overhead of reading your FBO back from the GPU to the CPU with glReadPixels(). If, for example you had to read it back anyway, then copy the data into a CUDA buffer, then transfer it back to the gpu using CUDA APIs, there wouldn't be much point.
So you need a GPGPU package that will take your OpenGL FBO object as an input directly, without any extra copying.
That would probably rule out everything except GLSL.
I'm not 100% sure whether CUDA has any way of operating directly on an OpenGL buffer object, but I don't think it has that feature.
I am sure that ATI's Stream SDK doesn't do that. (Although it will interoperate with DirectX.)
I doubt that the DirectX 11 "technology preview" with compute shaders has that feature, either.
EDIT: Follow-up: it looks like CUDA, at least the most recent version, has some support for OpenGL interoperability. If so, that's probably your best bet.
I recently found this Modern GPU
You may find OpenAI Triton useful

Resources