Transfer data from Mat/oclMat to cl_mem (OpenCV + OpenCL) - opencv

I am working on a project that needs a lot of OpenCL code. I am using OpenCV's ocl module to develop my project faster but there are some functions not implemented and I will have to write my own OpenCL code.
My question is this: what is the quickest and cheapest way to transfer data from Mat and/or oclMat to a cl_mem array. Re-wording this, is there a good way to transfer or enqueue (clEnqueueWriteBuffer) data from oclMat or Mat?
Currently, I am using a for-loop to read data from Mat (or download from oclMat and then use for-loops) and then enqueuing it. This is turning out to be costly, hence my question.
Thanks to anyone who sees this question :)

I've written a set of interop functions for the Boost.Compute library which ease the use of OpenCL and OpenCV. Take a look at the opencv_copy_mat_to_buffer() function.
There are also functions for copying from a OpenCL buffer back to the host cv::Mat and for copying cv::Mat to OpenCL image2d objects.

Calculate memory bandwidth, achieved in Host-Device interconnections.
If you get ~60% and more of maximal bandwidth, you've nothing to do, memory transfer is as fast as it can be. But if your bandwidth results are lower that 55% - 60% of theoretical maximum, try to use multiple command queues with unblocking operations (don't forget to sync at the end). Also, pay attention on avg image size. Small data transfers usually have big overhead rate.
If your Device uses shared memory, use memory mapping instead of read/write, this may dramatically save time. If Device has it's own memory, apply pinned memory technique, which is well described in NVIDIA OpenCL Best Practices Guide.

The documentation of oclMat states that there is some sort of functionality to the underlying ocl buffer data:
//! pointer to the data(OCL memory object)
uchar *data;
If you have clMat already in the device, you can simply perform a copy buffer from clMat.data to your clBuffer. But you will have to hack a little bit the memory, accessing some private members of the oclMat
Something like:
clEnqueueCopyBuffer(command_queue, (clBuffer *)oclMat.data, dst_buffer, 0, 0, size);
NOTE: Take care with the casting, maybe you have to cast another pointer.

For your comment, it's right. The oclMat can be used as cl_mem(void *) for device, since it was alloced by OpenCL device.
Additionally, you can creat svm memory(for example void* svmdata) at first, and then assign a Mat like: Mat A(rows, cols, CV_32FC1, svmdata).
Now you can process the Mat A between host and device without memory copy.
(PS. The svm memory is the new character of OCL, it can be created by clSVMAlloc).

Related

When to use setVertexBytes/setVertexBuffer when dealing with small data in Metal?

The documentation for setVertexBytes says:
Use this method for single-use data smaller than 4 KB. Create a MTLBuffer object if your data exceeds 4 KB in length or persists for multiple uses.
What exactly does single-use mean?
For example, if I have a uniforms struct which is less than 4KB(and is updated every frame), is it better to use a triple buffer technique or simply use setVertexBytes?
From what I understand using setVertexBytes would copy the data every time into a MTLBuffer that Metal manages. This sounds slower than using triple buffering.
But then if I have different objects, each with its own uniforms, I would have to triple buffer everything, since it's dynamically updated.
And if I have a material that updates rarely but is passed to the shader every frame, would it be better to keep it in a buffer or pass it as a pointer using setVertexBytes?
It's not necessarily the case that Metal manages a distinct resource into which this data is written. As user Columbo notes in their comment, some hardware allows constant data to be recorded directly into command buffer memory, from which it can be subsequently read by a shader.
As always, you should profile in order to find the difference between the two approaches on your target hardware, but if the amount of data you're pushing per draw call is small, you might very well find that using setVertexBytes:... is faster than writing into a buffer and calling setVertexBuffer:....
For data that doesn't vary every frame (your slow-varying material use case), it may indeed be more efficient to keep that data in a buffer (double- or triple-buffered) rather than using setVertexBytes:....

why it's so slow in data exchanging between CPU and GPU memory?

It's my first time using openCL on ARM(CPU:Qualcomm Snapdragon MSM8930, GPU:Adreno(TM)305).
I find using openCL is really very effective, but data exchanging between CPU and GPU takes too much time, as much as I can't imaging.
Here is an example:
cv::Mat mat(640,480,CV_8UC3,cv::Scalar(0,0,0));
cv::ocl::oclMat mat_ocl;
//cpu->gpu
mat_ocl.upload(mat);
//gpu->cpu
mat = (cv::Mat)mat_ocl;
Just a small image like this, the upload option takes 10ms, and download option takes 20ms! That takes too long.
Can anyone could tell me is this situation normal? Or something goes wrong here?
Thank you in advance!
added:
my messuring method is
clock_t start,end;
start=clock();
mat_ocl.upload(mat);
end = clock();
__android_log_print(ANDROID_LOG_INFO,"tag","upload time = %f s",(double)(end-start)/CLOCKS_PER_SEC);
Actually, I'm not using openCL exactly, but ocl module in openCV(although it says they are equal). When reading openCV documents, I find it's just tell us to transform cv::Mat to cv::ocl::oclMat (which is data uploading from CPU to GPU)to do GPU calculation, but I haven't found memory mapping method in the ocl module documents.
Well, I found some useful introductions in openCV doc:
In a heterogeneous device environment, there may be cost associated with data transfer. This would be the case, for example, when data needs to be moved from host memory (accessible to the CPU), to device memory (accessible to a discrete GPU). in the case of integrated graphics chips, there may be performance issues, relating to memory coherency between access from the GPU “part” of the integrated device, or the CPU “part.” For best performance, in either case, it is recommended that you do not introduce data transfers between CPU and the discrete GPU, except in the beginning and the end of the algorithmic pipeline.
So, it seems explain the reason why speed of data transfer between CPU and GPU is so slow. But I still don't know how to fix this issue.
Provide exact measuring methods and results.
From experience of OpenCL development under ARM platforms (not Qcom, though), I can say that you shouldn't expect much of read-write operations. Memory bus is usually like 64bit, plus DDR3 isn't that fast.
Use shared memory for your advantage - go for mapping/unmapping instead of read/write.
P. S. actual operation time is measured, using cl_event profiling:
cl_ulong getTimeNanoSeconds(cl_event event)
{
cl_ulong start = 0, end = 0;
cl_int ret = clWaitForEvents(1, &event);
if (ret != CL_SUCCESS)
throw(ret);
ret = clGetEventProfilingInfo(
event,
CL_PROFILING_COMMAND_START,
sizeof(cl_ulong),
&start,
NULL);
if (ret != CL_SUCCESS)
throw(ret);
ret = clGetEventProfilingInfo(
event,
CL_PROFILING_COMMAND_END,
sizeof(cl_ulong),
&end,
NULL);
if (ret != CL_SUCCESS)
throw(ret);
return (end - start);
}

Cascading Image filters in opencl without using Texture Memory

I am working on a custom device that supports OpenCL 1.2 Embedded Profile and does not have Image support or Texture Memory. I have to pass an image through a Sobel filter and then a Median filter. What could be the best (fastest) way of doing this? Can I avoid having to send the image back to the host after Sobel filter and then reading it back on the device for Median filter? Where to store the intermediate image, global memory, local memory or elsewhere?
You can keep the buffer in the global memory of the device between kernel calls to avoid the extra copies. When you create the buffer, make sure you use the flag 'CL_MEM_READ_WRITE', this will allow the Sobel kernel to write to it, and the Median kernel to read from it afterward. You can get away with two buffers, but I would use three if memory is not a restriction.
create 3 buffers. call them whatever you'd like. (originalBuff, middleBuff, finalBuff)
copy the image data to originalBuff
optionally set other buffers to an all-zero state (can be done on the device by the kernels which write to these buffers)
call the sobel filter kernel with params (originalBuff, middleBuff)
call median kernel with params (middleBuff, finalBuff)
read finalBuff back to host
I left out the other steps, such as creating context/program/queue/etc.. in order to focus on the answer to your question.
Read about clCreateBuffer here.
EDIT:
I have not tried the flag 'CL_MEM_HOST_NO_ACCESS' before, but I think it is worth a try. In my example, middleBuff might benefit from this flag. Like most opencl features, any possible benefit would be implementation-dependent.

Fast way to swap endianness using opencl

I'm reading and writing lots of FITS and DNG images which may contain data of an endianness different from my platform and/or opencl device.
Currently I swap the byte order in the host's memory if necessary which is very slow and requires an extra step.
Is there a fast way to pass a buffer of int/float/short having wrong endianess to an opencl-kernel?
Using an extra kernel run just for fixing the endianess would be ok; using some overheadless auto-fixing-read/-write operation would be perfect.
I know about the variable attribute ((endian(host/device))) but this doesn't help with a big endian FITS file on a little endian platform using a little endian device.
I thought about a solution like this one (neither implemented nor tested, yet):
uint4 mask = (uint4) (3, 2, 1, 0);
uchar4 swappedEndianness = shuffle(originalEndianness, mask);
// to be applied on a float/int-buffer somehow
Hoping there's a better solution out there.
Thanks in advance,
runtimeterror
Sure. Since you have a uchar4 - you can simply swizzle the components and write them back.
output[tid] = input[tid].wzyx;
swizzling is very also performant on SIMD architectures with very little cost, so you should be able to combine it with other operations in your kernel.
Hope this helps!
Most processor architectures perform best when using instructions to complete the operation which can fit its register width, for example 32/64-bit width. When CPU/GPU performs such byte-wise operators, using subscripts .wxyz for uchar4, they needs to use a mask to retrieve each byte from the integer, shift the byte, and then using integer add or or operator to the result. For the endianness swaping, the processor needs to perform above integer and, shift, add/or for 4 times because there are 4 bytes.
The most efficient way is as follows
#define EndianSwap(n) (rotate(n & 0x00FF00FF, 24U)|(rotate(n, 8U) & 0x00FF00FF)
n could be in any gentype, for example, an uint4 variable. Because OpenCL does not allow C++ type overloading, so the best choice is macro.

Question about cl_mem in OpenCL

I have been using cl_mem in some of my OpenCL boilerplate code, but I have been using it through context and not a sharp understanding of what exactly it is. I have been using it as a type for the memory I push on and off the board, which has so far been floats. I tried looking at the OpenCL docs, but cl_mem doesn't show up (does it?). Is there any documentation on it, or is it simple and can someone explain.
The cl_mem type is a handle to a "Memory Object" (as described in Section 3.5 of the OpenCL 1.1 Spec). These essentially are inputs and outputs for OpenCL kernels, and are returned from OpenCL API calls in host code such as clCreateBuffer
cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags,
size_t size, void *host_ptr, cl_int *errcode_ret)
The memory areas represented can be permitted different access patterns e.g. Read Only, or be allocated in different memory regions, depending on the flags set in the create buffer calls.
The handle is typically stored to allow a later call to release the memory, e.g:
cl_int clReleaseMemObject (cl_mem memobj)
In short, it provides an abstraction over where the memory actually is: you can copy data into the associated memory or back out via the OpenCL APIs clEnqueueWriteBuffer and clEnqueueReadBuffer, but the OpenCL implementation can allocate the space where it wants.
For the computer a cl_mem is a number (like a file handler for Linux) that is reserved for the use as a "memory identifier"( the API/driver whatever stores information about your memory under this number that it knows what it holds/how big it is and stuff like that)

Resources