OpenCV CUDA API very slow at the first call - opencv

I am using cuda::resize to resize a vector of images (in GpuMat)
It shows the first call takes ~15ms, and the rests only take ~0.3ms. So I want to ask if there are ways to shorten the time of the first call.
Here is my code(simplified):
for (int i = 0; i < num_images; ++i)
{
full_img = v_GpuMat[i].clone(); // vGpuMat is vector of images in cuda::GpuMat
seam_scale = 0.4377;
cuda::resize(full_img, img, Size(), seam_scale, seam_scale, INTER_LINEAR);
}
Thank you very much.

CUDA device memory allocations and copying data from device to host and vice versa are very slow. Please try allocale memory and load data outside the main loop. Cloning matrix allocates new device memory each time, try to use copying data instead of cloning it should speed up your code.

After checking the result in Nvidia-Visual-Profile, I found it is the cudaLaunchKernel that takes ~20ms and will be called only the first time.
If you have to make a continuous process like me, maybe one of the solution can be a dry run before you process your own tasks. As for this sample, make a cuda::resize out of the loop is much faster.

Related

why it's so slow in data exchanging between CPU and GPU memory?

It's my first time using openCL on ARM(CPU:Qualcomm Snapdragon MSM8930, GPU:Adreno(TM)305).
I find using openCL is really very effective, but data exchanging between CPU and GPU takes too much time, as much as I can't imaging.
Here is an example:
cv::Mat mat(640,480,CV_8UC3,cv::Scalar(0,0,0));
cv::ocl::oclMat mat_ocl;
//cpu->gpu
mat_ocl.upload(mat);
//gpu->cpu
mat = (cv::Mat)mat_ocl;
Just a small image like this, the upload option takes 10ms, and download option takes 20ms! That takes too long.
Can anyone could tell me is this situation normal? Or something goes wrong here?
Thank you in advance!
added:
my messuring method is
clock_t start,end;
start=clock();
mat_ocl.upload(mat);
end = clock();
__android_log_print(ANDROID_LOG_INFO,"tag","upload time = %f s",(double)(end-start)/CLOCKS_PER_SEC);
Actually, I'm not using openCL exactly, but ocl module in openCV(although it says they are equal). When reading openCV documents, I find it's just tell us to transform cv::Mat to cv::ocl::oclMat (which is data uploading from CPU to GPU)to do GPU calculation, but I haven't found memory mapping method in the ocl module documents.
Well, I found some useful introductions in openCV doc:
In a heterogeneous device environment, there may be cost associated with data transfer. This would be the case, for example, when data needs to be moved from host memory (accessible to the CPU), to device memory (accessible to a discrete GPU). in the case of integrated graphics chips, there may be performance issues, relating to memory coherency between access from the GPU “part” of the integrated device, or the CPU “part.” For best performance, in either case, it is recommended that you do not introduce data transfers between CPU and the discrete GPU, except in the beginning and the end of the algorithmic pipeline.
So, it seems explain the reason why speed of data transfer between CPU and GPU is so slow. But I still don't know how to fix this issue.
Provide exact measuring methods and results.
From experience of OpenCL development under ARM platforms (not Qcom, though), I can say that you shouldn't expect much of read-write operations. Memory bus is usually like 64bit, plus DDR3 isn't that fast.
Use shared memory for your advantage - go for mapping/unmapping instead of read/write.
P. S. actual operation time is measured, using cl_event profiling:
cl_ulong getTimeNanoSeconds(cl_event event)
{
cl_ulong start = 0, end = 0;
cl_int ret = clWaitForEvents(1, &event);
if (ret != CL_SUCCESS)
throw(ret);
ret = clGetEventProfilingInfo(
event,
CL_PROFILING_COMMAND_START,
sizeof(cl_ulong),
&start,
NULL);
if (ret != CL_SUCCESS)
throw(ret);
ret = clGetEventProfilingInfo(
event,
CL_PROFILING_COMMAND_END,
sizeof(cl_ulong),
&end,
NULL);
if (ret != CL_SUCCESS)
throw(ret);
return (end - start);
}

Transfer data from Mat/oclMat to cl_mem (OpenCV + OpenCL)

I am working on a project that needs a lot of OpenCL code. I am using OpenCV's ocl module to develop my project faster but there are some functions not implemented and I will have to write my own OpenCL code.
My question is this: what is the quickest and cheapest way to transfer data from Mat and/or oclMat to a cl_mem array. Re-wording this, is there a good way to transfer or enqueue (clEnqueueWriteBuffer) data from oclMat or Mat?
Currently, I am using a for-loop to read data from Mat (or download from oclMat and then use for-loops) and then enqueuing it. This is turning out to be costly, hence my question.
Thanks to anyone who sees this question :)
I've written a set of interop functions for the Boost.Compute library which ease the use of OpenCL and OpenCV. Take a look at the opencv_copy_mat_to_buffer() function.
There are also functions for copying from a OpenCL buffer back to the host cv::Mat and for copying cv::Mat to OpenCL image2d objects.
Calculate memory bandwidth, achieved in Host-Device interconnections.
If you get ~60% and more of maximal bandwidth, you've nothing to do, memory transfer is as fast as it can be. But if your bandwidth results are lower that 55% - 60% of theoretical maximum, try to use multiple command queues with unblocking operations (don't forget to sync at the end). Also, pay attention on avg image size. Small data transfers usually have big overhead rate.
If your Device uses shared memory, use memory mapping instead of read/write, this may dramatically save time. If Device has it's own memory, apply pinned memory technique, which is well described in NVIDIA OpenCL Best Practices Guide.
The documentation of oclMat states that there is some sort of functionality to the underlying ocl buffer data:
//! pointer to the data(OCL memory object)
uchar *data;
If you have clMat already in the device, you can simply perform a copy buffer from clMat.data to your clBuffer. But you will have to hack a little bit the memory, accessing some private members of the oclMat
Something like:
clEnqueueCopyBuffer(command_queue, (clBuffer *)oclMat.data, dst_buffer, 0, 0, size);
NOTE: Take care with the casting, maybe you have to cast another pointer.
For your comment, it's right. The oclMat can be used as cl_mem(void *) for device, since it was alloced by OpenCL device.
Additionally, you can creat svm memory(for example void* svmdata) at first, and then assign a Mat like: Mat A(rows, cols, CV_32FC1, svmdata).
Now you can process the Mat A between host and device without memory copy.
(PS. The svm memory is the new character of OCL, it can be created by clSVMAlloc).

DirectX 11, Combining pixel shaders to prevent bottlenecks

I'm trying to implement one complex algorithm using GPU. The only problem is HW limitations and maximum available feature level is 9_3.
Algorithm is basically "stereo matching"-like algorithm for two images. Because of mentioned limitations all calculations has to be performed in Vertex/Pixel shaders only (there is no computation API available). Vertex shaders are rather useless here so I considered them as pass-through vertex shaders.
Let me shortly describe the algorithm:
Take two images and calculate cost volume maps (basically conterting RGB to Grayscale -> translate right image by D and subtract it from the left image). This step is repeated around 20 times for different D which generates Texture3D.
Problem here: I cannot simply create one Pixel Shader which calculates
those 20 repetitions in one go because of size limitation of Pixel
Shader (max. 512 arithmetics), so I'm forced to call Draw() in a loop
in C++ which unnecessary involves CPU while all operations are done on
the same two images - it seems to me like I have one bottleneck here. I know that there are multiple render targets but: there are max. 8 targets (I need 20+), if I want to generate 8 results in one pixel shader I exceed it's size limit (512 arithmetic for my HW).
Then I need to calculate for each of calculated textures box filter with windows where r > 9.
Another problem here: Because window is so big I need to split box filtering into two Pixel Shaders (vertical and horizontal direction separately) because loops unrolling stage results with very long code. Manual implementation of those loops won't help cuz still it would create to big pixel shader. So another bottleneck here - CPU needs to be involved to pass results from temp texture (result of V pass) to the second pass (H pass).
Then in next step some arithmetic operations are applied for each pair of results from 1st step and 2nd step.
I haven't reach yet here with my development so no idea what kind of bottlenecks are waiting for me here.
Then minimal D (value of parameter from 1st step) is taken for each pixel based on pixel value from step 3.
... same as in step 3.
Here basically is VERY simple graph showing my current implementation (excluding steps 3 and 4).
Red dots/circles/whatever are temporary buffers (textures) where partial results are stored and at every red dot CPU is getting involved.
Question 1: Isn't it possible somehow to let GPU know how to perform each branch form up to the bottom without involving CPU and leading to bottleneck? I.e. to program sequence of graphics pipelines in one go and then let the GPU do it's job.
One additional question about render-to-texture thing: Does all textures resides in GPU memory all the time even between Draw() method calls and Pixel/Vertex shaders switching? Or there is any transfer from GPU to CPU happening... Cuz this may be another issue here which leads to bottleneck.
Any help would be appreciated!
Thank you in advance.
Best regards,
Lukasz
Writing computational algorithms in pixel shaders can be very difficult. Writing such algorithms for 9_3 target can be impossible. Too much restrictions. But, well, I think I know how to workaround your problems.
1. Shader repetition
First of all, it is unclear, what do you call "bottleneck" here. Yes, theoretically, draw calls in for loop is a performance loss. But does it bottleneck? Does your application really looses performance here? How much? Only profilers (CPU and GPU) can answer. But to run it, you must first complete your algorithm (stages 3 and 4). So, I'd better stick with current solution, and started to implement whole algorithm, then profile and than fix performance issues.
But, if you feel ready to tweaks... Common "repetition" technology is instancing. You can create one more vertex buffer (called instance buffer), which will contains parameters not for each vertex, but for one draw instance. Then you do all the stuff with one DrawInstanced() call.
For you first stage, instance buffer can contain your D value and index of target Texture3D layer. You can pass-through them from vertex shader.
As always, you have a tradeof here: simplicity of code to (probably) performance.
2. Multi-pass rendering
CPU needs to be involved to pass results from temp texture (result of
V pass) to the second pass (H pass)
Typically, you do chaining like this, so no CPU involved:
// Pass 1: from pTexture0 to pTexture1
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture0); // source
pContext->OMSetRenderTargets(1, pTexture1, 0); // target
pContext->Draw(...);
// Pass 2: from pTexture1 to pTexture2
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture1); // previous target is now source
pContext->OMSetRenderTargets(1, pTexture2, 0);
pContext->Draw(...);
// Pass 3: ...
Note, that pTexture1 must have both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_RENDER_TARGET flags. You can have multiple input textures and multiple render targets. Just make sure, that every next pass knows what previous pass outputs.
And if previous pass uses more resources than current, don't forget to unbind unneeded, to prevent hard-to-find errors:
pContext->PSSetShaderResources(2, 1, 0);
pContext->PSSetShaderResources(3, 1, 0);
pContext->PSSetShaderResources(4, 1, 0);
// Only 0 and 1 texture slots will be used
3. Resource data location
Does all textures resides in GPU memory all the time even between
Draw() method calls and Pixel/Vertex shaders switching?
We can never know that. Driver chooses appropriate location for resources. But if you have resources created with DEFAULT usage and 0 CPU access flag, you can be almost sure it will always be in video memory.
Hope it helps. Happy coding!

Leaking memory in detectMultiScale when detecting faces

I'm using the following function (successfully) to detect faces using OpenCV in iOS, but it seems to leak 4-5Mb of memory every second according to Instruments.
The function is called from processFrame() regularly.
By a process of elimination it's the line that called detectMultiScale on the face_cascade that's causing the problem.
I've tried surrounding sections with an autoreleasepool (as I've had that issue before releasing memory on non-UI threads when doing video processing) but that didn't make a difference.
I've also tried forcing the faces Vector to release its memory, but again to no avail.
Does anyone have any ideas?
- (bool)detectAndDisplay :(Mat)frame
{
BOOL bFaceFound = false;
vector<cv::Rect> faces;
Mat frame_gray;
cvtColor(frame, frame_gray, CV_BGRA2GRAY);
equalizeHist(frame_gray, frame_gray);
// the following line leaks 5Mb of memory per second
face_cascade.detectMultiScale(frame_gray, faces, 1.1, 2, 0 | CV_HAAR_SCALE_IMAGE, cv::Size(100, 100));
for(unsigned int i = 0; i < faces.size(); ++i)
{
rectangle(frame, cv::Point(faces[i].x, faces[i].y),
cv::Point(faces[i].x + faces[i].width, faces[i].y + faces[i].height),
cv::Scalar(0,255,255));
bFaceFound = true;
}
return bFaceFound;
}
I am using the same source code as you, suffering from exactly same problem - memory leakage. The only differences are: I use Qt5 for Windows and I am loading separate .jpg images (thousands of them actually). I have tried same techniques to prevent the crashes, but in vain. I wonder whether you already solved the problem?
The similar issue is described here (bold paragraph, at the real bottom of the page), however this is write-up for the previous version of OpenCV interface. Author says:
The above code (function detect and draw) has a serious memory leak when run in an infinite for loop for real time face detection.
My humble guess is, the leakage is caused by badly handled resources inside detectMultiScale() method. I did not check it out yet, but cvHaarDetectObjects() method explained here, might be a better solution (however using an old version of OpenCV is probably not the best idea).
Combined with a suggestion from previous link (add this line at the end of operations: cvReleaseMemStorage(&storage)), the leakage should be plugged.
Writing this post made me want to try this out, so I will let you know as soon as I know whether this works or not.
EDIT : My guess was probably wrong. Try to simply clear 'faces' vector after every detection instead of releasing it's memory. I am running the script now for quite a piece of time, few hundreds of faces detected, still no sign of problems.
EDIT 2 : Yep, this is it. Just add faces.clean() after every detection. Everything will work just fine. Cheers.

how to get page size

I was asked this question in an interview Plz tell me the answer :-
You have no documentation of the kernel. You only knows that you kernel supports paging.
How will you find that page size ? There is no flag or macro you have that can tell you about page size.
I was given the hint as you can use Time to get the answer. I still have no clue for it.
Run code like the following:
for (int stride = 1; stride < maxpossiblepagesize; stride += searchgranularity) {
char* somemem = (char*)malloc(veryverybigsize*stride);
starttime = getcurrentveryaccuratetime();
for (pos = somemem; pos < somemem+veryverybigsize*stride; pos += stride) {
// iterate over "veryverybigsize" chunks of size "stride"
*pos = 'Q'; // Just write something to force the page back into physical memory
}
endtime = getcurrentveryaccuratetime();
printf("stride %u, runtime %u", stride, endtime-starttime);
}
Graph the results with stride on the X axis and runtime on the Y axis. There should be a point at stride=pagesize, where the performance no longer drops.
This works by incurring a number of page faults. Once stride surpasses pagesize, the number of faults ceases to increase, so the program's performance no longer degrades noticeably.
If you want to be cleverer, you could exploit the fact that the mprotect system call must work on whole pages. Try it with something smaller, and you'll get an error. I'm sure there are other "holes" like that, too - but the code above will work on any system which supports paging and where disk access is much more expensive than RAM access. That would be every seminormal modern system.
It looks to me like a question about 'how does paging actually work'
They want you to explain the impact that changing the page size will have on the execution of the system.
I am a bit rusty on this stuff, but when a page is full, the system starts page swapping, which slows everything down. So you want to run something that will fill up the memory to different sizes, and measure the time it takes to do a task. At some point there will be a jump, where the time taken to do the task will suddenly jump.
Like I said I am a bit rusty on the implementation of doing this. But i'm pretty sure that is the shape of the answer they were after.
Whatever answer they were expecting it would almost certainly be a brittle solution. For one thing you can have multiple pages sizes so any answer you may have gotten for one small allocation may be irrelevant for the next multi-megabyte allocation (see things like Linux's Large Page support).
I suspect the question was more aimed at seeing how you approached the problem rather than the final solution you came up with.
By the way this question isn't about linux because you do have documentation for that as well as POSIX compliance, for which you just call sysconf(_SC_PAGE_SIZE).

Resources