Calling BLAS routines inside OpenCL kernels - image-processing

Currently I am doing some image processing algorithms using OpenCL. Basically my algorithm requires to solve a linear system of equations for each pixel. Each system is independent of others, so going for a parallel implementation is natural.
I have looked at several BLAS packages such as ViennaCL and AMD APPML, but it seems all of them have the same use pattern (host calling BLAS subroutines to be executed on CL device).
What I need is a BLAS library that could be called inside an OpenCL kernel so that I can solve many linear systems in parallel.
I found this similar question on the AMD forums.
Calling APPML BLAS functions from the kernel
Thanks

Its not possible. clBLAS routines make a series of kernel launches, some 'solve' routine kernel launches are really complicated. clBLAS routines take cl_mem and commandQueues as args. So if your buffer is already on device, clBLAS will directly act on that. It doesn't accept host buffer or manage host->device transfers
If you want to have a look at what kernel are generated and launched, uncomment this line https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/generic/common.c#L461 and build clBLAS. It will dump all kernels being called

Related

How to use OpenCV functions in Metal on iOS?

I have developed the Xcode project that uses OpenCV functions for image processing when the iPhone camera live stream.
It takes some time to process one frame and doesn't look like real time.
Is it possible to accelerate the calculation by integrating OpenCV and Metal?
For example, OpenCV function "grabCut" takes more than 1 second to detect certain foreground objects.
How can I reduce the processing time down to 10ms at least using Metal?
You can't call OpenCV functions from Metal.
If you want to speed up this algorithm, you could try porting it to Metal but that's only an option if the algorithm -- or major parts of it -- are highly parallel.
Now, it looks like grabCut has a CUDA implementation (which I found by googling for "grabcut cuda"), which means that implementing this in Metal might actually be worth doing. If you can find the CUDA source code, it's usually a relatively straightforward port.

Is OpenCL a shared, distributed or hybrid memory system

I'm having a hard time understanding if OpenCL and in particular OpenCL 2.0+ is a shared, distributed or a distributed shared memory architecture, in particular with a computer that has many OpenCL devices in the same PC.
In particular, I can see that It's a shared memory system in the fact that they can all access global memory but theirs a network-like aspect with the compute units that makes me question if it could classicly be classed as a distributed-shared memory architecture
Looking at it from a generic OpenCL coding perspective, your answer is "yes, maybe, unless it's not."
If you are talking about some specific hardware, there are (somewhere) clear and concise answers of how things work on the chip(s) and how OpenCL uses them.
By examining the OpenCL capacities and capabilities at runtime, you could modify some parameters of your OpenCL program or choose one of various kernels that is the best fit.

OpenCV BackgroundSubtratorMOG2

I've finished an algorithm aimed to foreground extraction based on video recently, but it processes too slowly per frame. There is an algorithm based on Mixed Gaussian Model named BackgroundSubtractorMOG2 in OpenCV3.0 and I find it processes quickly as nearly 15 times as mine per frame. I just wonder is it accelerated by OpenCL on GPU ? Or it is just run on CPU? p.s. I've seen some source codes of it and noticed there are OpenCL blocks but I'm not sure since I'm fresh. I will be very appreciated if anyone could help me figure it out!
If you look at the API page here You will find the line:
The function implements a sparse iterative version of the Lucas-Kanade optical flow in pyramids. See [Bouguet00]. The function is parallelized with the TBB library.
The TBB library is a parallization library and is used to "write parallel C++ programs that take full advantage of multicore performance" - this means that it is using more than just one CPU at a time, a much quicker way of processing. This can be seen on lines like this (Line 566):
parallel_for_(Range(0, image.rows),
MOG2Invoker(image, fgmask,
(GMM*)bgmodel.data,
(float*)(bgmodel.data + sizeof(GMM)*nmixtures*image.rows*image.cols),
bgmodelUsedModes.data, nmixtures, (float)learningRate,
(float)varThreshold,
backgroundRatio, varThresholdGen,
fVarInit, fVarMin, fVarMax, float(-learningRate*fCT), fTau,
bShadowDetection, nShadowDetection));

Would it work and be faster if I call function in OpenCV GPU module in my kernel function?

OpenCV has a gpu. GPU-accelerated Computer Vision module (http://docs.opencv.org/modules/gpu/doc/gpu.html). There are many functions which is already use GPU techniques. So I can directly use the function OpenCV applies. But I wonder whether it would be faster if I write my own kernel and in each kernel I call function of OpenCV GPU module. This is in the case I have many images. To handle each image I call OpenCV funtion in GPU module. Then it would be parallel-nested-parallel.
Your question is not entirely clear to me, but I would like to say this: it's impossible to say which would be faster, unless somebody already implemented that same algorithm using the approach you have in mind, and then shared a report about the benchmark tests.
There's a number of factors involved:
It depends on the type of operation you are trying to implement: techniques that have a high arithmetic intensity are better fit for GPUs for sure, however, not all problems can be modeled for GPUs.
The size of the input images matter: wasting time sending data from RAM to the GPU might not compensate in the end, so running the algorithm on the CPU can be faster for small images.
The model/power of the CPU/GPU: if the computer has a really crappy GPU, then it's probably better to run the algorithms on the CPU.
What I'm saying is: don't assume OpenCV GPU's module will always run it's algorithms faster than the CPU you got. Test it, measure it! The only way to know for sure is through experimentation and benchmark.

Re-write openCV functions using Cuda only

I have my code written in c++ and I used openCV functions for Image processing tasks.
I want to run my code in GPU (using cuda) to read a camera/stream inputs and do the image processing tasks in each frame in parallel.
I've read somewhere that I can't include the openCV functions in a .cu code, since the NVCC can't compile openCv functions (please correct me if this is not true)
I found the openCV gpu module in the openCV documentation, but I don't want to run the whole function in parallel, I want the whole algorithm to be processed in parallel ( in other way, include openCv in cuda not vise versa), so I've thought about writing all of my openCV functions in cuda, But I'm newbie to cuda.
My questions:
1- Are there cuda functions that can be used instead of openCv following functions :
split, inRange
fillHoles
Morphology (erosion, dilation, closing)
Countours (findContours, moments, boundingRect, approxPolyDP)
Drawing function (drawContours, rectangle, circle)
kmeans (or any other function for clustering)
I found some of them in Github, but still didn't test any, any documentation will be highly appreciated.
2- Does cuda reads only .pgm image format, and should I convert the .jpg frames before copying them to the device? Is it impossible to read the camera input directly to GPU global memory?
3- Do you suggest keeping my code in openCV and use another libraries for parallel processing like openCL? or use CPU (instead of GPU) for parallel processing using OpenMP? what might be the best option I should go with?
Before you begin down this route, i would recommend that you go through the first few lessons in this tutorial:
https://www.udacity.com/course/cs344
Then you will have a better idea about if a GPU is suitable for what your application requires.
In any case, openCV 1.0 is mostly in C, and cuda kernels are in C, so maybe you could try wrapping some of those into cuda kernels
Cheers

Resources