Hardware optimizations using Qualcomm Snapdragon 800 and Adreno 330 - opencv

I am developing a real-time computer vision project that runs on an Ubuntu (Linaro) board with an ARM CPU (Snapdragon 800).
Some parts of the software operate on HD images, huge amount of data. This slows the execution and acts as a bottleneck.
These operations include:
Finding all local minimum and maximum values in a 2D array (image). Currenly, it is implemented using the naive, trivial way.
Building a KD-Tree and performing a K-Nearest-Neighbors search. This is currently done using the FLANN library included in OpenCV.
I am looking for ways to utilize the available Adreno 330 GPU, and accelerate these computations.
I was looking at OpenCL, but I found out that it is supported on Adreno 330 only as an "embedded profile", something that I do not what it is, and how it affects things.
I also heard about NEON in ARM processors, but I do not know how will it be any use for me.
Any help, tips and links will be appreciated.


What platform to use for YOLO output when using AMD GPU?

long time tormented by this question, I ask your advice in what direction to move. Objective - to develop universal application with yolo on windows, which can use computing power of AMD/Nvidia/Intel GPU, AMD/Intel CPU (one of the devices will be used). As far as I know, the OpenCV DNN module is leading in CPU computation; a DNN + Cuda bundle is planned for Nvidia graphics cards and a DNN + OpenCL bundle is planned for Intel GPUs. But testing AMD GPU rx580 with DNN + OpenCL, I ran into the following problem: https://github.com/opencv/opencv/issues/17656. Does this module not support AMD GPU computing at all? If so, could you please let me know what platform this is possible on and, preferably, as efficiently as possible. A possible solution might be Tencent's ncnn, but I'm not sure of the performance on the desktop. By output I mean the coordinates of detected objects and their names (in opencv dnn module I got them with cv::dnn::Net::forward()). Also, correct me if I'm wrong somewhere. Any feedback would be appreciated.
I tried the OpenCV DNN + OpenCL module and expected high performance, but this combination does not work.
I believe OpenCV doesn't support AMD for GPU optimization. If you're interested in running DL models on non-Nvidia GPUs, I suggest reading PlaidML, YOLO-OpenCL, DeepCL

OpenCV BackgroundSubtratorMOG2

I've finished an algorithm aimed to foreground extraction based on video recently, but it processes too slowly per frame. There is an algorithm based on Mixed Gaussian Model named BackgroundSubtractorMOG2 in OpenCV3.0 and I find it processes quickly as nearly 15 times as mine per frame. I just wonder is it accelerated by OpenCL on GPU ? Or it is just run on CPU? p.s. I've seen some source codes of it and noticed there are OpenCL blocks but I'm not sure since I'm fresh. I will be very appreciated if anyone could help me figure it out!
If you look at the API page here You will find the line:
The function implements a sparse iterative version of the Lucas-Kanade optical flow in pyramids. See [Bouguet00]. The function is parallelized with the TBB library.
The TBB library is a parallization library and is used to "write parallel C++ programs that take full advantage of multicore performance" - this means that it is using more than just one CPU at a time, a much quicker way of processing. This can be seen on lines like this (Line 566):
parallel_for_(Range(0, image.rows),
MOG2Invoker(image, fgmask,
(float*)(bgmodel.data + sizeof(GMM)*nmixtures*image.rows*image.cols),
bgmodelUsedModes.data, nmixtures, (float)learningRate,
backgroundRatio, varThresholdGen,
fVarInit, fVarMin, fVarMax, float(-learningRate*fCT), fTau,
bShadowDetection, nShadowDetection));

Advice on a GPU for Dell Precision T3500 for image processing

I am a grad student and in our lab, we have a Dell Precision T3500 (http://www.dell.com/us/business/p/precision-t3500/pd). We use it primarily for image processing research and we need to use OpenCV 2.4.7's "ocl" i.e., OpenCL bindings for parallelizing up our work for some publications.
I looked at the workstation's specs and it specifies that we can get a NVIDIA Quadro 5000 or an AMD FirePro V7900 (the best of both manufacturers for this workstation).
This is where I am confused. Most of the reviews compare performance for CAD/CAM, MAYA and other software. But we will be writing our own code using OpenCV. Can anyone help me out in choosing the best of these two GPUs? Or is there anyway I can get a better GPU by upgrading the power supply?
We would greatly appreciate all the advice we can get at this stage!
Thank you very much.
If you are using OpenCL I agree with DarkZeros. You probably should buy AMD HW. Nvidia supports OpenCL only grudgingly as they want everyone to use CUDA.
Both of the cards you showed seem to be rather similar. Theoretical maximum at around 1TFlops. However both of them are rather old and very expensive. If you are not bound by any purchasing agreement I really recommend you buy a consumer card. The specs in dell.com only mean that if you purchase the computer from them you can select a GPU for it. It does not limit what you can do afterwards.
Depending on the chassis you could change your power supply. That would mean you could purchase something like this http://www.amazon.com/XFX-RADEON-1000MHz-Graphics-R9290XENFC/dp/B00G2OTRMA . It has double the memory of either of those professional cards and over 5x the theoretical processing power.
To be fair if you have the money to spend GTX Titan is still an excellent choice. It is about as fast as that AMD card and you can use CUDA with it if you need, considering how common CUDA is in scientific computing it might be wise to go with that.
However if you cannot switch your power supply, if it's non standard size or whatnot, then you are more limited. In that case you should search for pretty much the heftiest card that can run on 150W. Even those have perhaps double the performance of the cards the computer was originally available with.

Linear Algebra library using OpenGL ES 2.0 for iOS

Does anyone know of a linear algebra library for iOS that uses OpenGL ES 2.0 under the covers?
Specifically, I am looking for a way to do matrix multiplication on arbitrary-sized matrices (e.g., much larger than 4x4, more like 5,000 x 100,000) using the GPUs on iOS devices.
Is there a specific reason you're asking for "uses OpenGL ES 2.0 under the covers?" Or do you just want a fast, hardware optimized linear algebra library such as BLAS, which is built into iOS?
MetalPerformanceShaders.framework provides some tuned BLAS-like functions. It is not GLES. It is metal and runs on the GPU. See MetalPerformanceShaders/MPSMatrixMultiplication.h
OpenGL on iOS is probably the wrong way to go. Metal support on iOS would be the better way to go if you're going GPU.
You could use Apple's support for Metal Compute shaders. I've written high-performance code for my PhD in it. An early experiment I made calculating some fractals using Metal might give you some ideas to start
Ultimately, this question is too broad. What do you intend to use the library for, or how do you intend to use it? Is it a one off multiplication? Have you tested with current libraries and found the performance to be too slow? If so, by how much?
In general, you can run educational or purely informational experiments on performance of algorithm X on CPU vs. GPU vs. specialized hardware, but most often you run up against Amdahl's law and your code vs. a team of experts in the field.
You can also look into the Accelerate framework which offers BLAS.
Apple, according to the WWDC 2014 talk What's new in the Accelerate Framework, has hand tuned the Linear Algebra libraries targeted at their current generation hardware. They aren't just fast, but energy efficient. There are newer talks as well.

SIFT hardware accelerator for smartphones

I'm a fresh graduate electronics engineer and I've an experience on computer vision.I want to ask if it's feasible to make a hardware accelerator of SIFT algorithm - or any other openCV algorithms - to be used on smartphones instead of the current software implementation?
What are the advantages (much low computation, lower power, more complex applications will appear, ...) and the disadvantages(isn't better than the current software implementation, ...)?
Do you have an insight of that?
You might be interested to check NEON optimizations - a type of SIMD instructions supported by Nvidia Tegra 3 architectures. Some OpenCV functions are NEON optimized.
Start by reading this nice article Realtime Computer Vision with OpenCV, it has performance comparisons about using NEON, etc.
I also recommend you to start here and here, you will find great insights.
Opencv supports both cuda and (experimentally) opencl
There are specific optimizations for Nvidia's Tegra chipset used in a lot of phones/tablets. I don't know if any phone's use opencl
