TBB parallel_for with less number of threads - opencv

I have written a multi-view face detection code using opencv face detector. I am running five detectors (trained for different pose angles) over an image and taking their weights to detect faces in an image. I have made the code parallel using TBB parallel_for but it improved the performance by just 1.7-times. I would like to ask if there is any better way to run five detectors in parallel?
I am running my code on a cluster with 16-cores. I think number of threads (that in my case are 5) are too less to utilize the complete power.
Any suggestions?
Thanks,

Some possible problems to look into:
One of the detectors takes longer than the other detectors to run. For example, if one detector takes 4 units of time, and the other four detectors each take 1 unit of time, the most possible speedup is 2x. Parallelizing the slow detector itself might help in this kind of situation.
The detectors run so fast that the parallel_for does not have time to spread the work. If the detectors each take at least 0.1 sec, this should not be a problem.
Memory bandwidth can be a limiting resource, particularly if working sets do not fit in outer-level cache.
A profiler such as Intel(R) VTune(TM) Amplifier can sometimes help to track down these problems. Both commercial and non-commercial licenses exist for Amplifier. [Disclaimer: I work for Intel.]

Related

How fast an implementation of KMeans clustering on an FPGA board can get?

I wrote a code in OpenCV/C++ that segments the RGB images of a video: I load the image and I cluster areas within it based on its colours.I chose KMeans to perform the segmentation and the results are good. However, with images of size 960x1280 it takes several seconds to do the clustering of a single image, and at this rate I can't use it in real time. My code is meant to be rewritten in VHDL in order to be run on an FPGA board in real time, so I wonder, given the parallelization capabilities of modern FPGA boards, how much acceleration can I get? I have little experience with such boards so I can't predict how fast my code can get if implemented on one of them. I want to have your opinions, maybe I will have to change Kmeans with some other image segmentation algorithm if the acceleration is not enough, and if so, it's better to do it now rahter than after having written the vhdl version.

Google Object Detection API: Fluctuation in TotalLoss

I am using Google Object Detection API with my own dataset. Mostly after 50K steps it begins to converge with 60 percents accuracy. I think it works fine in general. But when if you look at TotalLoss graphic or in general all loss graphics, it fluctuates so much. It looks like this:
What could be the reason for this? Is it normal or not? If not what is explanation.
Also occasionally, I see in the example images some bounding boxes are doubled in one area, why is that?
Yes, fluctuation in the loss is very normal particularly because the detection pipelines are usually trained with small batch sizes (batch size 1 in the case of Faster R-CNN), so you typically only see a meaningful decrease in the loss if you average over many steps.
yes as #Jonathan explained, 'wiggles' are mostly observed when you have smaller batch sizes. Unfortunately, you are going to need at least 16GB memory to handle batches more than 1 when using Faster-RCNN on Tensorflow API. The only way except you don't have much power of processing is just to be patient until some thousands of iterations completed, in my case it was more than 100,000.

Would it work and be faster if I call function in OpenCV GPU module in my kernel function?

OpenCV has a gpu. GPU-accelerated Computer Vision module (http://docs.opencv.org/modules/gpu/doc/gpu.html). There are many functions which is already use GPU techniques. So I can directly use the function OpenCV applies. But I wonder whether it would be faster if I write my own kernel and in each kernel I call function of OpenCV GPU module. This is in the case I have many images. To handle each image I call OpenCV funtion in GPU module. Then it would be parallel-nested-parallel.
Your question is not entirely clear to me, but I would like to say this: it's impossible to say which would be faster, unless somebody already implemented that same algorithm using the approach you have in mind, and then shared a report about the benchmark tests.
There's a number of factors involved:
It depends on the type of operation you are trying to implement: techniques that have a high arithmetic intensity are better fit for GPUs for sure, however, not all problems can be modeled for GPUs.
The size of the input images matter: wasting time sending data from RAM to the GPU might not compensate in the end, so running the algorithm on the CPU can be faster for small images.
The model/power of the CPU/GPU: if the computer has a really crappy GPU, then it's probably better to run the algorithms on the CPU.
What I'm saying is: don't assume OpenCV GPU's module will always run it's algorithms faster than the CPU you got. Test it, measure it! The only way to know for sure is through experimentation and benchmark.

fastest image processing library?

I'm working on robot vision system and its main purpose is to detect objects, i want to choose one of these libraries (CImg , OpenCV) and I have knowledge about both of them.
The robot I'm using has Linux , 1GHz CPU and 1G ram and I'm using C++ the size of image is 320p.
I want to have a real-time image processing near 20 out of 25 frames per seconds.
In your opinion which library is more powerful l although I have tested both and they have the same process time, open cv is slightly better and I think that's because I use pointers with open cv codes.
Please share your idea and your reason.
thanks.
I think you can possibly get best performance when you integrated - OpenCV with IPP.
See this reference, http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-open-source-computer-vision-library-opencv-faq/
Here is another reference http://experienceopencv.blogspot.com/2011/07/speed-up-with-intel-integrated.html
Further, if you freeze the algorithm that works perfectly, usually you can isolate your algorithm and work your way towards doing serious optimization (such as memory optimization, porting to assembly etc.) which might not be ready to use.
It really depends on what you want to do (what kind of objects you want to detect, accuracy, what algorithm you are using etc..) and how much time you have got. If it is for generic computer vision/image processing, I would stick with OpenCV. As Dipan said, do consider further optimization. In my experience with optimization for Computer Vision, the bottleneck usually is in memory interconnect bandwidth (or memory itself) and so you might have to trade in cycles (computation) to save on communication. Do understand the algorithm really well to further optimize the algorithm (which at times can give huge improvements as compared to compilers).

Fastest method to compute convolution

I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.
FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
Paul
CenterSpace Software
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:
http://www.springerlink.com/content/kd6qm361pq8mmlx2/
This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.

Resources