I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.

FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:

This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.


Incremental Singular Value Decomposition for OpenCV

I have a large archive of images from outdoor camera. Close to 200000 items, each 1280x960 color pixels. I would like to index this database by constructing SVD (Eigen-images) for this data and making reduced vectors of data (say 100-dimentional vector for every picture).
Loading all this data into RAM at once would require about 200GB of RAM.
Firstly, I don't have so much RAM.
Secondly, it won't scale much. So, I am looking for implementation of incremental singular vector decomposition that probably should exist for libraries like OpenCV or Eigen.
I don't want to reduce resolution before making SVD because I believe that small parts (resoluted far objects) may be important to me, but reducing resolution I just lost all high-frequency features.
I found that NN algorithms GHA or APEX could help here.
Yet another algorithm:
I haven't seen an implementation using Eigen. But it doesn't seem that difficult to code the same method that scikit-learn uses for incremental PCA.

Apple Accelerate vDSP fft vs DFT and scaling factors

I am an experienced programmer but I don't have a lot of experience implementing DSP routines.
I've been banging my head against this for weeks if not months. My question is two fold, concerning Apple's Accelerate framework:
In the vDSP.h header there are comments to the effect of: please use vDSP_DFT_XXX instead of the (i guess) older versions vDSP_fft_XXX. However there are zero examples of this outside of Apple's Maybe it's just that the DFT functions are newer? If so, fine and dandy.
Scaling factors. I can read the documentation (, it says in the case of an FFT on a real input, like audio that I am working with, the resulting value of each of the Fourier coefficients is 2x the actual, mathematical value.
And yet, in every example, including Apple's own, the scaling factor used for the resulting vsmul() function looks like it is 1/2*N instead of 1/2 as I would expect.
Further, there is no documentation about the scaling factors for the vDSP_DFT_XXX routines, but I assume that they just wrap the older ones?
Any insight into either of these questions would be greatly appreciated! Hopefully I'm just missing something basic about the way that FFT's are implemented in this framework (or in general).
There are at least 3 different FFT scaling options that produce "mathematical" results, and there is no single standard scaling. Energy preserving (see Parseval's theorem) FFT libraries need to be scaled by on the order of 1/N for input magnitude results, since a longer signal of the same magnitude will have proportionally more energy. vDSP uses an energy preserving forward FFT.

TBB parallel_for with less number of threads

I have written a multi-view face detection code using opencv face detector. I am running five detectors (trained for different pose angles) over an image and taking their weights to detect faces in an image. I have made the code parallel using TBB parallel_for but it improved the performance by just 1.7-times. I would like to ask if there is any better way to run five detectors in parallel?
I am running my code on a cluster with 16-cores. I think number of threads (that in my case are 5) are too less to utilize the complete power.
Any suggestions?
Some possible problems to look into:
One of the detectors takes longer than the other detectors to run. For example, if one detector takes 4 units of time, and the other four detectors each take 1 unit of time, the most possible speedup is 2x. Parallelizing the slow detector itself might help in this kind of situation.
The detectors run so fast that the parallel_for does not have time to spread the work. If the detectors each take at least 0.1 sec, this should not be a problem.
Memory bandwidth can be a limiting resource, particularly if working sets do not fit in outer-level cache.
A profiler such as Intel(R) VTune(TM) Amplifier can sometimes help to track down these problems. Both commercial and non-commercial licenses exist for Amplifier. [Disclaimer: I work for Intel.]

Real Time Cuda Image Processing advice

I am trying to implement an algorithm for a system which the camera get 1000fps, and I need to get the value of each pixel in all images and do the different calculation on the evolution of pixel[i][j] in N number of images, for all the pixels in the images. I have the (unsigned char *ptr) I want to transfer them to the GPU and start implementing the algorithm.but I am not sure what would be the best option for realtime processing.
my system:
CPU Intel Xeon x5660 2.8Ghz(2 processors)
GPU NVIDIA Quadro 5000
I got the following questions:
I do I need to add any Image Processing library addition to CUDA? if yes what do you suggest?
can I create a matrix for pixel[i,j] containing values for images [1:n] for each pixel in the image size? for example for 1000 images with 200x200 size I will end up with 40000 matrix each
containing 1000 values for one pixel? Does CUDA gives me some options like OpenCV to have a Matrices? or Vector?
1 - Do I need to add any Image Processing library addition to CUDA?
Apples and oranges. Each has a different purpose. An image processing library like OpenCV offers a lot more than simple accelerated matrix computations. Maybe you don't need OpenCV to do the processing in this project as you seem to rather use CUDA directly. But you could still have OpenCV around to make it easier to load and write different image formats from the disk.
2 - Does CUDA gives me some options like OpenCV to have a Matrices?
Absolutely. Some time ago I wrote a simple (educational) application that used OpenCV to load an image from the disk and use CUDA to convert it to its grayscale version. The project is named cuda-grayscale. I haven't tested it with CUDA 4.x but the code shows how to do the basic when combining OpenCV and CUDA.
It sounds like you will have 40000 independent calculations, where each calculation works only within one (temporal) pixel. If so, this should be a good task for the GPU. Your 352 core Fermi GPU should be able to beat your 12 hyperthreaded Xeon cores.
Is the algorithm you plan to run a common operation? It sounds like it might not be, in which case you will likely have to write your own kernels.
Yes, you can have arrays of elements of any type in CUDA.
Having this being a "streaming oriented" approach is good for a GPU implementation in that it maximizes number of calculations as compared to transfers over the PCIe bus. It it might also introduce difficulties in that, if you want to process the 1000 values for a given pixel in a specific order (oldest to newest, for instance), you will probably want to avoid continuously shifting all the frames in memory (to make room for the newest frame). It will slightly complicate your addressing of the pixel values, but the best approach, to avoid shifting the frames, may be to overwrite the oldest frame with the newest frame each time a new frame is added. That way, you end up with a "stack of frames" that is fairly well ordered but has a discontinuity between old and new frames somewhere within it.
I do I need to add any Image Processing library addition to CUDA ???
if yes what do you suggest?
Disclosure: My company develop & market CUVILib
There are very few options when it comes to GPU Accelerated Imaging libraries which also offer general-purpose functionality. CUVILib is one of those options which offers the following, very suited for your specific needs:
CuviImage object which holds your image data and image as a 2D matrix
You can write your own GPU function and use CuviImage as a 2D GPU matrix.
CUVILib already provides a rich set of Imaging functionality like Color Operations, Image Statistics, Feature detection, Motion estimation, FFT, Image Transforms etc so chances are that you will find your desired functionality.
As for the question of whether GPUs are suited for your application: Yes! Imaging is one of those domains which are ideal for parallel computation.

Use Digital Signal Processors to accelerate calculations in the same fashion than GPUs

I read that several DSP cards that process audio, can calculate very fast Fourier Transforms and some other functions involved in Sound processing and others. There are some scientific problems (not many), like Quantum mechanics, that involver Fourier Transform calculations. I wonder if DSP could be used to accelerate calculations in this fashion, like GPUs do in some other cases, and if you know succcessful examples.
Any linear operations are easier and faster to do on DSP chips. Their architecture allows you to perform a linear operation (take two numbers, multiply each of them by a constant and add the results) in a single clock cycle. This is one of the reasons FFT can be calculated so quickly on a DSP chip. This is also a reason many other linear operations can be accelerated with their use. I guess I have three main points to make concerning performance and code optimization for such processors.
1) Perhaps less relevant, but I'd like to mention it nonetheless. In order to take full advantage of DSP processor's architecture, you have to code in Assembly. I'm pretty sure that regular C code will not be fully optimized by the compiler to do what you want. You literally have to specify each register, etc. It does pay off, however. The same way, you are able to make use of circular buffers and other DSP-specific things. Circular buffers are also very useful for calculating the FFT and FFT-based (circular) convolution.
2) FFT can be found in solutions to many problems, such as heat flow (Fourier himself actually came up with the solution back in the 1800s), analysis of mechanical oscillations (or any linear oscillators for that matter, including oscillators in quantum physics), analysis of brain waves (EEG), seismic activity, planetary motion and many other things. Any mathematical problem that involves convolution can be easily solved via the Fourier transform, analog or discrete.
3) For some of the applications listed above, including audio processing, other transforms other than FFT are constantly being invented, discovered, and applied to processing, such as Mel-Cepstrum (e.g. MPEG codecs), wavelet transform (e.g. JPEG2000 codecs), discrete cosine transform (e.g. JPEG codecs) and many others. In quantum physics, however, the Fourier Transform is inherent in the equation of angular momentum. It arises naturally, not just for the purposes of analysis or easy of calculations. For this reason, I would not necessarily put the reasons to use Fourier Transform in audio processing and quantum mechanics into the same category. For signal processing, it's a tool; for quantum physics, it's in the nature of the phenomenon.
Before GPUs and SIMD instruction sets in mainstream CPUs this was the only way to get performance for some applications. In the late 20th Century I worked for a company that made PCI cards to place extra processors in a PCI slot. Some of these were DSP cards using a TI C64x DSP, others were PowerPC cards to provide Altivec. The processor on the cards would typically have no operating system to give more predicatable real-time scheduling than the host. Customers would buy an industrial PC with a large PCI backplace, and attach multiple cards. We would also make cards in form factors such as PMC, CompactPCI, and VME for more rugged environments.
People would develop code to run on these cards, and host applications which communicated with the add-in card over the PCI bus. These weren't easy platforms to develop for, and the modern libraries for GPU computing are much easier.
Nowadays this is much less common. The price/performance ratio is so much better for general purpose CPUs and GPUs, and DSPs for scientific computing are vanishing. Current DSP manufacturers tend to target lower power embedded applications or cost sensitive high volume devices like digital cameras. Compare GPUFFTW with these Analog Devices benchmarks. The DSP peaks at 3.2GFlops, and the Nvidia 8800 reachs 29GFlops.
