Parallel I/O: file per process vs libraries like HDF5 - hdf5

For high performance computing applications with parallel I/O onto Lustre file systems, does file-per-process output give the upper limit to performance? I had always used HDF5, assuming it was some sort of high performance library, until I realized how terrible the parallel I/O performance was compared to file-per-process for my specific application. Sure, file-per-process is not as beautiful, and may require some (cheap) postprocessing to get into a useful format, but after wasting so much time trying to optimize HDF5 and getting terrible performance in the end I am wondering why anyone would use such a library for parallel I/O for high performance computing. What is wrong with file-per-process output and why is it common to discourage it? For bandwidth, is there any way to beat it?

Related

Would it work and be faster if I call function in OpenCV GPU module in my kernel function?

OpenCV has a gpu. GPU-accelerated Computer Vision module (http://docs.opencv.org/modules/gpu/doc/gpu.html). There are many functions which is already use GPU techniques. So I can directly use the function OpenCV applies. But I wonder whether it would be faster if I write my own kernel and in each kernel I call function of OpenCV GPU module. This is in the case I have many images. To handle each image I call OpenCV funtion in GPU module. Then it would be parallel-nested-parallel.
Your question is not entirely clear to me, but I would like to say this: it's impossible to say which would be faster, unless somebody already implemented that same algorithm using the approach you have in mind, and then shared a report about the benchmark tests.
There's a number of factors involved:
It depends on the type of operation you are trying to implement: techniques that have a high arithmetic intensity are better fit for GPUs for sure, however, not all problems can be modeled for GPUs.
The size of the input images matter: wasting time sending data from RAM to the GPU might not compensate in the end, so running the algorithm on the CPU can be faster for small images.
The model/power of the CPU/GPU: if the computer has a really crappy GPU, then it's probably better to run the algorithms on the CPU.
What I'm saying is: don't assume OpenCV GPU's module will always run it's algorithms faster than the CPU you got. Test it, measure it! The only way to know for sure is through experimentation and benchmark.

fastest image processing library?

I'm working on robot vision system and its main purpose is to detect objects, i want to choose one of these libraries (CImg , OpenCV) and I have knowledge about both of them.
The robot I'm using has Linux , 1GHz CPU and 1G ram and I'm using C++ the size of image is 320p.
I want to have a real-time image processing near 20 out of 25 frames per seconds.
In your opinion which library is more powerful l although I have tested both and they have the same process time, open cv is slightly better and I think that's because I use pointers with open cv codes.
Please share your idea and your reason.
thanks.
I think you can possibly get best performance when you integrated - OpenCV with IPP.
See this reference, http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-open-source-computer-vision-library-opencv-faq/
Here is another reference http://experienceopencv.blogspot.com/2011/07/speed-up-with-intel-integrated.html
Further, if you freeze the algorithm that works perfectly, usually you can isolate your algorithm and work your way towards doing serious optimization (such as memory optimization, porting to assembly etc.) which might not be ready to use.
It really depends on what you want to do (what kind of objects you want to detect, accuracy, what algorithm you are using etc..) and how much time you have got. If it is for generic computer vision/image processing, I would stick with OpenCV. As Dipan said, do consider further optimization. In my experience with optimization for Computer Vision, the bottleneck usually is in memory interconnect bandwidth (or memory itself) and so you might have to trade in cycles (computation) to save on communication. Do understand the algorithm really well to further optimize the algorithm (which at times can give huge improvements as compared to compilers).

Why do queries with shorter inverted lists perform better on CPU's when compared to GPU's

Moreover, why do queries with longer inverted list perform better on GPU's?
I read this result in a paper called Using Graphics Processors for High Performance IR querying.
Queries with longer lists work better on GPUs, because GPUs are highly parallel, and search is a mostly parallel problem.
However, GPUs (and other massively parallel computers) don't process things the same way few-core CPUs do. Like with any other problem, there is non-negligible work to be done to set up the problem for the GPU. For small problem sizes, this overhead swamps out any speedup provided by the GPUs.

Is it practical to include an adaptive or optimizing memory strategy into a library?

I have a library that does I/O. There are a couple of external knobs for tuning the sizes of the memory buffers used internally. When I ran some tests I found that the sizes of the buffers can affect performance significantly.
But the optimum size seems to depend on a bunch of things - the available memory on the PC, the the size of the files being processed (varies from very small to huge), the number of files, the speed of the output stream relative to the input stream, and I'm not sure what else.
Does it make sense to build an adaptive memory strategy into the library? or is it better to just punt on that, and let the users of the library figure out what to use?
Has anyone done something like this - and how hard is it? Did it work?
Given different buffer sizes, I suppose the library could track the time it takes for various operations, and then it could make some decisions about which size was optimal. I could imagine having the library rotate through various buffer sizes in the initial I/O rounds... and then it eventually would do the calculations and adjust the buffer size in future rounds depending on the outcomes. But then, how often to re-check? How often to adjust?
The adaptive approach is sometimes referred to as "autonomic", using the analogy of a Human's autonomic nervous system: you don't conciously control your heart rate and respiration, your autonomic nervous system does that.
You can read about some of this here, and here (apologies for the plugs, but I wanted to show that the concept is being taken seriously, and is manifesting in real products.)
My experience of using products that try to do this is that they do acually work, but can make me unhappy: that's because there is a tendency for them to take a "Father knows best" approach. You make some (you believe) small change to your app, or the environment and something unexecpected happens. You don't know why, and you don't know if it's good. So my rule for autonomy is:
Tell me what you are doing and why
Now sometimes the underlying math is quite complex - consider that some autonomic systems are trending and hence making predictive changes (number of requests of this type growing, let's provision more of resource X) so the mathematical models are non-trivial. Hence simple explanations are not always available. However some level of feedback to the watching humans can be reassuring.

Fastest method to compute convolution

I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.
FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
Paul
CenterSpace Software
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:
http://www.springerlink.com/content/kd6qm361pq8mmlx2/
This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.

Resources