GPU processing in Julia with minimal changes - machine-learning

Is there a possibility to execute your julia ML code in GPU without intensive editing?

Related

Parallel I/O: file per process vs libraries like HDF5

For high performance computing applications with parallel I/O onto Lustre file systems, does file-per-process output give the upper limit to performance? I had always used HDF5, assuming it was some sort of high performance library, until I realized how terrible the parallel I/O performance was compared to file-per-process for my specific application. Sure, file-per-process is not as beautiful, and may require some (cheap) postprocessing to get into a useful format, but after wasting so much time trying to optimize HDF5 and getting terrible performance in the end I am wondering why anyone would use such a library for parallel I/O for high performance computing. What is wrong with file-per-process output and why is it common to discourage it? For bandwidth, is there any way to beat it?

If I train a neural network with CUDA do I need to run the outputted algorithm with CUDA?

Let's say I used CUDA to train an object tracking program. Could I then put that program on another computer that didn't have a powerful gpu and run the object tracking program? Or is gpu support required to run the outputted algorithm as well as train it?
No, it does not matter how you trained your model. You can execute it in completely different scenario, using CPU, GPU, cloud or whatever you want. Since execution is usually much cheaper than training - you will usually need much less powerful hardware.

OpenCV BackgroundSubtratorMOG2

I've finished an algorithm aimed to foreground extraction based on video recently, but it processes too slowly per frame. There is an algorithm based on Mixed Gaussian Model named BackgroundSubtractorMOG2 in OpenCV3.0 and I find it processes quickly as nearly 15 times as mine per frame. I just wonder is it accelerated by OpenCL on GPU ? Or it is just run on CPU? p.s. I've seen some source codes of it and noticed there are OpenCL blocks but I'm not sure since I'm fresh. I will be very appreciated if anyone could help me figure it out!
If you look at the API page here You will find the line:
The function implements a sparse iterative version of the Lucas-Kanade optical flow in pyramids. See [Bouguet00]. The function is parallelized with the TBB library.
The TBB library is a parallization library and is used to "write parallel C++ programs that take full advantage of multicore performance" - this means that it is using more than just one CPU at a time, a much quicker way of processing. This can be seen on lines like this (Line 566):
parallel_for_(Range(0, image.rows),
MOG2Invoker(image, fgmask,
(GMM*)bgmodel.data,
(float*)(bgmodel.data + sizeof(GMM)*nmixtures*image.rows*image.cols),
bgmodelUsedModes.data, nmixtures, (float)learningRate,
(float)varThreshold,
backgroundRatio, varThresholdGen,
fVarInit, fVarMin, fVarMax, float(-learningRate*fCT), fTau,
bShadowDetection, nShadowDetection));

machine learning - predicting one instance at a time - lots of instances - trying not to use I/O

I have a large dataset and I'm trying to build a DAgger classifier for it.
As you know, in the training time, I need to run the initial learned classifier on training instances (predict them), one instance at a time.
Libsvm is too slow even for the initial learning.
I'm using OLL but that needs each instance to be written to a file and then run the test code on it and get the prediction, this involves many disk I/O.
I have considered working with vowpal_wabbit (yet I'm not sure if it will help with disk I/O) but I don't have the permission to install it on the cluster I'm working with.
Liblinear is too slow and again needs disk I/O I believe.
What are the other alternatives I can use?
I recommend trying Vowpal Wabbit (VW). If Boost (and gcc or clang) is installed on the cluster, you can simply compile VW yourself (see the Tutorial). If Boost is not installed, you can compile it yourself as well.
VW contains more modern algorithms than OLL. Moreover, it contains several structured prediction algorithms (SEARN, DAgger) and also a C++ and Python interface to it. See an iPython notebook tutorial.
As for the disk I/O: for one-pass learning, you can pipe the input data directly to vw (cat data | vw) or run vw --daemon. For multi-pass learning, you must use cache file (the input data in binary fast-to-load format), which takes some time to create (during the first pass, unless it already existed), but the subsequent passes are much faster because of the binary format.

Would it work and be faster if I call function in OpenCV GPU module in my kernel function?

OpenCV has a gpu. GPU-accelerated Computer Vision module (http://docs.opencv.org/modules/gpu/doc/gpu.html). There are many functions which is already use GPU techniques. So I can directly use the function OpenCV applies. But I wonder whether it would be faster if I write my own kernel and in each kernel I call function of OpenCV GPU module. This is in the case I have many images. To handle each image I call OpenCV funtion in GPU module. Then it would be parallel-nested-parallel.
Your question is not entirely clear to me, but I would like to say this: it's impossible to say which would be faster, unless somebody already implemented that same algorithm using the approach you have in mind, and then shared a report about the benchmark tests.
There's a number of factors involved:
It depends on the type of operation you are trying to implement: techniques that have a high arithmetic intensity are better fit for GPUs for sure, however, not all problems can be modeled for GPUs.
The size of the input images matter: wasting time sending data from RAM to the GPU might not compensate in the end, so running the algorithm on the CPU can be faster for small images.
The model/power of the CPU/GPU: if the computer has a really crappy GPU, then it's probably better to run the algorithms on the CPU.
What I'm saying is: don't assume OpenCV GPU's module will always run it's algorithms faster than the CPU you got. Test it, measure it! The only way to know for sure is through experimentation and benchmark.

Resources