How does PE take place for a Truffle interpreter AOT-compiled with native-image? - graalvm

Using native-image to improve startup times of Truffle interpreters seems to be common.
My understanding is that AOT compilation with native-image will result in methods compiled to native code that are run in the special-purpose SubstrateVM.
Also, that the Truffle framework relies on dynamically gathered profiling information to determine which trees of nodes to partially evaluate. And that PE works by taking the JVM bytecode of the nodes in question and analyzing it with the help of the Graal JIT compiler.
And here's where I'm confused. If we pass a Truffle interpreter through native-image, the code for each node's methods will be native code. How can PE proceed, then? In fact, is Graal even available in SubstrateVM?

Besides the native code of the interpreter, SVM also stores in the image a representation of the interpreter (a group of methods that conform the interpreter) for partial evaluation. The format of this representation is not JVM bytecodes, but the graphs already parsed into Graal IR form. PE runs on these graphs producing even smaller, optimized graphs which are then fed to the Graal compiler, so yes SVM ships the Graal compiler as well in the native image.
Why the Graal graphs and not the bytecodes? Bytecodes were used in the past, but storing the graphs directly saves the (bytecodes to Graal IR) parsing step.

Related

Use of third-party library

I'm interested in using Alea GPU with a third-party library and am trying to get a sense of my options. Specifically, I'm interested in using this L-BFGS library. I'm fairly new to the F# ecosystem but do have experience with both CUDA and functional programming.
I've been using that L-BFGS library as part of a program which implements logistic regression. It would be neat if I could assume the library correct and write the rest of my code (including that which runs on the GPU) in type-safe F#.
It seems possible to link C++ with F#. Assuming I figure out how to integrate the L-BFGS library into a F# program, would the introduction of Alea GPU cause any issues?
What I am trying to avoid is re-writing L-BFGS in F# using Alea. However, maybe that's actually the easiest path to using F#. If Alea has any facilities for nonlinear optimization, I could probably use those instead.
Alea GPU does not have a nonlinear optimizer yet. The CUDA version has a slightly different implementation than the standard CPU L-BFGS which sometimes causes some accuracy issues. Apart from this I did not face any issues with the code, except that the performance win also significantly depends on the objective function. The objective function for logistic regression is numerically relatively cheap.
We have an internal C# version for this code ported to Alea GPU, which could be also used from F# and we plan to release it in a future version.

machine learning - predicting one instance at a time - lots of instances - trying not to use I/O

I have a large dataset and I'm trying to build a DAgger classifier for it.
As you know, in the training time, I need to run the initial learned classifier on training instances (predict them), one instance at a time.
Libsvm is too slow even for the initial learning.
I'm using OLL but that needs each instance to be written to a file and then run the test code on it and get the prediction, this involves many disk I/O.
I have considered working with vowpal_wabbit (yet I'm not sure if it will help with disk I/O) but I don't have the permission to install it on the cluster I'm working with.
Liblinear is too slow and again needs disk I/O I believe.
What are the other alternatives I can use?
I recommend trying Vowpal Wabbit (VW). If Boost (and gcc or clang) is installed on the cluster, you can simply compile VW yourself (see the Tutorial). If Boost is not installed, you can compile it yourself as well.
VW contains more modern algorithms than OLL. Moreover, it contains several structured prediction algorithms (SEARN, DAgger) and also a C++ and Python interface to it. See an iPython notebook tutorial.
As for the disk I/O: for one-pass learning, you can pipe the input data directly to vw (cat data | vw) or run vw --daemon. For multi-pass learning, you must use cache file (the input data in binary fast-to-load format), which takes some time to create (during the first pass, unless it already existed), but the subsequent passes are much faster because of the binary format.

Calling BLAS routines inside OpenCL kernels

Currently I am doing some image processing algorithms using OpenCL. Basically my algorithm requires to solve a linear system of equations for each pixel. Each system is independent of others, so going for a parallel implementation is natural.
I have looked at several BLAS packages such as ViennaCL and AMD APPML, but it seems all of them have the same use pattern (host calling BLAS subroutines to be executed on CL device).
What I need is a BLAS library that could be called inside an OpenCL kernel so that I can solve many linear systems in parallel.
I found this similar question on the AMD forums.
Calling APPML BLAS functions from the kernel
Thanks
Its not possible. clBLAS routines make a series of kernel launches, some 'solve' routine kernel launches are really complicated. clBLAS routines take cl_mem and commandQueues as args. So if your buffer is already on device, clBLAS will directly act on that. It doesn't accept host buffer or manage host->device transfers
If you want to have a look at what kernel are generated and launched, uncomment this line https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/generic/common.c#L461 and build clBLAS. It will dump all kernels being called

CUDA Performance comparison for Computer Vision applications

I am currently working on performance comparison of various computer vision applications. The research is based on evaluating how these different algorithms perform on CUDA and OpenMP.
Do you have any source codes in CUDA as well as the serial implementation in C for these kind of applications?
Where can I find them?
The CUDA SDK is full of examples, compiled both on GPU and CPU.
sources are included.
Here is a list of the samples you get by installing it.
You could start from here :)

Is OpenCV 2.0 optimized for AMD processors?

I know that in the past OpenCV was based on IPP and was optimized only for Intel CPUs. Is this still the case with OpenCV 2.0?
History says that OpenCV was originally developed by Intel.
If you check OpenCV faq, they'll say:
OpenCV itself is open source and written in quite portable C/C++, it runs on other processors already and should be fairly easy to port (for example, there are already some CUDA optimizations on NVidia. On the other hand, OpenCV can sometimes run much faster on Intel processors (and sometimes AMD) because it can take advantage of SSE optimizations. OpenCV can be compiled statically with IPP libraries from Intel also which can speed up some function.
I have used it on other processors and different OS and I've always been very happy, including for video processing applications.

Resources