C++0x parallilzation constructs vs. OpenMP - comparison

I am curious about knowing the advantage of using OpenMP (and consequently linking against a third party library, assuming you are a C++ programmer) while C++0x offers good parallel constructs.
Could someone provide me with pros. and cons. of using OpenMP instead on C++0x build-in constructs?

I have to admit that I haven’t yet delved deeply into C++0x but as far as I see it “merely” offers some primitives for generic parallelization.
OpenMP on the other hand is a relatively high-level abstraction to parallelize code with a single purpose: to improve performance by distributing work across multiple CPU cores (rather than, say, improve UI responsiveness, or communicate with an asynchronous channel).
OpenMP makes this very easy because it offers a compact syntax and does a lot automatically, e.g. the managing of a thread pool and the scheduling of threads to distribute the work evenly. In the best case, this means that parallelizing an existing algorithm is as easy as putting the following into your code (at the appropriate position):
#pragma omp parallel for
(Of course it’s usually a bit more complicated.)
However, this comes at a cost that is twofold:
OpenMP is implemented by means of pragmas and integrates poorly with C++ syntax. For example, the following straightforward-looking code is illegal:
void f() {
#pragma omp critical
{
return;
}
}
That’s because you cannot prematurely leave OpenMP “blocks”. Quite the bummer.
OpenMP strives to be as platform-independent as possible. As a consequence, it lacks a few interesting primitives. For example, there’s no yield command in OpenMP, and no fetch_and_add primitive, nor a compare_and_swap or LL/CS.

For Open MP with gcc, libgomp comes with gcc itself and is not third-party. It was my understanding that this is similar for other compilers.

Related

Verilog or Vivado HLS or Vivado SDSoC

I want to convert my lane detection code written by C++ (OpenCV) to FPGA. Vivado HLS or Vivado SDSoC can help to embed the C ++ code into the FPGA. Or I can rewrite the lane detection code with verilog. The question is, what are the advantages and disadvantages of these three ways?
I want to use one of the cheap Zynq-7000 FPGAs.
Verilog is considered low-level these days. Compare it with assembly for software implementation. People use it only to get performance that they cannot attain with high-level languages such as C or Java in the software domain.
In the hardware domain, C (for Vivado HLS) or OpenCL are considered high-level languages. OpenCL was developed with portability to other architectures like GPUs and CPUs in mind. It has a lot more overhead in terms of communicating with the FPGA than Vivado HLS however.
Vivado HLS by itself produces just hardware modules in VHDL or Verilog, which you still have to connect to FPGA pins, ARM processors, etc. It does not take care of the communication to your module. You will still have to integrate your module in a Vivado block design or top-level VHDL or Verilog implementation yourself.
SDSoC, not "Vivado SDSoC" by the way, also lets you to write your entire implementation (hardware and software) in C. Under the hood, it will invoke Vivado HLS to implement the hardware module. Afterwards, the tool will take care of implementing an interface between your hardware and the on-board ARM processors that will run the software.
In summary, I recommend SDSoC unless you have a good reason not to use it. I do want to warn, however, that analyzing the synthesis results of Vivado HLS is a lot harder than analyzing Vivado output for Verilog or VHDL. Therefore, I always recommend to make sure that your code works as a software implementation first. With minimal effort, you should be able to compile any code in gcc or another compiler too. Don't use the synthesis results to debug your code, but just to analyze the performance.
SDSoc is better and easier, HLS like a blackbox, even UG902 have so many pages.
only my own opinion.
Take a look at Xilinx XAPP1167 and the Xilinx HLS Video Library Wiki.
That appnote is a few years old (older than the SDSoC tools) but has a reference design for accelerating OpenCV applications in a Zynq using HLS.
I can't speak to SDSoC, but I would highly recommend starting with HLS over a rewrite in Verilog. It sounds like you have exactly an intended use-case for HLS: to implement existing C++ applications in an FPGA. The downsides to it are (1) you'll likely need to modify your code a bit, since HLS doesn't support all C++ features, and (2) the performance may not be quite as good as a pure Verilog implementation.
Even if you have hardware design experience, manually translating C++ to Verilog will require some significant effort. I'd avoid that approach unless HLS or SDSoC doesn't give you the performance you need.
Start using OpenCL SDAccel or Intel SDK. OpenCL has verbose and well defined API - which is a good thing. It is very easy to learn and you can have parallel code execution similar to multi-module instances of Verilog/VHDL. OpenCl vs. HLS has benefits in not requiring to re-invent the whole system for managing data, I/O, pipes. etc. You get quite a bit of helper logic in OpenCL BSP (Intel) or shell (XILINX). Yeah, and start reading these long guides.
I would recommend SDAccel, as it is much more C++ "software" user friendly. At the same time, don't quote me on this, but I think they provide a OpenCV implementation out of the box, which means that probably you only need to massage you non-OpenCV code to achieve the performance you want.

Are there any tools that assist in porting F# to OCaml?

Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic. I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
b) OCaml gets a good complete port to the LLVM backend -
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
Anyone know of such things, either finished or in progress?
Thanks deeply!
To answer your question in an answer - as far as I know, there are no such tools and I do not think it is likely somebody will create them.
Although F# is inspired by OCaml, it has evolved a lot and is different in a number of ways (see this SO discussion), so automatic conversion is not trivial. Even if somebody did that, it would be more like compilation to hard to read OCaml than conversion to idiomatic code that you can later continue working on.
To add a few general comments, when you speak about "real-time" I imagine controlling some robot in a factory dealing with dangerous stuff or an airplane control. In these areas, concerns about GC are certainly valid. However, I do not think games are necessarily "real-time". You need good performance, that's for sure, but people have been writing games with .NET and F# quite happily. For some F# examples, see:
... a nice blog with a couple of game samples (that you can actually try & buy)
a 3D airplane shooter game that also looks fairly realistic
and there is also a book that uses games to explain F#
These are probably simpler than what you're aiming for, but it may be good enough to show that writing games using GC is doable.
Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic.
A few points here:
Incremental GCs are not the only way to get low pause times. Concurrent GCs like VCGC do the work in bulk but do it concurrently with mutators running, e.g. the VCGC implementation I described in the non-free article here was running with sub-millisecond pause times.
Incremental GC does not necessarily mean low pause times. For example, OCaml's GC typically incurs 10ms pauses and can incur arbitrarily-long pauses when it encounters a deep thread stack or long array in the heap.
I have measured typical pause times of 10ms with OCaml and 30ms with F# on .NET 3. With a simple implementation I was able to build a fault tolerant server in F# from scratch that handled 20k msgs/s with 50% of latencies under 114us and 95% under 500us.
I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
I wouldn't give up on the platform is your first working version has unacceptable latency. There are lots of things you can do to bring the max latency down.
b) OCaml gets a good complete port to the LLVM backend -
I seriously doubt OCaml will ever get what I'd consider to be a "good complete port to the LLVM backend". They'll just retarget LLVM with the current typeless IR and it won't do much better than the current ocamlopt compiler because LLVM isn't designed to optimize that kind of workload.
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
No automated tools but I've ported hundreds of thousands of lines of code between OCaml and F# now and it is generally very easy because most code is written in the core ML subset of both languages.

Why is LuaJIT so good?

EDIT: unfortunately LuaJIT was taken out of the comparison in the link below.
This comparison of programming languages shows that LuaJIT has an over tenfold improvement over the normal Lua implementation.
Why is the change so big? Is there something specific about Lua that makes it benefit a lot from JIT compilation?
Python is dynamically typed and compiled to bytecode as well, so why doesn't PyPy (that has JIT now, I believe) show such a large jump in performance?
Mike Pall has talked about this in a few places:
http://article.gmane.org/gmane.comp.lang.lua.general/58908
http://lambda-the-ultimate.org/node/3851
http://www.reddit.com/user/mikemike
As with every performant system, the answer in the end comes down to two things: algorithms and engineering. LuaJIT uses advanced compilation techniques, and it also has a very finely engineered implementation. For example, when the fancy compilation techniques can't handle a piece of code, LuaJIT falls back to an very fast interpreter written in x86 assembly.
LuaJIT gets double points on the engineering aspect, because not only is LuaJIT itself well-engineered, but the Lua language itself has a simpler and more coherent design than Python and JavaScript. This makes it (marginally) easier for an implementation to provide consistently good performance.

What's the quickest way to parallelize code?

I have an image processing routine that I believe could be made very parallel very quickly. Each pixel needs to have roughly 2k operations done on it in a way that doesn't depend on the operations done on neighbors, so splitting the work up into different units is fairly straightforward.
My question is, what's the best way to approach this change such that I get the quickest speedup bang-for-the-buck?
Ideally, the library/approach I'm looking for should meet these criteria:
Still be around in 5 years. Something like CUDA or ATI's variant may get replaced with a less hardware-specific solution in the not-too-distant future, so I'd like something a bit more robust to time. If my impression of CUDA is wrong, I welcome the correction.
Be fast to implement. I've already written this code and it works in a serial mode, albeit very slowly. Ideally, I'd just take my code and recompile it to be parallel, but I think that that might be a fantasy. If I just rewrite it using a different paradigm (ie, as shaders or something), then that would be fine too.
Not require too much knowledge of the hardware. I'd like to be able to not have to specify the number of threads or operational units, but rather to have something automatically figure all of that out for me based on the machine being used.
Be runnable on cheap hardware. That may mean a $150 graphics card, or whatever.
Be runnable on Windows. Something like GCD might be the right call, but the customer base I'm targeting won't switch to Mac or Linux any time soon. Note that this does make the response to the question a bit different than to this other question.
What libraries/approaches/languages should I be looking at? I've looked at things like OpenMP, CUDA, GCD, and so forth, but I'm wondering if there are other things I'm missing.
I'm leaning right now to something like shaders and opengl 2.0, but that may not be the right call, since I'm not sure how many memory accesses I can get that way-- those 2k operations require accessing all the neighboring pixels in a lot of ways.
Easiest way is probably to divide your picture into the number of parts that you can process in parallel (4, 8, 16, depending on cores). Then just run a different process for each part.
In terms of doing this specifically, take a look at OpenCL. It will hopefully be around for longer since it's not vendor specific and both NVidia and ATI want to support it.
In general, since you don't need to share too much data, the process if really pretty straightforward.
I would also recommend Threading Building Blocks. We use this with the Intel® Integrated Performance Primitives for the image analysis at the company I work for.
Threading Building Blocks(TBB) is similar to both OpenMP and Cilk. And it uses OpenMP to do the multithreading, it is just wrapped in a simpler interface. With it you don't have to worry about how many threads to make, you just define tasks. It will split the tasks, if it can, to keep everything busy and it does the load balancing for you.
Intel Integrated Performance Primitives(Ipp) has optimized libraries for vision. Most of which are multithreaded. For the functions we need that aren't in the IPP we thread them using TBB.
Using these, we obtain the best result when we use the IPP method for creating the images. What it does is it pads each row so that any given cache line is entirely contained in one row. Then we don't divvy up a row in the image across threads. That way we don't have false sharing from two threads trying to write to the same cache line.
Have you seen Intel's (Open Source) Threading Building Blocks?
I haven't used it, but take a look at Cilk. One of the big wigs on their team is Charles E. Leiserson; he is the "L" in CLRS, the most widely/respected used Algorithms book on the planet.
I think it caters well to your requirements.
From my brief readings, all you have to do is "tag" your existing code and then run it thru their compiler which will automatically/seamlessly parallelize the code. This is their big selling point, so you dont need to start from scratch with parallelism in mind, unlike other options (like OpenMP).
If you already have a working serial code in one of C, C++ or Fortran, you should give serious consideration to OpenMP. One of its big advantages over a lot of other parallelisation libraries / languages / systems / whatever, is that you can parallelise a loop at a time which means that you can get useful speed-up without having to re-write or, worse, re-design, your program.
In terms of your requirements:
OpenMP is much used in high-performance computing, there's a lot of 'weight' behind it and an active development community -- www.openmp.org.
Fast enough to implement if you're lucky enough to have chosen C, C++ or Fortran.
OpenMP implements a shared-memory approach to parallel computing, so a big plus in the 'don't need to understand hardware' argument. You can leave the program to figure out how many processors it has at run time, then distribute the computation across whatever is available, another plus.
Runs on the hardware you already have, no need for expensive, or cheap, additional graphics cards.
Yep, there are implementations for Windows systems.
Of course, if you were unwise enough to have not chosen C, C++ or Fortran in the beginning a lot of this advice will only apply after you have re-written it into one of those languages !
Regards
Mark

Is it possible that F# will be optimized more than other .Net languages in the future?

Is it possible that Microsoft will be able to make F# programs, either at VM execution time, or more likely at compile time, detect that a program was built with a functional language and automatically parallelize it better?
Right now I believe there is no such effort to try and execute a program that was built as single threaded program as a multi threaded program automatically.
That is to say, the developer would code a single threaded program. And the compiler would spit out a compiled program that is multi-threaded complete with mutexes and synchronization where needed.
Would these optimizations be visible in task manager in the process thread count, or would it be lower level than that?
I think this is unlikely in the near future. And if it does happen, I think it would be more likely at the IL level (assembly rewriting) rather than language level (e.g. something specific to F#/compiler). It's an interesting question, and I expect that some fine minds have been looking at this and will continue to look at this for a while, but in the near-term, I think the focus will be on making it easier for humans to direct the threading/parallelization of programs, rather than just having it all happen as if by magic.
(Language features like F# async workflows, and libraries like the task-parallel library and others, are good examples of near-term progress here; they can do most of the heavy lifting for you, especially when your program is more declarative than imperative, but they still require the programmer to opt-in, do analysis for correctness/meaningfulness, and probably make slight alterations to the structure of the code to make it all work.)
Anyway, that's all speculation; who can say what the future will bring? I look forward to finding out (and hopefully making some of it happen). :)
Being that F# is derived from Ocaml and Ocaml compilers can optimize your programs far better than other compilers, it probably could be done.
I don't believe it is possible to autovectorize code in a generally-useful way and the functional programming facet of F# is essentially irrelevant in this context.
The hardest problem is not detecting when you can perform subcomputations in parallel, it is determining when that will not degrade performance, i.e. when the subtasks will take sufficiently long to compute that it is worth taking the performance hit of a parallel spawn.
We have researched this in detail in the context of scientific computing and we have adopted a hybrid approach in our F# for Numerics library. Our parallel algorithms, built upon Microsoft's Task Parallel Library, require an additional parameter that is a function giving the estimated computational complexity of a subtask. This allows our implementation to avoid excessive subdivision and ensure optimal performance. Moreover, this solution is ideal for the F# programming language because the function parameter describing the complexity is typically an anonymous first-class function.
Cheers,
Jon Harrop.
I think the question misses the point of the .NET architecture-- F#, C# and VB (etc.) all get compiled to IL, which then gets compiled to machine code via the JIT compiler. The fact that a program was written in a functional language isn't relevant-- if there are optimizations (like tail recursion, etc.) available to the JIT compiler from the IL, the compiler should take advantage of it.
Naturally, this doesn't mean that writing functional code is irrelevant-- obviously, there are ways to write IL which will parallelize better-- but many of these techniques could be used in any .NET language.
So, there's no need to flag the IL as coming from F# in order to examine it for potential parallelism, nor would such a thing be desirable.
There's active research for autoparallelization and auto vectorization for a variety of languages. And one could hope (since I really like F#) that they would concive a way to determine if a "pure" side-effect free subset was used and then parallelize that.
Also since Simon Peyton-Jones the father of Haskell is working at Microsoft I have a hard time not beliving there's some fantastic stuff comming.
It's possible but unlikely. Microsoft spends most of it's time supporting and implementing features requested by their biggest clients. That usually means C#, VB.Net, and C++ (not necessarily in that order). F# doesn't seem like it's high on the list of priorities.
Microsoft is currently developing 2 avenues for parallelisation of code: PLINQ (Pararllel Linq, which owes much to functional languages) and the Task Parallel Library (TPL) which was originally part of Robotics Studio. A beta of PLINQ is available here.
I would put my money on PLINQ becoming the norm for auto-parallelisation of .NET code.

Resources