I want to convert my lane detection code written by C++ (OpenCV) to FPGA. Vivado HLS or Vivado SDSoC can help to embed the C ++ code into the FPGA. Or I can rewrite the lane detection code with verilog. The question is, what are the advantages and disadvantages of these three ways?
I want to use one of the cheap Zynq-7000 FPGAs.
Verilog is considered low-level these days. Compare it with assembly for software implementation. People use it only to get performance that they cannot attain with high-level languages such as C or Java in the software domain.
In the hardware domain, C (for Vivado HLS) or OpenCL are considered high-level languages. OpenCL was developed with portability to other architectures like GPUs and CPUs in mind. It has a lot more overhead in terms of communicating with the FPGA than Vivado HLS however.
Vivado HLS by itself produces just hardware modules in VHDL or Verilog, which you still have to connect to FPGA pins, ARM processors, etc. It does not take care of the communication to your module. You will still have to integrate your module in a Vivado block design or top-level VHDL or Verilog implementation yourself.
SDSoC, not "Vivado SDSoC" by the way, also lets you to write your entire implementation (hardware and software) in C. Under the hood, it will invoke Vivado HLS to implement the hardware module. Afterwards, the tool will take care of implementing an interface between your hardware and the on-board ARM processors that will run the software.
In summary, I recommend SDSoC unless you have a good reason not to use it. I do want to warn, however, that analyzing the synthesis results of Vivado HLS is a lot harder than analyzing Vivado output for Verilog or VHDL. Therefore, I always recommend to make sure that your code works as a software implementation first. With minimal effort, you should be able to compile any code in gcc or another compiler too. Don't use the synthesis results to debug your code, but just to analyze the performance.
SDSoc is better and easier, HLS like a blackbox, even UG902 have so many pages.
only my own opinion.
Take a look at Xilinx XAPP1167 and the Xilinx HLS Video Library Wiki.
That appnote is a few years old (older than the SDSoC tools) but has a reference design for accelerating OpenCV applications in a Zynq using HLS.
I can't speak to SDSoC, but I would highly recommend starting with HLS over a rewrite in Verilog. It sounds like you have exactly an intended use-case for HLS: to implement existing C++ applications in an FPGA. The downsides to it are (1) you'll likely need to modify your code a bit, since HLS doesn't support all C++ features, and (2) the performance may not be quite as good as a pure Verilog implementation.
Even if you have hardware design experience, manually translating C++ to Verilog will require some significant effort. I'd avoid that approach unless HLS or SDSoC doesn't give you the performance you need.
Start using OpenCL SDAccel or Intel SDK. OpenCL has verbose and well defined API - which is a good thing. It is very easy to learn and you can have parallel code execution similar to multi-module instances of Verilog/VHDL. OpenCl vs. HLS has benefits in not requiring to re-invent the whole system for managing data, I/O, pipes. etc. You get quite a bit of helper logic in OpenCL BSP (Intel) or shell (XILINX). Yeah, and start reading these long guides.
I would recommend SDAccel, as it is much more C++ "software" user friendly. At the same time, don't quote me on this, but I think they provide a OpenCV implementation out of the box, which means that probably you only need to massage you non-OpenCV code to achieve the performance you want.
Related
I was creating distributed systems in OOP languages using message passing libraries like MPI, ZepoMQ, RabbitMQ and so on. Now I found myself watching some erlang promotional material and understood that lots of things we emulate in OOP languages like C++ and C# using libraries (1 000 000 socket connections per process, distributed messaging and distributed process monitoring visualization) was there in Erlang for many years now. And it seemed reasonable to get to know the language better. I found myself asking one last question: are there any implementations\prototypes of Erlang alike VM that could run/spawn some processes not only on CPU but also on GPU?
Because that would definitely make Erlang (and its more readable for my OOP background dialects like Elixir) language of choice for most future projects.
GPU is fast only with sequential memory access. I hardly imagine garbage collection on GPU RAM. GPU is NOT a cool and parallel CPU. It requires more effort to write to. So most probably there is no Erlang compiler for GPU.
I doubt there's any implementation that can run Erlang processes on GPU but you can use two techniques to run computations on GPU under Erlang:
use C library through NIFs (native implemented functions) - see http://www.erlang.org/doc/man/erl_nif.html and an example of such an implementation: msantos/procket on Github (I'm sorry, I can't post the link due to low reputation :)
use native OS process and communicate with it through erlang "port" - see http://www.erlang.org/doc/reference_manual/ports.html
The first one is faster and the later is safer (NIFs can crash the whole VM).
This is not specific to GPU coputations. Erlang is not well suited for high performance number crunching - it's better to do it in C and manipulate the results in Erlang anyway. The communication between the C and Erlang should be implemented in the one of the two described manners.
Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic. I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
b) OCaml gets a good complete port to the LLVM backend -
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
Anyone know of such things, either finished or in progress?
Thanks deeply!
To answer your question in an answer - as far as I know, there are no such tools and I do not think it is likely somebody will create them.
Although F# is inspired by OCaml, it has evolved a lot and is different in a number of ways (see this SO discussion), so automatic conversion is not trivial. Even if somebody did that, it would be more like compilation to hard to read OCaml than conversion to idiomatic code that you can later continue working on.
To add a few general comments, when you speak about "real-time" I imagine controlling some robot in a factory dealing with dangerous stuff or an airplane control. In these areas, concerns about GC are certainly valid. However, I do not think games are necessarily "real-time". You need good performance, that's for sure, but people have been writing games with .NET and F# quite happily. For some F# examples, see:
... a nice blog with a couple of game samples (that you can actually try & buy)
a 3D airplane shooter game that also looks fairly realistic
and there is also a book that uses games to explain F#
These are probably simpler than what you're aiming for, but it may be good enough to show that writing games using GC is doable.
Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic.
A few points here:
Incremental GCs are not the only way to get low pause times. Concurrent GCs like VCGC do the work in bulk but do it concurrently with mutators running, e.g. the VCGC implementation I described in the non-free article here was running with sub-millisecond pause times.
Incremental GC does not necessarily mean low pause times. For example, OCaml's GC typically incurs 10ms pauses and can incur arbitrarily-long pauses when it encounters a deep thread stack or long array in the heap.
I have measured typical pause times of 10ms with OCaml and 30ms with F# on .NET 3. With a simple implementation I was able to build a fault tolerant server in F# from scratch that handled 20k msgs/s with 50% of latencies under 114us and 95% under 500us.
I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
I wouldn't give up on the platform is your first working version has unacceptable latency. There are lots of things you can do to bring the max latency down.
b) OCaml gets a good complete port to the LLVM backend -
I seriously doubt OCaml will ever get what I'd consider to be a "good complete port to the LLVM backend". They'll just retarget LLVM with the current typeless IR and it won't do much better than the current ocamlopt compiler because LLVM isn't designed to optimize that kind of workload.
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
No automated tools but I've ported hundreds of thousands of lines of code between OCaml and F# now and it is generally very easy because most code is written in the core ML subset of both languages.
I usually use F# for writing numerical algorithms. Functional programming constructs in F# helps to express algorithms in a very natural way. I often end up with a succinct and understandable implementation, and may be able to parallelize it quite fast if there is a chance of parallelism.
I wonder there is a way to compile F# programs down to FPGA. In this way, I can still use F# to avoid boilerplate codes in FPGA programming, and make use of high performance computing in FPGA. Is this possible to do so? If yes, could you provide some hints for me to start with?
I've read about (but never used) Avalda's F# to FPGA conversion, but their site is currently returning a completely blank page. I don't know if that's just temporary of if it means they've gone belly-up.
F# should be ideal for this task because it is derived from the ML family of languages that were bred for metaprogramming. However, I am not aware of any work in this area (although I have had the idea of working on it myself).
I would focus on writing a compiler in F# that compiled a DSL to an FPGA, rather than trying to compile general F# code.
Here's a list for HLS tools using C. My experience with one of them in 2006 was not favourable but I expect them to be much better today.
Regarding F#, I doubt this will exist any time soon.
I am curious about knowing the advantage of using OpenMP (and consequently linking against a third party library, assuming you are a C++ programmer) while C++0x offers good parallel constructs.
Could someone provide me with pros. and cons. of using OpenMP instead on C++0x build-in constructs?
I have to admit that I haven’t yet delved deeply into C++0x but as far as I see it “merely” offers some primitives for generic parallelization.
OpenMP on the other hand is a relatively high-level abstraction to parallelize code with a single purpose: to improve performance by distributing work across multiple CPU cores (rather than, say, improve UI responsiveness, or communicate with an asynchronous channel).
OpenMP makes this very easy because it offers a compact syntax and does a lot automatically, e.g. the managing of a thread pool and the scheduling of threads to distribute the work evenly. In the best case, this means that parallelizing an existing algorithm is as easy as putting the following into your code (at the appropriate position):
#pragma omp parallel for
(Of course it’s usually a bit more complicated.)
However, this comes at a cost that is twofold:
OpenMP is implemented by means of pragmas and integrates poorly with C++ syntax. For example, the following straightforward-looking code is illegal:
void f() {
#pragma omp critical
{
return;
}
}
That’s because you cannot prematurely leave OpenMP “blocks”. Quite the bummer.
OpenMP strives to be as platform-independent as possible. As a consequence, it lacks a few interesting primitives. For example, there’s no yield command in OpenMP, and no fetch_and_add primitive, nor a compare_and_swap or LL/CS.
For Open MP with gcc, libgomp comes with gcc itself and is not third-party. It was my understanding that this is similar for other compilers.
Section 3.5 of Structure and Interpretation of Computer Programs describes streams. Does Common Lisp have such streams built in or is there a good Common Lisp library implementing such streams?
[I mean streams in all the generality presented in section 3.5 of SICP; not just your usual i/o streams.]
SERIES is a featureful library providing that sort of functionality. For a shorter and more readable example of how the concept of streams maps to Common Lisp, see Pipes.