ArrayFire AF_BACKEND_CPU not multi-threaded? - arrayfire

I usually use the OpenCL backend when I use ArrayFire. I was using Intel OpenCL on my i7 CPU. When I switched to the AF_BACKEND_CPU backend my code was about 10-15x slower. I checked and noticed that it was only running on one core. I also suspect that it is not using SSE or AVX instructions which accounts for the rest of the slowdown since my processor only has 4 cores. I feel like the ArrayFire cpu backend should be faster. Is there a way to make it multithreaded?

The CPU backend is not yet multithreaded. But from version 3.4.0, I suspect it will change (see "Sparse Support, Thread Safety, Parallel CPU" on https://github.com/arrayfire/arrayfire/milestones)

I was wondering the same thing. It turns out in the meantime the milestone has been moved to 3.5.0 (https://github.com/arrayfire/arrayfire/issues/451).
As far as I could see only one core is used from AF by now. So 4 cores is still 3 too much.
In general I'd suggest using AF together with the GPU and creating af::array's only when needed, since there is no way otherwise to hold data just on the GPU or just on the CPU (see How to explicitly get linear indices from arrayfire? and http://forums.accelereyes.com/forums/viewtopic.php?f=17&t=43097&p=61730&hilit=copy+host+memory+into+an+array#p61727 on how to construct af::arrays ad-hoc.)
Also as a general rule of thumb for many tasks the GPU implementation still much fast than the CPU implementation, even if the task is not suited perfectly for the CPU. See for example sorting algorithms which normally involve a lot of branching.
If you insist on using the CPU in parallel you could also try to put OpenMP, MPI or just stl::thread on top of AF and parallelize like this. I didn't gained a lot with stl::thread for sorting operations though.

Related

Are there any implementations\prototypes of Erlang alike VM that could run not only on CPU but also on gpu?

I was creating distributed systems in OOP languages using message passing libraries like MPI, ZepoMQ, RabbitMQ and so on. Now I found myself watching some erlang promotional material and understood that lots of things we emulate in OOP languages like C++ and C# using libraries (1 000 000 socket connections per process, distributed messaging and distributed process monitoring visualization) was there in Erlang for many years now. And it seemed reasonable to get to know the language better. I found myself asking one last question: are there any implementations\prototypes of Erlang alike VM that could run/spawn some processes not only on CPU but also on GPU?
Because that would definitely make Erlang (and its more readable for my OOP background dialects like Elixir) language of choice for most future projects.
GPU is fast only with sequential memory access. I hardly imagine garbage collection on GPU RAM. GPU is NOT a cool and parallel CPU. It requires more effort to write to. So most probably there is no Erlang compiler for GPU.
I doubt there's any implementation that can run Erlang processes on GPU but you can use two techniques to run computations on GPU under Erlang:
use C library through NIFs (native implemented functions) - see http://www.erlang.org/doc/man/erl_nif.html and an example of such an implementation: msantos/procket on Github (I'm sorry, I can't post the link due to low reputation :)
use native OS process and communicate with it through erlang "port" - see http://www.erlang.org/doc/reference_manual/ports.html
The first one is faster and the later is safer (NIFs can crash the whole VM).
This is not specific to GPU coputations. Erlang is not well suited for high performance number crunching - it's better to do it in C and manipulate the results in Erlang anyway. The communication between the C and Erlang should be implemented in the one of the two described manners.

Why do the CLR and JVM use a stack-based architecture?

I am always curious as to why the JVM and CLR have a stack-based architecture?
Why don't they use a register-based approach?
What benefits does it have over the register-based approach?
I used to ponder the differences between register and stack machines and compare instruction sequences, and run benchmarks...
Then I spent a couple of years implementing both types of machines while working on the Parrot VM, which was a register machine. We started, naively, with a fixed register set, in combination with data and register stacks, but eventually concluded that it was an artificial limitation, so we changed to an infinite register set and an allocator. At some point, the Parrot fast core (GCC computed goto) outperformed Mono and JVM interpreter cores (non-JIT), but the difference came down to the JIT. Parrot's JIT never matched the quality of the others. It is the quality of the JITter that makes the eventual machine, and that is generally what people care about. If all VMs played by the same rules (ie. they had a constraint to run in interpreted mode with no JIT), then my evidence shows a register machine has the performance edge on an equivalent stack machine. Larger instructions, but fewer of them == higher throughput (IPC), and better cache locality of reference. The Dalvik JVM actually supports my findings, Dalvik had no JIT for a couple of years, and competed with its interpreter core.
Very few mainstream VMs run in interpretation mode exclusively (AFAIK), they JIT compile, and thats what we benchmark. The point of the interpreter core is to establish a presence on the platform, do bytecode verification, and provide a failsafe execution core in absence of the JIT. This isn't a rule, of course; there are billions of devices running ARM accelerated JVM without JIT, but in the absence of memory or CPU constraints, this applies.
I worked and worked at tweaking the core, testing and tuning, only to find that in the end we really wanted a fast JIT. I arrived at the conclusion that if you are going to eventually JIT, it doesn't matter much whether you implement a stack or register machine to start, do what you like; but you will get "to market" faster with a stack machine. Doing a lot of pseudo-register-machine virtual optimizations for bytecode interpretation by a virtual machine core is partially a wasted effort, because it isn't real native optimization. The soft-core doesn't do branch prediction, register renaming, instruction reordering, parallel execution or prefetch like a real processor. My feeling is that once we have a high quality JIT to native binary, we arrive at the same destination.
For those reasons, I technically favor a stack based machine for:
Simplicity - Much less code to maintain = less bugs
Time to implement
But visually, and emotionally I favor a register machine for:
Visual-Conceptual models more closely match the machine, and my
brain
Flexibility - Compilers can evaluate their expression trees
in different orders using SSA.
Note I didn't say compilers could more "easily" generate code. That seems to be what people who have worked mostly with stack machines like to argue. I don't believe that and didn't find that to be true. I saw many hobby compilers written in a short time on both Parrot and the CLR, though I would admit the ones on the CLR are of higher quality, but that is mainly one of ecosystem and quality of available tools. I wrote compilers on both platforms myself, and found there are tradeoffs, but not enough to lose sleep over.
This is an educated guess, because my real-world experience does not include writing a full JITter so I don't have first-hand experience comparing the pros and cons of JITting various forms of opcodes, but my opinion is, if you plan to include a JIT, then creating an extremely sophisticated virtual machine opcode core amounts to premature optimization. Your time is better spent elsewhere.
It is usually not appropriate to just link out to an article but this time I'll make an exception: This article by Eric Lippert answers just this question.

How pthreads does cross-threading and scheduling

I was wondering, how does pthreads-win32 (windows implementation of pthreads) implement cross-threading? Is it written exclusively with windows API? I checked some of the sources and it seems that most is indeed written with windows API, tho i was wondering if it uses windows scheduler to switch between threads (and cores) as well or does it implement its own? Specifically, most processors these days implement their own scheduler (i've read about itanium arch for example, the hardwired logic supports two threads per core and it even automatically switches between them with hw logic, so evidently OS support for multiple cores is not necessarily needed), so if i have an obsolete OS like windows 32-bit or something, which doesn't support multi-core processors, would a program written with pthreads-win32 still run on more than one processor core or would only one core be used?
How about pthreads implementations (untainted posix threads)? Do they support multi-core processors even if the OS on which they are running doesn't?
I am guessing the answer is no, for both windows and posix versions, only one core is in use if the OS doesn't support for multiple cores. Tho this is just an educated guess and i would like to confirm it, so pls leave a comment.
On a side request, can you pls recommend a lib that DOES support for muli-core thread execution, even if the OS on which the program is running DOESN'T. If any exist ofc.
Also, is there a way to ensure two threads written with pthreads are being executed on different cores, or does the OS (or the processor, or pthreads lib) do the assignment automatically? Does pthreads guarantee execution on different cores if they are available?
Cheers, Val
EDIT:
I know most of these questions are implementation specific, so i was referring to this implementation of pthreads for windows http://sourceware.org/pthreads-win32/. I didn't specifically mention it before, because as far as i know, this is the most popular and widely used implementation of pthreads for windows.
So from what i'm getting, the most important thing to note in all of this is that threading has very little to do with parallelism (like UMA with multi-core processors). So while threading might be a technique to implement concurrency it is not a way of ensuring ACTUAL parallel execution, which is what i was looking for in the first place, since i am studying parallel and distributed systems and algorithms.
So to answer one question at a time. Yes, pthreads, and probably most (if not all) other threading APIs out there are based on the underlying OS API. Which ofc gives them the same limits that the OS has. Meaning, yes, if the OS (concretely in this case, some windows running for example pthreads-win32) doesn't support multiple cores, only one core is in use at all times. As is pointed out on the wiki page nob provided, to cite: "Hyper-threading requires not only that the operating system support multiple processors, but also that it be specifically optimised for HTT, and Intel recommends disabling HTT when using operating systems that have not been so optimized." http://en.wikipedia.org/wiki/Hyper-threading Meaning in most cases, just hardwired processors (basic) scheduler is not enough to take advantage of multiple cores, it has to be supported/used by SW (OS support).
While this might not be a definitive proof, i believe enough evidence points in the same direction to confirm this to be the case.
I did not sift through pthreads (for posix compliant OSes) sources, i am guessing the same goes for this API, since it is more than likely to use the underlying OS API. You will have to confirm this on your own. :)
Also, any potential libs out there that might support execution on multiple cores even if the OS on which they're running on doesn't support multiple cores, you will have to find them on your own (if they exist), please leave a comment.
To ensure parallelism (execution on different cores) manually, linux does provide a way to pin a thread to a specific virtual processor (under certain conditions). To pin an entire process to a specific (virtual) processor/core, sched_setaffinity() (from sched.h) can be used. As nos pointed out, pthreads provides pthread_setaffinity_np() to pin a particular thread to a specific core. Windows supports a similar functionality with SetThreadAffinityMask(), so clearly, assigning threads manually to run in parallel on different cores is possible (if the OS supports multi-cores).
From my experience coding with pthreads, if you write for code that uses multiple threads (more than 2), they SHOULD be executed on more than one physical core, if available (which is probably an OS feature used by pthreads).
My questions were quite general to begin with, since most of these things are implementation specific, it's hard to give one answer. I hope this answer is detailed enough to help you clarify a few things.
Cheers, Val
Generally each modern OS supports Threads by itself and schedules them to the different (virtual) Cores of a System. The OS provides some general synchronization techniques (like Mutexes or Semaphores or Barriers) which are used by pthread to implement the pthreads API.
With two threads per Core (I think you mean Hyper Threading) on some Intel Processors (like Itanium) the OS sees two "virtual" Cores. The processor indeed schedules the two threads onto one physical core. (See Wikipedia)
However, there are examples where Runtime-Plattforms implement their own Thread-Conceptepts and do the scheduling: I think of (at least older) implementations of Java having their own scheduling routines.

What's the quickest way to parallelize code?

I have an image processing routine that I believe could be made very parallel very quickly. Each pixel needs to have roughly 2k operations done on it in a way that doesn't depend on the operations done on neighbors, so splitting the work up into different units is fairly straightforward.
My question is, what's the best way to approach this change such that I get the quickest speedup bang-for-the-buck?
Ideally, the library/approach I'm looking for should meet these criteria:
Still be around in 5 years. Something like CUDA or ATI's variant may get replaced with a less hardware-specific solution in the not-too-distant future, so I'd like something a bit more robust to time. If my impression of CUDA is wrong, I welcome the correction.
Be fast to implement. I've already written this code and it works in a serial mode, albeit very slowly. Ideally, I'd just take my code and recompile it to be parallel, but I think that that might be a fantasy. If I just rewrite it using a different paradigm (ie, as shaders or something), then that would be fine too.
Not require too much knowledge of the hardware. I'd like to be able to not have to specify the number of threads or operational units, but rather to have something automatically figure all of that out for me based on the machine being used.
Be runnable on cheap hardware. That may mean a $150 graphics card, or whatever.
Be runnable on Windows. Something like GCD might be the right call, but the customer base I'm targeting won't switch to Mac or Linux any time soon. Note that this does make the response to the question a bit different than to this other question.
What libraries/approaches/languages should I be looking at? I've looked at things like OpenMP, CUDA, GCD, and so forth, but I'm wondering if there are other things I'm missing.
I'm leaning right now to something like shaders and opengl 2.0, but that may not be the right call, since I'm not sure how many memory accesses I can get that way-- those 2k operations require accessing all the neighboring pixels in a lot of ways.
Easiest way is probably to divide your picture into the number of parts that you can process in parallel (4, 8, 16, depending on cores). Then just run a different process for each part.
In terms of doing this specifically, take a look at OpenCL. It will hopefully be around for longer since it's not vendor specific and both NVidia and ATI want to support it.
In general, since you don't need to share too much data, the process if really pretty straightforward.
I would also recommend Threading Building Blocks. We use this with the IntelĀ® Integrated Performance Primitives for the image analysis at the company I work for.
Threading Building Blocks(TBB) is similar to both OpenMP and Cilk. And it uses OpenMP to do the multithreading, it is just wrapped in a simpler interface. With it you don't have to worry about how many threads to make, you just define tasks. It will split the tasks, if it can, to keep everything busy and it does the load balancing for you.
Intel Integrated Performance Primitives(Ipp) has optimized libraries for vision. Most of which are multithreaded. For the functions we need that aren't in the IPP we thread them using TBB.
Using these, we obtain the best result when we use the IPP method for creating the images. What it does is it pads each row so that any given cache line is entirely contained in one row. Then we don't divvy up a row in the image across threads. That way we don't have false sharing from two threads trying to write to the same cache line.
Have you seen Intel's (Open Source) Threading Building Blocks?
I haven't used it, but take a look at Cilk. One of the big wigs on their team is Charles E. Leiserson; he is the "L" in CLRS, the most widely/respected used Algorithms book on the planet.
I think it caters well to your requirements.
From my brief readings, all you have to do is "tag" your existing code and then run it thru their compiler which will automatically/seamlessly parallelize the code. This is their big selling point, so you dont need to start from scratch with parallelism in mind, unlike other options (like OpenMP).
If you already have a working serial code in one of C, C++ or Fortran, you should give serious consideration to OpenMP. One of its big advantages over a lot of other parallelisation libraries / languages / systems / whatever, is that you can parallelise a loop at a time which means that you can get useful speed-up without having to re-write or, worse, re-design, your program.
In terms of your requirements:
OpenMP is much used in high-performance computing, there's a lot of 'weight' behind it and an active development community -- www.openmp.org.
Fast enough to implement if you're lucky enough to have chosen C, C++ or Fortran.
OpenMP implements a shared-memory approach to parallel computing, so a big plus in the 'don't need to understand hardware' argument. You can leave the program to figure out how many processors it has at run time, then distribute the computation across whatever is available, another plus.
Runs on the hardware you already have, no need for expensive, or cheap, additional graphics cards.
Yep, there are implementations for Windows systems.
Of course, if you were unwise enough to have not chosen C, C++ or Fortran in the beginning a lot of this advice will only apply after you have re-written it into one of those languages !
Regards
Mark

Datamining models in FORTRAN or C (or managed code)?

We are planning to develop a datamining package for windows. The program core / calculation engine will be developed in F# with GUI stuff / DB bindings etc done in C# and F#.
However, we have not yet decided on the model implementations. Since we need high performance, we probably can't use managed code here (any objections here?). The question is, is it reasonable to develop the models in FORTRAN or should we stick to C (or maybe C++). We are looking into using OpenCL at some point for suitable models - it feels funny having to go from managed code -> FORTRAN -> C -> OpenCL invocation for these situations.
Any recommendations?
F# compiles to the CLR, which has a just-in-time compiler. It's a dialect of ML, which is strongly typed, allowing all of the nice optimisations that go with that type of architecture; this means you will probably get reasonable performance from F#. For comparison, you could also try porting your code to OCaml (IIRC this compiles to native code) and see if that makes a material difference.
If it really is too slow then see how far that scaling hardware will get you. With the performance available through a modern PC or server it seems unlikely that you would need to go to anything exotic unless you are working with truly brobdinagian data sets. Users with smaller data sets may well be OK on an ordinary PC.
Workstations give you perhaps an order of magnitude more capacity than a standard dekstop PC. A high-end workstation like a HP Z800 or XW9400 (similar kit is available from several other manufacturers) can take two 4 or 6 core CPU chips, tens of gigabytes of RAM (up to 192GB in some cases) and has various options for high-speed I/O like SAS disks, external disk arrays or SSDs. This type of hardware is expensive but may be cheaper than a large body of programmer time. Your existing desktop support infrastructure shouldn be able to this sort of kit. The most likely problem is compatibility issues running 32 bit software on a 64-bit O/S. In this case you have various options like VMs or KVM switches to work around the compatibility issues.
The next step up is a 4 or 8 socket server. Fairly ordinary wintel servers go up to 8 sockets (32-48 cores) and perhaps 512GB of RAM - without having to move off the Wintel platform. This gives you fairly wide range of options within your platform of choice before you have to go to anything exotic1.
Finally, if you can't make it run quickly in F#, validate the F# prototype and build a C implementation using the F# prototype as a control. If that's still not fast enough you've got problems.
If your application can be structured in a way that suits the platform then you could look at a more exotic platform. Depending on what will work with your application, you might be able to host it on a cluster, cloud provider or build the core engine on a GPU, Cell processor or FPGA. However, in doing this you're getting into (quite substantial) additional costs and exotic dependencies that might cause support issues. You will probably also have to bring a third-party consultant who knows how to program the platform.
After all that, the best advice is: suck it and see. If you're comfortable with F# you should be able to prototype your application fairly quickly. See how fast it runs and don't worry too much about performance until you have some clear indication that it really will be an issue. Remember, Knuth said that premature optimisation is the root of all evil about 97% of the time. Keep a weather eye out for issues and re-evaluate your strategy if you think performance really will cause trouble.
Edit: If you want to make a packaged application then you will probably be more performance-sensitive than otherwise. In this case performance will probably become an issue sooner than it would with a bespoke system. However, this doesn't affect the basic 'suck it and see' principle.
For example, at the risk of starting a game of buzzword bingo, if your application can be parallelized and made to work on a shared-nothing architecture you might see if one of the cloud server providers [ducks] could be induced to host it. An appropriate front-end could be built to run locally or through a browser. However, on this type of architecture the internet connection to the data source becomes a bottleneck. If you have large data sets then uploading these to the service provider becomes a problem. It may be quicker to process a large dataset locally than to upload it through an internet connection.
I would advise not to bother with optimizations yet. First try to get a working prototype, then find out where computation time is spent. You can probably move the biggest bottlenecks out into C or Fortran when and if needed -- then see how much difference it makes.
As they say, often 90% of the computation is spent in 10% of the code.

Resources