Deeplearning4j how to configure for best performance? - deeplearning4j

I just updated my server and I am running a web service using Deeplearning4j. I wanted to check out different configurations for performance to decide where to put future efforts. From what know.
Native or CPU which is my current default
CUDA or GPU usage. Set environment in code to use CUDA
GTX GPU specific.
For #2 I am thinking I need to remove ND4J native jars from classpath in addition setting the environment in my Java code.
For #3 I am not sure if that applies for when I use my DNN model but only for training?
Many Thanks for feedback!

After running some tests it seems there is no difference. But after my server update going from a 3rd generation i7 to a 10th generation and updating my memory from 800MT/sec to 2400MT/sec seemed to have the biggest gain overall. I had no change in PCIe I am still at 3.0 and the GPU is still a GTX 1080. I have not tried overclocking the memory or GTX 1080 yet. That would indicate maybe the critical path is the memory or GTX. I suspect the memory is where the biggest gain is for my users.

Related

Qualitative comparison between Petalinux and FreeRTOS

I'm going to start the development of an application on a Zynq board. My task is basically to port an existing application running on a Microblaze on the dual core ARM.
What I'm wondering about is which O.S. to use on the new system, because I have no experience at all in this field.
It seems to me that there are four main approaches:
1) Petalinux (use both cores)
2) Petalinux+FreeRTOS (use both cores)
3) FreeRTOS (use only a core)
4) Baremetal (use only a core)
What my application has to do is to move a big amount of data between Ethernet and multiple custom links, so it has to serve a lot of interrupts and command a lot of DMA operations.
How much is the overhead introduced by Petalinux in the interrupt service with respect to baremetal or FreeRTOS? Do you think that, for this kind of work, is faster a single core application running without any OS or, for example, a Petalinux application that has the overhead of the OS (and of the synchronization mechanisms like semaphores or mutex)?
I know the question is not precise and quite vague, but having no experience in the field I strongly need some initial hints.
Thank you.
As you say, this is too vague to give a considered answer because it really depends on your application (when does it not). If you need all the 'stuff' that is available for Linux and boot time is not an issue then go with that. If you need actual real time behaviour, fast boot time, simplicity, and don't need anything Linux specific, then FreeRTOS might be your best choice. There is a Zynq FreeRTOS TCP project that uses the BSD style sockets interface (like Linux) here: http://www.freertos.org/FreeRTOS-Plus/FreeRTOS_Plus_TCP/TCPIP_FAT_Examples_Xilinx_Zynq.html
Usually the performance should not differ alot.
If you compile your linux with a well optimizing compiler there is a good chance to be faster compared to bare metal.
But if you need hard real time linux is not suitable for you.
There is a good whitepaper from Altera but should fit for Xilinx too:
whitepaper on real time jitter

ArrayFire AF_BACKEND_CPU not multi-threaded?

I usually use the OpenCL backend when I use ArrayFire. I was using Intel OpenCL on my i7 CPU. When I switched to the AF_BACKEND_CPU backend my code was about 10-15x slower. I checked and noticed that it was only running on one core. I also suspect that it is not using SSE or AVX instructions which accounts for the rest of the slowdown since my processor only has 4 cores. I feel like the ArrayFire cpu backend should be faster. Is there a way to make it multithreaded?
The CPU backend is not yet multithreaded. But from version 3.4.0, I suspect it will change (see "Sparse Support, Thread Safety, Parallel CPU" on https://github.com/arrayfire/arrayfire/milestones)
I was wondering the same thing. It turns out in the meantime the milestone has been moved to 3.5.0 (https://github.com/arrayfire/arrayfire/issues/451).
As far as I could see only one core is used from AF by now. So 4 cores is still 3 too much.
In general I'd suggest using AF together with the GPU and creating af::array's only when needed, since there is no way otherwise to hold data just on the GPU or just on the CPU (see How to explicitly get linear indices from arrayfire? and http://forums.accelereyes.com/forums/viewtopic.php?f=17&t=43097&p=61730&hilit=copy+host+memory+into+an+array#p61727 on how to construct af::arrays ad-hoc.)
Also as a general rule of thumb for many tasks the GPU implementation still much fast than the CPU implementation, even if the task is not suited perfectly for the CPU. See for example sorting algorithms which normally involve a lot of branching.
If you insist on using the CPU in parallel you could also try to put OpenMP, MPI or just stl::thread on top of AF and parallelize like this. I didn't gained a lot with stl::thread for sorting operations though.

Neo4j and Hugepages

Since Neo4j works primarily in memory, I was wondering if it would be advantageous to enable hugepages (https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt) in my Linux Kernel, and then XX:+UseLargePages or maybe -XX:+UseHugeTLBFS in the (OpenJDK 8) JVM ?
If so, what rule of thumb should I use to decide how many hugepages to configured?
The Neo4j Performance Guide (http://neo4j.com/docs/stable/performance-guide.html) does not mention this, and Google didn't turn up anyone else discussing it (in the first couple of search pages anyway), so I thought I'd ask.
I'm wrestling to get acceptable performance from my new Neo4j instance (2.3.2-community). Any little bit will help. I want to know if this is worth trying before I bring down the database to change JVM flags... I'm hoping someone else has done some experiments along these lines already.
Since Neo4j does its own file paging and doesn't rely on the OS to do this, it should be advantageous or at least not hurt. Huge pages will reduce the probability of TLB cache misses when you use a large amount of memory, which Neo4j often would like to do when there's a lot of data stored in it.
However, Neo4j does not directly use hugepages even though it could and it would be a nice addition. This means you have to rely on transparent huge pages and whatever features the JVM provides. The transparent huge pages can cause more-or-less short stalls when smaller pages are merged.
If you have a representative staging environment then I advise you to make the changes there first, and measure their effect.
Transparent huge pages are mostly a problem for programs that use mmap because I think it can lead to changing the size of the unit of IO, which will make the hard-pagefault latency much higher. I'm not entirely sure about this, though, so please correct me if I'm wrong.
The JVM actually does use mmap for telemetry and tooling, through a file in /tmp so make sure this directory is mounted on tmpfs to avoid gnarly IO stalls, for instance during safe-points (!!!). Always do this even if you don't use huge pages.
Also make sure you are using the latest Linux kernel and the latest Java version.
You may be able to squeeze some percentage points out of it with tuning G1, but this is a bit of a black art.

Why do the CLR and JVM use a stack-based architecture?

I am always curious as to why the JVM and CLR have a stack-based architecture?
Why don't they use a register-based approach?
What benefits does it have over the register-based approach?
I used to ponder the differences between register and stack machines and compare instruction sequences, and run benchmarks...
Then I spent a couple of years implementing both types of machines while working on the Parrot VM, which was a register machine. We started, naively, with a fixed register set, in combination with data and register stacks, but eventually concluded that it was an artificial limitation, so we changed to an infinite register set and an allocator. At some point, the Parrot fast core (GCC computed goto) outperformed Mono and JVM interpreter cores (non-JIT), but the difference came down to the JIT. Parrot's JIT never matched the quality of the others. It is the quality of the JITter that makes the eventual machine, and that is generally what people care about. If all VMs played by the same rules (ie. they had a constraint to run in interpreted mode with no JIT), then my evidence shows a register machine has the performance edge on an equivalent stack machine. Larger instructions, but fewer of them == higher throughput (IPC), and better cache locality of reference. The Dalvik JVM actually supports my findings, Dalvik had no JIT for a couple of years, and competed with its interpreter core.
Very few mainstream VMs run in interpretation mode exclusively (AFAIK), they JIT compile, and thats what we benchmark. The point of the interpreter core is to establish a presence on the platform, do bytecode verification, and provide a failsafe execution core in absence of the JIT. This isn't a rule, of course; there are billions of devices running ARM accelerated JVM without JIT, but in the absence of memory or CPU constraints, this applies.
I worked and worked at tweaking the core, testing and tuning, only to find that in the end we really wanted a fast JIT. I arrived at the conclusion that if you are going to eventually JIT, it doesn't matter much whether you implement a stack or register machine to start, do what you like; but you will get "to market" faster with a stack machine. Doing a lot of pseudo-register-machine virtual optimizations for bytecode interpretation by a virtual machine core is partially a wasted effort, because it isn't real native optimization. The soft-core doesn't do branch prediction, register renaming, instruction reordering, parallel execution or prefetch like a real processor. My feeling is that once we have a high quality JIT to native binary, we arrive at the same destination.
For those reasons, I technically favor a stack based machine for:
Simplicity - Much less code to maintain = less bugs
Time to implement
But visually, and emotionally I favor a register machine for:
Visual-Conceptual models more closely match the machine, and my
brain
Flexibility - Compilers can evaluate their expression trees
in different orders using SSA.
Note I didn't say compilers could more "easily" generate code. That seems to be what people who have worked mostly with stack machines like to argue. I don't believe that and didn't find that to be true. I saw many hobby compilers written in a short time on both Parrot and the CLR, though I would admit the ones on the CLR are of higher quality, but that is mainly one of ecosystem and quality of available tools. I wrote compilers on both platforms myself, and found there are tradeoffs, but not enough to lose sleep over.
This is an educated guess, because my real-world experience does not include writing a full JITter so I don't have first-hand experience comparing the pros and cons of JITting various forms of opcodes, but my opinion is, if you plan to include a JIT, then creating an extremely sophisticated virtual machine opcode core amounts to premature optimization. Your time is better spent elsewhere.
It is usually not appropriate to just link out to an article but this time I'll make an exception: This article by Eric Lippert answers just this question.

Datamining models in FORTRAN or C (or managed code)?

We are planning to develop a datamining package for windows. The program core / calculation engine will be developed in F# with GUI stuff / DB bindings etc done in C# and F#.
However, we have not yet decided on the model implementations. Since we need high performance, we probably can't use managed code here (any objections here?). The question is, is it reasonable to develop the models in FORTRAN or should we stick to C (or maybe C++). We are looking into using OpenCL at some point for suitable models - it feels funny having to go from managed code -> FORTRAN -> C -> OpenCL invocation for these situations.
Any recommendations?
F# compiles to the CLR, which has a just-in-time compiler. It's a dialect of ML, which is strongly typed, allowing all of the nice optimisations that go with that type of architecture; this means you will probably get reasonable performance from F#. For comparison, you could also try porting your code to OCaml (IIRC this compiles to native code) and see if that makes a material difference.
If it really is too slow then see how far that scaling hardware will get you. With the performance available through a modern PC or server it seems unlikely that you would need to go to anything exotic unless you are working with truly brobdinagian data sets. Users with smaller data sets may well be OK on an ordinary PC.
Workstations give you perhaps an order of magnitude more capacity than a standard dekstop PC. A high-end workstation like a HP Z800 or XW9400 (similar kit is available from several other manufacturers) can take two 4 or 6 core CPU chips, tens of gigabytes of RAM (up to 192GB in some cases) and has various options for high-speed I/O like SAS disks, external disk arrays or SSDs. This type of hardware is expensive but may be cheaper than a large body of programmer time. Your existing desktop support infrastructure shouldn be able to this sort of kit. The most likely problem is compatibility issues running 32 bit software on a 64-bit O/S. In this case you have various options like VMs or KVM switches to work around the compatibility issues.
The next step up is a 4 or 8 socket server. Fairly ordinary wintel servers go up to 8 sockets (32-48 cores) and perhaps 512GB of RAM - without having to move off the Wintel platform. This gives you fairly wide range of options within your platform of choice before you have to go to anything exotic1.
Finally, if you can't make it run quickly in F#, validate the F# prototype and build a C implementation using the F# prototype as a control. If that's still not fast enough you've got problems.
If your application can be structured in a way that suits the platform then you could look at a more exotic platform. Depending on what will work with your application, you might be able to host it on a cluster, cloud provider or build the core engine on a GPU, Cell processor or FPGA. However, in doing this you're getting into (quite substantial) additional costs and exotic dependencies that might cause support issues. You will probably also have to bring a third-party consultant who knows how to program the platform.
After all that, the best advice is: suck it and see. If you're comfortable with F# you should be able to prototype your application fairly quickly. See how fast it runs and don't worry too much about performance until you have some clear indication that it really will be an issue. Remember, Knuth said that premature optimisation is the root of all evil about 97% of the time. Keep a weather eye out for issues and re-evaluate your strategy if you think performance really will cause trouble.
Edit: If you want to make a packaged application then you will probably be more performance-sensitive than otherwise. In this case performance will probably become an issue sooner than it would with a bespoke system. However, this doesn't affect the basic 'suck it and see' principle.
For example, at the risk of starting a game of buzzword bingo, if your application can be parallelized and made to work on a shared-nothing architecture you might see if one of the cloud server providers [ducks] could be induced to host it. An appropriate front-end could be built to run locally or through a browser. However, on this type of architecture the internet connection to the data source becomes a bottleneck. If you have large data sets then uploading these to the service provider becomes a problem. It may be quicker to process a large dataset locally than to upload it through an internet connection.
I would advise not to bother with optimizations yet. First try to get a working prototype, then find out where computation time is spent. You can probably move the biggest bottlenecks out into C or Fortran when and if needed -- then see how much difference it makes.
As they say, often 90% of the computation is spent in 10% of the code.

Resources