Have MXNet used Nvidia's NCCL library for multi-GPU communication? - machine-learning

In Nvidia website, they claimed MXNet uses NCCL (https://developer.nvidia.com/nccl). However, I haven't found any reference from MXNet's github repository that they actually use NCCL library.
In the chainer blog, they also claimed that chainer achieves better performance than MXNet on 4 GPUs because of the use of NCCL library in chainer.(https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html)
In some of the old posts in MXNet repository, I can see that they were talking about the difficulty in including the NCCL library in MXNet.
My first question is, is there any version of MXNet with NCCL library?
Second, what might be the performance implications of using NCCL library (i.e. less memory usage, lesser communication overhead across multiple GPUs)?

There is no official release at this time that supports NCCL.
1) There was a PR for this which was closed (see discussion here: https://github.com/apache/incubator-mxnet/issues/2919). It's possible to pull in that code to an older commit.
2) See quote from ptrendx# about performance related to NCCL on Sept 10:
"As part of supporting DGX, NVIDIA provides optimized versions of most major DL frameworks as docker containers. The version of MXNet that is part of this DGX software stack has NCCL support (which I guess is why that page lists MXNet as supported).
We do upstream our optimizations and NCCL support is available as a PR since February (#5521), but it is not yet accepted to upstream MXNet due to API required.
That said, MXNet has actually very good communication scheme and as long as your network does not have a very large number of parameters (for which you need bandwidth given by NCCL and NVLink) you may get as good or better results with MXNet's native device kvstore."

Related

Do I need nvidia-container-runtime, and why? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I want to access my NVIDIA GPUs from inside containers. Can I do this without nvidia-container-runtime?
Requiring a custom Docker runtime just to talk to one device seems very strange. There is a whole universe of PCI devices out there. Why does this one need its own runtime? For example, suppose I had both NVIDIA and AMD GPUs. Would I be unable to access both from inside one container?
I understand that nvidia-container-runtime lets me control which GPUs are visible via NVIDIA_VISIBLE_DEVICES. But I do not care about this. I am not using containers to isolate devices; I am using containers to manage CUDA/CUDNN/TensorFlow version h*ll. And if I did want to isolate devices, I would use the same mechanism as forever: By controlling access to nodes in /dev.
In short, the whole "custom runtime" design looks flawed to me.
So, questions:
What am I missing?
Can I obtain access to my NVIDIA GPUs using the stock Docker (or podman) runtime?
If not, why not?
I certainly won't be able to answer every conceivable question related to this. I will try to give a summary. Some of what I write here is based on what's documented here and here. My discussion here will also be focused on linux, and docker (not windows, not singularity, not podman, etc.). I'm also not likely to be able to address in detail questions like "why don't other PCI devices have to do this?". I'm also not trying to make my descriptions of how docker works perfectly accurate to an expert in the field.
The NVIDIA GPU driver has components that run in user space and also other components that run in kernel space. These components work together and must be in harmony. This means the kernel mode component(s) for driver XYZ.AB must be used only with user-space components from driver XYZ.AB (not any other version), and vice-versa.
Roughly speaking, docker is a mechanism to provide an isolated user-space linux presence that runs on top of, and interfaces to, the linux kernel (where all the kernel space stuff lives). The linux kernel is in the base machine (outside the container) and much/most of linux user space code is inside the container. This is one of the architectural factors that allow you to do neato things like run an ubuntu container on a RHEL kernel.
From the NVIDIA driver perspective, some of its components need to be installed inside the container and some need to be installed outside the container.
Can I obtain access to my NVIDIA GPUs using the stock Docker (or podman) runtime?
Yes, you can, and this is what people did before nvidia-docker or the nvidia-container-toolkit existed. You need to install the exact same driver in the base machine as well as in the container. Last time I checked, this works (although I don't intend to provide instructions here.) If you do this, the driver components inside the container match those outside the container, and it works.
What am I missing?
NVIDIA (and presumably others) would like a more flexible scenario. The above description means that if a container was built with any other driver version (than the one installed on your base machine) it cannot work. This is inconvenient.
The original purpose of nvidia-docker was to do the following: At container load time, install the runtime components of the driver, which are present in the base machine, into the container. This harmonizes things, and although it does not resolve every compatibility scenario, it resolves a bunch of them. With a simple rule "keep your driver on the base machine updated to the latest" it effectively resolves every compatibility scenario that might arise from a mismatched driver/CUDA runtime. (The CUDA toolkit, and anything that depends on it, like CUDNN, need only be installed in the container.)
As you point out, the nvidia-container-toolkit has picked up a variety of other, presumably useful, functionality over time.
I'm not spending a lot of time here talking about the compatibility strategy ("forward") that exists for compiled CUDA code, and the compatibility strategy ("backward") that exists when talking about a specific driver and the CUDA versions supported by that driver. I'm also not intending to provide instructions for use of the nvidia-container-toolkit, that is already documented, and many questions/answers about it already exist also.
I won't be able to respond to follow up questions like "why was it architected that way?" or "that shouldn't be necessary, why don't you do this?"
To answer my own question: No, we do not need nvidia-container-runtime.
The NVIDIA shared libraries are tightly coupled to each point release of the driver. NVIDIA likes to say "the driver has components that run in user space", but of course that is a contradiction in terms. So for any version of the driver, you need to make the corresponding release of these shared libraries accessible inside the container.
A brief word on why this is a bad design: Apart from the extra complexity, the NVIDIA shared libraries have dependencies on other shared libraries in the system, in particular C and X11. If a newer release of the NVIDIA libraries ever required features from newer C or X11 libraries, a system running those newer libraries could never host an older container. (Because the container would not be able to run the newer injected libraries.) The ability to run old containers on new systems is one of the most important features of containers, at least in some applications. I guess we have to hope that never happens.
The HPC community figured this out and made it work some time ago. Here are some old instructions for creating a portable Singularity GPU container which injects the required NVIDIA shared libraries when the container runs. You could easily follow a similar procedure to create a portable OCI or Docker GPU container.
These days, Singularity supports a --nv flag to inject the necessary shared libraries automatically. It also supports a --rocm flag for AMD GPUs. (Yes, AMD chose the same bad design.) Presumably you could combine these flags if you needed both.
All of these details are pretty well-documented in the Singularity manual.
Bottom line: If you are asking the same question I was, try Singularity.

Recommenden Debian version for developing

I am using for the first time Beaglebone black. Whe you choose a Linux OS version almost always people recomend you use the previous one before the last one because it is free of bugs.
My question is: I need to apply the same philosophy in beaglebone (Choose the previous one version 8.7 instead of 9.3 the last one today 23/02/2018)?
Beaglebone web page recommend me use the last one 9.3, so that just amplify my doubts.
I want to work over a bug free version with well supoorted drviers for my master thesis.
I'd strongly recommend going with Debian 9.3. The Debian side is not what you need to worry about, it's extremely stable.
There are some aspects that you should be aware of, due to a lot of various write-ups being out there.
DT overlays get applied in U-Boot (cf. uenv.txt)
uenv.txt lives in /boot nowadays, not on the FAT partition
If any write-up mentions a 3.8 kernel, disregard it. It will be horribly outdated
There are userspace mechanisms now to control pin muxing and pin options
Avoid /dev/mem there be many dragons!
The question you will need to ask yourself though is which Linux kernel version you will want to run. That will also depend on your actual use case for the Hardware. It's impossible to give a recommendation without knowing more. One very important aspect though is that you MUST NOT choose a RealTime Linux kernel without understanding the impact that it will have (actually making a lot of things slower, cf. Steven's talk)

Is it possible to rewrite drivers for GPU not by manufacturer

My GPU received new drivers from Windows Update. However last version of driver provided by Ati/AMD (not sure) available from the site is more than a year old. I do not know what happened. Microsoft wrote new drivers on his own? Microsoft forced AMD to that?
The questions are: Is it possible that someone has written his own GPU drivers which works on specific OS far better than drivers provided by the manufacturer? Would not it require technical documentation of the GPU (which probably is kept safe by the manufacturer)?

IDE vs Library vs SDK vs Framework vs Toolkit

Made a research before asking this but I couldn't really understand much of differences between what I'm asking above. In-depth information would be much appreciated. Thanks in advance.
API - a set of functions and structures (classes?) for performing a selected task (e.g. libcurl API for network requests)
A Framework is something you can build upon. Usually it is complete (or almost complete) to a point it can be started out of the box (but probably would`nt do anything useful) and provides APIs to override some functionality
a toolkit is a set of utilities/tools you can use for some task (e.g. Kali Linux is a network penetration toolkit)
SDK (Software Developer`s Kit) is a toolkit (usually official) that can be used to interact with/program some device/entity. It also may provide APIs and frameworks internally. (e.g. Android SDK allows to develop, build, test and deploy applications for, well, Android. it describes APIs accessible from different OS versions. )
A toolkit is a set of utilities/tools you can use for some task (e.g. Kali Linux is a network penetration toolkit)

alea.cuBase and CUBLAS

I'm starting down the exciting road of GPU programming, and if I'm going to do some heavyweight number-crunching, I'd like to use the best libraries that are out there. I would especially like to use cuBLAS from an F# environment. CUDAfy offers the full set of drivers from their solution, and I have also been looking at Alea.cuBase, which has thrown up a few questions.
The Alea.cuSamples project on GitHub makes a cryptic reference to an Examples solution: "For more advanced test, please go to the MatrixMul projects in the Examples solution." However, I can't find any trace of these mysterious projects.
Does anyone know the location of the elusive "MatrixMul projects in the Examples solution"?
Given that cuSamples performs a straightfoward matrix multiplication, would the more advanced version, wherever it lives, use cuBLAS?
If not, is there a way to access cuBLAS from Alea.cuBase a la CUDAfy?
With Alea GPU V2, the new version we have now two options:
Alea Unbound library provides optimized matrix multiplication implementations http://quantalea.com/static/app/tutorial/examples/unbound/matrixmult.html
Alea GPU has cuBlas integrated, see tutorial http://quantalea.com/static/app/tutorial/examples/cublas/index.html
The matrixMulCUBLAS project is a C++ project that ships with the CUDA SDK, https://developer.nvidia.com/cuda-downloads. This uses cuBLAS to get astonishingly quick matrix multiplication (139 GFlops) on my home laptop.

Resources