I'm trying to create a Windows Docker Container with access to GPUs. To start I just wanted to try check if I can access GPU on Docker containers.
Dockerfile
FROM mcr.microsoft.com/windows:1903
CMD [ "ping", "-t", "localhost" ]
Build and run
docker build -t debug_image .
docker run -d --gpus all --mount src="C:\Program Files\NVIDIA Corporation\NVSMI",target="C:\Program Files\NVIDIA Corporation\NVSMI",type=bind debug_image
docker exec -it CONTAINER_ID powershell
Problem and question
Now that I'm inside, I try to execute my shared NVIDIA SMI executable. However, I got an error and it's not capable of running. The obvious question is why, if the host is capable.
PS C:\Program Files\NVIDIA Corporation\NVSMI> .\nvidia-smi.exe
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA
driver. Make sure that the latest NVIDIA driver is installed and
running. This can also be happening if non-NVIDIA GPU is running as
primary display, and NVIDIA GPU is in WDDM mode.
About NVIDIA Driver, AFAIK it should not return any problem, since it works on the HOST, where NVIDIA Driver is installed.
My host has 2 NVIDIA GPUs, and it has no "primary" display as it's a server with no screen connected. AFAIK, it's CPU doesn't have an integrated GPU, so I would assume one of the connected NVIDIA GPUs is the primary display (if it does exist when no display is connected to the server)(also, I think one should be it, because one renders the screen when I connect through TeamViewer if needed, and dxdiag returns one of them as Display 1).
About WDDM mode, I've found ways to change it, but didn't found ways to check the current mode.
So basically the question, is why is it not working? Any insight or help in the previous points would be helpful.
Update.
About:
1) I've updated my drivers from 431 to 441, latest version available for GTX 1080 Ti, and the error message remains the same.
2-3) I've confirmed that GTX (Except some Titan models) cannot run in TCC mode. Therefore they're running in WDDM mode.
Related
My goal is to be able to run Vulkan application in a docker container using the Nvidia Container Toolkit. Ideally running Ubuntu 22.04 on the host and in the container.
I've created a git repo to allow others to better reproduce this issue: https://github.com/rickyjames35/vulkan_docker_test
The README explains my findings but I will reiterate them here.
For this test I'm running Ubuntu 22.04 on my host as well as in the container FROM ubuntu:22.04. For this test I'm seeing that the only device vulkaninfo is finding is llvmpipe which is a CPU based graphics driver. I'm also seeing that llvmpipe can't render when running vkcube both in the container and on the host for Ubuntu 22.04. Here is the container output for vkcube:
Selected GPU 0: llvmpipe (LLVM 13.0.1, 256 bits), type: 4
Could not find both graphics and present queues
On my host I can tell it to use llvmpipe:
vkcube --gpu_number 1
Selected GPU 1: llvmpipe (LLVM 13.0.1, 256 bits), type: Cpu
Could not find both graphics and present queues
As you can see they have the same error. What's interesting is if I swap the container to FROM ubuntu:20.04 then llvmpipe can render but this is moot since I do not wish to do CPU rendering. The main issue here is that Vulkan is unable to detect my Nvidia GPU from within the container when using the Nvidia Container Toolkit with NVIDIA_DRIVER_CAPABILITIES=all and NVIDIA_VISIBLE_DEVICES=all. I've also tried using nvidia/vulkan. When running vulkaninfo in this container I get:
vulkaninfo
ERROR: [Loader Message] Code 0 : vkCreateInstance: Found no drivers!
Cannot create Vulkan instance.
This problem is often caused by a faulty installation of the Vulkan driver or attempting to use a GPU that does not support Vulkan.
ERROR at /vulkan-sdk/1.3.236.0/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:674:vkCreateInstance failed with ERROR_INCOMPATIBLE_DRIVER
I'm suspecting this has to to with me running Ubuntu 22.04 on the host although the whole point of docker is the host OS generally should not affect the container.
In the test above I was using nvidia-driver-525 I've tried using different versions of the driver with the same results. At this point I'm not sure if I'm doing something wrong or if Vulkan is not supported in the Nvidia Container Toolkit for Ubuntu 22.04 even though it claims to be.
I had a similar problem when trying to set up a docker container using the nvidia/cuda:12.0.0-devel-ubuntu22.04 image.
I was able to get it to work using the unityci/editor image. This is the docker command I used.
docker run -dit -e DISPLAY=$DISPLAY -v /tmp/.X11-unix/:/tmp/.X11-unix/ -v /dev:/dev --gpus='all,"capabilities=compute,utility,graphics,display"' unityci/editor:ubuntu-2022.2.1f1-base-1.0.1
After setting up the container, I had to apt install vulkan-utils and libnvidia-gl-525 then everything works.
Hope this helps!
Sometimes I can't communicate with my Nvidia GPUs inside a docker container when I came back to my workplace from home, even though the previously launched process that utilizes GPUs is running well. The running process (training a neural network via Pytorch) is not affected by the disconnection but I cannot launch a new process.
nvidia-smi gives Failed to initialize NVML: Unknown Error and torch.cuda.is_available() returns False likewise.
I met two different cases:
nvidia-smi works fine when it is done at the host machine. In this case, the situation can be solved by restarting the docker container via docker stop $MYCONTAINER followed by docker start $MYCONTAINER at the host machine.
nvidia-smi doesn't work at the host machine nor nvcc --version, throwing Failed to initialize NVML: Driver/library version mismatch and Command 'nvcc' not found, but can be installed with: sudo apt install nvidia-cuda-toolkit error. Strange point is that the current process still runs well. In this case, installing the driver again or rebooting the machine solves the problem.
However, these solutions require stopping all current processes. It would be unavailable when I should not stop the current process.
Does somebody has suggestion for solving this situation?
Many thanks.
(sofwares)
Docker version: 20.10.14, build a224086
OS: Ubuntu 22.04
Nvidia driver version: 510.73.05
CUDA version: 11.6
(hardwares)
Supermicro server
Nvidia A5000 * 8
(pic1) nvidia-smi not working inside of a docker container, but worked well on the host machine.
(pic2) nvidia-smi works after restarting a docker container, which is the case 1 I mentioned above
For the problem of Failed to initialize NVML: Unknown Error and having to restart the container, please see this ticket and post your system/package information there as well:
https://github.com/NVIDIA/nvidia-docker/issues/1671
There's a workaround on the ticket, but it would be good to have others post their configuration to help fix the issue.
Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
so sudo apt-get install -y --allow-downgrades containerd.io=1.6.6-1 and sudo apt-mark hold containerd.io to prevent the package from being updated. So do that, edit the config file, and pass all of the /dev/nvidia* devices in to docker run.
For the Failed to initialize NVML: Driver/library version mismatch issue, that is caused by the drivers updating but you haven't rebooted yet. If this is a production machine, I would also hold the driver package to stop that from auto-updating as well. You should be able to figure out the package name from something like sudo dpkg --get-selections "*nvidia*"
On a mac, docker utilizes HyperKit in order to create a LinuxKit VM. This means, for example, among other things, that I cannot see any of the image layers that are pulled down for a given container in places like /var/lib/docker, since the VM controls all of that.
Is there a way to actually get a shell on that VM to be able to do that sort of introspection?
In Docker Desktop 2.4 for Mac, it is possible to get a nearly full terminal into the LinuxKit VM, with sane tab auto-completion, and be able to inspect its contents.
For example, to see the layers of pulled down docker images, you may perform the following commands:
$ stty -echo -icanon && nc -U ~/Library/Containers/com.docker.docker/Data/debug-shell.sock && stty sane
/ # ls -al /var/lib/docker/overlay2/
The nc -U ~/Library/Containers/com.docker.docker/Data/debug-shell.sock may be run on its own, per the Docker release docs, but if it is not combined with stty per the above example, you will not see very good output, nor will you have tab completion in the vm.
I'm currently working on running DL algorithms inside docker containers and I've been successful. However, I can only get it running by passing --net=host flag to docker run command which makes container use host computer's network interface. If I don't pass that flag it throws the following error:
No EGL Display
nvbufsurftransform: Could not get EGL display connection
No protocol specified
nvbuf_utils: Could not get EGL display connection
When I do
echo $DISPLAY
it outputs :0 which is correct.
But I don't understand what Gstreamer, X11 or EGL has to do with full network feature. Is there any explanation for this or any workaround except --net=host flag? Because of this reason I can't map different ports for various containers.
I also have created a topic on this on NVIDIA DevTalk Forum but it still is a dark spot for me. I didn't satisfied with the answers I got.
But it is OK to use --net=host flag to solve this problem anyways.
Quick heads up: Gstreamer is not working over X11-Forwarding natively, you better have to use VNC solution, or have access to the physical machine.
Troubleshooting
is gstreamer installed? apt install -y gstreamer1.0-plugins-base
what does xrandr returns?
what does xauth list returns?
what does gst-launch-1.0 nvarguscamerasrc ! nvoverlaysink returns?
For example:
On my setup because I do not use a dockerfile I copy the xauth list cookie then paste it in docker
xauth add user/unix:11 MIT-MAGIC-COOKIE cccccccccccccccccccccccccc
After this I can test display with xterm&.
Besides, once this is done I have an output with xrandr
Getting more verbose
Also I connect to the docker by an ssh connection with verbose (to host / or / guest we don't care) ssh -X -v user#192.168.123.123
therefore the EGL error is wrapped by debug details.
stream stuff
This is related to Deepstream and Gstreamer customization from nVidia.
Some nvidia threads point that EGL needs a "sink" but no X11 display.
If there is some server running on the host at a designed port, running docker with --net=host will allow a client to connect within the docker.
According to the the doc, there are some servers used by the Gpu.
Doc
$DISPLAY
According to nVidia threads: unset DISPLAY provides better results.
On my setup, without display, the EGL error is gone. Then the stream cannot be seen.
I use singularity and I need to install a nvidia driver in my singularity container to do some deep learning with a gtx 1080.
This singularity image is created from a nvidia docker from here:
https://ngc.nvidia.com/catalog/containers/nvidia:kaldi and converted to a singularity container.
There was no nvidia drivers I think because nvidia-smi was not found before I install the driver.
I did the following commmands :
add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
apt install nvidia-418
after that I wanted to see if the driver was well installed, I did the command :
nvidia-smi
which return : Failed to initialize NVML: Driver/library version mismatch
I searched about how to solve this error and found this topic :
NVIDIA NVML Driver/library version mismatch
One answer says to do the command :
lsmod | grep nvidia
and then to rmmod on each except nvidia and finally to rmmod nvidia.
rmmod drm
But when I do this, as the topic excepted it, I have the error :
rmmod: ERROR: Module nvidia is in use.
The topic says to tap lsof /dev/nvidia*, and to kill the process that use the module, but I see nothing with drm written, and it seems to be a very bad idea to kill the process (Xorg, gnome-she).
Here is the answer to the command lsof /dev/nvidia*, followed by the command lsmod | grep nvidia, and then rmmod drm
Rebooting the computer also didn't work.
what should I do to manage using nvidia-smi and be able to use my GPU from inside the singularity container ?
Thank you
You may need to do the above steps in the host OS and not in the container itself. /dev is mounted into the container as is and still subject to use by the host, though the processes are run in a different userspace.
thank you for your answer.
I wanted to install the GPU driver in the singularity container because when inside the container, I wasn't able to use the GPU (nvidia-smi : command not found) while outside of the container I could use nvidia-smi.
You are right, the driver should be installed outside of the container, I wanted to install it in the container to avoid my problem of not having access to the driver from inside the container.
Now I found the solution : To use GPU from inside the singularity container, you must add --nv when calling the container.
example :
singularity exec --nv singularity_container.simg ~/test_gpu.sh
or
singularity shell --nv singularity_container.simg
When you add --nv, the container will have access to the nvidia driver and nvidia-smi will work.
Without this you will not be able to use GPU, nvidia-smi will not work.