Tensorflow/nvidia/cuda docker mismatched versions - docker

I am trying to use tensorflow and nvidia with docker, but hitting the following error:
docker run --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=#/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 --pid=5393 /var/lib/docker/overlay2/......./merged]\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\n\\"\"": unknown.
I get similar error when trying to run nvidia-smi:
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
but when trying to run nvidia-smi with cuda:9.0-base, it works like a charm:
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Do I need to make sure that cuda 10 works or I can run tensorflow with cuda 9? And how can I run the docker image of tensorflow with cuda:9.0-base? (still a docker newby).
Thanks a lot!

Ok I think I am finally starting to figure out the mess on my machine.
The tensorflow image does NOT care about the cuda image version, it does not use the docker cuda image. It cares about my nvidia drivers, since it has CUDA integrated in the tensorflow image.
(The docker cuda image that is working with my current drivers, is cuda:9.0)
That meant I have to find a tensorflow image that is working with my drivers (390.116), or update the drivers.
I tried the same command with tensorflow:1.12.0-gpu-py3, and it didn't have any problems.

Related

nvidia-smi gives an error inside of a docker container

Sometimes I can't communicate with my Nvidia GPUs inside a docker container when I came back to my workplace from home, even though the previously launched process that utilizes GPUs is running well. The running process (training a neural network via Pytorch) is not affected by the disconnection but I cannot launch a new process.
nvidia-smi gives Failed to initialize NVML: Unknown Error and torch.cuda.is_available() returns False likewise.
I met two different cases:
nvidia-smi works fine when it is done at the host machine. In this case, the situation can be solved by restarting the docker container via docker stop $MYCONTAINER followed by docker start $MYCONTAINER at the host machine.
nvidia-smi doesn't work at the host machine nor nvcc --version, throwing Failed to initialize NVML: Driver/library version mismatch and Command 'nvcc' not found, but can be installed with: sudo apt install nvidia-cuda-toolkit error. Strange point is that the current process still runs well. In this case, installing the driver again or rebooting the machine solves the problem.
However, these solutions require stopping all current processes. It would be unavailable when I should not stop the current process.
Does somebody has suggestion for solving this situation?
Many thanks.
(sofwares)
Docker version: 20.10.14, build a224086
OS: Ubuntu 22.04
Nvidia driver version: 510.73.05
CUDA version: 11.6
(hardwares)
Supermicro server
Nvidia A5000 * 8
(pic1) nvidia-smi not working inside of a docker container, but worked well on the host machine.
(pic2) nvidia-smi works after restarting a docker container, which is the case 1 I mentioned above
For the problem of Failed to initialize NVML: Unknown Error and having to restart the container, please see this ticket and post your system/package information there as well:
https://github.com/NVIDIA/nvidia-docker/issues/1671
There's a workaround on the ticket, but it would be good to have others post their configuration to help fix the issue.
Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
so sudo apt-get install -y --allow-downgrades containerd.io=1.6.6-1 and sudo apt-mark hold containerd.io to prevent the package from being updated. So do that, edit the config file, and pass all of the /dev/nvidia* devices in to docker run.
For the Failed to initialize NVML: Driver/library version mismatch issue, that is caused by the drivers updating but you haven't rebooted yet. If this is a production machine, I would also hold the driver package to stop that from auto-updating as well. You should be able to figure out the package name from something like sudo dpkg --get-selections "*nvidia*"

How to let `docker run --gpus all` gracefully fall back to CPU if GPU is not present?

I need to run a docker image on two machines: one with GPU and the other without GPU. In order to use GPU, I currently pass the --gpus all option to docker run as follows (otherwise, GPU is not detected by pytorch code inside the container):
docker run -dt -P --name "ch4-dev" --label "..." -v ".../debugpy:/debugpy:ro,z" --entrypoint "python3" --gpus all "ch4:latest"
The above docker run command is assembled by VS Code python extension. It runs correctly on the machine with the GPU, and torch.cuda.is_available() returns True.
However, on the machine without GPU, the same project generates the following error:
docker: Error response from daemon: could not select device driver ""
with capabilities: [[gpu]]. The terminal process failed to launch
(exit code: 125).
The Question:
Is there a more flexible option than --gpu=all that will detect/use a GPU if exists, and otherwise fall back to using CPU gracefully without an error?
using nvidia-docker command directly instead of the docker.

Running docker container on windows with a volume gives me an error

I'm trying to run a simple docker image with a volume
docker run -dit -v J:\dvolumes:/data ubuntu
However I get the following error
docker: Error response from daemon: invalid mode: data.
I'm guessing it's thinking the ':' after the J is what separates the volume but I don't really know how to make it think otherwise (if that's the case).
Any help would be greatly appreciated
Issue was that I was trying to run the container in WSL2 so I should be using linux paths
WSL2
docker run -dit -v /j/dvolumes:/data ubuntu
Windows
docker run -dit -v j:/dvolumes:/data ubuntu

NVIDIA Docker - initialization error: nvml error: driver not loaded

I'm a complete newcomer to Docker, so the following questions might be a bit naive, but I'm stuck and I need help.
I'm trying to reproduce some results in research. The authors just released code along with a specification of how to build a Docker image to reproduce their results. The relevant bit is copied below:
I believe I installed Docker correctly:
$ docker --version
Docker version 19.03.13, build 4484c46d9d
$ sudo docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
However, when I try checking that my nvidia-docker installation was successful, I get the following error:
$ sudo docker run --gpus all --rm nvidia/cuda:10.1-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown.
It looks like the key error is:
nvidia-container-cli: initialization error: nvml error: driver not loaded
I don't have a GPU locally and I'm finding conflicting information on whether CUDA needs to be installed before NVIDIA Docker. For instance, this NVIDIA moderator says "A proper nvidia docker plugin installation starts with a proper CUDA install on the base machine."
My questions are the following:
Can I install NVIDIA Docker without having CUDA installed?
If so, what is the source of this error and how do I fix it?
If not, how do I create this Docker image to reproduce the results?
Can I install NVIDIA Docker without having CUDA installed?
Yes, you can. The readme states that nvidia-docker only requires NVIDIA GPU driver and Docker engine installed:
Note that you do not need to install the CUDA Toolkit on the host system, but the NVIDIA driver needs to be installed
If so, what is the source of this error and how do I fix it?
That's either because you don't have a GPU locally or it's not NVIDIA, or you messed up somewhere when installed drivers. If you have a CUDA-capable GPU I recommend using NVIDIA guide to install drivers. If you don't have a GPU locally, you can still build an image with CUDA, then you can move it somewhere where there is a GPU.
If not, how do I create this Docker image to reproduce the results?
The problem is that even if you manage to get rid of CUDA in Docker image, there is software that requires it. In this case fixing the Dockerfile seems to me unnecessary - you can just ignore Docker and start fixing the code to run it on CPU.
I think you need
ENV NVIDIA_VISIBLE_DEVICES=void
then
RUN your work
finally
ENV NVIDIA_VISIBLE_DEVICES=all

Cloudera quickstart docker: unable to run/start the container

I am using windows 10 machine, with Docker on windows, and pulled cloudera-quickstart:latest image. while trying to run it, I am getting into below error.
can someone please suggest.
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "exec: \"/usr/bin/docker-quickstart\": stat /usr/bin/docker-quickstart: no such file or directory"
my run command:
docker run --hostname=quickstart.cloudera --privileged=true -t -i cloudera/quickstart /usr/bin/docker-quickstart
The issue was that I have download docker separately and created the image with this command, which is not supported in cloudera 5.10 and above.
tar xzf cloudera-quickstart-vm-*-docker.tar.gz
docker import - cloudera/quickstart:latest < cloudera-quickstart-vm--docker/.tar
so I finally removed the docker image and then pulled it properly
docker pull cloudera/quickstart:latest
now docker is properly up and running.
If you had downloaded CDH v5.13 docker image, then the issue might be mostly due to the structure of the image archive; in my case, I found it to be clouder*.tar.gz > cloudera*.tar > cloudera*.tar ! Seems the packaging was done by fault and the official documentation too doesn't capture this :( In which case, just perform one more level of extraction to get to the correct cloudera*.tar archieve. This post from the cloudera forum helped me.

Resources