CUDA Version mismatch in Docker with WSL2 backend - docker

I am trying to use docker (Docker Desktop for Windows 10 Pro) with the WSL2 Backend (WINDOWS SUBSHELL LINUX (WSL) (Ubuntu 20.04.4 LTS)).
That part seems to be working fine, except I would like to pass my GPU (Nvidia RTX A5000) through to my docker container.
Before I even get that far, I am still trying to set things up. I found a very good tutorial aimed at 18.04, but found all the steps are the same for 20.04, just with some version numbers bumpede.
At the end, I can see that my Cuda versions do not match. You can see that here, .
The real issue is when I try to run the test command as shown on the docker website:
docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
I get this error:
--> docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380:
starting container process caused: process_linux.go:545: container init caused: Running
hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli:
requirement error: unsatisfied condition: cuda>=11.6, please update your driver to a
newer version, or use an earlier cuda container: unknown.
... and I just don't know what to do, or how I can fix this.
Can someone explain how to get the GPU to pass through to a docker container successfully.

I had the same issue on Ubuntu when I tried to run the container:
s.evloev#some-pc:~$ docker run --gpus all --rm nvidia/cuda:11.7.0-base-ubuntu18.04
docker: Error response from daemon: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.7, please update your driver to a newer version, or use an earlier cuda container: unknown.
In my case it occurred when I tried to launch docker image that have nvidia cuda version which is higher than what was installed on my host.
When I checked my cuda version that was installed on my host I have found that it is version 11.3.
s.evloev#some-pc:~$ nvidia-smi
Thu Jul 21 15:06:33 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|... |
+-----------------------------------------------------------------------------+
So when I try to run the same cuda version (11.3) it works well:
s.evloev#some-pc:~$ docker run -it --gpus all --rm nvidia/cuda:11.3.0-base-ubuntu18.04 nvidia-smi
Thu Jul 21 12:13:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:65:00.0 Off | N/A |
| 0% 44C P8 7W / 180W | 1404MiB / 8110MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

The comment from #RobertCrovella resolved this:
please update your driver to a newer version when using WSL, the driver in your WSL setup is not something you install in WSL, it is provided by the driver on the windows side. Your WSL driver is 472.84 and this is too old to work with CUDA 11.6 (it only supports up to CUDA 11.4). So you would need to update your windows side driver to the latest one possible for your GPU, if you want to run a CUDA 11.6 test case. Regarding the "mismatch" of CUDA versions, this provides general background material for interpretation.
Downloading the most current Nvidia driver:
Version: R510 U3 (511.79) WHQL
Release Date: 2022.2.14
Operating System: Windows 10 64-bit, Windows 11
Language: English (US)
File Size: 640.19 MB
Now I am able to support CUDA 11.6, and the test from the docker documentation now works:
--> docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6
> Compute 8.6 CUDA device: [NVIDIA RTX A5000]
65536 bodies, total time for 10 iterations: 58.655 ms
= 732.246 billion interactions per second
= 14644.916 single-precision GFLOP/s at 20 flops per interaction
Thank you for the quick response!

Related

Yolo v7 not detecting objects on image

I'm trying Yolo v7, it seems to be working, but the resulted image has no object detection mapping on it, while it should have.
I read the Github to how to setup Yolo v7 on Docker, here's the full commands you should be able to reproduce my problem.
git clone https://github.com/WongKinYiu/yolov7
cd yolov7
nvidia-docker run --name yolov7 -it --rm -v "$CWD":/yolov7 --shm-size=64g nvcr.io/nvidia/pytorch:21.08-py3
// on the container
cd /yolov7
python -m pip install virtualenv
python -m virtualenv venv3
. venv3/bin/activate
pip install -r requirements.txt
apt update
apt install -y zip htop screen libgl1-mesa-glx
pip install seaborn thop
python detect.py --weights yolov7.pt --conf 0.25 --img-size 640 --source inference/images/horses.jpg
And this is the console output of the last command:
# python detect.py --weights yolov7.pt --conf 0.25 --img-size 640 --source inference/images/horses.jpg
Namespace(agnostic_nms=False, augment=False, classes=None, conf_thres=0.25, device='', exist_ok=False, img_size=640, iou_thres=0.45, name='exp', no_trace=False, nosave=False, project='runs/detect', save_conf=False, save_txt=False, source='inference/images/horses.jpg', update=False, view_img=False, weights=['yolov7.pt'])
YOLOR 🚀 v0.1-115-g072f76c torch 1.13.0+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3903.875MB)
Fusing layers...
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
Model Summary: 306 layers, 36905341 parameters, 6652669 gradients
Convert model to Traced-model...
traced_script_module saved!
model is traced!
/yolov7/venv3/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Done. (150.9ms) Inference, (0.3ms) NMS
The image with the result is saved in: runs/detect/exp6/horses.jpg
Done. (0.616s)
Now, I should be able to see the detections on the generated image runs/detect/exp6/horses.jpg, from the original image inference/images/horses.jpg, right? But I see the two images the same, no difference. What's wrong with the setup?
Nvidia driver:
$ nvidia-smi
Tue Dec 6 09:47:03 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 45% 27C P8 N/A / 75W | 13MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1152 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1256 G /usr/bin/gnome-shell 2MiB |
+-----------------------------------------------------------------------------+
Ubuntu version:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
modify variable half on detect.py file to False
Line 31 : half = False
When we are using the Gpu . The program allows only to use half precision of the Model . This is what is written in Line 31 of detect.py(# half precision only supported on CUDA).
By changing it to False :
Line 31 : half = False
Does the magic for me .
I came across the same issue too. It is basically what others mentioned. Some explanation is added below.
The reason is in line 31 half = device.type != 'cpu' # half precision only supported on CUDA.
Not all GPUs nor all Nvidia GPUs with CUDA support half-precision (16 bit) floats, esp if your GPU is a bit older. In my case, I was using an AMD 5700XT (via ROCM) - this GPU also has no fp16 support!
To make it configurable, I added a command line argument which let the user to override the variable half mentioned in other answers;
# After Line 31~, check if command line override is present.
# device = select_device(opt.device)
half = opt.fp16 and device.type != 'cpu' # half precision only supported on CUDA
# After line (169~) with `parser = argparse.ArgumentParser()`
parser.add_argument("--fp16", type=bool, default=False, help="Use float16 (Some GPUs only)")

Hardware Acceleration for Headless Chrome on WSL2 Docker

my target is to server-side-render a frame of a WebGL2 application in a performant way.
To do so, I'm using Puppeteer with headless Chrome inside a Docker container running Ubuntu 20.04.4 on WSL2 on Windows 10 22H2.
However, I can't get any hardware acceleration to become active.
It seems that Chrome doesn't detect my Nvidia GTX1080 card, while on the host system with headless: false it's being used and increases render performance drastically.
I have followed the tutorial here to install the CUDA toolkit on WSL2.
A container spun up using sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark uses the GTX1080.
Running nvidia-smi inside the container shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.65 Driver Version: 527.37 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| N/A 49C P8 10W / N/A | 871MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
Using flags like
--ignore-gpu-blocklist
--enable-gpu-rasterization
--enable-zero-copy
--use-gl=desktop / --use-gl=egl
did not solve the problem.
I also tried libosmesa6 with --use-gl=osmesa which had no effect.
I'm starting my container with docker-compose including the following section into the service's block:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Chrome version is HeadlessChrome/108.0.5351.0.
The Chrome WebGL context's WEBGL_debug_renderer_info tells me
vendor: Google Inc. (Google)
renderer: ANGLE (Google, Vulkan 1.3.0 (SwiftShader Device (Subzero) (0x0000C0DE)), SwiftShader driver)
Visiting chrome://gpu with Puppeteer shows me the following:
enter image description here
Any ideas what might still be missing to have Chrome use the GPU for hardware acceleration?

Running container fails with failed to add the host cannot allocate memory

OS: Red Hat Enterprise Linux release 8.7 (Ootpa)
Version:
$ sudo yum list installed | grep docker
containerd.io.x86_64 1.6.9-3.1.el8 #docker-ce-stable
docker-ce.x86_64 3:20.10.21-3.el8 #docker-ce-stable
docker-ce-cli.x86_64 1:20.10.21-3.el8 #docker-ce-stable
docker-ce-rootless-extras.x86_64 20.10.21-3.el8 #docker-ce-stable
docker-scan-plugin.x86_64 0.21.0-3.el8 #docker-ce-stable
Out of hundreds os docker calls made over days, a few of them fails. This is the schema of the commandline:
/usr/bin/docker
run
-u
1771:1771
-a
stdout
-a
stderr
-v
/my_path:/data
--rm
my_image:latest
my_entry
--my_args
The failure:
docker: Error response from daemon: failed to create endpoint recursing_aryabhata on network bridge: failed to add the host (veth6ad97f8) <=> sandbox (veth23b66ce) pair interfaces: cannot allocate memory.
It is not reproducible. The failure rate is less than one percent. At the time this error happens system has lots of free memory. Around the time that this failure happens, the application is making around 5 docker calls per second. Each call take about 5 to 10 seconds to complete.

nvidia driver support on ubuntu in docker from host windows - 'Found no NVIDIA driver on your system' error

I have built a docker image: ubuntu20.04 + py38 + torch, various libs(llvmlite, cuda, pyyaml etc) + my flask app. The app uses torch and it needs nvidia driver inside the container. The host machine is win10 x64.
when running container and testing it with postman, the error appeared:
<head>
<title>AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx // Werkzeug Debugger</title>
On my machine nvidia-smi is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 442.92 Driver Version: 442.92 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 40C P8 3W / N/A | 382MiB / 6144MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6212 C+G ...ta\Local\Postman\app-7.31.0\Postman.exe N/A |
| 0 6752 C+G ...are\Brave-Browser\Application\brave.exe N/A |
+-----------------------------------------------------------------------------+
It has been asked many times on SO, and the traditional answers are that nvidia can't support gpu acceleration from windows on linux docker container.
I found similar answers. I have read question and answers to this question. But these solutions involve host ubuntu + docker image with ubuntu inside.
this link instructs how to use nvidia-docker2, but nvidia-docker2 is deprecated according to this answer
The official nvidia-docker repo has instructions - but for linux host only.
But there is also this WSL on docker(backend linux) software installed - can it be used?
Is there still way to make ubuntu container use nvidia gpu from host windows machine?
It looks like you can now run Docker in Ubuntu with the Windows Subsystem for Linux (WSL 2) and do GPU-through.
This link goes through installation, setup and running a TensorFlow Jupyter notebook with GPU support:
https://ubuntu.com/blog/getting-started-with-cuda-on-ubuntu-on-wsl-2
Note - I haven't done this myself yet.

GPU becomes unavailable when computer goes to sleep

I am using docker installation of TensorFlow .
I initiate the container using
nvidia-docker run -it -p 8888:8888 -v /*/Data/docker:/docker --name TensorFlow gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash
This allows me to link a folder names "docker" in my secondary local drive with a folder inside docker container.
The issue is that whenever my computer (Ubuntu - GTX 1070 - 6700k Intel CPU) goes to sleep, the GPU becomes unavailable and code runs only on CPU. When I run the code in ipython notebook session inside docker I get:
failed call to cuInit: CUDA_ERROR_UNKNOWN.
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: 123456c234ds
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: 123456c234ds
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.57.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016
GCC version: gcc version 4.9.3 (Ubuntu 4.9.3-13ubuntu2)
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.57.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.57.0
When i restart the computer, the GPU becomes available without the UNKNOWN message.
I have searched the Internet and the solutions such as sudo apt-get install nvidia-modprobe does not solve the issue.

Resources