GPU becomes unavailable when computer goes to sleep - docker

I am using docker installation of TensorFlow .
I initiate the container using
nvidia-docker run -it -p 8888:8888 -v /*/Data/docker:/docker --name TensorFlow gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash
This allows me to link a folder names "docker" in my secondary local drive with a folder inside docker container.
The issue is that whenever my computer (Ubuntu - GTX 1070 - 6700k Intel CPU) goes to sleep, the GPU becomes unavailable and code runs only on CPU. When I run the code in ipython notebook session inside docker I get:
failed call to cuInit: CUDA_ERROR_UNKNOWN.
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: 123456c234ds
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: 123456c234ds
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.57.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016
GCC version: gcc version 4.9.3 (Ubuntu 4.9.3-13ubuntu2)
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.57.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.57.0
When i restart the computer, the GPU becomes available without the UNKNOWN message.
I have searched the Internet and the solutions such as sudo apt-get install nvidia-modprobe does not solve the issue.

Related

cannot run gazebo9 on docker with privileged on ubuntu 18.04

I am stuck on this for quite a while now i have tried searching and trying stuff but i am getting nowhere.
My setup is as follows:
Host
linux Distro: Archlinux
kernel version: 5.14.2
docker version: 20.10.8, build 3967b7d28e
nvidia driver version: 470.63.01-1
nvidia container toolkit version: 1.5.0-2 , cgroups disabled.
amd gpu driver: xf86-video-amdgpu 21.0.0-1
Container
base image: ubuntu:18.04
command line : docker run -it --rm --privileged --gpus all -e DISPLAY=$DISPLAY -e XAUTHORITY=~/.Xauthority --network host --volume /tmp/.X11-unix/:/tmp/.X11-unix --volume $XAUTHORITY:/root/.Xauthority gazebo:libgazebo9-bionic gazebo
Expected results
expected gazebo window to open with hardware acceleration, using privileged access.
Actual results
On using --privileged:
si_init_perfcounters: max_sh_per_se = 2 not supported (inaccurate performance counters)
X Error of failed request: BadAlloc (insufficient resources for operation)
Major opcode of failed request: 149 ()
Minor opcode of failed request: 2
Serial number of failed request: 35
Current serial number in output stream: 36
Without --privileged and specifying graphic cards in --device manually:
gazebo window opens up with hardware acceleration and works smoothly as expected.
Detailed description
I was actually trying to run gazebo version 9 in a custom image which i had created using ubuntu:18.04 as base image. i referred to gazebo:libgazebo9-bionic,nvidia/cuda:11.4.1-cudnn8-devel-ubuntu18.04 and ros:melodic-desktop while writing the dockerfile. i even tried the same thing for gazebo 11 on the same base image and got the same issue as above. Whereas the exactly similar setup for ubuntu foxy works smoothly. i really need to use privileged because i am going to be working on hardware for a lot of time. please help me on how should this be fixed. thanks alot
P.S. Other GUI applications (rviz,moveit,etc) are running without any issues. Im getting this issue with gazebo only.
Ok found the solution!
Gazebo was working on osrf/ros:noetic-desktop-full but not on osrf/ros:melodic-desktop-full.
I got the exact same error:
X Error of failed request: BadAlloc (insufficient resources for operation)
X Error of failed request: BadAlloc (insufficient resources for operation)
Major opcode of failed request: 149 ()
The solution was to update the MESA drivers on the ros:melodic image from version Mesa 20.0.8 to Mesa 22.0.2.
sudo add-apt-repository ppa:kisak/kisak-mesa -y
sudo apt update
sudo apt upgrade -y
If you want to check your current Mesa version:
sudo apt install mesa-utils
glxinfo | grep Mesa

Dockerized nmap shows incorrect OS versions

I've noticed that when Nmap is dockerized it is yielding incorrect OS results. I've tried various pre-built docker images as well as one I created myself and they all show the same results.
Here are a few of the pre-built images I've tried:
https://hub.docker.com/r/instrumentisto/nmap
https://hub.docker.com/r/uzyexe/nmap/
I've run the same Nmap command with these images and using my locally installed Nmap version and here are the results (all images are using Nmap 7.80):
$ nmap -sV -O 192.168.1.1
------(locally installed nmap result - correct):
OS CPE: cpe:/o:linux:linux_kernel:2.6
OS details: Linux 2.6.8 - 2.6.30
Network Distance: 1 hop
Service Info: OS: Linux; Device: broadband router; CPE: cpe:/o:linux:linux_kernel
------(all docker image nmap results - incorrect):
OS CPE: cpe:/h:hp:jetdirect_170x cpe:/h:hp:inkjet_3000
Aggressive OS guesses: HP 170X print server or Inkjet 3000 printer (85%), HP LaserJet 4000 printer (85%), HP LaserJet 4250 printer (85%)
No exact OS matches for host (test conditions non-ideal).
Service Info: OS: Linux; Device: broadband router; CPE: cpe:/o:linux:linux_kernel
What's interesting to me is that the Service Info is actually correct across the scans, but nothing else is.
I'm trying to figure out of there is a setting/flag that I'm missing when executing the docker command. Here's what I've tried:
Setting the docker network to host (no change in result)
Setting the docker network to bridge (no change in result)
Not setting any network setting (no change in result)
I really need to get Nmap working in a docker container because it's integrated into a rails web app that I'm building utilizing the ruby-nmap gem.
Thanks!

Docker Container nvidia/k8s-device-plugin:1.9 Keeps Reporting Error

I am trying to setup one small kubenertes cluster on my ubuntu 18.04 LTS server. Now every step is done, but checking the GPU status fails. The container keeps reporting errors:
1. Issue Description
I have done steps by Quick-Start, but when I run the test case, it reports error.
2. Steps to reproduce the issue
exec shell cmd
docker run --security-opt=no-new-privileges --cap-drop=ALL
--network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins
nvidia/k8s-device-plugin:1.9
check the erros
2020/02/09 00:20:15 Starting to serve on
/var/lib/kubelet/device-plugins/nvidia.sock
2020/02/09 00:20:15 Could not register device plugin: rpc error: code = Unimplemented desc =
unknown service deviceplugin.Registration
2020/02/09 00:20:15 Could
not contact Kubelet, retrying. Did you enable the device plugin
feature gate?
2020/02/09 00:20:15 You can check the prerequisites at:
https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/02/09
00:20:15 You can learn how to set the runtime at:
https://github.com/NVIDIA/k8s-device-plugin#quick-start
3. Environment Information
- outputs of nvidia-docker run --rm dlws/cuda nvidia-smi
NVIDIA-SMI 440.48.02 Driver Version: 440.48.02 CUDA Version: 10.2
outputs of nvidia-docker run --rm dlws/cuda nvidia-smi
NVIDIA-SMI 440.48.02 Driver Version: 440.48.02 CUDA Version: 10.2
contents of /etc/docker/daemon.json
contents:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
docker version: 19.03.2
kubernetes version: 1.15.2
Finally I found the answer, hope this post would be helpful for others who encounter the same issue:
For kubernetes 1.15, use k8s-device-plugin:1.11 instead. The version 1.9 is not able to communicate with kubelet.

Cannot install docker on OS X Version 10.9.5

I first tried installing VirtualBox by downloading "VirtualBox 5.0 for OS X hosts (amd64)" from the VirtualBox download page, and then installed boot2docker and docker via brew.
The first apparent issue appeared when creating the boot2docker-vm image:
$ boot2docker init
2015/07/27 21:38:13 Creating VM boot2docker-vm...
2015/07/27 21:38:13 Apply interim patch to VM boot2docker-vm (https://www.virtualbox.org/ticket/12748)
2015/07/27 21:38:13 Failed to modify VM "boot2docker-vm": exit status 1
Launching the VirtualBox manager application I can see the boot2docker-vm machine "Running", but looking at the log I see something like this which appears to indicate that the boot2docker-vm "machine" failed to boot:
00:00:04.169546 Guest Log: BIOS: Boot : bseqnr=1, bootseq=4231
00:00:04.169711 Guest Log: BIOS: Boot from Floppy 0 failed
00:00:04.170101 Guest Log: BIOS: Boot : bseqnr=2, bootseq=0423
00:00:04.170490 Guest Log: BIOS: CDROM boot failure code : 0002
00:00:04.170800 Guest Log: BIOS: Boot from CD-ROM failed
00:00:04.171190 Guest Log: BIOS: Boot : bseqnr=3, bootseq=0042
00:00:04.171795 Guest Log: int13_harddisk: function 02, unmapped device for ELDL=80
00:00:04.172304 Guest Log: BIOS: Boot from Hard Disk 0 failed
00:00:04.172706 Guest Log: BIOS: Boot : bseqnr=4, bootseq=0004
00:00:04.172991 Guest Log: BIOS: Booting from LAN...
00:00:04.191271 Display::handleDisplayResize(): uScreenId = 0, pvVRAM=0000000000000000 w=720 h=400 bpp=0 cbLine=0x0, flags=0x1
00:00:06.446949 Guest Log: BIOS: Boot from LAN failed
00:00:06.448852 Guest Log: Could not read from the boot medium! System halted.
I uninstalled everything and then tried downloading and installing from boot2docker download page, which installs VirtualBox, boot2docker, and docker all in one go. But I still see the same problem indicated above (the boot2docker-vm machine fails to boot).
I'm reluctant to make big changes to the OS X version on my laptop, since my hardware is old. But I'll try the installation sequence on a more modern machine and see if it works there.
Has anyone managed to make docker work on OS X Version 10.9.5?
EDIT (adding additional information which comments suggest might be relevant):
My machine has:
2.26GHz Intel Core 2 Duo
4Gb of RAM (1067 MHz DDR3)
NVIDIA GeForce 9400M 256 MB
OS X 10.9.5
I installed everything as the primary User (not root) on my system.
And the versions of everything which I installed are:
VirtualBox 4.3.30 r101610
boot2docker version 1.7.1
docker version 1.7.1
This issue on github might be of help (Latest version of virtual box 4.3.x works fine in the issue described). Though I would suggest to use docker-machine. Below are the steps (Installation):
$ docker-machine create --driver virtualbox dev
$ eval "$(docker-machine env dev)"
And then you can use docker commands as usual.
Some of the comments in the github issue suggested by nash_ag and this stackoverflow question pointed me in the right direction.
This is the sequence of steps I used to get VirtualBox, boot2docker, docker, and docker-machine working in my environment (where $USERNAME is my primary OS X User, who installed VirtualBox), with several wrong turns elided, and most output omitted:
$ rm -rf /Users/$USERNAME/VirtualBox\ VMs/
$ boot2docker delete
(ran VirtualBox Uninstall script from my desktop)
...
$ brew tap caskroom/cask
...
$ brew update
...
$ brew install brew-cask
...
$ brew cask install virtualbox
...
$ VBoxManage -v
5.0.0r101573
$ boot2docker -v
Boot2Docker-cli version: v1.7.1
Git commit: 8fdc6f5
$ VBoxManage list vms
(nothing)
$ boot2docker init -v
...
$ boot2docker up
...
$ eval "$(boot2docker shellinit)"
(writes .pem files)
$ brew install docker-machine
...
$ docker-machine -v
docker-machine version 0.3.1 (HEAD)
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM
$ docker-machine create --driver virtualbox dev
...
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM
dev virtualbox Running tcp://192.168.99.100:2376
$ VBoxManage list vms
"boot2docker-vm" {99d5c5c1-e7cc-49bf-93c7-b0cbf626d62c}
"dev" {341fd11e-86f9-46ca-89e6-39ee78458a4b}
$ eval "$(docker-machine env dev)"
$ docker run -d -p 8000:80 nginx
...
$ curl $(docker-machine ip dev):8000
<!DOCTYPE html>
...
At this point things appear to be working well enough for me to use the "standard" docs/instructions for docker and docker-machine, so my original problem is solved.

install local docker registry on centos 7

I am trying to install a local docker.io registry on a CentOS 7
machine following the instructions here:
https://github.com/docker/docker-registry#quick-start
I ran (EDITED, just to show docker is running):
# service docker restart && cd && docker run -p 5000:5000 registry
After a few minutes looking at the prompt, I got a bunch of errors like this:
[...]
OSError: [Errno 2] No such file or directory: './registry._setup_database.lock'
[2015-03-06 16:39:11 +0000] [13] [INFO] Worker exiting (pid: 13)
[2015-03-06 16:39:11 +0000] [14] [INFO] Worker exiting (pid: 14)
Traceback (most recent call last):
File "/usr/local/bin/gunicorn", line 11, in <module>
sys.exit(run())
File "/usr/local/lib/python2.7/dist-packages/gunicorn/app/wsgiapp.py", line 74, in run
WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
File "/usr/local/lib/python2.7/dist-packages/gunicorn/app/base.py", line 185, in run
super(Application, self).run()
File "/usr/local/lib/python2.7/dist-packages/gunicorn/app/base.py", line 71, in run
Arbiter(self).run()
File "/usr/local/lib/python2.7/dist-packages/gunicorn/arbiter.py", line 196, in run
self.halt(reason=inst.reason, exit_status=inst.exit_status)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/arbiter.py", line 292, in halt
self.stop()
File "/usr/local/lib/python2.7/dist-packages/gunicorn/arbiter.py", line 343, in stop
time.sleep(0.1)
File "/usr/local/lib/python2.7/dist-packages/gunicorn/arbiter.py", line 209, in handle_chld
self.reap_workers()
File "/usr/local/lib/python2.7/dist-packages/gunicorn/arbiter.py", line 459, in reap_workers
raise HaltServer(reason, self.WORKER_BOOT_ERROR)
gunicorn.errors.HaltServer: <HaltServer 'Worker failed to boot.' 3>
EDITED:
Details of the system:
docker --version
Docker version 1.3.2, build 39fa2fa/1.3.2
System:
cat /etc/centos-release
CentOS Linux release 7.0.1406 (Core)
uname -a
Linux denis1 3.10.0-123.20.1.el7.x86_64 #1 SMP Thu Jan 29 18:05:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Any ideas what I may be doing wrong?
Where should this file be? './registry._setup_database.lock'
EDITED2:
If I try it on my Ubuntu 14.10 laptop, where I installed a new version of docker via a ppa, then it works:
# Upgraded to docker 1.5 via a ppa package in my Ubuntu laptop:
sudo add-apt-repository ppa:docker-maint/testing
sudo apt-get update
sudo apt-get install docker.io
# pull registry latest
sudo docker pull registry:latest
latest: Pulling from registry
fa4fd76b09ce: Downloading 6.931 MB/197.2 MB 3m10s
1c8294cc5160: Download complete
117ee323aaa9: Download complete
fa4fd76b09ce: Pull complete
fa4fd76b09ce: Download complete
1c8294cc5160: Download complete
117ee323aaa9: Download complete
2d24f826cb16: Download complete
777c3edddace: Download complete
f06997673ad7: Download complete
7eafad9a1f16: Download complete
daa8104aee86: Download complete
418dcd975ba2: Download complete
30bff528d188: Download complete
a4f468439f7f: Download complete
e5a8e33139de: Download complete
024a18254446: Download complete
a68f5599e08a: Download complete
511136ea3c5a: Download complete
Status: Downloaded newer image for registry:latest
Any ideas what should I do to get the same result on my CentOS server?
Is there a more recent docker client I can get for CentOS 6 via yum install?
Disable SELINUX and FireWalld.
SELINUX is preventing execution of certain commands as SUDO which is somehow inhibiting behavior.
Also check FireWallD as well.
Neither of upgrade of docker version or "latest" tag etc will solve your problem, I tried them all... it has to do with SELINUX and/or FireWallD... disable both if you can.
I upgraded docker version to 1.5.0 by adding this to yum.repos.d:
[virt7-testing]
name=virt7-testing
baseurl=http://cbs.centos.org/repos/virt7-testing/x86_64/os/
enabled=1
gpgcheck=0
I feel like we can summarize things like this:
The documentation for 'registry' says it only supports docker 1.5. You're running 1.3 and there's a pretty big jump between these.
Supported Docker versions
This image is officially supported on Docker version 1.5.0.
Support for older versions (down to 1.0) is provided on a best-effort
basis.
On CentOS 6.5, you can yum install docker from EPEL (Docker instructions) and get Docker 1.5. Earlier than 6.5, you must yum install docker-io and it appears that 1.4 is the latest available version from EPEL.
In my experience, Docker's support on RedHat-family systems has been poorer then on Debian-family, but this gap has closed in the most recent versions of RH and Docker.

Resources