"docker:19.03-dind" could not select device driver "nvidia" with capabilities: [[gpu]] - docker

I got a K8S+DinD issue:
launch Kubernetes cluster
start a main docker image and a DinD image inside this cluster
when running a job requesting GPU, got error could not select device driver "nvidia" with capabilities: [[gpu]]
Full error
http://localhost:2375/v1.40/containers/long-hash-string/start: Internal Server Error ("could not select device driver "nvidia" with capabilities: [[gpu]]")
exec to the DinD image inside of K8S pod, nvidia-smi is not available.
Some debugging and it seems it's due to the DinD is missing the Nvidia-docker-toolkit, I had the same error when I ran the same job directly on my local laptop docker, I fixed the same error by installing nvidia-docker2 sudo apt-get install -y nvidia-docker2.
I'm thinking maybe I can try to install nvidia-docker2 to the DinD 19.03 (docker:19.03-dind), but not sure how to do it? By multiple stage docker build?
Thank you very much!
update:
pod spec:
spec:
containers:
- name: dind-daemon
image: docker:19.03-dind

I got it working myself.
Referring to
https://github.com/NVIDIA/nvidia-docker/issues/375
https://github.com/Henderake/dind-nvidia-docker
First, I modified the ubuntu-dind image (https://github.com/billyteves/ubuntu-dind) to install nvidia-docker (i.e. added the instructions in the nvidia-docker site to the Dockerfile) and changed it to be based on nvidia/cuda:9.2-runtime-ubuntu16.04.
Then I created a pod with two containers, a frontend ubuntu container and the a privileged docker daemon container as a sidecar. The sidecar's image is the modified one I mentioned above.
But since this post is 3 year ago from now, I did spent quite some time to match up the dependencies versions, repo migration over 3 years, etc.
My modified version of Dockerfile to build it
ARG CUDA_IMAGE=nvidia/cuda:11.0.3-runtime-ubuntu20.04
FROM ${CUDA_IMAGE}
ARG DOCKER_CE_VERSION=5:18.09.1~3-0~ubuntu-xenial
RUN apt-get update -q && \
apt-get install -yq \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common && \
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - && \
add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable" && \
apt-get update -q && apt-get install -yq docker-ce docker-ce-cli containerd.io
# https://github.com/docker/docker/blob/master/project/PACKAGERS.md#runtime-dependencies
RUN set -eux; \
apt-get update -q && \
apt-get install -yq \
btrfs-progs \
e2fsprogs \
iptables \
xfsprogs \
xz-utils \
# pigz: https://github.com/moby/moby/pull/35697 (faster gzip implementation)
pigz \
# zfs \
wget
# set up subuid/subgid so that "--userns-remap=default" works out-of-the-box
RUN set -x \
&& addgroup --system dockremap \
&& adduser --system -ingroup dockremap dockremap \
&& echo 'dockremap:165536:65536' >> /etc/subuid \
&& echo 'dockremap:165536:65536' >> /etc/subgid
# https://github.com/docker/docker/tree/master/hack/dind
ENV DIND_COMMIT 37498f009d8bf25fbb6199e8ccd34bed84f2874b
RUN set -eux; \
wget -O /usr/local/bin/dind "https://raw.githubusercontent.com/docker/docker/${DIND_COMMIT}/hack/dind"; \
chmod +x /usr/local/bin/dind
##### Install nvidia docker #####
# Add the package repositories
RUN curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add --no-tty -
RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && \
echo $distribution && \
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
tee /etc/apt/sources.list.d/nvidia-docker.list
RUN apt-get update -qq --fix-missing
RUN apt-get install -yq nvidia-docker2
RUN sed -i '2i \ \ \ \ "default-runtime": "nvidia",' /etc/docker/daemon.json
RUN mkdir -p /usr/local/bin/
COPY dockerd-entrypoint.sh /usr/local/bin/
RUN chmod 777 /usr/local/bin/dockerd-entrypoint.sh
RUN ln -s /usr/local/bin/dockerd-entrypoint.sh /
VOLUME /var/lib/docker
EXPOSE 2375
ENTRYPOINT ["dockerd-entrypoint.sh"]
#ENTRYPOINT ["/bin/sh", "/shared/dockerd-entrypoint.sh"]
CMD []
When I use exec to login into the Docker-in-Docker container, I can successfully run nvidia-smi (which previously return not found error then cannot run any GPU resource related docker run)
Welcome to pull my image at brandsight/dind:nvidia-docker

Related

docker over virtualbox cannot start

I have a docker file that creates a valid image that runs on my Ubuntu 18.04.
For compatibility with other machines, I've tried to run the docker in a Virtual Box Ubuntu machine (and avoid any configuration errors that may occur).
my docker run command line:
docker run -id --net=host --rm --privileged --gpus=all --env="NVIDIA_DRIVER_CAPABILITIES=all" --env="DISPLAY" -e DISPLAY=:0 -v /tmp/.X11-unix:/tmp/.X11-unix:rw -v /run/user/1000/gdm/Xauthority:/root/.Xauthority --env="QT_X11_NO_MITSHM=1" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /home/git/:/git --name nirge_sim nirge-sim:1.0
The base docker file:
FROM gazebo:gzserver9-bionic
# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES \
${NVIDIA_VISIBLE_DEVICES:-all}
ENV NVIDIA_DRIVER_CAPABILITIES \
${NVIDIA_DRIVER_CAPABILITIES:+$NVIDIA_DRIVER_CAPABILITIES,}graphics
# install Utilities
RUN apt-get update -y && apt-get install -y apt-utils curl ca-certificates wget \
&& rm -rf /var/lib/apt/lists/*
# install gazebo packages
RUN apt-get update -y && apt-get install -y --allow-unauthenticated --no-install-recommends \
libgazebo9-dev \
&& rm -rf /var/lib/apt/lists/*
# install ros packages
RUN sh -c 'echo "deb http://packages.ros.org/ros/ubuntu $(lsb_release -sc) main" > /etc/apt/sources.list.d/ros-latest.list'
RUN curl -s https://raw.githubusercontent.com/ros/rosdistro/master/ros.asc | apt-key add -
RUN apt-get update && apt-get install -y --allow-unauthenticated \
ros-melodic-desktop-full \
&& rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y --allow-unauthenticated --no-install-recommends \
ros-melodic-gazebo-ros-pkgs ros-melodic-gazebo-ros-control \
ros-melodic-gazebo-plugins ros-melodic-gazebo-ros ros-melodic-gazebo-ros\
ros-melodic-simulators \
&& rm -rf /var/lib/apt/lists/*
# final config for ros
RUN echo 'source /opt/ros/melodic/setup.bash' >> /root/.bashrc
RUN echo 'export LIBGL_ALWAYS_INDIRECT=1' >> /root/.bashrc
CMD ["bash"]
So this works on my Host, but not on my hosted host via virtual box.
the error is:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
Would appreciate any advice on this issue.
It appears that one of the dependencies (Gazebo) requires a dedicated GPU, one that is not simulated as a part of VirtualBox.
Nvidia cards tend to work well in Ubuntu
sourced from original site

How to install aws-cli in docker container based on maven:3.6.3-openjdk-14 image?

I would like to install aws-cli for below images but I received below error. I tried with apk, apt but none of then did not work. Can you please help how should I update my dockerfile?
I do not want to change my base image, I need to use maven:3.6.3-openjdk-14.
sh: apt-get: command not found
FROM maven:3.6.3-openjdk-14
RUN apt-get update \
&& apt-get install -y vim jq unzip curl \
&& apt-get upgrade -y \
#install aws 2
RUN curl --silent --show-error --fail "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
unzip awscliv2.zip && \
./aws/install && \
rm -rf awscliv2.zip
Docker image maven:3.6.3-openjdk-14 is based on Oracle Linux which uses rpm to manage packages, so apt-get is not available.
docker run -i -t maven:3.6.3-openjdk-14 -- cat /etc/os-release

Not able to access elasticsearch from docker container

elastic search is successfully running on docker container. but i'm not able access in browser. i mapped ports correctly. but the problem is in docker container. in container elasticsearch is mapped with localhost
127.0.0.1:9200
Dokcerfile
FROM ubuntu:16.04
MAINTAINER Rajesh Gurram
RUN apt-get update && \
apt-get install -y net-tools curl wget gnupg
RUN apt-get install -y software-properties-common
RUN add-apt-repository ppa:webupd8team/java && \
apt-get update && \
echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections && \
apt-get install -y oracle-java8-installer && apt-get clean
ENV JAVA_HOME /usr/lib/jvm/java-8-oracle
RUN apt-get install apt-transport-https
RUN wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add - && \
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-6.x.list && \
apt update && apt install -y elasticsearch
RUN sed -i 's/#network.host: 192.168.0.1/network.host: 0.0.0.0/g' /etc/elasticsearch/elasticsearch.yml
EXPOSE 9200 9300
Run Below command on Host machine it will resolve the issue
$ sysctl -w vm.max_map_count=262144
If you want to use docker to get an instance of Elasticsearch, you can read the following Guide:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
You can also use docker images directly from elastic, if ubuntu is not a necessary base image:
https://www.docker.elastic.co/
If you want to upgrade to an ELK Stack later on, I recommend a docker volume for persistency purposes.

Docker Image with Lando Support

I am trying to build a Docker image where Lando should be preinstalled.
My Dockerfile looks like :
FROM devwithlando/php:7.1-fpm
RUN apt-get update -y \
&& docker-php-ext-install pcntl
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
RUN add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
RUN apt-get update
RUN apt-get install -y docker-ce
#RUN usermod -aG docker ${USER}
RUN apt-get update
RUN curl -fsSL -o /tmp/lando-latest.deb http://installer.kalabox.io/lando-latest-dev.deb
RUN dpkg -i /tmp/lando-latest.deb
RUN lando version
But It's showing "lando command not found", Is anything I am missing, Please guide me.
Lando has essentially 3 dependencies:
Docker (Docker CE on Linux)
Docker-Compose
NodeJS (Typically the current LTS)
A container trying to run Lando itself should probably run from the [Docker][1] image, with all the typical "Docker in Docker" modifications and caveats such is possibly mounting the docker socket in the container, and running privileged mode, etc.
Your example is running from Lando's PHP FPM base image, which isn't at all designed to run either Docker or Node. It also isn't based on Ubuntu, but rather Debian directly (and you are including some Ubuntu specific code for installing Docker).
All that said, running Lando from within a Docker container is likely to run into issues with permissions and volume mounts among potential other things. It isn't recommended, though it might be possible.
Here is a Dockerfile from a small repo I made a few years back that worked to install an old version of Lando in a Dockerfile, it could help you make a more up to date one:
FROM ubuntu:bionic
RUN mkdir -p /root/.bin && touch /root/.zshrc
RUN apt update && apt upgrade -y && apt install -y \
git \
exuberant-ctags \
neovim \
python3-pip \
software-properties-common \
wget \
zsh
RUN chsh -s $(which zsh)
RUN add-apt-repository ppa:martin-frost/thoughtbot-rcm \
&& apt update \
&& apt install rcm -y
RUN git clone https://github.com/thinktandem/dotfiles.git ~/dotfiles \
&& mkdir -p ${XDG_CONFIG_HOME:=$HOME/.config} \
&& mkdir -p $XDG_CONFIG_HOME/nvim \
&& ln -s ~/.vim/autoload ~/.config/nvim/ \
&& ln -s ~/.vimrc $XDG_CONFIG_HOME/nvim/init.vim \
&& rcup
RUN git clone https://github.com/nodenv/nodenv.git ~/.nodenv
RUN git clone \
https://github.com/nodenv/node-build.git \
/root/.nodenv/plugins/node-build
RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - echo "deb https://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list \
&& apt remove cmdtest \
&& apt update \
&& apt install --no-install-recommends yarn -y
RUN add-apt-repository ppa:cpick/hub \
&& apt update \
&& apt install -y hub
RUN apt remove docker docker-engine docker.io \
&& apt install -y \
apt-transport-https \
ca-certificates \
curl \
software-properties-common && \
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| apt-key add - \
&& add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
bionic \
stable" \
&& apt update \
&& apt install -y docker-ce
RUN TEMP_DEB="$(mktemp)" \
&& wget -O "$TEMP_DEB" \
'https://github.com/lando/lando/releases/download/v3.0.0-rc.1/lando- v3.0.0-rc.1.deb' \
&& dpkg -i "$TEMP_DEB" \
&& rm -f "$TEMP_DEB"
RUN curl -L git.io/antigen > ~/antigen.zsh
RUN RCRC=$HOME/dotfiles/rcrc rcup
CMD ["/usr/bin/zsh"]

docker compose not finding file in gitlab job

I'm running a gitlab-ci.yml job with a gitlab-runner which is using the docker executor with the docker.sock mounted from the host (set in the config file).
docker composes run.sh file located here: https://github.com/docker/compose/blob/master/script/run/run.sh doesn't seem to find the docker-compose.lint.yml file I need it to run.
Part of the gitlab-ci.yml file:
Lint from image:
<<: *temp
stage: Lint
script:
- ls -a
- cat docker-compose.lint.yml
- . docker-compose_run.sh -f docker-compose.lint.yml up --remove-orphans --force-recreate --abort-on-container-exit
Dockerfile of the image that the gitlab-ci.yml file is using:
FROM debian:stretch
# Set Bash as default shell
RUN rm /bin/sh && \
ln --symbolic /bin/bash /bin/sh
# apt-get
RUN apt-get update && \
apt-get install -y \
# Install SSH
openssh-client \
openssh-server \
openssl \
# Install cURL
curl \
# Install git
git \
# Install locales
locales \
# Install Python
python3 \
python-dev \
python3-pip \
# Build stuff
build-essential \
libssl-dev \
libffi-dev \
apt-transport-https \
ca-certificates \
curl \
gnupg2 \
software-properties-common
# Install Docker
# Add Docker’s official GPG key
RUN curl -fsSL https://download.docker.com/linux/$(. /etc/os-release; echo "$ID")/gpg | apt-key add - && \
# Set up the stable repository
add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/$(. /etc/os-release; echo "$ID") \
$(lsb_release -cs) \
stable" && \
# Update package index
apt-get update && \
# Install Docker CE
apt-get install -y docker-ce
# pip
RUN pip3 install \
# Install Docker Compose
docker-compose
# Set locale
RUN sed --in-place '/en_US.UTF-8/s/^# //' /etc/locale.gen && \
locale-gen && \
# Set system locale (add line)
echo "export LANG=en_US.UTF-8" >> /etc/profile && \
# Set system timezone (add line)
echo "export TZ=UTC" >> /etc/profile
Job output:
...more stuff above here
$ ls -a
.
..
.clocignore
.dockerignore
.eslintignore
.eslintrc.json
.git
.gitignore
.gitlab-ci.yml
.nvmrc
.nyc_output
Dockerfile
README.md
__.gitlab-ci.yml
bin
build.sh
build_and_test.sh
coverage
docker-compose.lint.yml
docker-compose.test.yml
docker-compose_run.sh
lib
package-lock.json
package.json
stop.sh
test
test.sh
wait-for-vertica.sh
$ cat docker-compose.lint.yml
version: "3"
services:
ei:
image: lint:1
environment:
- NODE_ENV=test
entrypoint: ["npm", "run", "lint"]
$ . docker-compose_run.sh -f docker-compose.lint.yml up --remove-orphans --force-recreate --abort-on-container-exit
.IOError: [Errno 2] No such file or directory: u'./docker-compose.lint.yml'
ERROR: Job failed: exit code 1
Since this question was the top google result for my search:
executor failed running [/bin/sh -c apt-get install git-all]: exit code: 1
To fix my problem I added -y to the apt-get command, that is changed this:
apt-get install git-all
to this
apt-get install git-all -y

Resources