i am new to Docker and i am starting to understand how it works. I am trying to create a docker-compose.yml file with command as argument in order to override the CMD command in the Dockerfile.
Below is my Dockerfile and the docker-compose.yml file. In this case the CMD in the Dockerfile does not exists so i expect the command in the docker-compose.yml file to run.
Dockerfile:
# For more information, please refer to https://aka.ms/vscode-docker-python
FROM python:3.8-slim
# changing init
ENV TINI_VERSION="v0.19.0"
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini
# Keeps Python from generating .pyc files in the container
ENV PYTHONDONTWRITEBYTECODE=1
# Turns off buffering for easier container logging
ENV PYTHONUNBUFFERED=1
# keep pip, setuptools and wheel always updated
RUN pip install -U \
pip \
setuptools \
wheel
# create working directory
WORKDIR /app
# Install requirements
COPY requirements.txt .
RUN python -m pip install -r requirements.txt
RUN python -m pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cpu
# copy everything in the workdir
COPY ./code .
# Creates a non-root user with an explicit UID and adds permission to access the /app folder
# For more info, please refer to https://aka.ms/vscode-docker-python-configure-containers
RUN adduser -u 5678 --disabled-password --gecos "" appuser && chown -R appuser /app
USER appuser
# define entrypoint
ENTRYPOINT ["/tini", "--"]
# During debugging, this entry point will be overridden. For more information, please refer to https://aka.ms/vscode-docker-python-debug
# CMD [ "python", "main.py", "--model_path", "./model_e20_nf112_14122022.pt", "--input_path", "./input_images/", \
# "--output_path", "./output_folder/", "--camera_fps", "1", "--device", "cpu", "--profiler", "True"]
docker-compose.yml file
services:
inference_container:
image: inference
build:
context: ./
networks:
- inference_network
volumes: # link host path to docker container
- ./data/output_images:/app/output_folder
- ./data/profilers:/app/profiler
environment: # permessions for folder
- PUID=1000
- PGID=1000
command: python3 main.py --model_path ./model_e20_nf112_14122022.pt --input_path ./input_images/ --output_path ./output_folder/ --camera_fps 1 --device cpu --profiler True
networks:
# The presence of these objects is sufficient to define them
inference_network:
The output of the main.py script should be different in the case the argument --profiler is True or False.
I build the image with
docker-compose -f docker-compose.yml build --no-cache
in order to start from zero the build, then i create and run the container as
docker-compose up
I started with the case profiler = False, then i changed its value to True and i execute again docker-compose up. However the result is the same even if in the code the change is perceived as you can see below (first print is the value of profiler input).
[+] Running 1/1
Container unibapinferencescript-inference_container-1 Created
Attaching to unibapinferencescript-inference_container-1
unibapinferencescript-inference_container-1 | False
unibapinferencescript-inference_container-1 | Starting Camera Acquisition Process
unibapinferencescript-inference_container-1 | Starting Inference Process
unibapinferencescript-inference_container-1 | Adding Image
unibapinferencescript-inference_container-1 | Retrieving Image for inference
unibapinferencescript-inference_container-1 | Adding Image
unibapinferencescript-inference_container-1 | No more images to acquire
unibapinferencescript-inference_container-1 | Retrieving Image for inference
unibapinferencescript-inference_container-1 | Inference Process terminated
unibapinferencescript-inference_container-1 | Images have been save in ./output_folder/
unibapinferencescript-inference_container-1 exited with code 0
[+] Running 1/1
Container unibapinferencescript-inference_container-1 Created
Attaching to unibapinferencescript-inference_container-1
unibapinferencescript-inference_container-1 | True
unibapinferencescript-inference_container-1 | Starting Camera Acquisition Process
unibapinferencescript-inference_container-1 | Starting Inference Process
unibapinferencescript-inference_container-1 | Adding Image
unibapinferencescript-inference_container-1 | Retrieving Image for inference
unibapinferencescript-inference_container-1 | Adding Image
unibapinferencescript-inference_container-1 | No more images to acquire
unibapinferencescript-inference_container-1 | Retrieving Image for inference
unibapinferencescript-inference_container-1 | Inference Process terminated
unibapinferencescript-inference_container-1 | Images have been save in ./output_folder/
unibapinferencescript-inference_container-1 exited with code 0
Can someone help me to solve this problem?
Thanks in advance
I solved the problem. There was a bug in my code in the script main.py and that's why it was not working when i was changing the parameter in the command. Now it is all working well.
Case with profiler=False
[+] Running 1/1
⠿ Container unibapinferencescript-inference_container-1 Created 0.3s
Attaching to unibapinferencescript-inference_container-1
unibapinferencescript-inference_container-1 | Starting camera acquisition process
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | Starting inference process on cpu
unibapinferencescript-inference_container-1 | Adding image
unibapinferencescript-inference_container-1 | Retrieving image for inference
unibapinferencescript-inference_container-1 | Adding image
unibapinferencescript-inference_container-1 | No more images to acquire. Terminating the camera acquisition process.
unibapinferencescript-inference_container-1 | Camera acquisition process have been shut down. Performing inference on the remaining images in the queue.
unibapinferencescript-inference_container-1 | Retrieving image for inference
unibapinferencescript-inference_container-1 | Inference process on cpu terminated. The inference on 2 images has been performed.
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | The images have been saved in the folder ./output_folder/
unibapinferencescript-inference_container-1 exited with code 0
Case with profiler=True
[+] Running 1/1
⠿ Container unibapinferencescript-inference_container-1 Recreated 0.3s
Attaching to unibapinferencescript-inference_container-1
unibapinferencescript-inference_container-1 | Profiler Enabled
unibapinferencescript-inference_container-1 | Starting camera acquisition process
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | Starting inference process on cpu
unibapinferencescript-inference_container-1 | Adding image
unibapinferencescript-inference_container-1 | Retrieving image for inference
unibapinferencescript-inference_container-1 | Adding image
unibapinferencescript-inference_container-1 | No more images to acquire. Terminating the camera acquisition process.
unibapinferencescript-inference_container-1 | Camera acquisition process have been shut down. Performing inference on the remaining images in the queue.
unibapinferencescript-inference_container-1 | Retrieving image for inference
unibapinferencescript-inference_container-1 | Inference process on cpu terminated. The inference on 2 images has been performed.
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | The images have been saved in the folder ./output_folder/
unibapinferencescript-inference_container-1 | 5438 function calls (5562 primitive calls) in 0.531 seconds
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | Random listing order was used
unibapinferencescript-inference_container-1 | List reduced from 311 to 1 due to restriction <'inference_cpu'>
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | ncalls tottime percall cumtime percall filename:lineno(function)
unibapinferencescript-inference_container-1 | 1 0.447 0.447 80.468 80.468 Queue_functions.py:37(inference_cpu)
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | 5438 function calls (5562 primitive calls) in 0.531 seconds
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | Random listing order was used
unibapinferencescript-inference_container-1 | List reduced from 311 to 1 due to restriction <'TorchClasses'>
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | ncalls tottime percall cumtime percall filename:lineno(function)
unibapinferencescript-inference_container-1 | 2 0.004 0.002 79.007 39.504 TorchClasses.py:62(UNet.forward)
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | 5438 function calls (5562 primitive calls) in 0.531 seconds
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | Random listing order was used
unibapinferencescript-inference_container-1 | List reduced from 311 to 1 due to restriction <'camera_acquisition'>
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 | ncalls tottime percall cumtime percall filename:lineno(function)
unibapinferencescript-inference_container-1 | 1 0.000 0.000 0.031 0.031 Queue_functions.py:8(camera_acquisition)
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 |
unibapinferencescript-inference_container-1 exited with code 0
Related
More details:
I mostly followed the following instructions to setup my init script:
https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus
I used the docker base image:
nvidia/cuda:11.2.1-runtime-ubuntu20.04
My cloud init ExecStart command for the docker part is currently:
MY COMMAND
docker run --rm --name=myapp -dit -p 80:80 --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidiactl:/dev/nvidiactl <Docker Container Name> <Uvicorn Startup Command>
When I SSH into the running VM,
I did the following to help log what processes are using the gpu:
sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia
On first pass, I get the following output with /var/lib/nvidia/bin/nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 55C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
When I go to the root (by doing cd .. twice) and run MY COMMAND, I get the same issue with no processes using the GPU.
However, when I run MY COMMAND in the home/{username} directory, I get the following output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 51C P0 28W / 70W | 14734MiB / 15109MiB | 13% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4694 C /usr/bin/python3 2329MiB |
| 0 N/A N/A 4695 C /usr/bin/python3 2329MiB |
| 0 N/A N/A 4696 C /usr/bin/python3 2521MiB |
| 0 N/A N/A 4700 C /usr/bin/python3 5221MiB |
| 0 N/A N/A 4701 C /usr/bin/python3 2329MiB |
+-----------------------------------------------------------------------------+
My basic question is: How do I do the same thing in my cloud config script as I was able to do manually in my VM?
I already tried adding a user to my cloud init script like in the example provided by the google link and starting docker with -u flag, but that ran into permissioning issues (specifically PermissionError: [Errno 13] Permission denied: '/.cache')
Edit:
Never found a solution that I understood, but, turned out I needed to run
sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia
In the same directory I was running my docker command using ExecStartPre
I installed the nvidia-docker2 following the instructions here. When running the following command I will get the expected output as shown.
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:0B:00.0 On | N/A |
| 24% 31C P8 13W / 250W | 222MiB / 11011MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
However, running the above command without "sudo" results in the following error for me:
$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create
failed: runc create failed: unable to start container process: error during container
init: error running hook #0: error running hook: exit status 1, stdout: , stderr:
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1:
cannot open shared object file: no such file or directory: unknown.
Can anyone please help me with how I can solve this problem?
Add docker group to your user:
sudo usermod -aG docker your_user
Update:
Check here https://github.com/NVIDIA/nvidia-docker/issues/539
Maybe something from the comments will help you.
try adding "sudo" to you docker command.
e.g sudo docker-compose ...
I'm trying to use GPU from inside my docker container. I'm using docker with version 19.03 on Ubuntu 18.04.
Outside the docker container if I run nvidia-smi I get the below output.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
If I run the samething inside the container created from nvidia/cuda docker image, I get the same output as above and everything is running smoothly. torch.cuda.is_available() returns True.
But If I run the same nvidia-smi command inside any other docker container, it gives the following output where you can see that the CUDA Version is coming as N/A. Inside the containers torch.cuda.is_available() also returns False.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have installed nvidia-container-toolkit using the following commands.
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit
sudo systemctl restart docker
I started my containers using the following commands
sudo docker run --rm --gpus all nvidia/cuda nvidia-smi
sudo docker run -it --rm --gpus all ubuntu nvidia-smi
docker run --rm --gpus all nvidia/cuda nvidia-smi should NOT return CUDA Version: N/A if everything (aka nvidia driver, CUDA toolkit, and nvidia-container-toolkit) is installed correctly on the host machine.
Given that docker run --rm --gpus all nvidia/cuda nvidia-smi returns correctly. I also had problem with CUDA Version: N/A inside of the container, which I had luck in solving:
Please see my answer https://stackoverflow.com/a/64422438/2202107 (obviously you need to adjust and install the matching/correct versions of everything)
For anybody arriving here looking how to do it with docker compose, add to your service:
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities:
- gpu
- utility # nvidia-smi
- compute # CUDA. Required to avoid "CUDA version: N/A"
- video # NVDEC/NVENC. For instance to use a hardware accelerated ffmpeg. Skip it if you don't need it
Doc: https://docs.docker.com/compose/gpu-support
I'm building a image which requires testing GPU usability in the meantime. GPU containers runs well:
$ docker run --rm --runtime=nvidia nvidia/cuda:9.2-devel-ubuntu18.04 nvidia-smi
Wed Aug 7 07:53:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A |
| 24% 43C P8 17W / 250W | 2607MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
but failed when building with GPU:
$ cat Dockerfile
FROM nvidia/cuda:9.2-devel-ubuntu18.04
RUN nvidia-smi
# RUN build something
# RUN tests require GPU
$ docker build .
Sending build context to Docker daemon 2.048kB
Step 1/2 : FROM nvidia/cuda:9.2-devel-ubuntu18.04
---> cdf6d16df818
Step 2/2 : RUN nvidia-smi
---> Running in 88f12f9dd7a5
/bin/sh: 1: nvidia-smi: not found
The command '/bin/sh -c nvidia-smi' returned a non-zero code: 127
I'm new to docker but I think we need sanity checks when building an image. So how could I build docker image with cuda runtime?
Configuring docker daemon with --default-runtime=nvidia solved the problem.
Please refer to this wiki for more info.
Maybe it's because you are using "RUN" command on the Dockerfile. I'd try "CMD" (see documentation for this command) or "ENTRYPOINT" due to call 'docker run' with arguments.
I think that "RUN" commands are for previous jobs that you need to execute before the container get available, instead of a process with output and stuff.
Good luck with that,
strace reveals that the escaping I used may cause a problem compared to the shell form (shell form vs exec form see https://docs.docker.com/engine/reference/builder/ )
exec form with [/* 3 vars */] - breaks / makes trouble
ENTRYPOINT ["strace", "hugo", "server", "--watch=true", "--bind=0.0.0.0", "--source=\"/src\"", "--destination=\"/output\""]
execve("hugo", ["hugo", "server", "--watch=true", "--bind=0.0.0.0", "--source=\"/src\"", "--destination=\"/output\""], [/* 3 vars */]) = 0
shell form with [/* 4 vars */] - works fine
ENTRYPOINT strace hugo server --watch=true --bind=0.0.0.0 --source=""/src"" --destination=""/output""
execve("hugo", ["hugo", "server", "--watch=true", "--bind=0.0.0.0", "--source=/src", "--destination=/output"], [/* 4 vars */]) = 0"
Dockerfile:
(used ubuntu as i wasn't able to run strace with alpine:latest.)
# escape=\
# first line can be removed and doesn't change the behavior of the described issue
FROM ubuntu:latest
RUN apt-get update && apt-get install hugo strace
RUN hugo new site src
WORKDIR /src
ENTRYPOINT ["strace", "hugo", "server", "--watch=true", "--bind=0.0.0.0", "--source=\"/src\"", "--destination=\"/output\""]
EXPOSE 1313
Command to run and save output:
sudo docker run --security-opt seccomp:unconfined docker-hugo &> docker-hugo.strace
(see https://github.com/moby/moby/issues/20064#issuecomment-291095117 for info regarding --security-opt )
Overview of possible scenarios:
+------------------+-------------------------------------------+------------------------+---------------------------------------------------+
| | No Entrypoint | Entrypoint (JSON-form) | Entrypoint (shell-form) |
+------------------+-------------------------------------------+------------------------+---------------------------------------------------+
| No CMD | HostConfig.Config.cmd=/bin/bash is called | breaks | ok |
| | (assumption as of docker inspect) | | |
+------------------+-------------------------------------------+------------------------+---------------------------------------------------+
| CMD (JSON-form) | breaks | breaks | breaks |
| | | | (other issue; not handled here |
+------------------+-------------------------------------------+------------------------+---------------------------------------------------+
| CMD (shell-form) | ok | ok | Breaks [seems to work as designed] |
| | | | (both are called with a shell concatinated) |
| | | | Example: /bin/sh -c <ENTRYPOINT> /bin/sh -c <CMD> |
+------------------+-------------------------------------------+------------------------+---------------------------------------------------+
So my question: Am I escaping the JSON Array not correctly?