With Jenkins running on a Ubuntu 14.04 LTS server we began getting crashes on startup of test containers with the following error:
OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:297: copying bootstrap data to pipe caused \"write init-p: broken pipe\"": unknown
Initially it was suspected that this could be due to misconfiguration with local Dockerfiles or the Jenkins server itself, however running:
docker run --rm -i -a stdin -a stdout ubuntu echo 1
Should still work and produced the same issue
It turned out that this was due to a recent Docker update which caused problems with the older 3x kernel found by default on Ubuntu 14.04 LTS
Helpfully it is possible to upgrade the kernel version on 14.04 rather than upgrading the whole OS. It can be done as described on this Ask Ubuntu article, but in short:
sudo apt-get install linux-generic-lts-xenial
sudo reboot
NB: searching the received error message revealed no other current articles online, but searching parts of it sourced a few app-specific forum posts discussing it. For this reason I felt it useful to create a more easily locatable version on here, given it will cover use cases of development, testing or even prod running containers on 14.04.
Related
Sometimes I can't communicate with my Nvidia GPUs inside a docker container when I came back to my workplace from home, even though the previously launched process that utilizes GPUs is running well. The running process (training a neural network via Pytorch) is not affected by the disconnection but I cannot launch a new process.
nvidia-smi gives Failed to initialize NVML: Unknown Error and torch.cuda.is_available() returns False likewise.
I met two different cases:
nvidia-smi works fine when it is done at the host machine. In this case, the situation can be solved by restarting the docker container via docker stop $MYCONTAINER followed by docker start $MYCONTAINER at the host machine.
nvidia-smi doesn't work at the host machine nor nvcc --version, throwing Failed to initialize NVML: Driver/library version mismatch and Command 'nvcc' not found, but can be installed with: sudo apt install nvidia-cuda-toolkit error. Strange point is that the current process still runs well. In this case, installing the driver again or rebooting the machine solves the problem.
However, these solutions require stopping all current processes. It would be unavailable when I should not stop the current process.
Does somebody has suggestion for solving this situation?
Many thanks.
(sofwares)
Docker version: 20.10.14, build a224086
OS: Ubuntu 22.04
Nvidia driver version: 510.73.05
CUDA version: 11.6
(hardwares)
Supermicro server
Nvidia A5000 * 8
(pic1) nvidia-smi not working inside of a docker container, but worked well on the host machine.
(pic2) nvidia-smi works after restarting a docker container, which is the case 1 I mentioned above
For the problem of Failed to initialize NVML: Unknown Error and having to restart the container, please see this ticket and post your system/package information there as well:
https://github.com/NVIDIA/nvidia-docker/issues/1671
There's a workaround on the ticket, but it would be good to have others post their configuration to help fix the issue.
Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
so sudo apt-get install -y --allow-downgrades containerd.io=1.6.6-1 and sudo apt-mark hold containerd.io to prevent the package from being updated. So do that, edit the config file, and pass all of the /dev/nvidia* devices in to docker run.
For the Failed to initialize NVML: Driver/library version mismatch issue, that is caused by the drivers updating but you haven't rebooted yet. If this is a production machine, I would also hold the driver package to stop that from auto-updating as well. You should be able to figure out the package name from something like sudo dpkg --get-selections "*nvidia*"
I am trying to run a private tangle on my computer through linux docker containers.
Therefore I followed the guide over at https://wiki.iota.org/chrysalis-docs/tutorials/one_click_private_tangle
Every step succeeded up until we tried to execute
./private_tangle.sh install
This reports
Error 1
as seen in the screenshot below:
We do net get any further information, is anyone familiar with this error, or has any clue how to get some more information on the error so that we can at least have a clue where to look?
Some further information:
After executing docker ps -a we see that not a single container is running.
I am running on a windows 10 machine
I execute the commands from within ubuntu (version 20.04)
Ubuntu, docker-desktop and docker-desktop-data are all running WSL2
Docker integration with ubuntu is activated
I thought the error could maybe come from no hornet node initially being installed, so I installed a hornet node successfully, according the guide that https://wiki.iota.org/chrysalis-docs/tutorials/one_click_private_tangle. This changed nothing to the Error.
The version of docker and docker-compose are compliant with the requirements
If any more details are needed to help me solve this problem, please let me know.
I used the documentation (https://wiki.iota.org/chrysalis-docs/tutorials/one_click_private_tangle) to install these containers on my local ubuntu 18.04.
My docker version is: 20.10.12
And docker-compose version is: 1.29.2
By following the steps of the tutorial I managed to successfully start all of the containers without trouble.
My guess here would be that the permission of the 'private-tangle.sh' are not correct or that there is permission problem on the docker level.
You should start with checking the permission level of the private-tangle.sh script by using $ls -l
Here is my output -rwxrwxr-x 1 ben ben 9413 Jan 11 11:28 private-tangle.sh
It could also be due to the docker rights if you have to use sudo when executing a docker command it will give some troubles when executing the script.
You need to add yourself to a docker group to be able to run docker commands without sudo. You can do this by running sudo usermod -aG docker $USER with damiaan-vh as $user.
Solution from source https://stackoverflow.com/posts/70665394/edit
Suggesting to downgrade ubuntu version to 18.04 for more stable version.
For reinstalling the docker and docker-compose programs follow this documentations
(docker: https://docs.docker.com/engine/install/ubuntu/ )
(docker-compose: https://docs.docker.com/compose/install/ )
I fired up my Ubuntu 20.04 Virtual Machine today and sadly my docker containers seem to be gone. Running docker ps -a is a blank list. I'm not sure what may have caused this event, so I need some help troubleshooting. Recently, I was testing out docker-compose which I ran apt install docker-compose on my machine if that could have caused any issues. I also recently did a apt update / apt upgrade as well.
I don't see any files in /var/lib/docker/containers or /volumes if that's where my data should be. If they're gone oh well, but I'd like to try to figure out what happened so it doesn't happen again.
I'm a complete newcomer to Docker, so the following questions might be a bit naive, but I'm stuck and I need help.
I'm trying to reproduce some results in research. The authors just released code along with a specification of how to build a Docker image to reproduce their results. The relevant bit is copied below:
I believe I installed Docker correctly:
$ docker --version
Docker version 19.03.13, build 4484c46d9d
$ sudo docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
However, when I try checking that my nvidia-docker installation was successful, I get the following error:
$ sudo docker run --gpus all --rm nvidia/cuda:10.1-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown.
It looks like the key error is:
nvidia-container-cli: initialization error: nvml error: driver not loaded
I don't have a GPU locally and I'm finding conflicting information on whether CUDA needs to be installed before NVIDIA Docker. For instance, this NVIDIA moderator says "A proper nvidia docker plugin installation starts with a proper CUDA install on the base machine."
My questions are the following:
Can I install NVIDIA Docker without having CUDA installed?
If so, what is the source of this error and how do I fix it?
If not, how do I create this Docker image to reproduce the results?
Can I install NVIDIA Docker without having CUDA installed?
Yes, you can. The readme states that nvidia-docker only requires NVIDIA GPU driver and Docker engine installed:
Note that you do not need to install the CUDA Toolkit on the host system, but the NVIDIA driver needs to be installed
If so, what is the source of this error and how do I fix it?
That's either because you don't have a GPU locally or it's not NVIDIA, or you messed up somewhere when installed drivers. If you have a CUDA-capable GPU I recommend using NVIDIA guide to install drivers. If you don't have a GPU locally, you can still build an image with CUDA, then you can move it somewhere where there is a GPU.
If not, how do I create this Docker image to reproduce the results?
The problem is that even if you manage to get rid of CUDA in Docker image, there is software that requires it. In this case fixing the Dockerfile seems to me unnecessary - you can just ignore Docker and start fixing the code to run it on CPU.
I think you need
ENV NVIDIA_VISIBLE_DEVICES=void
then
RUN your work
finally
ENV NVIDIA_VISIBLE_DEVICES=all
I am using docker on my CentOS Linux release 7.8.2003 (Core) with 16 GB RAM. My docker version is Docker version 19.03.7. Docker-compose version is docker-compose version 1.23.2. I have 30+ docker containers running on my machine.
Everything was working smoothly, but I ran into a problem. Sometimes, when I try to run a container I get this error
ERROR: for container_name Cannot start service container_name: OCI runtime create failed: container_linux.go:349:
starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown
When I retry 3-5 times to run container, the container started successfully. Sometimes I need to restart docker service and my server to make it working. I don't know the exact reason why It is giving me this error sometimes and gets created successfully sometimes with same docker-compose file.
Can somebody explain this weird behavior of docker to me? Is it due to so many containers running on my machine or something else?
I had a similar issue:
OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:722: waiting for init preliminary setup caused: EOF: unknown
and the problem turned out to be the wrong version of my WSL distro, which was 1 instead of 2:
PS C:\Users\myself> wsl -l -v
NAME STATE VERSION
* Ubuntu Running 1
So I used the wsl --set-version command to upgrade it:
PS C:\Users\myself> wsl --set-version Ubuntu 2
PS C:\Users\myself> wsl -l -v
NAME STATE VERSION
* Ubuntu Running 2
Then I was able to successfully build my Docker image.
Hope can help someone.
Came across this link, which solved the issue for me. It apparently works for WSL, but definitely also for my Ubuntu 18.04 installation: the latest version(s) of docker have this problem, a few versions back they haven't.
I am a complete newb to Docker and am running Linux 18.04.6 Bionic Beaver
docker --version reports Docker version 20.10.7, build 20.10.7-0ubuntu5~18.04.3
I'm not sure if this is the solution, but after reinstalling Docker from various unrelated problems I ran runc init and killed an old running dockerd process and was able to get hello-world to run. I've wasted so much time on it that I don't want to find the root cause.