I fired up my Ubuntu 20.04 Virtual Machine today and sadly my docker containers seem to be gone. Running docker ps -a is a blank list. I'm not sure what may have caused this event, so I need some help troubleshooting. Recently, I was testing out docker-compose which I ran apt install docker-compose on my machine if that could have caused any issues. I also recently did a apt update / apt upgrade as well.
I don't see any files in /var/lib/docker/containers or /volumes if that's where my data should be. If they're gone oh well, but I'd like to try to figure out what happened so it doesn't happen again.
Related
I followed the instructions to install the nvidia-docker2 from the official documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
Whenever I run their test example:
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
I still get the error:
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. 3
I rebooted but still no effect.
I am on Ubuntu 22.04 with my nvidia drivers updated.
Nvidia-smi works on the machine but not working using docker
EDIT (SOLVED): Finally I found out what was going on.
When reinstalling, it was working, however if rebooting, it was going again to a previous state where it was not working.
This was due to the installation of another docker service installed using "snapd" so I had to purge completely docker:
sudo snap remove docker and after I could "Reinstall everything" and it finally is stable, even after rebooting
Unfortunately I was not able to "Fix" properly the issue so I purge all docker package and all nvidia container packages and reinstalled everything and now it works!!
Good old methods work fine :)
you need to restart the docker daemon :
sudo systemctl restart docker
if the problem still occurs install the nvidia-container-toolkit then restart docker daemon.
Sometimes I can't communicate with my Nvidia GPUs inside a docker container when I came back to my workplace from home, even though the previously launched process that utilizes GPUs is running well. The running process (training a neural network via Pytorch) is not affected by the disconnection but I cannot launch a new process.
nvidia-smi gives Failed to initialize NVML: Unknown Error and torch.cuda.is_available() returns False likewise.
I met two different cases:
nvidia-smi works fine when it is done at the host machine. In this case, the situation can be solved by restarting the docker container via docker stop $MYCONTAINER followed by docker start $MYCONTAINER at the host machine.
nvidia-smi doesn't work at the host machine nor nvcc --version, throwing Failed to initialize NVML: Driver/library version mismatch and Command 'nvcc' not found, but can be installed with: sudo apt install nvidia-cuda-toolkit error. Strange point is that the current process still runs well. In this case, installing the driver again or rebooting the machine solves the problem.
However, these solutions require stopping all current processes. It would be unavailable when I should not stop the current process.
Does somebody has suggestion for solving this situation?
Many thanks.
(sofwares)
Docker version: 20.10.14, build a224086
OS: Ubuntu 22.04
Nvidia driver version: 510.73.05
CUDA version: 11.6
(hardwares)
Supermicro server
Nvidia A5000 * 8
(pic1) nvidia-smi not working inside of a docker container, but worked well on the host machine.
(pic2) nvidia-smi works after restarting a docker container, which is the case 1 I mentioned above
For the problem of Failed to initialize NVML: Unknown Error and having to restart the container, please see this ticket and post your system/package information there as well:
https://github.com/NVIDIA/nvidia-docker/issues/1671
There's a workaround on the ticket, but it would be good to have others post their configuration to help fix the issue.
Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
so sudo apt-get install -y --allow-downgrades containerd.io=1.6.6-1 and sudo apt-mark hold containerd.io to prevent the package from being updated. So do that, edit the config file, and pass all of the /dev/nvidia* devices in to docker run.
For the Failed to initialize NVML: Driver/library version mismatch issue, that is caused by the drivers updating but you haven't rebooted yet. If this is a production machine, I would also hold the driver package to stop that from auto-updating as well. You should be able to figure out the package name from something like sudo dpkg --get-selections "*nvidia*"
I am trying to run a private tangle on my computer through linux docker containers.
Therefore I followed the guide over at https://wiki.iota.org/chrysalis-docs/tutorials/one_click_private_tangle
Every step succeeded up until we tried to execute
./private_tangle.sh install
This reports
Error 1
as seen in the screenshot below:
We do net get any further information, is anyone familiar with this error, or has any clue how to get some more information on the error so that we can at least have a clue where to look?
Some further information:
After executing docker ps -a we see that not a single container is running.
I am running on a windows 10 machine
I execute the commands from within ubuntu (version 20.04)
Ubuntu, docker-desktop and docker-desktop-data are all running WSL2
Docker integration with ubuntu is activated
I thought the error could maybe come from no hornet node initially being installed, so I installed a hornet node successfully, according the guide that https://wiki.iota.org/chrysalis-docs/tutorials/one_click_private_tangle. This changed nothing to the Error.
The version of docker and docker-compose are compliant with the requirements
If any more details are needed to help me solve this problem, please let me know.
I used the documentation (https://wiki.iota.org/chrysalis-docs/tutorials/one_click_private_tangle) to install these containers on my local ubuntu 18.04.
My docker version is: 20.10.12
And docker-compose version is: 1.29.2
By following the steps of the tutorial I managed to successfully start all of the containers without trouble.
My guess here would be that the permission of the 'private-tangle.sh' are not correct or that there is permission problem on the docker level.
You should start with checking the permission level of the private-tangle.sh script by using $ls -l
Here is my output -rwxrwxr-x 1 ben ben 9413 Jan 11 11:28 private-tangle.sh
It could also be due to the docker rights if you have to use sudo when executing a docker command it will give some troubles when executing the script.
You need to add yourself to a docker group to be able to run docker commands without sudo. You can do this by running sudo usermod -aG docker $USER with damiaan-vh as $user.
Solution from source https://stackoverflow.com/posts/70665394/edit
Suggesting to downgrade ubuntu version to 18.04 for more stable version.
For reinstalling the docker and docker-compose programs follow this documentations
(docker: https://docs.docker.com/engine/install/ubuntu/ )
(docker-compose: https://docs.docker.com/compose/install/ )
The Docker command is not working after restarting (using sudo reboot) the Ubuntu (20.04) server.
Now, for any command with docker, it gives me an error. For example,
$ docker --help
cannot update snap namespace: cannot create symlink in "/etc/docker": existing file in the way
snap-update-ns failed with code 1
When I manually check, there is a file called key.json in the /etc/docker folder which has a json dictionary.
Before restarting, I have had few docker containers running in the background with volume connected. When I run systemctl start docker as mentioned in one StackOverflow answer, I am getting
Failed to start docker.service: Unit docker.service not found.
It would be great at least to recover the docker images that were there before restarting.
-- Edit --
For some reason, docker is working now. I have restarted once again after the initial restart which resulted in the error. But there was no improvement. However, it is working fine now. I do not know what solved the issue, maybe the cmd journalctl -u docker.service (as suggested in a comment) help in some way, or some other reason.
So, It would be great if someone can answer what was the initial reason for the trouble? It might help us to avoid this in the future.
It looks like a Snap-related Issue.
I Found a fix on the SnapCraft forum here :
https://forum.snapcraft.io/t/layouts-still-brittle-when-refreshing-snaps/26252/5
sudo rm -rf /etc/docker
sudo snap refresh
Works in both Ubuntu 18.04.5 and 20.04.5 LTS.
I am using an Amazon linux machine (p2).
I have installed this docker version:
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.7.5
Git commit: 7392c3b/17.03.2-ce
Built: Wed Aug 9 22:45:09 2017
OS/Arch: linux/amd64
I'm not sure, but I think the issue started after killing a screen which ran some docker container
I'm experiencing this error:
sudo docker ps
Gives:
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
And:
sudo service docker status
Gives:
docker dead but subsys locked
I have tried both:
sudo rm -rf /var/run/docker
sudo rm /var/run/docker.*
I also tried to restart and stop:
sudo service docker start/stop
I also rebooted the EC2 machine
Try this and restart docker
yum update device-mapper-libs
sudo service docker restart
I am also facing the same issue. Although I (sort of) fixed it by issuing sudo service docker stop and sudo service docker start before running anything in docker.
Details: I am using docker in a spot instance, so it is setup everytime I need to perform some task. I create the docker and upload my files without any problem. But when I issue the command to run an uploaded bash script in docker I face this issue of docker not running. So before running the script I just stop and start docker. Weirdly, simply doing sudo service docker start or even sudo service docker restart did not solve my problem. I had to specifically issue both start and stop commands. But I don't have enough data points just yet, it is only working from the last couple of days and I am not in a hurry to test this hypothesis (of issuing both commands and not just one).
I had 10 docker containers running an ec2 instance(t2.large), each instance was running on its own service and the whole services are running in a cluster. I updated the timezone of the ec2 instance machine, this required me to reboot the instance. I rebooted the instance, this problem surfaced. First thing I noticed was that ssh into the machine was slower than before, I later realized docker ps was throwing that error, I magically resolved that later to realize that some of the container instances are running but they are not serving any page docker logs -f CONTAINER_ID let me know nginx didn't start due to privilege issues that some of my files that supposed to be created were not created.
I later realized that my magical solution was a really magical solution(most magical solutions are not solutions), all my 10 containers were trying to start at the same time which required more memory space than space my instance could offer, I later had to delete services and recreate them one by one - allow one container to start before creating another one in the same cluster. That was when I had peace. I hope this help somebody.