Docker restart policy is ignored if no space left on device - docker

Situation:
I run docker services on a production server.
The restart policy ist set to always.
Disk usage of some services is volantile (file sharing service) which means the disk is 100% full from time to time - and there is a delay till it's cleaned up again.
Problem:
If a docker service exits while the disk is full, docker does not try to restart that service. Even after disk space is available again, the service ist not restart automatically anymore. Manually restarting the service works - but that's not what I want for a production service.
The actual error is:
mkdir /var/lib/docker/overlay2/0e609c8b4059d3e0f1273bd8cb9e9a95c3d76730798a391dd360054ac450f3ed/merged: no space left on device
Question:
Is there a way to keep docker services restarting under any conditions?
Logs:
docker inspect <service> (official mongodb service in the following example - other service are similiar)
[
{
"Id": "b5b54e0dcd8c91b5ead96ce77d2b28afb4b973e205f81a7f6f2fb6b11920f40d",
"Created": "2022-01-13T08:24:03.142162294Z",
"Path": "docker-entrypoint.sh",
"Args": [
"mongod"
],
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 133,
"Error": "mkdir /var/lib/docker/overlay2/0e609c8b4059d3e0f1273bd8cb9e9a95c3d76730798a391dd360054ac450f3ed/merged: no space left on device",
"StartedAt": "2022-01-21T19:12:56.019305407Z",
"FinishedAt": "2022-01-21T20:11:37.176315302Z"
},
...
"RestartCount": 4,
...
"HostConfig": {
"RestartPolicy": {
"Name": "always",
"MaximumRetryCount": 0
},
...
}
]

Related

Docker swarm service - running state but no logs

In an existing swarm, I created a service via a docker-compose yaml file using the 'docker stack' command.
When I check the service via 'docker service ls' command, the new service shows up on the list. it shows "0/1" in the REPLICAS column
When I check the service using the command below, it shows 'Running' as the Desired State
docker service ps --no-trunc (service id)
When I check if there is already a corresponding container for the service, I can see none
When I try to access the service via the browser, it seems to be not started.
What is difficult is I cannot see any logs to find the cause of why this is happening
docker service logs (service id)
I figured it may just be slow to start but I waited for about half an hour and it was still in that state. Not sure how can I find out the cause of this without any logs. Can anyone help me on this?
EDIT: Below is the result when I did a docker inspect of the service task
[
{
"ID": "wt2tdoz64j5wmci4gr3q3io2e",
"Version": {
"Index": 3407514
},
"CreatedAt": "2020-08-25T00:58:13.012900717Z",
"UpdatedAt": "2020-08-25T00:58:13.012900717Z",
"Labels": {},
"Spec": {
"ContainerSpec": {
"Image": "my-ui-image:1.8.006",
"Labels": {
"com.docker.stack.namespace": "myservice-stack"
},
"Env": [
"BACKEND_HOSTNAME=somewebsite.com",
"BACKEND_PORT=3421"
],
"Privileges": {
"CredentialSpec": null,
"SELinuxContext": null
},
"Hosts": [
"10.152.30.18 somewebsite.com"
],
"Isolation": "default"
},
"Resources": {},
"Placement": {},
"Networks": [
{
"Target": "lt87emwtgbeztof5k2r1z2v27",
"Aliases": [
"myui_poc2"
]
}
],
"ForceUpdate": 0
},
"ServiceID": "nbskoeofakkgxlgj3utgn45c5",
"Slot": 1,
"Status": {
"Timestamp": "2020-08-25T00:58:13.012883476Z",
"State": "new",
"Message": "created",
"PortStatus": {}
},
"DesiredState": "running"
}
]
If you store your images in private registry then you must be logged in by command docker login and deploy your services by docker stack deploy -c docker-compose.yml your_service --with-registry-auth.
From the docker service ps ... output, you will see a column with the task id. You can get further details of the state of that task by inspecting the task id:
docker inspect $taskid
My guess is that your app is not redirecting it's output to stdout and that's why you don't get any output when doing "docker service logs...".
I would start by looking at this: https://docs.docker.com/config/containers/logging/
How you redirect the apps output to stdout will depend on what language your app is developed in.

POD Definition - Deploying to DC/OS

I'm new to DC/OS and I have been really struggling trying to deploy a POD. I have tried the simple examples provided in the documentation
but the deployments remain stuck in the deploying stage. There are plenty of resources available so that is not the issue.
I have 3 containers that I need to exist within a virtual network (queue, PDI, API). I have included my definition file that starts with a single container deployment and once I can successfully deploy I will add 2 additional containers to the definition. I have been looking at this example but have been unsuccessful.
I have successfully deployed the containers one at a time through Jenkins. All 3 images have been published and exist in the docker registry (Jfrog). I have included an example of my marathon.json for one of those successful deployments. I would appreciate any feedback that can help. The service is stuck in a deployed stage so I'm unable to drill down and see the logs via the command line or UI.
containers.image = pdi-queue
artifactory server = repos.pdi.com:5010/pdi-queue
1 Container POD Definition - (Error: Stuck in Deployment Stage)
{
"id":"/pdi-queue",
"containers":[
{
"name":"simple-docker",
"resources":{
"cpus":1,
"mem":128,
"disk":0,
"gpus":0
},
"image":{
"kind":"DOCKER",
"id":"repos.pdi.com:5010/pdi-queue",
"portMappings":[
{
"hostPort": 0,
"containerPort": 15672,
"protocol": "tcp",
"servicePort": 15672
}
]
},
"endpoints":[
{
"name":"web",
"containerPort":80,
"protocol":[
"http"
]
}
],
"healthCheck":{
"http":{
"endpoint":"web",
"path":"/"
}
}
}
],
"networks":[
{
"mode":"container",
"name":"dcos"
}
]
}
Marathon.json - (No Error: Successful deployment)
{
"id": "/pdi-queue",
"backoffFactor": 1.15,
"backoffSeconds": 1,
"container": {
"portMappings": [
{"containerPort": 15672, "hostPort": 0, "protocol": "tcp", "servicePort": 15672, "name": "health"},
{"containerPort": 5672, "hostPort": 0, "protocol": "tcp", "servicePort": 5672, "name": "queue"}
],
"type": "DOCKER",
"volumes": [],
"docker": {
"image": "repos.pdi.com:5010/pdi-queue",
"forcePullImage": true,
"privileged": false,
"parameters": []
}
},
"cpus": 0.1,
"disk": 0,
"healthChecks": [
{
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"maxConsecutiveFailures": 3,
"portIndex": 0,
"timeoutSeconds": 20,
"delaySeconds": 15,
"protocol": "MESOS_HTTP",
"path": "/"
}
],
"instances": 1,
"maxLaunchDelaySeconds": 3600,
"mem": 512,
"gpus": 0,
"networks": [
{
"mode": "container/bridge"
}
],
"requirePorts": false,
"upgradeStrategy": {
"maximumOverCapacity": 1,
"minimumHealthCapacity": 1
},
"killSelection": "YOUNGEST_FIRST",
"unreachableStrategy": {
"inactiveAfterSeconds": 300,
"expungeAfterSeconds": 600
},
"fetch": [],
"constraints": [],
"labels": {
"traefik.frontend.redirect.entryPoint": "https",
"traefik.frontend.redirect.permanent": "true",
"traefik.enable": "true"
}
}
I may not know the answer to the issues you are running into but I think I may be able to share some pointers to help debug this.
First of all, if you are unable to view logs from the DC/OS UI, you can also go to <cluster_url>/mesos and find the simple_docker task under Completed Tasks . It would show up as TASK_FAILED. Click on the Sandbox link on the right and then check stderr and stdout files for the task. There might be some clues there as to why it failed.
Another place to look can be to note the Agent IP from the Mesos UI where the task failed. SSH into the node and run sudo journalctl -u dcos-mesos-slave to see agent logs and try to find the logs corresponding to the failing task
One difference between the running the application as a Pod and a the App definition you shared is that your app definition is using DOCKER as the containerizer for the task while Pods use MESOS containerizer.
I noticed that you are using a private docker registry for your docker images. One possibility is that if your private registry's certificate is not trusted by Mesos but docker is configured already to trust it:
<copy the certificate(s) to /var/lib/dcos/pki/tls/certs>
cd /var/lib/dcos/pki/tls/certs
for file in *.crt; do ln -s \"$file\" \"$(openssl x509 -hash -noout -in \"$file\")\".0; done
This would need to be done on each agent node.
If its not a certificate issue, it could be docker registry credential issues. If the docker registry you are using requires authentication then you can specify docker credential at install time (assuming advanced install method) using : https://docs.mesosphere.com/1.11/installing/production/advanced-configuration/configuration-reference/#cluster-docker-credentials

docker restart policy doesn't work?

docker 1.10.2, ansible 2.0
I'm setting all containers to restart_policy:always.
When I'm doing a simple
service docker restart
and not all containers are up (every time different set of containers are up) - assuming this is applicative level (I have dependencies - for example zookeeper) - shouldn't the "failed" ones keep trying? isn't it all what this is about?
did docker inspect:
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 143,
"Error": "initlogger: Failed to initialize logging driver: dial tcp 127.0.0.1:24224: getsockopt: connection refused",
"StartedAt": "2016-05-23T13:22:55.849986549Z",
"FinishedAt": "2016-05-23T13:39:58.087484408Z"
},
"Image"

Mesos cannot deploy container from private Docker registry

I have a private Docker registry that is accessible at https://docker.somedomain.com (over standard port 443 not 5000). My infrastructure includes a set up of Mesosphere, which have docker containerizer enabled. I'm am trying to deploy a specific container to a Mesos slave via Marathon; however, this always fails with Mesos failing the task almost immediately with no data in stderr and stdout of that sandbox.
I tried deploying from an image from the standard Docker Registry and it appears to work fine. I'm having trouble figuring out what is wrong. My private Docker registry does not require password authentication (turned off for debugging this), AND if I shell into the Meso's slave instance, and sudo su as root, I can run a 'docker pull docker.somedomain.com/services/myapp' successfully every time.
Here is my Marathon post data for starting the task:
{
"id": "myapp",
"cpus": 0.5,
"mem": 64.0,
"instances": 1,
"container": {
"type": "DOCKER",
"docker": {
"image": "docker.somedomain.com/services/myapp:2",
"network": "BRIDGE",
"portMappings": [
{ "containerPort": 7000, "hostPort": 0, "servicePort": 0, "protocol": "tcp" }
]
},
"volumes": [
{
"containerPath": "application.yml",
"hostPath": "/var/myapp/application.yml",
"mode": "RO"
}
]
},
"healthChecks": [
{
"protocol": "HTTP",
"portIndex": 0,
"path": "/",
"gracePeriodSeconds": 5,
"intervalSeconds": 20,
"maxConsecutiveFailures": 3
}
]
}
I've been stuck on this for almost a day now, everything I've tried seems to be yielding the same result. Any insights on this would be much appreciated.
My versions:
Mesos: 0.22.1
Marathon: 0.8.2
Docker: 1.6.2
So this turns out to be an issue with volumes
"volumes": [
{
"containerPath": "/application.yml",
"hostPath": "/var/myapp/application.yml",
"mode": "RO"
}
]
Using the root path of the container of the root path may be legal in docker, but Mesos appears not to handle this behavior. Modifying the containerPath to a non-root path resolves this, i.e
"volumes": [
{
"containerPath": "/var",
"hostPath": "/var/myapp",
"mode": "RW"
}
]
If it is a problem between Marathon and the registry, the answer should be in the http logs of your registry. If Marathon connects, there will be an entry. And the Mesos master log should contain a clue as well.
It doesn't really sound like a problem between Marathon and Registry though. Are you sure you have 'docker,mesos' in /etc/mesos-slave/containerizers?
Did you --despite having no authentification-- try to follow Using a Private Docker Repository?
To supply credentials to pull from a private repository, add a .dockercfg to the uris field of your app. The $HOME environment variable will then be set to the same value as $MESOS_SANDBOX so Docker can automatically pick up the config file.

Docker image format

I would like to build a Docker image without docker itself. I have looked at [Packer](http://www.packer.io/docs/builders/docker.html, but it requires that Docker be installed on the builder host.
I have looked at the Docker Registry API documentation but this information doesn't appear to be there.
I guess that the image is simply a tarball, but I would like to see a complete specification of the format, i.e. what exact format is required and whether there are any metadata files required. I could attempt downloading an image from the registry and look what's inside, but there is no information on how to fetch the image itself.
The idea of my project is to implement a script that creates an image from artifacts I have compiled, and uploads it to the registry. I would like to use OpenEmbedded for this purpose, essentially this would be an extension to Bitbake.
The Docker image format is specified here: https://github.com/docker/docker/blob/master/image/spec/v1.md
The simplest possible image is a tar file containing the following:
repositories
uniqid/VERSION
uniqid/json
uniqid/layer.tar
Where VERSION contains 1.0, layer.tar contains the chroot contents and json/repositories are JSON files as specified in the spec above.
The resulting tar can be loaded into docker via docker load < image.tar
After reading James Coyle's blog, I figured that docker save and docker load commands are what I need.
> docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
progrium/consul latest e9fe5db22401 11 days ago 25.81 MB
> docker save e9fe5db22401 | tar x
> ls e9fe5db22401*
VERSION json layer.tar
The VERSION file contains only 1.0, and json contains quite a lot of information:
{
"id": "e9fe5db224015ddfa5ee9dbe43b414ecee1f3108fb6ed91add11d2f506beabff",
"parent": "68f9e4929a4152df9b79d0a44eeda042b5555fbd30a36f98ab425780c8d692eb",
"created": "2014-08-20T17:54:30.98176344Z",
"container": "3878e7e9b9935b7a1988cb3ebe9cd45150ea4b09768fc1af54e79b224bf35f26",
"container_config": {
"Hostname": "7f17ad58b5b8",
"Domainname": "",
"User": "",
"Memory": 0,
"MemorySwap": 0,
"CpuShares": 0,
"Cpuset": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"PortSpecs": null,
"ExposedPorts": {
"53/udp": {},
"8300/tcp": {},
"8301/tcp": {},
"8301/udp": {},
"8302/tcp": {},
"8302/udp": {},
"8400/tcp": {},
"8500/tcp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"HOME=/",
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"SHELL=/bin/bash"
],
"Cmd": [
"/bin/sh",
"-c",
"#(nop) CMD []"
],
"Image": "68f9e4929a4152df9b79d0a44eeda042b5555fbd30a36f98ab425780c8d692eb",
"Volumes": {
"/data": {}
},
"WorkingDir": "",
"Entrypoint": [
"/bin/start"
],
"NetworkDisabled": false,
"OnBuild": [
"ADD ./config /config/"
]
},
"docker_version": "1.1.2",
"author": "Jeff Lindsay <progrium#gmail.com>",
"config": {
"Hostname": "7f17ad58b5b8",
"Domainname": "",
"User": "",
"Memory": 0,
"MemorySwap": 0,
"CpuShares": 0,
"Cpuset": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"PortSpecs": null,
"ExposedPorts": {
"53/udp": {},
"8300/tcp": {},
"8301/tcp": {},
"8301/udp": {},
"8302/tcp": {},
"8302/udp": {},
"8400/tcp": {},
"8500/tcp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"HOME=/",
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"SHELL=/bin/bash"
],
"Cmd": [],
"Image": "68f9e4929a4152df9b79d0a44eeda042b5555fbd30a36f98ab425780c8d692eb",
"Volumes": {
"/data": {}
},
"WorkingDir": "",
"Entrypoint": [
"/bin/start"
],
"NetworkDisabled": false,
"OnBuild": [
"ADD ./config /config/"
]
},
"architecture": "amd64",
"os": "linux",
"Size": 0
}
The layer.tar file appears to be empty. So inspected the parent, and the grandparent, both contained no file in their layer.tar files.
So assuming that 4.0K is the standard size for an empty tarball:
for layer in $(du -hs */layer.tar | grep -v 4.0K | cut -f2)
do (echo $layer:;tar tvf $layer)
done
To see that these contain simple incremental changes to the filesystem.
So one conclusion is that it's probably best to just use Docker to build the image and push it the registry, just as Packer does.
The way to build an image from scratch is described in the docs.
It turns out that docker import - scratch doesn't care about what's in the tarball. I simply assumes that is the rootfs.
> touch foo
> tar c foo | docker import - scratch
02bb6cd70aa2c9fbaba37c8031c7412272d804d50b2ec608e14db054fc0b9fab
> docker save 02bb6cd70aa2c9fbaba37c8031c7412272d804d50b2ec608e14db054fc0b9fab | tar x
> ls 02bb6cd70aa2c9fbaba37c8031c7412272d804d50b2ec608e14db054fc0b9fab/
VERSION json layer.tar
> tar tvf 02bb6cd70aa2c9fbaba37c8031c7412272d804d50b2ec608e14db054fc0b9fab/layer.tar
drwxr-xr-x 0/0 0 2014-09-01 13:46 ./
-rw-r--r-- 500/500 0 2014-09-01 13:46 foo
In terms of OpenEmbedded integration, it's probably best to build the rootfs tarball, which is something Yocto provides out of the box, and use the official Python library to import the rootfs tarball with import_image(src='rootfs.tar', repository='scratch') and then push it private registry method.
This is not the most elegant solution, but that's how it would have to work at the moment. Otherwise one probably can just manage and deploy rootfs revisions in their own way, and just use docker import on the target host, which still won't be a nice fit, but is somewhat simple.

Resources