docker container is is automatically paused - docker

Hi all i have found a strange issue in my docker swarm standalone mode infrastucture: on one server and only one service is automatically paused after 5 second of starting . docker unpause also does not helps .
020450186dc1 myservice:latest "/bin/sh -c '/app/os…" 13 minutes ago Up 13 minutes (Paused) server-15.myservice_2
After some time this service is in pause mode again . I have no idea who can stop this service and where can i see some logging info about it . I know that the service can be killed but i have never heard about pause . The other instances of the same service are working stable.
UPD:
I dont see any special information on this issue even in docker debug mode . I just see that somebody pauses my container:
Apr 29 11:50:01 dockerd[1069]: time="2019-04-29T11:50:01.539113504+02:00" level=debug msg="Calling GET /_ping"
Apr 29 11:50:01 dockerd[1069]: time="2019-04-29T11:50:01.539687070+02:00" level=debug msg="Calling POST /v1.38/containers/debee0c7acb9aa0db2021719ee6ee3d084b88e47d9f2250b1af920f5271bf353/pause"
Apr 29 11:50:01 dockerd[1069]: time="2019-04-29T11:50:01+02:00" level=debug msg="event published" ns=moby topic="/tasks/paused" type=containerd.events.TaskPaused
Apr 29 11:50:01 dockerd[1069]: time="2019-04-29T11:50:01.570125095+02:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/paused

Related

Docker Swarm: Service keep on getting Ready & Shutdown

I have couple of docker swarm nodes, When tried to create the service on Leader with below command. Service creation process still going on it is more-than 40 minutes now.
docker service create \
--mode global \
--mount type=bind,src=/project/m32/,dst=/root/m32/ \
--publish mode=host,target=310,published=310 \
--publish mode=host,target=311,published=311 \
--publish mode=host,target=312,published=312 \
--publish mode=host,target=313,published=313 \
--constraint "node.labels.m32 == true" \
--name m32 \
local-registry/ubuntu:07
overall progress: 1 out of 2 tasks
ew0edluvz39p: ready [======================================> ]
kzc7jf7irsrh: running [==================================================>]
From service process, it keep on showing as Ready and Shutdown
$ docker service ps m32
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
s4q0rqrqbpdn m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Ready Ready 1 second ago
r6vibgptm5oc \_ m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Shutdown Complete 1 second ago
joq2p6c9jpnx \_ m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Shutdown Complete 7 seconds ago
a5h8gac02vfx \_ m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Shutdown Complete 13 seconds ago
f51stfsdlhvp \_ m32.ew0edluvz39pazold0wnv2ean local-registry/ubuntu:07 sl-089 Shutdown Complete 19 seconds ago
zqcbxkm4fwhr m32.kzc7jf7irsrhnx3kurcwqjb2j local-registry/ubuntu:07 sl-090 Ready Ready less than a second ago
za8efvi9x4yw \_ m32.kzc7jf7irsrhnx3kurcwqjb2j local-registry/ubuntu:07 sl-090 Shutdown Complete less than a second ago
$ sudo systemctl status docker.service
Nov 24 19:58:48 svr2 dockerd[2797]: time="2021-11-24T19:58:48.200421563+05:30" level=info msg="ignoring event" container=ea8b76fedb18159ba0cd8f279a9ca4264399c>
Nov 24 20:01:39 svr2 dockerd[2797]: time="2021-11-24T20:01:39.602028420+05:30" level=info msg="NetworkDB stats svr2(00bbf0799aa6) - netID:ubuzyty9mq4tb7xyb>
Nov 24 20:06:39 svr2 dockerd[2797]: time="2021-11-24T20:06:39.802013427+05:30" level=info msg="NetworkDB stats svr2(00bbf0799aa6) - netID:ubuzyty9mq4tb7xyb>
Nov 24 20:11:40 svr2 dockerd[2797]: time="2021-11-24T20:11:40.001992437+05:30" level=info msg="NetworkDB stats svr2(00bbf0799aa6) - netID:ubuzyty9mq4tb7xyb>
Nov 24 20:14:17 svr2 dockerd[2797]: time="2021-11-24T20:14:17.871605342+05:30" level=error msg="Error getting service xkauq9a599iv: service xkauq9a599iv not f>
Nov 24 20:14:52 svr2 dockerd[2797]: time="2021-11-24T20:14:52.833890158+05:30" level=error msg="Error getting service xkauq9a599iv: service xkauq9a599iv not f>
Nov 24 20:15:12 svr2 dockerd[2797]: time="2021-11-24T20:15:12.395692837+05:30" level=error msg="Error getting service pwaa8cvdd683: service pwaa8cvdd683 not f>
Nov 24 20:15:17 svr2 dockerd[2797]: time="2021-11-24T20:15:17.773200054+05:30" level=error msg="Error getting service xk0v0g2roypx: service xk0v0g2roypx not f>
Nov 24 20:16:18 svr2 dockerd[2797]: time="2021-11-24T20:16:18.529344060+05:30" level=error msg="Error getting service xk0v0g2roypx: service xk0v0g2roypx not f>
Nov 24 20:16:40 svr2 dockerd[2797]: time="2021-11-24T20:16:40.201888504+05:30" level=info msg="NetworkDB stats svr2(00bbf0799aa6) - netID:ubuzyty9mq4tb7xyb>
It looks loop process keep on creating containers. What is wrong in my way? Any help to fix this problem will be highly appreciated. Thanks
You really need to pass --restart-max-attempts 5 to your docker service create to ensure that services don't start too many times in a loop. Its bad for the stability of docker, and hard to debug. Rather have a task just give up and stop so you can see something is wrong and diagnose it.
To see specifically what is wrong you would want to look at the logs of each task. You use the individual task id's to see why each one failed:
# The logs for a task
docker service logs s4q0rqrqbpdn
# A general breakdown of a task
docker inspect s4q0rqrqbpdn
Sometimes you need to track down the actual container for the task and inspect that. docker container is not swarm aware, so
# list the service showing the full task id.
docker service ps <service> --no-trunc
# then docker context use <node> / ssh <node> to switch to a node of interest.
# Then, the container name is the "ID"."NAME" from the PS list. For example:
docker context use sl-089
docker container inspect m32.ew0edluvz39pazold0wnv2ean.s4q0rqrqbpdnABCDEFGABCDEFG
Inspecting the container can show if it was killed because of an OOM or certain other reasons that don't otherwise show up.

Cannot connect to the Docker daemon after failed pull

When I try to pull a certain docker image, my pull fails, and then prevents me from connecting to the docker deamon again until I reboot my laptop. The Image in question is an official Jupyter images which works fine on my other machine. Restarting the Deamon does not help, but rebooting my laptop does.
I tried to docker system prune -a already, that's why there are no images on my laptop anymore. Does somebody have an idea how to fix this problem?
I think the problem might be connected to one of the images not finishing it's extraction.
EDIT
I have the same problem with a alpine image. see below
me#mylaptop $ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
me#mylaptop $ docker pull jupyter/datascience-notebook
Using default tag: latest
latest: Pulling from jupyter/datascience-notebook
e6ca3592b144: Extracting [==================================================>] 28.56MB/28.56MB
534a5505201d: Download complete
990916bd23bb: Download complete
979cd14ae800: Download complete
5e8b9f8fa9e0: Download complete
6f224ed88dc4: Download complete
6ee9ec4a62a8: Download complete
7a1ae22ba760: Download complete
a1602338a8d7: Download complete
fce5135a7ea1: Download complete
e62a1c9017ef: Download complete
a5049ad1c512: Download complete
ec06c1612b0a: Download complete
acceda87b341: Download complete
939052532b6f: Download complete
d2dee4cc07fe: Download complete
4fe5e9dd4fad: Download complete
8fd08517e0c6: Download complete
7105a3ca8c38: Download complete
66c0798f609e: Download complete
94f3fc35ed38: Download complete
aa68263474a3: Download complete
6e7d1433394b: Download complete
f5902e69d9b7: Download complete
490bb991b4de: Download complete
fab6e92b04fa: Download complete
failed to register layer: Error processing tar file(exit status 1): Error cleaning up after pivot: remove /.pivot_root297865553: device or resource busy
me#mylaptop $ docker images
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
me#mylaptop $ sudo systemctl start docker
me#mylaptop $ systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2020-09-30 08:11:12 CEST; 15min ago
TriggeredBy: ● docker.socket
Docs: https://docs.docker.com
Main PID: 908 (dockerd)
Tasks: 10
Memory: 140.8M
CGroup: /system.slice/docker.service
└─908 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
sep 30 08:11:11 mylaptop dockerd[908]: time="2020-09-30T08:11:11.992016198+02:00" level=warning msg="Your kernel does not support cgroup rt runtime"
sep 30 08:11:11 mylaptop dockerd[908]: time="2020-09-30T08:11:11.992433459+02:00" level=info msg="Loading containers: start."
sep 30 08:11:12 mylaptop dockerd[908]: time="2020-09-30T08:11:12.227615723+02:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can b>
sep 30 08:11:12 mylaptop dockerd[908]: time="2020-09-30T08:11:12.296603004+02:00" level=info msg="Loading containers: done."
sep 30 08:11:12 mylaptop dockerd[908]: time="2020-09-30T08:11:12.486944893+02:00" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: >
sep 30 08:11:12 mylaptop dockerd[908]: time="2020-09-30T08:11:12.487273874+02:00" level=info msg="Docker daemon" commit=48a66213fe graphdriver(s)=overlay2 version=19.03.12-ce
sep 30 08:11:12 mylaptop dockerd[908]: time="2020-09-30T08:11:12.491959213+02:00" level=info msg="Daemon has completed initialization"
sep 30 08:11:12 mylaptop dockerd[908]: time="2020-09-30T08:11:12.530816090+02:00" level=info msg="API listen on /run/docker.sock"
sep 30 08:11:12 mylaptop systemd[1]: Started Docker Application Container Engine.
sep 30 08:23:36 mylaptop dockerd[908]: time="2020-09-30T08:23:36.941202710+02:00" level=info msg="Attempting next endpoint for pull after error: failed to register layer: Error processing tar fi>
me#mylaptop $ docker images
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
me#mylaptop $ docker pull alpine:3.12.0
3.12.0: Pulling from library/alpine
df20fa9351a1: Extracting [==================================================>] 2.798MB/2.798MB
failed to register layer: Error processing tar file(exit status 1): Error cleaning up after pivot: remove /.pivot_root517304538: device or resource busy
Solved it. The problem is that my kernel was/became to old.
The warning below by systemctl brought made me find this post on forums.docker.com
me#mylaptop $ systemctl status docker
...
sep 30 08:11:11 mylaptop dockerd[908]: time="2020-09-30T08:11:11.992016198+02:00" level=warning msg="Your kernel does not support cgroup rt runtime"
...
I'm running Manjaro so I upgrade my kernel with this command:
sudo mhwd-kernel -i linux54
After which docker worked again.

Docker + docker-compose up + Cannot start service

we have docker-compose.yml that contain configuration for Kafka , zookeeper and schema registry
when we start the docker compose we get the following errors
docker-compose up -d
Starting kafka-docker-final_zookeeper3_1 ... error
ERROR: for kafka-docker-final_zookeeper3_1 Cannot start service zookeeper3: network dd321821f3cb4a715c31e04b32bff2cf206c85ed5581b01b1c6a94ffa45f330e not found
ERROR: for zookeeper3 Cannot start service zookeeper3: network dd321821f3cb4a715c31e04b32bff2cf206c85ed5581b01b1c6a94ffa45f330e not found
ERROR: Encountered errors while bringing up the project.
and
systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Active: active (running) since Thu 2020-03-19 07:57:29 UTC; 1h 55min ago
Docs: https://docs.docker.com
Main PID: 12105 (dockerd)
Tasks: 30
Memory: 654.6M
CGroup: /system.slice/docker.service
└─12105 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
Mar 19 07:57:29 master3 dockerd[12105]: time="2020-03-19T07:57:29.610005717Z" level=info msg="Daemon has completed initialization"
Mar 19 07:57:29 master3 dockerd[12105]: time="2020-03-19T07:57:29.631338594Z" level=info msg="API listen on /var/run/docker.sock"
Mar 19 07:57:29 master3 systemd[1]: Started Docker Application Container Engine.
Mar 19 07:58:12 master3 dockerd[12105]: time="2020-03-19T07:58:12.352833676Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: re...ng headers)"
Mar 19 07:58:12 master3 dockerd[12105]: time="2020-03-19T07:58:12.352916724Z" level=info msg="Attempting next endpoint for pull after error: Get https://registry-1.docker.io/...ng headers)"
Mar 19 07:58:12 master3 dockerd[12105]: time="2020-03-19T07:58:12.353019409Z" level=error msg="Handler for POST /v1.22/images/create returned error: Get https://registry-1.do...ng headers)"
Mar 19 08:03:47 master3 dockerd[12105]: time="2020-03-19T08:03:47.255058871Z" level=warning msg="error locating sandbox id 20ce3c5b6383ad92dae848c3de1d91bbfff9306ca86fdc90fae...c not found"
Mar 19 08:03:47 master3 dockerd[12105]: time="2020-03-19T08:03:47.263976715Z" level=error msg="ef808aa411ae0aaef0920397c77b6d9a327bdd1651877402fe1fc142a513af8a cleanup: faile...h container"
Mar 19 09:50:43 master3 dockerd[12105]: time="2020-03-19T09:50:43.920457464Z" level=warning msg="error locating sandbox id 20ce3c5b6383ad92dae848c3de1d91bbfff9306ca86fdc90fae...c not found"
Mar 19 09:50:43 master3 dockerd[12105]: time="2020-03-19T09:50:43.927744636Z" level=error msg="ef808aa411ae0aaef0920397c77b6d9a327bdd1651877402fe1fc142a513af8a cleanup: faile...h container"
Hint: Some lines were ellipsized, use -l to show in full.
regarding to
Cannot start service zookeeper3: network dd321821f3cb4a715c31e04b32bff2cf206c85ed5581b01b1c6a94ffa45f330e not found
Cannot start service zookeeper3: network dd321821f3cb4a715c31e04b32bff2cf206c85ed5581b01b1c6a94ffa45f330e not found
how to fix this issue?
docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6c729cb0bb2c confluentinc/cp-schema-registry:latest "/etc/confluent/dock…" 3 months ago Exited (255) 2 hours ago 0.0.0.0:8081->8081/tcp kafka-docker-schemaregistry_1
ef808aa411ae confluentinc/cp-zookeeper:latest "/etc/confluent/dock…" 3 months ago Exited (255) 2 hours ago
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
docker network ls
NETWORK ID NAME DRIVER SCOPE
e5566ab8ca6d bridge bridge local
2467d9664593 host host local
c509e32d0d67 kafka-docker-default bridge local
08966157382c none null local
we fixed the issue by the following procedure
# docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6c729cb0bb2c confluentinc/cp-schema-registry:latest "/etc/confluent/dock…" 3 months ago Exited (255) 5 hours ago 0.0.0.0:8081->8081/tcp kafka-docker-schemaregistry_1
ef808aa411ae confluentinc/cp-zookeeper:latest "/etc/confluent/dock…" 3 months ago Exited (255) 5 hours ago kafka-docker-zookeeper3_1
# docker container rm 6c729cb0bb2c
# docker container rm ef808aa411ae
systemctl stop docker
systemctl start docker
docker-compose up -d
Creating kafka-docker-zookeeper3_1 ... done
Creating kafka-docker-kafka3_1 ... done
Creating kafka-docker-schemaregistry_1 ... done
docker-compose ps
Name Command State Ports
------------------------------------------------------------------------------------------------------------------------------------------------
kafka-docker-kafka3_1 /etc/confluent/docker/run Up 0.0.0.0:9092->9092/tcp
kafka-docker-schemaregistry_1 /etc/confluent/docker/run Up 0.0.0.0:8081->8081/tcp
kafka-docker-zookeeper3_1 /etc/confluent/docker/run Up 0.0.0.0:2181->2181/tcp, 0.0.0.0:2888->2888/tcp, 0.0.0.0:3888->3888/tcp
I had the same problem and for me it was enough to:
sudo docker-compose down
sudo docker-compose up

Docker swarm can be accessed only on nodes, where container is running

I'm currently running docker swarm on 3 nodes. First I created network as
docker network create -d overlay xx_net
after that a service as
docker service create --network xxx_net --replicas 1 -p 12345:12345 --name nameofservice nameofimage:1
If I read correctly, this is routing mesh (=ok for me). But I can only access service on that node-ip, where container is running, even it should be available on every node ip's.
If I drain some node, container starts up on different node and then it's on available on new ip.
**more information added below here:
I rebooted all servers - 3 workers, where on of them is manager
after boot, all seems to work ok!
I'm using rabbitmq-image from docker hub. Dockerfile is quite small: FROM rabbitmq:3-management Container has been started at worker 2
I can connect to rabbitmq's management page from all workers: worker1-ip:15672, worker2-ip:15672, worker3-ip:15672, so I think all ports needed is open.
about after 1 hour, rabbitmq-container has been moved from worker 2 to worker 3 - I do not know reason.
after that I cannot anymore connect from worker1-ip:15672, worker2-ip:15672 but from worker3-ip:15672 all still works!
I drained worker3 as docker node update --availability drain worker3
container started at worker1.
after that I can only connect from worker1-ip:15672, not anymore from worker2 or worker3
One test more:
all docker services restarted on all workers, and all works again?!
- let's wait a few hours...
Today's status:
2 of 3 nodes are working ok. On service log of manager:
Jul 12 07:53:32 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:53:32.787953754Z" level=info msg="memberlist: Marking dockerswarmworker2-459b4229d652 as failed, suspect timeout reached"
Jul 12 07:53:39 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:53:39.787783458Z" level=info msg="memberlist: Marking dockerswarmworker2-459b4229d652 as failed, suspect timeout reached"
Jul 12 07:55:27 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:55:27.790564790Z" level=info msg="memberlist: Marking dockerswarmworker2-459b4229d652 as failed, suspect timeout reached"
Jul 12 07:55:41 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:55:41.787974530Z" level=info msg="memberlist: Marking dockerswarmworker2-459b4229d652 as failed, suspect timeout reached"
Jul 12 07:56:33 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:56:33.027525926Z" level=error msg="logs call failed" error="container not ready for logs: context canceled" module="node/agent/taskmanager" node.id=b6vnaouyci7b76ol1apq96zxx
Jul 12 07:56:33 dockerswarmmanager dockerd[7180]: time="2017-07-12T07:56:33.027668473Z" level=error msg="logs call failed" error="container not ready for logs: context canceled" module="node/agent/taskmanager" node.id=b6vnaouyci7b76ol1apq96zxx
Jul 12 08:13:22 dockerswarmmanager dockerd[7180]: time="2017-07-12T08:13:22.787796692Z" level=info msg="memberlist: Marking dockerswarmworker2-03ec8453a81f as failed, suspect timeout reached"
Jul 12 08:21:37 dockerswarmmanager dockerd[7180]: time="2017-07-12T08:21:37.788694522Z" level=info msg="memberlist: Marking dockerswarmworker2-03ec8453a81f as failed, suspect timeout reached"
Jul 12 08:24:01 dockerswarmmanager dockerd[7180]: time="2017-07-12T08:24:01.525570127Z" level=error msg="logs call failed" error="container not ready for logs: context canceled" module="node/agent/taskmanager" node.id=b6vnaouyci7b76ol1apq96zxx
Jul 12 08:24:01 dockerswarmmanager dockerd[7180]: time="2017-07-12T08:24:01.525713893Z" level=error msg="logs call failed" error="container not ready for logs: context canceled" module="node/agent/taskmanager" node.id=b6vnaouyci7b76ol1apq96zxx
and from worker's docker log:
Jul 12 08:20:47 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:20:47.486202716Z" level=error msg="Bulk sync to node h999-99-999-185.scenegroup.fi-891b24339f8a timed out"
Jul 12 08:21:38 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:38.288117026Z" level=warning msg="memberlist: Refuting a dead message (from: h999-99-999-185.scenegroup.fi-891b24339f8a)"
Jul 12 08:21:39 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:39.404554761Z" level=warning msg="Neighbor entry already present for IP 10.255.0.3, mac 02:42:0a:ff:00:03"
Jul 12 08:21:39 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:39.404588738Z" level=warning msg="Neighbor entry already present for IP 104.198.180.163, mac 02:42:0a:ff:00:03"
Jul 12 08:21:39 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:39.404609273Z" level=warning msg="Neighbor entry already present for IP 10.255.0.6, mac 02:42:0a:ff:00:06"
Jul 12 08:21:39 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:39.404622776Z" level=warning msg="Neighbor entry already present for IP 104.198.180.163, mac 02:42:0a:ff:00:06"
Jul 12 08:21:47 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:21:47.486007317Z" level=error msg="Bulk sync to node h999-99-999-185.scenegroup.fi-891b24339f8a timed out"
Jul 12 08:22:47 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:22:47.485821037Z" level=error msg="Bulk sync to node h999-99-999-185.scenegroup.fi-891b24339f8a timed out"
Jul 12 08:23:17 dockerswarmworker2 dockerd[677]: time="2017-07-12T08:23:17.630602898Z" level=error msg="Bulk sync to node h999-99-999-185.scenegroup.fi-891b24339f8a timed out"
And this one from working worker:
Jul 12 08:33:09 h999-99-999-185.scenegroup.fi dockerd[10330]: time="2017-07-12T08:33:09.219973777Z" level=warning msg="Neighbor entry already present for IP 10.0.0.3, mac xxxxx"
Jul 12 08:33:09 h999-99-999-185.scenegroup.fi dockerd[10330]: time="2017-07-12T08:33:09.220539013Z" level=warning msg="Neighbor entry already present for IP "managers ip here", mac xxxxxx"
I restarted docker on problematic worker and it started to work again.
I'll be following...
** Today's results:
2 of workers available, one is not
I didn't a thing
after 4 hour "swarm alone", all seems to works again?!
services has been moved from worker to other for any good reason, all results seems to be problem with communication.
quite confusing.
Upgrade to docker 17.06
Ingress overlay networking was broken for a long time until about 17.06-rc3

my coreos/fleet deployed service is dying and I can't tell why

I'm trying to deploy nsqlookupd using fleet on a brand shiny new coreos cluster in EC2. Here is my systemd unit file:
[Unit]
Description=nsqlookupd service
After=docker.service
Requires=docker.service
[Service]
EnvironmentFile=/etc/environment
ExecStartPre=-/usr/bin/docker kill nsqlookupd
ExecStartPre=-/usr/bin/docker rm nsqlookupd
ExecStart=/usr/bin/docker run -d --name=nsqlookupd -e BROADCAST_ADDRESS=$COREOS_PUBLIC_IPV4 -p 4160:4160 -p 4161:4161 mikedewar/nsqlookupd
ExecStartPost=/usr/bin/etcdctl set /nsqlookupd_broadcast_address $COREOS_PUBLIC_IPV4
ExecStop=/usr/bin/docker stop -t 1 nsqlookupd
ExecStopPost=/usr/bin/etcdctl rm /nsqlookupd_broadcast_address
I've verified the container works fine if I just run the ExecStart command. My docker logs just look like
~ $ docker logs nsqlookupd
2014/08/08 02:23:58 nsqlookupd v0.2.29-alpha (built w/go1.2.2)
2014/08/08 02:23:58 TCP: listening on [::]:4160
2014/08/08 02:23:58 HTTP: listening on [::]:4161
and my fleetctl journal looks like
$ fleetctl journal nsqlookupd.service
-- Logs begin at Sun 2014-08-03 12:49:00 UTC, end at Fri 2014-08-08 02:30:06 UTC. --
Aug 08 02:23:57 ip-10-147-9-249 systemd[1]: Starting nsqlookupd service...
Aug 08 02:23:57 ip-10-147-9-249 docker[6140]: Error response from daemon: No such container: nsqlookupd
Aug 08 02:23:57 ip-10-147-9-249 docker[6140]: 2014/08/08 02:23:57 Error: failed to kill one or more containers
Aug 08 02:23:57 ip-10-147-9-249 docker[6148]: Error response from daemon: No such container: nsqlookupd
Aug 08 02:23:57 ip-10-147-9-249 docker[6148]: 2014/08/08 02:23:57 Error: failed to remove one or more containers
Aug 08 02:23:57 ip-10-147-9-249 etcdctl[6157]: 54.198.93.169
Aug 08 02:23:57 ip-10-147-9-249 systemd[1]: Started nsqlookupd service.
Aug 08 02:23:57 ip-10-147-9-249 docker[6155]: 0fce4465f61c092541ba9d4c4e89ce13c4d6bedc096519034ed585d7adb5e0d7
Aug 08 02:23:59 ip-10-147-9-249 docker[6194]: nsqlookupd
both of which look just fine. But the container dies quietly, and my fleetctl list-units gives
$ fleetctl list-units
UNIT STATE LOAD ACTIVE SUB DESC MACHINE
nsqlookupd.service launched loaded deactivating stop nsqlookupd service 1320802c.../10.147.9.249
Running docker images is a little worrying:
$ docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
<none> <none> 8ef9d8f9d18d 9 minutes ago 710 MB
mikedewar/nsqadmin latest 432af572bda8 2 days ago 710 MB
mikedewar/nsqd latest 00bd4e474964 2 days ago 710 MB
<none> <none> adf0ed97208e 3 weeks ago 710 MB
mikedewar/nsqlookupd latest 2219c0e783d9 3 weeks ago 710 MB
<none> <none> 35d2212f8932 3 weeks ago 710 MB
mikedewar/nsq latest f9794fe056e1 3 weeks ago 710 MB
busybox latest a9eb17255234 9 weeks ago 2.433 MB
zmarcantel/cassandra latest b1168b45b4f8 4 months ago 738 MB
as I've been updating mikedewar/nsqlookupd quite regularly over the last 3 weeks. Maybe that's the time I first pushed something to docker hub? I'd love to know that the image I'm working with is the up-to-date one. I've tried docker rmi mikedewar/nsqlookupd followed by docker pull mikedewar/nsqlookupd but the CREATED column still says it was created 3 weeks ago.
I don't know if this is useful, but the ExecStopPost=/usr/bin/etcdctl rm /nsqlookupd_broadcast_address command seems to have worked - the etcdctl log line in the fleet journal suggests I managed to set the key to my IP, but after the container dies I can't get that key from etcd.
Any help on where to look next for clues, or any ideas why this is happening would be greatly appreciated! As is probably clear I'm rather new to this sort of thing...
You shouldn't run docker containers in detached mode in a unit file. Your execstart contains it: ExecStart=/usr/bin/docker run -d. This will cause systemd to think the process exited immediately since it was forked into the background.
As for managing versions, if you want to be absolutely sure you're getting the latest copy, you should tag your containers and then pull mikedewar/nsqlookupd:1.2.3. You can increment this each time in your fleet unit file.

Resources