Manager node stuck in preparing state for service - docker

I have a swarm of 3 managers, 3 workers as below:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
ocnuul8dcbrf4gjtdzv06t0yf * manager1 Ready Active Leader 18.06.0-ce
z297dhtfon50pt4hllu4qfz6i manager2 Ready Active Reachable 18.06.0-ce
ondpdzyq06pd3oysn34p4xi9o manager3 Ready Active Reachable 18.06.0-ce
0bls0g65gee1wbv7wr6rwgbjk worker1 Ready Active 18.06.0-ce
mxtg28slr5rvljrayaf4k1wkk worker2 Ready Active 18.06.0-ce
hqu1436bvbar9srbr34er3fl4 worker3 Ready Active 18.06.0-ce
All managers are available.
However, when i deploy a service on the swarm, manager3 is stuck in preparing state
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
lmhpsgeqax13 web-fe.1 nigelpoulton/pluralsight-docker-ci:latest worker1 Running Running 19 minutes ago
nivas3gkh0pa web-fe.2 nigelpoulton/pluralsight-docker-ci:latest worker3 Running Running 19 minutes ago
5plwh46jri3t web-fe.3 nigelpoulton/pluralsight-docker-ci:latest worker2 Running Running 19 minutes ago
l1ykqzgzbgmb web-fe.4 nigelpoulton/pluralsight-docker-ci:latest manager2 Running Running 19 minutes ago
q788hrm6rba9 web-fe.5 nigelpoulton/pluralsight-docker-ci:latest manager3 Running Preparing 21 minutes ago
I could see in the /var/log/docker.log for manager3 that its failing while trying to establish connection with manager2's IP(192.168.99.105:2377)
7T00:10:54.230023789Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {192.168.99.105:2377 0 <nil>}. Err :connection error: desc = \"transport: Err7T00:10:54.230049538Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420a86940, TRANSIENT_FAILURE" module=grpc
Since manager1 is the leader , i was expecting it to send the message/signal to manager1 on preparing, but i dont understand why its trying to connect to manager2.
Could some one help me understand? Also, how do i recover from this and move manager3 from preparing to running state?
Regards

Related

How to access task and node information within docker container service

With docker service I can get the following running tasks and associated nodes. I am wondering how each running task can retrieve its node ID and task name? Is there any environment variable to access those? If not how can I set one?
$ docker service ps appservice
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
0qihejybwf1x appservice.1 appservice:3.0.5 manager1 Running Running 8 seconds
bk658fpbex0d appservice.2 appservice:3.0.5 worker2 Running Running 9 seconds
5ls5s5fldaqg appservice.3 appservice:3.0.5 worker1 Running Running 9 seconds
8ryt076polmc appservice.4 appservice:3.0.5 worker1 Running Running 9 seconds
1x0v8yomsncd appservice.5 appservice:3.0.5 manager1 Running Running 8 seconds
71v7je3el7rr appservice.6 appservice:3.0.5 worker2 Running Running 9 seconds
4l3zm9b7tfr7 appservice.7 appservice:3.0.5 worker2 Running Running 9 seconds
9tfpyixiy2i7 appservice.8 appservice:3.0.5 worker1 Running Running 9 seconds
3w1wu13yupln appservice.9 appservice:3.0.5 manager1 Running Running 8 seconds
8eaxrb2fqpbn appservice.10 appservice:3.0.5 manager1 Running Running 8 seconds
I was able to achieve this by setting environment variables in the docker-compose. To retrieve task id I added following line in the service configuration:
environment:
- MYTASKID={{.Task.ID}}

Trouble connecting to my docker app via VM IP

Solved at bottom
But why do I have to append :4000?
I'm following the docker get-started Guide here, https://docs.docker.com/get-started/part4/
I'm fairly certain I've done everything correctly, but am wondering why I can't connect to view the app after deploying it.
I've set my env to my VM, myvm1, for reference to following commands.
docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
099e16249604 beresj/getting-started:part2 "python app.py" 12 seconds ago Up 12 seconds 80/tcp getstartedlab_web.5.y0e2k1r1ev47u24e5iufkyn3i
6f9a24b343a7 beresj/getting-started:part2 "python app.py" 12 seconds ago Up 12 seconds 80/tcp getstartedlab_web.3.1pls3osj3uhsb5dyqtt4ts8j6
docker image ls -a
REPOSITORY TAG IMAGE ID CREATED SIZE
beresj/getting-started <none> e290b6208c21 22 hours ago 131MB
docker stack ls
NAME SERVICES ORCHESTRATOR
getstartedlab 1 Swarm
docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
myvm1 * virtualbox Running tcp://192.168.99.100:2376 v18.09.6
myvm2 - virtualbox Running tcp://192.168.99.101:2376 v18.09.6
docker stack ps getstartedlab
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
vkxx79fh3h85 getstartedlab_web.1 beresj/getting-started:part2 myvm2 Running Running 3 minutes ago
qexbaa3wz0pd getstartedlab_web.2 beresj/getting-started:part2 myvm2 Running Running 3 minutes ago
1pls3osj3uhs getstartedlab_web.3 beresj/getting-started:part2 myvm1 Running Running 3 minutes ago
ucuwen1jrncf getstartedlab_web.4 beresj/getting-started:part2 myvm2 Running Running 3 minutes ago
y0e2k1r1ev47 getstartedlab_web.5 beresj/getting-started:part2 myvm1 Running Running 3 minutes ago
curl 192.168.99.100
curl: (7) Failed to connect to 192.168.99.100 port 80: Connection refused
docker info
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.09.6
...
Swarm: active
NodeID: 0p9qrax9h3by0fupat8ufkfbq
Is Manager: true
ClusterID: 7vnqdk85n8jx6fqck9k7dv2ka
Managers: 1
Nodes: 2
Default Address Pool: 10.0.0.0/8
...
Node Address: 192.168.99.100
Manager Addresses:
192.168.99.100:2377
...
Kernel Version: 4.14.116-boot2docker
Operating System: Boot2Docker 18.09.6 (TCL 8.2.1)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 989.4MiB
Name: myvm1
I would expect to see what I was able to see when I just ran it on my local machine instead of on a VM in a swarm (I think I have the lingo correct?)
Not sure how to check open ports.
Again: this works if I simply remove the stack, unset the docker-machine environment, and just run:
docker stack deploy -c docker-compose.yml getstartedlab
not on the vm.
Thank you in advance. (Also, I'm new hence the get-started guide so I appreciate any help)
Edit
It works if I append :4000 to the VM IP in my url, ex: 192.168.99.100:4000 or 192.168.99.101:4000. It shows the two container Id's listed in 'docker container ls' for myvm1, and the other three are from myvm2. Could anyone tell me why I have to append 4000? Is it because I have ports: "4000:80" in my docker-compose.yml?
Not sure if this will help but if you use docker inspect <instance_id_here>, you can see what ports are exposed.
Exposed ports aren't open ports. You would need to bind a host port to a container port in the docker-compose.yml in order for it to be to be open.

How to use curl -4 http://localhost in the Docker part 3 tutorial?

Using the Docker tutorial I'm stuck at this part: https://docs.docker.com/get-started/part3/#run-your-new-load-balanced-app
I use curl -4 http://localhost but i get a curl: (7) Failed to connect to localhost port 80: Connection refused error.
output of previous step:
docker service ps getstartedlab_web
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
kqu5qggifnlm getstartedlab_web.1 s1mpl3/get-started:part2 moby Running Running 29 minutes ago
prhrmm6hpop3 getstartedlab_web.2 s1mpl3/get-started:part2 moby Running Running 29 minutes ago
ytrwy5gxp2rk getstartedlab_web.3 s1mpl3/get-started:part2 moby Running Running 29 minutes ago
mayvauijghbj getstartedlab_web.4 s1mpl3/get-started:part2 moby Running Running 29 minutes ago
r625x2k7n6ta getstartedlab_web.5 s1mpl3/get-started:part2 moby Running Running 29 minutes ago
So error and ports are empty.
What should I analyse to fix this issue?
For part 4 when you deploy to your swarm, you get an URL with docker-machine ls.
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
myvm1 * virtualbox Running tcp://192.168.99.100:2376 v17.10.0-ce
myvm2 - virtualbox Running tcp://192.168.99.101:2376 v17.10.0-ce
Change in docker-compose.yml file 80:80 to 4000:80
Use 192.168.99.100:4000 and it should be working.

When using mesos, marathon, and zookeeper my mesos-slave doesnt start when I specify the "containerizers" file with "docker,mesos"?

I have 3 CentOS VMs and I have installed Zookeeper, Marathon, and Mesos on the master node, while only putting Mesos on the other 2 VMs. The master node has no mesos-slave running on it. I am trying to run Docker containers so i specified "docker,mesos" in the containerizes file. One of the mesos-agents starts fine with this configuration and I have been able to deploy a container to that slave. However, the second mesos-agent simply fails when I have this configuration (it works if i take out that containerizes file but then it doesn't run containers). Here are some of the logs and information that has come up:
Here are some "messages" in the log directory:
Apr 26 16:09:12 centos-minion-3 systemd: Started Mesos Slave.
Apr 26 16:09:12 centos-minion-3 systemd: Starting Mesos Slave...
WARNING: Logging before InitGoogleLogging() is written to STDERR
[main.cpp:243] Build: 2017-04-12 16:39:09 by centos
[main.cpp:244] Version: 1.2.0
[main.cpp:247] Git tag: 1.2.0
[main.cpp:251] Git SHA: de306b5786de3c221bae1457c6f2ccaeb38eef9f
[logging.cpp:194] INFO level logging started!
[systemd.cpp:238] systemd version `219` detected
[main.cpp:342] Inializing systemd state
[systemd.cpp:326] Started systemd slice `mesos_executors.slice`
[containerizer.cpp:220] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
[linux_launcher.cpp:150] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
[provisioner.cpp:249] Using default backend 'copy'
[slave.cpp:211] Mesos agent started on (1)#172.22.150.87:5051
[slave.cpp:212] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --http_heartbeat_interval="30secs" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="linux" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
[slave.cpp:541] Agent resources: cpus(*):1; mem(*):919; disk(*):2043; ports(*):[31000-32000]
[slave.cpp:549] Agent attributes: [ ]
[slave.cpp:554] Agent hostname: node3
[status_update_manager.cpp:177] Pausing sending status updates
[state.cpp:62] Recovering state from '/var/lib/mesos/meta'
[state.cpp:706] No committed checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
[status_update_manager.cpp:203] Recovering status update manager
[docker.cpp:868] Recovering Docker containers
[containerizer.cpp:599] Recovering containerizer
[provisioner.cpp:410] Provisioner recovery complete
[group.cpp:340] Group process (zookeeper-group(1)#172.22.150.87:5051) connected to ZooKeeper
[group.cpp:830] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
[group.cpp:418] Trying to create path '/mesos' in ZooKeeper
[detector.cpp:152] Detected a new leader: (id='15')
[group.cpp:699] Trying to get '/mesos/json.info_0000000015' in ZooKeeper
[zookeeper.cpp:259] A new leading master (UPID=master#172.22.150.88:5050) is detected
Failed to perform recovery: Collect failed: Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1; stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service: main process exited, code=exited, status=1/FAILURE
Apr 26 16:09:13 centos-minion-3 systemd: Unit mesos-slave.service entered failed state.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service failed.
Logs from docker:
$ sudo systemctl status docker
● docker.service - Docker Application Container Engine Loaded:
loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/docker.service.d
└─flannel.conf Active: inactive (dead) since Tue 2017-04-25 18:00:03 CDT;
24h ago Docs: docs.docker.com Main PID: 872 (code=exited, status=0/SUCCESS)
Apr 26 18:25:25 centos-minion-3 systemd[1]: Dependency failed for Docker Application Container Engine.
Apr 26 18:25:25 centos-minion-3 systemd[1]: Job docker.service/start failed with result 'dependency'
Logs from flannel:
[flanneld-start: network.go:102] failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
You have answer in your logs
Failed to perform recovery: Collect failed:
Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1;
stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
Mesos keeps it state/metadata on local disk. When it's restarted it try to load this state. If configuration changed and is not compatible with previous state it won't start.
Just bring docker to live by fixing problems with flannel and etcd and everything will be fine.
add the following flag while starting agent,
--reconfiguration_policy=additive
more details here: http://mesos.apache.org/documentation/latest/agent-recovery/

Docker daemon no start

I'm trying to start the Docker daemon:
sudo systemctl start docker
But nothing happens, the cursor just blinks and the process never ends.
Yesterday it was working properly :(
sudo journalctl -fu docker
ago 18 16:05:24 host docker[1602]: time="2016-08-18T16:05:24.467635627-05:00" level=info msg="New containerd process, pid: 1609\n"
ago 18 16:05:24 host docker[1602]: time="2016-08-18T16:05:24.482107319-05:00" level=fatal msg="bad listen address format /var/run/docker/libcontainerd/docker-containerd.sock, expected proto://address"
ago 18 16:05:30 host docker[1602]: time="2016-08-18T16:05:30.470570243-05:00" level=info msg="New containerd process, pid: 1620\n"
ago 18 16:05:30 host docker[1602]: time="2016-08-18T16:05:30.491495106-05:00" level=fatal msg="bad listen address format /var/run/docker/libcontainerd/docker-containerd.sock, expected proto://address"
ago 18 16:08:06 host systemd[1]: Stopped Docker Application Container Engine.
-- Reboot --
ago 18 16:16:52 host systemd[1]: Starting Docker Application Container Engine...
ago 18 16:16:54 host docker[2294]: time="2016-08-18T16:16:54.360878396-05:00" level=info msg="New containerd process, pid: 2327\n"
ago 18 16:16:54 host docker[2294]: time="2016-08-18T16:16:54.686503187-05:00" level=fatal msg="bad listen address format /var/run/docker/libcontainerd/docker-containerd.sock, expected proto://address"
ago 18 16:17:00 host docker[2294]: time="2016-08-18T16:17:00.664023288-05:00" level=info msg="New containerd process, pid: 2368\n"
ago 18 16:17:00 host docker[2294]: time="2016-08-18T16:17:00.67708602-05:00" level=fatal msg="bad listen address format /var/run/docker/libcontainerd/docker-containerd.sock, expected proto://address"
One interesting thing with systemd is that if it thinks that a daemon is running, then the start command does nothing.
I have had to do the following to make sure I cleanly restart certain daemons:
sudo systemctl stop service-name
# wait a little if the service is slow to stop like the Cassandra database
sudo systemctl start service-name
That has worked for me with various services.
One way to know whether the service is considered running, is to check the status like so:
systemctl status service-name

Resources