No nodes connecting to host in docker swarm - docker

I just followed this tutorial step by step for setting up a docker swarm in EC2 -- https://docs.docker.com/swarm/install-manual/
I created 4 Amazon Servers using the Amazon Linux AMI.
manager + consul
manager
node1
node2
I followed the instructions to start the swarm and everything seems to go ok regarding making the docker instances.
Server 1
Running docker ps gives:
The Consul logs show this
2016/07/05 20:18:47 [INFO] serf: EventMemberJoin: 729a440e5d0d 172.17.0.2
2016/07/05 20:18:47 [INFO] serf: EventMemberJoin: 729a440e5d0d.dc1 172.17.0.2
2016/07/05 20:18:48 [INFO] raft: Node at 172.17.0.2:8300 [Follower] entering Follower state
2016/07/05 20:18:48 [INFO] consul: adding server 729a440e5d0d (Addr: 172.17.0.2:8300) (DC: dc1)
2016/07/05 20:18:48 [INFO] consul: adding server 729a440e5d0d.dc1 (Addr: 172.17.0.2:8300) (DC: dc1)
2016/07/05 20:18:48 [ERR] agent: failed to sync remote state: No cluster leader
2016/07/05 20:18:49 [WARN] raft: Heartbeat timeout reached, starting election
2016/07/05 20:18:49 [INFO] raft: Node at 172.17.0.2:8300 [Candidate] entering Candidate state
2016/07/05 20:18:49 [INFO] raft: Election won. Tally: 1
2016/07/05 20:18:49 [INFO] raft: Node at 172.17.0.2:8300 [Leader] entering Leader state
2016/07/05 20:18:49 [INFO] consul: cluster leadership acquired
2016/07/05 20:18:49 [INFO] consul: New leader elected: 729a440e5d0d
2016/07/05 20:18:49 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2016/07/05 20:18:49 [INFO] consul: member '729a440e5d0d' joined, marking health alive
2016/07/05 20:18:50 [INFO] agent: Synced service 'consul'
I registered each node using the following command with appropriate IP's
docker run -d swarm join --advertise=x-x-x-x:2375 consul://x-x-x-x:8500
Each of those created a docker instance
Node1
Running docker ps gives:
With logs that suggest there's a problem:
time="2016-07-05T21:33:50Z" level=info msg="Registering on the discovery service every 1m0s..." addr="172.31.17.35:2375" discovery="consul://172.31.3.233:8500"
time="2016-07-05T21:36:20Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions"
time="2016-07-05T21:37:20Z" level=info msg="Registering on the discovery service every 1m0s..." addr="172.31.17.35:2375" discovery="consul://172.31.3.233:8500"
time="2016-07-05T21:39:50Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions"
time="2016-07-05T21:40:50Z" level=info msg="Registering on the discovery service every 1m0s..." addr="172.31.17.35:2375" discovery="consul://172.31.3.233:8500"
...
And lastly when I get to the last step of trying to get host information like so on my Consul machine,
docker -H :4000 info
I see no nodes. Lastly when I try the step of running an app, I get the obvious error:
[ec2-user#ip-172-31-3-233 ~]$ docker -H :4000 run hello-world
docker: Error response from daemon: No healthy node available in the cluster.
See 'docker run --help'.
[ec2-user#ip-172-31-3-233 ~]$
Thanks for any insight on this. I'm still pretty confused by much of the swarm model and not sure where to go from here to diagnose.

It looks like Consul is either not binding to a public IP address, or is not accessible on the public IP due to security group or VPC settings. You are setting the discovery URL to consul://172.31.3.233:8500 on the Docker nodes, so I would sugest trying to connect to that address from an external IP, either in your browser or via curl like this:
% curl http://172.31.3.233:8500/ui/dist/
HTML
If you cannot connect (connection refused or timeout) then add a TCP port 8500 ingress rule to your AWS VMs, and try again.

After investigating your issue, I see that you forgot open port 2375 for Docker Engine in all four nodes.
Before starting Swarm Manager or Swarm Node, you have to open a TCP Port for Docker Engine, so Swarm will work with Docker Engine via that Port.
With Docker on Ubuntu 14.04, you can open the port by change file /etc/default/docker and add -H tcp://0.0.0.0:2375 to DOCKER_OPTS. For example:
DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"
After that, you restart Docker Engine
service docker restart
If you are using CentOS, the solution is same, you can read my blog article https://sonnguyen.ws/install-docker-docker-swarm-centos7/
And the other thing, I think that you should install and run Consul in all nodes (4 servers). So your Swarm can work with Consul on its node

Related

unable to run consul in docker host network mode

I am trying to run consul docker container in host network mode as suggested on docker hub. I am unable to access the UI at port 8500
My docker host IP address: 192.168.30.12
network interface which is used by host: ens192
Here is my docker run command:
docker run -d --net=host -v /home/docker/conf.json:/consul/config/config.json -v /home/docker/data/:/consul/data/ -e CONSUL_BIND_INTERFACE=ens192 -e CONSUL_CLIENT_INTERFACE=ens192 --name=consulserver1 -d consul agent -server -bootstrap-expect=1 -client 0.0.0.0 -bind=192.168.30.12
I also see following error in docker logs
==> Found address '192.168.30.12' for interface 'ens192', setting bind option...
==> Found address '192.168.30.12' for interface 'ens192', setting client option...
==> Starting Consul agent...
Version: '1.14.4'
Build Date: '2023-01-26 15:47:10 +0000 UTC'
Node ID: 'd8e91718-dcf3-70be-dd29-c558158959f0'
Node name: 'docker-try1'
Datacenter: 'dc1' (Segment: '<all>')
Server: true (Bootstrap: true)
Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, gRPC-TLS: 8503, DNS: 8600)
Cluster Addr: 192.168.30.12 (LAN: 8301, WAN: 8302)
Gossip Encryption: false
Auto-Encrypt-TLS: false
HTTPS TLS: Verify Incoming: false, Verify Outgoing: false, Min Version: TLSv1_2
gRPC TLS: Verify Incoming: false, Min Version: TLSv1_2
Internal RPC TLS: Verify Incoming: false, Verify Outgoing: false (Verify Hostname: false), Min Version: TLSv1_2
==> Log data will now stream in as it occurs:
2023-02-17T15:18:30.052Z [WARN] agent: BootstrapExpect is set to 1; this is the same as Bootstrap mode.
2023-02-17T15:18:30.052Z [WARN] agent: Node name "docker-try1" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2023-02-17T15:18:30.052Z [WARN] agent: bootstrap = true: do not enable unless necessary
2023-02-17T15:18:30.057Z [WARN] agent.auto_config: BootstrapExpect is set to 1; this is the same as Bootstrap mode.
2023-02-17T15:18:30.057Z [WARN] agent.auto_config: Node name "docker-try1" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2023-02-17T15:18:30.057Z [WARN] agent.auto_config: bootstrap = true: do not enable unless necessary
2023-02-17T15:18:30.061Z [INFO] agent.server.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:d8e91718-dcf3-70be-dd29-c558158959f0 Address:192.168.30.12:8300}]"
2023-02-17T15:18:30.061Z [INFO] agent.server.raft: entering follower state: follower="Node at 192.168.30.12:8300 [Follower]" leader-address= leader-id=
2023-02-17T15:18:30.062Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: docker-try1.dc1 192.168.30.12
2023-02-17T15:18:30.062Z [WARN] agent.server.serf.wan: serf: Failed to re-join any previously known node
2023-02-17T15:18:30.062Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: docker-try1 192.168.30.12
2023-02-17T15:18:30.063Z [INFO] agent.router: Initializing LAN area manager
2023-02-17T15:18:30.063Z [WARN] agent.server.serf.lan: serf: Failed to re-join any previously known node
2023-02-17T15:18:30.063Z [INFO] agent.server: Adding LAN server: server="docker-try1 (Addr: tcp/192.168.30.12:8300) (DC: dc1)"
2023-02-17T15:18:30.063Z [INFO] agent.server.autopilot: reconciliation now disabled
2023-02-17T15:18:30.064Z [INFO] agent.server: Handled event for server in area: event=member-join server=docker-try1.dc1 area=wan
2023-02-17T15:18:30.064Z [INFO] agent.server.cert-manager: initialized server certificate management
2023-02-17T15:18:30.064Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=udp
2023-02-17T15:18:30.065Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=tcp
2023-02-17T15:18:30.065Z [INFO] agent: Starting server: address=[::]:8500 network=tcp protocol=http
2023-02-17T15:18:30.065Z [INFO] agent: Started gRPC listeners: port_name=grpc_tls address=[::]:8503 network=tcp
2023-02-17T15:18:30.065Z [INFO] agent: started state syncer
2023-02-17T15:18:30.065Z [INFO] agent: Consul agent running!
2023-02-17T15:18:37.152Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="No cluster leader" index=0
2023-02-17T15:18:37.152Z [ERROR] agent.server.cert-manager: failed to handle cache update event: error="leaf cert watch returned an error: No cluster leader"
2023-02-17T15:18:37.248Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2023-02-17T15:18:39.483Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2023-02-17T15:18:39.483Z [INFO] agent.server.raft: entering candidate state: node="Node at 192.168.30.12:8300 [Candidate]" term=7
2023-02-17T15:18:39.486Z [INFO] agent.server.raft: election won: term=7 tally=1
2023-02-17T15:18:39.486Z [INFO] agent.server.raft: entering leader state: leader="Node at 192.168.30.12:8300 [Leader]"
2023-02-17T15:18:39.486Z [INFO] agent.server: cluster leadership acquired
2023-02-17T15:18:39.487Z [INFO] agent.server: New leader elected: payload=docker-try1
2023-02-17T15:18:39.493Z [INFO] agent.server.autopilot: reconciliation now enabled
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="federation state anti-entropy"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="federation state pruning"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="streaming peering resources"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="metrics for streaming peering resources"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="peering deferred deletion"
2023-02-17T15:18:39.493Z [INFO] connect.ca: initialized primary datacenter CA from existing CARoot with provider: provider=consul
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="intermediate cert renew watch"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="CA root pruning"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="CA root expiration metric"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="CA signing expiration metric"
2023-02-17T15:18:39.493Z [INFO] agent.leader: started routine: routine="virtual IP version check"
2023-02-17T15:18:39.493Z [INFO] agent.leader: stopping routine: routine="virtual IP version check"
2023-02-17T15:18:39.493Z [INFO] agent.leader: stopped routine: routine="virtual IP version check"
2023-02-17T15:18:40.065Z [ERROR] agent.server.autopilot: Failed to reconcile current state with the desired state
2023-02-17T15:18:41.061Z [INFO] agent: Synced node info
I think I figured it out.
There was a firewall blocking tcp ports. As soon as I opened all ports recommended in Consul documentation Consul Ports, it started working

single node ejabberd on kubernetes -ejabberdctl status shows node down

I'm trying to deploy ejabberd docker image in kubernetes with the following folders are mounted from a persistent volume,
/home/ejabberd/logs
/home/ejabberd/conf
/home/ejabberd/database
populated the database,and conf directory with our configuration files and the database folder
from the docker image using an init container .Upon setting the permissions, we could able to
start the ejabberd service , the logs says that the services (on ports 5222,5269,5280) are ready .
when I check the xmpp server status in the container using " ejabberdctl status " , the output says "node down"
===========ejabberd.log===================================================
2020-12-16 09:18:58.477630+00:00 [info] <0.3406.0>#mod_mqtt:init_topic_cache/2:611 Building MQTT cache for mydomain this may take a while
2020-12-16 09:18:59.087380+00:00 [info] <0.483.0>#ejabberd_mnesia:create/2:267 Creating Mnesia ram table 'bytestream'
2020-12-16 09:19:01.193203+00:00 [info] <0.126.0>#ejabberd_cluster_mnesia:wait_for_sync/1:123 Waiting for Mnesia synchronization to complete
2020-12-16 09:19:02.401537+00:00 [info] <0.126.0>#ejabberd_app:start/2:62 ejabberd 20.4.0 is started in the node 'ejabberd#mydomain' in 49.77s
2020-12-16 09:19:02.403414+00:00 [info] <0.601.0>#ejabberd_listener:init/4:159 Start accepting TCP connections at [::]:5222 for ejabberd_c2s
2020-12-16 09:19:02.403479+00:00 [info] <0.602.0>#ejabberd_listener:init/4:159 Start accepting TCP connections at [::]:5269 for ejabberd_s2s_in
2020-12-16 09:19:02.403956+00:00 [info] <0.603.0>#ejabberd_listener:init/4:159 Start accepting TLS connections at [::]:5443 for ejabberd_http
2020-12-16 09:19:02.403999+00:00 [info] <0.604.0>#ejabberd_listener:init/4:159 Start accepting TCP connections at [::]:5280 for ejabberd_http
2020-12-16 09:19:02.404098+00:00 [info] <0.605.0>#ejabberd_listener:init/4:159 Start accepting TCP connections at [::]:1883 for mod_mqtt
2020-12-16 09:19:02.404345+00:00 [info] <0.3418.0>#ejabberd_listener:init/4:159 Start accepting TCP connections at 10.42.8.15:7777 for mod_proxy65_stream
========================================ejabberdctl status===========================
~ $ ./bin/ejabberdctl status
Failed RPC connection to the node 'ejabberd#mydomain': nodedown
Commands to start an ejabberd node:
start - Start an ejabberd node in server mode
debug - Attach an interactive Erlang shell to a running ejabberd node
iexdebug - Attach an interactive Elixir shell to a running ejabberd node
live - Start an ejabberd node in live (interactive) mode
iexlive - Start an ejabberd node in live (interactive) mode, within an Elixir shell
foreground - Start an ejabberd node in server mode (attached)
Optional parameters when starting an ejabberd node:
--config-dir dir Config ejabberd: /home/ejabberd/conf
--config file Config ejabberd: /home/ejabberd/conf/ejabberd.yml
--ctl-config file Config ejabberdctl: /home/ejabberd/conf/ejabberdctl.cfg
--logs dir Directory for logs: /home/ejabberd/logs
--spool dir Database spool dir: /home/ejabberd/database/ejabberd#mydomain
--node nodename ejabberd node name: ejabberd#mydomain
If anyone has tried ejabberd on kubernetes, Please share your thought on this issue
Thanks in advance

Docker - Unable to join swarm as manager, able to join as worker

When executing a docker swarm join command (as manager), I face the following error:
Error response from daemon: manager stopped: can't initialize raft node: rpc error: code = Internal desc = connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match swarm-manager"
Joining the same swarm, but as worker, works flawless.
The logfiles show me the following items:
kmo#GETSTdock-app01 ~ $ sudo tail -f /var/log/upstart/docker.log
time="2018-07-06T09:18:17.890620199+02:00" level=info msg="Listening for connections" addr="[::]:2377" module=node node.id=7j75bmugpf8k2o0onta1yp4zy proto=tcp
time="2018-07-06T09:18:17.892234469+02:00" level=info msg="manager selected by agent for new session: { 10.130.223.107:2377}" module=node/agent node.id=7j75bmugpf8k2o0onta1yp4zy
time="2018-07-06T09:18:17.892364019+02:00" level=info msg="waiting 0s before registering session" module=node/agent node.id=7j75bmugpf8k2o0onta1yp4zy
time="2018-07-06T09:18:18.161362606+02:00" level=error msg="fatal task error" error="cannot create a swarm scoped network when swarm is not active" module=node/agent/taskmanager node.id=7j75bmugpf8k2o0onta1yp4zy service.id=p3ng4om2m8rl7ygoef18ayohp task.id=weaubf3qj5goctlh2039sjvdg
time="2018-07-06T09:18:18.162182077+02:00" level=error msg="fatal task error" error="cannot create a swarm scoped network when swarm is not active" module=node/agent/taskmanager node.id=7j75bmugpf8k2o0onta1yp4zy service.id=6sl9y5rcov6htwbyvm504ewh2 task.id=j3foc6rjszuqszj41qyqb6mpe
time="2018-07-06T09:18:18.184847516+02:00" level=info msg="Stopping manager" module=node node.id=7j75bmugpf8k2o0onta1yp4zy
time="2018-07-06T09:18:18.184993569+02:00" level=info msg="Manager shut down" module=node node.id=7j75bmugpf8k2o0onta1yp4zy
time="2018-07-06T09:18:18.185020917+02:00" level=info msg="shutting down certificate renewal routine" module=node/tls node.id=7j75bmugpf8k2o0onta1yp4zy node.role=swarm-manager
time="2018-07-06T09:18:18.185163663+02:00" level=error msg="cluster exited with error: manager stopped: can't initialize raft node: rpc error: code = Internal desc = connection error: desc = \"transport: x509: certificate is not valid for any names, but wanted to match swarm-manager\""
time="2018-07-06T09:18:18.185492995+02:00" level=error msg="Handler for POST /v1.37/swarm/join returned error: manager stopped: can't initialize raft node: rpc error: code = Internal desc = connection error: desc = \"transport: x509: certificate is not valid for any names, but wanted to match swarm-manager\""
I face similar problems when I join as worker, and then attempt to promote the node to a manager node.
Docker version = 18.03.1
OS = Ubuntu 14.04 LTS
Anybody an idea how to resolve this?
For me, I had to open port 2377 in the joining manager node's firewall; that seemed to do the trick. I'm not sure if this is best practice, as I'm still a noob with Docker Swarm: but add it to the list of things to try if you have this issue.
This may or may not work, but you can try
On manager run:
docker swarm leave --force
Recreate the swarm using:
docker swarm init --advertise-addr [ip-address for initial manager]
Then try to add managers using the advertised address
Also you can try:
Comment out the proxy from the docker proxy define file /etc/systemd/system/docker.service.d/docker.conf or /etc/systemd/system/docker.service.d/docker_proxy.conf
reload the deamon with
systemctl daemon-reload
Re-excute docker swarm join --token manager

How to fix docker daemon that will not restart due to hns error

Docker for Windows Server
Windows Server version 1709, with containers
Docker version 17.06.2-ee-6, build e75fdb8
Swarm mode (worker node, part of swarm with ubuntu masters)
After containers connected to an overlay network started intermittently losing their network adapters, I restarted the machine. Now daemon will not start. Below is the last lines of output from running docker -D.
Please let me know how to fix this.
time="2018-05-15T15:10:06.731160000Z" level=debug msg="Option Experimental: false"
time="2018-05-15T15:10:06.731160000Z" level=debug msg="Option DefaultDriver: nat"
time="2018-05-15T15:10:06.731160000Z" level=debug msg="Option DefaultNetwork: nat"
time="2018-05-15T15:10:06.734183700Z" level=info msg="Restoring existing overlay networks from HNS into docker"
time="2018-05-15T15:10:06.735174400Z" level=debug msg="[GET]=>[/networks/] Request : "
time="2018-05-15T15:12:06.789120400Z" level=debug msg="Network (d4d37ce) restored"
time="2018-05-15T15:12:06.796122200Z" level=debug msg="Endpoint (4114b6e) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.796122200Z" level=debug msg="Endpoint (819eb70) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.797124900Z" level=debug msg="Endpoint (ade55ea) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.798125600Z" level=debug msg="Endpoint (d0054fc) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.798125600Z" level=debug msg="Endpoint (e2af8d8) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.854118500Z" level=debug msg="[GET]=>[/networks/] Request : "
time="2018-05-15T15:14:06.860654000Z" level=debug msg="start clean shutdown of all containers with a 15 seconds timeout..."
Error starting daemon: Error initializing network controller: hnsCall failed in Win32: Server execution failed (0x80080005)
Here is complete set of steps to completely rebuild all docker issues withing swarm host. Sometimes only some steps are sufficient (specifically hns part), so you can try those first.
Remove all docker services and user-defined networks (so all docker networks except `nat` and `none`
Leave the swarm cluster (docker swarm leave --force)
Stop the docker service (PS C:\> stop-service docker)
Stop the HNS service (PS C:\> stop-service hns)
In regedit, delete all of the registry keys under these paths:
HKLM:\SYSTEM\CurrentControlSet\Services\vmsmp\parameters\SwitchList
HKLM:\SYSTEM\CurrentControlSet\Services\vmsmp\parameters\NicList
Now go to Device Manager, and disable then remove all network adapters that are “Hyper-V Virtual Ethernet…” adapters
Now rename your HNS.data file (the goal is to effectively “delete” it by renaming it):
C:\ProgramData\Microsoft\Windows\HNS\HNS.data
Also rename C:\ProgramData\docker folder (the goal is to effectively “delete” it by renaming it)
C:\ProgramData\docker
Now reboot your machine

Issue when joining serf nodes located in different Docker containers

Context: Host is AWS-EC2 / Ubuntu 14.04.5 with Docker version 17.05.0-ce. Containers are built from publicly available repo image cbhihe/serf-alpine-bash. All containers are located on the same EC2 instance and share the same default bridge network with net-interface "docker0".
Trying to join nodes serfDC1 (id d4fd90692e18) and serfDC2 (id 6353e7f6134d), by passing cmds from the host's shell:
$ docker exec serfDC1 serf agent -node=Node1 -bind=0.0.0.0:7946
==> Starting Serf agent…
==> Starting Serf agent RPC...
==> Serf agent running!
Node name: 'd4fd90692e18'
Bind addr: '0.0.0.0:7946'
RPC addr: '127.0.0.1:7373'
Encrypted: false
Snapshot: false
Profile: lan
==> Log data will now stream in as it occurs:
2017/06/04 00:01:10 [INFO] agent: Serf agent starting
2017/06/04 00:01:10 [INFO] serf: EventMemberJoin: d4fd90692e18 127.0.0.1
2017/06/04 00:01:11 [INFO] agent: Received event: member-join
^C
After discovering Node1's container's IP=172.17.0.4, I can issue the serf agent -join cmd to Node2:
$ docker exec serfDC2 serf agent -node=Node2 -join=172.17.0.4
==> Starting Serf agent...
==> Starting Serf agent RPC...
==> Serf agent running!
Node name: '6353e7f6134d'
Bind addr: '0.0.0.0:7946'
RPC addr: '127.0.0.1:7373'
Encrypted: false
Snapshot: false
Profile: lan
==> Joining cluster...(replay: false)
Join completed. Synced with 1 initial agents
==> Log data will now stream in as it occurs:
2017/06/04 00:18:35 [INFO] agent: Serf agent starting
2017/06/04 00:18:35 [INFO] serf: EventMemberJoin: 6353e7f6134d 127.0.0.1
2017/06/04 00:18:35 [INFO] agent: joining: [172.17.0.4] replay: false
2017/06/04 00:18:35 [INFO] serf: EventMemberJoin: d4fd90692e18 127.0.0.1
2017/06/04 00:18:35 [INFO] agent: joined: 1 nodes
2017/06/04 00:18:36 [WARN] memberlist: Got ping for unexpected node 'd4fd90692e18' from=127.0.0.1:7946
2017/06/04 00:18:36 [INFO] agent: Received event: member-join
2017/06/04 00:18:37 [WARN] memberlist: Got ping for unexpected node d4fd90692e18 from=127.0.0.1:34876
2017/06/04 00:18:37 [ERR] memberlist: Failed TCP fallback ping: EOF
2017/06/04 00:18:37 [INFO] memberlist: Suspect d4fd90692e18 has failed, no acks received
2017/06/04 00:18:38 [WARN] memberlist: Got ping for unexpected node 'd4fd90692e18' from=127.0.0.1:7946
2017/06/04 00:18:39 [WARN] memberlist: Got ping for unexpected node d4fd90692e18 from=127.0.0.1:34879
2017/06/04 00:18:39 [ERR] memberlist: Failed TCP fallback ping: EOF
2017/06/04 00:18:40 [INFO] memberlist: Suspect d4fd90692e18 has failed, no acks received
2017/06/04 00:18:41 [WARN] memberlist: Got ping for unexpected node 'd4fd90692e18' from=127.0.0.1:7946
2017/06/04 00:18:42 [WARN] memberlist: Got ping for unexpected node d4fd90692e18 from=127.0.0.1:34881
2017/06/04 00:18:42 [ERR] memberlist: Failed TCP fallback ping: EOF
2017/06/04 00:18:42 [INFO] memberlist: Marking d4fd90692e18 as failed, suspect timeout reached (0 peer confirmations)
2017/06/04 00:18:42 [INFO] serf: EventMemberFailed: d4fd90692e18 127.0.0.1
2017/06/04 00:18:43 [INFO] agent: Received event: member-failed
2017/06/04 00:18:44 [INFO] memberlist: Suspect d4fd90692e18 has failed, no acks received
2017/06/04 00:19:05 [INFO] serf: attempting reconnect to d4fd90692e18 127.0.0.1:7946
^C
Resulted in failure to join as shown by:
$ docker exec serfDC2 serf members
6353e7f6134d 127.0.0.1:7946 alive
d4fd90692e18 127.0.0.1:7946 failed
$ docker exec serfDC1 serf members
d4fd90692e18 127.0.0.1:7946 alive
6353e7f6134d 127.0.0.1:7946 failed
I have been at this for quite some time now and am at my wit's end as to where I should turn. Hashicorp's and Docker's documentation do not seem to cover this aspect of the initial handshake between two serf agents in different containers.
Could somebody show me where I took a wrong turn ? Any answer would be great, really. Tx.
Serf nodes need to 'announce' themselves with a routable address. In your case they are telling to each other: 'hi, I'm localhost:...', so each one tries to answer to localhost, which is something wrong because each container has its own localhost.
There is an option to configure the agent to use the eth0 ip to advertise to the others nodes in the network: -iface. Then you need to discard the -bind option. Those ports are default so there is no need to customize.
So, for the node1:
serf agent -node=Node1 -iface=eth0
And for the node2:
serf agent -node=Node2 -join=172.17.0.2 -iface=eth0
From docs:
-iface - This flag can be used to provide a binding interface. It can be used instead of -bind if the interface is known but not the address.
It's working properly for me:
Node1:
==> Log data will now stream in as it occurs:
2017/06/04 01:56:40 [INFO] agent: Serf agent starting
2017/06/04 01:56:40 [INFO] serf: EventMemberJoin: Node1 172.17.0.2
2017/06/04 01:56:41 [INFO] agent: Received event: member-join
2017/06/04 01:57:02 [INFO] serf: EventMemberJoin: Node2 172.17.0.3
2017/06/04 01:57:03 [INFO] agent: Received event: member-join
Node2:
==> Log data will now stream in as it occurs:
2017/06/04 01:57:02 [INFO] agent: Serf agent starting
2017/06/04 01:57:02 [INFO] serf: EventMemberJoin: Node2 172.17.0.3
2017/06/04 01:57:02 [INFO] agent: joining: [172.17.0.2] replay: false
2017/06/04 01:57:02 [INFO] serf: EventMemberJoin: Node1 172.17.0.2
2017/06/04 01:57:02 [INFO] agent: joined: 1 nodes
2017/06/04 01:57:03 [INFO] agent: Received event: member-join
Edit:
In the case that each container is in its own VM (EC2 instance), as each instance has its own docker network and not interconnected, you have to provide the EC2 instance IP and expose the corresponding ports. Use -advertise
-advertise - The advertise flag is used to change the address that we advertise to other nodes in the cluster.
Node1:
serf agent -node=Node1 -iface=eth0 -advertise=INSTANCE_IP
Node2:
serf agent -node=Node2 -join=NODE1_INSTANCE_IP -iface=eth0
And remember to expose the serf port in docker run
docker run -p 7946:7946 (...rest of the command...)

Resources