Docker Swarm with Consul - Manager not electing primary - docker

I'm trying to setup a HA docker cluster on 3 dedicated pc's. I've successfully followed the instructions on docs.docker.com/engine/installation/linux/ubuntulinux and now I'm trying to follow the instructions on https://docs.docker.com/swarm/install-manual. Since I'm not using any virtualization I start at "Set up an consul discovery backend". The PC's (running ubuntu trusty 14.04 server edition) are all in the LAN 192.168.2.0/24. ubuntu001 has .104, ubuntu002 has .106, and ubuntu003 has .105
I did the following according to the instructions:
arnolde#ubuntu001:~$ docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap
arnolde#ubuntu001:~$ docker run -d -p 4000:4000 swarm manage -H :4000 --replication --advertise 192.168.2.104:4000 consul://192.168.2.104
arnolde#ubuntu002:~# docker run -d swarm manage -H :4000 --replication --advertise 192.168.2.106:4000 consul://192.168.2.104:8500
arnolde#ubuntu003:~$ docker run -d swarm join --advertise=192.168.2.105:2375 consul://192.168.2.104:8500
But then when trying the next step, the swarm manager does NOT show up as "Primary" like it says it should, and no primary is listed:
arnolde#ubuntu001:~$ docker -H :4000 info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: swarm/1.1.0
Role: replica
Primary:
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
Plugins:
Volume:
Network:
Kernel Version: 3.19.0-25-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
And:
arnolde#ubuntu001:~$ docker -H :4000 run hello-world
docker: Error response from daemon: No elected primary cluster manager.
I searched and found https://github.com/docker/swarm/issues/1491 which recommends to use dockerswarm/swarm:master instead, which I did, but it didn't help:
arnolde#ubuntu001:~$ docker run -d -p 4000:4000 dockerswarm/swarm:master manage -H :4000 --replication --advertise 192.168.2.104:4000 consul://192.168.2.104
I didn't find any other input regarding swarm+consul+primary so here I am... any suggestions? Unfortunately I'm not sure how to troubleshoot since I don't even know where to look for logging/debugging info, i.e. if the manager is connecting to consul successfully etc...

I was able to solve it myself after explicitly adding the port number to the consul:// parameter, apparently the docker docs are incomplete:
arnolde#ubuntu001:~$ docker run -d -p 4000:4000 dockerswarm/swarm:master manage -H :4000 --replication --advertise 192.168.2.104:4000 consul://192.168.2.104:8500
arnolde#ubuntu001:~$ docker -H :4000 info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: swarm/1.1.0
Role: replica
Primary: 192.168.2.106:4000
Also I added "-p 4000:4000" to the command on the replica manager (on ubuntu002). Not sure if that was necessary (or even a good idea).

My friends,the first step you should edit docker start daemon configure to write listen the port any other configure ,my environment is centos7,so my daemon configure is in /usr/lib/docker/.... edit "ExecStart=/usr/bin/docker daemon -H fd:// -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock --cluster-store=consul://192.168.1.102:8500 --cluster-advertise=192.168.1.103:0" each node. and the second step: "docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap" anymore...

Related

SSL(curl) connection error in ElasticSearch setup

Have setup a 3-node Elasticsearch cluster using docker-compose. Followed below steps:
On one of the master nodes, es11, gets below error, however same curl command works fine on other 2 nodes i.e. es12, es13:
Error:
curl -X GET 'https://localhost:9316'
curl: (35) Encountered end of file
Below error in logs:
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es13][SOMEIP:9316][internal:cluster/coordination/join]",
"Caused by: org.elasticsearch.transport.ConnectTransportException: [es11][SOMEIP:9316] handshake failed. unexpected remote node {es13}{SOMEVALUE}{SOMEVALUE
"at org.elasticsearch.transport.TransportService.lambda$connectionValidator$6(TransportService.java:468) ~[elasticsearch-7.17.6.jar:7.17.6]",
"at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:95) ~[elasticsearch-7.17.6.jar:7.17.6]",
"at org.elasticsearch.transport.TransportService.lambda$handshake$9(TransportService.java:577) ~[elasticsearch-7.17.6.jar:7.17.6]",
https://localhost:9316 on browser gives site can't be reached error as well.It seems SSL certificate as created in step 4 below is having some issues in es11.
Any leads please? OR If I repeat step 4, do i need to copy the certs again to es12 & es13?
Below elasticsearch.yml
cluster.name: "docker-cluster"
network.host: 0.0.0.0
Ports as defined in all 3 nodes docker-compose.yml
environment:
- node.name=es11
- transport.port=9316
ports:
- 9216:9200
- 9316:9316
Initialize a docker swarm. On ES11 run docker swarm init. Follow the instructions to join 12 and 13 to the swarm.
Create an overlay network docker network create -d overlay --attachable elastic
If necessary, bring down the current cluster and remove all the associated volumes by running docker-compose down -v
Create SSL certificates for ES with docker-compose -f create-certs.yml run --rm create_certs
Copy the certs for es12 and 13 to the respective servers
Use this busybox to create the overlay network on 12 and 13 sudo docker run -itd --name containerX --net [network name] busybox
Configure certs on 12 and 13 with docker-compose -f config-certs.yml run --rm config_certs
Start the cluster with docker-compose up -d on each server
Set the passwords for the built-in ES accounts by logging into the cluster docker exec -it es11 sh then running bin/elasticsearch-setup-passwords interactive --url localhost:9316
(as per your https://discuss.elastic.co thread)
you cannot talk HTTP to the transport protocol port, which you have defined in transport.port. you need to talk to port 9200 in the container, which you have mapped to 9216 outside the container
the transport port runs a binary protocol that is not HTTP accessible

Connecting with Portainer: "resource is online but isn't responding to connection attempts"

I installed Ubuntu on an older Laptop. Now there is Docker with Portainer running and I want to access Portainer via my main PC in the same network. When I try to connect to Portainer via my Laptop where it is runnig (not Localhost address) it works fine. But when I try to connect via my PC, I get a timeout. Windows diagnostics says: "resource is online but isn't responding to connection attempts". How can I open Portainer to my local network? Or is this a problem with Ubuntu?
so check if you have openssh server running for ssh! disable firewall on terminal sudo ufw disable check if your network card is running on name eth0 ifconfig if not change following this step below
Using netplan which is the default these days. File /etc/netplan/00-installer-config.yaml file. but b4 you need to get serial/mac
Find the target devices mac/hw address using the lshw command:
lshw -C network
You'll see some output which looks like:
root#ys:/etc# lshw -C network
*-network
description: Ethernet interface
physical id: 2
logical name: eth0
serial: dc:a6:32:e8:23:19
size: 1Gbit/s
capacity: 1Gbit/s
capabilities: ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=bcmgenet driverversion=5.8.0-1015-raspi duplex=full ip=192.168.0.112 link=yes multicast=yes port=MII speed=1Gbit/s
So then you take the serial
dc:a6:32:e8:23:19
Note the set-name option.
This works for the wifi section as well.
if you using calbe you can delete everything add the example only change for your serial "mac" sudo nano /etc/netplan/00-installer-config.yaml file.
network:
version: 2
ethernets:
eth0:
dhcp4: true
match:
macaddress: <YOUR MAC ID HERE>
set-name: eth0
Then then to test this config run.
netplan try
When your happy with it
netplan apply
reboot you ubuntu
after restart
stop portainer container
sudo docker stop portainer
remove portainer container
sudo docker rm portainer
now run again on the last version
docker run -d -p 8000:8000 -p 9000:9000 \
--name=portainer --restart=always \
-v /var/run/docker.sock:/var/run/docker.sock \
-v portainer_data:/data \
portainer/portainer-ce:2.13.1

Flink cannot be run in Marathon

I have three physical nodes with docker installed on them. I have one docker container with Mesos, Marathon, Hadoop and Flink. I configured Master node and Slave nodes for Mesos,Zookeeper and Marathon. I do these works step by step.
First, In Master node, I enter to docker container with this command:
docker run -v /home/user/.ssh:/root/.ssh --privileged -p 5050:5050 -p 5051:5051 -p 5052:5052 -p 2181:2181 -p 8082:8081 -p 6123:6123 -p 8080:8080 -p 50090:50090 -p 50070:50070 -p 9000:9000 -p 2888:2888 -p 3888:3888 -p 4041:4040 -p 7077:7077 -p 52222:22 -e WEAVE_CIDR=10.32.0.2/12 -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins -e LIBPROCESS_IP=10.32.0.2 -e MESOS_RESOURCES=ports*:[11000-11999] -ti hadoop_marathon_mesos_flink_2 /bin/bash
Then run Mesos and Zookeeper :
/home/zookeeper-3.4.14/bin/zkServer.sh restart
/home/mesos-1.7.2/build/bin/mesos-master.sh --ip=10.32.0.1 --hostname=10.32.0.1 --roles=marathon,flink --quorum=1 --work_dir=/var/run/mesos --log_dir=/var/log/mesos
After that run Marathon in the same container:
/home/marathon-1.7.189-48bfd6000/bin/marathon --master 10.32.0.1:5050 --zk zk://10.32.0.1:2181/marathon --hostname 10.32.0.1 --webui_url 10.32.0.1:8080 --logging_level debug
And finally, I run hadoop:
/opt/hadoop/sbin/start-dfs.sh
Marathon,Mesos and Hadoop are run without any problems.
The most important part of my work is running Flink in Marathon. I configured Flink in docker container like this:
env.java.home: /opt/java
jobmanager.rpc.address: 10.32.0.1
high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/ha/
high-availability.zookeeper.quorum: 10.32.0.1:2181,10.32.0.2:2181
recovery.zookeeper.path.mesos-workers: /mesos-workers
In Marathon UI, I create Application and put this JSON file on it, but it is failed.
{
"id": "flink",
"cmd": "/home/flink-1.7.0/bin/mesos-appmaster.sh
-Dmesos.master=10.32.0.1:5050,10.32.0.2:5050
-Dmesos.initial-tasks=1",
"cpus": 1.0,
"mem": 1024
}
Flink application is failed in Mesos UI. It shows this error:
I0428 06:01:39.586699 6155 exec.cpp:162] Version: 1.7.2
I0428 06:01:39.596458 6154 exec.cpp:236] Executor registered on agent 984595ae-e811-48fb-a9f5-ca6128e1cc1a-S0
I0428 06:01:39.598870 6157 executor.cpp:188] Received SUBSCRIBED event
I0428 06:01:39.599761 6157 executor.cpp:192] Subscribed executor on 10.32.0.3
I0428 06:01:39.599963 6157 executor.cpp:188] Received LAUNCH event
I0428 06:01:39.601236 6157 executor.cpp:697] Starting task flink.16a7cc18-697b-11e9-928f-ce235caa831e
I0428 06:01:39.613719 6157 executor.cpp:712] Forked command at 6163
I0428 06:01:39.787395 6157 executor.cpp:1013] Command exited with status 1 (pid: 6163)
I0428 06:01:40.791885 6162 process.cpp:927] Stopped the socket accept loop
The strange thing is that in STDout, I see this text; even though I set JAVA_HOME in /etc/environment and flink-conf.yam.
Please specify JAVA_HOME. Either in Flink config ./conf/flink-conf.yaml or as system-wide JAVA_HOME.
Would you please tell me what I should do for that problem?
Many Thanks.
You can check your Flink log in Slave node. Also, it is better to change your JSON file like this. It helps you to follow your application.
{
"id": "flink",
"cmd": "/home/flink-1.7.0/bin/mesos-appmaster.sh -Djobmanager.heap.mb=1024
-Djobmanager.rpc.port=6123 -Drest.port=8081
-Dmesos.resourcemanager.tasks.mem=1024 -Dtaskmanager.heap.mb=1024
-Dtaskmanager.numberOfTaskSlots=2 -Dparallelism.default=2
-Dmesos.resourcemanager.tasks.cpus=1",
"cpus": 1.0,
"mem": 1024,
"fetch": [
{
"uri": "/home/flink-1.7.0/bin/mesos-appmaster.sh",
"executable": true
}
]
}
Also, JAVA_HOME to Flink_conf.yaml in every nodes, Master and Slaves.
env.java.home: /opt/java
With adding JAVA_HOME, you do not see the error in STDOUT.
I think it is useful.

consul container exits with a protocol version error

I am trying to make a container for consul and it keeps failing with this output, funny, I don't really think it is an error
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
following is the command I am using:
docker container run --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -dns-port 53 -bootstrap-expect 1 -ui -datacenter dc1 -v "/var/lib/consul:/consul/data" -data-dir /var/lib/consul
It is a single node fresh installation with latest version from registry, so there is no upgrade or version mismatch with any agent/client happening here.
Two things to fix. First, the -v volume argument must be for docker command, not for consul command. Move it to the right place:
docker container run -v "/consul/data:/var/lib/consul" -data-dir /var/lib/consul --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -dns-port 53 -bootstrap-expect 1 -ui -datacenter dc1
Also invert them (they are /host/dir:/container/dir)
Second, by default Consul can't listen to privileged ports (i.e. 53). See this: https://www.consul.io/docs/guides/forwarding.html, so remove the -dns-port 53 and implement any approach that they recommends:
docker container run -v "/consul/data:/var/lib/consul" -data-dir /var/lib/consul --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -bootstrap-expect 1 -ui -datacenter dc1
I recommend the DNSMasq setup, it is easy to implement.
#Robert Alright, I think we also went a bit off topic here. The real issue is the message it shows and exits immidiately after that.
I tried your example and it gives the same message/error (don't think it is an error though)
[root#ip-X-X-X-X user]# docker container run --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -dns-port 53 -bootstrap-expect 1 -ui -datacenter dc1 -v "/var/lib/consul:/consul/data" -data-dir /var/lib/consul
==> Found address 'X.X.X.X' for interface 'eth0', setting bind option...
Consul v0.8.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
[root#ip-X-X-X-X user]# docker container ls | grep consul-server
[root#ip-10-201-14-34 user]#
Same for recursors example:
[root#ip-X.X.X.X user]# docker container run --net host --name consul-server -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' -e CONSUL_BIND_INTERFACE='eth0' consul agent -server -client 0.0.0.0 -dns-port 53 -bootstrap-expect 1 -ui -datacenter dc1 -v "/var/lib/consul:/consul/data" -data-dir /var/lib/consul -recursers 8.8.8.8
==> Found address 'X.X.X.X' for interface 'eth0', setting bind option...
Consul v0.8.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
[root#ip-X-X-X-X user]# docker container ls | grep consul-server
[root#ip-10-201-14-34 user]#

Docker remote api don't restart after my computer restart

Last week I struggled to make my docker remote api working. As it is running on VM, I have not restart my VM since then. Today I finally restarted my VM and it is not working any more (docker and docker-compose are working normally, but not docker remote api). My docker init file looks like this: /etc/init/docker.conf.
description "Docker daemon"
start on filesystem and started lxc-net
stop on runlevel [!2345]
respawn
script
/usr/bin/docker -H tcp://0.0.0.0:4243 -d
end script
# description "Docker daemon"
# start on (filesystem and net-device-up IFACE!=lo)
# stop on runlevel [!2345]
# limit nofile 524288 1048576
# limit nproc 524288 1048576
respawn
kill timeout 20
.....
.....
Last time I made setting indicated here this
I tried nmap to see if port 4243 is opened.
ubuntu#ubuntu:~$ nmap 0.0.0.0 -p-
Starting Nmap 7.01 ( https://nmap.org ) at 2016-10-12 23:49 CEST
Nmap scan report for 0.0.0.0
Host is up (0.000046s latency).
Not shown: 65531 closed ports
PORT STATE SERVICE
22/tcp open ssh
43978/tcp open unknown
44672/tcp open unknown
60366/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 1.11 seconds
as you can see, the port 4232 is not opened.
when I run:
ubuntu#ubuntu:~$ echo -e "GET /images/json HTTP/1.0\r\n" | nc -U
This is nc from the netcat-openbsd package. An alternative nc is available
in the netcat-traditional package.
usage: nc [-46bCDdhjklnrStUuvZz] [-I length] [-i interval] [-O length]
[-P proxy_username] [-p source_port] [-q seconds] [-s source]
[-T toskeyword] [-V rtable] [-w timeout] [-X proxy_protocol]
[-x proxy_address[:port]] [destination] [port]
I run this also:
ubuntu#ubuntu:~$ sudo docker -H=tcp://0.0.0.0:4243 -d
flag provided but not defined: -d
See 'docker --help'.
I restart my computer many times and try a lot of things with no success.
I already have a group named docker and my user is in:
ubuntu#ubuntu:~$ groups $USER
ubuntu : ubuntu adm cdrom sudo dip plugdev lpadmin sambashare docker
Please tel me what is wrong.
Your startup script contains an invalid command:
/usr/bin/docker -H tcp://0.0.0.0:4243 -d
Instead you need something like:
/usr/bin/docker daemon -H tcp://0.0.0.0:4243
As of 1.12, this is now (but docker daemon will still work):
/usr/bin/dockerd -H tcp://0.0.0.0:4243
Please note that this is opening a port that gives remote root access without any password to your docker host.
Anyone that wants to take over your machine can run docker run -v /:/target -H your.ip:4243 busybox /bin/sh to get a root shell with your filesystem mounted at /target. If you'd like to secure your host, follow this guide to setting up TLS certificates.
I finally found www.ivankrizsan.se and it is working find now. Thanks to this guy (or girl) ;).
This settings work for me on ubuntu 16.04. Here is how to do :
Edit this file /lib/systemd/system/docker.service and replace the line ExecStart=/usr/bin/dockerd -H fd:// with
ExecStart=/usr/bin/docker daemon -H fd:// -H tcp://0.0.0.0:4243
Save the file
restart with :sudo service docker restart
Test with : curl http://localhost:4243/version
Result: you should see something like this:
{"Version":"1.11.0","ApiVersion":"1.23","GitCommit":"4dc5990","GoVersion" "go1.5.4","Os":"linux","Arch":"amd64","KernelVersion":"4.4.0-22-generic","BuildTime":"2016-04-13T18:38:59.968579007+00:00"}
Attention :
Remain aware that 0.0.0.0 is not good for security, for more security, you should use 127.0.0.1

Resources