[SOLVED] flannel dont work with that I changed to weave net. If you dont want to provide the pod-network-cidr: "10.244.0.0/16" flag in the config.yaml
I want to make a multi master setup with kubernetes and tried alot of different ways. Even the last way I take don´t work. The problem is that the dns and the flannel network plugin don´t want to start. They get every time the CrashLoopBackOff status. The way I do it is listed below.
First create a external etcd cluster with this command on every node (only the adresses changed)
nohup etcd --name kube1 --initial-advertise-peer-urls http://192.168.100.110:2380 \
--listen-peer-urls http://192.168.100.110:2380 \
--listen-client-urls http://192.168.100.110:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://192.168.100.110:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster kube1=http://192.168.100.110:2380,kube2=http://192.168.100.108:2380,kube3=http://192.168.100.104:2380 \
--initial-cluster-state new &
Then I created a config.yaml file for the kubeadm init command.
apiVersion: kubeadm.k8s.io/v1alpha1
kind: MasterConfiguration
api:
advertiseAddress: 192.168.100.110
etcd:
endpoints:
- "http://192.168.100.110:2379"
- "http://192.168.100.108:2379"
- "http://192.168.100.104:2379"
apiServerExtraArgs:
apiserver-count: "3"
apiServerCertSANs:
- "192.168.100.110"
- "192.168.100.108"
- "192.168.100.104"
- "127.0.0.1"
token: "64bhyh.1vjuhruuayzgtykv"
tokenTTL: "0"
Start command: kubeadm init --config /root/config.yaml
So now copy the /etc/kubernetes/pki on the other nodes and the config and start the other master nodes the same way. But it doesn´t work.
So what is the right way to initialize a multi master kubernetes cluster or why does my flannel network not start?
Status from a flannel pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 8m kubelet, kube2 MountVolume.SetUp succeeded for volume "run"
Normal SuccessfulMountVolume 8m kubelet, kube2 MountVolume.SetUp succeeded for volume "cni"
Normal SuccessfulMountVolume 8m kubelet, kube2 MountVolume.SetUp succeeded for volume "flannel-token-swdhl"
Normal SuccessfulMountVolume 8m kubelet, kube2 MountVolume.SetUp succeeded for volume "flannel-cfg"
Normal Pulling 8m kubelet, kube2 pulling image "quay.io/coreos/flannel:v0.10.0-amd64"
Normal Pulled 8m kubelet, kube2 Successfully pulled image "quay.io/coreos/flannel:v0.10.0-amd64"
Normal Created 8m kubelet, kube2 Created container
Normal Started 8m kubelet, kube2 Started container
Normal Pulled 8m (x4 over 8m) kubelet, kube2 Container image "quay.io/coreos/flannel:v0.10.0-amd64" already present on machine
Normal Created 8m (x4 over 8m) kubelet, kube2 Created container
Normal Started 8m (x4 over 8m) kubelet, kube2 Started container
Warning BackOff 3m (x23 over 8m) kubelet, kube2 Back-off restarting failed container
etcd version
etcd --version
etcd Version: 3.3.6
Git SHA: 932c3c01f
Go Version: go1.9.6
Go OS/Arch: linux/amd64
kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:00:59Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Last lines in nohup from etcd
2018-06-06 19:44:28.441304 I | etcdserver: name = kube1
2018-06-06 19:44:28.441327 I | etcdserver: data dir = kube1.etcd
2018-06-06 19:44:28.441331 I | etcdserver: member dir = kube1.etcd/member
2018-06-06 19:44:28.441334 I | etcdserver: heartbeat = 100ms
2018-06-06 19:44:28.441336 I | etcdserver: election = 1000ms
2018-06-06 19:44:28.441338 I | etcdserver: snapshot count = 100000
2018-06-06 19:44:28.441343 I | etcdserver: advertise client URLs = http://192.168.100.110:2379
2018-06-06 19:44:28.441346 I | etcdserver: initial advertise peer URLs = http://192.168.100.110:2380
2018-06-06 19:44:28.441352 I | etcdserver: initial cluster = kube1=http://192.168.100.110:2380,kube2=http://192.168.100.108:2380,kube3=http://192.168.100.104:2380
2018-06-06 19:44:28.443825 I | etcdserver: starting member a4df4f699dd66909 in cluster 73f203cf831df407
2018-06-06 19:44:28.443843 I | raft: a4df4f699dd66909 became follower at term 0
2018-06-06 19:44:28.443848 I | raft: newRaft a4df4f699dd66909 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018-06-06 19:44:28.443850 I | raft: a4df4f699dd66909 became follower at term 1
2018-06-06 19:44:28.447834 W | auth: simple token is not cryptographically signed
2018-06-06 19:44:28.448857 I | rafthttp: starting peer 9e0f381e79b9b9dc...
2018-06-06 19:44:28.448869 I | rafthttp: started HTTP pipelining with peer 9e0f381e79b9b9dc
2018-06-06 19:44:28.450791 I | rafthttp: started peer 9e0f381e79b9b9dc
2018-06-06 19:44:28.450803 I | rafthttp: added peer 9e0f381e79b9b9dc
2018-06-06 19:44:28.450809 I | rafthttp: starting peer fc9c29e972d01e69...
2018-06-06 19:44:28.450816 I | rafthttp: started HTTP pipelining with peer fc9c29e972d01e69
2018-06-06 19:44:28.453543 I | rafthttp: started peer fc9c29e972d01e69
2018-06-06 19:44:28.453559 I | rafthttp: added peer fc9c29e972d01e69
2018-06-06 19:44:28.453570 I | etcdserver: starting server... [version: 3.3.6, cluster version: to_be_decided]
2018-06-06 19:44:28.455414 I | rafthttp: started streaming with peer 9e0f381e79b9b9dc (writer)
2018-06-06 19:44:28.455431 I | rafthttp: started streaming with peer 9e0f381e79b9b9dc (writer)
2018-06-06 19:44:28.455445 I | rafthttp: started streaming with peer 9e0f381e79b9b9dc (stream MsgApp v2 reader)
2018-06-06 19:44:28.455578 I | rafthttp: started streaming with peer 9e0f381e79b9b9dc (stream Message reader)
2018-06-06 19:44:28.455697 I | rafthttp: started streaming with peer fc9c29e972d01e69 (writer)
2018-06-06 19:44:28.455704 I | rafthttp: started streaming with peer fc9c29e972d01e69 (writer)
#
If you do not have any hosting preferences and if you are ok with creating cluster on AWS then it can be done very easily using KOPS.
https://github.com/kubernetes/kops
Via KOPS you can easily configure the autoscaling group for master and can specify the number of master and nodes required for your cluster.
Flannel dont work with that so I changed to weave net. If you dont want to use provide the pod-network-cidr: "10.244.0.0/16" flag in the config.yaml
Related
I'm need up Kafka and Cassandra in Minikube
Host OS is Ubuntu 16.04
$ uname -a
Linux minikuber 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Minikube started normally:
$ minikube start
Starting local Kubernetes v1.8.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Services list:
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 1d
Zookeeper and Cassandra is running, but kafka crashing with error "CrashLoopBackOff"
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
zookeeper-775db4cd8-lpl95 1/1 Running 0 1h
cassandra-d84d697b8-p5wcs 1/1 Running 0 1h
kafka-6d889c567-w5n4s 0/1 CrashLoopBackOff 25 1h
View logs:
kubectl logs kafka-6d889c567-w5n4s -p
Output:
waiting for kafka to be ready
...
INFO Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
INFO Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
...
INFO Terminate ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
INFO Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
INFO Session: 0x0 closed (org.apache.zookeeper.ZooKeeper)
INFO EventThread shut down for session: 0x0 (org.apache.zookeeper.ClientCnxn)
FATAL Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server '' with timeout of 6000 ms
...
INFO shutting down (kafka.server.KafkaServer)
INFO shut down completed (kafka.server.KafkaServer)
FATAL Exiting Kafka. (kafka.server.KafkaServerStartable)
Сan any one help how to solve the problem of restarting the container?
kubectl describe pod kafka-6d889c567-w5n4s
Output describe:
Name: kafka-6d889c567-w5n4s
Namespace: default
Node: minikube/192.168.99.100
Start Time: Thu, 23 Nov 2017 17:03:20 +0300
Labels: pod-template-hash=284457123
run=kafka
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"kafka-6d889c567","uid":"0fa94c8d-d057-11e7-ad48-080027a5dfed","a...
Status: Running
IP: 172.17.0.5
Created By: ReplicaSet/kafka-6d889c567
Controlled By: ReplicaSet/kafka-6d889c567
Info about Containers:
Containers:
kafka:
Container ID: docker://7ed3de8ef2e3e665ba693186f5125c6802283e1fabca8f3c85eb584f8de19526
Image: wurstmeister/kafka
Image ID: docker-pullable://wurstmeister/kafka#sha256:2aa183fd201d693e24d4d5d483b081fc2c62c198a7acb8484838328c83542c96
Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 27 Nov 2017 09:43:39 +0300
Finished: Mon, 27 Nov 2017 09:43:49 +0300
Ready: False
Restart Count: 1003
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bnz99 (ro)
Info about Conditions:
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Info about volumes:
Volumes:
default-token-bnz99:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bnz99
Optional: false
QoS Class: BestEffort
Info about events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 38m (x699 over 2d) kubelet, minikube pulling image "wurstmeister/kafka"
Warning BackOff 18m (x16075 over 2d) kubelet, minikube Back-off restarting failed container
Warning FailedSync 3m (x16140 over 2d) kubelet, minikube Error syncing pod
Overview
kube-dns can't start (SetupNetworkError) after kubeadm init and network setup:
Error syncing pod, skipping: failed to "SetupNetwork" for
"kube-dns-654381707-w4mpg_kube-system" with SetupNetworkError:
"Failed to setup network for pod
\"kube-dns-654381707-w4mpg_kube-system(8ffe3172-a739-11e6-871f-000c2912631c)\"
using network plugins \"cni\": open /run/flannel/subnet.env:
no such file or directory; Skipping pod"
Kubernetes version
Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.4", GitCommit:"3b417cc4ccd1b8f38ff9ec96bb50a81ca0ea9d56", GitTreeState:"clean", BuildDate:"2016-10-21T02:48:38Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.4", GitCommit:"3b417cc4ccd1b8f38ff9ec96bb50a81ca0ea9d56", GitTreeState:"clean", BuildDate:"2016-10-21T02:42:39Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Environment
VMWare Fusion for Mac
OS
NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Kernel (e.g. uname -a)
Linux ubuntu-master 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
What is the problem
kube-system kube-dns-654381707-w4mpg 0/3 ContainerCreating 0 2m
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
3m 3m 1 {default-scheduler } Normal Scheduled Successfully assigned kube-dns-654381707-w4mpg to ubuntu-master
2m 1s 177 {kubelet ubuntu-master} Warning FailedSync Error syncing pod, skipping: failed to "SetupNetwork" for "kube-dns-654381707-w4mpg_kube-system" with SetupNetworkError: "Failed to setup network for pod \"kube-dns-654381707-w4mpg_kube-system(8ffe3172-a739-11e6-871f-000c2912631c)\" using network plugins \"cni\": open /run/flannel/subnet.env: no such file or directory; Skipping pod"
What I expected to happen
kube-dns Running
How to reproduce it
root#ubuntu-master:~# kubeadm init
Running pre-flight checks
<master/tokens> generated token: "247a8e.b7c8c1a7685bf204"
<master/pki> generated Certificate Authority key and certificate:
Issuer: CN=kubernetes | Subject: CN=kubernetes | CA: true
Not before: 2016-11-10 11:40:21 +0000 UTC Not After: 2026-11-08 11:40:21 +0000 UTC
Public: /etc/kubernetes/pki/ca-pub.pem
Private: /etc/kubernetes/pki/ca-key.pem
Cert: /etc/kubernetes/pki/ca.pem
<master/pki> generated API Server key and certificate:
Issuer: CN=kubernetes | Subject: CN=kube-apiserver | CA: false
Not before: 2016-11-10 11:40:21 +0000 UTC Not After: 2017-11-10 11:40:21 +0000 UTC
Alternate Names: [172.20.10.4 10.96.0.1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local]
Public: /etc/kubernetes/pki/apiserver-pub.pem
Private: /etc/kubernetes/pki/apiserver-key.pem
Cert: /etc/kubernetes/pki/apiserver.pem
<master/pki> generated Service Account Signing keys:
Public: /etc/kubernetes/pki/sa-pub.pem
Private: /etc/kubernetes/pki/sa-key.pem
<master/pki> created keys and certificates in "/etc/kubernetes/pki"
<util/kubeconfig> created "/etc/kubernetes/kubelet.conf"
<util/kubeconfig> created "/etc/kubernetes/admin.conf"
<master/apiclient> created API client configuration
<master/apiclient> created API client, waiting for the control plane to become ready
<master/apiclient> all control plane components are healthy after 14.053453 seconds
<master/apiclient> waiting for at least one node to register and become ready
<master/apiclient> first node is ready after 0.508561 seconds
<master/apiclient> attempting a test deployment
<master/apiclient> test deployment succeeded
<master/discovery> created essential addon: kube-discovery, waiting for it to become ready
<master/discovery> kube-discovery is ready after 1.503838 seconds
<master/addons> created essential addon: kube-proxy
<master/addons> created essential addon: kube-dns
Kubernetes master initialised successfully!
You can now join any number of machines by running the following on each node:
kubeadm join --token=247a8e.b7c8c1a7685bf204 172.20.10.4
root#ubuntu-master:~#
root#ubuntu-master:~# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system dummy-2088944543-eo1ua 1/1 Running 0 47s
kube-system etcd-ubuntu-master 1/1 Running 3 51s
kube-system kube-apiserver-ubuntu-master 1/1 Running 0 49s
kube-system kube-controller-manager-ubuntu-master 1/1 Running 3 51s
kube-system kube-discovery-1150918428-qmu0b 1/1 Running 0 46s
kube-system kube-dns-654381707-mv47d 0/3 ContainerCreating 0 44s
kube-system kube-proxy-k0k9q 1/1 Running 0 44s
kube-system kube-scheduler-ubuntu-master 1/1 Running 3 51s
root#ubuntu-master:~#
root#ubuntu-master:~# kubectl apply -f https://git.io/weave-kube
daemonset "weave-net" created
root#ubuntu-master:~#
root#ubuntu-master:~#
root#ubuntu-master:~# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system dummy-2088944543-eo1ua 1/1 Running 0 47s
kube-system etcd-ubuntu-master 1/1 Running 3 51s
kube-system kube-apiserver-ubuntu-master 1/1 Running 0 49s
kube-system kube-controller-manager-ubuntu-master 1/1 Running 3 51s
kube-system kube-discovery-1150918428-qmu0b 1/1 Running 0 46s
kube-system kube-dns-654381707-mv47d 0/3 ContainerCreating 0 44s
kube-system kube-proxy-k0k9q 1/1 Running 0 44s
kube-system kube-scheduler-ubuntu-master 1/1 Running 3 51s
kube-system weave-net-ja736 2/2 Running 0 1h
It looks like you have configured flannel before running kubeadm init. You can try to fix this by removing flannel (it may be sufficient to remove config file rm -f /etc/cni/net.d/*flannel*), but it's best to start fresh.
open below file location(if exists, either create) and paste below data
vim /run/flannel/subnet.env
FLANNEL_NETWORK=10.240.0.0/16
FLANNEL_SUBNET=10.240.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
I am trying to setup 3 node etcd cluster on Ubuntu machines as docker data store for networking. I successfully created etcd cluster using etcd docker image. Now when I am trying to replicate it, the steps fail on one node. Even after removing the failing node from the step up, the cluster is still looking for the removed node. The same error is being faced when I am using etcd binary.
Used following command by changing ip accordingly on all nodes:
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd \
-name etcd0 \
-advertise-client-urls http://172.27.59.141:2379,http://172.27.59.141:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
-initial-advertise-peer-urls http://172.27.59.141:2380 \
-listen-peer-urls http://0.0.0.0:2380 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster etcd0=http://172.27.59.141:2380,etcd1=http://172.27.59.244:2380,etcd2=http://172.27.59.232:2380 \
-initial-cluster-state new
Two of the nodes connect properly but the service of third node stops. Following is the log of the third node.
2016-06-16 17:16:34.293248 I | etcdmain: etcd Version: 2.3.6
2016-06-16 17:16:34.294368 I | etcdmain: Git SHA: 128344c
2016-06-16 17:16:34.294584 I | etcdmain: Go Version: go1.6.2
2016-06-16 17:16:34.294781 I | etcdmain: Go OS/Arch: linux/amd64
2016-06-16 17:16:34.294962 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2016-06-16 17:16:34.295142 W | etcdmain: no data-dir provided, using default data-dir ./node2.etcd
2016-06-16 17:16:34.295438 I | etcdmain: listening for peers on http://0.0.0.0:2380
2016-06-16 17:16:34.295654 I | etcdmain: listening for client requests on http://0.0.0.0:2379
2016-06-16 17:16:34.295846 I | etcdmain: listening for client requests on http://0.0.0.0:4001
2016-06-16 17:16:34.296193 I | etcdmain: stopping listening for client requests on http://0.0.0.0:4001
2016-06-16 17:16:34.301139 I | etcdmain: stopping listening for client requests on http://0.0.0.0:2379
2016-06-16 17:16:34.301454 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2016-06-16 17:16:34.301718 I | etcdmain: --initial-cluster must include node2=http://172.27.59.232:2380 given --initial-advertise-peer-urls=http://172.27.59.232:2380
Even after removing the failing node I can see that the two nodes are waiting for the third node to connect.
2016-06-16 17:16:12.063893 N | etcdserver: added member 17879927ec74147b [http://172.27.59.232:238] to cluster ba4424e006edb53e
2016-06-16 17:16:12.064431 N | etcdserver: added local member 24d9feabb7e2f26f [http://172.27.59.244:2380] to cluster ba4424e006edb53e
2016-06-16 17:16:12.065229 N | etcdserver: added member 2bda70be57138cfe [http://172.27.59.141:2380] to cluster ba4424e006edb53e
2016-06-16 17:16:12.218560 I | raft: 24d9feabb7e2f26f [term: 1] received a MsgVote message with higher term from 2bda70be57138cfe [term: 29]
2016-06-16 17:16:12.218964 I | raft: 24d9feabb7e2f26f became follower at term 29
2016-06-16 17:16:12.219276 I | raft: 24d9feabb7e2f26f [logterm: 1, index: 3, vote: 0] voted for 2bda70be57138cfe [logterm: 1, index: 3] at term 29
2016-06-16 17:16:12.222667 I | raft: raft.node: 24d9feabb7e2f26f elected leader 2bda70be57138cfe at term 29
2016-06-16 17:16:12.335904 I | etcdserver: published {Name:node1 ClientURLs:[http://172.27.59.244:2379 http://172.27.59.244:4001]} to cluster ba4424e006edb53e
2016-06-16 17:16:12.336459 N | etcdserver: set the initial cluster version to 2.2
2016-06-16 17:16:42.059177 W | rafthttp: the connection to peer 17879927ec74147b is unhealthy
2016-06-16 17:17:12.060313 W | rafthttp: the connection to peer 17879927ec74147b is unhealthy
2016-06-16 17:17:42.060986 W | rafthttp: the connection to peer 17879927ec74147b is unhealthy
It can be seen that despite starting the cluster with two nodes it is still searching for the third node.
Is there a location on local disk where data is being saved and its picking up old data despite it being not provided.
Please suggest what I am missing.
Is there a location on local disk where data is being saved and its picking up old data despite it being not provided.
Yes, the data of membership already stored at node0.etcd and node1.etcd.
You can get the following message from the log which indicates that the server already belongs to a cluster:
etcdmain: the server is already initialized as member before, starting as etcd member...
In order to run a new cluster with two members, just add another argument to your command :
--data-dir bak
I just followed this tutorial step by step for setting up a docker swarm in EC2 -- https://docs.docker.com/swarm/install-manual/
I created 4 Amazon Servers using the Amazon Linux AMI.
manager + consul
manager
node1
node2
I followed the instructions to start the swarm and everything seems to go ok regarding making the docker instances.
Server 1
Running docker ps gives:
The Consul logs show this
2016/07/05 20:18:47 [INFO] serf: EventMemberJoin: 729a440e5d0d 172.17.0.2
2016/07/05 20:18:47 [INFO] serf: EventMemberJoin: 729a440e5d0d.dc1 172.17.0.2
2016/07/05 20:18:48 [INFO] raft: Node at 172.17.0.2:8300 [Follower] entering Follower state
2016/07/05 20:18:48 [INFO] consul: adding server 729a440e5d0d (Addr: 172.17.0.2:8300) (DC: dc1)
2016/07/05 20:18:48 [INFO] consul: adding server 729a440e5d0d.dc1 (Addr: 172.17.0.2:8300) (DC: dc1)
2016/07/05 20:18:48 [ERR] agent: failed to sync remote state: No cluster leader
2016/07/05 20:18:49 [WARN] raft: Heartbeat timeout reached, starting election
2016/07/05 20:18:49 [INFO] raft: Node at 172.17.0.2:8300 [Candidate] entering Candidate state
2016/07/05 20:18:49 [INFO] raft: Election won. Tally: 1
2016/07/05 20:18:49 [INFO] raft: Node at 172.17.0.2:8300 [Leader] entering Leader state
2016/07/05 20:18:49 [INFO] consul: cluster leadership acquired
2016/07/05 20:18:49 [INFO] consul: New leader elected: 729a440e5d0d
2016/07/05 20:18:49 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2016/07/05 20:18:49 [INFO] consul: member '729a440e5d0d' joined, marking health alive
2016/07/05 20:18:50 [INFO] agent: Synced service 'consul'
I registered each node using the following command with appropriate IP's
docker run -d swarm join --advertise=x-x-x-x:2375 consul://x-x-x-x:8500
Each of those created a docker instance
Node1
Running docker ps gives:
With logs that suggest there's a problem:
time="2016-07-05T21:33:50Z" level=info msg="Registering on the discovery service every 1m0s..." addr="172.31.17.35:2375" discovery="consul://172.31.3.233:8500"
time="2016-07-05T21:36:20Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions"
time="2016-07-05T21:37:20Z" level=info msg="Registering on the discovery service every 1m0s..." addr="172.31.17.35:2375" discovery="consul://172.31.3.233:8500"
time="2016-07-05T21:39:50Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions"
time="2016-07-05T21:40:50Z" level=info msg="Registering on the discovery service every 1m0s..." addr="172.31.17.35:2375" discovery="consul://172.31.3.233:8500"
...
And lastly when I get to the last step of trying to get host information like so on my Consul machine,
docker -H :4000 info
I see no nodes. Lastly when I try the step of running an app, I get the obvious error:
[ec2-user#ip-172-31-3-233 ~]$ docker -H :4000 run hello-world
docker: Error response from daemon: No healthy node available in the cluster.
See 'docker run --help'.
[ec2-user#ip-172-31-3-233 ~]$
Thanks for any insight on this. I'm still pretty confused by much of the swarm model and not sure where to go from here to diagnose.
It looks like Consul is either not binding to a public IP address, or is not accessible on the public IP due to security group or VPC settings. You are setting the discovery URL to consul://172.31.3.233:8500 on the Docker nodes, so I would sugest trying to connect to that address from an external IP, either in your browser or via curl like this:
% curl http://172.31.3.233:8500/ui/dist/
HTML
If you cannot connect (connection refused or timeout) then add a TCP port 8500 ingress rule to your AWS VMs, and try again.
After investigating your issue, I see that you forgot open port 2375 for Docker Engine in all four nodes.
Before starting Swarm Manager or Swarm Node, you have to open a TCP Port for Docker Engine, so Swarm will work with Docker Engine via that Port.
With Docker on Ubuntu 14.04, you can open the port by change file /etc/default/docker and add -H tcp://0.0.0.0:2375 to DOCKER_OPTS. For example:
DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"
After that, you restart Docker Engine
service docker restart
If you are using CentOS, the solution is same, you can read my blog article https://sonnguyen.ws/install-docker-docker-swarm-centos7/
And the other thing, I think that you should install and run Consul in all nodes (4 servers). So your Swarm can work with Consul on its node
I want to run etcd in a Docker container with this command:
docker run -p 2379:2379 -p 4001:4001 --name etcd -v /usr/share/ca-certificates/:/etc/ssl/certs quay.io/coreos/etcd:v2.3.0-alpha.1
and seems that everything is ok:
2016-02-23 12:22:27.815591 I | etcdmain: etcd Version: 2.3.0-alpha.0+git
2016-02-23 12:22:27.815631 I | etcdmain: Git SHA: 40d3e0d
2016-02-23 12:22:27.815635 I | etcdmain: Go Version: go1.5.3
2016-02-23 12:22:27.815638 I | etcdmain: Go OS/Arch: linux/amd64
2016-02-23 12:22:27.815659 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2016-02-23 12:22:27.815663 W | etcdmain: no data-dir provided, using default data-dir ./default.etcd
2016-02-23 12:22:27.815896 I | etcdmain: listening for peers on http://localhost:2380
2016-02-23 12:22:27.815973 I | etcdmain: listening for peers on http://localhost:7001
2016-02-23 12:22:27.816030 I | etcdmain: listening for client requests on http://localhost:2379
2016-02-23 12:22:27.816091 I | etcdmain: listening for client requests on http://localhost:4001
2016-02-23 12:22:27.816370 I | etcdserver: name = default
2016-02-23 12:22:27.816383 I | etcdserver: data dir = default.etcd
2016-02-23 12:22:27.816387 I | etcdserver: member dir = default.etcd/member
2016-02-23 12:22:27.816390 I | etcdserver: heartbeat = 100ms
2016-02-23 12:22:27.816392 I | etcdserver: election = 1000ms
2016-02-23 12:22:27.816395 I | etcdserver: snapshot count = 10000
2016-02-23 12:22:27.816404 I | etcdserver: advertise client URLs = http://localhost:2379,http://localhost:4001
2016-02-23 12:22:27.816408 I | etcdserver: initial advertise peer URLs = http://localhost:2380,http://localhost:7001
2016-02-23 12:22:27.816415 I | etcdserver: initial cluster = default=http://localhost:2380,default=http://localhost:7001
2016-02-23 12:22:27.821522 I | etcdserver: starting member ce2a822cea30bfca in cluster 7e27652122e8b2ae
2016-02-23 12:22:27.821566 I | raft: ce2a822cea30bfca became follower at term 0
2016-02-23 12:22:27.821579 I | raft: newRaft ce2a822cea30bfca [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2016-02-23 12:22:27.821583 I | raft: ce2a822cea30bfca became follower at term 1
2016-02-23 12:22:27.821739 I | etcdserver: starting server... [version: 2.3.0-alpha.0+git, cluster version: to_be_decided]
2016-02-23 12:22:27.822619 N | etcdserver: added local member ce2a822cea30bfca [http://localhost:2380 http://localhost:7001] to cluster 7e27652122e8b2ae
2016-02-23 12:22:28.221880 I | raft: ce2a822cea30bfca is starting a new election at term 1
2016-02-23 12:22:28.222304 I | raft: ce2a822cea30bfca became candidate at term 2
2016-02-23 12:22:28.222545 I | raft: ce2a822cea30bfca received vote from ce2a822cea30bfca at term 2
2016-02-23 12:22:28.222885 I | raft: ce2a822cea30bfca became leader at term 2
2016-02-23 12:22:28.223075 I | raft: raft.node: ce2a822cea30bfca elected leader ce2a822cea30bfca at term 2
2016-02-23 12:22:28.223529 I | etcdserver: setting up the initial cluster version to 2.3
2016-02-23 12:22:28.227050 N | etcdserver: set the initial cluster version to 2.3
2016-02-23 12:22:28.227351 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2379 http://localhost:4001]} to cluster 7e27652122e8b2ae
But when I try to set a key (from same etcd node machine):
curl -L http://localhost:2379/v2/keys/mykey -XPUT -d value="this is awesome"
I get:
The requested URL could not be retrieved
Do I need to configure something more? Docker container is running ok:
docker ps
dba35d3b61c3 quay.io/coreos/etcd:v2.3.0-alpha.1 "/etcd" 2 seconds ago Up 1 seconds 0.0.0.0:2379->2379/tcp, 2380/tcp, 0.0.0.0:4001->4001/tcp, 7001/tcp etcd
You should configure etcd to listen on 0.0.0.0, otherwise it's listening on 127.0.0.1 which is not accessible outside the docker container
docker run \
-p 2379:2379 \
-p 4001:4001 \
--name etcd \
-v /usr/share/ca-certificates/:/etc/ssl/certs \
quay.io/coreos/etcd:v2.3.0-alpha.1 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001