Local Consul join K8s Consul Mac - docker

So I'm currently running on my local Kubernetes cluster (running on docker) the stable/consul chart from helm.
$ helm install -n wet-fish --namespace consul stable/consul
This creates two services
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
wet-fish-consul ClusterIP None <none> 8500/TCP,8400/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP 0s
wet-fish-consul-ui NodePort 10.110.229.223 <none> 8500:30276/TCP
So this means I can run localhost:30276 and see the consul ui.
Now I'm running on my local machine
$ consul agent -dev -config-dir=./consul.d -node=machine
$ consul join 127.0.0.1:30276
This just results in:
Error joining address '127.0.0.1:30276': Unexpected response code: 500 (1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
)
Failed to join any nodes.
and
2020/01/17 15:17:35 [WARN] agent: (LAN) couldn't join: 0 Err: 1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
2020/01/17 15:17:35 [ERR] http: Request PUT /v1/agent/join/127.0.0.1:30276, error: 1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
from=127.0.0.1:59693
There must be a way to have a local consul agent running that can connect to the k8s consul server...
This is on a Mac, so networking isn't as good....

There may be two problems here, the first is that consul agent -dev starts the agent in dev mode. By default dev mode is going to start both a server and an agent. This might be part of the reason behind the error.
The other problem could be due to localhost, the server running in Kubernetes will attempt to health check local agents. It needs to be able to ping the local agent, so even if you manage to join in the first step, it would probably fail health checks.
I agree about networking on Mac it does not make things easy, one thing you will probably have to do is set the advertise address for the local agent (non kube). Docker for mac has a host name docker.for.mac.localhost which is a routable ip to the local machine from a container. When starting the local agent if you set the advertise address to the ip value of that host Kubernetes Consul server should be able to route to the locally running agent.
Potential fix:
1. Ensure local agent is starting in client mode (manually configure not -dev)
2. Set advertise advertise address to an ip address which is routable from Kubernetes docker.for.mac.localhost
Give me a shout if that does not work for you, I have used a setup like this myself, 9/10 it is networking between Docker and the local machine.
Kind regards,
Nic

Related

Docker Swarm Network Cassandra Datacenter Setup expects host network always

Problem:
Setting up Multi datacenter using Docker Swarm. Each Docker Swarm at local DC is running with cassandra instances as 1/1+2 mode. There is a password connection between datacenters.
The seed nodes are DC-1:Node1, DC-2:Node1 (1+1) geo. DC-1:Node1, DC-1:Node3, DC-2:Node1, DC-2:Node3 in 1+2 mode of geo cluster ...
To discover the nodes, construct the topology between DC nodes via using cassandra storage port always expects bridge or host network. it does not work with OVERLAY network with PORT forwarding approach (Where it works for same DC with local network not across GEO sites).
It expect eithers host/bridge network, Otherwise it throws an exception as shown below
DEBUG [MessagingService-Outgoing-site-cassandra-A/15.29.8.10-Gossip] 2020-12-04 09:49:11,325
OutboundTcpConnection.java:546 - Unable to connect to site-cassandra-B/15.29.8.10
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_262]
at sun.nio.ch.Net.connect(Net.java:454) ~[na:1.8.0_262]
at sun.nio.ch.Net.connect(Net.java:446) ~[na:1.8.0_262]
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:645) ~[na:1.8.0_262]
at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:146) ~
[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:132) ~
[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:434) [apache-
cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:262) [apache-
cassandra-3.11.4.jar:3.11.4]
`
Workaround:
After we setup bridge network between geo-sites, it is able to discover and nodetool status shows two DC with proper cassandra instances with their replication configured % value.
I would like to know the reason of why cassandra is forcing to have
bridge or host based network why not with overlay base port forwarding
approach?
Thanks
Suresh Perumal

kubeadm join times out on non-default NIC/IP

I am trying to configure a K8s cluster on-prem and the servers are running Fedora CoreOS using multiple NICs.
I am configuring the cluster to use a non-default NIC - a bond which is defined with 2 interfaces. All servers can reach each-other over that interface and have HTTP + HTTPS connectivity to the internet.
kubeadm join hangs at:
I0513 13:24:55.516837 16428 token.go:215] [discovery] Failed to request cluster-info, will try again: Get https://${BOND_IP}:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The relevant kubeadm init config looks like this:
[...]
localAPIEndpoint:
advertiseAddress: ${BOND_IP}
bindPort: 6443
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
node-ip: ${BOND_IP}
criSocket: /var/run/dockershim.sock
name: master
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
[...]
The join config that am using looks like this:
apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
discovery:
bootstrapToken:
token: ${TOKEN}
caCertHashes:
- "${SHA}"
apiServerEndpoint: "${BOND_IP}:6443"
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
If I am trying to configure it using default eth0, it works without issues.
This is not a connectivity issue. The port test works fine:
# nc -s ${BOND_IP_OF_NODE} -zv ${BOND_IP_OF_MASTER} 6443
Ncat: Version 7.80 ( https://nmap.org/ncat )
Ncat: Connected to ${BOND_IP_OF_MASTER}:6443.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
I suspect this happens due to kubelet listening on eth0, if so, can I change it to use a different NIC/IP?
LE: The eth0 connection has been cut off completely (cable out, interface down, connection down).
Now, when we init, if we choose port 0.0.0.0 for the kube-api it defaults to the bond, which we wanted initially:
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 0.0.0.0
result:
[certs] apiserver serving cert is signed for DNS names [emp-prod-nl-hilv-quortex19 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.0.0.1 ${BOND_IP}]
I have even added the 6443 port in iptables for accept and it still times out.. All my CALICO pods are up and running (all pods for that matter in kube-system namespace)
LLE:
I have tested calico and weavenet and both show the same issue. The api-server is up and can be reached from the master using curl but it times out from the nodes.
LLLE:
On the premise that the kube-api is nothing but an HTTPS server, I have tried two options from the node that cannot reach it when doing the kubeadm join:
Ran a python3 simple http server over 6443 and WAS ABLE TO CONNECT from node
Ran an nginx pod and exposed it over another port as NodePort and WAS ABLE TO CONNECT from node
the node just cant reach the api-server on 6443 or any other port for that matter ....
what am i doing wrong...
The cause:
The interface used was in BOND of type ACTIVE-ACTIVE. This made it so kubeadm tried another interface from the 2 bonded, which was not in the same subnet as the IP of the advertised server apparently...
Using ACTIVE-PASSIVE did the trick and was able to join the nodes.
LE: If anyone knows why kubeadm join does not support LACP with ACTIVE-ACTIVE bond setups on FEDORA COREOS please advise here. Otherwise, if additional configurations are required, I would very much like to know what I have missed.

Docker Swarm running inside VM workstation 15 player doesn't accept worker connection

I am running a docker swarm manager in VM Workstation 15 player with NAT(VM: Ubuntu 19.10, Host: Windows 10). I ran docker swarm init --advertise-addr 223.181.240.48:2377 on my mangager vm. Now i copied to the token and used it on my my other vm that is running on another node and another network with NAT. it returns the following error:
Error response from daemon: Timeout was reached before node joined.
The attemp to join the swarm will continue in the background. Use the
"docker info" command to see the current swarm status of your node.
Then i tried googling for error and got to know that the problem may arise due to firewall and i might have to unblock the port.Also, as i am using NAT, i have to either use automatic bridge or port forward.First, I tried using bride(in vm setting, i changed network to bridge), but when i tried "my ip",the results were same in both host machine and vm(223.181.240.48).So, i tried port forwarding with NAT,i went to C:/ProgramData/VMware/vmnetnat.conf and added the following line
[incomingtcp]
2377:192.168.172.2:2377
192.168.172.2 is my vm's net gateway address. Then i again ran the docker swarm command, copied to my other vm. Now, i got the following error:
Error response from daemon: rpc error: code =Unavailable desc = all
SubConns are in TransientFailure, latest connection error: connection
error: desc = "transport: Error while dialing dial tcp
233.181.240.48:2377: connect: connection refused"
Then i tried sudo ufw allow 2377/tcp to unblock port in vm. Then retried the whole procedure again. Now i am again receiving the timeout error. Did i miss something in the middle? or did something wrong? And what is the difference between the ip i receive through a "my ip " google search and the ipv4 i receive in wired connection setting(dhcp on).

Docker swarm mode load balancing not working as described

Update
I believe the culprit is the master who does not appear to be listening on port 7946. netstat shows that 7946 is listening on the nodes, but not the master. When I check the syslogs for the nodes I see the following error
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
Original Post
I am running a three node Swarm Mode cluster in AWS; one master and two workers. This is swarm mode not to be confused with docker swarm from pre 1.12.
I created all of the services with docker-machine. Each machine is running Ubuntu 15.10 with Docker 1.12.3.
Linux swarm-master-01 4.2.0-42-generic #49-Ubuntu SMP Tue Jun 28 21:26:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Using the master node I have created a service with the following
docker service create --replicas 1 --name myapp -p 3000 myapp
When I run docker service ps myapp I get the following output
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
02awst8p9pezgpkfzqgz8z79t myapp.1 myapp:latest swarm-node-01 Running Running 19 minutes ago
The running task is deployed to swarm-node-01.
I checked the auto-selected port which was published publicly
$ docker service inspect myapp | jq .[].Endpoint.Ports[].PublishedPort
30000
According to the documentation:
External components, such as cloud load balancers, can access the service on the PublishedPort of any node in the cluster whether or not the node is currently running the task for the service. All nodes in the swarm route ingress connections to a running task instance.
But when I try to curl the nodes who do not have the task running I'm getting connection refused.
$ curl $(docker-machine ip swarm-node-01):30000/stats
{"uptime":"2016-11-09T14:48:35Z","requestCount":7,"statuses":{"200":7},"pid":1,"open_db_conns":0}
$ curl $(docker-machine ip swarm-node-02):30000/stats
curl: (7) Failed to connect to [the IP] port 30000: Connection refused
note: I scrubbed the IP of node-02
My Troubleshooting:
The nodes are both properly connected to the swarm
Scaling the service up to 5 (which inherently deploys the task to every node) makes curl work on every node, because the task is deployed to every node.
UPDATE 1
I initialized the swarm with
docker swarm init --advertise-addr 10.0.0.12:2377 --listen-addr 10.0.0.12:2377
I checked the syslogs from the nodes and I'm seeing the following errors
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
I checked to see if the ingress port was listening and it doesn't seem to be
ubuntu#swarm-master-01:~$ sudo lsof -i :7946
ubuntu#swarm-master-01:~$ cat < /dev/tcp/10.0.0.12/7946
-bash: connect: Connection refused
-bash: /dev/tcp/10.0.0.12/7946: Connection refused
ubuntu#swarm-master-01:~$ cat < /dev/tcp/0.0.0.0/7946
-bash: connect: Connection refused
-bash: /dev/tcp/0.0.0.0/7946: Connection refused
I was able to get around the issue for now, but I don't know what initially caused it. The overlay network (port 7946) wasn't listening on swarm-master-01. I figured this out with netstat -nlt. I searched the syslogs and found these errors related to the port in the syslog.
Nov 8 20:28:20 ubuntu docker[23092]: time="2016-11-08T20:28:20.171385360Z" level=warning msg="2016/11/08 20:28:20 [ERR] memberlist: Failed TCP fallback ping: read tcp 10.0.0.85:54016->10.0.0.13:7946: i/o timeout"
Nov 9 18:26:17 swarm-node-01 docker[714]: time="2016-11-09T18:26:17.573441271Z" level=warning msg="2016/11/09 18:26:17 [ERR] memberlist: Failed to send indirect ping: write udp [::]:7946->10.0.0.38:7946: use of closed network connection"
For some reason docker refused to open this port and listen any more. Here is what I did (albeit undesirable) to circumvent the issue:
Created another node with docker-machine called swarm-master-02
Joined swarm-master-02 to the cluster as a master
Demoted master-01 which set master-02 as the leader
Restarted the docker daemon on each node (might not have been necessary)
Now all of the machines are working as expected except for swarm-master-01. One task is running on swarm-node-01 and curl works against all nodes by forwarding the traffic to the proper container on the proper node. However, swarm-master-01 refuses to listen on the overlay network and curl does not work against this node. I was only able to fix swarm-master-01 by completely removing it from the cluster, restarting the docker daemon, and joining it again as a master. Now 7946 is listening on that machine.

glusterfs geo-replication - server with two interfaces - private IP advertised

I have been trying to setup a geo replication with glusterfs servers. Everything worked as expected in my test environment, on my staging environment, but then i tried the production and got stuck.
Let say I have
gluster fs server is on public ip 1.1.1.1
gluster fs slave is on public 2.2.2.2, but this IP is on interface eth1
The eth0 on gluster fs slave server is 192.168.0.1.
So when i start the command on 1.1.1.1 (firewall and ssh keys are set properly)
gluster volume geo-replication vol0 2.2.2.2::vol0 create push-pem
I get an error.
Unable to fetch slave volume details. Please check the slave cluster and slave volume.
geo-replication command failed
The error is not that important in this case, the problem is the slave IP address
2015-03-16T11:41:08.101229+00:00 xxx kernel: TCP LOGDROP: IN= OUT=eth0 SRC=1.1.1.1 DST=192.168.0.1 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=24243 DF PROTO=TCP SPT=1015 DPT=24007 WINDOW=14600 RES=0x00 SYN URGP=0
As you can see in the firewall drop log above, the port 24007 of the slave gluster daemon is advertised on private IP of the interface eth0 on slave server and should be the IP of the eth1 private IP. So master cannot connect and will time out
Is there a way to force gluster server to advertise interface eth1 or bind to it only?
I use cfengine and ansible to push configuration, so binding to Interface could be a better solution than IP, but whatever solution will do.
Thank you in advance.
I've encountered this issue but in a different context.
I was trying to geo-replicate two nodes which were both behind a NAT (AWS instances in different regions).
When the master connects to the slave via the public IP to check for volume compatability/size and other details, it retrieves the hostname of the slave, which usually resolves to something that only has meaning in that remote region.
Then it uses that hostname to dial back to the slave when later setting up the session, which fails, as that hostname resolves to a private IP in a different region.
My workaround for the issue was to use hostnames when creating the volumes, probing for peers, and establishing geo replication, and then add a /etc/hosts entry mapping slaves hostname which usually resolves to its private IP to its public IP, rather than it's private IP.
This gets you to the point where you establish a session, but I haven't had any luck actually getting it to sync, as it uses the wrong IP somewhere long the way again.
Edit:
I've actually managed to get it running by adding /etc/hosts hacks on both sides.
GlusterFS has no notion of the network layer. Check your routes. If the next-hop for your geo-replication slave is on eth1, then gluster will open a port on that interface for the slave IP address.
Also make sure your firewall is configured to forward geo-replication traffic on this port.

Resources