I'm trying to launch a multi region cluster using the ec2multiregion snitch.
The nodes in one DC can communicate. But when adding nodes from another DC they fail with the following error:
ERROR [main] 2016-05-09 10:57:01,88
CassandraDaemon.java:581 Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
I have installed dse on an ubuntu 14.04 and have 4 nodes running in a cluster in Frankfurt (2 on subnet a and 2 on subnet b).
The problem arise when I try to add more nodes from Ireland.
I have added the following ports to the security:
80
8984
7199
61620
7000 - 7001
61620 - 61621
8983
7077
443
4040
8888
22
7080 - 7081
7080
9160
9042
Then I made the following settings in the cassandra.yaml file
listen_address: local ip
rpc_address: local ip
seeds: "public ip seed 1, public ip seed 2"
endpoint_snitch: Ec2MultiRegionSnitch
broadcast_address: public ip
What more do I need to setup for them to communicate?
I ended up going with cassandra community version 3.2 and instead using the GossipingPropertyFileSnitch and then it worked
Related
Problem:
Setting up Multi datacenter using Docker Swarm. Each Docker Swarm at local DC is running with cassandra instances as 1/1+2 mode. There is a password connection between datacenters.
The seed nodes are DC-1:Node1, DC-2:Node1 (1+1) geo. DC-1:Node1, DC-1:Node3, DC-2:Node1, DC-2:Node3 in 1+2 mode of geo cluster ...
To discover the nodes, construct the topology between DC nodes via using cassandra storage port always expects bridge or host network. it does not work with OVERLAY network with PORT forwarding approach (Where it works for same DC with local network not across GEO sites).
It expect eithers host/bridge network, Otherwise it throws an exception as shown below
DEBUG [MessagingService-Outgoing-site-cassandra-A/15.29.8.10-Gossip] 2020-12-04 09:49:11,325
OutboundTcpConnection.java:546 - Unable to connect to site-cassandra-B/15.29.8.10
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_262]
at sun.nio.ch.Net.connect(Net.java:454) ~[na:1.8.0_262]
at sun.nio.ch.Net.connect(Net.java:446) ~[na:1.8.0_262]
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:645) ~[na:1.8.0_262]
at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:146) ~
[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:132) ~
[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:434) [apache-
cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:262) [apache-
cassandra-3.11.4.jar:3.11.4]
`
Workaround:
After we setup bridge network between geo-sites, it is able to discover and nodetool status shows two DC with proper cassandra instances with their replication configured % value.
I would like to know the reason of why cassandra is forcing to have
bridge or host based network why not with overlay base port forwarding
approach?
Thanks
Suresh Perumal
I am starting a WebLogic 12.2.1.4 admin server in docker from my docker-compose.yml file.
I use different port mapping, not the default 7001.
My docker port mapping is this: 7101:7001
Everything works fine, except this: I constantly get the following exception when I click on the Deployment menu on the web console:
<Feb 12, 2021 5:11:21,002 PM UTC> <Notice> <JMX> <BEA-149535> <JMX Resiliency Activity Server=All Servers : Resolving connection list DomainRuntimeServiceMBean>
javax.ws.rs.ProcessingException: java.net.ConnectException: Tried all: '1' addresses, but could not connect over HTTP to server: 'localhost', port: '7101'
failed reasons:
[0] address:'localhost/127.0.0.1',port:'7101' : java.net.ConnectException: Connection refused
The WL admin server tries to use the public docker port 7101 in the container but actually, WL is listening on the default 7001 port inside the container. Port 7101 is only used from the host machine, and of course, WL is not listening on port 7101 in the container.
My workaround is the following:
Check the IP address of the admin-server container with docker inspect <container-name>
Open the WL console using the container private IP address, e.g.: http://172.19.0.2:7001/console
In this case, the exception does not appear
But if I open the WL console from http://localhost:7101/console which is the mapped port to the host machine by docker, then the exception appears
Maybe this is a WL user interface issue? But I am not sure.
Any idea why this happening?
I am trying to configure a K8s cluster on-prem and the servers are running Fedora CoreOS using multiple NICs.
I am configuring the cluster to use a non-default NIC - a bond which is defined with 2 interfaces. All servers can reach each-other over that interface and have HTTP + HTTPS connectivity to the internet.
kubeadm join hangs at:
I0513 13:24:55.516837 16428 token.go:215] [discovery] Failed to request cluster-info, will try again: Get https://${BOND_IP}:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The relevant kubeadm init config looks like this:
[...]
localAPIEndpoint:
advertiseAddress: ${BOND_IP}
bindPort: 6443
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
node-ip: ${BOND_IP}
criSocket: /var/run/dockershim.sock
name: master
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
[...]
The join config that am using looks like this:
apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
discovery:
bootstrapToken:
token: ${TOKEN}
caCertHashes:
- "${SHA}"
apiServerEndpoint: "${BOND_IP}:6443"
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
If I am trying to configure it using default eth0, it works without issues.
This is not a connectivity issue. The port test works fine:
# nc -s ${BOND_IP_OF_NODE} -zv ${BOND_IP_OF_MASTER} 6443
Ncat: Version 7.80 ( https://nmap.org/ncat )
Ncat: Connected to ${BOND_IP_OF_MASTER}:6443.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
I suspect this happens due to kubelet listening on eth0, if so, can I change it to use a different NIC/IP?
LE: The eth0 connection has been cut off completely (cable out, interface down, connection down).
Now, when we init, if we choose port 0.0.0.0 for the kube-api it defaults to the bond, which we wanted initially:
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 0.0.0.0
result:
[certs] apiserver serving cert is signed for DNS names [emp-prod-nl-hilv-quortex19 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.0.0.1 ${BOND_IP}]
I have even added the 6443 port in iptables for accept and it still times out.. All my CALICO pods are up and running (all pods for that matter in kube-system namespace)
LLE:
I have tested calico and weavenet and both show the same issue. The api-server is up and can be reached from the master using curl but it times out from the nodes.
LLLE:
On the premise that the kube-api is nothing but an HTTPS server, I have tried two options from the node that cannot reach it when doing the kubeadm join:
Ran a python3 simple http server over 6443 and WAS ABLE TO CONNECT from node
Ran an nginx pod and exposed it over another port as NodePort and WAS ABLE TO CONNECT from node
the node just cant reach the api-server on 6443 or any other port for that matter ....
what am i doing wrong...
The cause:
The interface used was in BOND of type ACTIVE-ACTIVE. This made it so kubeadm tried another interface from the 2 bonded, which was not in the same subnet as the IP of the advertised server apparently...
Using ACTIVE-PASSIVE did the trick and was able to join the nodes.
LE: If anyone knows why kubeadm join does not support LACP with ACTIVE-ACTIVE bond setups on FEDORA COREOS please advise here. Otherwise, if additional configurations are required, I would very much like to know what I have missed.
So I'm currently running on my local Kubernetes cluster (running on docker) the stable/consul chart from helm.
$ helm install -n wet-fish --namespace consul stable/consul
This creates two services
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
wet-fish-consul ClusterIP None <none> 8500/TCP,8400/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP 0s
wet-fish-consul-ui NodePort 10.110.229.223 <none> 8500:30276/TCP
So this means I can run localhost:30276 and see the consul ui.
Now I'm running on my local machine
$ consul agent -dev -config-dir=./consul.d -node=machine
$ consul join 127.0.0.1:30276
This just results in:
Error joining address '127.0.0.1:30276': Unexpected response code: 500 (1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
)
Failed to join any nodes.
and
2020/01/17 15:17:35 [WARN] agent: (LAN) couldn't join: 0 Err: 1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
2020/01/17 15:17:35 [ERR] http: Request PUT /v1/agent/join/127.0.0.1:30276, error: 1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
from=127.0.0.1:59693
There must be a way to have a local consul agent running that can connect to the k8s consul server...
This is on a Mac, so networking isn't as good....
There may be two problems here, the first is that consul agent -dev starts the agent in dev mode. By default dev mode is going to start both a server and an agent. This might be part of the reason behind the error.
The other problem could be due to localhost, the server running in Kubernetes will attempt to health check local agents. It needs to be able to ping the local agent, so even if you manage to join in the first step, it would probably fail health checks.
I agree about networking on Mac it does not make things easy, one thing you will probably have to do is set the advertise address for the local agent (non kube). Docker for mac has a host name docker.for.mac.localhost which is a routable ip to the local machine from a container. When starting the local agent if you set the advertise address to the ip value of that host Kubernetes Consul server should be able to route to the locally running agent.
Potential fix:
1. Ensure local agent is starting in client mode (manually configure not -dev)
2. Set advertise advertise address to an ip address which is routable from Kubernetes docker.for.mac.localhost
Give me a shout if that does not work for you, I have used a setup like this myself, 9/10 it is networking between Docker and the local machine.
Kind regards,
Nic
I have 2 VMs.
On the first I run:
docker swarm join-token manager
On the second I run the result from this command.
i.e.
docker swarm join --token SWMTKN-1-0wyjx6pp0go18oz9c62cda7d3v5fvrwwb444o33x56kxhzjda8-9uxcepj9pbhggtecds324a06u 192.168.65.3:2377
However, this outputs:
Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.65.3:2377: connect: connection refused"
Any idea what's going wrong?
If it helps I'm spinning up these VMs using Vagrant.
Just add the port to firewall on master side
firewall-cmd --add-port=2377/tcp --permanent
firewall-cmd --reload
Then again try docker swarm join on second VM or node side
I was facing similar issue. and I spent couple of hours to figure out the root cause and share to those who may have similar issues.
Environment:
Oracle Cloud + AWS EC2 (2 +2)
OS: 20.04.2-Ubuntu
Docker version : 20.10.8
3 dynamic public IP+ 1 elastic IP
Issues
create two instances on the Oracle cloud at beginning
A instance (manager) docker swarm init --advertise-addr success
B instance (worker) docker join as worker is worker success
when I try to promo B as manager, encountered error
Unable to connect to remote host: No route to host
5. mesh routing is not working properly.
Investigation
Suspect it is related to network/firewall/Security group/security list
ssh to B server (worker), telnet (manager) 2377, with same error
Unable to connect to remote host: No route to host
3. login oracle console and add ingress rule under security list for all of relative port
TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network traffic
4. try again but still not work with telnet for same error
5. check the OS level firewall. if has disable it.
systemctl ufw disable
6. try again but still not work with same result
7. I suspect there have something wrong with oracle cloud, then I decide try to use AWS install the same version of OS/docker
8. add security group to allow all of relative ports/protocol and disable ufw
9. test with AWS instance C (leader/master) + D (worker). it works and also can promote D to manager. mesh routing was also work.
10. confirm the issue with oracle cloud
11. try to join the oracle instance (A) to C as worker. it works but still cannot promote as manager.
12. use journalctl -f to investigate the log and confirm there have socket timeout from A/B (oracle instances) to AWS instance(C)
13. relook the A/B, found there have iptables block request
14. remove all of setup in the iptables
# remove the rules
iptables -P INPUT ACCEPT
iptables -P OUTPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -F
15. remove all of setup in the iptables
Root Cause
It caused by firewall either in cloud security/WAF/ACL level or OS firewall/rules. e.g. ufw/iptables
I did firewall-cmd --add-port=2377/tcp --permanent firewall-cmd --reload already on master side and was still getting the same error.
I did telnet <master ip> 2377 on worker node and then I did reboot on master.
Then it is working fine.
It looks like your docker swarm manager leader is not running on port 2377. You can check it by firing this command on your swarm manager leader vm. If it is working just fine then you will get similar output
[root#host1]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
tilzootjbg7n92n4mnof0orf0 * host1 Ready Active Leader
Furthermore you can check the listening ports in leader swarm manager node. It should have port tcp 2377 for cluster management communications and tcp/udp port 7946 for communication among nodes opened.
[root#host1]# netstat -ntulp | grep dockerd
tcp6 0 0 :::2377 :::* LISTEN 2286/dockerd
tcp6 0 0 :::7946 :::* LISTEN 2286/dockerd
udp6 0 0 :::7946 :::* 2286/dockerd
In the second vm where you are configuring second swarm manager you will have to make sure you have connectivity to port 2377 of leader swarm manager. You can use tools like telnet, wget, nc to test the connectivity as given below
[root#host2]# telnet <swarm manager leader ip> 2377
Trying 192.168.44.200...
Connected to 192.168.44.200.
For me I was on linux and windows. My windows docker private network was the same as my local network address. So docker daemon wasn't able to find in his own network the master with the address I was giving to him.
So I did :
1- go to Docker Desktop app
2- go to Settings
3- go to Resources
4- go to Network section and change the Docker subnet address (need to be different from your local subnet address).
5- Then apply and restart.
6- use the docker join on the worker again.
Note: All this steps are performed on the node where the error appear. Make sure that the ports 2377, 7946 and 4789 are opens on the master (you can use iptables or ufw).
Hope it works for you.