Ejabberd clustering not working - erlang

I follow this website link for ejabberd clustering http://chad.ill.ac/post/35967173942/easy-ejabberd-clustering-guide-mnesia-mysql
everything is fine its shows two nodes running db and web admin also two node master and slave but if i shtdown master or slave node other one node not continue the process what should i do for if one node is down otherone is continue the process.

Mnesia behaves as a multi-master database. But if you down the nodes, the restart process should be in reverse order. If you have node1 and node2 and you kill node 1 and after that, you kill node2, then you should restart node2 first and then node1. That's because Mnesia thinks that the last updated node is the last one.

Related

How to handle when leader node goes down in docker swarm

I have two docker nodes running in swarm like below. The second node i promoted to work as manager.
imb9cmobjk0fp7s6h5zoivfmo * Node1 Ready Active Leader 19.03.11-ol
a9gsb12wqw436zujakdpbqu5p Node2 Ready Active Reachable 19.03.11-ol
This works fine when leader node goes to drain/pause. But as part of my test i have stopped the Node1 instance then i got below error when try to see what are the nodes(docker node ls) in the second node and when tried to list the services running(docker service ls).
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online
Also no docker process coming up in node 2 which were running in node 1 before stopping the instance. Only the existing process are running. My expectation is after stopping the node1 instance, the procees were running in node 1 has to move to node2. This works fine when a node goes to drain status
The raft consensus algoritm fails when it cant' find a clear majority.
This means, never run with 2 manager nodes as one node going down leaves the other with 50% - which is not a majority and quorum cannot be reached.
Generally in fact, avoid even numbers, especially when splitting managers between availability zones, as a zone split can leave you with a 50/50 partition - again no majority and no Quorum and a dead swarm.
So, valid numbers of swarm managers to try are generally: 1,3,5,7. Going higher than 7 generally reduces performance and doesn't help availability.
1 should only be used if you are using a 1 or 2 node swarm, and in these cases, loss of the manager node equates to loss of the swarm anyway.
3 managers is really the minimum you should aim for. If you only have 3 nodes, then prefer to use the managers as workers than run 1 manager and 2 workers.

What happens if master node dies in kubernetes? How to resolve the issue?

I've started learning kubernetes with docker and I've been thinking, what happens if master node dies/fails. I've already read the answers here. But it doesn't answer the remedy for it.
Who is responsible to bring it back? And how to bring it back? Can there be a backup master node to avoid this? If yes, how?
Basically, I'm asking a recommended way to handle master failure in kubernetes setup.
You should have multiple VMs serving as master node to avoid single point of failure.An odd number of 3 or 5 master nodes are recommended for quorum. Have a load balancer in-front of all the VMs serving as master node which can do load balancing and in case one master node dies loadbalancer should remove the VMs IP and make it as unhealthy and not send traffic to it.
Also ETCD cluster is the brain of a kubernetes cluster. So you should have multiple VMs serving as ETCD nodes. Those VMs can be same VMs as of master node or for reduced blast radius you can have separate VMs for ETCD. Again the odd number of VMs should should be 3 or 5. Make sure to take periodic backup of ETCD nodes data so that you can restore the cluster state to pervious state in case of a disaster.
Check the official doc on how to install a HA kubernetes cluster using Kubeadm.
In short, for Kubernetes you should keep master nodes to function properly all the time. There are different methods to make copies of master node, so it is available on failure. As example check this - https://kubernetes.io/docs/tasks/administer-cluster/highly-available-master/
Abhishek, you can run master node in high availability, you should set up the control plane aka master node behind Load balancer as first step. If you have plans to upgrade a single control-plane kubeadm cluster to high availability you should specify the --control-plane-endpoint to set the shared endpoint for all control-plane nodes. Such an endpoint can be either a DNS name or an IP address of a load-balancer.
By default because of security reasons the master node does not host PODS and if you want to enable hosting PODS on master node you can run the following command to do so.
kubectl taint nodes --all node-role.kubernetes.io/master
If you want to manually restore the master make sure you back up the etcd directory /var/lib/etcd. You can restore this on the new master and it should work. Read about high availability kubernetes over here.

How to reschedule containers with swarm when the server dies for a moment

I run two servers using docker-compose and swarm
When stopping the A server, the container in A server is moved to B server.
but, when starting up the A server, the container that was in the B server will not moved to A server.
I want to know how to properly arrange the location of the dead server's container when the server dies for a moment
First, for your Swarm to be able to re-create a task when a node goes down, you still need to have a majority of manager nodes still available... so if it was only a two node Swarm, this wouldn't work because you'd need three managers for one to fail and another to take leader role and re-schedule the failed replicas. (just a FYI)
I think what you're asking for is "re-balancing". When a node comes back online (or a new one is added), Swarm does nothing with services that are set to the default replicated mode. Swarm doesn't "move" containers, it destroys and re-creates containers, so it considers the service still healthy on Node B and won't move it back to Node A. It wouldn't want to disrupt your active/healthy services on Node B just because Node A came back online.
If Node B does fail, then Swarm would again re-schedule the task on the next best node.
If Node B has a lot of containers, and work is unbalanced (i.e. Node A is empty and Node B has 3 tasks running), then you can force a service update which will destroy and re-create all replicas of that service and will try to spread them out by default, which may result on one of the tasks ending up back on Node A.
docker service update --force <serivcename>

How to reconnect a crashed erlang mnesia node to cluster again?

I'm learning erlang and mnesia. I have a question: how to reconncet a "crashed" erlang mnesia node to cluster again?
Erlang/OTP 17 [erts-6.2]
What I did:
Two mnesia nodes: m11#deb83-11 and m12#deb83-12. They were connected
with each other well.
(m11#deb83-11)4> mnesia:system_info(running_db_nodes).
['m12#deb83-12','m11#deb83-11']
Then I teminated the erl shell of m12#deb83-12 by "Ctl-G" and "q"
without stopping mnesia.
After that, I restarted erl shell for m12#deb83-12 node with same
command line.
I found the restarted node m12#deb83-12 did not connect to
m11#deb83-11.
(m11#deb83-11)16> mnesia:system_info(running_db_nodes).
['m11#deb83-11']
Note 1. If i stopped mnesia in step#2, m12#deb83-12 would reconnect to m11#deb83-11 successfully after step#3)
Note 2. I did not create any table. There is only an empty schema in this cluster.
Thanks in advance!
Ming
Apparently all you need to do is connect to the other node (so that nodes(). returns the other node) and restart mnesia with mnesia:stop(). and mnesia:start()..

ejabberd clustering, Slave doesn't work when master goes down

I have setup ejabberd clustering, one is master and other is slave as described here.
I have copied .erlang.cookie and database files from master to slave.
Everything is working fine.
The issue is when I stop master node:
Then no request getting routed to slave.
When trying to restart slave node its not getting start once it down.
I get stuck here, please help me out.
Thanks
This is the standard behaviour of Mnesia. If the node you start was not the last one that was stopped in a cluster, then it does not have any way to know if it has the latest, most up to date data.
The process to start a Mnesia cluster is to start the node in reverse order in which they were shutdown.
In case the node that was last seen on Mnesia cluster cannot start or join the cluster, them you need to use a Mnesia command to force the cluster "master", that is tell it that you consider this node has the most up to date content. This is done by using Erlang command mnesia:set_master_nodes/1.
For example, from ejabberd Erlang command-line:
mnesia:set_master_nodes([node1#myhost]).
In most case, Mnesia clustering handles everything automatically. When a node goes down, the other nodes are aware and automatically keep on working transparently. The only case you need to set which node as the reference data (with set_master_nodes/1), is when this is ambiguous for Mnesia, that is either when starting only nodes that were down when there was still running nodes or when there is a netsplit.
Follow the step from below link:
http://chadillac.tumblr.com/post/35967173942/easy-ejabberd-clustering-guide-mnesia-mysql
and call the method join_as_master(NodeName) of the easy_cluster module.

Resources