ejabberd clustering, Slave doesn't work when master goes down - erlang

I have setup ejabberd clustering, one is master and other is slave as described here.
I have copied .erlang.cookie and database files from master to slave.
Everything is working fine.
The issue is when I stop master node:
Then no request getting routed to slave.
When trying to restart slave node its not getting start once it down.
I get stuck here, please help me out.
Thanks

This is the standard behaviour of Mnesia. If the node you start was not the last one that was stopped in a cluster, then it does not have any way to know if it has the latest, most up to date data.
The process to start a Mnesia cluster is to start the node in reverse order in which they were shutdown.
In case the node that was last seen on Mnesia cluster cannot start or join the cluster, them you need to use a Mnesia command to force the cluster "master", that is tell it that you consider this node has the most up to date content. This is done by using Erlang command mnesia:set_master_nodes/1.
For example, from ejabberd Erlang command-line:
mnesia:set_master_nodes([node1#myhost]).
In most case, Mnesia clustering handles everything automatically. When a node goes down, the other nodes are aware and automatically keep on working transparently. The only case you need to set which node as the reference data (with set_master_nodes/1), is when this is ambiguous for Mnesia, that is either when starting only nodes that were down when there was still running nodes or when there is a netsplit.

Follow the step from below link:
http://chadillac.tumblr.com/post/35967173942/easy-ejabberd-clustering-guide-mnesia-mysql
and call the method join_as_master(NodeName) of the easy_cluster module.

Related

Ansible playbook to update daemon.json and restart docker on a running Kubernetes cluster

I have 5 kubernetes clusters that I need to change part of the /etc/docker/daemon.json and then restart docker. I am planning on doing this via ansible, and those 2 steps are pretty straight forward. The question I have is how can I accomplish this while not taking down the whole cluster with the docker restart? I assume I would want to do this one node at a time, drain the node, then update/restart docker, wait for the node to come back online, then move onto the next node? Not sure exactly how to accomplish that.
You're on the right track. Drain the nodes and update them one by one, uncordoning the nodes as they complete.
As mdaniel mentioned in their comment you'll likely want to limit the batch size Ansible uses to one using serial. This means Ansible will only work on one host at a time. You can read about this in Ansible's docs.
You'll also want to structure your playbook in a way that it doesn't move onto the next host until all tasks are complete for the one it's working on. Maybe using blocks.

Couchdb 3.1.0 cluster - database failed to load after restarting one node

Here is the situation : on a couchdb cluster made of two nodes, each node is a couchdb docker instance on a server (ip1 and ip2). I had to reboot one server and restart docker, after that both my couchdb instances displays for each database: "This database failed to load."
I can connect with Futon and see the full list of databases, but that's all. On "Verify Couchdb Installation" with Futon I have several errors (only 'Create database' is a green check)
The docker logs for the container gives me this error :
"internal_server_error : No DB shards could be opened"
I tried to recover the database locally by copying the .couch and shards/ files to a local instance of couchdb but the same problem occurs.
How can I retrieve the data ?
PS: I checked the connectivity between my two nodes with erl, no problem there. Looks like docker messed up with some couchdb config file on restart.
metadata and cloning a node
The individual databases have metadata indicating on which nodes their shards are stored which is built on creation based on cluster options, so copying database files alone does not actually move or mirror the database on to the new node. (If one sets the metadata correctly the shards are copied by couch itself, so copying the files is only done to speed up the process.)
replica count
A 2 node cluster usually does not make sense. As with file system RAID, you can stripe for maximal performance and half the reliability or you can create a mirror, but unless individual node state has perfect consistency detection one can not automatically decide which of two nodes is incorrect, while deciding which of 3 nodes is incorrect is easy enough to perform automatically. Consequently, most clusters are 3 or more nodes and each shard has 3 replicas on any 3 nodes.
Alright, just in case someone do the same mistake :
When you have a 2 node cluster, couchdb#ip1 and couchdb#ip2, and created the cluster from couchdb#ip1 :
1) If the node couchdb#ip2 stops, then the cluster setup is messed up (couchdb#ip1 will no longer work), on restart it appears that the node will not connect correctly and the databases will appear but will not be available.
2) On the other hand, stoping and starting couchdb#ip1 do not cause any problem
The solution in case 1 is to recreate the cluster with 2 fresh couchdb instances (couchdb#ip1 and couchdb#ip2), then copy the databases on one couchdb instance and all the databases will be back !
If anyone can explain in detail why this happend ? It also means that this cluster configuration is absolutly not reliable (if couchdb#ip2 is down then nothing works), I guess it will not be the same with a 3 nodes cluster ?

Adding Hypervisor back to Failover Cluster

Somehow I removed my test hyper-visor from a two node cluster and now when i try to add it back to cluster it is not happening basically the Hyp is pointing towards CSV but not able to access it when I spin a VM and place it in a volume in that CSV. What could I be possibly doing wrong and when I try to connect to Failover Cluster which is already present from the same Hyp I am not able to connect to it with an error message that cites issues with network.
We first have to clear the cluster by running following command
Clear-ClusterNode -Name nodeName -Force
After the cluster is cleared you will be able to add the node back to the cluster.

How to reschedule containers with swarm when the server dies for a moment

I run two servers using docker-compose and swarm
When stopping the A server, the container in A server is moved to B server.
but, when starting up the A server, the container that was in the B server will not moved to A server.
I want to know how to properly arrange the location of the dead server's container when the server dies for a moment
First, for your Swarm to be able to re-create a task when a node goes down, you still need to have a majority of manager nodes still available... so if it was only a two node Swarm, this wouldn't work because you'd need three managers for one to fail and another to take leader role and re-schedule the failed replicas. (just a FYI)
I think what you're asking for is "re-balancing". When a node comes back online (or a new one is added), Swarm does nothing with services that are set to the default replicated mode. Swarm doesn't "move" containers, it destroys and re-creates containers, so it considers the service still healthy on Node B and won't move it back to Node A. It wouldn't want to disrupt your active/healthy services on Node B just because Node A came back online.
If Node B does fail, then Swarm would again re-schedule the task on the next best node.
If Node B has a lot of containers, and work is unbalanced (i.e. Node A is empty and Node B has 3 tasks running), then you can force a service update which will destroy and re-create all replicas of that service and will try to spread them out by default, which may result on one of the tasks ending up back on Node A.
docker service update --force <serivcename>

Jenkins: 2 master nodes using NFS

I´m thinking about the following high availability solution for my enviroment:
Datacenter with one powered on Jenkins master node.
Datacenter for desasters with one off Jenkins master node.
Datacenter one is always powered on, the second is only for disasters. My idea is install the two jenkins using the same ip but with a shared NFS. If the first has fallen, the second starts with the same ip and I still having my service successfully
My question is, can this solution work?.
Thanks all by the hekp ;)
I don't see any challenges as such why it should not work. But you still got to monitor in case of switch-over because I have faced the situation where jobs that were running when jenkins abruptly shuts down were still in the queue when service was recovered but they never completed afterwards, I had to manually delete the build using script console.
Over the jenkins forum a lot of people have reported such bugs, most of them seems to have fixed, but still there are cases where this might happen, and it is because every time jenkins is restarted/started the configuration is reloaded from the disk. So there is inconsistency at times because of in memory config that were there earlier and reloaded config.
So in your case, it might happen that your executor thread would still be blocked when service is recovered. Thus you got to make sure that everything is running fine after recovery.

Resources