Related to Neo4j 1.9.3 HA clustering, our team had a few different interpretations on how master node(s) behave if a HA cluster 'splits' due to a network problem. We're trying to understand the impact on master nodes as well as the branched database that occurs if/when the master cannot deliver updates to the slaves.
Given a 5 node deployment, where a network failure splits things into two groups/sub-clusters:
Group[A] consists of 2 nodes
Group[B] consists of 3 nodes
Each member of a given group can communicate with each other, but [A] cannot communicate with [B]. Prior to the split, the original master node (of the 5-node cluster) lived within the new [A] or [B] groups.
Questions:
If the original master node lived within [A] (i.e. in a minority non-quorum group of cluster nodes), will it write lock its database (knowing that it is a branch at this point) until it can rejoin the entire cluster at which time it will honor the newly elected master from [B] (which was able to elect a new master because it had quorum)?
If the original master node lived within [B] (i.e. in a majority quorum group of cluster nodes), will it continue to allow write’s to its database, while [A] will be writelocked because it will not have a master? Or is a master elected in [A] even if it doesn’t have a quorum for the whole cluster?
Any help is much appreciated!
There will always be only one master in a Neo4j cluster. Since the cluster is aware of the number of members a master election requires to have a quorum of more than the half. In case of a split in the way you've described the following will happen:
original master lived in minority partition A: master will degrade to slave and serve only read requests but won't accept writes. Partition B has a quorum and will elect a new master. When partitioning is resolved the former master will continue to work as slave.
original master lived in majority partition B: master continues to be master and propagates transactions to other members of B. When partitioning is resolved the former members of A will catch up missing transactions.
Related
I have 2 swarm nodes and I whish that in case one node shut down, the other one rearrange all services to itself.
Right now I have one leader(manager) and one worker, and it works perfectly if the worker goes down, because leader rearranges all services to itself.
My problem here is when leader goes down and no one assumes services within it.
I already tried with two managers, but didn't works.
So I am thinking about to let all my services in the worker node so if leader node goes down there is no problem at all and if worker node goes down, leader node would rearrange all services to itself.
I tried with
deploy:
placement:
constraints:
- "node.role!=manager"
But it also does not works, because it will never instance this service in a manager node.
So I would like to ask if there is any way to make those two nodes to rearrange all services to itself in case other goes down?!
or
There is an way to configure a service to "preferably" be deployed in one specific node if that node is available otherwise be deployed in any other node?
The rub of it is, you need 3 nodes, all managers. It is not a good idea, even with a 2 node swarm, to make 2 nodes managers as docker swarm uses the raft protocol for manager quorum, and this protocol requires a clear majority. With two manager nodes, if either node goes down, the remaining manager node only represents 50% of the swarm managers and so will not represent the swarm until qorum is restored.
Once you have 3 nodes - all managers - the swarm will tolerate any single nodes failure and move tasks to the other two nodes.
Don't bother with 4 manager nodes - they dont provide extra protection from single node failures, and don't protect from two node failures as, again, only 2 out 4 does not represet more than 50%, to survive 2 node failures you want 5 managers.
Lets say we have a test setup of 10 nodes, 4 managers and 6 workers.
When the leader manager fails, the other 3 managers will chose another manager as leader.
When this leader as well fails, we only have 2 managers left out of 4. The other managers then say
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Because we have not more than half of the managers left, they will not be able to chose a new leader although 2 managers of the cluster are left.
My question is
the sense of this rule, because the cluster is without a leader and not manageable anymore as long as no additional managers are added to the cluster, although there are 2 managers available.
Why should I chose the role worker for nodes at all? What advantage are there to have nodes as workers? Managers also act as workers by default only with the disadvantage that they cannot take over when manager nodes fail.
Docker recommends to use a system with odd number of manager nodes. So your initial setup of 4 manager is as good as having 3 manager nodes. It is recommended that you start with 5 nodes, as you are loosing 2 nodes. Also, isn't there any serious issue to be addressed in the way you are using? (loosing so many nodes is not a good sign)
If the swarm loses the quorum of managers, the swarm cannot perform management tasks. If your swarm has multiple managers, always have more than two. To maintain quorum, a majority of managers must be available. An odd number of managers is recommended, because the next even number does not make the quorum easier to keep. For instance, whether you have 3 or 4 managers, you can still only lose 1 manager and maintain the quorum. If you have 5 or 6 managers, you can still only lose two.
Having a dedicated worker nodes makes sure that they won't participate in the Raft distributed state, make scheduling decisions, or serve the swarm mode HTTP API. So the complete compute power of these nodes are dedicated specifically to run the containers.
because manager nodes use the Raft consensus algorithm to replicate data in a consistent way, they are sensitive to resource starvation
The quotes are taken from the docker official documentation link
I am playing with multi node docker swarm in cloud. I setup 4 nodes swarm where 2 manager (1 primary and the other one is reachable manager ) and 2 worker nodes. While I am reading docs, I found out that we have to choose odd number of manager nodes like 1,3.... Not sure what is the technical restriction behind this decision.
This is related to how consensus across managers is determined when maintaining cluster consistency during an outage. See Raft consensus in swarm mode.
The algorithm used to derive consensus for a cluster of N nodes requires (N/2)+1 of them to agree. For a cluster of 2 managers you would actually be reducing reliability because if either of them goes down the other would be unable to do anything. In general, having an even number of managers provides no benefit over having one less.
I have two Aerospike servers cluster runnning with a replication factor of 2. Both servers have the same replicated objects count, which means all records are replacated. But still the monitoring panel shows incoming and outgoing migration going on.
This happened after I restarted one of the servers. Now de I/O rate in both servers are above it was before restarting.
Why is this happening?
When a node leaves the cluster, the partition id of any partition that node was a member of advances. When the node returns, they share their partition info with the cluster and migrations are required for any partition the returning node is a member of. This is done because while the node was down, the remaining node may have taken on writes.
For replication factor 2 with 2 nodes, both nodes are members of all partitions.
I'm building a monitoring tool in Erlang. When run on a cluster, it should run a set of data collection functions on all nodes and record that data using RRD on a single "recorder" node.
The current version has a supervisor running on the master node (rolf_node_sup) which attempts to run a 2nd supervisor on each node in the cluster (rolf_service_sup). Each of the on-node supervisors should then start and monitor a bunch of processes which send messages back to a gen_server on the master node (rolf_recorder).
This only works locally. No supervisor is started on any remote node. I use the following code to attempt to load the on-node supervisor from the recorder node:
rpc:call(Node, supervisor, start_child, [{global, rolf_node_sup}, [Services]])
I've found a couple of people suggesting that supervisors are really only designed for local processes. E.g.
Starting processes at remote nodes
how: distributed supervision tree
What is the most OTP way to implement my requirement to have supervised code running on all nodes in a cluster?
A distributed application is suggested as one alternative to a distributed supervisor tree. These don't fit my use case. They provide for failover between nodes, but keeping code running on a set of nodes.
The pool module is interesting. However, it provides for running a job on the node which is currently the least loaded, rather than on all nodes.
Alternatively, I could create a set of supervised "proxy" processes (one per node) on the master which use proc_lib:spawn_link to start a supervisor on each node. If something goes wrong on a node, the proxy process should die and then be restarted by it's supervisor, which in turn should restart the remote processes. The slave module could be very useful here.
Or maybe I'm overcomplicating. Is directly supervising nodes a bad idea, instead perhaps I should architect the application to gather data in a more loosely coupled way. Build a cluster by running the app on multiple nodes, tell one to be master, leave it at that!
Some requirements:
The architecture should be able to cope with nodes joining and leaving the pool without manual intervention.
I'd like to build a single-master solution, at least initially, for the sake of simplicity.
I would prefer to use existing OTP facilities over hand-rolled code in my implementation.
Interesting challenges, to which there are multiple solutions. The following are just my suggestions, which hopefully makes you able to better make the choice on how to write your program.
As I understand your program, you want to have one master node where you start your application. This will start the Erlang VM on the nodes in the cluster. The pool module uses the slave module to do this, which require key-based ssh communication in both directions. It also requires that you have proper dns working.
A drawback of slave is that if the master dies, so does the slaves. This is by design as it probably fit the original use case perfectly, however in your case it might be stupid (you may want to still collect data, even if the master is down, for example)
As for the OTP applications, every node may run the same application. In your code you can determine the nodes role in the cluster using configuration or discovery.
I would suggest starting the Erlang VM using some OS facility or daemontools or similar. Every VM would start the same application, where one would be started as the master and the rest as slaves. This has the drawback of marking it harder to "automatically" run the software on machines coming up in the cluster like you could do with slave, however it is also much more robust.
In every application you could have a suitable supervision tree based on the role of the node. Removing inter-node supervision and spawning makes the system much simpler.
I would also suggest having all the nodes push to the master. This way the master does not really need to care about what's going on in the slave, it might even ignore the fact that the node is down. This also allows new nodes to be added without any change to the master. The cookie could be used as authentication. Multiple masters or "recorders" would also be relatively easy.
The "slave" nodes however will need to watch out for the master going down and coming up and take appropriate action, like storing the monitoring data so it can send it later when the master is back up.
I would look into riak_core. It provides a layer of infrastructure for managing distributed applications on top of the raw capabilities of erlang and otp itself. Under riak_core, no node needs to be designated as master. No node is central in an otp sense, and any node can take over other failing nodes. This is the very essence of fault tolerance. Moreover, riak_core provides for elegant handling of nodes joining and leaving the cluster without needing to resort to the master/slave policy.
While this sort of "topological" decentralization is handy, distributed applications usually do need logically special nodes. For this reason, riak_core nodes can advertise that they are providing specific cluster services, e.g., as embodied by your use case, a results collector node.
Another interesting feature/architecture consequence is that riak_core provides a mechanism to maintain global state visible to cluster members through a "gossip" protocol.
Basically, riak_core includes a bunch of useful code to develop high performance, reliable, and flexible distributed systems. Your application sounds complex enough that having a robust foundation will pay dividends sooner than later.
otoh, there's almost no documentation yet. :(
Here's a guy who talks about an internal AOL app he wrote with riak_core:
http://www.progski.net/blog/2011/aol_meet_riak.html
Here's a note about a rebar template:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-March/003632.html
...and here's a post about a fork of that rebar template:
https://github.com/rzezeski/try-try-try/blob/7980784b2864df9208e7cd0cd30a8b7c0349f977/2011/riak-core-first-multinode/README.md
...talk on riak_core:
http://www.infoq.com/presentations/Riak-Core
...riak_core announcement:
http://blog.basho.com/2010/07/30/introducing-riak-core/