Mnesia Clustering - erlang

If I am clustering 2 nodes together, from my experimenting and reading up online I understand that Node A will be like a "master" node and Node B will copy the tables over if I want them to. (Otherwise it will just access them remotely.)
What happens though if Node B goes down? Does it just recopy the data that's been changed since it was last up?
Also what happens if Node A goes down. Is Node B still usable? If so, if data is changed on Node B, does Node A copy it over to itself? My understanding so far is that Node A doesn't care about what Node B says, but someone please tell me I'm wrong.

Since the accepted answer is a link only answer, figured I would document this for anyone who comes along:
Mnesia doesn't quite work by having a primary-secondary architecture. Instead, some nodes have local copies of data, and some have remote copies. (You can see this by running mnesia:info() from the console. There is a list of remote tables, and a list for each of the local-tables: ram_copies,disc_copies and disc_only_copies.)
If a node goes down, as long as there is some table with a local copy, operations involving that table are fine.
One of the down-sides with Mnesia is that it is subject to network partition events. If in your cluster the network connection between two nodes goes bad, then each one will think that the other node is down, and continue to write data. Recovery from this is complicated. In the more mundane case though, if one node goes down, then nodes with local copies of data continue along, and when the down-node recovers, it syncs back up with the cluster.

Related

Informix 10 replication queues not moving

I'm using Informix IDS 10.00.UC6 on Solaris 10, with two machines having the same database schema and all tables replicating in both directions using Enterprise Replication, so in theory both databases should have the same content.
However , a problem has arisen where one direction of replication (Host A to Host B) continues to work correctly, but the other direction (Host B to host A) does not work. The symptoms are:
Changes made to a table on Host B do not propagate to Host A (as determined by changing a row on Host B and inspecting the table on Host A)
cdr list serv shows Active and Connected (both directions), but on Host B there is a queue of millions of bytes.
cdr list repl shows non-zero queues for several of the replicates.
cdr stats recvq on Host A shows nothing received from Host B recently.
cdr stats rqm shows data in the spool trg_send_stxn with flags SEND_Q, SPOOLED, PROGRESS_TABLE, NEED_ACK, SENDQ_MASK, SREP_TABLE.
There are no errors or relevant messages in online.log or cdr_mon.log , or any other place I can think to look.
Some of the tables are "out of sync" in that rows have conflicting data or are missing; this is for various reasons relating to past errors where one host was offline. However, even changes to tables with correct data on Host B are not propagated to Host A.
I did a cdr cleanstart on Host B yesterday after this problem was occurring in both directions, which did at least make the A -> B direction start working (the opposite of what I expected), and the queue on Host B were 0 at that time. After that cleanstart, some changes to tables (with correct data) would propagate to Host A, while some changes to other tables on B would not. But today, no tables are propagating from B to A.
Before the cleanstart I had found by experimenting that sometimes deleting an individual replicate would reduce the size of the stuck queue but the queue remained stuck all the same; and sometimes, deleting a replicate would make the queue move for a time before being stuck again.
There is also a DR host that both A and B do one-way propagation to, and that is propagating correctly with no queue backup.
I'm at a loss now as to try and diagnose why the data in the replication queues is not moving. If there were sync errors (i.e. the replicated change could not be applied due to Host A data differing) I would expect log messages in online.log that the update was rejected, with information saved to $INFORMIXDIR/ats_dr and so on -- this has happened recently . It seems as if there must be something in the queue being refused but not being cleared and not logged, blocking the queue. Host A has heavy live traffic and (thankfully) is correctly replicating to Host B, but not vice versa.
Any ideas of more things to try or ways to diagnose the problem would be most welcome.
Edit - may or may not be related to Retrieving or deleting a row with a blob in Informix 10 where it appears that the ER send spooler on Host B has corruption.
If there are any recent replication definition changes in your environment, I would look for the following for any clues. As Jonathan mentioned, IDS 10.0.xC6 is quite old and there were lots of addition to ER in the latest versions that makes it even more robust and resilient to failures
Host B being Receive Only in the replicates
ATS/RIS files
What does cdr error -a show (run on each server)?
(Just in case you don't have it ... a link to Version 10 ER manual:
http://publibfp.dhe.ibm.com/epubs/pdf/25122792.pdf)
Oh, and are ALL Servers "in time sync" (ntp)?
JJ
Regarding ATS/RIS files, can we assume all replicates do have these options on and this has already worked in the past?
What's in 'onstat -g rcv' (receive statistics) on A, and how does this output change over time?
What does 'onstat -g nif' say on B? Possibly a block in the transmission?
Can we assume both sides had been restarted at least once since the issue started so any internal thread confusion would have been resolved and ER re-initialized at least once on either side?
Is there possibly some huge transaction, from B underway to A, that's clogging replication, e.g. by filling up A's receive queue? Any space problems in queue sbspaces (or queue header dbspaces)?
I guess a cleanstart on B will resolve the problem, but of course a re-sync of all replicated tables would have to occur (since you already did a cleanstart on A that's required anyway).
Andreas

What is Mnesia replication strategy?

What strategy does Mnesia use to define which nodes will store replicas of particular table?
Can I force Mnesia to use specific number of replicas for each table? Can this number be changed dynamically?
Are there any sources (besides the source code) with detailed (not just overview) description of Mnesia internal algorithms?
Manual. You're responsible for specifying what is replicated where.
Yes, as above, manually. This can be changed dynamically.
I'm afraid (though may be wrong) that none besides the source code.
In terms of documenation the whole Erlang distribution is hardly the leader
in the software world.
Mnesia does not automatically manage the number of replicas of a given table.
You are responsible for specifying each node that will store a table replica (hence their number). A replica may be then:
stored in memory,
stored on disk,
stored both in memory and on disk,
not stored on that node - in this case the table will be accessible but data will be fetched on demand from some other node(s).
It's possible to reconfigure the replication strategy when the system is running, though to do it dynamically (based on a node-down event for example) you would have to come up with the solution yourself.
The Mnesia system events could be used to discover a situation when a node goes down; given you know what tables were stored on that node you could check the number of their online replicas based on the nodes which were still online and then perform a replication if needed.
I'm not aware of any application/library which already manages this kind of stuff and it seems like a quite an advanced (from my point of view, at least) endeavor to make one.
However, Riak is a database which manages data distribution among it's nodes transparently from the user and is configurable with respect to the options you mentioned. That may be the way to go for you.

Erlang clusters

I'm trying to implement a cluster using Erlang as the glue that holds it all together. I like the idea that it creates a fully connected graph of nodes, but upon reading different articles online, it seems as though this doesn't scale well (having a max of 50 - 100 nodes). Did the developers of OTP impose this limitation on purpose? I do know that you can setup nodes to have explicit connections only as well as have hidden nodes, etc. But, it seems as though the default out-of-the-box setup isn't very scalable.
So to the questions:
If you had 5 nodes (A, B, C, D, E) that all had explicit connections such that A-B-C-D-E. Does Erlang/OTP allow A to talk directly to E or does A have to pass messages from B through D to get to E, and thus that's the reason for the fully connected graph? Again, it makes sense but it doesn't scale well from what I've seen.
If one was to try and go for a scalable and fault-tolerant system, what are your options? It seems as though, if you can't create a fully connected graph because you have too many nodes, the next best thing would be to create a tree of some kind. But, this doesn't seem very fault-tolerant because if the root or any parent of children nodes dies, you would lose a significant portion of your cluster.
In looking into supervisors and workers, all of the examples I've seen apply this to processes on a single node. Could it be applied to a cluster of nodes to help implement fault-tolerance?
Can nodes be part of several clusters?
Thanks for your help, if there is a semi-recent website or blogpost (roughly 1-year old) that I've missed, I'd be happy to look at those. But, I've scoured the internet pretty well.
Yes, you can send messages to a process on any remote node in a cluster, for example, by using its process identifier (pid). This is called location transparency. And yes, it scales well (see Riak, CouchDB, RabbitMQ, etc).
Note that one node can run hundred thousands of processes. Erlang has proven to be very scalable and was built for fault tolerance. There are other approaches to build bigger, e.g. SOA approach of CloudI (see comments). You also could build clusters that use hidden nodes if you really really need to.
At the node level you would take a different approach, for example, build identical nodes that are easy to replace if they fail and the work is taken over by the remaining nodes. Check out how Riak handles this (look into riak_core and check the blog post Introducing Riak Core).
Nodes can leave and enter a cluster but cannot be part of multiple clusters at the same time. Connected nodes share one cluster cookie which is used to identify connected nodes. You can set the cookie while the VM is running (see Distributed Erlang).
Read http://learnyousomeerlang.com/ for greater good.
The distribution protocol is about providing robustness, not scalability. What you want to do is to group your cluster into smaller areas and then use connections, which are not distribution in Erlang but in, say, TCP sessions. You could run 5 groups of 10 machines each. This means the 10 machines have seamless Pid distribution: you can call a pid on another machine. But distributing to another group means you can't seamlessly address the group like that.
You generally want some kind of "route reflection" as in BGP.
1) I think you need a direct connection between nodes to communicate between processes. This does, however, mean that you don't need persistent connections between all the nodes if two will never communicate (say if they're only workers, not coordinators).
2) You can create a not-fully-connected graph of erlang nodes. The documentation is hard to find, and comes with problems - you disable the global system which handles global names in the cluster, so you have to do everything by locally registered names, or locally registered names on remote nodes. Or just use Pids, as they work too. To start an erlang node like this, use erl ... -connect_all false .... I hope you know what you're up to, as I couldn't trust myself to do that.
It also turns out that a not-fully-connected graph of erlang nodes is a current research topic. The RELEASE Project is currently working on exactly that, and have come up with a concept of S-groups, which are essentially fully-connected groups. However, nodes can be members of more than one S-group and nodes in separate s-groups don't have to be fully connected but can establish the connections they need on demand to do direct node-to-node communication. It's worth finding presentations of theirs because the research is really interesting.
Another thing worth pointing out is that several people have found that you can get up to 150-200 nodes in a fully-connected cluster. Do you really have a use-case for more nodes than that? Surely 150-200 incredibly beefy computers would do most things you could throw at them, unless you have a ridiculous project to do.
3) While you can't start processes on a different node using gen_server:start_link/3,4, you can certainly call servers on a foreign node very easily. It seems that they've overlooked being able to start servers on foreign nodes, but there's probably good reason for it - such as a ridiculous number of error cases.
4) Try looking at hidden nodes, and at having a not-fully-connected cluster. They should allow you to group nodes as you see fit.
TL;DR: Scaling is hard, let's go shopping.
There are some good answers already, so I'm trying to be simple.
1) No, if A and E are not connected directly, A cannot talk to E. The distribution protocol runs on direct TCP connection - no routing included.
2) I think a tree structure is good enough - trade-offs always exist.
3) There's no 'supervisor for nodes', but erlang:monitor_node is your friend.
4) Yes. A node can talk to nodes from different 'clusters'. In the local node, use erlang:set_cookie(OtherNode, OtherCookie) to access a remote node with a different cookie.
1)
yes. they talk to each other
2) 3) and 4)
Generally speaking, when building a scalable and fault tolerant system, you would want, or more over, need to divide the work load to different "regions" or "clusters". Supervisor/Worker model has this envisioned thus the topology. What you need is a few processes coordinating work between clusters and all workers within one single cluster will talk to each other to balance out within group.
As you can see, with this topology, the "limitation" is not really a limitation as long as you divide your tasks carefully and in a balanced fashion. Personally, I believe a tree like structure for supervisor processes is not avoidable in large scale systems, and this is the practice I'm following. Reasons are vary but boils down to scalability, fault tolerance as fall back policy implementation, maintenance need and portability of the clusters.
So in conclusion,
2) use a tree-like topology for your supervisors. let workers explicitly connect to each other and talk within their own domain with the supervisors.
3) while this is the native designed environment, as I presume, I'm pretty sure a supervisor can talk to a worker on a different machine. I would not suggest this as fault tolerance can be hell in remote worker scenario.
4) you should never let a node be part of two different cluster at the same moment. You can switch it from one cluster to another though.

Is this the right way of building an Erlang network server for multi-client apps?

I'm building a small network server for a multi-player board game using Erlang.
This network server uses a local instance of Mnesia DB to store a session for each connected client app. Inside each client's record (session) stored in this local Mnesia, I store the client's PID and NODE (the node where a client is logged in).
I plan to deploy this network server on at least 2 connected servers (Node A & B).
So in order to allow a Client A who is logged in on Node A to search (query to Mnesia) for a Client B who is logged in on Node B, I replicate the Mnesia session table from Node A to Node B or vise-versa.
After Client A queries the PID and NODE of the Client B, then Client A and B can communicate with each other directly.
Is this the right way of establishing connection between two client apps that are logged-in on two different Erlang nodes?
Creating a system where two or more nodes are perfectly in sync is by definition impossible. In practice however, you might get close enough that it works for your particular problem.
You don't say the exact reason behind running on two nodes, so I'm going to assume it is for scalability. With many nodes, your system will also be more available and fault-tolerant if you get it right. However, the problem could be simplified if you know you only ever will run in a single node, and need the other node as a hot-slave to take over if the master is unavailable.
To establish a connection between two processes on two different nodes, you need some global addressing(user id 123 is pid<123,456,0>). If you also care about only one process running for User A running at a time, you also need a lock or allow only unique registrations of the addressing. If you also want to grow, you need a way to add more nodes, either while your system is running or when it is stopped.
Now, there are already some solutions out there that helps solving your problem, with different trade-offs:
gproc in global mode, allows registering a process under a given key(which gives you addressing and locking). This is distributed to the entire cluster, with no single point of failure, however the leader election (at least when I last looked at it) works only for nodes that was available when the system started. Adding new nodes requires an experimental version of gen_leader or stopping the system. Within your own code, if you know two players are only going to ever talk to each other, you could start them on the same node.
riak_core, allows you to build on top of the well-tested and proved architecture used in riak KV and riak search. It maps the keys into buckets in a fashion that allows you to add new nodes and have the keys redistributed. You can plug into this mechanism and move your processes. This approach does not let you decide where to start your processes, so if you have much communication between them, this will go across the network.
Using mnesia with distributed transactions, allows you to guarantee that every node has the data before the transaction is commited, this would give you distribution of the addressing and locking, but you would have to do everything else on top of this(like releasing the lock). Note: I have never used distributed transactions in production, so I cannot tell you how reliable they are. Also, due to being distributed, expect latency. Note2: You should check exactly how you would add more nodes and have the tables replicated, for example if it is possible without stopping mnesia.
Zookeper/doozer/roll your own, provides a centralized highly-available database which you may use to store the addressing. In this case you would need to handle unregistering yourself. Adding nodes while the system is running is easy from the addressing point of view, but you need some way to have your application learn about the new nodes and start spawning processes there.
Also, it is not necessary to store the node, as the pid contains enough information to send the messages directly to the correct node.
As a cool trick which you may already be aware of, pids may be serialized (as may all data within the VM) to a binary. Use term_to_binary/1 and binary_to_term/1 to convert between the actual pid inside the VM and a binary which you may store in whatever accepts binary data without mangling it in some stupid way.

Online mnesia recovery from network partition [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Is it possible to recover from a network partition in an mnesia cluster without restarting any of the nodes involved? If so, how does one go about it?
I'm interested specifically in knowing:
How this can be done with the standard OTP mnesia (v4.4.7)
What custom code if any one needs to write to make this happen (e.g. subscribe to mnesia running_paritioned_network events, determine a new master, merge records from non-master to master, force load table from the new master, clear running parititioned network event -- example code would be greatly appreciated).
Or, that mnesia categorically does not support online recovery and requires that the node(s) that are part of the non-master partition be restarted.
While I appreciate the pointers to general distributed systems theory, in this question I am interested in erlang/OTP mnesia only.
After some experimentation I've discovered the following:
Mnesia considered the network to be partitioned if between two nodes there is a node disconnect and a reconnect without an mnesia restart.
This is true even if no Mnesia read/write operations occur during the time of the disconnection.
Mnesia itself must be restarted in order to clear the partitioned network event - you cannot force_load_table after the network is partitioned.
Only Mnesia needs to be restarted in order to clear the network partitioned event. You don't need to restart the entire node.
Mnesia resolves the network partitioning by having the newly restarted Mnesia node overwrite its table data with data from another Mnesia node (the startup table load algorithm).
Generally nodes will copy tables from the node that's been up the longest (this was the behaviour I saw, I haven't verified that this explicitly coded for and not a side-effect of something else). If you disconnect a node from a cluster, make writes in both partitions (the disconnected node and its old peers), shutdown all nodes and start them all back up again starting the disconnected node first, the disconnected node will be considered the master and its data will overwrite all the other nodes. There is no table comparison/checksumming/quorum behaviour.
So to answer my question, one can perform semi online recovery by executing mnesia:stop(), mnesia:start() on the nodes in the partition whose data you decide to discard (which I'll call the losing partition). Executing the mnesia:start() call will cause the node to contact the nodes on the other side of the partition. If you have more than one node in the losing partition, you may want to set the master nodes for table loading to nodes in the winning partition - otherwise I think there is a chance it will load tables from another node in the losing partition and thus return to the partitioned network state.
Unfortunately mnesia provides no support for merging/reconciling table contents during the startup table load phase, nor does it provide for going back into the table load phase once started.
A merge phase would be suitable for ejabberd in particular as the node would still have user connections and thus know which user records it owns/should be the most up-to-date for (assuming one user conneciton per cluster). If a merge phase existed, the node could filter userdata tables, save all records for connected users, load tables as per usual and then write the saved records back to the mnesia cluster.
Sara's answer is great, even look at article about CAP. Mnesia developers sacrifice P for CA. If you need P, then you should choice what of CAP you want sacrifice and than choice another storage. For example CouchDB (sacrifice C) or Scalaris (sacrifice A).
It works like this. Imagine the sky full of birds. Take pictures until you got all the birds.
Place the pictures on the table. Map pictures over each other. So you see every bird one time. Do you se every bird? Ok. Then you know, at that time. The system was stable.
Record what all the birds sounds like(messages) and take some more pictures. Then repeat.
If you have a node split. Go back to the latest common stable snapshot. And try** to replay what append after that. :)
It's better described in
"Distributed Snapshots: Determining Global States of Distributed Systems"
K. MANI CHANDY and LESLIE LAMPORT
** I think there are a problem deciding who's clock to go after when trying to replay what happend

Resources