Something I not get, I have two mnesia nodes. One has ram and other has disc copy.
My question is:
You do create schema once? But schema is where you enter nodes.
I confused and found not good documentation on this
Let's start by clarifying the concepts. A mnesia cluster consists of nodes and tables; nodes may have copies of the tables. The type of the copy, which may be ram_copies, disc_copies, and disc_only_copies, applies to a given table on a given node. A node may have different types of copies of different tables, and a table may have different types of copies on different nodes. A special case is a node which doesn't have disc based copies at all; it is called a ram only node.
The schema is a special table that stores information about the nodes and tables. Each node must have a copy of this table in the cluster; ram only nodes obviously have a ram copy, other nodes have a disc copy. To be precise, a node must have a disc copy of the schema to have a disc-based copy of any other table.
When you call mnesia:create_schema, you are creating a disc copy of a schema without tables, to be loaded by mnesia when it is started (this function refuses to work if mnesia is already started). If your cluster contains multiple disc-based nodes, the schema is created on all these nodes simultaneously, and when mnesia is started on these nodes, they automatically connect to each other (the nodes know about each other from the schema).
When mnesia cannot load the schema from disk at startup, it creates an empty one for itself in ram (or refuses to start, depending on settings). After that, you can either turn it into a ram-only node by calling mnesia:change_config on a disc-based node of the cluster, in which case the empty schema will be replaced and the node will be synchronized with the rest of the cluster, or you can start creating tables and adding other ram only nodes (which still have an empty schema), building a ram-only cluster.
A ram only node can be turned into a disc node by calling mnesia:chang_table_copy_type on the table schema. This way you can build a complete disc-based cluster dynamically from scratch, without creating a disc-based schema beforehand. However if you have a fixed set of disc nodes, it's much easier to statically initialize the schema on them before starting the cluster the first time.
Related
We need to maintain and modify an in-memory hash table from within a single java process. We also need to persist it, so that its contents can be recovered after a crash, deploy or when the machine running the application fails.
We have tight latency requirements.
Would Apache Geode fit our requirements? We will run two additional nodes, which can be used on application startup to populate the hash table values.
Geode is a key value distributed cache, kind of like a hash table on steroids, so yes, it would fit your requirements.
You can choose to persist your data, or not.
You can have n nodes populating your data managed by a locator process that will auto distribute to all nodes, or a set of nodes, on the same machine, or on n other machines.
What strategy does Mnesia use to define which nodes will store replicas of particular table?
Can I force Mnesia to use specific number of replicas for each table? Can this number be changed dynamically?
Are there any sources (besides the source code) with detailed (not just overview) description of Mnesia internal algorithms?
Manual. You're responsible for specifying what is replicated where.
Yes, as above, manually. This can be changed dynamically.
I'm afraid (though may be wrong) that none besides the source code.
In terms of documenation the whole Erlang distribution is hardly the leader
in the software world.
Mnesia does not automatically manage the number of replicas of a given table.
You are responsible for specifying each node that will store a table replica (hence their number). A replica may be then:
stored in memory,
stored on disk,
stored both in memory and on disk,
not stored on that node - in this case the table will be accessible but data will be fetched on demand from some other node(s).
It's possible to reconfigure the replication strategy when the system is running, though to do it dynamically (based on a node-down event for example) you would have to come up with the solution yourself.
The Mnesia system events could be used to discover a situation when a node goes down; given you know what tables were stored on that node you could check the number of their online replicas based on the nodes which were still online and then perform a replication if needed.
I'm not aware of any application/library which already manages this kind of stuff and it seems like a quite an advanced (from my point of view, at least) endeavor to make one.
However, Riak is a database which manages data distribution among it's nodes transparently from the user and is configurable with respect to the options you mentioned. That may be the way to go for you.
Can the Erlang ETS tables be shared among different processes? Thus, if I have two processes running on different Erlang running systems, can I somehow link them so that all the changes I do in one ETS table will be reflected in the other?
Within a single Erlang node, ETS tables can be fully shared by passing the public option to ets:new. (Beware that the table will be destroyed if its owner dies, though, unless you have set up an heir.)
If you need to share tables across several Erlang nodes, you need to use Mnesia.
You cannot "share" an ETS table between processes on different nodes, an ETS table is only accessible by processes on the node on which it was created. If you want to share ETS tables then you will need create a process on one node, the node with the table, and access the table from the other node through this process. It is not really that difficult.
If I am clustering 2 nodes together, from my experimenting and reading up online I understand that Node A will be like a "master" node and Node B will copy the tables over if I want them to. (Otherwise it will just access them remotely.)
What happens though if Node B goes down? Does it just recopy the data that's been changed since it was last up?
Also what happens if Node A goes down. Is Node B still usable? If so, if data is changed on Node B, does Node A copy it over to itself? My understanding so far is that Node A doesn't care about what Node B says, but someone please tell me I'm wrong.
Since the accepted answer is a link only answer, figured I would document this for anyone who comes along:
Mnesia doesn't quite work by having a primary-secondary architecture. Instead, some nodes have local copies of data, and some have remote copies. (You can see this by running mnesia:info() from the console. There is a list of remote tables, and a list for each of the local-tables: ram_copies,disc_copies and disc_only_copies.)
If a node goes down, as long as there is some table with a local copy, operations involving that table are fine.
One of the down-sides with Mnesia is that it is subject to network partition events. If in your cluster the network connection between two nodes goes bad, then each one will think that the other node is down, and continue to write data. Recovery from this is complicated. In the more mundane case though, if one node goes down, then nodes with local copies of data continue along, and when the down-node recovers, it syncs back up with the cluster.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Is it possible to recover from a network partition in an mnesia cluster without restarting any of the nodes involved? If so, how does one go about it?
I'm interested specifically in knowing:
How this can be done with the standard OTP mnesia (v4.4.7)
What custom code if any one needs to write to make this happen (e.g. subscribe to mnesia running_paritioned_network events, determine a new master, merge records from non-master to master, force load table from the new master, clear running parititioned network event -- example code would be greatly appreciated).
Or, that mnesia categorically does not support online recovery and requires that the node(s) that are part of the non-master partition be restarted.
While I appreciate the pointers to general distributed systems theory, in this question I am interested in erlang/OTP mnesia only.
After some experimentation I've discovered the following:
Mnesia considered the network to be partitioned if between two nodes there is a node disconnect and a reconnect without an mnesia restart.
This is true even if no Mnesia read/write operations occur during the time of the disconnection.
Mnesia itself must be restarted in order to clear the partitioned network event - you cannot force_load_table after the network is partitioned.
Only Mnesia needs to be restarted in order to clear the network partitioned event. You don't need to restart the entire node.
Mnesia resolves the network partitioning by having the newly restarted Mnesia node overwrite its table data with data from another Mnesia node (the startup table load algorithm).
Generally nodes will copy tables from the node that's been up the longest (this was the behaviour I saw, I haven't verified that this explicitly coded for and not a side-effect of something else). If you disconnect a node from a cluster, make writes in both partitions (the disconnected node and its old peers), shutdown all nodes and start them all back up again starting the disconnected node first, the disconnected node will be considered the master and its data will overwrite all the other nodes. There is no table comparison/checksumming/quorum behaviour.
So to answer my question, one can perform semi online recovery by executing mnesia:stop(), mnesia:start() on the nodes in the partition whose data you decide to discard (which I'll call the losing partition). Executing the mnesia:start() call will cause the node to contact the nodes on the other side of the partition. If you have more than one node in the losing partition, you may want to set the master nodes for table loading to nodes in the winning partition - otherwise I think there is a chance it will load tables from another node in the losing partition and thus return to the partitioned network state.
Unfortunately mnesia provides no support for merging/reconciling table contents during the startup table load phase, nor does it provide for going back into the table load phase once started.
A merge phase would be suitable for ejabberd in particular as the node would still have user connections and thus know which user records it owns/should be the most up-to-date for (assuming one user conneciton per cluster). If a merge phase existed, the node could filter userdata tables, save all records for connected users, load tables as per usual and then write the saved records back to the mnesia cluster.
Sara's answer is great, even look at article about CAP. Mnesia developers sacrifice P for CA. If you need P, then you should choice what of CAP you want sacrifice and than choice another storage. For example CouchDB (sacrifice C) or Scalaris (sacrifice A).
It works like this. Imagine the sky full of birds. Take pictures until you got all the birds.
Place the pictures on the table. Map pictures over each other. So you see every bird one time. Do you se every bird? Ok. Then you know, at that time. The system was stable.
Record what all the birds sounds like(messages) and take some more pictures. Then repeat.
If you have a node split. Go back to the latest common stable snapshot. And try** to replay what append after that. :)
It's better described in
"Distributed Snapshots: Determining Global States of Distributed Systems"
K. MANI CHANDY and LESLIE LAMPORT
** I think there are a problem deciding who's clock to go after when trying to replay what happend