One replicated mnesia table has become out-of-sync - erlang

I have an erlang application currently running on four nodes with a replicated mnesia db that stores minimal data regarding connected clients. The mnesia replication has been working seamlessly in the past (as far as I know anyway) but a client recently noticed that one of the nodes is missing some ids related to his application.
I'm not really sure how this happened. Our network may have had a hiccup at the time. Maybe? But, of more urgency at the moment is getting the data into a good state across all nodes. Is there a way to tell mnesia to replicate from a known-good node?

Mnesia is legendary about this issue. It's a huge PITA.
Looking at it from CAP theorem's point of view, most systems built with Mnesia end up being C-A (consistency-availability with no partition tolerance) systems. For most of the time you have (and heavily rely on) its hard consistency. Then a network partition happens...
It's still available for writes, but these writes destroy consistency. And later on, Mnesia has no mechanism for automatic data repair.
Everyone who uses Mnesia in a cluster should familiarize themselves with these tradeoffs. Your problem is a clear sign that using Mnesia was a poor choice. Double so if this data is critical to you.
I too use Mnesia in such a way (sometimes we all need speed you know). But I make sure to only use it to store data that I can easily reconstruct. In general, if you need it stored on disk, Mnesia is no good, except for toy projects.
I make sure to always have this function at hand:
reinit_mnesia_cluster() ->
rpc:multicall(mnesia, stop, []),
AllNodes = [node() | nodes()],
rpc:multicall(mnesia, start, []).
Use it only after the network partition has been resolved and all nodes are reachable. This will erase all Mnesia replicas and start it anew. Again, if you can't live with what it does, then using Mnesia was a poor choice.
For important data that needs hard consistency, use SQL. For important data that needs availability, use Riak. For shared state that needs speed, use Redis. Mnesia is no replacement for these systems, although at first it does seem so.
Edit on 2014-11-16: Here is a much better article on the topic, explaining in detail what I said above

Honestly, I think the cleanest way to get an out-of-sync Mnesia to replicate from a known good node is to shut down the application on the bad node, and delete all its Mnesia database files, then do the following.
Write an escript that starts Mnesia up standalone using the "bad" node name and Mnesia directory, replicates the tables from a known good node, and shuts Mnesia down. Run that escript on the bad node.
The act of replicating the tables and shutting Mnesia down gracefully puts the node back in sync with the cluster. Then, when you start the application up on the bad node, it will join up and stay in sync with the cluster.
Of course, this description lacks precise details, but that's the gist of it. There are surely less brute force ways of doing this, but unless you have massive amounts of data to replicate, I think this way is the quickest and cleanest.


How Erlang Mnesia Distribution Works

I am getting through the online examples, and can already use mnesia ram copies and also connect them, but I am a bit confused on a couple of things.
1: Does the starter node (the one who creates the schema), only have the local schema? (for example, in root folder =
I ask because on another node, I can simply start mnesia, and change_config(extra_db_nodes, [node]), and automatically get all the data that is on the starting node.
This seems weird to me, what happens if all nodes go down? This means starter node needs to be ran first before you can do anything.
2: There seems to be a lot of different ways to connect nodes, and to copy the tables ... Could I get a list of different ways to do this, and their impacts?
3: From the first question, after calling change_config, how can you know that its finished downloading all the data before you can start to use it? For example, if someone connects to the node, and you check if they are already online, they might be connected to another node and you dont get that data during the check.
4: After connecting to a node, are you automatically connected to all nodes? And does it automatically update your local ram copies without doing anything? How does it assure synchronization when reading, and writing? Do I have to do anything special?
And about question 1 again -- couldn't you have a node process running that holds the local schema, and use this node to connect all nodes together? And if possible could you forbid mnesia from copying ram copies to this node process?
I know this is a lot, so thank you for your time.
Not a direct answer to your questions, but you can check out Erlang Performance Lab which might help you understand how some operations in Mnesia works by visualizing the messages between different nodes.

mnesia files damaged need to forensically dump everything

I have damaged my Mnesia database beyond repair as a result of overestimating the fragility of the implementation. When I try Mnesia API the records I need are not visible even though they keys are visible in the file. Even though the documentation indicates that Mnesia artifacts are DETS files they cannot be opened with or identified as DETS artifacts. PS: dump_to_textfile() does not work either.
Eventually I was able to dump my DB. It did not end my Mnesia problems but it gave me options I did not have before.
Originally I had implemented a master-master mnesia cluster. (read the docs). It turns out that not even the most seasoned Erlang programmer uses Mnesia replication as there are to many flaws. In fact I come to this information from the Erlang inner circle and a few L1 teams too. In my case, however, the work was already in production. And that's when problems started.
We started getting DB consistency errors and, my favorite, network or DB partition errors. It takes a very highly skilled and knowledgeable individual to recover as well as a lot of planning and code in advance; which I did not have.
Ultimately I took two steps. (a) removed the second app so that even though the DB was in a master-master cluster; one was a slave because it was never used as a master. (b) In a second implementation I split the cluster so that the app ran on a single node with a single DB. #a was in production and #b was the warm standby. Replication was manual as writes were very rare.
In the single node deployment there are two nodes. The first node is the application; app#ks and on the same hardware was an "erl" node when I needed to rpc into the app and see how things were going.
when I posted this question I was trying to dump the contents of my Mnesia DB. I was having a number of problems because I was trying to access the DB from the admin node as the application node was operational.
Because I was trying to access the mnesia lib from the erl node the DB was not LOCAL to the erl node and so dump_to_textfile produced an empty file. I eventually had success when I used rpc to tell the app#ks node to dump.
When I launched the admin node I set the mnesia dir parameter to the same folder as the app#ks node. I have a vague memory that this is undesirable.
There are many more Mnesia issues to solve but none that refer to the problem I reported. But I still do not know how to extract the raw data from the various DB files.

Postgresql replication in rails with data-fabric gem

I am currently setting up a master-slave app using Ruby on Rails. I am planning to use data-fabric or octopus gem for handling the read/write connections.
This is my first time setting up master-slave DBs. I am confused over the various open source tools available to implement the postgresql replication e.g. pgpool II, pgcluster, Bucardo and Hot Standby/Streaming Replication (built in feature in postgresql 9.1)
My requirements are
fault tolerance(high availability and no data loss on failover)
load balancing
Thanks in advance
Note: I have gone through the stackoverflow post regarding postgresql replication but they are pretty old and not helping to conclude on which tool I should go with.
In your case, streaming replication is the place to start. It is not very flexible but it does what you need regarding database reads as long as you don't need to replicate between major versions.
Database Replication 101
Database replication is a way to ensure that data saved to a specific server becomes stored in a number of other servers. This is often done to better utilize more limited network connections, ensure fault tolerance (so there is essentially a hot back-up), ensure that read-only queries can be distributed over a larger number of databases, etc. This all must be done without sacrificing the the basic guarantees of ACID.
There are a number of different overlapping ways to categorize replication solutions. These include:
Page or file-level vs row-level vs statement-level
Synchronous vs Asynchronous
Master-slave vs Multi-Master
In general understanding replication and the tradeoffs between solutions requires relatively strong understanding of database mechanics and ACID guarantees. I will assume you are relatively familiar with storage mechanics, and deterministic vs non-deterministic operations and the like.
What is Being Replicated? File changes (Physical) vs Row Changes (Logical) vs Statements
The simplest approach is to replicate block changes to files, for example as stored in the write-ahead log in PostgreSQL. This replicates changes at the page level and it requires identical file formats. This means you cannot replicate across major versions, CPU architectures, or operating systems. Anything that could affect the alignment of tuples, for example, will cause the replication to either fail or, worse, corrupt the slave's database. This is the approach streaming replication uses. It is simple to set up, and it always replicates everything in the database cluster.
Additionally this approach means you can easily guarantee that the master and slave databases are identical down to the file level. Because of the fact that the PostgreSQL WAL is cluster-global it is unlikely that this approach will ever replicate anything short of the entire database cluster.
As a description of how this works, suppose I:
UPDATE my_table SET rand_value = random() WHERE id > 10000;
In this case, this changes a bunch of data pages and the file operations are replicated to the replicas. The files remain identical between the master and slave.
Another approach, one taken by Slony, Bucardo, and others is to replicate rows in a logical manner. In this approach, changed rows are flagged and logged, and the changes sent to the replicas. The replicas re-run row operations from the master database. Because these are add-on tools which do not replicate file operations but rather logical database operations, they can replicate across CPU architectures, operating systems, etc. Also they are usually designed so that you can replicate some but not all tables in a database, allowing for a lot of flexibility. On the other hand this leads to a lot of potential for errors. "Oops, that table was not replicated" is a real problem.
In this case when I run the update statement above, a trigger is fired capturing the actual rows inserted and deleted and these are logged, replicated, and the row operations re-run. Because this happens after rand() is run, the databases are logically, but not necessarily physically identical.
A final approach is statement replication. In this case we replicate statements and re-run the statements on the replicas. Some configurations of PgPool will do this. In this case, you cannot ensure that a database is logically equivalent to its replica if any non-deterministic functions are run. In the statement above, the statement itself will run on each replica, ensuring different pseudorandom numbers in the relevant column.
Synchronous vs Asynchronous
This distinction is important to understand regarding failover guarantees. In an asynchronous replication system, the updates are queued and transferred when possible to the replicas and re-run there. In a synchronous replication system the database which accepts the write will not return a successful commit until at least a certain number of replica databases report a successful commit.
Asynchronous replication is generally more robust and produces better availability than synchronous replication. This is because synchronous replication introduces additional points of failure. If you have one master and one slave, then if either system goes down, your database becomes unavailable at least for write operations.
The tradeoff though is that synchronous replication offers a guarantee that data which is committed is in fact available on replicas in the event that the master, say, suffers catastrophic hardware failure immediately following commit. This is a very low probability event, but in some cases it is important that you know the data is still available. In short this provides additional durability guarantees not present in async replication.
Multi-Master vs Master-Slave
Most replication systems are master-slave. In this case, all writes begin at one node and are replicated to other nodes. Writes may only begin at one node. They may not begin at other nodes. This makes replication straight-forward because we know that the slaves represent a past state of the master.
Multi-master replication allows writes to occur to more than one node. In an asynchronous replication system, this leads to the problem of conflict resolution. These problems are actually worse than most assume when you add DDL statements. Suppose two different users run the above update statement on two different masters. We will now have a set of records that have to be replicated across but they will conflict.
Multi-master replication typically requires that people think through this conflict resolution process quite carefully. It is never a process that just works out of the box. Often times you write your own conflict resolution routines. For this reason I typically recommend avoiding multi-master replication unless you really need it.

Data Warehouse: One Database or many?

At my new company, they keep all data associated with the data warehouse, including import, staging, audit, dimension and fact tables, together in the same physical database.
I've been a database developer for a number of years now and this consolidation of function and form seems counter to everything I know.
It seems to make security, backup/restore and performance management issues more manually intensive.
Is this something that is done in the industry? Are there substantial reasons for doing or not doing it?
The platform is Netezza. The size is in terabytes, hundreds of millions of rows.
What I'm looking to get from answers to this question is a solid understanding of how right or wrong this path is. From your experience, what are the issues I should be focused on arguing if this is a path that will cause trouble for us down the road. If it is no big deal, then I'd like to know that as well.
In general I would recommend using separate databases. This is the configuration I have always seen used in production and it really makes a lot of sense since - as you mentioned - both databases have fundamentally different purposes / usage patterns / etc.
If you're using one physical server, the fewer instances on that server the simpler the management and the more efficient the process.
If you put TWO instances on the same Physical Server you get:
Half the memory to use
Twice the count of database process
You could take the entire staging db down without affecting the DW
So which is more precious to you, outage windows or CPU and Memory?
On the same the physical server multiple instances make performance management issues MUCH more manual to solve. If you look at the health of one of the instances, it might look fine but users are reporting poor performance, so you have to look at the next instance to see if the problem may be coming from there... and so on per instance.
Security is also harder with more than one instance. At best it's just as hard as a single instance but it's never easier. You'll have two admin accounts (SYS or something), Duplicate process accounts, etc.
Tell us why you think it's better to have more than one instance.
Can we be clear on terms. When you say "in the same Database" do you mean to say the same instance, or the same physical server. If you did move the staging to a new instance would it reside on the same physical hardware?
I think people get a little too hung up on instances. If you're going to put two instances on the same piece of hardware, you're only doubling the number of everything to very little advantage. All the server processes will be running twice... all the memory pools will be cut in half.
so let's say you really did mean two separate physical boxes...
Let's say you buy 2 12-way boxes (just say). When you're staging db server is done for the day, those 12 CPU's are wasting away. When your users pack up and go home, your prod DW CPUs are wasting away. CPU cycles are perishable, you can't get them back. BUT, if you had one 24 way box... then the staging DB COULD use 20 CPUs at night for some excellent Parallel Execution for building summary tables and your users will have double the capacity for processes during the day.
so let's say you meant the same hardware.
"It seems to make security, backup/restore and performance management issues more manually intensive."
Guaranteed that performance issues are harder to solve the more instances that share the same hardware. Guaranteed.
What security do you do at the instance level?
What DW are you backing up at the instance level? You're not backing up tablespaces, but rather whole instances? Seems like that pattern will fail at a certain size.
Not familiar with the tool specifically. So if it's a single instance on a single box, then the division would seem more logical than physical and therefore the reasons they exist is for management, not performance. You don't increase your CPUs or memory by adding a database, right? So it doesn't seem like there's no performance upside to it. Each DB may be adding separate processes (performance hit), or it might be completely logical like schemas in Oracle. If each database is managed by new processes than data going between them will mean IPC.
Maybe the addition of the Netezza tag will get some traction.
We use databases for every segment (INVENTORY, CRM, BILLING...). There are no performance downsides and maintenance and overview is much better.
Better late than never, but for Netezza:
There are no performance hits while querying cross database. Netezza allows only SELECT operations cross database, no INSERT, UPDATE or DELETEstatements allowed.
This means you cannot do:
but you can do \c OTHERDB then
You are also not able to create a materialized view on a cross-database object, for example:
Administration might be where you will decide (though you probably already did long ago) on what kind of database(s) you'll create. Depending on your infrastructure, you might have a TEST/QA system and a PROD system on the same box, or on separate boxes.
You will gain speed in the load and the output if the tables are in the same schema (database). Obvious...but hey, I said it.
There is more overhead the more tables you put into one schema. Backups time, size of backups, ease of use.
Where I am, we have many multiple TB databases within one data-warehouse. Our rule of thumb is that a single loading process or a single report query should NOT have to span database. This keeps "like" tables together but gives some allowances for our backups and contingency processes. It also makes it a bit easier to "find" data.
For those processes that need to break this rule, we will either move data from one database to the other or allow the process to join across schemas.
I'm not as familiar with Netezza, so I'm not 100% sure what your options might be.
Few points for you to consider
a) If the data in one or more staging, audit, dimension and fact table has to be joined, you are better off keeping them in one database
b) Typically you will retain dimension tables and fact tables in the same database and distribute on most frequently joined columns to leverage "co-located join" functionality of Netezza
c) You should be able to use SQL grant permission to manage access to all objects (DB, tables, views etc)

Mnesia asynchronous transaction

I would like to have a master-slave setup of Erlang nodes, where read and write operations happen on the master node only. Slave nodes are only kept as hot-standbys.
As I understand the default behavior of Mnesia is to acquire the lock synchronously on all nodes before executing the write operation. This would result in high latency especially for geographically distributed nodes.
My question is: does Mnesia support asynchronous transactions, where locks are only acquired on the master node, and write operations are propagated afterwards towards slave nodes?
I think you will be happier if you build this off-site replication using a message queue system (rabbitmq perhaps) updating the replicated db yourself from the message queue feed. WAN links are more likely to become congested or go down, and message-queue protocols have ways to handle that. Erlang distribution just give up and you have to spill the updates into a file until the replica comes up and can consume it.
For best symmetry, have posting-to-the-message queue be the primary method to update the db. So even the master is updated by consuming from the message queue. If a response is needed, the current master can send a message back to the issuer of the message.
Mnesia does have a few different kinds of mnesia transaction contexts but nothing that really fit exactly with what you want.
Maybe your application can benefit using sticky locks. I guess that it is quite close to your needs, but...not exactly what you wanted
Interesting Q and equally interesting A!
Basically, what you are suggesing, Christian, is e.g. to have a gen_server - serializing the access to the DB.
First time I did that and then I realized: hang on! Mnesia is transactional so it sounds a bit odd to first serialize access and then sort of do it again by updating the DB via a transaction.
I still am a bit puzzled, however, given that mnesia enforces transactional semantics I tend to take that as a hint that you should not have to serialize access yourselves, especially since the implementors of mnesia probably know the system better than I do ;)
I understand that this is not quite a direct answer to your question, however, I'd say use mnesia + memorynodes + disknodes. The memorynodes for quick takeover and the disknodes for recovering after a crash/backup.
