Mnesia Replication and Large Numbers of Dirty Operations - erlang

Some applications require really fast response, to meet their expectations to users. I am building one such application and i am using mnesia. Now, when we by-pass the mnesia transaction manager , we approach good performance. However, this is the problem: We need to replicate this database as part of load balancing, after-all, mnesia does the replication for us. We are using ONLY dirty operations in this application. We have a few parts using async_dirty context. I am wondering, would mnesia replication be affected if we are not using the transaction context at this scale ? Too many frequent dirty operations are occuring on records all the time, so i wonder if a request made on side B replica, would find the changes the have just been made by side A replica via a dirty operation ?

According to Mnesia User's Guide:
async_dirty activities "will wait for the operation to be performed on one node but not the others".
For sync_dirty activities: "The caller will wait for the updates to be performed on all active replicas".

Related

Give users read-only access to Neo4j while doing Batch Update

This is just a general question, not too technical. We have this use-case wherein we are to load hundreds of thousands of records to an existing Neo4j database. Now, we cannot afford to make the database offline because of users who are accessing it. I know that Neo4j requires exclusive lock on the database while it's performing batch updates. Is there a way around my problem? I don't want to lock my database while doing updates. I still want my users to access it - even for just read-only access. Thanks.
Neo4j never requires exclusive lock on the database. It selectively locks portions of the graph that are affected by mutating operations. So there are some things you can do to achieve your goal. Are you a Neo4j Enterprise customer?
Option 1: If so, you can run your batch insert on the master node and route users to slaves for reading.
Option 2: Alternatively, you could do a "blue-green" style deployment where you:
take a backup (B) of your existing database (A), then mark the A database read-only
apply your batch inserts onto B either by starting a separate instance, or even better, using BatchInserters. That way, you'll insert your hundreds of thousands in a few seconds
start the new database B
flip a switch on a load-balancer, so that users start to be routed to the B instead of A
take A down
(Please let me know if you need some tips how to make a read-only DB.)
Option 3: If you can only afford to run one instance at any one time, then there are techniques you can employ to let your users access the database as usual and still insert large volumes of data. One of them could be using a single-threaded "writer" with a queue that batches write operations. Because one thread only ever writes to the database, you never run into deadlock scenarios and people can happily read from the database. For option 3, I suggest using GraphAware Writer.
I've assumed you are not trying to insert hundreds of thousands of nodes to a running Neo4j database using Cypher. If you are, I would start there and change it to use Java APIs or the BatchInserter API.

Is it a good idea to use MQ to store data in DB?

I'm going to use rabbitMQ as a message broker and switch most of the scripts to sending data to queue instead of performing direct writes/reads. Consumer will get those messages and perform corresponding operations. In my dreams this will give me more flexibility choosing DB engine, app level sharding and so on. But is it a good idea generally? Or am I missing something? Current write load is ~15k inserts/deletes for mysql and 30-50k sets for redis instances. Read load is the same ~15-20k selects, and 50-70k gets for redis.
The biggest issue you'll face will be the fact that your DB writes will be asynchronously processed. If a client writes data to the DB and then instantly reads it back, the value might not be what it originally inserted because the Rabbit queue might have been very busy or slow, delaying the update operation. Or an admin might accidentally purge your queue and then you'll have all these clients thinking their transactions had been committed but nothing will have been stored.
This sounds like a classic case of premature optimization. It's a solution in search of a problem, and you should probably avoid doing it.
With amqp you can run a none asynchronous operations using a RPC way, with this kind of architecture you should figure out all problems related with asynchronous operations.

PostgreSQL + Rails concurrency clarification

I'm building a background job that's updating users' statistics for a web application. The job currently takes 55-60 seconds, and I'm concerned about what would happen if a user were to try to load his stats page at the same time that job is running.
From what I've read about PostgreSQL and concurrency, if two clients attempt to access the same row (one updating and one reading), and I'm not explicitly starting any transactions, the first one just has to wait for the second one to finish.
So if I'm understanding that correctly, the only performance hit I'm likely to incur is on the infinitesimally small chance that a user tries to load his stats page at the same moment that the row is being updated. It's not like the whole stats table is locked up during the 55-60 second job unless I were to explicitly configure Postgres to do that, right?
Is that a correct interpretation? Are there other factors I'm missing?
(I mention the Rails part just in case it has any bearing on the above scenario)
(Also: the PostgreSQL version is 9.0.4)
It depends on transaction isolation level. If I've got your case - you are talking about Dirty Read avoiding delay. And YES, Dirty Read is impossible if you are using default isolation level. Reader will wait for the writer only when it will try to get the same row that is being updated.
Read Committed is the default isolation level in PostgreSQL. When a transaction runs on this isolation level, a SELECT query sees only data committed before the query began;
specs on ISOLATION

How to guarantee data integrity for concurrent Rails/Active Record operations

I need to implement a feature for a rails site that will involve reading and exporting most of my database.
I know this operation is going to take a while. That's fine-- I've got delayed job for that.
What I'm worried about is the data changing during the running of the job, and the resulting export being corrupted because of that.
My initial thought was to do all of the reads within a transaction. However, I would also like to be running the reads concurrently, if possible. ActiveRecord docs say that Transactions cannot be shared between Connections, and Connections cannot be shared between Threads. So it looks as though I am restricted to a single thread with this approach.
Any suggestions for a workaround? Is there another way to give the job a consistent view of the data that doesn't involve transactions? Or is there some alternative to ActiveRecord/Mysql out there that can distribute transactions across threads?

Mnesia asynchronous transaction

I would like to have a master-slave setup of Erlang nodes, where read and write operations happen on the master node only. Slave nodes are only kept as hot-standbys.
As I understand the default behavior of Mnesia is to acquire the lock synchronously on all nodes before executing the write operation. This would result in high latency especially for geographically distributed nodes.
My question is: does Mnesia support asynchronous transactions, where locks are only acquired on the master node, and write operations are propagated afterwards towards slave nodes?
I think you will be happier if you build this off-site replication using a message queue system (rabbitmq perhaps) updating the replicated db yourself from the message queue feed. WAN links are more likely to become congested or go down, and message-queue protocols have ways to handle that. Erlang distribution just give up and you have to spill the updates into a file until the replica comes up and can consume it.
For best symmetry, have posting-to-the-message queue be the primary method to update the db. So even the master is updated by consuming from the message queue. If a response is needed, the current master can send a message back to the issuer of the message.
Mnesia does have a few different kinds of mnesia transaction contexts but nothing that really fit exactly with what you want.
Maybe your application can benefit using sticky locks. I guess that it is quite close to your needs, but...not exactly what you wanted http://www.erlang.org/documentation/doc-5.8.3/lib/mnesia-4.4.17/doc/html/Mnesia_chap4.html#id70700
Interesting Q and equally interesting A!
Basically, what you are suggesing, Christian, is e.g. to have a gen_server - serializing the access to the DB.
First time I did that and then I realized: hang on! Mnesia is transactional so it sounds a bit odd to first serialize access and then sort of do it again by updating the DB via a transaction.
I still am a bit puzzled, however, given that mnesia enforces transactional semantics I tend to take that as a hint that you should not have to serialize access yourselves, especially since the implementors of mnesia probably know the system better than I do ;)
I understand that this is not quite a direct answer to your question, however, I'd say use mnesia + memorynodes + disknodes. The memorynodes for quick takeover and the disknodes for recovering after a crash/backup.
HTH,
haavee

Resources