eventually consistent ACID database - system-design

While reading the book "Acing the system design interview", I spotted this line :
An ACID database including RDBMS databases cannot accept writes when it experiences a network partition, because it cannot maintain ACID consistency if writes occur during a network partition.
I thought it is possible to run an ACID database in the eventually consistent mode. For instance, leader can accept writes and followers might be updated asyncronously. In case of network failures, only the healthy ones can accept the writes.
My question is what the book claims is true? so an ACID database can not accept writes during partition?

Related

Why are read-only nodes called read-only in the case of data store replication?

I was going through the article, https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs which says, "If separate read and write databases are used, they must be kept in sync". One obvious benefit I can understand from having separate read replicas is that they can be scaled horizontally. However, I have some doubts:
It says, "Updating the database and publishing the event must occur in a single transaction". My understanding is that there is no guarantee that the updated data will be available immediately on the read-only nodes because it depends on when the event will be consumed by the read-only nodes. Did I get it correctly?
Data must be first written to read-only nodes before it can be read i.e. write operations are also performed on the read-only nodes. Why are they called read-only nodes? Is it because the write operations are performed on these nodes not directly by the data producer application; but rather by some serverless function (e.g. AWS Lambda or Azure Function) that picks up the event from the topic (e.g. Kafka topic) to which the write-only node has sent the event?
Is the data sharded across the read-only nodes or does every read-only node have the complete set of data?
All of these have "it depends"-like answers...
Yes, usually, although some implementations might choose to (try to) update read models transactionally with the update. With multiple nodes you're quickly forced to learn the CAP theorem, though, and so in many CQRS contexts, eventual consistency is just accepted as a feature, as the gains from tolerating it usually significantly outweigh the losses.
I suspect the bit you quoted anyway refers to transactionally updating the write store with publishing the event. Even this can be difficult to achieve, and is one of the problems event sourcing seeks to solve.
Yes. It's trivially obvious - in this context - that data must be written before it can be read, but your apps as consumers of the data see them as read-only.
Both are valid outcomes. Usually this part is less an application concern and is more delegated to the capabilities of your chosen read-model infrastructure (Mongo, Cosmos, Dynamo, etc).

distributed storage: why the redundant copy is 3 by default instead of 2?

In distributed storage, to avoid data disasters, we need multiple copies of data.
However, why the total copy quantity is preferred as 3 by default instead of 2?
Two copies will save nearly 50% storage requirements.
What's the main reason of choosing 3 copies?
When using two copies of data, and they differ which version do you choose? The third acts as a tie breaker.
As to why they would differ, if one computer were down for a bit—or even if they can't talk to each other—their data would differ unless the system stops accepting writes. With three computers, though, if one is down or separated from the others, the other two can still accept data without fear of the scenario in the first paragraph. (Unless you have correlated failures, which you should still plan for.)
Update. Generally you'll find that distributed algorithms use a Quorum-based system for ensuring writes. In most it's a simple majority, meaning that at least ceil(n/2) of the nodes must have the value before it is durably written. After that, you are guaranteed that nothing can un-write the value because you cannot get ceil(n/2) more nodes to oust the decision. In a two-node system ceil(n/2) = 2; so if one of the nodes goes down, you cannot accept a write anymore. But in a three node system, ceil(n/2) = 2 still, so one node can go down and the system can still accept writes.
Really it's a question of durability vs cost vs latency. The more nodes you throw at your system, the more likely you'll not lose data. One node is fairly ephemmeral; two nodes slightly less ephemeral. Three nodes is pretty good, and many systems stop there. But systems that need higher durability will have 5, 7, or 9 nodes required.
I work on one of the most reliable systems on the internet and we use 5 nodes in the quorum with up to 16 more nodes as hot backups. For us the cost is little compared to the required durability; we chose to use 5 nodes in the quorum for latency sake with the backups for a little boost in durability and to take some read pressure of the quorum.
Because cost increase is not that significant compared to significant improvement in redundancy.
Adding to Michael's answer in this question, three is chosen because it provides a very simple level of fault tolerance. This is called 't fault-tolerance' in the presence of Byzantine faults, where t is 1. That is at most 1 of those data copies can go stale/corrupt/wrong without bringing down the system.
t is usually chosen before hand as an SLA for the system in question, or via empirical evidence. Given a value of t one needs 2*t+1 copies to handle fault tolerance.

Using ets function to read mnesia table (erlang)

working on an erlang project using mnesia (some tables ram copies, some tables disk copies, some tables both). in an attempt to optimize a certain read (ram table), i used the ets lookup rather than the mnesia dirty_read i had been using, and timed both versions of the routine. the ets lookup was significantly faster than the mnesia dirty_read.
my question is whether there is some 'gotcha' or 'catch' to reading an mnesia table using ets vs mnesia (there must be, otherwise there is no reason for the slower mnesia read to exist). if it makes any difference, i don't need and am not using any "distrubuted" or "nodes." in other words, i am and will only be using a single node on a single computer.
mnesia:dirty_read does a rpc call even if the table is local. Also it checks for the current activity context and maintains it even for dirty lookups. This will result in the extra time required for the lookup.
In your case (where there is only one node with local mnesia), direct ets lookup should work but not recommended as it will be implementation dependent. The better would be to use mnesia:ets(Fun,[, Args]).

What is Mnesia replication strategy?

What strategy does Mnesia use to define which nodes will store replicas of particular table?
Can I force Mnesia to use specific number of replicas for each table? Can this number be changed dynamically?
Are there any sources (besides the source code) with detailed (not just overview) description of Mnesia internal algorithms?
Manual. You're responsible for specifying what is replicated where.
Yes, as above, manually. This can be changed dynamically.
I'm afraid (though may be wrong) that none besides the source code.
In terms of documenation the whole Erlang distribution is hardly the leader
in the software world.
Mnesia does not automatically manage the number of replicas of a given table.
You are responsible for specifying each node that will store a table replica (hence their number). A replica may be then:
stored in memory,
stored on disk,
stored both in memory and on disk,
not stored on that node - in this case the table will be accessible but data will be fetched on demand from some other node(s).
It's possible to reconfigure the replication strategy when the system is running, though to do it dynamically (based on a node-down event for example) you would have to come up with the solution yourself.
The Mnesia system events could be used to discover a situation when a node goes down; given you know what tables were stored on that node you could check the number of their online replicas based on the nodes which were still online and then perform a replication if needed.
I'm not aware of any application/library which already manages this kind of stuff and it seems like a quite an advanced (from my point of view, at least) endeavor to make one.
However, Riak is a database which manages data distribution among it's nodes transparently from the user and is configurable with respect to the options you mentioned. That may be the way to go for you.

Erlang fault-tolerant application: PA or CA of CAP?

I have already asked a question regarding a simple fault-tolerant soft real-time web application for a pizza delivery shop.
I have gotten really nice comments and answers there, but I disagree in that it is a true web service. Rather than a web service, it is more of a real-time system to accept orders from customers, control the dispatching of these orders and control the vehicles that deliver those orders in real time.
Moreover, unlike a 'true' web service this system is not intended to have many users - it is just a few dispatchers (telephone operators) and a few delivery drivers that will use it (as for now I have no requirement to provide direct access to the service to the actual customers; only the dispatchers and delivery drivers will have the direct access).
Hence this question is a bit more general.
I have found that in order to make a right choice for a NoSQL data storage option for this application first thing that I have to do is to make a choice between CA, PA and CP according to the CAP theorem.
Now, the Building Web Applications with Erlang book says that "while it [Mnesia] is not a SQL database, it is a CA database like a SQL database. It will not handle network partition". The same book says that the CouchDB database is a PA database.
Having that in mind, I think that the very first thing that I need to do with my application is to decide what the 'fault-tolerance' term means regarding to CAP.
The simple requirement that I have is to have the application available 24/7(R1). The other one is that there is no need to scale, the application will have a very modest amount of users (it is probably not possible to have thousands of dispatchers) (R2).
Now, does R1 require the application to provide Consistency, Availability and Partition Tolerance and with what priorities?
What type of data storage option will better handle the following issues:
Providing 24/7 availability for a dispatcher (a person who accepts phone calls from customers and who uses a CRM) to look up customer records and put orders into the system;
Looking up current ongoing served orders and their status (placed, baking, dispatched, delivering, delivered) in real time;
Keep track of all working vehicles' locations and their payloads in real time;
Recover any part of the system after system crash or network crash to continue providing 1,2 and 3;
To sum it up: What kind of Data Storage (CA, PA or CP) will suite the system described above better? What kind of Data Storage will better satisfy the R1 requirement?
For your 24/ requirement you are searching a database with (High) Availability because you want your requests to succeed everytime (even if they are only error results).
A netsplit would bringt your whole system down, when you have no partition tolerance
Consistency is nice to have, but you can only have 2 of 3.
Your best bet will be a PA solution. I highly recomment a solution which has been inspired by Amazon Dynamo. The best known dynamo implementations are riak and couchdb. Riak even allows you to change PA to some other form by tuning the read and write replicas.
First, don't confuse CAP "Availability" with "High Availability". They have nothing to do with each other. The A in CAP simply means "All DB nodes can answer queries". To get High Availability, you must be in multiple data centers, you must have robust documented procedures for maintenance, expansion, etc. None of that depends on your CAP choice.
Second, be realistic about your requirements. A stock-trading application might have a requirement for 100% uptime, because every second of downtime could loose millions of dollars. On the other hand, I'm guessing your pizza joint might loose tens of dollars for every minute it's down. So it doesn't make sense to spend millions trying to keep it up. Try to compute your actual costs.
Third, always evaluate your choice vs mainstream. You could just go CA (MySQL) and quickly fail-over to the slaves when problems happen. Be realistic about the costs (and risks) of building on new technology. If you really expect your system to run for 5 years without downtime, ask for proof that someone else has run that database for 5 years without downtime.
If you go "AP" and have remote people (drivers, etc.) then you'll need to write an app that stores their data on their phone and sends it in the background (with retries). Of course, you could do this regardless of weather your database was CA or AP.
If you want high uptimes, you can either:
Increase MTBF (Mean Time Between Failures) - Buy redundant power supplies, buy dual ethernet cards, etc..
Decrease MTTR (Mean Time To Recovery) - Just make sure when failure happens you can recover quickly. (Fail over to slave)
I've seen people spend tens of thousands of dollars on MTBF, only to be down for 8 hours while they restore their backup. It makes more sense to ensure MTTR is low before attacking MTBF.

Resources