Erlang fault-tolerant application: PA or CA of CAP? - erlang

I have already asked a question regarding a simple fault-tolerant soft real-time web application for a pizza delivery shop.
I have gotten really nice comments and answers there, but I disagree in that it is a true web service. Rather than a web service, it is more of a real-time system to accept orders from customers, control the dispatching of these orders and control the vehicles that deliver those orders in real time.
Moreover, unlike a 'true' web service this system is not intended to have many users - it is just a few dispatchers (telephone operators) and a few delivery drivers that will use it (as for now I have no requirement to provide direct access to the service to the actual customers; only the dispatchers and delivery drivers will have the direct access).
Hence this question is a bit more general.
I have found that in order to make a right choice for a NoSQL data storage option for this application first thing that I have to do is to make a choice between CA, PA and CP according to the CAP theorem.
Now, the Building Web Applications with Erlang book says that "while it [Mnesia] is not a SQL database, it is a CA database like a SQL database. It will not handle network partition". The same book says that the CouchDB database is a PA database.
Having that in mind, I think that the very first thing that I need to do with my application is to decide what the 'fault-tolerance' term means regarding to CAP.
The simple requirement that I have is to have the application available 24/7(R1). The other one is that there is no need to scale, the application will have a very modest amount of users (it is probably not possible to have thousands of dispatchers) (R2).
Now, does R1 require the application to provide Consistency, Availability and Partition Tolerance and with what priorities?
What type of data storage option will better handle the following issues:
Providing 24/7 availability for a dispatcher (a person who accepts phone calls from customers and who uses a CRM) to look up customer records and put orders into the system;
Looking up current ongoing served orders and their status (placed, baking, dispatched, delivering, delivered) in real time;
Keep track of all working vehicles' locations and their payloads in real time;
Recover any part of the system after system crash or network crash to continue providing 1,2 and 3;
To sum it up: What kind of Data Storage (CA, PA or CP) will suite the system described above better? What kind of Data Storage will better satisfy the R1 requirement?

For your 24/ requirement you are searching a database with (High) Availability because you want your requests to succeed everytime (even if they are only error results).
A netsplit would bringt your whole system down, when you have no partition tolerance
Consistency is nice to have, but you can only have 2 of 3.
Your best bet will be a PA solution. I highly recomment a solution which has been inspired by Amazon Dynamo. The best known dynamo implementations are riak and couchdb. Riak even allows you to change PA to some other form by tuning the read and write replicas.

First, don't confuse CAP "Availability" with "High Availability". They have nothing to do with each other. The A in CAP simply means "All DB nodes can answer queries". To get High Availability, you must be in multiple data centers, you must have robust documented procedures for maintenance, expansion, etc. None of that depends on your CAP choice.
Second, be realistic about your requirements. A stock-trading application might have a requirement for 100% uptime, because every second of downtime could loose millions of dollars. On the other hand, I'm guessing your pizza joint might loose tens of dollars for every minute it's down. So it doesn't make sense to spend millions trying to keep it up. Try to compute your actual costs.
Third, always evaluate your choice vs mainstream. You could just go CA (MySQL) and quickly fail-over to the slaves when problems happen. Be realistic about the costs (and risks) of building on new technology. If you really expect your system to run for 5 years without downtime, ask for proof that someone else has run that database for 5 years without downtime.
If you go "AP" and have remote people (drivers, etc.) then you'll need to write an app that stores their data on their phone and sends it in the background (with retries). Of course, you could do this regardless of weather your database was CA or AP.
If you want high uptimes, you can either:
Increase MTBF (Mean Time Between Failures) - Buy redundant power supplies, buy dual ethernet cards, etc..
Decrease MTTR (Mean Time To Recovery) - Just make sure when failure happens you can recover quickly. (Fail over to slave)
I've seen people spend tens of thousands of dollars on MTBF, only to be down for 8 hours while they restore their backup. It makes more sense to ensure MTTR is low before attacking MTBF.

Related

SOLR and VNodes and Tokens

Note: I have done a little reformatting and added some additional information.
Please take a look at this: Question_Answer
I want to ask - with DSE 5.0 and the upcoming changes that were mentioned at C* Summit this year for 5.1 and 5.2, will the same advice be useful?
Our use case is:
The platform MUST be available at all times. (Cassandra)
The data must be searchable. (SOLR / Lucene)
The platform MUST provide analytics / Data Warehousing / BI etc (Graph / Spark)
All of that is possible in a single product offering thanks to DSE! Thank you DataStax!
But our amount of data stored and our transaction count are VERY modest.
Our specification is for 100 concurrent sessions within the application - which of course doesn't even translate to 100 concurrent DB requests / operations.
For the most part our application resembles an everyday enterprise CRUD application.
While not ridiculous, AWS instances aren't exactly free.
Having a separate cluster for each workload (with enough replication for continuous availability), will be a cost issue for us.
While I understand, a proof of concept can offer some help - but without a real workload / real users - passing through the services / applications - in ways that only a "production" system and rogue users : can really provide an insight for. The best you can do is "loaded" functional testing.
In short, we're a little stuck here from a platform perspective.
We're, initially, thinking of having:
2 data centres for geographic isolation
2 racks per DC
2 nodes per Rack
RF of 3
CL of local_quorum
If we find we're hitting performance issues, we can scale out - add an extra rack or extra nodes to the initial 2 racks.
As for V-nodes or number of tokens, we have no idea.
The documentation for DSE Search says V-nodes adds 30% overhead, so it sounds like you shouldn't use V-nodes, but then in a table in the documentation it also says to use 16 or 32. How can it be both?
If we can successfully run all workloads on a single node (our requirements are genuinely minimal), do we run with V-nodes (16 or 32) or do we run a single token?
Lastly, is there another alternative?
Can you have Nodes with different workloads in the same data centre? Where individual nodes are set up with RAM / CPU requirements for a specific workload?
Assuming our 4 node per data centre (as a starting place only - we have no idea whether or not you can successfully run Search on a single node / or Spark on a single node)
Node 1: Just Cassandra
Node 2 : Cassandra and Search
Node 3 : Cassandra and Graph
Node 4 : Cassandra and Spark
If Search needs 64GB RAM - so be it... but the Cassandra only node could well work with just 8 or 16.
So we can cater, in terms of CPU and memory per workload type - but still only have a single DC. (We'll have 2 for redundancy - but effectively it is a single DC installation : mirrored)
Thanks in advance for your help.
Vnodes adds an additional overhead for the scatter-gather part of the search solution. In some benchmarks that's been as high as 30%. Some customers are willing to live with that overhead and want to use vnodes due to the benefits of dynamic scaling.
If you have or are planning a small cluster - and won't need to scale it on the fly - then I would definitely recommend sticking with single tokens. The hidden benefit of that approach, is that your repairs will be slightly faster also. This helps with Search as you are reading at the equivalent of CL.ONE.
It is possible to run all the features on the same DC (Search, Analytics and now Graph) but you will find that the overheads go up. You will need larger nodes with more memory and cpu resources to cope with the processing load. I'd probably start with 128 Gb of ram and go from there. I guess if your load is really light you might get away with less. As with everything benchmarking at the scale you're intending to run is key.
As an aside I'm not totally clear on your intentions re RF. You kind of imply 2 nodes and RF=3. I'm guessing it's just phrasing, but if not - it's worth noting you want at least as many nodes as the RF for best coverage!

distributed storage: why the redundant copy is 3 by default instead of 2?

In distributed storage, to avoid data disasters, we need multiple copies of data.
However, why the total copy quantity is preferred as 3 by default instead of 2?
Two copies will save nearly 50% storage requirements.
What's the main reason of choosing 3 copies?
When using two copies of data, and they differ which version do you choose? The third acts as a tie breaker.
As to why they would differ, if one computer were down for a bit—or even if they can't talk to each other—their data would differ unless the system stops accepting writes. With three computers, though, if one is down or separated from the others, the other two can still accept data without fear of the scenario in the first paragraph. (Unless you have correlated failures, which you should still plan for.)
Update. Generally you'll find that distributed algorithms use a Quorum-based system for ensuring writes. In most it's a simple majority, meaning that at least ceil(n/2) of the nodes must have the value before it is durably written. After that, you are guaranteed that nothing can un-write the value because you cannot get ceil(n/2) more nodes to oust the decision. In a two-node system ceil(n/2) = 2; so if one of the nodes goes down, you cannot accept a write anymore. But in a three node system, ceil(n/2) = 2 still, so one node can go down and the system can still accept writes.
Really it's a question of durability vs cost vs latency. The more nodes you throw at your system, the more likely you'll not lose data. One node is fairly ephemmeral; two nodes slightly less ephemeral. Three nodes is pretty good, and many systems stop there. But systems that need higher durability will have 5, 7, or 9 nodes required.
I work on one of the most reliable systems on the internet and we use 5 nodes in the quorum with up to 16 more nodes as hot backups. For us the cost is little compared to the required durability; we chose to use 5 nodes in the quorum for latency sake with the backups for a little boost in durability and to take some read pressure of the quorum.
Because cost increase is not that significant compared to significant improvement in redundancy.
Adding to Michael's answer in this question, three is chosen because it provides a very simple level of fault tolerance. This is called 't fault-tolerance' in the presence of Byzantine faults, where t is 1. That is at most 1 of those data copies can go stale/corrupt/wrong without bringing down the system.
t is usually chosen before hand as an SLA for the system in question, or via empirical evidence. Given a value of t one needs 2*t+1 copies to handle fault tolerance.

Naming statsd metrics for short lived streams

I am trying to model statistics to submit to statsd/graphite. However what I am monitoring is "session" centric. For example, I have a game that is played in real time. There are multiple instances of a game active on the servers. Each game has multiple (and variable number of) participants. Each instance of a game has a unique ID as does each player.
I want to track (and graph) each player's stats but then roll the metric up for the whole instance and then for all the instances of a game. For example there may be two instances of a game active at a given time. Lets say each has two players in the game
GameTitle.RealTime.VoiceErrors.game_instance_a.player_id_1 10
GameTitle.RealTime.VoiceErrors.game_instance_a.player_id_2 20
GameTitle.RealTime.VoiceErrors.game_instance_b.player_id_3 50
GameTitle.RealTime.VoiceErrors.game_instance_b.player_id_4 70
where game_instances and player_ids are 128 bit numbers
And I want to be able to see that the value of all voice errors for game_instance_a is 30
while all voice errors across the system is 150
Given this I have three questions
What guidance would you have on naming the metrics.
Is it kosher to have metrics that have "dynamic" identifiers as part of the name
What are they scale limits on this. If I had a 100K game instances
with say as many as 1000 players in a game, is this going to kill statsd/graphite?
Thanks!
What guidance would you give on naming the metrics?
Graphite recommends that "Volatile path components should be kept as deep into the hierarchy as possible". This essentially means that if you can push the parts of the metrics that are frequently unique to the end of the "bucket" without impacting your grouping queries you should try to do so.
Here is a great post on using Graphite that includes naming recommendations. And here is another one with additional info from Jason Dixon (an excellent source for Graphite stuff in general).
Is it kosher to have metrics that have "dynamic" identifiers as part of the name?
I usually try to avoid identifiers in the metric names unless they are very low in number (<100). Because Graphite will store a .wsp file for every metric name you'll have a difficult time re-sizing or adjusting the storage settings should you decide to change your configuration. Additionally, the Graphite UI will have a "folder" for every metric name so you can easily make the UI unusable.
In your case, I'd probably graph the total number of game instances, the total number of players, and the number of errors (by type), etc. Additionally, I might try to track players per instance (generally) and maybe errors per instance (again without knowing the actual instance. e.g. GameTitle.RealTime.PerInstance.VoiceErrors) if I had that capability (i.e. state stored per instance in my application).
Logstash, Elastic Search, Kibana
I'd suggest logging this error information with instance and player ids and using logstash to send your logs to elastic search and kibana. Then I'd watch Graphite for real time error and health anomaly detection and use Kibana (and Elastic Search underneath) to dig deeper.
What are the scale limits on this. If I had a 100K game instances with say as many as 1000 players in a game, is this going to kill statsd/graphite?
Statsd should have no problem with this, as it just acts as a -mostly- dumb aggregator. While it does maintain some state internally I don't anticipate a problem.
I don't think you'll have problems with the internal Graphite Whisper Storage itself, as it is just using files and folders. But, as I mentioned above, the Graphite Web UI will be unusable and I think you'll also run the risk of other manageability issues.
Summary
Keep the volatile (dynamic) metric buckets at the end of the name and avoid going above a couple hundred of these.

Is DataSnap Optimized for responding to more than 1k users at the same time?

We want to start a big multi-tier application. The server side application must respond to more than 1000 users at the same time. We want to create server application by 64 bit compiler and client side with 32 bit. In this case we don't know DataSnap can respond to all client without any problem or not?
In this case The Server computer is very powerful (multi-processor and more than 16GB of RAM) and DataBase Management system is FireBird 2.5.
You need a way to perform realistic load tests.
For the Firebird database, you can simulate concurrent users with the free Apache JMeter tool. It can run SQL statements and record their execution time statistics (average, min/max etc.). So you could for example create a thread group with twenty different SQL queries, and then run twenty threads which each will perform these queries sequentially.
JMeter allows to define time limits on the SQL query, and JMeter treats it as an error if the query exceeds this limit. Then you can try to find the maximum client count where the overall error rate is still less than (for example) five percent.
But you also need to know how high the expected database load will be, and you will also need to have a test database with a realistic size, not only a couple of records. Also, some database queries like reports might cause higher load - these should be included in the simulation too, as they can affect overall performance. In JMeter, you can create a second thread group, running in parallel with the first one, for these long-running statements with different settings (less simulated clients).
Testing the database will show if there is a bottleneck already in this area. For example, the test result could be that the database can serve twenty clients with a total average transaction rate of 20 TPS (transactions per second), which means one client executes one transaction per second. But this TPS value will decrease with higher user count.
Related question: Firebird usage in big projects which also has a link to http://www.firebirdsql.org/en/case-studies-catalog/
Regarding DataSnap client load simulation: this can be done with a scripted client, which runs a predefined set of statements / commands over the connection.
To run a high number of load test clients simultaneaously you could use a service like Amazon Elastic Compute Cloud (EC2), to launch clones of your test machine image, saving you hardware costs. But of course I would start with a small client machine which simply runs ten or twenty scripted clients.
As far as I know DataSnap is based on Indy. And Indy's connection handling model is not very scaleable - one thread per connection, which is very resource consuming. Even using Indy's thread pools is not an option I think... Also in Windows (32 bit) for example there is a limit of the maximum threads you can create (2000 IIRC). Anyway - using many threads is not good and hits performance of the server (for reference - Windows Internals book, Windows Performance Team blog etc.)
A scalable, robust and professional application server would use IO Completion ports (IOCP) for data processing. But I don't know if DataSnap can take advantage of it.
UPDATE:
On the CodeRage7 I asked similar scaleability questions. Here are the answers:
Q: Recently there was a question on StackOverflow about DataSnap's scaleability/performance. So can DS handle, for example, 2000 or more concurent user request at the network and application level?
A: the scalability is based on scalability of TCP/HTTP/HTTPs and # of connections allowed in your server operating system. Also based on memory and hardware you employ. There is no specific limit in DataSnap.
My comment: While this is true, Indy's connection handling model, i.e. one thread per connection, introduces bottleneck especially in 32 bit Windows (2000 threads max). In Win64 it should not be so much problem, but again - this kind of handling data flow leads to performance degradation.
Q: Does DataSnap support some kind of load balancing?
A: Not directly. You can do this in code in your DataSnap server(s).
My comment: I've found very good paper on implementing Failover/Load Balancing in DataSnap in Andreano Lanusse's blog
Q: Does DataSnap support IO Completion ports for better scaleability?
This my question was left unanswered.
Hope this helps!
UPDATE2:
I found very interesting post on DS Performance: DataSnap analysis based on Speed & Stability tests
UPDATE3:
DataSnap, Deployment, Performance, and More (Marco Cantu)
Monitoring and control of connections in DataSnap XE2 - translated in English
Monitoring and control of connections in DataSnap XE2 - original
When the specifications for a system are made, you need to be very precise when it's about multiple users.
For example: you create a website, and the client expects 15.000 unique users.
Then the client usually comes up with a requirement that the system needs to support 15.000 simultaneous users, which is very naive.
You'll need a more detailed specification than that.
Usually it's more sensible to say something like: in 99% of the requests, 99% of the users can get a response to their request within 5 seconds average.
In normal usage, you'll never see all users send a request within the same second. If at some point they all arrive within the same minute (also very unlikely), you'll have a lot fewer concurrent users.
Even for websites with tens of thousands of users, where most of them connect on a daily basis, the webserver is idle most the time, and once and a while it jumps to 5% or in extreme cases to 20%. If we really have to serve all of these users at once we'd be screwed, but that never happens, and it's not realistic to prepare a server for such loads.

What are the requirements for an application health monitoring system?

What, at a minimum, should an application health-monitoring system do for you (the developer) and/or your boss (the IT Manager) and/or the operations (on-call) staff?
What else should it do above the minimum requirements?
Is monitoring the 'infrastructure' applications (ms-exchange, apache, etc.) sufficient or do individual user applications, web sites, and databases also need to be monitored?
if the latter, what do you need to know about them?
ADDENDUM: thanks for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both
Whether the application is running.
Unusual cpu/memory/network usage.
Report any unhandled exceptions.
Status of various modules (if applicable).
Status of external components (databases, webservices, fileservers, etc.)
Number of pending background tasks (if applicable).
Maybe track usage of the application and report statistics on most/less used functionalities so you know where optimizations are most beneficial.
The answer is 'it depends'. Why do you need to monitor? How large is your operations staff? Do you need reporting? What is the application environment? Who cares if the application fails? Who cares if an exception happens? Are any of the errors recoverable? I could ask questions like these for a long time.
Great question.
We've been looking for some application-level monitoring solution for our needs some time ago without any luck. Popular monitoring solution are mostly addressed to monitor infrastrcture and - in my opinion - they are too complicated for a requirements of most of small and mid-sized companies.
We required (mainly) following features:
alerts - we wanted to know about
incident as fast as possible
painless management - hosted service wouldbe
the best
visualizations - it's good to know what is going on and take some knowledge from the data
Because we didn't find suitable solution we started to write our own. Finally we've ended with up-and-running service called AlertGrid. (You can check it for free of course.)
The idea behind it is to provide an easy way to handle custom monitoring scenarios. Integration API is very simple (one function with two required parameters). At the momment we and others are using it for:
monitor scheduled tasks (cron jobs)
monitor entire application logic execution
alert on errors in applications
we are also working on examples of basic infrastructure monitoring using AlertGrid
This is such an open ended question, but I would start with physical measurements.
1. Are all the machines I think are hosting this site pingable?
2. Are all the machines which should be serving content actually serving some content? (Ideally this would be hit from an external network.)
3. Is each expected service on each machine running?
3a. Have those services run recently?
4. Does each machine have hard drive space left? (Don't forget the db)
5. Have these machines been backed up? When was the last time?
Once one lays out the physical monitoring of the systems, one can address those specific to a system?
1. Can an automated script log in? How long did it take?
2. How many users are live? Have there been a million fake accounts added?
...
These sorts of questions get more nebulous, and can be very system specific. They also usually can be derived reactively when responding to phsyical measurements. Hard drive fill up, maybe the web server logs got filled up because a bunch of agents created too many fake users. That kind of thing.
While plan A shouldn't necessarily be reactive, it is the way many a site setup a monitoring system.
Minimum: make sure it is running :)
However, some other stuff would be very useful. For example, the CPU load, RAM usage and (in multiuser systems) which user is running what. Also, for applications that access network, a list of network connections for each app. And (if you have access to client computer(s)) it would be cool to be able to see the 'window title' of the app - maybe check each 2-3 minutes if it changed and save it. Also, a list of files open by the application could be very useful, but it is not a must.
I think this is fairly simple - monitor so that you can be warned early enough before something goes wrong. That means monitor dependencies and the application itself.
It's really hard to provide specifics if you're not going to give details on the application you're monitoring, so I'd say use that as a general rule.
At a minimum you want to know that the system is healthy. This is subjective in what defines your system is healthy. Is it computers are up, the needed resources exist, the data is flowing through the system, the data is properly producing results, etc, etc.
In my project we do monitoring of most of this and then some. It really comes down to what is the highest level that you can use to analyze that everything is working. In our case we need to know down to the data output. If you just need to know down to the are these machines up it saves you on trying to show an inexperienced end user what is wrong.
There are also "off the shelf" tools that will do a lot of the hard work for you if you are just looking too hard into data results. I particularly liked Nagios when I was looking around but we needed more than it could easily show so I wrote our own monitoring system. Basically we also watch for "peculiarities" in the system, memory / cpu spikes, etc...
thanks everyone for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both
the difference is:
infrastructure monitoring would be servers plus MS Exchange Server, Apache, IIS, and so forth
application monitoring would be user machines and the specific programs that they use to do their jobs, and/or servers plus the data-moving/backend applications that they run to keep the data flowing
sometimes it's hard to draw the line - an oversimplified definition might be "if your team wrote it, it's an application; if you bought it, it's infrastructure"
i think in practice it is best to monitor both
What you need to do is to break down the business process of the application and then have the software emit events at major business components. In addition, you'll need to create end to end synthetic transactions (eg. emulating end users clicking on a website). All that data would be fed into an monitoring tool. In the past, I've done JMX for applications of which flowed into Tivoli Monitoring's JMX Adapter and then I've done scripts that implement a "fake user" and then pipe in the results into Tivoli Monitoring's Script Adapter. Tivoli Monitoring takes the data and then creates application health and performance charts from that raw data.

Resources