Ruby On Rails/Merb as a frontend for a billions of records app - ruby-on-rails

I am looking for a backend solution for an application written in Ruby on Rails or Merb to handle data with several billions of records. I have a feeling that I'm supposed to go with a distributed model and at the moment I looked at
HBase with Hadoop
Couchdb
Problems with HBase solution as I see it -- ruby support is not very strong, and Couchdb did not reach 1.0 version yet.
Do you have suggestion what would you use for such a big amount of data?
Data will require rather fast imports sometimes of 30-40Mb at once, but imports will come in chunks. So ~95% of the time data will be read only.

Depending on your actual data usage, MySQL or Postgres should be able to handle a couple of billion records on the right hardware. If you have a particular high volume of requests, both of these databases can be replicated across multiple servers (and read replication is quite easy to setup (compared to multiple master/write replication).
The big advantage of using a RDBMS with Rails or Merb is you gain access to all of the excellent tool support for accessing these types of databases.
My advice is to actually profile your data in a couple of these systems and take it from there.

There's a number of different solutions people have used. In my experience it really depends more on your usage patterns related to that data and not the sheer number of rows per table.
For example, "How many inserts/updates per second are occurring." Questions like these will play into your decision of what back-end database solution you'll choose.
Take Google for example: There didn't really exist a storage/search solution that satisfied their needs, so they created their own based on a Map/Reduce model.

A word of warning about HBase and other projects of that nature (don't know anything about CouchDB -- I think it's not really a db at all, just a key-value store):
Hbase is not tuned for speed; it's tuned for scalability. If response speed is at all an issue, run some proofs of concept before you commit to this path.
Hbase does not support joins. If you are using ActiveRecord and have more than one relation.. well you can see where this is going.
The Hive project, also built on top of Hadoop, does support joins; so does Pig (but it's not really sql). Point 1 applies to both. They are meant for heavy data processing tasks, not the type of processing you are likely to be doing with Rails.
If you want scalability for a web app, basically the only strategy that works is partitioning your data and doing as much as possible to ensure the partitions are isolated (don't need to talk to each other). This is a little tricky with Rails, as it assumes by default that there is one central database. There may have been improvements on that front since I looked at the issue about a year and a half ago. If you can partition your data, you can scale horizontally fairly wide. A single MySQL machine can deal with a few million rows (PostgreSQL can probably scale to a larger number of rows but might work a little slower).
Another strategy that works is having a master-slave set up, where all writes are done by the master, and reads are shared among the slaves (and possibly the master). Obviously this has to be done fairly carefully! Assuming a high read/write ratio, this can scale pretty well.
If your organization has deep pockets, check out what Vertica, AsterData, and Greenplum have to offer.

The backend will depend on the data and how the data will be accessed.
But for the ORM, I'd most likely use DataMapper and write a custom DataObjects adapter to get to whatever backend you choose.

I'm not sure what CouchDB not being at 1.0 has to do with it. I'd recommend doing some testing with it (just generate a billion random documents) and see if it'll hold up. I'd say it will, despite not having a specific version number.
CouchDB will help you a lot when it comes to partitioning/sharding your data and like, seems like it might fit with your project -- especially if your data format might change in the future (adding or removing fields) since CouchDB databases have no schema.
There are plenty of optimizations in CouchDB for read-heavy apps as well and, based on my experience with it, is where it really shines.

Related

Cache solution for a news feed, based on objective information?

I need some suggestions of what works well for caching an updatable news feed.
Please, no "Fanboy" answers either please - not looking for subjective opinions of what the "best" system, just seeking some suggestions of technologies that will fit the requirements below. So please, share what you have used in the real world, even if you prefer some other solution.
I have a rails based news feed (Neo4j database), and while performance is good, I would like to cache it so that servers don't get bogged down serving live feeds.
REQUIREMENTS:
EASY FRAGMENT UPDATES: I'd like to easily update parts of a user's newsfeed the
cache based upon specific triggers, for example, when a user edits
their status update - I don't want to regenerate the user's entire
news feed in the cache, rather I just want to update that one
"fragment", or section if you will, of the particular user's feed. And I don't want to jump through hoops to try and do so.
DELETION: If someone deletes an activity, I just want to remove that activity
from their news feed before the system eventually refreshes the entire feed for that user.
EASY RETRIEVAL: I'd like to retrieve the cache in such a way that the rails
controller/models can easily read them and hand them off to views without
any modification of the views.
PERSISTENCE: If I need to reboot the cache, it should load up the
cache from disk. Which means it should save cached entries to disk.
SPEED: Given that it must be able to be update fragments of cached
news feeds, there is going to be a performance hit of some sort. But
I need speed..
What cache technologies provide such capabilities? Will Redis, MongoDB, Memcached fit these requirements? What other options are there? (CouchDB, Tokyo File cabinet, etc)..
In the spirit Stack Overflow, I'm not asking for subjective opinions on what you like better and why, I'm just asking for possible candidate systems that you may have actually used in production to accomplish caching and updating a cached news feed (or anything similar).
Since it is mainly an opinion-based topic, this answer will be subjective. But I will try anyway to remain factual.
The first point to notice is your requirements tend to be mutually exclusive. As we said in France, you want the butter, the money for the butter, and the wife of the farmer (ok, this is probably a lousy translation).
For example, to support easy fragment updates and proper deletion, you will need some kind of data structures in the cache. I have zero knowledge about Rails, but I guess it will have impact on the data access patterns, and the definitions of controllers/models. In other words, it will add complexity to data retrieval. You need speed, but at the same time, you also require persistency, and also non-trivial data access patterns. Well, you cannot get everything at the same time, you will have to make choices, and prioritize these requirements.
My second point is a cache is only useful when there is a significant difference in term of performance between the cache and the underlying storage engine. Since you already use a NoSQL engine which is rather efficient (Neo4j), you need to consider only engines which are truly designed for raw performance (i.e. low-latency stores): memcached, redis, couchbase, aerospike, to name well-established open-source products. If you feel a bit more adventurous, you can also consider other projects like tarantool or hyperdex.
There are a number of commercial products as well, but I'm not sure they provide a Ruby client (TIBCO ActiveSpaces, Gigaspaces, Red-Hat Infinispan, etc ...)
Other NoSQL engines (MongoDB, Cassandra, CouchDB, etc ...) have other interesting properties, but they will not beat these solutions at raw performance for a mixed r/w workload. Here, I'm only talking about raw performance (i.e. low latency at high throughput), not scalability.
Actually, memcached can be excluded because it does not support persistency. I would say you can probably implement what you want with Redis, Couchbase or Aerospike, but Aerospike 3 does not seem to have yet an officially supported Ruby client.
Supporting multiple data accesses paths (i.e. consistent indexing data structure) will be easier with Redis and Aerospike than Couchbase. High-availability will be easier with Couchbase or Aerospike than with Redis. Implementing a cache behavior will be easier with Redis and Couchbase than with Aerospike.
Some general advices:
make sure you really have a performance or a scalability issue with Neo4j before adding the complexity of an extra layer. Complexity is like toothpaste: once it is out of the tube, you cannot put it back.
data access patterns should be listed at design time, and must be backed by corresponding data structures in the chosen engine.
the hardware footprint must be considered as well. If you have only a couple of boxes, pick a lightweight solution like Redis.
with persistency, you need to consider also HA. What happens if the caching layer is lost? Actually, I would say that for a cache, HA may be more important than persistency.
Finally, you need also to define the exact cache semantic you want (update behavior, invalidation behavior, cache miss management, TTL policy if any, etc ...). The 3 NoSQL engines I have listed provide some tools to help the implementation of the various strategies, but none of them will support an off-the-shelf strategy. This will require some coding to implement it.

Neo4j sharding aspect

I was looking on the scalability of Neo4j, and read a document written by David Montag in January 2013.
Concerning the sharding aspect, he said the 1st release of 2014 would come with a first solution.
Does anyone know if it was done or its status if not?
Thanks!
Disclosure: I'm working as VP Product for Neo Technology, the sponsor of the Neo4j open source graph database.
Now that we've just released Neo4j 2.0 (actually 2.0.1 today!) we are embarking on a 2.1 release that is mostly oriented around (even more) performance & scalability. This will increase the upper limits of the graph to an effectively unlimited number of entities, and improve various other things.
Let me set some context first, and then answer your question.
As you probably saw from the paper, Neo4j's current horizontal-scaling architecture allows read scaling, with writes all going to master and fanning out. This gets you effectively unlimited read scaling, and into the tens of thousands of writes per second.
Practically speaking, there are production Neo4j customers (including Snap Interactive and Glassdoor) with around a billion people in their social graph... in all cases behind an active and heavily-hit web site, being handled by comparatively quite modest Neo4j clusters (no more than 5 instances). So that's one key feature: the Neo4j of today an incredible computational density, and so we regularly see fairly small clusters handling a substantially large production workload... with very fast response times.
More on the current architecture can be found here: www.neotechnology.com/neo4j-scales-for-the-enterprise/
And a list of customers (which includes companies like Wal-Mart and eBay) can be found here: neotechnology.com/customers/ One of the world's largest parcel delivery carriers uses Neo4j to route all of their packages, in real time, with peaks of 3000 routing operations per second, and zero downtime. (This arguably is the world's largest and most mission-critical use of a graph database and of a NOSQL database; though unfortunately I can't say who it is.)
So in one sense the tl;dr is that if you're not yet as big as Wal-Mart or eBay, then you're probably ok. That oversimplifies it only a bit. There is the 1% of cases where you have sustained transactional write workloads into the 100s of thousands per second. However even in those cases it's often not the right thing to load all of that data into the real-time graph. We usually advise people to do some aggregation or filtering, and bring only the more important things into the graph. Intuit gave a good talk about this. They filter a billion B2B transactions into a much smaller number of aggregate monthly transaction relationships with aggregated counts and currency amounts by direction.
Enter sharding... Sharding has gained a lot of popularity these days. This is largely thanks to the other three categories of NOSQL, where joins are an anti-pattern. Most queries involve reading or writing just a single piece of discrete data. Just as joining is an anti-pattern for key-value stores and document databases, sharding is an anti-pattern for graph databases. What I mean by that is... the very best performance will occur when all of your data is available in memory on a single instance, because hopping back and forth all over the network whenever you're reading and writing will slow things significantly down, unless you've been really really smart about how you distribute your data... and even then. Our approach has been twofold:
Do as many smart things as possible in order to support extremely high read & write volumes without having to resort to sharding. This gets you the best and most predictable latency and efficiency. In other words: if we can be good enough to support your requirement without sharding, that will always be the best approach. The link above describes some of these tricks, including the deployment pattern that lets you shard your data in memory without having to shard it on disk (a trick we call cache-sharding). There are other tricks along similar lines, and more coming down the pike...
Add a secondary architecture pattern into Neo4j that does support sharding. Why do this if sharding is best avoided? As more people find more uses for graphs, and data volumes continue to increase, we think eventually it will be an important and inevitable thing. This would allow you to run all of Facebook for example, in one Neo4j cluster (a pretty huge one)... not just the social part of the graph, which we can handle today. We've already done a lot of work on this, and have an architecture developed that we believe balances the many considerations. This is a multi-year effort, and while we could very easily release a version of Neo4j that shards naively (that would no doubt be really popular), we probably won't do that. We want to do it right, which amounts to rocket science.
TL;DR With 2018 is days away neo4j still does not support sharding as it is typically considered.
Details Neo4j still requires all data to fit on a single node. The node contents can be replicated within a cluster - but actual sharding is not part of the picture.
When neo4j talks of sharding they are referring to caching portions of the database in memory: different slices are cached on different replicated nodes. That differs from say mysql sharding in which each node contains only a portion of the total data.
Here is a summary of their "take" on scalability: their product term is "High Availability" https://neo4j.com/blog/neo4j-scalability-infographic/
. Note that High Availability should not be the same as Scalability: so they do not actually support the latter in the traditional understanding of the term.

Is using Redis right for this situation?

I'm planning on creating an app (Rails) that will have a very large collection of users - it'll start small but I would like it to be able to handle a million or more.
I want to build a system that will be able to handle 2500+ requests per second. Each request will require a write (for logging purposes) as well as a read from the enormous list of users, indexed by username (I was recommended to use MongoDB for this purpose) and the results of the read will be sent back to the user.
I am a little unclear about how mongo will handle both reads and writes, so I had this idea of using Mongo to sort of permanently store the records and then load them up into Redis every time the server starts up for even faster access so that Mongo doesn't have to deal with anything but the writes.
Does that sound reasonable or is that a huge misuse of Mongo and Redis?
The speed of delivery is of utmost importance.
It's possible, actually, to create the entire application using just Redis. What you'd want to do is research design patterns for Redis. A good place to start is this PDF by Karl Seguin called The Little Redis book.
For example, use Redis's hashes to save all users' information.
Further, if planned well you don't need to have another persistent storage such as Mongo or MySQL in conjunction with Redis as Redis is persistent itself. You just need to pick a good sharding/replication strategy that'll allow you to be flexible enough for future systemic changes.
I think the stack that you are asking about is certainly a very good solution and one that's pretty battle tested for high performance sites. Trello (created by same people who created this very site) uses a similar architecture as well as craigslist.
Trello Tech Stack Writeup
Craigslist also uses this
Redis is fast and has a great pub/sub mechanism in addition to normal invalidation type features that makes it a superior cache to most. Mongo is a db i'm very familiar with and think it's great for all sorts of data store purposes as well as being a solid enterprise db that scales well, protects data integrity and checks off a bunch of marks in the SLA enterprise jargon checklist
I think it's a great combination but really the question should be is do I even need this. For your load I think Mongo itself could handle this quite nicely (and give data integrity) and also if you really want you can run it on server with enough memory to make sure your dataset fits inside memory (denormalizing and good schema design is key). Foursquare runs exclusively on Mongo in memory.
So think if this is necessary but remember simple always wins. Redis/Mongo is super powerful but it will also take a lot more work to master two data stores and administer them.
Thanks,
Prasith
As others have mentioned, using a single service makes more sense to me. There's reason to keep the logging data in memory though. I'd try using something simple, a logfile if possible, or Scribe or Flume if you need to distribute the writes.

How to do some reporting with Rails (with a dedicated DB)

In a Rails app, I am wondering how to build a reporting solution. I heard that I should use a separated database for reporting purposes but knowing that I will need to store a huge amount of data, I have a lot of questions :
What kind of DBMS should I choose?
When should I store data in the reporting database?
Should the database schema of the production db and reporting db be identical?
I am storing basic data (information about users, about result of operations) and I will need for example to run a report to know how many user failed an operation during the previous month.
In now that it is a vague question, but any hint would be highly appreciated.
Thanks!
Work Backwards
Start from what the end-users want for reporting or how they want to/should visualize data. Once you have some concepts in mind, then start working backwards to how to achieve those goals. Starting with the assumption that it should be a replicated copy in an RBDMS excludes several reasonable possibilities.
Making a Real-time Interface
If users are looking to aggregate values (counts, averages, etc.) on the fly (per web request), it would be worthwhile looking into replicating the master down to a reporting database if the SQL performance is acceptable (and stays acceptable if you were to double the input data). SQL engines usually do a great job aggregation and scale pretty far. This would also give you the capability to join data results together and return complex results as the users request it.
Just remember, replication isn't easy or without it's own set of problems.
This'll start to show signs of weakness in the hundreds of millions of rows range with normalized data, in my experience. At some point, inserts fight with selects on the same table enough that both become exceptionally slow (remember, replication is still a stream of inserts). Alternatively, indexes become so large that storage I/O is required for rekeying, so overall table performance diminishes.
Batching
On the other hand, if reporting falls under the scheme of sending standardized reports out with little interaction, I wouldn't necessarily recommend backing to an RBDMS. In this case, results are combined, aggregated, joined, etc. once. Paying the overhead of RBDMS indexing and storage bloat isn't worth it.
Batch engines like Hadoop will scale horizontally (many smaller machines instead of a few huge machines) so processing larger volumes of data is economical.
Batch to RBDMS or K/V Store
This is also a useful path if a lot of computation is needed to make the records more meaningful to a reporting engine. Alternatively, records could be denormalized before storing them in the reporting storage engine. The denormalized or simple results would then be shipped to a key/value store or RBDMS to make reporting easier and achieve higher performance at the cost of latency, compute, and possibly storage.
Personal Advice
Don't over-design it to start with. The decisions you make on the initial implementation will probably all change at some point. However, design it with the current and near-term problems in mind. Also, benchmarks done by others are not terribly useful if your usage model isn't exactly the same as theirs; benchmark your usage model.
I would recommend to to use some pre-build reporting services than to manually write out if you need a large set of reports.
You might want to look at Tableau http://www.tableausoftware.com/ and other available.
Database .. Yes it should be a separate seems safer , plus reporting is generally for old and consolidated data.. you live data might be too large to perform analysis on.
Database type -- > have to choose based on the reporting services used , though I think mongo is not supported by any of the reporting services , mysql is preferred.
If there are only one or two reports you could just build them on rails

Riak vs Amazon SimpleDB

I am looking for an eventually consistent key value data store and i decided to choose between Amazon SimpleDB and Riak ,so can anyone share their valuable experiences comparing both .
Thanks in advance
Fedrick
Riak is a key-value store. The data values you store is opaque to the database, so you have no secondary indexes. But you do have the ability to run map-reduce if your data is JSON (or XML, I think). You can run map-reduce over all data, or just a subset ("seed keys"). It also has a "link walking" feature where documents can refer to other documents, which can be auto-fetched. They don't currently have an incremental map-reduce like CouchDB, which means any secondary queries (non-key) are quite expensive. They have plans to fix this.
SimpleDB is actually halfway between a docstore and a keystore: Each key->item supports multiple attributes, but it only goes one level deep. You can query on your key or your attribute values.
In production, Riak should be pretty "hands-off". If it's slow or getting full, just spin up a new server and tell it to join the cluster. (unlike CouchDB or MongoDB where you have to futz with multiple config files).
SimpleDB can take a pounding (tens of thousands of requests per second I've heard), but you are responsible for data scaling (i.e. don't violate their domain size limits or it will slow down).
I have used SimpleDB for about 6 months now. I am going into production with it. It works well, but I wish it were faster. I perform %like% queries for searching, and I can't seem to get it to dive through more than a few MB a second worth of values. But non %like% searches are much faster. I get the feeling that it could be sped up if someone at Amazon wrote a few algorithms in good old c, rather than Erlang, but then again I am a c coder.
Also the first few queries on a recently opened Domain will take longer, as the system gets it all read in.
Overall it worked for me, but if I want to scale higher I will have to go with something else.
Also, I think that almost all my use of it will be free - there is a generous allocation of space, etc.
Make sure you plan on the fact that SimpleDB currently has no 'read only' access modes, etc. Any user that can use it can edit it.
--Tom

Resources