Suitable data storage backend for Erlang application when data doesn't fit memory - erlang

I'm researching possible options how to organize data storage for an Erlang application. The data it supposed to use is basically a huge collection of binary blobs indexed by short string ids. Each blob is under 10 Kb but there are many of them. I'd expect that in total they would have size up to 200 Gb so obviously it cannot fit into memory. The typical operation on this data is either reading a blob by its id or updating a blob by its id or adding a new one. At each given period of day only a subset of ids is being used so the data storage access performance might benefit from in-memory cache. Speaking about performance - it is quite critical. The target is to have around 500 reads and 500 updates per second on commodity hardware (say on EC2 VM).
Any suggestions what to use here? As I understand dets is out of question as it is limited to 2G (or was it 4G?). Mnesia probably out of question too; my impression is that it was mainly designed for cases when data fits memory. I'm considering trying EDTK's Berkeley DB driver for the task. Would it work in the above scenario? Does anybody have experience using it in the production in the similar conditions?

tcerl came out of facing the same size limit. I'm not using Erlang these days but it sounds like what you're looking for.

Have you looked at what CouchDB is doing? It might not be quite what you are after as a drop in product, but there is lots of erlang code in there for storing data. There is also some talk of providing a native erlang interface instead of the REST api.

Is there any reason why you can't just use a file system, treating filename as your string id and file contents as a binary blob? You can choose one (filesystem) that fits your performance requirements, and you should get caching basically for free, provided by your OS.

Mnesia can store data on disk just fine. There's also dets (disk based term storage) which is roughly analogous to Berkeley DB. It's in the standard lib:

I would recommend Apache CouchDB.
It's a great fit for Erlang, and from the sound of it (you mention ID-based blobs and don't mention any relational requirements) you're looking for a document-oriented database.
Since the interface is REST, you can very simply add a commodity HTTP cache in front of it if you need caching.
The documentation for CouchDB is of a very high quality.
It also has built-in Map-Reduce :)


Long distance OSM routing, how to work with all that data?

I am trying to build my own routing system which utilizes OSMSharp, and will eventually have a full website front end deployed to Azure. However, I think I have a serious problem if I want to find a route over a long distance (e.g. NY -> CA). It looks like the routers in OSMSharp just accept a Stream of osm data, however even the binary format (.osm.pbf) will be roughly 10gb of data. Which seems like a huge performance concern.
Either, I need to hold that huge file in memory, and who knows how much Azure is going to charge me for that, or how well OSMSharp/the CLR is going to handle it; or it needs to be broken up and stored in a DB for on-the-fly loading.
Can anyone give any insight into how this is usually handled? Am I way out of my league for a personal project? Maybe I should support just one US State?
Directly processing a pbf file will be very inefficient because it just contains raw data. This file format is not optimized for running queries on it. You need to pre-process this file, calculate a routing graph, drop uninteresting data and then store it in some kind of database or a similar efficient format.
For really long distances consider using contraction hierarchies. They are used by many popular OSM routers, such as graphhopper and OSRM.
It also helps to take a look at the various different online routers and offline routers for OSM in order to get some ideas.

How videos are stored on web server these days?

I'm building a web app that need to store some resources, including but not limited to articles, pictures and videos. My question here is how videos (mp4/ogg) are stored on web server? just as bare file or as binaries in relational or nosql db?
The question to BLOB data almost always comes down to "don't BLOB data". There are very few times that make more sense to write a database connector for your data then to just keep it on disk.
The general trend is to use an established service that employs good design patterns, such as Paperclip for ruby, and tailor it to your needs.
Using an external storage service is also a good idea, for example Amazon S3 will store all of your data for pennies on the dollar per gigabyte, and they'll do an excellent job of it.
If you do decide to cook up your own server that handles data internally, might I recommend digital ocean? I have been very happy with the SSD servers I have setup there (which are super fast).
For video you will almost certainly need a webserver that is capable of streaming the file. I think Nginx has this feature.
I think you need to elaborate a bit about the use case you wish to implement for this app. Only then you can have precise answer.
And to to help out with that, here are some questions you need to ponder:
1- You said you wanted to store videos, what are your requirements beyond storage?
2- do you wish for example to offer access to third party users to these videos and search with keywords?
3- If yes, what kind of information is available about the videos? what is the expected average size of these files?
Many database engines offer the possibility of storing big binary files, but that comes with an impact on performance. That's why most of the storage systems that deal with big files, store the files themselves on the disk and any related metadata (file name, last updated, associated keywords, etc.) are stored in the database. That makes for a scalable system.
I'll edit this answer, if you find it useful and have further related-questions.
An unlimited file storage is difficult to setup without AWS S3. S3 is cheap and scalable solution but expensive to use without proper caching, so we have Nginx S3 proxy that works well:

NoSQL (BigTable...) and TimeSeries Data

I work in an organization that collects/stores a lot of time series data (time=value,time=value...). Today we use a historian to collect and process this data. The main advantage of using a historian was to compress the data and be more efficient in terms of data storage. However, with technologies such as Big Data, NoSQL it seems the effort to compress data (because of storage $$) is fading and the trend is to store "lots" of data.
Has anyone experimented with replacing a time-series historian with
a BigData solution? I'm aware of OpenTSDB, has anyone used this in a
non IT role?
Would a NoSQL database (Cassandra...) be a good fit for time-series
data? If so, what might an implementation look like?
Is the importance on just collecting or storing or is speed or ease of analysis essential?
For most reasonable data sizes standard SQL will suffice.
Above that and especially for analysis you would preferably want an in-memory and column oriented database. At the highest end this means kdb by which is used by all major banks ($$ expensive). However you ask specifically about open source, I"d consider monetdb or mysql in memory depending on your data size and access requirements.
Cassandra is one of the more appropriate choices from the nosql bunch and people have tried using it already:
I found I was spending a lot of time hacking around at the smallest data level to get things to work and creating a lot of verbose code. Which was then going to spread my data over multiple servers and try to make up for the inefficient storage by using multiple machines. When I evaluated it, it's time support and functions for manipulating time were poor and I couldn't do much more than just pull out ranges easily. For those reasons I moved on from cassandra.

Riak vs Amazon SimpleDB

I am looking for an eventually consistent key value data store and i decided to choose between Amazon SimpleDB and Riak ,so can anyone share their valuable experiences comparing both .
Thanks in advance
Riak is a key-value store. The data values you store is opaque to the database, so you have no secondary indexes. But you do have the ability to run map-reduce if your data is JSON (or XML, I think). You can run map-reduce over all data, or just a subset ("seed keys"). It also has a "link walking" feature where documents can refer to other documents, which can be auto-fetched. They don't currently have an incremental map-reduce like CouchDB, which means any secondary queries (non-key) are quite expensive. They have plans to fix this.
SimpleDB is actually halfway between a docstore and a keystore: Each key->item supports multiple attributes, but it only goes one level deep. You can query on your key or your attribute values.
In production, Riak should be pretty "hands-off". If it's slow or getting full, just spin up a new server and tell it to join the cluster. (unlike CouchDB or MongoDB where you have to futz with multiple config files).
SimpleDB can take a pounding (tens of thousands of requests per second I've heard), but you are responsible for data scaling (i.e. don't violate their domain size limits or it will slow down).
I have used SimpleDB for about 6 months now. I am going into production with it. It works well, but I wish it were faster. I perform %like% queries for searching, and I can't seem to get it to dive through more than a few MB a second worth of values. But non %like% searches are much faster. I get the feeling that it could be sped up if someone at Amazon wrote a few algorithms in good old c, rather than Erlang, but then again I am a c coder.
Also the first few queries on a recently opened Domain will take longer, as the system gets it all read in.
Overall it worked for me, but if I want to scale higher I will have to go with something else.
Also, I think that almost all my use of it will be free - there is a generous allocation of space, etc.
Make sure you plan on the fact that SimpleDB currently has no 'read only' access modes, etc. Any user that can use it can edit it.

Ruby On Rails/Merb as a frontend for a billions of records app

I am looking for a backend solution for an application written in Ruby on Rails or Merb to handle data with several billions of records. I have a feeling that I'm supposed to go with a distributed model and at the moment I looked at
HBase with Hadoop
Problems with HBase solution as I see it -- ruby support is not very strong, and Couchdb did not reach 1.0 version yet.
Do you have suggestion what would you use for such a big amount of data?
Data will require rather fast imports sometimes of 30-40Mb at once, but imports will come in chunks. So ~95% of the time data will be read only.
Depending on your actual data usage, MySQL or Postgres should be able to handle a couple of billion records on the right hardware. If you have a particular high volume of requests, both of these databases can be replicated across multiple servers (and read replication is quite easy to setup (compared to multiple master/write replication).
The big advantage of using a RDBMS with Rails or Merb is you gain access to all of the excellent tool support for accessing these types of databases.
My advice is to actually profile your data in a couple of these systems and take it from there.
There's a number of different solutions people have used. In my experience it really depends more on your usage patterns related to that data and not the sheer number of rows per table.
For example, "How many inserts/updates per second are occurring." Questions like these will play into your decision of what back-end database solution you'll choose.
Take Google for example: There didn't really exist a storage/search solution that satisfied their needs, so they created their own based on a Map/Reduce model.
A word of warning about HBase and other projects of that nature (don't know anything about CouchDB -- I think it's not really a db at all, just a key-value store):
Hbase is not tuned for speed; it's tuned for scalability. If response speed is at all an issue, run some proofs of concept before you commit to this path.
Hbase does not support joins. If you are using ActiveRecord and have more than one relation.. well you can see where this is going.
The Hive project, also built on top of Hadoop, does support joins; so does Pig (but it's not really sql). Point 1 applies to both. They are meant for heavy data processing tasks, not the type of processing you are likely to be doing with Rails.
If you want scalability for a web app, basically the only strategy that works is partitioning your data and doing as much as possible to ensure the partitions are isolated (don't need to talk to each other). This is a little tricky with Rails, as it assumes by default that there is one central database. There may have been improvements on that front since I looked at the issue about a year and a half ago. If you can partition your data, you can scale horizontally fairly wide. A single MySQL machine can deal with a few million rows (PostgreSQL can probably scale to a larger number of rows but might work a little slower).
Another strategy that works is having a master-slave set up, where all writes are done by the master, and reads are shared among the slaves (and possibly the master). Obviously this has to be done fairly carefully! Assuming a high read/write ratio, this can scale pretty well.
If your organization has deep pockets, check out what Vertica, AsterData, and Greenplum have to offer.
The backend will depend on the data and how the data will be accessed.
But for the ORM, I'd most likely use DataMapper and write a custom DataObjects adapter to get to whatever backend you choose.
I'm not sure what CouchDB not being at 1.0 has to do with it. I'd recommend doing some testing with it (just generate a billion random documents) and see if it'll hold up. I'd say it will, despite not having a specific version number.
CouchDB will help you a lot when it comes to partitioning/sharding your data and like, seems like it might fit with your project -- especially if your data format might change in the future (adding or removing fields) since CouchDB databases have no schema.
There are plenty of optimizations in CouchDB for read-heavy apps as well and, based on my experience with it, is where it really shines.
