Suitability of Amazon SimpleDB for large temporal data sets eminating from thousands of separate devices - time-series

I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
(id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.

Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/

I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.

It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.

I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here

Related

How to do some reporting with Rails (with a dedicated DB)

In a Rails app, I am wondering how to build a reporting solution. I heard that I should use a separated database for reporting purposes but knowing that I will need to store a huge amount of data, I have a lot of questions :
What kind of DBMS should I choose?
When should I store data in the reporting database?
Should the database schema of the production db and reporting db be identical?
I am storing basic data (information about users, about result of operations) and I will need for example to run a report to know how many user failed an operation during the previous month.
In now that it is a vague question, but any hint would be highly appreciated.
Thanks!
Work Backwards
Start from what the end-users want for reporting or how they want to/should visualize data. Once you have some concepts in mind, then start working backwards to how to achieve those goals. Starting with the assumption that it should be a replicated copy in an RBDMS excludes several reasonable possibilities.
Making a Real-time Interface
If users are looking to aggregate values (counts, averages, etc.) on the fly (per web request), it would be worthwhile looking into replicating the master down to a reporting database if the SQL performance is acceptable (and stays acceptable if you were to double the input data). SQL engines usually do a great job aggregation and scale pretty far. This would also give you the capability to join data results together and return complex results as the users request it.
Just remember, replication isn't easy or without it's own set of problems.
This'll start to show signs of weakness in the hundreds of millions of rows range with normalized data, in my experience. At some point, inserts fight with selects on the same table enough that both become exceptionally slow (remember, replication is still a stream of inserts). Alternatively, indexes become so large that storage I/O is required for rekeying, so overall table performance diminishes.
Batching
On the other hand, if reporting falls under the scheme of sending standardized reports out with little interaction, I wouldn't necessarily recommend backing to an RBDMS. In this case, results are combined, aggregated, joined, etc. once. Paying the overhead of RBDMS indexing and storage bloat isn't worth it.
Batch engines like Hadoop will scale horizontally (many smaller machines instead of a few huge machines) so processing larger volumes of data is economical.
Batch to RBDMS or K/V Store
This is also a useful path if a lot of computation is needed to make the records more meaningful to a reporting engine. Alternatively, records could be denormalized before storing them in the reporting storage engine. The denormalized or simple results would then be shipped to a key/value store or RBDMS to make reporting easier and achieve higher performance at the cost of latency, compute, and possibly storage.
Personal Advice
Don't over-design it to start with. The decisions you make on the initial implementation will probably all change at some point. However, design it with the current and near-term problems in mind. Also, benchmarks done by others are not terribly useful if your usage model isn't exactly the same as theirs; benchmark your usage model.
I would recommend to to use some pre-build reporting services than to manually write out if you need a large set of reports.
You might want to look at Tableau http://www.tableausoftware.com/ and other available.
Database .. Yes it should be a separate seems safer , plus reporting is generally for old and consolidated data.. you live data might be too large to perform analysis on.
Database type -- > have to choose based on the reporting services used , though I think mongo is not supported by any of the reporting services , mysql is preferred.
If there are only one or two reports you could just build them on rails

How to structure data model for an advertising tool in MongoDB

I am building an ad analytics tool which assumes a data structure like this:
Account
Campaign
Keyword
Conversion
I have a lot of information about individual conversion events, which can be tied back to the cost data of each campaign, keyword, ad group, etc. In SQL, you could consider each property a sort of foreign key (text-based) to the campaign, keyword or ad in a particular account, but that's inefficient and slow. It doesn't sound like a great idea to make campaign_id, keyword_id, etc. fields and populate them either, because I want the analytics to be available in near-real time.
What would be a good way to model this with MongoDB?
Assuming a very high volume of conversion events (millions per day or more), a storage engine alone (MongoDB or anything else) won't help you. What you need is the ability to run map-reduce jobs on the data in order to calculate the analytics. You can scale-out your cluster as necessary to achieve near-real time performance.
The free/open-source options that I can suggest are Hadoop (and probably HBase and Hive) or Riak.
There are other options - I'm only suggesting these two because I've personal experience with them in a high scale production environment. We're currently using Hadoop to power an analytics system processing billions of events per day.
If you're not into rolling your own and are able and willing to pay (a lot!) then look at GreenPlum and Vertica.
I'll be happy to share more information on potential solution designs - but I'll need more data on what you're trying to achieve - scale, use cases etc.
I'm not sure that MongoDB is really the right choice for something like this, since MongoDB is really more about storing less well (or more complex) documents rather than hierarchical records like this one. However, if you are going the MongoDB route, then you can just use the account, campaign and keyword tags directly. There is no substantive benefit to abstracting these into meaningless keys in MongoDB. You can index these fields directly in MongoDB.
I don't know what your volumes are going to be and what other factors are affecting your technology choices. However, assuming that your accounts, campaigns and keywords don't change that frequently, you could do this with plain old RDBMS (SQL or Oracle etc.) using lookup tables for these determinants where the foreign keys are meaningless integers. If you're doing live analytics you could adopt a star schema and keep all of the numeric FKs on the base fact table (Conversion) so that you aren't joining a chain of four tables to get the whole picture, instead you'd be doing three one-hop joins. This would allow you to summarize at any level with only a single join.

Achieving better DB performance

I have a website backed by a relational database comprised of the usual e-commerce related tables (Order, OrderItem, ShoppingCart, CreditCard, Payment, Customer, Address, etc...).
The stored proc. which returns order history is painfully slow due to the amount of data + the numerous joins which must occur, and depending on the search parameters it sometimes times out (despite the indexing that is in place).
The DB schema is pretty well normalized and I believe I can achieve better performance by moving toward something like a data warehouse. DW projects aren't trivial and then there's the issue of keeping the data in sync so I was wondering if anyone knows of a shortcut. Perhaps an out-of the box solution that will create the DW schema and keep the data in sync (via triggers perhaps). I've heard of Lucene but it seems geared more toward text searches and document management. Does anyone have other suggestions?
How big is your database?
There's not really any shortcuts, but dimensional modelling is really NOT that hard. You first determine a grain and then need to identify your facts and the dimensions associated with the facts. Then you divide the dimensions into tables which allow you to have the dimensions only grow slowly over time. The choice of dimensions is completely practical and based on the data behavior.
I recommend you have a look at Kimball's books.
For a database of a few GB, it's certainly possible to update a reporting database from scratch several times a day (no history, just repopulating from a 3NF for a different model of the same data). There are certain realtime data warehousing techniques which just apply changes continuously throughout the day.
So while DW projects might not be trivial, the denormalization techniques are very approachable and usable without necessarily building a complete time-invariant data warehouse.
Materialized Views are what you might use in Oracle. They give you the "keeping the data in sync" feature you are looking for combined with fast access of aggregate data. Since you didn't mention any specifics (platform, server specs, number of rows, number of hits/second, etc) of your platform, I can't really help much more than that.
Of course, we are assuming you've already checked that all your SQL is written properly and optimally, that your indexing is correct, that you are properly using caching in all levels of your app, that your DB server has enough RAM, fast hard drives, etc.
Also, have you considered denormalizing your schema, just enough to serve up your most common queries faster? that's better than implementing an entire data warehouse, which might not even be what you want anyway. Usually a data warehouse is for reporting purposes, not for serving interactive apps.

Riak vs Amazon SimpleDB

I am looking for an eventually consistent key value data store and i decided to choose between Amazon SimpleDB and Riak ,so can anyone share their valuable experiences comparing both .
Thanks in advance
Fedrick
Riak is a key-value store. The data values you store is opaque to the database, so you have no secondary indexes. But you do have the ability to run map-reduce if your data is JSON (or XML, I think). You can run map-reduce over all data, or just a subset ("seed keys"). It also has a "link walking" feature where documents can refer to other documents, which can be auto-fetched. They don't currently have an incremental map-reduce like CouchDB, which means any secondary queries (non-key) are quite expensive. They have plans to fix this.
SimpleDB is actually halfway between a docstore and a keystore: Each key->item supports multiple attributes, but it only goes one level deep. You can query on your key or your attribute values.
In production, Riak should be pretty "hands-off". If it's slow or getting full, just spin up a new server and tell it to join the cluster. (unlike CouchDB or MongoDB where you have to futz with multiple config files).
SimpleDB can take a pounding (tens of thousands of requests per second I've heard), but you are responsible for data scaling (i.e. don't violate their domain size limits or it will slow down).
I have used SimpleDB for about 6 months now. I am going into production with it. It works well, but I wish it were faster. I perform %like% queries for searching, and I can't seem to get it to dive through more than a few MB a second worth of values. But non %like% searches are much faster. I get the feeling that it could be sped up if someone at Amazon wrote a few algorithms in good old c, rather than Erlang, but then again I am a c coder.
Also the first few queries on a recently opened Domain will take longer, as the system gets it all read in.
Overall it worked for me, but if I want to scale higher I will have to go with something else.
Also, I think that almost all my use of it will be free - there is a generous allocation of space, etc.
Make sure you plan on the fact that SimpleDB currently has no 'read only' access modes, etc. Any user that can use it can edit it.
--Tom

Ruby On Rails/Merb as a frontend for a billions of records app

I am looking for a backend solution for an application written in Ruby on Rails or Merb to handle data with several billions of records. I have a feeling that I'm supposed to go with a distributed model and at the moment I looked at
HBase with Hadoop
Couchdb
Problems with HBase solution as I see it -- ruby support is not very strong, and Couchdb did not reach 1.0 version yet.
Do you have suggestion what would you use for such a big amount of data?
Data will require rather fast imports sometimes of 30-40Mb at once, but imports will come in chunks. So ~95% of the time data will be read only.
Depending on your actual data usage, MySQL or Postgres should be able to handle a couple of billion records on the right hardware. If you have a particular high volume of requests, both of these databases can be replicated across multiple servers (and read replication is quite easy to setup (compared to multiple master/write replication).
The big advantage of using a RDBMS with Rails or Merb is you gain access to all of the excellent tool support for accessing these types of databases.
My advice is to actually profile your data in a couple of these systems and take it from there.
There's a number of different solutions people have used. In my experience it really depends more on your usage patterns related to that data and not the sheer number of rows per table.
For example, "How many inserts/updates per second are occurring." Questions like these will play into your decision of what back-end database solution you'll choose.
Take Google for example: There didn't really exist a storage/search solution that satisfied their needs, so they created their own based on a Map/Reduce model.
A word of warning about HBase and other projects of that nature (don't know anything about CouchDB -- I think it's not really a db at all, just a key-value store):
Hbase is not tuned for speed; it's tuned for scalability. If response speed is at all an issue, run some proofs of concept before you commit to this path.
Hbase does not support joins. If you are using ActiveRecord and have more than one relation.. well you can see where this is going.
The Hive project, also built on top of Hadoop, does support joins; so does Pig (but it's not really sql). Point 1 applies to both. They are meant for heavy data processing tasks, not the type of processing you are likely to be doing with Rails.
If you want scalability for a web app, basically the only strategy that works is partitioning your data and doing as much as possible to ensure the partitions are isolated (don't need to talk to each other). This is a little tricky with Rails, as it assumes by default that there is one central database. There may have been improvements on that front since I looked at the issue about a year and a half ago. If you can partition your data, you can scale horizontally fairly wide. A single MySQL machine can deal with a few million rows (PostgreSQL can probably scale to a larger number of rows but might work a little slower).
Another strategy that works is having a master-slave set up, where all writes are done by the master, and reads are shared among the slaves (and possibly the master). Obviously this has to be done fairly carefully! Assuming a high read/write ratio, this can scale pretty well.
If your organization has deep pockets, check out what Vertica, AsterData, and Greenplum have to offer.
The backend will depend on the data and how the data will be accessed.
But for the ORM, I'd most likely use DataMapper and write a custom DataObjects adapter to get to whatever backend you choose.
I'm not sure what CouchDB not being at 1.0 has to do with it. I'd recommend doing some testing with it (just generate a billion random documents) and see if it'll hold up. I'd say it will, despite not having a specific version number.
CouchDB will help you a lot when it comes to partitioning/sharding your data and like, seems like it might fit with your project -- especially if your data format might change in the future (adding or removing fields) since CouchDB databases have no schema.
There are plenty of optimizations in CouchDB for read-heavy apps as well and, based on my experience with it, is where it really shines.

Resources