I'm currently evaluating rethinkdb to use it with a time-series database for my charting needs.
I'm looking for existing tutorials and snippets if they even exist for rethinkdb. (I already know that mongodb, redis, tempodb and cassandra has similar resources).
You have some ressources on RethinkDB website:
General introduction to the query language -- http://www.rethinkdb.com/docs/introduction-to-reql/
Cookbook (probably what you are looking for) -- http://www.rethinkdb.com/docs/cookbook/javascript/
API docs (there are some useful examples there) -- http://www.rethinkdb.com/api/javascript/
There are some ffull apps too -- http://www.rethinkdb.com/docs/examples/
I am not aware of specific snippet for time series though.
But you just have to create an index on your time serie data, and use orderBy({index: "time"}) (something like that).
Related
I have searched for it in many blogs, but it seems all the blogs present a biased view. I myself am having a little bias towards Prometheus now, However, i did not find any good article which explains a use case of Prometheus for sensor data.
In my case, we manufacture IoT devices and we have a lot of data coming in. Till now we have been using MongoDB for everything, but now I want to switch to a time-series database, but I am really confused, whether I can choose Prometheus or not.
I am comfortable writing my own metric converter which can convert my sensor data into Prometheus metrics format (If something doesn't exist already)
Don't feel bd, lots of folks start out trying MongoDB for IoT applications because Mongo claims it's great for IoT. Only problem is, it's terrible for IoT. :-)
What you need is a true Time Series Database (TSDB). If you want to be able to query your data with SQL, try out QuestDB. It's the fastest open source TSDB out there and it's small.
I think i found it. Its Victoria Metrics. Haven't seen something as amazing as VM. First thing, it supports both Prometheus and Influx DB Write protocol(not just these, it supports some other time series database protocols also) and supports query language similar to prometheus. It has Vm Agent whose multiple instances we can run easily. It has cluster support and performance-wise, nothing like it.
I want to move up from using plain Rails log files for my web applications, so I can analyze page views and usage patterns. I've heard CouchDB is sometimes used for this.
On the other hand, I know of people who just feed the plain text log files into Hadoop and reduce them into summary stats that they then insert into MySQL.
What are the pros and cons of each of these two methods of logging and analyzing log files?
I can only speak for CouchDB, but the main benefits of using a document database to store things like these are;
They are schema less so that you can alter the schema of your log entries and still perform queries on the various editions of the schema you might have.
The map/reduce algorithm is a very powerful way to do grouping queries.
REST interface makes it technology agnostic in terms of consuming the data.
Scaling is horizontal and "infinite".
I'am Starting a new App (Something like an AngelList + hacker news but local), and got thinking if I should join the wave and get a CouchDB or stay with the old School and proved to work relational DBs, or try Object oriented DBs.
After studying a bit of how CouchDB works I don't see where It can be useful, looks harder to make connection between the "Documents" and also I cannot find how they can do BI with this Kind of DB since Data Minning would result in a lake of info in a complete mess.
I think I am missing the key points of the paradigm What am I missing ? I bet most people and Amazon are right and I am wrong, but where ?
Specially the benefits with Ruby On Rails Object Mapping ?
I would like to have some BI and cross some info after I get a lot of users... so...
Take a look at this post: Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison
I am building out some reporting stuff for our website (a decent sized site that gets several million pageviews a day), and am wondering if there are any good free/open source data warehousing systems out there.
Specifically, I am looking for only something to store the data--I plan to build a custom front end/UI to it so that it shows the information we care about. However, I don't want to have to build a customized database for this, and while I'm pretty sure an SQL database would not work here, I'm not sure what to use exactly. Any pointers to helpful articles would also be appreciated.
Edit: I should mention--one DB I have looked at briefly was MongoDB. It seems like it might work, but their "Use Cases" specifically mention data warehousing as "Less Well Suited": http://www.mongodb.org/display/DOCS/Use+Cases . Also, it doesn't seem to be specifically targeted towards data warehousing.
http://www.hypertable.org/ might be what you are looking for is (and I'm going by your descriptions above here) something to store large amounts of logged data with normalization. i.e. a visitor log.
Hypertable is based on google's bigTable project.
see http://code.google.com/p/hypertable/wiki/PerformanceTestAOLQueryLog for benchmarks
you lose the relational capabilities of SQL based dbs but you gain a lot in performance. you could easily use hypertable to store millions of rows per hour (hard drive space withstanding).
hope that helps
I may not understand the problem correctly -- however, if you find some time to (re)visit Kimball’s “The Data Warehouse Toolkit”, you will find that all it takes for a basic DW is a plain-vanilla SQL database, in other words you could build a decent DW with MySQL using MyISAM for the storage engine. The question is only in desired granularity of information – what you want to keep and for how long. If your reports are mostly periodic, and you implement a report storage or cache, than you don’t need to store pre-calculated aggregations (no need for cubes). In other words, Kimball star with cached reporting can provide decent performance in many cases.
You could also look at the community edition of “Pentaho BI Suite” (open source) to get a quick start with ETL, analytics and reporting -- and experiment a bit to evaluate the performance before diving into custom development.
Although this may not be what you were expecting, it may be worth considering.
Pentaho Mondrian
Open source
Uses standard relational database
MDX (think pivot table)
ETL ( via Kettle )
I use this.
In addition to Mike's answer of hypertable, you may want to take a look at Apache's Hadoop project:
http://hadoop.apache.org/
They provide a number of tools which may be useful for your application, including HBase, another implementation of the BigTable concept. I'd imagine for reporting, you might find their mapreduce implementation useful as well.
It all depends on the data and how you plan to access it. MonetDB is a column-oriented database engine from the most revolutionary team on database technologies. They just got VLDB's 10-year best paper award. The DB is open source and there are plenty of reviews online praising them.
Perhaps you should have a look at TPC and see which of their test problem datasets match best your case and work from there.
Also consider the need for concurrency, it adds a big overhead for any kind of approach and sometimes is not really required. For example, you can pre-digest some summary or index data and only have that protected for high concurrency. Profiling your data queries is the following step.
About SQL, I don't like it either but I don't think it's smart ruling out an engine just because of the front-end language.
I see a similar problem and thinking of using plain MyISAM with http://www.jitterbit.com/ as data access layer. Jitterbit (or another free tool alike) seems very nice for this sort of transformations.
Hope this helps a bit.
A lot of people just use Mysql or Postgres :)
I am looking for a backend solution for an application written in Ruby on Rails or Merb to handle data with several billions of records. I have a feeling that I'm supposed to go with a distributed model and at the moment I looked at
HBase with Hadoop
Couchdb
Problems with HBase solution as I see it -- ruby support is not very strong, and Couchdb did not reach 1.0 version yet.
Do you have suggestion what would you use for such a big amount of data?
Data will require rather fast imports sometimes of 30-40Mb at once, but imports will come in chunks. So ~95% of the time data will be read only.
Depending on your actual data usage, MySQL or Postgres should be able to handle a couple of billion records on the right hardware. If you have a particular high volume of requests, both of these databases can be replicated across multiple servers (and read replication is quite easy to setup (compared to multiple master/write replication).
The big advantage of using a RDBMS with Rails or Merb is you gain access to all of the excellent tool support for accessing these types of databases.
My advice is to actually profile your data in a couple of these systems and take it from there.
There's a number of different solutions people have used. In my experience it really depends more on your usage patterns related to that data and not the sheer number of rows per table.
For example, "How many inserts/updates per second are occurring." Questions like these will play into your decision of what back-end database solution you'll choose.
Take Google for example: There didn't really exist a storage/search solution that satisfied their needs, so they created their own based on a Map/Reduce model.
A word of warning about HBase and other projects of that nature (don't know anything about CouchDB -- I think it's not really a db at all, just a key-value store):
Hbase is not tuned for speed; it's tuned for scalability. If response speed is at all an issue, run some proofs of concept before you commit to this path.
Hbase does not support joins. If you are using ActiveRecord and have more than one relation.. well you can see where this is going.
The Hive project, also built on top of Hadoop, does support joins; so does Pig (but it's not really sql). Point 1 applies to both. They are meant for heavy data processing tasks, not the type of processing you are likely to be doing with Rails.
If you want scalability for a web app, basically the only strategy that works is partitioning your data and doing as much as possible to ensure the partitions are isolated (don't need to talk to each other). This is a little tricky with Rails, as it assumes by default that there is one central database. There may have been improvements on that front since I looked at the issue about a year and a half ago. If you can partition your data, you can scale horizontally fairly wide. A single MySQL machine can deal with a few million rows (PostgreSQL can probably scale to a larger number of rows but might work a little slower).
Another strategy that works is having a master-slave set up, where all writes are done by the master, and reads are shared among the slaves (and possibly the master). Obviously this has to be done fairly carefully! Assuming a high read/write ratio, this can scale pretty well.
If your organization has deep pockets, check out what Vertica, AsterData, and Greenplum have to offer.
The backend will depend on the data and how the data will be accessed.
But for the ORM, I'd most likely use DataMapper and write a custom DataObjects adapter to get to whatever backend you choose.
I'm not sure what CouchDB not being at 1.0 has to do with it. I'd recommend doing some testing with it (just generate a billion random documents) and see if it'll hold up. I'd say it will, despite not having a specific version number.
CouchDB will help you a lot when it comes to partitioning/sharding your data and like, seems like it might fit with your project -- especially if your data format might change in the future (adding or removing fields) since CouchDB databases have no schema.
There are plenty of optimizations in CouchDB for read-heavy apps as well and, based on my experience with it, is where it really shines.