Apache Kudu vs InfluxDB on time series data for fast analytics - time-series

How does Apache Kudu compare with InfluxDB for IoT sensor data that requires fast analytics (e.g. robotics)?
Kudu has recently released v1.0 I have a few specific questions on how Kudu handles the following:
Sharding?
Data retention policies (keeping data for a specified number of data points, or time and aggregating/discarding data thereafter)?
Are there roll-up /aggregation functionality (e.g. converting 1s interval data into 1min interval data)?
Is there support for continuous queries (i.e. materialised views on data - query to view the 60 seconds on an ongoing basis)?
How is the data stored between disk and memory?
Can regular time series be induced from an irregular one (converting irregular event data into regular time intervals)?
Also are there any other distinct strengths and/or weaknesses between Kudu and InfluxDB?

Kudu is a much lower level datastore than InfluxDB. Its more like a distributed file system that provides a few database like features than a full fledged database. It currently relies on a query engine such as Impala for finding data stored in Kudu.
Kudu is also fairly young. It likely would be possible to build a time series database with kudu as the distributed store underneath it, but currently the closest implementation to that would be this proof of concept project.
As for the answers to your questions.
1) Kudu stores data in tablets and offers two ways of partitioning data: Range Partitions and Hash based Partitioning
2) Nope Although if the data was structured with range partitioning, dropping a tablet should be an efficient operation (similar to how InfluxDB drops whole shards when deleting data).
3) Query engines that work with Kudu are able to do this, such as impala or spark.
4) Impala does have some support for views
5) Data is stored in a columnar format similar to Parquet however Kudu's big selling point is that Kudu allows the columnar data to be mutable, which is something that is very difficult with current parquet files.
6) While I'm sure you could get spark or impala to do this, its not a built in feature.
Kudu is still a new project and it is not really designed to compete with InfluxDB but rather give a highly scalable and highly performant storage layer for a service like InfluxDB. The ability to append data to a parquet like data structure is really exciting though as it could eliminate the need for lambda architectures.

Related

Time-series charts for large amounts of data

I have a couple of thousand time-series covering several years at second-granularity. I'd like to store the data in a suitable DB (i.e. one that scales well and can retain all data at original granularity, e.g. Druid, openTSDB or similar). The goal is to be able to view the data in a browser (e.g. by entering a time frame and ideally having zoom/pan functionality).
To limit the number of datapoints that my webserver needs to handle I'd like to have functionality which seems to be working out of the box for Graphite/Grafana (which, if I understand correctly, is not a good choice for long-term retention of data):
a time-series chart in Grafana will limit data by querying aggregations from graphite (e.g. return mean value over 30m buckets when zooming out while showing all data when zooming in).
Now the questions:
are there existing visualization tools for time-series DBs that provide this functionality?
are there existing charting frameworks that allow me to customize the data queried per zoom level?
Feedback on the choice of DB is also welcome (open-source preferred).
You can absolutely store multiple years of data in Graphite, the issue you'll have is that the way that Graphite selects the aggregation level to read from is by locating the highest-resolution archive that covers the requested interval, so you can't automatically take advantage of aggregation to both have efficient long-term graphs and the ability to drill down to the raw data for a time period in the past.
One way to get around this problem is to use carbon-aggregator to generate multiple output series with different intervals from your input series so you can have my.metric.raw, my.metric.10min, my.metric.1hr, etc. You'd combine that with a carbon schema that defines matching interval and retention for each of the series so my.metric.raw is stored at 1-second resolution, .1min at 1-minute etc.
If you do that then in Grafana you can use a template variable to choose which interval you want to graph from, so you'd define a variable $aggregation with options raw, 10min, etc and write your queries like my.metric.$aggregation.
That will give you the performance that you need with the ability to drill into the raw data.
That said, we generally find that while everyone thinks they want lots of historical data at high granularity, it's almost never actually used and is typically an unneeded expense. That may not be the case for you, but think carefully about the actual use-cases when designing the system.

NoSQL (BigTable...) and TimeSeries Data

I work in an organization that collects/stores a lot of time series data (time=value,time=value...). Today we use a historian to collect and process this data. The main advantage of using a historian was to compress the data and be more efficient in terms of data storage. However, with technologies such as Big Data, NoSQL it seems the effort to compress data (because of storage $$) is fading and the trend is to store "lots" of data.
Has anyone experimented with replacing a time-series historian with
a BigData solution? I'm aware of OpenTSDB, has anyone used this in a
non IT role?
Would a NoSQL database (Cassandra...) be a good fit for time-series
data? If so, what might an implementation look like?
Is the importance on just collecting or storing or is speed or ease of analysis essential?
For most reasonable data sizes standard SQL will suffice.
Above that and especially for analysis you would preferably want an in-memory and column oriented database. At the highest end this means kdb by kx.com which is used by all major banks ($$ expensive). However you ask specifically about open source, I"d consider monetdb or mysql in memory depending on your data size and access requirements.
Cassandra is one of the more appropriate choices from the nosql bunch and people have tried using it already:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://synfin.net/sock_stream/technology/advanced-time-series-metric-data-with-cassandra
I found I was spending a lot of time hacking around at the smallest data level to get things to work and creating a lot of verbose code. Which was then going to spread my data over multiple servers and try to make up for the inefficient storage by using multiple machines. When I evaluated it, it's time support and functions for manipulating time were poor and I couldn't do much more than just pull out ranges easily. For those reasons I moved on from cassandra.

How to do some reporting with Rails (with a dedicated DB)

In a Rails app, I am wondering how to build a reporting solution. I heard that I should use a separated database for reporting purposes but knowing that I will need to store a huge amount of data, I have a lot of questions :
What kind of DBMS should I choose?
When should I store data in the reporting database?
Should the database schema of the production db and reporting db be identical?
I am storing basic data (information about users, about result of operations) and I will need for example to run a report to know how many user failed an operation during the previous month.
In now that it is a vague question, but any hint would be highly appreciated.
Thanks!
Work Backwards
Start from what the end-users want for reporting or how they want to/should visualize data. Once you have some concepts in mind, then start working backwards to how to achieve those goals. Starting with the assumption that it should be a replicated copy in an RBDMS excludes several reasonable possibilities.
Making a Real-time Interface
If users are looking to aggregate values (counts, averages, etc.) on the fly (per web request), it would be worthwhile looking into replicating the master down to a reporting database if the SQL performance is acceptable (and stays acceptable if you were to double the input data). SQL engines usually do a great job aggregation and scale pretty far. This would also give you the capability to join data results together and return complex results as the users request it.
Just remember, replication isn't easy or without it's own set of problems.
This'll start to show signs of weakness in the hundreds of millions of rows range with normalized data, in my experience. At some point, inserts fight with selects on the same table enough that both become exceptionally slow (remember, replication is still a stream of inserts). Alternatively, indexes become so large that storage I/O is required for rekeying, so overall table performance diminishes.
Batching
On the other hand, if reporting falls under the scheme of sending standardized reports out with little interaction, I wouldn't necessarily recommend backing to an RBDMS. In this case, results are combined, aggregated, joined, etc. once. Paying the overhead of RBDMS indexing and storage bloat isn't worth it.
Batch engines like Hadoop will scale horizontally (many smaller machines instead of a few huge machines) so processing larger volumes of data is economical.
Batch to RBDMS or K/V Store
This is also a useful path if a lot of computation is needed to make the records more meaningful to a reporting engine. Alternatively, records could be denormalized before storing them in the reporting storage engine. The denormalized or simple results would then be shipped to a key/value store or RBDMS to make reporting easier and achieve higher performance at the cost of latency, compute, and possibly storage.
Personal Advice
Don't over-design it to start with. The decisions you make on the initial implementation will probably all change at some point. However, design it with the current and near-term problems in mind. Also, benchmarks done by others are not terribly useful if your usage model isn't exactly the same as theirs; benchmark your usage model.
I would recommend to to use some pre-build reporting services than to manually write out if you need a large set of reports.
You might want to look at Tableau http://www.tableausoftware.com/ and other available.
Database .. Yes it should be a separate seems safer , plus reporting is generally for old and consolidated data.. you live data might be too large to perform analysis on.
Database type -- > have to choose based on the reporting services used , though I think mongo is not supported by any of the reporting services , mysql is preferred.
If there are only one or two reports you could just build them on rails

BigData Vs Neo4J

I´ve been looking for a triple store for my project. In this project i want to store my data according to certain ontologies (OWL).
From my research i ended up with two tecnologies Neo4J and BigData that seems to fit well in this case.
I want to know if any of this two is more apropriated to use with RDF, RDFS, OWL and SPARQL Queries.
Neo4j can be used to store as entity-relationship-entity form. In case of Bigdata, you should not be upload your whole data into Neo4j because it will become very heavy and process will be very much slow. You should use complimentary db for storing actual data and store ids and some params into Neo4j for Graph traversal to perform sort of Graph Analytics. Neo4j is mainly build up for Graph Analytics that its power or you have to use Graph engine e.g GraphX (Spark).
Thanks,
You might want to try out the SparQL plugin for Neo4j, see here for a HTTP based test, and this Berlin Dataset Test for embedded usage.
Neo4J is a specific technology, while big data is more a generic term. I think what you're asking about OLAP and OLTP. As data gets bigger, there are differences between use cases for RDF style graph databases, which are often used for OLAP (On-line Analytical Processing) style analytics. In short, OLAP is designed for analytics that look across an big data set, while OLTP is more aimed at INSERT/DELETEs (on potentially big data).
OLAP-based traversals tend to process the entire graph, while OLTP based traversals tend to process smaller data sets by starting with one or a handful of vertices and traversing from there.
For example, let’s say you wanted to calculate the average age of friends of one particular user. Great use case for OLTP, since the query data set is small. However, if you wanted to calculate the average age of everyone on the database, OLAP is the preferred technology.
OLAP is optimal for deep analysis of a lot of data, while OLTP is better suited for fast running queries and a lot of INSERTs. If you’re trying to achieve a SLA where the analytics must complete within a certain timeframe, consider the type of analytics and which one is better suited. Or maybe you need both.

Implementing large scale log file analytics

Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?
Focusing on web analytics in particular, I'm interested in two closely-related aspects: query performance and data storage.
I know that the general approach is to use map reduce to distribute each query over a cluster (e.g. using Hadoop). However, what's the most efficient storage format to use? This is log data, so we can assume each event has a time stamp, and that in general the data is structured and not sparse. Most web analytics queries involve analyzing slices of data between two arbitrary timestamps and retrieving aggregate statistics or anomalies in that data.
Would a column-oriented DB like Big Table (or HBase) be an efficient way to store, and more importantly, query such data? Does the fact that you're selecting a subset of rows (based on timestamp) work against the basic premise of this type of storage? Would it be better to store it as unstructured data, eg. a reverse index?
Unfortunately there is no one size fits all answer.
I am currently using Cascading, Hadoop, S3, and Aster Data to process 100's Gigs a day through a staged pipeline inside of AWS.
Aster Data is used for the queries and reporting since it provides a SQL interface to the massive data sets cleaned and parsed by Cascading processes on Hadoop. Using the Cascading JDBC interfaces, loading Aster Data is quite a trivial process.
Keep in mind tools like HBase and Hypertable are Key/Value stores, so don't do ad-hoc queries and joins without the help of a MapReduce/Cascading app to perform the joins out of band, which is a very useful pattern.
in full disclosure, I am a developer on the Cascading project.
http://www.asterdata.com/
http://www.cascading.org/
The book Hadoop: The definitive Guide by O'Reilly has a chapter which discusses how hadoop is used at two real-world companies.
http://my.safaribooksonline.com/9780596521974/ch14
Have a look at the paper Interpreting the Data: Parallel Analysis with Sawzall by Google. This is a paper on the tool Google uses for log analysis.

Resources