Analyzing Sensor Data stored in cassandra and draw graphs - ruby-on-rails

I'm collecting data from different sensors and write them to a Cassandra database.
The Sensor-ID accts as a partition key, the timestamp of the sensors data as clustering column. Additionally a value of the sensor is stored.
Each sensor collects something about 30000 to 60000 values a day.
The simplest thing I wane do is draw a graph showing this data. This is not a problem for a few hours but when showing a week or even a longer range, all the data has to be loaded into the backend (a rails application) for further processing. This isn't really fast with my test dataset and won't be faster in production I think.
So my question is, how to speed this up. I thought about pre-processing the data directly in the database but it seems, that Cassandra isn't able to do such things.
For a graph with a width of 1000px it isn't interesting to draw ten thousands of points - so it would be interesting to gather only relevant, pre-aggregated data from the database.
For example, when showing the data for a whole day in a graph with a width of 1000px, it would be enough to take 1000 average values (this would be an average clustered by 86seconds - 60*60*24 / 1000).
Is this a good approach? Or are there other techniques fasten this up? How would I handle this with database? Create a second Table and store some average values? But the resolution of the graph may change...
Other approaches would be drawing mean values by day, week, month and so on. Maybe vor this a second table could do a good job!

Cassandra is all about letting you write and read your data quickly. Think of it as just a data store. It can't (really) do any processing on that data.
If you want to do operations on it, then you are going to need to put the data into something else. Storm is quite popular for building computation clusters for processing data from Cassandra, but without knowing exactly the scale you need to operate at, then that may be overkill.
Another option which might suit you is to aggregate data on the way in, or perhaps in nightly jobs. This is how OLAP is often done with other technologies. This can work if you know in advance what you need to aggregate. You could build your sets into hourly, daily, whatever, then pull a smaller amount of data into Rails for graphing (and possibly aggregate it even further to exactly meet the desired graph requirements).

For the purposes of storing, aggregating, and graphing your sensor data, you might consider RRDtool which does basically everything you describe. Its main limitation is it does not store raw data, but instead stores aggregated, interpolated values. (If you need the raw data, you can still use Cassandra for that.)

AndySavage is onto something here when it comes to precomputing aggregate values. This does require you to understand in advance the sorts of metrics you'd like to see from the sensor values generally.
You correctly identify the limitation of a graph in informing the viewer. Questions you need to ask really fall into areas such as:
When you aggregate are you interested in the mean, median, spread of the values?
What's the biggest aggregation that you're interested in?
What's the goal of the data visualisation - is it really necessary to be looking at a whole year of data?
Are outliers the important part of the dataset?
Each of these questions will lead you down a different path with visualisation and the application itself too.
Once you know what you're wanting to do, an ETL process harnessing some form of analytical processing will be needed. This is where the Hadoop world would be useful investigating.
Regarding your decision to use Cassandra as your timeseries historian, how is that working for you? I'm looking at technical solutions for a similar requirement at the moment and it's one of the options on the table.

Related

Storing any number series data in a time-series database

I would like to make use of time-series database InfluxDb to store data points indexed by another number instead of time which every data point is stored against. So I can take advantage all the features for a series of datapoints against this number..
For example I have a rocket doing multiple launches on which I have several sensors recording temperature, air pressure, fuel level &c. And I want to graph these datapoints against elevation not time..
I realise I could store elevation itself against time then from the time for say a temperature reading work out the elevation and project the results - but that working out would lose the performance characteristics of just querying the datapoints indexed by elevation. Also third party tools which use the time-series database won't be able to simply get these datapoints against elevation as opposed to time to graph them out, e.g. Grafana, without me putting something in-between to marry the data up..
One idea I had was to have a fake time where meters = seconds and store against this, then I would need make that a composite with something else to differentiate rocket launches, e.g. increment year by 1 starting at year 0.. So I don't see every launch starting at the same elevation and can separate the "number-series" from each other - I guess I would have that problem anyway and the proper way to that would be through tags..
What makes you believe that this approach would be more efficient than storing the elevation jointly with your other sensor data? Fetching data is pretty cheap so the performance gain might be very light compared to the augmented complexity of your keys. Not to mention that you would still need to have the time make part of your elevation-timestamp, otherwise you will end up with duplicate pseudo timestamps and therefore incomplete data as most time series databases do not allow multiple values at the same timestamp for a given series.
I would encourage you to also have a look at other time series databases which include elevation as part of their standard data model. Check out Warp 10 for that matter (std disclaimer, I am the co-founder of SenX, maker of Warp 10).

RapidMiner - Time Series Segmentation

As I am fairly new to RapidMiner, I have a Historical Financial Data Set (with attributes Date, Open, Close, High, Low, Volume Traded) from Yahoo Finance and I am trying to find a way to segment it such as in the image below:
I am also planning on performing this segmentation on more than one of such Data Sets and then comparing between each segmentation (i.e. Segment 1 for Data Set A against Segment 1 for Data Set B), so I would preferably require an equal number of segments each.
I am aware that certain extensions are available within the RapidMiner Marketplace, however I do not believe that any of them have what I am looking for. Your assistance is much appreciated.
Edit: I am currently trying to replicate the Voting-Based Outlier Mining for Multiple Time Series (V-BOMM) with multiple data sets. So far, I am able to perform the operation by recording and comparing common dates against each other.
However, I would like to enhance the process to compare Segments rather than simply dates. I have gone through the existing functionalities of RapidMiner, and thus far I don't believe any fit my requirements.
I have also considered Dynamic Time Warping, but I can't seem to find an available functionality in RapidMiner.
Ultimate question: Can someone guide me to functionalities that can help replicate the segmentation in the attached image such that the segments can be compared between Historic Data Sets in RapidMiner? Also, can someone guide me on how to implement Dynamic Time Warping using RapidMiner?
I would use the new version of the Time Series extension, using the windowing features to segment the time series into whatever parts you want. There is a nice explanation of the new tools in the blog section of the community.

Neo4j: Cypher Performance testing/Benchmarking

I created a Neo4j 3 database that includes some test data and also a small application that will send http cypher requests to Neo4j. These requests are always of the same time. Acutally its a query template that just differs by some attributes. I am interested in the performance of these statements.
I know that I can use the PROFILE to get some information in the browser. But I want to execute a set of statements, e. g. 10 example queries, several times and calculate the average performance. Is there an easy way or a tool to do this or do I have to write e. g. a Python script that collects these values? It does not have to be a big application, I just want to see some general performance metrics.
I don't think there is an out-of-the-box tool for benchmarking Neo4j yet. So your best option is to implement your own solution - but you have to be careful if you want to get results that are (to some degree) representative:
Check the docs on performance.
Give the Neo4j JVM sufficient time to warmup. This means that you'll want to run a warmup phase with the queries and discard the execution times of them.
Instead of using a client-server architecture, you can also opt to use Neo4j in embedded mode, which will give you a better idea of the query performance (without the overhead of the driver and the serialization/deserialization process). However, in this case you have to implement benchmark over the JVM (in Java or possibly Jython).
Run each query multiple times. Do not use the average as it is more sensitive to outlier values (you can get high values for a number of reasons, e.g. if the OS scheduler starts some job in the background during a particular query execution).
A good paper in the topic, How not to lie with statistics: the correct way to summarize benchmark results, argues that you should use the geometric mean.
It is also common practice in performance experiments in computer science papers to use the median value. I tend to use this option - e.g. this figure shows the execution times of two simple SPARQL queries on in-memory RDF engines (Jena and Sesame), for their first executions and the median values of 5 consecutive executions.
Note however, that Neo4j employs various caching mechanisms, so if you only run the same query multiple times, it will only need to compute the results on the first execution and following executions will use the cache - unless the database is updated between the query executions.
As a good approximation, you can design the benchmark to resemble your actual workload as closely as possible - in many cases, application-specific macrobenchmarks make more sense than microbenchmarks. So if each query will only be evaluated once by the application, it is perfectly acceptable to benchmark only the first evaluation.
(Bonus.) Another good read in the topic is The Benchmark Handbook - chapter 1 discusses the most important criteria for domain-specific benchmarks (relevance, portability, scalability and simplicity). These are probably not required for your benchmark but these are nice to now.
I worked on a cross-technology benchmark considering relational, graph and semantic databases, including Neo4j. You might find some useful ideas or code snippets in the repository: https://github.com/FTSRG/trainbenchmark

Neo4j partition

Is the a way to physically separate between neo4j partitions?
Meaning the following query will go to node1:
Match (a:User:Facebook)
While this query will go to another node (maybe hosted on docker)
Match (b:User:Google)
this is the case:
i want to store data of several clients under neo4j, hopefully lots of them.
now, i'm not sure about whats is the best design for that but it has to fulfill few conditions:
no mixed data should be returned from a cypher query ( its really hard to make sure, that no developer will forget the ":Partition1" (for example) in a cypher query)
performance of 1 client shouldn't affect another client, for example, if 1 client has lots of data, and another client has small amount of data, or if a "heavy" query of 1 client is currently running, i dont want other "lite" queries of another client to suffer from slow slow performance
in other words, storing everything under 1 node, at some point in the future, i think, will have scalability problem, when i'll have more clients.
btw, is it common to have few clusters?
also whats the advantage of partitioning over creating different Label for each client? for example: Users_client_1 , Users_client_2 etc
Short answer: no, there isn't.
Neo4j has high availability (HA) clusters where you can make a copy of your entire graph on many machines, and then serve many requests against that copy quickly, but they don't partition a really huge graph so some of it is stored here, some other parts there, and then connected by one query mechanism.
More detailed answer: graph partitioning is a hard problem, subject to ongoing research. You can read more about it over at wikipedia, but the gist is that when you create partitions, you're splitting your graph up into multiple different locations, and then needing to deal with the complication of relationships that cross partitions. Crossing partitions is an expensive operation, so the real question when partitioning is, how do you partition such that the need to cross partitions in a query comes up as infrequently as possible?
That's a really hard question, since it depends not only on the data model but on the access patterns, which may change.
Here's how bad the situation is (quote stolen):
Typically, graph partition problems fall under the category of NP-hard
problems. Solutions to these problems are generally derived using
heuristics and approximation algorithms.[3] However, uniform graph
partitioning or a balanced graph partition problem can be shown to be
NP-complete to approximate within any finite factor.[1] Even for
special graph classes such as trees and grids, no reasonable
approximation algorithms exist,[4] unless P=NP. Grids are a
particularly interesting case since they model the graphs resulting
from Finite Element Model (FEM) simulations. When not only the number
of edges between the components is approximated, but also the sizes of
the components, it can be shown that no reasonable fully polynomial
algorithms exist for these graphs.
Not to leave you with too much doom and gloom, plenty of people have partitioned big graphs. Facebook and twitter do it every day, so you can read about FlockDB on the twitter side or avail yourself of relevant facebook research. But to summarize and cut to the chase, it depends on your data and most people who partition design a custom partitioning strategy, it's not something software does for them.
Finally, other architectures (such as Apache Giraph) can auto-partition in some senses; if you store a graph on top of hadoop, and hadoop already automagically scales across a cluster, then technically this is partitioning your graph for you, automagically. Cool, right? Well...cool until you realize that you still have to execute graph traversal operations all over the place, which may perform very poorly owing to the fact that all of those partitions have to be traversed, the performance situation you're usually trying to avoid by partitioning wisely in the first place.

what is the advantage of RDF and Triple Storage to Neo4j?

Neo4j is a really fast and scalable graph database, it seems that it can be used on business projects and it is free, too!
At the same time, there are no RDF triple stores that work well with large data or deliver a high-speed access. And what is more, free RDF triple stores perform even worse.
So what is the advantage of RDF and RDF triple stores to Neo4j?
The advantage of using a triple store for RDF rather than Neo4j is that that's what they're designed for. Neo4j is pretty good for many use cases, but in my experience its performance for loading and querying RDF is well below all dedicated RDF databases.
It's a fallacy that RDF databases don't scale or are not fast. Sure, they're not yet up to the performance & scale levels that relational databases have, but they have a 50 year head start. Many triple stores scale into the billions of triples, provide 'standard' enterprise features, and provide great performance for many use cases.
If you're going to use RDF for a project, use a triple store; it's going to provide the best performance and set of features/APIs for working with RDF to build your application.
RDF and SPARQL are standards, so you have a choice of multiple implementations, and can migrate your data from one RDF store to another.
Additionally, version 1.1 of the SPARQL query language is quite sophisticated (more expressive than most SQL implementations) and can do all kinds of queries that would require a lot of code to be written in Neo4J.
If you are going for graph mining (e.g., graph traversal) upon triples, neo4j is a good choice. For the large triples, you might want to use its batchInserter which is fairly fast.
So I think it's all about your use case. Both technologies can and do overlap.
In my mind, there its mostly about the use case. Do you want a full knowledge graph including all the ecosystems from the semantic web? Then go for the triple store.
If you need a general-purpose graph (e.g. store big data as a graph) use the property graph model. My reasoning is, that the underlying philosophy is very much different and this starts with how the data is stored which has implications for your usage scenario.
let's do some out-of-mind bullet points here to compare. Take it with a grain of salt please as this is not a benchmark paper just some experience-based 5 min write down.
Property graph (neo4j):
Think of nodes/Edges as documents
Implemented on top of e.g. linked list, key-value stores (deep searches, large data e.g. via gremlin)
Support for OWL/RDF but not natively (as i see its on a meta layer)
Really great when it comes to having the data in the graph and doing ML (it stores it as linked lists that gives you nice vectors which is cool for ML out of the box)
Made for large data at scale.
Use Cases: (focus is on the data entities and not their classes)
Social Graphs and other scenarios where you need deep traversal
Large data graphs, where you have a lot of documents that need to be searched in a schema-free graph manner .
Analyzing customer funnels from click data etc. You want to move out of your relational schema because actually, you are in a graph use case...
Triple Store (E.g. rdf4j)
Think of data in maximum normal form as triples (no redundant data at all)
Triples are stored in context triples. Works a lot with index.
Broad but searches and specific knowledge extractions. Deep searches are sometimes cumbersome.
Scale is impressive and can scale to trillions of nodes with fast performance. But i would not recommend storing big data in the graph e.g. time-series or so. The reason is the special way how indexes are used and in order to scale horizontally, you may consider working with subgraphs ...
Support for all the ecosystems like SPARQL, SHACL, SWIRL etc. this is a big plus in case
Use cases:
It's really about knowledge graphs. Do you need shape testing, rule evaluation, inference, and reasoning? Go for it because you have to focus on the ontology and class structure!
Also e.g. you have IoT and want to configure relations for logistics and smart factory while the telemetry is stored somewhere else and only referenced in the graph.
I have heard rumors that it takes whole day to load 10M triples into Neo4j (it is actually the slowest one because it's not built primarily for RDF).
Sesame and 4Store are the fastest ones but Jena has powerful API.

Resources