Storing any number series data in a time-series database - time-series

I would like to make use of time-series database InfluxDb to store data points indexed by another number instead of time which every data point is stored against. So I can take advantage all the features for a series of datapoints against this number..
For example I have a rocket doing multiple launches on which I have several sensors recording temperature, air pressure, fuel level &c. And I want to graph these datapoints against elevation not time..
I realise I could store elevation itself against time then from the time for say a temperature reading work out the elevation and project the results - but that working out would lose the performance characteristics of just querying the datapoints indexed by elevation. Also third party tools which use the time-series database won't be able to simply get these datapoints against elevation as opposed to time to graph them out, e.g. Grafana, without me putting something in-between to marry the data up..
One idea I had was to have a fake time where meters = seconds and store against this, then I would need make that a composite with something else to differentiate rocket launches, e.g. increment year by 1 starting at year 0.. So I don't see every launch starting at the same elevation and can separate the "number-series" from each other - I guess I would have that problem anyway and the proper way to that would be through tags..

What makes you believe that this approach would be more efficient than storing the elevation jointly with your other sensor data? Fetching data is pretty cheap so the performance gain might be very light compared to the augmented complexity of your keys. Not to mention that you would still need to have the time make part of your elevation-timestamp, otherwise you will end up with duplicate pseudo timestamps and therefore incomplete data as most time series databases do not allow multiple values at the same timestamp for a given series.
I would encourage you to also have a look at other time series databases which include elevation as part of their standard data model. Check out Warp 10 for that matter (std disclaimer, I am the co-founder of SenX, maker of Warp 10).

Related

Dynamic clustering for panel data

I have panel data consisting of time series for 120 months, 45 institutions and approximately 8 variables for each one. I want to do a cluster analysis in order to detect stressed institutions based on dynamic clustering analysis. For instance, check if a stressed institution does move from one cluster to another, or if its behavior changes so much that it is no longer part of its own cluster.
The idea would be to use the information up to time t to cluster the institutions and get the clusters for each institution so it can evolve with new information and use all the information available up to that point from all the banks, with time varying clusters.
My first idea was to use statistical control techniques and anomaly detection for time series such as the ones in the package anomaly, but this procedure does not use all the information from the other banks, just its own. It might be that the whole system is stressed, so detecting an anomaly in one bank might be because of the system and not because of the particular bank.
I also tried using clustering in each period through hierarchical clustering, and did a decent job on classifying the institutions based on my knowledge of them. However, this procedure only uses data at each point in time, not all the data available up to that point.
I had the idea of using clustering methods for panel data at each point in time, using the data up to that point, and cycling through each month to get dynamic clusters using the whole dataset. However, I don't know if this approach makes sense, or if there are better methods to do this kind of analysis.
Thank you very much!

Point in polygon based search vs geo hash based search

I'm looking for some advice.
I'm developing a system with geographic triggers, these enable my device to perform certain actions depending on where it is. The triggers are contained within polygons that are stored in my database I've explored multiple options to get this working, however, I'm not very familiar with geo-spacial systems.
An option would be to use the current location of the device and query the DB directly to give me all the polygons that contain that point, thus, all the triggers since they are linked together. A potential problem with this approach, I think, would be the possible amount of polygons stored, and the frequency of the queries, since this system serves multiple devices simultaneously and each one of them polls every few seconds.
An other option I'm exploring is to encode the polygons to an array of geo-hashes and then attach the trigger to each one of them.
Green is the geohashes that the trigger will be attached to, yellow are areas that need to be recalculated with a higher precision. The idea is to encode the polygon in the most efficient way down to X precision.
An other optimization I came up with is to only store the intersection of polygons with roads since these devices are only use in motor vehicles.
Doing this enable the device to work offline performing it's own encoding and lookup, with a potential disadvantage being that the device will have to implement logic to stay up-to-date with triggers added or removed ( potentially every 24 hours )
I'm looking for the most efficient way to implement this given some constrains such as:
Potentially unreliable networks ( the device has LTE connectivity )
Limited processing power, the devices for now are based on a raspberry pi 3 Compute module, however, they perform other tasks such as image processing.
Limited storage, since they store videos and images.
Potential large amount of triggers/polygons
Potential large amount of devices.
Any thoughts are greatly appreciated.

Best OCR approach on documents with different formats to find one specific information

Unfortunately, because of confidential data, I can't give a more specific explanation.
The Problem
So I've got a few documents that in general contain the same information but have different formats. In most cases, the value I am looking for is near a keyword on the document. The OCR itself is taken care of by the Google Cloud Vision API but what is the best approach to handle the different formats?
My idea
... was to train a classifier that detects what format I am dealing with and then picks the appropriate way of finding the target value, I implemented beforehand by hand. This is not handy nor scalable. So I am looking for some algorithm I tell e.g. where the target value is, what it looks like etc.
What is the best ML-approach for this problem or what are your ideas?
As an example of the type of data: Let's say I have receipts from 20 different supermarkets and I am looking to find the total cost, with the problem that every companies receipt looks different.
Recently I had to deal with a similar situation using tesseract, excluding the OCR tool itself, I didn't use any ML-approach because like you said, it wouldn't be scalable.
I don't think a classifier would payoff unless you have a huge amount of different layouts, and then you'd have to decide how to extract the data for each and every layout...
It depends a lot on the type of data you need to extract, but using your example, if you had to extract the total cost from all the different layouts, you could extract as many numbers as you can from each receipt, and score them based on some factors, like:
If its a cost ($ or other currency symbols)
The distance to some common keywords like "Total, Final, Sum, etc"
If it's the highest value for that receipt
Other factors you might think of, it all depends on the data you need to extract
Then you can calculate the final total cost using the individual costs that scored the highest for each receipt

RapidMiner - Time Series Segmentation

As I am fairly new to RapidMiner, I have a Historical Financial Data Set (with attributes Date, Open, Close, High, Low, Volume Traded) from Yahoo Finance and I am trying to find a way to segment it such as in the image below:
I am also planning on performing this segmentation on more than one of such Data Sets and then comparing between each segmentation (i.e. Segment 1 for Data Set A against Segment 1 for Data Set B), so I would preferably require an equal number of segments each.
I am aware that certain extensions are available within the RapidMiner Marketplace, however I do not believe that any of them have what I am looking for. Your assistance is much appreciated.
Edit: I am currently trying to replicate the Voting-Based Outlier Mining for Multiple Time Series (V-BOMM) with multiple data sets. So far, I am able to perform the operation by recording and comparing common dates against each other.
However, I would like to enhance the process to compare Segments rather than simply dates. I have gone through the existing functionalities of RapidMiner, and thus far I don't believe any fit my requirements.
I have also considered Dynamic Time Warping, but I can't seem to find an available functionality in RapidMiner.
Ultimate question: Can someone guide me to functionalities that can help replicate the segmentation in the attached image such that the segments can be compared between Historic Data Sets in RapidMiner? Also, can someone guide me on how to implement Dynamic Time Warping using RapidMiner?
I would use the new version of the Time Series extension, using the windowing features to segment the time series into whatever parts you want. There is a nice explanation of the new tools in the blog section of the community.

Analyzing Sensor Data stored in cassandra and draw graphs

I'm collecting data from different sensors and write them to a Cassandra database.
The Sensor-ID accts as a partition key, the timestamp of the sensors data as clustering column. Additionally a value of the sensor is stored.
Each sensor collects something about 30000 to 60000 values a day.
The simplest thing I wane do is draw a graph showing this data. This is not a problem for a few hours but when showing a week or even a longer range, all the data has to be loaded into the backend (a rails application) for further processing. This isn't really fast with my test dataset and won't be faster in production I think.
So my question is, how to speed this up. I thought about pre-processing the data directly in the database but it seems, that Cassandra isn't able to do such things.
For a graph with a width of 1000px it isn't interesting to draw ten thousands of points - so it would be interesting to gather only relevant, pre-aggregated data from the database.
For example, when showing the data for a whole day in a graph with a width of 1000px, it would be enough to take 1000 average values (this would be an average clustered by 86seconds - 60*60*24 / 1000).
Is this a good approach? Or are there other techniques fasten this up? How would I handle this with database? Create a second Table and store some average values? But the resolution of the graph may change...
Other approaches would be drawing mean values by day, week, month and so on. Maybe vor this a second table could do a good job!
Cassandra is all about letting you write and read your data quickly. Think of it as just a data store. It can't (really) do any processing on that data.
If you want to do operations on it, then you are going to need to put the data into something else. Storm is quite popular for building computation clusters for processing data from Cassandra, but without knowing exactly the scale you need to operate at, then that may be overkill.
Another option which might suit you is to aggregate data on the way in, or perhaps in nightly jobs. This is how OLAP is often done with other technologies. This can work if you know in advance what you need to aggregate. You could build your sets into hourly, daily, whatever, then pull a smaller amount of data into Rails for graphing (and possibly aggregate it even further to exactly meet the desired graph requirements).
For the purposes of storing, aggregating, and graphing your sensor data, you might consider RRDtool which does basically everything you describe. Its main limitation is it does not store raw data, but instead stores aggregated, interpolated values. (If you need the raw data, you can still use Cassandra for that.)
AndySavage is onto something here when it comes to precomputing aggregate values. This does require you to understand in advance the sorts of metrics you'd like to see from the sensor values generally.
You correctly identify the limitation of a graph in informing the viewer. Questions you need to ask really fall into areas such as:
When you aggregate are you interested in the mean, median, spread of the values?
What's the biggest aggregation that you're interested in?
What's the goal of the data visualisation - is it really necessary to be looking at a whole year of data?
Are outliers the important part of the dataset?
Each of these questions will lead you down a different path with visualisation and the application itself too.
Once you know what you're wanting to do, an ETL process harnessing some form of analytical processing will be needed. This is where the Hadoop world would be useful investigating.
Regarding your decision to use Cassandra as your timeseries historian, how is that working for you? I'm looking at technical solutions for a similar requirement at the moment and it's one of the options on the table.

Resources