Insert data with timestamps from the past to time series DB - time-series

Let consider Influxdb as an example of TSDB. In outline looks like Influxdb stores data in sorted by time append-only files. But also it claims that it's possible to insert data with random timestamps, not just append. And for IoT world it's a quite usual scenario to occasionally find some data from the past (for example some devices were offline for some time and then get online again) and put this data to the time series db to plot some charts. How influxdb can deal with such scenarios? Will it rewrite the append-only files completely?

This is how I understand it. InfluxDB creates a logical database (shard) for each block of time for which it has data. By default, the shard group duration is 1 week. Therefore, if you insert measurements with timestamps from e.g. 4 weeks ago, they will not affect shards from subsequent weeks.
Within each shard, incoming writes are first appended to a WAL (write ahead log) and also cached in memory. When the WAL and cache are sufficiently full, they are snapshotted to disk, converting them to level 0 TSM (time structured merge tree) files. These files are read-only and measurements are ordered firstly by series and then by time.
As TSM files grow, they are compacted together, increasing their level. Multiple level 0 snapshots are compacted to produce level 1 files. Less often, multiple level 1 files are compacted to produce level 2 files, and so on up to a maximum level 4. Compaction ensures that TSM files are optimised to (ideally) contain a minimum set of series, and to minimally overlap with other TSM files. This means that fewer TSM files need to be searched for any particular series/time lookup.
So knowing this, how would InfluxDB suffer under a workload of writes with random timestamps? If the timestamps are sparsely distributed and our shard group duration is short, i.e. most writes hit different shards, then we will end up with many shards. This means many almost-empty data files which is inefficient (this very issue is addressed in their FAQ). On the other hand, if the random timestamps are concentrated in one or two shards, their lower-level TSM files will likely significantly overlap in time, meaning all of them need to be searched even for queries over narrow time ranges. This will affect read performance on these kinds of queries.
More information can be found in these resources:
The InfluxDB Storage Engine and the Time-Structured Merge Tree (TSM)
CMU Time Series Database Lectures #1 - Paul Dix (InfluxDB)

Related

DynamoDB Timeseries Table Design

Scenario:
I have a few weather stations that I'm collecting data for. The data comes in roughly every 15 minutes or so. Each data packet contains several measurements like pressure, temperature, humidity, etc.
The data would be queried in multiple ways:
display latest values for all measurements at a station
display a historical chart for a single measurement (for ex. temperature)
other?
Proposed Tables:
STATIONS: hash-key: station-id
Contains metadata information about the stations
STATION_X_MEASUREMENT_DATA: hash-key: measurement-type, range-key: timestamp
Where X is the station ID. Each record contains the measurement value for a specific measurement type and time. Each station will have its own data table so that the data can be removed by dropping a table when a station is no longer in service.
STATION_SUMMARY: hash-key: station_id
Contains the latest/current values for all measurement types for each station
Questions:
Should I have two separate tables (summary and individual measurments) or should I just query the latest measurements when I want to display the summary?
Should I store the measurement types as individual records or combined into a single records for a specific timestamp?
If I were to store all measurements in a combined record with timestamp as range key, would it be worth to use minutes or seconds as the partition key? I'm afraid that would make querying more complicated.
Is there anything else I should change/improve? Are there better alternatives?
Should I have two separate tables (summary and individual measurments)
or should I just query the latest measurements when I want to display
the summary?
I don't see how you can have one table. In the measurement data you will have an item per measurement, while in the summary table every item will have static information about stations. If you are going to add them into a single table, are you going to duplicate summary information?
Also having two separate tables allows you to set different RCU/WCU for tables. I guess that station summary is rarely written, so you can set a low WCU, and higher a RCU, while measurement data is often written and may not be read so often. Again your settings can reflect this.
Now, do you want to have separate table for stations and stations summaries? It depends on your data and access patterns, but it is a common pattern to split heave detailed information into a separate table, and compact representation (maybe subset of fields) into a different table. It allows you to save some serious number of RCUs if you have requests like get-all-stations, since probably they don't require detailed info.
Should I store the measurement types as individual records or combined
into a single records for a specific timestamp?
The only difference that I see is that you can compress several measurements into a binary blob and store it into one item. If your measurements have some repetitions (LZW algorithm?) or if data does not change one from measurement to measurement (delta encoding?). In the later case instead of writing 202, 203, 202, you can write 22, 1, -1 or something like this.
Keep in mind that an item is limited to 400KB so you can't jam a lot of data in one item.
Also keep in mind that for a single partition key you can't have more than 10GB of data, so you need to have a strategy for how you are going to handle that. Notice that this does not depend on number of items or size of individual items.
If you don't have a lot of data you may be fine having just an item per measurement. If you have a lot of data and you need to decrease AWS cost, then you probably will be better having compressed arrays of measurements
If I were to store all measurements in a combined record with
timestamp as range key, would it be worth to use minutes or seconds as
the partition key? I'm afraid that would make querying more
complicated.
Hard to say. How many records do you have per second? Per minute? Maybe it makes sense to aggregate per hour to get better results from compression? Or maybe for a day? It depends on your data.
Is there anything else I should change/improve? Are there better alternatives?
You can have different tables for different time intervals. Newer data can have high WCU/RCU config, while older data will have low WCU (can you write in the past?) and lower RCU. Old data can be transferred to S3. Also you can use DynamoDB TTL to automatically remove old tables if you need to.

Does bucketing two *large* tables in Hive *in the same way* help perform much more efficient joins?

Imagine the following situation I am planning:
Have two rather large tables stored in Hive, both containing different types of customer related information (say, although this is not exactly the case, a record of customer transactions in one and customer owned data in the other). Let's call the tables A and B.
Tables are large in the sense that none of the tables fits completely in memory. (There are 10 million customers and theres is a few kilobytes of info associated to each of them in each of the two tables)
Be careful enough to bucket both tables in exactly the same way, by a field present in both tables (customer_id, which is a bigint), and using the same number of buckets 100.
I wonder whether this setup will, in any way, guarantee that a join (by customer_id) between both tables will be efficient, in the sense that very little shuffling of information between nodes will be required. I imagine this could the case, if for instance, there were a guarantee that the physical files corresponding to the same bucket in both tables are physically stored in the same (sets of nodes), i.e. if for every bucket i (in [0,99]) the file A/part_0_000i and the file B/part_0_000i were physically stored in the same nodes and the same held for their replicas.
Notes:
I am aware that partitioning and bucketing are different and that the first essentially determines the structure of subdirectories, whereas the second on determines which file each record goes too. This question is about bucketing only
Also, by number 2, map-side joins are not an option here, since, as far as my understading goes, they require loading one of the tables completely within each mapper and doing the join completely there.
Bucketing is used when there are too many levels in your data in which you want to partition by, or there are no good candidate partitions.
A concrete example would be partitioning on customerID in sales data. You may have 20 thousand customers. Partitions would contain small amounts of data which is inefficient and have too many partitions also inefficient. However you can hash the customerID and partition into 50 buckets for example. Then when you are merging on customerID the job will only have to scan against what is in a bucket rather than the entire sum of all your data.
With ideal bucketing your buckets should contain some multiple of the file system block size. Remember also that too many buckets or buckets that are built over varialbes not used as keys can be detrimental for other queries.
I have used them when I need to execute large jobs repeatedly. My queries time has been reduced significantly. I tend to only use when my data is very big. And big is relative to cluster size and capacity.
One great thing about bucketing is that they help ensure the bucketed partitions are of similar size. If you partition over State for example, California will have huge partitions while other states are very small.
Bucketing is tactical and not an appropriate for all use cases. Happy bucketing!
Yes, it will definitely help.
Bucketed tables are partitioned and sorted the same way, so they will be mergesorted, which works in linear time (n), otherwise the tables have to be sorted the same way first, which is usually nlog(n)

Time Series Databases - Metrics vs. tags

I'm new with TSDB and I have a lot of temperature sensors to store in my database with one point per second. Is it better to use one unique metric per sensor, or only one metric (temperature for example) with distinct tags depending sensor??
I searched on Internet what is the best practice, but I didn't found a good answer...
Thank you! :-)
Edit:
I will have 8 types of measurements (temperature, setpoint, energy, power,...) from 2500 sources
If you are storing your data in InfluxDB, I would recommend storing all the metrics in a single measurement and using tags to differentiate the sources, rather than creating a measurement per source. The reason being that you can trivially merge or decompose the metrics using tags within a measurement, but it is not possible in the newest InfluxDB to merge or join across measurements.
Ultimately the decision rests with both your choice of TSDB and the queries you care most about running.
For comparison purposes, in Axibase Time-Series Database you can store temperature as a metric and sensor id as entity name. ATSD schema has a notion of entity which is the name of system for which the data is being collected. The advantage is more compact storage and the ability to define tags for entities themselves, for example sensor location, sensor type etc. This way you can filter and group results not just by sensor id but also by sensor tags.
To give you an example, in this blog article 0601911 stands for entity id - which is EPA station id. This station collects several environmental metrics and at the same time is described with multiple tags in the database: http://axibase.com/environmental-monitoring-using-big-data/.
The bottom line is that you don't have to stage a second database, typically a relational one, just to store extended information about sensors, servers etc. for advanced reporting.
UPDATE 1: Sample network command:
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 m:humidity=72 m:precipitation=44.3
Tags that describe sensor-001 such as location, type, etc are stored separately, minimizing storage footprint and speeding up queries. If you're collecting energy/power metrics you often have to specify attributes to series such as Status because data may not come clean/verified. You can use series tags for this purpose.
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 ... t:status=Provisional
You should use one metric per sensor. You probably won't be needing to aggregate values from different temperature sensors, but you will be needing to aggregate values of a given sensor (average over a minute for instance).
Metrics correspond to data coming from the same source, or at least data you will be likely to aggregate. You can create almost as many metrics as you want (up to 16 million metrics in OpenTSDB for instance).
Tags make distinctions between these pieces of data. For instance, you could tag data differently if they suddenly change a lot, in order to retrieve only relevant data if needed, without losing the rest of the data. Although for a temperature sensor getting data every second, the best would probably be to filter and only store data when the value changed...
Best practices are summed up here

DB Selection and Modeling Time Series Data with Ad-Hoc queries

I have to develop a system for tracking/monitoring performance in a cellular network.
The domain includes a set of hierarchical elements, and each one has an associated set of counters that are reported periodically (every 15 minutes). The system should collect these counter values (available as large XML files) and periodically aggregate them on two dimensions: Time (from 15 to hour and from hour to day) and Hierarchy (lower level to higher level elements). The aggregation is most often a simple SUM but sometime requires average/min/max etc. Of course for the element dimension aggregation it needs to group by the hierarchy (group all children to one parent record). The user should be able to define and view KPIs (Key Performance Indicator) - that is, some calculations on the various counters. The KPI could be required for just one element, for several elements (producing a data-series for each) or as an aggregation for several elements (resulting in one data series of aggregated data.
There will be about 10-15 users to the system with probably 20-30 queries an hour. The query response time should be a few seconds (up to 10-15 for very large reports including many elements and long time period).
In high level, this is the flow:
Parse and Input Counter Data - there is a set of XML files which contains a periodical update of counters data for the elements. The size of all files is about 4GB / 15 minutes (so roughly 400GB/day).
Hourly Aggregation - once an hour all the collected counters, for all the elements should be aggregated - every 4 records related to an element are aggregated into one hourly record which should be stored.
Daily Aggregation - once a day, 2 all collected counters, for all elements should be aggregated - every 24 records related to an element are aggregated into one daily record.
Element Aggregation - with each one of the time-dimension aggregation it is possibly required to aggregate along the hierarchy of the elements - all records of child elements are aggregated into one record for the parent element.
KPI Definitions - there should be some way for the user to define a KPI. The KPI is a definition of a calculation based on counters from the same granularity (Time dimension). The calculation could (and will) involved more than one element level (e.g. p1.counter1 + sum(c1.counter1) where p1 is a parent of one or more records in c1).
User Interaction - the user can select one or more elements and one or more counters/KPIs, the granularity to use, the time period to view and whether or not to aggregate the selected data.
In case of aggregation, the results is one data-series that include the "added up" values for all the selected elements for each relevant point in time. In "SQL":
SELECT p1.time SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
GROUP BY p1.time
In case there is no aggregation need to keep the identifiers from p1 and have a data-series for each selected element
SELECT p1.time, p1.id, SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
The system has to keep data for 10, 100 and 1000 days for 15-min, hour and daily records. Following is a size estimate considering integer only columns at 4 bytes for storage with 400 counters for elements of type P, 50 for elements of type C and 400 for type GP:
As it adds up, I assume the based on DDL (in reality, DBs optimize storage) to 3.5-4 TB of data plus probably about 20-30% extra which will be required for indexes. For the child "tables", can get close to 2 billion records per table.
It is worth noting that from time to time I would like to add counters (maybe every 2-3 month) as the network evolves.
I once implemented a very similar system (though probably with less data) using Oracle. This time around I may not use a commercial DB and must revert to open source solutions. Also with the increase popularity of no-SQL and dedicated time-series DBs, maybe relational is not the way to go?
How would you approach such development? What are the products that could be used?
From a few days of research, I came up with the following
Use MySQL / PostGres
InfluxDB (or a similar product)
Cassandra + Spark
Others?
How could each solution would be used and what would be the advantages/disadvantages for each approach? If you can, elaborate or suggest also the overall (hardware) architecture to support this kind of development.
Comments and suggestions are welcome - preferably from people with hands on experience with similar project.
Going with Open Source RDBMS:
Using MySQL or Postgres
The table structure would be (imaginary SQL):
CREATE TABLE LEVEL_GRANULARITY (
TIMESTAMP DATE,
PARENT_ID INT,
ELEMENT_ID INT,
COUNTER_1 INT
...
COUNTER_N INT
PRIMARY_KEY (TIMESTAMP, PARENT_ID, ELEMENT_ID)
)
For example we will have P1_HOUR, GP_HOUR, P_DAY, GP_DAY etc.
The tables could be partitions by date to enhance query time and ease data management (can remove whole partitions).
To facilitate fast load, use loaders provided with the DB - these loaders are usually faster and insert data in bulks.
Aggregation could be done quite easily with `SELECT ... INTO ...' query (since the scope of the aggregation is limited, I don't think it will be a problem).
Queries are straight forward as aggregation, grouping and joining is built in. I am not sure about the query performance considering how large the tables are.
Since it is a write intensive I don't think the clustering could help here.
Pros:
Simple configuration (assuming no clusters etc).
SQL query capabilities - flexible
Cons:
Query performance - will it work?
Management overhead
Rigid Schema
Scaling?
Using InfluxDB (or something like that):
I have not used this DB and writing from playing around with it some
The model would be to create a time-series for every element in every level and granularity.
The data series name will include the identifiers of the element and the granularity.
For example P.P_ElementID.G.15MIN or P.P_ElementID.C.C1_ELEMENT_ID.G.60MIN
The data series will contain all the counters relevant for that level.
The input has to parse the XML and build the data series name before inserting the new data points.
InfluxDB has an SQL like query language. and allows to specify the calculation in an SQL like manner. It also supports grouping. To group by element would be possible by using regular expression, e.g. SELECT counter1/counter2 FROM /^P\.P_ElementID\.C1\..*G\.15MIN/ to get all children of ElementID.
There is a notion of grouping by time in general it is made for this kind of data.
Pros:
Should be fast
Support queries etc very similar to SQL
Support Deleting by Date (but have to do it on every series...)
Flexible Schema
Cons:
* Currently, seems not to support clusters very easily (
* Clusters = more maintenance
* Can it support millions of data-series (and still work fast)
* Less common, less documented (currently)

Will Redis's sorted sets scale?

This may be more of a theoretical question but I'm looking for a pragmatic answer.
I plan to use Redis's Sorted Sets to store the ranking of a model in my database based on a calculated value. Currently my data set is small (250 members in the set). I'm wondering if the sorted sets would scale to say, 5,000 members or larger. Redis claims a 1GB maximum value and my values are the ID of my model so I'm not really concerned about the scalability of the value of the sorted set.
ZRANGE has a time complexity of O(log(N)+M). If I'm most frequently trying to get the top 5 ranked items from the set, log(N) of N set items might be a concern.
I also plan to use ZINTERSTORE which has a time complexity of O(N*K)+O(M*log(M)). I plan to use ZINTERSTORE frequently and retrieve the results using ZRANGE 0 -1
I guess my question is two fold.
Will Redis sorted sets scale to 5,000 members without issues? 10,000? 50,000?
Will ZRANGE and ZINTERSTORE (in conjunction with ZRANGE) begin to show performance issues when applied to a large set?
I have had no issues with hundreds of thousands of keys in sorted sets. Sure getting the entire set will take a while the larger the set is, but that is expected - even from just an I/O Standpoint.
One such instance was on a sever with several DBs in use and several sorted sets with 50k to >150k keys in them. High writes were the norm as these use a lot of zincrby commands coming by way of realtime webserver log analysis peaking at over 150M records per day. And I'd store a week at a time.
Given my experience, I'd say go for it and see; it will likely be fine unless your server hardware is really low end.
In Redis, sorted sets having scaling limitations. A sorted set cannot be partitioned. As a result, if the size of a sorted set exceeds the size of the partition, there is nothing you can do (without modifying Redis).
Quote from article:
The partitioning granularity is the key, so it is not possible to shard a dataset with a single huge key like a very big sorted set[1].
Reference:
[1] http://redis.io/topics/partitioning

Resources