DB Selection and Modeling Time Series Data with Ad-Hoc queries - time-series

I have to develop a system for tracking/monitoring performance in a cellular network.
The domain includes a set of hierarchical elements, and each one has an associated set of counters that are reported periodically (every 15 minutes). The system should collect these counter values (available as large XML files) and periodically aggregate them on two dimensions: Time (from 15 to hour and from hour to day) and Hierarchy (lower level to higher level elements). The aggregation is most often a simple SUM but sometime requires average/min/max etc. Of course for the element dimension aggregation it needs to group by the hierarchy (group all children to one parent record). The user should be able to define and view KPIs (Key Performance Indicator) - that is, some calculations on the various counters. The KPI could be required for just one element, for several elements (producing a data-series for each) or as an aggregation for several elements (resulting in one data series of aggregated data.
There will be about 10-15 users to the system with probably 20-30 queries an hour. The query response time should be a few seconds (up to 10-15 for very large reports including many elements and long time period).
In high level, this is the flow:
Parse and Input Counter Data - there is a set of XML files which contains a periodical update of counters data for the elements. The size of all files is about 4GB / 15 minutes (so roughly 400GB/day).
Hourly Aggregation - once an hour all the collected counters, for all the elements should be aggregated - every 4 records related to an element are aggregated into one hourly record which should be stored.
Daily Aggregation - once a day, 2 all collected counters, for all elements should be aggregated - every 24 records related to an element are aggregated into one daily record.
Element Aggregation - with each one of the time-dimension aggregation it is possibly required to aggregate along the hierarchy of the elements - all records of child elements are aggregated into one record for the parent element.
KPI Definitions - there should be some way for the user to define a KPI. The KPI is a definition of a calculation based on counters from the same granularity (Time dimension). The calculation could (and will) involved more than one element level (e.g. p1.counter1 + sum(c1.counter1) where p1 is a parent of one or more records in c1).
User Interaction - the user can select one or more elements and one or more counters/KPIs, the granularity to use, the time period to view and whether or not to aggregate the selected data.
In case of aggregation, the results is one data-series that include the "added up" values for all the selected elements for each relevant point in time. In "SQL":
SELECT p1.time SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
GROUP BY p1.time
In case there is no aggregation need to keep the identifiers from p1 and have a data-series for each selected element
SELECT p1.time, p1.id, SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
The system has to keep data for 10, 100 and 1000 days for 15-min, hour and daily records. Following is a size estimate considering integer only columns at 4 bytes for storage with 400 counters for elements of type P, 50 for elements of type C and 400 for type GP:
As it adds up, I assume the based on DDL (in reality, DBs optimize storage) to 3.5-4 TB of data plus probably about 20-30% extra which will be required for indexes. For the child "tables", can get close to 2 billion records per table.
It is worth noting that from time to time I would like to add counters (maybe every 2-3 month) as the network evolves.
I once implemented a very similar system (though probably with less data) using Oracle. This time around I may not use a commercial DB and must revert to open source solutions. Also with the increase popularity of no-SQL and dedicated time-series DBs, maybe relational is not the way to go?
How would you approach such development? What are the products that could be used?
From a few days of research, I came up with the following
Use MySQL / PostGres
InfluxDB (or a similar product)
Cassandra + Spark
Others?
How could each solution would be used and what would be the advantages/disadvantages for each approach? If you can, elaborate or suggest also the overall (hardware) architecture to support this kind of development.
Comments and suggestions are welcome - preferably from people with hands on experience with similar project.

Going with Open Source RDBMS:
Using MySQL or Postgres
The table structure would be (imaginary SQL):
CREATE TABLE LEVEL_GRANULARITY (
TIMESTAMP DATE,
PARENT_ID INT,
ELEMENT_ID INT,
COUNTER_1 INT
...
COUNTER_N INT
PRIMARY_KEY (TIMESTAMP, PARENT_ID, ELEMENT_ID)
)
For example we will have P1_HOUR, GP_HOUR, P_DAY, GP_DAY etc.
The tables could be partitions by date to enhance query time and ease data management (can remove whole partitions).
To facilitate fast load, use loaders provided with the DB - these loaders are usually faster and insert data in bulks.
Aggregation could be done quite easily with `SELECT ... INTO ...' query (since the scope of the aggregation is limited, I don't think it will be a problem).
Queries are straight forward as aggregation, grouping and joining is built in. I am not sure about the query performance considering how large the tables are.
Since it is a write intensive I don't think the clustering could help here.
Pros:
Simple configuration (assuming no clusters etc).
SQL query capabilities - flexible
Cons:
Query performance - will it work?
Management overhead
Rigid Schema
Scaling?

Using InfluxDB (or something like that):
I have not used this DB and writing from playing around with it some
The model would be to create a time-series for every element in every level and granularity.
The data series name will include the identifiers of the element and the granularity.
For example P.P_ElementID.G.15MIN or P.P_ElementID.C.C1_ELEMENT_ID.G.60MIN
The data series will contain all the counters relevant for that level.
The input has to parse the XML and build the data series name before inserting the new data points.
InfluxDB has an SQL like query language. and allows to specify the calculation in an SQL like manner. It also supports grouping. To group by element would be possible by using regular expression, e.g. SELECT counter1/counter2 FROM /^P\.P_ElementID\.C1\..*G\.15MIN/ to get all children of ElementID.
There is a notion of grouping by time in general it is made for this kind of data.
Pros:
Should be fast
Support queries etc very similar to SQL
Support Deleting by Date (but have to do it on every series...)
Flexible Schema
Cons:
* Currently, seems not to support clusters very easily (
* Clusters = more maintenance
* Can it support millions of data-series (and still work fast)
* Less common, less documented (currently)

Related

Combining additive and semi-additive facts in a single report

I'm working on a quarterly report. The report should look something like this:
col
Calculation
Source table
Start_Balance
Sum at start of time period
Account_balance
Sell Transactions
Sum of all sell values between the two time periods
Transactions
Buy Transactions
Sum of all buy values between the two time periods
Transactions
End Balance
Sum at the end of time period
Account_balance
so e.g.
Calculation
sum
Start_Balance
1000
Sell Transactions
500
Buy Transactions
750
End Balance
1250
The problem here is that I'm working with a relational star schema, one of the facts is semi-additive and the other is additive, so they behave differently on the time dimension.
In my case I'm using Cognos analytics, but I think this problem goes for any BI tool. What would be best practice to deal with this issue? I'm certain I can come up with some sql query that combines these two tables into one table which the report reads from, but this doesn't seem like best practice, or is it? Another approach would be to create some measures in the BI tool, I'm not a big fan of this approach because it seems to be least sustainable approach, and I'm unfamiliar with it.
For Cognos you can stitch the tables
The technique has to do with how Cognos aggregates
Framework manager joins are typically 1 to n for describing the relationship
A star schema having the fact table in the middle and representing the N
with all of the outer tables describing/grouping the data, representing the 1
Fact tables, quantitative data, the stuff you want to sum should be on the many side of the relationship
Descriptive tables, qualitative data, the stuff you want to describe or group by should be on the 1 (instead of the many)
To stitch we have multiple tables we want to be facts
Take the common tables that you would use for grouping, like the period (there are probably some others like company, or customer, etc)
Connect each of the fact tables with the common table (aka dimension) like this:
Account_balance N to 1 Company
Account_balance N to 1 Period
Account_balance N to 1 Customer
Transactions N to 1 Company
Transactions N to 1 Period
Transactions N to 1 Customer
This will cause Cognos to perform a full outer join with a coalesce
Allowing you to handle the fact tables even though they have different levels of granularity
Remember with an outer join you may have to handle nulls and you may need to use the summary filter depending on your reporting needs
You want to include the common tables on your report which might conflict with how you want the report to look
An easy work around is to add them to the layout and then set the property to box type none so the sql behaves you want and the report looks the way you want
You'll probably need to setup determinants in the Framework Manager model. The following does a good job in explaining this:
https://www.ibm.com/docs/en/cognos-analytics/11.0.0?topic=concepts-multiple-fact-multiple-grain-queries

Does a data warehouse need to satisfy 2NF or another normal form?

I'm investigating data warehouses. And I have an issue about star schemas.
It's in
Oracle® OLAP Application Developer's Guide
10g Release 1 (10.1)
3.2.1 Dimension Table: TIME_DIM
https://docs.oracle.com/cd/B13789_01/olap.101/b10333/global.htm#CHDCGABE
To represent the hierarchy MONTH -> QUARTER -> YEAR, we need some keys such as: YEAR_ID, QUARTER_ID. But there are some things that I do not understand:
1) Why do we need field YEAR_DSC & QUARTER_DSC? I think that we can look up these values from YEAR & QUARTER TABLE. And it breaks 2NF.
2) What is the normal form that a schema in data warehouse needs to satisfy? (1NF, 2NF, 3NF, or any.)
NFs (normal forms) don't matter for data warehouse base tables.
We normalize to reduce certain kinds of redundancy so that when we update a database we don't have to say the same thing in multiple places and so that we can't accidentally erroneously not say the same thing where it would need to be said in multiple places. That is not a problem in query results because we are not updating them. The same is true for a data warehouse's base tables. (Which are also just queries on its original database's base tables.)
Data warehouses are usually optimized for reading speed, and that usually means some denormalization compared to the original database to avoid recomputation at the expense of space. (Notice though that sometimes rereading something bigger can be slower than reading smaller parts and recomputing the big thing.) We probably don't want to drop normalized tables when moving to a data warehouse, because they answer simple queries and we don't want to slow down by recomputing them. Other than those tradeoffs, there's no reason not to denormalize. Some particular warehouse design methods might have their own rules about what parts should be denormalized what amounts.
(Whatever our original database design NF is chosen to be, we should always first normalize to 5NF then consciously denormalize. We don't need to normalize or know constraints to update or query a database.)
Read some textbook basics on why we normalize & why we use data warehouses.

DynamoDB Timeseries Table Design

Scenario:
I have a few weather stations that I'm collecting data for. The data comes in roughly every 15 minutes or so. Each data packet contains several measurements like pressure, temperature, humidity, etc.
The data would be queried in multiple ways:
display latest values for all measurements at a station
display a historical chart for a single measurement (for ex. temperature)
other?
Proposed Tables:
STATIONS: hash-key: station-id
Contains metadata information about the stations
STATION_X_MEASUREMENT_DATA: hash-key: measurement-type, range-key: timestamp
Where X is the station ID. Each record contains the measurement value for a specific measurement type and time. Each station will have its own data table so that the data can be removed by dropping a table when a station is no longer in service.
STATION_SUMMARY: hash-key: station_id
Contains the latest/current values for all measurement types for each station
Questions:
Should I have two separate tables (summary and individual measurments) or should I just query the latest measurements when I want to display the summary?
Should I store the measurement types as individual records or combined into a single records for a specific timestamp?
If I were to store all measurements in a combined record with timestamp as range key, would it be worth to use minutes or seconds as the partition key? I'm afraid that would make querying more complicated.
Is there anything else I should change/improve? Are there better alternatives?
Should I have two separate tables (summary and individual measurments)
or should I just query the latest measurements when I want to display
the summary?
I don't see how you can have one table. In the measurement data you will have an item per measurement, while in the summary table every item will have static information about stations. If you are going to add them into a single table, are you going to duplicate summary information?
Also having two separate tables allows you to set different RCU/WCU for tables. I guess that station summary is rarely written, so you can set a low WCU, and higher a RCU, while measurement data is often written and may not be read so often. Again your settings can reflect this.
Now, do you want to have separate table for stations and stations summaries? It depends on your data and access patterns, but it is a common pattern to split heave detailed information into a separate table, and compact representation (maybe subset of fields) into a different table. It allows you to save some serious number of RCUs if you have requests like get-all-stations, since probably they don't require detailed info.
Should I store the measurement types as individual records or combined
into a single records for a specific timestamp?
The only difference that I see is that you can compress several measurements into a binary blob and store it into one item. If your measurements have some repetitions (LZW algorithm?) or if data does not change one from measurement to measurement (delta encoding?). In the later case instead of writing 202, 203, 202, you can write 22, 1, -1 or something like this.
Keep in mind that an item is limited to 400KB so you can't jam a lot of data in one item.
Also keep in mind that for a single partition key you can't have more than 10GB of data, so you need to have a strategy for how you are going to handle that. Notice that this does not depend on number of items or size of individual items.
If you don't have a lot of data you may be fine having just an item per measurement. If you have a lot of data and you need to decrease AWS cost, then you probably will be better having compressed arrays of measurements
If I were to store all measurements in a combined record with
timestamp as range key, would it be worth to use minutes or seconds as
the partition key? I'm afraid that would make querying more
complicated.
Hard to say. How many records do you have per second? Per minute? Maybe it makes sense to aggregate per hour to get better results from compression? Or maybe for a day? It depends on your data.
Is there anything else I should change/improve? Are there better alternatives?
You can have different tables for different time intervals. Newer data can have high WCU/RCU config, while older data will have low WCU (can you write in the past?) and lower RCU. Old data can be transferred to S3. Also you can use DynamoDB TTL to automatically remove old tables if you need to.

How does one perform a "range query"?

Google cloud dataflow supports what I would call a "full outer join" SQL like statement through their "CoGroupByKey"method. However, is there any way to implement in dataflow what would be in SQL a "range join"? For example, if I had a table called "people" in which there was a floating point field called "age". And let's say I wanted all the pairs of people in which their ages were within say five years from each other. I could write the following statement:
select p1.name, p1.age, p2.name, p2.age
from people p1, people p2
where p1.age between (p2.age - 5.0) and (p2.age + 5.0);
I couldn't determine if there was a way to accomplish this in dataflow. (Again, if I wanted a strict equality, that I could use a CoGroupByKey, but in this case it's not a strict equality condition).
For my particular use case, the "people" table is not too large – maybe 500,000 rows and approximately 50 megs of RAM required. So, I could, I think, simply run a asList() method to create a single object that sits in a single computer's RAM and then just sort the people object by age and then write some sort of routine that "walks through the list from the low stage to the highest age" and while walking through the list outputs those people whose ages are less than 10 years from each other. This would work, but it would be single threaded etc. I was wondering if there was a "better" way of doing it using the dataflow architecture. (And other developers may need to find a "dataflow" way of doing this operation if the object that they were dealing with dies not fit nicely into memory of one single computer, e.g. a people table of maybe 1 billion rows etc.)
The trick to making this work efficiently at scale is to partition your data into sets of potential matches. In your case, you could assign each person to two different keys, age rounded up to multiple of 5, and age rounded down to multiple of 5. Then, do a GroupByKey on these buckets, and emit all the pairs within each bucket that are actually close enough in age. You'll need to eliminate duplicates, since it's possible for two records to both end up in the same two buckets.
With this solution, the entire data does not need to fit in memory, just each subset of the data.

Fact table design guidance for 100s of facts

I'm trying to create a datamart for the healthcare application. The facts in the datamart are basically going to be measurements and findings related to heart, and we have 100s of them. Starting from 1000 and can go to as big as 20000 per exam type.
I'm wondering what my design choices for the fact tables are:
Grain: 1 row per patient per exam type.
Some of the choices that I can think of -
1) A big wide fact table with 1000 or more columns.
2) EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
4) any other ideas?
Any inputs would be appreciated.
1. A big wide fact table with 1000 or more columns.
One very wide fact table gives end-user maximum flexibility if queries are executed directly in the data warehouse. However some considerations should be taken into account, as you might hit some limits depending on a platform.
SQL Server 2014 limits are as per below:
Bytes per row 8,060. A row-overflow storage might be a solution, however it supports only few column types typically not related to fact nature, i.e. varchar, nvarchar, varbinary, sql_variant. Also not supported in In-Memory OLTP. https://technet.microsoft.com/en-us/library/ms186981(v=sql.105).aspx
Columns per non-wide table 1024. Wide-tables and sparse columns are solution as columns per wide table limit is 30,000. However the same Bytes per row limit applies. https://technet.microsoft.com/en-us/library/cc280604(v=sql.120).aspx
Columns per SELECT/INSERT/UPDATE statement 4,096
Non-clustered indexes per table 999
https://technet.microsoft.com/en-us/library/ms143432(v=sql.120).aspx
2. EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
According to Kimball, EAV design is called Fact Normalization. It may make sense when a number of measurements is extremely lengthy, but sparsely populated for a given fact and no computations are made between facts.
Because facts are normalized therefore:
Extensibility is very easy, i.e. it's easy to add new measurements without the need to amend the data structure.
It's good to extract all measurements for one exam and present measurements as rows on the screen.
It's hard to extract/aggregate/make computation between several measurements (e.g. average HDL to CHOL ration) and present measurements/aggregates/computations as columns, i.e. requires complex WHERE/PIVOTING or multi-joins. SQL makes it difficult to make computations between facts in different rows.
If primary end-user platform is an OLAP cube then Fact Normalization makes sense. The cubes allows to make computation across any dimension.
Data importing could be an issue if data format is in a flat style CSV.
This questions is also discussed here Should I use EAV model?.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
In some scenarios multiple smaller fact tables perfectly makes sense. One of the reason is if you hit some physical limits set by platform, e.g. Bytes per row.
The facts could be grouped either by subject area, e.g. measurement group/subgroup, or by frequency of usage. Each table could be placed on a separate file group and drive to maximize I/O.
Further, you could duplicate measurements across different fact tables to reduce the need of fact tables join, i.e. put one measurement in a specific measurement subgroup fact table and in frequently used measurement fact table.
However some considerations should be taken into account if there are some specific requirements for data loading. For example, if a record errors out in your ETL to one fact table, you might want to make sure that the corresponding records in the other fact tables are deleted and staged to your error table so you don't end up with any bogus information. This is especially true if end users have their own calculations in the front end tool.
If you use OLAP cubes then multiple fact tables actually becomes a source of a measure group to a specific fact table.
In terms of fact-to-fact join, you (BI application) should never issue SQL that joins two fact tables together across the fact table’s foreign keys. Instead, the technique of Drilling Across two fact tables should be used, where the answer sets from two or more fact tables are separately created, and the results sort-merged on the common row header attribute values to produce the correct result.
More on this topic: http://www.kimballgroup.com/2003/04/the-soul-of-the-data-warehouse-part-two-drilling-across/
4) any other ideas?
SQL XML or some kind NoSQL could be an option, but the same querying / aggregation / computation / presentation issues exist.

Resources