I am using a time series database (InfluxDB) and I am trying to understand how to design a measurement (table).
My background is using relational database where it is common to join tables.
In my current project we are writing different sensor values like (temperature and pressure) for many
vehicles to a measurement along with associated identifiers so that we know the specific details of
the each value we measure.
Measurement: Sensor_Trans
Tags: time, vehicleId, sensorId
Fields: value (temperature or pressure)
Later when I want to use these values, I need addtional details about the specific values.
Note: that I currently have 20+ unique tags for each sensor measurement like:
unit of measure, size of vehicle, senor description, etc.
For example: I want to know the engine pressure in Kpa for all cars with four doors.
For example: I want to know the exhaust temperature in degrees C for truck 89.
I'd like to know what is concidered best practise when designing time series measurements (tables)?
1- Do I add more tags that provide the addition inforation directly to the measurement?
2- Do I keep the Vehicle and Sensor definitions in a relational table and join in code?
3- Other?
1-Do I add more tags that provide the additional information directly to the measurement? Yes you can do that but also keep in mind adding more tags also consume more memory. Please refer the system requirements in the following link
https://docs.influxdata.com/influxdb/v1.7/guides/hardware_sizing/
2- Do I keep the Vehicle and Sensor definitions in a relational table and join in code? No need if you implement the above, you can design a relation DB table for your entire need instead ok keeping two different databases.
Related
I need to write a small ETL pipeline because I need to move some data from a source database to a target database (a datawarehouse) to perform some analysis on data.
Among those data, I need to clean and conform the name of cities. Cities are inserted manually by international users, conseguently for a single city I can have multiple names (for example London or Londra).
In my source database I do not have only big cities but I have also small villages.
Well, if I do not standardize city names, our analysis could be nonsensical.
Which is the best practices to standardize cities in my target database? Have any idea or suggestion I can undertake?
Thank you
The only reliable way to do this is to use commercial address validation software - preferably in your source system when the data is being created but it could be integrated into your data pipeline processes.
Assuming you can't afford/justify the use of commercial software, the only other solution is to create your own translation table i.e. a table that holds the values that are entered and what value you want them to be translated to.
While you can build this table based on historic data, there will always be new values that are not in the table, so you would need a process to identify these, add the new record to your translation data and then fix the affected records. You would also need to accept that there would be un-cleansed data in your warehouse for a period of time after each data load
I am trying to use kdb+ to capture and do aggregations on a number of sensory streams collated from iot sensors.
Each sensor has a unique identifier a time component (.z.z) and a scalar value:
percepts:([]time:`datetime$(); id:`symbol$(); scalar:`float$())
However because the data is temporal in nature, it would seem logical to maintain separate perceptual/sensory streams in different columns, i.e.:
time id_1 id_2 ...
15 0.15 ...
16 ... 1.5
However appending to a table indicatively only supports row operations in the insert fashion i.e. percepts insert (.z.z; `id_1; 0.15)
Seen as though I would like to support an large and non-static number of sensors in this setup, it would seem like an anti-pattern to append rows of the aforementioned format, before doing a transformation thereafter to turn the rows into columns based on their id. Would it be possible/necessary to create a table with a dynamic (growing) number of columns based upon new feature streams?
How would one most effectively implement logic that allows the insertion of columnar time series data averting the need to do a transform on row based data?
You can add data to a specific column. For that make following changes:
Make time column as key either permanently or during an update operation.
Use upsert to add data and pass data in table format.
Update function that I have mentioned below is specific to your example but you can make it more generic. It takes sensor name and sensor data as input. It performs 3 steps:
It first checks if the table is empty, in that case, set table schema as input dataset schema(which according to your example should be time and sensor name columns) and also make time as a primary key.
If the table has data but the column is missing for new sensor then first add a column with null float values and then upsert the data.
If a column is already there then just upsert the data.
q)t:() / table to store all sensors data
q)upd:{[s;tbl] `t set $[0=count t;`time xkey 0#tbl;not s in cols t;![t;();0b;enlist[s]!enlist count[t]#0Nf];t] upsert tbl}
q)upd[`id1;([]time:1#.z.z;id1:1#14.4)]
q)upd[`id2;([]time:1#.z.z;id2:1#2.3)]
time id1 id2
--------------------------------
2019.08.26T13:35:43.203 14.4
2019.08.26T13:35:46.861 2.3
Some points regarding your design:
If all sensors are not sending data for each time entry then the table will have a lot of null values (similar to the sparse matrix) which would be waste of memory and will have some impact on queries as well.
In that case, you can consider other design depending on your use case. For example, instead of storing each time entry, store data in time buckets. Another option is to group related sensors in a different table instead of storing all in one.
Another point you need to consider is you will have a fat table if you keep on adding sensors to it and that has its own issues. Also, it will become a single bottleneck point which could be an issue in the future and scaling it would be hard.
For small sensor sets, the current design is good but if you are planning to add many sensors in future then look into other design options.
I'm investigating data warehouses. And I have an issue about star schemas.
It's in
Oracle® OLAP Application Developer's Guide
10g Release 1 (10.1)
3.2.1 Dimension Table: TIME_DIM
https://docs.oracle.com/cd/B13789_01/olap.101/b10333/global.htm#CHDCGABE
To represent the hierarchy MONTH -> QUARTER -> YEAR, we need some keys such as: YEAR_ID, QUARTER_ID. But there are some things that I do not understand:
1) Why do we need field YEAR_DSC & QUARTER_DSC? I think that we can look up these values from YEAR & QUARTER TABLE. And it breaks 2NF.
2) What is the normal form that a schema in data warehouse needs to satisfy? (1NF, 2NF, 3NF, or any.)
NFs (normal forms) don't matter for data warehouse base tables.
We normalize to reduce certain kinds of redundancy so that when we update a database we don't have to say the same thing in multiple places and so that we can't accidentally erroneously not say the same thing where it would need to be said in multiple places. That is not a problem in query results because we are not updating them. The same is true for a data warehouse's base tables. (Which are also just queries on its original database's base tables.)
Data warehouses are usually optimized for reading speed, and that usually means some denormalization compared to the original database to avoid recomputation at the expense of space. (Notice though that sometimes rereading something bigger can be slower than reading smaller parts and recomputing the big thing.) We probably don't want to drop normalized tables when moving to a data warehouse, because they answer simple queries and we don't want to slow down by recomputing them. Other than those tradeoffs, there's no reason not to denormalize. Some particular warehouse design methods might have their own rules about what parts should be denormalized what amounts.
(Whatever our original database design NF is chosen to be, we should always first normalize to 5NF then consciously denormalize. We don't need to normalize or know constraints to update or query a database.)
Read some textbook basics on why we normalize & why we use data warehouses.
Time-series data such as historical stock prices are usually stored in an RDBMS.
I am evaluating various options to use this data, possibly store it in doc store or triple store in MarkLogic, and build some use cases on this data and/or along with the other kind of data stored in the doc/triple store.
Essentially, I am looking for ways to
Store time series data such as historical stock prices in a MarkLogic database.
Ways to query this data (stored in ML or queried across the RDBMS), through XQuery for example.
Ways to query this data, along with the other data stored in the doc/triple store.
I would appreciate any recommendations in this regard.
Added some more info...
I am trying to figure a neat way of capturing this data as triples. The idea being that it would be nice to link this data with other related data. For example, if the historical stock price we are trying to store is for HSBC listed on NYSE, then we can in some way define resources for HSBC and NYSE and also capture the stock price as literals (perhaps) and then link the resource HSBC with for example, the company information stored in dbpedia.
Essentially, I am talking about creating linked data, such that it is easy to query across data fetched from different sources and also if possible, try to use inferencing. For example, if I use this approach, it would be possible for me to run a query such as 'Get me the stock price of the companies headquartered in London, whose turnover is greater than $1billion'.
You have 2 alternatives. Either you have 1 big document for each series, or you have 1 document per price. The former is not recommended, as the latter let you better use the index system, especially a range index on the timestamp.
I worked on a system using MarkLogic, which was essentially a system to store time series. We used 1 document per point in the series (as well as 1 document for the series itself, for its "metadata", all information common across all the points in the series). We also put all documents relative to 1 series in 1 collection. We used a naming scheme for the document URIs based on the timestamp and a unique ID per series, so we can easily guarantee the uniqueness of the document URIs.
An important point is to have the series point documents to reference their series document (either explicitly or just by being in the same collection), instead of the other way around.
As per querying, it depends on your specific use cases, but typically you will use a search constraint on the collection to identify one (or several) series, and a range index on the timestamp to select a "slice" of points in the series. If you have use cases like selecting points based on their value (instead of their time) you can do it as efficiently as you do it based on the timestamp, by using a range index on the values themselves.
I would recommend storing time-series data in a time-series database: https://en.wikipedia.org/wiki/Time_series_database
Update 1:
You can define HSBC as an entity, specify meta-data for the entity such as location or headcount, and then store quarterly revenue and traded tick prices as separate time-series. Then you can run queries that a) filter by meta-data tag such as Location and filter by aggregation, e.g. MAX(price). I would store headcount as series as well actually. This way I can investigate correlations between different series for research and analytics.
I'm new with TSDB and I have a lot of temperature sensors to store in my database with one point per second. Is it better to use one unique metric per sensor, or only one metric (temperature for example) with distinct tags depending sensor??
I searched on Internet what is the best practice, but I didn't found a good answer...
Thank you! :-)
Edit:
I will have 8 types of measurements (temperature, setpoint, energy, power,...) from 2500 sources
If you are storing your data in InfluxDB, I would recommend storing all the metrics in a single measurement and using tags to differentiate the sources, rather than creating a measurement per source. The reason being that you can trivially merge or decompose the metrics using tags within a measurement, but it is not possible in the newest InfluxDB to merge or join across measurements.
Ultimately the decision rests with both your choice of TSDB and the queries you care most about running.
For comparison purposes, in Axibase Time-Series Database you can store temperature as a metric and sensor id as entity name. ATSD schema has a notion of entity which is the name of system for which the data is being collected. The advantage is more compact storage and the ability to define tags for entities themselves, for example sensor location, sensor type etc. This way you can filter and group results not just by sensor id but also by sensor tags.
To give you an example, in this blog article 0601911 stands for entity id - which is EPA station id. This station collects several environmental metrics and at the same time is described with multiple tags in the database: http://axibase.com/environmental-monitoring-using-big-data/.
The bottom line is that you don't have to stage a second database, typically a relational one, just to store extended information about sensors, servers etc. for advanced reporting.
UPDATE 1: Sample network command:
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 m:humidity=72 m:precipitation=44.3
Tags that describe sensor-001 such as location, type, etc are stored separately, minimizing storage footprint and speeding up queries. If you're collecting energy/power metrics you often have to specify attributes to series such as Status because data may not come clean/verified. You can use series tags for this purpose.
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 ... t:status=Provisional
You should use one metric per sensor. You probably won't be needing to aggregate values from different temperature sensors, but you will be needing to aggregate values of a given sensor (average over a minute for instance).
Metrics correspond to data coming from the same source, or at least data you will be likely to aggregate. You can create almost as many metrics as you want (up to 16 million metrics in OpenTSDB for instance).
Tags make distinctions between these pieces of data. For instance, you could tag data differently if they suddenly change a lot, in order to retrieve only relevant data if needed, without losing the rest of the data. Although for a temperature sensor getting data every second, the best would probably be to filter and only store data when the value changed...
Best practices are summed up here