InfluxDB - Keep timestamp of record on downsampling - influxdb

InfluxDB version used: 1.8.0
Given a time series db that is used for storing e.g. temperatures from iot sensors (on different locations).
The sensors are queried e.g. every other minute.
Now the maximum temperature per sensor for the last hour can be queried using
select max(*) from temperatures where time >= now() - 1h group by location
name: temperatures
tags: location=collector
time max_temperature
---- ---------------
2020-06-24T17:41:34Z 34.8
name: temperatures
tags: location=outside
time max_temperature
---- ---------------
2020-06-24T17:43:34Z 23.4
I'm now would like to keep the max temperatures for every hour and for every day for a certain period of time.
So naturally I would use a retention policy and continuous queries.
Lets say I want to store the the maximum temperature by the hour for a month:
show RETENTION POLICIES on iotsensors
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
lastmonth 744h0m0s 24h0m0s 1 false
The continuous query looks like this:
CREATE CONTINUOUS QUERY max_temperatures_per_hour ON iotsensors
BEGIN
SELECT max(temperature) INTO iotsensors.lastmonth.max_temperatures_per_hour FROM iotsensors.autogen.temperatures GROUP BY time(1h), location TZ('Europe/Berlin')
END
By the nature of the GROUP BY time(1h) term, the exact time of the temperature is lost.
Especially when the data is condensed for a whole day in the second step FROM iotsensors.lastmonth.max_temperatures_per_hour GROUP BY time(1d) the resolution is getting even more coarse. (setting it to midnight of each day 00:00:00)
select max from iotmeasurements.last2years.max_temperatures_per_day where time >= now() - 4d group by location tz('Europe/Berlin')
name: max_temperatures_per_day
tags: location=collector
time max
---- ---
2020-06-21T00:00:00+02:00 80.9
2020-06-22T00:00:00+02:00 78.5
2020-06-23T00:00:00+02:00 101.2
name: min_max_temperatures_per_day
tags: location=outside
time max
---- ---
2020-06-21T00:00:00+02:00 21.8
2020-06-22T00:00:00+02:00 22.5
2020-06-23T00:00:00+02:00 22.8
I do know that this the expected and documented behaviour
https://docs.influxdata.com/influxdb/v1.8/query_language/explore-data/#group-by-time-intervals
However, the information of when exactly the maximum value was recorded is a valuable information which I'd like to keep.
Is there any way to store the exact timestamp of the record when downsampling?
I'd prefer to keep the timestamp inside the time field like
tags: location=collector
time max
---- ---
2020-06-20T04:30:40Z 80.9
2020-06-21T04:22:00Z 78.5
2020-06-22T04:53:10Z 101.2
Alternatively but a second best solution would be to add a timestamp field for each downsampled record
time max timestamp
---- --- ---------
2020-06-20T00:00:00+02:00 80.9 2020-06-20T04:30:40Z
2020-06-21T00:00:00+02:00 78.5 2020-06-21T04:22:00Z
2020-06-22T00:00:00+02:00 101.2 2020-06-22T04:53:10Z
For this I needed to be able to query the time into a separate field, wouldn't I.
But my attempts weren't successful so far. Something I tried was this:
SELECT max(temperature),time as timestamp FROM temperatures GROUP BY time(60m),"location"
I'd consider to move to InfluxDB 2.0 if that was a prerequesit for a solution to my problem.

So far I haven't found a solution with using solely InfluxDB.
The original question was based on the misconception that there always is one single maximum value over the time frame used for downsampling.
Given a series of data points like this.
name: max_temperatures_per_day
tags: location=collector
time max
---- ---
2020-06-20T04:30:40Z 80.9
2020-06-21T04:22:00Z 78.5
2020-06-22T04:53:10Z 101.2
2020-06-22T05:33:10Z 73.3
2020-06-22T05:41:10Z 65.0
2020-06-22T05:53:10Z 48.2
2020-06-22T05:56:10Z 73.3
2020-06-22T10:30:10Z 54.3
2020-06-22T12:30:10Z 63.7
2020-06-22T18:03:10Z 101.2
2020-06-22T18:20:10Z 90.2
it would be possible to identify exactly one point in time having the maximum value with the 4th hour of the day 2020-06-22T04:53:10Z 101.2 but for the fifth hour it's not possible since the maximum value ocured at 5:33 as well at 5:56.
Downsampling the data to the resolution of one day (24h) makes it even worse as the maximum value (101.2) ocured 4:53AM as well as 6:03PM that given day. Which of this possibly multiple points in time should be kept?
However using Kapacitor for carrying out the continues queries the original desired result can be achieved.
Starting with from this article https://docs.influxdata.com/kapacitor/v1.5/guides/continuous_queries/, it's possible to setup a query like this
batch
|query('SELECT * FROM "iotmeasurements"."autogen".temperatures')
.period(1h)
.every(1h)
.groupBy('location')
.align()
|max('temperature')
.as('max_temp')
.usePointTimes()
|influxDBOut()
.database('iotmeasurements')
.retentionPolicy('lastmonth')
.measurement('max_temperatures')
.precision('s')
This will keep the point time where the maximum value ocured first. In the example below, the data point at 5:33AM would be kept and the same value at 5:56AM would be skipped.
I'm not entirely sure if usePointTimes() (https://docs.influxdata.com/kapacitor/v1.5/nodes/influx_q_l_node/#usepointtimes) is needed.
In case loosing the record of later ocurances of the maximum value in the downsampling time frame is acceptable, this might be a solution. Even though, running a second service is needed for this. Adding an additional point of possible fail overs.
Another disadvantage of using Kapacitor is that it seems to be not possible to perform a downsampling for the past.
One may carry out a GROUP BY time query like this SELECT max(temperature) INTO ... FROM temperatures WHERE time >= now() - 1w GROUP BY time(1h),"location" outside a continuous query to do the downsampling for measurement points from the past inside influxdb itself.
There seems to be now way for doing so for Kapacitor 'ticks'.

Related

How to obtain time interval value reports from InfluxDB

Using InfluxDB: Is there any way to build a time-bucketed report of a field value representing a state that persists over time? Ideally in InfluxQL query language
More specifically as an example: Say a measurement contains points that report changes in the light bulb state (On / Off). They could be 0s and 1s as in the example below, or any other value. For example:
time light
---- -----
2022-03-18T00:00:00Z 1
2022-03-18T01:05:00Z 0
2022-03-18T01:55:00Z 0
2022-03-18T02:30:00Z 1
2022-03-18T04:06:00Z 0
The result should be a listing of intervals indicating if this light was on or off during each time interval (e.g. hours), or what percentage of that time it was on. For the given example, the result if grouping hourly should be:
Hour
Value
2022-03-18 00:00
1.00
2022-03-18 01:00
0.17
2022-03-18 02:00
0.50
2022-03-18 03:00
1.00
2022-03-18 04:00
0.10
Note that:
for 1am bucket, even if the light starts and ends in On state, it was On for only 10 over 60 minutes, so the value is low (10/60)
and more importantly the bucket from 3am to 4am has value "1" as the light was On since the last period, even if there was no change in this time period. This rules out usage of simple aggregation (e.g. MEAN) over a GROUP BY TIME(), as there would not be any way to know if an empty/missing bucket corresponds to an On or Off state as it only depends on the last reported value before that time bucket.
Is there a way to implement it in pure InfluxQL, without retrieving potentially big data sets (points) and iterating through them in a client?
I consider that raw data could be obtained by query:
SELECT "light" FROM "test3" WHERE $timeFilter
Where "test3" is your measurement name and $timeFilter is from... to... time period.
In this case we need to use a subquery which will fill our data, let's consider grouping (resolution) time as 1s:
SELECT last("light") as "filled_light" FROM "test3" WHERE $timeFilter GROUP BY time(1s) fill(previous)
This query gives us 1/0 value every 1s. We will use it as a subquery.
NOTE: You should be informed that this way does not consider if beginning of data period within $timeFilter has been started with light on or off. This way will not provide any data before hour with any value within $timeFilter.
In next step you should use integral() function on data you got from subquery, like this:
SELECT integral("filled_light",1h) from (SELECT last("light") as "filled_light" FROM "test3" WHERE $timeFilter GROUP BY time(1s) fill(previous)) group by time(1h)
This is how it looks on charts:
And how Result data looks in a table:
This is not a perfect way of getting it to work but I hope it resolves your problem.

Showing hourly average (histogramm) in grafana

Given a timeseries of (electricity) marketdata with datapoints every hour, I want to show a Bar Graph with all time / time frame averages for every hour of the data, so that an analyst can easily compare actual prices to all time averages (which hour of the day is most/least expensive).
We have cratedb as backend, which is used in grafana just like a postgres source.
SELECT
extract(HOUR from start_timestamp) as "time",
avg(marketprice) as value
FROM doc.el_marketprices
GROUP BY 1
ORDER BY 1
So my data basically looks like this
time value
23.00 23.19
22.00 25.38
21.00 29.93
20.00 31.45
19.00 34.19
18.00 41.59
17.00 39.38
16.00 35.07
15.00 30.61
14.00 26.14
13.00 25.20
12.00 24.91
11.00 26.98
10.00 28.02
9.00 28.73
8.00 29.57
7.00 31.46
6.00 30.50
5.00 27.75
4.00 20.88
3.00 19.07
2.00 18.07
1.00 19.43
0 21.91
After hours of fiddling around with Bar Graphs, Histogramm Mode, Heatmap Panel und much more, I am just not able to draw a simple Hours-of-the day histogramm with this in Grafana. I would very much appreciate any advice on how to use any panel to get this accomplished.
your query doesn't return correct time series data for the Grafana - time field is not valid timestamp, so don't extract only
hour, but provide full start_timestamp (I hope it is timestamp
data type and value is in UTC)
add WHERE time condition - use Grafana's macro __timeFilter
use Grafana's macro $__timeGroupAlias for hourly groupping
SELECT
$__timeGroupAlias(start_timestamp,1h,0),
avg(marketprice) as value
FROM doc.el_marketprices
WHERE $__timeFilter(start_timestamp)
GROUP BY 1
ORDER BY 1
This will give you data for historic graph with hourly avg values.
Required histogram may be a tricky, but you can try to create metric, which will have extracted hour, e.g.
SELECT
$__timeGroupAlias(start_timestamp,1h,0),
extract(HOUR from start_timestamp) as "metric",
avg(marketprice) as value
FROM doc.el_marketprices
WHERE $__timeFilter(start_timestamp)
GROUP BY 1
ORDER BY 1
And then visualize it as histogram. Remember that Grafana is designated for time series data, so you need proper timestamp (not only extracted hours, eventually you can fake it) otherwise you will have hard time to visualize non time series data in Grafana. This 2nd query may not work properly, but it gives you at least idea.

Moving average of values stored in an InfluxDB database

I'm looking to create a moving average over 1 year, 1 month and 1 day of the value which is stored in a InfluxDB. I came across this functionality.
However, the values I have are not recorded in fixed time intervals. So while the function works on the example
name: h2o_feet
time water_level
---- -----------
2015-08-18T00:00:00Z 2.064
2015-08-18T00:06:00Z 2.116
2015-08-18T00:12:00Z 2.028
2015-08-18T00:18:00Z 2.126
2015-08-18T00:24:00Z 2.041
2015-08-18T00:30:00Z 2.051
It wouldn't work on mine (I believe), which would look like this:
name: h2o_feet
time water_level
---- -----------
2015-08-18T00:00:00Z 2.064
2015-08-18T00:01:00Z 2.116
2015-08-18T00:12:00Z 2.028
2015-08-18T00:12:30Z 2.126
2015-08-18T00:14:00Z 2.041
2015-08-18T00:30:00Z 2.051
Is there a way to do this using influxdb functions?

Anomaly detection in data transfer

I am working on a Anomaly detection model and would need help with identifying the anomalies in data transfer. Example: If an employee is connected using VPN and we have the following data usage:
EMPID date Bytes_sent Bytes recieved
A123 Timestamp 222222 3333333
A123 Timestamp 444444 6666666
A123 Timestamp 99999999 88888888888
I want to flag row 3 as abnormal since the employee has been sending or receiving within a range and then there is a sudden jump. I want to keep track of the bytes sent and received in the recent days - meaning how his behavior is changing over the recent few days.
One way is keeping additional metrics for each observation:
For Bytes_recieved:
An indicator of whether the observation is an outlier. This will be
decided by whether the observed Bytes_recieved are outside of the
last observed average plus, minus the last observed SD as described
below.
A running average over the last N non outlying events.
Standard deviation over the last N non outlying events.
N will be based on the amount of observation you want to consider. You mentioned recent days, so you could set N = "recent" * average events per day
E.g:
EMPID date Bytes_sent Bytes_recieved br-avg-last-N br-sd-last-N br-Outlier
A123 Timestamp 222222 3333333 3333333 2357022.368 FALSE
A123 Timestamp 444444 6666666 4999999.5 2356922.368 FALSE
A123 Timestamp 99999999 88888888888 N/A N/A TRUE
Bytes_recieved Outlier for row three is calculated as whether the observed Bytes_recieved is outside the range defined by:
(last Bytes_recieved Average-Last-10) - 2*(last Bytes_recieved SD-Last-N) And (last Bytes_recieved Average-Last-10) + 2*(last Bytes_recieved SD-Last-N)
4999999.5 + 2 * 2356922.368 = 9713844.236; 9,713,844.236 < 88,888,888,888 -> TRUE
2 Standard deviations will give you 96% outliers, i.e. extreme observations you will only see ~4% of the time. You can modify it to your needs.
You can either do the same for Bytes_sent and have an 'Or' condition for the outlier decision, or calculate distance from a multi dimensional running average (here X is Bytes_sent and Y is Bytes_recieved) and mark outliers based on extreme distances. (You'll need to track a running SD or another spread metric per observation)
This way you could also easily add dimensions: time of day anomalies, extreme differences between Bytes_sent and Bytes_recieved etc.

How to find time intervals with no data points in InfluxDB

I have a bunch of IoT sensors that upload second by second data to InfluxDB. Since their network is unreliable, sometimes they do not report data.
I'm trying to figure out how to determine time periods in InfluxDB for which there is no data, and am encountering some wacky behavior with subqueries.
What I've tried so far:
Count the number of points each second, for example:
select count(power)
from energy
where time < '2017-05-14T00:05:10Z'
and time >= '2017-05-14T00:04:30Z'
group by time(1s);
This looks promising, as it returns a result for each second in the interval and the count of data points:
...
1494720297000000000 1
1494720298000000000 1
1494720299000000000 0
1494720300000000000 0
...
Now I want only the time periods where there are 0 points, however when I try this, only time ranges with non-zero numbers of points are reported:
select "points"
from
(select count(power) as "points"
from energy
where time < '2017-05-14T00:05:10Z'
and time >= '2017-05-14T00:04:30Z'
group by time(1s));
Returns:
...
1494720297000000000 1
1494720298000000000 1
No data after 1494720298000000000 is returned, even though the subquery does return rows.
Any help would be appreciated in crafting a query or approach to identify only the areas of time where there is no data.
add fill(none) after your query
Example-select count(power)from energy where time < '2017-05-14T00:05:10Z' and time >= '2017-05-14T00:04:30Z' group by time(1s) fill(none)

Resources