InfluxDB average of distinct count over time - influxdb

Using Influx DB v0.9, say I have this simple query:
select count(distinct("id")) FROM "main" WHERE time > now() - 30m and time < now() GROUP BY time(1m)
Which gives results like:
08:00 5
08:01 10
08:02 5
08:03 10
08:04 5
Now I want a query that produces points with an average of those values over 5 minutes. So the points are now 5 minutes apart, instead of 1 minute, but are an average of the 1 minute values. So the above 5 points would be 1 point with a value of the result of (5+10+5+10+5)/5.
This does not produce the results I am after, for clarity, since this is just a count, and I'm after the average.
select count(distinct("id")) FROM "main" WHERE time > now() - 30m and time < now() GROUP BY time(5m)
This doesn't work (gives errors):
select mean(distinct("id")) FROM "main" WHERE time > now() - 30m and time < now() GROUP BY time(5m)
Also doesn't work (gives error):
select mean(count(distinct("id"))) FROM "main" WHERE time > now() - 30m and time < now() GROUP BY time(5m)
In my actual usage "id" is a string (content, not a tag, because count distinct not supported for tags in my version of InfluxDB).

To clarify a few points for readers, in InfluxQL, functions like COUNT() and DISTINCT() can only accept fields, not tags. In addition, while COUNT() supports the nesting of the DISTINCT() function, most nested or sub-functions are not yet supported. In addition, nested queries, subqueries, or stored procedures are not supported.
However, there is a way to address your need using continuous queries, which are a way to automate the processing of data and writing those results back to the database.
First take your original query and make it a continuous query (CQ).
CREATE CONTINUOUS QUERY count_foo ON my_database_name BEGIN
SELECT COUNT(DISTINCT("id")) AS "1m_count" INTO main_1m_count FROM "main" GROUP BY time(1m)
END
There are other options for the CQ, but that basic one will wake up every minute, calculate the COUNT(DISTINCT("id")) for the prior minute, and then store that result in a new measurement, main_1m_count.
Now, you can easily calculate your 5 minute mean COUNT from the pre-calculated 1 minute COUNT results in main_1m_count:
SELECT MEAN("1m_count") FROM main_1m_count WHERE time > now() - 30m GROUP BY time(5m)
(Note that by default, InfluxDB uses epoch 0 and now() as the lower and upper time range boundaries, so it is redundant to include and time < now() in the WHERE clause.)

Related

InfluxQL time calculations return no records

I'd like to query InfluxDB using InfluxQL and exclude any rows from 0 to 5 minutes after the hour.
Seems pretty easy to do using the time field (the number of nanoseconds since the epoch) and a little modulus math. But the problem is that any WHERE clause with even the simplest calculation on time returns zero records.
How can I get what I need if I can't perform calculations on time? How can I exclude any rows from 0 to 5 minutes after the hour?
# Returns 10 records
SELECT * FROM "telegraf"."autogen"."processes" WHERE time > 0 LIMIT 10
# Returns 0 records
SELECT * FROM "telegraf"."autogen"."processes" WHERE (time/1) > 0 LIMIT 10

How to obtain time interval value reports from InfluxDB

Using InfluxDB: Is there any way to build a time-bucketed report of a field value representing a state that persists over time? Ideally in InfluxQL query language
More specifically as an example: Say a measurement contains points that report changes in the light bulb state (On / Off). They could be 0s and 1s as in the example below, or any other value. For example:
time light
---- -----
2022-03-18T00:00:00Z 1
2022-03-18T01:05:00Z 0
2022-03-18T01:55:00Z 0
2022-03-18T02:30:00Z 1
2022-03-18T04:06:00Z 0
The result should be a listing of intervals indicating if this light was on or off during each time interval (e.g. hours), or what percentage of that time it was on. For the given example, the result if grouping hourly should be:
Hour
Value
2022-03-18 00:00
1.00
2022-03-18 01:00
0.17
2022-03-18 02:00
0.50
2022-03-18 03:00
1.00
2022-03-18 04:00
0.10
Note that:
for 1am bucket, even if the light starts and ends in On state, it was On for only 10 over 60 minutes, so the value is low (10/60)
and more importantly the bucket from 3am to 4am has value "1" as the light was On since the last period, even if there was no change in this time period. This rules out usage of simple aggregation (e.g. MEAN) over a GROUP BY TIME(), as there would not be any way to know if an empty/missing bucket corresponds to an On or Off state as it only depends on the last reported value before that time bucket.
Is there a way to implement it in pure InfluxQL, without retrieving potentially big data sets (points) and iterating through them in a client?
I consider that raw data could be obtained by query:
SELECT "light" FROM "test3" WHERE $timeFilter
Where "test3" is your measurement name and $timeFilter is from... to... time period.
In this case we need to use a subquery which will fill our data, let's consider grouping (resolution) time as 1s:
SELECT last("light") as "filled_light" FROM "test3" WHERE $timeFilter GROUP BY time(1s) fill(previous)
This query gives us 1/0 value every 1s. We will use it as a subquery.
NOTE: You should be informed that this way does not consider if beginning of data period within $timeFilter has been started with light on or off. This way will not provide any data before hour with any value within $timeFilter.
In next step you should use integral() function on data you got from subquery, like this:
SELECT integral("filled_light",1h) from (SELECT last("light") as "filled_light" FROM "test3" WHERE $timeFilter GROUP BY time(1s) fill(previous)) group by time(1h)
This is how it looks on charts:
And how Result data looks in a table:
This is not a perfect way of getting it to work but I hope it resolves your problem.

Showing hourly average (histogramm) in grafana

Given a timeseries of (electricity) marketdata with datapoints every hour, I want to show a Bar Graph with all time / time frame averages for every hour of the data, so that an analyst can easily compare actual prices to all time averages (which hour of the day is most/least expensive).
We have cratedb as backend, which is used in grafana just like a postgres source.
SELECT
extract(HOUR from start_timestamp) as "time",
avg(marketprice) as value
FROM doc.el_marketprices
GROUP BY 1
ORDER BY 1
So my data basically looks like this
time value
23.00 23.19
22.00 25.38
21.00 29.93
20.00 31.45
19.00 34.19
18.00 41.59
17.00 39.38
16.00 35.07
15.00 30.61
14.00 26.14
13.00 25.20
12.00 24.91
11.00 26.98
10.00 28.02
9.00 28.73
8.00 29.57
7.00 31.46
6.00 30.50
5.00 27.75
4.00 20.88
3.00 19.07
2.00 18.07
1.00 19.43
0 21.91
After hours of fiddling around with Bar Graphs, Histogramm Mode, Heatmap Panel und much more, I am just not able to draw a simple Hours-of-the day histogramm with this in Grafana. I would very much appreciate any advice on how to use any panel to get this accomplished.
your query doesn't return correct time series data for the Grafana - time field is not valid timestamp, so don't extract only
hour, but provide full start_timestamp (I hope it is timestamp
data type and value is in UTC)
add WHERE time condition - use Grafana's macro __timeFilter
use Grafana's macro $__timeGroupAlias for hourly groupping
SELECT
$__timeGroupAlias(start_timestamp,1h,0),
avg(marketprice) as value
FROM doc.el_marketprices
WHERE $__timeFilter(start_timestamp)
GROUP BY 1
ORDER BY 1
This will give you data for historic graph with hourly avg values.
Required histogram may be a tricky, but you can try to create metric, which will have extracted hour, e.g.
SELECT
$__timeGroupAlias(start_timestamp,1h,0),
extract(HOUR from start_timestamp) as "metric",
avg(marketprice) as value
FROM doc.el_marketprices
WHERE $__timeFilter(start_timestamp)
GROUP BY 1
ORDER BY 1
And then visualize it as histogram. Remember that Grafana is designated for time series data, so you need proper timestamp (not only extracted hours, eventually you can fake it) otherwise you will have hard time to visualize non time series data in Grafana. This 2nd query may not work properly, but it gives you at least idea.

How to find time intervals with no data points in InfluxDB

I have a bunch of IoT sensors that upload second by second data to InfluxDB. Since their network is unreliable, sometimes they do not report data.
I'm trying to figure out how to determine time periods in InfluxDB for which there is no data, and am encountering some wacky behavior with subqueries.
What I've tried so far:
Count the number of points each second, for example:
select count(power)
from energy
where time < '2017-05-14T00:05:10Z'
and time >= '2017-05-14T00:04:30Z'
group by time(1s);
This looks promising, as it returns a result for each second in the interval and the count of data points:
...
1494720297000000000 1
1494720298000000000 1
1494720299000000000 0
1494720300000000000 0
...
Now I want only the time periods where there are 0 points, however when I try this, only time ranges with non-zero numbers of points are reported:
select "points"
from
(select count(power) as "points"
from energy
where time < '2017-05-14T00:05:10Z'
and time >= '2017-05-14T00:04:30Z'
group by time(1s));
Returns:
...
1494720297000000000 1
1494720298000000000 1
No data after 1494720298000000000 is returned, even though the subquery does return rows.
Any help would be appreciated in crafting a query or approach to identify only the areas of time where there is no data.
add fill(none) after your query
Example-select count(power)from energy where time < '2017-05-14T00:05:10Z' and time >= '2017-05-14T00:04:30Z' group by time(1s) fill(none)

How to Group by last 20 days and do an aggregate function?

I can't seem to figure this one out. I'm trying to get the standard deviation of a column for the past 20 days. Here is what I have
Model.where('date < ?','2013-03-25')
.group('date')
.order('date DESC')
.limit(20)
.select('stddev_samp(percent_change) as stdev')
However all I'm getting is 20 entries of Nil. I was expecting 1 entry of the standard deviation.
After switching the stddev_samp to sum, I see that I'm getting nil because you can't have a standard deviation on 1 entry. I.e. It is not grouping the 20 as I expected, but calculating standard deviation on each date.
So my question is, how do I get stddev of the last 20 days? I know it's possible to simply choose select percent_change and then calculate the standard deviation in ruby, but I assume that the aggregate function stddev_samp should be usable in this case.
I am using rails 3.2 and Postgresql 9.2
I'm not a Ruby guy so I'll explain it in normal SQL:
What you're doing is:
SELECT stddev_samp(percent_change) as stdev
FROM tbl
WHERE date < '2013-03-25'
GROUP BY date
ORDER BY date DESC
LIMIT 20;
This calculates the deviation for each day seperately, not for the sum of them, and when you try to get the deviation of only one element you get NULL.
Removing the GROUP BY would fix it but also would return the result for the whole table not just last 20 entries so we need a subquery:
SELECT stddev_samp(percent_change) as stdev
FROM
(SELECT percent_change
FROM tbl
WHERE date < '2013-03-25'
ORDER BY date DESC
LIMIT 20) AS q
No need to 'Group By', 'Order by' or sub-selects. Just get the records for the last 20 days and run the aggregate function on them.
Ruby:
Model.where('date >= ?', Date.today - 20.days).select('stddev_samp(percent_change) as stdev').first['stdev']
SQL:
select stddev_samp(percent_change) as stdev
from <table>
where date >= now() - interval 20 day;
If you want to use the LAST 20 RECORDS, not last 20 days:
Ruby:
Model.order('date desc').limit(20).select('stddev_samp(percent_change) as stdev').first['stdev']
SQL:
select stddev_samp(percent_change) as stdev
from <table>
order by date desc
limit 20;
you don't need the group by since you don't want one value for each date.
also your limit might not work if you have multiple values for a date or have a date missing
try this:
SELECT stddev_samp(percent_change) as stdev
FROM
(SELECT percent_change
FROM tbl
WHERE date > now() - interval '20 days') AS q

Resources