Is it possible to get percentile on aggregated data in Influxdb? - influxdb

Is it possible to get percentile on aggregated data in Influxdb?
Say, my data is
db,label1=value1 measure1_count=20 measure1_mean=0.8 140000000000
db,label1=value1 measure1_count=8 measure1_mean=0.9 140000001000
db,label1=value1 measure1_count=15 measure1_mean=0.4 140000002000
It it possible to do percentile on above data in influxdb1/2?

Influx db provide the Median aggregate function for calculating median.
select MEDIAN(Value) from ProcessData group by TagName
Note: MEDIAN() is nearly equivalent to PERCENTILE(field_key, 50), except MEDIAN() returns the average of the two middle field values if the field contains an even number of values.
https://docs.influxdata.com/influxdb/v1.8/query_language/functions/#:~:text=Note%3A%20MEDIAN()%20is%20nearly,an%20even%20number%20of%20values.

Related

Google Sheets: Average percentage using multiple conditions

I would like to get an average percentage out of my sample, however, I need to use several conditions. I tried to use the AVERAGE and AVERAGEIF together with FILTER but everything returns an error and I think I'm incorrectly "merging" formulas.
You can find my test sheet here.
The rules I need to apply:
The score for individual rows is possible to find in the "Data" sheet in cell N and the total results should be visible in the sheet "Calculation" cell E.
As the sample is huge in real life, I need to filter out several pieces of information and add conditions:
to filter out all items where the code/ID starts with 0: Data!A:A&"", "^0.+"
to filter out all items that are matching the date in the Calculation sheet: Data!C:C=$B3
to filter all items with the specific name: Data!B:B=$A3
Any idea how to get the average % out of items with specific filters?
UPDATE
Expected results: I want to see the total average for a specific date, name, and ID, and let's say I would use these filters, then I would see only the final average percentage.
Test =100%
Test = 0%
Test = 100%
Total Average %: 66.7%
Also, I think the best way would be to use AVERAGEIFS, but I'm getting the error "Array arguments to AVERAGEIFS are of different size".
=AVERAGEIFS(Data!N:N,Data!B:B=$A3,Data!C:C=$B3,Data!A:A&"", "^0.+")
=IFERROR(AVERAGEIFS(Data!N3:N,Data!B3:B,A3,Data!C3:C,B3,ARRAYFORMULA(if(LEN(Data!A3:A),REGEXMATCH(Data!A3:A,"^0.+"),"")),TRUE),"")
or
=IFERROR(AVERAGE(FILTER(Data!N3:N,Data!B3:B=A3,Data!C3:C=B3,REGEXMATCH(Data!A3:A,"^0.+"))),"")
or
=IFERROR(INDEX(QUERY({Data!A3:C,Data!N3:N},"select avg(Col4) where Col1 starts with '0' and Col2 = '"&A3&"' and Col3 = '"&B3&"'"),2,0),"")

is there a function to get mean values of a column for every unique date in date column?

jupyter notebook screenshot showing al columns in the datasetI have an AQI(Air Quality Index) dataset for which there are various columns such as O3, SO2, PM2.5,etc and a datetime column which has timestamps in it like (20-Sep-2017 - 01:00, 20-Sep-2017 - 00:00). I want to get mean value of columns for every unique date such as O3 has several values but I want only mean for 20-Sep-2017. I've tried regex, and many other things but did not get desired output.

Druid Timeseries Row Count Aggregation

I am currently calculating the average for a single dimension in a Druid data source using a timeseries query via pydruid. This is based on an example in the documentation (https://github.com/druid-io/pydruid):
from pydruid.client import PyDruid
from pydruid.utils.aggregators import count, doublesum
client = PyDruid()
client.timeseries(
datasource='test_datasource',
granularity='hour',
intervals='2019-05-13T11:00:00.000/2019-05-23T17:00:00.000',
aggregations={
'sum':doublesum('dimension_name'),
'count': count('rows')
},
post_aggregations={
'average': (
Field('sum')/ Field('count')
)
}
)
My problem is that I don't know what count('rows') is doing. This seems to give the total row count for a datasource and is not filtered on the dimension. So I don't know whether the average will be incorrect if one row in the dimension in question has a null value.
I was wondering whether anyone knew how to calculate the average correctly?
Many thanks

InfluxDB: Starting cumulative_sum() from zero / aggregate grouping required for cumulative_sum and non_negative_difference

Using InfluxDB, I'm trying produce an output that shows cumulative rainfall for a time period, that starts from zero.
The rainfall sensor outputs a cumulative rainfall amount, but resets to zero on power-failure, restart etc.
My first query component uses non_negative_difference() to show the increments.
select
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query
.... yields an increment per raw data point, for example:
2018-06-01T14:21:00.926Z 0
2018-06-01T14:22:02.959Z 0.30000000000000426
2018-06-01T14:23:04.992Z 0.3999999999999986
2018-06-01T14:24:07.024Z 0.10000000000000142
2018-06-01T14:25:09.059Z 0.19999999999999574
2018-06-01T14:26:11.094Z 0
2018-06-01T14:27:13.127Z 0.10000000000000142
2018-06-01T14:28:15.158Z 0.20000000000000284
2018-06-01T14:29:20.027Z 0.09999999999999432
2018-06-01T14:30:22.476Z 0.10000000000000142
2018-06-01T14:30:53.918Z 0.6000000000000014
2018-06-01T14:31:55.968Z 0.5
2018-06-01T14:32:58.007Z 0.5
2018-06-01T14:34:00.046Z 0.20000000000000284
2018-06-01T14:35:02.075Z 0.3999999999999986
2018-06-01T14:36:04.102Z 0.3999999999999986
2018-06-01T14:37:06.136Z 0.20000000000000284
2018-06-01T14:38:08.201Z 0
So far so good.
I'm now trying to stitch these readings back to cumulative total, starting from zero for the intended period.
I can use cumulative_sum() for this, for example:
SELECT
cumulative_sum(nnd)
FROM
(SELECT
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query )
which yields:
2018-06-01T14:21:00.926Z 0
2018-06-01T14:22:02.959Z 0.30000000000000426
2018-06-01T14:23:04.992Z 0.7000000000000028
2018-06-01T14:24:07.024Z 0.8000000000000043
2018-06-01T14:25:09.059Z 1
2018-06-01T14:26:11.094Z 1
2018-06-01T14:27:13.127Z 1.1000000000000014
2018-06-01T14:28:15.158Z 1.3000000000000043
2018-06-01T14:29:20.027Z 1.3999999999999986
2018-06-01T14:30:22.476Z 1.5
2018-06-01T14:30:53.918Z 2.1000000000000014
2018-06-01T14:31:55.968Z 2.6000000000000014
2018-06-01T14:32:58.007Z 3.1000000000000014
2018-06-01T14:34:00.046Z 3.3000000000000043
2018-06-01T14:35:02.075Z 3.700000000000003
2018-06-01T14:36:04.102Z 4.100000000000001
2018-06-01T14:37:06.136Z 4.300000000000004
2018-06-01T14:38:08.201Z 4.300000000000004
Looking good!
Now I'd like to group it up into more distinct time buckets, for nice graphing.
Let's try....
SELECT
cumulative_sum(max(nnd))
FROM (SELECT
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query)
GROUP BY
time(5m)
and I get an error: ERR: aggregate function required inside the call to non_negative_difference
But I cannot find a reasonable way of adding aggregates and groupings to non_negative_difference() that do not affect the accuracy of the differencing function itself.
The only thing I've been able to do is a dummy aggregate SUM() over time groups that are smaller than the sensor period. But this isn't robust enough for my liking - (and i'm still not sure it is 100% correct)
Is it correct that I must have both queries as aggregate queries?
I was trying to do this very thing for my weather station. Instead of having the weather station calculate the cumulative value I wanted Grafana to do it. The solution that worked for me is the advanced syntax Yuri Lachin mentions in his comments.
With InfluxDB you can use CUMULATIVE_SUM(), but the basic syntax doesn't allow you to group by time (only by tag). The "advanced syntax", however, allows you to to have a time series by nesting an aggregate function like MEAN() or SUM().
Here's the function I am using in Grafana to get a cumulative rainfall total for a selected time period:
SELECT CUMULATIVE_SUM(MEAN("rainfall")) FROM "weather" WHERE $timeFilter GROUP BY time(1h) fill(0).
The GROUP BY is, of course, flexible. I was interested in hourly rainfall so I grouped by 1h. You can group by the time period you find most interesting.
Using this query the rainfall will start from zero for period you select in Grafana. In the Seattle area we had measurable rain (I know, shocker) on 8/6/2020 and 8/8/2020. If I set my date range to include both dates the graph shows just under .2mm total rainfall:
If I switch my graph to 8/8 and 8/9 the total is just under 1mm:
Note: I was also interested in seeing the individual bucket tips so included those as bars on the second Y-axis.
For more detail see: https://docs.influxdata.com/influxdb/v1.8/query_language/functions/#advanced-syntax-7

How to sum largest $n$ values in a range in Google Spreadsheet?

I have a list of values and I need to sum the largest 10 values (in a row). I found this but I can't figure it out/get it to work:
https://productforums.google.com/forum/#!topic/docs/A5jiMqkRLYE
let's say you want to sum the 10 highest values of the range E2:EP
then try:
=sumif(E2:P2, ">="&large(E2:P2,10))
and see if that works ?
EDIT: Maybe this is a better option ? This will only sum the 10 outputted by the array_constrain. Will only work in the new google sheets, though..
=sum(array_constrain(sort(transpose($A3:$O3), 1, 0), 10 ,1))
Can you see if this works ?
This works in old google sheets too:
sum(query(sort(transpose($A3:$O3), 1, false), "select * limit 10"))
Transpose puts the data in a column, sort sorts the data in a descending order and then query selects first 10 numbers.
Unfortunately, replacing sort with "order by" in a query statement does not work, because you can not reference a column in a range returned by transpose.
The sortn function seems to be just what you need.
From the documentation linked above, it "[r]eturns the first n items in a data set after performing a sort." The data set does not have to be sorted. It takes a bunch of optional parameters as it can sort on multiple columns.
SORTN(range, [n], [display_ties_mode], [sort_column1, is_ascending1], ...)
The interesting ones for your case are n, sort_column1, and is_ascending1. Specifically, your required formula would be
sum(sortn(transpose(A3:O3), 10, 0, 1, false)))
Some notes:
This assumes your data in A3:O3. You can replace it with your range.
transpose converts the data row to a data column as required by sortn.
10 is n, indicating the number of values that you require.
0 is the value for display_ties_mode. We are ignoring this value.
1 is the value of sort_column1, telling that we want to sort the first column (after transpose).
false tells sortn to sort descending and thus pick the largest values. The default is to pick the smallest.

Resources