How to create a delayed sliding window in Azure Stream Analytics - stream

I would like to calculate the rate of change between the two following values in my stream:
AVG(value) in SlidingWindow of 1mn
AVG(value) in SlidingWindow of 1mn in the previous minute
The only thing I can't find in the documentation is how to create a "delayed" sliding window, meaning that it begins 2mn before and ends 1mn before the actual time so I can make some calculations such as the rate of change.

You can do it two steps.
Compute one minute aggregates of AVG.
Use LAG over previous stream's AVG
Something like below
WITH OneMinuteWindows AS
(
SELECT
Avg(Column1) AvgValue
FROM
InputEventHub
GROUP BY
TumblingWindow(mi, 1)
)
SELECT
System.TimeStamp [TimeStamp],
AvgValue [CurrentValue],
LAG(AvgValue) OVER (LIMIT DURATION(mi, 2)) [PreviousValue]
FROM
OneMinuteWindows

Related

How to use a Lambda function with Reduce to repeat row N numbre f time in google sheets

Hello I've tried using text manipulation to achieve the results, and while it works - I don't think it's an efficient way to do it and there is limitations with how many times it can be done.
I was trying to figure out how to get it done with reduce but it having hard time to figure it out.
This is the current table
Unique ID
Some other Info
How many times to repeat
123
Some Info
2
456
Some Info
3
The result would be
Unique ID
123
123
456
456
456
Thank you.
Here's one way to do this:
=ArrayFormula(REDUCE("Unique ID",SEQUENCE(COUNTA(A2:A)),LAMBDA(a,c,{a;IF(SEQUENCE(INDEX(C2:C,c)),INDEX(A2:A,c))})))
Explanation
The LAMBDA inside REDUCE works by taking 3 parameters: an accumulator (a), a current value (c) and the operation to perform using them.
The accumulator (a) is initialized to the first argument of REDUCE, which is "Unique ID" and every time the inner LAMBDA is executed, the accumulator updates with the result of that execution.
The current value (c) is a variable parameter and it takes on all the values provided in the second argument of REDUCE SEQUENCE(COUNTA(A2:A)) (1).
Let's assume (1) returns:
1
2
The main work happens here:
{a;IF(SEQUENCE(INDEX(C2:C,c)),INDEX(A2:A,c))} (2)
Before this piece of code is executed, a has a value of "Unique ID" and c has a value of 1.
When it executes for the first time, a and c are replaced with their initial value, so we get:
{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))}
Now c becomes 2 and a becomes
{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))}
So when (2) is executed for the second time, this is what we get:
{{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))};
IF(SEQUENCE(INDEX(C2:C,2)),INDEX(A2:A,2))}
We have now gone through all the values of c so the formula stops executing and that's effectively what it returns.
The amount of iterations REDUCE does depends on the size of its second parameter.
Let's see another example. Assume (1) returns:
1
2
3
First time c=1, a="Unique ID":
{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))}
Second time c=2, a=PREVIOUSLY_RETURNED_ARRAY:
{{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))};
IF(SEQUENCE(INDEX(C2:C,2)),INDEX(A2:A,2))}
Third and last time c=3, a=PREVIOUSLY_RETURNED_ARRAY:
{{{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))};
IF(SEQUENCE(INDEX(C2:C,2)),INDEX(A2:A,2))};
IF(SEQUENCE(INDEX(C2:C,3)),INDEX(A2:A,3))}
And that's the array REDUCE returns.
Do you see a pattern?
A different approach could be-
=QUERY(FLATTEN(INDEX(SPLIT(REPT(A2:A3&"|",C2:C3),"|"))),"where Col1 is not null")

query order by 2nd column when score is equal

I'm using this formula to order column G to sort the highest points of the player list.
=QUERY(A2:G;"select * where A is not null order by G desc";0)
Some of the players have equal total points, but not equal times. Points are earned over different rounds, based on what time they finished.
If the players have equal points, I want to sort by a second column (their total finishing time) in column H.
example:
Both players finished 1st & 2nd. The total time has a difference of 1 minute. Player 2 should be ordered first based on his total time.
Note that I can't directly order by "Total Time" due to the point system in the background.
Player Round1 Round2 Points Total Time
1 3min 1min 10 4min
2 1min 2min 10 3min
Found it!
=QUERY(A2:G;"select * where A is not null order by G desc, H asc";0)

Create Chart Showing Trend for Binary Outcome Variables Vs Date in Google Sheets

I have a dataset with several hundred rows showing whether I completed tasks on a given day. For both task1 and task2, I would mark a 1 if I did and 0 if I did not. An example of 5 rows is below.
Date task1 task2
1/1/20 1 0
2/1/20 0 0
3/1/20 1 1
4/1/20 1 1
5/1/20 1 1
...
I'm looking to create a chart that would have the date on the x axis and the two variables on the y. Then using two different colours (green for 1 and white for 0) I would see how often I completed each task. I would also like to label the different parts of the chart to show the total days in a row that the tasks are completed (or not).
An image below gives an idea of what I want it to look like (note I have much more data than 3 obs per a month)
Not yet an answer, but I've started to get it resolved. I've got the value of each run identified.
See a very rough sample sheet here:
https://docs.google.com/spreadsheets/d/1qq1XRLNFIGunbwqhy5o-zP6NZbaEVFfvLGIRhOquxo4/edit?usp=sharing
Column H has the values for each run (for Task1) - positive for a run of "1"s, and negative for a run of "0"s. Haven't figured out how to plot them both on the same line.
The conditional formatting with green highlighting flags each row that needs to be plotted, due to a change in the run value.
I'll see if I can generate a chart from this...

InfluxDB: Starting cumulative_sum() from zero / aggregate grouping required for cumulative_sum and non_negative_difference

Using InfluxDB, I'm trying produce an output that shows cumulative rainfall for a time period, that starts from zero.
The rainfall sensor outputs a cumulative rainfall amount, but resets to zero on power-failure, restart etc.
My first query component uses non_negative_difference() to show the increments.
select
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query
.... yields an increment per raw data point, for example:
2018-06-01T14:21:00.926Z 0
2018-06-01T14:22:02.959Z 0.30000000000000426
2018-06-01T14:23:04.992Z 0.3999999999999986
2018-06-01T14:24:07.024Z 0.10000000000000142
2018-06-01T14:25:09.059Z 0.19999999999999574
2018-06-01T14:26:11.094Z 0
2018-06-01T14:27:13.127Z 0.10000000000000142
2018-06-01T14:28:15.158Z 0.20000000000000284
2018-06-01T14:29:20.027Z 0.09999999999999432
2018-06-01T14:30:22.476Z 0.10000000000000142
2018-06-01T14:30:53.918Z 0.6000000000000014
2018-06-01T14:31:55.968Z 0.5
2018-06-01T14:32:58.007Z 0.5
2018-06-01T14:34:00.046Z 0.20000000000000284
2018-06-01T14:35:02.075Z 0.3999999999999986
2018-06-01T14:36:04.102Z 0.3999999999999986
2018-06-01T14:37:06.136Z 0.20000000000000284
2018-06-01T14:38:08.201Z 0
So far so good.
I'm now trying to stitch these readings back to cumulative total, starting from zero for the intended period.
I can use cumulative_sum() for this, for example:
SELECT
cumulative_sum(nnd)
FROM
(SELECT
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query )
which yields:
2018-06-01T14:21:00.926Z 0
2018-06-01T14:22:02.959Z 0.30000000000000426
2018-06-01T14:23:04.992Z 0.7000000000000028
2018-06-01T14:24:07.024Z 0.8000000000000043
2018-06-01T14:25:09.059Z 1
2018-06-01T14:26:11.094Z 1
2018-06-01T14:27:13.127Z 1.1000000000000014
2018-06-01T14:28:15.158Z 1.3000000000000043
2018-06-01T14:29:20.027Z 1.3999999999999986
2018-06-01T14:30:22.476Z 1.5
2018-06-01T14:30:53.918Z 2.1000000000000014
2018-06-01T14:31:55.968Z 2.6000000000000014
2018-06-01T14:32:58.007Z 3.1000000000000014
2018-06-01T14:34:00.046Z 3.3000000000000043
2018-06-01T14:35:02.075Z 3.700000000000003
2018-06-01T14:36:04.102Z 4.100000000000001
2018-06-01T14:37:06.136Z 4.300000000000004
2018-06-01T14:38:08.201Z 4.300000000000004
Looking good!
Now I'd like to group it up into more distinct time buckets, for nice graphing.
Let's try....
SELECT
cumulative_sum(max(nnd))
FROM (SELECT
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query)
GROUP BY
time(5m)
and I get an error: ERR: aggregate function required inside the call to non_negative_difference
But I cannot find a reasonable way of adding aggregates and groupings to non_negative_difference() that do not affect the accuracy of the differencing function itself.
The only thing I've been able to do is a dummy aggregate SUM() over time groups that are smaller than the sensor period. But this isn't robust enough for my liking - (and i'm still not sure it is 100% correct)
Is it correct that I must have both queries as aggregate queries?
I was trying to do this very thing for my weather station. Instead of having the weather station calculate the cumulative value I wanted Grafana to do it. The solution that worked for me is the advanced syntax Yuri Lachin mentions in his comments.
With InfluxDB you can use CUMULATIVE_SUM(), but the basic syntax doesn't allow you to group by time (only by tag). The "advanced syntax", however, allows you to to have a time series by nesting an aggregate function like MEAN() or SUM().
Here's the function I am using in Grafana to get a cumulative rainfall total for a selected time period:
SELECT CUMULATIVE_SUM(MEAN("rainfall")) FROM "weather" WHERE $timeFilter GROUP BY time(1h) fill(0).
The GROUP BY is, of course, flexible. I was interested in hourly rainfall so I grouped by 1h. You can group by the time period you find most interesting.
Using this query the rainfall will start from zero for period you select in Grafana. In the Seattle area we had measurable rain (I know, shocker) on 8/6/2020 and 8/8/2020. If I set my date range to include both dates the graph shows just under .2mm total rainfall:
If I switch my graph to 8/8 and 8/9 the total is just under 1mm:
Note: I was also interested in seeing the individual bucket tips so included those as bars on the second Y-axis.
For more detail see: https://docs.influxdata.com/influxdb/v1.8/query_language/functions/#advanced-syntax-7

How do I get consistent values with influxdb non_negative_derivative?

Using grafana with influxdb, I am trying to show the per-second rate of some value that is a counter. If I use the non_negative_derivative(1s) function, the value of the rate seems to change dramatically depending on the time width of the grafana view. I'm using the last selector (but could also use max which is the same value since it is a counter).
Specifically, I'm using:
SELECT non_negative_derivative(last("my_counter"), 1s) FROM ...
According to the influxdb docs non-negative-derivative:
InfluxDB calculates the difference between chronological field values and converts those results into the rate of change per unit.
So to me, that means that the value at a given point should not change that much when expanding the time view, since the value should be rate of change per unit (1s in my example query above).
In graphite, they have the specific perSecond function, which works much better:
perSecond(consolidateBy(my_counter, 'max'))
Any ideas on what I'm doing wrong with the influx query above?
If you want per second results that don't vary, you'll want to GROUP BY time(1s). This will give you accurate perSecond results.
Consider the following example:
Suppose that the value of the counter at each second changes like so
0s → 1s → 2s → 3s → 4s
1 → 2 → 5 → 8 → 11
Depending on how we group the sequence above, we'll see different results.
Consider the case where we group things into 2s buckets.
0s-2s → 2s-4s
(5-1)/2 → (11-5)/2
2 → 3
versus the 1s buckets
0s-1s → 1s-2s → 2s-3s → 3s-4s
(2-1)/1 → (5-2)/1 → (8-5)/1 → (11-8)/1
1 → 3 → 3 → 3
Addressing
So to me, that means that the value at a given point should not change that much when expanding the time view, since the value should be rate of change per unit (1s in my example query above).
The rate of change per unit is a normalizing factor, independent of the GROUP BY time unit. Interpreting our previous example when we change the derivative interval to 2s may offer some insight.
The exact equation is
∆y/(∆x/tu)
Consider the case where we group things into 1s buckets with a derivative interval of 2s. The result we should see is
0s-1s → 1s-2s → 2s-3s → 3s-4s
2*(2-1)/1 → 2*(5-2)/1 → 2*(8-5)/1 → (11-8)/1
2 → 6 → 6 → 6
This may seem a bit odd, but if you consider what this says it should make sense. When we specify a derivative interval of 2s what we're asking for is what the 2s rate of change is for the 1s GROUP BY bucket.
If we apply similar reasoning to the case of 2s buckets with a derivative interval of 2s is then
0s-2s → 2s-4s
2*(5-1)/2 → 2*(11-5)/2
4 → 6
What we're asking for here is what the 2s rate of change is for the 2s GROUP BY bucket and in the first interval the 2s rate of change would be 4 and the second interval the 2s rate of change would be 6.
#Michael-Desa gives an excellent explanation.
I'd like to augment that answer with a solution to a pretty common metric our company is interested in: "What is the maximum "operation per second" value on a specific measurement field?".
I will use a real-life example from our company.
Scenario Background
We send a lot of data from an RDBMS to redis. When transferring that data, we keep track of 5 counters:
TipTrgUp -> Updates by a business trigger (stored procedure)
TipTrgRm -> Removes by a business trigger (stored procedure)
TipRprUp -> Updates by an unattended auto-repair batch process
TipRprRm -> Removes by an unattended auto-repair batch process
TipDmpUp -> Updates by a bulk-dump process
We made a metrics collector that sends the current state of these counters to InfluxDB, with an interval of 1 second (configurable).
Grafana graph 1: low resolution, no true max ops
Here is the grafana query that is useful, but does not show the true max ops when zoomed out (we know it will go to around 500 ops on a normal business day, when no special dumps or maintenance is taking place - otherwise it goes into the thousands):
SELECT
non_negative_derivative(max(TipTrgUp),1s) AS "update/TipTrgUp"
,non_negative_derivative(max(TipTrgRm),1s) AS "remove/TipTrgRm"
,non_negative_derivative(max(TipRprUp),1s) AS "autorepair-up/TipRprUp"
,non_negative_derivative(max(TipRprRm),1s) AS "autorepair-rm/TipRprRm"
,non_negative_derivative(max(TipDmpUp),1s) AS "dump/TipDmpUp"
FROM "$rp"."redis_flux_-transid-d-s"
WHERE
host =~ /$server$/
AND $timeFilter
GROUP BY time($interval),* fill(null)
Sidenotes: $rp is the name of the retention policy, templated in grafana. We use CQ's to downsample to retention policies with a larger duration. Also note the 1s as a derivative parameter: it is needed, since the default is different when using GROUP BY. This can be easily overlooked in the InfluxDB documentation.
The graph, seen by 24 hours looks like this:
If we simply use a resolution of 1s (as suggested by #Michael-Desa), an enormous amount of data is transferred from influxdb to the client. It works reasonably well (about 10 seconds), but too slow for us.
Grafana graph 2: low and high resolution, true max ops, slow performance
We can however use subqueries to add the true maxops to this graph, which is a slight improvement. A lot less data is transferred to the client, but the InfluxDB server has to do a lot of number crunching. Series B (with maxops prepended in the aliases):
SELECT
max(subTipTrgUp) AS maxopsTipTrgUp
,max(subTipTrgRm) AS maxopsTipTrgRm
,max(subTipRprUp) AS maxopsRprUp
,max(subTipRprRm) AS maxopsTipRprRm
,max(subTipDmpUp) AS maxopsTipDmpUp
FROM (
SELECT
non_negative_derivative(max(TipTrgUp),1s) AS subTipTrgUp
,non_negative_derivative(max(TipTrgRm),1s) AS subTipTrgRm
,non_negative_derivative(max(TipRprUp),1s) AS subTipRprUp
,non_negative_derivative(max(TipRprRm),1s) AS subTipRprRm
,non_negative_derivative(max(TipDmpUp),1s) AS subTipDmpUp
FROM "$rp"."redis_flux_-transid-d-s"
WHERE
host =~ /$server$/
AND $timeFilter
GROUP BY time(1s),* fill(null)
)
WHERE $timeFilter
GROUP BY time($interval),* fill(null)
Gives:
Grafana graph 3: low and high resolution, true max ops, high performance, pre-calculate by CQ
Our final solution to these kind of metrics (but only when we need a live view, the subquery approach works fine for ad-hoc graphs) is: use a Continuous Query to pre-calculate the true maxops. We generate CQ's like this:
CREATE CONTINUOUS QUERY "redis_flux_-transid-d-s.maxops.1s"
ON telegraf
BEGIN
SELECT
non_negative_derivative(max(TipTrgUp),1s) AS TipTrgUp
,non_negative_derivative(max(TipTrgRm),1s) AS TipTrgRm
,non_negative_derivative(max(TipRprUp),1s) AS TipRprUp
,non_negative_derivative(max(TipRprRm),1s) AS TipRprRm
,non_negative_derivative(max(TipDmpUp),1s) AS TipDmpUp
INTO telegraf.A."redis_flux_-transid-d-s.maxops"
FROM telegraf.A."redis_flux_-transid-d-s"
GROUP BY time(1s),*
END
From here on, it's trivial to use these maxops measurements in grafana. When downsampling to an RP with longer retention, we again use max() as the selector function.
Series B (with .maxops appended in the aliases)
SELECT
max(TipTrgUp) AS "update/TipTrgUp.maxops"
,max(TipTrgRm) AS "remove/TipTrgRm.maxops"
,max(TipRprUp) as "autorepair-up/TipRprUp.maxops"
,max(TipRprRm) as "autorepair-rm/TipRprRm.maxops"
,max(TipDmpUp) as "dump/TipDmpUp.maxops"
FROM "$rp"."redis_flux_-transid-d-s.maxops"
WHERE
host =~ /$server$/
AND $timeFilter
GROUP BY time($interval),* fill(null)
Gives:
When zoomed in to 1s precision, you can see that the graphs become identical:
Hope this helps, TW
The problem here is that the $__interval width changes depending on the time frame you are viewing in Grafana.
The way then to get consistent results is to take a sample from each interval (mean(), median(), or max() all work equally well) and then transform by derivative($__interval). That way your derivative changes to match your interval length as you zoom in/out.
So, your query might look like:
SELECT derivative(mean("mem.gc.count"), $__interval) FROM "influxdb"
WHERE $timeFilter GROUP BY time($__interval) fill(null)

Resources