Optimising a query (Min, Max, First and Last) - influxdb

First post here and new to InfluxDB.
I have been given the following query, and I have no clue how I can optimize it.
A period of 24 hours and 1m windows takes around 2.4 seconds at the moment (is this the expected amount of time).
I suspect one of the reasons is that there are 4 tables (querying the same set of data) and 3 joins.
I have looked into the map function to try and reduce it to one table but I can't seem to get it to work with the window.
bucketName = "${bucket}"
startTime = -${period}
interval = ${interval}
token = "${token}"
minPrice = from (bucket: bucketName)
|> range(start: startTime, stop: now())
|> filter(fn: (r) => r["_field"] == token)
|> window(every: interval)
|> min()
|> duplicate(column: "_value", as: "low")
|> keep(columns: ["low", "_start", "_stop"] )
maxPrice = from (bucket: bucketName)
|> range(start: startTime, stop: now())
|> filter(fn: (r) => r["_field"] == token)
|> window(every: interval)
|> max()
|> duplicate(column: "_value", as: "high")
|> keep(columns: ["high", "_start", "_stop"] )
openPrice = from (bucket: bucketName)
|> range(start: startTime, stop: now())
|> filter(fn: (r) => r["_field"] == token)
|> window(every: interval)
|> first()
|> duplicate(column: "_value", as: "open")
|> keep(columns: ["open", "_stop", "_start"] )
closePrice = from (bucket: bucketName)
|> range(start: startTime, stop: now())
|> filter(fn: (r) => r["_field"] == token)
|> window(every: interval)
|> last()
|> duplicate(column: "_value", as: "close")
|> keep(columns: ["close", "_stop", "_start"] )
highLowData = join(tables: {min: minPrice, max: maxPrice}, on: ["_start", "_stop"])
openCloseData = join(tables: {open: openPrice, close: closePrice}, on: ["_start", "_stop"])
join(tables: {highLow: highLowData, openClose: openCloseData}, on: ["_start", "_stop"])
I have managed to optimize it down to 0.7s by using a union rather than a join. However now I'm faced with data that has empty fields.
Like this:
Query below
startTime = -24h
breakDown = 1m
token = "tokenName"
all = from (bucket: "prices")
|> range(start: startTime, stop: now())
|> filter(fn: (r) => r["_field"] == token)
|> window(every: breakDown)
lowPrice = all
|> min()
|> duplicate(column: "_value", as: "low")
|> keep(columns: ["low", "_stop", "_start"] )
highPrice = all
|> max()
|> duplicate(column: "_value", as: "high")
|> keep(columns: ["high", "_stop", "_start"] )
openPrice = all
|> first()
|> duplicate(column: "_value", as: "open")
|> keep(columns: ["open", "_stop", "_start"] )
closePrice = all
|> last()
|> duplicate(column: "_value", as: "close")
|> keep(columns: ["close", "_stop", "_start"] )
highLowData = union(tables: [lowPrice, highPrice])
openCloseData = union(tables: [openPrice, closePrice])
result = union(tables: [highLowData, openCloseData])
|> yield (name: "Result")

Related

Flux left join on empty table

I'm looking to join two data streams together but receive the following error from Influx:
error preparing right side of join: cannot join on an empty table
I'm trying to build a query which compares the total sales by a store this month compared to last month. If the store has no sales this month then I don't want it to show. Below is a basic example of my current query.
import "join"
lastMonth = from(bucket: "my-bucket")
|> range(start: 2022-10-01, stop: 2022-11-01)
|> filter(fn: (r) => r._measurement == "transaction")
|> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
|> group(columns: ["storeId"], mode: "by")
|> reduce(
fn: (r, accumulator) => ({
storeId: r.storeId,
amount: accumulator.amount + (r.totalAmount - r.refundAmount)
}),
identity: {
storeId: "",
amount: 0.0
}
)
from(bucket: "my-bucket")
|> range(start: 2022-11-01, stop: 2022-12-01)
|> filter(fn: (r) => r._measurement == "transaction")
|> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
|> group(columns: ["storeId"], mode: "by")
|> reduce(
fn: (r, accumulator) => ({
storeId: r.storeId,
amount: accumulator.amount + (r.totalAmount - r.refundAmount)
}),
identity: {
storeId: "",
amount: 0.0
}
)
|> join.left(
right: lastMonth,
on: (l, r) => l.storeId == r.storeId,
as: (l, r) => ({
storeId: l.storeId,
thisMonthAmount: l.amount,
lastMonthAmount: r.amount
})
)
How can I achieve this in Flux without encountering this issue?

influx query: how to get historical average

I am SQL native struggling with flux syntax (philosophy?) once again. Here is what I am trying to do: plot values of a certain measurement as a ratio of their historical average (say over the past month).
Here is as far as I have gotten:
from(bucket: "secret_bucket")
|> range(start: v.timeRangeStart, stop:v.timeRangeStop)
|> filter(fn: (r) => r._measurement == "pg_stat_statements_fw")
|> group(columns: ["query"])
|> aggregateWindow(every: v.windowPeriod, fn: sum)
|> timedMovingAverage(every: 1d, period: 30d)
I believe this produces an average over the past 30 days, for each day window. Now what I don't know how to do is divide the original data by these values in order to get the relative change, i.e. something like value(_time)/tma_value(_time).
Thanks to #Munun, I got the following code working. I made a few changes since my original post to make things work as I needed.
import "date"
t1 = from(bucket: "secret_bucket")
|> range(start: v.timeRangeStart, stop:v.timeRangeStop)
|> filter(fn: (r) => r._measurement == "pg_stat_statements_fw")
|> group(columns: ["query"])
|> aggregateWindow(every: 1h, fn: sum)
|> map(fn: (r) => ({r with window_value: float(v: r._value)}))
t2 = from(bucket: "secret_bucket")
|> range(start: date.sub(from: v.timeRangeStop, d: 45d), stop: v.timeRangeStop)
|> filter(fn: (r) => r._measurement == "pg_stat_statements_fw")
|> mean(column: "_value")
|> group()
|> map(fn: (r) => ({r with avg_value: r._value}))
join(tables: {t1: t1, t2: t2}, on: ["query"])
|> map(fn: (r) => ({r with _value: (r.window_value - r.avg_value)/ r.avg_value * 100.0 }))
|> keep(columns: ["_value", "_time", "query"])
Here are few steps you could try:
re-add _time after the aggregate function so that you can have same number of records as the original one:
|> duplicate(column: "_stop", as: "_time")
calculate the ratio with two data sources via join and map
The final Flux could be:
t1 = from(bucket: "secret_bucket")
|> range(start: v.timeRangeStart, stop:v.timeRangeStop)
|> filter(fn: (r) => r._measurement == "pg_stat_statements_fw")
|> group(columns: ["query"])
|> aggregateWindow(every: v.windowPeriod, fn: sum)
|> timedMovingAverage(every: 1d, period: 30d)
|> duplicate(column: "_stop", as: "_time")
t2 = from(bucket: "secret_bucket")
|> range(start: v.timeRangeStart, stop:v.timeRangeStop)
|> filter(fn: (r) => r._measurement == "pg_stat_statements_fw")
join(tables: {t1: t1, t2: t2}, on: ["hereIsTheTagName"])
|> map(fn: (r) => ({r with _value: r._value_t2 / r._value_t1 * 100.0}))

InfluxDB 2.0 - Flux query: How to sum a column and use the sum for further calculations

I am new to flux query language (with Influx DB 2) and cant find a solution for the following problem:
I have data with changing true and false values:
I was able to calculate the time in seconds until the next change by using the events.duration function:
Now I want to calculate the total time and the time of all "false"-events and after that I want to calculate the percentage of all false events. I tryed the following
import "contrib/tomhollingworth/events"
total = from(bucket: "********")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "********")
|> filter(fn: (r) => r["Server"] == "********")
|> filter(fn: (r) => r["_field"] == "********")
|> filter(fn: (r) => r["DataNode"] == "********")
|> events.duration(
unit: 1s,
columnName: "duration",
timeColumn: "_time",
stopColumn: "_stop"
)
|> sum(column: "duration")
|> yield(name: "total")
downtime = from(bucket: "********")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "********")
|> filter(fn: (r) => r["Server"] == "********")
|> filter(fn: (r) => r["_field"] == "********")
|> filter(fn: (r) => r["DataNode"] == "********")
|> events.duration(
unit: 1s,
columnName: "duration",
timeColumn: "_time",
stopColumn: "_stop"
)
|> pivot(rowKey:["_time"], columnKey: ["_value"], valueColumn: "duration")
|> drop(columns: ["true"])
|> sum(column: "false")
|> yield(name: "downtime")
downtime_percentage = downtime.false / total.duration
With this I am getting the following error error #44:23-44:31: expected {A with false:B} but found [C]
I also tryed some variations but couldnet get it to work.
I guess I am getting some basic things wrong but I couldnt figure it out yet. Let me know, if you need more information.
I have found a way to solve my problem. Although I am sure that there is a more elegant solution, I document my way here, maybe it helps someone and we can improve it together.
import "contrib/tomhollingworth/events"
//Set time window in seconds (based on selected time)
time_window = int(v: v.timeRangeStart)/-1000000000
//Filter (IoT-)Data
data= from(bucket: "*******")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "*******")
|> filter(fn: (r) => r["Server"] == "*******")
|> filter(fn: (r) => r["Equipment"] == "*******")
|> filter(fn: (r) => r["DataNode"] == "******")
//Use events.duration to calculate the duration in seconds of each true/false event.
|> events.duration(
unit: 1s,
columnName: "duration",
timeColumn: "_time",
stopColumn: "_stop"
)
//Sum up the event times via "sum()" and save them as an array variable via "findColumn()". This is the only way to access the value later (As far as I know. please let me know if you know other ways!).
total_array = data
|> sum(column: "duration")
|> findColumn(
fn: (key) => key._field == "*******",
column: "duration",
)
//Calculate "missing time" in seconds in the time window, because the first event in the time window is missing.
missing_time = time_window - total_array[0]
//Create an array with the first event to determine if it is true or false
first_value_in_window = data
|> first()
|> findColumn(
fn: (key) => key._field == "*******",
column: "_value",
)
//Calculate the downtime by creating columns with the true and false values via pivot. Then sum up the column with the false values
downtime = data
|> map(fn: (r) => ({ r with duration_percentage: float(v: r.duration)/float(v: time_window) }))
|> pivot(rowKey:["_time"], columnKey: ["_value"], valueColumn: "duration_percentage")
|> map( fn: (r) => ({r with
downtime: if exists r.false then
r.false
else
0.0
}))
|> sum(column: "downtime")
//Create an array with the downtime so that this value can be accessed later on
downtime_array = downtime
|> findColumn(
fn: (key) => key._field == "PLS_Antrieb_laeuft",
column: "downtime",
)
//If the first value in the considered time window is true, then the remaining time in the time window (missing_time) was downtime. Write this value in the column "false_percentage_before_window".
//The total downtime is calculated from the previously calculated sum(downtime_array) and, if applicable, the downtime of the remaining time in the time window if the first value is true (first_value_in_window[0])
data
|> map( fn: (r) => ({r with
false_percentage_before_window: if first_value_in_window[0] then
float(v: missing_time)/float(v: time_window)
else
0.0
}))
|> map(fn: (r) => ({ r with _value: (downtime_array[0] + r.false_percentage_before_window) * 100.00 }))
|> first()
|> keep(columns: ["_value"])
|> yield(name: "Total Downtime")
This solution assumes that the true/false events only occur alternately.

Influxdb Flux query with custom window aggregate function

Could you please help me with the InfluxDB 2 Flux query syntax to build a windowed query with a custom aggregate function.
I went through the online docs, but they seem to be lacking examples on how to get to the actual window content (first, last records) from within the custom aggregate function. It also doesn't immediately describe the expected signature of the custom functions.
I'd like to build a query with a sliding window that would produce a difference between the first and the last value in the window. Something along these lines:
difference = (column, tables=<-) => ({ tables.last() - tables.first() })
from(bucket: "my-bucket")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "simple")
|> filter(fn: (r) => r["_field"] == "value")
|> aggregateWindow(every: 1mo, fn: difference, column: "_value", timeSrc: "_stop", timeDst: "_time", createEmpty: true)
|> yield(name: "diff")
The syntax of the above example is obviously wrong, but hopefully you can understand, what I'm trying to do.
Thank you!
Came up with the following. It works at least syntactically:
from(bucket: "my-bucket")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "simple")
|> filter(fn: (r) => r["_field"] == "value")
|> aggregateWindow(
every: 1mo,
fn: (column, tables=<-) => tables |> reduce(
identity: {first: -1.0, last: -1.0, diff: -1.0},
fn: (r, acc) => ({
first:
if acc.first < 0.0 then r._value
else acc.first,
last:
r._value,
diff:
if acc.first < 0.0 then 0.0
else (acc.last - acc.first)
})
)
|> drop(columns: ["first", "last"])
|> set(key: "_field", value: column)
|> rename(columns: {diff: "_value"})
)
|> yield(name: "diff")
The window is not really sliding though.
The same for the sliding window:
from(bucket: "my-bucket")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "simple")
|> filter(fn: (r) => r["_field"] == "value")
|> window(every: 1h, period: 1mo)
|> reduce(
identity: {first: -1.0, last: -1.0, diff: -1.0},
fn: (r, acc) => ({
first:
if acc.first < 0.0 then r._value
else acc.first,
last:
r._value,
diff:
if acc.first < 0.0 then 0.0
else (acc.last - acc.first)
})
)
|> duplicate(column: "_stop", as: "_time")
|> drop(columns: ["first", "last"])
|> rename(columns: {diff: "_value"})
|> window(every: inf)

Counting boolean values in flux query

I have a bucket where one field is a boolean
I'd like to count the number of true and the number of false for each hour
from(bucket: "xxx")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> window(every: 1h)
|> filter(fn: (r) => r["_measurement"] == "xxx")
|> filter(fn: (r) => r["_field"] == "myBoolField")
|> group(columns: ["_stop"])
Because this is issued from a cron that runs every minute (more or less), this will give something like :
table _start _stop _time _value otherfield1 otherfield2
0 2021-05-18T19:00:00 2021-05-18T20:00 2021-05-18T19:01 false xxx xxx
0 2021-05-18T19:00:00 2021-05-18T20:00 2021-05-18T19:02 true xxx xxx
0 2021-05-18T19:00:00 2021-05-18T20:00 2021-05-18T19:03 true xxx xxx
...
1 2021-05-18T20:00:00 2021-05-18T21:00 2021-05-18T20:01 false xxx xxx
1 2021-05-18T20:00:00 2021-05-18T21:00 2021-05-18T20:02 false xxx xxx
1 2021-05-18T20:00:00 2021-05-18T21:00 2021-05-18T20:03 false xxx xxx
...
Now, I'd like to count the total, the number of false and the number of true for each hour (so for each table) but without losing/dropping the other fields
So I'd like a structure like
table _stop _value nbFalse nbTrue otherfield1 otherfield2
0 2021-05-18T20:00 59 1 58 xxx xxx
1 2021-05-18T21:00 55 4 51 xxx xxx
I've tried many combinations of pivot, count, ... without success
From my understanding, the correct way to do is
drop _start and _time
duplicate _value into nbTrue and nbFalse
re-aggregate by _stop to keep only true in nbTrue and false in nbFalse
count the three columns _value, nbTrue and nbFalse
|> drop(columns: ["_start", "_time"])
|> duplicate(column: "_value", as: "nbTrue")
|> duplicate(column: "_value", as: "nbFalse")
but I am stucked at step 3...
Didn't test it, but I have something similar to this on my mind:
from(bucket: "xxx")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "xxx")
|> filter(fn: (r) => r["_field"] == "myBoolField")
|> aggregateWindow(
every: 1h,
fn: (column, tables=<-) => tables |> reduce(
identity: {true_count: 0.0},
fn: (r, accumulator) => ({
true_count:
if r._value == true then accumulator.true_count + 1.0
else accumulator.true_count + 0.0
})
)
)
I got this from the docs and adjusted it a bit, I think it should get you what you need.
Answer from #dskalec should work, I did not test it directly because, in the end, I needed to aggregate more than just the boolean field
here is my query, you can see that it is using the same aggregate+reduce (I just use a pivot before to have more than one field to aggregate)
from(bucket: "rt")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "xxx")
|> pivot(
rowKey:["_time"],
columnKey: ["_field"],
valueColumn: "_value"
)
|> window(every: 1h, createEmpty: true)
|> reduce(fn: (r, accumulator) => ({
nb: accumulator.nb + 1,
field1: accumulator.field1 + r["field1"], //field1 is an int
field2: if r["field2"] then accumulator.field2 + 1 else accumulator.field2, //field2 is a boolean
}),
identity: {nb: 0, field1: 0, field2: 0}
)
|> duplicate(column: "_stop", as: "_time")
|> drop(columns: ["_start", "_stop"])

Resources