Optimizing Group By in Flux - influxdb

I have a measurement with a few million rows of data containing information about around 20 thousand websites.
show tag keys from site_info:
domain
proxy
http_response_code
show field keys from site_info:
responseTime
uuid
source
What I want to do is count all of the uuid's for each website over a given time frame. I have tried writing a query like this one:
from(bucket: "telegraf/autogen")
|> range($range)
|> filter(fn: (r) =>
r._measurement == "site_info"
r._field == "uuid")
|> group(columns:["domain"])
|> count()
However this query will take up to 45 minutes to run for a time range of just now()-6h (assumingly due to the fact that I am trying to group data into 20k+ buckets)
Any suggestions on how to optimize the query to not take such extended amounts of time without altering the data schema?

I think for the time being flux‘s influx datastore integration is just not optimized at all. They announced that performance tuning should start in the beta phase.

Related

Flux Query which creates new field based on presence of data point

I am currently facing the following problem.
I have an InfluxDB which gets Data in the form of Timestamps and other fields. Now I want to evaluate during which times of the day data got into the DB.
Therefore I thought about creating a query which filters for the measurement I am interested in and then for the field "Time". This field contains a timestamp of the device which sent the data.
Now I want to do 2 things:
Fill in gaps between data points, so if there is a dif between to "Time" values > 120000 (2minutes) create a new row
Every row of the stream shall contain a field "_existed" which contains true if the row was present before filling the gaps and false if not
My idea then looks like this:
import "date"
currentDay = now()
ago = date.sub(from:currentDay, d:2d)
from(bucket:"Testmachine")
|> range(start: ago, stop:currentDay)
|> filter(fn: (r) => r._measurement == "G/G0010757")
|> filter(fn: (r) => r._field == "Time")
|> map(fn: (r) => ({Number: exists(r._value), Time: r._time }))
But sadly I dont get it to work. The outcome I would expect would look like this:
Time: Unix_Timestamp
Value: Boolean if row existed
Can someone help me to find a solution or is there currently none?

Is there a way to fill a result of sparse data with 0 value points with Flux?

I have points spread out every 5 min, and when the value is 0 the point is just omitted. I'd like to fill the omitted data with empty values.
I see with InfluxQL I could do:
group by time(5m) fill(0)
But I am using InfluxDB 2. I have tried this query:
from(bucket:"%v")
|> range(start: %d)
|> filter(fn: (r) => r._measurement == "volume" and r.id == "%v")
|> window(every: 5m, period: 5m, createEmpty: true)
|> fill(value: 0)
But it does not appear to be working.
Any help is appreciated.
It turns out this is a bug in InfluxDB related to https://github.com/influxdata/influxdb/issues/21857
Apparently the window function does not work either.
The fill() function only replaces nulls in the data and not missing data points based on time. At the moment there's no function available to this time, although one has been requested.
What I've done over time periods where I need to fill in missing data is to generate a time series (with zero values) and join this with the time series with missing data.

Why is this InfluxDB Flux query returning 2 tables?

Obv. I'm new to InfluxDB & the Flux query language so appreciate patience! Happy to be redirected to documentation but I haven't found anything genuinely useful to date.
I've configured Jenkins (2.277.3) to push build metrics to InfluxDB (Version 2.0.5 ('7c3ead)) using plugin (https://plugins.jenkins.io/influxdb/). No custom metrics at the moment. Data is being successfully sent.
I'd like to build a simple bar chart to show build times for a specific project. Each "bar" would be an individual build (with a distinct build number). Also:
X-axis, date/time of build
Y-axis, duration of build
(Ideally bars would be green/red to indicate success/anything else and would be labelled with job number. In time I'd like to add an overlay with average build time.)
I'm trying to create the query(ies) to support this view:
from(bucket: "db0")
|> range(start: -2d)
|> filter(fn: (r) => r["project_name"] == "Job2")
|> filter(fn: (r) => r._measurement == "jenkins_data" and r._field == "build_time" )
This results in 2 tables in the Table view, one for build SUCCESS and one for build FAILURE. Can someone explain to be why this is the case, and whether I'm missing some fundamental understanding of how to use the tool?
"Each flux query returns a stream of tables meaning your query can return multiple tables. Each table is created depending on the grouping. If you change the grouping at the end of your query you could merge these tables into 1. The simples example would be to just add |> group() at the end and see that now you are getting just 1 table."
Accepting #ditoslav's comment as the answer to my question.

How do I get at values other than a filtered field in Flux?

Looking at "Mistake 3" at the best practices for InfluxDb 2.0:
https://www.influxdata.com/blog/data-layout-and-schema-design-best-practices-for-influxdb/
Mistake 3: Making ids (such as eventid, orderid, or userid) a tag. This is another example that can cause unbounded cardinality if the tag values aren’t scoped.
Solution 3: Instead, make these metrics a field."
This is all fair and dandy, but then how do I draw out results? It doesn't seem to be described anywhere. All examples show filtering by tag, not by a field. So if I am logging "SensorId, Temp, Humidity" and these are all fields, how do I get the Temp graph for SensorId 97?
In order to filter by field in Flux you'd write:
from(bucket: "TestBucket")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "sensor" and r["_field"] == "sensorid" and r["_value"] == 97)
But now I'm stuck with just sensorid values. The temp and hum values have vanished. I am trying to wrap my head around Flux, but it is hard given that you write data as a record and this is what you have in your head when designing your solution. But then in Flux all of a sudden the columns seem to vanish as you try to narrow your result set.
Pivot is your friend in these cases. So this bit of code:
|> pivot(
rowKey:["_time"],
columnKey: ["_field"],
valueColumn: "_value"
)
transforms a result table to use the fields as column names and _value as value for those columns. I'm still left wondering if this is more efficient than using a tag for an id column. Or, more precisely, at what point one becomes more efficient than the other.

What is the equivalent of SELECT first(column), last(column) in Flux Query Language?

What would be equivalent flux queries for
SELECT first(column) as first, last(column) as last FROM measurement ?
SELECT last(column) - first(column) as column FROM measurement ?
(I am referring to FluxQL, the new query language developed by InfluxData)
There are first() and last() functions but, I am unable to find the example to use both in same query.
These are the documentation for FluxQL for better reference:
https://docs.influxdata.com/flux/v0.50/introduction/getting-started
https://v2.docs.influxdata.com/v2.0/query-data/get-started/
If you (or someone else who lands here) just wanted the difference between the min and max values you would want the built-in spread function.
The spread() function outputs the difference between the minimum and maximum values in a specified column.
However, you're asking for the difference between the first and last values in a stream and there doesn't seem to be a built-in function for that (probably because most streams are expected to be dynamic in range). To achieve this, you could either write a custom aggregator function like in a similar answer. Or you can join two queries together, then take the difference:
data = from(bucket: "example-bucket") |> range(start: -1d) |> filter(fn: (r) => r._field == "field-you-need")
temp_earlier_number = data |> first() |> set(key: "_field", value: "delta")
temp_later_number = data |> last() |> set(key: "_field", value: "delta")
union(tables: [temp_later_number, temp_earlier_number])
|> difference()
What this does is create two tables with a field named delta and then join them together, resulting in a table with two rows--one representing the first value and the other representing the last. Then we take the difference between those two rows. If you don't want negative numbers, just be sure to subtract in the correct order for your data (or use math.abs).

Resources