InfluxDB create a non-time based window then aggregate results - influxdb

Imagine I've got 10.000 rows, and I want to split it each in "chunks" of 100, then each chunk should be run through an aggregate function.
I know this is doable using windows, but sadly they are time-based, and I need index-based.

Related

Using GROUP BY tag across entire database

How does running a query against each value of a tag perform compared to running a query for the same data, across the entire database, with GROUP BY "tag"? The first works for me. The second runs for a while but does not return anything (which could be the fault of either Influx or the software package that is playing middle-man i assume).
My InfluxDB database has data from 220 test events, and each test event lasts about two hours, and each test has tens-of-thousands of parameters. Test number is a tag.
I want to compute the COUNT(), MIN(), MEAN(), and MAX() for each of 10 different parameters, for each test. I know that I can write a Python script that will submit a separate InfluxQL query for each test number (WHERE "test"=xxx), and compile the results from each query, resulting in a relatively small amount of data. This takes maybe 12 minutes.
Alternatively, I've tried running one single query (same SELECT and FROM clauses) but instead of "WHERE "test" = xx, I simply GROUP BY "test". This seems to run for at least 14 minutes but then disappears (per "show queries"), without responding to my influxdb.DataFrameClient.
Is there something particularly-problematic about the second approach? It seems to be the more-intuitive approach for the analyst, but I can't get it to work.
Thanks!

How to measure throughput with dynamic interval in Grafana

We are measuring throughput using Grafana and Influx. Of course, we would like to measure throughput in terms how many requests, approximately, happens every single second (rps).
The typical request is:
SELECT sum("count") / 10 FROM "http_requests" GROUP BY time(10s)
But we are loosing possibility to use astonishing dynamic $__interval that very useful when graph scope is large, like a day of week. When we are changing interval we should change divider into SELECT expression.
SELECT sum("count") / $__interval FROM "http_requests" GROUP BY time($__interval)
But this approach does not work, because of empty result returns.
How to create request using dynamic $__interval for throughput measuring?
The reason you get no results is that $__interval is not a number but a string such as 10s, 1m, etc. that is understood by influxdb as a time range. So it is not possible to use it the way you are trying.
However, what you want to calculate is the mean which is available as a function in InfluxQL. The way to get the behavior that you want is with something like this.
SELECT mean("count") FROM "http_requests" GROUP BY time($__interval)
EDIT: On a second thought that is not quite what you want.
You'd probably need to use derivative. I'll come back to you on that one later.
Edit2: Do you think this answers the question that you have Calculating request per second using InfluxDB on Grafana
Edit3: Third edit's a charm.
We use your starting query and wrap it in another one as such:
SELECT sum("rps") from (SELECT sum("count") / 10 as rps FROM "http_requests" GROUP BY time(10s)) GROUP BY time($__interval)

Is there a way to tell Google Cloud Dataflow that the data coming in is already ordered?

We have an input data source that is approximately 90 GB (it can be either a CSV or XML, it doesn't matter) that contains an already ordered list of data. For simplicity, you can think of it as having two columns: time column, and a string column. The hundreds of millions of rows in this file are already ordered by the time column in ascending order.
In our Google cloud DataFlow, we have modeled each row as an element in our Pcollection, and we apply DoFn transformations to the string field (e.g. count the number of characters that are uppercase in the string etc.). This works fine.
However, we then need to apply functions that are supposed to be calculated for a block of time (e.g. five minutes) with a one minute overlap. So, we are thinking about using a sliding windowing function (even though the data is bounded).
However, the calculations logic that needs to be applied over these five-minute windows assumes that the data is ordered logically ( i.e. ascending) by the time field. My understanding is that even when using these windowing functions, one cannot assume that within each window the P collection objects are ordered in any way – so one would need to manually iterate through every P collection and reorder them, right? However, this seems like a huge waste of computational power, since the incoming data already contains ordered data. So is there a way to teach/inform Google cloud data flow that the input data is ordered and so to maintain that order even within the windows?
On a minor note, I had another question: my understanding is that if the data source is unbounded, there is never a "overall aggregation" function that would ever execute, as it never really make sense (since there is no end to the incoming data); however, if one uses a windowing function for bounded data, there is a true end state which corresponds to when all the data has been read from the CSV file. Therefore, is there a way to tell Google cloud data flow to do a final calculation once all the data has been read in, even though we are using a windowing function to divide the data up?
SlidingWindows sounds like the right solution for your problem. The ordering of the incoming data is not preserved across a GroupByKey, so informing Dataflow of that would not be useful currently. However, the batch Dataflow runner does already sort by timestamp in order to implement windowing efficiently, so for simple windowing like SlidingWindows, your code will see the data in order.
If you want to do a final calculation after doing some windowed calculations on a bounded data set, you can re-window your data into the global window again, and do your final aggregation after that:
p.apply(Window.into(new GlobalWindows()));

Batching Elements in Google Cloud Dataflow Pipeline

I'm looking into grouping elements during the flow into batch groups that are grouped based on a batch size.
In Pseudo code:
PCollection[String].apply(Grouped.size(10))
Basically converting a PCollection[String] into PCollection[List[String]] where each list now contains 10 elements. As it is batch and in case it doesn't evenly divide the last batch would contain the left over elements.
I have two ugly ideas with windows and fake timestamps or a GroupBy using keys based on a random index to distribute evenly, but this seems like a to complex solution for the simple problem.
This question is similar to a variety of questions on how to batch elements. Take a look at these to get you started:
Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?
Partition data coming from CSV so I can process larger patches rather then individual lines

InfluxDB performance

For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?
As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.
Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.

Resources