I'm trying to write aggregated result from two measurements into a single measurement.
I found on documentation that you can write multiple matching measurements with :MEASUREMENT keyword in INTO query. Like
SELECT * INTO "copy_NOAA_water_database"."autogen".:MEASUREMENT FROM
"NOAA_water_database"."autogen"./.*/
What I'm trying to do is aggregate from multiple measurements and write result to a single measurement.
SELECT mean("water_level") INTO
"copy_NOAA_water_database"."autogen"."water_agg" FROM
"NOAA_water_database"."autogen"./.*/ GROUP BY time(15m), *
The above query runs successfully, but I'm not sure whether influx has considered points from all measurement of NOAA_water_database or just last appearing measurement is considered.
Q: I'm not sure whether influx has considered points from all measurement of NOAA_water_database or just last appearing measurement is considered.
A: I suspect influxdb is not aggregating the data from your measurements.
I think it is only aggregating the data from each measurement individually and then for each output write it to your specified measurement and since the resolved time of the mean operation can possibly be the same, measurement B's result can overwrite measurement A's result.
I derived this theory by doing an experiment using the following dataset;
INSERT cpu,host=serverA value=10
INSERT cpu,host=serverA value=20
INSERT cpu2,host=serverA value=5
INSERT cpu2,host=serverA value=15
Doing a SELECT statement similar to your query above returns;
select * FROM "historian"."autogen"./cpu.*/
name: cpu
time host value
---- ---- -----
1546511130857357196 serverA 10
1546511132744883738 serverA 20
name: cpu2
time host value
---- ---- -----
1546511156629403118 serverA 5
1546511157888695746 serverA 15
Then instead of using mean I do sum to find test the behaviour of influxdb.
I also simplified the query by dropping the groupBy operation.
Doing a sum gives me;
SELECT sum("value") INTO test_sum FROM "historian"."autogen"./.*/
name: result
time written
---- -------
0 2
> select * from test_sum;
name: test_sum
time sum
---- ---
0 20
Theory: if influx is aggregating the data from all measurements, the sum result would not be 20. It should be 50. The only way 20 can be derived is from by summing 5 + 15 which is the data from the last measurement.
But when we do the sum operation, influx did told us 2 rows were written. My theory to this is that, the influx did calculate the sum of the first measurement however as first and second summation's result time is both 0 therefore the 2nd measurement's result would have overwritten the first result's.
Recommended solution:
The best tool to do this job is actually influxdb's kapacitor. It is a great tool because it is fast however it is also extremely to learn.
Alternatively if your dataset isn't huge which I suspect it should be alright since you are grouping by 15m. You can write a script in your favourite programming language to read out the data, do the mean and then write the data back to influxdb.
Related
i write sensor data every second to an influxdb database. Displaying weekly, monthly or yearly summaries in grafana is quite slow since it needs to query many thousand values.
To speed things up, i was thinking about using a cron job to run a queries like
select mean(sensor1) into data_avg_1h from data where time > start and time <= end group by time(1h)
select mean(sensor1) into data_avg_1d from data where time > start and time <= end group by time(1d)
select mean(sensor1) into data_avg_1w from data where time > start and time <= end group by time(1w)
This would mean i need more storage, but queries run much faster.
Is this a bodge job or acceptable and is there a more clever way to do something like that?
Yes. It is perfectly ok and it is also recommended to downsample the data like you have mentioned in the question.
However, instead of using a cronjob it will be better to use Continuous query feature of InfluxDB to achieve the same result.
Downsampling & Contious Query Documentation.
Please be aware that when storing the average value for short period, if you want to calculate the average for a longer period from this downsampled data you will have to calculate the weighted average. Otherwise, you will calculating the average of average which, may not be equal to the average value calculated from the Original data.
This is because, each downsampled average value might be having different number of datapoints.
So while calculating the mean on regular interval store the number of data points received in that interval. This way you will be able to calculate the weighted average.
I have an InfluxDB measurement which includes a field that holds either 0 or 1.
How do I find the longest unbroken run of a given value?
Imagine that the field represents whether the sun is up or not, and I have a year's worth of data. I would like the query which finds the longest unbroken run of 1's, which would represent the longest day of the year and return me something like "23rd June 5am to 23rd June 9pm". (I'm in the northern hemisphere, and totally made those times up, but hopefully you get the idea.)
I don't think this can be done with InfluxQL. In many RDBMS, it's possible to do similar operations in a single SQL query using window functions and grouping.
I've experimented a few ways, but as of v1.3 I believe InfluxQL is just not expressive enough for this task. Limitations include:
No window functions (although some functions exhibit similar behaviour, e.g. DIFFERENCE, DERIVATIVE).
time cannot be manipulated like an ordinary tag or field value. For example, it's not possible to take the FIRST(time) of a group of measurements.
Can only GROUP BY time or tag, not by field value (or derived value from a subquery result). Additionally, when grouped by time, only group interval timestamps are returned by selector functions.
Can only ORDER BY time.
The best way to do this is therefore at the application level.
Edit: for the record, the closest I can get is to use ELAPSED to find the longest gap(s) between subsequent 0 values. This might work for you if your data model is a specific shape and data comes in at regular intervals:
SELECT TOP(elapsed, N) AS elapsed FROM (SELECT ELAPSED(field) FROM measurement WHERE field != 1)
Returns e.g. for N = 1:
time elapsed
---- -------
2000 500
However, there is no guarantee that there is a value of 1 in the gap. Use FIRST to retrieve the first measurement with field == 1 within the gap, or nothing if there are none:
SELECT FIRST(field) FROM measurement WHERE field = 1 AND time < 2000 and time > (2000 - 500)
Returns e.g.:
time first
---- -----
1000 1
Therefore the longest run of 1 values is from 1000 -> 2000.
In InfluxDB v1.3, I have a measurement with one field and a tag that can take two values.
I would like to compute (x where mytag=y) - (x where mytag=z), using the last value of each series when needed (something like an http://code.kx.com/wiki/Reference/aj). I would like to do this in one query, if possible.
If the above is not possible, is there a different schema (e.g. using separate measurements) where what I would like to do is feasible? If so, can you please elaborate on the structure and the query?
SELECT difference(mean(x))
FROM <measurement>
WHERE time > now() - 1h and (mytag='y' OR mytag='x')
GROUP BY time(60s), mytag
Functions like difference require an aggregate query (group by time()) as well as an aggregation function for the values within the grouped window (mean above).
Difference then shows the differences between sequential aggregated values for the time period specified, additionally grouped by the two tag values specified.
These can be adjusted depending on your data.
I'm sending metrics in StatsD format to Telegraf, which forwards them to InfluxDB 0.9.
I'm measuring execution times (of some event) from multiple hosts. The measurement is called "execTime", and the tag is "host". Once Telegraf gets these numbers, it calculates mean/upper/lower/count, and stores them in separate measurements.
Sample data looks like this in influxdb:
TIME...FIELD..............HOST..........VALUE
t1.....execTime.count.....VM1...........3
t1.....execTime.mean......VM1...........15
t1.....execTime.count.....VM2...........6
t1.....execTime.mean......VM2...........22
(So at time t1, there were 3 events on VM1, with mean execution time 15ms, and on VM2 there were 6 events, and the mean execution time was 22ms)
Now I want to calculate the mean of the operation execution time across both hosts at time t1. Which is (3*15 + 6*22)/(3+6) ms.
But since the count and mean values are in two different series, I can't simply use "select mean(value) from execTime.mean"
Do I need to change my schema, or can I do this with the current setup?
What I need is essentially a new series, which is a combination of the execTime.count and execTime.mean across all hosts. Instead of calculating this on-the-fly, the best approach seems to be to actually create the series along with the others.
So now I have two timer stats being generated on each host for each event:
1. one event with actual hostname for the 'host' tag
2. second event with one tag "host=all"
I can use the first set of series to check mean execution times per host. And the second series gives me the mean time for all hosts combined.
It is possible to do mathematical operations on fields from two different series, provided both series are members of the same measurement. I suspect your schema is non-optimized for your use case.
For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?
As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.
Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.