How to downsample timeseries in QuestDb?

How to downsample timeseries in QuestDb? - time-series

I have a table with metrics per box in QuestDb with columns
Timestamp (designated timestamp)
Machine (Symbol)
CPU (double)
and I want to downsample the results to 2 min interval taking average per machine so that output is same columns but with 1 data point per every 2 mins per every machine. I have a feeling that there should be a special SQL extension syntax for that but cannot make it work so far.

You can use sample by for that
SELECT Timestamp, Machine, AVG(CPU)
FROM tablename
SAMPLE BY 2m
This will automatically group by Machine and 2 mins timestamp intervals

Related

How to display the information of peak hour from last 24h data from influxDb

I am working on a project on influxDb where I have a time-series data. Here I have to find the peak hour and display the data during that peak hour.
The task is to
group the 24hour data to 1hour each and get the mean value.
Among these mean value I have to find the max value and it's time.
Using this time we have to display the data for that 1 hour.
I have wrote the query to find the max value and time-period
SELECT max(mean)
FROM (SELECT mean("value")
FROM "measurement"
WHERE ("floorNo"='1' AND time >= now() - 24h)
GROUP BY time(1h))

How to find the highest temperature recorded on each day

I am working on InfluxDb and have a time series data which records temperature of a location. Here i have to find the max temp and the time stamp for each day.
The task is to:
Find the mean temperature on 1 hour basis.
Then find the max temperature from the above mean for each day.
I have wrote a query but I'am not getting the output as required.
SELECT MAX(mean)
FROM (SELECT mean("value")
FROM "temperature"
WHERE ("location" = 'L1')
GROUP BY time(1h))
GROUP BY time(1d)
I'am getting the output as:
time max
---- ---
2020-01-17T00:00:00Z 573.44
2020-01-16T00:00:00Z 674.44
Here am getting the time stamp as 00:00:00z is there a way to get the exact time i.e if mean temp is 573.44 at 13:00 hour on 2020-01-17, The timestamp should be 2020-01-17T13:00:00Z

Currently, no.
GROUP BY effectively removes per-entry timestamps. It's like all the data for the grouping interval (e.g. for the day) gets dropped into a bucket without their timestamps. The bucket has a single timestamp - the start of the interval.
The result only ever has the timestamp of the start of the GROUP BY interval.
The only way to do this is in the app, not the backend. You might want to experiment; if you only need resolution down to the e.g. nearest 5 minutes, do the query using GROUP BY time(5m). This can be done to shrink the size of data retrieved by the client vs backend processing time.

InfluxDB Continuous Query running on entire time series data

If my interpretation is correct, according to the documentation provided here:InfluxDB Downsampling when we down-sample data using a Continuous Query running every 30 minutes, it runs only for the previous 30 minutes data.
Relevant part of the document:
Use the CREATE CONTINUOUS QUERY statement to generate a CQ:
CREATE CONTINUOUS QUERY "cq_30m" ON "food_data" BEGIN
SELECT mean("website") AS "mean_website",mean("phone") AS "mean_phone"
INTO "a_year"."downsampled_orders"
FROM "orders"
GROUP BY time(30m)
END
That query creates a CQ called cq_30m in the database food_data.
cq_30m tells InfluxDB to calculate the 30-minute average of the two
fields website and phone in the measurement orders and in the DEFAULT
RP two_hours. It also tells InfluxDB to write those results to the
measurement downsampled_orders in the retention policy a_year with the
field keys mean_website and mean_phone. InfluxDB will run this query
every 30 minutes for the previous 30 minutes.
When I create a Continuous Query it actually runs on the entire dataset, and not on the previous 30 minutes. My question is, does this happen only the first time after which it runs on the previous 30 minutes of data instead of the entire dataset?
I understand that the query itself uses GROUP BY time(30m) which means it'll return all data grouped together but does this also hold true for the Continuous Query? If so, should I then include a filter to only process the last 30 minutes of data in the Continuous Query?

What you have described is expected functionality.
Schedule and coverage
Continuous queries operate on real-time data. They use the local server’s timestamp, the GROUP BY time() interval, and InfluxDB database’s preset time boundaries to determine when to execute and what time range to cover in the query.
CQs execute at the same interval as the cq_query’s GROUP BY time() interval, and they run at the start of the InfluxDB database’s preset time boundaries. If the GROUP BY time() interval is one hour, the CQ executes at the start of every hour.
When the CQ executes, it runs a single query for the time range between now() and now() minus the GROUP BY time() interval. If the GROUP BY time() interval is one hour and the current time is 17:00, the query’s time range is between 16:00 and 16:59.999999999.
So it should only process the last 30 minutes.
Its a good point about the first run.
I did manage to find a snippet from an old document
Backfilling Data
In the event that the source time series already has data in it when you create a new downsampled continuous query, InfluxDB will go back in time and calculate the values for all intervals up to the present. The continuous query will then continue running in the background for all current and future intervals.
https://influxdbcom.readthedocs.io/en/latest/content/docs/v0.8/api/continuous_queries/#backfilling-data
Which would explain the behaviour you have found

InfluxDB integral gives high value after missing data

I am storing Amps, Volts and Watts in influxdb in a measurement/table called "Power". The frequency of update is approx every second. I can use the integral function to get power usage (in Amp Hour or Watt Hour) on an hourly basis. This is working very nicely, so I can get a graph of power used each hour over a 24 hour period. My SQL is below.
The issue is if there is a gap in the data then I get a huge spike in the result when it returns. eg if data was missing from 3pm to 5.45 pm, then the 5 pm result shows a huge spike. Reason I can see is there is close to 3 hours gap, so it just calculates the area under the graph and lumps it into the 5 PM value. Can I avoid that?
SELECT INTEGRAL(Watts) FROM Power WHERE time > now() - 24h GROUP BY time(1h)

I had a similar issue with Influx. It turns out that integral() doesn't support fill() as noted by Yuri Lachin in the comments.
Since you're grouping by hours anyway, then the average value of the power (watts) for the hour happens to be equal to the energy consumption for the hour (watt-hours), so you can use the mean() value here and you should get the correct result.
The query I'm using is:
SELECT mean("load_power") AS "load"
FROM "power_readings"
WHERE $timeFilter
GROUP BY time(1h) fill(0)
For daily numbers I can go back to using integral() because I rarely have gaps of data that span multiple days, so there's no filler needed.
Since you can use the fill() function in this query, you can decide what makes the most sense of the various options (see https://docs.influxdata.com/influxdb/v1.7/query_language/data_exploration/#group-by-time-intervals-and-fill)

You need to use fill() in group by section of query (see docs for Influx fill() usage).
In your case fill(none) or fill(0) should do the job.

InfluxDB performance

For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?

As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.

Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to downsample timeseries in QuestDb? - time-series

You can use sample by for that SELECT Timestamp, Machine, AVG(CPU) FROM tablename SAMPLE BY 2m This will automatically group by Machine and 2 mins timestamp intervals

Related

How to display the information of peak hour from last 24h data from influxDb

How to find the highest temperature recorded on each day

InfluxDB Continuous Query running on entire time series data

InfluxDB integral gives high value after missing data

InfluxDB performance

Categories

Resources