InlfuxDB TOP function poor performances - influxdb

I'm using InfluxDB and I'm trying to query values in it with TOP() function.
Here is an exemple of request :
SELECT TOP("duration", 2) AS "top_duration" FROM "range" WHERE "time" > '2017-11-23T15:23:32.243Z' AND "contract" = 'A0000544' AND "type" = 'PRESENCE' AND "room" = '3908' AND "endTime" < 80785557 AND "startTime" > 28630649
In the measurement contract, type and room are tags, duration, startTime and endTime are fields.
I have around 37 866 326 points in range, but only 78 962 for contract 'A0000544' and 10 487 for room '3908'
This request takes several seconds and I'm trying to reduce the processing time.
I tried to create another measurement to reduce my sample and keeping only biggest "duration".
I kept only 4 066 728 points but the processing time was the same.
When I keep only the point about the contract in the measurement the request take around 300ms.
I don't understand why I have so much execution time difference with empty database and in the other hand no difference with the filtered measurement.
Am I missing something? Is there any other possible optimisations?

That is just an assumption, but maybe having filtering by filed rather then by tags alone + having 3 fields in a single measurement is a performance killer. Field are not indexed, so filtering by fields requires a full table scan). Besides, multiple fields per data point create multiple index entries.
I am not sure of the solution... Probably, InfluxDB was not designed for such a complex table schema.

Related

Slow query with 22 million points

I have 1 TB (text data).
I installed the Influxd, in a machine (240 G RAM, 32 CUP)
I only inserted around 22 million points in one measurement, one tag and 110 field.
When i do query (select id from ts limit 1) , it exceed 20 second, and this is not good.
So can you please help me in what i should do to have a good performance
how many count your series?
maybe your problem come up from here:
https://docs.influxdata.com/influxdb/v1.2/concepts/schema_and_data_layout/#don-t-have-too-many-series
Tags containing highly variable information like UUIDs, hashes, and random strings will lead to a large number of series in the database, known colloquially as high series cardinality. High series cardinality is a primary driver of high memory usage for many database workloads

How to find longest run of a certain value in InfluxDB

I have an InfluxDB measurement which includes a field that holds either 0 or 1.
How do I find the longest unbroken run of a given value?
Imagine that the field represents whether the sun is up or not, and I have a year's worth of data. I would like the query which finds the longest unbroken run of 1's, which would represent the longest day of the year and return me something like "23rd June 5am to 23rd June 9pm". (I'm in the northern hemisphere, and totally made those times up, but hopefully you get the idea.)
I don't think this can be done with InfluxQL. In many RDBMS, it's possible to do similar operations in a single SQL query using window functions and grouping.
I've experimented a few ways, but as of v1.3 I believe InfluxQL is just not expressive enough for this task. Limitations include:
No window functions (although some functions exhibit similar behaviour, e.g. DIFFERENCE, DERIVATIVE).
time cannot be manipulated like an ordinary tag or field value. For example, it's not possible to take the FIRST(time) of a group of measurements.
Can only GROUP BY time or tag, not by field value (or derived value from a subquery result). Additionally, when grouped by time, only group interval timestamps are returned by selector functions.
Can only ORDER BY time.
The best way to do this is therefore at the application level.
Edit: for the record, the closest I can get is to use ELAPSED to find the longest gap(s) between subsequent 0 values. This might work for you if your data model is a specific shape and data comes in at regular intervals:
SELECT TOP(elapsed, N) AS elapsed FROM (SELECT ELAPSED(field) FROM measurement WHERE field != 1)
Returns e.g. for N = 1:
time elapsed
---- -------
2000 500
However, there is no guarantee that there is a value of 1 in the gap. Use FIRST to retrieve the first measurement with field == 1 within the gap, or nothing if there are none:
SELECT FIRST(field) FROM measurement WHERE field = 1 AND time < 2000 and time > (2000 - 500)
Returns e.g.:
time first
---- -----
1000 1
Therefore the longest run of 1 values is from 1000 -> 2000.

Performance issues when retrieving last value

I have a measurement that keeps track of sensor readings for a bunch of machines.
There are something of the order of 50 different readings per machine, and there are up to 1000 machines. We have one reading every 30 seconds.
The way I store the reading is in a single measurement which has 2 tags, machine_id and analysis_id and a single value.
One of the use cases I have is to retrieve the current value for each reading for a list of machines.
When this database gets to 100 million records or something like that, which with those numbers means less than 1 day, I can no longer retrieve the last values with a query as it takes too long.
I tried the two following alternatives:
SELECT *
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
ORDER BY time DESC
LIMIT 1
and:
SELECT last(*) AS value,
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
both of then take a pretty long time to complete. At 100 million it's something of the order of 1 second.
The use case of retrieving the latest values is a very frequent one. I need to be able to get the "current" state of machines almost instantly.
I can work that out on the side of the app logic, by keeping track of the latest value in a separate place, but I was wondering what I could do with InfluxDB alone.
I was facing something similar and I worked around it by creating a continuous query.
https://docs.influxdata.com/influxdb/v0.8/api/continuous_queries/

InfluxDB performance

For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?
As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.
Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.

find record closest to a given time in ruby on rails

Background
I have a ror application which is continuously recording and showing on a web site real time sensor data. Then I have a table called sensors which has unique list of all sensors and stores latest values.
I also have another table histories which dumps all the sensor values ever received for each sensor.
So the relation is "sensor has many histories" , the time_stamp col records the creation time stamp.
Not all sensors update at same interval or frequency.
Problem
Now I want to take a input time stamp from user, a date and time in past, and show what the sensors were showing at that time. For example say i want to see what all sensors looked like at 2 PM yesterday, once i have this time stamp from user, how do i retrieve one sensors value closest to input time stamp from the history table.
I am looking to add a method in Sensor model, which will take time_stamp as argument, and retrive the value closest to input time_stamp from the history table.
What is they simplest way to write this Active record query?
Thanks
Shaunak
Just sort the histories according to the difference between the passed timestamp and the history timestamp (absolute value so it can go in either direction), and return the top result (that's the one with the smallest difference).
sensor.histories.order("ABS(time_stamp - #{params[:time_stamp].to_i})").first
Note that for this query I am assuming you are using MySQL (because I'm using a MySQL method ABS) and I am also assuming that the time_stamp field is stored as unix timestamp and the user input likewise. If the database storage or input is in a different format, you'll have to use different date arithmetic functions. See http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html for details. Or if you are not using MySQL, see the docs for the database you are using.
Also note that I am using .to_i to sanitize my data for the query. If your data is in a different format, you may need to sanitize it a different way.
To make this more efficient, limit it to time_spans within the maximum possible range. If sensors take data every 10 minutes or more frequently (never less than 10 minutes apart between readings), then a range of greater than 10 minutes on each side will do. Something like below. Here, 600 = 10 (minutes) * 60 (seconds):
sensor.histories.where("time_stamp >= ? AND time_stamp <= ?", params[:time_stamp].to_i - 600, params[:time_stamp].to_i + 600).order("ABS(time_stamp - #{params[:time_stamp].to_i})").first
It is simple to convert this to a model method:
class Sensor < ActiveRecord::Base
def history_at(time_stamp)
self.histories.where("time_stamp >= ? AND time_stamp <= ?", time_stamp.to_i - 600, time_stamp.to_i + 600).order("ABS(time_stamp - #{time_stamp.to_i})").first
end
end

Resources