I have a measurement that keeps track of sensor readings for a bunch of machines.
There are something of the order of 50 different readings per machine, and there are up to 1000 machines. We have one reading every 30 seconds.
The way I store the reading is in a single measurement which has 2 tags, machine_id and analysis_id and a single value.
One of the use cases I have is to retrieve the current value for each reading for a list of machines.
When this database gets to 100 million records or something like that, which with those numbers means less than 1 day, I can no longer retrieve the last values with a query as it takes too long.
I tried the two following alternatives:
SELECT *
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
ORDER BY time DESC
LIMIT 1
and:
SELECT last(*) AS value,
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
both of then take a pretty long time to complete. At 100 million it's something of the order of 1 second.
The use case of retrieving the latest values is a very frequent one. I need to be able to get the "current" state of machines almost instantly.
I can work that out on the side of the app logic, by keeping track of the latest value in a separate place, but I was wondering what I could do with InfluxDB alone.
I was facing something similar and I worked around it by creating a continuous query.
https://docs.influxdata.com/influxdb/v0.8/api/continuous_queries/
Related
API in question: https://api.slack.com/methods/team.accessLogs
The maximum page is 100 and the maximum records per page(count) is 1000 so total 100,000 records could potentially be returned. Since there is no way to limit the starting date for the accessLog, the results will continue to grow as more unique user/IP/useragent combinations are used until it reaches the limit at which point it wouldn't be possible to return all records. Is this correct?
Also, the documentation does not specify how the results are ordered?
You have mentioned correctly that typically you can fetch 100,000 records.
But there is a way to limit the starting date.
before argument in api lets you set the time before which you want the records.
https://api.slack.com/methods/team.accessLogs#arg_before
The records are fetched in reverse chronological order i.e. latest record first, and by default, the value of before argument is 'now'.
After fetching first 100,000 records,
set before argument value as "date_last" value from the last record.
(keep in mind that before argument is inclusive of the value provided, therefore the last record will be repeated. To avoid it you can reduce "date_last" value by 1 )
I have an InfluxDB measurement which includes a field that holds either 0 or 1.
How do I find the longest unbroken run of a given value?
Imagine that the field represents whether the sun is up or not, and I have a year's worth of data. I would like the query which finds the longest unbroken run of 1's, which would represent the longest day of the year and return me something like "23rd June 5am to 23rd June 9pm". (I'm in the northern hemisphere, and totally made those times up, but hopefully you get the idea.)
I don't think this can be done with InfluxQL. In many RDBMS, it's possible to do similar operations in a single SQL query using window functions and grouping.
I've experimented a few ways, but as of v1.3 I believe InfluxQL is just not expressive enough for this task. Limitations include:
No window functions (although some functions exhibit similar behaviour, e.g. DIFFERENCE, DERIVATIVE).
time cannot be manipulated like an ordinary tag or field value. For example, it's not possible to take the FIRST(time) of a group of measurements.
Can only GROUP BY time or tag, not by field value (or derived value from a subquery result). Additionally, when grouped by time, only group interval timestamps are returned by selector functions.
Can only ORDER BY time.
The best way to do this is therefore at the application level.
Edit: for the record, the closest I can get is to use ELAPSED to find the longest gap(s) between subsequent 0 values. This might work for you if your data model is a specific shape and data comes in at regular intervals:
SELECT TOP(elapsed, N) AS elapsed FROM (SELECT ELAPSED(field) FROM measurement WHERE field != 1)
Returns e.g. for N = 1:
time elapsed
---- -------
2000 500
However, there is no guarantee that there is a value of 1 in the gap. Use FIRST to retrieve the first measurement with field == 1 within the gap, or nothing if there are none:
SELECT FIRST(field) FROM measurement WHERE field = 1 AND time < 2000 and time > (2000 - 500)
Returns e.g.:
time first
---- -----
1000 1
Therefore the longest run of 1 values is from 1000 -> 2000.
For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?
As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.
Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.
Background
I have a ror application which is continuously recording and showing on a web site real time sensor data. Then I have a table called sensors which has unique list of all sensors and stores latest values.
I also have another table histories which dumps all the sensor values ever received for each sensor.
So the relation is "sensor has many histories" , the time_stamp col records the creation time stamp.
Not all sensors update at same interval or frequency.
Problem
Now I want to take a input time stamp from user, a date and time in past, and show what the sensors were showing at that time. For example say i want to see what all sensors looked like at 2 PM yesterday, once i have this time stamp from user, how do i retrieve one sensors value closest to input time stamp from the history table.
I am looking to add a method in Sensor model, which will take time_stamp as argument, and retrive the value closest to input time_stamp from the history table.
What is they simplest way to write this Active record query?
Thanks
Shaunak
Just sort the histories according to the difference between the passed timestamp and the history timestamp (absolute value so it can go in either direction), and return the top result (that's the one with the smallest difference).
sensor.histories.order("ABS(time_stamp - #{params[:time_stamp].to_i})").first
Note that for this query I am assuming you are using MySQL (because I'm using a MySQL method ABS) and I am also assuming that the time_stamp field is stored as unix timestamp and the user input likewise. If the database storage or input is in a different format, you'll have to use different date arithmetic functions. See http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html for details. Or if you are not using MySQL, see the docs for the database you are using.
Also note that I am using .to_i to sanitize my data for the query. If your data is in a different format, you may need to sanitize it a different way.
To make this more efficient, limit it to time_spans within the maximum possible range. If sensors take data every 10 minutes or more frequently (never less than 10 minutes apart between readings), then a range of greater than 10 minutes on each side will do. Something like below. Here, 600 = 10 (minutes) * 60 (seconds):
sensor.histories.where("time_stamp >= ? AND time_stamp <= ?", params[:time_stamp].to_i - 600, params[:time_stamp].to_i + 600).order("ABS(time_stamp - #{params[:time_stamp].to_i})").first
It is simple to convert this to a model method:
class Sensor < ActiveRecord::Base
def history_at(time_stamp)
self.histories.where("time_stamp >= ? AND time_stamp <= ?", time_stamp.to_i - 600, time_stamp.to_i + 600).order("ABS(time_stamp - #{time_stamp.to_i})").first
end
end
Let's say I have a AWS SimpleDB domain with around 3 million items, each item has an attribute of "foo" with a value of some arbitrary integer (which is of course actually stored in SimpleDB as a string, but let's ignore the conversion to and from for now). I would like to increment the foo value for each item every 60 seconds, until it reaches a maximum value (max value is not the same for each item, item's max is stored as another attribute-value in item), then reset foo to zero: read, increment, evaluate, store.
Given the large number of items, and the hard 60 second time limit, is this approach feasible in SimpleDB? Anyone have an approach to make this work?
You can do it, but it is not feasible. You can only get between 100-300 PUTs per second for a single domain. You can read upwards of 1000 items per second so writes will be the bottleneck.
To be on the conservative side lets say 100 store operations per second, per domain. You'd need 500 domains to open up enough throughput to store all 3 million each minute. You only get 100 by default, so you'd have to ask for more.
Also it would be expensive. Writes with a small number of attributes are about $3 per million and reads are about $1.30 per million. That's about $13 / minute.
The only thing I can really suggest would be if there was a way to combine the 3 million items into a smaller number of items. If there were a way to put 50 "items" into each real item, you could do it with 10 domains at about $15.50 / hour. But I still wouldn't call that feasible, since you can get a cluster of 10 Extra Large High-CPU EC2 server instances for $6.80 / hour.
Why not generate the value at read time from a trusted clock? I'm going to make up some names:
Touch_time - Epoch value (seconds since 1970) when the item was initialized to zero.
Max_age - Number of minutes when time wraps around.
Current_time - Epoch value of now.
So at any time, you can get the value you were proposing to store in an attribute by
(current_time - touch_time) % (max_age * 60)
Assuming max_age changes relatively infrequently, and everyone trusts touch_time and current_time to within a minute, and that's what NTP is for.