How to find longest run of a certain value in InfluxDB - influxdb

I have an InfluxDB measurement which includes a field that holds either 0 or 1.
How do I find the longest unbroken run of a given value?
Imagine that the field represents whether the sun is up or not, and I have a year's worth of data. I would like the query which finds the longest unbroken run of 1's, which would represent the longest day of the year and return me something like "23rd June 5am to 23rd June 9pm". (I'm in the northern hemisphere, and totally made those times up, but hopefully you get the idea.)

I don't think this can be done with InfluxQL. In many RDBMS, it's possible to do similar operations in a single SQL query using window functions and grouping.
I've experimented a few ways, but as of v1.3 I believe InfluxQL is just not expressive enough for this task. Limitations include:
No window functions (although some functions exhibit similar behaviour, e.g. DIFFERENCE, DERIVATIVE).
time cannot be manipulated like an ordinary tag or field value. For example, it's not possible to take the FIRST(time) of a group of measurements.
Can only GROUP BY time or tag, not by field value (or derived value from a subquery result). Additionally, when grouped by time, only group interval timestamps are returned by selector functions.
Can only ORDER BY time.
The best way to do this is therefore at the application level.
Edit: for the record, the closest I can get is to use ELAPSED to find the longest gap(s) between subsequent 0 values. This might work for you if your data model is a specific shape and data comes in at regular intervals:
SELECT TOP(elapsed, N) AS elapsed FROM (SELECT ELAPSED(field) FROM measurement WHERE field != 1)
Returns e.g. for N = 1:
time elapsed
---- -------
2000 500
However, there is no guarantee that there is a value of 1 in the gap. Use FIRST to retrieve the first measurement with field == 1 within the gap, or nothing if there are none:
SELECT FIRST(field) FROM measurement WHERE field = 1 AND time < 2000 and time > (2000 - 500)
Returns e.g.:
time first
---- -----
1000 1
Therefore the longest run of 1 values is from 1000 -> 2000.

Related

VLOOKUP with wildcard and find Nth occurance?

I'm setting up a Google Sheet that will calculate the most effective purchase size of specific agricultural inputs (fertilizer, chemical, etc). I set up the price data in its own tab with a separate row for each input name + size.
To keep it easy for the user I'd like to require only the input name, # of gallons per acre, and acres and then have a formula spit out the total cost and most effective purchase (bulk if > X gallons, X # of 250 gallon containers + X 55 drums, etc). How can I use the input name plus a wildcard to find the appropriate purchase size?
https://docs.google.com/spreadsheets/d/1bMOPuk2qhmVuJT7vE_ni3KFxfcgKvwTwkM4p50xQF_0/edit?usp=sharing
I tried:
=ArrayFormula(iferror(INDEX('Data (Current)'!H2:H,SMALL(IF($A2&"*"='Data (Current)'!A2:A,ROW('Data (Current)'!A2:A)-1),1))))
...but it returns blank so I'm guessing the reference $A2&"*" to the input name isn't working properly. When I replace it with a string found in the 'Data (Current)' tab then it works fine.
=ArrayFormula(iferror(INDEX('Data (Current)'!H2:H,SMALL(IF($A2&"*"='Data (Current)'!A2:A,ROW('Data (Current)'!A2:A)-1),1))))
I expected the output to be the smallest value (in this case I think it's 5). Then when I change the last number to 2 or 3 it will find the next smallest value, in this case, 55 or 250. Then I can use simple formulas to interact with that and finish the spreadsheet.
Unfortunately, the actual output is nothing, or "".
Sorry if this isn't what you're looking for, as I had some trouble understanding your question.
Presuming what you want is essentially this:
I want to buy Y quantity of item.
I can buy item at cheaper prices if I buy in higher quantities, although sometimes they have a minimum order quantity.
What is the most optimal combination of the options I have to minimize the price I pay?
I'm unsure if there's a simple solution for this within Google Sheets alone. This might be treading more into Apps Script territory.
However, that's not to say that it's not impossible. I've "brute-forced" the above solution above with an iterative-like approach, for the "Chelated Calcium" product: https://docs.google.com/spreadsheets/d/1YSBiSx0IMr4T0R11Dqb-tqOhH4AOTTAWeH2yQfT4X5w
First, list the data in a standardized manner. This includes giving each same product something easy to look it up by. For example, on the Data (Current) tab, I've added 3 columns:
Product Common Name - This is used so that all items of different quantities can be found easily, without needing wildcards.
Gallons - Much easier to parse the data if it it's explicitly laid out.
Minimum Order Gallons - This is your threshold for Bulk. I've set it at an arbitrary 20,000 gallons for Chelated Calcium.
The data here is ordered least-effective first. How you do this will be up to you. In this case, I sorted by the Retail Cost Per Ounce parameter from your sheet, highest first. This eliminates any guesswork about which of the options are most effective, since you can just traverse your options in order. Note: The way I've laid out the formulas will only work IFF the same products are directly next to each other. It won't work if there are other products between them.
On the Field Level Tool tab, standardize your inputs to the Gallons unit. I do this in Total Gallons Needed column (I multiply anything with a "GAL" with 1, and "QUART" with 0.25).
For each item, determine the row numbers where the product begins and ends. This is marked by columns L (Least Efficient Index) and M (Most Efficient Index). I got these results by using the MATCH function.
Set up the iterations, from 0 to N-1. On this sheet, I've set up N=5 iterations, which means that it can traverse 5 different options of the same product only. Since Chelated Calcium only has 4 different options (5 Gal, 30 Gal, 250 Gal, Bulk), 5 is more than enough for this product. If you have products with more options, you may want to have more iterations.
The iterations are on the right side of the Field Level Tool tab.
In your case, you might want to put it on a different tab since the place I put it makes the file look very messy.
In each iteration, I perform the following steps:
To Fulfill - How many gallons still need to be purchased by this iteration?
ThisIndex - What is the row number of this iteration? This is determined by Most Efficient Index - Iteration Number. Remember that since we sorted in order of ascending efficiency, this means that the iteration starts with the most efficient option it can find first. There is a check to make sure that it only outputs a value if it is between the range [Least Efficient Index, Most Efficient Index]. Otherwise, it will be blank to avoid miscalculations by intruding into another product in the Data (Current) tab.
Retail Price, Minimum Gals, Gallons per Order - Simple data extraction for easy usage in the iteration, using INDEX (and indirectly, MATCH by virtue of ThisIndex).
Order - This formula does a couple of things, outlined below:
It checks whether there still remains a valid choice of product at this iteration. It does this by checking whether ThisIndex still exists. If the product doesn't exist, then it will be nulled. This is accomplished by using the IF function.
It will determine if there is a minimum threshold that must be met to purchase this choice. You can see in the 0th iteration, for example, that there is a minimum quantity of 20,000 gallons. If To Fulfill quantity is greater than or equal to the threshold OR there is no threshold, then a purchase is quantified by this column. The mathematics are simply to divide the To Fulfill amount by the Gallons per Order amount to determine the number of orders of this particular product choice. If there is a threshold but the To Fulfill amount doesn't meet it, then this iteration is skipped with a 0 order value.
If the item is already on its least efficient choice (ThisIndex == Least Efficient Index), it will do a CEILING function to ensure that the order is fulfilled. If not, it will do a FLOOR function instead. This is because you cannot order 3.5 units of an item, so they have to be rounded either up or down.
Expenditure - This is simply Order multiplied by the Retail Price, or how much money you spend in this iteration.
Remaining - How much of the product is left unfulfilled at the end of this iteration, to be used as To Fulfill for the next iteration.
Note: If you see formulas that are of the form =IF(ThisIndex, [calculations_here],), that is simply a check to nullify that calculation if ThisIndex is invalid.
Copy the iterations as many times as you want to the right. Something nice to do is to force the iterations to do a CEILING on the very last one to ensure that you never under-buy.
Generate a user-readable string for the purchase suggestion. You can see this on the Suggested Purchase column.
Calculate the Gallons Bought with a simple SUMPRODUCT over all the iterations.
Calculate the total expenditure with a simple SUM over all the iterations.
I hope this is what you were looking for. Regardless, it's at least a fun exercise on how much you can abuse Sheets. ;)

InfluxDB: INTO with multiple measurement

I'm trying to write aggregated result from two measurements into a single measurement.
I found on documentation that you can write multiple matching measurements with :MEASUREMENT keyword in INTO query. Like
SELECT * INTO "copy_NOAA_water_database"."autogen".:MEASUREMENT FROM
"NOAA_water_database"."autogen"./.*/
What I'm trying to do is aggregate from multiple measurements and write result to a single measurement.
SELECT mean("water_level") INTO
"copy_NOAA_water_database"."autogen"."water_agg" FROM
"NOAA_water_database"."autogen"./.*/ GROUP BY time(15m), *
The above query runs successfully, but I'm not sure whether influx has considered points from all measurement of NOAA_water_database or just last appearing measurement is considered.
Q: I'm not sure whether influx has considered points from all measurement of NOAA_water_database or just last appearing measurement is considered.
A: I suspect influxdb is not aggregating the data from your measurements.
I think it is only aggregating the data from each measurement individually and then for each output write it to your specified measurement and since the resolved time of the mean operation can possibly be the same, measurement B's result can overwrite measurement A's result.
I derived this theory by doing an experiment using the following dataset;
INSERT cpu,host=serverA value=10
INSERT cpu,host=serverA value=20
INSERT cpu2,host=serverA value=5
INSERT cpu2,host=serverA value=15
Doing a SELECT statement similar to your query above returns;
select * FROM "historian"."autogen"./cpu.*/
name: cpu
time host value
---- ---- -----
1546511130857357196 serverA 10
1546511132744883738 serverA 20
name: cpu2
time host value
---- ---- -----
1546511156629403118 serverA 5
1546511157888695746 serverA 15
Then instead of using mean I do sum to find test the behaviour of influxdb.
I also simplified the query by dropping the groupBy operation.
Doing a sum gives me;
SELECT sum("value") INTO test_sum FROM "historian"."autogen"./.*/
name: result
time written
---- -------
0 2
> select * from test_sum;
name: test_sum
time sum
---- ---
0 20
Theory: if influx is aggregating the data from all measurements, the sum result would not be 20. It should be 50. The only way 20 can be derived is from by summing 5 + 15 which is the data from the last measurement.
But when we do the sum operation, influx did told us 2 rows were written. My theory to this is that, the influx did calculate the sum of the first measurement however as first and second summation's result time is both 0 therefore the 2nd measurement's result would have overwritten the first result's.
Recommended solution:
The best tool to do this job is actually influxdb's kapacitor. It is a great tool because it is fast however it is also extremely to learn.
Alternatively if your dataset isn't huge which I suspect it should be alright since you are grouping by 15m. You can write a script in your favourite programming language to read out the data, do the mean and then write the data back to influxdb.

Performance issues when retrieving last value

I have a measurement that keeps track of sensor readings for a bunch of machines.
There are something of the order of 50 different readings per machine, and there are up to 1000 machines. We have one reading every 30 seconds.
The way I store the reading is in a single measurement which has 2 tags, machine_id and analysis_id and a single value.
One of the use cases I have is to retrieve the current value for each reading for a list of machines.
When this database gets to 100 million records or something like that, which with those numbers means less than 1 day, I can no longer retrieve the last values with a query as it takes too long.
I tried the two following alternatives:
SELECT *
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
ORDER BY time DESC
LIMIT 1
and:
SELECT last(*) AS value,
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
both of then take a pretty long time to complete. At 100 million it's something of the order of 1 second.
The use case of retrieving the latest values is a very frequent one. I need to be able to get the "current" state of machines almost instantly.
I can work that out on the side of the app logic, by keeping track of the latest value in a separate place, but I was wondering what I could do with InfluxDB alone.
I was facing something similar and I worked around it by creating a continuous query.
https://docs.influxdata.com/influxdb/v0.8/api/continuous_queries/

InfluxDB and Grafana graph using midnight as 0 on Y-axis derivative

I am graphing with Grafana (2.6.0) and I have an InfluxDB (0.10.2) database with the following data in it:
> select * from "WattmeterMainskwh" where time > now() - 5m
name: WattmeterMainskwh
-----------------------
time value
1457579891000000000 15529.322
1457579956000000000 15529.411
1457580011000000000 15529.425
1457580072000000000 15529.460
1457580135000000000 15529.476
...etc...
This data collects my household kilowatt usage as measured by a kWH gauge that steadily increments the usage value across months or years. I cannot easily reset the counter, nor do I wish to do so.
My goal is to create a graph that shows my daily kWH use over 24 hour periods starting at midnight, or at a minimum showing relative kWH over the interval displayed. This type of graph would be useful in many other circumstances as well where I could imagine "errors across the day" or "visitors since opening time" or "BGP resets per calendar week" were useful but the collection counter was not reset to zero upon the reset or turn-over of the time interval. This kind of counting is actually quite common in my experience.
This graph works, but doesn't show me what I'm looking for:
SELECT derivative(mean("value")) FROM "WattmeterMainskwh" WHERE $timeFilter GROUP BY time($interval) fill(null)
That graph just shows the difference between one sample and the previous sample. What I want is a steadily increasing line starting from the left side of the graph and increasing towards the right side of the graph, with zero as the bottom of the Y axis, and the graph starting at zero at the farthest left X value.
This graph works too and shows me the correct curve, but it's off by fifteen thousand or so. So far, it's the closest to what I want but since this is an ever-increasing counter that can't be reset I need to subtract some from the Y axis. Ideally, I'd like to subtract whatever the value was at the previous midnight from each sample to get a relative number based on a day instead of an absolute based on all time.
SELECT sum("value") FROM "WattmeterMainskwh" WHERE $timeFilter GROUP BY time($interval) fill(null)
And here's the graph from that previous statement:
Graph that is off by 15k
This attempt didn't work - I apparently can't take a sum of a derivative group:
SELECT sum(derivative(mean("value"))) FROM "WattmeterMainskwh" WHERE $timeFilter GROUP BY time($interval) fill(null)
This doesn't work, either - I can't perform functions within "derivative":
SELECT derivative(sum("value")-first("value")) FROM "WattmeterMainskwh" WHERE $timeFilter GROUP BY time($interval) fill(null)
Of course, I could just create a new value that had calculations applied to it before I wrote it into InfluxDB, but that seems to me to be a data-redundant and sloppy way to solve this problem, as well as being quite inflexible if I want to look at other intervals on a whim. I'm hoping that there is some way to do this more elegantly within the combination of InfluxDB & Grafana, but I'm just not able to find it with the search terms I've used or the thinking I've put towards interpreting the documentation.
Is this type of graph even possible with InfluxDB/Grafana? As far as I can tell a continuous query is not a solution, and the lack of nested SELECTs makes even the hackish ways of doing this not obvious to me.
BONUS: It would be really great to have the graph show midnight every night as a "zero" location, instead of "zero" being the first point in the displayed interval, so looking at five days of normal data would show five distinct "waves" of increasing daily aggregate energy usage, with the wave Y value going back down to zero at 12:00:01 on each day. But I'll take whatever I can get.
Nested functions have only partial support. However, you can effectively nest functions by chaining Continuous Queries.
Use a CQ to calculate the derivative(mean(value)) and store that in a new measurement foo. Then for your graph you can query select sum(value) from foo.
(I know this answer is quite late, but it might help others. Oh, and please excuse me for all the Dutch in my graphs; I had to keep it in dutch for the highest possible WAF)
You could do what I do for my kWh calculations:
Which results in a simple query like this:
SELECT distinct("kwh_combined") FROM "smartmeter" WHERE $timeFilter GROUP BY time($__interval) fill(linear)
In order to get your total count.. or if you want it in a nice graph like this which shows the number of kWh's used per hour in the bars and the yellow line (I normally run in dark mode, excuse the yellow) which is my current WATT power draw:
This data (or at least your hourly usage in bars) can be retrieved by a query like this:
Which is this exact query (for B):
SELECT spread("kwh_combined") FROM "smartmeter" WHERE $timeFilter GROUP BY time(1h) fill(null)
... where the 'kwh_combined' is (still) my counter just counting up and up.
All this results in me being able to 'query' the InfluxDB for a certain time period, like "last 24 hours" to come up with a nice panel like this: (ignore the encircled prices, that was for a question I posted I just made 10 minutes ago, check my PS)
I hope this helps you or anyone else; it took me some figuring out, but I'm happy to give something back to the community :)
PS: Don't be as stupid as I was and hardcode your electrical and gas prices into your dashboard but store them with your measurements as they could change over time.
I had the same problem (same application even) and solved it here. In your case, the query should be roughly:
SELECT value-value_fill FROM
(SELECT first(value) as value_fill FROM WattmeterMainskwh WHERE time>now()-7d GROUP BY time(1d)),
(SELECT first(value) as value FROM WattmeterMainskwh WHERE time>now()-7d GROUP BY time(1h))
fill(previous)

find record closest to a given time in ruby on rails

Background
I have a ror application which is continuously recording and showing on a web site real time sensor data. Then I have a table called sensors which has unique list of all sensors and stores latest values.
I also have another table histories which dumps all the sensor values ever received for each sensor.
So the relation is "sensor has many histories" , the time_stamp col records the creation time stamp.
Not all sensors update at same interval or frequency.
Problem
Now I want to take a input time stamp from user, a date and time in past, and show what the sensors were showing at that time. For example say i want to see what all sensors looked like at 2 PM yesterday, once i have this time stamp from user, how do i retrieve one sensors value closest to input time stamp from the history table.
I am looking to add a method in Sensor model, which will take time_stamp as argument, and retrive the value closest to input time_stamp from the history table.
What is they simplest way to write this Active record query?
Thanks
Shaunak
Just sort the histories according to the difference between the passed timestamp and the history timestamp (absolute value so it can go in either direction), and return the top result (that's the one with the smallest difference).
sensor.histories.order("ABS(time_stamp - #{params[:time_stamp].to_i})").first
Note that for this query I am assuming you are using MySQL (because I'm using a MySQL method ABS) and I am also assuming that the time_stamp field is stored as unix timestamp and the user input likewise. If the database storage or input is in a different format, you'll have to use different date arithmetic functions. See http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html for details. Or if you are not using MySQL, see the docs for the database you are using.
Also note that I am using .to_i to sanitize my data for the query. If your data is in a different format, you may need to sanitize it a different way.
To make this more efficient, limit it to time_spans within the maximum possible range. If sensors take data every 10 minutes or more frequently (never less than 10 minutes apart between readings), then a range of greater than 10 minutes on each side will do. Something like below. Here, 600 = 10 (minutes) * 60 (seconds):
sensor.histories.where("time_stamp >= ? AND time_stamp <= ?", params[:time_stamp].to_i - 600, params[:time_stamp].to_i + 600).order("ABS(time_stamp - #{params[:time_stamp].to_i})").first
It is simple to convert this to a model method:
class Sensor < ActiveRecord::Base
def history_at(time_stamp)
self.histories.where("time_stamp >= ? AND time_stamp <= ?", time_stamp.to_i - 600, time_stamp.to_i + 600).order("ABS(time_stamp - #{time_stamp.to_i})").first
end
end

Resources