How can I calculate the area under a graph with InfluxDB? - influxdb

Here is a sample of my data:
> SELECT time, value from task Limit 5;
name: task
time value
---- -----
1540149422155456967 0
1540149423155456967 1
1540151481498019507 1
1540151482498019507 0
1540151680870649288 0
I have a measurement that is a boolean - encoded as 1 or 0. I'd like to calculate the total area under a graph - but I'm not sure how to do this?
You can think of this value as indicative as where a water pump is on or off - the rate of water flowing is always equal when turned on. I want to calculate / graph. The 'cumulative' total water that was output.
I've thought of cumulative_sum but that does not really return the result I intend:
> SELECT cumulative_sum("value") from task Limit 5;
name: task
time cumulative_sum
---- --------------
1540149422155456967 0
1540149423155456967 1
1540151481498019507 2
1540151482498019507 2
1540151680870649288 2
I've also considered the integral function but the values returned seem very odd?
> SELECT integral("value") from task GROUP BY time(1s) Limit 5;
name: task
time integral
---- --------
1540149422000000000 0.35662646729441955
1540149423000000000 0.9879165657055804
1540151481000000000 2057.874007792324
1540151482000000000 0.12401171467626151
1540151680000000000 0.008365803347453472
If there is a more appropriate way of encoding this data then I can also do that.

Related

InfluxDB: How to create a continuous query to calculate delta values?

I'd like to calculate the delta values for a series of measurements stored in an InfluxDB. The values are readings from an electricity meter taken every 5 minutes. The values increase over time. Here is subset of the data to give you an idea (commands shown below are executed in the InfluxDB CLI):
> SELECT "Haushaltstromzaehler - cnt" FROM "myhome_measurements" WHERE time >= '2018-02-02T10:00:00Z' AND time < '2018-02-02T11:00:00Z'
name: myhome_measurements
time Haushaltstromzaehler - cnt
---- --------------------------
2018-02-02T10:00:12.610811904Z 11725.638
2018-02-02T10:05:11.242021888Z 11725.673
2018-02-02T10:10:10.689827072Z 11725.707
2018-02-02T10:15:12.143326976Z 11725.736
2018-02-02T10:20:10.753357056Z 11725.768
2018-02-02T10:25:11.18448512Z 11725.803
2018-02-02T10:30:12.922032896Z 11725.837
2018-02-02T10:35:10.618788096Z 11725.867
2018-02-02T10:40:11.820355072Z 11725.9
2018-02-02T10:45:11.634203904Z 11725.928
2018-02-02T10:50:11.10436096Z 11725.95
2018-02-02T10:55:10.753853952Z 11725.973
Calculating the differences in the InfluxDB CLI is pretty straightforward with the difference() function. This gives me the electricity consumed within the 5 minutes intervals:
> SELECT difference("Haushaltstromzaehler - cnt") FROM "myhome_measurements" WHERE time >= '2018-02-02T10:00:00Z' AND time < '2018-02-02T11:00:00Z'
name: myhome_measurements
time difference
---- ----------
2018-02-02T10:05:11.242021888Z 0.03499999999985448
2018-02-02T10:10:10.689827072Z 0.033999999999650754
2018-02-02T10:15:12.143326976Z 0.02900000000045111
2018-02-02T10:20:10.753357056Z 0.0319999999992433
2018-02-02T10:25:11.18448512Z 0.03499999999985448
2018-02-02T10:30:12.922032896Z 0.033999999999650754
2018-02-02T10:35:10.618788096Z 0.030000000000654836
2018-02-02T10:40:11.820355072Z 0.03299999999944703
2018-02-02T10:45:11.634203904Z 0.028000000000247383
2018-02-02T10:50:11.10436096Z 0.02200000000084401
2018-02-02T10:55:10.753853952Z 0.02299999999922875
Where I struggle is getting this to work in a continuous query. Here is the command I used to setup the continuous query:
CREATE CONTINUOUS QUERY cq_Haushaltstromzaehler_cnt ON myhomedb
BEGIN
SELECT difference(sum("Haushaltstromzaehler - cnt")) AS "delta" INTO "Haushaltstromzaehler_delta" FROM "myhome_measurements" GROUP BY time(1h)
END
Looking in the InfluxDB log file I see that no data is written in the new 'delta' measurement from the continuous query execution:
...finished continuous query cq_Haushaltstromzaehler_cnt, 0 points(s) written...
After much troubleshooting and experimenting I now understand why no data is generated. Setting up a continuous query requires to use the GROUP BY time() statement. This in turn requires to use an aggregate function within the differences() function. The problem now is that the aggregate function returns only one value for the time period specified by GROUP BY time(). Obviously, the differences() function cannot calculate a difference from just one value. Essentially, continuous query executes a command like this:
> SELECT difference(sum("Haushaltstromzaehler - cnt")) FROM "myhome_measurements" WHERE time >= '2018-02-02T10:00:00Z' AND time < '2018-02-02T11:00:00Z' GROUP BY time(1h)
>
I'm now somewhat clueless as to how to make this work and appreciate any advice you might have.
Does it help using the last aggregate function? Not tested this as a cq yet.
Select difference(last(T1_Consumed)) AS T1_Delta, difference(last(T2_Consumed)) AS T2_Delta
from P1Data
where time >= 1551648871000000000 group by time(1h)
DIFFERENCE() would calculate delta from the "aggregated" value taken from previous group, not within current group.
So fill free to use selector function there - since your counters seemed to be cumulative, LAST() should be working well.

How to use a continuous query when down sampling in InfluxDB/InfluxQL

I have a table/serie like this
Message MessageValue
--------------- ---------------------
property1 10
property2 9
property3 7
property2 22
I want to downsample property2's mean value every 10 minutes. How would I do something like this?
CREATE CONTINUOUS QUERY "cq_10m" ON "DatabaseName" BEGIN SELECT mean(SELECT MessageValue WHERE Message =property2 ) AS "mean_Property2" INTO "RetentionPolicyName"."downsampled_orders" FROM "TableName" GROUP BY time(10m) END
It would look something like below. Your CQ will query the db every 10 minutes and will calculate the mean of "MessageValue" across that time frame. This will be downsampled into: mean_Property2.
CREATE CONTINUOUS QUERY "cq_10m" ON "dbName"
RESAMPLE EVERY 10m FOR 2h
BEGIN SELECT mean("MessageValue") AS mean_Property2 INTO
mean_Property2 FROM "retentionPolicyName"."measurementName" WHERE "Message"='property2'
GROUP BY time(10m) END

Influxdb querying values from 2 measurements and using SUM() for the total value

select SUM(value)
from /measurment1|measurment2/
where time > now() - 60m and host = 'hostname' limit 2;
Name: measurment1
time sum
---- ---
1505749307008583382 4680247
name: measurment2
time sum
---- ---
1505749307008583382 3004489
But is it possible to get value of SUM(measurment1+measurment2) , so that I see only o/p .
Not possible in influx query language. It does not support functions across measurements.
If this is something you require, you may be interested in layering another API on top of influx that do this, like Graphite via Influxgraph.
For the above, something like this.
/etc/graphite-api.yaml:
finders:
- influxgraph.InfluxDBFinder
influxdb:
db: <your database>
templates:
# Produces metric paths like 'measurement1.hostname.value'
- measurement.host.field*
Start the graphite-api/influxgraph webapp.
A query /render?from=-60min&target=sum(*.hostname.value) then produces the sum of value on tag host='hostname' for all measurements.
{measurement1,measurement2}.hostname.value can be used instead to limit it to specific measurements.
NB - Performance wise (of influx), best to have multiple values in the same measurement rather than same value field name in multiple measurements.

Using InfluxQL to count points (rows) with same value within an interval?

I'm trying to leverage my moderate SQL-knowledge for InfluxQL, but I'm missing something(s) about the nature of timeseries db.
Use case
I write a measurements from our issue tracker, when an issue is updated:
issue_updated,project=facebook,ticket=fb1,assignee=coolman status="todo"
Problem
Given this returns rows of issues statuses:
SELECT status
FROM "issue_updated"
If this was SQL (fiddle) I would use COUNT(and then add the WHERE time > NOW() - 1Y GROUP BY time(5m)). However the following gives me Mixing aggregate and non-aggregate queries is not supported
SELECT status, count(status) as 'Count'
FROM "issue_updated"
Can someone give some guidance here? ta
Sounds like what you're looking for is the ability to group by a field value which isn't currently supported.
From what I can tell, if you modify your schema a bit, it should be possible to do what you're looking. Instead of
issue_updated,project=facebook,ticket=fb1,assignee=coolman status="todo"
Do
issue_updated,project=facebook,ticket=fb1,assignee=coolman,status=todo value=1
then
SELECT count(value) FROM "issue_updated" WHERE time > now() - 52w GROUP BY status
name: issue_updated
tags: status=other
time count
---- -----
1449523659065350722 1
name: issue_updated
tags: status=todo
time count
---- -----
1449523659065350722 2
should work.

What's difference between item-based and content-based collaborative filtering?

I am puzzled about what the item-based recommendation is, as described in the book "Mahout in Action". There is the algorithm in the book:
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
How can I calculate the similarity between items? If using the content, isn't it a content-based recommendation?
Item-Based Collaborative Filtering
The original Item-based recommendation is totally based on user-item ranking (e.g., a user rated a movie with 3 stars, or a user "likes" a video). When you compute the similarity between items, you are not supposed to know anything other than all users' history of ratings. So the similarity between items is computed based on the ratings instead of the meta data of item content.
Let me give you an example. Suppose you have only access to some rating data like below:
user 1 likes: movie, cooking
user 2 likes: movie, biking, hiking
user 3 likes: biking, cooking
user 4 likes: hiking
Suppose now you want to make recommendations for user 4.
First you create an inverted index for items, you will get:
movie: user 1, user 2
cooking: user 1, user 3
biking: user 2, user 3
hiking: user 2, user 4
Since this is a binary rating (like or not), we can use a similarity measure like Jaccard Similarity to compute item similarity.
|user1|
similarity(movie, cooking) = --------------- = 1/3
|user1,2,3|
In the numerator, user1 is the only element that movie and cooking both has. In the denominator the union of movie and cooking has 3 distinct users (user1,2,3). |.| here denote the size of the set. So we know the similarity between movie and cooking is 1/3 in our case. You just do the same thing for all possible item pairs (i,j).
After you are done with the similarity computation for all pairs, say, you need to make a recommendation for user 4.
Look at the similarity score of similarity(hiking, x) where x is any other tags you might have.
If you need to make a recommendation for user 3, you can aggregate the similarity score from each items in its list. For example,
score(movie) = Similarity(biking, movie) + Similarity(cooking, movie)
score(hiking) = Similarity(biking, hiking) + Similarity(cooking, hiking)
Content-Based Recommendation
The point of content-based is that we have to know the content of both user and item. Usually you construct user-profile and item-profile using the content of shared attribute space. For example, for a movie, you represent it with the movie stars in it and the genres (using a binary coding for example). For user profile, you can do the same thing based on the users likes some movie stars/genres etc. Then the similarity of user and item can be computed using e.g., cosine similarity.
Here is a concrete example:
Suppose this is our user-profile (using binary encoding, 0 means not-like, 1 means like), which contains user's preference over 5 movie stars and 5 movie genres:
Movie stars 0 - 4 Movie Genres
user 1: 0 0 0 1 1 1 1 1 0 0
user 2: 1 1 0 0 0 0 0 0 1 1
user 3: 0 0 0 1 1 1 1 1 1 0
Suppose this is our movie-profile:
Movie stars 0 - 4 Movie Genres
movie1: 0 0 0 0 1 1 1 0 0 0
movie2: 1 1 1 0 0 0 0 1 0 1
movie3: 0 0 1 0 1 1 0 1 0 1
To calculate how good a movie is to a user, we use cosine similarity:
dot-product(user1, movie1)
similarity(user 1, movie1) = ---------------------------------
||user1|| x ||movie1||
0x0+0x0+0x0+1x0+1x1+1x1+1x1+1x0+0x0+0x0
= -----------------------------------------
sqrt(5) x sqrt(3)
= 3 / (sqrt(5) x sqrt(3)) = 0.77460
Similarly:
similarity(user 2, movie2) = 3 / (sqrt(4) x sqrt(5)) = 0.67082
similarity(user 3, movie3) = 3 / (sqrt(6) x sqrt(5)) = 0.54772
If you want to give one recommendation for user i, just pick movie j that has the highest similarity(i, j).
"Item-based" really means "item-similarity-based". You can put whatever similarity metric you like in here. Yes, if it's based on content, like a cosine similarity over term vectors, you could also call this "content-based".

Resources