SUM(LAST()) on GROUP BY - influxdb

I have a series, disk, that contains a path (/mnt/disk1, /mnt/disk2, etc) and total space of a disk. It also includes free and used values. These values are updated at a specified interval. What I would like to do, is query to get the sum of the total of the last() of each path. I would also like to do the same for free and for used, to get a aggregate of the total size, free space, and used space of all of my disks on my server.
I have a query here that will get me the last(total) of all the disks, grouped by its path (for distinction):
select last(total) as total from disk where path =~ /(mnt\/disk).*/ group by path
Currently, this returns 5 series, each containing 1 row (the latest) and the value of its total. I then want to take the sum of those series, but I cannot just wrap the last(total) into a sum() function call. Is there a way to do this that I am missing?

Carrying on from my comment above about nested functions.
Building a toy example:
CREATE DATABASE FOO
USE FOO
Assuming your data is updated at intervals greater than[1] every minute:
CREATE CONTINUOUS QUERY disk_sum_total ON FOO
BEGIN
SELECT sum("total") AS "total_1m" INTO disk_1m_total FROM "disk"
GROUP BY time(1m)
END
Then push some values in:
INSERT disk,path="/mnt/disk1" total=30
INSERT disk,path="/mnt/disk2" total=32
INSERT disk,path="/mnt/disk3" total=33
And wait more than a minute. Then:
INSERT disk,path="/mnt/disk1" total=41
INSERT disk,path="/mnt/disk2" total=42
INSERT disk,path="/mnt/disk3" total=43
And wait a minute+ again. Then:
SELECT * FROM disk_1m_total
name: disk_1m_total
-------------------
time total_1m
1476015300000000000 95
1476015420000000000 126
The two values are 30+32+33=95 and 41+42+43=126.
From there, it's trivial to query:
SELECT last(total_1m) FROM disk_1m_total
name: disk_1m_total
-------------------
time last
1476015420000000000 126
Hope that helps.
[1] Picking intervals smaller than the update frequency prevents minor timing jitters from making all the data being accidentally summed twice for a given group. There might be some "zero update" intervals, but no "double counting" intervals. I typically run the query twice as fast as the updates. If the CQ sees no data for a window, there will be no CQ performed for that window, so last() will still give the correct answer. For example, I left the CQ running overnight and pushed no new data in: last(total_1m) gives the same answer, not zero for "no new data".

Related

Find nodes with 3+ occurrences in a 10 minute period

I have a list of nodes with a startTime property. I need to determine if the list contains a clump of 3 or more nodes with a startTime within 10 minutes of each other. I don't need to get the nodes that are in the clump, I just need a boolean indicating the existence of such a clump.
I am at a loss, everything I have tried fails so badly that it is not worth posting them.
I feel that I am missing something easy.
This should be doable.
First you'll need to collect the startTimes, order them, and collect them.
From there, you'll need to get the relevant pairings (each entry, and the entry 2 indices ahead for the end of the duration) that will comprise a group of 3, then see if the start times of that pair occur within 10 minutes of each other.
Assuming for the sake of example :Event nodes with a startTime property, you might use this query to get the results you want:
MATCH (e:Event)
WITH e
ORDER BY e.startTime ASC
WITH collect(e.startTime)[1..] as times
WITH times, range(0, size(times) - 3) as indices
RETURN any(index in indices WHERE times[index + 2] <= times[index] + duration({minutes:10}))

InfluxDB: How to create a continuous query to calculate delta values?

I'd like to calculate the delta values for a series of measurements stored in an InfluxDB. The values are readings from an electricity meter taken every 5 minutes. The values increase over time. Here is subset of the data to give you an idea (commands shown below are executed in the InfluxDB CLI):
> SELECT "Haushaltstromzaehler - cnt" FROM "myhome_measurements" WHERE time >= '2018-02-02T10:00:00Z' AND time < '2018-02-02T11:00:00Z'
name: myhome_measurements
time Haushaltstromzaehler - cnt
---- --------------------------
2018-02-02T10:00:12.610811904Z 11725.638
2018-02-02T10:05:11.242021888Z 11725.673
2018-02-02T10:10:10.689827072Z 11725.707
2018-02-02T10:15:12.143326976Z 11725.736
2018-02-02T10:20:10.753357056Z 11725.768
2018-02-02T10:25:11.18448512Z 11725.803
2018-02-02T10:30:12.922032896Z 11725.837
2018-02-02T10:35:10.618788096Z 11725.867
2018-02-02T10:40:11.820355072Z 11725.9
2018-02-02T10:45:11.634203904Z 11725.928
2018-02-02T10:50:11.10436096Z 11725.95
2018-02-02T10:55:10.753853952Z 11725.973
Calculating the differences in the InfluxDB CLI is pretty straightforward with the difference() function. This gives me the electricity consumed within the 5 minutes intervals:
> SELECT difference("Haushaltstromzaehler - cnt") FROM "myhome_measurements" WHERE time >= '2018-02-02T10:00:00Z' AND time < '2018-02-02T11:00:00Z'
name: myhome_measurements
time difference
---- ----------
2018-02-02T10:05:11.242021888Z 0.03499999999985448
2018-02-02T10:10:10.689827072Z 0.033999999999650754
2018-02-02T10:15:12.143326976Z 0.02900000000045111
2018-02-02T10:20:10.753357056Z 0.0319999999992433
2018-02-02T10:25:11.18448512Z 0.03499999999985448
2018-02-02T10:30:12.922032896Z 0.033999999999650754
2018-02-02T10:35:10.618788096Z 0.030000000000654836
2018-02-02T10:40:11.820355072Z 0.03299999999944703
2018-02-02T10:45:11.634203904Z 0.028000000000247383
2018-02-02T10:50:11.10436096Z 0.02200000000084401
2018-02-02T10:55:10.753853952Z 0.02299999999922875
Where I struggle is getting this to work in a continuous query. Here is the command I used to setup the continuous query:
CREATE CONTINUOUS QUERY cq_Haushaltstromzaehler_cnt ON myhomedb
BEGIN
SELECT difference(sum("Haushaltstromzaehler - cnt")) AS "delta" INTO "Haushaltstromzaehler_delta" FROM "myhome_measurements" GROUP BY time(1h)
END
Looking in the InfluxDB log file I see that no data is written in the new 'delta' measurement from the continuous query execution:
...finished continuous query cq_Haushaltstromzaehler_cnt, 0 points(s) written...
After much troubleshooting and experimenting I now understand why no data is generated. Setting up a continuous query requires to use the GROUP BY time() statement. This in turn requires to use an aggregate function within the differences() function. The problem now is that the aggregate function returns only one value for the time period specified by GROUP BY time(). Obviously, the differences() function cannot calculate a difference from just one value. Essentially, continuous query executes a command like this:
> SELECT difference(sum("Haushaltstromzaehler - cnt")) FROM "myhome_measurements" WHERE time >= '2018-02-02T10:00:00Z' AND time < '2018-02-02T11:00:00Z' GROUP BY time(1h)
>
I'm now somewhat clueless as to how to make this work and appreciate any advice you might have.
Does it help using the last aggregate function? Not tested this as a cq yet.
Select difference(last(T1_Consumed)) AS T1_Delta, difference(last(T2_Consumed)) AS T2_Delta
from P1Data
where time >= 1551648871000000000 group by time(1h)
DIFFERENCE() would calculate delta from the "aggregated" value taken from previous group, not within current group.
So fill free to use selector function there - since your counters seemed to be cumulative, LAST() should be working well.

stack data and restructure without using var to cases or casestovar in SPSS

I have the following situation: a loop (stack data) with only 1 index variable and with multiple items corresponding to the statements, as in the picture below (sorry it is Excel, but is the same as in SPSS):
stack data - cases on multiple lines, but never filling for 1 respondent all the columns
I want to reach to the following situation but without using casestovars to restructure, because that creates a lot of empty variables. I remember for older versions it was a command like Update, which was moving up the cases, to reach the following result:
reducing the cases per respondent
Like starting from this:
ID Index Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6
1 1 1 1
1 2 1 1
1 3 1 1
To reach to this:
ID Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6
1 1 1 1 1 1 1
But without using casestovars. Is there any command in SPSS syntax for this?
Thank you very much, have a nice day!
Not entirely sure how variable your data structure is likely to be in reality but if as demo'ed where you have only a single response for each q1_1 to q1_6 per respondent ID, then the below would be sufficient:
dataset declare dsAgg.
aggregate outfile="dsAgg" /break=respid /q1_1 to q1_6=max(q1_1 to q1_6).
Also not sure of the significance of duplicate index values within the same respondent IDs, if this was intended or not.
The following syntax could do the job -
* first we'll recreate your example data.
data list list/respid index q1_1 to q1_6.
begin data
1,1,1,,,,,
1,2,,2,,,,
1,3,,,1,,,
1,4,,,,2,,
1,5,,,,,1,
1,6,,,,,,2
2,1,3,,,,,
2,1,,4,,,,
2,2,,,5,,,
2,2,,,,4,,
2,3,,,,,3,
2,3,,,,,,2
end data.
* now to work: first thing is to make sure the data from each ID are together.
sort cases by respid index.
* the loop will fill down the data to the last line of each ID.
do repeat qq=q1_1 to q1_6.
if respid=lag(respid) and missing(qq) qq=lag(qq).
end repeat.
* the following lines will help recognize the last line for each ID and select it.
compute lineNR=$casenum.
aggregate /outfile=* mode=ADDVARIABLES/break=respid/MXlineNR=max(lineNR).
select if lineNR=MXlineNR.
exe.

Write Cypher query to display temperature values till it reaches set temperature

I have about 200,000 rows of 24 hour data as follows:
I can use the query to create a room node with time, roomtemp, and set temp as properties. Moreover, I can also, define the relationship of each room with its corresponding temperatures.
Now, I need to find:
all rows that show an update/increase/decrease from initial temperature till set temperature for all rooms. e.g. based on above data, I need:
Here I have discarded 5th row data as 16 was repetitive and showed no update(increase or decrease) in temp value. The temperature values continued till it reached set temperature '18'.
I can manually create the temperature states by giving its values one by one, but I am unsure how to MERGE the above requirement into the graph using Cypher.
Can I utilize any other programming language to obtain same results using Neo4j in conjunction?
Do I have to utilize in-graph time-tree for this scenario? Can I retrieve my results without creating a time tree?
Filter temparature by room and date (which can also be a date-node)
Sort by time
Collect into a list
Filter by differences in two subsequent temperatues
Turn list into rows
Here is a query that does this:
MATCH (r:Room)<-[:TEMP]-(t:Temparature)
WHERE t.time STARTS WITH "2016-01-01"
AND t.temp < room.temp ADN t.temp > {initial}
WITH t ORDER by t.time ASC
WITH collect(t) temps
WITH [idx in range(0,size(temps)-2) WHERE temps[idx].temp <> temps[idx+1].temp | temps[idx] ] as filtered
UNWIND filtered as t
RETURN t;

Neo4j: Best way to batch relate nodes using Cypher?

When I run a script that tries to batch merge all nodes a certain types, I am getting some weird performance results.
When merging 2 collections of nodes (~42k) and (~26k), the performance is nice and fast.
But when I merge (~42) and (5), performance DRAMATICALLY degrades. I'm batching the ParentNodes (so (~42k) split up in batches of 500. Why does performance drop when I'm, essentially, merging less nodes (when the batch set is the same, but the source of the batch set is high and the target set is low)?
Relation Query:
MATCH (s:ContactPlayer)
WHERE has(s.ContactPrefixTypeId)
WITH collect(s) AS allP
WITH allP[7000..7500] as rangedP
FOREACH (parent in rangedP |
MERGE (child:ContactPrefixType
{ContactPrefixTypeId:parent.ContactPrefixTypeId}
)
MERGE (child)-[r:CONTACTPLAYER]->(parent)
SET r.ContactPlayerId = parent.ContactPlayerId ,
r.ContactPrefixTypeId = child.ContactPrefixTypeId )
Performance Results:
Process Starting
Starting to insert Contact items
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++]
Total time for 42149 Contact items: 19176.87ms
Average time per batch (500): 213.4ms
Longest batch time: 663ms
Starting to insert ContactPlayer items
[++++++++++++++++++++++++++++++++++++++++++++++++++++++++]
Total time for 27970 ContactPlayer items: 9419.2106ms
Average time per batch (500): 167.75ms
Longest batch time: 689ms
Starting to relate Contact to ContactPlayer
[++++++++++++++++++++++++++++++++++++++++++++++++++++++++]
Total time taken to relate Contact to ContactPlayer: 7907.4877ms
Average time per batch (500): 141.151517857143ms
Longest batch time: 883.0918ms for Batch number: 0
Starting to insert ContactPrefixType items
[+]
Total time for 5 ContactPrefixType items: 22.0737ms
Average time per batch (500): 22ms
Longest batch time: 22ms
Already inserted data for Contact.
Starting to relate ContactPrefixType to Contact
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++]
Total time taken to relate ContactPrefixType to Contact: 376540.8309ms
Average time per batch (500): 4429.78643647059ms
Longest batch time: 14263.1843ms for Batch number: 63
So far, the best I could come up with is the following (and it's a hack, specific to my environment):
If / Else condition:
If childrenNodes.count() < 200 -> assume they are type identifiers for the parent... i.e. ContactPrefixType
Else assume it is a matrix for relating multiple item types together (i.e. ContactAddress)
If childNodes < 200
MATCH (parent:{parentLabel}),
(child:{childLabel} {{childLabelIdProperty}:parent.{parentRelationProperty}})
CREATE child-[r:{relationshipLabel}]->parent
This takes about 3-5 seconds to complete per relationship type
Else
MATCH (child:{childLabel}),
(parent:{parentLabel} {{parentPropertyField : child.{childLabelIdProperty}})
WITH collect(parent) as parentCollection, child
WITH parentCollection[{batchStart}..{batchEnd}] as coll, child
FOREACH (parent in coll |
CREATE child-[r:{relationshipLabel}]-parent )
I'm not sure this is the most efficient way of doing this, but after trying MANY different options, this seems to be the fastest.
Stats:
insert 225,018 nodes with 2,070,977 properties
create 464,606 relationships
Total: 331 seconds.
Because this is a straight import and I'm not dealing with updates yet, I assume that all the relationships are correct and don't need to worry about invalid data... however, I will try to set properties to the relationship type so as to be able to perform cleanup functions later (i.e. store the parent and child Id's in the relationship type as properties for later reference)
If anyone can improve on this, I would love it.
Can you pass the ids in as parameters rather than fetch them from the graph? The query could look like
MATCH (s:ContactPlayer {ContactPrefixTypeId:{cptid})
MERGE (c:ContactPrefixType {ContactPrefixTypeId:{cptid})
MERGE c-[:CONTACT_PLAYER]->s
If you use the REST API Cypher resource, I think the entity should look something like
{
"query":...,
"params": {
"cptid":id1
}
}
If you use the transactional endpoint, it should look something like this. You control transaction size by the number of statements in each call, and also by the number of calls before you commit. More here.
{
"statements":[
"statement":...,
"parameters": {
"cptid":id1
},
"statement":...,
"parameters": {
"cptid":id2
}
]
}

Resources