InfluxQL: Calculate periode of status change - influxdb

I have a table in my InfluxDB for a parking sensor which sends a state for occupied (1) and vacant (9).
Now I want to create a query which shows me the time period between the status changes so that I can create a report for the time the parkingslot was occupied and free.
The data is generated by the parking sensor and inserted via node-red into the influxdb.
The data is located in InfluxDB version 1.8.10 on Ubuntu 20.04.
The table data has actually the following structure:
name: parkinginfo
| time | statusnumbered |
| -------- | -------------- |
| 1646302488500839186 | 1 |
| 1646302488500203666 | 1 |
| 1646302488499932866 | 2 |
| 1646302488499826263 | 1 |
statusnumbered: 1 = vacant, 2 = occupied
Can someone help me for creating a query for this?
Thanks!

It is feasible to report the total duration for a given state in InfluxDB but only in Flux not InfluxQL (easily).
You could:
Enable Flux in v1.8 with the configuration change here
Sample Flux could be:
from(bucket: "yourDatabaseName/autogen") |> range(start: 2022-03-03T00:00:00Z, stop: 2022-03-03T23:59:59Z) |> filter(fn: (r) => r._measurement == "yourMeasurementName") |> stateDuration(fn: (r) => r._value == 1, column: "state", unit: 1m)
from(bucket: "yourDatabaseName/autogen") |> range(start: 2022-03-03T00:00:00Z, stop: 2022-03-03T23:59:59Z) |> filter(fn: (r) => r._measurement == "yourMeasurementName") |> stateDuration(fn: (r) => r._value == 2, column: "state", unit: 1m)
Again it's still not possible to do it in InfluxQL yet though the community has been waiting for this for a while. See more details here.

Related

Calculating a new column in spark df based on another spark df without an explicit join column

I have the df1 and df2 without a common crossover column. Now I need to add a new column in df1 from df2 if a condition based on columns df2 is met. I will try to explain myself better with an example:
df1:
+--------+----------+
|label | raw |
+--------+----------+
|0.0 |-1.1088619|
|0.0 |-1.3188809|
|0.0 |-1.3051535|
+--------+----------+
df2:
+--------------------+----------+----------+
| probs | minRaw| maxRaw|
+--------------------+----------+----------+
| 0.1|-1.3195256|-1.6195256|
| 0.2|-1.6195257|-1.7195256|
| 0.3|-1.7195257|-1.8195256|
| 0.4|-1.8195257|-1.9188809|
The expected output will be a new column in df1 that get the df2.probs if df1.raw value is between df2.minRaw and df2.maxRaw .
My first aproach has been try to explode the range minRaw and maxRaw, and then joined dataframes, but those columns are continuous. The second idea is an udflike this:
def get_probabilities(raw):
df= isotonic_prob_table.filter((F.col("min_raw")>=raw)& \
(F.col("max_raw")<=raw))\
.select("probs")
df.show()
#return df.select("probabilidad_bin").value()
#return df.first()["probabilidad_bin"]
But it takes a long time in my large dataframe, and give me this alerts:
23/02/13 22:02:20 WARN org.apache.spark.sql.execution.window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/02/13 22:02:20 WARN org.apache.spark.sql.execution.window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[Stage 82:> (0 + 1) / 1][Stage 83:====> (4 + 3) / 15]23/02/13 22:04:36 WARN org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/02/13 22:04:36 WARN org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
If value is'n't between minRaw and maxRaw, the output expected is null and df1 can have duplicates.
I have spark version 2.4.7 and I'm not a pyspark expert. Thank you in advance for read!
I think you can just join those dataframes with the condition between.
df1.join(df2, f.col('raw').between(f.col('maxRaw'), f.col('minRaw')), 'left').show(truncate=False)
+-----+-----+-----+----------+----------+
|label|raw |probs|minRaw |maxRaw |
+-----+-----+-----+----------+----------+
|0.0 |-1.1 |null |null |null |
|0.0 |-1.1 |null |null |null |
|0.0 |-1.32|0.1 |-1.3195256|-1.6195256|
|0.0 |-1.32|0.1 |-1.3195256|-1.6195256|
|0.0 |-1.73|0.3 |-1.7195257|-1.8195256|
|0.0 |-1.88|0.4 |-1.8195257|-1.9188809|
+-----+-----+-----+----------+----------+
Use range between in a sql expression
df2.createOrReplaceTempView('df2')
df1.createOrReplaceTempView('df1')
%sql
SELECT minRaw,maxRaw,raw
FROM df1 JOIN df2 ON df1.raw BETWEEN df2.minRaw and df2.maxRaw
You can perform a crossjoin between df1 and df2, and apply a filter so that you're only selecting rows where df1.raw is between df2.minRaw and df2.maxRaw – this should be more performant than a udf.
Note: Since df1 can have duplicates, we want to deduplicate df1 before crossjoining with df2 so that after we apply the filter we don't have any duplicate rows, but still have the minimum information we need. Then we can right join on df1 to ensure we have all of the original rows in df1.
I've also modified your df1 slightly to include duplicates for the purpose of demonstrating the result:
df1 = spark.createDataFrame(
[
(0.0,-1.10),
(0.0,-1.10),
(0.0,-1.32),
(0.0,-1.32),
(0.0,-1.73),
(0.0,-1.88)
],
['label','raw']
)
df2 = spark.createDataFrame(
[
(0.1, -1.3195256, -1.6195256),
(0.2, -1.6195257, -1.7195256),
(0.3, -1.7195257, -1.8195256),
(0.4, -1.8195257, -1.9188809)
],
['probs','minRaw','maxRaw']
)
This is the result when you crossjoin df1 and df2 and remove duplicates:
df1.drop_duplicates().crossJoin(df2).show()
+-----+-----+-----+----------+----------+
|label| raw|probs| minRaw| maxRaw|
+-----+-----+-----+----------+----------+
| 0.0| -1.1| 0.1|-1.3195256|-1.6195256|
| 0.0|-1.32| 0.1|-1.3195256|-1.6195256|
| 0.0|-1.73| 0.1|-1.3195256|-1.6195256|
| 0.0|-1.88| 0.1|-1.3195256|-1.6195256|
...
| 0.0| -1.1| 0.4|-1.8195257|-1.9188809|
| 0.0|-1.32| 0.4|-1.8195257|-1.9188809|
| 0.0|-1.73| 0.4|-1.8195257|-1.9188809|
| 0.0|-1.88| 0.4|-1.8195257|-1.9188809|
+-----+-----+-----+----------+----------+
Then we can apply the filter and right join with df1 to make sure all of the original rows exist:
df1.crossJoin(df2).filter(
(F.col('raw') > F.col('maxRaw')) & (F.col('raw') < F.col('minRaw'))
).select(
'label','raw','probs'
).join(
df1, on=['label','raw'], how='right'
)
+-----+-----+-----+
|label| raw|probs|
+-----+-----+-----+
| 0.0| -1.1| null|
| 0.0| -1.1| null|
| 0.0|-1.32| 0.1|
| 0.0|-1.32| 0.1|
| 0.0|-1.73| 0.3|
| 0.0|-1.88| 0.4|
+-----+-----+-----+

InfluxDB 2.0 - How to merge and sum 2 separate buckets stream in one with Flux

Context
I work in an application with influxDB and I am facing a use case I don't know how to solve.
I am using influxDB 2 and flux for the queries.
Use case
To simplify my use case, I'll take an example with bees.
I want to measure the evolution of the bee population accross time.
I measure 2 populations of bees in 2 separate buckets (bucket A for bee A and bucket B for bee B).
So I want to find a query that merges these 2 buckets to get the current population of bees at any time.
Using 2 buckets is mandatory in my use case. What I want is just a way to "combine" the result when I query.
The results should be sorted based on the timestamps
Examples
The query on the Bee A and the Bee B buckets returns the values in the first 2 lines. The objective is to create the Total line.
Example 1:
In case the measurements are always A, B, A, B; the result must be as
Populations
Bee A
10
12
5
22
Bee B
20
16
19
36
Total
10
30
32
28
21
24
41
58
Example 2:
In case the measurements are A, B, A, A, A, B; the result must be as
Populations
Bee A
10
12
5
22
Bee B
20
16
19
36
Total
10
30
32
25
42
38
41
58
Code
I tried using union but I couldn't get it to work as it kept 2 separates tables instead of one.
bucket1 = from(bucket: "beeA")
|> range(start: 0)
|> filter(fn: (r) => r["_measurement"] == "population")
|> filter(fn: (r) => r["_field"] == "pop")
|> sort(columns: ["_time"])
bucket2 = from(bucket: "beeB")
|> range(start: 0)
|> filter(fn: (r) => r["_measurement"] == "population")
|> filter(fn: (r) => r["_field"] == "pop")
|> sort(columns: ["_time"])
union(tables: [bucket1, bucket2])
you Can try by removing, |> sort(columns: ["_time"]).
Also check for |> range(start: 0). You may give different time range and atlast check the selected time range where you have the data.

changing the value of bin in a record of aerospike db which is of map type in lua script

say aerospike database is having recorded data like below
let the namespace be employee
name age characteristics
sachin 25 MAP('{"weight":70, "height":25}')
now i want to change the value of height which is inside map for all the records in employee namespace through lua script.
i have tried for changing the bins of normal data type as below , i,e i
tried to change the age as below:
function changeAgeOfEmployee(rec)
if not aerospike:exists(rec) then
error ("Invalid Record. Returning")
return
else
age = 30
rec['age'] = age
aerospike:update(rec)
end
end
but i am not sure how to change the value in map in lua, can somebody please assist me in this
Your MAP data-type is basically a lua table. MAP in lua can be written as:
local m = map {"weight" => 70, "height" => 25}
To loop over all the key/value pairs you should use the pairs iterator like this:
for key, value in map.pairs(m) do
m[key] = 30 --this changes all the values of your MAP to 30
end
If you're going to modify a key of a map, or an index of a list, you should cast that bin to a local variable, then set it back to the record ahead of updating it.
function changes(rec)
rec['i'] = 99
local m = rec['m']
m['a'] = 66
rec['m'] = m
aerospike:update(rec)
end
In AQL
$ aql
Aerospike Query Client
Version 3.15.1.2
C Client Version 4.3.0
Copyright 2012-2017 Aerospike. All rights reserved.
aql> register module './test.lua'
OK, 1 module added.
aql> select * from test.demo where PK='88'
+----+-------+--------------------------------------+------------------------------------------+
| i | s | m | l |
+----+-------+--------------------------------------+------------------------------------------+
| 88 | "xyz" | MAP('{"a":2, "b":4, "c":8, "d":16}') | LIST('[2, 4, 8, 16, 32, NIL, 128, 256]') |
+----+-------+--------------------------------------+------------------------------------------+
1 row in set (0.002 secs)
aql> execute test.changes() on test.demo where PK='88'
+---------+
| changes |
+---------+
| |
+---------+
1 row in set (0.001 secs)
aql> select * from test.demo where PK='88'
+----+-------+---------------------------------------+------------------------------------------+
| i | s | m | l |
+----+-------+---------------------------------------+------------------------------------------+
| 99 | "xyz" | MAP('{"a":66, "b":4, "c":8, "d":16}') | LIST('[2, 4, 8, 16, 32, NIL, 128, 256]') |
+----+-------+---------------------------------------+------------------------------------------+
1 row in set (0.000 secs)

Getting a subtree in Neo4J efficiently

I have a tree with about 300 nodes. For an arbitrary node, I need to draw a subtree from that node to the root (including all possible paths).
For example if I have this tree (edited):
a
|
-----
| | |
b c d
| |
---
|
e
|
f
and the e node is selected, I need to draw:
a
|
---
| |
b c
| |
---
|
e
I am using this Cypher query:
start n=node({nodeId}) optional match n-[r:DEPENDS*]->p return n,r,p
Although it works, depending on the depth of the searched node, it is very very slow (more than 10 seconds).
¿How can I achieve this efficiently?
Your query will compute all paths, while you are only interested in the one path too the root. So get the root and the node and the shortest-path in beween.
MATCH path=shortestPath((root)<-[:DEPENDS*]-(n))
WHERE id(root) = {rootId} and id(n) = {nodeId}
RETURN path

Make my Neo4j queries faster

I'm evaluating Neo4j for our application, and now am at a point where performance is an issue. I've created a lot of nodes and edges that I'm doing some queries against. The following is a detail of the nodes and edges data in this database:
I am trying to do a search that traverses the yellow arrows of this diagram. What I have so far is the following query:
MATCH (n:LABEL_TYPE_Project {id:'14'})
-[:RELATIONSHIP_scopes*1]->(m:LABEL_TYPE_PermissionNode)
-[:RELATIONSHIP_observedBy*1]->(o:LABEL_TYPE_Item)
WHERE m.id IN ['1', '2', '6', '12', '12064', '19614', '19742', '19863', '21453', '21454', '21457', '21657', '21658', '31123', '31127', '31130', '47691', '55603', '55650', '56026', '56028', '56029', '56050', '56052', '85383', '85406', '85615', '105665', '1035242', '1035243']
AND o.content =~ '.*some string.*'
RETURN o
LIMIT 20
(The variable paths above have been updated, see "Update 2")
The above query takes a barely-acceptable 1200ms. It only returns the requested 20 items. If I want a count of the same, this takes forever:
MATCH ... more of the same ...
RETURN count(o)
The above query takes many minutes. This is Neo4j 2.2.0-M03 Community running on CentOS. There is around 385,000 nodes, 170,000 of type Item.
I have created indices on all id fields (programmatically, index().forNodes(...).add(...)), also on the content field (CREATE INDEX ... statement).
Are there fundamental improvements yet to be made to my queries? Things I can try?
Much appreciated.
This question was moved over from Neo4j discussion group on Google per their suggestions.
Update 1
As requested:
:schema
Gives:
Indexes
ON :LABEL_TYPE_Item(id) ONLINE
ON :LABEL_TYPE_Item(active) ONLINE
ON :LABEL_TYPE_Item(content) ONLINE
ON :LABEL_TYPE_PermissionNode(id) ONLINE
ON :LABEL_TYPE_Project(id) ONLINE
No constraints
(This is updated, see "Update 2")
Update 2
I have made the following noteworthy improvements to the query:
Shame on me, I did have super-nodes for all TYPE_Projects (not by design, just messed up the importing algorithm that I was using) and I removed it now
I had a lot of "strings" that could have been proper data types, such as integers, booleans and I am now importing them as such (you'll see in the updated queries below that I removed a lot of quotes)
As pointed out, I had variable length paths and I fixed those
As pointed out, I should have had uniqueness indices instead of regular indices and I fixed that
As a consequence:
:schema
Now gives:
Indexes
ON :LABEL_TYPE_Item(active) ONLINE
ON :LABEL_TYPE_Item(content) ONLINE
ON :LABEL_TYPE_Item(id) ONLINE (for uniqueness constraint)
ON :LABEL_TYPE_PermissionNode(id) ONLINE (for uniqueness constraint)
ON :LABEL_TYPE_Project(id) ONLINE (for uniqueness constraint)
Constraints
ON (label_type_item:LABEL_TYPE_Item) ASSERT label_type_item.id IS UNIQUE
ON (label_type_project:LABEL_TYPE_Project) ASSERT label_type_project.id IS UNIQUE
ON (label_type_permissionnode:LABEL_TYPE_PermissionNode) ASSERT label_type_permissionnode.id IS UNIQUE
The query now looks like this:
MATCH (n:LABEL_TYPE_Project {id:14})
-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)
-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id IN [1, 2, 6, 12, 12064, 19614, 19742, 19863, 21453, 21454, 21457, 21657, 21658, 31123, 31127, 31130, 47691, 55603, 55650, 56026, 56028, 56029, 56050, 56052, 85383, 85406, 85615, 105665, 1035242, 1035243]
AND o.content =~ '.*some string.*'
RETURN o
LIMIT 20
The above query now takes approx. 350ms.
I still want a count of the same:
MATCH ...
RETURN count(0)
The above query now takes approx. 1100ms. Although that's much better, and barely acceptable for this particular query, I've already found some more-complex queries that inherently take longer. So a further improvement on this query here would be great.
As requested here is the PROFILE for RETURN o query (for the improved query):
Compiler CYPHER 2.2
Planner COST
Projection
|
+Limit
|
+Filter(0)
|
+Expand(All)(0)
|
+Filter(1)
|
+Expand(All)(1)
|
+NodeUniqueIndexSeek
+---------------------+---------------+-------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------------+---------------+-------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection | 1900 | 20 | 0 | m, n, o | o |
| Limit | 1900 | 20 | 0 | m, n, o | { AUTOINT32} |
| Filter(0) | 1900 | 20 | 131925 | m, n, o | (hasLabel(o:LABEL_TYPE_Item) AND Property(o,content(23)) ~= /{ AUTOSTRING31}/) |
| Expand(All)(0) | 4993 | 43975 | 43993 | m, n, o | (m)-[:RELATIONSHIP_observedBy]->(o) |
| Filter(1) | 2 | 18 | 614 | m, n | (hasLabel(m:LABEL_TYPE_PermissionNode) AND any(-_-INNER-_- in Collection(List({ AUTOINT1}, { AUTOINT2}, { AUTOINT3}, { AUTOINT4}, { AUTOINT5}, { AUTOINT6}, { AUTOINT7}, { AUTOINT8}, { AUTOINT9}, { AUTOINT10}, { AUTOINT11}, { AUTOINT12}, { AUTOINT13}, { AUTOINT14}, { AUTOINT15}, { AUTOINT16}, { AUTOINT17}, { AUTOINT18}, { AUTOINT19}, { AUTOINT20}, { AUTOINT21}, { AUTOINT22}, { AUTOINT23}, { AUTOINT24}, { AUTOINT25}, { AUTOINT26}, { AUTOINT27}, { AUTOINT28}, { AUTOINT29}, { AUTOINT30})) where Property(m,id(0)) == -_-INNER-_-)) |
| Expand(All)(1) | 11 | 18 | 19 | m, n | (n)-[:RELATIONSHIP_scopes]->(m) |
| NodeUniqueIndexSeek | 1 | 1 | 1 | n | :LABEL_TYPE_Project(id) |
+---------------------+---------------+-------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
And here is the PROFILE for RETURN count(o) query (for the improved query):
Compiler CYPHER 2.2
Planner COST
Limit
|
+EagerAggregation
|
+Filter(0)
|
+Expand(All)(0)
|
+Filter(1)
|
+Expand(All)(1)
|
+NodeUniqueIndexSeek
+---------------------+---------------+--------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------------+---------------+--------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit | 44 | 1 | 0 | count(o) | { AUTOINT32} |
| EagerAggregation | 44 | 1 | 0 | count(o) | |
| Filter(0) | 1900 | 101 | 440565 | m, n, o | (hasLabel(o:LABEL_TYPE_Item) AND Property(o,content(23)) ~= /{ AUTOSTRING31}/) |
| Expand(All)(0) | 4993 | 146855 | 146881 | m, n, o | (m)-[:RELATIONSHIP_observedBy]->(o) |
| Filter(1) | 2 | 26 | 850 | m, n | (hasLabel(m:LABEL_TYPE_PermissionNode) AND any(-_-INNER-_- in Collection(List({ AUTOINT1}, { AUTOINT2}, { AUTOINT3}, { AUTOINT4}, { AUTOINT5}, { AUTOINT6}, { AUTOINT7}, { AUTOINT8}, { AUTOINT9}, { AUTOINT10}, { AUTOINT11}, { AUTOINT12}, { AUTOINT13}, { AUTOINT14}, { AUTOINT15}, { AUTOINT16}, { AUTOINT17}, { AUTOINT18}, { AUTOINT19}, { AUTOINT20}, { AUTOINT21}, { AUTOINT22}, { AUTOINT23}, { AUTOINT24}, { AUTOINT25}, { AUTOINT26}, { AUTOINT27}, { AUTOINT28}, { AUTOINT29}, { AUTOINT30})) where Property(m,id(0)) == -_-INNER-_-)) |
| Expand(All)(1) | 11 | 26 | 27 | m, n | (n)-[:RELATIONSHIP_scopes]->(m) |
| NodeUniqueIndexSeek | 1 | 1 | 1 | n | :LABEL_TYPE_Project(id) |
+---------------------+---------------+--------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Remaining suggestions:
Use MATCH ... WITH x MATCH ...->(x) syntax: this did not help me at all, so far
Use Lucene indexes: still to do See results in "Update 3"
Use precomputation: this will not help me, since the queries are going to be rather variant
Update 3
I've been playing with full-text search, and indexed the content property as follows:
IndexManager indexManager = getGraphDb().index();
Map<String, String> customConfiguration = MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "type", "fulltext");
Index<Node> index = indexManager.forNodes("INDEX_FULL_TEXT_content_Item", customConfiguration);
index.add(node, "content", value);
When I run the following query this takes approx. 1200ms:
START o=node:INDEX_FULL_TEXT_content_Item("content:*some string*")
MATCH (n:LABEL_TYPE_Project {id:14})
-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)
-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id IN [1, 2, 6, 12, 12064, 19614, 19742, 19863, 21453, 21454, 21457, 21657, 21658, 31123, 31127, 31130, 47691, 55603, 55650, 56026, 56028, 56029, 56050, 56052, 85383, 85406, 85615, 105665, 1035242, 1035243]
RETURN count(o);
Here is the PROFILE for this query:
Compiler CYPHER 2.2
Planner COST
EagerAggregation
|
+Filter(0)
|
+Expand(All)(0)
|
+NodeHashJoin
|
+Filter(1)
| |
| +NodeByIndexQuery
|
+Expand(All)(1)
|
+NodeUniqueIndexSeek
+---------------------+---------------+--------+--------+-------------+------------------------------------------------------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------------+---------------+--------+--------+-------------+------------------------------------------------------------------------+
| EagerAggregation | 50 | 1 | 0 | count(o) | |
| Filter(0) | 2533 | 166 | 498 | m, n, o | (Property(n,id(0)) == { AUTOINT0} AND hasLabel(n:LABEL_TYPE_Project)) |
| Expand(All)(0) | 32933 | 166 | 332 | m, n, o | (m)<-[:RELATIONSHIP_scopes]-(n) |
| NodeHashJoin | 32933 | 166 | 0 | m, o | o |
| Filter(1) | 1 | 553 | 553 | o | hasLabel(o:LABEL_TYPE_Item) |
| NodeByIndexQuery | 1 | 553 | 554 | o | Literal(content:*itzndby*); INDEX_FULL_TEXT_content_Item |
| Expand(All)(1) | 64914 | 146855 | 146881 | m, o | (m)-[:RELATIONSHIP_observedBy]->(o) |
| NodeUniqueIndexSeek | 27 | 26 | 30 | m | :LABEL_TYPE_PermissionNode(id) |
+---------------------+---------------+--------+--------+-------------+------------------------------------------------------------------------+
Things to think about/try: in general with query optimization, the #1 name of the game is to figure out ways to consider less data in the first place, in the answering of the query. It's far less fruitful to consider the same data faster, than it is to consider less data.
Lucene indexes on your content fields. My understanding is that regex you're doing isn't narrowing cypher's search path any, so it's basically having to look at every o:LABEL_TYPE_Item and run the regex against that field. Your regex is only looking for a substring though, so lucene may help cut down the number of nodes cypher has to consider before it can give you a result.
Your relationship paths are variable length, (-[:RELATIONSHIP_scopes*1]->) yet the image you give us suggests you only ever need one hop. On both relationship hops, depending on how your graph is structured (and how much data you have) you might be looking through way more information than you need to there. Consider those relationship hops and your data model carefully; can you replace with -[:RELATIONSHIP_scopes]-> instead? Note that you have a WHERE clause on m nodes, you may be traversing more of those than required.
Check the query plan (via PROFILE, google for docs). One trick I see a lot of people using is pushing the most restrictive part of their query to the top, in front of a WITH block. This reduces the number of "starting points".
What I mean is taking this query...
MATCH (foo)-[:stuff*]->(bar) // (bunch of other complex stuff)
WHERE bar.id = 5
RETURN foo
And turning it into this:
MATCH bar
WHERE bar.id = 5
WITH bar
MATCH (foo)-[:stuff*]->(bar)
RETURN foo;
(Check output via PROFILE, this trick can be used to force the query execution plan to do the most selective thing first, drastically reducing the amount of the graph that cypher considers/traverses...better performance)
Precompute; if you have a particular set of nodes that you use all the time (those with the IDs you identify) you can create a custom index node of your own. Let's call it (foo:SpecialIndex { label: "My Nifty Index" }). This is akin to a "view" in a relational database. You link the stuff you want to access quickly to foo. Then your query, instead of having that big WHERE id IN [blah blah] clause, it simply looks up foo:SpecialIndex, traverses to the hit points, then goes from there. This trick works well when the list of entry points in your list of IDs is large, rapidly growing, or both. This keeps all the same computation you'd do normally, but shifts some of it to be done ahead of time so you don't do it every time you run the query.
Got any supernodes in that graph? (A supernode is an extremely densely connected node, i.e. one with a million outbound relationships) -- don't do that. Try to arrange your data model such that you don't have supernodes, if at all possible.
JVM/Node Cache tweaks. Sometimes you can get an advantage by changing your node caching strategy, or available memory to do the caching. The idea here is that instead of hitting data on disk, if you warm your cache up then you get at least some of the I/O out of the way. This one can help in some cases, but it wouldn't be my first stop unless the way you've configured the JVM or neo4j is already somewhat memory-poor. This one probably also helps you a little less because it tries to make your current access pattern faster, rather than improving your actual access pattern.
can you share your output of :schema in the browser?
if you don't have it do:
create constraint on (p:LABEL_TYPE_Project) assert p.id is unique;
create constraint on (m:LABEL_TYPE_PermissionNode) assert m.id is unique;
The manual indexes you created only help for Item.content if you index it with FULLTEXT_CONFIG and then use START o=node:items("content:(some string)") MATCH ...
As in Neo4j you can always traverse relationships in both directions, you don't need the inverse relationships, it only hurts performance because queries then tend to check one cycle more.
You don't need variable length paths [*1] in your query, change it to:
MATCH (n:LABEL_TYPE_Project {id:'14'})-[:RELATIONSHIP_scopes]->
(m:LABEL_TYPE_PermissionNode)-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id in ['1', '2', ... '1035242', '1035243']
AND o.content =~ '.*itzndby.*' RETURN o LIMIT 20
For real queries use parameters, for project-id and permission.id ->
MATCH (n:LABEL_TYPE_Project {id: {p_id}})-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id in {m_ids} AND o.content =~ '.*'+{item_content}+'.*'
RETURN o LIMIT 20
remember a realistic query performance only shows up on a warmed up system, so run the query at least twice
you might also want to split up your query
MATCH (n:LABEL_TYPE_Project {id: {p_id}})-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)
WHERE m.id in {m_ids}
WITH distinct m
MATCH (m)-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE o.content =~ '.*'+{item_content}+'.*'
RETURN o LIMIT 20
Also learn about PROFILE you can prefix your query it in the old webadmin: http://localhost:7474/webadmin/#/console/
If you use Neo4j 2.2-M03 there is built in support for query plan visualization with EXPLAIN and PROFILE prefixes.

Resources