Cypher Neo4j query optimization - neo4j

I use the following Cypher query:
MATCH (v:Value)-[:CONTAINS]->(hv:HistoryValue)
WHERE v.id = {valueId}
OPTIONAL MATCH (hv)-[:CREATED_BY]->(u:User)
WHERE {fetchCreateUsers}
WITH u, hv
ORDER BY hv.createDate DESC
WITH count(hv) as count, ceil(toFloat(count(hv)) / {maxResults}) as step, COLLECT({userId: u.id, historyValueId: hv.id, historyValue: hv.originalValue, historyValueCreateDate: hv.createDate}) AS data
RETURN REDUCE(s = [], i IN RANGE(0, count - 1, CASE step WHEN 0 THEN 1 ELSE step END) | s + data[i]) AS result, step, count
This query works fine and does exactly what I need.
Right now I'm concerned about two possible issues inside of this query from the performance point of view and Cypher best practices.
First of all, as you may see - I two times use the same count(hv) function. Will it cause the problems during the execution from the performance point of view or Cypher and Neo4j are smart enough to optimize it? If no, please show how to fix it.
And the second place is the CASE statement inside range() function? The same question here - will this CASE statement be executed only once or every time for every iteration over my range? Please show how to fix it if needed.
UPDATED
I tried to do a separator WITH for count but the query doesn't return the results(returns empty result)
MATCH (v:Value)-[:CONTAINS]->(hv:HistoryValue)
WHERE v.id = {valueId}
OPTIONAL MATCH (hv)-[:CREATED_BY]->(u:User)
WHERE {fetchCreateUsers}
WITH u, hv ORDER BY hv.createDate DESC
WITH u, hv, count(hv) as count
WITH u, hv, count, ceil(toFloat(count) / {maxResults}) as step, COLLECT({userId: u.id, historyValueId: hv.id, historyValue: hv.originalValue, historyValueCreateDate: hv.createDate}) AS data
RETURN REDUCE(s = [], i IN RANGE(0, count - 1, CASE step WHEN 0 THEN 1 ELSE step END) | s + data[i]) AS result, step, count

1 MATCH (v:Value)-[:CONTAINS]->(hv:HistoryValue)
2 WHERE v.id = {valueId}
3 OPTIONAL MATCH (hv)-[:CREATED_BY]->(u:User)
4 WHERE {fetchCreateUsers}
5 WITH u, hv
6 ORDER BY hv.createDate DESC
7 WITH count(hv) as count, ceil(toFloat(count(hv)) / {maxResults}) as step, COLLECT({userId: u.id, historyValueId: hv.id, historyValue: hv.originalValue, historyValueCreateDate: hv.createDate}) AS data
8 RETURN REDUCE(s = [], i IN RANGE(0, count - 1, CASE step WHEN 0 THEN 1 ELSE step END) | s + data[i]) AS result, step, count
(1) You need to pass hv in line 5, because it's values are collected in line 7. That said, you can still do something like this:
5 WITH u, collect(hv) AS hvs, count(hv) as count
UNWIND hvs AS hv
However, this is not very elegant and probably not worth doing.
(2) You can calculate the CASE expression in line 7:
7 WITH count, data, step, CASE step WHEN 0 THEN 1 ELSE step END AS stepFlag
8 RETURN REDUCE(s = [], i IN RANGE(0, count - 1, stepFlag) | s + data[i]) AS result, step, count

Related

Neo4j - Counting rows with where condition

I'm trying to count the amount of rows that Neo4j will return but the count (or the query) is very slow.
Version 1 (70 sec):
MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WHERE order.timestamp >= 1632434400 AND size((order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(:Product)) <= 20
WITH order
MATCH (order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(product:Product)
RETURN COUNT(product);
Version 2 (68 sec.):
MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WITH size((order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(:Product)) AS amount
WHERE order.timestamp >= 1632434400 AND amount <= 20
RETURN SUM(amount)
Using Neo4j 4.4 community with about 800000 orders and about 17000000 order lines.
Is there a more efficient way to count the rows?
These are the indexes:
CREATE INDEX idx_order_torder_id FOR (n:Order) ON (n.order_id);
CREATE INDEX idx_order_timestamp FOR (n:Order) ON (n.timestamp);
CREATE INDEX idx_person_person_id FOR (n:Person) ON (n.person_id);
CREATE INDEX idx_product_product_id FOR (n:Product) ON (n.product_id);
The amount of rows are equal to 4269011.
The EXPLAIN plan:
Please try below, hope it will give faster results
MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WHERE order.timestamp >= 1632434400
WITH order.order_id AS orderid
MATCH (o:Order { order_id: orderid })<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(product:Product)
WITH COUNT(product) as productCount
WHERE productCount <= 20
RETURN productCount;
Because every order line has one product, i can skip the counting of the relation order lines to products:
MATCH (order:Order)
WHERE order.timestamp >= 1632434400
WITH order
MATCH (order)<-[:HAS_ORDER]-(orderLine:OrderLine)
WITH COUNT(orderLine) as productCount
WHERE productCount <= 20
RETURN SUM(productCount);
This query took 0m17.342s
But i managed to snoop some seconds with the following query:
MATCH (order:Order)
WHERE order.timestamp >= 1632434400
WITH order, size((order)<-[:HAS_ORDER]-(:OrderLine)) AS amount
WHERE amount <= 20
RETURN SUM(amount);
This query took 0m15.675s

Finding the "GAP" in node values ? or next?

Let say I have a nodes with values a multiples of 10. I want to find the first GAP in the values.
Here is how I would do it in numpy :
> np.where(np.diff([11,21,31,51,61,71,91]) > 10)[0][0] + 2
> 4 i.e. 41
How would I do this in Cypher... ?
match (n) where n.val % 10 = 1
with n.val
order by val ....???
I'm using RedisGraph.
PS>
if no GAP it should return the next value i.e. biggest + 10, if possible !
I'm not sure if this is the most performant solution, but you can accomplish this using a combination of collect() and list comprehensions:
MATCH (n) WHERE n.val % 10 = 1 WITH n.val AS val ORDER BY n.val // collect ordered vals
WITH collect(val) AS vals // combine vals into array
WITH vals, [idx IN range(0, size(vals) + 1) WHERE vals[idx + 1] - vals[idx] > 10] AS gaps // find first index with diff > 10
RETURN vals[gaps[0]] + 10 // return missing value
To additionally return the next-biggest value if no gaps are found, change the RETURN clause to use a CASE statement:
RETURN CASE size(gaps) WHEN 0 THEN vals[-1] + 10 ELSE vals[gaps[0]] + 10 END

Neo4j Cypher round value

I have the following Cypher:
MATCH (v:Value)-[:CONTAINS]->(hv:HistoryValue)
WHERE v.id = {valueId}
OPTIONAL MATCH (hv)-[:CREATED_BY]->(u:User)
WHERE {fetchCreateUsers}
WITH u, hv ORDER BY hv.createDate DESC
WITH count(hv) as count, count(hv) / {maxResults} as step, COLLECT({userId: u.id, historyValueId: hv.id, historyValue: hv.originalValue, historyValueCreateDate: hv.createDate}) AS data
RETURN REDUCE(s = [], i IN RANGE(0, count - 1, CASE step WHEN 0 THEN 1 ELSE step END) | s + data[i]) AS result, step, count
right now:
count(hv) = 260
{maxResults} = 100
The step variable equals 2 but I expect round(260/100) = 3
I tried the following round(count(hv) / {maxResults}) as step but step is still 2.
How to fix my query in order to get a proper round (3 as step variable in this particular case)?
Use toFloat() in one of the values:
return round(toFloat(260) / 100)
Output:
╒═══════════════════════════╕
│"round(toFloat(260) / 100)"│
╞═══════════════════════════╡
│3 │
└───────────────────────────┘
You're currently doing integer division. If you enter return 260/100 you'll get 2, and that's the value that gets rounded (though there's nothing to round, so you get 2 back).
You need to be working with floating point values. You can do this by having maxResults have an explicit decimal (100.0), or use toFloat() around either maxResults or the count. Both return 260/100.0 and return toFloat(260)/100 or return 260/toFloat(100) will result in 2.6. If you round() that you'll get your expected 3 value.

Cypher: analog of `sort -u` to merge 2 collections?

Suppose I have a node with a collection in a property, say
START x = node(17) SET x.c = [ 4, 6, 2, 3, 7, 9, 11 ];
and somewhere (i.e. from .csv file) I get another collection of values, say
c1 = [ 11, 4, 5, 8, 1, 9 ]
I'm treating my collections as just sets, order of elements does not matter. What I need is to merge x.c with c1 with come magic operation so that resulting x.c will contain only distinct elements from both. The following idea comes to mind (yet untested):
LOAD CSV FROM "file:///tmp/additives.csv" as row
START x=node(TOINT(row[0]))
MATCH c1 = [ elem IN SPLIT(row[1], ':') | TOINT(elem) ]
SET
x.c = [ newxc IN x.c + c1 WHERE (newx IN x.c AND newx IN c1) ];
This won't work, it will give an intersection but not a collection of distinct items.
More RTFM gives another idea: use REDUCE() ? but how?
How to extend Cypher with a new builtin function UNIQUE() which accept collection and return collection, cleaned form duplicates?
UPD. Seems that FILTER() function is something close but intersection again :(
x.c = FILTER( newxc IN x.c + c1 WHERE (newx IN x.c AND newx IN c1) )
WBR,
Andrii
How about something like this...
with [1,2,3] as a1
, [3,4,5] as a2
with a1 + a2 as all
unwind all as a
return collect(distinct a) as unique
Add two collections and return the collection of distinct elements.
dec 15, 2014 - here is an update to my answer...
I started with a node in the neo4j database...
//create a node in the DB with a collection of values on it
create (n:Node {name:"Node 01",values:[4,6,2,3,7,9,11]})
return n
I created a csv sample file with two columns...
Name,Coll
"Node 01","11,4,5,8,1,9"
I created a LOAD CSV statement...
LOAD CSV
WITH HEADERS FROM "file:///c:/Users/db/projects/coll-merge/load_csv_file.csv" as row
// find the matching node
MATCH (x:Node)
WHERE x.name = row.Name
// merge the collections
WITH x.values + split(row.Coll,',') AS combo, x
// process the individual values
UNWIND combo AS value
// use toInt as the values from the csv come in as string
// may be a better way around this but i am a little short on time
WITH toInt(value) AS value, x
// might as well sort 'em so they are all purdy
ORDER BY value
WITH collect(distinct value) AS values, x
SET x.values = values
You could use reduce like this:
with [1,2,3] as a, [3,4,5] as b
return reduce(r = [], x in a + b | case when x in r then r else r + [x] end)
Since Neo4j 3.0, with APOC Procedures you can easily solve this with apoc.coll.union(). In 3.1+ it's a function, and can be used like this:
...
WITH apoc.coll.union(list1, list2) as unionedList
...

Cypher: find a path which takes the maximum valued step each time

I am trying to write a cypher query that finds a path between nodes a and b such that each step has the maximum timestamp value out of all available alternatives that is less than 15.
Here is my query so far, it does everything except for select the maximum possible timestamp at each step. How do I express this condition?
MATCH path=(a:NODE)-[rs:PARENT*]->(b:NODE)
WHERE a.name = 'SOME_VALUE' and b.name = 'SOME_OTHER_VALUE' AND ALL (r IN rs
WHERE r.timestamp < 15)
RETURN path
This is just awful sudo code but I think it expresses what I am looking for
MATCH path=(a:NODE)-[rs:PARENT*]->(b:NODE)
WHERE a.name = 'SOME_VALUE' and b.name = 'SOME_OTHER_VALUE' AND ALL (r IN rs
WHERE r.timestamp < 15 AND r.timestamp = max(allPossibleRsForThisStep))
RETURN path
Can this kind of query be written in cypher?
It won't be fast in cypher, it's possible to compute all maximum values first and then do what you want to do by compare the max value in a list with the current value.
Something like this (not sure if it works)
WITH range(1,10) as max_vals // a list with 10 values (actual values are not that important)
MATCH (a:NODE)-[rs:PARENT*..10]->(b:NODE)
WHERE a.name = 'SOME_VALUE' and b.name = 'SOME_OTHER_VALUE'
WITH a,b,
map(idx in range(0,size(rs)) |
max_vals[idx] = case when max_vals[idx]<rs[idx].timestamp then rs[idx].timestamp else max_vals[idx] end ), max_vals
MATCH path=(a)-[rs:PARENT*..10]->(b)
AND ALL (idx in range(0,size(rs) WHERE rs[idx].timestamp < 15 AND rs[idx].timestamp = max_vals[idx])
RETURN path

Resources