calculating total path cost in cypher, taking relation directionality into account - neo4j
Using a cypher query on neo4j, in a directed, cyclic graph I need a BFS query and a target node sorting per depth level.
For the within-depth sorting, a custom "total path cost function" should be used, calculated based on
all relation attributes r.followrank between start and end node.
relation directionality (followrank if it points towards end node, or 0 if not)
At any search depth level n, a node connected to a high ranked node at level n-m, m>0 should be ranked higher than a node connected to a low ranked node at level n-m. Reverse directionality should result in a 0 rank (which means, the node and its subtree are still part of the ranking).
I'm using neo4j community-1.9.M01. The approach I've taken so far was to extract an array of followranks for the shortest path to each end node
I thought I've come up with a great first idea for this query but it seems to break down at multiple points.
My query is:
START strt=node(7)
MATCH p=strt-[*1..]-tgt
WHERE not(tgt=strt)
RETURN ID(tgt), extract(r in rels(p): r.followrank*length(strt-[*0..]-()-[r]->() )) as rank, extract(n in nodes(p): ID(n));
which outputs
==> +-----------------------------------------------------------------+
==> | ID(tgt) | rank | extract(n in nodes(p): ID(n)) |
==> +-----------------------------------------------------------------+
==> | 14 | [1.0] | [7,14] |
==> | 15 | [1.0,1.0] | [7,14,15] |
==> | 11 | [1.0,1.0,1.0] | [7,14,15,11] |
==> | 8 | [1.0,1.0,1.0,1.0,0.0] | [7,14,15,11,7,8] |
==> | 9 | [1.0,1.0,1.0,1.0,0.0] | [7,14,15,11,7,9] |
==> | 10 | [1.0,1.0,1.0,1.0,0.0] | [7,14,15,11,7,10] |
==> | 12 | [1.0,1.0,1.0,0.0] | [7,14,15,11,12] |
==> | 8 | [0.0] | [7,8] |
==> | 9 | [0.0] | [7,9] |
==> | 10 | [0.0] | [7,10] |
==> | 11 | [1.0] | [7,11] |
==> | 15 | [1.0,1.0] | [7,11,15] |
==> | 14 | [1.0,1.0,1.0] | [7,11,15,14] |
==> | 8 | [1.0,1.0,1.0,1.0,0.0] | [7,11,15,14,7,8] |
==> | 9 | [1.0,1.0,1.0,1.0,0.0] | [7,11,15,14,7,9] |
==> | 10 | [1.0,1.0,1.0,1.0,0.0] | [7,11,15,14,7,10] |
==> | 12 | [1.0,0.0] | [7,11,12] |
==> +-----------------------------------------------------------------+
==> 17 rows
==> 38 ms
It looks similar to what I need, but the issues are
nodes 8, 9, 10, 11 have the same relation direction to 7! The inverse query result ...*length(strt-[*0..]-()-[r]->() )... looks even stranger - see the queries right below.
I don't know how to normalize the results of the length() expression to 1.
Directionality:
START strt=node(7)
MATCH strt<-[r]-m
RETURN ID(m), r.followrank;
==> +----------------------+
==> | ID(m) | r.followrank |
==> +----------------------+
==> | 8 | 1 |
==> | 9 | 1 |
==> | 10 | 1 |
==> | 11 | 1 |
==> +----------------------+
==> 4 rows
==> 0 ms
START strt=node(7)
MATCH strt-[r]->m
RETURN ID(m), r.followrank;
==> +----------------------+
==> | ID(m) | r.followrank |
==> +----------------------+
==> | 14 | 1 |
==> +----------------------+
==> 1 row
==> 0 ms
Inverse query:
START strt=node(7)
MATCH p=strt-[*1..]-tgt
WHERE not(tgt=strt)
RETURN ID(tgt), extract(rr in rels(p): rr.followrank*length(strt-[*0..]-()<-[rr]-() )) as rank, extract(n in nodes(p): ID(n));
==> +-----------------------------------------------------------------+
==> | ID(tgt) | rank | extract(n in nodes(p): ID(n)) |
==> +-----------------------------------------------------------------+
==> | 14 | [1.0] | [7,14] |
==> | 15 | [1.0,1.0] | [7,14,15] |
==> | 11 | [1.0,1.0,1.0] | [7,14,15,11] |
==> | 8 | [1.0,1.0,1.0,1.0,3.0] | [7,14,15,11,7,8] |
==> | 9 | [1.0,1.0,1.0,1.0,3.0] | [7,14,15,11,7,9] |
==> | 10 | [1.0,1.0,1.0,1.0,3.0] | [7,14,15,11,7,10] |
==> | 12 | [1.0,1.0,1.0,2.0] | [7,14,15,11,12] |
==> | 8 | [3.0] | [7,8] |
==> | 9 | [3.0] | [7,9] |
==> | 10 | [3.0] | [7,10] |
==> | 11 | [1.0] | [7,11] |
==> | 15 | [1.0,1.0] | [7,11,15] |
==> | 14 | [1.0,1.0,1.0] | [7,11,15,14] |
==> | 8 | [1.0,1.0,1.0,1.0,3.0] | [7,11,15,14,7,8] |
==> | 9 | [1.0,1.0,1.0,1.0,3.0] | [7,11,15,14,7,9] |
==> | 10 | [1.0,1.0,1.0,1.0,3.0] | [7,11,15,14,7,10] |
==> | 12 | [1.0,2.0] | [7,11,12] |
==> +-----------------------------------------------------------------+
==> 17 rows
==> 30 ms
So my questions are:
what's going on with this query?
is there a working approach?
For an additional detail, I know the min(length(path)) aggregator, but it doesn't work in this case where I'm trying to extract information about the best hit - the additional information I return about the best hit will disaggreate the result again - I think that's a cypher limitation.
Basically, you want to do a rank only considering relationships that are "with the path flow". Unfortunately, to test "with path flow", you need to check the path-index of each relationships' start/end nodes, and that can only be done with APOC right now.
// allshortestpaths to get all non-cyclic paths
MATCH path=allshortestpaths((a{id:"1"})-[*]-(b{id:"2"}))
// Find rank worthy relationships
WITH path, filter(rl in relationships(path) WHERE apoc.coll.indexOf(path, startnode(rl))<apoc.coll.indexOf(path, endnode(rl)))) as comply
// Filter results
RETURN path, REDUCE(rk = 0, rl in comply | rk+rl.followrank) as rank
ORDER BY rank DESC
(I can't test the APOC part, so you might have to pass NODES(path) instead of path to the APOC procedure)
Related
Neo4j Cypher: How to optimize a NOT EXISTS Query when cardinality is high
The below query takes over 1 second & consumer about 7 MB when cardinality b/w users to posts is about 8000 (one user views about 8000 posts). It is difficult to scale this due to high & linearly growing latencies & memory consumption. Is there a possibility to model this differently and/or optimise the query? Query PROFILE MATCH (u:User)-[:CREATED]->(p:Post) WHERE NOT (:User{ID: 2})-[:VIEWED]->(p) RETURN p.ID Plan | Plan | Statement | Version | Planner | Runtime | Time | DbHits | Rows | Memory (Bytes) | +-----------------------------------------------------------------------------------------------------------+ | "PROFILE" | "READ_ONLY" | "CYPHER 4.1" | "COST" | "INTERPRETED" | 1033 | 3721750 | 10 | 6696240 | +-----------------------------------------------------------------------------------------------------------+ +------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | Operator | Details | Estimated Rows | Rows | DB Hits | Cache H/M | Memory (Bytes) | Ordered by | +------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +ProduceResults#neo4j | `p.ID` | 2158 | 10 | 0 | 0/0 | | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +Projection#neo4j | p.ID AS `p.ID` | 2158 | 10 | 10 | 0/0 | | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +Filter#neo4j | u:User | 2158 | 10 | 10 | 0/0 | | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +Expand(All)#neo4j | (p)<-[anon_15:CREATED]-(u) | 2158 | 10 | 20 | 0/0 | | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +AntiSemiApply#neo4j | | 2158 | 10 | 0 | 0/0 | | | | |\ +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | | +Expand(Into)#neo4j | (anon_47)-[anon_61:VIEWED]->(p) | 233 | 0 | 3695819 | 0/0 | 6696240 | anon_47.ID ASC | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | | +NodeUniqueIndexSeek#neo4j | UNIQUE anon_47:User(ID) WHERE ID = $autoint_0 | 8630 | 8630 | 17260 | 0/0 | | anon_47.ID ASC | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +NodeByLabelScan#neo4j | p:Post | 8630 | 8630 | 8631 | 0/0 | | | +------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
Yes, this can be improved. First, let's understand what this is doing. First, it starts with a NodeByLabelScan. That makes sense, there's no avoiding that. But then, for every node of the label (the following executes PER ROW!), it matches to user 2, and expands all :VIEWED relationships from user 2 to see if any of them is the post for that particular row. Can you see why this is inefficient? There are 8630 post nodes according to the PROFILE plan, so user 2 is looked up by index 8630 times, and their :VIEWED relationships are expanded 8630 times. Why 8630 times? Because this is happening per :Post node. Instead, try this: MATCH (:User{ID: 2})-[:VIEWED]->(viewedPost) WITH collect(viewedPost) as viewedPosts MATCH (:User)-[:CREATED]->(p:Post) WHERE NOT p IN viewedPosts RETURN p.ID This changes things up a bit. First it matches to user 2's viewed posts (the lookup and expansion is performed only once), then those viewed posts are collected. Then it will do a label scan, and filter such that the post isn't in the collection of viewed posts.
Neo4j feature or bug with search query?
I have recently updated neo4j from 2.1.7 to 2.2.5. I found out that query Match (c:C) where id(c) = 111 with c Match (p:I{id: c.id}) return count(p) worked fine in 2.1.7, but it performs very poor in 2.2.5 (100 times longer). I have all the indexes that are needed. I modified this query to Match (c:C) where id(c) = 111 with c.id as c_id Match (p:I{id: c_id}) return count(p) and after this it works fine in 2.2.5 This two queries have different profile. But I'm not very expirienced with profiling. UPDATED One more strange thing is that if i use explain instead of profile - it works fast. neo4j-sh (?)$ PROFILE Match (c:C) where id(c) = 10563822 with c Match (i:I{id: c.id}) return count(i); ==> +----------+ ==> | count(i) | ==> +----------+ ==> | 4551 | ==> +----------+ ==> 1 row ==> 18257 ms ==> ==> Compiler CYPHER 2.2 ==> ==> Planner COST ==> ==> EagerAggregation ==> | ==> +Filter(0) ==> | ==> +CartesianProduct ==> | ==> +Filter(1) ==> | | ==> | +NodeByIdSeek ==> | ==> +NodeByLabelScan ==> ==> +------------------+---------------+---------+---------+-------------+-------------------------+ ==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other | ==> +------------------+---------------+---------+---------+-------------+-------------------------+ ==> | EagerAggregation | 26 | 1 | 0 | count(i) | | ==> | Filter(0) | 652 | 4551 | 2522988 | c, i | i.id == c.id | ==> | CartesianProduct | 6521 | 1261494 | 0 | c, i | | ==> | Filter(1) | 0 | 1 | 1 | c | c:C | ==> | NodeByIdSeek | 1 | 1 | 1 | c | | ==> | NodeByLabelScan | 1261494 | 1261494 | 1261495 | i | :I | ==> +------------------+---------------+---------+---------+-------------+-------------------------+ ==> ==> Total database accesses: 3784485 sh (?)$ PROFILE Match (c:C) where id(c) = 10563822 with c.id as c_id Match (i:I{id: c_id}) return count(i); ==> +----------+ ==> | count(i) | ==> +----------+ ==> | 4551 | ==> +----------+ ==> 1 row ==> 64 ms ==> ==> Compiler CYPHER 2.2 ==> ==> Planner COST ==> ==> EagerAggregation ==> | ==> +Apply ==> | ==> +Projection ==> | | ==> | +Filter ==> | | ==> | +NodeByIdSeek ==> | ==> +NodeIndexSeek ==> ==> +------------------+---------------+------+--------+-------------+---------------------+ ==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other | ==> +------------------+---------------+------+--------+-------------+---------------------+ ==> | EagerAggregation | 1 | 1 | 0 | count(i) | | ==> | Apply | 1 | 4551 | 0 | c, c_id, i | | ==> | Projection | 0 | 1 | 1 | c, c_id | c.id | ==> | Filter | 0 | 1 | 1 | c | c:C | ==> | NodeByIdSeek | 1 | 1 | 1 | c | | ==> | NodeIndexSeek | 1 | 4551 | 4552 | i | :I(id) | ==> +------------------+---------------+------+--------+-------------+---------------------+ ==> ==> Total database accesses: 4555
I don't have enough knowledge of neo4j internals to know why your query is slower (the CartesianProduct step seems a red flag) in more recent versions, but here is a logically equivalent query that seems like it should be much faster: START c = node(111) MATCH (p:I { id: c.id }) RETURN count(p) Here is the profile: +------------------+------+--------+----------------------------------------------------------+-----------------------+ | Operator | Rows | DbHits | Identifiers | Other | +------------------+------+--------+----------------------------------------------------------+-----------------------+ | ColumnFilter | 1 | 0 | count(p) | keep columns count(p) | | EagerAggregation | 1 | 0 | INTERNAL_AGGREGATE51b25e53-027d-439b-9046-c1a2a6b0fe70 | | | Filter | 0 | 0 | c, p | p.id == c.id | | NodeById | 0 | 0 | c, p | Literal(List(111)) | | NodeByLabel | 0 | 1 | p | :I | +------------------+------+--------+----------------------------------------------------------+-----------------------+ NOTE: This should be considered a temporary workaround, as START has been deprecated, and I do not know how long this kind of usage will continue to be supported.
Cypher syntax clarification. Multiple MATCH clauses vs using a comma [duplicate]
are these two Chypher statements identical: //first match (a)-[r]->(b),b-[r2]->c //second match (a)-[r]->(b) match b-[r2]->c
The 2 Cypher statements are NOT identical. We can show this by using the PROFILE command, which shows you how the Cypher engine would perform a query. In the following examples, the queries all end with RETURN a, c, since you cannot have a bare MATCH clause. As you can see, the first query has a NOT(r == r2) filter that the second query does not. This is because Cypher makes sure that the result of a single MATCH clause does not contain duplicate relationships. First query profile match (a)-[r]->(b),b-[r2]->c return a,c; ==> +-----------------------------------------------+ ==> | a | c | ==> +-----------------------------------------------+ ==> | Node[1]{name:"World"} | Node[0]{name:"World"} | ==> +-----------------------------------------------+ ==> 1 row ==> 2 ms ==> ==> Compiler CYPHER 2.3 ==> ==> Planner COST ==> ==> Runtime INTERPRETED ==> ==> Projection ==> | ==> +Filter ==> | ==> +Expand(All)(0) ==> | ==> +Expand(All)(1) ==> | ==> +AllNodesScan ==> ==> +----------------+---------------+------+--------+----------------+----------------+ ==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other | ==> +----------------+---------------+------+--------+----------------+----------------+ ==> | Projection | 1 | 1 | 0 | a, b, c, r, r2 | a; c | ==> | Filter | 1 | 1 | 0 | a, b, c, r, r2 | NOT(r == r2) | ==> | Expand(All)(0) | 1 | 2 | 4 | a, b, c, r, r2 | (b)-[r2:]->(c) | ==> | Expand(All)(1) | 2 | 2 | 8 | a, b, r | (b)<-[r:]-(a) | ==> | AllNodesScan | 6 | 6 | 7 | b | | ==> +----------------+---------------+------+--------+----------------+----------------+ ==> Second query profile match (a)-[r]->(b) match b-[r2]->c return a,c; ==> +-----------------------------------------------+ ==> | a | c | ==> +-----------------------------------------------+ ==> | Node[1]{name:"World"} | Node[1]{name:"World"} | ==> | Node[1]{name:"World"} | Node[0]{name:"World"} | ==> +-----------------------------------------------+ ==> 2 rows ==> 2 ms ==> ==> Compiler CYPHER 2.3 ==> ==> Planner COST ==> ==> Runtime INTERPRETED ==> ==> Projection ==> | ==> +Expand(All)(0) ==> | ==> +Expand(All)(1) ==> | ==> +AllNodesScan ==> ==> +----------------+---------------+------+--------+----------------+----------------+ ==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other | ==> +----------------+---------------+------+--------+----------------+----------------+ ==> | Projection | 1 | 2 | 0 | a, b, c, r, r2 | a; c | ==> | Expand(All)(0) | 1 | 2 | 4 | a, b, c, r, r2 | (b)-[r2:]->(c) | ==> | Expand(All)(1) | 2 | 2 | 8 | a, b, r | (b)<-[r:]-(a) | ==> | AllNodesScan | 6 | 6 | 7 | b | | ==> +----------------+---------------+------+--------+----------------+----------------+
cypher get relationships between same nodes,but lose some relationships
data set: neo4j-sh (?)$ START n = node(*) MATCH n-[r]-m RETURN n,r,m; ==> +---------------------------------------------+ ==> | n | r | m | ==> +---------------------------------------------+ ==> | Node[1]{} | (2)-[1:KNOWS]->(1)| Node[2]{} | ==> | Node[1]{} | (3)-[2:KNOWS]->(1) | Node[3]{} | ==> | Node[2]{} | (2)-[1:KNOWS]->(1) | Node[1]{} | ==> | Node[2]{} | (3)-[0:KNOWS]->(2) | Node[3]{} | ==> | Node[3]{} | (3)-[0:KNOWS]->(2) | Node[2]{} | ==> | Node[3]{} | (3)-[2:KNOWS]->(1) | Node[1]{} | ==> +---------------------------------------------+ ==> 6 rows ==> ==> 0 ms cypher query: neo4j-sh (0)$ start x=node(1,2,3),y=node(1,2,3) match x-[r]-y return id(x),id(y) order by id(x) desc; ==> +---------------+ ==> | id(x) | id(y) | ==> +---------------+ ==> | 1 | 2 | ==> | 1 | 3 | ==> | 2 | 1 | ==> | 3 | 1 | ==> +---------------+ ==> 4 rows in fact,2 and 3 are linked,why no returns; how to get returns? thanks url:http://console.neo4j.org/?id=qwdh4p
Query unique pair of nodes when pair orders is not important in cypher
I am trying to compare users with according to their common interests in this graph. I know why the following query produces duplicate pairs but can't think of a good way in cypher to avoid it. Is there any way to do it without looping in cypher? neo4j-sh (?)$ start n=node(*) match p=n-[:LIKES]->item<-[:LIKES]-other where n <> other return n.name,other.name,collect(item.name) as common, count(*) as freq order by freq desc; ==> +-----------------------------------------------+ ==> | n.name | other.name | common | freq | ==> +-----------------------------------------------+ ==> | "u1" | "u2" | ["f1","f2","f3"] | 3 | ==> | "u2" | "u1" | ["f1","f2","f3"] | 3 | ==> | "u1" | "u3" | ["f1","f2"] | 2 | ==> | "u3" | "u2" | ["f1","f2"] | 2 | ==> | "u2" | "u3" | ["f1","f2"] | 2 | ==> | "u3" | "u1" | ["f1","f2"] | 2 | ==> | "u4" | "u3" | ["f1"] | 1 | ==> | "u4" | "u2" | ["f1"] | 1 | ==> | "u4" | "u1" | ["f1"] | 1 | ==> | "u2" | "u4" | ["f1"] | 1 | ==> | "u1" | "u4" | ["f1"] | 1 | ==> | "u3" | "u4" | ["f1"] | 1 | ==> +-----------------------------------------------+
In order to avoid having duplicates in the form of a--b and b--a, you can exclude one of the combinations in your WHERE clause with WHERE ID(a) < ID(b) making your above query start n=node(*) match p=n-[:LIKES]->item<-[:LIKES]-other where ID(n) < ID(other) return n.name,other.name,collect(item.name) as common, count(*) as freq order by freq desc;
OK, I see that you use (*) as a start point, which mean to loop through the whole graph and make each node as a start point.. So the output is different, not duplicate as you say.. +-----------------------------------------------+ | n.name | other.name | common | freq | +-----------------------------------------------+ | "u2" | "u1" | ["f1","f2","f3"] | 3 | not equal to: +-----------------------------------------------+ | n.name | other.name | common | freq | +-----------------------------------------------+ | "u1" | "u2" | ["f1","f2","f3"] | 3 | So, I see that if you try using an index and set a start point, there won't be any duplicates. start n=node:someIndex(name='C') match p=n-[:LIKES]->item<-[:LIKES]-other where n <> other return n.name,other.name,collect(item.name) as common, count(*) as freq order by freq desc;