What is the difference between these two Cypher queries? - neo4j
I'm a bit stumped.
In my database, I have a relationship like this:
(u:User)-[r1:LISTENS_TO]->(a:Artist)<-[r2:LISTENS_TO]-(u2:User)
I want to perform a query where for a given user, I find the common artists between that user and every other user.
To give an idea of size of my database, I have about 600 users, 47,546 artists, and 184,211 relationships between users and artists.
The first query I was trying was the following:
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
WHERE
other:User
WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN other.username, mutualArtists
This was taking around 20 seconds to return. The profile for this query is as follows:
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns other.username, mutualArtists |
| Extract | 10 | 20 | | other.username |
| ColumnFilter(1) | 10 | 0 | | keep columns other, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEb6facb18-1c5d-45a6-83bf-a75c25ba6baf of type Integer) |
| EagerAggregation | 563 | 0 | | other |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, me, ar2, ar1, other | |
| Filter(1) | 1 | 3 | | ((hasLabel(me:User(3)) AND hasLabel(other:User(3))) AND hasLabel(other:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
I was frustrated. It didn't seem like this should take 20 seconds.
I came back to the problem later on, and tried debugging it from the start.
I started to break down the query, and I noticed I was getting much faster results. Without the Neo4J Spatial query, I was getting results in about 1.5 seconds.
I finally added things back, and ended up with the following query:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
WHERE
u2:User
WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN u2.username, mutualArtists
This query returns in 4240 ms. A 5X improvement! The profile for this query is as follows:
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns u2.username, mutualArtists |
| Extract | 10 | 20 | | u2.username |
| ColumnFilter(1) | 10 | 0 | | keep columns u2, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEbdf86ac1-8677-4d45-967f-c2dd594aba49 of type Integer) |
| EagerAggregation | 563 | 0 | | u2 |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, u2, u, ar2, ar1 | |
| Filter(1) | 1 | 3 | | ((hasLabel(u:User(3)) AND hasLabel(u2:User(3))) AND hasLabel(u2:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
And, to prove that I ran them both in a row and got very different results:
neo4j-sh (?)$ START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
> WHERE
> u2:User
>
> WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN u2.username, mutualArtists
> ;
+------------------------------+
| u2.username | mutualArtists |
+------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+------------------------------+
10 rows
4240 ms
neo4j-sh (?)$ START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
> WHERE
> other:User
>
> WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN other.username, mutualArtists;
+--------------------------------+
| other.username | mutualArtists |
+--------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+--------------------------------+
10 rows
20418 ms
Unless I have gone crazy, the only difference between these two queries is the names of the nodes (I've changed "me" to "u" and "other" to "u2").
Why does that cause a 5X improvement??!?!
If anyone has any insight into this, I would be eternally grateful.
Thanks,
-Adam
EDIT 8.1.14
Based on #ulkas's suggestion, I tried simplifying the query.
The results were:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
RETURN u2.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~4 seconds
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~20 seconds
So bizarre. It seems as though literally the named nodes of "other" and "me" cause the query time to jump tremendously. I'm very confused.
Thanks,
-Adam
That sounds like you're seeing the effect of caching. Upon the first access the cache is not populated. Subsequent queries hitting the same graph will be much faster since the nodes/relationships are already available in the cache.
working with OPTIONAL MATCH following WHERE other:User has no sense, since the end node other (u2) must be match. try to perform the queries without optional match and where and without the last with, simply
START me=node(553314), other=node:userLocations("withinDistance[38.89037,-77.03196,80.467]")
MATCH
pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, count(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
Related
MYSQL joining the sum of matching fields
I record eftpos payments that are payed as a group at the end of each day, but am having trouble matching individual payments to the daily total Payments table: |id | paymentjobno| paymentamount| paymentdate|paymenttype| | 1 | 1000 | 10 | 01/01/2000 | 2 | | 2 | 1001 | 15 | 01/01/2000 | 2 | | 3 | 1002 | 18 | 01/01/2000 | 2 | | 4 | 1003 | 10 | 01/01/2000 | 1 | | 5 | 1004 | 127 | 02/01/2000 | 2 | I want to return something like this so I can match it to $43 transactions on the following day and record payment ID numbers against the transaction |id | paymentjobno| paymentamount| paymentdate|paymenttype|daytotal| | 1 | 1000 | 10 | 01/01/2000 | 2 | 43 | | 2 | 1001 | 15 | 01/01/2000 | 2 | 43 | | 3 | 1002 | 18 | 01/01/2000 | 2 | 43 | Below is my current attempt, but I only get one returned row per day even if there's multiple payments, and the daytotal is the same for every returned result, which is also not the value I was expecting. What am I doing wrong? SELECT id, paymentjobno, paymentamount, paymentdate, paymenttype, t.daytotal FROM payments LEFT JOIN ( SELECT SUM(paymentamount) AS daytotal FROM payments GROUP BY paymentdate) t ON id = payments.id WHERE paymenttype = 2 AND paymentdate $dateclause GROUP BY payments.paymentdate
Neo4j Cypher: How to optimize a NOT EXISTS Query when cardinality is high
The below query takes over 1 second & consumer about 7 MB when cardinality b/w users to posts is about 8000 (one user views about 8000 posts). It is difficult to scale this due to high & linearly growing latencies & memory consumption. Is there a possibility to model this differently and/or optimise the query? Query PROFILE MATCH (u:User)-[:CREATED]->(p:Post) WHERE NOT (:User{ID: 2})-[:VIEWED]->(p) RETURN p.ID Plan | Plan | Statement | Version | Planner | Runtime | Time | DbHits | Rows | Memory (Bytes) | +-----------------------------------------------------------------------------------------------------------+ | "PROFILE" | "READ_ONLY" | "CYPHER 4.1" | "COST" | "INTERPRETED" | 1033 | 3721750 | 10 | 6696240 | +-----------------------------------------------------------------------------------------------------------+ +------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | Operator | Details | Estimated Rows | Rows | DB Hits | Cache H/M | Memory (Bytes) | Ordered by | +------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +ProduceResults#neo4j | `p.ID` | 2158 | 10 | 0 | 0/0 | | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +Projection#neo4j | p.ID AS `p.ID` | 2158 | 10 | 10 | 0/0 | | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +Filter#neo4j | u:User | 2158 | 10 | 10 | 0/0 | | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +Expand(All)#neo4j | (p)<-[anon_15:CREATED]-(u) | 2158 | 10 | 20 | 0/0 | | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +AntiSemiApply#neo4j | | 2158 | 10 | 0 | 0/0 | | | | |\ +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | | +Expand(Into)#neo4j | (anon_47)-[anon_61:VIEWED]->(p) | 233 | 0 | 3695819 | 0/0 | 6696240 | anon_47.ID ASC | | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | | +NodeUniqueIndexSeek#neo4j | UNIQUE anon_47:User(ID) WHERE ID = $autoint_0 | 8630 | 8630 | 17260 | 0/0 | | anon_47.ID ASC | | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+ | +NodeByLabelScan#neo4j | p:Post | 8630 | 8630 | 8631 | 0/0 | | | +------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
Yes, this can be improved. First, let's understand what this is doing. First, it starts with a NodeByLabelScan. That makes sense, there's no avoiding that. But then, for every node of the label (the following executes PER ROW!), it matches to user 2, and expands all :VIEWED relationships from user 2 to see if any of them is the post for that particular row. Can you see why this is inefficient? There are 8630 post nodes according to the PROFILE plan, so user 2 is looked up by index 8630 times, and their :VIEWED relationships are expanded 8630 times. Why 8630 times? Because this is happening per :Post node. Instead, try this: MATCH (:User{ID: 2})-[:VIEWED]->(viewedPost) WITH collect(viewedPost) as viewedPosts MATCH (:User)-[:CREATED]->(p:Post) WHERE NOT p IN viewedPosts RETURN p.ID This changes things up a bit. First it matches to user 2's viewed posts (the lookup and expansion is performed only once), then those viewed posts are collected. Then it will do a label scan, and filter such that the post isn't in the collection of viewed posts.
joinging three tables in psql and keeping results according to group membership
I am using psql and joined three tables A, B and C from table A. For example resulting table is as follows: +----+------+------+------+ | pk | a_id | b_id | c_id | +----+------+------+------+ | 1 | 5 | 12 | 16 | | 2 | 5 | 7 | 8 | | 3 | 5 | 6 | 21 | | 4 | 8 | 12 | 16 | | 5 | 8 | 3 | 9 | | 6 | 9 | 11 | 32 | | 7 | 9 | 8 | 2 | +----+------+------+------+ I am trying to create c_id relations over a_id. In a_id there are three groups [5,8,9]. For example c_id=16 has a relation to a_id=[5,8], so c_id=[8,21,9,32] must be protected via a_id=[5,8]. And resulting table should look like as follows: +----+------+------+------+ | pk | a_id | b_id | c_id | +----+------+------+------+ | 1 | 5 | 12 | 16 | | 2 | 5 | 7 | 8 | | 3 | 5 | 6 | 21 | | 4 | 8 | 12 | 16 | | 5 | 8 | 3 | 9 | +----+------+------+------+ How can I write such a condition in join statement?
After the join, you can write this query. I created your result table directly, and then I wrote a SQL query. SELECT * from res WHERE a_id in (SELECT distinct a_id FROM res WHERE c_id=16)
Query unique pair of nodes when pair orders is not important in cypher
I am trying to compare users with according to their common interests in this graph. I know why the following query produces duplicate pairs but can't think of a good way in cypher to avoid it. Is there any way to do it without looping in cypher? neo4j-sh (?)$ start n=node(*) match p=n-[:LIKES]->item<-[:LIKES]-other where n <> other return n.name,other.name,collect(item.name) as common, count(*) as freq order by freq desc; ==> +-----------------------------------------------+ ==> | n.name | other.name | common | freq | ==> +-----------------------------------------------+ ==> | "u1" | "u2" | ["f1","f2","f3"] | 3 | ==> | "u2" | "u1" | ["f1","f2","f3"] | 3 | ==> | "u1" | "u3" | ["f1","f2"] | 2 | ==> | "u3" | "u2" | ["f1","f2"] | 2 | ==> | "u2" | "u3" | ["f1","f2"] | 2 | ==> | "u3" | "u1" | ["f1","f2"] | 2 | ==> | "u4" | "u3" | ["f1"] | 1 | ==> | "u4" | "u2" | ["f1"] | 1 | ==> | "u4" | "u1" | ["f1"] | 1 | ==> | "u2" | "u4" | ["f1"] | 1 | ==> | "u1" | "u4" | ["f1"] | 1 | ==> | "u3" | "u4" | ["f1"] | 1 | ==> +-----------------------------------------------+
In order to avoid having duplicates in the form of a--b and b--a, you can exclude one of the combinations in your WHERE clause with WHERE ID(a) < ID(b) making your above query start n=node(*) match p=n-[:LIKES]->item<-[:LIKES]-other where ID(n) < ID(other) return n.name,other.name,collect(item.name) as common, count(*) as freq order by freq desc;
OK, I see that you use (*) as a start point, which mean to loop through the whole graph and make each node as a start point.. So the output is different, not duplicate as you say.. +-----------------------------------------------+ | n.name | other.name | common | freq | +-----------------------------------------------+ | "u2" | "u1" | ["f1","f2","f3"] | 3 | not equal to: +-----------------------------------------------+ | n.name | other.name | common | freq | +-----------------------------------------------+ | "u1" | "u2" | ["f1","f2","f3"] | 3 | So, I see that if you try using an index and set a start point, there won't be any duplicates. start n=node:someIndex(name='C') match p=n-[:LIKES]->item<-[:LIKES]-other where n <> other return n.name,other.name,collect(item.name) as common, count(*) as freq order by freq desc;
calculating total path cost in cypher, taking relation directionality into account
Using a cypher query on neo4j, in a directed, cyclic graph I need a BFS query and a target node sorting per depth level. For the within-depth sorting, a custom "total path cost function" should be used, calculated based on all relation attributes r.followrank between start and end node. relation directionality (followrank if it points towards end node, or 0 if not) At any search depth level n, a node connected to a high ranked node at level n-m, m>0 should be ranked higher than a node connected to a low ranked node at level n-m. Reverse directionality should result in a 0 rank (which means, the node and its subtree are still part of the ranking). I'm using neo4j community-1.9.M01. The approach I've taken so far was to extract an array of followranks for the shortest path to each end node I thought I've come up with a great first idea for this query but it seems to break down at multiple points. My query is: START strt=node(7) MATCH p=strt-[*1..]-tgt WHERE not(tgt=strt) RETURN ID(tgt), extract(r in rels(p): r.followrank*length(strt-[*0..]-()-[r]->() )) as rank, extract(n in nodes(p): ID(n)); which outputs ==> +-----------------------------------------------------------------+ ==> | ID(tgt) | rank | extract(n in nodes(p): ID(n)) | ==> +-----------------------------------------------------------------+ ==> | 14 | [1.0] | [7,14] | ==> | 15 | [1.0,1.0] | [7,14,15] | ==> | 11 | [1.0,1.0,1.0] | [7,14,15,11] | ==> | 8 | [1.0,1.0,1.0,1.0,0.0] | [7,14,15,11,7,8] | ==> | 9 | [1.0,1.0,1.0,1.0,0.0] | [7,14,15,11,7,9] | ==> | 10 | [1.0,1.0,1.0,1.0,0.0] | [7,14,15,11,7,10] | ==> | 12 | [1.0,1.0,1.0,0.0] | [7,14,15,11,12] | ==> | 8 | [0.0] | [7,8] | ==> | 9 | [0.0] | [7,9] | ==> | 10 | [0.0] | [7,10] | ==> | 11 | [1.0] | [7,11] | ==> | 15 | [1.0,1.0] | [7,11,15] | ==> | 14 | [1.0,1.0,1.0] | [7,11,15,14] | ==> | 8 | [1.0,1.0,1.0,1.0,0.0] | [7,11,15,14,7,8] | ==> | 9 | [1.0,1.0,1.0,1.0,0.0] | [7,11,15,14,7,9] | ==> | 10 | [1.0,1.0,1.0,1.0,0.0] | [7,11,15,14,7,10] | ==> | 12 | [1.0,0.0] | [7,11,12] | ==> +-----------------------------------------------------------------+ ==> 17 rows ==> 38 ms It looks similar to what I need, but the issues are nodes 8, 9, 10, 11 have the same relation direction to 7! The inverse query result ...*length(strt-[*0..]-()-[r]->() )... looks even stranger - see the queries right below. I don't know how to normalize the results of the length() expression to 1. Directionality: START strt=node(7) MATCH strt<-[r]-m RETURN ID(m), r.followrank; ==> +----------------------+ ==> | ID(m) | r.followrank | ==> +----------------------+ ==> | 8 | 1 | ==> | 9 | 1 | ==> | 10 | 1 | ==> | 11 | 1 | ==> +----------------------+ ==> 4 rows ==> 0 ms START strt=node(7) MATCH strt-[r]->m RETURN ID(m), r.followrank; ==> +----------------------+ ==> | ID(m) | r.followrank | ==> +----------------------+ ==> | 14 | 1 | ==> +----------------------+ ==> 1 row ==> 0 ms Inverse query: START strt=node(7) MATCH p=strt-[*1..]-tgt WHERE not(tgt=strt) RETURN ID(tgt), extract(rr in rels(p): rr.followrank*length(strt-[*0..]-()<-[rr]-() )) as rank, extract(n in nodes(p): ID(n)); ==> +-----------------------------------------------------------------+ ==> | ID(tgt) | rank | extract(n in nodes(p): ID(n)) | ==> +-----------------------------------------------------------------+ ==> | 14 | [1.0] | [7,14] | ==> | 15 | [1.0,1.0] | [7,14,15] | ==> | 11 | [1.0,1.0,1.0] | [7,14,15,11] | ==> | 8 | [1.0,1.0,1.0,1.0,3.0] | [7,14,15,11,7,8] | ==> | 9 | [1.0,1.0,1.0,1.0,3.0] | [7,14,15,11,7,9] | ==> | 10 | [1.0,1.0,1.0,1.0,3.0] | [7,14,15,11,7,10] | ==> | 12 | [1.0,1.0,1.0,2.0] | [7,14,15,11,12] | ==> | 8 | [3.0] | [7,8] | ==> | 9 | [3.0] | [7,9] | ==> | 10 | [3.0] | [7,10] | ==> | 11 | [1.0] | [7,11] | ==> | 15 | [1.0,1.0] | [7,11,15] | ==> | 14 | [1.0,1.0,1.0] | [7,11,15,14] | ==> | 8 | [1.0,1.0,1.0,1.0,3.0] | [7,11,15,14,7,8] | ==> | 9 | [1.0,1.0,1.0,1.0,3.0] | [7,11,15,14,7,9] | ==> | 10 | [1.0,1.0,1.0,1.0,3.0] | [7,11,15,14,7,10] | ==> | 12 | [1.0,2.0] | [7,11,12] | ==> +-----------------------------------------------------------------+ ==> 17 rows ==> 30 ms So my questions are: what's going on with this query? is there a working approach? For an additional detail, I know the min(length(path)) aggregator, but it doesn't work in this case where I'm trying to extract information about the best hit - the additional information I return about the best hit will disaggreate the result again - I think that's a cypher limitation.
Basically, you want to do a rank only considering relationships that are "with the path flow". Unfortunately, to test "with path flow", you need to check the path-index of each relationships' start/end nodes, and that can only be done with APOC right now. // allshortestpaths to get all non-cyclic paths MATCH path=allshortestpaths((a{id:"1"})-[*]-(b{id:"2"})) // Find rank worthy relationships WITH path, filter(rl in relationships(path) WHERE apoc.coll.indexOf(path, startnode(rl))<apoc.coll.indexOf(path, endnode(rl)))) as comply // Filter results RETURN path, REDUCE(rk = 0, rl in comply | rk+rl.followrank) as rank ORDER BY rank DESC (I can't test the APOC part, so you might have to pass NODES(path) instead of path to the APOC procedure)