Neo4j 3.2 Cypher low performance - neo4j

Without getting unnecessarily too specific, I'm facing the following issue with Cyper in Neo4j 3.2. Let's say we have a database with 3 entities: User, Comment, Like.
For whatever reason, I'm trying to run the following query:
MATCH (n:USER) WHERE n.name = "name"
WITH n
MATCH (o:USER)
WITH n, o, "2000" as number
MATCH (n)<-[:CREATED_BY]-(:COMMENT)-[:HAS]->(l:LIKE)-[:CREATED_BY]->(o)
RETURN n, o, number, count(l)
The query takes minutes to complete. However, if I simply remove the "2000" as number part, it completes within tens of milliseconds.
Does anybody have an explanation why?
EDIT:
Top image, with "2000" as number part; bottom, without it.

You're going to have to clean up your query, right now you're not using indexes (so the initial match with the specific name is slow), and then you perform a cartesian product against all :User nodes, then create strings for each row.
So first, create an index on :USER(name) so you can find your start node fast.
Then we'll have to clean up the rest of the match.
Try something like this instead:
MATCH (n:USER) WHERE n.name = "name"
WITH n, "2000" as number
MATCH (n)<-[:CREATED_BY]-(:COMMENT)-[:HAS]->(l:LIKE)-[:CREATED_BY]->(o:User)
RETURN n, o, number, count(l)
You should see a similar plan with this query as in the query without the "2000".
The reason for this is that although your plan has a cartesian product with your match to o, the planner was intelligent enough to realize there was an additional restriction for o in that it had to occur in the pattern in your last match, and its optimization for that situation let you avoid performing a cartesian product.
Introduction of a new variable number, however, prevented the planner from recognizing that this was basically the same situation, so the planner did not optimize out the cartesian product.
For now, try to be explicit about how you want the query to be performed, and try to avoid cartesian products in your queries.
In this particular case, it's important to realize that when you have MATCH (o:User) on the third line, that's not declaring that the type of o is a :User in the later match, it's instead saying that for every row in your results so far, perform a cartesian product against all :User nodes, and then for each of those user nodes, see which ones exist in the pattern provided. That is a lot of unnecessary work, compared to simply expanding the pattern provided and getting whatever :User nodes you find at the other end of the pattern.
EDIT
As far as getting both :LIKE and :DISLIKE node counts, maybe try something like this:
MATCH (n:USER) WHERE n.name = "name"
WITH n, "2000" as number
MATCH (n)<-[:CREATED_BY]-(:COMMENT)-[:HAS]->(likeDislike)-[:CREATED_BY]->(o:User)
WITH n, o, number, head(labels(likeDislike)) as type, count(likeDislike) as cnt
WITH n, o, number, CASE WHEN type = "LIKE" THEN cnt END as likeCount, CASE WHEN type = "DISLIKE" THEN cnt END as dislikeCount
RETURN n, o, number, sum(likeCount) as likeCount, sum(dislikeCount) as dislikeCount
Assuming you still need that number variable in there.

Related

Return multiple relationship counts for one MATCH statement

I want to do something like this:
MATCH (p:person)-[a:UPVOTED]->(t:topic),(p:person)-[b:DOWNVOTED]->(t:topic),(p:person)-[c:FLAGGED]->(t:topic) WHERE ID(t)=4 RETURN COUNT(a),COUNT(b),COUNT(c)
..but I get all 0 counts when I should get 2, 1, 1
A better solution is to use size which improve drastically the performance of the query :
MATCH (t:Topic)
WHERE id(t) = 4
RETURN size((t)<-[:DOWNVOTED]-(:Person)) as downvoted,
size((t)<-[:UPVOTED]-(:Person)) as upvoted,
size((t)<-[:FLAGGED]-(:Person)) as flagged
If you are sure that the other nodes on the relationships are always labelled with Person, you can remove them from the query and it will be a bit faster again
Let's start with refactoring the query a bit (hopefully the meaning of it isn't lost):
MATCH
(t:topic)
(p:person)-[upvote:UPVOTED]-(t),
(p:person)-[downvote:DOWNVOTED]->(t),
(p:person)-[flag:FLAGGED]->(t)
WHERE ID(t)=4
RETURN COUNT(upvote), COUNT(downvote), COUNT(flag)
Since t is your primary variable (since you are filtering on it), I've matched once with the label and then used just the variable throughout the rest of the matches. Seeing the query cleaned up like this, it seems to me that you're trying to count all upvotes/downvotes/flags for a topic, but you don't care who did those things. Currently, since you're using the same variable p Cypher is going to try to match the same person for all three lines. So you could have different variables:
(p1:person)-[upvote:UPVOTED]-(t),
(p2:person)-[downvote:DOWNVOTED]->(t),
(p3:person)-[flag:FLAGGED]->(t)
Or better, since you're not referencing the people anywhere else, you can just leave the variables out:
(:person)-[upvote:UPVOTED]-(t),
(:person)-[downvote:DOWNVOTED]->(t),
(:person)-[flag:FLAGGED]->(t)
And stylistically, I would also suggest starting your matches with the item that you're filtering on:
(t)<-[upvote:UPVOTED]-(:person)
(t)<-[downvote:DOWNVOTED]-(:person)
(t)<-[flag:FLAGGED]-(:person)
The next problem comes in because by making these a MATCH, you're saying that there NEEDS to be a match. Which means you'll never get cases with zeros. So you'll want OPTIONAL MATCH:
MATCH (t:topic)
WHERE ID(t)=4
OPTIONAL MATCH (t)<-[upvote:UPVOTED]-(:person)
OPTIONAL MATCH (t)<-[downvote:DOWNVOTED]-(:person)
OPTIONAL MATCH (t)<-[flag:FLAGGED]-(:person)
RETURN COUNT(upvote), COUNT(downvote), COUNT(flag)
Even then, though what you're saying is: "Find a topic and find all cases where there is 1 upvote, no downvote, no flag, 1 upvote, 1 downvote, no flag, etc... to all permutations). That means you'll want to COUNT one at a time:
MATCH (t:topic)
WHERE ID(t)=4
OPTIONAL MATCH (t)<-[r:UPVOTED]-(:person)
WITH t, COUNT(r) AS upvotes
OPTIONAL MATCH (t)<-[r:DOWNVOTED]-(:person)
WITH t, upvotes, COUNT(r) AS downvotes
OPTIONAL MATCH (t)<-[r:FLAGGED]-(:person)
RETURN upvotes, downvotes, COUNT(r) AS flags
A couple of miscellaneous items:
Be careful about using Neo IDs as a long-term reference because they can be recycled.
Use parameters whenever possible for performance / security (WHERE ID(t)={topic_id})
Also, labels are generally TitleCase. See The Zen of Cypher guide.
Check this query, i think it will help you.
MATCH (p:person)-[a:UPVOTED]->(t:topic),
(p)-[b:DOWNVOTED]->(t),(p)-[c:FLAGGED]->(t)
WHERE ID(t)=4
RETURN COUNT(a) as a_count,COUNT(b) as b_count,COUNT(c) as c_count;
Your current MATCH requires that the same person node (identified by p) have relationships of all 3 types with t. This is because an identifier is bound to a specific node (or relationship, or value), and (unless hidden by a WITH clause, which you do not have in your query) will reference that same node (or relationship, or value) throughout a query.
Based on your expected results, I am assuming that you are just trying to count the number of relationships of those 3 types between any person and t. If so, this is a performant way to do that:
MATCH (t:topic)
WHERE ID(t) = 4
MATCH (:person)-[r:UPVOTED|DOWNVOTED|FLAGGED]->(t)
RETURN REDUCE(s=[0,0,0], x IN COLLECT(r) |
CASE TYPE(x)
WHEN 'UPVOTED' THEN [s[0]+1, s[1], s[2]]
WHEN 'DOWNVOTED' THEN [s[0], s[1]+1, s[2]]
ELSE [s[0], s[1], s[2]+1]
END
) As res;
res is an array with the number of UPVOTED, DOWNVOTED, and FLAGGED relationships, respectively, between any person and t.
Another approach would be to use separate OPTIONAL MATCH statements for each relationship type, returning three COUNT(DISTINCT x) values. But the above query uses a single MATCH statement, greatly reducing the number of DB hits, which are generally expensive.

What is the difference between multiple MATCH clauses and a comma in a Cypher query?

In a Cypher query language for Neo4j, what is the difference between one MATCH clause immediately following another like this:
MATCH (d:Document{document_ID:2})
MATCH (d)--(s:Sentence)
RETURN d,s
Versus the comma-separated patterns in the same MATCH clause? E.g.:
MATCH (d:Document{document_ID:2}),(d)--(s:Sentence)
RETURN d,s
In this simple example the result is the same. But are there any "gotchas"?
There is a difference: comma separated matches are actually considered part of the same pattern. So for instance the guarantee that each relationship appears only once in resulting path is upheld here.
Separate MATCHes are separate operations whose paths don't form a single patterns and which don't have these guarantees.
I think it's better to explain providing an example when there's a difference.
Let's say we have the "Movie" database which is provided by official Neo4j tutorials.
And there're 10 :WROTE relationships in total between :Person and :Movie nodes
MATCH (:Person)-[r:WROTE]->(:Movie) RETURN count(r); // returns 10
1) Let's try the next query with two MATCH clauses:
MATCH (p:Person)-[:WROTE]->(m:Movie) MATCH (p2:Person)-[:WROTE]->(m2:Movie)
RETURN p.name, m.title, p2.name, m2.title;
Sure you will see 10*10 = 100 records in the result.
2) Let's try the query with one MATCH clause and two patterns:
MATCH (p:Person)-[:WROTE]->(m:Movie), (p2:Person)-[:WROTE]->(m2:Movie)
RETURN p.name, m.title, p2.name, m2.title;
Now you will see 90 records are returned.
That's because in this case records where p = p2 and m = m2 with the same relationship between them (:WROTE) are excluded.
For example, there IS a record in the first case (two MATCH clauses)
p.name m.title p2.name m2.title
"Aaron Sorkin" "A Few Good Men" "Aaron Sorkin" "A Few Good Men"
while there's NO such a record in the second case (one MATCH, two patterns)
There are no differences between these provided that the clauses are not linked to one another.
If you did this:
MATCH (a:Thing), (b:Thing) RETURN a, b;
That's the same as:
MATCH (a:Thing) MATCH (b:Thing) RETURN a, b;
Because (and only because) a and b are independent. If a and b were linked by a relationship, then the meaning of the query could change.
In a more generic way, "The same relationship cannot be returned more than once in the same result record." [see 1.5. Cypher Result Uniqueness in the Cypher manual]
Both MATCH-after-MATCH, and single MATCH with comma-separated pattern should logically return a Cartesian product. Except, for comma-separated pattern, we must exclude those records for which we already added the relationship(s).
In Andy's answer, this is why we excluded repetitions of the same movie in the second case: because the second expression from each single MATCH was using there the same :WROTE relationship as the first expression.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (a)) .
IN short their is NO Difference in this both query but used it very carefully.
In a more generic way, "The same relationship cannot be returned more than once in the same result record." [see 1.5. Cypher Result Uniqueness in the Cypher manual]
How about this statement?
MATCH p1=(v:player)-[e1]->(n)
MATCH p2=(n:team)<-[e2]-(m)
WHERE e1=e2
RETURN e1,e2,p1,p2

Cypher query results unexpectedely affected by "WITH" statement

As part of a larger query, I am trying to select PRODUCTs that have a relationship to more than one SKU. I subsequently want to delete these relationships and perform other modifications not included here for the sake of simplicity.
I was surprised to find that while the following query returns a single node:
MATCH (p:PRODUCT)-[p_s_rel:PRODUCT_SKU]->(s:SKU)
WITH s, p, count(s) AS sCount
WHERE sCount> 1 AND id(s) IN [9220, 9201]
RETURN s
adding the relationships p_s_rel in the WITH clause changes the result and returns no nodes:
MATCH (p:PRODUCT)-[p_s_rel:PRODUCT_SKU]->(s:SKU)
WITH s, p, p_s_rel, count(s) AS sCount
WHERE sCount> 1 AND id(s) IN [9220, 9201]
RETURN s
Based on the documentation for WITH, I didn't expect this behavior. Is it not possible to specify relationship identifiers in the WITH clause?
If that's the case, how can I delete the p_s_rel relationships in my example?
Edit: This is on version 2.1.5
Are you sure that the first query is doing what you think that it is? In your query if sCount > 1 then I think that means you have more than one relationship between p and some SKU (i.e multiple relationships between the same nodes). If your WITH statement were WITH p, count(s) AS sCount then you would be matching a single Product with multiple SKUs.
By executing WITH s, p, count(s) AS sCount you are saying carry the currently matched SKU, the currently matched PRODUCT and the count of PRODUCT_SKU relationships between them, whereas WITH p, count(s) AS sCount would be taken to mean carry the currently matched PRODUCT and the count of all PRODUCT_SKU relationships that it has.
In your second query, by including p_s_rel in your WITH clause you will be propagating only a single row with each result - as there will only ever be one distinct p_s_rel to carry forward in each match (a relationship only has one start node, and one end node). This means that sCount will never be greater than 1 - hence your empty resultset. i.e you are saying carry the currently matched SKU, the currently matched PRODUCT, the current realtionship between those two nodes and the count of the end nodes on that relationship (1).
If you want to carry the relationships forward you could use COLLECT, or, as you are going to be restricting your resultset by SKU, maybe you would be better off matching those first:
MATCH (s1:SKU), (s2:SKU)
WHERE ID(s1) = 9220 AND ID(s2) = 9201
WITH s1, s2
MATCH (s1)<-[r1:PRODUCT_SKU]-(p:PRODUCT)-[r2:PRODUCT_SKU]->(s2)
DELETE r1, r2,
RETURN p

Neo4j: Transitive query and node ordering

I am using Neo4j to track relationships in OOP architecture. Let us assume that nodes represent classes and (u) -[:EXTENDS]-> (v) if class u extends class v (i.e. for each node there is at most one outgoing edge of type EXTENDS). I am trying to find out a chain of predecessor classes for a given class (n). I have used the following Cypher query:
start n=node(...)
match (n) -[:EXTENDS*]-> (m)
return m.className
I need to process nodes in such an order that the direct predecessor of class n comes first, its predecessor comes as second etc. It seems that the Neo4j engine returns the nodes in exactly this order (given the above query) - is this something I should rely on or could this behavior suddenly change in some of the future releases?
If I should not rely on this behavior, what Cypher query would allow me to get all predecessor nodes in given ordering? I was thinking about following query:
start n=node(...)
match p = (n) -[:EXTENDS*]-> (m {className: 'Object'})
return p
Which would work quite fine, but I would like to avoid specifying the root class (Object in this case).
It's unlikely to change anytime soon as this is really the nature of graph databases at work.
The query you've written will return ALL possible "paths" of nodes that match that pattern. But given that you've specified that there is at most one :EXTENDS edge from each such node, the order is implied with the direction you've included in the query.
In other words, what's returned won't start "skipping" nodes in a chain.
What it will do, though, is give you all "sub-paths" of a path. That is, assuming you specified you wanted the predecessors for node "a", for the following path...
(a)-[:EXTENDS]->(b)-[:EXTENDS]->(c)
...your query (omitting the property name) will return "a, b, c" and "a, b". If you only want ALL of its predecessors, and you can use Cypher 2.x, consider using the "path" way, something like:
MATCH p = (a)-[:EXTENDS*]->(b)
WITH p
ORDER BY length(p) DESC
LIMIT 1
RETURN extract(x in nodes(p) | p.className)
Also, as a best practice, given that you're looking at paths of indefinite length, you should likely limit the number of hops your query makes to something reasonable, e.g.
MATCH (n) -[:EXTENDS*0..10]-> (m)
Or some such.
HTH

In neo4j is there a way to get path between more than 2 random nodes whose direction of relation is not known

I have a scenario where I have more than 2 random nodes.
I need to get all possible paths connecting all three nodes. I do not know the direction of relation and the relationship type.
Example : I have in the graph database with three nodes person->Purchase->Product.
I need to get the path connecting these three nodes. But I do not know the order in which I need to query, for example if I give the query as person-Product-Purchase, it will return no rows as the order is incorrect.
So in this case how should I frame the query?
In a nutshell I need to find the path between more than two nodes where the match clause may be mentioned in what ever order the user knows.
You could list all of the nodes in multiple bound identifiers in the start, and then your match would find the ones that match, in any order. And you could do this for N items, if needed. For example, here is a query for 3 items:
start a=node:node_auto_index('name:(person product purchase)'),
b=node:node_auto_index('name:(person product purchase)'),
c=node:node_auto_index('name:(person product purchase)')
match p=a-->b-->c
return p;
http://console.neo4j.org/r/tbwu2d
I actually just made a blog post about how start works, which might help:
http://wes.skeweredrook.com/cypher-it-all-starts-with-the-start/
Wouldn't be acceptable to make several queries ? In your case you'd automatically generate 6 queries with all the possible combinations (factorial on the number of variables)
A possible solution would be to first get three sets of nodes (s,m,e). These sets may be the same as in the question (or contain partially or completely different nodes). The sets are important, because starting, middle and end node are not fixed.
Here is the code for the Matrix example with added nodes.
match (s) where s.name in ["Oracle", "Neo", "Cypher"]
match (m) where m.name in ["Oracle", "Neo", "Cypher"] and s <> m
match (e) where e.name in ["Oracle", "Neo", "Cypher"] and s <> e and m <> e
match rel=(s)-[r1*1..]-(m)-[r2*1..]-(e)
return s, r1, m, r2, e, rel;
The additional where clause makes sure the same node is not used twice in one result row.
The relations are matched with one or more edges (*1..) or hops between the nodes s and m or m and e respectively and disregarding the directions.
Note that cypher 3 syntax is used here.

Resources