Optional Match and Where -- How does this query work? - neo4j

Consider the following schema, where orange nodes are of type Person and brown nodes are of type Movie. (This is from the "movies" dataset that is shipped with Neo4j).
The query that I am trying to write goes as follows:
Find all reviewer pairs, one following the other, and return the names
of the two reviewers. If they have both reviewed the same movie,
return the title of the movie as well. Restrict the query so that the first letter of the name of both reviewers is ’J’
Now, consider the following CYPHER query:
MATCH (a:Person)-[:REVIEWED]->(:Movie),
(b:Person)-[:REVIEWED]->(:Movie),
(a:Person)-[:FOLLOWS]->(b:Person)
OPTIONAL MATCH (a:Person)-[:REVIEWED]->(m:Movie)<-[:REVIEWED]-(b:Person)
WHERE a.name STARTS WITH 'J'
AND b.name STARTS WITH 'J'
RETURN DISTINCT a.name, b.name, m.title
This returns the following (incorrect) results:
Why?
What I've gathered so far:
the WHERE applies to the (OPTIONAL) MATCH directly preceding it
the WHERE constraints are considered while looking for matches, not afterwards.
When an OPTIONAL MATCH does not apply fully, null is put for the missing parts of the pattern
I still don't understand, why "Angela Scope" shows up in the results. In any case, if the predicates should forbid it to ever show up.
PS: I am aware that the following query returns the correct results
MATCH (a:Person)-[:REVIEWED]->(:Movie),
(b:Person)-[:REVIEWED]->(:Movie),
(a:Person)-[:FOLLOWS]->(b:Person)
WHERE a.name STARTS WITH 'J'
AND b.name STARTS WITH 'J'
OPTIONAL MATCH (a:Person)-[:REVIEWED]->(m:Movie)<-[:REVIEWED]-(b:Person)
RETURN DISTINCT a.name, b.name, m.title
however, I'd like to find out why these two queries return different results and especially why the one mentioned first returns exactly this result.

Sure, you're almost at the answer already:
the WHERE applies to the (OPTIONAL) MATCH directly preceding it
This is important. You should not view the WHERE clause as independent, as it is associated with and modifies the preceding clause. So read it out like MATCH ... WHERE ... and OPTIONAL MATCH ... WHERE ... and WITH ... WHERE ... as a whole.
Remember that an OPTIONAL MATCH will never filter out rows. It will keep existing rows, and for any newly introduced variables, will try to find matches using the pattern provided that passes its WHERE clause. If it doesn't find matches, newly introduced variables will be set to null. And again...no filtering.
So for this snippet:
OPTIONAL MATCH (a:Person)-[:REVIEWED]->(m:Movie)<-[:REVIEWED]-(b:Person)
WHERE a.name STARTS WITH 'J'
AND b.name STARTS WITH 'J'
Angela Scope and Jessica Thompson have a follows relationship between them, and they have reviewed the same movie, The Replacements, but they fail the WHERE clause, since Angela's name doesn't start with a 'J'. Therefore the OPTIONAL MATCH didn't find anything, so the newly introduced variable m will come back as null. Nothing will be filtered.
In order to have a predicate filter your rows, the WHERE clause needs to be associated with a MATCH, or a WITH. So we could fix it as in the correct query you added later, or like this:
MATCH (a:Person)-[:REVIEWED]->(:Movie),
(b:Person)-[:REVIEWED]->(:Movie),
(a:Person)-[:FOLLOWS]->(b:Person)
OPTIONAL MATCH (a:Person)-[:REVIEWED]->(m:Movie)<-[:REVIEWED]-(b:Person)
WITH a, m, b
WHERE a.name STARTS WITH 'J'
AND b.name STARTS WITH 'J'
RETURN DISTINCT a.name, b.name, m.title
And this is less efficient since the filtering happens after we've done the OPTIONAL MATCH. Better to filter earlier, so we only execute the OPTIONAL MATCH when we already have our filtered results.
Also to note, you have an issue with duplicates here due to your matching of these patterns at the start: (a:Person)-[:REVIEWED]->(:Movie). While this does indeed find persons who are reviewers, you will get a row per path that matches the pattern...so for Jessica Thompson, for example, you can see she has reviewed 2 movies, so there are two paths that match that pattern, which is why she's showing up at least twice per other reviewer in your results (and it will be multiplicative, depending on the number of movies the other reviewer has reviewed.
To fix this, instead of looking for all paths of a :Person reviewing a :Movie, look for a :Person where they have reviewed a movie:
MATCH (a:Person)
WHERE (a)-[:REVIEWED]->()
Because the pattern becomes a predicate, Cypher only has to find at least one :REVIEWED relationship from a :Person, and then it can stop looking, and you won't have those duplicate results.

Related

Neo4J Cypher Exclude nodes that connect to a specific node

I'm trying to find all nodes that don't connect to a specific node. I have an app where students doing an assignment discover themes in a story, and then write explications. Then, other students do peer reviews of these explications. My data looks like this:
Assignment-hasTheme->Theme-hasChild->Theme
Annotation-theme->Theme
Explication-owner->User
Explication-annotation->Annotation
PeerReview-explication->Explication
As part of the application, when a user has to do a peer review, I have to find all the explications written by other users. It seems to me like this query should work:
MATCH
(u),
(a)-[:hasTheme]->(:Theme)
-[:hasChild*]->(:Theme)
<-[:theme]-(ann:Annotation)
<-[:annotation]-(e:Explication)
OPTIONAL MATCH
(e)<-[:explication]-(p:PeerReview)
WHERE id(a)=7 AND id(u)=4
AND (e)-[:owner]->(u)
RETURN e, count(e) AS explicationCount
ORDER BY explicationCount ASC
The problem is that it doesn't: I get all the explications that all users have written. That includes the explications the user wrote. Can anyone tell me how to exclude those?
The problem is that the WHERE clause is only associated with one other clause...the preceding MATCH, OPTIONAL MATCH, or WITH. In your query, it's associated with the OPTIONAL MATCH.
If you re-read your query knowing this, you can see that the first MATCH has no WHERE clause, so it's matching on all assignments and all users, finding all explications.
THEN it does the optional match to get :PeerReviews matching on the given assignment and user ids where the explication owner is the user with the given id. The WHERE is only affecting which :PeerReviews (variable p) are matched.
A couple other things I can see...you're introducing a variable ann on the :Annotations matched in the pattern, and a variable p for the :PeerReview, but you're not actually doing anything with these in the query. This also makes your OPTIONAL MATCH useless, you're not returning or operating on the matched :PeerReviews.
My recommendation is to remove those variables and remove your OPTIONAL MATCH completely.
MATCH
(u),
(a)-[:hasTheme]->(:Theme)
-[:hasChild*]->(:Theme)
<-[:theme]-(:Annotation)
<-[:annotation]-(e:Explication)
WHERE id(a)=7 AND id(u)=4
AND (e)-[:owner]->(u)
RETURN e, count(e) AS explicationCount
ORDER BY explicationCount ASC
If you do want to add in the OPTIONAL MATCH and use the matched :PeerReview, ensure that it's below the WHERE affecting the MATCH, like so:
MATCH
(u),
(a)-[:hasTheme]->(:Theme)
-[:hasChild*]->(:Theme)
<-[:theme]-(:Annotation)
<-[:annotation]-(e:Explication)
WHERE id(a)=7 AND id(u)=4
AND (e)-[:owner]->(u)
OPTIONAL MATCH
(e)<-[:explication]-(p:PeerReview)
RETURN e, count(e) AS explicationCount, p
ORDER BY explicationCount ASC
EDIT
In response to the comments where the desired result is each :Explication and the count of all linked :PeerReviews, you would use this query:
MATCH
(u),
(a)-[:hasTheme]->(:Theme)
-[:hasChild*0..]->(:Theme)
<-[:theme]-(:Annotation)
<-[:annotation]-(e:Explication)
WHERE id(a)=7 AND id(u)=4
AND (e)-[:owner]->(u)
OPTIONAL MATCH
(e)<-[:explication]-(p:PeerReview)
RETURN e, count(p) as peerReviewCount
ORDER BY peerReviewCount ASC
EDIT
Updated the above query so it will find annotations on the parent theme as well instead of just its children.

Cypher Query not returning nonexistent relationships

I have a graph database where there are user and interest nodes which are connected by IS_INTERESTED relationship. I want to find interests which are not selected by a user. I wrote this query and it is not working
OPTIONAL MATCH (u:User{userId : 1})-[r:IS_INTERESTED] -(i:Interest)
WHERE r is NULL
Return i.name as interest
According to answers to similar questions on SO (like this one), the above query is supposed to work.However,in this case it returns null. But when running the following query it works as expected:
MATCH (u:User{userId : 1}), (i:Interest)
WHERE NOT (u) -[:IS_INTERESTED] -(i)
return i.name as interest
The reason I don't want to run the above query is because Neo4j gives a warning:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this
will build a cartesian product between all those parts. This may
produce a large amount of data and slow down query processing. While
occasionally intended, it may often be possible to reformulate the
query that avoids the use of this cross product, perhaps by adding a
relationship between the different parts or by using OPTIONAL MATCH
(identifier is: (i))
What am I doing wrong in the first query where I use OPTIONAL MATCH to find nonexistent relationships?
1) MATCH is looking for the pattern as a whole, and if can not find it in its entirety - does not return anything.
2) I think that this query will be effective:
// Take all user interests
MATCH (u:User{userId: 1})-[r:IS_INTERESTED]-(i:Interest)
WITH collect(i) as interests
// Check what interests are not included
MATCH (ni:Interest) WHERE NOT ni IN interests
RETURN ni.name
When your OPTIONAL MATCH query does not find a match, then both r AND i must be NULL. After all, since there is no relationship, there is no way get the nodes that it points to.
A WHERE directly after the OPTIONAL MATCH is pulled into the evaluation.
If you want to post-filter you have to use a WITH in between.
MATCH (u:User{userId : 1})
OPTIONAL MATCH (u)-[r:IS_INTERESTED] -(i:Interest)
WITH r,i
WHERE r is NULL
Return i.name as interest

Return multiple relationship counts for one MATCH statement

I want to do something like this:
MATCH (p:person)-[a:UPVOTED]->(t:topic),(p:person)-[b:DOWNVOTED]->(t:topic),(p:person)-[c:FLAGGED]->(t:topic) WHERE ID(t)=4 RETURN COUNT(a),COUNT(b),COUNT(c)
..but I get all 0 counts when I should get 2, 1, 1
A better solution is to use size which improve drastically the performance of the query :
MATCH (t:Topic)
WHERE id(t) = 4
RETURN size((t)<-[:DOWNVOTED]-(:Person)) as downvoted,
size((t)<-[:UPVOTED]-(:Person)) as upvoted,
size((t)<-[:FLAGGED]-(:Person)) as flagged
If you are sure that the other nodes on the relationships are always labelled with Person, you can remove them from the query and it will be a bit faster again
Let's start with refactoring the query a bit (hopefully the meaning of it isn't lost):
MATCH
(t:topic)
(p:person)-[upvote:UPVOTED]-(t),
(p:person)-[downvote:DOWNVOTED]->(t),
(p:person)-[flag:FLAGGED]->(t)
WHERE ID(t)=4
RETURN COUNT(upvote), COUNT(downvote), COUNT(flag)
Since t is your primary variable (since you are filtering on it), I've matched once with the label and then used just the variable throughout the rest of the matches. Seeing the query cleaned up like this, it seems to me that you're trying to count all upvotes/downvotes/flags for a topic, but you don't care who did those things. Currently, since you're using the same variable p Cypher is going to try to match the same person for all three lines. So you could have different variables:
(p1:person)-[upvote:UPVOTED]-(t),
(p2:person)-[downvote:DOWNVOTED]->(t),
(p3:person)-[flag:FLAGGED]->(t)
Or better, since you're not referencing the people anywhere else, you can just leave the variables out:
(:person)-[upvote:UPVOTED]-(t),
(:person)-[downvote:DOWNVOTED]->(t),
(:person)-[flag:FLAGGED]->(t)
And stylistically, I would also suggest starting your matches with the item that you're filtering on:
(t)<-[upvote:UPVOTED]-(:person)
(t)<-[downvote:DOWNVOTED]-(:person)
(t)<-[flag:FLAGGED]-(:person)
The next problem comes in because by making these a MATCH, you're saying that there NEEDS to be a match. Which means you'll never get cases with zeros. So you'll want OPTIONAL MATCH:
MATCH (t:topic)
WHERE ID(t)=4
OPTIONAL MATCH (t)<-[upvote:UPVOTED]-(:person)
OPTIONAL MATCH (t)<-[downvote:DOWNVOTED]-(:person)
OPTIONAL MATCH (t)<-[flag:FLAGGED]-(:person)
RETURN COUNT(upvote), COUNT(downvote), COUNT(flag)
Even then, though what you're saying is: "Find a topic and find all cases where there is 1 upvote, no downvote, no flag, 1 upvote, 1 downvote, no flag, etc... to all permutations). That means you'll want to COUNT one at a time:
MATCH (t:topic)
WHERE ID(t)=4
OPTIONAL MATCH (t)<-[r:UPVOTED]-(:person)
WITH t, COUNT(r) AS upvotes
OPTIONAL MATCH (t)<-[r:DOWNVOTED]-(:person)
WITH t, upvotes, COUNT(r) AS downvotes
OPTIONAL MATCH (t)<-[r:FLAGGED]-(:person)
RETURN upvotes, downvotes, COUNT(r) AS flags
A couple of miscellaneous items:
Be careful about using Neo IDs as a long-term reference because they can be recycled.
Use parameters whenever possible for performance / security (WHERE ID(t)={topic_id})
Also, labels are generally TitleCase. See The Zen of Cypher guide.
Check this query, i think it will help you.
MATCH (p:person)-[a:UPVOTED]->(t:topic),
(p)-[b:DOWNVOTED]->(t),(p)-[c:FLAGGED]->(t)
WHERE ID(t)=4
RETURN COUNT(a) as a_count,COUNT(b) as b_count,COUNT(c) as c_count;
Your current MATCH requires that the same person node (identified by p) have relationships of all 3 types with t. This is because an identifier is bound to a specific node (or relationship, or value), and (unless hidden by a WITH clause, which you do not have in your query) will reference that same node (or relationship, or value) throughout a query.
Based on your expected results, I am assuming that you are just trying to count the number of relationships of those 3 types between any person and t. If so, this is a performant way to do that:
MATCH (t:topic)
WHERE ID(t) = 4
MATCH (:person)-[r:UPVOTED|DOWNVOTED|FLAGGED]->(t)
RETURN REDUCE(s=[0,0,0], x IN COLLECT(r) |
CASE TYPE(x)
WHEN 'UPVOTED' THEN [s[0]+1, s[1], s[2]]
WHEN 'DOWNVOTED' THEN [s[0], s[1]+1, s[2]]
ELSE [s[0], s[1], s[2]+1]
END
) As res;
res is an array with the number of UPVOTED, DOWNVOTED, and FLAGGED relationships, respectively, between any person and t.
Another approach would be to use separate OPTIONAL MATCH statements for each relationship type, returning three COUNT(DISTINCT x) values. But the above query uses a single MATCH statement, greatly reducing the number of DB hits, which are generally expensive.

What is the difference between multiple MATCH clauses and a comma in a Cypher query?

In a Cypher query language for Neo4j, what is the difference between one MATCH clause immediately following another like this:
MATCH (d:Document{document_ID:2})
MATCH (d)--(s:Sentence)
RETURN d,s
Versus the comma-separated patterns in the same MATCH clause? E.g.:
MATCH (d:Document{document_ID:2}),(d)--(s:Sentence)
RETURN d,s
In this simple example the result is the same. But are there any "gotchas"?
There is a difference: comma separated matches are actually considered part of the same pattern. So for instance the guarantee that each relationship appears only once in resulting path is upheld here.
Separate MATCHes are separate operations whose paths don't form a single patterns and which don't have these guarantees.
I think it's better to explain providing an example when there's a difference.
Let's say we have the "Movie" database which is provided by official Neo4j tutorials.
And there're 10 :WROTE relationships in total between :Person and :Movie nodes
MATCH (:Person)-[r:WROTE]->(:Movie) RETURN count(r); // returns 10
1) Let's try the next query with two MATCH clauses:
MATCH (p:Person)-[:WROTE]->(m:Movie) MATCH (p2:Person)-[:WROTE]->(m2:Movie)
RETURN p.name, m.title, p2.name, m2.title;
Sure you will see 10*10 = 100 records in the result.
2) Let's try the query with one MATCH clause and two patterns:
MATCH (p:Person)-[:WROTE]->(m:Movie), (p2:Person)-[:WROTE]->(m2:Movie)
RETURN p.name, m.title, p2.name, m2.title;
Now you will see 90 records are returned.
That's because in this case records where p = p2 and m = m2 with the same relationship between them (:WROTE) are excluded.
For example, there IS a record in the first case (two MATCH clauses)
p.name m.title p2.name m2.title
"Aaron Sorkin" "A Few Good Men" "Aaron Sorkin" "A Few Good Men"
while there's NO such a record in the second case (one MATCH, two patterns)
There are no differences between these provided that the clauses are not linked to one another.
If you did this:
MATCH (a:Thing), (b:Thing) RETURN a, b;
That's the same as:
MATCH (a:Thing) MATCH (b:Thing) RETURN a, b;
Because (and only because) a and b are independent. If a and b were linked by a relationship, then the meaning of the query could change.
In a more generic way, "The same relationship cannot be returned more than once in the same result record." [see 1.5. Cypher Result Uniqueness in the Cypher manual]
Both MATCH-after-MATCH, and single MATCH with comma-separated pattern should logically return a Cartesian product. Except, for comma-separated pattern, we must exclude those records for which we already added the relationship(s).
In Andy's answer, this is why we excluded repetitions of the same movie in the second case: because the second expression from each single MATCH was using there the same :WROTE relationship as the first expression.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (a)) .
IN short their is NO Difference in this both query but used it very carefully.
In a more generic way, "The same relationship cannot be returned more than once in the same result record." [see 1.5. Cypher Result Uniqueness in the Cypher manual]
How about this statement?
MATCH p1=(v:player)-[e1]->(n)
MATCH p2=(n:team)<-[e2]-(m)
WHERE e1=e2
RETURN e1,e2,p1,p2

Difference between START n = node(*) and MATCH (n)

In Neo4j 2.0 this query:
MATCH (n) WHERE n.username = 'blevine'
OPTIONAL MATCH n-[:Person]->person
OPTIONAL MATCH n-[:UserLink]->role
RETURN n AS user,person,collect(role) AS roles
returns different results than this query:
START n = node(*) WHERE n.username = 'blevine'
OPTIONAL MATCH n-[:Person]->person
OPTIONAL MATCH n-[:UserLink]->role
RETURN n AS user,person,collect(role) AS roles
The first query works as expected returning a single Node for 'blevine' and the associated Nodes mentioned in the OPTIONAL MATCH clauses. The second query returns many more Nodes which do not even have a username property. I realize that start n = node(*) is not recommended and that START is not even required in 2.0. But the second form (with OPTIONAL MATCH replaced with question marks on the relationship type) worked prior to 2.0. In the second form, why is 'n' not being constrained to the single 'blevine' node by the first WHERE clause?
To run the second query as expected you would just need to add WITH n. In your query you would need to filter the result and pass it for optional match which is to be done using WITH
START n = node(*) WHERE n.username = 'blevine'
WITH n
OPTIONAL MATCH n-[:Person]->person
OPTIONAL MATCH n-[:UserLink]->role
RETURN n AS user,person,collect(role) AS roles
From the documentation
WHERE defines the MATCH patterns in more detail. The predicates are part of the
pattern description, not a filter applied after the matching is done.
This means that WHERE should always be put together with the MATCH clause it belongs to.
when you do start n=node(*) where n.name="xyz" you need to pass the result explicitly into your next optional matches. But when you do MATCH (n) WHERE n.name="xyz" this tells graph specifically what node to start looking into.
EDIT
Here is the thing. The documentation says Optional Match returns null if a pattern is not found so in your first case, it includes all those results too where n.username property is null or cases where n doesnt even have a relationship suggested in the OPTIONAL MATCH pattern. So when you do a WITH n , the graph is explicitly told to use only n.
Excerpt from the documentation (link : here)
OPTIONAL MATCH matches patterns against your graph database, just like MATCH does.
The difference is that if no matches are found, OPTIONAL MATCH will use NULLs for
missing parts of the pattern. OPTIONAL MATCH could be considered the Cypher
equivalent of the outer join in SQL.
Either the whole pattern is matched, or nothing is matched. Remember that
WHERE is part of the pattern description, and the predicates will be
considered while looking for matches, not after. This matters especially
in the case of multiple (OPTIONAL) MATCH clauses, where it is crucial to
put WHERE together with the MATCH it belongs to.
Also few more things to note about the behaviour of WHERE clause: here
Excerpts:
WHERE is not a clause in it’s own right — rather, it’s part of MATCH,
OPTIONAL MATCH, START and WITH.
In the case of WITH and START, WHERE simply filters the results.
For MATCH and OPTIONAL MATCH on the other hand, WHERE adds constraints
to the patterns described. It should not be seen as a filter after the
matching is finished.

Resources