Clarification on multiple MATCH patterns in a Cypher query - neo4j

In the below query, does the 2nd match pattern john-[r?:HAS_SEEN]->(movie) run on the result of the first match john-[:IS_FRIEND_OF]->(user)-[:HAS_SEEN]->(movie) . I am trying to understand if this is similar to the unix pipe concept i.e. the result of the 1st pattern is the input to the 2nd pattern.
start john=node(1)
match
john-[:IS_FRIEND_OF]->(user)-[:HAS_SEEN]->(movie),
john-[r?:HAS_SEEN]->(movie)
where r is null
return movie;

I don't think I would compare multiple MATCH clauses to the UNIX pipes concept. Using multiple, comma-separated matches is just a way of breaking out of the 1-dimensional constraint of writing relationships with a single sentence. For example, the following is completely valid:
MATCH a--b,
b--c,
c--d,
d--e,
a--c
At the very end I went back and referenced a and c even though they weren't used in the clause directly before. Again, this is just a way of drawing 2 dimensions' worth of relationships by only using 1-dimensional sentences. We're drawing a 2-dimensional picture with several 1-dimensional pieces.
On a side note, I WOULD compare the WITH clause to UNIX pipes -- I'd call them analogous. WITH will pipe out any results it finds into the next set of clauses you give it.

Yes, simply think of these two matches to be one - i.e.
match (movie)<-[r?:HAS_SEEN]-john-[:IS_FRIEND_OF]->(user)-[:HAS_SEEN]->(movie)
or
match john-[:IS_FRIEND_OF]->(user)-[:HAS_SEEN]->(movie)<-[r?:HAS_SEEN]-john

Related

Back-reference neo4j

Can I use some back-reference sort of mechanism in neo4j? I'm not interested in what matches the query, just that it is the same thing in many places. Something like:
MATCH (a:Event {diagnosis1:11})
MATCH (b:Event {diagnosis1:15})
MATCH (c:Event {diagnosis1:5})
MATCH (a)-[rel:Next {PatientID:*}]->(b)
MATCH (b)-[rel1:Next {PatientID:\{1}]->(c)
The idea is that I just require the attribute IDs from both edges to be the same, without specifying it. The whole purpose of it would be not generating all possible matchings, to then filter them, but only hop in the specific places.
I've asked something similar in a more specific way here.
Edit: I know WHERE clauses can be used for that, but they filter the query AFTER matching the edges and nodes. I want to do that DURING the matching!
Use a WHERE clause with simple references, there's no need for back references:
MATCH (a)-[rel:Next]->(b)
MATCH (b)-[rel1:Next]->(c)
WHERE rel.PatientID = rel1.PatientID
Update
First of all, Cypher is a declarative query language: you express what you want, the runtime takes care of executing and optimizing it any way it can, so it's not that obvious that it would do it the way you think it will, or that using "back references" would magically solve the problem; it's just another way of writing the same thing.
So, your problem is that the match creates all the relationship pairs before filtering them. How about splitting the match in 2 phases using WITH?
MATCH (a:Event {diagnosis1:11})-[rel:Next]->(b:Event {diagnosis1:15})
WITH a, b, rel
MATCH (b)-[rel1:Next]->(c:Event {diagnosis1:5})
WHERE rel1.PatientID = rel.PatientID
That should only select the second relationships that match the first, but I'm not sure if it's an O(n^2) algorithm in Cypher's runtime.
Otherwise, if you drop to the Java API (which would mean either an extension or a procedure, depending on your version of Neo4j), you can probably implement in O(n) by
scanning all the relationships between a and b, indexing them by PatientID in some multimap (see Guava, or use a Map<K, Collection<V>>); this is O(n)
then doing the same for all the relationships between b and c, still O(n)
iterate on the keys of one multimap to get the values in both and match them, still O(n)

What does a comma in a Cypher query do?

A co-worker coded something like this:
match (a)-[r]->(b), (c) set c.x=y
What does the comma do? Is it just shorthand for MATCH?
Since Cypher's ASCII-art syntax can only let you specify one linear chain of connections in a row, the comma is there, at least in part, to allow you to specify things that might branch off. For example:
MATCH (a)-->(b)<--(c), (b)-->(d)
That represents three nodes which are all connected to b (two incoming relationships, and one outgoing relationship.
The comma can also be useful for separating lines if your match gets too long, like so:
MATCH
(a)-->(b)<--(c),
(c)-->(d)
Obviously that's not a very long line, but that's equivalent to:
MATCH
(a)-->(b)<--(c)-->(d)
But in general, any MATCH statement is specifying a pattern that you want to search for. All of the parts of that MATCH form the pattern. In your case you're actually looking for two unconnected patterns ((a)-[r]->(b) and (c)) and so Neo4j will find every combination of each instance of both patterns, which could potentially be very expensive. In Neo4j 2.3 you'd also probably get a warning about this being a query which would give you a cartesian product.
If you specify multiple matches, however, you're asking to search for different patterns. So if you did:
MATCH (a)-[r]->(b)
MATCH (c)
Conceptually I think it's a bit different, but the result is the same. I know it's definitely different with OPTIONAL MATCH, though. If you did:
MATCH (a:Foo)
OPTIONAL MATCH (a)-->(b:Bar), (a)-->(c:Baz)
You would only find instances where there is a Foo node connected to nothing, or connected to both a Bar and a Baz node. Whereas if you do this:
MATCH (a:Foo)
OPTIONAL MATCH (a)-->(b:Bar)
OPTIONAL MATCH (a)-->(c:Baz)
You'll find every single Foo node, and you'll match zero or more connected Bar and Baz nodes independently.
EDIT:
In the comments Stefan Armbruster made a good point that commas can also be used to assign subpatterns to individual identifiers. Such as in:
MATCH path1=(a)-[:REL1]->(b), path2=(b)<-[:REL2*..10]-(c)
Thanks Stefan!
EDIT2: Also see Mats' answer below
Brian does a good job of explaining how the comma can be used to construct larger subgraph patterns, but there's also a subtle yet important difference between using the comma and a second MATCH clause.
Consider the simple graph of two nodes with one relationship between them. The query
MATCH ()-->() MATCH ()-->() RETURN 1
will return one row with the number 1. Replace the second MATCH with a comma, however, and no rows will be returned at all:
MATCH ()-->(), ()-->() RETURN 1
This is because of the notion of relationship uniqueness. Inside each MATCH clause, each relationship will be traversed only once. That means that for my second query, the one relationship in the graph will be matched by the first pattern, and the second pattern will not be able to match anything, leading to the whole pattern not matching. My first query will match the one relationship once in each of the clauses, and thus create a row for the result.
Read more about this in the Neo4j manual: http://neo4j.com/docs/stable/cypherdoc-uniqueness.html

Is it possible to reduce/optimize this query for node degrees?

Given the following Cypher query that returns afferent (inbound) and efferent (outbound) connections, and the sum as the node degree:
START n = node(*)
RETURN n.name, length((n)-->()) AS efferent,
length((n)<--()) AS afferent,
length((n)-->()) + length((n)<--()) AS degree
Is it possible to reduce the query so that the two length() functions are not repeated in the summation in the degree column?
You can resolve and bind the two length computations separately from and before returning by using WITH. Then you can sum those bound values while returning.
START n = node(*)
WITH n, length((n)-->()) AS efferent, length((n)<--()) AS afferent
RETURN n.name, efferent, afferent, efferent + afferent AS degree
You may want to use MATCH (n) instead of START n = node(*) if your Neo4j version is >2.0, but that's not what you're asking about so I'll assume you know what you are doing.
EDIT
In Neo4j 1.x START is how you began a query. From 2.x and on, while START is still around, MATCH is the preferred way. If you have Neo4j 2.x and don't know a particular reason why you should use START, then you should use MATCH. Here's a short explanation of why.
Your query is written to touch the entire graph. When that is the intention there is not a very big difference between START n = node(*) and MATCH (n). The execution plans do differ, but I'm not aware that the difference is very important.
If, however, you want to perform your computations only on part of the graph, and you add to your 'starting point pattern' to that effect, then there will be significant differences. If, for example, you want to perform your computation only on nodes with the :User label
START n = node(*)
WHERE n:User
will still pull up all nodes, and then apply a filter to discard those that don't have the label, whereas
MATCH (n)
WHERE n:User
will only pull up the nodes that have that label to begin with.
The general difference is this: WHERE is a dependent clause accompanying START, MATCH, OPTIONAL MATCH or WITH. When it accompanies START or WITH it does not work by modifying the operation but by filtering the results; when it accompanies MATCH and OPTIONAL MATCH it modifies (as often as it can) the operation and therefore doesn't have to filter the results. The difference is that between shouting "Everyone, if you are my child, don't go into the road" and "Kids, don't go into the road".
There are cases when WHERE is not pulled into the MATCH clause. One example is
MATCH n
WHERE n:Male OR n:Female
In this case all nodes are pulled up and then filtered, just as if we had used START instead of MATCH.
Sometimes it is easy to know which patterns in the WHERE clause are able to be pulled in to modify the MATCH. This is the case for patterns that you can move into the MATCH clause yourself, by simply rearranging the query. The first MATCH example above could also be expressed
MATCH (n:User)
There is no way, however, to do this for the WHERE clause in second MATCH example, WHERE n:Male OR n:Female.
That a WHERE pattern cannot be moved into the MATCH clause by reformulating the query is not a reliable indicator that the query planner is unable to make use of it in the match operation. Being a declarative language, you ultimately have to trust the query planner to wisely implement the instructions; trust, but verify.1,2
One other difference between START and MATCH pertains to indexing. If you use 'legacy indexing' then you need to use START to access these indices. The 'new' (about two years I believe) label indices have continuously been improved for features and efficiency and we are running out of reasons to use the old indices. I think the only reason left may be full-text indexing, for which a configured legacy lucene index is still necessary. In time this feature also will be added to the label indices. Possibly, at that point, the START clause will be removed from Cypher altogether–but that is just the author's speculation.

Cypher matched results difer with "with" and without "with"

First of all sorry for the clumsy title, I couldn't think of a better way to expose it.
The issue is I get different results when querying cypher in a single match---result and when spliting it in a match --- with--- match ---result structure.
The match---result skips certain results.
My code:
match---result query
match (up:U)-[r1:COCS]->(op:O)-[r2:CCLS]->(jp:J)-[r3:PRE]->(n:J{id:"AC"})<-[j2o:CCLS]-(o:O)<-[o2u:COCS]-(u:U)
return up,type(r1), op, type(r2), jp, type(r3), n, type(j2o), o, type(o2u), u
Returns less results (there are results missing that match the path structure).
match--with---match---result query
match (up:U)-[r1:COCS]->(op:O)-[r2:CCLS]->(jp:J)-[r3:PRE]->(n:J{id:"AC"})
with up, r1, op, r2, jp, r3, n
match(n)<-[j2o:CCLS]-(o:O)<-[o2u:COCS]-(u:U)
return up,type(r1), op, type(r2), jp, type(r3), n, type(j2o), o, type(o2u), u
Returns the correct results
I do not understand why this is so. It makes no sense to me.
The way I understand how the with works, both should return the same results. Can someone throw some light?
This is with Neo4J 2.1.6
Thank you.
I can think of an explanation for this seemingly anomalous behavior.
To quote from the neo4j manual:
While pattern matching, Neo4j makes sure to not include matches where
the same graph relationship is found multiple times in a single
pattern. In most use cases, this is a sensible thing to do.
In your first query, the following sub-pattern appears twice (once on either side of the (n) node):
(:U)-[:COCS]->(:O)
Since the first query consists of a single pattern, Cypher would prevent the same COCS relationship from showing up twice in the same result row. In your case, this prevented some rows from showing up in the results.
Your second query splits the original query so that the above sub-pattern no longer appears twice in a single pattern. Therefore, you got "complete" results.
So, the lesson here is: if you use a pattern that repeats a relationship subpattern, make sure that you really intend to filter out those rows in which the same relationship instance shows up multiple times.

neo4j: optional 'steps' in Cypher query

I am trying to find relations between nodes with optional but specific nodes/relationships in between (neo4j 2.0 M6).
In my data model, 'Gene' can be 'PARTOF' a 'Group. I have 'INTERACT' relationships between 'Gene'-'Gene', 'Gene'-'Group' and 'Group'-'Group' (red lines in model image).
I want to boil this down to all 'INTERACT' relationships between 'Gene': both direct (Gene-INTERACT-Gene) and via one or two 'Group' (Gene-PARTOF-Group-INTERACT-Gene).
Of course this is easy with multiple Cypher queries:
# direct INTERACT
MATCH (g1:Gene)-[r:INTERACT]-(g2:Gene) RETURN g1, g2
# INTERACT via one Group
MATCH (g1:Gene)-[:PARTOF]-(gr:Group)-[r:INTERACT]-(g2:Gene) RETURN g1, g2
# INTERACT via two Group
MATCH (g1:Gene)-[:PARTOF]-(gr1:Group)-[r:INTERACT]-(gr2:Group)-[:PARTOF]-(g2:Gene)
RETURN g1, g2
But would it be possible to construct a single Cypher query that takes optional 'Group steps' in the path? So far I only used optional relationships and shortestPaths, but I have no idea if I can filter for one or two optional nodes in between two genes.
You can assign a depth between zero and one for each of the relationships you add to the path. Try something like
MATCH (g1:Gene)-[:PARTOF*0..1]-(gr1)-[:INTERACT]-(gr2)-[:PARTOF*0..1]-(g2:Gene)
RETURN g1,g2
and to see what the matched paths actually look like, just return the whole path
MATCH p=(g1:Gene)-[:PARTOF*0..1]-(gr1)-[:INTERACT]-(gr2)-[:PARTOF*0..1]-(g2:Gene)
RETURN p
There is a bit of a bother with declaring node labels for the optional parts of this pattern, however, so this query assumes that genes are not part of anything other than groups, and that they only interact with groups and other genes. If a gene can have [:PARTOF] to something else, then (gr1) will bind that something, and the query is no longer reliable. Simply adding a where clause like
WHERE gr1:Group AND gr2:Group
excludes the case where the optional parts are not matched, so that won't work (it'd be like your third query). I'm sure it can be solved but if your actual model is not much more complex than what you describe in your question, this should.
I took the liberty of interpreting your model in a console here, check it out to see if it does what you want.

Resources