Cypher query optimisation - Utilising known properties of nodes - neo4j

Setup:
Neo4j and Cypher version 2.2.0.
I'm querying Neo4j as an in-memory instance in Eclipse created TestGraphDatabaseFactory().newImpermanentDatabase();.
I'm using this approach as it seems faster than the embedded version and I assume it has the same functionality.
My graph database is randomly generated programmatically with varying numbers of nodes.
Background:
I generate cypher queries automatically. These queries are used to try and identify a single 'target' node. I can limit the possible matches of the queries by using known 'node' properties. I only use a 'name' property in this case. If there is a known name for a node, I can use it to find the node id and use this in the start clause. As well as known names, I also know (for some nodes) if there are names known not to belong to a node. I specify this in the where clause.
The sorts of queries that I am running look like this...
START
nvari = node(5)
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvari:C4)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION),
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION),
WHERE
NOT(nvarj.Name IN ['nf']) AND NOT(nvarm.Name IN ['nb','nj'])
RETURN DISTINCT target
Another way to think about this (if it helps), is that this is an isomorphism testing problem where we have some information about how nodes in a query and target graph correspond to each other based on restrictions on labels.
Question:
With regards to optimisation:
Would it help to include relation variables in the match clause? I took them out because the node variables are sufficient to distinguish between relationships but this might slow it down?
Should I restructure the match clause to have match/where couples including the where clauses from my previous example first? My expectation is that they can limit possible bindings early on. For example...
START
nvari = node(5)
MATCH
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarj.Name IN ['nf'])
MATCH
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarm.Name IN ['nb','nj'])
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION)
RETURN DISTINCT target
On the side:
(Less important but still an interest) If I make each relationship in a match clause an optional match except for relationships containing the target node, would cypher essentially be finding a maximum common sub-graph between the query and the graph data base with the constraint that the MCS contains the target node?
Thanks a lot in advance! I hope I have made my requirements clear but I appreciate that this is not a typical use-case for Neo4j.

I think querying with node properties is almost always preferable to using relationship properties (if you had a choice), as that opens up the possibility that indexing can help speed up the query.
As an aside, I would avoid using the IN operator if the collection of possible values only has a single element. For example, this snippet: NOT(nvarj.Name IN ['nf']), should be (nvarj.Name <> 'nf'). The current versions of Cypher might not use an index for the IN operator.
Restructuring a query to eliminate undesirable bindings earlier is exactly what you should be doing.
First of all, you would need to keep using MATCH for at least the first relationship in your query (which binds target), or else your result would contain a lot of null rows -- not very useful.
But, thinking clearly about this, if all the other relationships were placed in separate OPTIONAl MATCH clauses, you'd be essentially saying that you want a match even if none of the optional matches succeeded. Therefore, the logical equivalent would be:
MATCH (target:C5)-[:IN_LOCATION]->(nvara:LOCATION)
RETURN DISTINCT target
I don't think this is a useful result.

Related

Cypher return multiple hops through pattern of relationships and nodes

I'm making a proof of concept access control system with neo4j at work, and I need some help with Cypher.
The data model is as follows:
(:User|Business)-[:can]->(:Permission)<-[:allows]-(:Business)
Now I want to get a path from a User or a Business to all the Business-nodes that you can reach trough the
-[:can]->(:Permission)<-[:allows]-
pattern. I have managed to write a MATCH that gets me halfway there:
MATCH
path =
(:User {userId: 'e96cca53-475c-4534-9fe1-06671909fa93'})-[:can|allows*]-(b:Business)
but this doesn't have any directions, and I can't figure out how to include the directions without reducing the returned matches to only the direct matches (i.e it doesn't continue after the first hit on a :Business node)
So what I'm wondering is:
Is there a way to match multiple of these hops in one query?
Should I model this entirely different?
Am I on the wrong path completely and the query should be completely
rewritten
Currently the syntax of variable-length expansions doesn't allow fine control for separate directions for certain types. There are improvements in the pipeline around this, but for the moment Cypher alone won't get you what you want.
We can use APOC Procedures for this, as fine control of the direction of types in the expansion, and sequences of relationships, are supported in the path expander procs.
First, though, you'll need to figure out how to address your user-or-business match, either by adding a common label to these nodes by which you can MATCH either type by property, or you can use a subquery with two UNIONed queries, one for :Business nodes, the other for :User nodes, that way you can still take advantage of an index on either, and get possible results in a single variable.
Once you've got that, you can use apoc.path.expandConfig(), passing some options to get what you want:
// assume you've matched to your `start` node already
CALL apoc.path.expandConfig(start, {relationshipFilter:'can>|<allows', labelFilter:'>Business'}) YIELD path
RETURN path
This one doesn't use sequences, but it does restrict the direction of expansion per relationship type. We are also setting the labelFilter such that :Business nodes are the end node of the path and not nodes of any other label.
You can specify the path as follows:
MATCH path = (:User {userId: $id})-[:can]->(:Permission)
<-[:allows]-(:Business))
RETURN path
This should return the results you're after.
I see a good solution has been provided via path expanding APOC procedures.
But I'll focus on your point #2: "Should I model this entirely differently?"
Well, not entirely but I think yes.
The really liberating part of working with Neo4j is that you can change the road you are driving over as easily as you can change your driving strategy: model vs query. And since you are at an early stage in your project, you can experiment with different models. There's a good opportunity to make just a semantic change to make an 'end run' around the problem.
The semantics of a relationship in Neo4j are expressed through
the mandatory TYPE you assign to the relationship, combined with
the direction you choose to point the mandatory arrow
The trick you solved with APOC was how to traverse a path of relationships that alternate between pointing forward and backward along the query's path. But before reaching for a power tool, why not just reverse the direction of either of your relationship types. You can change the model for allows from
<-[:allows]-
to
-[:is_allowed_by]->
and that buys you a lot. Now the directions of both relationships are the same and you can combine both relationships into a single relationship in the match pattern. And the path traversal can be expressed like this, short & sweet:
(u:User)-[:can|is_allowed_by*]->(c:Company)
That will literally go to all lengths to find every user-to-company path, branching included.

How to query Neo4j N levels deep with variable length relationship and filters on each level

I'm new(ish) to Neo4j and I'm attempting to build a tool that allows users on a UI to essentially specify a path of nodes they would like to query neo4j for. For each node in the path they can specify specific properties of the node and generally they don't care about the relationship types/properties. The relationships need to be variable in length because the typical use case for them is they have a start node and they want to know if it reaches some end node without caring about (all of) the intermediate nodes between the start and end.
Some restrictions the user has when building the path from the UI is that it can't have cycles, it can't have nodes who has more than one child with children and nodes can't have more than one incoming edge. This is only enforced from their perspective, not in the query itself.
The issue I'm having is being able to specify filtering on each level of the path without getting strange behavior.
I've tried a lot of variations of my Cypher query such as breaking up the path into multiple MATCH statements, tinkering with the relationships and anything else I could think of.
Here is a Gist of a sample Cypher dump
cypher-dump
This query gives me the path that I'm trying to get however it doesn't specify name or type on n_four.
MATCH path = (n_one)-[*0..]->(n_two)-[*0..]->(n_three)-[*0..]->(n_four)
WHERE n_one.type IN ["JCL_JOB"]
AND n_two.type IN ["JCL_PROC"]
AND n_three.name IN ["INPA", "OUTA", "PRGA"]
AND n_three.type IN ["RESOURCE_FILE", "COBOL_PROGRAM"]
RETURN path
This query is what I'd like to work however it leaves out the leafs at the third level which I am having trouble understanding.
MATCH path = (n_one)-[*0..]->(n_two)-[*0..]->(n_three)-[*0..]->(n_four)
WHERE n_one.type IN ["JCL_JOB"]
AND n_two.type IN ["JCL_PROC"]
AND n_three.name IN ["INPA", "OUTA", "PRGA"]
AND n_three.type IN ["RESOURCE_FILE", "COBOL_PROGRAM"]
AND n_four.name IN ["TAB1", "TAB2", "COPYA"]
AND n_four.type IN ["RESOURCE_TABLE", "COBOL_COPYBOOK"]
RETURN path
What I've noticed is that when I "... RETURN n_four" in my query it is including nodes that are at the third level as well.
This behavior is caused by your (probably inappropriate) use of [*0..] in your MATCH pattern.
FYI:
[*0..] matches 0 or more relationships. For instance, (a)-[*0..]->(b) would succeed even if a and b are the same node (and there is no relationship from that node back to itself).
The default lower bound is 1. So [*] is equivalent to [*..] and [*1..].
Your 2 queries use the same MATCH pattern, ending in ...->(n_three)-[*0..]->(n_four).
Your first query does not specify any WHERE tests for n_four, so the query is free to return paths in which n_three and n_four are the same node. This lack of specificity is why the query is able to return 2 extra nodes.
Your second query specifies WHERE tests for n_four that make it impossible for n_three and n_four to be the same node. The query is now more picky, and so those 2 extra nodes are no longer returned.
You should not use [*0..] unless you are sure you want to optionally match 0 relationships. It can also add unnecessary overhead. And, as you now know, it also makes the query a bit trickier to understand.

Adding relationship to existing nodes with Cypher doesn't work

I am working on Panama dataset using Neo4J graph database 1.1.5 web version. I identified Ion Sturza, former Prime Minister of Moldova on the database and want to make a map of his related network. I used following code to query using Cypher (creating a variable 'IonSturza'):
MATCH (IonSturza {name: "Ion Sturza"}) RETURN IonSturza
I identified that the entity 'CONSTANTIN LUTSENKO' linked differently to entities like 'Quade..' and 'Kinbo...' with a name in small letters as in this picture. I hence want to map a relationship 'SAME_COMPANY_AS' between the capslock and the uncapped version. I tried the following code based on this answer by #StefanArmbruster:
MATCH (a:Officer {name :"Constantin Lutsenko"}),(b:Officer{name :
"CONSTANTIN LUTSENKO"})
where (a:Officer{name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(b:Entity{id:'284429'})
CREATE (a)-[:SAME_COMPANY_AS]->(b)
Instead of indexing, I used the 'where' statement to specify the uncapped version which is linked only to the entity bearing id '284429'.
My code however shows the cartesian product error message:
This query builds a cartesian product between disconnected patterns.If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (b))<<
Also when I execute, there are no changes, no rows!! What am I missing here? Can someone please help me with inserting this relationship between the nodes. Thanks in advance!
The cartesian product warning will appear whenever you're matching on two or more disconnected patterns. In this case, however, it's fine, because you're looking up both of them by what is likely a unique name, s your result should be one node each.
If each separate part of that pattern returned multiple nodes, then you would have (rows of a) x (rows of b), a cartesian product between the two result sets.
So in this particular case, don't mind the warning.
As for why you're not seeing changes, note that you're reusing variables for different parts of the graph: you're using variable b for both the uppercase version of the officer, and for the :Entity in your WHERE. There is no node that matches to both.
Instead, use different variables for each, and include the :Entity in your match. Also, once you match to nodes and bind them to variables, you can reuse the variable names later in your query without having to repeat its labels or properties.
Try this:
MATCH (a:Officer {name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(:Entity{id:'284429'}),(b:Officer{name : "CONSTANTIN LUTSENKO"})
CREATE (a)-[:SAME_COMPANY_AS]->(b)
Though I'm not quite sure of what you're trying to do...is an :Officer a company? That relationship type doesn't quite seem right.
I tried the answer by #InverseFalcon and thanks to it, by modifying the property identifier from 'id' to 'name' and using the property for both 'a' and 'b', 4 relationships were created by the following code:
MATCH (a:Officer {name :"Constantin Lutsenko"})-[:SHAREHOLDER_OF]->
(:Entity{name:'KINBOROUGH PORTFOLIO LTD.'}),(b:Officer{name : "CONSTANTIN
LUTSENKO"})-[:SHAREHOLDER_OF]->(:Entity{name:'Chandler Group Holdings Ltd'})
CREATE (a)-[:SAME_NAME_AS]->(b)
Thank you so much #InverseFalcon!

Back-reference neo4j

Can I use some back-reference sort of mechanism in neo4j? I'm not interested in what matches the query, just that it is the same thing in many places. Something like:
MATCH (a:Event {diagnosis1:11})
MATCH (b:Event {diagnosis1:15})
MATCH (c:Event {diagnosis1:5})
MATCH (a)-[rel:Next {PatientID:*}]->(b)
MATCH (b)-[rel1:Next {PatientID:\{1}]->(c)
The idea is that I just require the attribute IDs from both edges to be the same, without specifying it. The whole purpose of it would be not generating all possible matchings, to then filter them, but only hop in the specific places.
I've asked something similar in a more specific way here.
Edit: I know WHERE clauses can be used for that, but they filter the query AFTER matching the edges and nodes. I want to do that DURING the matching!
Use a WHERE clause with simple references, there's no need for back references:
MATCH (a)-[rel:Next]->(b)
MATCH (b)-[rel1:Next]->(c)
WHERE rel.PatientID = rel1.PatientID
Update
First of all, Cypher is a declarative query language: you express what you want, the runtime takes care of executing and optimizing it any way it can, so it's not that obvious that it would do it the way you think it will, or that using "back references" would magically solve the problem; it's just another way of writing the same thing.
So, your problem is that the match creates all the relationship pairs before filtering them. How about splitting the match in 2 phases using WITH?
MATCH (a:Event {diagnosis1:11})-[rel:Next]->(b:Event {diagnosis1:15})
WITH a, b, rel
MATCH (b)-[rel1:Next]->(c:Event {diagnosis1:5})
WHERE rel1.PatientID = rel.PatientID
That should only select the second relationships that match the first, but I'm not sure if it's an O(n^2) algorithm in Cypher's runtime.
Otherwise, if you drop to the Java API (which would mean either an extension or a procedure, depending on your version of Neo4j), you can probably implement in O(n) by
scanning all the relationships between a and b, indexing them by PatientID in some multimap (see Guava, or use a Map<K, Collection<V>>); this is O(n)
then doing the same for all the relationships between b and c, still O(n)
iterate on the keys of one multimap to get the values in both and match them, still O(n)

Cypher match path based on previous relationship

I am trying to impose restrictions on my path match pattern.
I would like to match the next relationship based on the type of the previous used relationship.
Here is an example of a simplified Database:
(A)-1-(B)-2-(C)-1-(E)-2-(F)
| |
3----(D)----3
Query 1:
start n=node(A), m=node(F)
match p=n-[r*]-m
return p
should result both paths
(A)-1-(B)-2-(C)-1-(E)-2-(F)
(A)-1-(B)-3-(D)-3-(E)-2-(F)
However, when running the query starting from node (F):
start m=node(F),n=node(A)
match p=m-[r*]-n
return p
The result should be only:
(F)-2-(E)-1-(C)-2-(B)-1-(A)
Path
(F)-2-(E)-3-(D)-3-(B)-1-(A)
should not be valid, since it violates these constrains:
Coming from a -1- type relationship you can proceed to either a
-2- or -3- relationship.
Coming from a -2- or -3- type relationship you can only proceed to
a -1- relationship.
These paths are valid:
()-1-()-2-()
()-1-()-3-()
()-2-()-1-()
()-3-()-1-()
These path are not valid:
()-3-()-2-()
()-2-()-3-()
First, upvote for the very detailed, specific, and well laid out question.
Unfortunately, I don't think it's possible to do what you want to do with Cypher, I think you need the Traversal API to do this. Cypher is a declarative language; that is, you tell it what you want, and it goes and gets it for you. Using the traversal API is an imperative approach to query; that is, you tell neo4j exactly how to traverse the graph.
Your query here imposes constraints about the order in which relationships get traversed, and what makes a valid path. Nothing wrong with that, but I believe that imposing constraints on the order of traversal implicitly means you're telling cypher which way to traverse, and you just can't do that with cypher because it's declarative. Another common example of the declarative vs. imperative thing is breadth-first vs. depth-first search. If you're looking for certain nodes, you can't tell cypher to traverse breadth-first vs. depth-first; you just tell it which nodes you want, and it goes and gets them.
Now, paths can be treated like collections in cypher via the relationships() function. And you can use the filter and reduce functions to work with individual relationships. But your query is harder in that you need code that says something like "If the first relationship in a path is a 1, then the next must be a 2 or a 3". This is exactly the sort of thing that you can do with Evaluators in the Traversal API. Check the interface and you can see how you could write your own method that would implement exactly the logic you're talking about via Evaluator#evaluate(Path path).
As a general note, because declarative query (cypher) hides traversal details from you, IMHO it's always better to use declarative query for ease, if you can specify what you want declaratively. But there are cases where you have to control the order of traversal, and for that you need traversal. (I have had cases where I need all nodes connected to something else, via breadth-first search only, to a maximum depth of 3, along complex relationship criteria -- I couldn't use cypher for that).
To give you a way forward, check the link I provided on the traversal framework. Perhaps you can describe your query as a TraversalDescription, and then hand it off to neo4j to run.

Resources