neo4j query to find indirect paths runs extremely slow - neo4j

MATCH (a:Author {id:'author_1'}),
(art:Article {id:'PMID:21473878'})
WITH a, art
MATCH r=((a)-[*2..4]-(art))
RETURN r
In a database with roughly 1.3 million nodes and 8 million relations this query runs forever. Is there anything I can do?
There are indexes on :Author and :Article id
===============

In this particular case, the query planner can sometimes use an inefficient approach when matching to patterns connecting two nodes you already know.
In this case, the planner takes one node as the start node, expands to all possible nodes from the given pattern, and then applies a filter on all of those nodes to see if it's the other node at the end of the match. This is unnecessary property access, especially with large numbers of matched nodes.
The better approach is for both your start and end node to be looked up via the index, then perform expansions from one of those nodes, and use a hash join to determine which of those end nodes is the same as the end node you're looking for. This approach only uses property access once when matching to the id of the end node in question (instead of for every single node found at the end of the expansion).
The trick right now is how to get Neo4j to use this approach in the planner. This may work:
MATCH (a:Author {id:'author_1'}),
(art:Article {id:'PMID:21473878'})
MATCH r=((a)-[*2..4]-(end))
WHERE end = art
RETURN r
At the least, I'd expect this to be about as fast as your approach using an OPTIONAL MATCH.

The first two matches are not necessary. Remove it and try:
MATCH r=(:Author {id:'author_1'})-[*2..4]-(:Article {id:'PMID:21473878'})
return r

Related

Adding a property filter to cypher query explodes memory, why?

I'm trying to write a query that explores a DAG-type graph (a bill of materials) for all construction paths leading down to a specific part number (second MATCH), among all the parts associated with a given product (first MATCH). There is a strange behavior I don't understand:
This query runs in a reasonable time using Neo4j community edition (~2 s):
WITH '12345' as snid, 'ABCDE' as pid
MATCH (m:Product {full_sn:snid})-[:uses]->(p:Part)
WITH snid, pid, collect(p) AS mparts
MATCH path=(anc:Part)-[:has*]->(child:Part)
WHERE ALL(node IN nodes(path) WHERE node IN mparts)
WITH snid, path, relationships(path)[-1] AS rel,
nodes(path)[-2] AS parent, nodes(path)[-1] AS child
RETURN stuff I want
However, to get the query I want, I must add a filter on the child using the part number pid in the second MATCH statement:
MATCH path=(anc:Part)-[:has*]->(child:Part {pn:pid})
And when I try to run the new query, neo4j browser compains that there is not enough memory. (Neo.TransientError.General.OutOfMemoryError). When I run it with EXPLAIN, the db hits are exploding into the 10s of billions, as if I'm asking it for a massive cartestian product: but all I have done is added a restriction on the child, so this should be reducing the search space, shouldn't it?
I also tried adding an index on :Part(pn). Now the profile shown by EXPLAIN looks very efficient, but I still have the same memory error.
If anyone can help me understand why this change between the two queries is causing problems, I'd greatly appreciate it!
Best wishes,
Ben
MATCH path=(anc:Part)-[:has*]->(child:Part)
The * is exploding to every downstream child node.
That's appropriate if that is what's desired. If you make this an optional match and limit to the collect items, this should restrict the return results.
OPTIONAL MATCH path=(anc:Part)-[:has*]->(child:Part)
This is conceptionally (& crudely) similar to an inner join in SQL.

Neo4j stop searching an undirected path when given node is encountered

I have the following test data in Neo4j:
merge (n1:device {name:"n1"})-[:phys {name:"phys"}]->(:interface {name:"n1a"})-[:cable {name:"cable"}]->(:interface {name:"n2a"})-[:phys {name:"phys"}]->(n2:device {name:"n2"})
merge (n1)-[:phys {name:"phys"}]->(:interface {name:"n1b"})-[:cable {name:"cable"}]->(:interface {name:"n2b"})-[:phys {name:"phys"}]->(n2)
merge (n1)-[:phys {name:"phys"}]->(:interface {name:"n1c"})-[:cable {name:"cable"}]->(:interface {name:"n2c"})-[:phys {name:"phys"}]->(n2)
merge (n1)-[:phys {name:"phys"}]->(:interface {name:"n1d"})-[:cable {name:"cable"}]->(:interface {name:"n2d"})
Giving:
While this example has exactly 3 relationships and 2 nodes on each of the 4 paths between each of n1 and n2, my real data could have many more, and also many more paths.
This is a undirected graph and in the real dataset, relationships on parts of each path are in either direction.
I know that every path starts at a :device and either just ends at a non :device or ends at a :device, and along the way there could be any number of relationships and other non :device nodes.
So I am looking to do:
match p=(:device {name:"n1"})-[*]-(:device) return (p)
and have it return the same, (I would be happy with double), number of records as:
match p=(:device {name:"n1"})-[*]->(:device) return (p)
So I am looking for a way to stop matching relationships and cease following the path when the first (:device) is encountered in the path.
From my limited understanding, I could easily achieve this by making every relationship bidirectional. However I have avoided that option to date as I have read it is bad practice.
Extra for experts :-)
Additionally, I would like a way to return any full paths that don't end at a :device (eg, the bottom one)
Thanks
This is a use case that is a little hard to do with just Cypher, as we don't have a way to specify "follow a variable-length path and stop when you reach another node of this type".
We can do something like this when we use LIMIT, but that becomes too restrictive when we don't know how many results there will be, or we need to do this for multiple starting nodes.
Because of this, there are some APOC path finder procedures that include more flexible options. One of these is a labelFilter option which lets you describe how to filter nodes with particular labels found during expansion (blacklisting, whitelisting, etc). One of these filters is called a termination filter (uses an / symbol before the appropriate label), which means to include the path to that node as a result, and stop expansion, which is exactly what you're looking for.
After you install APOC, you can use the apoc.path.expandConfig() procedure, starting from your start node, and supply the labelFilter config parameter to get this behavior:
MATCH (start:device {name:"n1"})
CALL apoc.path.expandConfig(start, {labelFilter:'/device'}) YIELD path
RETURN path

follow all relationships but specific ones

How can I tell cypher to NOT follow a certain relationship/edge?
E.g. I have a :NODE that is connected to another :NODE via a :BUDDY relationship. Additionally every :NODE is related to :STUFF in arbitrary depth by arbitrary edges which are NOT of type :BUDDY. I now want to add a shortcut relation from each :NODE to its :STUFF. However, I do not include :STUFF of its :BUDDIES.
(:NODE)-[:BUDDY]->(:NODE)
(:NODE)-[*]->(:STUFF)
My current query looks like this:
MATCH (n:Node)-[*]->(s:STUFF) WHERE NOT (n)-[:BUDDY]->()-[*]->(s) CREATE (n)-[:HAS]->(s)
However I have some issues with this query:
1) If I ever add a :BUDDY relationship not directly between :NODE but children of :NODE the query will use that relationship for matching. This might not be intended as I do not want to include buddies at all.
2) Explain tells me that neo4j does the match (:NODE)-[*]->(:STUFF) and then AntiSemiApply the pattern (n)-[:BUDDY]->(). As a result it matches the whole graph to then unmatch most of the found connections. This seems ineffective and the query runs slower than I like (However subjective this might sound).
One (bad) fix is to restrict the depth of (:NODE)-[*]->(:STUFF) via (:NODE)-[*..XX]->(:STUFF). However, I cannot guarantee that depth unless I use a ridiculous high number for worst case scenarios.
I'd actually just like to tell neo4j to just not follow a certain relationship. E.g. MATCH (n:NODE)-[ALLBUT(:BUDDY)*]->(s:STUFF) CREATE (n)-[:HAS]->(s). How can I achieve this without having to enumerate all allowed connections and connect them with a | (which is really fast - but I have to manually keep track of all possible relations)?
One option for this particular schema is to explicitly traverse past the point where the BUDDY relationship is a concern, and then do all the unbounded traversing you like from there. Then you only have to apply the filter to single-step relationships:
MATCH (n:Node) -[r]-> (leaf)
WHERE NOT type(r) = 'BUDDY'
WITH n, leaf
MATCH (leaf) -[*] -> (s:Stuff)
WITH n, COLLECT(DISTINCT leaf) AS leaves, COLLECT(DISTINCT s) AS stuff
RETURN n, [leaf IN leaves WHERE leaf:Stuff] + stuff AS stuffs
The other option is to install apoc and take a look at the path expander procedure, which allows you to 'blacklist' node labels (such as :Node) from your path query, which may work just as well depending on your graph. See here.
My final solution is generating a string from the set of
{relations}\{relations_id_do_not_want}
and use that for matching. As I am using an API and thus can do this generation automatically it is not as much as an inconvenience as I feared, but still an inconvenience. However, this was the only solution I was able to find since posting this question.
You could use a condition o nrelationship type:
MATCH (:Node)-[r:REL]->(:OtherNode)
WHERE NOT type(r) = 'your_unwanted_rel_type'
I have no clues about perf though

How do I match from multiple starting nodes to the closest n nodes of a particular type per row?

I'm looking for a way to do breadth-first-search or shortest path from some start node(s) to a kind of node (whether by label or by property), and to stop when I find the first match (or n matches, if I could have that as a parameter).
I'd like to know if a solution exists with Cypher itself, and if not, if there are existing procedures (from APOC or some other source) to do this, and if not, how I might implement this (using the traversal framework maybe?)
The shortestPath() algorithms, both from neo4j itself and the APOC libraries, work best when you know your start and end nodes, or if you want to match based upon all possible start or end nodes.
But when we don't know our end nodes specifically, and we want to simply find the first node matching some predicates or labels (or nodes, if we allow the number we want to find to be parameterized), those procedures don't seem to work well.
For example, let's say I have a social graph of :Persons with [:Knows] relationships between them, and a [:LivesIn] relationship between :Persons and :City. Lastly, :Persons are additionally labeled with their occupation (so a :Person who is a doctor is also labeled with :Doctor)
With this example graph, for a given :City, for all :Persons in that city, I want to find the shortest path of each :Person following [:Knows] relationships to a :Doctor. As output I want to see each :Person, their closest :Doctor, the number of [:Knows] hops to that :Doctor, ordered by the number of hops descending.
My query, if I was using the neo4j shortest path, could look something like this:
MATCH (c:City)<-[:LivesIn]-(p:Person)
WHERE c.name = "San Diego"
WITH p
MATCH path = shortestPath( (p)-[:Knows*]-(d:Doctor) )
...
At this point we have rows pairing every person to every single doctor in the graph, and the shortest paths between all of them. As far as what to do next, I might collect all paths so we are back to one row per :Person, then order all collections by path length ascending, then for each user take the head of the path collection, then output the :Person, their closest :Doctor, and the number of hops between them.
That's not efficient at all. I want the shortest path to the closest :Doctor, and to stop searching once that first :Doctor node is found.
If an easy solution exists, I'd additionally like to know if supports other options (finding based on predicates and properties, not just labels), and if the number of matches I want it to find can be parameterized (if I want to find the closest 2 doctors, for example).
Why not use LIMIT 1 in the return statement?
APOC has matured quite a bit, and now has answers to these kinds of requirements.
Using apoc.path.expandConfig() (and other path expander procedures) it's now possible to expand out to nodes with certain labels, and stop further expansion once a given limit is reached (such as requesting the n closest doctors per person).
Using apoc.cypher.run(), it's now possible to perform a Cypher query with a LIMIT, and that limit will apply to the results per-row instead of limiting all rows. This is good for more complicated cases where node labels and relationship types alone aren't enough to describe the traversal or the nodes needed, such as when property evaluation is needed.
Both approaches are demonstrated in this Neo4j knowledge base entry.

Is it the optimal way of expressing "go through all nodes" queries in Cypher?

I have a quite large social graph in which I execute global queries like this one:
match (n:User)-[r:LIKES]->(k:User)
where not (k:User)-[]->(n:User)
return count(r);
They take a lot of time and memory, so I am curious if they are expressed in optimal way. I have felling that when I execute such query Cypher is firstly matching everything that fits the expression (and that takes a lot of memory) and then starts to count things. I would rather like to go through every node, check the pattern and update the counter if necessary. This way such queries would not require a lot of memory. So how in fact such query is executed? If it is not optimal, is there a way to make it better (in Cypher)?
If you used the query just as you wrote it, you may not be getting what you think you are. Putting labels on node "variables" can cause them to be treated as fresh (partial) patterns instead of bound nodes. Is your query any faster if you use
MATCH (n:User)-[r:LIKES]->(k:User)
WHERE NOT (n)<--(k)
RETURN count(r)
Here's how this works (not considering internal optimizations, which I don't begin to understand).
For each User node, every outgoing LIKES relationship is followed. If the other end of the LIKES relationship is a User node, the two nodes and the relationship are bound to the names n, k, and r and passed to the WHERE clause. Every outgoing relationship on the bound k node is then tested to see if it connects to the bound n node. If no such relationship is found, the match is considered successful. The count() function in the RETURN clause counts the resulting collection of relationships that were passed from the match.
If you have a densely connected graph, and particularly if there are many other relationships between nodes other than LIKES relationship, this can be quite an extensive search.
As a further experiment, you might try changing the WHERE clause to read
WHERE NOT (k)-->(n)
and see if it makes any difference. I don't think it will, but I could be wrong.

Resources