Neo4j cypher - search for nodes with no path between them - neo4j

I'm trying to find a generic way to search for a node or set of nodes which does not have a link to a another node or set of nodes.
As an example, I was able to find all the nodes of a specific type (e.g. :Style) which are connected somehow to a specific set of nodes (e.g. :MetadataRoot), with the following:
match (root:MetadataRoot),
(n:Style),
p=shortestPath((root)-[*]-(n))
return p
Using this, I was able to subtract the set of all :Style nodes from the nodes returned by the above query, but that doesn't seem like the best way to go about this.

If you know the label of the start nodes you can use the EXISTS function :
MATCH (n:Style)
WHERE NOT EXISTS((n)-[]-())
RETURN n
If you know the end node :
MATCH (n:Style)
WHERE NOT EXISTS ((n)-[*]-(:MetadataRoot))
RETURN n
EDIT :
Not sure, but regarding the performance issues in your comment, a workaround could be something like this :
MATCH p=allShortestPaths((n:Style)-[*]-(:MetadataRoot))
WITH nodes(p) as nodesRelated
MATCH (s:Style) WHERE NOT s IN nodesRelated

This should be way faster and it should need less resources to execute:
MATCH (n:Style),
OPTIONAL MATCH p=shortestPath((:MetadataRoot)-[*0..40]-(n))
WITH n, p
WHERE p IS NULL
RETURN n ```

Related

match a branching path of variable length

I have a graph which looks like this:
Here is the link to the graph in the neo4j console:
http://console.neo4j.org/?id=av3001
Basically, you have two branching paths, of variable length. I want to match the two paths between orange node and yellow nodes. I want to return one row of data for each path, including all traversed nodes. I also want to be able to include different WHERE clauses on different intermediate nodes.
At the end, i need to have a table of data, like this:
a - b - c - d
neo - morpheus - null - leo
neo - morpheus - trinity - cypher
How could i do that?
I have tried using OPTIONAL MATCH, but i can't get the two rows separately.
I have tried using variable length path, which returns the two paths but doesn't allow me to access and filter intermediate nodes. Plus it returns a list, and not a table of data.
I've seen this question:
Cypher - matching two different possible paths and return both
It's on the same subject but the example is very complex, a more generic solution to this simpler problem is what i'm looking for.
You can define what your end node by using WHERE statement. So in your case end node has no outgoing relationship. Not sure why you expect a null on return as you said neo - morpheus - null - leo
MATCH p=(n:Person{name:"Neo"})-[*]->(end) where not (end)-->()
RETURN extract(x IN nodes(p) | x.name)
Edit:
may not the the best option as I am not sure how to do this programmatically. If I use UNWIND I get back only one row. So this is a dummy solution
MATCH p=(n{name:"Neo"})-[*]->(end) where not (end)-->()
with nodes(p) as list
return list[0].name,list[1].name,list[2].name,list[3].name
You can use Cypher to match a path like this MATCH p=(:a)-[*]->(:d) RETURN p, and p will be a list of nodes/relationships in the path in the order it was traversed. You can apply WHERE to filter the path just like with node matching, and apply any list functions you need to it.
I will add these examples too
// Where on path
MATCH p=(:a)-[*]-(:d) WHERE NONE(n in NODES(p) WHERE n.name="Trinity") WITH NODES(p) as p RETURN p[0], p[1], p[2], p[3]
// Spit path into columns
MATCH p=(:a)-[*]-(:d) WITH NODES(p) as p RETURN p[0], p[1], p[2], p[3]
// Match path, filter on label
MATCH p=(:a)-[*]-(:d) WITH NODES(p) as p RETURN FILTER(n in p WHERE "a" in LABELS(n)) as a, FILTER(n in p WHERE "b" in LABELS(n)) as b, FILTER(n in p WHERE "c" in LABELS(n)) as c, FILTER(n in p WHERE "d" in LABELS(n)) as d
Unfortunately, you HAVE to explicitly set some logic for each column. You can't make dynamic columns (that I know of). In your table example, what is the rule for which column gets 'null'? In the last example, I set each column to be the set of nodes of a label.
I.m.o. you're asking for extensive post-processing of the results of a simply query (give me all the paths starting from Neo). I say this because :
You state you need to be able to specify specific WHERE clauses for each path (but you don't specify which clauses for which path ... indicating this might be a dynamic thing ?)
You don't know the size of the longest path beforehand ... but you still want the result to be a same-size-for-all-results table. And would any null columns then always be just before the end node ? Why (for that makes no real sense other then convenience) ?
...
Therefore (and again i.m.o.) you need to process the results in a (Java or whatever you prefer) program. There you'll have full control over the resultset and be able to slice and dice as you wish. Cypher (exactly like SQL in fact) can only do so much and it seems that you're going beyond that.
Hope this helps,
Regards,
Tom
P.S. This may seem like an easy opt-out, but look at how simple your query is as compared to the constructs that have to be wrought trying to answer your logic. So ... separate the concerns.

Neo4j Cypher find two disjoint nodes

I'm using Neo4j to try to find any node that is not connected to a specific node "a". The query that I have so far is:
MATCH p = shortestPath((a:Node {id:"123"})-[*]-(b:Node))
WHERE p IS NULL
RETURN b.id as b
So it tries to find the shortest path between a and b. If it doesn't find a path, then it returns that node's id. However, this causes my query to run for a few minutes then crashes when it runs out of memory. I was wondering if this method would even work, and if there is a more efficient way? Any help would be greatly appreciated!
edit:
MATCH (a:Node {id:"123"})-[*]-(b:Node),
(c:Node)
WITH collect(b) as col, a, b, c
WHERE a <> b AND NOT c IN col
RETURN c.id
So col (collect(b)) contains every node connected to a, therefore if c is not in col then c is not connected to a?
For one, you're giving this MATCH an impossible predicate to fulfill, so it will never find the shortest path.
WHERE clauses are associated with MATCH, OPTIONAL MATCH, and WITH clauses, so your query is asking for the shortest path where the path doesn't exist. That will never return anything.
Also, the shortestPath will start at the node you DON'T want to be connected, so this has no way of finding the nodes that aren't connected to it.
Probably the easiest way to approach this is to MATCH to all nodes connected to your node in question, then MATCH to all :Nodes checking for those that aren't in the connected set. That means you won't have to do a shortestPath from every single node in the db, just a membership check in a collection.
You'll need APOC Procedures for this, as it has the fastest means of matching to nodes within a subgraph.
MATCH (a:Node {id:"123"})
CALL apoc.path.subgraphNodes(a, {}) YIELD node
WITH collect(node) as subgraph
MATCH (b:Node)
WHERE NOT b in subgraph
RETURN b.id as b
EDIT
Your edited query is likely to blow up, that's going to generate a huge result set (the query will build a result set of every node reachable from your start node by a unique path in a cartesian product with every :Node).
Instead, go step by step, collect the distinct nodes (because otherwise you'll get multiples of the same nodes that can be reached via different paths), and then only after you have your collection should you start your match for nodes that aren't in the list.
MATCH (:Node {id:"123"})-[*0..]-(b:Node)
WITH collect(DISTINCT b) as col
MATCH (a:Node)
WHERE NOT a IN col
RETURN a.id

Neo4J, Match node with "OR"

Helo,
i want to match a graph where a node can be typeX or typeY
my first thought was:
match (:typeX|typeY)-[]-(z) return z
But this doesn´t work :(
Is there any way without typing the query twice?
Like this:
match (:typeX)-[]-(z), (:typeY)-[]-(z) return z
Can someone help me?
thank you in advance :)
One way is
MATCH (n) WHERE labels(n) IN ['typeX','typeY']
WITH n
MATCH (n)-[]-(z)
RETURN z
However, if "either typeX or typeY" are queried frequently and share some common purpose in your domain, you could add another common label to them like "commonXY" and query using that label instead.
Unfortunately there's not a good efficient way to do this without sacrificing performance. All other current answers are forced to scan all nodes and then filter on their labels, which isn't performant with large numbers of nodes (PROFILE the queries). All the efficient means I know of are more verbose.
You can perform a UNION of the two queries to return nodes one hop from all :typeX and :typeY nodes.
match (:typeX)--(z)
return z
union
match (:typeY)--(z)
return z
This query will work even if n has multiple labels:
MATCH (n)
WHERE ANY(lab IN labels(n) WHERE lab IN ['typeX', 'typeY'])
MATCH (n)--(z)
RETURN z
there is a n predicate n:Label
MATCH (n)--(z)
WHERE n:typeX OR n:typeY
RETURN z

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Return nodes of a connected node

For a given node n, I want to get its related nodes and all nodes connected to the related nodes.
For example:
MATCH (n)-[:IN]->(x)
WHERE n.myid='myid'
RETURN n, x
How do I return all the x's connected nodes as well?
Assuming that you don't want type(r) to be IN,
Have you tried?
MATCH (n)-[:IN]->(x)-[]-(y)
WHERE n.myid='myid'
AND y<>n
RETURN n,x,collect(y)
I have it collecting y because you will otherwise get a bunch of "rows" for each, but that is totally up to you.
Try playing on:
http://console.neo4j.org
Console with above example
http://console.neo4j.org/r/yaczrx
Also, you may want to look into how deep you want that second lookup to go.
BTW: if you want to see the path in the console (you can see how the nodes are interconnected): http://console.neo4j.org/r/39mz9a
MATCH path =(n:Crew)-[:KNOWS]-m-[rr]-(x)
WHERE n.name='Neo' AND x<>n
RETURN n AS Initial_Node, m AS Linking_Node,
collect(x) AS Nodes_connected_to_Linking_Node, path
MATCH (n)-[:IN]->(x)--(y)
WHERE n.myid='myid'
RETURN n, x,y

Resources