Neo4j Cypher complex query optimization - neo4j

Now I have a graph with millions of nodes and millions of edge relationships. There is a directed relationship between nodes.
Now suppose the node has two states A and B. I want to find all state A nodes on the path that do not have state B.
As shown in the figure below, there are nodes A--K, and then three of them, E, G and J, are of type B, and the others are of type A.
picture link is https://i.stack.imgur.com/a0yOV.jpg
For node E, its upstream and downstream traversal is shown below, so nodes B, H, K do not meet the requirements.
For node G, its upstream and downstream traversal is shown below, so nodes B, D, K do not meet the requirements.
For node J, its upstream and downstream traversal is shown below, so nodes A, B, C, D, F do not meet the requirements.
So finally only node "I" is the node that meets the requirements.
picture link is https://i.stack.imgur.com/A2eqv.jpg
The case of the above example is a DAG, but the actual situation is that there may be cycle in the graph, including spin cycle (case 1), AB cycle (case 2), large loops (case 3), and complex cycle (case 4)
picture link is https://i.stack.imgur.com/NDpED.jpg
The Cypher query statement I can write
MATCH (n:A)
WHERE NOT exists((n)-[*]->(:B))
AND NOT exists((n)<-[*]-(:B))
RETURN n;
But this query statement is stuck in the case of millions of nodes and millions of edges with a limit 35,But in the end there are more than 30,000 nodes that meet the requirements.
Obviously my statement is taking up too much memory, querying out 30+ nodes has taken up almost all the available memory, how can I write a more efficient query?
Here is a example
CREATE (a:A{id:'a'})
CREATE (b:A{id:'b'})
CREATE (c:A{id:'c'})
CREATE (d:A{id:'d'})
CREATE (e:B{id:'e'})
CREATE (f:A{id:'f'})
CREATE (g:B{id:'g'})
CREATE (h:A{id:'h'})
CREATE (i:A{id:'i'})
CREATE (j:B{id:'j'})
CREATE (k:A{id:'k'})
MERGE (a)-[:REF]->(c)
MERGE (b)-[:REF]->(c)
MERGE (b)-[:REF]->(d)
MERGE (b)-[:REF]->(e)
MERGE (c)-[:REF]->(f)
MERGE (d)-[:REF]->(g)
MERGE (e)-[:REF]->(g)
MERGE (e)-[:REF]->(h)
MERGE (f)-[:REF]->(i)
MERGE (f)-[:REF]->(j)
MERGE (f)-[:REF]->(k)
MERGE (g)-[:REF]->(k)
MERGE (g)-[:REF]->(j)
use this code will get the result 'i'
MATCH (n:A)
WHERE NOT exists((n)-[*]->(:B))
AND NOT exists((n)<-[*]-(:B))
RETURN n;
But when there are 800,000 nodes (400,000 type A, 400,000 type B) and over 1.4 million edges in the graph, this code cannot run the result

Some thoughts:
I don’t think this global graph search can be solved with a single query. You will need some kind of process to optimise exploration and use the result up to a certain point in subsequent steps.
when you could assign node labels instead of properties to reflect
the state of a node, you could use apoc.path.expandConfig to just
explore paths until you hit a node with state B.
you don’t need to re-investigate state A nodes that you traverse before you hit a node with state B, because they will not meet the requirements.
Another approach could be this, given the fact that all nodes that are on the up or downstream paths from a B node, will not fulfil the requirements. Still assuming that you use labels to distinguish A and B nodes.
MATCH (b:B)
CALL apoc.path.spanningTree(b,
{relationshipFilter: "<",
labelFilter:"/B"
}
) YIELD path
UNWIND nodes(path) AS downStreamNode
WITH b,COLLECT(DISTINCT downStreamNode) AS downStreamNodes
CALL apoc.path.spanningTree(b,
{relationshipFilter: ">",
labelFilter:"/B"}
) YIELD path
UNWIND nodes(path) AS upStreamNode
WITH b,downStreamNodes+COLLECT(DISTINCT upStreamNode) AS upAndDownStreamNodes
RETURN apoc.coll.toSet(apoc.coll.flatten(COLLECT(upAndDownStreamNodes))) AS allNodesThatDoNotFulfillRequirements

Related

Shortest paths between two sets of nodes

I created a graph in Neo4j with 10 million nodes and 30 million relationships.
Each node is labeled as A (4 million nodes) , B (6 million nodes) or C (20 nodes).
Nodes in A lead to nodes in B. Nodes in B lead to other nodes in B, and to nodes in C.
For each node in A, I need to find the closest node (or nodes, if they are the same distance) in C, and add the ID of the C node as a value of a property in the A node.
Any help would be much appreciated.
So we're looking at a model like this (using :LEAD since you didn't specify a relationship type):
(:A)-[:LEAD]->(:B)
(:B)-[:LEAD]->(:B)
(:B)-[:LEAD]->(:C)
APOC Procedures offers the best solution for this one, but it's a two-parter since we first find the closest :C node using the path expander procedures, then rematch using that distance to get the full collection of :C nodes reachable at that distance.
You'll also want to make use of apoc.periodic.iterate() so you can batch this, though you may want to play around with the batchSize.
I'm making some assumptions in this query since you didn't provide much in the way of properties to use in the graph.
CALL apoc.periodic.iterate("MATCH (a:A) RETURN a",
"CALL apoc.path.spanningTree(a, {relationshipFilter:'LEAD>', labelFilter:'/C', limit:1}) YIELD path
WITH a, length(path) as length
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'LEAD>', labelFilter:'/C', maxLevel:length}) YIELD node
WITH a, collect(node.id) as ids
SET a.cIDs = ids",
{batchSize:1000}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages

How to duplicate node tree in neo4j?

How can I create full node tree from existing node tree, Actually i have 1 node and i need to find all relation and nodes that has my existing top node. I need to create full node tree to another node.
Copy All Node Tree from A with Relation and create Duplicate same node and relation for B Node
Okay, this is a tricky one.
As others have mentioned apoc.refactor.cloneNodesWithRelationships() can help, but this will also result in relationships between cloned nodes and the originals, not just the clones, as this proc wasn't written with this kind of use case in mind.
So this requires us to do some Cypher acrobatics to find the relationships from the cloned nodes that don't go to the cloned nodes so we can delete them.
However at the same time we also have to identify the old root node so we can refactor relationships to the new root (this requires separate processing for incoming and outgoing relationships).
Here's a query that should do the trick, making assumptions about the nodes since you didn't provide any details about your graph:
MATCH (a:Root{name:'A'})
WITH a, id(a) as aId
CALL apoc.path.subgraphNodes(a, {}) YIELD node
WITH aId, collect(node) as nodes
CALL apoc.refactor.cloneNodesWithRelationships(nodes) YIELD input, output
WITH aId, collect({input:input, output:output}) as createdData, collect(output) as createdNodes
WITH createdNodes, [item in createdData WHERE item.input = aId | item.output][0] as aClone // clone of root A node
UNWIND createdNodes as created
OPTIONAL MATCH (created)-[r]-(other)
WHERE NOT other in createdNodes
DELETE r // get rid of relationships that aren't between cloned nodes
// now to refactor relationships from aClone to the new B root node
WITH DISTINCT aClone
MATCH (b:Root{name:'B'})
WITH aClone, b
MATCH (aClone)-[r]->()
CALL apoc.refactor.from(r, b) YIELD output
WITH DISTINCT b, aClone
MATCH (aClone)<-[r]-()
CALL apoc.refactor.to(r, b) YIELD output
WITH DISTINCT aClone
DETACH DELETE aClone

Neo4j Cypher find two disjoint nodes

I'm using Neo4j to try to find any node that is not connected to a specific node "a". The query that I have so far is:
MATCH p = shortestPath((a:Node {id:"123"})-[*]-(b:Node))
WHERE p IS NULL
RETURN b.id as b
So it tries to find the shortest path between a and b. If it doesn't find a path, then it returns that node's id. However, this causes my query to run for a few minutes then crashes when it runs out of memory. I was wondering if this method would even work, and if there is a more efficient way? Any help would be greatly appreciated!
edit:
MATCH (a:Node {id:"123"})-[*]-(b:Node),
(c:Node)
WITH collect(b) as col, a, b, c
WHERE a <> b AND NOT c IN col
RETURN c.id
So col (collect(b)) contains every node connected to a, therefore if c is not in col then c is not connected to a?
For one, you're giving this MATCH an impossible predicate to fulfill, so it will never find the shortest path.
WHERE clauses are associated with MATCH, OPTIONAL MATCH, and WITH clauses, so your query is asking for the shortest path where the path doesn't exist. That will never return anything.
Also, the shortestPath will start at the node you DON'T want to be connected, so this has no way of finding the nodes that aren't connected to it.
Probably the easiest way to approach this is to MATCH to all nodes connected to your node in question, then MATCH to all :Nodes checking for those that aren't in the connected set. That means you won't have to do a shortestPath from every single node in the db, just a membership check in a collection.
You'll need APOC Procedures for this, as it has the fastest means of matching to nodes within a subgraph.
MATCH (a:Node {id:"123"})
CALL apoc.path.subgraphNodes(a, {}) YIELD node
WITH collect(node) as subgraph
MATCH (b:Node)
WHERE NOT b in subgraph
RETURN b.id as b
EDIT
Your edited query is likely to blow up, that's going to generate a huge result set (the query will build a result set of every node reachable from your start node by a unique path in a cartesian product with every :Node).
Instead, go step by step, collect the distinct nodes (because otherwise you'll get multiples of the same nodes that can be reached via different paths), and then only after you have your collection should you start your match for nodes that aren't in the list.
MATCH (:Node {id:"123"})-[*0..]-(b:Node)
WITH collect(DISTINCT b) as col
MATCH (a:Node)
WHERE NOT a IN col
RETURN a.id

Is this the optimal cypher for returning every node of a subtree?

I have a graph of a tree structure (well no, more of a DAG because i can have multiple parents) and need to be able to write queries that return all results in a flat list, starting at a particular node(s) and down.
I've reduced one of my use cases to this simple example. In the ascii representation here, n's are my nodes and I've appended their id. p is a permission in my auth system, but all that is pertinent to the question is that it marks the spot from which I need to recurse downwards to collect nodes which should be returned by the query.
There can be multiple root nodes related to p's
The roots, such as n3 below, should be contained in the results, as well as the children
The relationship depth is unbounded.
Graph:
n1
^ ^
/ \
n2 n3<--p
^ ^
/ \
n4 n5
^
/
n6
If it's helpful, here's the cypher I ran to throw together this quick example:
CREATE path=(n1:n{id:1})<-[:HAS_PARENT]-(n2:n{id:2}),
(n1)<-[:HAS_PARENT]-(n3:n{id:3})<-[:HAS_PARENT]-(n4:n{id:4}),
(n3)<-[:HAS_PARENT]-(n5:n{id:5}),
(n4)<-[:HAS_PARENT]-(n6:n{id:6})
MATCH (n{id:3})
CREATE (:p)-[:IN]->(n)
Here is the current best query I have:
MATCH (n:n)<--(:p)
WITH collect (n) as parents, (n) as n
OPTIONAL MATCH (c)-[:HAS_PARENT*]->(n)
WITH collect(c) as children, (parents) as parents
UNWIND (parents+children) as tree
RETURN tree
This returns the correct set of results, and unlike some previous attempts I made which did not use any collect/unwind, the results come back as a single column of data as desired.
Is this the most optimal way of making this type of query? It is surprisingly more complex than I thought the simple scenario called for. I tried some queries where I combined the roots ("parents" in my query) with the "children" using a UNION clause, but I could not find a way to do so without repeating the query for the relationship with p. In my real world queries, that's a much more expensive operation which i've reduced down here for the example, so I cannot run it more than once.
This might suit your needs:
MATCH (c)-[:HAS_PARENT*0..]->(root:n)<--(:p)
RETURN root, COLLECT(c) AS tree
Each result row will contain a distinct root node and a collection if its tree nodes (including the root node).

Find all relations starting with a given node

In a graph where the following nodes
A,B,C,D
have a relationship with each nodes successor
(A->B)
and
(B->C)
etc.
How do i make a query that starts with A and gives me all nodes (and relationships) from that and outwards.
I do not know the end node (C).
All i know is to start from A, and traverse the whole connected graph (with conditions on relationship and node type)
I think, you need to use this pattern:
(n)-[*]->(m) - variable length path of any number of relationships from n to m. (see Refcard)
A sample query would be:
MATCH path = (a:A)-[*]->()
RETURN path
Have also a look at the path functions in the refcard to expand your cypher query (I don't know what exact conditions you'll need to apply).
To get all the nodes / relationships starting at a node:
MATCH (a:A {id: "id"})-[r*]-(b)
RETURN a, r, b
This will return all the graphs originating with node A / Label A where id = "id".
One caveat - if this graph is large the query will take a long time to run.

Resources