Fast search for unconnected nodes in big neo4j graph - neo4j

So, i've created a Neo4j graph database out of a relational database. The graph database has about 7 million nodes, and about 9 million relationships between the nodes.
I now want to find all nodes, that are not connected to nodes with a certain label (let's call them unconnected nodes). For example, i have nodes with the labels "Customer" and "Order" (let's call them top-level-nodes). I want to find all nodes that have no relationship from or to these top-level-nodes. The relationship doesn't have to be direct, the nodes can be connected via other nodes to the top-level-nodes.
I have a cypher query which would solve this problem:
MATCH (a) WHERE not ((a)-[*]-(:Customer)) AND not ((a)-[*]-(:Order)) RETURN a;
As you can imagine, the query will need a long time to execute, the performance is bad. Most likely because of the undirected relationship and because it doesn't matter via how many nodes the relationship can be made. However, the relationship directions don't matter, and i need to make sure that there is no path from any node to one of the top-level-nodes.
Is there any way to find the unconnected nodes faster ? Note that the database is really big, and there are more than 2 labels which mark top-level-nodes.

You could try this approach, which does involve more operations, but can be run in batches for better performance (see apoc.periodic.commit() in the APOC procedures library).
The idea is to first apply a label (say, :Unconnected) to all nodes in your graph (batch execute with apoc.periodic.commit), and then, taking batches of top level nodes with that label, matching to all nodes in the subgraphs extending from them and removing that label.
When you finally have run out of top level nodes with the :Unconnected label (meaning all top level nodes and their subgraphs no longer have this label) then the only nodes remaining in your graph with the :Unconnected label are not connected to your top level nodes.
Any approach to this kind of operation will likely be slow, but the advantage again is that you can process this in batches, and if you get interrupted, you can resume. Once your queries are done, all the relevant unconnected nodes are now labeled for further processing at your convenience.
Also, one last note, in Neo4j undirected relationships have no arrows in the syntax ()-[*]-().

MATCH (a)
WHERE
not (a:Customer OR a:Order)
AND shortestPath((a)-[*]-(:Customer)) IS NULL
AND shortestPath((a)-[*]-(:Order)) IS NULL
RETURN a;
If you could add rel-types it would be faster.
One further optimization could be to check the nodes of an :Customer path for an :Order node and vice versa. i.e.
NONE(n in nodes(path) WHERE n:Order)
In general, this might be rather a set operation, i.e.
expand around all order and customer nodes in parallel into two sets
and compute the overlap between the two sets.
Then remove the overlap from the total number of nodes.
I added an issue for apoc here to add such a function or procedure
https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/223

Related

Why so many db hit in neo4j?

There is total 1 Category node and 2 Template node in my case. I put an * in [*] to support more further scenarios. But why there are so many db hit in this cypher for current data?
It's probably the * in the relationship part of your query that's doing it.
While you've got only one Category node and two Template nodes, you've asked Neo4j to hop through any number of relationships to get from one to the other and not given it any help to narrow down the search besides specifying the starting node.
For example, if your Category was connected to 100,000 other nodes (of any label, not just Template) you've forced Neo4j to jump through every single one of them looking to see if there's a path to a Template node - and if those nodes have their own connections then they all need to be explored too, because the depth of the traversal isn't constrained.
If you know how Category and Template nodes can be connected in ways you're interested in (for example, if there's only every some specific set of relationships you want to traverse) then you'll radically improve the performance of the query. Equally, reducing the maximum length of the path will help.

Retrieve All Nodes That Can Be Reached By A Specific Node In A Directed Graph

Given a graph in Neo4j that is directed (but possible to have cycles), how can I retrieve all nodes that are reachable from a specific node with Cypher?
(Also: how long can I expect a query like this to take if my graph has 2 million nodes, and by extension 48 million nodes? A rough gauge will do eg. less than a minute, few minutes, an hour)
Cypher's uniqueness behavior is that relationships must be unique per path (each relationship can only be traversed once per path), but this isn't efficient for these kinds of use cases, where the goal is instead to find distinct nodes, so a node should only be visited once total (across all paths, not per path).
There are some path expander procedures in the APOC Procedures library that are directed at these use cases.
If you're trying to find all reachable nodes from a starting node, traversing relationships in either direction, you can use apoc.path.subgraphNodes() like so, using the movies graph as an example:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {}) YIELD node
RETURN node
If you only wanted reachable nodes going a specific direction (let's say outgoing) then you can use a relationshipFilter to specify this. You can also add in the type too if that's important, but if we only wanted reachable via any outgoing relationship the query would look like:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {relationshipFilter:'>'}) YIELD node
RETURN node
In either case these approaches should work better than with Cypher alone, especially in any moderately connected graph, as there will only ever be a single path considered for every reachable node (alternate paths to an already visited node will be pruned, cutting down on the possible paths to explore during traversal, which is efficient as we don't care about these alternate paths for this use case).
Have a look here, where an algorithm is used for community detection.
You can use something like
match (n:Movie {title:"The Matrix"})-[r*1..50]-(m) return distinct id(m)
but that is slow (tested on the Neo4j movie dataset with 60k nodes, above already runs more than 10 minutes. Probably memory usage will become an issue when you have a dataset consisting out of millions of nodes. Next to that, it also depends how your dataset is connected, e.g. nr of relationships.

Depth wise retrieval of nodes from neo4j

I have a science graph in neo4j which has names of some scientists as nodes and connected to nodes holding laws by relation has_discovered. The laws are then related to their application by relation has_application. I am new to cypher. I want to know what cql query will give me level 1 and level 2 nodes of the scientists nodes. Here level 1 will be the nodes holding laws and level 2 will be nodes holding their applications.
This query should probably take care of it, assuming your labels are :Scientist, :Law, and :Application.
MATCH (sci:Scientist)-[:has_discovered]->(law:Law)-[:has_application]->(app:Application)
RETURN sci, law, app
As long as your :has_discovered and :has_application relationships only connect those types of nodes, you can leave off the :Law and :Application labels (but you'll want to keep the :Scientist label so you begin your pattern match only at :Scientist nodes).
You can use COLLECT() as necessary to group results if you want.

Neo4j Cypher: Match and Delete the subgraph based on value of node property

Suppose I have 3 subgraphs in Neo4j and I would like to select and delete the whole subgraph if all the nodes in the subgraph matching the filtering criteria that is each node's property value <= 1. However if there is atleast one node within the subgraph that is not matching the criteria then the subgraph will not be deleted.
In this case the left subgraph will be deleted but the right subgraph and the middle one will stay. The right one will not be deleted even though it has some nodes with value 1 because there are also nodes with values greater than 1.
userids and values are the node properties.
I will be thankful if anyone can suggest me the cypher query that can be used to do that. Please note that the query will be on the whole graph, that is on all three subgraphs or more if there are anymore.
Thanks for the clarification, that's a tricky requirement, and it's not immediately clear to me what the best approach is that will scale well with large graphs, as most possibilities seem to be expensive full graph operations. We'll likely need to use a few steps to set up the graph for easier querying later. I'm also assuming you mean "disconnected subgraphs", otherwise this answer won't work.
One start might be to label nodes as :Alive or :Dead based upon the property value. It should help if all nodes are of the same label, and if there's an index on the value property for that label, as our match operations could take advantage of the index instead of having to do a full label scan and property comparison.
MATCH (a:MyNode)
WHERE a.value <= 1
SET a:Dead
And separately
MATCH (a:MyNode)
WHERE a.value > 1
SET a:Alive
Then your query to mark nodes to delete would be:
MATCH (a:Dead)
WHERE NOT (a)-[*]-(:Alive)
SET a:ToDelete
And if all looks good with the nodes you've marked for delete, you can run your delete operation, using apoc.periodic.commit() from APOC Procedures to batch the operation if necessary.
MATCH (a:ToDelete)
DETACH DELETE a
If operations on disconnected subgraphs are going to be common, I highly encourage using a special node connected to each subgraph you create (such as a single :Cluster node at the head of the subgraph) so you can begin such operations on :Cluster nodes, which would greatly speed up these kind of queries, since your query operations would be executed per cluster, instead of per :Dead node.

Neo4j labels and properties, and their differences

Say we have a Neo4j database with several 50,000 node subgraphs. Each subgraph has a root. I want to find all nodes in one subgraph.
One way would be to recursively walk the tree. It works but can be thousands of trips to the database.
One way is to add a subgraph identifier to each node:
MATCH(n {subgraph_id:{my_graph_id}}) return n
Another way would be to relate each node in a subgraph to the subgraph's root:
MATCH(n)-[]->(root:ROOT {id: {my_graph_id}}) return n
This feels more "graphy" if that matters. Seems expensive.
Or, I could add a label to each node. If {my_graph_id} was "BOBS_QA_COPY" then
MATCH(n:BOBS_QA_COPY) return n
would scoop up all the nodes in the subgraph.
My question is when is it appropriate to use a garden-variety property, add relationships, or set a label?
Setting a label to identify a particular subgraph makes me feel weird, like I am abusing the tool. I expect labels to say what something is, not which instance of something it is.
For example, if we were graphing car information, I could see having parts labeled "FORD EXPLORER". But I am less sure that it would make sense to have parts labeled "TONYS FORD EXPLORER". Now, I could see (USER id:"Tony") having a relationship to a FORD EXPLORER graph...
I may be having a bout of "SQL brain"...
Let's work this through, step by step.
If there are N non-root nodes, adding an extra N ROOT relationships makes the least sense. It is very expensive in storage, it will pollute the data model with relationships that don't need to be there and that can unnecessarily complicate queries that want to traverse paths, and it is not the fastest way to find all the nodes in a subgraph.
Adding a subgraph ID property to every node is also expensive in storage (but less so), and would require either: (a) scanning every node to find all the nodes with a specific ID (slow), or (b) using an index, say, :Node(subgraph_id) (faster). Approach (b), which is preferable, would also require that all the nodes have the same Node label.
But wait, if approach 2(b) already requires all nodes to be labelled, why don't we just use a different label for each subgroup? By doing that, we don't need the subgraph_id property at all, and we don't need an index either! And finding all the nodes with the same label is fast.
Thus, using a per-subgroup label would be the best option.

Resources