Neo4j Cypher: Finding the maximum and minimum node value in every disconnected subgraph and take the difference - neo4j

If I have a graph as shown below. I would like to find the maximum value in a subgraph and minimum value in a subgraph take the difference and return.
For instance the right-most subgraph has 4 nodes. Maximum value is 3 and Minimum value is 1, I would like to take the difference and return, which for this case is 2. This should happen for every disconnected subgraph in the whole graph database. I will prefer to handle each subgraph using one query, that way it can be done in batch and difference for each subgraph can be returned.
I will be thankful to get some intuition.

The real problem will be finding those subgraphs, as Neo4j has no native support for disconnected subgraph detection or tracking, and will require some intensive full graph queries to identify them.
I've provided an approach to finding disconnected subgraphs and attaching a :Subgraph node to the node with the smallest id in the subgraph in this answer to a similar question.
Once the :Subgraph nodes are in place, you are free to batch queries on the subgraphs.
As noted in that answer, it does not provide an approach to keeping up with graph changes which end up affecting subgraphs (creating new subgraphs, merging subgraphs, dividing subgraphs).
EDIT
Once you have a :Subgraph node attached to each disconnected subgraph, you can perform operations on subgraphs easily.
You might use this query to calculate the difference:
MATCH (s:Subgraph)-[*]-(subgraphNode)
WITH DISTINCT s, subgraphNode
WITH s, MIN(subgraphNode.value) as minimum, MAX(subgraphNode.value) as maximum
WITH s, maximum - minimum as difference
...
If you need to batch that query, then you'll want to use APOC Procedures, probably apoc.periodic.iterate().
EDIT
After some testing, it seems like APOC's Path Expander functionality, using NODE_GLOBAL uniqueness, leads to a more efficient means to find all nodes within a subgraph.
I'll be altering my linked answer accordingly. Here's how this would work with the subgraph query:
MATCH (s:Subgraph)
CALL apoc.path.expandConfig(s,{minLevel:1, bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
WITH s, last(nodes(path)) as subgraphNode
WITH s, MIN(subgraphNode.value) as minimum, MAX(subgraphNode.value) as maximum
WITH s, maximum - minimum as difference
...

Related

Cypher - unlimited path length and large path length queries hang

I am using Neo4j Community 4.0.4.
I have encountered this issue using the offical Bolt driver for Python, but it is also completely reproducible in the Neo4j browser (version 4.0.7).
I have a very simple graph for now, consisting of the following node and relationship types:
(:Document)-[:contains]->(:Block)
(:Block)<-[:prev]-(:Block)-[:next]->(:Block)
There are only 75 nodes in my entire test database for now - 1 Document node and 74 Block nodes.
Running the following Cypher statement brings the CPU to 100% and the memory utilization rises indefinitely, after which I have to kill the session:
match (d:Doc{name: 'doc name'})
optional match (d)-[*]-(n)
return d,n
I also got the Java heap size error at some point.
It only starts to work if I set a strict upper bound on the relationship or specify the direction, e.g.:
optional match (d)-[*..5]->(n)
For example, this already does not work (the answer takes forever so I have to kill the session):
optional match (d)-[*..5]-(n)
Considering that (a) I am doing a strictly local graph traversal that graph databases are supposed to be exceptionally good at, (b) clusters associated with different starting nodes are NOT connected and (c) my test data set is tiny, how can this be happening?
From the symptoms it appears that the engine simply does not keep track of which nodes and relationships were already visited when preparing the results ... or am I missing something?
UPDATE:
This was just answered via the Neo4j community forum by a Neo4j staff member:
https://community.neo4j.com/t/getting-paths-of-any-length-or-long-paths-does-not-work/18298
I wrongly assumed that Cypher would just dynamically switch from the path uniqueness traversal to the node uniqueness traversal just because the operation following the match dealt only with nodes and not with relationships.
Poor assumption on my part - not only Cypher doesn't do it automatically, there is no way AT ALL in core Cypher to drop a path during traversal if all the nodes in the path were aleady visited.
The APOC-based solution was suggested:
match (d:Doc{name: 'doc name'})
CALL apoc.path.subgraphNodes(d, {}) YIELD node as n
return d, n
In my case I have disconnected sub-graphs that are tens of thousands of nodes each and are relatively dense. This came up when trying to delete a (:Doc) node and everything that's connected to it before re-loading a new version of the sub-graph into Neo4j:
disconnect delete d, n
I see this task of "removing the old version before re-loading" as a very common operational task for sub-graphs that many people may have in their use cases... Installing and managing additional libraries (like APOC or the Graph Data Science library) seems like an overkill for something this simple... But it's either that or making the deletions more targeted.
A MATCH clause avoids traversing the same relationship twice, so that would not be the issue. However, it can still travel between the same 2 nodes multiple times (as long as different relationships are used).
The main thing to consider is that variable-length relationship patterns have exponential (time and memory) complexity. If the nodes being traversed have an average of R relevant relationships, then the MATCH clause has to traverse about R**P possible paths of length P. The higher that P gets (especially with no upper bound), the worse it gets. But a high R also hurts.

Retrieve All Nodes That Can Be Reached By A Specific Node In A Directed Graph

Given a graph in Neo4j that is directed (but possible to have cycles), how can I retrieve all nodes that are reachable from a specific node with Cypher?
(Also: how long can I expect a query like this to take if my graph has 2 million nodes, and by extension 48 million nodes? A rough gauge will do eg. less than a minute, few minutes, an hour)
Cypher's uniqueness behavior is that relationships must be unique per path (each relationship can only be traversed once per path), but this isn't efficient for these kinds of use cases, where the goal is instead to find distinct nodes, so a node should only be visited once total (across all paths, not per path).
There are some path expander procedures in the APOC Procedures library that are directed at these use cases.
If you're trying to find all reachable nodes from a starting node, traversing relationships in either direction, you can use apoc.path.subgraphNodes() like so, using the movies graph as an example:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {}) YIELD node
RETURN node
If you only wanted reachable nodes going a specific direction (let's say outgoing) then you can use a relationshipFilter to specify this. You can also add in the type too if that's important, but if we only wanted reachable via any outgoing relationship the query would look like:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {relationshipFilter:'>'}) YIELD node
RETURN node
In either case these approaches should work better than with Cypher alone, especially in any moderately connected graph, as there will only ever be a single path considered for every reachable node (alternate paths to an already visited node will be pruned, cutting down on the possible paths to explore during traversal, which is efficient as we don't care about these alternate paths for this use case).
Have a look here, where an algorithm is used for community detection.
You can use something like
match (n:Movie {title:"The Matrix"})-[r*1..50]-(m) return distinct id(m)
but that is slow (tested on the Neo4j movie dataset with 60k nodes, above already runs more than 10 minutes. Probably memory usage will become an issue when you have a dataset consisting out of millions of nodes. Next to that, it also depends how your dataset is connected, e.g. nr of relationships.

neo4j shortestPath algorithm

I have a question about shortestPath algorithm in neo4j. 
If I have a graph with 10^6 nodes and each node has 1000 relationships, searching for the shortest path up to 4 levels, must search for 1000*1000*1000*1000=10^12 nodes that is higher than total nodes. The reason is that some nodes are repeated during search. My question is that in neo4j shortestPath  algorithm, this search takes time of touching 10^6 nodes or 10^12 nodes. In other words, does it mark up nodes that are already searched  to not search them again?
Thanks 
I don't believe that kind of pruning is used. In Cypher, the default uniqueness for traversals is RELATIONSHIP_PATH: within each path, a relationship must be unique, they can't be reused.
You might try using either the shortestPath proc in the Graph Algorithms project or one of APOC Procedures' path expander procs instead.
With APOC path expanders, you can either set the uniqueness yourself to NODE_GLOBAL (which prevents processing of the same nodes multiple times during all expansions), or use one of the procs that already does this under the hood (subgraphNodes(), subgraphAll(), or spanningTree()).
The gotchas (at the moment) with APOC are that you can't currently supply the end nodes of the expansion (you'll have to expand out to nodes with certain defined labels and filter your results after with a WHERE clause), and expansions only go in one direction (from start node out) instead of bi-directional (such as from cypher's shortestPath()), so you won't realize any efficiency improvements that can happen from expanding from the other direction.
I currently have a PR on APOC to supply known end nodes of the expansion, so that should make it into the next APOC release (within the next week or so).

Neo4j Cypher: Match and Delete the subgraph based on value of node property

Suppose I have 3 subgraphs in Neo4j and I would like to select and delete the whole subgraph if all the nodes in the subgraph matching the filtering criteria that is each node's property value <= 1. However if there is atleast one node within the subgraph that is not matching the criteria then the subgraph will not be deleted.
In this case the left subgraph will be deleted but the right subgraph and the middle one will stay. The right one will not be deleted even though it has some nodes with value 1 because there are also nodes with values greater than 1.
userids and values are the node properties.
I will be thankful if anyone can suggest me the cypher query that can be used to do that. Please note that the query will be on the whole graph, that is on all three subgraphs or more if there are anymore.
Thanks for the clarification, that's a tricky requirement, and it's not immediately clear to me what the best approach is that will scale well with large graphs, as most possibilities seem to be expensive full graph operations. We'll likely need to use a few steps to set up the graph for easier querying later. I'm also assuming you mean "disconnected subgraphs", otherwise this answer won't work.
One start might be to label nodes as :Alive or :Dead based upon the property value. It should help if all nodes are of the same label, and if there's an index on the value property for that label, as our match operations could take advantage of the index instead of having to do a full label scan and property comparison.
MATCH (a:MyNode)
WHERE a.value <= 1
SET a:Dead
And separately
MATCH (a:MyNode)
WHERE a.value > 1
SET a:Alive
Then your query to mark nodes to delete would be:
MATCH (a:Dead)
WHERE NOT (a)-[*]-(:Alive)
SET a:ToDelete
And if all looks good with the nodes you've marked for delete, you can run your delete operation, using apoc.periodic.commit() from APOC Procedures to batch the operation if necessary.
MATCH (a:ToDelete)
DETACH DELETE a
If operations on disconnected subgraphs are going to be common, I highly encourage using a special node connected to each subgraph you create (such as a single :Cluster node at the head of the subgraph) so you can begin such operations on :Cluster nodes, which would greatly speed up these kind of queries, since your query operations would be executed per cluster, instead of per :Dead node.

Neo4j and Cypher - How can I create/merge chained sequential node relationships (and even better time-series)?

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.

Resources