neo4j shortestPath algorithm - neo4j

I have a question about shortestPath algorithm in neo4j. 
If I have a graph with 10^6 nodes and each node has 1000 relationships, searching for the shortest path up to 4 levels, must search for 1000*1000*1000*1000=10^12 nodes that is higher than total nodes. The reason is that some nodes are repeated during search. My question is that in neo4j shortestPath  algorithm, this search takes time of touching 10^6 nodes or 10^12 nodes. In other words, does it mark up nodes that are already searched  to not search them again?
Thanks 

I don't believe that kind of pruning is used. In Cypher, the default uniqueness for traversals is RELATIONSHIP_PATH: within each path, a relationship must be unique, they can't be reused.
You might try using either the shortestPath proc in the Graph Algorithms project or one of APOC Procedures' path expander procs instead.
With APOC path expanders, you can either set the uniqueness yourself to NODE_GLOBAL (which prevents processing of the same nodes multiple times during all expansions), or use one of the procs that already does this under the hood (subgraphNodes(), subgraphAll(), or spanningTree()).
The gotchas (at the moment) with APOC are that you can't currently supply the end nodes of the expansion (you'll have to expand out to nodes with certain defined labels and filter your results after with a WHERE clause), and expansions only go in one direction (from start node out) instead of bi-directional (such as from cypher's shortestPath()), so you won't realize any efficiency improvements that can happen from expanding from the other direction.
I currently have a PR on APOC to supply known end nodes of the expansion, so that should make it into the next APOC release (within the next week or so).

Related

Cypher - unlimited path length and large path length queries hang

I am using Neo4j Community 4.0.4.
I have encountered this issue using the offical Bolt driver for Python, but it is also completely reproducible in the Neo4j browser (version 4.0.7).
I have a very simple graph for now, consisting of the following node and relationship types:
(:Document)-[:contains]->(:Block)
(:Block)<-[:prev]-(:Block)-[:next]->(:Block)
There are only 75 nodes in my entire test database for now - 1 Document node and 74 Block nodes.
Running the following Cypher statement brings the CPU to 100% and the memory utilization rises indefinitely, after which I have to kill the session:
match (d:Doc{name: 'doc name'})
optional match (d)-[*]-(n)
return d,n
I also got the Java heap size error at some point.
It only starts to work if I set a strict upper bound on the relationship or specify the direction, e.g.:
optional match (d)-[*..5]->(n)
For example, this already does not work (the answer takes forever so I have to kill the session):
optional match (d)-[*..5]-(n)
Considering that (a) I am doing a strictly local graph traversal that graph databases are supposed to be exceptionally good at, (b) clusters associated with different starting nodes are NOT connected and (c) my test data set is tiny, how can this be happening?
From the symptoms it appears that the engine simply does not keep track of which nodes and relationships were already visited when preparing the results ... or am I missing something?
UPDATE:
This was just answered via the Neo4j community forum by a Neo4j staff member:
https://community.neo4j.com/t/getting-paths-of-any-length-or-long-paths-does-not-work/18298
I wrongly assumed that Cypher would just dynamically switch from the path uniqueness traversal to the node uniqueness traversal just because the operation following the match dealt only with nodes and not with relationships.
Poor assumption on my part - not only Cypher doesn't do it automatically, there is no way AT ALL in core Cypher to drop a path during traversal if all the nodes in the path were aleady visited.
The APOC-based solution was suggested:
match (d:Doc{name: 'doc name'})
CALL apoc.path.subgraphNodes(d, {}) YIELD node as n
return d, n
In my case I have disconnected sub-graphs that are tens of thousands of nodes each and are relatively dense. This came up when trying to delete a (:Doc) node and everything that's connected to it before re-loading a new version of the sub-graph into Neo4j:
disconnect delete d, n
I see this task of "removing the old version before re-loading" as a very common operational task for sub-graphs that many people may have in their use cases... Installing and managing additional libraries (like APOC or the Graph Data Science library) seems like an overkill for something this simple... But it's either that or making the deletions more targeted.
A MATCH clause avoids traversing the same relationship twice, so that would not be the issue. However, it can still travel between the same 2 nodes multiple times (as long as different relationships are used).
The main thing to consider is that variable-length relationship patterns have exponential (time and memory) complexity. If the nodes being traversed have an average of R relevant relationships, then the MATCH clause has to traverse about R**P possible paths of length P. The higher that P gets (especially with no upper bound), the worse it gets. But a high R also hurts.

Retrieve All Nodes That Can Be Reached By A Specific Node In A Directed Graph

Given a graph in Neo4j that is directed (but possible to have cycles), how can I retrieve all nodes that are reachable from a specific node with Cypher?
(Also: how long can I expect a query like this to take if my graph has 2 million nodes, and by extension 48 million nodes? A rough gauge will do eg. less than a minute, few minutes, an hour)
Cypher's uniqueness behavior is that relationships must be unique per path (each relationship can only be traversed once per path), but this isn't efficient for these kinds of use cases, where the goal is instead to find distinct nodes, so a node should only be visited once total (across all paths, not per path).
There are some path expander procedures in the APOC Procedures library that are directed at these use cases.
If you're trying to find all reachable nodes from a starting node, traversing relationships in either direction, you can use apoc.path.subgraphNodes() like so, using the movies graph as an example:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {}) YIELD node
RETURN node
If you only wanted reachable nodes going a specific direction (let's say outgoing) then you can use a relationshipFilter to specify this. You can also add in the type too if that's important, but if we only wanted reachable via any outgoing relationship the query would look like:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {relationshipFilter:'>'}) YIELD node
RETURN node
In either case these approaches should work better than with Cypher alone, especially in any moderately connected graph, as there will only ever be a single path considered for every reachable node (alternate paths to an already visited node will be pruned, cutting down on the possible paths to explore during traversal, which is efficient as we don't care about these alternate paths for this use case).
Have a look here, where an algorithm is used for community detection.
You can use something like
match (n:Movie {title:"The Matrix"})-[r*1..50]-(m) return distinct id(m)
but that is slow (tested on the Neo4j movie dataset with 60k nodes, above already runs more than 10 minutes. Probably memory usage will become an issue when you have a dataset consisting out of millions of nodes. Next to that, it also depends how your dataset is connected, e.g. nr of relationships.

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Neo4j Cypher: Finding the maximum and minimum node value in every disconnected subgraph and take the difference

If I have a graph as shown below. I would like to find the maximum value in a subgraph and minimum value in a subgraph take the difference and return.
For instance the right-most subgraph has 4 nodes. Maximum value is 3 and Minimum value is 1, I would like to take the difference and return, which for this case is 2. This should happen for every disconnected subgraph in the whole graph database. I will prefer to handle each subgraph using one query, that way it can be done in batch and difference for each subgraph can be returned.
I will be thankful to get some intuition.
The real problem will be finding those subgraphs, as Neo4j has no native support for disconnected subgraph detection or tracking, and will require some intensive full graph queries to identify them.
I've provided an approach to finding disconnected subgraphs and attaching a :Subgraph node to the node with the smallest id in the subgraph in this answer to a similar question.
Once the :Subgraph nodes are in place, you are free to batch queries on the subgraphs.
As noted in that answer, it does not provide an approach to keeping up with graph changes which end up affecting subgraphs (creating new subgraphs, merging subgraphs, dividing subgraphs).
EDIT
Once you have a :Subgraph node attached to each disconnected subgraph, you can perform operations on subgraphs easily.
You might use this query to calculate the difference:
MATCH (s:Subgraph)-[*]-(subgraphNode)
WITH DISTINCT s, subgraphNode
WITH s, MIN(subgraphNode.value) as minimum, MAX(subgraphNode.value) as maximum
WITH s, maximum - minimum as difference
...
If you need to batch that query, then you'll want to use APOC Procedures, probably apoc.periodic.iterate().
EDIT
After some testing, it seems like APOC's Path Expander functionality, using NODE_GLOBAL uniqueness, leads to a more efficient means to find all nodes within a subgraph.
I'll be altering my linked answer accordingly. Here's how this would work with the subgraph query:
MATCH (s:Subgraph)
CALL apoc.path.expandConfig(s,{minLevel:1, bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
WITH s, last(nodes(path)) as subgraphNode
WITH s, MIN(subgraphNode.value) as minimum, MAX(subgraphNode.value) as maximum
WITH s, maximum - minimum as difference
...

Neo4j - get all articulation vertices

using Neo4j, I would like to get all the articulation vertices (vertices/nodes that when removed, splits the graph in more connected components) from my graph.
Is there an easy way to do it (without completely re-implementing DFS)?
Alternatively, is there a possibility to do a traversal with the exclusion of a certain node? (and its relationships) (I have a fairly small number of nodes, using neo4j embedded so optimal O() is not critical)
you could exclude nodes by not continuing past them, e.g. with the Traversal Framework, see http://docs.neo4j.org/chunked/snapshot/tutorials-java-embedded-traversal.html#_new_traversal_framework. Also, you could implement your own RelationshipExpander that will not expand relationships to your node to avoid in a traversal, see http://components.neo4j.org/neo4j/1.5.M01/apidocs/org/neo4j/graphdb/RelationshipExpander.html
HTH
/peter

Resources