The docs of Neo4j data science library state:
There are multiple termination conditions supported for the traversal,
based on either reaching one of several target nodes, reaching a
maximum depth, exhausting a given budget of traversed relationship
cost, or just traversing the whole graph.
But in the algorithm specific parameters I could not find any parameter for constraining the maximum cost of the traversal (or simply number of relationships if cost is 1). The Only parameters listed are startNodeId, targetNodes and maxDepth.
Any Idea if this actually can be done or if the docs are incorrect?
Here is the list of procedures and functions for your reference. As you can see, Breadth First Search is still in Alpha stage and no estimate function is available yet. You can also see that functions in Beta and Production stages have this function *.estimate. These functions will give you an idea of how much memory will be used when you run those data science related functions. An example of gds.nodeSimilarity.write.estimate can be found below
CALL gds.nodeSimilarity.write.estimate('myGraph', {
writeRelationshipType: 'SIMILAR',
writeProperty: 'score'})
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
nodeCount relationshipCount bytesMin bytesMax requiredMemory
9 9 2592 2808 "[2592 Bytes ... 2808 Bytes]"
Related
We are trying to find a way to create a full distance matrix in a neo4j database, where that distance is defined as the length of the shortest path between any two nodes. Of course, there is the shortestPath method but using a loop going through all pairs of nodes and calculating their shortestPaths get very slow. We are explicitely not talking about allShortestPaths, because that returns all shortest paths between 2 specific nodes.
Is there a specific method or approach that is fast for a large number of nodes (>30k)?
Thank you!
j.
There is no easier method; the full distance matrix will take a long time to build.
As you've described it, the full distance matrix must contain the shortest path between any two nodes, which means you will have to get that information at some point. Iterating over each pair of nodes and running a shortest-path algorithm is the only way to do this, and the complexity will be O(n) multiplied by the complexity of the algorithm.
But you can cut down on the runtime with a dynamic programming solution.
You could certainly leverage some dynamic programming methods to cut down on the calculation time. For instance, if you are trying to find the shortest path between (A) and (C), and have already calculated the shortest from (B) to (C), then if you happen to encounter (B) while pathfinding from (A), you do not need to recalculate the rest of the cost of that path; it is known.
However, creating a dynamic programming solution of any reasonable complexity will almost certainly be best done in a separate module for Neo4J that is thrown in into a plugin. If what you are doing is a one-time operation or an operation that won't be run frequently, it might be easier to just do the naive solution of calling shortestPath between each pair, but if you plan to be running it fairly frequently on dynamic data, it might be worth authoring a custom plugin. It totally depends on your needs.
No matter what, though, it will take some time to calculate. The dynamic programming solution will cut down on the time greatly (especially in a densely-connected graph), but it will still not be very fast.
What is the end game? Is this a one-time query that resets some property or creates new edges. Or a recurring frequent effort. If it's one-time, you might create edges between the two nodes at each step creating a transitive closure environment. The edge would point between the two nodes and have, as a property, the distance.
Thus, if the path is a>b>c>d, you would create the edges
a>b 1
a>c 2
a>d 3
b>c 1
b>d 2
c>d 1
The edges could be named distinctively to distinguish them from the original path edges. This could create circular paths, which may neither negate this strategy or need a constraint. if you are dealing with directed acyclic graphs it would work well.
I am struggling to find 1 efficient algorithm which will give me all possible paths between 2 nodes in a directed graph.
I found RGL gem, fastest so far in terms of calculations. I am able to find the shortest path using the Dijkstras Shortest Path Algorithm from the gem.
I googled, inspite of getting many solutions (ruby/non-ruby), either couldn't convert the code or the code is taking forever to calculate (inefficient).
I am here primarily if someone can suggest to find all paths using/tweaking various algorithms from RGL gem itself (if possible) or some other efficient way.
Input of directed graph can be an array of arrays..
[[1,2], [2,3], ..]
P.S. : Just to avoid negative votes/comments, unfortunately I don't have inefficient code snippet to show as I discarded it days ago and didn't save it anywhere for the record or reproduce here.
The main problem is that the number of paths between two nodes grows exponentially in the number of overall nodes. Thus any algorithm finding all paths between two nodes, will be very slow on larger graphs.
Example:
As an example imagine a grid of n x n nodes each connected to their 4 neighbors. Now you want to find all paths from the bottom left node to the top right node. Even when you only allow for moves to the right (r) and moves up (u) your resulting paths can be described by any string of length 2n with equal number of (r)'s and (u)'s. This will give you "2n choose n" number of possible paths (ignoring other moves and cycles)
In a general sense, is there a best practice to use when attempting to estimate how long the setting of relationships takes in Neo4j?
For example, I used the data import tool successfully, and here's what I've got in my 2.24GB database:
IMPORT DONE in 3m 8s 791ms. Imported:
7432663 nodes
0 relationships
119743432 properties
In preparation for setting relationships, I set some indices:
CREATE INDEX ON :ChessPlayer(player_id);
CREATE INDEX ON :Matches(player_id);
Then I let it rip:
MATCH (p:Player),(m:Matches)
WHERE p.player_id = m.player_id
CREATE (p)-[r:HAD_MATCH]->(m)
Then, I started to realize, that I have no idea how to even estimate how long that setting these relationships might take to set. Is there a 'back of the envelope' calculation for determining at least a ballpark figure for this kind of thing?
I understand that everyone's situation is different on all levels, including software, hardware, and desired schema. But any discussion would no doubt be useful and would deepen mine (and anyone else who reads this)'s understanding.
PS: FWIW, I'm running Ubuntu 14.04 with 16GB RAM and an Intel Core i7-3630QM CPU # 2.40GHz
The problem here is that you don't take into account transaction sizes. In your example all :HAD_MATCH relationships are created in one single large transaction. A transaction internally builds up in memory first and then gets flushed to disc. If the transaction is too large to fit in your heap you'll might see massive performance degradation due to garbage collections or even OutOfMemoryExceptions.
Typically you want to limit transaction sizes to e.g. 10k - 100k atomic operations.
The probably most easy to do transaction batching in this case is using the rock_n_roll procedure from neo4j-apoc. This uses one cypher statement to provide the data to be worked on and a second one running for each of the results from the previous one in batched mode. Note that apoc requires Neo4j 3.x:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
There was a bug in 3.0.0 and 3.0.1 causing this performing rather badly. So the above is for Neo4j >= 3.0.2.
If being on 3.0.0 / 3.0.1 use this as a workaround:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"CYPHER planner=rule WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
I wonder why neo4j has a Capacity Limit on Nodes and Relationships. The limit on Nodes and Relationships is 2^35 1 which is a "little" bit more then the "normal" 2^32 integer. Common SQL Databases for example mysql stores there primary key as int(2^32) or bigint(2^64)2. Can you explain me the advantages of this decision? In my opinion this is a key decision point when choosing a database.
It is an artificial limit. They are going to remove it in the not-too-distant future, although I haven't heard any official ETA.
Often enough, you run into hardware limits on a single machine before you actually hit this limit.
The current option is to manually shard your graphs to different machines. Not ideal for some use cases, but it works in other cases. In the future they'll have a way to shard data automatically--no ETA on that either.
Update:
I've learned a bit more about neo4j storage internals. The reason the limits are what they are exactly, are because the id numbers are stored on disk as pointers in several places (node records, relationship records, etc.). To increase it by another power of 2, they'd need to increase 1 byte per node and 1 byte per relationship--it is currently packed as far as it will go without needing to use more bytes on disk. Learn more at this great blog post:
http://digitalstain.blogspot.com/2010/10/neo4j-internals-file-storage.html
Update 2:
I've heard that in 2.1 they'll be increasing these limits to around another order of magnitude higher than they currently are.
As of neo4j 3.0, all of these constraints are removed.
Dynamic pointer compression expands Neo4j’s available address space as needed, making it possible to store graphs of any size. That’s right: no more 34 billion node limits!
For more information visit http://neo4j.com/blog/neo4j-3-0-massive-scale-developer-productivity.
Is it possible to create a linked list on a GPU using CUDA?
I am trying to do this and I am encoutering some difficulties.
If I can't allocate dynamic memory in a CUDA kernel, then how can I create a new node and add it to the linked list?
You really don't want to do this if you can help it - the best thing you can do if you can't get away from linked lists is to emulate them via arrays and use array indices rather than pointers for your links.
There are some valid use cases for linked lists on a GPU. Consider using a Skip List as an alternative as they provide faster operations. There are examples of highly concurrent Skip List algorithms available via Google searches.
Check out this link http://www.cse.iitk.ac.in/users/mainakc/lockfree.html/
for CUDA code a PDF and PPT presentation on a number of lock free CUDA data structures.
Link Lists can be constructed in parallel using a reduction algorithm approach. This assumes that ALL members are known at construction time. Each thread starts by connecting 2 nodes. Then half the threads connect the 2 node segments together and so on, reducing the number threads by 2 each iteration. This will build a list in log2 N time.
Memory allocation is a constraint. Pre-allocate all the nodes in an array on the host. Then you can use array subscripts in place of pointers. That has the advantage that the list traversal is valid on the GPU and the host.
For concurrency you need to use CUDA atomic operations. Atomic add/increment to count the nodes used from the node array and Compare and Swap to to set the links between nodes.
Again carefully consider the use case and access patterns. Using one large linked list is very serial. Using 100 - 100's of small Linked list is more parallel. I expect the memory access be uncoalesced unless care is taken to allocate connected nodes in adjacent memory locations.
I agree with Paul, linked lists are a very 'serial' way of thinking. Forget what you've learned about serial operations and just do everything at once : )
take a look at Thrust for the way of doing common operations