I'm facing very low performance when I execute the A* algorithm provided by neo4j; I created a test project based on maven; you can find it here https://github.com/angeloimm/neo4jAstarTest
Basically these are my tests:
A* from node 1 to node 2: 1416 millis
A* from node 1 to node 300000: 3428 millis
A* from node 1 to node 525440: 4128 millis
I was wondering if these times are the best time I can get or if I can improve them
In the configuration file you can see that I configured neo4j ith this settings:
nodestore_mapped_memory_size=250M
relationshipstore_mapped_memory_size=3G
nodestore_propertystore_mapped_memory_size=250M
strings_mapped_memory_size=500M
arrays_mapped_memory_size=50
cache_type=strong
The neo4j version is 2.0.3
Any tips would be really appreciate.
Thank you
Angelo
The reason for being slow here is that the two nodes in questions are not connected with each other, the end node is located in a distinct subgraph from the startnode.
Maybe consider a different strategy for that kind of scenario. Maybe in a first run, check in there are principally any paths between the two nodes and only if that is true, apply aStar to find the shortest one.
Related
I run the Dijkstra source-target shortest path algorithm in Neo4j (community edition) for 7 different graphs. The sizes of these graphs are as follows: 6,301 nodes - 8,846 nodes - 10,876 nodes - 22,687 nodes - 26,518 nodes - 36,682 nodes - 62,586 nodes.
For all these graphs, the results (the path) are received in 2 ms and completed at different amounts of times. Is it OK that the time is the same for all these graphs regardless of their sizes?
The same is happening when running the Yen algorithm.
If the time provided by the Neo4j browser is inaccurate, how can I measure the execution time accurately?
Update (tracking the execution time):
Thanks in advance.
The docs of Neo4j data science library state:
There are multiple termination conditions supported for the traversal,
based on either reaching one of several target nodes, reaching a
maximum depth, exhausting a given budget of traversed relationship
cost, or just traversing the whole graph.
But in the algorithm specific parameters I could not find any parameter for constraining the maximum cost of the traversal (or simply number of relationships if cost is 1). The Only parameters listed are startNodeId, targetNodes and maxDepth.
Any Idea if this actually can be done or if the docs are incorrect?
Here is the list of procedures and functions for your reference. As you can see, Breadth First Search is still in Alpha stage and no estimate function is available yet. You can also see that functions in Beta and Production stages have this function *.estimate. These functions will give you an idea of how much memory will be used when you run those data science related functions. An example of gds.nodeSimilarity.write.estimate can be found below
CALL gds.nodeSimilarity.write.estimate('myGraph', {
writeRelationshipType: 'SIMILAR',
writeProperty: 'score'})
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
nodeCount relationshipCount bytesMin bytesMax requiredMemory
9 9 2592 2808 "[2592 Bytes ... 2808 Bytes]"
We are trying to find a way to create a full distance matrix in a neo4j database, where that distance is defined as the length of the shortest path between any two nodes. Of course, there is the shortestPath method but using a loop going through all pairs of nodes and calculating their shortestPaths get very slow. We are explicitely not talking about allShortestPaths, because that returns all shortest paths between 2 specific nodes.
Is there a specific method or approach that is fast for a large number of nodes (>30k)?
Thank you!
j.
There is no easier method; the full distance matrix will take a long time to build.
As you've described it, the full distance matrix must contain the shortest path between any two nodes, which means you will have to get that information at some point. Iterating over each pair of nodes and running a shortest-path algorithm is the only way to do this, and the complexity will be O(n) multiplied by the complexity of the algorithm.
But you can cut down on the runtime with a dynamic programming solution.
You could certainly leverage some dynamic programming methods to cut down on the calculation time. For instance, if you are trying to find the shortest path between (A) and (C), and have already calculated the shortest from (B) to (C), then if you happen to encounter (B) while pathfinding from (A), you do not need to recalculate the rest of the cost of that path; it is known.
However, creating a dynamic programming solution of any reasonable complexity will almost certainly be best done in a separate module for Neo4J that is thrown in into a plugin. If what you are doing is a one-time operation or an operation that won't be run frequently, it might be easier to just do the naive solution of calling shortestPath between each pair, but if you plan to be running it fairly frequently on dynamic data, it might be worth authoring a custom plugin. It totally depends on your needs.
No matter what, though, it will take some time to calculate. The dynamic programming solution will cut down on the time greatly (especially in a densely-connected graph), but it will still not be very fast.
What is the end game? Is this a one-time query that resets some property or creates new edges. Or a recurring frequent effort. If it's one-time, you might create edges between the two nodes at each step creating a transitive closure environment. The edge would point between the two nodes and have, as a property, the distance.
Thus, if the path is a>b>c>d, you would create the edges
a>b 1
a>c 2
a>d 3
b>c 1
b>d 2
c>d 1
The edges could be named distinctively to distinguish them from the original path edges. This could create circular paths, which may neither negate this strategy or need a constraint. if you are dealing with directed acyclic graphs it would work well.
I was using neo4j 3.1.0 enterprise edition. The main logical in my graph is: There are "IP" nodes and "User" nodes and both have "UNIQUE" constraints. Each time an user login, I add a relationship from IP to User.
Here is my insert Cypher:
MERGE (i:IP {ip:"1.2.3.4"})
MERGE (u:User {username:"xxx#gmail.com"})
MERGE (i) - [l:SUCC] -> (u)
SET i:ExpireNode, i.expire={expire}
SET u:ExpireNode, u.expire={expire}
SET l.expire={expire}, l.login={login}
The insert is pretty fast as the beginning. But when the number of node grows to millions, it became very slow and sometimes took more than 1 second to insert nodes and relationships.
How could I optimize it? I was running neo4j with 12-cores CPU and 64G memory. The initial head size is 16G and page cache is 30G.
--------------------------------------------------------------
Tested the same cypher in Web UI and it took 10ms for commands. But by using java driver, it will sometimes take more than 1s. Below is my java code:
try (Transaction tx = session.beginTransaction()) {
for (Login login : loginList) {
Value value = login2Operation(login);
tx.run(INSERT_COMMANDS_SUCC, value);
}
tx.success();
}
--------------------------------------------------------------
After some exploring, I found that the insert speed increased significantly if running in 5 threads. But the overall speed is too slow and I have to increase to 100 threads. Then the single insert of each speed grows to 1s. So, I believe the problem is because the parallel ability of Neo4j.
In the neo4j.conf, I added dbms.threads.worker_count=200. But it's not helping. Any ideas?
Thanks to #InverseFalcon 's advices and the UNWIND operation helps a lot!
Get more details in Michael Hunger's tips and tricks
I have a graph with ~89K nodes and ~1.2M relationships, and am trying to get the transitive closure of a single node via the following Cypher query:
start n=NODE(<id of a single node of interest>)
match (n)-[*1..]->(m)
where has(m.name)
return distinct m.name
Unfortunately, this query goes away and doesn't seem to come back (although to be fair I've only given it about an hour of execution time at this point).
Any suggestions on ways to optimise what I've got here, or better ways to achieve the requirement?
Notes:
Neo4J v2.0.0 (installed via Homebrew).
Mac OSX 10.8.5
Oracle Java 1.7.0_51
8GB physical RAM (neo4j JVM assigned whatever the default is)
Database is hosted on an SSD volume.
Query is submitted via the admin web UI's "Data browser".
"name" is an auto-indexed field.
CPU usage is fairly low - averaging around 20% of 8 cores.
I haven't gotten into the weeds of profiling the Neo4J server yet - my first attempt locked up VisualVM.
That's probably a combinatorial explosion of path, care to try this?
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match shortestPath((n)-[*]->(m))
return m.name
without shortest-path it would look like that, but as you are only interested in the reachable nodes from n the above should be good enough.
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match (n)-[*]->(m)
return distnct m.name
Try query - https://code.google.com/p/gueryframework/ - this is a standalone library but is has a neo4j adapter. I.e., you will have to rewrite your queries in the query format.
Better support for transitive closure was one of the main reasons for developing query, we mainly use this in software analysis tools where we need reachability / pattern analysis (e.g., the antipattern queries in http://xplrarc.massey.ac.nz/ are computed using query).
There is a brief discussion about this in the neo4j google group:
https://groups.google.com/forum/#!searchin/neo4j/jens/neo4j/n69ksEJxDtQ/29DNKyWKur4J
and an (older, not maintained) project with some benchmarking code:
https://code.google.com/p/graph-query-benchmarks/
Cheers, Jens