I run the Dijkstra source-target shortest path algorithm in Neo4j (community edition) for 7 different graphs. The sizes of these graphs are as follows: 6,301 nodes - 8,846 nodes - 10,876 nodes - 22,687 nodes - 26,518 nodes - 36,682 nodes - 62,586 nodes.
For all these graphs, the results (the path) are received in 2 ms and completed at different amounts of times. Is it OK that the time is the same for all these graphs regardless of their sizes?
The same is happening when running the Yen algorithm.
If the time provided by the Neo4j browser is inaccurate, how can I measure the execution time accurately?
Update (tracking the execution time):
Thanks in advance.
Related
I'm trying to find the execution time of GDS algorithms using the community edition of Neo4j. Is there any way to find it rather than query logging? Since this facility is specific to the enterprise edition.
Update:
I did as suggested. Why the result is 0 for the computeMillis and preProcessingMillis?
Update 2:
The following table indicates the time in ms required for running the Yen algorithm to retrieve one path for each topology. However, the time does not dependent on the graph size. Why? is it normal to have such results?
When you are executing the mutate or the write mode of the algorithm, you can YIELD the computeMillis property, which can tell you the execution time of the algorithm. Note that some algorithms like PageRank have more properties available to be YIELD-ed
preProcessingMillis - Milliseconds for preprocessing the graph.
computeMillis - Milliseconds for running the algorithm.
postProcessingMillis - Milliseconds for computing the
centralityDistribution.
writeMillis - Milliseconds for writing result data back.
I was using neo4j 3.1.0 enterprise edition. The main logical in my graph is: There are "IP" nodes and "User" nodes and both have "UNIQUE" constraints. Each time an user login, I add a relationship from IP to User.
Here is my insert Cypher:
MERGE (i:IP {ip:"1.2.3.4"})
MERGE (u:User {username:"xxx#gmail.com"})
MERGE (i) - [l:SUCC] -> (u)
SET i:ExpireNode, i.expire={expire}
SET u:ExpireNode, u.expire={expire}
SET l.expire={expire}, l.login={login}
The insert is pretty fast as the beginning. But when the number of node grows to millions, it became very slow and sometimes took more than 1 second to insert nodes and relationships.
How could I optimize it? I was running neo4j with 12-cores CPU and 64G memory. The initial head size is 16G and page cache is 30G.
--------------------------------------------------------------
Tested the same cypher in Web UI and it took 10ms for commands. But by using java driver, it will sometimes take more than 1s. Below is my java code:
try (Transaction tx = session.beginTransaction()) {
for (Login login : loginList) {
Value value = login2Operation(login);
tx.run(INSERT_COMMANDS_SUCC, value);
}
tx.success();
}
--------------------------------------------------------------
After some exploring, I found that the insert speed increased significantly if running in 5 threads. But the overall speed is too slow and I have to increase to 100 threads. Then the single insert of each speed grows to 1s. So, I believe the problem is because the parallel ability of Neo4j.
In the neo4j.conf, I added dbms.threads.worker_count=200. But it's not helping. Any ideas?
Thanks to #InverseFalcon 's advices and the UNWIND operation helps a lot!
Get more details in Michael Hunger's tips and tricks
We are evaluating Neo4J for our application, testing it against a small test database with a total of around 20K nodes, 150K properties, and 100K relationships. The branching factor is ~100 relationships/node. Server and version information is below [1]. The Cypher query is:
MATCH p = ()-[r1:RATES]-(m1:Movie)-[r2:RATES]-(u1:User)-[r3:RATES]-(m2:Movie)-[r4:RATES]-()
RETURN r1.id as i_id, m1.id, r2.id, u1.id, r3.id, m2.id, r4.id as t_id;
(The first and last empty nodes aren't important to us, but I didn't see how to start with relationships.)
I killed it after a couple of hours. Maybe I'm expecting too much by hoping Neo4J would avoid combinatorial explosion. I tried tweaking some server parameters but got no further.
My main question is whether what I'm trying to do (a nine-step path query) is reasonable for Neo4J, or, for that matter, any graph database. I realize nine steps is a very deep search, and one that touches every node in the database multiple times, but unfortunately that's what our research needs to do.
Looking forward to your thoughts.
[1] System info:
The Linux server has 32 processors and 64GB of memory.
Neo4j - Graph Database Kernel (neo4j-kernel), version: 2.1.2.
java version "1.7.0_60", Java(TM) SE Runtime Environment (build 1.7.0_60-b19), Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
To answer your main question, Neo4j has no problem doing a variable length query that does not result in a combinatorial explosion in the search space (an exponential time complexity as a result of your branching factor).
There is however an optimization that can be done to your Cypher query.
MATCH ()-[r1:RATES]->(m1:Movie),
(m1)<-[r2:RATES]-(u1:User),
(u1)-[r3:RATES]->(m2:Movie),
(m2)<-[r4:RATES]-()
RETURN r1.id as i_id, m1.id, r2.id, u1.id, r3.id, m2.id, r4.id as t_id;
That being said, Cypher has some current limitations with these kinds of queries. We call these queries "graph global operations". When you are running a query that touches the graph globally without a specific starting point, computation as well as writes and reads to disc can cause performance bottlenecks. When returning large payloads over HTTP REST, you'll encounter data transfer limitations within your network.
To test the difference between query response times due to network data transfer constraints, compare the previous query to the following:
MATCH ()-[r1:RATES]->(m1:Movie),
(m1)<-[r2:RATES]-(u1:User),
(u1)-[r3:RATES]->(m2:Movie),
(m2)<-[r4:RATES]-()
RETURN count(*)
The difference between the queries in response time should be significant.
So what are your options?
Option 1:
Write a Neo4j unmanaged extension in Java that runs on-heap embedded in the JVM using Neo4j's Java API. Your Cypher query can be translated imperatively into a traversal description that operates on your graph in-memory. Seeing that you have 64GB of memory, your Java heap should be configured so that Neo4j has access to 70-85% of your available memory.
You can learn more about the Neo4j Java API here: http://docs.neo4j.org/chunked/stable/server-unmanaged-extensions.html
Option 2:
Tune the performance configurations of Neo4j to run your graph in-memory and optimize your Cypher queries to limit the amount of data transferred over the network. Performance will still be sub-optimal for graph global operations.
I'm facing very low performance when I execute the A* algorithm provided by neo4j; I created a test project based on maven; you can find it here https://github.com/angeloimm/neo4jAstarTest
Basically these are my tests:
A* from node 1 to node 2: 1416 millis
A* from node 1 to node 300000: 3428 millis
A* from node 1 to node 525440: 4128 millis
I was wondering if these times are the best time I can get or if I can improve them
In the configuration file you can see that I configured neo4j ith this settings:
nodestore_mapped_memory_size=250M
relationshipstore_mapped_memory_size=3G
nodestore_propertystore_mapped_memory_size=250M
strings_mapped_memory_size=500M
arrays_mapped_memory_size=50
cache_type=strong
The neo4j version is 2.0.3
Any tips would be really appreciate.
Thank you
Angelo
The reason for being slow here is that the two nodes in questions are not connected with each other, the end node is located in a distinct subgraph from the startnode.
Maybe consider a different strategy for that kind of scenario. Maybe in a first run, check in there are principally any paths between the two nodes and only if that is true, apply aStar to find the shortest one.
I have a graph with ~89K nodes and ~1.2M relationships, and am trying to get the transitive closure of a single node via the following Cypher query:
start n=NODE(<id of a single node of interest>)
match (n)-[*1..]->(m)
where has(m.name)
return distinct m.name
Unfortunately, this query goes away and doesn't seem to come back (although to be fair I've only given it about an hour of execution time at this point).
Any suggestions on ways to optimise what I've got here, or better ways to achieve the requirement?
Notes:
Neo4J v2.0.0 (installed via Homebrew).
Mac OSX 10.8.5
Oracle Java 1.7.0_51
8GB physical RAM (neo4j JVM assigned whatever the default is)
Database is hosted on an SSD volume.
Query is submitted via the admin web UI's "Data browser".
"name" is an auto-indexed field.
CPU usage is fairly low - averaging around 20% of 8 cores.
I haven't gotten into the weeds of profiling the Neo4J server yet - my first attempt locked up VisualVM.
That's probably a combinatorial explosion of path, care to try this?
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match shortestPath((n)-[*]->(m))
return m.name
without shortest-path it would look like that, but as you are only interested in the reachable nodes from n the above should be good enough.
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match (n)-[*]->(m)
return distnct m.name
Try query - https://code.google.com/p/gueryframework/ - this is a standalone library but is has a neo4j adapter. I.e., you will have to rewrite your queries in the query format.
Better support for transitive closure was one of the main reasons for developing query, we mainly use this in software analysis tools where we need reachability / pattern analysis (e.g., the antipattern queries in http://xplrarc.massey.ac.nz/ are computed using query).
There is a brief discussion about this in the neo4j google group:
https://groups.google.com/forum/#!searchin/neo4j/jens/neo4j/n69ksEJxDtQ/29DNKyWKur4J
and an (older, not maintained) project with some benchmarking code:
https://code.google.com/p/graph-query-benchmarks/
Cheers, Jens