I have a graph with ~89K nodes and ~1.2M relationships, and am trying to get the transitive closure of a single node via the following Cypher query:
start n=NODE(<id of a single node of interest>)
match (n)-[*1..]->(m)
where has(m.name)
return distinct m.name
Unfortunately, this query goes away and doesn't seem to come back (although to be fair I've only given it about an hour of execution time at this point).
Any suggestions on ways to optimise what I've got here, or better ways to achieve the requirement?
Notes:
Neo4J v2.0.0 (installed via Homebrew).
Mac OSX 10.8.5
Oracle Java 1.7.0_51
8GB physical RAM (neo4j JVM assigned whatever the default is)
Database is hosted on an SSD volume.
Query is submitted via the admin web UI's "Data browser".
"name" is an auto-indexed field.
CPU usage is fairly low - averaging around 20% of 8 cores.
I haven't gotten into the weeds of profiling the Neo4J server yet - my first attempt locked up VisualVM.
That's probably a combinatorial explosion of path, care to try this?
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match shortestPath((n)-[*]->(m))
return m.name
without shortest-path it would look like that, but as you are only interested in the reachable nodes from n the above should be good enough.
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match (n)-[*]->(m)
return distnct m.name
Try query - https://code.google.com/p/gueryframework/ - this is a standalone library but is has a neo4j adapter. I.e., you will have to rewrite your queries in the query format.
Better support for transitive closure was one of the main reasons for developing query, we mainly use this in software analysis tools where we need reachability / pattern analysis (e.g., the antipattern queries in http://xplrarc.massey.ac.nz/ are computed using query).
There is a brief discussion about this in the neo4j google group:
https://groups.google.com/forum/#!searchin/neo4j/jens/neo4j/n69ksEJxDtQ/29DNKyWKur4J
and an (older, not maintained) project with some benchmarking code:
https://code.google.com/p/graph-query-benchmarks/
Cheers, Jens
Related
I recently upgraded my Neo4j to 3.1.3, and alongside that, got the most recent APOC plugin (3.1.3.6).
I had a bit of code that worked fine, and could create ~3 million relationships in about a minute and a half wall time. But now, it's been running for over 8 hours and shows no sign of stopping...
Because the code used to run without any problems, I'm hoping something must have changed between versions that has lead to my code having been borked.
Is it rock_n_roll that should be changed (maybe to apoc.periodic.commit with positional arguments or something)? Thanks for any insight.
Here's what I'm running .
CALL apoc.periodic.rock_n_roll(
"MATCH (c:ChessPlayer),(r:Record) WHERE c.ChessPlayer_ID = r.ChessPlayer RETURN c,r",
"CYPHER planner=rule WITH {c} AS c, {r} AS r CREATE (c)-[:HAD_RECORD]->(r)",
200000)
My understanding is that call is querying the Cartesian product of ChessPlayers and Records, and then trying to filter them out row by row, and then doing the batch update on those final results (which eats a lot of memory, I think this one opening transaction is what's killing you). So if you can break it up so that each transaction can touch as few nodes as possible, it should be able to perform massively better (especially if r.ChessPlayer is indexed, since now you don't need to load all of them)
CALL apoc.periodic.rock_n_roll(
"MATCH (c:ChessPlayer) WHERE NOT EXISTS((c)-[:HAD_RECORD]->()) RETURN c",
"MATCH (r:Record) WHERE c.ChessPlayer_ID = r.ChessPlayer WITH c,r CREATE UNIQUE (c)-[:HAD_RECORD]->(r)",
100000)
periodic.commit() would work on a similar principle. The smaller (least nodes touched) you can make each transaction, the faster the batch will become.
In a general sense, is there a best practice to use when attempting to estimate how long the setting of relationships takes in Neo4j?
For example, I used the data import tool successfully, and here's what I've got in my 2.24GB database:
IMPORT DONE in 3m 8s 791ms. Imported:
7432663 nodes
0 relationships
119743432 properties
In preparation for setting relationships, I set some indices:
CREATE INDEX ON :ChessPlayer(player_id);
CREATE INDEX ON :Matches(player_id);
Then I let it rip:
MATCH (p:Player),(m:Matches)
WHERE p.player_id = m.player_id
CREATE (p)-[r:HAD_MATCH]->(m)
Then, I started to realize, that I have no idea how to even estimate how long that setting these relationships might take to set. Is there a 'back of the envelope' calculation for determining at least a ballpark figure for this kind of thing?
I understand that everyone's situation is different on all levels, including software, hardware, and desired schema. But any discussion would no doubt be useful and would deepen mine (and anyone else who reads this)'s understanding.
PS: FWIW, I'm running Ubuntu 14.04 with 16GB RAM and an Intel Core i7-3630QM CPU # 2.40GHz
The problem here is that you don't take into account transaction sizes. In your example all :HAD_MATCH relationships are created in one single large transaction. A transaction internally builds up in memory first and then gets flushed to disc. If the transaction is too large to fit in your heap you'll might see massive performance degradation due to garbage collections or even OutOfMemoryExceptions.
Typically you want to limit transaction sizes to e.g. 10k - 100k atomic operations.
The probably most easy to do transaction batching in this case is using the rock_n_roll procedure from neo4j-apoc. This uses one cypher statement to provide the data to be worked on and a second one running for each of the results from the previous one in batched mode. Note that apoc requires Neo4j 3.x:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
There was a bug in 3.0.0 and 3.0.1 causing this performing rather badly. So the above is for Neo4j >= 3.0.2.
If being on 3.0.0 / 3.0.1 use this as a workaround:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"CYPHER planner=rule WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
I have imported nodes.tsv ( 350MB, 18M rows, 3 cols) and rels.tsv ( 5GB, 150M rows, 2 cols) using the batch-importer script.
These are my batch.properties file entries
• neostore.nodestore.db.mapped_memory=250M
• neostore.relationshipstore.db.mapped_memory=1000M
• neostore.relationshipgroupstore.db.mapped_memory=10M
• neostore.propertystore.db.mapped_memory=500M
• neostore.propertystore.db.strings.mapped_memory=500M
• neostore.propertystore.db.arrays.mapped_memory=215M
• dump_configuration=true
I have turned on auto update and auto indexing in ne04j.properties as follows
• allow_store_upgrade=true • node_auto_indexing=true
• node_keys_indexable=name,title • relationship_auto_indexing=true
• relationship_keys_indexable=sent_date,has_read
I'm using neo4j 2.2 version on 64 bit windows server that has 1 TB SSD and 256GB ram.
What's the configuration for batch importer and neo4j server that I should use for maximum query and data loading peformance?
This query for ex: is timing out in the browser
MATCH ()-[r:BELONGS_TO]->() RETURN r
If you have that much RAM:
Your memory mapping config is wrong for 2.2
use only this setting:
`dbms.pagecache.memory=20G``
and then provide neo4j with 24G heap in neo4j-wrapper.conf
Use Neo4j enterprise which scales much better.
Disable the auto-indexes they are not used for what you're doing.
Your query doesn't make sense for any use-case:
MATCH ()-[r:BELONGS_TO]->() RETURN r
Sub-second graph queries always start at a set of concrete start-points (retrieved with an index lookup) and then traverse out from those starting points.
Global scan queries like your's will just pull all data in memory and inefficiently work over it.
Esp. if you return so much data you can't assume sub-second performance. The data volume alone will kill it.
So figure out a label + property-values that you want to start from and then write the query that traverses out from those start points.
If you want to have sub-second on something you're doing you have to go down to the Java API and aggregate there, e.g. with a server extension:
int counter=0;
for (Relationship r : GlobalGraphOperations.at(db)) {
if (r.hasType(Types.BELONGS_TO)) counter++;
}
return counter;
With millions of nodes that query might be slow no matter what you do, though with the amount of memory you have available maybe it wouldn't be a big deal. This is a good guide for calculating memory settings:
http://neo4j.com/developer/guide-performance-tuning/
While you're playing, I would set the query timeout on the server so that your queries can't jam up the server and force you to need to restart it:
http://neo4j.com/docs/stable/server-configuration.html
You might try starting with LIMIT clauses on your queries so that you can get an idea for how the performance degrades as the LIMIT increases.
If you can possibly find a way to limit your query based on node selects that would also be helpful, especially if you can do it by a label or by a label/property combination (which you can index).
Lastly, I would try using EXPLAIN in the web console to get an idea for how your queries will be executed:
http://neo4j.com/docs/2.2.0/how-do-i-profile-a-query.html
Also you can use PROFILE, though that will run the query, so you'll need to be a bit more careful there. You can probably use the LIMIT here too to play and see how things work
We are evaluating Neo4J for our application, testing it against a small test database with a total of around 20K nodes, 150K properties, and 100K relationships. The branching factor is ~100 relationships/node. Server and version information is below [1]. The Cypher query is:
MATCH p = ()-[r1:RATES]-(m1:Movie)-[r2:RATES]-(u1:User)-[r3:RATES]-(m2:Movie)-[r4:RATES]-()
RETURN r1.id as i_id, m1.id, r2.id, u1.id, r3.id, m2.id, r4.id as t_id;
(The first and last empty nodes aren't important to us, but I didn't see how to start with relationships.)
I killed it after a couple of hours. Maybe I'm expecting too much by hoping Neo4J would avoid combinatorial explosion. I tried tweaking some server parameters but got no further.
My main question is whether what I'm trying to do (a nine-step path query) is reasonable for Neo4J, or, for that matter, any graph database. I realize nine steps is a very deep search, and one that touches every node in the database multiple times, but unfortunately that's what our research needs to do.
Looking forward to your thoughts.
[1] System info:
The Linux server has 32 processors and 64GB of memory.
Neo4j - Graph Database Kernel (neo4j-kernel), version: 2.1.2.
java version "1.7.0_60", Java(TM) SE Runtime Environment (build 1.7.0_60-b19), Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
To answer your main question, Neo4j has no problem doing a variable length query that does not result in a combinatorial explosion in the search space (an exponential time complexity as a result of your branching factor).
There is however an optimization that can be done to your Cypher query.
MATCH ()-[r1:RATES]->(m1:Movie),
(m1)<-[r2:RATES]-(u1:User),
(u1)-[r3:RATES]->(m2:Movie),
(m2)<-[r4:RATES]-()
RETURN r1.id as i_id, m1.id, r2.id, u1.id, r3.id, m2.id, r4.id as t_id;
That being said, Cypher has some current limitations with these kinds of queries. We call these queries "graph global operations". When you are running a query that touches the graph globally without a specific starting point, computation as well as writes and reads to disc can cause performance bottlenecks. When returning large payloads over HTTP REST, you'll encounter data transfer limitations within your network.
To test the difference between query response times due to network data transfer constraints, compare the previous query to the following:
MATCH ()-[r1:RATES]->(m1:Movie),
(m1)<-[r2:RATES]-(u1:User),
(u1)-[r3:RATES]->(m2:Movie),
(m2)<-[r4:RATES]-()
RETURN count(*)
The difference between the queries in response time should be significant.
So what are your options?
Option 1:
Write a Neo4j unmanaged extension in Java that runs on-heap embedded in the JVM using Neo4j's Java API. Your Cypher query can be translated imperatively into a traversal description that operates on your graph in-memory. Seeing that you have 64GB of memory, your Java heap should be configured so that Neo4j has access to 70-85% of your available memory.
You can learn more about the Neo4j Java API here: http://docs.neo4j.org/chunked/stable/server-unmanaged-extensions.html
Option 2:
Tune the performance configurations of Neo4j to run your graph in-memory and optimize your Cypher queries to limit the amount of data transferred over the network. Performance will still be sub-optimal for graph global operations.
I have written a variety of queries using cypher that take no less than 200ms per query. They're very straightforward, so I'm having trouble identifying where the bottleneck is.
Simple Match with Parameters, 2200ms:
Simple Distinct Match with Parameters, 200ms:
Pathing, 2500ms:
At first I thought the issue was a lack of resources, because I was running neo4j and my application on the same box. While the performance monitor indicated that CPU and memory were largely free'd up and available, I moved the neo4j server to another local box and observed similar latency. Both servers are workstations with fairly new Xeon processors, 12GB memory and SSDs for the data storage. All of the above leads me to believe that the latency isn't due to my hardware. OS is Windows 7.
The graph has less than 200 nodes and less than 200 relationships.
I've attached some queries that I send to neo4j along with the configuration for the server, database, and JVM. No plugins or extensions are loaded.
Pastebin Links:
Database Configuration
Server Configuration
JVM Configuration
[Expanding a bit on a comment I made earlier.]
#TFerrell: Your comments state that "all nodes have labels", and that you tried applying indexes. However, it is not clear if you actually specified the labels in your slow Cypher queries. I noticed from your original question statement that neither of your slower queries actually specified a node label (which presumably should have been "Project").
If your Cypher query does not specify the label for a node, then the DB engine has to test every node, and it also cannot apply an index.
So, please try specifying the correct node label(s) in your slow queries.
Is that the first run or a subsequent run of these queries?
You probably don't have a label on your nodes and no index or unique constraint.
So Neo4j has to scan the whole store for your node pulling everything into memory, loading the properties and checking.
try this:
run until count returns 0:
match (n) where not n:Entity set n:Entity return count(*);
add the constraint
create constraint on (e:Entity) assert e.Id is unique;
run your query again:
match (n:Element {Id:{Id}}) return n
etc.
It seems there is something wrong with the automatic memory mapping calculation when you are on Windows (memory mapping on heap).
I just looked at your messages.log and added up some numbers, so it seems the mmio alone is enough to fill your java heap space (old-gen) leaving no room for the database, caches etc.
Please try to amend that by fixing the mmio config in your conf/neo4j.properties to more sensible values (than the auto-calculation).
For your small store just uncommenting the values starting with #neostore. (i.e. remove the #) should work fine.
Otherwise something like this (fitting for a 3GB heap) for a larger graph (2M nodes, 10M rels, 20M props,10M long strings):
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M
neostore.propertystore.db.mapped_memory=250M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M
Here are the added numbers:
auto mmio: 134217728 + 134217728 + 536870912 + 536870912 + 1073741824 = 2.3GB
stores sizes: 1073920 + 1073664 + 3221698 + 3221460 + 1073786 = 9MB
JVM max: 3.11 RAM : 13.98 SWAP: 27.97 GB
max heaps: Eden: 1.16, oldgen: 2.33
taken from:
neostore.propertystore.db.strings] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073920b)
neostore.propertystore.db.arrays] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073664b)
neostore.propertystore.db] brickCount=6 brickSize=536854b mappedMem=536870912b (storeSize=3221698b)
neostore.relationshipstore.db] brickCount=6 brickSize=536844b mappedMem=536870912b (storeSize=3221460b)
neostore.nodestore.db] brickCount=1 brickSize=1073730b mappedMem=1073741824b (storeSize=1073786b)