I have written a variety of queries using cypher that take no less than 200ms per query. They're very straightforward, so I'm having trouble identifying where the bottleneck is.
Simple Match with Parameters, 2200ms:
Simple Distinct Match with Parameters, 200ms:
Pathing, 2500ms:
At first I thought the issue was a lack of resources, because I was running neo4j and my application on the same box. While the performance monitor indicated that CPU and memory were largely free'd up and available, I moved the neo4j server to another local box and observed similar latency. Both servers are workstations with fairly new Xeon processors, 12GB memory and SSDs for the data storage. All of the above leads me to believe that the latency isn't due to my hardware. OS is Windows 7.
The graph has less than 200 nodes and less than 200 relationships.
I've attached some queries that I send to neo4j along with the configuration for the server, database, and JVM. No plugins or extensions are loaded.
Pastebin Links:
Database Configuration
Server Configuration
JVM Configuration
[Expanding a bit on a comment I made earlier.]
#TFerrell: Your comments state that "all nodes have labels", and that you tried applying indexes. However, it is not clear if you actually specified the labels in your slow Cypher queries. I noticed from your original question statement that neither of your slower queries actually specified a node label (which presumably should have been "Project").
If your Cypher query does not specify the label for a node, then the DB engine has to test every node, and it also cannot apply an index.
So, please try specifying the correct node label(s) in your slow queries.
Is that the first run or a subsequent run of these queries?
You probably don't have a label on your nodes and no index or unique constraint.
So Neo4j has to scan the whole store for your node pulling everything into memory, loading the properties and checking.
try this:
run until count returns 0:
match (n) where not n:Entity set n:Entity return count(*);
add the constraint
create constraint on (e:Entity) assert e.Id is unique;
run your query again:
match (n:Element {Id:{Id}}) return n
etc.
It seems there is something wrong with the automatic memory mapping calculation when you are on Windows (memory mapping on heap).
I just looked at your messages.log and added up some numbers, so it seems the mmio alone is enough to fill your java heap space (old-gen) leaving no room for the database, caches etc.
Please try to amend that by fixing the mmio config in your conf/neo4j.properties to more sensible values (than the auto-calculation).
For your small store just uncommenting the values starting with #neostore. (i.e. remove the #) should work fine.
Otherwise something like this (fitting for a 3GB heap) for a larger graph (2M nodes, 10M rels, 20M props,10M long strings):
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M
neostore.propertystore.db.mapped_memory=250M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M
Here are the added numbers:
auto mmio: 134217728 + 134217728 + 536870912 + 536870912 + 1073741824 = 2.3GB
stores sizes: 1073920 + 1073664 + 3221698 + 3221460 + 1073786 = 9MB
JVM max: 3.11 RAM : 13.98 SWAP: 27.97 GB
max heaps: Eden: 1.16, oldgen: 2.33
taken from:
neostore.propertystore.db.strings] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073920b)
neostore.propertystore.db.arrays] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073664b)
neostore.propertystore.db] brickCount=6 brickSize=536854b mappedMem=536870912b (storeSize=3221698b)
neostore.relationshipstore.db] brickCount=6 brickSize=536844b mappedMem=536870912b (storeSize=3221460b)
neostore.nodestore.db] brickCount=1 brickSize=1073730b mappedMem=1073741824b (storeSize=1073786b)
Related
In a general sense, is there a best practice to use when attempting to estimate how long the setting of relationships takes in Neo4j?
For example, I used the data import tool successfully, and here's what I've got in my 2.24GB database:
IMPORT DONE in 3m 8s 791ms. Imported:
7432663 nodes
0 relationships
119743432 properties
In preparation for setting relationships, I set some indices:
CREATE INDEX ON :ChessPlayer(player_id);
CREATE INDEX ON :Matches(player_id);
Then I let it rip:
MATCH (p:Player),(m:Matches)
WHERE p.player_id = m.player_id
CREATE (p)-[r:HAD_MATCH]->(m)
Then, I started to realize, that I have no idea how to even estimate how long that setting these relationships might take to set. Is there a 'back of the envelope' calculation for determining at least a ballpark figure for this kind of thing?
I understand that everyone's situation is different on all levels, including software, hardware, and desired schema. But any discussion would no doubt be useful and would deepen mine (and anyone else who reads this)'s understanding.
PS: FWIW, I'm running Ubuntu 14.04 with 16GB RAM and an Intel Core i7-3630QM CPU # 2.40GHz
The problem here is that you don't take into account transaction sizes. In your example all :HAD_MATCH relationships are created in one single large transaction. A transaction internally builds up in memory first and then gets flushed to disc. If the transaction is too large to fit in your heap you'll might see massive performance degradation due to garbage collections or even OutOfMemoryExceptions.
Typically you want to limit transaction sizes to e.g. 10k - 100k atomic operations.
The probably most easy to do transaction batching in this case is using the rock_n_roll procedure from neo4j-apoc. This uses one cypher statement to provide the data to be worked on and a second one running for each of the results from the previous one in batched mode. Note that apoc requires Neo4j 3.x:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
There was a bug in 3.0.0 and 3.0.1 causing this performing rather badly. So the above is for Neo4j >= 3.0.2.
If being on 3.0.0 / 3.0.1 use this as a workaround:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"CYPHER planner=rule WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
I have imported nodes.tsv ( 350MB, 18M rows, 3 cols) and rels.tsv ( 5GB, 150M rows, 2 cols) using the batch-importer script.
These are my batch.properties file entries
• neostore.nodestore.db.mapped_memory=250M
• neostore.relationshipstore.db.mapped_memory=1000M
• neostore.relationshipgroupstore.db.mapped_memory=10M
• neostore.propertystore.db.mapped_memory=500M
• neostore.propertystore.db.strings.mapped_memory=500M
• neostore.propertystore.db.arrays.mapped_memory=215M
• dump_configuration=true
I have turned on auto update and auto indexing in ne04j.properties as follows
• allow_store_upgrade=true • node_auto_indexing=true
• node_keys_indexable=name,title • relationship_auto_indexing=true
• relationship_keys_indexable=sent_date,has_read
I'm using neo4j 2.2 version on 64 bit windows server that has 1 TB SSD and 256GB ram.
What's the configuration for batch importer and neo4j server that I should use for maximum query and data loading peformance?
This query for ex: is timing out in the browser
MATCH ()-[r:BELONGS_TO]->() RETURN r
If you have that much RAM:
Your memory mapping config is wrong for 2.2
use only this setting:
`dbms.pagecache.memory=20G``
and then provide neo4j with 24G heap in neo4j-wrapper.conf
Use Neo4j enterprise which scales much better.
Disable the auto-indexes they are not used for what you're doing.
Your query doesn't make sense for any use-case:
MATCH ()-[r:BELONGS_TO]->() RETURN r
Sub-second graph queries always start at a set of concrete start-points (retrieved with an index lookup) and then traverse out from those starting points.
Global scan queries like your's will just pull all data in memory and inefficiently work over it.
Esp. if you return so much data you can't assume sub-second performance. The data volume alone will kill it.
So figure out a label + property-values that you want to start from and then write the query that traverses out from those start points.
If you want to have sub-second on something you're doing you have to go down to the Java API and aggregate there, e.g. with a server extension:
int counter=0;
for (Relationship r : GlobalGraphOperations.at(db)) {
if (r.hasType(Types.BELONGS_TO)) counter++;
}
return counter;
With millions of nodes that query might be slow no matter what you do, though with the amount of memory you have available maybe it wouldn't be a big deal. This is a good guide for calculating memory settings:
http://neo4j.com/developer/guide-performance-tuning/
While you're playing, I would set the query timeout on the server so that your queries can't jam up the server and force you to need to restart it:
http://neo4j.com/docs/stable/server-configuration.html
You might try starting with LIMIT clauses on your queries so that you can get an idea for how the performance degrades as the LIMIT increases.
If you can possibly find a way to limit your query based on node selects that would also be helpful, especially if you can do it by a label or by a label/property combination (which you can index).
Lastly, I would try using EXPLAIN in the web console to get an idea for how your queries will be executed:
http://neo4j.com/docs/2.2.0/how-do-i-profile-a-query.html
Also you can use PROFILE, though that will run the query, so you'll need to be a bit more careful there. You can probably use the LIMIT here too to play and see how things work
I'm trying to query ~8 million nodes in a neo4j database. I can do queries which hit the index for exact matches easily enough, but is there a performant way to do aggregations?
MATCH (r:Resident) RETURN r.forename, count(r.forename) ORDER BY count(r.forename)
This query just sits there until I eventually restart my server. I've read the performance guides and I'm watching vm_stat and it seems to be quickly running out of pages free. I've tried tuning the memory / JVM heap settings to various things, but I'm not sure I completely know what I'm doing ;) I've got an 8 GB MacBook Air with an SSD drive in case that's helpful for suggesting settings. Also, here's my stats on my DB from webadmin:
10,236,226 nodes
56,280,161 properties
10,190,430 relationships
2 relationship types
14,535 MB database disk usage
I inserted 8M nodes with just 1 prop, and got this query down to ~20s without changing the default settings (after warming up the cache--first time took 90s), which is comparable to other databases like postgres (which I also tested).
Some things you could try to do:
raise the sizes on the appropriate files mmio settings (as per the file sizes in data/graph.db/) in conf/neo4j.properties (at the top)
increase the node cache size in neo4j.properties
increase the heap init/max in neo4j-wrapper.conf
make sure you have enough RAM left over
I'm using Cypher's LOAD CSV syntax in Neo4J 2.1.2. So far it's been a huge improvement over the more manual ETL process required in previous versions. But I'm running into some behavior in a single case that's not what I'd expect and I wonder if I'm missing something.
The cypher query being used is this:
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///Users/James/Desktop/import/dependency_sets_short.csv' AS row
MATCH (s:Sense {uid: toInt(row[4])})
MERGE (ds:DependencySet {label: row[2]}) ON CREATE SET ds.optional=(row[3] = 't')
CREATE (s)-[:has]->(ds)
Here's a couple of lines of the CSV:
227303,1,TO-PURPOSE-NOMINAL,t,73830
334471,1,AT-LOCATION,t,92048
334470,1,AT-TIME,t,92048
334469,1,ON-LOCATION,t,92048
227302,1,TO-PURPOSE-INFINITIVE,t,73830
116008,1,TO-LOCATION,t,68204
116007,1,IN-LOCATION,t,68204
227301,1,TO-LOCATION,t,73830
334468,1,ON-DATE,t,92048
116006,1,AT-LOCATION,t,68204
334467,1,WITH-ASSOCIATE,t,92048
Basically, I'm matching a Sense node (previously imported) based on it's ID value which is the fifth column. Then I'm doing a merge to either get a DependencySet node if it exists, or create it. Finally, I'm creating a has edge between the Sense node and the DependencySet node. So far so good, this all works as expected. What's confusing is the performance as the size of the CSV grows.
CSV Lines Time (msec)
------------------------------
500 480
1000 717
2000 1110
5000 1521
10000 2111
50000 4794
100000 5907
200000 12302
300000 35494
400000 Java heap space error
My expectation is that growth would be more-or-less linear, particularly as I'm committing every 500 lines as recommended by the manual, but it's actually closer to polynomial:
What's worse is that somewhere between 300k and 400k rows, it runs into a Java heap space error. Based on the trend from previous imports, I'd expect the import of 400k to take a bit over a minute. Instead, it churns away for about 5-7 minutes before running into the heap space error. It seems like I could split this file into 300,000-line chunks, but isn't that what "USING PERIODIC COMMIT" is supposed to do, more or less? I suppose I could give Neo4J more memory too, but again, it's not clear why I should have to in this scenario.
Also, to be clear, the lookups on both Sense.uid and DependencySet.label are indexed, so the lookup penalty for these should be pretty small. Here's a snippet from the schema:
Indexes
ON :DependencySet(label) ONLINE (for uniqueness constraint)
ON :Sense(uid) ONLINE (for uniqueness constraint)
Any explanations or thoughts on an alternative approach would be appreciated.
EDIT: The problem definitely seems to be in the MATCH and/or CREATE part of the query. If I remove lines 3 and 5 from the Cypher query it performs fine.
I assume that you've already created all the Sense labeled nodes before running this LOAD CSV import. What I think is going on is that as you are matching nodes with the label Sense into memory and creating relationships from the DependencySet to the Sense node via CREATE (s)-[:HAS]->(ds) you are increasing utilization of the available heap.
Another possibility is that the size of your relationship store in your memory mapped settings needs to be increased. In your scenario it looks like the Sense nodes have a high degree of connectivity to other nodes in the graph. When this happens your relationship store for those nodes require more memory. Eventually when you hit 400k nodes the heap is maxed out. Up until that point it needs to do more garbage collection and reads from disk.
Michael Hunger put together an excellent blog post on memory mapped settings for fast LOAD CSV performance. See here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
That should resolve your problem. I don't see anything wrong with your query.
i believe the line
MATCH (s:Sense {uid: toInt(row[4])})
makes the time paradigm. somewhere around the 200 000 in the x line of your graph, you have no longer all the Sense nodes in the memory but some of them must be cached to disk. thus all the increase in time is simply re-loading data from cache to memory and vise-versa (otherwise it will be still linear if kept in memory).
maybe if you could post you server memory settings, we could dig deeper.
to the problem of java heap error refer to Kenny's answer
I wonder why neo4j has a Capacity Limit on Nodes and Relationships. The limit on Nodes and Relationships is 2^35 1 which is a "little" bit more then the "normal" 2^32 integer. Common SQL Databases for example mysql stores there primary key as int(2^32) or bigint(2^64)2. Can you explain me the advantages of this decision? In my opinion this is a key decision point when choosing a database.
It is an artificial limit. They are going to remove it in the not-too-distant future, although I haven't heard any official ETA.
Often enough, you run into hardware limits on a single machine before you actually hit this limit.
The current option is to manually shard your graphs to different machines. Not ideal for some use cases, but it works in other cases. In the future they'll have a way to shard data automatically--no ETA on that either.
Update:
I've learned a bit more about neo4j storage internals. The reason the limits are what they are exactly, are because the id numbers are stored on disk as pointers in several places (node records, relationship records, etc.). To increase it by another power of 2, they'd need to increase 1 byte per node and 1 byte per relationship--it is currently packed as far as it will go without needing to use more bytes on disk. Learn more at this great blog post:
http://digitalstain.blogspot.com/2010/10/neo4j-internals-file-storage.html
Update 2:
I've heard that in 2.1 they'll be increasing these limits to around another order of magnitude higher than they currently are.
As of neo4j 3.0, all of these constraints are removed.
Dynamic pointer compression expands Neo4j’s available address space as needed, making it possible to store graphs of any size. That’s right: no more 34 billion node limits!
For more information visit http://neo4j.com/blog/neo4j-3-0-massive-scale-developer-productivity.