I recently upgraded my Neo4j to 3.1.3, and alongside that, got the most recent APOC plugin (3.1.3.6).
I had a bit of code that worked fine, and could create ~3 million relationships in about a minute and a half wall time. But now, it's been running for over 8 hours and shows no sign of stopping...
Because the code used to run without any problems, I'm hoping something must have changed between versions that has lead to my code having been borked.
Is it rock_n_roll that should be changed (maybe to apoc.periodic.commit with positional arguments or something)? Thanks for any insight.
Here's what I'm running .
CALL apoc.periodic.rock_n_roll(
"MATCH (c:ChessPlayer),(r:Record) WHERE c.ChessPlayer_ID = r.ChessPlayer RETURN c,r",
"CYPHER planner=rule WITH {c} AS c, {r} AS r CREATE (c)-[:HAD_RECORD]->(r)",
200000)
My understanding is that call is querying the Cartesian product of ChessPlayers and Records, and then trying to filter them out row by row, and then doing the batch update on those final results (which eats a lot of memory, I think this one opening transaction is what's killing you). So if you can break it up so that each transaction can touch as few nodes as possible, it should be able to perform massively better (especially if r.ChessPlayer is indexed, since now you don't need to load all of them)
CALL apoc.periodic.rock_n_roll(
"MATCH (c:ChessPlayer) WHERE NOT EXISTS((c)-[:HAD_RECORD]->()) RETURN c",
"MATCH (r:Record) WHERE c.ChessPlayer_ID = r.ChessPlayer WITH c,r CREATE UNIQUE (c)-[:HAD_RECORD]->(r)",
100000)
periodic.commit() would work on a similar principle. The smaller (least nodes touched) you can make each transaction, the faster the batch will become.
Related
I have been really struggling to achieve acceptable performance for my application with Neo4J 3.0.3. Here is some background:
I am trying to replace Apache Solr with Neo4j for an application to extend its capabilities, while maintaining or improving performance.
In Solr I have documents that essentially look like this:
{
"time": "2015-08-05T00:16:00Z",
"point": "45.8300018311,-129.759994507",
"sea_water_temperature": 18.49,
"sea_water_temperature_depth": 4,
"wind_speed": 6.48144,
"eastward_wind": 5.567876,
"northward_wind": -3.3178043,
"wind_depth": -15,
"sea_water_salinity": 32.19,
"sea_water_salinity_depth": 4,
"platform": 1,
"mission": 1,
"metadata": "KTDQ_20150805v20001_0016"
}
Since Solr is a key-value data store, my initial translation to Neo4J was going to be simple so I could get a feel for working with the API.
My method was essentially to have each Solr record equate to a Neo4J node, where every key-value would become a node-property.
Obviously a few tweaks were required (changing None to 'None' (python), changing ISO times to epoch times (neo4j doesnt support indexing datetimes), changing point to lat/lon (neo4j spatial indexing), etc).
My goal was to load up Neo4J using this model, regardless of how naive it might be.
Here is an example of a rest call I make when loading in a single record (using http:localhost:7474/db/data/cypher as my endpoint):
{
"query" :
"CREATE (r:record {lat : {lat}, SST : {SST}, meta : {meta}, lon : {lon}, time : {time}}) RETURN id(r);",
"params": {
"lat": 40.1021614075,
"SST": 6.521100044250488,
"meta": "KCEJ_20140418v20001_1430",
"lon": -70.8780212402,
"time": 1397883480
}
}
Note that I have actually removed quite a few parameters for testing neo4j.
Currently I have serious performance issues. Loading a document like this into Solr for me takes about 2 seconds. For Neo4J it takes:
~20 seconds using REST API
~45 seconds using BOLT
~70 seconds using py2neo
I have ~50,000,000 records I need to load. Doing this in Solr usually takes 24 hours, so Neo4J could take almost a month!!
I recorded these times without using a uniqueness constraint on my 'meta' attribute, and without adding each node into the spatial index. The time results in this scenario was extremely awful.
Running into this issue, I tried searching for performance tweaks online. The following things have not improved my situation:
-increasing the open file limit from 1024 to 40000
-using ext4, and tweaking it as documented here
-increasing the page cache size to 16 GB (my system has 32)
So far I have only addressed load times. After I had loaded about 50,000 nodes overnight, I attempted queries on my spatial index like so:
CALL spatial.withinDistance('my_layer', lon : 34.0, lat : 20.0, 1000)
as well as on my time index like so:
MATCH (r:record) WHERE r.time > {} AND r.time < {} RETURN r;
These simple queries would take literally several minutes just return possibly a few nodes.
In Apache Solr, the spatial index is extremely fast and responds within 5 seconds (even with all 50000000 docs loaded).
At this point, I am concerned as to whether or not this performance lag is due to the nature of my data model, the configuration of my server, etc.
My goal was to extrapolate from this model, and move several measurement types to their own class of Node, and create relationships from my base record node to these.
Is it possible that I am abusing Neo4j, and need to recreate this model to use relationships and several different Node types? Should I expect to see dramatic improvements?
As a side note, I originally planned to use a triple store (specifically Parliament) to store this data, and after struggling to work with RDF, decided that Neo4J looked promising and much easier to get up and running. Would it be worth while to go back to RDF?
Any advice, tips, comments are welcome. Thank you in advance.
EDIT:
As suggested in the comments, I have changed the behavior of my loading script.
Previously I was using python in this manner:
from neo4j.v1 import GraphDatabase
driver = GraphDatabase('http://localhost:7474/db/data')
session = driver.session()
for tuple in mydata:
statement = build_statement(tuple)
session.run(statement)
session.close()
With this approach, the actual .run() statements run in virtually no time. The .close() statement was where all the run time occurs.
My modified approach:
transaction = ''
for tuple in mydata:
statement = build_statement(tuple)
transaction += ('\n' + statement)
with session.begin_transaction() as tx:
tx.run(transaction)
session.close()
I'm a bit confused because the behavior of this is pretty much the same. .close() still takes around 45 seconds, except only it doesn't commit. Since I am reusing the same identifier in each of my statements (CREATE (r:record {...}) .... CREATE (r:record {...}) ...), I get the CypherError regarding this behavior. I don't really know how to avoid this problem at the moment, and furthermore, the run time did not seem to improve at all (I would expect an error to actually make this terminate much faster).
In a general sense, is there a best practice to use when attempting to estimate how long the setting of relationships takes in Neo4j?
For example, I used the data import tool successfully, and here's what I've got in my 2.24GB database:
IMPORT DONE in 3m 8s 791ms. Imported:
7432663 nodes
0 relationships
119743432 properties
In preparation for setting relationships, I set some indices:
CREATE INDEX ON :ChessPlayer(player_id);
CREATE INDEX ON :Matches(player_id);
Then I let it rip:
MATCH (p:Player),(m:Matches)
WHERE p.player_id = m.player_id
CREATE (p)-[r:HAD_MATCH]->(m)
Then, I started to realize, that I have no idea how to even estimate how long that setting these relationships might take to set. Is there a 'back of the envelope' calculation for determining at least a ballpark figure for this kind of thing?
I understand that everyone's situation is different on all levels, including software, hardware, and desired schema. But any discussion would no doubt be useful and would deepen mine (and anyone else who reads this)'s understanding.
PS: FWIW, I'm running Ubuntu 14.04 with 16GB RAM and an Intel Core i7-3630QM CPU # 2.40GHz
The problem here is that you don't take into account transaction sizes. In your example all :HAD_MATCH relationships are created in one single large transaction. A transaction internally builds up in memory first and then gets flushed to disc. If the transaction is too large to fit in your heap you'll might see massive performance degradation due to garbage collections or even OutOfMemoryExceptions.
Typically you want to limit transaction sizes to e.g. 10k - 100k atomic operations.
The probably most easy to do transaction batching in this case is using the rock_n_roll procedure from neo4j-apoc. This uses one cypher statement to provide the data to be worked on and a second one running for each of the results from the previous one in batched mode. Note that apoc requires Neo4j 3.x:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
There was a bug in 3.0.0 and 3.0.1 causing this performing rather badly. So the above is for Neo4j >= 3.0.2.
If being on 3.0.0 / 3.0.1 use this as a workaround:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"CYPHER planner=rule WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
I'm running into issues with deadlocks during concurrent merge operations (REST API). I have a function that processes text with some metadata, and for each item in the metadata dictionary, I'm performing a merge to add either a node or connect the text node with the metadata[n] node. Issues come up when the message rate is around 500-1000 per second.
In this particular function, there are 11 merges between 6 queries, which go something like this:
q1 = "MERGE (n:N { id: {id} }) ON CREATE SET ... ON MATCH SET "
"WITH . MERGE (...) "
"WITH ., . MERGE (.)-[:some_rel]->(.)"
params = {'the': 'params'}
cypher.execute(q1, params)
if some_condition:
q2 = "MATCH (n:N { id: {id} }) ... "
"WITH n, . MERGE (n)-[:some_rel]->(.)"
params = {'more': 'params'}
cypher.execute(q2, params)
if some_condition2:
q3
...
if some_condition_n:
qn
I'm running the above with Python, via Celery (for those not familiar with Celery, it's a distributed task queue). When the issue first came up, I was executing the above in a single transaction, and had a ton of failures due to deadlock exceptions. My initial thought was simply to implement a distributed blocking lock at the function level with Redis. This, however, causes a bottleneck in my app.
Next, I switched from a single Cypher transaction a few atomic transactions as in the above and removed the lock. This takes care of the bottleneck, as well as greatly reducing the number of deadlock exceptions, but they're still occurring, albeit at the reduced level.
Graph databases aren't really my thing, so I don't have a ton of experience with the in's and out's of Neo4j and Cypher. I have a secondary-index in Redis of the uuid's of existing nodes, so there is a pre-processing step prior to the merge's to try and keep the graph access down. Are there any obvious solutions that I should try? Maybe there's some way to queue the operations on the graph-side, or maybe I'm overlooking some server optimizations? Any advice on where to look would be appreciated. Thanks!
Okay, after thinking about this some more, I realized that the way my queries were executed was inefficient and could do with some refactoring. Since all the queries are within the same general context, there is no reason to execute them all individually, or even no reason to open a transaction and have them executed that way.
Instead, I changed function to go through the conditionals, and concatenate the query strings into one long string, and add the params that I need to the param dictionary. So, now, there's only one execution at the end, with one statement. This also takes out some of the 'MATCH' statements.
This fix doesn't wholly fix the issue, though, as there are still some deadlock exceptions being thrown.
I think I found the issue, mainly that there wasn't an issue to begin with. That is:
The Enterprise version of Neo4j has an alternative lock manager than
the Community version, meant to provide scalable locking on
high-CPU-count machines.
The Enterprise Lock Manager uses a deadlock detection algorithm that
does not require (much) synchronization, which gives it some very
desirable scalability attributes. The drawback is that it may
sometimes detect false-positives. This normally does not happen in
production usage, but becomes evident in stress testing individual
operations. These scenarios see much lower churn in CPU cache
invalidation, which the enterprise lock manager needs to communicate
across cores.
As a deadlock detection error is a safe-to-retry error and the user is
expected to handle these in all application code, since there may be
legitimate deadlocks at any time, this behavior is actually by design
to gain scalability.
I simply caught the exception and then retry after a few seconds and now:
I'm using Cypher's LOAD CSV syntax in Neo4J 2.1.2. So far it's been a huge improvement over the more manual ETL process required in previous versions. But I'm running into some behavior in a single case that's not what I'd expect and I wonder if I'm missing something.
The cypher query being used is this:
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///Users/James/Desktop/import/dependency_sets_short.csv' AS row
MATCH (s:Sense {uid: toInt(row[4])})
MERGE (ds:DependencySet {label: row[2]}) ON CREATE SET ds.optional=(row[3] = 't')
CREATE (s)-[:has]->(ds)
Here's a couple of lines of the CSV:
227303,1,TO-PURPOSE-NOMINAL,t,73830
334471,1,AT-LOCATION,t,92048
334470,1,AT-TIME,t,92048
334469,1,ON-LOCATION,t,92048
227302,1,TO-PURPOSE-INFINITIVE,t,73830
116008,1,TO-LOCATION,t,68204
116007,1,IN-LOCATION,t,68204
227301,1,TO-LOCATION,t,73830
334468,1,ON-DATE,t,92048
116006,1,AT-LOCATION,t,68204
334467,1,WITH-ASSOCIATE,t,92048
Basically, I'm matching a Sense node (previously imported) based on it's ID value which is the fifth column. Then I'm doing a merge to either get a DependencySet node if it exists, or create it. Finally, I'm creating a has edge between the Sense node and the DependencySet node. So far so good, this all works as expected. What's confusing is the performance as the size of the CSV grows.
CSV Lines Time (msec)
------------------------------
500 480
1000 717
2000 1110
5000 1521
10000 2111
50000 4794
100000 5907
200000 12302
300000 35494
400000 Java heap space error
My expectation is that growth would be more-or-less linear, particularly as I'm committing every 500 lines as recommended by the manual, but it's actually closer to polynomial:
What's worse is that somewhere between 300k and 400k rows, it runs into a Java heap space error. Based on the trend from previous imports, I'd expect the import of 400k to take a bit over a minute. Instead, it churns away for about 5-7 minutes before running into the heap space error. It seems like I could split this file into 300,000-line chunks, but isn't that what "USING PERIODIC COMMIT" is supposed to do, more or less? I suppose I could give Neo4J more memory too, but again, it's not clear why I should have to in this scenario.
Also, to be clear, the lookups on both Sense.uid and DependencySet.label are indexed, so the lookup penalty for these should be pretty small. Here's a snippet from the schema:
Indexes
ON :DependencySet(label) ONLINE (for uniqueness constraint)
ON :Sense(uid) ONLINE (for uniqueness constraint)
Any explanations or thoughts on an alternative approach would be appreciated.
EDIT: The problem definitely seems to be in the MATCH and/or CREATE part of the query. If I remove lines 3 and 5 from the Cypher query it performs fine.
I assume that you've already created all the Sense labeled nodes before running this LOAD CSV import. What I think is going on is that as you are matching nodes with the label Sense into memory and creating relationships from the DependencySet to the Sense node via CREATE (s)-[:HAS]->(ds) you are increasing utilization of the available heap.
Another possibility is that the size of your relationship store in your memory mapped settings needs to be increased. In your scenario it looks like the Sense nodes have a high degree of connectivity to other nodes in the graph. When this happens your relationship store for those nodes require more memory. Eventually when you hit 400k nodes the heap is maxed out. Up until that point it needs to do more garbage collection and reads from disk.
Michael Hunger put together an excellent blog post on memory mapped settings for fast LOAD CSV performance. See here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
That should resolve your problem. I don't see anything wrong with your query.
i believe the line
MATCH (s:Sense {uid: toInt(row[4])})
makes the time paradigm. somewhere around the 200 000 in the x line of your graph, you have no longer all the Sense nodes in the memory but some of them must be cached to disk. thus all the increase in time is simply re-loading data from cache to memory and vise-versa (otherwise it will be still linear if kept in memory).
maybe if you could post you server memory settings, we could dig deeper.
to the problem of java heap error refer to Kenny's answer
I have a graph with ~89K nodes and ~1.2M relationships, and am trying to get the transitive closure of a single node via the following Cypher query:
start n=NODE(<id of a single node of interest>)
match (n)-[*1..]->(m)
where has(m.name)
return distinct m.name
Unfortunately, this query goes away and doesn't seem to come back (although to be fair I've only given it about an hour of execution time at this point).
Any suggestions on ways to optimise what I've got here, or better ways to achieve the requirement?
Notes:
Neo4J v2.0.0 (installed via Homebrew).
Mac OSX 10.8.5
Oracle Java 1.7.0_51
8GB physical RAM (neo4j JVM assigned whatever the default is)
Database is hosted on an SSD volume.
Query is submitted via the admin web UI's "Data browser".
"name" is an auto-indexed field.
CPU usage is fairly low - averaging around 20% of 8 cores.
I haven't gotten into the weeds of profiling the Neo4J server yet - my first attempt locked up VisualVM.
That's probably a combinatorial explosion of path, care to try this?
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match shortestPath((n)-[*]->(m))
return m.name
without shortest-path it would look like that, but as you are only interested in the reachable nodes from n the above should be good enough.
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match (n)-[*]->(m)
return distnct m.name
Try query - https://code.google.com/p/gueryframework/ - this is a standalone library but is has a neo4j adapter. I.e., you will have to rewrite your queries in the query format.
Better support for transitive closure was one of the main reasons for developing query, we mainly use this in software analysis tools where we need reachability / pattern analysis (e.g., the antipattern queries in http://xplrarc.massey.ac.nz/ are computed using query).
There is a brief discussion about this in the neo4j google group:
https://groups.google.com/forum/#!searchin/neo4j/jens/neo4j/n69ksEJxDtQ/29DNKyWKur4J
and an (older, not maintained) project with some benchmarking code:
https://code.google.com/p/graph-query-benchmarks/
Cheers, Jens