I have created a .Net application that utilizes a Neo4J Graph Database (with GrapheneDB as a provider). I am having performance issues when I save a new graph object. I am not keeping a history of the graph so each time I save, I first delete the old one including nodes and relationships, then I save the new one. I have not indexed my nodes yet. I don't think this is the problem because loading multiple of these graphs at a time is very fast.
My save method steps through each branch and merges the nodes and relationships. (I left the relationships out of each step for cleanliness). After the full query is created the code is executed in one shot.
merge the root node 37 and node 4
merge type1 node 12-17 with 4
merge type2 node 18-22 with 4
merge 2 with 37
merge 7-11 with 2
merge 5 with 37 (creates relationships)
merge 23-26 with 5
merge 6 with 37 (creates relationships)
merge 30-27 with 6
Nodes 2, 4, 5, 6 can have 100-200 leaf nodes. I have about 100 of these graphs in my database. This save can take the server 10 - 20 seconds on production and sometimes times out.
I have tried saving a different way, and it takes longer but doesn't timeout as frequently. I create groups of nodes first. Each node stores the root id 37. Each group is created in a separate execution. After the nodes are created I create relationships by selecting child nodes and the root node. This splits the query up into separate smaller queries.
How can I improve the performance of this save? Loading 30 of these graphs takes 3-5 seconds. I should also note that the save got significantly less performant as more data was added.
Since you delete all the nodes (and their relationships) beforehand, you should not be using MERGE at all, as that requires a lot of scanning (without the relevant indexes) to determine whether each node already exists.
Try using CREATE instead (as long as the CREATEs avoid creating duplicates).
Related
I have installed the APOC Procedures and used "CALL apoc.warmup.run."
The result is as follow:
pageSize
8192
nodesPerPage nodesTotal nodesLoaded nodesTime
546 156255221 286182 21
relsPerPage relsTotal relsLoaded relsTime
240 167012639 695886 8
totalTime
30
It looks like the neo4j server only caches part of nodes and relations.
But I want it to cache all the nodes and relationships in order to improve query performance.
First of all, for all data to be cached, you need a page cache large enough.
Then, the problem is not that Neo4j does not cache all it can, it's more of a bug in the apoc.warmup.run procedure: it retrieves the number of nodes (resp. relationships) in the database, and expects them to all have ids between 1 and that number of nodes (resp. relationships). However, it's not true if you've had some churn in the DB, like creating more nodes then deleting some of them.
I believe that could be fixed by using another query instead:
MATCH (n) RETURN count(n) AS count, max(id(n)) AS maxId
as profiling it shows about the same number of DB hits as the number of nodes, and takes about 650 ms on my machine for 1.4 million nodes.
Update: I've opened an issue on the subject.
Update 2
While the issue with the ids is real, I missed the real reason why the procedure reports reading far less nodes: it only reads one node per page (assuming they're stored sequentially), since it's the pages that are cached. With the current values, that means trying to read one node every 546 nodes. It happens that 156255221 ÷ 546 = 286181, and with node 0 that makes it 286182 nodes loaded.
I'm loading relationships into my graph db in Neo4j using the load csv operation. The nodes are already created. I have four different types of relationships to create from four different CSV files (file 1 - 59 relationships, file 2 - 905 relationships, file 3 - 173,000 relationships, file 4 - over 1 million relationships). The cypher queries execute just fine, However file 1 (59 relationships) takes 25 seconds to execute, file 2 took 6.98 minutes and file 3 is still going on since past 2 hours. I'm not sure if these execution times are normal given neo4j's capabilities to handle millions of relationships. A sample cypher query I'm using is given below.
load csv with headers from
"file:/sample.csv"
as rels3
match (a:Index1 {Filename: rels3.Filename})
match (b:Index2 {Field_name: rels3.Field_name})
create (a)-[:relation1 {type: rels3.`relation1`}]->(b)
return a, b
'a' and 'b' are two indices I created for two of the preloaded node categories hoping to speed up lookup operation.
Additional information - Number of nodes (a category) - 1791
Number of nodes (b category) - 3341
Is there a faster way to load this and does load csv operation take so much time? Am i going wrong somewhere?
Create an index on Index1.Filename and Index2.Field_name:
CREATE INDEX ON :Index1(Filename);
CREATE INDEX ON :Index2(Field_name);
Verify these indexes are online:
:schema
Verify your query is using the indexes by adding PROFILE to the start of your query and looking at the execution plan to see if the indexes are being used.
More info here
What i like to do before running a query is run explain first to see if there are any warnings. I have fixed many a query thanks to the warnings.
(simple pre-append explain to your query)
Also, perhaps you can drop the return statement. After your query finishes you can then run another to just see the nodes.
I create roughly 20M relationships in about 54 mins using a query very similar to yours.
Indices are important because that's how neo finds the nodes.
I was doing a POC on publicly-available Twitter dataset for our project. I was able to create the Neo4j database for it using Michael Hunger's Batch Inserter utility, and it was relatively fast (It just took a 2h and 53 mins to finish). All in all there were
15,203,731 Nodes, with 2 properties (name, url)
256,147,121 Relationships, with 1 property
Now I created a Cypher query to update the Twitter database. I added a new property (Age) on the Node and a new property on the Relationship (FollowedSince) in the CSVs. Now things start to look bad. The query to update the relationship (see below) takes forever to run.
USING PERIODIC COMMIT 100000
LOAD CSV WITH HEADERS FROM {csvfile} AS row FIELDTERMINATOR '\t'
MATCH (u1:USER {name:row.`name:string:user`}), (u2:USER {name:row.`name:string:user2`})
MERGE (u1)-[r:Follows]->(u2)
ON CREATE SET r.Property=row.Property, r.FollowedSince=row.FollowedSince
ON MATCH SET r.Property=row.Property, r.FollowedSince=row.FollowedSince;
I already pre-created the index by running
CREATE INDEX ON :USER(name);
My neo4j property:
allow_store_upgrade=true
dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=260M
neostore.propertystore.db.index.mapped_memory=260M
neostore.nodestore.db.mapped_memory=768M
neostore.relationshipstore.db.mapped_memory=12G
neostore.propertystore.db.mapped_memory=2048M
neostore.propertystore.db.strings.mapped_memory=2048M
neostore.propertystore.db.arrays.mapped_memory=260M
node_auto_indexing=true
I'd like to know what should I do to speed up my Cypher query? As of this writing, it's more than an hour and a half have passed and my Relationship (10,000,747) still hasn't finished. The Node (15,203,731) that finished earlier clocked at 34 minutes which I think is way too long. (The Batch Inserter utility processed the whole Node in just 5 minutes!)
I did test my queries on a small dataset just to try it out first before tackling bigger dataset, and it did work.
My Neo4j lives on a server-grade machine, so hardware is not an issue here.
Any advice please? Thanks.
I'm importing nodes that are all part of one Merge and relationship creation statement, but Neo4j is crashing with StackOverflowExceptions or "ERROR (-v for expanded information):
Error unmarshaling return header; nested exception is:
java.net.SocketException: Software caused connection abort: recv failed"
I admit my approach may be faulty, but I have some (A) nodes with ~8000 relationships to nodes of type (B) and (B) nodes have ~ 7000 relationships to other (A) nodes.
I basically have a big MERGE statement that creates the (A) & (B) nodes with a CREATE UNIQUE that does all the relationship creating at the end. I store all this Cypher in a file and import it through the Neo4jShell.
Example:
MERGE (foo:A { id:'blah'})
MERGE (bar:B {id:'blah2'})
MERGE (bar2:B1 {id:'blah3'})
MERGE (bar3:B3 {id:'blah3'})
MERGE (foo2:A1 {id:'blah4'})
... // thousands more of these
CREATE UNIQUE foo-[:x]->bar, bar-[:y]->foo2, // hundreds more of these
Is there a better way to do this ? I was trying to avoid creating all the Merge statements, then matching each one to create the relationships in another query. I get really slow import performance on both ways. Splitting up each merge as a transaction is slow (2 hrs import for 60K, nodes/relationships). Current approach crashes neo4j
The current one big merge/create unique approach works for the first big insert, but fails after that when the next big insert uses 5000 nodes and 8000 relationships. Here is the result for the first big merge:
Nodes created: 756
Relationships created: 933
Properties set: 5633
Labels added: 756
15101 ms
I'm using a Windows 7 machine with 8GB RAM. In my neo4j.wrapper I use:
wrapper.java.initmemory=512
wrapper.java.maxmemory=2048
There are 3 things that might help:
If you don't really need merge, you should use just a create instead. Create is more efficient because it doesn't have to check for existing relations
Make sure your indexes are correct
You now have everything in 1 big transaction. You state the alternative of having every statement in 1 transaction. Neither works for you. However, you could make transactions of, say, 100 statements each. This approach should be quicker than 1 statement per transaction, and still use less memory than putting everything in 1 big transaction
Trying to create a simple set of relationships/nodes where a club is LOCATED in a region which is PART_OF a location which BELONGS_TO a country. The scrip below with 150 lines (shown only 2) executes for a minute and creates 150 nodes, 150 labels,150 relationships.
merge (c:COUNTRY {name:'Fictus'})
merge(d1:club {name:'alpha'})
merge(l1:LOCATION {name:'shore'})
merge(r1:REGION {name:'north park'})
merge d1-[:LOCATED]->l1
merge l1-[:BELONGS_TO]->r1
merge r1-[:IS_PART_OF]->c
merge(d2:club {name:'beta'})
merge(l2:LOCATION {name:'shore'})
merge (r2:REGION {name:'north park'})
merge d2-[:LOCATED]->l2
merge l2-[:BELONGS_TO]->r2
merge r2-[:IS_PART_OF]->c
two questions:
1. Isn't it supposed to create only 3 labels? Why it says 150?
2. Evidently it's a bad way to create the objects. What is the right one via a script?
Thanks!
Based on your query, there should only be 3 node labels. I believe the reported creation of "150 labels" is misleading -- it is probably just telling you that 150 nodes had been assigned a label.
If you are truly trying to start a new database, then you can use the REST API to perform batch operations.