I was doing a POC on publicly-available Twitter dataset for our project. I was able to create the Neo4j database for it using Michael Hunger's Batch Inserter utility, and it was relatively fast (It just took a 2h and 53 mins to finish). All in all there were
15,203,731 Nodes, with 2 properties (name, url)
256,147,121 Relationships, with 1 property
Now I created a Cypher query to update the Twitter database. I added a new property (Age) on the Node and a new property on the Relationship (FollowedSince) in the CSVs. Now things start to look bad. The query to update the relationship (see below) takes forever to run.
USING PERIODIC COMMIT 100000
LOAD CSV WITH HEADERS FROM {csvfile} AS row FIELDTERMINATOR '\t'
MATCH (u1:USER {name:row.`name:string:user`}), (u2:USER {name:row.`name:string:user2`})
MERGE (u1)-[r:Follows]->(u2)
ON CREATE SET r.Property=row.Property, r.FollowedSince=row.FollowedSince
ON MATCH SET r.Property=row.Property, r.FollowedSince=row.FollowedSince;
I already pre-created the index by running
CREATE INDEX ON :USER(name);
My neo4j property:
allow_store_upgrade=true
dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=260M
neostore.propertystore.db.index.mapped_memory=260M
neostore.nodestore.db.mapped_memory=768M
neostore.relationshipstore.db.mapped_memory=12G
neostore.propertystore.db.mapped_memory=2048M
neostore.propertystore.db.strings.mapped_memory=2048M
neostore.propertystore.db.arrays.mapped_memory=260M
node_auto_indexing=true
I'd like to know what should I do to speed up my Cypher query? As of this writing, it's more than an hour and a half have passed and my Relationship (10,000,747) still hasn't finished. The Node (15,203,731) that finished earlier clocked at 34 minutes which I think is way too long. (The Batch Inserter utility processed the whole Node in just 5 minutes!)
I did test my queries on a small dataset just to try it out first before tackling bigger dataset, and it did work.
My Neo4j lives on a server-grade machine, so hardware is not an issue here.
Any advice please? Thanks.
Related
I'm creating nodes and relationships programmatically using neo4j java driver based on reading the relationships specified in a csv file.
The csv file contains about 16 million rows and there will be about 16*4 million relationships to create.
I'm using the pattern Match Match Create for this purpose :
Match (a:label), (b:label) where a.prop='1234' and b.prop='4567' create (a)-[:LINKS]->(b)
I just started the program and functionally it ran well. I saw nodes and relationships being created properly in the neo4j DB.
However, for the past four hours, only 100,000 rows from the csv have been processed and only 92037 relationships being created.
According to this speed, it will take about one months to finish processing the csv and creating all the relationships.
I noticed that I was sending the Match...Create one by one to the session.writeTransaction().
Is there any way to batch them up so as to speed up the creation time?
My import.csv creates many nodes and merging creates a huge cartesian product and runs in a transaction timeout since the data has grown so much. I've currently set the transaction timeout to 1 second because every other query is very quick and is not supposed to take any longer than one second to finish.
Is there a way to split or execute this specific query in smaller chunks to prevent a timeout?
Upping or disabling the transaction timeout in the neo4j.conf is not an option because the neo4j service needs a restart for every change made in the config.
The query hitting the timeout from my import script:
MATCH (l:NameLabel)
MATCH (m:Movie {id: l.id,somevalue: l.somevalue})
MERGE (m)-[:LABEL {path: l.path}]->(l);
Nodecounts: 1000 Movie, 2500 Namelabel
You can try installing APOC Procedures and using the procedure apoc.periodic.commit.
call apoc.periodic.commit("
MATCH (l:Namelabel)
WHERE NOT (l)-[:LABEL]->(:Movie)
WITH l LIMIT {limit}
MATCH (m:Movie {id: l.id,somevalue: l.somevalue})
MERGE (m)-[:LABEL {path: l.path}]->(l)
RETURN count(*)
",{limit:1000})
The below query will be executed repeatedly in separate transactions until it returns 0.
You can change the value of {limit : 1000}.
Note: remember to install APOC Procedures according the version of Neo4j you are using. Take a look in the Version Compatibility Matrix.
The number of nodes and labels in your database suggest this is an indexing problem. Do you have constraints on both the Movie and Namelabel (which should be NameLabel since it is a node) nodes? The appropriate constraints should be in place and active.
Indexing and Performance
Make sure to have indexes and constraints declared and ONLINE for
entities you want to MATCH or MERGE on
Always MATCH and MERGE on a
single label and the indexed primary-key property
Prefix your load
statements with USING PERIODIC COMMIT 10000 If possible, separate node
creation from relationship creation into different statements
If your
import is slow or runs into memory issues, see Mark’s blog post on
Eager loading.
If your Movie nodes have unique names then use the CREATE UNIQUE statement. - docs
If one of the nodes is not unique but will be used in a relationship definition then the CREATE INDEX ON statement. With such a small dataset it may not be readily apparent how inefficient your queries are. Try the PROFILE command and see how many nodes are being searched. Your MERGE statement should only check a couple nodes at each step.
I'm loading relationships into my graph db in Neo4j using the load csv operation. The nodes are already created. I have four different types of relationships to create from four different CSV files (file 1 - 59 relationships, file 2 - 905 relationships, file 3 - 173,000 relationships, file 4 - over 1 million relationships). The cypher queries execute just fine, However file 1 (59 relationships) takes 25 seconds to execute, file 2 took 6.98 minutes and file 3 is still going on since past 2 hours. I'm not sure if these execution times are normal given neo4j's capabilities to handle millions of relationships. A sample cypher query I'm using is given below.
load csv with headers from
"file:/sample.csv"
as rels3
match (a:Index1 {Filename: rels3.Filename})
match (b:Index2 {Field_name: rels3.Field_name})
create (a)-[:relation1 {type: rels3.`relation1`}]->(b)
return a, b
'a' and 'b' are two indices I created for two of the preloaded node categories hoping to speed up lookup operation.
Additional information - Number of nodes (a category) - 1791
Number of nodes (b category) - 3341
Is there a faster way to load this and does load csv operation take so much time? Am i going wrong somewhere?
Create an index on Index1.Filename and Index2.Field_name:
CREATE INDEX ON :Index1(Filename);
CREATE INDEX ON :Index2(Field_name);
Verify these indexes are online:
:schema
Verify your query is using the indexes by adding PROFILE to the start of your query and looking at the execution plan to see if the indexes are being used.
More info here
What i like to do before running a query is run explain first to see if there are any warnings. I have fixed many a query thanks to the warnings.
(simple pre-append explain to your query)
Also, perhaps you can drop the return statement. After your query finishes you can then run another to just see the nodes.
I create roughly 20M relationships in about 54 mins using a query very similar to yours.
Indices are important because that's how neo finds the nodes.
I am trying to insert unique nodes and relationship in neo4j.
What I am using :-
Neo4j Community Edition running on Amazon EC2.[Amazon Linux m3.large]
Neo4j Java Rest Binding [ https://github.com/neo4j-contrib/java-rest-binding ]
Data Size and Type :
TSV File [Multiple]. Each contains more than 8 Million Lines [each line represent a node or a relationship].There are more than 10 files for nodes.[= 2 Million Nodes] and another 2 million relations.
I am using UniqueNodeFactory for inserting nodes. And inserting sequentially, couldn't find any way to insert into batches preserving unique nodes.
The problem is it is taking huge time to insert data. For example it took almost a day for inserting 0.3 million unique nodes. Is there any way to fasten the insertion?
Don't do that.
Java-REST-Binding was never made for that.
Use either
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "http://some.url" as line
CREATE (u:User {name:line.name})
You can also use merge (with constraints), create relationships etc.
See my blog post for an example: http://jexp.de/blog/2014/06/using-load-csv-to-import-git-history-into-neo4j/
Or the Neo4j Manual: http://docs.neo4j.org/chunked/milestone/cypherdoc-importing-csv-files-with-cypher.html
The data comes to the system continuosly with rate 300-500 TPS. I need to import it to neo4j with the following scheme:
If N node does not exist, create it
If the relation N-[rel:rel_type]->X does not exist, create it
Increment rel.weight
It seems to be impossible to solve the problem using REST batch.
Different cypher queries are too long because they generate many small transactions.
Gremlin works much faster. I collect parameters for gremlin script in array and execute it as a batch. But even though I could hardly reach the speed of 300 TPS.
I should mention that besides there will be a flow of queries ~500 TPS:
START N=node(...) MATCH N-[rel:rel_type]->X return rel.weight,X.name;
The heap size is set to 5 Gb. Additional options:
-XX:MaxPermSize=1G -XX:+CMSClassUnloadingEnabled -XX:+UseParallelGC -XX:+UseNUMA
What is optimal way and configuration for importing such kind of data?
to check whether or not the incoming node exists and has the rels to other node, you can use create unique syntax.
START n=node:node_index(newNode={N})
CREATE UNIQUE n-[:REL_TYPE]->x ;
to automatically increment the weight of the relationship, i would assume something like this (but no warranty on this, there is probably a faster way of doing it):
START n=node:node_index(newNode={N})
CREATE UNIQUE n-[rel:REL_TYPE]->x
SET rel.weight = coalesce(rel.weight?,0) +1