I am trying to create 500,000 nodes in a graph database. I plan to add edges as per my requirements later. I have a text file with 500,000 lines representing the data to be stored in each node.
from bulbs.neo4jserver import Graph, Config, NEO4J_URI
config = Config(NEO4J_URI)
g = Graph(config)
def get_or_create_node(text, crsqid):
v = g.vertices.index.lookup(crsqid=crsqid)
if v==None:
v = g.vertices.create(crsqid=crsqid)
print text + " - node created"
v.text = text
v.save()
return v
I then loop over each line in the text file,
count = 1
with open('titles-sorted.txt') as f:
for line in f:
get_or_create_node(line, count)
count += 1
This is terribly slow. This gives me 5000 nodes in 10 minutes. Can this be improved? Thanks
I don't see any transaction code in there, establishing one, or signaling transaction success. You should look into that -- if you're doing one transaction for every single node creation, that's going to be slow. You should probably create one transaction, insert thousands of nodes, then commit the whole batch.
I'm not familiar with bulbs, so I can't tell you how to do that with this python framework, but here is a place to start: this page suggests you can use a coding style like this, with some python/neo bindings:
with db.transaction:
foo()
also, if you're trying to load mass amounts of data and you need performance, you should check this page for information on bulk importing. It's unlikely that doing it in your own script is going to be the most performant. You might instead consider using your script to generate cypher queries, which get piped to the neo4j-shell.
Finally a thing to consider is indexes. Looks like you're indexing on crsqid - if you get rid of that index, creates may go faster. I don't know how your IDs are distributed, but it might be better to break records up into batches to test if they exist, rather than using the get_or_create() pattern.
Batch loading 500k nodes individually via REST is not ideal. Use Michael's batch loader or the Gremlin shell -- see Marko's movie recommendation blog post for an example of how to do this from the Gremlin shell.
Related
I am importing the data around 12 million nodes and 13 million relationships.
First I used the csv import with periodic commit 50000 and divided the data into different chunks, but still its taking too much time.
Then I saw the batch insertion method. But for the batch insertion method I have to create new data sets in excel sheet.
Basically I am importing the data from SqlServer: first I save the data into csv, then import it into my neo4j.
Also, I am using the neo4j community version. I did change the properties for the like all i had found on stackoverflow. But still initially with preiodic commit 50K it goes faster but after 1 million it takes too much time.
Is there anyway to import this data directly from sql in short span of time, as neo4j is famous for its fast working with big data? Any suggestions or help?
Here is the LOAD CSV used (index on numbers(num)) :
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv"
AS csvLine fieldterminator ';'
Merge (Numbers:Number {num: csvLine.Numbers}) return * ;
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Level1.csv"
AS csvLine fieldterminator ';'
MERGE (TermNum:Number {num: csvLine.TermNum})
MERGE (OrigNum:Number {num: (csvLine.OrigNum)})
MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ;
How long is it taking?
To give you a reference, my db is about 4m nodes, 650,000 unique relationships, ~10m-15m properties (not as large, but should provide an idea). It takes me less than 10 minutes to load in the nodes file + set multiple labels, and then load in the relationships file + set the relationships (all via LOAD CSV). This is also being done on a suped up computer, but if yours is taking hours, I would make some tweaks.
My suggestions are as follows:
Are you intentionally returning the nodes after the MERGE? I can't imagine you are doing anything with it, but either way, consider removing the RETURN *. With RETURN *, you're returning all nodes, relationships, and paths found in the query and that's bound to slow things down. (http://neo4j.com/docs/stable/query-return.html#return-return-all-elements)
Is the "num" field meant to be unique? If so, consider adding the following constraints (NOTE: this will also create the index, so no need to create it separately). I think this might speed up the MERGE (I'm not sure on that), though see next point.
CREATE CONSTRAINT ON (Numbers:Number) ASSERT Numbers.num IS UNIQUE;
If the num field is unique AND this is a brand new database (i.e. you're starting from scratch when you run this script), then call CREATE to create the nodes, rather than MERGE (for the creation of the nodes only).
As was already mentioned by Christophe, you should definitely increase the heap size to around 4g.
Let us know how it goes!
EDIT 1
I have not been able to find much relevant information on memory/performance tuning for the Windows version. What I have found leaves me with a couple of questions, and is potentially outdated.
This is potentially outdated, but provides some background on some of the different settings and the differences between Windows and Linux.
http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html
Those differences between Windows & Linux have themselves changed from one version to the next (as demonstrated with the following links)
Cypher MATCH query speed,
https://stackoverflow.com/a/29055966/4471711
Michael's response above seems to indicate that if you're NOT running a java application with Neo4j, you don't need to worry about the heap (-Xmx), however that doesn't seem right in my mind given the other information I saw, but perhaps all of that other info is prior to 2.2.
I have also been through this.
http://neo4j.com/docs/stable/configuration.html
So, what I have done is set both heap (-Xmx in the neo4j.vmoptions) and the pagecache to 32g.
Can you modify your heap settings to 4096MB.
Also, in the second LOAD CSV, are the numbers used for the two first MERGE already in the database ? If yes use MATCH instead.
I would also commit at a level of 10000.
I have a list of MATCH statements which are totally unrelated to each other. But if I execute them like
MATCH (a:Person),(b:InProceedings) WHERE a.identifier = 'person/joseph-valeri' and b.identifier = 'conference/edm2008/paper/209' CREATE (a)-[r:creator]->(b)
MATCH (a:Person),(b:InProceedings) WHERE a.identifier = 'person/nell-duke' and b.identifier = 'conference/edm2008/paper/209' CREATE (a)-[r:creator]->(b)
But if I execute them at once I get the following error:
WITH is required between CREATE and MATCH (line 2, column 1)
What changes should I incorporate?
(I am new to Neo4j)
Does this need to happen in a single transaction? In which case you should be matching your nodes up front before performing the create:
MATCH (jo:Person{identifier:'person/joseph-valeri'}), (nell:Person{identifier:'person/nell-duke'}), (b:InProceedings{identifier:'conference/edm2008/paper/209'})
CREATE (jo)-[:creator]->(b), (nell)-[:creator]->(b)
If it's just the two creators you could change the create to:
CREATE (jo)-[:creator]->(b)<-[:creator]-(nell)
If this isn't what you want to achieve then effectively what you have posted is two distinct Cypher statements that you are trying to run as one, and the parser is getting confused.
Post comment edit
Given that you said millions I think that you are going to find the transaction time on performing the import prohibitive and therefore you should investigate the CSV import syntax (and specifically pay attention to PERIODIC COMMIT) if you can write to CSV instead of to the big Cypher dump?
If for some reason that is not an option and you are starting from empty then build slowly - creating nodes first. These are going to need names to keep the speed up (but these names aren't persisted, just constant in your Cypher query):
CREATE (a:Person{identifier:'person/joseph-valeri'}),
(b:Person{identifier:'person/nell-duke'}),
(zzz:Person{identifier:'person/do-you-really-want-person-in-all-these-identifiers'}),
(inProca:InProceedings{identifier:'conference/edm2008/paper/209'}),
(inProcb:InProceedings{identifier:'conference/edm2009/paper/209'})
You will have kept track of a, b .. zzz in your Python script allowing you to build the CREATE statment up with:
(a)-[:creator]->(inProcA), (zzz)-[:creator]-(inProcB)
Now if all of your nodes already exist and you just want to build the relationships in now, then you have the choice of:
Performing individual MATCH and CREATEs for each new relationship, exceuting them each individually. This looks like what your original code was doing. You should move the conditions into the MATCH rather than the WHERE clause.
MATCHing a large set of nodes and CREATEing new realtionships. This is more akin to what my initial code was doing and will require your script to be smart in generating the queries.
MERGEing existing nodes into new relationships.
Whatever you do, you'e going to need to batch the writes within the transaction or you're going to run out of memory - you can advise Neo4J to do this by using the USING PERIODIC COMMIT 50000 syntax, here is a great blog post on it.
I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks
Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.
Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.
The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.
load2neo install + curl usage example:
http://nigelsmall.com/load2neo
load2neo Geoff syntax:
http://nigelsmall.com/geoff
It is much faster (>>10x) than using Cypher via neo4j-shell.
I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.
this is more of a best-practices question. I am implementing a search back-end for highly structured data that, in essence, consists of ontologies, terms, and a complex set of mappings between them. Neo4j seemed like a natural fit and after some prototyping I've decided to go with py2neo as a way to communicate with neo4j, mostly because of nice support for batch operations. This is more of a best practices question than anything.
What I'm getting frustrated with is that I'm having trouble with introducing the types of higher-level abstraction that I would like to in my code - I'm stuck with either using the objects directly as a mini-orm, but then I'm making lots and lots of atomic rest calls, which kills performance (I have a fairly large data set).
What I've been doing is getting my query results, using get_properties on them to batch-hydrate my objects, which preforms great and which is why I went down this route in the first place, but this makes me pass tuples of (node, properties) around in my code, which gets the job done, but isn't pretty. at all.
So I guess what I'm asking is if there's a best practice somewhere for working with a fairly rich object graph in py2neo, getting the niceties of an ORM-like later while retaining performance (which in my case means doing as much as possible as batch queries)
I am not sure whether I understand what you want, but I had a similar issue. I wanted to make a lot of calls and create a lot of nodes, indexes and relationships.. (around 1.2 million) . Here is an example of adding nodes, relationships, indexes and labels in batches using py2neo
from py2neo import neo4j, node, rel
gdb = neo4j.GraphDatabaseService("<url_of_db>")
batch = neo4j.WriteBatch(gdb)
a = batch.create(node(name='Alice'))
b = batch.create(node(name='Bob'))
batch.set_labels(a,"Female")
batch.set_labels(b,"Male")
batch.add_indexed_node("Name","first_name","alice",a) #this will create an index 'Name' if it does not exist
batch.add_indexed_node("Name","first_name","bob",b)
batch.create(rel(a,"KNOWS",b)) #adding a relationship in batch
batch.submit() #this will now listen to the db and submit the batch records. Ideally around 2k-5k records should be sent
Since your asking for best practice, here is an issue I ran into:
When adding a lot of nodes (~1M) with py2neo in a batch, my program often gets slow or crashes when the neo4j server runs out of memory. As a workaround, I split the submit in multiple batches:
from py2neo import neo4j
def chunker(seq, size):
"""
Chunker gets a list and returns slices
of the input list with the given size.
"""
for pos in xrange(0, len(seq), size):
yield seq[pos:pos + size]
def submit(graph_db, list_of_elements, size):
"""
Batch submit lots of nodes.
"""
# chunk data
for chunk in chunker(list_of_elements, size):
batch = neo4j.WriteBatch(graph_db)
for element in chunk:
n = batch.create(element)
batch.add_labels(n, 'Label')
# submit batch for chunk
batch.submit()
batch.clear()
I tried this with different chunk sizes. For me, it's fastest with ~1000 nodes per batch. But I guess this depends on the RAM/CPU of your neo4j server.
The data comes to the system continuosly with rate 300-500 TPS. I need to import it to neo4j with the following scheme:
If N node does not exist, create it
If the relation N-[rel:rel_type]->X does not exist, create it
Increment rel.weight
It seems to be impossible to solve the problem using REST batch.
Different cypher queries are too long because they generate many small transactions.
Gremlin works much faster. I collect parameters for gremlin script in array and execute it as a batch. But even though I could hardly reach the speed of 300 TPS.
I should mention that besides there will be a flow of queries ~500 TPS:
START N=node(...) MATCH N-[rel:rel_type]->X return rel.weight,X.name;
The heap size is set to 5 Gb. Additional options:
-XX:MaxPermSize=1G -XX:+CMSClassUnloadingEnabled -XX:+UseParallelGC -XX:+UseNUMA
What is optimal way and configuration for importing such kind of data?
to check whether or not the incoming node exists and has the rels to other node, you can use create unique syntax.
START n=node:node_index(newNode={N})
CREATE UNIQUE n-[:REL_TYPE]->x ;
to automatically increment the weight of the relationship, i would assume something like this (but no warranty on this, there is probably a faster way of doing it):
START n=node:node_index(newNode={N})
CREATE UNIQUE n-[rel:REL_TYPE]->x
SET rel.weight = coalesce(rel.weight?,0) +1