Background of problem
I'm currently building an app that models various geographic features (roads, towns, highways, etc.) in a graph database. The geographic data is all in GeoJSON format.
There is no LOAD JSON function in the cypher language, so loading JSON files requires passing the fully parsed JavaScript object as a parameter and using UNWIND to access arrayed properties and objects to create nodes. (This guide helped me out a lot to get started: Loading JSON in neo4j). Since GeoJSON is just a spec built on JSON conventions, the load JSON method works great for reasonably sized files.
However, geographic data files can be massive. Some of the files I'm trying to import range from 100 features to 200,000 features.
The problem I'm running into is that with these very large files, the query will not MERGE any nodes in the database until it has completely processed the file. For large files, this often exceeds the 3600s timeout limit set in neo4j. So I end up waiting for an hour to find out that I have no new data in my database.
I know that with some data, the current recommendation is to convert it to CSV and then use the optimization of LOAD CSV. However, I don't believe it is easy to condense GeoJSON into CSV.
Primary Question
Is it possible to send the data from a very large JSON/GeoJSON file over in smaller batches so that neo4j will commit the data intermittently?
Current Approach
To import my data, I built a simple Express app that connects to my neo4j database via the bolt protocol (using official binary JS drivers). My GeoJSON files all have a well known text (WKT) property for each feature so that I can make use of neo4j-spatial.
Here's an example of the code I would use to import a set of road data:
session.run("WITH {json} as data UNWIND data.features as features MERGE (r:Road {wkt:features.properties.wkt})", {json: jsonObject})
.then(function (result) {
var records = [];
result.records.forEach((value) => {
records.push(value);
});
console.log("query completed");
session.close();
driver.close();
return records;
})
.catch((error) => {
console.log(error);
// Close out the session objects
session.close();
driver.close();
});
As you can see I'm passing in the entire parsed GeoJSON object as a parameter in my cypher query. Is there a better way to do this with very large files to avoid the timeout issue I'm experiencing?
This answer might be helpful here: https://stackoverflow.com/a/59617967/1967693
apoc.load.jsonArray() will stream the values of the given JSON file. This can then be used as data source for batching via apoc.periodic.iterate.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('https://dummyjson.com/products', '$.features') YIELD value AS features",
"UNWIND features as feature MERGE (r:Road {wkt:feature.properties.wkt})",
{batchSize:1000, parallel:true}
)
Related
I'm new to neo4j and currently attempting to migrate existing data into a neo4j database. I have written a small program to convert current data (in bespoke format) into a large CREATE cypher query for initial population of the database. My first iteration has been to somewhat retain the structuring of the existing object model, i.e Objects become nodes, node type is same as object name in current object model, and the members become properties (member name is property name). This is done for all fundamental types (and strings) and any member objects are thus decomposed in the same way as in the original object model.
This has been fine in terms of performance and 13000+ line CREATE cypher queries have been generated which can be executed throuh the web frontend/client. However the model is not ideal for a graph database, I beleive, since there can be many properties, and instead I would like to deomcompose these 'fundamental' nodes (with members which are fundamental types) into their own node, relating to a more 'abstract' node which represents the more higher level object/class. This means each member is a node with a single (at first, it may grow) property say { value:"42" }, or I could set the node type to the data type (i.e integer). If my understanding is correct this would also allow me to create relationships between the 'members' (since they are nodes and not propeties) allowing a greater freedom when expressing relationships between original members of different objects rather than just relating the parent objects to each other.
The problem is this now generates 144000+ line Cypher queries (and this isn't a large dataset in compraison to others) which the neo4j client seems to bulk at. The code highlighting appears to work in the query input box of the client (i.e it highlights correctly, which I assume implies it parsed it correctly and is valid cypher query), but when I come to run the query, I get the usual browser not responding and then a stack overflow (no punn intended) error. Whats more the neo4j client doesn't exit elegantly and always requires me to force end task and the db is in the 2.5-3GB usage from, what is effectively and small amount of data (144000 lines, approx 2/3 are relationships so at most ~48000 nodes). Yet I read I should be able to deal with millions of nodes and relationships in the milliseconds?
Have tried it with firefox and chrome. I am using the neo4j community edition on windows10. The sdk would initially be used with C# and C++. This research is in its initial stages so I haven't used the sdk yet.
Is this a valid approach, i.e to initially populate to database via a CREATE query?
Also is my approach about decomposing the data into fundamental types a good one? or are there issues which are likely to arise from this approach.
That is a very large Cypher query!!!
You would do much better to populate your database using LOAD CSV FROM... and supplying a CSV file containing the data you want to load.
For a detailed explaination, have a look at:
https://neo4j.com/developer/guide-import-csv/
(This page also discusses the batch loader for really large datasets.)
Since you are generating code for the Cypher query I wouldn't imagine you would have too much trouble generating a CSV file.
(As an indication of performance, I have been loading a 1 million record CSV today into Neo4j running on my laptop in under two minutes.)
I am working on a Rails project, the database is OrientDB graph database. I need to transfer data from Postgres to OrientDB graph. I have written scripts to in Ruby to fetch data from postgres and load it into the graph structure by creating relevant edges and nodes.
However this process is very slow and is taking months to enter million records. The graph is somewhat a densely connected graph.
I wanted to use the inbuilt ETL configuration provided by OrientDB but it seems relatively complex since I need to create multiple vertexes from fields in the same table and then connect them. I referred to this documentation.
Can I write custom ETL to load data into OrientDB with the same speed as done by inbuilt ETL tool?
Also, are there any benchmarks for the speed of data loading into OrientDB.
if ETL doesn't fit your needs you can write a custom importer using Java or any other JVM language of your choice.
If you only need to import the db once, the best way is to use plocal access (embedded) and then move the resulting database under the server.
With this approach, you can achieve the best performances, because the network isn't involved.
The code should be something like this snippet:
OrientGraphFactory fc = new OrientGraphFactory("plocal:./databases/import", "admin", "admin");
Connection conn = DriverManager.getConnection("jdbc....");
Statement stmt = conn.createStatement();
ResultSet resultSet = stmt.executeQuery("SELECT * from table ");
while (resultSet.next()) {
OrientGraph graph = fc.getTx();
OrientVertex vertex1 = graph.addVertex("class:Class1", "name", resultSet.getString("name"));
OrientVertex vertex2 = graph.addVertex("class:Class2", "city", resultSet.getString("city"));
graph.addEdge(null, vertex1, vertex2, "class:edgeClass");
graph.shutdown();
}
fc.close();
resultSet.close();
stmt.close();
conn.close();
It is more pseudo code than working code, but take it as a template for operations needed to import a single query/table from the original RDBMS.
About performance, it is quite hard to give numbers, it depends on many factors: schema complexity, type of access (plocal, remote), lookups, connection's speed of the data source.
Few more words about teleporter. It will import the original database schema and data inside OrientDB, automatically. AFAIK you have a working OrientDB and for sure Teleporter will not create the same schema on OrientDb you did.
There are several built-in input output format in Giraph, but all those formats support only numerical ids & value.
So is there a way to process property graph such that both vertices & edges can have multiple key & values or anything close? I'm specifically interested in whether edge can have attributes like timeCreated or type.
Also, is there some convention to use only numerical ids & data for faster processing? Specifically, is the property graph from graph database usually filtered to have only ids & value before batch processing using Giraph?
At least from Neo4j you can use the csv export of node- and relationship-id's to generate the data for giraph.
You can use something like this:
http://neo4j.com/blog/export-csv-from-neo4j-curl-cypher-jq/
and you can use neo4j-import to import that csv data, or LOAD CSV for more advanced structural updates.
I'm evaluating using Neo4J Community 2.1.3 to store a list of concepts and relationships between them. I'm trying to load my sample test data (CSV files) into Neo4J using Cypher from the Web interface , as described in the online manual.
My data looks something like this:
concepts.csv
id,concept
1,tree
2,apple
3,grapes
4,fruit salad
5,motor vehicle
6,internal combustion engine
relationships.csv
sourceid,targetid
2,1
4,2
4,3
5,6
6,5
And so on... For my sample, I have ~17K concepts and ~16M relationships. Following the manual, I started Neo4J server, and entered this into Cypher:
LOAD CSV WITH HEADERS FROM "file:///data/concepts.csv" AS csvLine
CREATE (c:Concept { id: csvLine.id, concept: csvLine.concept })
This worked fine and loaded my concepts. Then I tried to load my relationships.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///data/relationships.csv" AS csvLine
MATCH (c1:Concept { id: csvLine.sourceid }),(c2:Concept { id: csvLine.targetid })
CREATE (c1)-[:RELATED_TO]->(c2)
This would run for an hour or so, but always stopped with either:
"Unknown error" (no other info!), or
"Neo.TransientError.Transaction.DeadlockDetected" with a detailed message like
"LockClient[695] can't wait on resource RWLock[RELATIONSHIP(572801), hash=267423386] since => LockClient[695] <-[:HELD_BY]- RWLock[NODE(4145), hash=1224203266] <-[:WAITING_FOR]- LockClient[691] <-[:HELD_BY]- RWLock[RELATIONSHIP(572801), hash=267423386]"
It would stop after loading maybe 200-300K relationships. I've done a "sort | uniq" on the relationships.csv so I'm pretty sure there are no duplicates. I looked at the log files in data/log but found no error message.
Has anyone seen this before? BTW, I don't mind losing a small portion of the relationships, so I'll be happy if I can just turn off ACID transactions. I also want to avoid writing code (to use the Java API) at this stage. I just want to load up my data to try it out. Is there anyway to do this?
My full data set will have millions of concepts and maybe hundreds of millions of relationships. Does anyone know if Neo4J can handle this amount of data?
Thank you.
You're doing it correctly.
Do you use the neo4j-shell or the browser?
Did you do: create index on :Concept(id);?
If you don't have an index, searching for the concepts will take exponentially longer, as it has to scan all nodes of this label for this id-value. You should / could also check via prefixing your query with PROFILE if it uses an index for both matches.
Never seen that deadlock before despite importing millions of relationships.
Can you share the full stack trace? If you use shell, you might want to do export STACKTRACES=true
Can you use USING PERIODIC COMMIT 1000 ?
this is more of a best-practices question. I am implementing a search back-end for highly structured data that, in essence, consists of ontologies, terms, and a complex set of mappings between them. Neo4j seemed like a natural fit and after some prototyping I've decided to go with py2neo as a way to communicate with neo4j, mostly because of nice support for batch operations. This is more of a best practices question than anything.
What I'm getting frustrated with is that I'm having trouble with introducing the types of higher-level abstraction that I would like to in my code - I'm stuck with either using the objects directly as a mini-orm, but then I'm making lots and lots of atomic rest calls, which kills performance (I have a fairly large data set).
What I've been doing is getting my query results, using get_properties on them to batch-hydrate my objects, which preforms great and which is why I went down this route in the first place, but this makes me pass tuples of (node, properties) around in my code, which gets the job done, but isn't pretty. at all.
So I guess what I'm asking is if there's a best practice somewhere for working with a fairly rich object graph in py2neo, getting the niceties of an ORM-like later while retaining performance (which in my case means doing as much as possible as batch queries)
I am not sure whether I understand what you want, but I had a similar issue. I wanted to make a lot of calls and create a lot of nodes, indexes and relationships.. (around 1.2 million) . Here is an example of adding nodes, relationships, indexes and labels in batches using py2neo
from py2neo import neo4j, node, rel
gdb = neo4j.GraphDatabaseService("<url_of_db>")
batch = neo4j.WriteBatch(gdb)
a = batch.create(node(name='Alice'))
b = batch.create(node(name='Bob'))
batch.set_labels(a,"Female")
batch.set_labels(b,"Male")
batch.add_indexed_node("Name","first_name","alice",a) #this will create an index 'Name' if it does not exist
batch.add_indexed_node("Name","first_name","bob",b)
batch.create(rel(a,"KNOWS",b)) #adding a relationship in batch
batch.submit() #this will now listen to the db and submit the batch records. Ideally around 2k-5k records should be sent
Since your asking for best practice, here is an issue I ran into:
When adding a lot of nodes (~1M) with py2neo in a batch, my program often gets slow or crashes when the neo4j server runs out of memory. As a workaround, I split the submit in multiple batches:
from py2neo import neo4j
def chunker(seq, size):
"""
Chunker gets a list and returns slices
of the input list with the given size.
"""
for pos in xrange(0, len(seq), size):
yield seq[pos:pos + size]
def submit(graph_db, list_of_elements, size):
"""
Batch submit lots of nodes.
"""
# chunk data
for chunk in chunker(list_of_elements, size):
batch = neo4j.WriteBatch(graph_db)
for element in chunk:
n = batch.create(element)
batch.add_labels(n, 'Label')
# submit batch for chunk
batch.submit()
batch.clear()
I tried this with different chunk sizes. For me, it's fastest with ~1000 nodes per batch. But I guess this depends on the RAM/CPU of your neo4j server.