Batch import in Neo4j - neo4j

I am using the BatchInserter to insert data from a CSV file which works fine when the DB is completely empty (no files in the data directory). But using the BatchInserter I cannot get data into the database and it throws an exception mentioned below. This is with the DB service stopped. I tried several ways and failed. But I need to know if there is a way to import data from a CSV into an existing DB as the CSV is prone to change.
Exception in thread "main" java.lang.IllegalStateException: Misaligned file size 68 for DynamicArrayStore[fileName:neostore.nodestore.db.labels, blockSize:60], expected version length 25
at org.neo4j.kernel.impl.store.AbstractDynamicStore.verifyFileSizeAndTruncate(AbstractDynamicStore.java:265)
at org.neo4j.kernel.impl.store.CommonAbstractStore.loadStorage(CommonAbstractStore.java:217)
at org.neo4j.kernel.impl.store.CommonAbstractStore.<init>(CommonAbstractStore.java:118)
at org.neo4j.kernel.impl.store.AbstractDynamicStore.<init>(AbstractDynamicStore.java:92)
at org.neo4j.kernel.impl.store.DynamicArrayStore.<init>(DynamicArrayStore.java:64)
at org.neo4j.kernel.impl.store.StoreFactory.newNodeStore(StoreFactory.java:327)
at org.neo4j.kernel.impl.store.StoreFactory.newNodeStore(StoreFactory.java:316)
at org.neo4j.kernel.impl.store.StoreFactory.newNeoStore(StoreFactory.java:160)
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.<init>(BatchInserterImpl.java:258)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:94)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:88)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:63)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:51)
at net.xalgo.neo4j.batchinserter.DrugDatabaseInserter.start(DrugDatabaseInserter.java:30)
at net.xalgo.neo4j.batchinserter.BatchInserterApp.main(BatchInserterApp.java:56)

The Batch Inserter only works with initial data import (when the database is empty). It avoids transactions and other checks to increase performance and therefore cannot be used with an existing database.
For importing data into an already existing Neo4j database you can use LOAD CSV Cypher.

Finally I had to remove BatchInserter and use spring data Neo4jTemplate with direct cypher in a non transactional way.

Related

Importing data from two files into neo4j using LOAD CSV

I am trying to import data into neo4j using LOAD CSV
The resource file contains names of all the nodes I need to create
resource1
resource2
resource3
In another file I have all the properties of that resource
resource1,name,xyz
resource1,year,1920
resource1,age,100
resource2,length,300
resource2,age,30
I Managed to load the nodes into neo4j but how do I import the second file so that I can add the data to that particular node as properties, I tried setting the key dynamically
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///infobox.csv' AS line
MERGE (:Node{line[1]:line[2]})
neo4j doesn't allow setting the key dynamically?
How do I solve this?
Natively, Neo4j doesn't allow setting the key dynamically. But you can install APOC Procedures use apoc.create.setProperty to do this.
Try something like:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///infobox.csv' AS line
// match the node by resource1, resource2, etc
MATCH(node:Node{resource_id : line[0]})
CALL apoc.create.setProperty(node, line[1], line[2])
RETURN *
Note: Remember to install APOC procedures according the version of Neo4j you are using. Take a look in the Version Compatibility Matrix.

Elasticsearch can't sync 2GB database from neo4j with GprahAware

The neo4j2elasticsearch works on my machine when the database is only 250KB. But the the databse is around 2GB. It won't sync anymore. I'm wondering is because of these parameters in the config file:
#optional, size of the in-memory queue that queues up operations to be synchronised to Elasticsearch, defaults to 10000
com.graphaware.module.ES.queueSize=10000
#optional, size of the batch size to use during re-initialization, defaults to 1000
com.graphaware.module.ES.reindexBatchSize=2000
I'm wondering what is the unit of in-memory queue size 10000 and how to estimate what parameter to set based on my database size.
Here is the debug file :
neo4j debug.log failure loading
The database is re-initialized but there are only empty neo4j-index-relationship/neo4j-index-node index in the elasticsearch database
Just for information, here is the debug file for successful 250KB database loading:
neo4j debug.log successful loading
seems like Re-indexing nodes... step is missing in the 2GB database loading procedure.
The log doesn't have a line saying it will re-index, did you configure:
com.graphaware.module.ES.initializeUntil=
To a timestamp that warrants re-indexing on startup? Otherwise, it will only index new data. It is explained at the bottom of https://github.com/graphaware/neo4j-to-elasticsearch :
...in order to trigger (re-)indexing, i.e. sending every node that
should be indexed to Elasticsearch upon Neo4j restart, you have to
manually intervene...
So try to create a new node and see if the synchronization is working for new stuff to eliminate this situation (most common one).

Neo4j APOC import error

I have a data model that starts with a single record, this has a custom "recordId" that's a uuid, then it relates out to other nodes and they then in turn relate to each other. That starting node is what defines the data that "belongs" together, as in if we had separate databases inside neo4j. I need to export this data, into a backup data-set that can be re-imported into either the same or a new database with ease
After some help, I'm using APOC to do the export:
call apoc.export.cypher.query("MATCH (start:installations)
WHERE start.recordId = \"XXXXXXXX-XXX-XXX-XXXX-XXXXXXXXXXXXX\"
CALL apoc.path.subgraphAll(start, {}) YIELD nodes, relationships
RETURN nodes, relationships", "/var/lib/neo4j/data/test_export.cypher", {})
There are then 2 problems I'm having:
Problem 1 is the data that's exported has internal neo4j identifiers to generate the relationships. This is bad if we need to import into a new database and the UNIQUE IMPORT ID values already exist. I need to have this data generated with my own custom recordIds as the point of reference.
Problem 2 is that the import doesn't even work.
call apoc.cypher.runFile("/var/lib/neo4j/data/test_export.cypher") yield row, result
returns:
Failed to invoke procedure apoc.cypher.runFile: Caused by: java.lang.RuntimeException: Error accessing file /var/lib/neo4j/data/test_export.cypher
I'm hoping someone can help me figure out what may be going on, but I'm not sure what additional info is helpful. No one in the Neo4j slack channel has been able to help find a solution.
Thanks.
problem1:
The exported file does not contain any internal neo4j ids. It is not safe to use neo4j ids out of the database, since they are not globally unique. So you should not use them to transfer data from one database to another.
If you are about to use globally uniqe ids, you can use an external plugin like GraphAware UUID plugin. (disclaimer: I work for GraphAware)
problem2:
If you cannot access the file, then possible reasons:
apoc.import.file.enabled=true is not set in neo4j.conf
os level
permission is not set

Spark job-server release memory

I've set up a spark job-server (see https://github.com/spark-jobserver/spark-jobserver/tree/jobserver-0.6.2-spark-1.6.1) in standalone mode.
I've created a default context to use. Currently I have 2 kind of jobs on this context:
Synchronization with another server:
Dumps the data from the other server's db;
Perform some joins, reduce the data, generating a new DF;
Save the obtained DF in a parquet file;
Load this parquet file as a temp table and cache it;
Queries: perform sql queries on the cached table.
The only object that I persist is the final table that will be cached.
What I don't get is why when I perform the synchronization, all the assigned memory is used and never released, but, if I load the parquet file directly (doing a fresh start of the server, using the parquet file generated previously), only a fraction of the memory is used.
I'm missing something? There is a way to free up unused memory?
Thank you
You can free up memory by unpersisting cached table: yourTable.unpersist()

Failure on CSV import into Neo4j 2.2.0-RC01

I'm having some weird issues when using the batch load into Neo4j 2.2.0-RC1. I am trying to import 10 different node sets (for different labels) along with 12 relationship files. The data sets vary in size - some node types have ~200-300k records, some are small (50-100 records). For most node types I have a separate file with a header and separate file with data for each of the sets (the data is generated from the DB and I want to be able to regenerate the dump files without worrying about preparing the :ID columns, describing data types etc.)
I am re-running the import task a number of times (with options --processors 1 --stacktrace) and I keep getting different errors (not a single change in the actual dataset) which makes me think it might be something concurrency-related. Sometimes import simply hangs with a message like this:
Nodes
[>:36.75 MB/s------------------------|*PROPERTIES-----------------------------------------|NOD|] 0
In most cases, it crashes with an error like below, except the number of nodes that it manages to import fine differs from run to run.
[>:27.23 MB/s-------------|*PROPERTIES--------------------------|NO|v:19.62 MB/s---------------]100kImport error: Panic called, so exiting
java.lang.RuntimeException: Panic called, so exiting
at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.stillExecuting(StageExecution.java:63)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.anyStillExecuting(ExecutionSupervisor.java:79)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.finishAwareSleep(ExecutionSupervisor.java:102)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.supervise(ExecutionSupervisor.java:64)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisors.superviseDynamicExecution(ExecutionSupervisors.java:65)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.executeStages(ParallelBatchImporter.java:226)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(ParallelBatchImporter.java:151)
at org.neo4j.tooling.ImportTool.main(ImportTool.java:263)
Caused by: java.lang.RuntimeException: Panic called, so exiting
at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.assertHealthy(AbstractStep.java:189)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep.process(ProducerStep.java:77)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep$1.run(ProducerStep.java:54)
Caused by: java.lang.IllegalStateException: Nodes for any specific group must be added in sequence before adding nodes for any other group
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.put(EncodingIdMapper.java:137)
at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:76)
at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:41)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.call(ExecutorServiceStep.java:96)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.call(ExecutorServiceStep.java:87)
at org.neo4j.unsafe.impl.batchimport.executor.DynamicTaskExecutor$Processor.run(DynamicTaskExecutor.java:217)
I managed to run it successfully once, which, again, seems to imply that some sort of timing issue is at play.
Unfortunately I cannot provide the datasets as they contain confidential data.
The weirdest thing of all is that if I split the load into 2 different sets (the datasets are almost separate subgraphs, they have only 2 relationships in common) then all works fine (so not likely to be data related), but even loading just nodes doesn't work if I put them all into a single command. And because it's not possible to force a load into an existing database, loading it in 2 steps is sadly not an option.
1) Is that a known issue and if so, any ETA on a fix / issue that I could follow?
2) If not, is there any troubleshooting I can do to get to the bottom of it? The messages.log file in the target DB directory contains VERY little output, it would be nice if I could get some more details on what's going wrong.
I've spotted the problem. Thanks for reporting/asking. The next release will include this fix. I see an additional set of integration tests for the import tool. I'll provide link to commit once it's in.

Resources