Elasticsearch can't sync 2GB database from neo4j with GprahAware - neo4j

The neo4j2elasticsearch works on my machine when the database is only 250KB. But the the databse is around 2GB. It won't sync anymore. I'm wondering is because of these parameters in the config file:
#optional, size of the in-memory queue that queues up operations to be synchronised to Elasticsearch, defaults to 10000
com.graphaware.module.ES.queueSize=10000
#optional, size of the batch size to use during re-initialization, defaults to 1000
com.graphaware.module.ES.reindexBatchSize=2000
I'm wondering what is the unit of in-memory queue size 10000 and how to estimate what parameter to set based on my database size.
Here is the debug file :
neo4j debug.log failure loading
The database is re-initialized but there are only empty neo4j-index-relationship/neo4j-index-node index in the elasticsearch database
Just for information, here is the debug file for successful 250KB database loading:
neo4j debug.log successful loading
seems like Re-indexing nodes... step is missing in the 2GB database loading procedure.

The log doesn't have a line saying it will re-index, did you configure:
com.graphaware.module.ES.initializeUntil=
To a timestamp that warrants re-indexing on startup? Otherwise, it will only index new data. It is explained at the bottom of https://github.com/graphaware/neo4j-to-elasticsearch :
...in order to trigger (re-)indexing, i.e. sending every node that
should be indexed to Elasticsearch upon Neo4j restart, you have to
manually intervene...
So try to create a new node and see if the synchronization is working for new stuff to eliminate this situation (most common one).

Related

Spark job-server release memory

I've set up a spark job-server (see https://github.com/spark-jobserver/spark-jobserver/tree/jobserver-0.6.2-spark-1.6.1) in standalone mode.
I've created a default context to use. Currently I have 2 kind of jobs on this context:
Synchronization with another server:
Dumps the data from the other server's db;
Perform some joins, reduce the data, generating a new DF;
Save the obtained DF in a parquet file;
Load this parquet file as a temp table and cache it;
Queries: perform sql queries on the cached table.
The only object that I persist is the final table that will be cached.
What I don't get is why when I perform the synchronization, all the assigned memory is used and never released, but, if I load the parquet file directly (doing a fresh start of the server, using the parquet file generated previously), only a fraction of the memory is used.
I'm missing something? There is a way to free up unused memory?
Thank you
You can free up memory by unpersisting cached table: yourTable.unpersist()

Batch import in Neo4j

I am using the BatchInserter to insert data from a CSV file which works fine when the DB is completely empty (no files in the data directory). But using the BatchInserter I cannot get data into the database and it throws an exception mentioned below. This is with the DB service stopped. I tried several ways and failed. But I need to know if there is a way to import data from a CSV into an existing DB as the CSV is prone to change.
Exception in thread "main" java.lang.IllegalStateException: Misaligned file size 68 for DynamicArrayStore[fileName:neostore.nodestore.db.labels, blockSize:60], expected version length 25
at org.neo4j.kernel.impl.store.AbstractDynamicStore.verifyFileSizeAndTruncate(AbstractDynamicStore.java:265)
at org.neo4j.kernel.impl.store.CommonAbstractStore.loadStorage(CommonAbstractStore.java:217)
at org.neo4j.kernel.impl.store.CommonAbstractStore.<init>(CommonAbstractStore.java:118)
at org.neo4j.kernel.impl.store.AbstractDynamicStore.<init>(AbstractDynamicStore.java:92)
at org.neo4j.kernel.impl.store.DynamicArrayStore.<init>(DynamicArrayStore.java:64)
at org.neo4j.kernel.impl.store.StoreFactory.newNodeStore(StoreFactory.java:327)
at org.neo4j.kernel.impl.store.StoreFactory.newNodeStore(StoreFactory.java:316)
at org.neo4j.kernel.impl.store.StoreFactory.newNeoStore(StoreFactory.java:160)
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.<init>(BatchInserterImpl.java:258)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:94)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:88)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:63)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:51)
at net.xalgo.neo4j.batchinserter.DrugDatabaseInserter.start(DrugDatabaseInserter.java:30)
at net.xalgo.neo4j.batchinserter.BatchInserterApp.main(BatchInserterApp.java:56)
The Batch Inserter only works with initial data import (when the database is empty). It avoids transactions and other checks to increase performance and therefore cannot be used with an existing database.
For importing data into an already existing Neo4j database you can use LOAD CSV Cypher.
Finally I had to remove BatchInserter and use spring data Neo4jTemplate with direct cypher in a non transactional way.

Failure on CSV import into Neo4j 2.2.0-RC01

I'm having some weird issues when using the batch load into Neo4j 2.2.0-RC1. I am trying to import 10 different node sets (for different labels) along with 12 relationship files. The data sets vary in size - some node types have ~200-300k records, some are small (50-100 records). For most node types I have a separate file with a header and separate file with data for each of the sets (the data is generated from the DB and I want to be able to regenerate the dump files without worrying about preparing the :ID columns, describing data types etc.)
I am re-running the import task a number of times (with options --processors 1 --stacktrace) and I keep getting different errors (not a single change in the actual dataset) which makes me think it might be something concurrency-related. Sometimes import simply hangs with a message like this:
Nodes
[>:36.75 MB/s------------------------|*PROPERTIES-----------------------------------------|NOD|] 0
In most cases, it crashes with an error like below, except the number of nodes that it manages to import fine differs from run to run.
[>:27.23 MB/s-------------|*PROPERTIES--------------------------|NO|v:19.62 MB/s---------------]100kImport error: Panic called, so exiting
java.lang.RuntimeException: Panic called, so exiting
at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.stillExecuting(StageExecution.java:63)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.anyStillExecuting(ExecutionSupervisor.java:79)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.finishAwareSleep(ExecutionSupervisor.java:102)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.supervise(ExecutionSupervisor.java:64)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisors.superviseDynamicExecution(ExecutionSupervisors.java:65)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.executeStages(ParallelBatchImporter.java:226)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(ParallelBatchImporter.java:151)
at org.neo4j.tooling.ImportTool.main(ImportTool.java:263)
Caused by: java.lang.RuntimeException: Panic called, so exiting
at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.assertHealthy(AbstractStep.java:189)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep.process(ProducerStep.java:77)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep$1.run(ProducerStep.java:54)
Caused by: java.lang.IllegalStateException: Nodes for any specific group must be added in sequence before adding nodes for any other group
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.put(EncodingIdMapper.java:137)
at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:76)
at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:41)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.call(ExecutorServiceStep.java:96)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.call(ExecutorServiceStep.java:87)
at org.neo4j.unsafe.impl.batchimport.executor.DynamicTaskExecutor$Processor.run(DynamicTaskExecutor.java:217)
I managed to run it successfully once, which, again, seems to imply that some sort of timing issue is at play.
Unfortunately I cannot provide the datasets as they contain confidential data.
The weirdest thing of all is that if I split the load into 2 different sets (the datasets are almost separate subgraphs, they have only 2 relationships in common) then all works fine (so not likely to be data related), but even loading just nodes doesn't work if I put them all into a single command. And because it's not possible to force a load into an existing database, loading it in 2 steps is sadly not an option.
1) Is that a known issue and if so, any ETA on a fix / issue that I could follow?
2) If not, is there any troubleshooting I can do to get to the bottom of it? The messages.log file in the target DB directory contains VERY little output, it would be nice if I could get some more details on what's going wrong.
I've spotted the problem. Thanks for reporting/asking. The next release will include this fix. I see an additional set of integration tests for the import tool. I'll provide link to commit once it's in.

neo4j-shell readonly model can't work with index

I modified the demo EmbeddedNeo4jWithIndexing.java(create node with index).
I Comment all shutdown() method,
after do it, I open the database by neo4j-shell -readonly,then I run
start n=node:nodes("*:*") return n;
I get 0 rows
if I open the database without -readonly ,and run the command above,it successed!
I am very confused about it
I tried 1.8.2 and 1.9
We disallow read-only access to running databases because of lucene ignoring the R/O aspect and still doing merges asynchronically behind the scenes, clashing with the merge threads of the live database.

EmbeddedReadOnlyGraphDatabase complaining about locked database

Exception in thread "main" java.lang.IllegalStateException: Database locked.
at org.neo4j.kernel.InternalAbstractGraphDatabase.create(InternalAbstractGraphDatabase.java:289)
at org.neo4j.kernel.InternalAbstractGraphDatabase.run(InternalAbstractGraphDatabase.java:227)
at org.neo4j.kernel.EmbeddedReadOnlyGraphDatabase.<init>(EmbeddedReadOnlyGraphDatabase.java:81)
at org.neo4j.kernel.EmbeddedReadOnlyGraphDatabase.<init>(EmbeddedReadOnlyGraphDatabase.java:72)
at org.neo4j.kernel.EmbeddedReadOnlyGraphDatabase.<init>(EmbeddedReadOnlyGraphDatabase.java:54)
at QueryNodeReadOnly.main(QueryNodeReadOnly.java:55)
This is using 1.8.2 version of neo4j. I've written a program that opens the db in readonly mode, querying and and make it sleep for a while before exiting.
Here is the relevant text
graphDb = new EmbeddedReadOnlyGraphDatabase( dbname); // Line 55 - the exception.
......
......
......
......
......
if(sleepVal > 0)
Thread.sleep(sleepVal);
I reckon I should not be getting this error. There are only 2 processes that open the db , both in read-only mode. In fact, it should work even if i open the db when another process has opened it to write to it.
We disallow two databases accessing the same files on disk at the same time - even in read-only mode.
The reason being is that while we do not allow you to modify the database in read-only mode, Lucene will still write to disk when servicing your read requests, and having two instances access those same index files leads to race conditions and index corruptions.
Why is it you want 2x instances accessing the same files at the same time anyway? What is your use case?
You can't make multiple connections to an embedded database. Maybe you should consider using the REST server.

Resources