Failure on CSV import into Neo4j 2.2.0-RC01 - neo4j

I'm having some weird issues when using the batch load into Neo4j 2.2.0-RC1. I am trying to import 10 different node sets (for different labels) along with 12 relationship files. The data sets vary in size - some node types have ~200-300k records, some are small (50-100 records). For most node types I have a separate file with a header and separate file with data for each of the sets (the data is generated from the DB and I want to be able to regenerate the dump files without worrying about preparing the :ID columns, describing data types etc.)
I am re-running the import task a number of times (with options --processors 1 --stacktrace) and I keep getting different errors (not a single change in the actual dataset) which makes me think it might be something concurrency-related. Sometimes import simply hangs with a message like this:
Nodes
[>:36.75 MB/s------------------------|*PROPERTIES-----------------------------------------|NOD|] 0
In most cases, it crashes with an error like below, except the number of nodes that it manages to import fine differs from run to run.
[>:27.23 MB/s-------------|*PROPERTIES--------------------------|NO|v:19.62 MB/s---------------]100kImport error: Panic called, so exiting
java.lang.RuntimeException: Panic called, so exiting
at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.stillExecuting(StageExecution.java:63)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.anyStillExecuting(ExecutionSupervisor.java:79)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.finishAwareSleep(ExecutionSupervisor.java:102)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.supervise(ExecutionSupervisor.java:64)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisors.superviseDynamicExecution(ExecutionSupervisors.java:65)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.executeStages(ParallelBatchImporter.java:226)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(ParallelBatchImporter.java:151)
at org.neo4j.tooling.ImportTool.main(ImportTool.java:263)
Caused by: java.lang.RuntimeException: Panic called, so exiting
at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.assertHealthy(AbstractStep.java:189)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep.process(ProducerStep.java:77)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep$1.run(ProducerStep.java:54)
Caused by: java.lang.IllegalStateException: Nodes for any specific group must be added in sequence before adding nodes for any other group
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.put(EncodingIdMapper.java:137)
at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:76)
at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:41)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.call(ExecutorServiceStep.java:96)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.call(ExecutorServiceStep.java:87)
at org.neo4j.unsafe.impl.batchimport.executor.DynamicTaskExecutor$Processor.run(DynamicTaskExecutor.java:217)
I managed to run it successfully once, which, again, seems to imply that some sort of timing issue is at play.
Unfortunately I cannot provide the datasets as they contain confidential data.
The weirdest thing of all is that if I split the load into 2 different sets (the datasets are almost separate subgraphs, they have only 2 relationships in common) then all works fine (so not likely to be data related), but even loading just nodes doesn't work if I put them all into a single command. And because it's not possible to force a load into an existing database, loading it in 2 steps is sadly not an option.
1) Is that a known issue and if so, any ETA on a fix / issue that I could follow?
2) If not, is there any troubleshooting I can do to get to the bottom of it? The messages.log file in the target DB directory contains VERY little output, it would be nice if I could get some more details on what's going wrong.

I've spotted the problem. Thanks for reporting/asking. The next release will include this fix. I see an additional set of integration tests for the import tool. I'll provide link to commit once it's in.

Related

Delete Bigtable row in Apache Beam 2.2.0

In Dataflow 1.x versions, we could use CloudBigtableIO.writeToTable(TABLE_ID) to create, update, and delete Bigtable rows. As long as a DoFn was configured to output a Mutation object, it could output either a Put or a Delete, and CloudBigtableIO.writeToTable() successfully created, updated, or deleted a row for the given RowID.
It seems that the new Beam 2.2.0 API uses BigtableIO.write() function, which works with KV<RowID, Iterable<Mutation>>, where the Iterable contains a set of row-level operations. I have found how to use that to work on Cell-level data, so it's OK to create new rows and create/delete columns, but how do we delete rows now, given an existing RowID?
Any help appreciated!
** Some further clarification:
From this document: https://cloud.google.com/bigtable/docs/dataflow-hbase I understand that changing the dependency ArtifactID from bigtable-hbase-dataflow to bigtable-hbase-beam should be compatible with Beam version 2.2.0 and the article suggests doing Bigtble writes (and hence Deletes) in the old way by using CloudBigtableIO.writeToTable(). However that requires imports from the com.google.cloud.bigtable.dataflow family of dependencies, which the Release Notes suggest is deprecated and shouldn't be used (and indeed it seems incompatible with the new Configuration classes/etc.)
** Further Update:
It looks like my pom.xml didn't refresh properly after the change from bigtable-hbase-dataflow to bigtable-hbase-beam ArtifactID. Once the project got updated, I am able to import from the
com.google.cloud.bigtable.beam.* branch, which seems to be working at least for the minimal test.
HOWEVER: It looks like now there are two different Mutation classes:
com.google.bigtable.v2.Mutation and
org.apache.hadoop.hbase.client.Mutation ?
And in order to get everything to work together, it has to be specified properly which Mutation is used for which operation?
Is there a better way to do this?
Unfortunately, Apache Beam 2.2.0 doesn't provide a native interface for deleting an entire row (including the row key) in Bigtable. The only full solution would be to continue using the CloudBigtableIO class as you already mentioned.
A different solution would be to just delete all the cells from the row. This way, you can fully move forward with using the BigtableIO class. However, this solution does NOT delete the row key itself, so the cost of storing the row key remains. If your application requires deleting many rows, this solution may not be ideal.
import com.google.bigtable.v2.Mutation
import com.google.bigtable.v2.Mutation.DeleteFromRow
// mutation to delete all cells from a row
Mutation.newBuilder().setDeleteFromRow(DeleteFromRow.getDefaultInstance()).build()
I would suggest that you should continue using CloudBigtableIO and bigtable-hbase-beam. It shouldn't be too different from CloudBigtableIO in bigtable-hbase-dataflow.
CloudBigtableIO uses the HBase org.apache.hadoop.hbase.client.Mutation and translates them into the Bigtable equivalent values under the covers

Neo4j APOC import error

I have a data model that starts with a single record, this has a custom "recordId" that's a uuid, then it relates out to other nodes and they then in turn relate to each other. That starting node is what defines the data that "belongs" together, as in if we had separate databases inside neo4j. I need to export this data, into a backup data-set that can be re-imported into either the same or a new database with ease
After some help, I'm using APOC to do the export:
call apoc.export.cypher.query("MATCH (start:installations)
WHERE start.recordId = \"XXXXXXXX-XXX-XXX-XXXX-XXXXXXXXXXXXX\"
CALL apoc.path.subgraphAll(start, {}) YIELD nodes, relationships
RETURN nodes, relationships", "/var/lib/neo4j/data/test_export.cypher", {})
There are then 2 problems I'm having:
Problem 1 is the data that's exported has internal neo4j identifiers to generate the relationships. This is bad if we need to import into a new database and the UNIQUE IMPORT ID values already exist. I need to have this data generated with my own custom recordIds as the point of reference.
Problem 2 is that the import doesn't even work.
call apoc.cypher.runFile("/var/lib/neo4j/data/test_export.cypher") yield row, result
returns:
Failed to invoke procedure apoc.cypher.runFile: Caused by: java.lang.RuntimeException: Error accessing file /var/lib/neo4j/data/test_export.cypher
I'm hoping someone can help me figure out what may be going on, but I'm not sure what additional info is helpful. No one in the Neo4j slack channel has been able to help find a solution.
Thanks.
problem1:
The exported file does not contain any internal neo4j ids. It is not safe to use neo4j ids out of the database, since they are not globally unique. So you should not use them to transfer data from one database to another.
If you are about to use globally uniqe ids, you can use an external plugin like GraphAware UUID plugin. (disclaimer: I work for GraphAware)
problem2:
If you cannot access the file, then possible reasons:
apoc.import.file.enabled=true is not set in neo4j.conf
os level
permission is not set

Error when performing schema changes in DSE 5.0

I am trying to get my head around using graphs for the first time - and as you can imagine, I am having a fair bit of trial and error.
Subsequently, I am doing a lot of;
Create Schema
find a mistake / modelling error
delete schena
rinse and repeat
All of which is completely fine: But for the fact that I seem to constantly be getting the following error;
Schema migration interrupted. The migration operation will continue in the background.
Now if I get this error when doing a schema.clear(), then, it actually doesn't continue in the background at all - it is lying!
I have to rerun the command and sometimes, several times to get the schema deleted.
And if that isn't annoying enough - I might end up with the following, too.
Script evaluation exceeded the configured threshold for the request: [149a3432-b1b3-45b7-8e68-d21c0325d877 - schema.clear()]
I have a single DC, two racks, with 2 nodes each - as a training cluster.
I am using DSE 5.0.1
I am using the GossipingPropertyFileSnitch - snitch
(I also have the rack properties file, too, for the above snitch type.
And I also ensure that I have run;
:remote config timeout max
in the gremlin-console, too...
So I am not really sure how it can complain about timing out and since this is all on my local PC in Virtual machines - and is only being used by me - I don't understand how something is interrupting the command I just asked it to complete, either!
Thanks if anyone has any ideas!
-Gavin
With special thanks to Jeremy at DataStax - I have a solution for my time out issue;
I still don't understand why it complained in the first place - in that I was the only person using the cluster - on virtual machines on my own PC.... but nonetheless - I can successfully complete commands in the gremlin console.
The required change is in DSE.YAML
altering the following configuration item to a value higher than the default of 30 seconds. (I set it to 180 sec)
realtime_evaluation_timeout: 180 sec

Loading a .trig file with inference to Fuseki using the 'tdbloader" bulk loader?

I am currently writing some Java code extracting some data and writing them as Linked Data, using the TRIG syntax. I am now using Jena, and Fuseki to create a SPARQL endpoint to query and visualize this data.
The data is written so that each source dataset gives me a .trig file, containing one named graph. So I want to load thoses files in Fuseki. Except that it doesn't seem to understand the Trig syntax...
If I remove the named graphs, and rename the files as .ttl, everything loads perfectly in the default graphs. But if I try to import trig files :
using Fuseki's webapp uploader, it either crashes ("Can't make new graphs") or adds nothing except the prefixes, as if the graphs other than the default ones could not be added (the logs say nothing helpful except the error code and description).
using Java code, the process is too slow. I used this technique : " Loading a .trig file into TDB? " but my trig files are pretty big, so this solution is not very good for me.
So I tried to use the bulk loader, the console command 'tdbloader'. This time everything seems fine, but in the webapp, there is still no data.
You can see the process going fine here : Quads are added just fine
But the result still keeps only the default graph and its original data : Nothing is added
So, I don't know what to do. The guys behind Jena and Fuseki suggested not to use the bulk loader in the Java code (rather than the command line tool), so that's one solution I guess I'd like to avoid.
Did I miss something obvious about how to load TRIG files to Fuseki? Thanks.
UPDATE :
As it seemed to be a problem in my configuration (see the comments of this post for a link to my config file; I cannot post more than 2 links), I tried to add some kind of specification for some named graphs I would like to see added to the dataset on Fuseki.
I added code to link (with ja:namedgraph) external graphs that I added via tdbloader. This seems to work. Great!
Now another problem : there's no inference, even when my config file specifies an Inference model... I set that queries should be applied with named graphs merged as the default graph, but this does not seem to carry the OWL Inference rules...So simple queries work, but I have 1/ to specify the graph I query (with "FROM") and 2/ no inference in my data.
The two methods are to use the tdb bulkloader offline or you can POST data into the dataset directly. (i.e. HTTP POST operations to http://localhost:3030/ds).
You can test where your graph are there with a query like
SELECT (count(*) AS ?C) { GRAPH ?g { ?s ?p ?o } }
The named graphs will show up when the Fuseki server is started unless your configuration of the SPARQL services only exports one graph.

java.lang.stackoverflowerror on small import

I am doing some work with neo4j based around working out who knows who and what they do, it is in the format of
Company-node
product-node
person-node
and the relationships between them as
company borders,by location
person works at company
company has product.
I have a spreadsheet that has all the information written down and a macro that takes the iformation adn converts it into cypher. THe code comes to around 5000 lines.
When I try to import it I get an unknown error if I try to run it in the internet browser. If i run it in the shell it goes the whole way through and then gives the error
Error occurre in server thread ; nested exception is:
java.lang.StackOverflowerror
my heap size is set to 3gb
anyone have any ideas on what the error is and how to fix it?
Fist of all it has nothing to do with your heap size, it is related to stack size. if you want to increase stack size use -Xss parameter.
Also stack is used to hold intermediate variables and function calls, your import is somehow crossing the stack size set in you configuration.

Resources