Neo4J OutOfMemory error when querying - neo4j

I've recently started working with Neo4J and so far I haven't been able to find an answer to the problems I'm having, in particular with the server. I'm using version 1.8.1 and running the server as a service on Windows, not embedded. The graph I have has around 7m nodes and nearly 11m relationships.
With small queries and multiples of, things run nicely. However, when I'm trying to pull back more complex queries, potentially thousands of rows, things go sour. If I'm using the console, I'll get nothing and then after a few minutes or more the word undefined appears (it's trying to do something in Javascript but I'm not sure what). If I'm using Neo4JClient in .NET, it'll time out (I'm working this through a WCF service) and I suspect my problems are server side.
Here is a sample cypher query that has caused me problems in the console:
start begin = node:idx(ID="1234")
MATCH begin-[r1?:RELATED_TO]-n1-[r2?:RELATED_TO]-n2-[r3?:RELATED_TO]-n3-[r4?:RELATED_TO]-n4
RETURN begin.Title?, r1.RelationType?, n1.Title?, r2.RelationType?, n2.Title?, r3.RelationType?, n3.Title?, r4.RelationType?, n4.Title?;
I've looked through the logs and I'm receiving the following severe error:
SEVERE: The exception contained within MappableContainerException could not be mapped to a response, re-throwing to the HTTP container
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Unknown Source)
at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source)
at java.lang.AbstractStringBuilder.append(Unknown Source)
at java.lang.StringBuffer.append(Unknown Source)
at java.io.StringWriter.write(Unknown Source)
at java.io.PrintWriter.newLine(Unknown Source)
at java.io.PrintWriter.println(Unknown Source)
at java.io.PrintWriter.println(Unknown Source)
at org.neo4j.cypher.PipeExecutionResult$$anonfun$dumpToString$1.apply(PipeExecutionResult.scala:96)
at org.neo4j.cypher.PipeExecutionResult$$anonfun$dumpToString$1.apply(PipeExecutionResult.scala:96)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:45)
at org.neo4j.cypher.PipeExecutionResult.dumpToString(PipeExecutionResult.scala:96)
at org.neo4j.cypher.PipeExecutionResult.dumpToString(PipeExecutionResult.scala:124)
at org.neo4j.cypher.javacompat.ExecutionResult.toString(ExecutionResult.java:90)
at org.neo4j.shell.kernel.apps.Start.exec(Start.java:72)
at org.neo4j.shell.kernel.apps.ReadOnlyGraphDatabaseApp.execute(ReadOnlyGraphDatabaseApp.java:32)
at org.neo4j.shell.impl.AbstractAppServer.interpretLine(AbstractAppServer.java:127)
at org.neo4j.shell.kernel.GraphDatabaseShellServer.interpretLine(GraphDatabaseShellServer.java:92)
at org.neo4j.shell.impl.AbstractClient.evaluate(AbstractClient.java:130)
at org.neo4j.shell.impl.AbstractClient.evaluate(AbstractClient.java:114)
at org.neo4j.server.webadmin.rest.ShellSession.evaluate(ShellSession.java:96)
at org.neo4j.server.webadmin.rest.ConsoleService.exec(ConsoleService.java:123)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
From an educated guess perspective looking at the stack trace, is it that I'm pulling back too many records? Since it's running out of memory whilst expanding the StringBuffer.
I've wondered whether GC could be playing a part, so I got hold of GCViewer. It doesn't seem to be Garbage Collection, I can add in a screenshot from GCViewer if you think it will be useful.
I've allocated the JVM anywhere between the default value and 8G of memory. Here are some of my settings from my configuration files (I'll try and include only the relevant ones). Let me know if you need any more.
Neo4J.properties
# Default values for the low-level graph engine
use_memory_mapped_buffers=false
# Keep logical logs, helps debugging but uses more disk space, enabled for legacy reasons
keep_logical_logs=true
Neo4J-server.properties
# HTTP logging is disabled. HTTP logging can be enabled by setting this property to 'true'.
org.neo4j.server.http.log.enabled=false
Neo4J-Wrapper.conf (possibly inexpertly slotted together)
# Uncomment the following line to enable garbage collection logging
wrapper.java.additional.4=-Xloggc:data/log/neo4j-gc.log
# Setting a different Garbage Collector as recommended by Neo4J
wrapper.java.additional.5=-XX:+UseConcMarkSweepGC
# other beneficial settings that should boost performance
wrapper.java.additional.6=-d64
wrapper.java.additional.7=-server
wrapper.java.additional.8=-Xss1024k
# Initial Java Heap Size (in MB)
wrapper.java.initmemory=1024
# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=8000
Any help would be gratefully appreciated.

your query is simply too complex. when you have such a large graph than to be sure you will not reach your heap memory limit, you must have appropriate memory allocated. you might want to play a little bit with this configuration: io examples.
however, your query could be simplified to this:
start begin = node:idx(ID="1234")
MATCH p=begin-[r1:RELATED_TO*0..4]-n4
RETURN p

Craig the problem is that you use the Neo4j-Shell which is just an ops tool and just collects the data in memory before sending back, it was never meant to handle huge result sets.
You probably want to run your queries directly against the http endpoint with streaming enabled (X-Stream:true http-header) then you have no problem with that anymore.

Related

Is there a way to override the ExceptionResolver of JsonTemplateLayout of Log4j2

With SpringBoot + OpenTelemetry the stack-trace for every exception runs into 100s of lines. Most of the lines apart from the top few are often not useful in troubleshooting the issue. The default truncation can solve the large event problem, but as it could be arbitrary we may miss some key data.
Our requirement is to produce something of this sort in the JSONTemplateLayout context:
java.lang.Exception: top-level-exception
at MyClass.methodA(MyClass.java:25)
<15 more frames, maximum>
...truncated...
Caused by: java.lang.RuntimeException: root-cause-exception
at SomeOtherClass.methodB(SomeOtherClass.java:55)
<15 more frames, maximum>
...truncated...
This way we don't lose the chain that caused the exception to truncation but at the same time not have more than N frames at each level in the chain (N=16 in the example above).
If there's a way to pass in the resolver for stack trace this could work, but I couldn't find a way to pass in a custom resolver.

Error using s3 connector with landoop docker container for kafka

When creating a sink connector with the following configuration
connector.class=io.confluent.connect.s3.S3SinkConnector
s3.region=us-west-2
topics.dir=topics
flush.size=3
schema.compatibility=NONE
topics=my_topic
tasks.max=1
s3.part.size=5242880
format.class=io.confluent.connect.s3.format.avro.AvroFormat
# added after comment
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
schema.generator.class=io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator
storage.class=io.confluent.connect.s3.storage.S3Storage
s3.bucket.name=my-bucket
and running it I get following error
org.apache.kafka.connect.errors.DataException: coyote-test-avro
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:97)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:453)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:287)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:198)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:166)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 91319
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:192)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:218)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:394)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:387)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaByIdFromRegistry(CachedSchemaRegistryClient.java:65)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getBySubjectAndId(CachedSchemaRegistryClient.java:138)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:122)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserializeWithSchemaAndVersion(AbstractKafkaAvroDeserializer.java:194)
at io.confluent.connect.avro.AvroConverter$Deserializer.deserialize(AvroConverter.java:121)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:84)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:453)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:287)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:198)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:166)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
But my topic has a schema as I can see using the UI provided by the docker container landoop/fast-data-dev. And even if I try to write raw data to s3 changing following configs
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
format.class=io.confluent.connect.s3.format.bytearray.ByteArrayFormat
storage.class=io.confluent.connect.s3.storage.S3Storage
schema.compatibility=NONE
and removing schema.generator.class, the same error appears, even-though this should not use an avro schema to my understanding.
To be able to write to s3 I set the environement variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in my container but it seems anyhow that the problem comes before that point.
I imagine that there might be a problem with the versions, as mentioned above I use the container landoop/fast-data-dev in a docker-machine (it doesn't work in the mac native docker machine) and produce and consumer work perfectly.
This is the about section
I looked at the connect logs but I couldn't figure out any helpful information, however if you can tell me what I should look for I will add the relevant lines (all the logs are too big)
Every single topic message must be encoded as Avro, as specified by the schema registry.
The converter looks at bytes 2-5 of the raw Kafka data (keys and values), converts to an integer (in your case, the ID in the error), and does a lookup into the registry.
If it's not Avro, or otherwise bad data, you get either the error here, or an error about invalid magic byte.
And this error isn't a Connect error. You can reproduce it using the Avro console consumer if you add the print-key property.
Assuming that is the case, one solution is to change the key serde to use byte array deserializer so it skips avro lookups
Otherwise, the only solution here since you cannot delete the messages in Kafka is to figure out why the producers are sending bad data, fix them, then either move the connect consumer group to the latest offset with valid data, wait for the invalid data to expire on the topic, or move to a new topic entirely

Solr post.jar crashes with "content is not allowed in prolog"

I'm trying to evaluate Solr, but can't start crawling websites with the recursive option on. Have searched all over for an answer but no luck.
Environment: Windows Server 2012 r2, java version "1.8.0_171", solr-7.3.0.
When running the post.jar tool I get the following error:
java -Dauto=yes -Dc=testcore -Ddata=web -Drecursive=2 -Ddelay=10 -jar post.jar http://localhost/
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/testcore/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
Entering recursive mode, depth=2, delay=10s
Entering crawl at level 0 (1 links total, 1 new)
POSTed web resource http://localhost/ (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
... 5 more
I can index all links (to files and to other pages) in http://localhost/ manually if I do one by one with the recursive option off, so I don't think there are any files or links with special characters. Thank you all, your help is appreciated.
Remove the -Drecursive=2 , it was creating issue use the below command.
java -Dauto=yes -Dc=testcore -Ddata=web -Ddelay=10 -jar post.jar http://localhost/
I couldn't get the post.jar tool to work correctly. After trying and troubleshooting Nutch 1.8 I was finally able to have it crawl webpages and follow links automatically.
This is what I did: Install cygwin, install/extract Nutch to cygwin/home folder, download Hadoop-0.20.20-core.jar and paste to cygwin/home/apache-nutch-1.8/lib.
After doing this I was able to complete the Nutch tutorial here: https://wiki.apache.org/nutch/NutchTutorial
There were a few other minor hiccups along the way, but I don't really remember what those were (I need to work on better documentation...), anyway if anybody is trying this in a similar environment to mine, feel free to send me a message.
With Drupal I solved this with a 6 line shell script no need for nutch, etc and the R&D, environment headaches etc that this entails:
#!/bin/bash
x=0
while [ "$x" != "37142" ]
do
/opt/solr/bin/post -c drupal_dev https://www.[yoursite].com/node/$x
let "x+=1"
done
You could dynamically generate the highest node number using drush.
You could easily adapt this to use a list of URLs generated by using wget to crawl your site, or simply post them as wget crawls your site. I plan to do this if I get push-back from marketing about using /node/[nodeId] urls.
This particular shell script is slow enough that I didn't even need to throw in a delay.

How to implement a custom file parser in Google DataFlow for a Google Cloud Storage file

I have a custom file format in Google Cloud Storage and I want to read it from Google DataFlow.
I've implemented a Source and a Reader by subclassing FileBasedReader, but then I realized it didn't support reading from Google Cloud Storage (while FileBasedSink actually does...) so I'm not sure what's the best idea to solve that here...
I tried to subclass TextIO but I couldn't reach an end with that as it doesn't seem to be designed to be subclassed.
Any good idea on how to deal with that?
Thanks.
Update to reflect on the comments
File pattern used: gs://mybucket/my.json
Implemented the Source class from FileBasedSource:
MessageSource<T> extends FileBasedSource<T>
Implemented the Reader class (what I really care about here) from FileBasedReader:
MessageReader<T> extends FileBasedReader<T>
Process for reading is:
MySource source = // instantiate source
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from(options.getSource()).named("ReadFileData"))
.apply(ParDo.of(new DoFn<String, String>() {
And the getSource() comes from this command line parameter (verified correct):
--source=gs://${BUCKET_NAME}/my.json \
Am I missing anything?
2nd UPDATE
While running source.getEstimatedSizeBytes(options) it tells me no handler found?
java.io.IOException: Unable to find handler for gs://mybucket/my.json
at com.google.cloud.dataflow.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:186)
at com.google.cloud.dataflow.sdk.io.FileBasedSource.getEstimatedSizeBytes(FileBasedSource.java:182)
at com.etc.TrackingDataPipeline.main(TrackingDataPipeline.java:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
at java.lang.Thread.run(Thread.java:745)
I thought the FileBasedSource was supposed to handle GCS?
From the stack trace you show in "2nd Update", it looks like you have called getEstimatedSizeBytes directly from your main() method. This is expected to lead to the error you see.
The standard URL scheme handlers are registered when a pipeline runner is constructed. In your example code, that would happen when you call Pipeline.create(options) (this calls PipelineRunner.fromOptions(options), where the standard handlers are registered).
If you want to have the standard URL schemes registered in a context other than running a pipeline, you can explicitly call IOChannelUtils.registerStandardIOFactories(). I should note that this is not a supported API, but reaching a bit "under the hood". As such, it may change at any time.

can't debug "Unknown error" in Neo4j

I need to load ~29 million nodes from a CSV file (with USING PERIODIC COMMIT) but I'm getting "Unknown error" after the first ~75k nodes are loaded. I've tried changing the commit size (250, 500, and 1000), increasing the java heap (-Xmx4096m), and using memory mapping, but nothing changes (except the number of nodes that get loaded - with commit size 500 I get "Unkown error" after 75,499 nodes and with commit size 250 I get "Unkown error" after 75,749 nodes).
I'm doing it in the browser, using Neoj4 2.1.7 on a remote machine with 10GB of RAM and Windows Server 2012. Here's my code:
USING PERIODIC COMMIT 1000
LOAD CSV FROM "file:/C:/Users/thiago.marzagao/Desktop/CSVs/cnpj.csv" AS node
CREATE (:PessoaJuridica {id: node[0], razaoSocial: node[1], nomeFantasia: node[2], CNAE: node[3], porte: node[4], dataAbertura: node[5], situacao: node[6], dataSituacao: node[7], endereco: node[8], CEP: node[9], municipio: node[10], UF: node[11], tel: node[12], email: node[13]})
The really bad part is that the nioneo_logical.log files have some weird encoding that no text editor can't figure out. All I see is eÿÿÿÿ414141, ÿÿÿÿÿÿÿÿ, etc. The messages file, in turn, ends with hundreds of garbage collection warnings, like these:
2015-02-05 17:16:54.596+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 304ms.
2015-02-05 17:16:55.033+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 238ms.
2015-02-05 17:16:55.471+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 231ms.
I've found somewhat related questions but not exactly what I'm looking for.
What am I missing?
The browser is the worst choice to run such an import, also because of http timeouts.
Enough RAM helps as well as a fast disk.
Try to use bin/Neo4jShell.bat which connects to the running server. And best make sure the CSV file is locally available.
those nioneo.*log files are logical logs (write ahead logs for transactions)
the log files your looking for are data/log/*.log and data/graph.db/messages.log
Something else that you can please do, is to open the Browser-Inspector, go to the Network/Requests tab and re-run the query, so that you can get the raw HTTP-response, we just discussed that and will try to dump it directly to the JS console in the future.

Resources