The logs for my Cloud Dataflow job contain multiple messages of the form:
[GC (Allocation Failure) [PSYoungGen: 72352K->160K(71168K)]
120584K->48392K(158720K), 0.0022739 secs] [Times: user=0.00 sys=0.00,
real=0.00 secs]
What does this mean? What should I do about it?
The "Allocation Failure" message is a completely normal part of Java memory management; it just means the JVM has run out of memory and needs to trigger a GC. Unless there is already a reason to suspect memory problems, the "Java GC" log is generally not worth looking at for Dataflow pipelines.
Please see Java GC (Allocation Failure) for additional information.
Related
I'm running my neo4j community edition 3.5.5 version with 8GB ram in aws instance.
Initially for few months it ran very fine and got results in millis of time, but now a days it's getting stopping automatically and starting automatically. Sometimes it's not at all starting for hours,even we started it manually also.
Can anyone please help me with this. I'm getting the below logs.
tail -100f /var/log/neo4j/neo4j.log
2019-07-29 13:17:52.570+0000 WARN The client is unauthorized due to authentication failure.
2019-09-04 05:33:52.328+0000 WARN The client is unauthorized due to authentication failure.
2019-10-17 15:18:14.652+0000 INFO Transaction with id 2683388 has been automatically rolled back due to transaction timeout.
nohup: ignoring input
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006e5400000, 3670016000, 0) failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 3670016000 bytes for committing reserved memory.
An error report file with more information is saved as:
/home/ubuntu/hs_err_pid8965.log
nohup: ignoring input
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006e5400000, 3670016000, 0) failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 3670016000 bytes for committing reserved memory.
An error report file with more information is saved as:
/home/ubuntu/hs_err_pid9050.log
nohup: ignoring input
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006e5400000, 3670016000, 0) failed; error='Cannot allocate memory' (errno=12)
2019-10-17 17:14:44.651+0000 INFO Transaction with id 2689294 has been automatically rolled back due to transaction timeout.
this can be because you are running lot of merge operations and dont have proper indices created or try increasing the heap size in config file .
I am running a pyspark job (python 3.5, spark 2.1, java8) in yarn-client mode from an edge node with spark2-submit. The job succed, the result dataframe is written on HDFS and seems correct (we didn't find yet any error with the data in such dataframe).
The issue is that I see a lot (6'000) ERROR messages and I would like to understand what is wrong and if this impact or not the final dataframe.
All ERROR messages looks like this one:
18/06/01 14:08:36 INFO codegen.CodeGenerator: Code generated in 45.712788 ms
18/06/01 14:08:37 INFO executor.Executor: Finished task 33.0 in stage 34.0 (TID 2312). 4600 bytes result sent to driver
18/06/01 14:08:37 INFO executor.Executor: Finished task 117.0 in stage 34.0 (TID 2316). 3801 bytes result sent to driver
18/06/01 14:08:40 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 2512
18/06/01 14:08:40 INFO executor.Executor: Running task 190.1 in stage 34.0 (TID 2512)
18/06/01 14:08:40 INFO storage.ShuffleBlockFetcherIterator: Getting 28 non-empty blocks out of 193 blocks
18/06/01 14:08:40 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 1 ms
18/06/01 14:08:40 INFO executor.Executor: Executor is trying to kill task 190.1 in stage 34.0 (TID 2512)
18/06/01 14:08:40 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /...../yarn/nm/usercache/../appcache/application_xxxx/blockmgr-xxxx/temp_shuffle_xxxxx
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:238)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The ERROR start after quite some feture engineering (select, groupby ..) and I see the ERROR when adding these lines:
df = (df.groupby('x','y')
.agg(func.sum('x').alias('x_sum'))
.groupby('y')
.agg(func.mean('y').alias('py_sum_avg')))
So I guess the of the data shuffle is triggered by groupBy.
I first thought it was an issue with memory so I added much more memory and overhead memory for the driver and executor without a real success (this is what you can find in some other thread). In the code I have other groupBy and it seems it is causing some issue at this stage.
I also see that it could be related to too many files open or if the disk is full but the ERROR messages is a bit different in these 2 cases.
I am quite new in pysaprk so I am looking to advice to debug such issue.
How can I find what is the reason why is called java.nio.channels.ClosedByInterruptException ? I guess this is the reason that trigger ERROR storage.DiskBlockObjectWriter. Is this correct ? Is it trigger by Executor: Executor is trying to kill task 190 If this is a standard process to have some tasks killed why is this triggering ERRORs ? Can I get some hint by looking at the Sprak UI (I see that some task were killed).Can I get more info from the traceback ?
How can fixed these issues ? Any suggestion how to proceed to debug such things ? I am not sure how to proceed to debug this issue and where to look at (memory, issue in the pysaprk code, issue with the setup of the cluster or of my spark params)
I am working on an Hadoop Data Lake with Cloudera CDH 5.8.
There is an issue with using spark.speculation in Spark 2.1 which I am using.
The related upstream bug is SPARK-19293. The exception stack trace in my situation is slightly different than the one in SPARK-19293. Putting
--conf spark.speculation=false
and the ERROR are gone in my test
I am running a batch pipeline with the Apache Beam 2.2 SDK via the Cloud Dataflow service. There are 751 text files that I parse using TextIO.readAll() transform, deserialize and write to a date partitioned table in BigQuery.
First thing I noticed is that autoscaling was not really kicking in and left the pipeline at 15 workers, even though I was able to push throughput a lot higher when for example manually setting the number of workers to 250.
My pipeline fails with the following stack trace:
(abed94a6f5139e21): java.io.IOException: Failed to close some writers
at org.apache.beam.sdk.io.gcp.bigquery.WriteBundlesToFiles.finishBundle(WriteBundlesToFiles.java:248)
Suppressed: java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
Service Unavailable
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
at org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter.close(TableRowWriter.java:81)
at org.apache.beam.sdk.io.gcp.bigquery.WriteBundlesToFiles.finishBundle(WriteBundlesToFiles.java:242)
at org.apache.beam.sdk.io.gcp.bigquery.WriteBundlesToFiles$DoFnInvoker.invokeFinishBundle(Unknown Source)
at org.apache.beam.runners.core.SimpleDoFnRunner.finishBundle(SimpleDoFnRunner.java:187)
at com.google.cloud.dataflow.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:407)
at com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.finish(ParDoOperation.java:60)
at com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:76)
at com.google.cloud.dataflow.worker.DataflowWorker.executeWork(DataflowWorker.java:330)
at com.google.cloud.dataflow.worker.DataflowWorker.doWork(DataflowWorker.java:302)
at com.google.cloud.dataflow.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:251)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:135)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:115)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:102)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
Service Unavailable
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
... 4 more
Should I try with even more workers or split the work across several pipelines?
Thanks to the comment by jkff it worked flawlessly - after setting --maxNumWorkers=250 (15 seems to be the standard maximum).
The error was a transient error that Dataflow would retry several times and in the end, the pipeline ran successfully.
One of our pipelines failed this morning with an error we've never seen before. In addition, we had to manually delete the one VM that was was spun up to cancel/stop the job.
Has anything changed in the Dataflow service that could cause this error?
0 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 49 files. Enable logging at DEBUG level to see which files will be staged.
2243 [main] INFO com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Successfully created cloud dataflow service pipeline
2282 [main] INFO com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Last loaded table was found. It will be processed for denormalization: Clicks_06_2015
2282 [main] INFO com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Last loaded table was found. It will be processed for denormalization: ActiveViews_06_2015
2282 [main] INFO com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Last loaded table was found. It will be processed for denormalization: Impressions_06_2015
2435 [main] WARN com.google.cloud.dataflow.sdk.Pipeline - Transform <removed>:<removed>.advertisers2 does not have a stable unique name. In the future, this will prevent reloading streaming pipelines
2615 [main] WARN com.google.cloud.dataflow.sdk.Pipeline - Transform <removed>:<removed>.lineitems2 does not have a stable unique name. In the future, this will prevent reloading streaming pipelines
2616 [main] WARN com.google.cloud.dataflow.sdk.Pipeline - Transform <removed>:<removed>.creative2name2 does not have a stable unique name. In the future, this will prevent reloading streaming pipelines
2616 [main] WARN com.google.cloud.dataflow.sdk.Pipeline - Transform <removed>:<removed>.adunit2site2 does not have a stable unique name. In the future, this will prevent reloading streaming pipelines
3236 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
3241 [main] INFO com.google.cloud.dataflow.sdk.util.PackageUtil - Uploading 49 files from PipelineOptions.filesToStage to staging location to prepare for execution.
41834 [main] INFO com.google.cloud.dataflow.sdk.util.PackageUtil - Uploading PipelineOptions.filesToStage complete: 10 files newly uploaded, 39 files cached
Dataflow SDK version: 0.4.150602
51003 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/<removed>/dataflow/job/2015-06-11_16_39_02-17130055143605818331
Submitted job: 2015-06-11_16_39_02-17130055143605818331
51004 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - To cancel the job using the 'gcloud' tool, run:
> gcloud alpha dataflow jobs --project=<removed> cancel 2015-06-11_16_39_02-17130055143605818331
2015-06-11T23:39:02.506Z: Detail: (b056559940543e6a): Expanding GroupByKey operations into optimizable parts.
2015-06-11T23:39:02.509Z: Detail: (b056559940543d60): Annotating graph with Autotuner information.
2015-06-11T23:39:02.759Z: Detail: (b0565599405437a9): Fusing adjacent ParDo, Read, Write, and Flatten operations
2015-06-11T23:39:02.762Z: Detail: (b05655994054369f): Fusing consumer Impressions_06_2015-ParDoDFP-transform into Impressions_06_2015-BQ-Read
2015-06-11T23:39:02.764Z: Detail: (b056559940543595): Fusing consumer Impressions_06_2015-BQ-Write into Impressions_06_2015-ParDoDFP-transform
2015-06-11T23:39:02.766Z: Detail: (b05655994054348b): Fusing consumer ActiveViews_06_2015-ParDoDFP-transform into ActiveViews_06_2015-BQ-Read
2015-06-11T23:39:02.767Z: Detail: (b056559940543381): Fusing consumer ActiveViews_06_2015-BQ-Write into ActiveViews_06_2015-ParDoDFP-transform
2015-06-11T23:39:02.769Z: Detail: (b056559940543277): Fusing consumer Clicks_06_2015-ParDoDFP-transform into Clicks_06_2015-BQ-Read
2015-06-11T23:39:02.771Z: Detail: (b05655994054316d): Fusing consumer Clicks_06_2015-BQ-Write into Clicks_06_2015-ParDoDFP-transform
2015-06-11T23:39:02.818Z: Detail: (b056559940543987): Adding StepResource setup and teardown to workflow graph.
2015-06-11T23:39:18.614Z: Error: (5494fb7a460f58a8): Workflow failed. Causes: (20fbc2bb0e7cb0b1): One or more operations had an error: 'operation-1434065943092-518467f1f5b21-8d000d8a-d5cd5762': 'The resource 'projects/<removed>/zones/us-central1-a/disks/dfp-denormalization-job-1-06111639-3db5-harness-0' is not ready'.
2015-06-11T23:39:18.651Z: Detail: (4fb958a4957733a5): Cleaning up.
2015-06-11T23:40:36.126Z: Error: (d41cf136c17a5e79): Workflow failed. Causes: (20fbc2bb0e7cb0b1): One or more operations had an error: 'operation-1434065943092-518467f1f5b21-8d000d8a-d5cd5762': 'The resource 'projects/<removed>/zones/us-central1-a/disks/dfp-denormalization-job-1-06111639-3db5-harness-0' is not ready'.
2015-06-11T23:43:05.998Z: Warning: (c5964e114f42988b): Job 2015-06-11_16_39_02-17130055143605818331 is already finishing. Ignoring cancel request.
2015-06-11T23:48:04.715Z: Warning: (cf462c726cde3704): Job 2015-06-11_16_39_02-17130055143605818331 is already finishing. Ignoring cancel request.
2015-06-11T23:50:35.529Z: Warning: Internal Issue (4fb958a495773599): 65177287:8503
748739 [main] INFO com.google.cloud.dataflow.sdk.runners.BlockingDataflowPipelineRunner - Job finished with status FAILED
748740 [main] ERROR com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Job "dfp-denormalization-job-1434066640362" failed. Job may be retried.
This was a temporary issue with the Google Compute Engine API that has since been resolved. When calling GCE on behalf of the user, Dataflow will attempt to work around any transient errors.
As i am new for neo4j i have been facing the follwing errors.
1.when i start neo4j it gives the follwoing message.
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Using additional JVM arguments: -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
note :i tried to edit the file ,/etc/security/limits.conf and added
root soft nofile 40000
root hard nofile 40000
but not solved
2.in the messages.log file has multiple records like below.
2014-07-16 07:07:49.688+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 111ms [total block time: 56.805s]
2014-07-16 07:09:02.778+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 103ms [total block time: 56.908s]
The problem is some time suddently the server CPU goes high and taking few hours to get down.Please give me a proper idea.
Thanks
Az
1) Best practice on Ubuntu is not setting this in ยด/etc/security/limits.confdirectly, instead create a file/etc/security/limits.d/neo4j.conf` containing:
* soft nofile 40000
* hard nofile 40000
2) This is more information for you how much time is spent in GC. If a single pause gets too long it's an indication to tweak JVM settings. Stop times of 100ms is not really concerning in most cases. However the "total block time" of almost one minute might require further investigation.