Dataflow Batch Job fails with "Failed to close some writers" - google-cloud-dataflow

I am running a batch pipeline with the Apache Beam 2.2 SDK via the Cloud Dataflow service. There are 751 text files that I parse using TextIO.readAll() transform, deserialize and write to a date partitioned table in BigQuery.
First thing I noticed is that autoscaling was not really kicking in and left the pipeline at 15 workers, even though I was able to push throughput a lot higher when for example manually setting the number of workers to 250.
My pipeline fails with the following stack trace:
(abed94a6f5139e21): java.io.IOException: Failed to close some writers
at org.apache.beam.sdk.io.gcp.bigquery.WriteBundlesToFiles.finishBundle(WriteBundlesToFiles.java:248)
Suppressed: java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
Service Unavailable
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
at org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter.close(TableRowWriter.java:81)
at org.apache.beam.sdk.io.gcp.bigquery.WriteBundlesToFiles.finishBundle(WriteBundlesToFiles.java:242)
at org.apache.beam.sdk.io.gcp.bigquery.WriteBundlesToFiles$DoFnInvoker.invokeFinishBundle(Unknown Source)
at org.apache.beam.runners.core.SimpleDoFnRunner.finishBundle(SimpleDoFnRunner.java:187)
at com.google.cloud.dataflow.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:407)
at com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.finish(ParDoOperation.java:60)
at com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:76)
at com.google.cloud.dataflow.worker.DataflowWorker.executeWork(DataflowWorker.java:330)
at com.google.cloud.dataflow.worker.DataflowWorker.doWork(DataflowWorker.java:302)
at com.google.cloud.dataflow.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:251)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:135)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:115)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:102)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
Service Unavailable
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
... 4 more
Should I try with even more workers or split the work across several pipelines?

Thanks to the comment by jkff it worked flawlessly - after setting --maxNumWorkers=250 (15 seems to be the standard maximum).
The error was a transient error that Dataflow would retry several times and in the end, the pipeline ran successfully.

Related

Dataflow read GCS error

Does the following exception look familiar to anyone? The exactly same pipeline and data worked last week but got failed couple of times for the same exception today. I didn't see any footprint of my code from the stack trace. Wondering what it might be related to... for example, GCS read quota?
And, since it went fine on my direct runner, for these type of Dataflow exceptions, how can I debug on Dataflow?
{
insertId: "7289985381136617647:828219:0:906922"
jsonPayload: {
exception: "java.io.IOException: Failed to advance reader of source: gs://fiona_dataflow/tmp/BigQueryExtractTemp/5c813875537d4c1a89b74a800bb37c50/000000000864.avro range [0, 808559590)
at com.google.cloud.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.advance(WorkerCustomSources.java:605)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.advance(ReadOperation.java:398)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:193)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:158)
at com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:75)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:383)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:355)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:286)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:134)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:114)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:101)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.geotab.bigdata.streaming.mapserver.backfill.MapServerBatchBeamApplication.lambda$main$fd9fc9ef$1(MapServerBatchBeamApplication.java:82)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase$1.apply(BigQuerySourceBase.java:211)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase$1.apply(BigQuerySourceBase.java:205)
at org.apache.beam.sdk.io.AvroSource$AvroBlock.readNextRecord(AvroSource.java:579)
at org.apache.beam.sdk.io.BlockBasedSource$BlockBasedReader.readNextRecord(BlockBasedSource.java:223)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:473)
at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.advance(OffsetBasedSource.java:267)
at com.google.cloud.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.advance(WorkerCustomSources.java:602)
... 14 more
"
job: "2018-04-23_07_30_32-17662367668739576363"
logger: "com.google.cloud.dataflow.worker.WorkItemStatusClient"
message: "Uncaught exception occurred during work unit execution. This will be retried."
stage: "s19"
thread: "27"
work: "1213589185295287945"
worker: "mapserverbatchbeamapplica-04230730-s20x-harness-713d"

Akka 2.5 Distributed Data on Docker + Alpine Linux

After upgrading a service that uses Akka + Akka cluster sharding to the newly released Akka (2.5.0), we started encountering issues starting the system in Docker + Alpine Linux. From what I can infer, Akka Cluster sharding is configured to used Akka Distributed Data (which is not experimental anymore as of 2.5.0) and it is using LMDB (which requires GCC + glibc and it is not available in Alpine Linux).
My questions are as follows:
1) Is there any standard alternative supported by Akka instead of LMDB?
2) Is there any way to get LMDB to work in Alpine Linux?
Stack Trace:
[ERROR] [04/20/2017 13:42:19.014] [lotus-akka.actor.default-dispatcher-5] [akka://lotus/system/sharding/replicator/durableStore] Error relocating /tmp/lmdbjava-native-library-5972006786989102785.so: __fprintf_chk: symbol not found
akka.actor.ActorInitializationException: akka://lotus/system/sharding/replicator/durableStore: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:191)
at akka.actor.ActorCell.create(ActorCell.scala:600)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:454)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:476)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)
at akka.dispatch.Mailbox.run(Mailbox.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at akka.util.Reflect$.instantiate(Reflect.scala:65)
at akka.actor.ArgsReflectConstructor.produce(IndirectActorProducer.scala:96)
at akka.actor.Props.newActor(Props.scala:213)
at akka.actor.ActorCell.newActor(ActorCell.scala:555)
at akka.actor.ActorCell.create(ActorCell.scala:581)
... 7 more
Caused by: java.lang.UnsatisfiedLinkError: Error relocating /tmp/lmdbjava-native-library-5972006786989102785.so: __fprintf_chk: symbol not found
at jnr.ffi.provider.jffi.NativeLibrary.loadNativeLibraries(NativeLibrary.java:87)
at jnr.ffi.provider.jffi.NativeLibrary.getNativeLibraries(NativeLibrary.java:70)
at jnr.ffi.provider.jffi.NativeLibrary.getSymbolAddress(NativeLibrary.java:49)
at jnr.ffi.provider.jffi.NativeLibrary.findSymbolAddress(NativeLibrary.java:59)
at jnr.ffi.provider.jffi.AsmLibraryLoader.generateInterfaceImpl(AsmLibraryLoader.java:158)
at jnr.ffi.provider.jffi.AsmLibraryLoader.loadLibrary(AsmLibraryLoader.java:89)
at jnr.ffi.provider.jffi.NativeLibraryLoader.loadLibrary(NativeLibraryLoader.java:43)
at jnr.ffi.LibraryLoader.load(LibraryLoader.java:325)
at jnr.ffi.LibraryLoader.load(LibraryLoader.java:304)
at org.lmdbjava.Library.<clinit>(Library.java:95)
at org.lmdbjava.Env$Builder.open(Env.java:406)
at org.lmdbjava.Env$Builder.open(Env.java:430)
at akka.cluster.ddata.LmdbDurableStore.<init>(DurableStore.scala:131)
... 16 more
Finally managed to solve this problem. Cluster sharding attempts to use durable storage by default (default is LMDB). For cluster sharding without using remember-entities, durable storage is not required.
Hence, the solution to this was to disable durable storage for cluster sharding by adding the following configuration
akka.cluster.sharding.distributed-data.durable.keys = []

How to handle "an unhandled error caused by the Dataflow SDK" (corrupted gz as input)

Is there a way to deal with "an unhandled error caused by the Dataflow SDK"?
Specifically, we have a Dataflow job that takes a list of gz files (in GCS) as input, and produces some output.
Once in a while one of the gz files may be corrupted, and the job fails because of it.
We are wondering if there is a way to handle this -- specifically, we want to make it so that the job would ignore such corrupted file(s) and proceed.
It is not clear if we can catch an exception thrown due to the corrupted gz file or not (because it appears that it is handled in Dataflow SDK itself, causing it to fail).
(For Google Dataflow team: Here is a specific dataflow job id: 2017-04-02_05_08_20-5491890758767473661.)
Update: Here's the stack trace we got from the logging UI.
(778029c78ed61ff2): java.io.IOException: Failed to advance reader of source: StaticValueProvider{value=gs://aaa.gz} range [0, 9223372036854775807)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources$BoundedReaderIterator.advance(WorkerCustomSources.java:544)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.advance(ReadOperation.java:425)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:217)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:182)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:69)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:284)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:220)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:170)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:192)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:172)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:159)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:278)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at com.google.cloud.dataflow.sdk.io.TextIO$TextSource$TextBasedReader.tryToEnsureNumberOfBytesInBuffer(TextIO.java:1077)
at com.google.cloud.dataflow.sdk.io.TextIO$TextSource$TextBasedReader.findSeparatorBounds(TextIO.java:1011)
at com.google.cloud.dataflow.sdk.io.TextIO$TextSource$TextBasedReader.readNextRecord(TextIO.java:1043)
at com.google.cloud.dataflow.sdk.io.CompressedSource$CompressedReader.readNextRecord(CompressedSource.java:482)
at com.google.cloud.dataflow.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:536)
at com.google.cloud.dataflow.sdk.io.OffsetBasedSource$OffsetBasedReader.advance(OffsetBasedSource.java:287)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources$BoundedReaderIterator.advance(WorkerCustomSources.java:541)
... 14 more

Slave lost error in pyspark

I'm using Spark1.6
I'm running a simple df.show(2) method and got errors like
An error occurred while calling o143.showString.
: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 6 in stage 6.0 failed 4 times, most recent failure:
Lost task 6.3 in stage 6.0
ExecutorLostFailure (executor 2 exited caused by one of the
running tasks) Reason: Slave lost
When I did persist, through spark UI I saw the shuffleWrite memory is very high and took a long time and still returned errors.
Through some search, I found these might be the out of memory problem.
Following this link out of memory error Java
I did a repartition up to 1000, still not so helpful.
I set up the SparkConf as
conf = (SparkConf().set("spark.driver.maxResultSize", "150g").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"))
My server side memory could be up to 200GB
Do yo have any good idea to do this or point me to related links. Pyspark will be most helpful
Here is the error log from YARN:
Application application_1477088172315_0118 failed 2 times due to
AM Container for appattempt_1477088172315_0118_000006 exited
with exitCode: 10
For more detailed output, check application tracking page: Then,
click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1477088172315_0118_06_000001
Exit code: 10
Stack trace: ExitCodeException exitCode=10:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 10
Failing this attempt. Failing the application.
Here is the error info from notebook:
Py4JJavaError: An error occurred while calling o71.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 15.0 failed 4 times, most recent failure: Lost task 1.3 in stage 15.0 (): ExecutorLostFailure (executor 26 exited caused by one of the running tasks) Reason: Slave lost
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Thank you

Jenkins xvnc plugin, some display numbers stay allocated when a build is stopped aburptly (ie jenkins restart) and cannot be used again

I am using Jenkins with the Xvnc plugin to run acceptance tests on Firefox in a CentOS slave . I have limited the display numbers to 2-4 since there will be at most 3 instances of testing that need a display. The tests and plugin work fine until Jenkins had to be restarted a few times due to issues in other builds. The following error now occurs whenever the build tries to run:
FATAL: All available display numbers are allocated or blacklisted.
allocated: [2, 3, 4]
blacklisted: []
java.lang.RuntimeException: All available display numbers are allocated or blacklisted.
allocated: [2, 3, 4]
blacklisted: []
at hudson.plugins.xvnc.DisplayAllocator.doAllocate(DisplayAllocator.java:59)
at hudson.plugins.xvnc.DisplayAllocator.allocate(DisplayAllocator.java:49)
at hudson.plugins.xvnc.Xvnc.doSetUp(Xvnc.java:99)
at hudson.plugins.xvnc.Xvnc.setUp(Xvnc.java:89)
at jenkins.tasks.SimpleBuildWrapper.setUp(SimpleBuildWrapper.java:146)
at hudson.model.Build$BuildExecution.doRun(Build.java:156)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
at hudson.model.Run.execute(Run.java:1741)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:381)
I checked a working build where I restarted Jenkins without manually stopping each job and found potential cause:
Terminating xvnc.
FATAL: hudson.remoting.Channel$OrderlyShutdown
hudson.remoting.RequestAbortedException: hudson.remoting.Channel$OrderlyShutdown
at hudson.remoting.Request.abort(Request.java:296)
at hudson.remoting.Channel.terminate(Channel.java:815)
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1034)
at hudson.remoting.Channel$2.handle(Channel.java:484)
at hudson.remoting.AbstractByteArrayCommandTransport$1.handle(AbstractByteArrayCommandTransport.java:61)
at org.jenkinsci.remoting.nio.NioChannelHub$2.run(NioChannelHub.java:594)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
at ......remote call to jenkinstest.build.thoughtwire.com.test(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1361)
at hudson.remoting.Request.call(Request.java:171)
at hudson.remoting.Channel.call(Channel.java:752)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:954)
at hudson.plugins.xvnc.Xvnc$DisposerImpl.tearDown(Xvnc.java:183)
at jenkins.tasks.SimpleBuildWrapper$EnvironmentWrapper.tearDown(SimpleBuildWrapper.java:175)
at hudson.model.Build$BuildExecution.doRun(Build.java:173)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
at hudson.model.Run.execute(Run.java:1741)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:381)
Caused by: hudson.remoting.Channel$OrderlyShutdown
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1034)
at hudson.remoting.Channel$2.handle(Channel.java:484)
at hudson.remoting.AbstractByteArrayCommandTransport$1.handle(AbstractByteArrayCommandTransport.java:61)
at org.jenkinsci.remoting.nio.NioChannelHub$2.run(NioChannelHub.java:594)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: Command close created at
at hudson.remoting.Command.<init>(Command.java:56)
at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1028)
at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1026)
at hudson.remoting.Channel.close(Channel.java:1109)
at hudson.remoting.Channel.close(Channel.java:1092)
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1033)
at hudson.remoting.Channel$2.handle(Channel.java:484)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:60)
It seems like the job did not close properly and the Xvnc plugin did not get a chance to deallocate the display. I made sure the processes and tests in the slave are properly terminated and nothing is running.
The core issue here is that display numbers 2, 3, and 4 are now permanently allocated and cannot be reused even though no builds are running. If the slave (TEST) is mirrored (TEST2) then TEST2 can use display 2, 3, and 4 but TEST cannot. I have tried reinstalling the plugin but the numbers stay allocated and linked to TEST.
Does anyone know of a way to clear the list of allocated display numbers?
Is this a bug with the plugin?
Is there a way to prevent display numbers from staying allocated if say Jenkins suddenly dies while jobs are running?
The allocated display number is saved in hudson.plugins.xvnc.Xvnc.xml file on the jenkins master (under jenkins home directory). To clear the numbers, you need to stop jenkins, clean up <allocatedNumbers> in that xml, and start jenkins server again.
It is important to edit the file after you stop jenkins server, since jenkins will save the current numbers when it stops.
This is a groovy script I created to clean up the Xvnc display numbers without stopping jenkins. But it might also clean up numbers of still running jobs.
https://github.com/sdiepend/jenkins-monitoring/blob/master/cleanXvncDisplayNumbers.groovy
import jenkins.*
import jenkins.model.Jenkins
Jenkins jenkins = Jenkins.getActiveInstance();
xvncDescriptor = jenkins.getDescriptorByType(hudson.plugins.xvnc.Xvnc.DescriptorImpl.class)
xvncDescriptor.allocators.each {
allocator = it.value
// collect is used to make sure numAlloc is an entire new list and not just a reference to the same list object, otherwise you'll get a
// concurrentmodification exception
numAlloc = allocator.allocatedNumbers.collect()
numAlloc.each {
allocator.allocatedNumbers.remove(it)
}
}

Resources