I have setup a simple standalone spark cluster using docker.
I have two docker containers that have spark and one is running as master and anther as worker.
These two containers share a custom bridge network.
I have opened up webui and submit port of master container and can successfully see master webui.
Worker container webui port is also opened and can successfully be seen from my web browser.
The problem happens when I try to run pyspark.
Now from my machine(not inside and docker container), I try to run the following python script but it never completes.
from pyspark import SparkContext, SparkConf
sc = SparkContext("spark://localhost:10077","test")
rdd = sc.parallelize([1,2,3])
for a in rdd.collect():
print(a)
I checked the master and worker webui and it seems that worker node is continuously exiting and recreating a executer.
The error of executor is like this:
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:424)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:413)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:444)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
... 4 more
Caused by: java.io.IOException: Failed to connect to devmachine/172.17.0.1:40314
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: devmachine/172.17.0.1:40314
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
I think the main cause it that the worker cannot connect back to the driver(devmachine) with a given port.
However, I have configured the master and worker container to be able to resolve "devmachine" ip address and have confirmed that the resolution works(tested separately with a simple flask running on devmachine and make http request from a worker container. it worked)
From what I understand, I am submitting my pyspark job in "client" mode (instead of cluster mode) and thus my current machine(devmachine) is where driver exists and since client mode states that the workers communicate with the driver, thus the error where worker is trying to connect to devmachine.
Any pointers on what I am doing wrong?
update 22.07.21
okay I have eventually found a workaround to make this simple spark job to work.
Since the original problem seemed to be caused by difficulty of network communication between the driver(host machine) and the worker docker container for some unknown reason, I though perhaps if we submitted the spark job from another docker container which is bridge networked to the master & worker docker container could solve the problem.
I prepare a docker image on top of the image that was used for master/worker, where I only installed python3 and pip and installed pyspark since it lacked these packages.
After that, I lauched a docker container with this new image, made sure that it was connected to the custom bridge network used by master and worker node.
I submitted the python script (after changing the master url inside the python file to spark://master:7077) with the following command:
$ spark-submit t1.py
and it worked well!
Here was the output:
22/07/21 07:25:49 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/07/21 07:25:49 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/07/21 07:25:49 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ad4336e4-af0c-4100-852f-6b51f8803f94
22/07/21 07:25:49 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
22/07/21 07:25:49 INFO SparkEnv: Registering OutputCommitCoordinator
22/07/21 07:25:49 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/07/21 07:25:49 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://master:7077...
22/07/21 07:25:49 INFO TransportClientFactory: Successfully created connection to master/172.18.0.2:7077 after 27 ms (0 ms spent in bootstraps)
22/07/21 07:25:50 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20220721072549-0005
22/07/21 07:25:50 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220721072549-0005/0 on worker-20220721060838-172.18.0.3-45161 (172.18.0.3:45161) with 8 core(s)
22/07/21 07:25:50 INFO StandaloneSchedulerBackend: Granted executor ID app-20220721072549-0005/0 on hostPort 172.18.0.3:45161 with 8 core(s), 1024.0 MiB RAM
22/07/21 07:25:50 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37799.
22/07/21 07:25:50 INFO NettyBlockTransferService: Server created on 46dd0a966c6f:37799
22/07/21 07:25:50 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/07/21 07:25:50 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 46dd0a966c6f, 37799, None)
22/07/21 07:25:50 INFO BlockManagerMasterEndpoint: Registering block manager 46dd0a966c6f:37799 with 434.4 MiB RAM, BlockManagerId(driver, 46dd0a966c6f, 37799, None)
22/07/21 07:25:50 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 46dd0a966c6f, 37799, None)
22/07/21 07:25:50 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 46dd0a966c6f, 37799, None)
22/07/21 07:25:50 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220721072549-0005/0 is now RUNNING
22/07/21 07:25:50 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
22/07/21 07:25:50 INFO SparkContext: Starting job: collect at /root/t1.py:10
22/07/21 07:25:50 INFO DAGScheduler: Got job 0 (collect at /root/t1.py:10) with 2 output partitions
22/07/21 07:25:50 INFO DAGScheduler: Final stage: ResultStage 0 (collect at /root/t1.py:10)
22/07/21 07:25:50 INFO DAGScheduler: Parents of final stage: List()
22/07/21 07:25:50 INFO DAGScheduler: Missing parents: List()
22/07/21 07:25:50 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274), which has no missing parents
22/07/21 07:25:50 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.9 KiB, free 434.4 MiB)
22/07/21 07:25:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1757.0 B, free 434.4 MiB)
22/07/21 07:25:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 46dd0a966c6f:37799 (size: 1757.0 B, free: 434.4 MiB)
22/07/21 07:25:50 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1513
22/07/21 07:25:50 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274) (first 15 tasks are for partitions Vector(0, 1))
22/07/21 07:25:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks resource profile 0
22/07/21 07:25:52 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.18.0.3:59260) with ID 0, ResourceProfileId 0
22/07/21 07:25:52 INFO BlockManagerMasterEndpoint: Registering block manager 172.18.0.3:33016 with 434.4 MiB RAM, BlockManagerId(0, 172.18.0.3, 33016, None)
22/07/21 07:25:52 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (172.18.0.3, executor 0, partition 0, PROCESS_LOCAL, 4464 bytes) taskResourceAssignments Map()
22/07/21 07:25:52 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (172.18.0.3, executor 0, partition 1, PROCESS_LOCAL, 4491 bytes) taskResourceAssignments Map()
22/07/21 07:25:52 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.18.0.3:33016 (size: 1757.0 B, free: 434.4 MiB)
22/07/21 07:25:52 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 456 ms on 172.18.0.3 (executor 0) (1/2)
22/07/21 07:25:52 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 446 ms on 172.18.0.3 (executor 0) (2/2)
22/07/21 07:25:52 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/07/21 07:25:52 INFO DAGScheduler: ResultStage 0 (collect at /root/t1.py:10) finished in 2.357 s
22/07/21 07:25:52 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/07/21 07:25:52 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
22/07/21 07:25:52 INFO DAGScheduler: Job 0 finished: collect at /root/t1.py:10, took 2.401307 s
1
2
3
22/07/21 07:25:53 INFO SparkContext: Invoking stop() from shutdown hook
22/07/21 07:25:53 INFO SparkUI: Stopped Spark web UI at http://46dd0a966c6f:4040
22/07/21 07:25:53 INFO StandaloneSchedulerBackend: Shutting down all executors
22/07/21 07:25:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
22/07/21 07:25:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/07/21 07:25:53 INFO MemoryStore: MemoryStore cleared
22/07/21 07:25:53 INFO BlockManager: BlockManager stopped
22/07/21 07:25:53 INFO BlockManagerMaster: BlockManagerMaster stopped
22/07/21 07:25:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/07/21 07:25:53 INFO SparkContext: Successfully stopped SparkContext
22/07/21 07:25:53 INFO ShutdownHookManager: Shutdown hook called
22/07/21 07:25:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-aca43d84-54ab-45a6-87ec-25a644286af0
22/07/21 07:25:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-aca43d84-54ab-45a6-87ec-25a644286af0/pyspark-c7c9f3aa-5810-405b-a375-28e8b3344f68
22/07/21 07:25:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-bfcd5ac6-e629-4fdb-8621-fc76d17ffed3
and here is the worker node's output for this job to compare with the original error log.
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:424)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:413)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:444)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
... 4 more
Caused by: java.io.IOException: Failed to connect to cl-aicrdev03/172.17.0.1:44731
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: cl-aicrdev03/172.17.0.1:44731
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
It seems that it had no problem connecting to the driver this time.
But then still it begs theh question.
Why does this simply spark submit work when the driver is a docker container sharing the same bridge network with the master & worker
while it doesn't work when the driver is outside of docker and the worker container has been ensured(I think..) to be able to look at the driver outside?
Related
Our Environment:
Jenkins version - Jenkins 2.319.1
Jenkins Master image : jenkins/jenkins:2.319.1-lts-alpine
Jenkins worker image: jenkins/inbound-agent:4.11-1-alpine
Installed plugins:
Kubernetes - 1.30.6
Kubernetes Client API - 5.4.1
Kubernetes Credentials Plugin - 0.9.0
JAVA version on master: openjdk 11.0.13
JAVA version on Agent/worker : openjdk 11.0.14
Hi team,
We are facing issue in jenkins where jenkins agent disconnects(or goes offline) from master while still job is running on agent/worker. We are getting below error(highlighted) and tried below things but issue is still not resolving fully. Jenkins is deployed on EKS.
Error:
5334535:2022-11-02 14:07:54.573+0000 [id=140290] INFO hudson.slaves.NodeProvisioner#update: worker-7j4x4 provisioning successfully completed. We have now 2 computer(s)
5334695:2022-11-02 14:07:54.675+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes done-jenkins/worker-7j4x4
5334828:2022-11-02 14:07:56.619+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes done-jenkins/worker-7j4x4
5334964-2022-11-02 14:07:58.650+0000 [id=140309] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #97 from /100.122.254.111:42648
5335123-2022-11-02 14:09:19.733+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5335275-2022-11-02 14:09:19.733+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5335409-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2608, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5335965-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 1 nodes assigned to this Jenkins instance, which we will check
5336139-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5336279-2022-11-02 14:09:19.734+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5336438-groovy.lang.MissingPropertyException: No such property: envVar for class: groovy.lang.Binding
5336532- at groovy.lang.Binding.getVariable(Binding.java:63)
5336585- at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onGetProperty(SandboxInterceptor.java:271)
–
5394279-2022-11-02 15:09:19.733+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5394431-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5394565-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2620, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5395121-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 3 nodes assigned to this Jenkins instance, which we will check
5395295-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5395435-2022-11-02 15:09:19.734+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5395594-2022-11-02 15:11:59.502+0000 [id=140320] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-254-111.eu-central-1.compute.internal/100.122.254.111:42648.
5395817-java.util.concurrent.TimeoutException: Ping started at 1667401679501 hasn't completed by 1667401919502
5395920- at hudson.remoting.PingThread.ping(PingThread.java:134)
5395977- at hudson.remoting.PingThread.run(PingThread.java:90)
5396032:2022-11-02 15:11:59.503+0000 [id=141914] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5049 for worker-7j4x4 terminated: java.nio.channels.ClosedChannelException
5396231-2022-11-02 15:12:35.579+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started Periodic background build discarder
5396368-2022-11-02 15:12:36.257+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished Periodic background build discarder. 678 ms
5396514-2022-11-02 15:14:15.582+0000 [id=141422] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-237-38.eu-central-1.compute.internal/100.122.237.38:55038.
5396735-java.util.concurrent.TimeoutException: Ping started at 1667401815582 hasn't completed by 1667402055582
5396838- at hudson.remoting.PingThread.ping(PingThread.java:134)
5396895- at hudson.remoting.PingThread.run(PingThread.java:90)
5396950-2022-11-02 15:14:15.584+0000 [id=141915] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5050 for worker-fjf1p terminated: java.nio.channels.ClosedChannelException
****5397149-2022-11-02 15:14:19.733+0000 [id=141950] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5397301-2022-11-02 15:14:19.733+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5397435-2022-11-02 15:14:19.734+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2621, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
Any suggestion or resolutions pls.
Tried below things:
Increased idleMinutes to 180 from default
Verified that resources are sufficient as per graphana dashboard
Changed podRetention to onFailure from Never
Changed podRetention to Always from Never
Increased readTimeout
Increased connectTimeout
Increased slaveConnectTimeoutStr
Disabled the ping thread from UI via disabling “response time" checkbox from preventive node monitroing
Increased activeDeadlineSeconds
Verified same java version on master and agent
Updated kubernetes and kubernetes API client plugins
Expectation is worker/agent should disconnect once job is successfully ran and after idleMinutes defined it should terminate but few times its terminating while job is still running on agent
I Installed Docker, Docker Compose and then Jenkins on CentOS 8. Seems Jenkins is installed correctly. However I could see the message Jenkins appears to be offline and get an exception as mentioned below. I changed the URL https://updates.jenkins.io/update-center.json in hudson.model.UpdateCenter.xml from https to http. Still the exception reappears. Plugins are not getting upgraded.
Any help on this is appreciated. Thanks.
Exception:
$: docker logs -f jenkins
Running from: /usr/share/jenkins/jenkins.war
webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")
2020-05-16 07:08:57.939+0000 [id=1] INFO org.eclipse.jetty.util.log.Log#initialized: Logging initialized #4453ms to org.eclipse.jetty.util.log.JavaUtilLog
2020-05-16 07:09:02.052+0000 [id=1] INFO winstone.Logger#logInternal: Beginning extraction from war file
2020-05-16 07:09:04.197+0000 [id=1] WARNING o.e.j.s.handler.ContextHandler#setContextPath: Empty contextPath
2020-05-16 07:09:04.990+0000 [id=1] INFO org.eclipse.jetty.server.Server#doStart: jetty-9.4.27.v20200227; built: 2020-02-27T18:37:21.340Z; git: a304fd9f351f337e7c0e2a7c28878dd536149c6c; jvm 1.8.0_242-b08
2020-05-16 07:09:12.128+0000 [id=1] INFO o.e.j.w.StandardDescriptorProcessor#visitServlet: NO JSP Support for /, did not find org.eclipse.jetty.jsp.JettyJspServlet
2020-05-16 07:09:12.439+0000 [id=1] INFO o.e.j.s.s.DefaultSessionIdManager#doStart: DefaultSessionIdManager workerName=node0
2020-05-16 07:09:12.439+0000 [id=1] INFO o.e.j.s.s.DefaultSessionIdManager#doStart: No SessionScavenger set, using defaults
2020-05-16 07:09:12.476+0000 [id=1] INFO o.e.j.server.session.HouseKeeper#startScavenging: node0 Scavenging every 600000ms
2020-05-16 07:09:14.143+0000 [id=1] INFO hudson.WebAppMain#contextInitialized: Jenkins home directory: /var/jenkins_home found at: EnvVars.masterEnvVars.get("JENKINS_HOME")
2020-05-16 07:09:14.794+0000 [id=1] INFO o.e.j.s.handler.ContextHandler#doStart: Started w.#2235eaab{Jenkins v2.237,/,file:///var/jenkins_home/war/,AVAILABLE}{/var/jenkins_home/war}
2020-05-16 07:09:14.871+0000 [id=1] INFO o.e.j.server.AbstractConnector#doStart: Started ServerConnector#5315b42e{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}
2020-05-16 07:09:14.872+0000 [id=1] INFO org.eclipse.jetty.server.Server#doStart: Started #21390ms
2020-05-16 07:09:14.881+0000 [id=20] INFO winstone.Logger#logInternal: Winstone Servlet Engine running: controlPort=disabled
2020-05-16 07:09:18.067+0000 [id=26] INFO jenkins.InitReactorRunner$1#onAttained: Started initialization
2020-05-16 07:09:18.316+0000 [id=25] INFO jenkins.InitReactorRunner$1#onAttained: Listed all plugins
2020-05-16 07:09:22.668+0000 [id=26] INFO jenkins.InitReactorRunner$1#onAttained: Prepared all plugins
2020-05-16 07:09:22.694+0000 [id=26] INFO jenkins.InitReactorRunner$1#onAttained: Started all plugins
2020-05-16 07:09:22.881+0000 [id=25] INFO jenkins.InitReactorRunner$1#onAttained: Augmented all extensions
2020-05-16 07:09:24.037+0000 [id=25] INFO jenkins.InitReactorRunner$1#onAttained: System config loaded
2020-05-16 07:09:24.037+0000 [id=25] INFO jenkins.InitReactorRunner$1#onAttained: System config adapted
2020-05-16 07:09:24.037+0000 [id=25] INFO jenkins.InitReactorRunner$1#onAttained: Loaded all jobs
2020-05-16 07:09:24.038+0000 [id=26] INFO jenkins.InitReactorRunner$1#onAttained: Configuration for all jobs updated
2020-05-16 07:09:24.194+0000 [id=39] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started Download metadata
2020-05-16 07:09:24.255+0000 [id=39] INFO hudson.util.Retrier#start: Attempt #1 to do the action check updates server
2020-05-16 07:09:27.052+0000 [id=25] INFO o.s.c.s.AbstractApplicationContext#prepareRefresh: Refreshing org.springframework.web.context.support.StaticWebApplicationContext#6069d6b7: display name [Root WebApplicationContext]; startup date [Sat May 16 07:09:27 UTC 2020]; root of context hierarchy
2020-05-16 07:09:27.053+0000 [id=25] INFO o.s.c.s.AbstractApplicationContext#obtainFreshBeanFactory: Bean factory for application context [org.springframework.web.context.support.StaticWebApplicationContext#6069d6b7]: org.springframework.beans.factory.support.DefaultListableBeanFactory#6481ce76
2020-05-16 07:09:27.084+0000 [id=25] INFO o.s.b.f.s.DefaultListableBeanFactory#preInstantiateSingletons: Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory#6481ce76: defining beans [authenticationManager]; root of factory hierarchy
2020-05-16 07:09:27.739+0000 [id=25] INFO o.s.c.s.AbstractApplicationContext#prepareRefresh: Refreshing org.springframework.web.context.support.StaticWebApplicationContext#3b4860bb: display name [Root WebApplicationContext]; startup date [Sat May 16 07:09:27 UTC 2020]; root of context hierarchy
2020-05-16 07:09:27.739+0000 [id=25] INFO o.s.c.s.AbstractApplicationContext#obtainFreshBeanFactory: Bean factory for application context [org.springframework.web.context.support.StaticWebApplicationContext#3b4860bb]: org.springframework.beans.factory.support.DefaultListableBeanFactory#5c405df3
2020-05-16 07:09:27.747+0000 [id=25] INFO o.s.b.f.s.DefaultListableBeanFactory#preInstantiateSingletons: Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory#5c405df3: defining beans [filter,legacy]; root of factory hierarchy
2020-05-16 07:09:27.955+0000 [id=25] INFO jenkins.InitReactorRunner$1#onAttained: Completed initialization
2020-05-16 07:09:28.492+0000 [id=19] INFO hudson.WebAppMain$3#run: Jenkins is fully up and running
***2020-05-16 07:09:44.531+0000 [id=39] INFO hudson.util.Retrier#start: The attempt #1 to do the action check updates server failed with an allowed exception:
java.net.UnknownHostException: updates.jenkins.io***
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at sun.net.www.http.HttpClient.New(HttpClient.java:339)
at sun.net.www.http.HttpClient.New(HttpClient.java:357)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
at hudson.model.DownloadService.loadJSON(DownloadService.java:114)
at hudson.model.UpdateSite.updateDirectlyNow(UpdateSite.java:212)
at hudson.model.UpdateSite.updateDirectlyNow(UpdateSite.java:207)
at hudson.PluginManager.checkUpdatesServer(PluginManager.java:1767)
at hudson.util.Retrier.start(Retrier.java:63)
at hudson.PluginManager.doCheckUpdatesServer(PluginManager.java:1738)
at jenkins.DailyCheck.execute(DailyCheck.java:93)
at hudson.model.AsyncPeriodicWork.lambda$doRun$0(AsyncPeriodicWork.java:100)
at java.lang.Thread.run(Thread.java:748)
2020-05-16 07:09:44.536+0000 [id=39] INFO hudson.util.Retrier#start: Calling the listener of the allowed exception 'updates.jenkins.io' at the attempt #1 to do the action check updates server
2020-05-16 07:09:44.544+0000 [id=39] INFO hudson.util.Retrier#start: Attempted the action check updates server for 1 time(s) with no success
2020-05-16 07:09:44.547+0000 [id=39] SEVERE hudson.PluginManager#doCheckUpdatesServer: Error checking update sites for 1 attempt(s). Last exception was: UnknownHostException: updates.jenkins.io
2020-05-16 07:09:44.566+0000 [id=39] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished Download metadata. 20,358 ms
2020-05-16 07:10:31.488+0000 [id=56] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started Periodic background build discarder
2020-05-16 07:10:31.492+0000 [id=56] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished Periodic background build discarder. 2 ms
I got the same exception with Kubernetes environment.
My Setup:
-Docker CE
-Kubernetes
Deployment of Jenkins:
deployment.apps/jenkins created
persistentvolume/jenkins created
persistentvolumeclaim/jenkins-claim created
serviceaccount/jenkins created
role.rbac.authorization.k8s.io/jenkins created
rolebinding.rbac.authorization.k8s.io/jenkins created
service/jenkins created
In the pod's log, I could saw this
2020-06-19 05:04:12.590+0000 [id=39] INFO hudson.util.Retrier#start: The attempt #1 to do the action check updates server failed with an allowed exception:
java.net.UnknownHostException: updates.jenkins.io
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:666)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
at hudson.model.DownloadService.loadJSON(DownloadService.java:114)
at hudson.model.UpdateSite.updateDirectlyNow(UpdateSite.java:212)
at hudson.model.UpdateSite.updateDirectlyNow(UpdateSite.java:207)
at hudson.PluginManager.checkUpdatesServer(PluginManager.java:1767)
at hudson.util.Retrier.start(Retrier.java:63)
at hudson.PluginManager.doCheckUpdatesServer(PluginManager.java:1738)
at jenkins.DailyCheck.execute(DailyCheck.java:93)
at hudson.model.AsyncPeriodicWork.lambda$doRun$0(AsyncPeriodicWork.java:100)
at java.lang.Thread.run(Thread.java:748)
<b>2020-06-19 05:04:12.591+0000 [id=39] INFO hudson.util.Retrier#start: Calling the listener of the allowed exception 'updates.jenkins.io' at the attempt #1 to do the action check updates server</b>
2020-06-19 05:04:12.593+0000 [id=39] INFO hudson.util.Retrier#start: Attempted the action check updates server for 1 time(s) with no success
2020-06-19 05:04:12.594+0000 [id=39] SEVERE hudson.PluginManager#doCheckUpdatesServer: Error checking update sites for 1 attempt(s). Last exception was: UnknownHostException: updates.jenkins.io
2020-06-19 05:04:12.597+0000 [id=39] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished Download metadata. 20,180 ms
After some time if logs are fetched, you'd see the token for first time Jenkins setup.
get the token and in my case, I had to do a port-forward to 8080.
Once you access Jenkins on the browser, fill in the token and see if Jenkins is offline.
if it is, goto manage Jenkins then advanced tab, scroll down and locate the update.jenkins.io URL at last.
Here you just need to click on submit without tempering anything and apply.
Now check for the updates in the plugins
If this does not work then you can choose HTTP over HTTPS for updates.jenkins.io URL in the same advanced section and submit, apply the changes. Again check for updates in plugins.
If above 2 don't work then it is very much possible you'd have to configure the proxy settings of Jenkins so that Jenkins instance can reach to internet.
The exception is simply saying that Jenkins instance could not fetch the available plugins/updates from updates.jenkins.io site since it can not resolve the URL.
The main idea here is to resolve the URL.
There are a couple of issues related to this URL exception on git issue trackers as well.
Guess you've just hit the common issue.
In my case I had to consider https://kubernetes.io/docs/tasks/debug-application-cluster/dns-debugging-resolution/ for proper debugging of dns.
I guess you can check if you have the network interface with nameserver 8.8.8.8 or the /etc/resolve.conf has the same.
After trying different things, these commands worked for me:
Disable and stop the firewall on my Centos Host
sudo systemctl disable firewalld
sudo systemctl stop firewalld
Then, restart docker
sudo service docker restart
I have a question regarding Flink. I am running an application in a local cluster, with 1 TaskManager and 4 Taskslots.
After some time of running the application, I got an Timeout error:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id feea6a6702a0cf960ae2847b5bd25665 timed out.
I have seen some posts with this topic but any answer to it. Could you help me to see the root cause, or a posible troubleshooting?
I am using flink version 1.5.3
It seems that the docker container of taskmanagers and JobManager are stopped when this happens.
Let me add the error trace from the JobManager container logs:
2019-06-09 13:31:06,300 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window NgsiEvent (ef3a860de48d54544d973754c6170d8b) switched from state FAILING to FAILED.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 63dbab620797b84da023b33578478238 timed out.
at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1609)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-09 13:31:06,308 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Could not restart the job Socket Window NgsiEvent (ef3a860de48d54544d973754c6170d8b) because the restart strategy prevented it.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 63dbab620797b84da023b33578478238 timed out.
at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1609)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-09 13:31:06,317 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job ef3a860de48d54544d973754c6170d8b.
2019-06-09 13:31:06,322 INFO org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore - Shutting down
2019-06-09 13:31:06,331 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#16363182f31f:36715] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#16363182f31f:36715]] Caused by: [16363182f31f]
2019-06-09 13:31:06,351 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Job ef3a860de48d54544d973754c6170d8b reached globally terminal state FAILED.
2019-06-09 13:31:06,434 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job Socket Window NgsiEvent(ef3a860de48d54544d973754c6170d8b).
2019-06-09 13:31:06,447 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool.
2019-06-09 13:31:06,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 883e842633b0fd9a2e53ab45778581fe: JobManager is shutting down..
2019-06-09 13:31:06,449 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.slotpool.SlotPool has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.LocalRpcInvocation until processing is started.
2019-06-09 13:31:06,457 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Disconnect job manager 00000000000000000000000000000000#akka.tcp://flink#jobmanager:6123/user/jobmanager_2 for job ef3a860de48d54544d973754c6170d8b from the resource manager.
2019-06-09 13:31:06,459 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool.
2019-06-09 13:31:06,460 INFO org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManagerRunner already shutdown.
2019-06-09 13:31:16,304 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#16363182f31f:36715] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#16363182f31f:36715]] Caused by: [16363182f31f: Name or service not known]
2019-06-09 13:31:26,320 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#16363182f31f:36715] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#16363182f31f:36715]] Caused by: [16363182f31f: Name or service not known]
2019-06-09 13:31:36,286 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#16363182f31f:36715] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#16363182f31f:36715]] Caused by: [16363182f31f]
Thanks in advance!
I have installed 3 instances of neo4j version 1.9.4 on a linux machine, in 3 different directories: Neo4j01, neo4j02, neo4j03.
I have updated the configuration files neo4j.properties and neo4j-server.properties as mentioned in the link (http://docs.neo4j.org/chunked/milestone/ha-setup-tutorial.html).
When I start all the neo4j instances one after the other, they are successfully installing, but after some time 2 of the 3 neo4j process/instances are automatically disappearing. I noticed it via ps -aef | grep neo4j.
When I checked console logs then i found below errors:
2013-11-12 16:37:32.512+0000 INFO [Cluster] Checking store consistency with master
2013-11-12 16:37:33.174+0000 INFO [Cluster] Store is consistent
2013-11-12 16:37:33.176+0000 INFO [Cluster] Catching up with master
2013-11-12 16:37:33.276+0000 INFO [Cluster] Now consistent with master
2013-11-12 16:37:34.442+0000 INFO [Cluster] ServerId 2, successfully moved to slave for master ha://localhost.localdomain:6363?serverId=1
2013-11-12 16:37:34.689+0000 INFO [Cluster] Instance 1 is available as backup at backup://localhost.localdomain:6366
2013-11-12 16:37:34.798+0000 INFO [Cluster] Instance 2 (this server) is available as slave at ha://localhost.localdomain:6364?serverId=2
2013-11-12 16:37:35.036+0000 INFO [Cluster] Database available for write transactions
2013-11-12 16:37:35.360+0000 INFO [API] Successfully started database
2013-11-12 16:37:36.079+0000 INFO [API] Starting HTTP on port :7474 with 10 threads available
2013-11-12 16:37:40.596+0000 INFO [Cluster] Instance 3 has failed
2013-11-12 16:37:43.654+0000 INFO [API] Enabling HTTPS on port :7473
2013-11-12 16:38:01.081+0000 INFO [API] Mounted REST API at: /db/manage/
2013-11-12 16:38:01.158+0000 INFO [API] Mounted discovery module at [/]
2013-11-12 16:38:02.375+0000 INFO [API] Loaded server plugin "CypherPlugin"
2013-11-12 16:38:02.449+0000 INFO [API] Loaded server plugin "GremlinPlugin"
2013-11-12 16:38:02.462+0000 INFO [API] Mounted REST API at [/db/data/]
2013-11-12 16:38:02.534+0000 INFO [API] Mounted management API at [/db/manage/]
2013-11-12 16:38:03.568+0000 INFO [API] Mounted webadmin at [/webadmin]
2013-11-12 16:38:06.189+0000 INFO [API] Mounting static content at [/webadmin] from [webadmin-html]
2013-11-12 16:38:30.844+0000 DEBUG [API] Failed to start Neo Server on port [7474], reason [org.mortbay.util.MultiException[java.net.BindException: Address already in use, java.net.BindException: Address already in use]]
2013-11-12 16:38:30.880+0000 DEBUG [API] org.neo4j.server.ServerStartupException: Starting Neo4j Server failed: org.mortbay.util.MultiException[java.net.BindException: Address already in use, java.net.BindException: Address already in use]
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:211) ~[neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.Bootstrapper.start(Bootstrapper.java:86) [neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.Bootstrapper.main(Bootstrapper.java:49) [neo4j-server-1.9.4.jar:1.9.4]
Caused by: java.lang.RuntimeException: org.mortbay.util.MultiException[java.net.BindException: Address already in use, java.net.BindException: Address already in use]
at org.neo4j.server.web.Jetty6WebServer.startJetty(Jetty6WebServer.java:334) ~[neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.web.Jetty6WebServer.start(Jetty6WebServer.java:154) ~[neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.AbstractNeoServer.startWebServer(AbstractNeoServer.java:344) ~[neo4j-server-1.9.4.jar:1.9.4]
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:187) ~[neo4j-server-1.9.4.jar:1.9.4]
... 2 common frames omitted
Caused by: org.mortbay.util.MultiException: Multiple exceptions
at org.mortbay.jetty.Server.doStart(Server.java:188) ~[jetty-6.1.25.jar:6.1.25]
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) ~[jetty-util-6.1.25.jar:6.1.25]
at org.neo4j.server.web.Jetty6WebServer.startJetty(Jetty6WebServer.java:330) ~[neo4j-server-1.9.4.jar:1.9.4]
... 5 common frames omitted
2013-11-12 16:38:30.894+0000 DEBUG [API] Failed to start Neo Server on port [7474]
Now, only neo4j01 process is running and neo4j02 and neo4j03 processes are disappeared. But even though neo4j01 process is up and running I am unable to access the webadmin page at http://htname:7474/webadmin/#/info/org.neo4j/High%20Availability/.
Please, can someone shed some light on this?
You might want to take a look at https://github.com/neo-technology/neo4j-enterprise-local-qa. This contains a rakefile that automates a local setup of 3 instances. Clone the repo locally, and use
rake setup_cluster start_cluster
to bring a locally running cluster online. Shutdown can be done via
rake stop_cluster
Find the configs in machine[ABC]/conf/.
I have a VPS with Centos running with the following details:
[root#XXXXXXX~]# uname -a
Linux xxxxxxxx2.6.32-042stab055.10 #1 SMP Thu May 10 15:38:32 MSD 2012 i686 i686 i386 GNU/Linux
I am trying to run the ecommerce system shopizer from the tomcat installed on it.
I had tried to build it on the VPS but did not succeed so I built it else where and copied the war files on the VPS's tomcat.
The issue I am facing now and during the build too was that the build was hanging when i ran the built ant scripts. now when i am launching the tomcat server whose details are as follows :
Server version: Apache Tomcat/6.0.35
Server built: Nov 28 2011 11:20:06
Server number: 6.0.35.0
OS Name: Linux
OS Version: 2.6.32-042stab055.10
Architecture: i386
JVM Version: 1.6.0_24-b24
JVM Vendor: Sun Microsystems Inc.
the server hangs up at start and when I take thread dump (both in this case and the build case) I get the following :
"GC Daemon" daemon prio=10 tid=0xa09fd000 nid=0x36ef in Object.wait() [0xa08f7000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xa6c35a88> (a sun.misc.GC$LatencyLock)
at sun.misc.GC$Daemon.run(GC.java:117)
- locked <0xa6c35a88> (a sun.misc.GC$LatencyLock)
"Low Memory Detector" daemon prio=10 tid=0xb7686000 nid=0x36ed runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"C1 CompilerThread0" daemon prio=10 tid=0xb7684000 nid=0x36ec runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0xb7682800 nid=0x36eb waiting on condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0xb7673000 nid=0x36ea in Object.wait() [0xa0ffe000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xa6ad0b58> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
- locked <0xa6ad0b58> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)
"Reference Handler" daemon prio=10 tid=0xb7671800 nid=0x36e9 in Object.wait() [0xa1198000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xa6ad0a58> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0xa6ad0a58> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0xb7605c00 nid=0x36e7 runnable [0xb775d000]
java.lang.Thread.State: RUNNABLE
at java.lang.Byte$ByteCache.<clinit>(Byte.java:79)
at java.lang.Byte.valueOf(Byte.java:102)
- waiting on <0xa6ad0a58> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0xa6ad0a58> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0xb7605c00 nid=0x36e7 runnable [0xb775d000]
java.lang.Thread.State: RUNNABLE
at java.lang.Byte$ByteCache.<clinit>(Byte.java:79)
at java.lang.Byte.valueOf(Byte.java:102)
at com.opensymphony.xwork2.conversion.impl.DefaultTypeConverter.<init>(DefaultTypeConverter.java:59)
at com.opensymphony.xwork2.conversion.impl.XWorkConverter.<init>(XWorkConverter.java:186)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at com.opensymphony.xwork2.inject.ContainerImpl$ConstructorInjector.construct(ContainerImpl.java:419)
at com.opensymphony.xwork2.inject.ContainerBuilder$5.create(ContainerBuilder.java:207)
at com.opensymphony.xwork2.inject.Scope$2$1.create(Scope.java:51)
- locked <0xa18cf408> (a com.opensymphony.xwork2.inject.ContainerImpl)
at com.opensymphony.xwork2.inject.ContainerImpl$ParameterInjector.inject(ContainerImpl.java:462)
at com.opensymphony.xwork2.inject.ContainerImpl.getParameters(ContainerImpl.java:477)
at com.opensymphony.xwork2.inject.ContainerImpl.access$000(ContainerImpl.java:34)
at com.opensymphony.xwork2.inject.ContainerImpl$MethodInjector.inject(ContainerImpl.java:293)
at com.opensymphony.xwork2.inject.ContainerImpl$ConstructorInjector.construct(ContainerImpl.java:431)
at com.opensymphony.xwork2.inject.ContainerBuilder$5.create(ContainerBuilder.java:207)
at com.opensymphony.xwork2.inject.Scope$2$1.create(Scope.java:51)
- locked <0xa18cf408> (a com.opensymphony.xwork2.inject.ContainerImpl)
at com.opensymphony.xwork2.inject.ContainerImpl$ParameterInjector.inject(ContainerImpl.java:462)
at com.opensymphony.xwork2.inject.ContainerImpl.getParameters(ContainerImpl.java:477)
at com.opensymphony.xwork2.inject.ContainerImpl.access$000(ContainerImpl.java:34)
at com.opensymphony.xwork2.inject.ContainerImpl$MethodInjector.inject(ContainerImpl.java:293)
at com.opensymphony.xwork2.inject.ContainerImpl$ConstructorInjector.construct(ContainerImpl.java:431)
at com.opensymphony.xwork2.inject.ContainerBuilder$5.create(ContainerBuilder.java:207)
at com.opensymphony.xwork2.inject.Scope$2$1.create(Scope.java:51)
- locked <0xa18cf408> (a com.opensymphony.xwork2.inject.ContainerImpl)
at com.opensymphony.xwork2.inject.ContainerImpl$ParameterInjector.inject(ContainerImpl.java:462)
at com.opensymphony.xwork2.inject.ContainerImpl.getParameters(ContainerImpl.java:477)
at com.opensymphony.xwork2.inject.ContainerImpl.access$000(ContainerImpl.java:34)
at com.opensymphony.xwork2.inject.ContainerImpl$MethodInjector.inject(ContainerImpl.java:293)
at com.opensymphony.xwork2.inject.ContainerImpl$ConstructorInjector.construct(ContainerImpl.java:431)
at com.opensymphony.xwork2.inject.ContainerBuilder$5.create(ContainerBuilder.java:207)
at com.opensymphony.xwork2.inject.Scope$2$1.create(Scope.java:51)
- locked <0xa18cf408> (a com.opensymphony.xwork2.inject.ContainerImpl)
at com.opensymphony.xwork2.inject.ContainerBuilder$3.create(ContainerBuilder.java:93)
at com.opensymphony.xwork2.inject.ContainerBuilder$7.call(ContainerBuilder.java:487)
at com.opensymphony.xwork2.inject.ContainerBuilder$7.call(ContainerBuilder.java:484)
at com.opensymphony.xwork2.inject.ContainerImpl.callInContext(ContainerImpl.java:574)
at com.opensymphony.xwork2.inject.ContainerBuilder.create(ContainerBuilder.java:484)
at com.opensymphony.xwork2.config.impl.DefaultConfiguration.createBootstrapContainer(DefaultConfiguration.java:252)
at com.opensymphony.xwork2.config.impl.DefaultConfiguration.reloadContainer(DefaultConfiguration.java:193)
- locked <0xa37c05b0> (a com.opensymphony.xwork2.config.impl.DefaultConfiguration)
....
....
..
"VM Thread" prio=10 tid=0xb766d800 nid=0x36e8 runnable
"VM Periodic Task Thread" prio=10 tid=0xb7688400 nid=0x36ee waiting on condition
JNI global references: 911
Heap
def new generation total 39424K, used 5740K [0xa1580000, 0xa4040000, 0xa6ad0000)
eden space 35072K, 12% used [0xa1580000, 0xa19a8e88, 0xa37c0000)
from space 4352K, 34% used [0xa37c0000, 0xa3932570, 0xa3c00000)
to space 4352K, 0% used [0xa3c00000, 0xa3c00000, 0xa4040000)
tenured generation total 87424K, used 1432K [0xa6ad0000, 0xac030000, 0xb1580000)
the space 87424K, 1% used [0xa6ad0000, 0xa6c36038, 0xa6c36200, 0xac030000)
compacting perm gen total 12288K, used 11825K [0xb1580000, 0xb2180000, 0xb5580000)
the space 12288K, 96% used [0xb1580000, 0xb210c588, 0xb210c600, 0xb2180000)
No shared spaces configured.
What can I do to resolve the issue or is it a bug/issue with java running on VPS??
B) the jvm sometimes crashes on the same VPS with the following error :
[root#xxxxxx ~]# java
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.
[root#xxxxxx ~]# free
total used free shared buffers cached
Mem: 1155072 561320 593752 0 0 317124
-/+ buffers/cache: 244196 910876
Swap: 0 0 0
Very strange to me, can anyone explain me this behaviour.
I suggest that you try to update your operating system and your JVM.
I think this crash is due to a bug in your actual configuration.