After deploying a Python Docker container and successfully executing a script the container crashes and restarts in a loop after showing the following error message:
2017-06-19 13:22:49 [APP/PROC/WEB/0] OUT Exit status 0
2017-06-19 13:22:49 [CELL/0] OUT Exit status 0
2017-06-19 13:22:49 [CELL/0] OUT Destroying container
2017-06-19 13:22:49 [API/0] OUT Process has crashed with type: "web"
2017-06-19 13:22:49 [API/0] OUT App instance exited with guid 85e7922e-5a0c-4430-994a-324e5abc0c14 payload: {"instance"=>"", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"2 error(s) occurred:\n\n* Codependent step exited\n* cancelled", "crash_count"=>1, "crash_timestamp"=>1497871369566402154, "version"=>"b9800e3a-b057-4cc5-b7e4-c01f9b3c6594"}
Executing the same docker image locally it does not throw any errors. The Python script I execute is doing a simple print command and I even implemented a handler for the SIGTERM signal that is sent into the container after execution.
In CF, applications are not supposed to finish. But if your script only just prints something, it'll perform an exit 0 afterwards. Thus the app container is stopped and CF registers a "crash", and will then restart the application in accordance with the app lifecycle:
https://docs.cloudfoundry.org/devguide/deploy-apps/app-lifecycle.html
Related
Our Environment:
Jenkins version - Jenkins 2.319.1
Jenkins Master image : jenkins/jenkins:2.319.1-lts-alpine
Jenkins worker image: jenkins/inbound-agent:4.11-1-alpine
Installed plugins:
Kubernetes - 1.30.6
Kubernetes Client API - 5.4.1
Kubernetes Credentials Plugin - 0.9.0
JAVA version on master: openjdk 11.0.13
JAVA version on Agent/worker : openjdk 11.0.14
Hi team,
We are facing issue in jenkins where jenkins agent disconnects(or goes offline) from master while still job is running on agent/worker. We are getting below error(highlighted) and tried below things but issue is still not resolving fully. Jenkins is deployed on EKS.
Error:
5334535:2022-11-02 14:07:54.573+0000 [id=140290] INFO hudson.slaves.NodeProvisioner#update: worker-7j4x4 provisioning successfully completed. We have now 2 computer(s)
5334695:2022-11-02 14:07:54.675+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes done-jenkins/worker-7j4x4
5334828:2022-11-02 14:07:56.619+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes done-jenkins/worker-7j4x4
5334964-2022-11-02 14:07:58.650+0000 [id=140309] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #97 from /100.122.254.111:42648
5335123-2022-11-02 14:09:19.733+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5335275-2022-11-02 14:09:19.733+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5335409-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2608, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5335965-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 1 nodes assigned to this Jenkins instance, which we will check
5336139-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5336279-2022-11-02 14:09:19.734+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5336438-groovy.lang.MissingPropertyException: No such property: envVar for class: groovy.lang.Binding
5336532- at groovy.lang.Binding.getVariable(Binding.java:63)
5336585- at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onGetProperty(SandboxInterceptor.java:271)
–
5394279-2022-11-02 15:09:19.733+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5394431-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5394565-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2620, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5395121-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 3 nodes assigned to this Jenkins instance, which we will check
5395295-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5395435-2022-11-02 15:09:19.734+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5395594-2022-11-02 15:11:59.502+0000 [id=140320] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-254-111.eu-central-1.compute.internal/100.122.254.111:42648.
5395817-java.util.concurrent.TimeoutException: Ping started at 1667401679501 hasn't completed by 1667401919502
5395920- at hudson.remoting.PingThread.ping(PingThread.java:134)
5395977- at hudson.remoting.PingThread.run(PingThread.java:90)
5396032:2022-11-02 15:11:59.503+0000 [id=141914] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5049 for worker-7j4x4 terminated: java.nio.channels.ClosedChannelException
5396231-2022-11-02 15:12:35.579+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started Periodic background build discarder
5396368-2022-11-02 15:12:36.257+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished Periodic background build discarder. 678 ms
5396514-2022-11-02 15:14:15.582+0000 [id=141422] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-237-38.eu-central-1.compute.internal/100.122.237.38:55038.
5396735-java.util.concurrent.TimeoutException: Ping started at 1667401815582 hasn't completed by 1667402055582
5396838- at hudson.remoting.PingThread.ping(PingThread.java:134)
5396895- at hudson.remoting.PingThread.run(PingThread.java:90)
5396950-2022-11-02 15:14:15.584+0000 [id=141915] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5050 for worker-fjf1p terminated: java.nio.channels.ClosedChannelException
****5397149-2022-11-02 15:14:19.733+0000 [id=141950] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5397301-2022-11-02 15:14:19.733+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5397435-2022-11-02 15:14:19.734+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2621, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
Any suggestion or resolutions pls.
Tried below things:
Increased idleMinutes to 180 from default
Verified that resources are sufficient as per graphana dashboard
Changed podRetention to onFailure from Never
Changed podRetention to Always from Never
Increased readTimeout
Increased connectTimeout
Increased slaveConnectTimeoutStr
Disabled the ping thread from UI via disabling “response time" checkbox from preventive node monitroing
Increased activeDeadlineSeconds
Verified same java version on master and agent
Updated kubernetes and kubernetes API client plugins
Expectation is worker/agent should disconnect once job is successfully ran and after idleMinutes defined it should terminate but few times its terminating while job is still running on agent
I'm trying to setup tensorflow to use GPU acceleration with WSL 2 running Ubuntu 20.04. I'm following this tutorial and am running into the error seen here. However, when I follow the solution there and try to start docker with sudo service docker start I get told docker is an unrecognized service. However, considering I can access the help menu and whatnot, I know docker is installed. While I can get docker to work with the desktop tool, since it doesn't support Cuda as mentioned in the SO post from earlier, it's not very helpful. It's not really giving me error logs or anything, so please ask if you need more details.
Edit:
Considering the lack of details, here are a list of solutions I've tried to no avail. 1 2 3
Update: I used sudo dockerd to get the container started and tried running the nvidia benchmark container only to be met with
INFO[2020-07-18T21:04:05.875283800-04:00] shim containerd-shim started address=/containerd-shim/021834ef5e5600bdf62a6a9e26dff7ffc1c76dd4ec9dadb9c1fcafb6c88b6e1b.sock debug=false pid=1960
INFO[2020-07-18T21:04:05.899420200-04:00] shim reaped id=70316df254d6b2633c743acb51a26ac2d0520f6f8e2f69b69c4e0624eaac1736
ERRO[2020-07-18T21:04:05.909710600-04:00] stream copy error: reading from a closed fifo
ERRO[2020-07-18T21:04:05.909753500-04:00] stream copy error: reading from a closed fifo
ERRO[2020-07-18T21:04:06.001006700-04:00] 70316df254d6b2633c743acb51a26ac2d0520f6f8e2f69b69c4e0624eaac1736 cleanup: failed to delete container from containerd: no such container
ERRO[2020-07-18T21:04:06.001045100-04:00] Handler for POST /v1.40/containers/70316df254d6b2633c743acb51a26ac2d0520f6f8e2f69b69c4e0624eaac1736/start returned error: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled
Update 2: After installing windows insider and making everything as up to date as possible, I encountered a different error.
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Error: only 0 Devices available, 1 requested. Exiting.
I have a GTX 970, so I'm not sure why it's not being detected. After running sudo lshw -C display, it was confirmed that my graphics card isn't being detected. I got:
*-display UNCLAIMED
description: 3D controller
product: Microsoft Corporation
vendor: Microsoft Corporation
physical id: 4
bus info: pci#941e:00:00.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: bus_master cap_list
configuration: latency=0
I'm trying to run a bitcoin network on regtest with this version of bitcoin node so I can test out bitpay's insight-ui block explorer.
Running on regtest I get this repeating error
Assertion failed: (psocket), function Shutdown, file zmq/zmqpublishnotifier.cpp, line 92.
[2017-05-19T00:42:44.515Z] warn: Bitcoin process unexpectedly exited with code: null
[2017-05-19T00:42:44.515Z] warn: Restarting bitcoin child process in 5000ms
[2017-05-19T00:42:49.516Z] info: Using bitcoin config file: /Users/harshagoli/BTCT/bitcoin.conf
[2017-05-19T00:42:49.517Z] warn: Stopping existing spawned bitcoin process with pid: 12690
[2017-05-19T00:42:49.517Z] warn: Unclean bitcoin process shutdown, process not found with pid: 12690
[2017-05-19T00:42:49.517Z] info: Starting bitcoin process
Which eventually becomes
[2017-05-19T00:42:54.133Z] error: RPCError: Bitcoin JSON-RPC: Request Error: connect ECONNREFUSED 127.0.0.1:8332
at Bitcoin._wrapRPCError (/Users/harshagoli/mynode/node_modules/bitcore-node/lib/services/bitcoind.js:449:13)
at /Users/harshagoli/mynode/node_modules/bitcore-node/lib/services/bitcoind.js:781:28
at ClientRequest.<anonymous> (/Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/bitcoind-rpc/lib/index.js:116:7)
at emitOne (events.js:77:13)
at ClientRequest.emit (events.js:169:7)
at Socket.socketErrorListener (_http_client.js:269:9)
at emitOne (events.js:77:13)
at Socket.emit (events.js:169:7)
at emitErrorNT (net.js:1269:8)
at nextTickCallbackWith2Args (node.js:458:9)
[2017-05-19T00:42:54.133Z] info: Beginning shutdown
[2017-05-19T00:42:54.133Z] info: Stopping insight-ui (not started)
[2017-05-19T00:42:54.134Z] info: Stopping insight-api (not started)
[2017-05-19T00:42:54.134Z] info: Stopping web (not started)
[2017-05-19T00:42:54.135Z] info: Stopping bitcoind
After which I have the reoccurring error
[2017-05-19T00:42:54.221Z] error: Error: Stopping while trying to spawn bitcoind.
at /Users/harshagoli/mynode/node_modules/bitcore-node/lib/services/bitcoind.js:905:25
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:676:51
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:726:13
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:52:16
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:264:21
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:44:16
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:723:17
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:167:37
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:652:25
at /Users/harshagoli/mynode/node_modules/bitcore-node/lib/services/bitcoind.js:887:16
Thoughts on how I can get this up and running with a block to look at so I can use the block explorer?
Okay I figured it out. What was happening is there were some other bitcoind processes that had zombied out and were listening on the port this process was trying to access. I ran this command to kill the other processes
killall -9 bitcoind
Also, to create more blocks on regtest (while in the your node directory) use this command.
./node_modules/bitcore-node/bin/bitcoin-0.12.1/bin/bitcoin-cli -datadir=/Users/harshagoli/mynode/data -regtest generate 150
I use rabbitmq-erlang-client rabbitmq_2.7.0 in my production env. Recently, I have found some error as "unexpected_delivery_and_no_default_consumer" in my server app log.
2016-08-26 15:25:00.465 [error] CRASH REPORT Process with 0 neighbours exited with reason: {unexpected_delivery_and_no_default_consumer,{gen_server2,call,[,{consumer_call,{'basic.deliver',>,15804,false,>,>},{amqp_msg,{'P_basic',undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},>}},infinity]}} in gen_server:terminate/7 line 826
rabbitmq consumer res:
rabbitmqctl list_consumers
Why "amq.ctag-2I715Fu1Q2m9AHgrhlN1Og" is not in the consumers tags?
PS:
1. I do not use the method 'basic.cancel'.
2. I config the msg as "no_ack = true"
OpsCenter version: 5.1.0 and
DSE Version: 4.6.0
Creating a brand new cluster by using OpsCenter directly, gives us the following error. It randomly works with the same settings but 95% of the times it fails with the same error. Opscenter is running on its own box but sharing the same Security groups as the cluster instances. For good measure, I have opened up all TCP ports to all IPs. The following is the stack trace of the error from the opscenterd.log:
*2015-03-19 10:06:12+0000 [] INFO: Starting provisioning process
2015-03-19 10:06:12+0000 [] INFO: Starting installation phase of cluster provisioning
2015-03-19 10:06:13+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.
2015-03-19 10:06:13+0000 [] INFO: Beginning install of OpsCenter agent to 54.x.x.x
2015-03-19 10:06:26+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.
2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version None
2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version u'5.1.0'
2015-03-19 10:07:23+0000 [] INFO: Successfully installed agent and dse on node 10.x.x.x
2015-03-19 10:07:23+0000 [] INFO: Beginning "stop" phase of cluster provisioning
2015-03-19 10:07:25+0000 [] WARN: Marking request '10.x.x.x: /ops/stop' (f6708fa2-b45f-42b4-b992-90a82b460ac7) as failed: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 'stop stage' (0b6fcb6b-96ba-404e-a484-b4b6b167b309) as failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 'provision' (daf1c15d-92e3-40b0-83ca-34d548ea835b) as failed: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR:
2015-03-19 10:07:25+0000 [] ERROR: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Failed to provision cluster: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 28c021fd-d21a-4fed-bb5c-a4fe17d362e0 as failed: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:41+0000 [] WARN: Unable to find a matching cluster for node with IP [u'fe80:0:0:0:2000:aff:feeb:31c7%2', u'10.x.x.x', u'0:0:0:0:0:0:0:1%1', u'127.0.0.1']; the message was [u'5.1.0', u'/1947480708/conf']. This usually indicates that an OpsCenter agent is still running on an old node that was decommissioned or is part of a cluster that OpsCenter is no longer monitoring.
Appreciate any help!
Thanks in advance
Harsha
OpCenter developer here. I make the OpsCenter provisioning features go zoom (or splat occasionally as you've seen). It is with sadness and shame that I must tell you that you're hitting a bug.
The Datastax AMI version 2.4 used by OpsCenter provisioning (https://github.com/riptano/ComboAMI/tree/2.4) does quite a bit of work at boot time via startup scripts. One of those tasks is to set up some gpg repository keys used to validate packages. Intermittently that process can fail, breaking package installs and leading to the series of errors that you're seeing. This failure is intermittent and has greatly increased in frequency recently. If you check /home/ubuntu/datastax-ami/ami.log you should see the gpg key failures that begin the rest of the failure chain.
Unfortunately, this error is pretty far down the technology stack and is difficult to manually work around. If you just need to provision a single cluster you can retry until you get a good run. Otherwise your best best is to manually launch the instances and use local provisioning to deploy dse/dsc to their private ip addresses:
Launch instances using ami-ada2b6c4 (assuming you're in us-east-1)
Make sure to add the instances to the OpsCenterSecurity group.
Make sure you have the private half of the keypair you use (you'll need it during local provisioning)
On the instance data page, hit the advanced pulldown and add the following userdata as text "--raidonly --java7"
Do a local-provisioning run against the private-ip's
Not a super-simple workaround. I wish your experience with OpsCenter this time around was more awesome. The good news is I'm on this bug and it will be fixed in an upcoming point release.
Edit: No longer necessary to manually remove /etc/security/limits.d/cassandra.conf
if its just complaining about java then install the java 7 preferably datastax wants oracle jdk and jre. you might already have java 7 and another version on your nodes but java 7 is not the default version. to change this do:
sudo update-java-alternatives -s java-7-oracle
which is a command you can script to run with ssh so you dont have to log in to each node