UWSGI count processes killed with signal 9 (indirectly counting invocations of OOM killer) - uwsgi

I'm running UWSGI on a server and trying to track when worker processes get OOMed without using dmesg since that requires root privileges. In this environment, if a child was killed with SIGKILL, it's a safe assumption that the OOM killer did that.
UWSGI reports in its logs what signal a child was killed with. This issue (https://github.com/unbit/uwsgi/issues/25) shows an example of logs where a child was reported to have exited with signal 9.
Example:
Oct 20 18:54:28 localhost app: DAMN ! worker 2 (pid: 16100) died, killed by signal 9 :( trying respawn ...
Here's the line of code in UWSGI that's responsible for this message:
if (WIFSIGNALED(waitpid_status)) {
uwsgi_log("DAMN ! worker %d (pid: %d) died, killed by signal %d :( trying respawn ...\n", thewid, (int) diedpid, (int) WTERMSIG(waitpid_status));
}
https://github.com/unbit/uwsgi/blob/65a8d676f3e63a04b07fdcb4e1f92bb6502f024d/core/master.c#L1074
Is there a way to count the number of child processes killed with SIGKILL and surface it as a metric within the metric subsystem thing? I'm also wondering whether a child that exceeds the harakiri timeout is counted as being killed with a signal.
UWSGI does seem to keep a per-worker signal count e.g. "signals": 0, but I'm not sure exactly what that field is counting.
Example from same GitHub issue:
"pid": 11360,
"requests": 294,
"respawn_count": 38,
"rss": 226373632,
"running_time": 628263,
"signals": 0,
"status": "cheap",
"tx": 5178,
"vsz": 380694528

Related

In EKS, Worker pods going offline abruptly with 'hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection'

Our Environment:
Jenkins version - Jenkins 2.319.1
Jenkins Master image : jenkins/jenkins:2.319.1-lts-alpine
Jenkins worker image: jenkins/inbound-agent:4.11-1-alpine
Installed plugins:
Kubernetes - 1.30.6
Kubernetes Client API - 5.4.1
Kubernetes Credentials Plugin - 0.9.0
JAVA version on master: openjdk 11.0.13
JAVA version on Agent/worker : openjdk 11.0.14
Hi team,
We are facing issue in jenkins where jenkins agent disconnects(or goes offline) from master while still job is running on agent/worker. We are getting below error(highlighted) and tried below things but issue is still not resolving fully. Jenkins is deployed on EKS.
Error:
5334535:2022-11-02 14:07:54.573+0000 [id=140290] INFO hudson.slaves.NodeProvisioner#update: worker-7j4x4 provisioning successfully completed. We have now 2 computer(s)
5334695:2022-11-02 14:07:54.675+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes done-jenkins/worker-7j4x4
5334828:2022-11-02 14:07:56.619+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes done-jenkins/worker-7j4x4
5334964-2022-11-02 14:07:58.650+0000 [id=140309] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #97 from /100.122.254.111:42648
5335123-2022-11-02 14:09:19.733+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5335275-2022-11-02 14:09:19.733+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5335409-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2608, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5335965-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 1 nodes assigned to this Jenkins instance, which we will check
5336139-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5336279-2022-11-02 14:09:19.734+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5336438-groovy.lang.MissingPropertyException: No such property: envVar for class: groovy.lang.Binding
5336532- at groovy.lang.Binding.getVariable(Binding.java:63)
5336585- at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onGetProperty(SandboxInterceptor.java:271)
–
5394279-2022-11-02 15:09:19.733+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5394431-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5394565-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2620, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5395121-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 3 nodes assigned to this Jenkins instance, which we will check
5395295-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5395435-2022-11-02 15:09:19.734+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5395594-2022-11-02 15:11:59.502+0000 [id=140320] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-254-111.eu-central-1.compute.internal/100.122.254.111:42648.
5395817-java.util.concurrent.TimeoutException: Ping started at 1667401679501 hasn't completed by 1667401919502
5395920- at hudson.remoting.PingThread.ping(PingThread.java:134)
5395977- at hudson.remoting.PingThread.run(PingThread.java:90)
5396032:2022-11-02 15:11:59.503+0000 [id=141914] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5049 for worker-7j4x4 terminated: java.nio.channels.ClosedChannelException
5396231-2022-11-02 15:12:35.579+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started Periodic background build discarder
5396368-2022-11-02 15:12:36.257+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished Periodic background build discarder. 678 ms
5396514-2022-11-02 15:14:15.582+0000 [id=141422] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-237-38.eu-central-1.compute.internal/100.122.237.38:55038.
5396735-java.util.concurrent.TimeoutException: Ping started at 1667401815582 hasn't completed by 1667402055582
5396838- at hudson.remoting.PingThread.ping(PingThread.java:134)
5396895- at hudson.remoting.PingThread.run(PingThread.java:90)
5396950-2022-11-02 15:14:15.584+0000 [id=141915] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5050 for worker-fjf1p terminated: java.nio.channels.ClosedChannelException
****5397149-2022-11-02 15:14:19.733+0000 [id=141950] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5397301-2022-11-02 15:14:19.733+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5397435-2022-11-02 15:14:19.734+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2621, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
Any suggestion or resolutions pls.
Tried below things:
Increased idleMinutes to 180 from default
Verified that resources are sufficient as per graphana dashboard
Changed podRetention to onFailure from Never
Changed podRetention to Always from Never
Increased readTimeout
Increased connectTimeout
Increased slaveConnectTimeoutStr
Disabled the ping thread from UI via disabling “response time" checkbox from preventive node monitroing
Increased activeDeadlineSeconds
Verified same java version on master and agent
Updated kubernetes and kubernetes API client plugins
Expectation is worker/agent should disconnect once job is successfully ran and after idleMinutes defined it should terminate but few times its terminating while job is still running on agent

uwsgi log format: seeing uwsgi req: N/M where N > M

I have a uwsgi process running a flask application. There is haproxy (running in mode http) sitting between the client and the application.
I am seeing occational haproxy termination state as "SD--" and the Tc = 0 and Tr = -1, and the returned http code is -1. This means that the haproxy encountered a explicit tcp disconnection from the uwsgi server.
Looking at the uwsgi logs, I found that the server was normally processing requests at the same time. But the affected request never reached the server.
Only thing strange about the uwsgi logs at that point of time is that
the Number of requests managed by the current uwsgi worker is greater than the sum total of requests managed by the whole uwsgi app.
like this:
[pid: 22759|app: 0|req: **47188**/**47178**] * POST * => generated 84 bytes in 970 msecs (HTTP/1.1 200) 2 headers in 71 bytes (3 switches on core 98)
I am wondering if this is abnormal, or what what scenarios can these counters be so?

Unused Passenger process stays alive and consumes server resources for a Rails 4 app

we have a Rails app that runs using Apache -> Passenger. At least once a week, our alerts that monitor server CPU and RAM start getting triggered on one or more of our app servers, and the root cause is that one or more of the Passenger processes are taking up a large chunk of the server CPU and RAM , without actually serving any requests.
for example, when i run "passenger-status" on the server that triggers these alerts, i see this:
Version : 5.3.1
Date : 2022-06-03 22:00:13 +0000
Instance: (Apache/2.4.51 (Amazon) OpenSSL/1.0.2k-fips Phusion_Passenger/5.3.1)
----------- General information -----------
Max pool size : 12
App groups : 1
Processes : 9
Requests in top-level queue : 0
----------- Application groups -----------
Requests in queue: 0
* PID: 16915 Sessions: 1 Processed: 3636 Uptime: 3h 2m 30s
CPU: 5% Memory : 1764M Last used: 0s ago
* PID: 11275 Sessions: 0 Processed: 34 Uptime: 55m 24s
CPU: 45% Memory : 5720M Last used: 35m 43s ago
...
see how the 2nd process hasn't been used for > 35 minutes but is taking up so much of the server resources?
the only solution has been to manually kill the PID which seems to resolve the issue, but is there a way to automate this check?
i also realize that the Passenger version is old and can be upgraded (which I will get done soon) but i have seen this issue in multiple versions prior to the current version, so i wasn't sure if an upgrade by itself is guaranteed to resolve this or not.

Docker container crashes after execution of python script

After deploying a Python Docker container and successfully executing a script the container crashes and restarts in a loop after showing the following error message:
2017-06-19 13:22:49 [APP/PROC/WEB/0] OUT Exit status 0
2017-06-19 13:22:49 [CELL/0] OUT Exit status 0
2017-06-19 13:22:49 [CELL/0] OUT Destroying container
2017-06-19 13:22:49 [API/0] OUT Process has crashed with type: "web"
2017-06-19 13:22:49 [API/0] OUT App instance exited with guid 85e7922e-5a0c-4430-994a-324e5abc0c14 payload: {"instance"=>"", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"2 error(s) occurred:\n\n* Codependent step exited\n* cancelled", "crash_count"=>1, "crash_timestamp"=>1497871369566402154, "version"=>"b9800e3a-b057-4cc5-b7e4-c01f9b3c6594"}
Executing the same docker image locally it does not throw any errors. The Python script I execute is doing a simple print command and I even implemented a handler for the SIGTERM signal that is sent into the container after execution.
In CF, applications are not supposed to finish. But if your script only just prints something, it'll perform an exit 0 afterwards. Thus the app container is stopped and CF registers a "crash", and will then restart the application in accordance with the app lifecycle:
https://docs.cloudfoundry.org/devguide/deploy-apps/app-lifecycle.html

Failed to upgrade the iOS application while app is doing network operations using sockets

When I am trying to update the application using iTunes, I am getting a error pop-up - Unable to download application.
I am running into this error only when my app is doing network operations using sockets.
In other scenarios where app is either not running or is idle, it works correctly.
From the console logs, I got following error message -
2013-04-18 10:11:39 AM GMT+07:00 backboardd <Warning>: pid_suspend failed for [7104]: Unknown error: -1, Unknown error: -1
2013-04-18 10:11:39 AM GMT+07:00 backboardd <Warning>: Could not set priority of [7104] to 4096, priority: No such process
2013-04-18 10:11:39 AM GMT+07:00 backboardd <Warning>: Application 'UIKitApplication:com.avaya.AVSIPiPhoneCFE[0xe6ed]' exited abnormally with signal 9: Killed: 9
Any idea why this would happen?
This question addresses a similar problem.
In short, iOS automatically restarts an app that crashes or exits abnormally, if it has a background execution flag set. It seems that this leads to iTunes being unable to overwrite the old binary with the new one, because it's still running.

Resources