Application slow down due to zombie process?

Application slow down due to zombie process? - erlang

We face the application downtime/issue while uploading the file to Azure storage via Arc.
There is no specific code error, but facing a timeout issue.
It gets resolved once the Azure web app is restarted.
It has happened intermittently.
Since we cannot find the root cause, we consulted if there is an issue on the Azure side.
The Microsoft team says the system health is OK but pointing towards accumulated zombie processes. EPMD and inet_gethost
On searching, I understand that these are created by Erlang runtime.
Please let me know if we have some process to kill these zombie processes from time to time?
Also, do they contribute to the application downtime?
Thanks

Please let me know, if we have some process to kill these zombie processes time to time?
If you're running a sensible init process, these zombie processes should be correctly reaped. This can often be a problem if you run Erlang as the top-level process inside a container, for instance. Can you give more detail of your environment?
Also do they really contribute for the application downtime?
Depends on how many of them there are, but probably not, no.

Related

Cloud Run - OpenBLAS Warnings and application restarts (Not a cold start issue)

Problem
I have an application running on a Cloud Run instance for a 5 months now.
The application has a startup time of about 3 minutes and when the startup is over it does not need much RAM.
Here are two snapshots of docker stats when I run the app locally :
When the app isn't excited
When the app is receiving 10 requests per seconds (Which is way over our use case for now) :
There aren't any problems when I run the app locally however problems arise when I deploy it on Cloud Run. I keep receiving : "OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k" messages followed by the restart of the app. This is a problem because as I said the app takes up to 3 minutes to restart, during which the requests take a lot of time to get treated.
I already fixed the cold start issue by using a minimum instance of 1 AND using a google cloud scheduler to query the service every minutes.
Examples
Here are examples of what I see in the logs.
In the second example the warnings came once again just after the application restart which caused a second restart in a row, this happens quite often.
Also note that those warnings/restarts are not necessarily happening when users are connected to the app but can happen when the only activity is due to the Google Cloud Scheduler
I tried increasing the allocated RAM and CPU to 4 CPUs and 4 Go of RAM (which is a huge over kill) and yet the problem remains.
Update 02/21
As of 01/01/21 we stopped witnessing such behavior from our cloud run service (maybe due an update, I don't know). I did contact the GCP support but they just told me to raise an issue on the OpenBLAS github repo but since I can't reproduce the behavior I did not do so. I'll leave the question open as nothing I did really worked.

OpenBLAS performs high performance compute optimizations and need to know what are the CPU capacity to tune itself the best.
However, when you run a container on Cloud Run, you run it in a sandbox GVisor, to increase the security and the isolation of all the container running on the same serverless platform.
This sandbox intercepts low level kernel calls and discard the abnormal/dangerous ones. I guess that for this reason that OpenBLAS can't determine the L2 cache size. On your environment, you haven't this sandbox, and you can access directly to the CPU info.
Why it's restart?? It could be a problem with OpenBLAS or a problem with Cloud Run (suspicious kernel call, kill the instance and restart it).
I haven't immediate solution because I don't know OpenBLAS. I had similar behavior with Tensorflow Serving, and tensorflow proposes a compiled version without any CPU optimization: less efficient but more portable and resilient to different environment constraint. If a similar compilation exists for OpenBLAS, it could be great to test it.

What happens if an IIS application worker process hangs?

I am totally new in web programming... Now I am working on an already implemented ASP.NET MVC application which is deployed in IIS. This app is bound to an application pool which has only one worker process. At this moment, I am trying to understand what happens if the worker process freezes/hangs due to an uncontrolled exception thrown by app code. So may someone explain me it?
What we have observed is that when this happens, application stops working correctly and we need to restart its application pool in order to app begins to work correctly again. After observing this behavior, I have a doubt..... In application pool advanced configuration, under process model, the ping maximum response time (seconds) is set to 90 so as far as I know, when application pool pings the worker process and it does not respond because it is hang, after 90 seconds then worker process should terminate, but it seems it is not terminating because when this happens we need to restart application pool in order to app works again.... so Why in this case worker process does not terminate?

First off, you have "only" one Worker Process and should probably keep it that way. Often times Web Gardening causes more issues than it helps, particularly with .NET Apps. Second, you say it freezes/hangs due to "uncontrolled" (unhandled?) exception thrown by app code. Why do you think this is the case. Do you have an error page or something indicating its an exception? The "ping" process checks if the process is still doing work, but not necessarily finishing requests. So from the perspective of WAS, IIS is still responding.
If you want to troubleshoot, you could investigate getting a memory dump with DebugDiag and perform some automated analysis on it. https://support.microsoft.com/en-us/help/919792/how-to-use-the-debug-diagnostics-tool-to-troubleshoot-a-process-that-h

Jenkins Server Java.exe memory is growing very fast

We're running Jenkins server with few slaves that run the builds. Lately there are more and more builds that are running in the same time.
I see the java.exe process on the Jenkins server just increasing , and not decreasing even when the jobs were finnished.
Any idea dwhy oes it happen?
We're running Jenkins ver. 1.501.
Is there a way maybe to make the Jenkins server service ro wait until the last job is finnished, then restart automatically?

I can't seem to find a reference on this (still posting an answer because it's too long for comments ;-) ) but this is what I've observed using the Oracle JVM:
If more memory than currently reserved is needed, the JVM reserves more. So far so good. What it doesn't seem to do is release the memory once it's not needed anymore. You can watch this behaviour by turning on the heap size indicator in Eclipse.
I'd say the same does happen with Jenkins. A running Jenkins with only a few projects already can easily jump the 1 gig mark. If you have a lot of concurrent builds, Jenkins needs a lot of memory at some point. After the builds are done and the heapsize has decreased, the JVM keeps the memory reserved. It is practically "empty" but still claimed by the JVM so it's unavailable for other processes.
Again: It's just an observation. I'd be happy if someone with deeper insight on Java memory management would back this up (or disprove it)
As for a practical solution I'd say you gonna have to live with it to some point. Jenkins IS very hungry for memory. Restarting it solves the problem only temporary. At least it should stop claiming memory at some point because the "empty" reserved memory should be reused. If it's not this really sounds like a memory leak and would be a bug.

Jenkins' [excessive] use of memory without bounds seems to be a common observation. The Jenkins Wiki gives some suggestions for "I'm getting OutOfMemoryErrors".
We have also found that the Monitoring Plugin is useful for keeping an eye on the memory usage and helping us know if we might need to restart Jenkins soon.
Is there a way maybe to make the Jenkins server service ro wait until the last job is finnished, then restart automatically?
Check out the Restart Safely Plugin

WAS threads are getting hung

I am facing an issue that WAS threads are getting hung.
Configurations:
OS: AIX,
WAS: 6.1.0.31
com.ibm.websphere.threadmonitor.interval: 180 seconds
com.ibm.websphere.threadmonitor.threshold: 10 minutes
com.ibm.websphere.threadmonitor.false.alarm.threshold: 100
Above settings are for hung detection.
Is there any way that I can clean up the hung threads ?
Thanks in advance.

No. WAS doesn't provide mechanism for that. What you see is a watchdog mechanism that provides merely notifications. You are supposed to actually fix the underlying problem why the threads get hung in the first place. To get started with that issue
kill -3 <pid>
and read the stack traces. It is likely that after a few you will start seeing a pattern and then you have to read the source code for your applications to understand what really went wrong and how to fix it.

As far as I know it is Java that does not allow to kill a thread which is hung. The best thing is to avoid hanging threads by hunting down the cause. Like already mentioned in the other answer, try to force the application server to create a thread dump (aka Java Core) and analyze its content. On Linux/UNIX systems a
kill -3 <pid>
will do the job. You'll find free graphical tools in the internet to look into these dumps. I typically use one which is called IBM Thread and Monitor Dump Analyzer for Java. The WebSphere Application Server log file will tell you the thread name to look for.

You have tool to interpret. I have not used it in production though (never had that requirement, we go for a clean restart). You can check this out though. It uses bytecode instrumentation.
http://www.ibm.com/developerworks/websphere/downloads/hungthread.html

Workling processes multiplying uncontrolably

We have a rails app running on passenger and we background process some tasks using a combination of RabbitMQ and Workling. The workling's worker process is started using the script/workling_client command. There is always only one worker process started, and the script/workling_client has a :multiple => false options, thus allowing only one instance. But sometimes, under mysterious circumstances which I haven't been able to track down, more worklings spawn up. If I let the system run for some time, more and more worklings appear. I'm not sure if these rogue worklings cause any problems, but it is still unsettling not to know why is it happening. We are using Monit to monitor the workling process. So if it dies, it will spawn it up again. But this still does not explain how come there are suddenly more than one of them.
So my question is: does anyone know what can be cause of this and how to make it stop? Is it possible that workling sometimes dies by itself, without deleting it's pid file? Could there be something wrong with the Daemons gem workling_client is build upon?

Not an answer - I have the same problems running RabbitMQ + Workling.
I'm using God to monitor the single workling process as well (:multiple => false)...
I found that the multiple worklings were eating up huge amounts of memory & causing serious resource usage, so it's important that I find a solution for this.
You might find this message thread helpful: http://groups.google.com/group/rubyonrails-talk/browse_thread/thread/ed8edd0368066292/5b17d91cc85c3ada?show_docid=5b17d91cc85c3ada&pli=1

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart