WAS threads are getting hung - websphere-6.1

I am facing an issue that WAS threads are getting hung.
Configurations:
OS: AIX,
WAS: 6.1.0.31
com.ibm.websphere.threadmonitor.interval: 180 seconds
com.ibm.websphere.threadmonitor.threshold: 10 minutes
com.ibm.websphere.threadmonitor.false.alarm.threshold: 100
Above settings are for hung detection.
Is there any way that I can clean up the hung threads ?
Thanks in advance.

No. WAS doesn't provide mechanism for that. What you see is a watchdog mechanism that provides merely notifications. You are supposed to actually fix the underlying problem why the threads get hung in the first place. To get started with that issue
kill -3 <pid>
and read the stack traces. It is likely that after a few you will start seeing a pattern and then you have to read the source code for your applications to understand what really went wrong and how to fix it.

As far as I know it is Java that does not allow to kill a thread which is hung. The best thing is to avoid hanging threads by hunting down the cause. Like already mentioned in the other answer, try to force the application server to create a thread dump (aka Java Core) and analyze its content. On Linux/UNIX systems a
kill -3 <pid>
will do the job. You'll find free graphical tools in the internet to look into these dumps. I typically use one which is called IBM Thread and Monitor Dump Analyzer for Java. The WebSphere Application Server log file will tell you the thread name to look for.

You have tool to interpret. I have not used it in production though (never had that requirement, we go for a clean restart). You can check this out though. It uses bytecode instrumentation.
http://www.ibm.com/developerworks/websphere/downloads/hungthread.html

Related

Application slow down due to zombie process?

We face the application downtime/issue while uploading the file to Azure storage via Arc.
There is no specific code error, but facing a timeout issue.
It gets resolved once the Azure web app is restarted.
It has happened intermittently.
Since we cannot find the root cause, we consulted if there is an issue on the Azure side.
The Microsoft team says the system health is OK but pointing towards accumulated zombie processes. EPMD and inet_gethost
On searching, I understand that these are created by Erlang runtime.
Please let me know if we have some process to kill these zombie processes from time to time?
Also, do they contribute to the application downtime?
Thanks
Please let me know, if we have some process to kill these zombie processes time to time?
If you're running a sensible init process, these zombie processes should be correctly reaped. This can often be a problem if you run Erlang as the top-level process inside a container, for instance. Can you give more detail of your environment?
Also do they really contribute for the application downtime?
Depends on how many of them there are, but probably not, no.

Jenkins Server Java.exe memory is growing very fast

We're running Jenkins server with few slaves that run the builds. Lately there are more and more builds that are running in the same time.
I see the java.exe process on the Jenkins server just increasing , and not decreasing even when the jobs were finnished.
Any idea dwhy oes it happen?
We're running Jenkins ver. 1.501.
Is there a way maybe to make the Jenkins server service ro wait until the last job is finnished, then restart automatically?
I can't seem to find a reference on this (still posting an answer because it's too long for comments ;-) ) but this is what I've observed using the Oracle JVM:
If more memory than currently reserved is needed, the JVM reserves more. So far so good. What it doesn't seem to do is release the memory once it's not needed anymore. You can watch this behaviour by turning on the heap size indicator in Eclipse.
I'd say the same does happen with Jenkins. A running Jenkins with only a few projects already can easily jump the 1 gig mark. If you have a lot of concurrent builds, Jenkins needs a lot of memory at some point. After the builds are done and the heapsize has decreased, the JVM keeps the memory reserved. It is practically "empty" but still claimed by the JVM so it's unavailable for other processes.
Again: It's just an observation. I'd be happy if someone with deeper insight on Java memory management would back this up (or disprove it)
As for a practical solution I'd say you gonna have to live with it to some point. Jenkins IS very hungry for memory. Restarting it solves the problem only temporary. At least it should stop claiming memory at some point because the "empty" reserved memory should be reused. If it's not this really sounds like a memory leak and would be a bug.
Jenkins' [excessive] use of memory without bounds seems to be a common observation. The Jenkins Wiki gives some suggestions for "I'm getting OutOfMemoryErrors".
We have also found that the Monitoring Plugin is useful for keeping an eye on the memory usage and helping us know if we might need to restart Jenkins soon.
Is there a way maybe to make the Jenkins server service ro wait until the last job is finnished, then restart automatically?
Check out the Restart Safely Plugin

How do you ensure your Rails server running

What is common approach to make sure that Rails server is auto-restarted after a serious crash, or a process kill? How do you deal with hanging processes? I have nginx and thin running on my production server - would you suggest to put something in between them? Or using another server?
Firstly:
You should identify the cause of a process hang or kill. These are not normal behaviours and indicate a fault somewhere.
Look for:
Insufficient memory or high load before a crash - indicates a configuration problem.
Versions of nginx that are too new.
If you're virtualising, this can cause a number of subtle problems with linux kernels that may cause segfaults. If you're using EC2, use Amazon Linux for your best chance. Ubuntu server is too bleeding edge for this purpose.
In order to do the restarts, I suggest you use monit as this is quick, easy and reliable - it's the normal way to do this.
Lastly, I suggest you set up external monitoring as well using something like Pingdom, as even monit won't catch every type fault, such as hardware failures.
If you only want to monitor an application, I'm always using Nagios with Centreon. You can set email alarming when your rails server is down. You have to setup your NRPE on every machine you want to monitor.
When an error is detected you can run a bash file to kill hanging processes and restart the server automatically. Personally, I never use that because a crash mean something goes wrong. So I do it manually in order to check everything.
Try to look here : http://www.centreon.com/

daemon process killed automatically

Recently I run a below command to start daemon process which runs once in a three days.
RAILS_ENV=production lib/daemons/mailer_ctl start, which was working but when I came back after a week and found that process was killed.
Is this normal or not?
Thanks in advance.
Nope. As far as I understand it, this daemon is supposed to run until you kill it. The idea is for it to work at regular intevals, right? So the daemon is supposed to wake up, do its work, then go back to sleep until needed. So if it was killed, that's not normal.
The question is why was it killed and what can you do about it. The first question is one of the toughest ones to answer when debugging detached processes. Unless your daemon leaves some clues in the logs, you may have trouble finding out when and why it terminated. If you look through your logs (and if you're lucky) there may be some clues -- I'd start right around the time when you suspect it last ran and look at your Rails production.log, any log file the daemon may create but also at the system logs.
Let's assume for a moment that you never can figure out what happened to this daemon. What to do about it becomes an interesting question. The first step is: Log as much as you can without making the logs too bulky or impacting performance. At a minimum log wakeup, processing, and completion events, as well as trapping signals and logging them. Best to log to somewhere other than the Rails production.log. You may also want to run the daemon at a shorter interval than 3 days until you are certain it is stable.
Look into using a process monitoring tool like monit (http://mmonit.com/monit/) or god (http://god.rubyforge.org/). These tools can "watch" the status of daemons and if they are not running can automatically start them. You still need to figure out why they are being killed, but at least you have some safety net.

How is Erlang fault tolerant, or help in that regard?

How is Erlang fault tolerant, or help in that regard?
I think I covered part of the answer in this reply to another thread.
Erlang is fault tolerant with the following things in mind:
Erlang knows that errors WILL happen, and things will break, so instead of guarding against errors, Erlang lets you have strong tools to minimize impact of errors and recover from them as they happen.
Erlang encourages you to program for success case, and crash if anything goes wrong without trying to recover partially broken data. The idea behind this is that partially incorrect data may propagate further in your system and may get written to database, and thus presents risk to your system. Better to get rid of it early and only keep fully correct data.
Process isolation in Erlang helps with minimizing impact of partially wrong data when it appears and then leads to process crash. System cleans up the crashed code and its memory but keeps working as a whole.
Supervision and restart strategies help keep your system fully functional if parts of it crashed by restarting vital parts of your system and bringing them back into service. If something goes very wrong such that restarts happen too much, the system is considered broken beyond repair and thus is shut down.
Caveat: I am an Erlang noob.
#Daniel's answer is essentially correct. I strongly suggest that you take the time to read Erlang creator Joe Armstrong's thesis (Making reliable distributed systems in the presence of software errors). The thesis provides a good explanation of the need for, and the solution to, developing robust distributed systems. I believe the paper will answer your question satisfactorily.
Erlang makes it easy to create many, small processes, and to monitor those processes. When one of those processes crashes, it may be possible to restart that part of the system without needing to bring the whole thing down.
You may have seen something like this in modern versions of Windows: the system can restart the graphics driver if it crashes; it doesn't kill the whole system.
To make it easier to write fault-tolerant applications, Erlang provides the concept of supervisor processes. These processes monitor a number of child processes, and know how to respond if a child dies. You might create a whole supervision tree, so that you have fine control about how different parts of the application behave. You can read more in the Erlang documentation.

Resources