Can the OS kill the process randomly in Linux? - shutdown

One of our processes went down in a Linux box. When I checked in logs, I could see it was shut down. That shows graceful shutdown. I checked CPU, Memory, process utilization, all under threshold. There were no abnormalities found over memory utilization. Is there any way that OS killed the process randomly?
Any suggestions?

The kernel can kill a process under extreme circumstances ie memory starvation. But since this was not the case, and you are sure that sysads did not kill the process either. The shutdown must have been initiated from within the process.
Linux would not kill your process unless there are extreme circumstances. Although, some other process running under root might be able to send such signals.
You should get more idea from the kernal logs and make sure that was the process killed by OS itself or not.

Related

If I run out of memory, which process will be killed?

I have an important process which I don't want to be killed. This process is using about 10GB of RAM. I have 32GB available. I want to run another process which will take up 18.2GB of RAM. There should be some room left. What happens if I hit the full 32GB? Will the last program I called be killed? That wouldn't be so bad, but the important one cannot die.
It is a likely chance that one of your programs will be moved to your disk but i'm not sure which one it is so I recommend batch processing.

How do I monitor and restart my application running in Docker based on memory usage?

I have an application running in Docker that leaks memory over time. I need to restart this application periodically when memory usage gets over a threshold. My application can respond to signals or by touching tmp/restart.txt (this is Rails)... as long as I can run a script or send a configurable signal when limits are triggered, I can safely shut down/restart my process.
I have looked into limiting memory utilization with Docker, but I am not seeing a custom action when a limit or reservation is hit. A SIGKILL would not be appropriate for my app... I need some time to clean up.
I am using runit as a minimal init system inside the container, and ECS for container orchestration. This feels like a problem that is attended to at the application or init level... killing a container rather than restarting the process seems heavy.
I have used Monit for this in the past, but I don't like how Monit deals with pidfiles... too often Monit loses control of a process. I am trying out Inspeqtor which seems to fit the bill very well, but while it supports runit there are no packages that work with runit out of the box.
So my question is, if SIGKILL is inappropriate for my use case, what's the best way to monitor a process for memory usage and then perform a cleanup/restart action based on that usage crossing a threshold?

How to reliably clean up dask scheduler/worker

I'm starting up a dask cluster in an automated way by ssh-ing into a bunch of machines and running dask-worker. I noticed that I sometimes run into problems when processes from a previous experiment are still running. Wha'ts the best way to clean up after dask? killall dask-worker dask-scheduler doesn't seem to do the trick, possibly because dask somehow starts up new processes in their place.
If you start a worker with dask-worker, you will notice in ps, that it starts more than one process, because there is a "nanny" responsible for restarting the worker in the case that it somehow crashes. Also, there may be "semaphore" processes around for communicating between the two, depending on which form of process spawning you are using.
The correct way to stop all of these would be to send a SIGINT (i.e., keyboard interrupt) to the parent process. A KILL signal might not give it the chance to stop and clean up the child process(s). If some situation (e.g., ssh hangup) caused a more radical termination, or perhaps a session didn't send any stop signal at all, then you will probably have to grep the output of ps for dask-like processes and kill them all.

Performance impact with dask multiple processes no-nanny

I notice a 5/6 times performance degradation using dask workers with processes only and no-nanny versus with nanny. Is this expected behaviour?
I want to run dask without a nanny due to state in the worker. I appreciate that having state in workers is not desirable but its beyond my control (3rd party library).
Alternative if I run dask workers with a nanny can I capture worker failures/restarts and reinitialise the worker?
A Nanny process just starts a dask-worker process, and then watches it, restarting it if it falls over. It should not affect performance at all. If you do not have a nanny then you can not capture worker failures or restarts. This is the role of the nanny.

Why did I have to kill -9 neo4j?

I asked nicely:
$ neo4j stop
Stopping Neo4j Server [74949]........................ lots of dots ......
waited a few minutes, asked nicely again:
$ kill 74949
$ ps x | grep neo
74949 ?? R 30:13.01 /Library/Java/Java...org.neo4j.server.Bootstrapper
still running. Done asking nicely.
$ kill -9 74949
Why did neo4j not respect the TERM signal? If it was waiting for something, how could I have found out what?
Normally, I would ask this kind of question on Server Fault, but the neo4j site points here.
Not necessarily the descending order of usefulness...
ps alx might have given a hint (process state - but with Java programs the issue isn't that often the jvm itself that died/locked but stuff that's running inside the jvm)
in top 100% cpu usage may indicate an endless loop
Java processes can end up in a state where all they still do is gc'ing in an almost always vain attempt to free up memory, enabling gc logging can help you detect this condition.
afaik neo4j is remotely monitorable via jmx (visualvm or jconsole) and I've already used these tools to determine what thread hung one of our Glassfish servers.
If you execute kill [PID], you only send (TERM SIGNAL) and the process will close it self gently. If you send SIGNAL number 9 (KILL SIGNAL) the process will end immediately. Regardless of it current state. Signal 9 (can not be ignored).

Resources