How to reliably clean up dask scheduler/worker - dask

I'm starting up a dask cluster in an automated way by ssh-ing into a bunch of machines and running dask-worker. I noticed that I sometimes run into problems when processes from a previous experiment are still running. Wha'ts the best way to clean up after dask? killall dask-worker dask-scheduler doesn't seem to do the trick, possibly because dask somehow starts up new processes in their place.

If you start a worker with dask-worker, you will notice in ps, that it starts more than one process, because there is a "nanny" responsible for restarting the worker in the case that it somehow crashes. Also, there may be "semaphore" processes around for communicating between the two, depending on which form of process spawning you are using.
The correct way to stop all of these would be to send a SIGINT (i.e., keyboard interrupt) to the parent process. A KILL signal might not give it the chance to stop and clean up the child process(s). If some situation (e.g., ssh hangup) caused a more radical termination, or perhaps a session didn't send any stop signal at all, then you will probably have to grep the output of ps for dask-like processes and kill them all.

Related

How do I monitor and restart my application running in Docker based on memory usage?

I have an application running in Docker that leaks memory over time. I need to restart this application periodically when memory usage gets over a threshold. My application can respond to signals or by touching tmp/restart.txt (this is Rails)... as long as I can run a script or send a configurable signal when limits are triggered, I can safely shut down/restart my process.
I have looked into limiting memory utilization with Docker, but I am not seeing a custom action when a limit or reservation is hit. A SIGKILL would not be appropriate for my app... I need some time to clean up.
I am using runit as a minimal init system inside the container, and ECS for container orchestration. This feels like a problem that is attended to at the application or init level... killing a container rather than restarting the process seems heavy.
I have used Monit for this in the past, but I don't like how Monit deals with pidfiles... too often Monit loses control of a process. I am trying out Inspeqtor which seems to fit the bill very well, but while it supports runit there are no packages that work with runit out of the box.
So my question is, if SIGKILL is inappropriate for my use case, what's the best way to monitor a process for memory usage and then perform a cleanup/restart action based on that usage crossing a threshold?

Can the OS kill the process randomly in Linux?

One of our processes went down in a Linux box. When I checked in logs, I could see it was shut down. That shows graceful shutdown. I checked CPU, Memory, process utilization, all under threshold. There were no abnormalities found over memory utilization. Is there any way that OS killed the process randomly?
Any suggestions?
The kernel can kill a process under extreme circumstances ie memory starvation. But since this was not the case, and you are sure that sysads did not kill the process either. The shutdown must have been initiated from within the process.
Linux would not kill your process unless there are extreme circumstances. Although, some other process running under root might be able to send such signals.
You should get more idea from the kernal logs and make sure that was the process killed by OS itself or not.

Why did I have to kill -9 neo4j?

I asked nicely:
$ neo4j stop
Stopping Neo4j Server [74949]........................ lots of dots ......
waited a few minutes, asked nicely again:
$ kill 74949
$ ps x | grep neo
74949 ?? R 30:13.01 /Library/Java/Java...org.neo4j.server.Bootstrapper
still running. Done asking nicely.
$ kill -9 74949
Why did neo4j not respect the TERM signal? If it was waiting for something, how could I have found out what?
Normally, I would ask this kind of question on Server Fault, but the neo4j site points here.
Not necessarily the descending order of usefulness...
ps alx might have given a hint (process state - but with Java programs the issue isn't that often the jvm itself that died/locked but stuff that's running inside the jvm)
in top 100% cpu usage may indicate an endless loop
Java processes can end up in a state where all they still do is gc'ing in an almost always vain attempt to free up memory, enabling gc logging can help you detect this condition.
afaik neo4j is remotely monitorable via jmx (visualvm or jconsole) and I've already used these tools to determine what thread hung one of our Glassfish servers.
If you execute kill [PID], you only send (TERM SIGNAL) and the process will close it self gently. If you send SIGNAL number 9 (KILL SIGNAL) the process will end immediately. Regardless of it current state. Signal 9 (can not be ignored).

Erlang supervisor. Restarting process, if it fails several times, give up and send message

I have several gen_server workers periodically requesting some information from hardware sensors. Sensors may temporary fail, it is normal. If sensor fails worker terminates with an exception.
All workers are spawned form supervisor with simple_one_to_one strategy. Also I have a control gen_server, which can start and stop workers and also recives 'DOWN' messages.
So now I have two problems:
If worker is restarted by supervisor its state is lost, which is not acceptable to me. I need to recreate the worker with the same state.
If the worker is failing several times in period of time something serious has happened with the sensors and it requires the operator's attention. Thus I need to give up restarting the worker and send a message to event handlers. But the default behaviuor of supervisor is terminate after exhaust process restart limit.
I see two solutions:
Set the type of the processes in the supervisor as temporary and control them and restart them in control gen_server. But this is exactly what supervisor should do, so I'm reinventing the wheel.
Create a supervisor for each worker under the main supervisor. This exactly solves my second problem, but the state of workers is lost after restart, thus I need some storage like ets table storing the states of workers.
I am very new to Erlang, so I need some advice to my problem, as to which (if any) solution is the best. Thanks in advance.
If worker is restarted by supervisor its state is lost, which is not
accertable to me. I need to recreate worker with the same state.
If you need the process state to persist the process lifecycle, you need to store it elsewhere, for example in an ETS table.
If the worker is failing several times in particular amount of time
something serious happened with sensors and it require operator's
attention. Thus I need to give up restarting worker and send some
message for event handlers. But default behaviuor of supervisor is
terminate after exhaust process restart limit.
Correct. Generally speaking, the less logic you put into your supervisor, the better it is. Supervisors should just supervise child processes and that's it. But you could still monitor your supervisor and be notified whenever your supervisor gave up (just an idea). This way you can avoid re-invent the wheel and use the supervisor to manage the children.

daemon process killed automatically

Recently I run a below command to start daemon process which runs once in a three days.
RAILS_ENV=production lib/daemons/mailer_ctl start, which was working but when I came back after a week and found that process was killed.
Is this normal or not?
Thanks in advance.
Nope. As far as I understand it, this daemon is supposed to run until you kill it. The idea is for it to work at regular intevals, right? So the daemon is supposed to wake up, do its work, then go back to sleep until needed. So if it was killed, that's not normal.
The question is why was it killed and what can you do about it. The first question is one of the toughest ones to answer when debugging detached processes. Unless your daemon leaves some clues in the logs, you may have trouble finding out when and why it terminated. If you look through your logs (and if you're lucky) there may be some clues -- I'd start right around the time when you suspect it last ran and look at your Rails production.log, any log file the daemon may create but also at the system logs.
Let's assume for a moment that you never can figure out what happened to this daemon. What to do about it becomes an interesting question. The first step is: Log as much as you can without making the logs too bulky or impacting performance. At a minimum log wakeup, processing, and completion events, as well as trapping signals and logging them. Best to log to somewhere other than the Rails production.log. You may also want to run the daemon at a shorter interval than 3 days until you are certain it is stable.
Look into using a process monitoring tool like monit (http://mmonit.com/monit/) or god (http://god.rubyforge.org/). These tools can "watch" the status of daemons and if they are not running can automatically start them. You still need to figure out why they are being killed, but at least you have some safety net.

Resources