Linux short simple command how to send SIGTERM to a process and SIGKILL if it fails to exist in X seconds? - timeout

How should look the Linux command to send terminate signal to the process/PID and if it fails to exit gracefully after 10 seconds kill it?
My attempt is: "sudo timeout -vk 5 10 kill PIDhere" (-v verbose, -k kill after X seconds) but I am not sure if it is good or how to adjust values or if there is better command that even work with part of the name shown in process COMMAND line. ("ps aux" output)

sudo timeout -vk 5 10 kill PIDhere
Will execute kill, and then attempt to terminate that process if it takes too long. Which shouldn't happen, and presumably isn't what you want (if kill was actually hanging, killing it would not affect your actual process). timeout is useful for capping how long a process runs for, not how long it takes to terminate after receiving a signal.
Instead, I'd suggest starting the process asynchronously (e.g. using & in a shell, but any language's subprocess library will have similar functionality) and then waiting for the process to terminate after you send it a signal. I describe doing this in Java in this answer. In the shell that might look like:
$ some_process &
# time passes, eventually we decide to terminate the process
$ kill %1
$ sleep 5s
$ kill -s SIGKILL %1 # will fail and do nothing if %1 has already finished
Or you could rely on wait which will return early if the job terminates before the sleep completes:
$ some_process &
# time passes
$ kill %1
$ sleep 5s &
$ wait -n %1 %2 # returns once %1 or %2 (sleep) complete
$ kill -s SIGKILL %1 # if %2 completes first %1 is still running and will be killed
You can do the same as above with PIDs instead of job IDs, it's just a little more fiddly because you have to worry about PID reuse.
if there is better command that even work with part of the name
Does pkill do what you want?

Related

Spawning a process with {create_group=True} / set_pgid hangs when starting Docker

Given a Linux system, in Haskell GHCi 8.8.3, I can run a Docker command with:
System.Process> withCreateProcess (shell "docker run -it alpine sh -c \"echo hello\""){create_group=False} $ \_ _ _ pid -> waitForProcess pid
hello
ExitSuccess
However, when I switch to create_group=True the process hangs. The effect of create_group is to call set_pgid with 0 in the child, and pid in the parent. Why does that change cause a hang? Is this a bug in Docker? A bug in System.Process? Or an unfortunate but necessary interaction?
This isn't a bug in Haskell or a bug in Docker, but rather just the way that process groups work. Consider this C program:
#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
int main(void) {
if(setpgid(0, 0)) {
perror("setpgid");
return 1;
}
execlp("docker", "docker", "run", "-it", "alpine", "echo", "hello", (char*)NULL);
perror("execlp");
return 1;
}
If you compile that and run ./a.out directly from your interactive shell, it will print "hello" as you'd expect. This is unsurprising, since the shell will have already put it in its own process group, so its setpgid is a no-op. If you run it with an intermediary program that forks a child to run it (sh -c ./a.out, \time ./a.out - note the backslash, strace ./a.out, etc.), then the setpgid will put it in a new process group, and it will hang like it does in Haskell.
The reason for the hang is explained in "Job Control Signals" in the glibc manual:
Macro: int SIGTTIN
A process cannot read from the user’s terminal while it is running as a background job. When any process in a background job tries to read from the terminal, all of the processes in the job are sent a SIGTTIN signal. The default action for this signal is to stop the process. For more information about how this interacts with the terminal driver, see Access to the Terminal.
Macro: int SIGTTOU
This is similar to SIGTTIN, but is generated when a process in a background job attempts to write to the terminal or set its modes. Again, the default action is to stop the process. SIGTTOU is only generated for an attempt to write to the terminal if the TOSTOP output mode is set; see Output Modes.
When you docker run -it something, Docker will attempt to read from stdin even if the command inside the container doesn't. Since you just created a new process group, and you didn't set it to be in the foreground, it counts as a background job. As such, Docker is getting stopped with SIGTTIN, which causes it to appear to hang.
Here's a list of options to fix this:
Redirect the process's standard input to somewhere other than the TTY
Use signal or sigaction to make the process ignore the SIGTTIN signal
Use sigprocmask to block the process from receiving the SIGTTIN signal
Call tcsetpgrp(0, getpid()) to make your new process group be the foreground process group (note: this is the most complicated, since it will itself cause SIGTTOU, so you'd have to ignore that signal at least temporarily anyway)
Options 2 and 3 will also only work if the program doesn't actually need stdin, which is the case with Docker. When SIGTTIN doesn't stop the process, reads from stdin will still fail with EIO, so if there's actually data you want to read, then you need to go with option 4 (and remember to set it back once the child exits).
If you have TOSTOP set (which is not the default), then you'd have to repeat the fix for SIGTTOU or for standard output and standard error (except for option 4, which wouldn't need to be repeated at all).

Stopping dask-ssh created scheduler from the Client interface

I am running Dask on a SLURM-managed cluster.
dask-ssh --nprocs 2 --nthreads 1 --scheduler-port 8786 --log-directory `pwd` --hostfile hostfile.$JOBID &
sleep 10
# We need to tell dask Client (inside python) where the scheduler is running
scheduler="`hostname`:8786"
echo "Scheduler is running at ${scheduler}"
export ARL_DASK_SCHEDULER=${scheduler}
echo "About to execute $CMD"
eval $CMD
# Wait for dash-ssh to be shutdown from the python
wait %1
I create a Client inside my python code and then when finished, I shut it down.
c=Client(scheduler_id)
...
c.shutdown()
My reading of the dask-ssh help is that the shutdown will shutdown all workers and then the scheduler. But it does not stop the background dask-ssh and so eventually the job timeouts.
I've tried this interactively in the shell. I cannot see how to stop the scheduler.
I would appreciate any help.
Thanks,
Tim
Recommendation with --scheduler-file
First, when setting up with SLURM you might consider using the --scheduler-file option, which allows you to coordinate the scheduler address using your NFS (which I assume you have given that you're using SLURM). Recommend reading this doc section: http://distributed.readthedocs.io/en/latest/setup.html#using-a-shared-network-file-system-and-a-job-scheduler
dask-scheduler --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
>>> client = Client(scheduler_file='/path/to/scheduler.json')
Given this it also becomes easier to use the sbatch or qsub command directly. Here is an example with SGE's qsub
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json
Client.shutdown
It looks like client.shutdown only shuts down the client. You're correct that this is inconsistent with the docstring. I've raised an issue here: https://github.com/dask/distributed/issues/1085 for tracking further developments.
In the meantime
These three commands should suffice to tear down the workers, close the scheduler, and stop the scheduler process
client.loop.add_callback(client.scheduler.retire_workers, close_workers=True)
client.loop.add_callback(client.scheduler.terminate)
client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.loop.stop())
What people usually do
Typically people start and stop clusters with whatever means that they started them. This might involve using SLURM's kill command. We should make the client-focused way more consistent though regardless.

Is it possible to install timeout in OpenWRT

I need to execute a command with a timeout in OpenWRT, but it seems that the command timeout is not installed by default neither can be installed using opkg. I know that I can do a work around (using command &; sleep $DELAY; kill $!), but I wish to do this more properly without getting the risk of kill trying to kill a process in case the command finished before the timeout.
Yes you can install timeout on openWRT
$ opkg update
$ opkg install coreutils-timeout
$ timeout 2 sleep 10
This has been tested with AA, pretty sure that would also work with BB.
In short: it is not possible. I have to do it using sleep && kill.
timeout is a shell command so it executes in a subshell
timeout 6 sleep 20 will work if executed in direct shell terminal but same command won't work if initiated from a shell script.
So to run timeout in a shell script , use like this
out="$(timeout 6 sleep 20)"
OR
echo "$(timeout 10 sleep 20)"
this will run your timeout and your command in one subshell

Script to HUP parent and child process

I need a shell script to send a HUP to the parent and child processes.
I am using freeBSD with tcsh? #/bin/sh
Somehow, I need to pipe the PID output from pgrep to kill -HUP in a loop in a shell script.
Ultimately I want to run this script as a cron job.
I just don't have the skills yet.
Thanks - Brad
(This isn't a complete answer, but I can't make comments without at least 50 reputation apparently).
First of all, /bin/sh on FreeBSD is a Boune-compatible shell, not tcsh (which is /bin/tcsh). A start would be something like the following:
#!/bin/sh
for pid in $(pgrep <process name>); do kill -HUP $pid; done
Without more details, I can't really say much more.

script to start erlang code

I am trying to build a script on ubuntu to start some Erlang code of mine:
the script is something like:
#!/bin/sh
EBIN=$HOME/path_to_beams
ERL=/usr/local/bin/erl
export HEART_COMMAND="$EBIN/starting_script start"
case $1 in
start)
$ERL -sname mynode -pa $EBIN \
-heart -detached -s my_module start_link
;;
*)
echo "Usage: $0 {start|stop|debug}"
exit 1
esac
exit 0
but I'm having a couple of problems.
First of all, the code can be executed only if the script is in the same directory as the beams, this seems strange to me, I double checked the paths, so why doesn't the -pa flag work?
Second, the script (without the -pa) works fine, but if I try to start instead of the main module (a gen_server) its supervisor (-s my_module_sup start_link) it doesn't work...this is strange, because if I start the supervisor from a normal shell everything works fine.
Third, the -heart flag, should restart the script in case of failure, but if I kill the process with a normal Unix kill, the process is not restarted.
Can someone give me some hints?
Thanks in advance,
pdn
The first thing that comes to mind is that you're using erlexport instead of erl. Not sure why you're doing this (I've not heard of erlexport before). Try it with erl instead.
Your -heart flag won't have meaning if the Erlang node itself is killed because the process can't keep itself alive. You would need another process running that monitors the Erlang process and restarts it if killed.

Resources