How to process SIGTERM in argo or kubeflow stage/node/component? - kubeflow

How to process SIGTERM in argo or kubeflow stage/node/component?
It's possible to catch SIGTERM if your python script launched with PID 1.
But in argo/kubeflow container PID 1 is occupied by
1 root 0:00 /var/run/argo/argoexec emissary -- bash -c set -eo pipefail; touch /tmp/9306d238a1214915a260b696e45390ad.step; sleep 1; echo "
p.s.
Tried to use
container.set_lifecycle(V1Lifecycle(pre_stop=V1Handler(_exec=V1ExecAction([
"pkill", "-15", "python"
]))))
But this setting doesn't leads to the correct SIGTERM forwarding.
SIGTERM on process python3 appears immediately before the pod killing, after ~30 sec since the pod stop initialization.

Related

Pulumi does not perform graceful shutdown of kubernetes pods

I'm using pulumi to manage kubernetes deployments. One of the deployments runs an image which intercepts SIGINT and SIGTERM signals to perform a graceful shutdown like so (this example is running in my IDE):
{"level":"info","msg":"trying to activate gcloud service account","time":"2021-06-17T12:19:25-05:00"}
{"level":"info","msg":"env var not found","time":"2021-06-17T12:19:25-05:00"}
{"Namespace":"default","TaskQueue":"main-task-queue","WorkerID":"37574#Paymahns-Air#","level":"error","msg":"Started Worker","time":"2021-06-17T12:19:25-05:00"}
{"Namespace":"default","Signal":"interrupt","TaskQueue":"main-task-queue","WorkerID":"37574#Paymahns-Air#","level":"error","msg":"Worker has been stopped.","time":"2021-06-17T12:19:27-05:00"}
{"Namespace":"default","TaskQueue":"main-task-queue","WorkerID":"37574#Paymahns-Air#","level":"error","msg":"Stopped Worker","time":"2021-06-17T12:19:27-05:00"}
Notice the "Signal":"interrupt" with a message of Worker has been stopped.
I find that when I alter the source code (which alters the docker image) and run pulumi up the pod doesn't gracefully terminate based on what's described in this blog post. Here's a screenshot of logs from GCP:
The highlighted log line in the image above is the first log line emitted by the app. Note that the shutdown messages aren't logged above the highlighted line which suggests to me that the pod isn't given a chance to perform a graceful shutdown.
Why might the pod not go through the graceful shutdown mechanisms that kubernetes offers? Could this be a bug with how pulumi performs updates to deployments?
EDIT: after doing more investigation I found that this problem is happening because starting a docker container with go run /path/to/main.go actually ends up created two processes like so (after execing into the container):
root#worker-ffzpxpdm-78b9797dcd-xsfwr:/gadic# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.3 0.3 2046200 30828 ? Ssl 18:04 0:12 go run src/cmd/worker/main.go --temporal-host temporal-server.temporal.svc.cluster.local --temporal-port 7233 --grpc-port 6789 --grpc-hos
root 3782 0.0 0.5 1640772 43232 ? Sl 18:06 0:00 /tmp/go-build2661472711/b001/exe/main --temporal-host temporal-server.temporal.svc.cluster.local --temporal-port 7233 --grpc-port 6789 --
root 3808 0.1 0.0 4244 3468 pts/0 Ss 19:07 0:00 /bin/bash
root 3817 0.0 0.0 5900 2792 pts/0 R+ 19:07 0:00 ps aux
If run kill -TERM 1 then the signal isn't forwarded to the underlying binary, /tmp/go-build2661472711/b001/exe/main, which means the graceful shutdown of the application isn't executed. However, if I run kill -TERM 3782 then the graceful shutdown logic is executed.
It seems the go run spawns a subprocess and this blog post suggests the signals are only forwarded to PID 1. On top of that, it's unfortunate that go run doesn't forward signals to the subprocess it spawns.
The solution I found is to add RUN go build -o worker /path/to/main.go in my dockerfile and then to start the docker container with ./worker --arg1 --arg2 instead of go run /path/to/main.go --arg1 --arg2.
Doing it this way ensures there aren't any subprocess spawns by go and that ensures signals are handled properly within the docker container.

dumpcap stops logging after 1184 files?

Recently I come across a linux application design. The indent of the application is to log the ethernet frames via dumpcap <> api in linux. But they implemented as below:
Create a new process using fork()
Call dumpcap <> in execl() as shown below
a. execl("/bin/sh", "/bin/sh", "-c", dumpcap<>, NULL);
b. sudo dumpcap -i "eth0" -B 1 -b filesize:5 -w "/mnt/Test_1561890567.pcapng" -t -q
They send a SIGTERM to kill the process
The problem facing now is when ever we run the command from the process after 1184 or 1185 no:files then dumpcap stops logging. The process and thread is alive the command we can see in top command.

Container Optimized OS Graceful Shutdown of Celery

Running COS on GCE
Any ideas on how to get COS to do a graceful docker shutdown?
My innermost process is celery, which says he wants a SIGTERM to stop gracefully
http://docs.celeryproject.org/en/latest/userguide/workers.html#stopping-the-worker
My entrypoint is something like
exec celery -A some_app worker -c some_concurrency
On COS I am running my docker a service, something like
write_files:
- path: /etc/systemd/system/servicename.service
permissions: 0644
owner: root
content: |
[Unit]
Description=Some service
[Service]
Environment="HOME=/home/some_home"
RestartSec=10
Restart=always
ExecStartPre=/usr/share/google/dockercfg_update.sh
ExecStart=/usr/bin/docker run -u 2000 --name=somename --restart always some_image param_1 param_2
ExecStopPost=/usr/bin/docker stop servicename
KillMode=processes
KillSignal=SIGTERM
But ultimately when my COS instance it shut down, it just yanks the plug.
Do I need to add a shutdown script to do a docker stop? Do I need to do something more advanced?
What is the expected exit status of your container process when when it receives SIGTERM?
Running systemctl stop <service> then systemctl status -l <service> should show the exit code of the main process. Example:
Main PID: 21799 (code=exited, status=143)
One possibility is that the process does receive SIGTERM and shuts down gracefully, but returns non-zero exit code.
This would make the systemd believe that it didn't shutdown correctly. If that is the case, adding
SuccessExitStatus=143
to your systemd service should help. (Replace 143 with the actual exit code of your main process.)

Jenkins with Publish over SSH plugin, -1 exit status

I use Jenkins for build and plugin for deploy my artifacts to server. After deploying files I stopped service by calling eec in plugin
sudo service myservice stop
and I receive answer from Publish over SSH:
SSH: EXEC: channel open
SSH: EXEC: STDOUT/STDERR from command [sudo service myservice stop]...
SSH: EXEC: connected
Stopping script myservice
SSH: EXEC: completed after 200 ms
SSH: Disconnecting configuration [172.29.19.2] ...
ERROR: Exception when publishing, exception message [Exec exit status not zero. Status [-1]]
Build step 'Send build artifacts over SSH' changed build result to UNSTABLE
Finished: UNSTABLE
The build is failed but the service is stopped.
My /etc/init.d/myservice
#! /bin/sh
# /etc/init.d/myservice
#
# Some things that run always
# touch /var/lock/myservice
# Carry out specific functions when asked to by the system
case "$1" in
start)
echo "Starting myservice"
setsid /opt/myservice/bin/myservice --spring.config.location=/etc/ezd/application.properties --server.port=8082 >> /opt/myservice/app.log &
;;
stop)
echo "Stopping script myservice"
pkill -f myservice
#
;;
*)
echo "Usage: /etc/init.d/myservice {start|stop}"
exit 1
;;
esac
exit 0
Please say me why I get -1 exit status?
Well, the script is called /etc/init.d/myservice, so it matches the myservice pattern given to pkill -f. And because the script is waiting for the pkill to complete, it is still alive and gets killed and returns -1 for that reason (there is also the killing signal in the result of wait, but the Jenkins slave daemon isn't printing it).
Either:
come up with more specific pattern for pkill,
use proper pid-file or
switch to systemd, which can reliably kill exactly the process it started.
On this day and age, I recommend the last option. Systemd is simply lot more reliable than init scripts.
Yes, Jan Hudec is right. I call ps ax | grep myservice in Publish over SSH plugin:
83469 pts/5 Ss+ 0:00 bash -c ps ax | grep myservice service myservice stop
So pkill -f myservice will affect the process with PID 83469 which is parent for pkill. This is -1 status cause as I understand.
I changed pkill -f myservice to pkill -f "java.*myservice" and this solved my problem.

Foreman Cannot Start Nginx, But I Can Start it Manually. Why?

I am currently running Foreman on staging (Ubuntu) and once I get it working will switch to using upstart.
My Procfile.staging looks like this:
nginx: sudo service nginx start
unicorn: bundle exec unicorn -c ./config/unicorn.rb
redis: bundle exec redis-server
sidekiq: bundle exec sidekiq -v -C ./config/sidekiq.yml
I can successfully start nginx using:
$ sudo service nginx start
However when I run $ foreman start, whilst the other three processes start successfully, nginx does not:
11:15:46 nginx.1 | started with pid 15966
11:15:46 unicorn.1 | started with pid 15968
11:15:46 redis.1 | started with pid 15971
11:15:46 sidekiq.1 | started with pid 15974
11:15:46 nginx.1 | Starting nginx: nginx.
11:15:46 nginx.1 | exited with code 0
11:15:46 system | sending SIGTERM to all processes
SIGTERM received
11:15:46 unicorn.1 | terminated by SIGTERM
11:15:46 redis.1 | terminated by SIGTERM
11:15:46 sidekiq.1 | terminated by SIGTERM
So why isn't nginx starting when started by Foreman?
The is a problem in your Procfile.
The nginx command can't use sudo inside foreman, because it will always ask for a password and then it will fail. That's why you are not starting nginx and the logs are empty.
If you really need to use sudo inside a procfile you could use something like this:
sudo_app: echo "sudo_password" | sudo -S app_command
nginx: echo "sudo_password" | sudo -S service nginx start
which I really don't recommend. Other option is to call sudo foreman start
For more information check out this issue on github, it is precisely what you want to solve.
Keep me posted if it works for you.
You should be able to add sudo access without a password for your local user to allow managing this service. This can be a big security hole, but if you whitelist what commands can be run you dramatically reduce the risk. I recommend adding no-password sudoers entry for the services command and anything else you want to script:
/etc/sudoers:
your_user_name ALL = (ALL) NOPASSWD: /usr/sbin/service
Another option if you're not comfortable with this would be to run nginx directly, not through the service manager:
nginx: /usr/sbin/nginx -c /path/to/nginx.conf

Resources