I have a swarm of few services, and in the compose file there are few volumes created with the vieux/sshfs driver, which are used by the services.
The containers spawned by the services execute a single script, after which the container finishes/exits and a new one is created on its place - basically the services are spawning new containers all the time.
All works smooth, except that there is exceptionally large amount of zombie processes accumulated in the host machine. The zombies go away when the docker daemon is re stared - it must be docker who makes the zombies.
"ps aux | grep 'Z'" is
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 3040 0.0 0.0 0 0 ? Zs 14:13 0:00 [ssh] <defunct>
root 3042 0.0 0.0 0 0 ? Zs 14:13 0:00 [sshfs] <defunct>
root 3052 0.0 0.0 0 0 ? Zs 14:13 0:00 [ssh] <defunct>
root 3055 0.0 0.0 0 0 ? Zs 14:13 0:00 [sshfs] <defunct>
...
As far as I understand, the volumes are created only once, and the services are just using the local copy of the volume - not creating a new ssh connection and reading straight form the remote machine - and this should not be creating another ssh connection process that will become zombie.
I have trouble finding info on the topic, which makes me think that I am doing something fundamentally wrong. Please help.
I have just resolved the issue by enabling Tini for the services in the docker-compose file as follows -
init: true
Few zombies (<10) pop up, but then they get killed in a second - no accumulation.
I still don't get what the zombies had to do with the ssh. If anyone can answer that I would be grateful.
PS: I have checked few days after I have enabled Tini. There are some accumulated zombies (~300, before there ware ~2000). Problem seems mitigated, but it is still there.
I recently read an article about it.
It just said than if you declare your volume directly into the docker-compose.yml in may lead to some issue with zombie sshfs process.
To avoid this, I try to declare the volume as external and run manually the creation of the docker volume.
docker volume create -d vieux/sshfs -o sshcmd="$USER_SSH#$IP:/mysupervolume" -o IdentityFile="/root/.ssh/$SSH_KEY" nameofmyvolume
Hope it will help someone.
Kr,
Related
[root#k8s001 ~]# docker exec -it f72edf025141 /bin/bash
root#b33f3b7c705d:/var/lib/ghost# ps aux`enter code here`
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 1012 4 ? Ss 02:45 0:00 /pause
root 8 0.0 0.0 10648 3400 ? Ss 02:57 0:00 nginx: master process nginx -g daemon off;
101 37 0.0 0.0 11088 1964 ? S 02:57 0:00 nginx: worker process
node 38 0.9 0.0 2006968 116572 ? Ssl 02:58 0:06 node current/index.js
root 108 0.0 0.0 3960 2076 pts/0 Ss 03:09 0:00 /bin/bash
root 439 0.0 0.0 7628 1400 pts/0 R+ 03:10 0:00 ps aux
The display come from internet, it says pause container is the parent process of other containers in the pod, if you attach pod or other containers, do ps aux, you would see that.
Is it correct, I do in my k8s,different, PID 1 is not /pause.
...Is it correct, I do in my k8s,different, PID 1 is not /pause.
This has changed, pause no longer hold PID 1 despite being the first container created by the container runtime to setup the pod (eg. cgroups, namespace etc). Pause is isolated (hidden) from the rest of the containers in the pod regardless of your ENTRYPOINT/CMD. See here for more background information.
By default, Docker will run your entrypoint (or the command, if there is no entrypoint) as PID 1. However, that is not necessarily always the case, since, depending on how you start the container, Docker (or your orchestrator) can also run its custom init process as PID 1:
$ docker run -d --init --name test alpine sleep infinity
849efe38ecec439550738e981065ec4aff55ef5607f03b9fed975e2d3146b9b0
$ with-docker docker exec -ti test ps
PID USER TIME COMMAND
1 root 0:00 /sbin/docker-init -- sleep infinity
7 root 0:00 sleep infinity
8 root 0:00 ps
For more information on why you would want your entrypoint not to be PID 1, you can check this explanation from a tini developer:
Now, unlike other processes, PID 1 has a unique responsibility, which is to reap zombie processes.
Zombie processes are processes that:
Have exited.
Were not waited on by their parent process (wait is the syscall parent processes use to retrieve the exit code of their children).
Have lost their parent (i.e. their parent exited as well), which means they'll never be waited on by their parent.
I'm using pulumi to manage kubernetes deployments. One of the deployments runs an image which intercepts SIGINT and SIGTERM signals to perform a graceful shutdown like so (this example is running in my IDE):
{"level":"info","msg":"trying to activate gcloud service account","time":"2021-06-17T12:19:25-05:00"}
{"level":"info","msg":"env var not found","time":"2021-06-17T12:19:25-05:00"}
{"Namespace":"default","TaskQueue":"main-task-queue","WorkerID":"37574#Paymahns-Air#","level":"error","msg":"Started Worker","time":"2021-06-17T12:19:25-05:00"}
{"Namespace":"default","Signal":"interrupt","TaskQueue":"main-task-queue","WorkerID":"37574#Paymahns-Air#","level":"error","msg":"Worker has been stopped.","time":"2021-06-17T12:19:27-05:00"}
{"Namespace":"default","TaskQueue":"main-task-queue","WorkerID":"37574#Paymahns-Air#","level":"error","msg":"Stopped Worker","time":"2021-06-17T12:19:27-05:00"}
Notice the "Signal":"interrupt" with a message of Worker has been stopped.
I find that when I alter the source code (which alters the docker image) and run pulumi up the pod doesn't gracefully terminate based on what's described in this blog post. Here's a screenshot of logs from GCP:
The highlighted log line in the image above is the first log line emitted by the app. Note that the shutdown messages aren't logged above the highlighted line which suggests to me that the pod isn't given a chance to perform a graceful shutdown.
Why might the pod not go through the graceful shutdown mechanisms that kubernetes offers? Could this be a bug with how pulumi performs updates to deployments?
EDIT: after doing more investigation I found that this problem is happening because starting a docker container with go run /path/to/main.go actually ends up created two processes like so (after execing into the container):
root#worker-ffzpxpdm-78b9797dcd-xsfwr:/gadic# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.3 0.3 2046200 30828 ? Ssl 18:04 0:12 go run src/cmd/worker/main.go --temporal-host temporal-server.temporal.svc.cluster.local --temporal-port 7233 --grpc-port 6789 --grpc-hos
root 3782 0.0 0.5 1640772 43232 ? Sl 18:06 0:00 /tmp/go-build2661472711/b001/exe/main --temporal-host temporal-server.temporal.svc.cluster.local --temporal-port 7233 --grpc-port 6789 --
root 3808 0.1 0.0 4244 3468 pts/0 Ss 19:07 0:00 /bin/bash
root 3817 0.0 0.0 5900 2792 pts/0 R+ 19:07 0:00 ps aux
If run kill -TERM 1 then the signal isn't forwarded to the underlying binary, /tmp/go-build2661472711/b001/exe/main, which means the graceful shutdown of the application isn't executed. However, if I run kill -TERM 3782 then the graceful shutdown logic is executed.
It seems the go run spawns a subprocess and this blog post suggests the signals are only forwarded to PID 1. On top of that, it's unfortunate that go run doesn't forward signals to the subprocess it spawns.
The solution I found is to add RUN go build -o worker /path/to/main.go in my dockerfile and then to start the docker container with ./worker --arg1 --arg2 instead of go run /path/to/main.go --arg1 --arg2.
Doing it this way ensures there aren't any subprocess spawns by go and that ensures signals are handled properly within the docker container.
This is the first time such a thing happens to me. I'm really scared.
I've been coding and testing a Django webapp on my laptop. The app is running on Docker, with docker-compose. Both the host and guest are Ubuntu 18.04. It consists of 3 images: Django+Gunicorn, Nginx and Postgres.
Nothing really fancy and it worked perfectly, until 5 minutes ago.
When I tried to refresh the page (accessible via 127.0.0.1) on Chrome Incognito, it got stuck on loading. Same thing with curl. At the time, I was logged into the Django container (to activate collectstatic whenever I needed it) and it was still running as usual.
I thought something was stuck somewhere so I tried to see if there's anything listening to the 80 port. Nothing really special:
tcp6 0 0 :::80 :::* LISTEN 10815/docker-proxy
So, wanting to get back to coding as fast as possible, I tried to (sudo) down then kill the containers, to no avail:
ERROR: for xxxxxxxx_nginx_1 Cannot kill container: e94e64a75b1726ccd27623024a4223ffb3d77c6578b4d69f6240bea51e8e641b: Cannot kill container e94e64a75b1726ccd27623024a4223ffb3d77c6578b4d69f6240bea51e8e641b: unknown error after kill: docker-runc did not terminate sucessfully: container_linux.go:393: signaling init process caused "permission denied"
: unknown
No problem, I thought, and I just stoped the docker service:
sudo systemctl stop docker
I refreshed the 127.0.0.1 page expecting to see a This site can’t be reached page ... only to see the webapp loading!
I tried to see what container are running to stop them, but docker ps returned this:
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Which confirms the Docker service was down. systemctl status confirmed just that. I also checked if the serverside code was running. It is. I also tried to change some frontend code, and it loading the new version.
Can someone tell me what's going on, and how to stop this 'zombie' app from running?
Thanks!
EDIT
I just had the idea to run ps aux | grep docker and here's what I found:
root 1661 0.5 0.9 670260 74136 ? Ssl 17:47 1:15 dockerd -G docker --exec-root=/var/snap/docker/384/run/docker --data-root=/var/snap/docker/common/var-lib-docker --pidfile=/var/snap/docker/384/run/docker.pid --config-file=/var/snap/docker/384/config/daemon.json --debug
root 2148 0.3 0.4 756640 34944 ? Ssl 17:47 0:47 docker-containerd --config /var/snap/docker/384/run/docker/containerd/containerd.toml
root 4105 0.0 0.0 7508 4112 ? Sl 17:48 0:01 docker-containerd-shim -namespace moby -workdir /var/snap/docker/common/var-lib-docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/7709ab085e470228c120eff4c9b36590348dac483a40d9b107cfb8d62146e060 -address /var/snap/docker/384/run/docker/containerd/docker-containerd.sock -containerd-binary /snap/docker/384/bin/docker-containerd -runtime-root /var/snap/docker/384/run/docker/runtime-runc -debug
root 10618 0.0 0.0 7508 4464 ? Sl 17:57 0:01 docker-containerd-shim -namespace moby -workdir /var/snap/docker/common/var-lib-docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/3a689a845ef012584e46d631c053ca0a00dbe34bb430f5e52a4de879c7efe966 -address /var/snap/docker/384/run/docker/containerd/docker-containerd.sock -containerd-binary /snap/docker/384/bin/docker-containerd -runtime-root /var/snap/docker/384/run/docker/runtime-runc -debug
root 10815 0.0 0.0 425952 2956 ? Sl 17:58 0:07 /snap/docker/384/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.20.0.4 -container-port 80
root 10822 0.0 0.0 9172 5032 ? Sl 17:58 0:01 docker-containerd-shim -namespace moby -workdir /var/snap/docker/common/var-lib-docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/e94e64a75b1726ccd27623024a4223ffb3d77c6578b4d69f6240bea51e8e641b -address /var/snap/docker/384/run/docker/containerd/docker-containerd.sock -containerd-binary /snap/docker/384/bin/docker-containerd -runtime-root /var/snap/docker/384/run/docker/runtime-runc -debug
ahmed 26359 0.0 0.0 21536 1048 pts/5 S+ 21:52 0:00 grep --color=auto docker
EDIT 2
After manually killing some of the processes above, the situation is back to normal. But still, I'd love to get an explanation if there's one.
I am trying to bypass the login page on RStudio as we are running it in a Docker container and this step is not necessary as we authenticate before we let users launch the container.
I am using the Rocker implementation of RStudio for Docker. We are running on Centos7.
I'm fairly new to SO, so please let me know what information would be helpful for answering the question.
I figured it out.
When you start rserver, add the flag --auth-none=1, so my final CMD in my Dockerfile looked like:
USER rstudio
CMD ["/usr/lib/rstudio-server/bin/rserver","--server-daemonize=0","--auth-none=1"]
I will caution though, the first time I did it, I ran with sudo -E in front of the command and it logged into RStudio as ROOT! (this is also because I had altered the /etc/rstudio/rserver.conf with the setting auth-minimum-user-id=0 because I was trying to get the error to go away (which it did :)
The above code will change to user 'rstudio' before running the command which will log you straight in as rstudio.
Hope that helps someone out there, I know I spent the better portion of my day finding a work-around!
To bypass the login page you need also to define the environment variable USER.
need to set system environmental variable USER=rstudio in order for --auth-none 1
-- https://github.com/rstudio/rstudio/issues/1663
Here is a snippet of Dockerfile permitting to run the RStudio server and to login as the user rstudio.
ENV USER="rstudio"
CMD ["/usr/lib/rstudio-server/bin/rserver", "--server-daemonize", "0", "--auth-none", "1"]
When it's run the login page is not displayed and we can check that the server and the session are running with the rstudio user.
# Run the container
docker run --name rstudio --rm -p 8787:8787 -d rstudio
# Check processes
docker exec -it rstudio ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
rstudio+ 1 0.1 0.3 210792 13844 ? Ssl 21:10 0:00 /usr/lib/rstudi
rstudio 49 0.7 2.3 555096 82312 ? Sl 21:10 0:03 /usr/lib/rstudi
root 570 0.0 0.1 45836 3744 pts/0 Rs+ 21:18 0:00 ps aux
I hosted a web-app on jelastic (dogado) as a docker container (the official docker container link). After 2 weeks I get an email:
Dear Jelastic customer, there was a process of the command
"/usr/local/tomcat/3333" which was sending massive packets to
different targets this morning. The symptoms look like the docker
instance has a security hole and was used in an DDoS attack or part of
a botnet.
The top command showed this process:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
334 root 20 0 104900 968 456 S 99.2 0.1 280:51.95 3333
root#node0815-somename:/# ls -al /proc/334
...
lrwxrwxrwx 1 root root 0 Jul 26 08:16 cwd -> /usr/local/tomcat
lrwxrwxrwx 1 root root 0 Jul 26 08:16 exe -> /usr/local/tomcat/3333
We have killed the process and changed the permissions of the file:
root#node0815-somename:/# kill 334
root#node0815-somename:/# chmod 000 /usr/local/tomcat/3333
Please investigate or use a more security hardenend docker template.
Has anyone encountered the same or a similar problem before? Is it possible that the container was hacked?
The guys which provide the container gave me a hint...
I remove only the ROOT war.
RUN rm -rf /usr/local/tomcat/webapps/ROOT
I forget completely that the tomcat delivers example apps. So I have to delete the security holes:
RUN rm -rf /usr/local/tomcat/webapps/
Do you use any protection tools? We don't except the scenario when your container can be hacked if there are no protection.
We strongly recommend using IPtables and Fail2Ban to protect your containers from hack attacks (You have root access to your Docker container using SSH, so you are able to install and configure these packages), especially if you have attached public IP to your containers.
Also, you have access to all container logs (via Dashboard or SSH), so you are able to analyze logs and take preventive actions.
Have a nice day.