How do I monitor and restart my application running in Docker based on memory usage? - docker

I have an application running in Docker that leaks memory over time. I need to restart this application periodically when memory usage gets over a threshold. My application can respond to signals or by touching tmp/restart.txt (this is Rails)... as long as I can run a script or send a configurable signal when limits are triggered, I can safely shut down/restart my process.
I have looked into limiting memory utilization with Docker, but I am not seeing a custom action when a limit or reservation is hit. A SIGKILL would not be appropriate for my app... I need some time to clean up.
I am using runit as a minimal init system inside the container, and ECS for container orchestration. This feels like a problem that is attended to at the application or init level... killing a container rather than restarting the process seems heavy.
I have used Monit for this in the past, but I don't like how Monit deals with pidfiles... too often Monit loses control of a process. I am trying out Inspeqtor which seems to fit the bill very well, but while it supports runit there are no packages that work with runit out of the box.
So my question is, if SIGKILL is inappropriate for my use case, what's the best way to monitor a process for memory usage and then perform a cleanup/restart action based on that usage crossing a threshold?

Related

Can a docker container be run `nice`ly?

I have a docker image that hosts a web server and another that runs background tasks. Most of the time the web server is idle, and the background tasks should be allowed to use 100% of the CPUs, but any time the web server needs resources, it should have priority on the CPUs so it can respond quickly.
If everything was running on one linux machine, I could use something like nice -n19 background-task to run the tasks, and they would allow the web server as much CPU as it needed.
Is there a way to run the whole container at a nice level? I know I can restrict the amount of CPU time available to each background task with cpu_quota, but this doesn't solve the problem. If the web server wants to use all 4 CPU cores to serve a client it should be allowed. If the web server is not busy, all 4 CPU cores should work on the background task.
If I change the command in the Dockerfile to:
nice -n19 background-task
Will this work between containers? Processes inside containers are all kind of normal processes running on the same kernel, so it seems like it will, but I'm not sure.
This seems like something fairly obvious to do. Am I missing something?
docker-processes are usual OS processes.
Docker or not is not a concern for process scheduler.
So nice/renice works for docker-processes in the same way as for others.

What happens to ECS containers that exceed soft memory limit when there is memory contention?

Say I have an instance with 2G memory, and a task/container with 0.5G soft memory limit, and 0.75G hard memory limit.
The instance is running 3 containers, each consuming 0.6G memory. Now a 4th container needs to be added? What happens to the 3 running containers? Is their memory allocation reduced? Or are they migrated to another instance? What if there is no other instance, will the 4th container be placed?
I understand how soft and hard CPU limits work since CPU is a dynamic resource (the application can handle spikes in free CPU). In case of memory, however, you cannot really take away memory from a container that is already using it.
The 4th container will not be able to spawn and you will get the below error.
(service sample) was unable to place a task because no container instance met all of its requirements. The closest matching (container-instance 05016874-f518-4b7a-a817-eb32a4d387f1) has insufficient memory available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.
You need to add another ecs instance if you want to schedule the 4th container. all other 3 containers will be in the steady state. Nothing like memory allocation reduced happened in the cluster. If there is no instance your service will always be in an unsteady state and continue to give you the above errors.
Ref: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html
Actually, memory can be reclaimed from running processes. For example the kernel may evict memory that is backed by files (like the code of the process itself). If the data ends up being needed again the kernel can page it back in. This is explained a little in this blog post: https://chrisdown.name/2018/01/02/in-defence-of-swap.html
If the task is scheduled on that node but the kernel fails to reclaim enough memory to avoid an out-of-memory situation then one of the processes will get killed by the kernel, which docker will detect and kill the container, which ECS will notice. I'm not sure if ECS will try to reschedule the dead task on the same instance or a different one. It probably depends.

How to reliably clean up dask scheduler/worker

I'm starting up a dask cluster in an automated way by ssh-ing into a bunch of machines and running dask-worker. I noticed that I sometimes run into problems when processes from a previous experiment are still running. Wha'ts the best way to clean up after dask? killall dask-worker dask-scheduler doesn't seem to do the trick, possibly because dask somehow starts up new processes in their place.
If you start a worker with dask-worker, you will notice in ps, that it starts more than one process, because there is a "nanny" responsible for restarting the worker in the case that it somehow crashes. Also, there may be "semaphore" processes around for communicating between the two, depending on which form of process spawning you are using.
The correct way to stop all of these would be to send a SIGINT (i.e., keyboard interrupt) to the parent process. A KILL signal might not give it the chance to stop and clean up the child process(s). If some situation (e.g., ssh hangup) caused a more radical termination, or perhaps a session didn't send any stop signal at all, then you will probably have to grep the output of ps for dask-like processes and kill them all.

Can the OS kill the process randomly in Linux?

One of our processes went down in a Linux box. When I checked in logs, I could see it was shut down. That shows graceful shutdown. I checked CPU, Memory, process utilization, all under threshold. There were no abnormalities found over memory utilization. Is there any way that OS killed the process randomly?
Any suggestions?
The kernel can kill a process under extreme circumstances ie memory starvation. But since this was not the case, and you are sure that sysads did not kill the process either. The shutdown must have been initiated from within the process.
Linux would not kill your process unless there are extreme circumstances. Although, some other process running under root might be able to send such signals.
You should get more idea from the kernal logs and make sure that was the process killed by OS itself or not.

Can i limit apache+passenger memory usage on server without swap space

i'm running a rails application with apache+passenger on virtual servers that do not have any swap space configured.
The site gets decent amount of traffic with 200K+ daily requests and sometimes the whole system runs out of memory causing odd behaviour on whole system.
The question is that is there any way to configure apache or passenger not to run out of memory (e.g. gracefully restarting passenger instances when they start using, say more than 300M of memory).
Servers have 4GB of memory and currently i'm using passenger's PassengerMaxRequests option but it does not seem to be the most solid solution here.
At the moment, i also cannot switch to nginx so that is not an option to preserve some room.
Any clever ideas i'm probably missing are welcome.
Edit: My temporary solution
I did not go with restarting Rails instances when they exceed certain amount of memory usage. Engine Yard wrote great blog post on the ActiveRecord memory bloat issue. This is our main suspect on the subject. As i did not have much time to optimize application, i set PassengerMaxRequests to 300 and added extra 2GB memory to server. Things have been good since then. At first i was worried that re-starting Rails instances continuously makes it slow but it does not seem to have impact i should worry about.
If you mean "limiting" as killing those processes and if this is the only application on the server and it is a Linux, then you have two choices:
Set maximum amount of memory one process can have:
# ulimit -m
unlimited
Or use cgroups for similar behavior:
http://en.wikipedia.org/wiki/Cgroups
I would advise against restarting instances (if that is possible) that go over the "memory limit", because that may put your system in infinite loops where a process repeatedly reaches that limit and restarts.
Maybe you could write a simple daemon that constantly watches the processes, and kills any that go over a certain amount of memory. Be sure to log any information about the process that did this so you can fix the issue whenever it comes up.
I'd also look into getting some real swap space on there... This seems like a bad hack.
I have a problem where passenger processes end up going out of control and consuming too much memory. I wrote the following script which has been helping to keep things under control until I find the real solution http://www.codeotaku.com/blog/2012-06/passenger-memory-check/index. It might be helpful.
Passenger web instances don't contain important state (generally speaking) so killing them isn't normally a process, and passenger will restart them as and when required.
I don't have a canned solution but you might want to use two commands that ship with Passenger to keep track of memory usage and nr of processes: passenger-status and sudo passenger-memory-stats, see
Passenger users guide for Nginx or
Passenger users guide for Apache.

Resources