Memory is not getting released after use In Puma Docker Container - ruby-on-rails

We are using Puma with cluster mode where we are using 3 workers with 5 threads. Once the container application is up it starts consuming memory by Puma workers at a high rate and after some time memory utilization by workers slows down but increases step by step.
After the memory is utilized 100% it stops the container and starts again.
This is really a big problem we are facing.
Attaching the Memory utilization Graph for Rails Puma Application.

Related

Rails: PUMA cluster mode vs threaded only?

In my rails app, hosted on the Digitalocean app platform, we're seeing high ram usage and can accommodate only 2 puma workers in a pro instance (1 vCPU, 2 GB Ram). Should I prefer more small instances without any workers as that would be more cost-effective?
i.e. if it was heroku, then should I use 2x dyno with more puma workers or multiple 1x dynos?
Is there any inherent advantage to using puma in cluster mode vs threaded only with more instances?

Docker container memory usage not going down

I have a flask web server running with docker-compose.
When the container first starts, it starts by using around 200MB of memory but after some use goes up to 1GB (docker stats).
However once it reaches high consumption, even when idle, the memory usage of the container is not decreasing and eventually hitting the limit - causing dead uwsgi workers and stopped processes.
Can someone explain what happens behind the curtain and how to have the container release unused memory?
Looks like this a bug and it is being tracked
https://github.com/kubernetes/kubernetes/issues/70179

Unicorn master memory increasing

I have a ruby service running with unicorn which spawns 20 child worker. When I start my service, unicorn master starts with approx 520 MB of memory and then master spawn 20 child worker. My service runs perfectly for 7-8 day, but gradually unicorn master memory is kept increasing and goes up to 1.3 GB, which apparently is causing OOM error when unicorn master trying to fork a new child (which now need 1.3 GB of memory to fork new child)and due to increased memory footprint, memory was not available.
So my concern here is,
As per my understanding, Unicorn master doesn't serve any request, its
sole purpose is to spawn missing child then why master process memory is increasing?
I have unicorn-worker-killer configured with my app which is working absolutely
fine. If there is any memory leak in my Ruby app then unicorn-worker-killer should already be handling it. Is this assumption wrong? Can there be any memory leak in master process due to my Ruby app?
After killing the child worker, the unicorn is creating new child worker process with memory approx same as master process. Is it true that all child worker which Unicorn master will create over the time will start with the same memory as Unicorn master process?
Here below is my service and machine configuration.
Ruby: 2.1.2
Unicorn: 4.6.3
Unicorn worker count: 20
OS: Debian GNU/Linux 8, Jessie
RAM: 24 GB

Why is there so many sidekiq process?

I ran htop in my production server to see what was eating my RAM. A lot of sidekiq process is running, is this normal?
Press Shift-H. htop shows individual threads as processes by default. There is only one actual sidekiq process.
You seem to have configured it to have 25 workers.
By default, one sidekiq process creates 25 threads.
If that's crushing your machine with I/O, you can adjust it down:
sidekiq -c 10
https://github.com/mperham/sidekiq/wiki/Advanced-Options
If you are not using JRuby then it's likely these all are seperate processes that consume memory.

Airbnb Airflow using all system resources

We've set up Airbnb/Apache Airflow for our ETL using LocalExecutor, and as we've started building more complex DAGs, we've noticed that Airflow has starting using up incredible amounts of system resources. This is surprising to us because we mostly use Airflow to orchestrate tasks that happen on other servers, so Airflow DAGs spend most of their time waiting for them to complete--there's no actual execution that happens locally.
The biggest issue is that Airflow seems to use up 100% of CPU at all times (on an AWS t2.medium), and uses over 2GB of memory with the default airflow.cfg settings.
If relevant, we're running Airflow using docker-compose running the container twice; once as scheduler and once as webserver.
What are we doing wrong here? Is this normal?
EDIT:
Here is the output from htop, ordered by % Memory used (since that seems to be the main issue now, I got CPU down):
I suppose in theory I could reduce the number of gunicorn workers (it's at the default of 4), but I'm not sure what all the /usr/bin/dockerd processes are. If Docker is complicating things I could remove it, but it's made deployment of changes really easy and I'd rather not remove it if possible.
I have also tried everything I could to get the CPU usage down and Matthew Housley's advice regarding MIN_FILE_PROCESS_INTERVAL was what did the trick.
At least until airflow 1.10 came around... then the CPU usage went through the roof again.
So here is everything I had to do to get airflow to work well on a standard digital ocean droplet with 2gb of ram and 1 vcpu:
1. Scheduler File Processing
Prevent airflow from reloading the dags all the time and set:
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=60
2. Fix airflow 1.10 scheduler bug
The AIRFLOW-2895 bug in airflow 1.10, causes high CPU load, because the scheduler keeps looping without a break.
It's already fixed in master and will hopefully be included in airflow 1.10.1, but it could take weeks or months until its released. In the meantime this patch solves the issue:
--- jobs.py.orig 2018-09-08 15:55:03.448834310 +0000
+++ jobs.py 2018-09-08 15:57:02.847751035 +0000
## -564,6 +564,7 ##
self.num_runs = num_runs
self.run_duration = run_duration
+ self._processor_poll_interval = 1.0
self.do_pickle = do_pickle
super(SchedulerJob, self).__init__(*args, **kwargs)
## -1724,6 +1725,8 ##
loop_end_time = time.time()
self.log.debug("Ran scheduling loop in %.2f seconds",
loop_end_time - loop_start_time)
+ self.log.debug("Sleeping for %.2f seconds", self._processor_poll_interval)
+ time.sleep(self._processor_poll_interval)
# Exit early for a test mode
if processor_manager.max_runs_reached():
Apply it with patch -d /usr/local/lib/python3.6/site-packages/airflow/ < af_1.10_high_cpu.patch;
3. RBAC webserver high CPU load
If you upgraded to use the new RBAC webserver UI, you may also notice that the webserver is using a lot of CPU persistently.
For some reason the RBAC interface uses a lot of CPU on startup. If you are running on a low powered server, this can cause a very slow webserver startup and permanently high CPU usage.
I have documented this bug as AIRFLOW-3037. To solve it you can adjust the config:
AIRFLOW__WEBSERVER__WORKERS=2 # 2 * NUM_CPU_CORES + 1
AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL=1800 # Restart workers every 30min instead of 30seconds
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT=300 #Kill workers if they don't start within 5min instead of 2min
With all of these tweaks my airflow is using only a few % of CPU during idle time on a digital ocean standard droplet with 1 vcpu and 2gb of ram.
I just ran into an issue like this. Airflow was consuming roughly a full vCPU in a t2.xlarge instance, with the vast majority of this coming from the scheduler container. Checking the scheduler logs, I could see that it was processing my single DAG more than once a second even though it only runs once a day.
I found that the MIN_FILE_PROCESS_INTERVAL was set to the default value of 0, so the scheduler was looping over the DAG. I changed the process interval to 65 seconds, and Airflow now uses less than 10 percent of a vCPU in a t2.medium instance.
Try to change the below config in airflow.cfg
# after how much time a new DAGs should be picked up from the filesystem
min_file_process_interval = 0
# How many seconds to wait between file-parsing loops to prevent the logs from being spammed.
min_file_parsing_loop_time = 1
the key point is HOW to processing dag files.
reduce cpu usage from 80%+ to 30% for scheduler on a 8-core server, i have updated 2 config key,
min_file_process_interval from 0 to 60.
max_threads from 1000 to 50.
For starters, you can use htop to monitor and debug your CPU usage.
I would suggest that you run webserver and scheduler processes on the same docker container which would reduce the resources required to run two containers on a ec2 t2.medium. Airflow workers need resources for downloading data and reading it in memory but webserver and scheduler are pretty lightweight processes. Makes sure when you run webserver you are controlling the number of workers running on the instance using the cli.
airflow webserver [-h] [-p PORT] [-w WORKERS]
[-k {sync,eventlet,gevent,tornado}]
[-t WORKER_TIMEOUT] [-hn HOSTNAME] [--pid [PID]] [-D]
[--stdout STDOUT] [--stderr STDERR]
[-A ACCESS_LOGFILE] [-E ERROR_LOGFILE] [-l LOG_FILE]
[-d]
I have faced the same issue deploying airflow on EKS.Its resolved by updating max_threads to 128 in airflow config.
max_threads: Scheduler will spawn multiple threads in parallel to schedule dags. This is controlled by max_threads with default value of 2. User should increase this value to a larger value (e.g numbers of cpus where scheduler runs - 1) in production.
From here
https://airflow.apache.org/docs/stable/faq.html
I tried to run Airflow on a AWS t2.micro instance (1vcpu, 1gb of memory, eligible for free tier), and had the same issue : the worker consumed 100% of the cpu and consumed all available memory.
The EC2 instance was totally stuck and unusable, of course Airflow didn't working.
So I created a 4GB swap file using the method described here. With the swap, no more issues, Airflow was fully functionnal.
Of course, with only one vcpu, you cannot expect incredible performances, but it runs.

Resources