Airbnb Airflow using all system resources - docker

We've set up Airbnb/Apache Airflow for our ETL using LocalExecutor, and as we've started building more complex DAGs, we've noticed that Airflow has starting using up incredible amounts of system resources. This is surprising to us because we mostly use Airflow to orchestrate tasks that happen on other servers, so Airflow DAGs spend most of their time waiting for them to complete--there's no actual execution that happens locally.
The biggest issue is that Airflow seems to use up 100% of CPU at all times (on an AWS t2.medium), and uses over 2GB of memory with the default airflow.cfg settings.
If relevant, we're running Airflow using docker-compose running the container twice; once as scheduler and once as webserver.
What are we doing wrong here? Is this normal?
EDIT:
Here is the output from htop, ordered by % Memory used (since that seems to be the main issue now, I got CPU down):
I suppose in theory I could reduce the number of gunicorn workers (it's at the default of 4), but I'm not sure what all the /usr/bin/dockerd processes are. If Docker is complicating things I could remove it, but it's made deployment of changes really easy and I'd rather not remove it if possible.

I have also tried everything I could to get the CPU usage down and Matthew Housley's advice regarding MIN_FILE_PROCESS_INTERVAL was what did the trick.
At least until airflow 1.10 came around... then the CPU usage went through the roof again.
So here is everything I had to do to get airflow to work well on a standard digital ocean droplet with 2gb of ram and 1 vcpu:
1. Scheduler File Processing
Prevent airflow from reloading the dags all the time and set:
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=60
2. Fix airflow 1.10 scheduler bug
The AIRFLOW-2895 bug in airflow 1.10, causes high CPU load, because the scheduler keeps looping without a break.
It's already fixed in master and will hopefully be included in airflow 1.10.1, but it could take weeks or months until its released. In the meantime this patch solves the issue:
--- jobs.py.orig 2018-09-08 15:55:03.448834310 +0000
+++ jobs.py 2018-09-08 15:57:02.847751035 +0000
## -564,6 +564,7 ##
self.num_runs = num_runs
self.run_duration = run_duration
+ self._processor_poll_interval = 1.0
self.do_pickle = do_pickle
super(SchedulerJob, self).__init__(*args, **kwargs)
## -1724,6 +1725,8 ##
loop_end_time = time.time()
self.log.debug("Ran scheduling loop in %.2f seconds",
loop_end_time - loop_start_time)
+ self.log.debug("Sleeping for %.2f seconds", self._processor_poll_interval)
+ time.sleep(self._processor_poll_interval)
# Exit early for a test mode
if processor_manager.max_runs_reached():
Apply it with patch -d /usr/local/lib/python3.6/site-packages/airflow/ < af_1.10_high_cpu.patch;
3. RBAC webserver high CPU load
If you upgraded to use the new RBAC webserver UI, you may also notice that the webserver is using a lot of CPU persistently.
For some reason the RBAC interface uses a lot of CPU on startup. If you are running on a low powered server, this can cause a very slow webserver startup and permanently high CPU usage.
I have documented this bug as AIRFLOW-3037. To solve it you can adjust the config:
AIRFLOW__WEBSERVER__WORKERS=2 # 2 * NUM_CPU_CORES + 1
AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL=1800 # Restart workers every 30min instead of 30seconds
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT=300 #Kill workers if they don't start within 5min instead of 2min
With all of these tweaks my airflow is using only a few % of CPU during idle time on a digital ocean standard droplet with 1 vcpu and 2gb of ram.

I just ran into an issue like this. Airflow was consuming roughly a full vCPU in a t2.xlarge instance, with the vast majority of this coming from the scheduler container. Checking the scheduler logs, I could see that it was processing my single DAG more than once a second even though it only runs once a day.
I found that the MIN_FILE_PROCESS_INTERVAL was set to the default value of 0, so the scheduler was looping over the DAG. I changed the process interval to 65 seconds, and Airflow now uses less than 10 percent of a vCPU in a t2.medium instance.

Try to change the below config in airflow.cfg
# after how much time a new DAGs should be picked up from the filesystem
min_file_process_interval = 0
# How many seconds to wait between file-parsing loops to prevent the logs from being spammed.
min_file_parsing_loop_time = 1

the key point is HOW to processing dag files.
reduce cpu usage from 80%+ to 30% for scheduler on a 8-core server, i have updated 2 config key,
min_file_process_interval from 0 to 60.
max_threads from 1000 to 50.

For starters, you can use htop to monitor and debug your CPU usage.
I would suggest that you run webserver and scheduler processes on the same docker container which would reduce the resources required to run two containers on a ec2 t2.medium. Airflow workers need resources for downloading data and reading it in memory but webserver and scheduler are pretty lightweight processes. Makes sure when you run webserver you are controlling the number of workers running on the instance using the cli.
airflow webserver [-h] [-p PORT] [-w WORKERS]
[-k {sync,eventlet,gevent,tornado}]
[-t WORKER_TIMEOUT] [-hn HOSTNAME] [--pid [PID]] [-D]
[--stdout STDOUT] [--stderr STDERR]
[-A ACCESS_LOGFILE] [-E ERROR_LOGFILE] [-l LOG_FILE]
[-d]

I have faced the same issue deploying airflow on EKS.Its resolved by updating max_threads to 128 in airflow config.
max_threads: Scheduler will spawn multiple threads in parallel to schedule dags. This is controlled by max_threads with default value of 2. User should increase this value to a larger value (e.g numbers of cpus where scheduler runs - 1) in production.
From here
https://airflow.apache.org/docs/stable/faq.html

I tried to run Airflow on a AWS t2.micro instance (1vcpu, 1gb of memory, eligible for free tier), and had the same issue : the worker consumed 100% of the cpu and consumed all available memory.
The EC2 instance was totally stuck and unusable, of course Airflow didn't working.
So I created a 4GB swap file using the method described here. With the swap, no more issues, Airflow was fully functionnal.
Of course, with only one vcpu, you cannot expect incredible performances, but it runs.

Related

Parallel Docker Container Creation

I am using a Docker Setup that consists of 14 different containers. Every container gets a cpu_limit of 2 and a mem_limit of 2g.
To create and run these containers, I've written a Python script that uses the docker-py library. As of now, the containers are created sequentially, which takes approximately 2 minutes.
Now I'm thinking about parallelizing the process. So now instead of doing (its pseudocode):
for container in containers_to_start:
create_container(container)
I do
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
pool.map(create_container, containers_to_start)
And as a result the 14 containers are created 2x faster. BUT: The applications within the containers take a significant longer time to boot. At the end of the day, i dont gain really much, the time until every application is reachable is more or less the same, no matter if with or without multithreading.
But I don't really know why, because every container gets the same amount of CPU and memory resources, so I would expect the same boot time no matter how many containers are starting at the same time. Clearly this is not the case. Maybe I'm missing some knowledge here, any explanation would be greatly appreciated.
System Specs
CPU: intel i7 # 2.90 GHz
32GB RAM
I am using Windows 10 with Docker installed in WSL2 backend.

Containers: high cpu usage in %soft (soft IRQ) for network-intensive workloads

I'm trying to debug some performance issues on a RHEL8.3 server.
The server is actually a Kubernetes worker nodes and hosts several Redis containers (PODs).
These containers are doing a lot of network I/O (iptraf-ng reports about 500 kPPS and 1.5Gbps).
The server is an high-end Dell server with 104 cpus and 10Gbps NICs.
The issue I'm trying to debug is related to soft IRQs. In short: despite my attempts to set IRQ affinity of the NIC on a specific range of dedicated CPUs, the utility "mpstat" is still reporting a lot of CPU spent in "soft%" on all the CPUs where the "redis-server" process is running (even if redis-server has been moved using taskset to a non-overlapping range of dedicated CPU cores).
For more details consider the attached screenshot redis_server_and_mpstat:
the "redis-server" with PID 3592506 can run only on CPU 80 (taskset -pc 3592506 returns 80 only)
as can be seen from the "mpstat" output, it's running close to 100%, with 25-28% of the time spent in "%soft" time
In the attempt to address this problem, I've been using the Mellanox IRQ affinity script (https://github.com/Mellanox/mlnx-tools/blob/master/ofed_scripts/set_irq_affinity.sh) to "move" all IRQs related to the NICs on a separate set of CPUs (namely CPUs 1,3,5,7,9,11,13,15,17 that belong to NUMA1) for both NICs (eno1np0, eno2np1) that compose the "bond0" bonded interface used by the server, see the screenshot set_irq_affinity. Moreover the "irqbalance" daemon has been stopped and disabled.
The result is that mpstat is now reporting a consistent CPU usage from CPUs 1,3,5,7,9,11,13,15,17 in "%soft" time, but at the same time the redis-server is still spending 25-28% of its time spent in "%soft" column (i.e. nothing has changed for redis-server).
This pattern is repeated for all instances of "redis-server" running on that server (there's more than 1), while other CPUs having no redis-server scheduled, are 100% idle.
Finally in a different environment based on RHEL7.9 (kernel 3.10.0) and a non-containerized deployment of Redis, I see that, when running the "set_irq_affinity.sh" script to move IRQs away from Redis CPUs, Redis %soft column goes down to zero.
Can you help me to understand why running redis into a Kubernetes container (with kernel 4.18.0), the redis-server process will continue to spend a consistent amount of time in %soft handling, despite NIC IRQs having affinity on different CPUs ?
Is it possible that the time the redis-server process spends in "soft IRQ" handling is due to the veth virtual ethernet device created by the containerization technology (in this case the Kubernetes CNI is Flannel, using all default settings) ?
Thanks

Kubernetes OOM pod killed because kernel memory grows to much

I am working on a java service that basically creates files in a network file system to store data. It runs in a k8s cluster in a Ubuntu 18.04 LTS.
When we began to limit the memory in kubernetes (limits: memory: 3Gi), the pods began to be OOMKilled by kubernetes.
At the beginning we thought it was a leak of memory in the java process, but analyzing more deeply we noticed that the problem is the memory of the kernel.
We validated that looking at the file /sys/fs/cgroup/memory/memory.kmem.usage_in_bytes
We isolated the case to only create files (without java) with the DD command like this:
for i in {1..50000}; do dd if=/dev/urandom bs=4096 count=1 of=file$i; done
And with the dd command we saw that the same thing happened ( the kernel memory grew until OOM).
After k8s restarted the pod, I got doing a describe pod:
Last State:Terminated
Reason: OOMKilled
Exit Code: 143
Creating files cause the kernel memory grows, deleting those files cause the memory decreases . But our services store data , so it creates a lot of files continuously, until the pod is killed and restarted because OOMKilled.
We tested limiting the kernel memory using a stand alone docker with the --kernel-memory parameter and it worked as expected. The kernel memory grew to the limit and did not rise anymore. But we did not find any way to do that in a kubernetes cluster.
Is there a way to limit the kernel memory in a K8S environment ?
Why the creation of files causes the kernel memory grows and it is not released ?
Thanks for all this info, it was very useful!
On my app, I solved this by creating a new side container that runs a cron job, every 5 minutes with the following command:
echo 3 > /proc/sys/vm/drop_caches
(note that you need the side container to run in privileged mode)
It works nicely and has the advantage of being predictable: every 5 minutes, your memory cache will be cleared.

Slow install / upgrade through Helm (for Kubernetes)

Our application consists of circa 20 modules. Each module contains a (Helm) chart with several deployments, services and jobs. Some of those jobs are defined as Helm pre-install and pre-upgrade hooks. Altogether there are probably about 120 yaml files, which eventualy result in about 50 running pods.
During development we are running Docker for Windows version 2.0.0.0-beta-1-win75 with Docker 18.09.0-ce-beta1 and Kubernetes 1.10.3. To simplify management of our Kubernetes yaml files we use Helm 2.11.0. Docker for Windows is configured to use 2 CPU cores (of 4) and 8GB RAM (of 24GB).
When creating the application environment for the first time, it takes more that 20 minutes to become available. This seems far to slow; we are probably making an important mistake somewhere. We have tried to improve the (re)start time, but to no avail. Any help or insights to improve the situation would be greatly appreciated.
A simplified version of our startup script:
#!/bin/bash
# Start some infrastructure
helm upgrade --force --install modules/infrastructure/chart
# Start ~20 modules in parallel
helm upgrade --force --install modules/module01/chart &
[...]
helm upgrade --force --install modules/module20/chart &
await_modules()
Executing the same startup script again later to 'restart' the application still takes about 5 minutes. As far as I know, unchanged objects are not modified at all by Kubernetes. Only the circa 40 hooks are run by Helm.
Running a single hook manually with docker run is fast (~3 seconds). Running that same hook through Helm and Kubernetes regularly takes 15 seconds or more.
Some things we have discovered and tried are listed below.
Linux staging environment
Our staging environment consists of Ubuntu with native Docker. Kubernetes is installed through minikube with --vm-driver none.
Contrary to our local development environment, the staging environment retrieves the application code through a (deprecated) gitRepo volume for almost every deployment and job. Understandibly, this only seems to worsen the problem. Starting the environment for the first time takes over 25 minutes, restarting it takes about 20 minutes.
We tried replacing the gitRepo volume with a sidecar container that retrieves the application code as a TAR. Although we have not modified the whole application, initial tests indicate this is not particularly faster than the gitRepo volume.
This situation can probably be improved with an alternative type of volume that enables sharing of code between deployements and jobs. We would rather not introduce more complexity, though, so we have not explored this avenue any further.
Docker run time
Executing a single empty alpine container through docker run alpine echo "test" takes roughly 2 seconds. This seems to be overhead of the setup on Windows. That same command takes less 0.5 seconds on our Linux staging environment.
Docker volume sharing
Most of the containers - including the hooks - share code with the host through a hostPath. The command docker run -v <host path>:<container path> alpine echo "test" takes 3 seconds to run. Using volumes seems to increase runtime with aproximately 1 second.
Parallel or sequential
Sequential execution of the commands in the startup script does not improve startup time. Neither does it drastically worsen.
IO bound?
Windows taskmanager indicates that IO is at 100% when executing the startup script. Our hooks and application code are not IO intensive at all. So the IO load seems to originate from Docker, Kubernetes or Helm. We have tried to find the bottleneck, but were unable to pinpoint the cause.
Reducing IO through ramdisk
To test the premise of being IO bound further, we exchanged /var/lib/docker with a ramdisk in our Linux staging environment. Starting the application with this configuration was not significantly faster.
To compare Kubernetes with Docker, you need to consider that Kubernetes will run more or less the same Docker command on a final step. Before that happens many things are happening.
The authentication and authorization processes, creating objects in etcd, locating correct nodes for pods scheduling them and provisioning storage and many more.
Helm itself also adds an overhead to the process depending on size of chart.
I recommend reading One year using Kubernetes in production: Lessons learned. Author goes into explaining what have they achieved by switching to Kubernetes as well differences in overhead:
Cost calculation
Looking at costs, there are two sides to the story. To run Kubernetes, an etcd cluster is required, as well as a master node. While these are not necessarily expensive components to run, this overhead can be relatively expensive when it comes to very small deployments. For these types of deployments, it’s probably best to use a hosted solution such as Google's Container Service.
For larger deployments, it’s easy to save a lot on server costs. The overhead of running etcd and a master node aren’t significant in these deployments. Kubernetes makes it very easy to run many containers on the same hosts, making maximum use of the available resources. This reduces the number of required servers, which directly saves you money. When running Kubernetes sounds great, but the ops side of running such a cluster seems less attractive, there are a number of hosted services to look at, including Cloud RTI, which is what my team is working on.

Docker Swarm CPU overload on deploy with Spring Boot containers

I have created a number of Spring Boot application, which all work like magic in isolation or when started up one of the other manually.
My challenge is that I want to deploy a stack with all the services in a Docker Swarm.
Initially I didn't understand what was going on, as it seemed like all my containers were hanging.
Turns out running a single Spring Boot application spikes up my CPU utilization to max it out for a good couple of seconds (20s+ to start up).
Now the issue is that Docker Swarm is launching 10 of these containers simultaneously and my load average goes above 80 and the system grinds to a halt. The container HEALTHCHECKS starts timing out and eventually Docker restarts them. This is an endless cycle and may or may not stabilize and if it does stabilize it takes a minimum of 30 minutes. So much for micro services vs big fat Java EE applications :(
Is there any way to convince Docker to rollout the containers one by one? I'm sure this will help a lot.
There is a rolling update parameter - https://docs.docker.com/engine/swarm/swarm-tutorial/rolling-update/ - but is does not seem applicable to startup deployment.
Your help will be greatly appreciated.
I've also tried systemd (which isn't ideal for distributed micro services). It worked slightly better than Docker, but have the same issue when deploying all the applications at once.
Initially I wanted to try Kubernetes, but I've got enough on my plate and if I can get away with Docker Swarm, that would be awesome.
Thanks!

Resources