Airflow Gunicorn timeout - timeout

I have an external service that calls the airflow trigger DAG API (quite alot of requests) at once during a fixed timing.
I am having Gunicorn timeout despite tweaking the airflow.cfg settings.
master_timeout, worker_timeout, and refresh_interval.
I am also using gevent worker type for async.
Is there something am missing out?
# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = 600
# Number of seconds the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = 480
# Number of workers to refresh at a time. When set to 0, worker refresh is
# disabled. When nonzero, airflow periodically refreshes webserver workers by
# bringing up new ones and killing old ones.
worker_refresh_batch_size = 1
# Number of seconds to wait before refreshing a batch of workers.
worker_refresh_interval = 30
# Number of workers to run the Gunicorn web server
workers = 5
# The worker class gunicorn should use. Choices include
# sync (default), eventlet, gevent
worker_class = gevent

Related

How to increase RPS in distributed locust load test

I cannot get past 1200 RPS no matter if I use 4 or 5 workers.
I tried to start locust in 3 variations -- one, four, and five worker processes (docker-compose up --scale worker_locust=num_of_workers). I use 3000 clients with a hatch rate of 100. The service that I am loading is a dummy that just always returns yo and HTTP 200, i.e., it's not doing anything, but returning a constant string. When I have one worker I get up to 600 RPS (and start to see some HTTP errors), when I have 4 workers I can get up to the ~1200 RPS (without a single HTTP error):
When I have 5 workers I get the same ~1200 RPS, but with a lower CPU usage:
I suppose that if the CPU went down in the 5-worker case (with respect to 4-worker case), than it's not the CPU that is bounding the RPS.
I am running this on a 6-core MacBook.
The locustfile.py I use posts essentially almost empty requests (just a few parameters):
from locust import HttpUser, task, between, constant
class QuickstartUser(HttpUser):
wait_time = constant(1) # seconds
#task
def add_empty_model(self):
self.client.post(
"/models",
json={
"grouping": {
"grouping": "a/b"
},
"container_image": "myrepo.com",
"container_tag": "0.3.0",
"prediction_type": "prediction_type",
"model_state_base64": "bXkgc3RhdGU=",
"model_config": {},
"meta": {}
}
)
My docker-compose.yml:
services:
myservice:
build:
context: ../
ports:
- "8000:8000"
master_locust:
image: locustio/locust
ports:
- "8089:8089"
volumes:
- ./:/mnt/locust
command: -f /mnt/locust/locustfile.py --master
worker_locust:
image: locustio/locust
volumes:
- ./:/mnt/locust
command: -f /mnt/locust/locustfile.py --worker --master-host master_locust
Can someone suggest the direction of getting towards the 2000 RPS?
You should check out the FAQ.
https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps
It's probably your server not being able to handle more requests, at least from your one machine. There are other things you can do to make more sure that's the case. You can try FastHttpUser, running on multiple machines, or just upping the number of users. But if you can, check to see how the server is handling the load and see what you can optimize there.
You will need more workers to generate more RPS. I thought one worker will have limited local port range when creating tcp connection to the destination.
You may check this value in your linux worker:
net.ipv4.ip_local_port_range
Try to tweak that number it on your each linux worker, or simply create hundreds of new worker with another powerful machine (your 6-core cpu macbook is to small)
To create many workers you could try Locust in kubernetes with horizontal pod autoscaling for the workers deployment.
Here is some helm chart to start play arround with Locust k8s deployment:
https://github.com/deliveryhero/helm-charts/tree/master/stable/locust
You may need to check these args for it:
worker.hpa.enabled
worker.hpa.maxReplicas
worker.hpa.minReplicas
worker.hpa.targetCPUUtilizationPercentage
simply set the maxReplicas value to get more workers when the load testing is started. Or you can scale it manually with kubectl command to scale worker pods to your desired number.
I've done to generate minimal 8K rps (stable value for my app, it can't serve better) with 1000 pods/worker, with Locust load test parameter like 200K users with 2000 spawn per second.
You may have to scale out your server when you reach higher throughput, but with 1000 pods/worker i thought you can easily reach 15K-20K rps.

How to check the number of workers running inside a docker container of kubernetes pod?

I am using Flask-uWSGI architecture for a production service and have set the master flag of uWSGI config as False. While running the service, I pass NUM_WORKERS of uWSGI as 2 to the docker container. Based on this doc on uWSGI config, master flag is necessary to re-spawn and pre-fork workers. I wonder if my service containers within the pods are actually using 2 workers?
So, I want to exec into a pod and see the number of uWSGI workers which are actually being used?
Not related but my uWSGI config:
[uwsgi]
socket = 0.0.0.0:9999
protocol = http
module = my_app.server.wsgi
callable = app
master = false
thunder-lock = true
Add a prometheus-exporter to your app, and curl /metrics endpoint manually:
https://github.com/timonwong/uwsgi_exporter
It has num workers metric: https://github.com/timonwong/uwsgi_exporter/blob/9f88775cc1a600e4038bb1eae5edfdf38f023dc4/exporter/exporter.go#L50
Further, you can collect this using Prometheus to monitor and alert automatically.

Flink TaskManager Docker Swarm doesn't recover

I'm Running a Flink v1.10 with 1 JobManager and 3 Taskmanagers in Docker Swarm, without Zookeeper. I've a Job running taking 12 Slots and i've 3 TM's with 20 Slots each (60 total).
After some tests everything went well except one test.
So, the test failing is, if i cancel the job manually i've a side-car retrying the job and the Taskmanager on the Browser Console doesn't recover and keeps decreasing.
More pratical example, so, i've a job running, consuming 12 slots of 60 total.
The web console shows me 48 Slots free and 3 TM's.
I cancel the job manually the side-car retriggers the job and the web
console shows me 36 Slots free and 2 TM's
The job enter's in a fail state and the Slot's will keep dreasing until 0 Slots free and 1 TM shows on the console.
The solution is scale down and scale up all the 3 TM's and everything get back to normal.
Everything work's fine with this configuration, the jobmanager recover's if i remove it, or if i scale up or down the TM's, but if i cancel the job the TM's looks like they loose the connection to the JM.
Any suggestions what i'm doing wrong?
Here is my flink-conf.yaml.
env.java.home: /usr/local/openjdk-8
env.log.dir: /opt/flink/
env.log.file: /var/log/flink.log
jobmanager.rpc.address: jobmanager1
jobmanager.rpc.port: 6123
jobmanager.heap.size: 2048m
#taskmanager.memory.process.size: 2048m
#env.java.opts.taskmanager: 2048m
taskmanager.memory.flink.size: 2048m
taskmanager.numberOfTaskSlots: 20
parallelism.default: 2
#==============================================================================
# High Availability
#==============================================================================
# The high-availability mode. Possible options are 'NONE' or 'zookeeper'.
#
high-availability: NONE
#high-availability.storageDir: file:///tmp/storageDir/flink_tmp/
#high-availability.zookeeper.quorum: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
#high-availability.zookeeper.quorum:
# ACL options are based on https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#sc_BuiltinACLSchemes
# high-availability.zookeeper.client.acl: open
#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================
# state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
# state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints
# state.backend.incremental: false
jobmanager.execution.failover-strategy: region
#==============================================================================
# Rest & web frontend
#==============================================================================
rest.port: 8080
rest.address: jobmanager1
# rest.bind-port: 8081
rest.bind-address: 0.0.0.0
#web.submit.enable: false
#==============================================================================
# Advanced
#==============================================================================
# io.tmp.dirs: /tmp
# classloader.resolve-order: child-first
# taskmanager.memory.network.fraction: 0.1
# taskmanager.memory.network.min: 64mb
# taskmanager.memory.network.max: 1gb
#==============================================================================
# Flink Cluster Security Configuration
#==============================================================================
# security.kerberos.login.use-ticket-cache: false
# security.kerberos.login.keytab: /mobi.me/flink/conf/smart3.keytab
# security.kerberos.login.principal: smart_user
# security.kerberos.login.contexts: Client,KafkaClient
#==============================================================================
# ZK Security Configuration
#==============================================================================
# zookeeper.sasl.login-context-name: Client
#==============================================================================
# HistoryServer
#==============================================================================
#jobmanager.archive.fs.dir: hdfs:///completed-jobs/
#historyserver.web.address: 0.0.0.0
#historyserver.web.port: 8082
#historyserver.archive.fs.dir: hdfs:///completed-jobs/
#historyserver.archive.fs.refresh-interval: 10000
blob.server.port: 6124
query.server.port: 6125
taskmanager.rpc.port: 6122
high-availability.jobmanager.port: 50010
zookeeper.sasl.disable: true
#recovery.mode: zookeeper
#recovery.zookeeper.quorum: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
#recovery.zookeeper.path.root: /
#recovery.zookeeper.path.namespace: /cluster_one
The solution was to increate the metaspace size in the flink-conf.yaml.
Br,
André.

uWSGI worker free, but request' handling has significant delay

I would like to run Django app under uWSGI behind nginx. I've launched 2 uwsgi workers, but I noticed next sad circumstance: when one worker is busy, another worker starts handle request only after 10-15 seconds of waiting.
Configuration is pretty simple.
uWSGI:
uwsgi --socket 127.0.0.1:3031 --wsgi-file wsgi.py --master --processes 2 --threads 1
nginx:
server {
listen 8000;
server_name example.org;
location / {
include uwsgi_params;
uwsgi_pass 127.0.0.1:3031;
}
}
and /etc/nginx/nginx.conf - default value
test Django view:
def test(request):
print('Start!!!')
time.sleep(9999)
print('End')
return HttpResponse()
And wsgi.py has default Django value.
So when I launch all this together and send 2 GET requests then I see in console only one "Start!!!" and only after 10-15 seconds apears second "Start!!!".
I have the same strange behaviour without nginx (with uwsgi --http); with multiple threads for each worker; without "--master" uwsgi option; without django app; with a few uwsgi instances behind Nginx load balancer.
Additional info:
uwsgi version: 2.0.12
nginx version: 1.4.6
host OS: Ubuntu 14.04
Python version: 3.4
Django: 1.9
CPU: 4 cores

Why is monit not restarting the server

I have configured monit on a Ubuntu machine with the following configuration:
check process apache with pidfile /var/run/apache2/apache2.pid
start program = "/etc/init.d/apache2 start" with timeout 60 seconds
stop program = "/etc/init.d/apache2 stop"
if cpu > 80% for 5 cycles then restart
if children > 250 then restart
but is not working. The server has become offline on occasions and nothing seemed to have happened.
Any ideas of why it is not restarting?
I don't know what you mean by "server has become offline on occasions". As it can mean that the node where Apache was running was shutdown and it can also mean that http://localhost:80/ was not accessible.
If later was the case then changing the configuration to
check process apache with pidfile /var/run/apache2/apache2.pid
start program = "/etc/init.d/apache2 start" with timeout 60 seconds
stop program = "/etc/init.d/apache2 stop"
if failed host 127.0.0.1 port 80 then restart
if cpu > 80% for 5 cycles then restart
if children > 250 then restart
might work. As your configuration will not restart Apache if its process was running but because of any issue was not accessible at http://localhost:80/

Resources