Why is gunicorn calling sys.exit(1)? - docker

Gunicorn is sometimes crashing my server, actually exiting the Python interpreter with sys.exit(1)! Why is this? Note that the failure is not always at the same point. In the two cases shown below, there is a different last codeline before gunicorn's exit. This code running here is openpyxl , which should not be causing interpreter shutdown!
Is the server running out of memory? Some other cause?
(This is Flask on Gunicorn on Docker in Google Container Engine.)
Case 1
File "/virtualenv_for_docker/lib/python3.7/site-packages/openpyxl/descriptors/base.py", line 166, in __set__
super(Bool, self).__set__(instance, value)
File "/virtualenv_for_docker/lib/python3.7/site-packages/gunicorn/workers/base.py", line 196, in handle_abort
sys.exit(1)
SystemExit: 1
Case 2
File "/virtualenv_for_docker/lib/python3.7/site-packages/openpyxl/descriptors/serialisable.py", line 164, in __eq__
def __eq__(self, other):
File "/virtualenv_for_docker/lib/python3.7/site-packages/gunicorn/workers/base.py", line 196, in handle_abort
sys.exit(1)
SystemExit: 1

As wrote #maxm the server is catching a SIGABRT, that call generally happens on timeout.
You should increase the timeout value or reduce the request processing time. Also you can setup the signal handler to try to log what happened in the worker after a timeout.

Related

Airflow DockerOperator unable to mount tmp directory correctly

I am trying to run a simple python script within a docker run command scheduled with Airflow.
I have followed the instructions here Airflow init.
My .env file:
AIRFLOW_UID=1000
AIRFLOW_GID=0
And the docker-compose.yaml is based on the default one docker-compose.yaml. I had to add - /var/run/docker.sock:/var/run/docker.sock as an additional volume to run docker inside of docker.
My dag is configured as followed:
""" this is an example dag """
from datetime import timedelta
from airflow import DAG
from airflow.operators.docker_operator import DockerOperator
from airflow.utils.dates import days_ago
from docker.types import Mount
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['info#foo.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 10,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'msg_europe_etl',
default_args=default_args,
description='Process MSG_EUROPE ETL',
schedule_interval=timedelta(minutes=15),
start_date=days_ago(0),
tags=['satellite_data'],
) as dag:
download_and_store = DockerOperator(
task_id='download_and_store',
image='satellite_image:latest',
auto_remove=True,
api_version='1.41',
mounts=[Mount(source='/home/archive_1/archive/satellite_data',
target='/app/data'),
Mount(source='/home/dlassahn/projects/forecast-system/meteoIntelligence-satellite',
target='/app')],
command="python3 src/scripts.py download_satellite_images "
"{{ (execution_date - macros.timedelta(hours=4)).strftime('%Y-%m-%d %H:%M') }} "
"'msg_europe' ",
)
download_and_store
The Airflow log:
[2021-08-03 17:23:58,691] {docker.py:231} INFO - Starting docker container from image satellite_image:latest
[2021-08-03 17:23:58,702] {taskinstance.py:1501} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 268, in _raise_for_status
response.raise_for_status()
File "/home/airflow/.local/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http+docker://localhost/v1.41/containers/create
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 319, in execute
return self._run_image()
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 258, in _run_image
tty=self.tty,
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/container.py", line 430, in create_container
return self.create_container_from_config(config, name)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/container.py", line 441, in create_container_from_config
return self._result(res, True)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 274, in _result
self._raise_for_status(response)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 270, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 400 Client Error for http+docker://localhost/v1.41/containers/create: Bad Request ("invalid mount config for type "bind": bind source path does not exist: /tmp/airflowtmp037k87u6")
Trying to set mount_tmp_dir=False yield to an Dag ImportError because of unknown Keyword Argument mount_tmp_dir. (this might be an issue for the Documentation)
Nevertheless I do not know how to configure the tmp directory correctly.
My Airflow Version: 2.1.2
There was a bug in Docker Provider 2.0.0 which prevented Docker Operator to run with Docker-In-Docker solution.
You need to upgrade to the latest Docker Provider 2.1.0
https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/index.html#id1
You can do it by extending the image as described in https://airflow.apache.org/docs/docker-stack/build.html#extending-the-image with - for example - this docker file:
FROM apache/airflow
RUN pip install --no-cache-dir apache-airflow-providers-docker==2.1.0
The operator will work out-of-the-box in this case with "fallback" mode (and Warning message), but you can also disable the mount that causes the problem. More explanation from the https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/_api/airflow/providers/docker/operators/docker/index.html
By default, a temporary directory is created on the host and mounted
into a container to allow storing files that together exceed the
default disk size of 10GB in a container. In this case The path to the
mounted directory can be accessed via the environment variable
AIRFLOW_TMP_DIR.
If the volume cannot be mounted, warning is printed and an attempt is
made to execute the docker command without the temporary folder
mounted. This is to make it works by default with remote docker engine
or when you run docker-in-docker solution and temporary directory is
not shared with the docker engine. Warning is printed in logs in this
case.
If you know you run DockerOperator with remote engine or via
docker-in-docker you should set mount_tmp_dir parameter to False. In
this case, you can still use mounts parameter to mount already
existing named volumes in your Docker Engine to achieve similar
capability where you can store files exceeding default disk size of
the container,
I had the same issue and all "recommended" ways of solving the issue here and setting up mount_dir params as descripted here just lead to other errors. The one solution that helped me was wrapping the invocated by docker code with the VPN (actually this hack was taken from another docker-powered DAG that used VPN and worked well).
So the final solution looks like:
#!/bin/bash
connect_to_vpn.sh &
sleep 10
python3 my_func.py
sleep 10
stop_vpn.sh
wait -n
exit $?
To connect to VPN I used openconnect. The took can be installed with apt install and supports anyconnect protocol (it was my crucial requirement).

Why does my Python Dataflow job gets stuck at the Write phase?

I wrote a Python Dataflow job which managed to process 300 files, unfortunately, when I try to run it on 400 files it gets stuck in the Write phase forever.
The logs aren't really helpful, but I think that the issue comes from the writing logic of the code, initially, I only wanted 1 output file, so I wrote:
| 'Write' >> beam.io.WriteToText(
known_args.output,
file_name_suffix=".json",
num_shards=1,
shard_name_template=""
))
Then, I removed, num_shards=1 and shard_name_template="" and I was able to process more files but it'd still get stuck.
Extra Information
the files to process are small, less than a 1MB
also, when removing the num_shards and shard_name_template fields, I noticed that the data got output a temporary folder in the target path, but the job never finishes
I have the following DEADLINE_EXCEEDED exception and I tried solving it by increasing --num_workers to 6 and --disk_size_gb to 30 but it doesn't work.
Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 638, in do_work work_executor.execute() File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute op.start() File "dataflow_worker/shuffle_operations.py", line 63, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 64, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 79, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 80, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 82, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "/usr/local/lib/python3.7/site-packages/dataflow_worker/shuffle.py", line 441, in __iter__ for entry in entries_iterator: File "/usr/local/lib/python3.7/site-packages/dataflow_worker/shuffle.py", line 282, in __next__ return next(self.iterator) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/shuffle.py", line 240, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position) File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 133, in shuffle_client.PyShuffleReader.Read OSError: Shuffle read failed: b'DEADLINE_EXCEEDED: (g)RPC timed out when extract-fields-three-mont-10090801-dlaj-harness-fj4v talking to extract-fields-three-mont-10090801-dlaj-harness-1f7r:12346. Server unresponsive (ping error: Deadline Exceeded, {"created":"#1602260204.931126454","description":"Deadline Exceeded","file":"third_party/grpc/src/core/ext/filters/deadline/deadline_filter.cc","file_line":69,"grpc_status":4}). Typically one can self manage this issue, please read: https://cloud.google.com/dataflow/docs/guides/common-errors#tsg-rpc-timeout'
Can you please recommend ways to troubleshoot this type of issues?
After trying to pump resources, I managed to get it working by enabling the Dataflow shuffle service fixed the situation. Please see resource
Just add --experiments=shuffle_mode=service to your PipelineOptions.

Celery with redis: instance state changed (master -> replica?)

Am using celery for scheduled tasks and redis server for data backup within docker containers. My jobs are running correctly sometimes. But I am get following error randomly and celery beat task can no longer progress.
[2020-09-16 21:01:07,863: CRITICAL/MainProcess] Unrecoverable error: ResponseError('UNBLOCKED force unblock from blocking operation, instance sta
te changed (master -> replica?)',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/worker/worker.py", line 205, in start
self.blueprint.start(self)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 599, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.6/site-packages/celery/worker/loops.py", line 83, in asynloop
next(loop)
File "/usr/local/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 364, in create_loop
cb(*cbargs)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 1088, in on_readable
self.cycle.on_readable(fileno)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 359, in on_readable
chan.handlers[type]()
File "/usr/local/lib/python3.6/site-packages/kombu/transport/redis.py", line 739, in _brpop_read
**options)
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 892, in parse_response
response = connection.read_response()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 752, in read_response
raise response
redis.exceptions.ResponseError: UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)
Any help is will be appreciated. Let me know in case you need more details
As I stated above the issue is happening randomly and perturb our app in production. So I decided to spend time on a solution. I came across many propositions such as hardware issues (Memory or CPU). But this one definitively solve the issue. I was not using authentication on redis server Those interesting on setting redis password easily in docker can refer to this Docker Tip. After setting a password to redis the url looks like REDIS_URL=redis://user:myPass#localhost:6379
You can try this answer: https://stackoverflow.com/a/74141982/1635525
TLDR Adding restart: unless-stopped to your docker-compose helps to recover from celery crashes including the ones caused by redis downtime/maintenance.

What is the root cause of distributed.scheduler.KilledWorker exception?

I'm trying to run a Dask job on a YARN cluster. This jobs reads and writes to HDFS using the hdfs3 library.
When I run it on a cluster without a Kerberos security layer, it runs fine.
But, on a cluster with a Kerberos security layer, I had to implement the solution here to avoid Kerberos related errors. Running the same job, led to the following error:
File "/fsstreamdevl/f6_development/acoustics/acoustics_analysis_dask/acoustics_analytics/task_runner/task_runner.py", line 123, in run
dask.compute(tasks)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/dask/base.py", line 446, in compute
results = schedule(dsk, keys, **kwargs)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/client.py", line 2568, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/client.py", line 1822, in gather
asynchronous=asynchronous,
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/client.py", line 753, in sync
return sync(self.loop, func, *args, **kwargs)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/utils.py", line 331, in sync
six.reraise(*error[0])
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/utils.py", line 316, in f
result[0] = yield future
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/distributed/client.py", line 1653, in _gather
six.reraise(type(exception), exception, traceback)
File "/anaconda_env/projects/f6acoustics/dev/dask_yarn_test/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
distributed.scheduler.KilledWorker: ('__call__-6af7aa29-2a09-45f3-a5e2-207c06562672', <Worker 'tcp://10.194.211.132:11927', memory: 0, processing: 1>)
Strangely enough, running the same solution on the former cluster without a Kerberos security layer, I get the same error.
Looking at the YARN application logs, I see the following traceback, but cannot tell what it means.
distributed.nanny - INFO - Closing Nanny at 'tcp://10.194.211.133:17659'
Traceback (most recent call last):
File "/opt/hadoop/data/05/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_171773/container_e47_1560931326013_171773_01_000003/environment/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/opt/hadoop/data/05/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_171773/container_e47_1560931326013_171773_01_000003/environment/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/hadoop/data/05/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_171773/container_e47_1560931326013_171773_01_000003/environment/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/opt/hadoop/data/05/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_171773/container_e47_1560931326013_171773_01_000003/environment/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
End of LogType:dask.worker.log
I do not see any explicit messages in the logs about low memory. Would anyone know how to diagnose this issue?
hdfs3 is not actively maintained any more. You have two main choices for interacting with HDFS:
pyarrow's hdfs driver (via libhdfs jni library), which requires you to have java and hadoop requirements correctly set up and available to the session calling it
webhdfs such as in fsspec, which does not need java libraries, and can interact with kerberos if HTTP authentication is allowed on your system.

Dataflow jobs fail with: Shuffle close failed: FAILED_PRECONDITION: Precondition check failed

My Dataflow jobs fail with the following error:
INFO:root:2018-10-15T18:55:37.417Z: JOB_MESSAGE_ERROR: Workflow failed.
Causes: S17:fold2/Write/WriteImpl/WindowInto(WindowIntoFn)+write instances fold2/Write/WriteImpl/GroupByKey/Reify+write instances fold2/Write/WriteImpl/GroupByKey/Write failed.,
A work item was attempted 4 times without success.
Each time the worker eventually lost contact with the service. The work item was attempted on:
yuri-nine-gag-recommender-10151140-3kmq-harness-mdgd,
yuri-nine-gag-recommender-10151140-3kmq-harness-mdgd,
yuri-nine-gag-recommender-10151140-3kmq-harness-41dd,
yuri-nine-gag-recommender-10151140-3kmq-harness-mdgd
Digging into the logs shows only one error:
An exception was raised when trying to execute the workitem 6479210647275353150 :
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 158, in execute op.finish()
File "dataflow_worker/shuffle_operations.py", line 144, in dataflow_worker.shuffle_operations.ShuffleWriteOperation.finish def finish(self):
File "dataflow_worker/shuffle_operations.py", line 145, in dataflow_worker.shuffle_operations.ShuffleWriteOperation.finish with self.scoped_finish_state:
File "dataflow_worker/shuffle_operations.py", line 147, in dataflow_worker.shuffle_operations.ShuffleWriteOperation.finish self.writer.__exit__(None, None, None)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 599, in __exit__ self.writer.Close()
File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 202, in shuffle_client.PyShuffleWriter.Close IOError: Shuffle close failed: FAILED_PRECONDITION: Precondition check failed.
Any ideas?
I finally figured out the problem by removing various pieces for code, printing tons of logs and running the job again. It turned out that I had a regular expression that blew up for one particular entry. Unfortunately, Dataflow logs were not helpful at all.

Resources