Airflow DockerOperator volumes and mounts - docker

We have Airflow running (using Docker compose) with several DAG's active. Last week we updated our Airflow to version 2.1.3.
This resulted in an error for a DAG where we use DockerOperator:
airflow.exceptions.AirflowException: Invalid arguments were passed to DockerOperator (task_id: t_docker). Invalid arguments were:
**kwargs: {'volumes':
I found this release note telling me that
The volumes parameter in airflow.providers.docker.operators.docker.DockerOperator and airflow.providers.docker.operators.docker_swarm.DockerSwarmOperator was replaced by the mounts parameter
So I changed our DAG from
t_docker = DockerOperator(
task_id='t_docker',
image='customimage:latest',
container_name='custom_1',
api_version='auto',
auto_remove=True,
volumes=['/home/airflow/scripts:/opt/airflow/scripts','/home/airflow/data:/opt/airflow/data'],
docker_url='unix://var/run/docker.sock',
network_mode='bridge',
dag=dag
)
to this
t_docker = DockerOperator(
task_id='t_docker',
image='customimage:latest',
container_name='custom_1',
api_version='auto',
auto_remove=True,
mounts=['/home/airflow/scripts:/opt/airflow/scripts','/home/airflow/data:/opt/airflow/data'],
docker_url='unix://var/run/docker.sock',
network_mode='bridge',
dag=dag
)
But now i get this error:
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.41/containers/create?name=custom_1: Internal Server Error ("json: cannot unmarshal string into Go struct field HostConfig.HostConfig.Mounts of type mount.Mount")
What am I doing wrong?

The change isn't only in the parameter name it's also a change to Mount syntax.
You should replace
volumes=['/home/airflow/scripts:/opt/airflow/scripts','/home/airflow/data:/opt/airflow/data']
with:
mounts=[
Mount(source="/home/airflow/scripts", target="/opt/airflow/scripts", type="bind"),
Mount(source="/home/airflow/data", target="/opt/airflow/data", type="bind"),
]
So your code will be:
from docker.types import Mount
t_docker = DockerOperator(
task_id='t_docker',
image='customimage:latest',
container_name='custom_1',
api_version='auto',
auto_remove=True,
mounts=[
Mount(source="/home/airflow/scripts", target="/opt/airflow/scripts", type="bind"),
Mount(source="/home/airflow/data", target="/opt/airflow/data", type="bind"),
],
docker_url='unix://var/run/docker.sock',
network_mode='bridge',
dag=dag
)

Related

Pyspark running in docker container cannot write file

I have a docker container running PySpark, hadoop and all the required dependecies. I am using spark-submit to query the minio and I want to write the output dataframe to the file. Reading the file works but writing does not. If I execute python in that container and try to create file at the same path, it works.
Am I missing some spark configuration?
This is the error I get:
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.save
: java.net.ConnectException: Call From 10d3463d04ce/10.0.1.132 to localhost:9000 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Relevant code:
spark = SparkSession.builder.getOrCreate()
spark_context = spark.sparkContext
spark_context._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'minio')
spark_context._jsc.hadoopConfiguration().set(
'fs.s3a.secret.key', AWS_SECRET_ACCESS_KEY
)
spark_context._jsc.hadoopConfiguration().set('fs.s3a.path.style.access', 'true')
spark_context._jsc.hadoopConfiguration().set(
'fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem'
)
spark_context._jsc.hadoopConfiguration().set('fs.s3a.endpoint', AWS_S3_ENDPOINT)
spark_context._jsc.hadoopConfiguration().set(
'fs.s3a.connection.ssl.enabled', 'false'
)
df = spark.sql(query)
df.show() # this works perfectly fine
df.coalesce(1).write.format('json').save(output_path) # here I get the error
Solution was to prepend file:// to output_path.

Invalid arguments were passed to DockerOperator (retrieve_output)

In trying to create the DockerOperator, I got this error:
Invalid arguments were passed to DockerOperator (task_id: t2). Invalid arguments were:
**kwargs: {'retrieve_output': True, 'retrieve_output_path': '/tmp/script.out'}
Here is my code:
from airflow.decorators import task, dag
from airflow.providers.docker.operators.docker import DockerOperator
from datetime import datetime
#dag(start_date=datetime(2023, 1, 1), schedule="#daily", catchup=False)
def docker_dag():
#task()
def t1():
pass
t2 = DockerOperator(
task_id='t2',
container_name="task_t2",
image='stock_image:v1.0.2',
command='python3 stock_data.py',
docker_url="tcp://docker-proxy:2375", # I have to use this on MacOS or I'll get a Permission Denied error
network_mode='bridge',
xcom_all=True,
retrieve_output=True,
retrieve_output_path="/tmp/script.out",
auto_remove=True,
mount_tmp_dir=False
)
t1() >> t2
dag = docker_dag()
Note: Here is the link to the documentation which shows my arguments do exist in the documentation. So why would I be getting an InvalidArgument error for just these 2 specific arguments?
retrieve_output and retrieve_output_path were added to the DockerOperator in the Docker provider version 2.2.0. :)

Airflow DockerOperator unable to mount tmp directory correctly

I am trying to run a simple python script within a docker run command scheduled with Airflow.
I have followed the instructions here Airflow init.
My .env file:
AIRFLOW_UID=1000
AIRFLOW_GID=0
And the docker-compose.yaml is based on the default one docker-compose.yaml. I had to add - /var/run/docker.sock:/var/run/docker.sock as an additional volume to run docker inside of docker.
My dag is configured as followed:
""" this is an example dag """
from datetime import timedelta
from airflow import DAG
from airflow.operators.docker_operator import DockerOperator
from airflow.utils.dates import days_ago
from docker.types import Mount
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['info#foo.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 10,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'msg_europe_etl',
default_args=default_args,
description='Process MSG_EUROPE ETL',
schedule_interval=timedelta(minutes=15),
start_date=days_ago(0),
tags=['satellite_data'],
) as dag:
download_and_store = DockerOperator(
task_id='download_and_store',
image='satellite_image:latest',
auto_remove=True,
api_version='1.41',
mounts=[Mount(source='/home/archive_1/archive/satellite_data',
target='/app/data'),
Mount(source='/home/dlassahn/projects/forecast-system/meteoIntelligence-satellite',
target='/app')],
command="python3 src/scripts.py download_satellite_images "
"{{ (execution_date - macros.timedelta(hours=4)).strftime('%Y-%m-%d %H:%M') }} "
"'msg_europe' ",
)
download_and_store
The Airflow log:
[2021-08-03 17:23:58,691] {docker.py:231} INFO - Starting docker container from image satellite_image:latest
[2021-08-03 17:23:58,702] {taskinstance.py:1501} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 268, in _raise_for_status
response.raise_for_status()
File "/home/airflow/.local/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http+docker://localhost/v1.41/containers/create
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 319, in execute
return self._run_image()
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 258, in _run_image
tty=self.tty,
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/container.py", line 430, in create_container
return self.create_container_from_config(config, name)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/container.py", line 441, in create_container_from_config
return self._result(res, True)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 274, in _result
self._raise_for_status(response)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/api/client.py", line 270, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/home/airflow/.local/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 400 Client Error for http+docker://localhost/v1.41/containers/create: Bad Request ("invalid mount config for type "bind": bind source path does not exist: /tmp/airflowtmp037k87u6")
Trying to set mount_tmp_dir=False yield to an Dag ImportError because of unknown Keyword Argument mount_tmp_dir. (this might be an issue for the Documentation)
Nevertheless I do not know how to configure the tmp directory correctly.
My Airflow Version: 2.1.2
There was a bug in Docker Provider 2.0.0 which prevented Docker Operator to run with Docker-In-Docker solution.
You need to upgrade to the latest Docker Provider 2.1.0
https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/index.html#id1
You can do it by extending the image as described in https://airflow.apache.org/docs/docker-stack/build.html#extending-the-image with - for example - this docker file:
FROM apache/airflow
RUN pip install --no-cache-dir apache-airflow-providers-docker==2.1.0
The operator will work out-of-the-box in this case with "fallback" mode (and Warning message), but you can also disable the mount that causes the problem. More explanation from the https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/_api/airflow/providers/docker/operators/docker/index.html
By default, a temporary directory is created on the host and mounted
into a container to allow storing files that together exceed the
default disk size of 10GB in a container. In this case The path to the
mounted directory can be accessed via the environment variable
AIRFLOW_TMP_DIR.
If the volume cannot be mounted, warning is printed and an attempt is
made to execute the docker command without the temporary folder
mounted. This is to make it works by default with remote docker engine
or when you run docker-in-docker solution and temporary directory is
not shared with the docker engine. Warning is printed in logs in this
case.
If you know you run DockerOperator with remote engine or via
docker-in-docker you should set mount_tmp_dir parameter to False. In
this case, you can still use mounts parameter to mount already
existing named volumes in your Docker Engine to achieve similar
capability where you can store files exceeding default disk size of
the container,
I had the same issue and all "recommended" ways of solving the issue here and setting up mount_dir params as descripted here just lead to other errors. The one solution that helped me was wrapping the invocated by docker code with the VPN (actually this hack was taken from another docker-powered DAG that used VPN and worked well).
So the final solution looks like:
#!/bin/bash
connect_to_vpn.sh &
sleep 10
python3 my_func.py
sleep 10
stop_vpn.sh
wait -n
exit $?
To connect to VPN I used openconnect. The took can be installed with apt install and supports anyconnect protocol (it was my crucial requirement).

mounting directories using docker operator on airflow is not working

I'm trying to use the docker operator to automate the execution of some scripts using airflow.
Airflow version: apache-airflow==1.10.12
What I want to do is to "copy" all my project's files (with folders and files) to the container using this code.
The following file ml-intermediate.py is in this directory ~/airflow/dags/ml-intermediate.py:
"""
Template to convert a Ploomber DAG to Airflow
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
from ploomber.spec import DAGSpec
from soopervisor.script.ScriptConfig import ScriptConfig
script_cfg = ScriptConfig.from_path('/home/letyndr/airflow/dags/ml-intermediate')
# Replace the project root to reflect the new location - or maybe just
# write a soopervisor.yaml, then we can we rid of this line
script_cfg.paths.project = '/home/letyndr/airflow/dags/ml-intermediate'
# TODO: use lazy_import from script_cfg
dag_ploomber = DAGSpec('/home/letyndr/airflow/dags/ml-intermediate/pipeline.yaml',
lazy_import=True).to_dag()
dag_ploomber.name = "ML Intermediate"
default_args = {
'start_date': days_ago(0),
}
dag_airflow = DAG(
dag_ploomber.name.replace(' ', '-'),
default_args=default_args,
description='Ploomber dag',
schedule_interval=None,
)
script_cfg.save_script()
from airflow.operators.docker_operator import DockerOperator
for task_name in dag_ploomber:
DockerOperator(task_id=task_name,
image="continuumio/miniconda3",
api_version="auto",
auto_remove=True,
# command="sh /home/letyndr/airflow/dags/ml-intermediate/script.sh",
command="sleep 600",
docker_url="unix://var/run/docker.sock",
volumes=[
"/home/letyndr/airflow/dags/ml-intermediate:/home/letyndr/airflow/dags/ml-intermediate:rw",
"/home/letyndr/airflow-data/ml-intermediate:/home/letyndr/airflow-data/ml-intermediate:rw"
],
working_dir=script_cfg.paths.project,
dag=dag_airflow,
container_name=task_name,
)
for task_name in dag_ploomber:
task_ploomber = dag_ploomber[task_name]
task_airflow = dag_airflow.get_task(task_name)
for upstream in task_ploomber.upstream:
task_airflow.set_upstream(dag_airflow.get_task(upstream))
dag = dag_airflow
When I execute this DAG using Airflow, I get the error that the docker does not find the /home/letyndr/airflow/dags/ml-intermediate/script.sh script. I changed the execution command of the docker operator sleep 600 to enter to the container and check the files in the container with the corrects paths.
When I'm in the container I can go to this path /home/letyndr/airflow/dags/ml-intermediate/ for example, but I don't see the files that are supposed to be there.
I tried to replicate how Airflow implements Docker SDK for Python checking this part of the package docker operator Airflow, specifically, this one where it creates the docker container: docker container creation
This is my one replication of the docker implementation:
import docker
client = docker.APIClient()
# binds = {
# "/home/letyndr/airflow/dags": {
# "bind": "/home/letyndr/airflow/dags",
# "mode": "rw"
# },
# "/home/letyndr/airflow-data/ml-intermediate": {
# "bind": "/home/letyndr/airflow-data/ml-intermediate",
# "mode": "rw"
# }
# }
binds = ["/home/letyndr/airflow/dags:/home/letyndr/airflow/dags:rw",
"/home/letyndr/airflow-data/ml-intermediate:/home/letyndr/airflow-data/ml-intermediate:rw"]
container = client.create_container(
image="continuumio/miniconda3",
command="sleep 600",
volumes=["/home/letyndr/airflow/dags", "/home/letyndr/airflow-data/ml-intermediate"],
host_config=client.create_host_config(binds=binds),
working_dir="/home/letyndr/airflow/dags",
name="simple_example",
)
client.start(container=container.get("Id"))
What I found was that mounting volumes only works if it's set host_config and volumes, the problem is that the implementation on Airflow just set host_config but not volumes. I added the parameter on the method create_container, it worked.
Do you know if I'm using docker operator correctly or is this an issue?
Try using the mounts argument instead of volumes. That's how volumes are defined in the Airflow documentation / source code.
So it should look something like this:
DockerOperator(task_id=task_name,
image="continuumio/miniconda3",
api_version="auto",
auto_remove=True,
# command="sh /home/letyndr/airflow/dags/ml-intermediate/script.sh",
command="sleep 600",
docker_url="unix://var/run/docker.sock",
mounts=[
"/home/letyndr/airflow/dags/ml-intermediate:/home/letyndr/airflow/dags/ml-intermediate:rw",
"/home/letyndr/airflow-data/ml-intermediate:/home/letyndr/airflow-data/ml-intermediate:rw"
],
working_dir=script_cfg.paths.project,
dag=dag_airflow,
container_name=task_name,
)
These are a some other optional arguments that might be helpful:
host_tmp_dir: Specify the location of the temporary directory on the host which will be mapped to tmp_dir. If not provided defaults to using the standard system temp directory.
tmp_dir: Mount point inside the container to a temporary directory created on the host by the operator
tmp_dir:
EDIT: After additional review I see that each mount item must be of type Mount from docker.types. The argument volumes was also renamed to mount as part of the Changelog for Airflow 2.1. Here is an example from the Airflow source code.
Revised, the code should look something akin to
from docker.types import Mount
...
...
DockerOperator(task_id=task_name,
image="continuumio/miniconda3",
api_version="auto",
auto_remove=True,
# command="sh /home/letyndr/airflow/dags/ml-intermediate/script.sh",
command="sleep 600",
docker_url="unix://var/run/docker.sock",
mounts=[
Mount(
source='/home/letyndr/airflow/dags/ml-intermediate',
target='/home/letyndr/airflow/dags/ml-intermediate:rw',
type='bind'
)
],
working_dir=script_cfg.paths.project,
dag=dag_airflow,
container_name=task_name,
)

Airflow docker operator Internal Server Error ("b'Mounts denied: EOF'") MacOS

I'm trying to use the docker operator on an airflow pipeline. This is the code I'm using:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.operators.docker_operator import DockerOperator
default_args = {
'owner' : 'airflow',
'description' : 'Use of the DockerOperator',
'depend_on_past' : False,
'start_date' : datetime(2018, 1, 3),
'email_on_failure' : False,
'email_on_retry' : False,
'retries' : 1,
'retry_delay' : timedelta(minutes=5)
}
with DAG('docker_dag', default_args=default_args, schedule_interval="5 * * * *", catchup=False) as dag:
t1 = BashOperator(
task_id='print_current_date',
bash_command='date'
)
t2 = DockerOperator(
task_id='docker_command',
image='alpine:3.7',
api_version='auto',
auto_remove=True,
command="/bin/sleep 30",
docker_url="unix://var/run/docker.sock",
network_mode="bridge"
)
t3 = BashOperator(
task_id='print_hello',
bash_command='echo "hello world"'
)
t1 >> t2 >> t3
The original source is: How to use the DockerOperator in Apache Airflow
I run my DAG using the airflow UI , but in the docker_command task I get this error:
*** Reading local file: /Users/SoftwareDeveloper/airflow/logs/docker_dag/docker_command/2020-10-04T01:47:06.692609+00:00/2.log
[2020-10-03 20:52:21,857] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: docker_dag.docker_command 2020-10-04T01:47:06.692609+00:00 [queued]>
[2020-10-03 20:52:21,876] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: docker_dag.docker_command 2020-10-04T01:47:06.692609+00:00 [queued]>
[2020-10-03 20:52:21,877] {taskinstance.py:880} INFO -
--------------------------------------------------------------------------------
[2020-10-03 20:52:21,877] {taskinstance.py:881} INFO - Starting attempt 2 of 2
[2020-10-03 20:52:21,877] {taskinstance.py:882} INFO -
--------------------------------------------------------------------------------
[2020-10-03 20:52:21,886] {taskinstance.py:901} INFO - Executing <Task(DockerOperator): docker_command> on 2020-10-04T01:47:06.692609+00:00
[2020-10-03 20:52:21,889] {standard_task_runner.py:54} INFO - Started process 33165 to run task
[2020-10-03 20:52:21,929] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'docker_dag', 'docker_command', '2020-10-04T01:47:06.692609+00:00', '--job_id', '72', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/docker_operator.py', '--cfg_path', '/var/folders/9m/6w_b7jmn11s9yn6f3jyfd7rr0000gs/T/tmp98jxpu6w']
[2020-10-03 20:52:21,932] {standard_task_runner.py:78} INFO - Job 72: Subtask docker_command
[2020-10-03 20:52:21,977] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: docker_dag.docker_command 2020-10-04T01:47:06.692609+00:00 [running]> 64.1.168.192.in-addr.arpa
[2020-10-03 20:52:22,126] {docker_operator.py:210} INFO - Starting docker container from image alpine:3.7
[2020-10-03 20:52:22,192] {taskinstance.py:1150} ERROR - 500 Server Error: Internal Server Error ("b'Mounts denied: EOF'")
Traceback (most recent call last):
File "/Users/SoftwareDeveloper/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/docker/api/client.py", line 259, in _raise_for_status
response.raise_for_status()
File "/Users/SoftwareDeveloper/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/requests/models.py", line 941, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.40/containers/49bf0d929c8d2524a778b5bdc255544c5a3e4915530bbea379b8e147d765a5c6/start
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/SoftwareDeveloper/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/Users/SoftwareDeveloper/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/airflow/operators/docker_operator.py", line 277, in execute
return self._run_image()
File "/Users/SoftwareDeveloper/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/airflow/operators/docker_operator.py", line 233, in _run_image
self.cli.start(self.container['Id'])
File "/Users/SoftwareDeveloper/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/docker/utils/decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "/Users/SoftwareDeveloper/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/docker/api/container.py", line 1108, in start
self._raise_for_status(res)
File "/Users/SoftwareDeveloper/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/docker/api/client.py", line 261, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/Users/SoftwareDeveloper/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("b'Mounts denied: EOF'")
[2020-10-03 20:52:22,202] {taskinstance.py:1187} INFO - Marking task as FAILED. dag_id=docker_dag, task_id=docker_command, execution_date=20201004T014706, start_date=20201004T015221, end_date=20201004T015222
[2020-10-03 20:52:26,855] {local_task_job.py:102} INFO - Task exited with return code 1
Update: This exercise I did it on a Mac, but I tried to do the same on a Linux OS (Ubuntu 18.04 distribution) and everything is working fine. So, I'm afraid this is related with something about my configuration on macOS or the permissions I have on the computer.
On Mac I have Catalina version 10.15.5.
Do you have any idea why I'm getting that error?

Resources