I'm trying to get Airflow up and running within a container and used the image available here. I found that although the DAG gets into running state (on the UI), the tasks within the DAG seem to be waiting indefinitely and never actually get triggered.
Given that some of the steps given in the documentation are optional, I followed these steps in order to get the example DAGs up and running within my container -
Pulled the image from dockerhub
docker pull puckel/docker-airflow
Triggered Airflow with default settings, which should start it with Sequential Executor
docker run -d -p 8080:8080 -e LOAD_EX=y puckel/docker-airflow
I'm relatively new to setting up Airflow and dockers, although I have worked on Airflow in the past. So, it's possible that I am missing something very basic here, since no one else seems to be facing the same issue. Any help would be highly appreciated.
The sequential executor is not a scheduler, so it only runs jobs that are created manually, from the UI or the run command. Certain kinds of tasks won't run in the sequential executor, I think its SubdagOperators that won't. While honestly it should be picking up dummy, bash, or python tasks, you may save time figuring it out if you run the scheduler and the local executor and a db. Puckel has an example docker compose file, https://github.com/puckel/docker-airflow
Related
I'd like to ask what's the meaning of this PORT setup?
enter image description here
Sometimes I get the "5555/tcp, 8793/tcp" part and sometimes do not. So what is the function of "5555/tcp, 8793/tcp" and how does it appear when I build my Docker container?
Second question, which Docker container should I execute if I want to use command "airflow tasks test"?
Thanks
I expect to understand Docker container better especially how to use "airflow tasks test" command
I am considering implementing AirFlow and have no prior experience with it.
I have a VM with docker installed, and two containers running on it:
container with python environment where cronjobs currently run
container with an AirFlow installation
Is it possible to use AirFlow to run a task in the python container? I am not sure, because:
If I use the BashOperator with the command like docker exec mycontainer python main.py, I assume it will mark this task as success, even if the python script fails (it successfully run the command, but its resposibility ends there).
I see there is a DockerOperator, but it seems to take an image, create and run a container, but I want to run a task on a container that is already running.
The closest answer I found is using kubernetes here, which is overkill for my needs.
The BashOperator runs the bash command on:
the scheduler container if you use the LocalExecutor
one of the executors containers if you use the CeleryExecutor
a new separate pod if you use the KubernetesExecutor
While the DockerOperator is developed to create a new docker container on a docker server (local or remote server), and not to manage an existing container.
To run a task (command) on an existing container (or any other host), you can setup a ssh server within the python docker container, then use the sshOperator to run your command on the remote ssh server (the python container in your case).
I have to execute some maprcli commands on a daily basis, and the maprcli command needs to be executed with a special user. The maprcli command and the user are both on the local host.
To schedule this tasks I need to use airflow, which further on works in a docker container. I am facing 2 problems here:
the maprcli is not available in the airflow docker conainer
the user with whom it should be executed is not available in the container.
The first problem can be solved with a volume mapping, but is there maybe a cleaner solution?
Is there any way to use the needed local/host user during the execution of a python script inside the airflow docker container?
The permissions depend on the availability of a mapr ticket that is normally generated by maprlogin.
Making this work correctly is much easier in Kubernetes than in bare docker containers because of the more advanced handling of tickets.
I used to run a long training process on a remote server with GPU capabilities. Now my work schedule changes, so I can't have my computer connected to a network all the time till I finish the process. I found that nohup is the solution for me. But I don't know how to keep invoke the process correctly related my situation.
I use ssh to connect to the remote server.
I have to use docker to access to GPU.
Then I start the process in the docker.
If I start the process with nohup in docker, I can't really leave docker, right. So, do I use nohup at each step?
Edit:
I need the terminal output of the process at step 3, because I need that information to carry out the rest of the work. Consider, step 3 is training a neural network. So, the training log tells me the accuracy of different models at different iterations. I use that information to do the testing.
Following #David Maze's suggestion, I did this (a slightly different approach as I was not familiar with docker a whole lot)
Logged in to the remote server.
Configured the docker script to have remote workdir.
...
WORKDIR /workspace
...
After building the docker container, run docker with mount option to mount the local project to docker workdir. When running docker, I used nohup. Since I don't need interactive mode I ignored the -it flag.
nohup docker run --gpus all -v $(pwd)/path-to-project-root:/workspace/ docker-image:tag bash -c "command1; command2" > project.out 2>&1 &
To test this, I logged out from the server and see the content of project.out later. It contained the expected output.
Building my question on How to run DBT in airflow without copying our repo, I am currently running airflow and syncing the dags via git. I am considering different option to include DBT within my workflow. One suggestion by louis_guitton is to Dockerize the DBT project, and run it in Airflow via the Docker Operator.
I have no prior experience using the Docker Operator in Airflow or generally DBT. I am wondering if anyone has tried or can provide some insights about their experience incorporating that workflow, my main questions are:
Should DBT as a whole project be run as one Docker container, or is it broken down? (for example: are tests ran as a separate container from dbt tasks?)
Are logs and the UI from DBT accessible and/or still useful when run via the Docker Operator?
How would partial pipelines be run? (example: wanting to run only a part of the pipeline)
Judging by your questions, you would benefit from trying to dockerise dbt on its own, independently from airflow. A lot of your questions would disappear. But here are my answers anyway.
Should DBT as a whole project be run as one Docker container, or is it broken down? (for example: are tests ran as a separate container from dbt tasks?)
I suggest you build one docker image for the entire project. The docker image can be based on the python image since dbt is a python CLI tool. You then use the CMD arguments of the docker image to run any dbt command you would run outside docker.
Please remember the syntax of docker run (which has nothing to do with dbt): you can specify any COMMAND you wand to run at invocation time
$ docker run [OPTIONS] IMAGE[:TAG|#DIGEST] [COMMAND] [ARG...]
Also, the first hit on Google for "docker dbt" is this dockerfile that can get you started
Are logs and the UI from DBT accessible and/or still useful when run via the Docker Operator?
Again, it's not a dbt question but rather a docker question or an airflow question.
Can you see the logs in the airflow UI when using a DockerOperator? Yes, see this how to blog post with screenshots.
Can you access logs from a docker container? Yes, Docker containers emit logs to stdout and stderr output streams (which you can see in airflow, since airflow picks this up). But logs are also stored in JSON files on the host machine in a folder /var/lib/docker/containers/. If you have any advanced needs, you can pick up those logs with a tool (or a simple BashOperator or PythonOperator) and do what you need with it.
How would partial pipelines be run? (example: wanting to run only a part of the pipeline)
See answer 1, you would run your docker dbt image with the command
$ docker run my-dbt-image dbt run -m stg_customers