I added to my DAG the python file that builds an interactive dashboard on my localserver but when it runs in the DAG the site can't be reached. Do I need to set something up on my container?
Related
Problem: new dags not shown on docker airflow, no error when running airflow dags list-import-errors
Docker image: official airflow image
Dags path inside docker-compose.yaml (this is the default path):
volumes:
- ./dags:/opt/airflow/dags
I put the dag file inside the dags folder on the main directory, as shown below:
However, the dags is still not shown on both webserver UI and airflow dags list. Running airflow dags list-import-errors also yield no result.
When I open the docker terminal, I can see my dag inside the dags folder via ls command. I also tried to make the owner root by using chown, but both of my dag still not shown up on the list.
The airflow run successfully (via docker compose) as I can see example dags, but not my own dags.
Any help will be appreciated. Thanks!
I would try a few things:
rename files to sth like gcp_dag.py and python_dag.py
ensure import airflow is present in each file
ensure you create DAG object in each file
add __init__.py (empty) file to dags folder
It would be also helpful to see contents of at least one of those files.
do not name your python DAGs test_* that probably does not matter but this is usually a convention for unit tests
make sure there is a DAG word in your dags (or change https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-discovery-safe-mode to False) - the DAGs can be ignored if they do not contain "DAG" or "airflow"
Exec into your scheduler container and check if it can see and read the DAGs and whether the env vars are properly set airflow info run in your scheduler container (providing that it is run with the same env as the running scheduler) should show all the configuration
Check if your scheduler is runnning
look at the logs of the scheduler and see if it is scanning the right folders - enabling DEBUG log might be helpful https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#logging-level
see if you can see the dags with airflow dags sub commands
It might also be a problem with permissions. Check what user is assigned to the DAG files or better try to view the files from inside the container, like this:
docker exec -it <container> bash
cat /opt/airflow/dags/test_gcp.py
I am considering implementing AirFlow and have no prior experience with it.
I have a VM with docker installed, and two containers running on it:
container with python environment where cronjobs currently run
container with an AirFlow installation
Is it possible to use AirFlow to run a task in the python container? I am not sure, because:
If I use the BashOperator with the command like docker exec mycontainer python main.py, I assume it will mark this task as success, even if the python script fails (it successfully run the command, but its resposibility ends there).
I see there is a DockerOperator, but it seems to take an image, create and run a container, but I want to run a task on a container that is already running.
The closest answer I found is using kubernetes here, which is overkill for my needs.
The BashOperator runs the bash command on:
the scheduler container if you use the LocalExecutor
one of the executors containers if you use the CeleryExecutor
a new separate pod if you use the KubernetesExecutor
While the DockerOperator is developed to create a new docker container on a docker server (local or remote server), and not to manage an existing container.
To run a task (command) on an existing container (or any other host), you can setup a ssh server within the python docker container, then use the sshOperator to run your command on the remote ssh server (the python container in your case).
I am using Airflow running in a docker container on Windows and trying to read a file from hdfs:
sc = SparkSession.builder.appName('spark_app').getOrCreate()
csvDF = sc.read.csv("hdfs://host.docker.internal:9000/hadoop_files/example.csv")
It works fine in a local environment, but fails with the error in docker:
RuntimeError: Java gateway process exited before sending its port number
It seems like I need to add JAVA_HOME to my Dockerfile, tried this, but it didn't work:
ENV JAVA_HOME E:/Java/jdk1.8.0_321
RUN export JAVA_HOME
There are two reasons for this problem:
You didn’t download a java jdk in your docker image
You downloaded the java jdk but its path is not known for you spark session, to solve the problem, you need to define the environment variable JAVA_HOME and set the jdk path in the docker image (and not in the host E:/Java/jdk1.8.0_321), if you have the jdk in your host, you can mount it as a volume to your docker container, and set its path to the env var.
But as an airflow user for several years, I don't recommend to execute the jobs on airflow itself, and I prefer to only schedule them using airflow, and run them on external server, in your case you can submit your job to a separate spark cluster using SparkSubmitOperator (http request if you use apache Livy) or execute it in a kubernetes cluster (minikube for dev). In this case you don't need to download pyspark lib and java jdk in the airflow docker image.
I have installed Apache Airflow on my Windows 11 machine and many questions have already arisen.
1)How can I execute 'airflow' command under the Windows OS? It seems that Windows doesn't recognize this command and there is no help configuring 'PATH' settings. For example:
> airflow dags list
Or is the better choice to download Linux docker image file, run it's container and install Airflow on it?
2)After installing Airflow, the DAGs home directory under Windows is:
C:\docker-airflow-master\dags
In docker-compose.yaml file under the 'volumes:' tag the dag folder is:
./dags:/usr/local/airflow/dags
Does it mean that './dags:/usr/local/airflow/dags' is path in Airflow container and when I put my DAGs under 'C:\docker-airflow-master\dags' they are found using some matching between these directories? Where is this matching done?
3)Does anyone nowadays runs Airflow on local machine at all or is it more reasonable to use some cloud environments for this (e.g. Google Cloud Composer)?
I'm trying to get Airflow up and running within a container and used the image available here. I found that although the DAG gets into running state (on the UI), the tasks within the DAG seem to be waiting indefinitely and never actually get triggered.
Given that some of the steps given in the documentation are optional, I followed these steps in order to get the example DAGs up and running within my container -
Pulled the image from dockerhub
docker pull puckel/docker-airflow
Triggered Airflow with default settings, which should start it with Sequential Executor
docker run -d -p 8080:8080 -e LOAD_EX=y puckel/docker-airflow
I'm relatively new to setting up Airflow and dockers, although I have worked on Airflow in the past. So, it's possible that I am missing something very basic here, since no one else seems to be facing the same issue. Any help would be highly appreciated.
The sequential executor is not a scheduler, so it only runs jobs that are created manually, from the UI or the run command. Certain kinds of tasks won't run in the sequential executor, I think its SubdagOperators that won't. While honestly it should be picking up dummy, bash, or python tasks, you may save time figuring it out if you run the scheduler and the local executor and a db. Puckel has an example docker compose file, https://github.com/puckel/docker-airflow