Is possible to import the hooks that airflow provided in the code (snowflake hook, aws hook, etc) in a kubernetes operator that run a python script?
I may have the wrong idea of how to work with airflow.
Example. I have Airflow running with a kubernetes cluster. Tasks of the dag are inside containers. One task takes some data from SQL and upload it to S3. If I want to do this in a python script inside container I need to code all this stuff (with the possible error and expend time) if I can reuse the libraries (hooks) that airflow has in the code I can save time and also, I'm supported for a lot of developers.
Thank you
Related
I have an image in Artifact Registry that does a unit of work:
it expects input files inside a certain directory (let's call it main_input)
runs them, does some sequences of computation, and outputs results into an output folder in Google Storage
The run time of each does not exceed 30 minutes, but I have thousands of such runs to perform.
Inside a single VM, I can create various containers from this image by mounting the main_input directory inside the container to the correct ones on the host, and run.
However, I wonder if Cloud Run is a more scalable solution for this? or shall I look at other services/strategies?
Managing thousands of runs is not an easy task, you can use a scheduler like airflow or argo worfkflows to run the tasks and restart them if needed.
For the containers environment, I propose Kubernetes (GKE) over Cloud Run, for some reason:
you have more permissions than Cloud Run
to be agnostic from GCP (your code can work on other platforms like AWS)
better management for apps configurations and secrets, and supported with all the CI/CD tools
less expensive: you can create scalable preemptible node pools to reduce the cost and add a lot of resources when needed
you can use the same cluster to run other applications for your company
you can use open source tools for log collections, monitoring, HTTP reverse proxy, ...
I am working on a Dockerized Python/Django project including a container for Celery workers, into which I have been integrating the off-the-shelf airflow docker containers.
I have Airflow successfully running celery tasks in the pre-existing container, by instantiating a Celery app with the redis broker and back end specified, and making a remote call via send_task; however none of the logging carried out by the celery task makes it back to the Airflow logs.
Initially, as a proof of concept as I am completely new to Airflow, I had set it up to run the same code by exposing it to the Airflow containers and creating airflow tasks to run it on the airflow celery worker container. This did result in all the logging being captured, but it's definitely not the way we want it architectured, as this makes the airflow containers very fat due to the repetition of dependencies from the django project.
The documentation says "Most task handlers send logs upon completion of a task" but I wasn't able to find more detail that might give me a clue how to enable the same in my situation.
Is there any way to get these logs back to airflow when running the celery tasks remotely?
Instead of "returning the logs to Airflow", an easy-to-implement alternative (because Airflow natively supports it) is to activate remote logging. This way, all logs from all workers would end up e.g. on S3, and the webserver would automatically fetch them.
The following illustrates how to configure remote logging using an S3 backend. Other options (e.g. Google Cloud Storage, Elastic) can be implemented similarly.
Set remote_logging to True in airflow.cfg
Build an Airflow connection URI. This example from the official docs is particularly useful IMO. One should end up having something like:
aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?
endpoint_url=http%3A%2F%2Fs3%3A4566%2F
It is also possible to create the connectino through the webserver GUI, if needed.
Make the connection URI available to Airflow. One way of doing so is to make sure that the environment variable AIRFLOW_CONN_{YOUR_CONNECTION_NAME} is available. Example for connection name REMOTE_LOGS_S3:
export AIRFLOW_CONN_REMOTE_LOGS_S3=aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?endpoint_url=http%3A%2F%2Fs3%3A4566%2F
Set remote_log_conn_id to the connection name (e.g. REMOTE_LOGS_S3) in airflow.cfg
Set remote_base_log_folder in airflow.cfg to the desired bucket/prefix. Example:
remote_base_log_folder = s3://my_bucket_name/my/prefix
This related SO touches deeper on remote logging.
If debugging is needed, looking into any worker logs locally (i.e., inside the worker) should help.
I have tests that I run locally using a docker-compose environment.
I would like to implement these tests as part of our CI using Jenkins with Kubernetes on Google Cloud (following this setup).
I have been unsuccessful because docker-in-docker does not work.
It seems that right now there is no solution for this use-case. I have found other questions related to this issue; here, and here.
I am looking for solutions that will let me run docker-compose. I have found solutions for running docker, but not for running docker-compose.
I am hoping someone else has had this use-case and found a solution.
Edit: Let me clarify my use-case:
When I detect a valid trigger (ie: push to repo) I need to start a new job.
I need to setup an environment with multiple dockers/instances (docker-compose).
The instances on this environment need access to code from git (mount volumes/create new images with the data).
I need to run tests in this environment.
I need to then retrieve results from these instances (JUnit test results for Jenkins to parse).
The problems I am having are with 2, and 3.
For 2 there is a problem running this in parallel (more than one job) since the docker context is shared (docker-in-docker issues). If this is running on more than one node then i get clashes because of shared resources (ports for example). my workaround is to only limit it to one running instance and queue the rest (not ideal for CI)
For 3 there is a problem mounting volumes since the docker context is shared (docker-in-docker issues). I can not mount the code that I checkout in the job because it is not present on the host that is responsible for running the docker instances that I trigger. my workaround is to build a new image from my template and just copy the code into the new image and then use that for the test (this works, but means I need to use docker cp tricks to get data back out, which is also not ideal)
I think the better way is to use the pure Kubernetes resources to run tests directly by Kubernetes, not by docker-compose.
You can convert your docker-compose files into Kubernetes resources using kompose utility.
Probably, you will need some adaptation of the conversion result, or maybe you should manually convert your docker-compose objects into Kubernetes objects. Possibly, you can just use Jobs with multiple containers instead of a combination of deployments + services.
Anyway, I definitely recommend you to use Kubernetes abstractions instead of running tools like docker-compose inside Kubernetes.
Moreover, you still will be able to run tests locally using Minikube to spawn the small all-in-one cluster right on your PC.
Hi Stackoverflow community, I have a question regarding using Docker with AWS EC2. I am comfortable with EC2 but am very new to Docker. I code in Python 3.6 and would like to automate the following process:
1: start an EC2 instance with Docker (Docker image stored in ECR)
2: run a one-off process and return results (let's call it "T") in a CSV format
3: store "T" in AWS S3
4: Shut down the EC2
The reason for using an EC2 instance is because the process is quite computationally intensive and is not feasible for my local computer. The reason for Docker is to ensure the development environment is the same across the team and the CI facility (currently using circle.ci). I understand that interactions with AWS can mostly be done using Boto3.
I have been reading about AWS's own ECS and I have a feeling that it's geared more towards deploying a web-app with Docker rather than running a one-off process. However, when I searched around EC2 + Docker nothing else but ECS came up. I have also done the tutorial in AWS but it doesn't help much.
I have also considered running EC2 with a shell script (i.e. downloading docker, pulling the image, building the container etc)but it feels a bit hacky? Therefore my questions here are:
1: Is ECS really the most appropriate solution in his scenario? (or in other words is ECS designed for such operations?)
2: If so are there any examples of people setting-up and running a one-off process using ECS? (I find the set-up really confusing especially the terminologies used)
3: What are the other alternatives (if any)?
Thank you so much for the help!
Without knowing more about your process; I'd like to pose 2 alternatives for you.
Use Lambda
Pending just how compute intensive your process is, this may not be a viable option. However, if it something that can be distributed, Lambda is awesome. You can find more information about the resource limitations here. This route, you would simply write Python 3.6 code to perform your task and write "T" to S3.
Use Data Pipeline
With Data Pipeline, you can build a custom AMI (EC2) and use that as your image. You can then specify the size of the EC2 resource that you need to run this process. It sounds like your process would be pretty simple. You would need to define:
EC2resource
Specify AMI, Role, Security Group, Instance Type, etc.
ShellActivity
Bootstrap the EC2 instance as needed
Grab your code form S3, GitHub, etc
Execute your code (Include in your code writing "T" to S3)
You can also schedule the pipeline to run at an interval/schedule or call it directly from boto3.
From this link I found that Google Cloud Dataflow uses Docker containers for its workers: Image for Google Cloud Dataflow instances
I see it's possible to find out the image name of the docker container.
But, is there a way I can get this docker container (ie from which repository do I go to get it?), modify it, and then indicate my Dataflow job to use this new docker container?
The reason I ask is that we need to install various C++ and Fortran and other library code on our dockers so that the Dataflow jobs can call them, but these installations are very time consuming so we don't want to use the "resource" property option in df.
Update for May 2020
Custom containers are only supported within the Beam portability framework.
Pipelines launched within portability framework currently must pass --experiments=beam_fn_api explicitly (user-provided flag) or implicitly (for example, all Python streaming pipelines pass that).
See the documentation here: https://cloud.google.com/dataflow/docs/guides/using-custom-containers?hl=en#docker
There will be more Dataflow-specific documentation once custom containers are fully supported by Dataflow runner. For support of custom containers in other Beam runners, see: http://beam.apache.org/documentation/runtime/environments.
The docker containers used for the Dataflow workers are currently private, and can't be modified or customized.
In fact, they are served from a private docker repository, so I don't think you're able to install them on your machine.
Update Jan 2021: Custom containers are now supported in Dataflow.
https://cloud.google.com/dataflow/docs/guides/using-custom-containers?hl=en#docker
you can generate a template from your job (see https://cloud.google.com/dataflow/docs/templates/creating-templates for details), then inspect the template file to find the workerHarnessContainerImage used
I just created one for a job using the Python SDK and the image used in there is dataflow.gcr.io/v1beta3/python:2.0.0
Alternatively, you can run a job, then ssh into one of the instances and use docker ps to see all running docker containers. Use docker inspect [container_id] to see more details about volumes bound to the container etc.