Any way of monitoring Airflow DAG's execution time?

Any way of monitoring Airflow DAG's execution time? - monitoring

I'd like to use Airflow with Statsd and DataDog to monitor if DAG takes e.g. twice time as its previous execution(s). So, I need some kind of a real-time timer for a DAG (or operator).
I'm aware that Airflow supports some metrics.
However, to my understanding all of the metrics are related to finished tasks/DAGs, right? So, It's not the solution, because I'd like to monitor running DAGs.
I also considered the timeout_execution/SLA features, but they are not suitable for this use-case
I'd like to be notified that some DAG hangs, but I don't want to kill it.

There are a number of different ways you could handle this:
In the past I've configured a telemetry DAG which would collect the current state of all tasks/DAGs by querying the metadata tables. I'd collect these metrics and push them up to CloudWatch. This became problematic as these internal fields change often so we would run into issues when trying to upgrade to newer versions of Airflow.
There are also some well-maintained Prometheus exporters that some companies have open sourced. By setting these up you could poll the exposed export path as frequently as you wanted to (DataDog supports Prometheus).
These are just some of your options. Since the Airflow webserver is just a Flask app you can really expose metrics in whatever way you see fit.

As I understand, you can monitor running tasks in DAGs using DataDog, refer the integration with Airflow docs
You may refer metrics via DogStatD docs. Also, look at this page would be useful to understand what to monitor.
E.g., the metrics as below:
airflow.operator_failures: monitor the failed operator.
airflow.operator_successes: monitor succeed operator.
airflow.dag_processing.processes: Number of currently running DAG parsing (processes).
airflow.scheduler.tasks.running : Number of tasks running in executor
Shown as task.

Related

Job-based cloud processing solution

I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you

I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.

Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks

When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.

Should I use a Container/Service Fabric Guest Executable for a scheduled daily workload?

This is a more general question about which types of payloads to host in a Container. In our case we will use Service Fabric guest executables. For this post I will only use the word Container to refer to both. The reason I do this is they have similar properties and think more people may understand a container than a SF Guest Exe.
WebAPIs/Services that needs to scale are a good fit for containers, but this question is related to what we call a "Batch" job. This nomenclature comes out of the old .bat files, but in our case we are using a .NET Framework or Core .exe (console apps).
Currently Windows Task Scheduler kicks off the batch running under a service account on a VM. We want the processing to happen on a certain time of day or day of the week and not before or after. There is not any real scaling here. There is one instance which may or may not be multithreaded and on average they generally run between 2-15 minutes and then stop. Some run longer some run shorter. I understand there are limitations to this approach but this is the type of payload I'm discussing here.
As we modernize the Technology stack we are looking to use the Orchestrator as much as possible. As a technologist I've always tried to understand the different tools in our tool belts and not use a tool just because that's the one I used last, instead use the correct tool for the task.
We started out by not writing any more .net console apps. Instead we put the business logic of these "batches" into WebApi's. Then having the task scheduler call the API when it needed to perform its action. If I put this into Service Fabric and host it my concern is that the system resources are consumed for 23 hours and 45 minutes a day when they are not being used. That seems to be opposite of what you would expect when using a container.
Now if I could spin up a Service Fabric Guest Exe/Container on demand and then after it finishes destroy the instance of the app that could fit the need. Then I could have the benefits of the orchestrator without the determent of having it consume resources all the time. I would hope to retire the Batch Server (VM) as the hardware is usage is not optimized and instead add resources to the cluster.
UPDATE
Looking at Vaclav's Scalability Doco I think there might be a use case in here? https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-concepts-scalability He uses a "Workload Manager Service" combined with CreateServiceAsync, to spin up an instance of the service on demand. I guess I would deploy the app to the image store but not create an instance of the app until needed. Then I need to figure out how to end it, is it as simple as changing the infinite loop in Program.cs? The thing is it doesn't look like there is a Program.cs in a Guest Executable.
This looks like a way to run a package until completion, which was releases as part of 7.1. But how do we start a second execution of the service? I want to execute based on a request coming in.
https://learn.microsoft.com/en-us/azure/service-fabric/run-to-completion
Thoughts?

Using Dataflow vs. Cloud Composer

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation.
Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery.
Let me give a very basic example:
# file.csv
type\x01date
house\x0112/27/1982
car\x0111/9/1889
From this file we detect the schema and create a BigQuery table, something like this:
`table`
type (STRING)
date (DATE)
And, we also format our data to insert (in python) into BigQuery:
DATA = [
("house", "1982-12-27"),
("car", "1889-9-11")
]
This is a vast simplification of what's going on, but this is how we're currently using Cloud Dataflow.
My question then is, where does Cloud Composer come into the picture? What additional features could it provide on the above? In other words, why would it be used "on top of" Cloud Dataflow?

Cloud composer(which is backed by Apache Airflow) is designed for tasks scheduling in small scale.
Here is an example to help you understand:
Say you have a CSV file in GCS, and using your example, say you use Cloud Dataflow to process it and insert formatted data into BigQuery. If this is a one-off thing, you have just finished it and its perfect.
Now let's say your CSV file is overwritten at 01:00 UTC every day, and you want to run the same Dataflow job to process it every time when its overwritten. If you don't want to manually run the job exactly at 01:00 UTC regardless of weekends and holidays, you need a thing to periodically run the job for you (in our example, at 01:00 UTC every day). Cloud Composer can help you in this case. You can provide a config to Cloud Composer, which includes what jobs to run (operators), when to run (specify a job start time) and run in what frequency (can be daily, weekly or even yearly).
It seems cool already, however, what if the CSV file is overwritten not at 01:00 UTC, but anytime in a day, how will you choose the daily running time? Cloud Composer provides sensors, which can monitor a condition (in this case, the CSV file modification time). Cloud Composer can guarantee that it kicks off a job only if the condition is satisfied.
There are a lot more features that Cloud Composer/Apache Airflow provide, including having a DAG to run multiple jobs, failed task retry, failure notification and a nice dashboard. You can also learn more from their documentations.

For the basics of your described task, Cloud Dataflow is a good choice. Big data that can be processed in parallel is a good choice for Cloud Dataflow.
The real world of processing big data is usually messy. Data is usually somewhat to very dirty, arrives constantly or in big batches and needs to be processed in time sensitive ways. Usually it takes the coordination of more than one task / system to extract desired data. Think of load, transform, merge, extract and store types of tasks. Big data processing is often glued together using using shell scripts and / or Python programs. This makes automation, management, scheduling and control processes difficult.
Google Cloud Composer is a big step up from Cloud Dataflow. Cloud Composer is a cross platform orchestration tool that supports AWS, Azure and GCP (and more) with management, scheduling and processing abilities.
Cloud Dataflow handles tasks. Cloud Composer manages entire processes coordinating tasks that may involve BigQuery, Dataflow, Dataproc, Storage, on-premises, etc.
My question then is, where does Cloud Composer come into the picture?
What additional features could it provide on the above? In other
words, why would it be used "on top of" Cloud Dataflow?
If you need / require more management, control, scheduling, etc. of your big data tasks, then Cloud Composer adds significant value. If you are just running a simple Cloud Dataflow task on demand once in a while, Cloud Composer might be overkill.

Cloud Composer Apache Airflow is designed for tasks scheduling
Cloud Dataflow Apache Beam = handle tasks
For me, the Cloud Composer is a step up (a big one) from Dataflow. If I had one task, let's say to process my CSV file from Storage to BQ I would/could use Dataflow. But if I wanted to run the same job daily I would use Composer.

Ruby on Rails: getting Prometheus metrics with Telegraf

I have a RoR application with installed Prometheus-client and Telegraf daemon with Prometheus input plugin working on the instance I want to monitor.
As far as I understand I need some kind of exporter middleware to collect metrics from Prometheus::Client.registry and expose them with /metrics HTTP endpoint.
What I don't really understand is how to pass all metrics from different envs (e.g from rake task and app's runtime code) into the same registry (it's an instance variable of Prometheus::Middleware::Exporter.new(registry)) of the same instance of Prometheus::Middleware::Exporter middleware?
Also, will urls = ["http://localhost:3000/metrics"] config of Prometheus input plugin for Telegraf work on EC2 instance for example?
Thank you for advices.

First of all, it's not recommended to use a catch-all exporter like Telegraf. You can read about some of the arguments in this blog post: https://www.robustperception.io/one-agent-to-rule-them-all/.
Then, if I understand your question correctly, it's not possible to use the same registry from multiple processes (like your Rails app and some rake task). Your Rails app will export its own metrics and you'll need to use a different approach for rake tasks.
As rake tasks are (usually) short-lived processes, they are not well suited to be pulled from. You have two options here, either you use the Pushgateway and the PGW-support in client_ruby to push all relevant metrics at the end of the rake task execution (like how long it took, how many items were processed, if there was any error, etc.). Alternatively, you can use the textfile collector in the node_exporter and write your metrics to disk at the end of your rake task execution. The node_exporter will then read that file and export the metrics when it gets scraped.
I don't actively monitor stackoverflow, you'll get more help with these questions on the prometheus-users mailing list, see https://prometheus.io/community/.

Perhaps an easier way to go would be to setup a Telegraf client on the same host (with Prometheus output and statsd input) and then fire events from your application into Telegraf's input, in statsd format. Telegraf would then turn around and emit these metrics in Prometheus's format.
in this way you'll get both Telegraf's host-level metrics (free memory, disc usage, etc) AND your application's metrics, all exported in the same port. It doesn't require any Ruby-specific code, just the ability to fire UDP messages from your app into a local port.

How does Celery discover new Nodes?

I'm running Celery and RabbitMQ Gunicorn in Docker.
My question is this: I understand that Celery is designed for distributed processing. What I have see no docs on at all is, assuming that I have several machines/nodes on the same LAN, how do they discover each other? Does RabbitMQ play a role? Do celery instances somehow discover each other? Is there a list of suitable hosts somewhere? If so, how do I edit it?
Also, assuming I'm going to use only one node to handle the HTTP requests, do I still need to have gunicorn running on all nodes? I ask this because in the gunicorn start command, it has a setting for the number of workers. And, is this setting applicable only to that node, or as a max total for all connected nodes?
EDIT:
After the first answer, I started working on this. It seems that I need some sort of networking setup, either swarm or bridging etc. I should clarify that I'm using docker-compose to bring up the solution, and I see that a normal swarm setup doesn't work, and I have to use something slightly different if I go that route.
To be clear: I need a way in which I can add celery workers on separate hosts and have them be able to communicate with the "main" host so that I can increase the capacity of the system. If someone could provide a clear process for achieving this or a link to such, it'd be most helpful.
I hope I've expressed this clearly, please let me know if you need any further info.
Thanks!

I feel like #ffledgling didn't fully answer the question so I am adding a note:
Here is a list of all events sent by the worker to the broker (in your case RabbitMq): http://docs.celeryproject.org/en/latest/userguide/monitoring.html#event-reference
As you can see, there are few worker self-related messages/events:
worker-online
worker-heartbeat
worker-offline
All of them contain a signature of the hostname. Therefore a successful handshake flow (not exactly handshake because master doesn't respond with message but using it as a metaphor here) may look like this:
>
new worker online --> worker send worker-online message to the queue --> master received and start to read logs from worker host --> master schedule tasks --> ...
Beyond that, host name is a standard body field in every event (both task and worker self-related), here is the documentation: http://docs.celeryproject.org/en/latest/internals/protocol.html?highlight=event%20reference#standard-body-fields
For example, if you look at task-started event: it also contains a hostname as signature, this is how the master knows who picked up the task and where to read the log of the task from.

I understand that Celery is designed for distributed processing. What
I have see no docs on at all is, assuming that I have several
machines/nodes on the same LAN, how do they discover each other? Does
RabbitMQ play a role? Do celery instances somehow discover each other?
Is there a list of suitable hosts somewhere? If so, how do I edit it?
Celery is a distributed task queue that works using a message brokering system such as RabbitMQ.
What essentially happens all celery workers connect a shared Queue such as RabbitMQ. The master(s) dispatch work by pushing it onto the queue. Workers who are connected to the Queue as well, pull work off of the queue and then attempt to execute it. Once it is finished (successfully or otherwise), it will push the results back onto the Queue, which the master(s) can then query.
Given this architecture, you do not need to add a list of hosts, they "auto-detect" work. You simply need to start them up and ensure they can talk to the Queue.
A slightly more detailed explanation from another SO answer.
Link to the architecture with a diagram.
Also, assuming I'm going to use only one node to handle the HTTP
requests, do I still need to have gunicorn running on all nodes? I ask
this because in the gunicorn start command, it has a setting for the
number of workers. And, is this setting applicable only to that node,
or as a max total for all connected nodes?
No, you do not need guicorn running on all the nodes, just the one you're using to serve HTTP requests via python. Celery workers do not need guicorn. The worker setting in guicorn refers to the number of workers in the HTTP listeners pool. This is separate, independent and unrelted to the set of workers that celery uses.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart