airflow blocking threads machine learning end to end - machine-learning

I am running airflow for end to end machine learning, and some machine learning tasks take hours to train. I do not want the dags and corresponding train tasks to be the bottle neck, how would I allow for multiple tasks to be scheduled ? I'm assuming I'd have to change my executor from Sequential to Celery. Is there anything else I'd be missing ?

Besides Celery, you can also use LocalExecutor, which is more simple to setup and may fit your needs. Furthermore, you can create a pool for machine learning tasks, while using the default pool for the rest of tasks
https://airflow.apache.org/docs/stable/concepts.html?highlight=pool#pools

Related

Job-based cloud processing solution

I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you
I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.
Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks
When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.

Dask workers get stuck in SLURM queue and won't start until the master hits the walltime

Lately, I've been trying to do some machine learning work with Dask on an HPC cluster which uses the SLURM scheduler. Importantly, on this cluster SLURM is configured to have a hard wall-time limit of 24h per job.
Initially, I ran my code with a single worker, but my job was running out of memory. I tried to increase the number of workers (and, therefore, the number of requested nodes), but the workers got stuck in the SLURM queue (with the reason for such being labeled as "Priority"). Meanwhile, the master would run and eventually hit the wall-time, leaving the workers to die when they finally started.
Thinking that the issue might be my requesting too many SLURM jobs, I tried condensing the workers into a single, multi-node job using a workaround I found on github. Nevertheless, these multi-node jobs ran into the same issue.
I then attempted to get in touch with the cluster's IT support team. Unfortunately, they are not too familiar with Dask and could only provide general pointers. Their primary suggestions were to either put the master job on hold until the workers were ready, or launch new masters every 24h until the the workers could leave the queue. To help accomplish this, they cited the SLURM options --begin and --dependency. Much to my chagrin, I was unable to find a solution using either suggestion.
As such, I would like to ask if, in a Dask/SLURM environment, there is a way to force the master to not start until the workers are ready, or to launch a master that is capable of "inheriting" workers previously created by another master.
Thank you very much for any help you can provide.
I might be wrong on the below, but in my experience with SLURM, Dask itself won't be able to communicate with the SLURM scheduler. There is dask_jobqueue that helps to create workers, so one option could be to launch the scheduler on a low-resource node (that presumably could be requested for longer).
There is a relatively new feature of heterogeneous jobs on SLURM (see https://slurm.schedmd.com/heterogeneous_jobs.html), and as I understand this will guarantee that your workers, scheduler and client launch at the same time, and perhaps this is something that your IT can help with as this is specific to SLURM (rather than dask). Unfortunately, this will work only for non-interactive workloads.
The answer to my problem turned out to be deceptively simple. Our SLURM configuration uses the backfill scheduler. Because my Dask workers were using the maximum possible --time (24 hours), this meant that the backfill scheduler wasn't working effectively. As soon as I lowered --time to the amount I believed was necessary for the workers to finish running the script, they left "queue hell"!

Any way of monitoring Airflow DAG's execution time?

I'd like to use Airflow with Statsd and DataDog to monitor if DAG takes e.g. twice time as its previous execution(s). So, I need some kind of a real-time timer for a DAG (or operator).
I'm aware that Airflow supports some metrics.
However, to my understanding all of the metrics are related to finished tasks/DAGs, right? So, It's not the solution, because I'd like to monitor running DAGs.
I also considered the timeout_execution/SLA features, but they are not suitable for this use-case
I'd like to be notified that some DAG hangs, but I don't want to kill it.
There are a number of different ways you could handle this:
In the past I've configured a telemetry DAG which would collect the current state of all tasks/DAGs by querying the metadata tables. I'd collect these metrics and push them up to CloudWatch. This became problematic as these internal fields change often so we would run into issues when trying to upgrade to newer versions of Airflow.
There are also some well-maintained Prometheus exporters that some companies have open sourced. By setting these up you could poll the exposed export path as frequently as you wanted to (DataDog supports Prometheus).
These are just some of your options. Since the Airflow webserver is just a Flask app you can really expose metrics in whatever way you see fit.
As I understand, you can monitor running tasks in DAGs using DataDog, refer the integration with Airflow docs
You may refer metrics via DogStatD docs. Also, look at this page would be useful to understand what to monitor.
E.g., the metrics as below:
airflow.operator_failures: monitor the failed operator.
airflow.operator_successes: monitor succeed operator.
airflow.dag_processing.processes: Number of currently running DAG parsing (processes).
airflow.scheduler.tasks.running : Number of tasks running in executor
Shown as task.

How to batch schedule dask_jobqueue jobs in DASK instead of concurrent?

By my reading of Dask-Jobqueue (https://jobqueue.dask.org/en/latest/), and by testing on our SLURM cluster, it seems when you set cluster.scale(n), and create client = Client(cluster), none of your jobs are able to start until all n of your jobs are able to start.
Suppose you have 999 jobs to run, and a cluster with 100 nodes or slots; worse yet, suppose other people share the cluster, and maybe some of them have long-running jobs. Admins sometimes need to do maintenance on some of the nodes, so they add and remove nodes. You never know how much parallelism you'll be able to get. You want the cluster scheduler to simply take 999 jobs (in slurm, these would be submitted via sbatch), run them in any order on any available nodes, store results in a shared directory, and have a dependent job (in slurm, that would be sbatch --dependency=) process the shared directory after all 999 jobs completed. Is this possible with DASK somehow?
It seems a fundamental limitation of the architecture, that all the jobs are expected to run in parallel, and the user must specify the degree of parallelism.
Your understanding is not correct. Dask can run with fewer than the specified number of jobs, just as you've asked for. It will use whatever resources arrive.

Redistribute dask tasks among the cluster

I am abusing dask as a task scheduler for long running tasks with map(, pure=False). So I am not interested in the dask graph, I just use dark as a way to distribute unix commands.
Lets say if have 1000 tasks and they run for a week on a cluster of 30 workers. What I have seen is that if a worker goes down, its tasks get redistributed to the remaining workers.
Sometimes I can free resources from other simulations and I add new workers to the desk cluster. However, those workers then have 0 tasks assigned and idle. They only get new tasks if one of the old workers goes down, then the tasks get redistributed.
So, my question: "Can I manually redistribute and shuffle the tasks on the dask cluster"?
The scheduler should already be balancing old tasks to new workers. Information on our work-stealing policies live here:
http://distributed.readthedocs.io/en/latest/work-stealing.html
So I'm surprirsed to hear that tasks don't automatically rebalance. If you're able to produce an mcve to reproduce the problem I'd enjoy taking a look.

Resources