Use a Dask Cluster in a PythonScriptStep - dask

Is it possible to have a multi-node Dask cluster be the compute for a PythonScriptStep with AML Pipelines?
We have a PythonScriptStep that uses featuretools's, deep feature synthesis (dfs) (docs). ft.dfs() has a param, n_jobs which allows for parallelization. When we run on a single machine, the job takes three hours, and runs much faster on a Dask. How can I operationalize this within an Azure ML pipeline?

We've been working and recently released a dask_cloudprovider.AzureMLCluster that might be of interest to you: link to repo. You can install it via pip install dask-cloudprovider.
The AzureMLCluster instantiates Dask cluster on AzureML service with elasticity of scaling up to 100s of nodes should you require that. The only required parameter is the Workspace object, but you can pass your own ComputeTarget should you choose to.
An example of how to use it you can found here. In this example I use my custom GPU/RAPIDS docker image but you can use any images within the Environment class.

Related

Job-based cloud processing solution

I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you
I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.
Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks
When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.

On Demand Container Serving With GCP

I have a Docker image that will execute different logic depending on the environment variables that it is run with. You can imagine that running it with VAR=A will produce slightly different logic compared to running it with VAR=B.
We have an application that is meant to allow users to kick off tasks that are defined within this Docker image. Depending on the user attributes, different environment variables will need to be passed into the Docker container when it is run. The task is meant to run each time an event is generated through user action and then the container should shut down/be removed.
I'm trying to determine if GCP has any container services that best match what I'm looking for. My understanding of some of the services is:
Cloud Functions - can work well for consuming events and taking specific actions each time an event is triggered, but it is not suited for containerized workloads.
Cloud Run - a serverless way of deploying containers. As I understand it, a deployment on cloud run spins up a "service", and the environment variables must be passed in as part of the service definition. Because we may have a large number of combinations of environment variables (many of which may need to be running at once), it seems that this would end up creating a large number of services, which feels potentially clunky. This approach seems better for deploying a single service with static environment variables that needs to be auto-scaled by GCP.
GKE - another container orchestration platform. This is what I'm considering at the moment. The idea is that we would define a single job definition that can vary according to environment variables that are passed into it. The problem is that these jobs would need to be kicked off dynamically via code. This is a fairly straightforward process with kubectl, but the Kubernetes REST API seems fairly underdeveloped (or at least not that well documented). And the lack of information online on how to start jobs on-demand through the Kubernetes API makes me question whether this is the best approach.
Are there any tools that I'm missing that would be useful in spinning up containers on-demand with dynamic sets of environment variables and removing them when done?

One-time (per worker) setup for python dataflow?

My dataflow job has to download some file from remote server. I want to save the file on worker machine so job doesn't have to keep downloading the same file.
I tried to do this with setup method, however it seems setup will be called for each thread, and multiple threads can call setup in parallel (I cannot find documentation around this, but based on my experience my job tries to write file data in parallel and hence causing malformed data).
Is there a way to perform one-time setup whenever worker machine is launched?
I also checked Apache Beam: DoFn.Setup equivalent in Python SDK but I believe it focuses around per-thread setup.
The Beam model doesn't include a specific callback for when a VM is created because the model doesn't guarantee the runtime environment. However, because you are using Dataflow that uses containers you have two options:
Modify the container image
Modify the setup.py
The first will give you direct control over the container image, and it works for all languages. The second only works for Python.

memory/cpu utilization issues when running dask on cluster (SGE)

I ported my code that uses pandas to dask (basically replaced pd.read_csv with dd.read_csv). The code essentially applies a series of filters/transformations on the (dask) dataframe. Similarly, I use dask bag/array instead of numpy array-like data. The code works great locally (i.e., on my workstation or on a virtual machine). However, when I run the code to a cluster (SGE), my jobs gets killed by the scheduler saying that memory/cpu usage have exceeded the allocated limit. It looks like dask is trying to consume all memory/threads available on the node as opposed to what has been allocated by the scheduler. There seem to be two approaches to fix this issue - (a) set memory/cpu limits for dask from within the code as soon as we load the library (just like we set matplotlib.use("Agg") as soon as we load matplotlib when we need to set the backend) and/or (b) have dask understand the memory limits set by the scheduler. Could you please help on how to go about this issue. My preference would be to set mem/cpu limits for dask from within the code. PS: I understand there is a way to spin up dask workers in a cluster environment and specify these limits, but I am interested in running my code that works great locally on the cluster with minimal additional changes. Thanks!

How to run multiple times a Docker container with different parameters in Kubernetes?

I have a Docker container that reads a variable which I provide to it during the execution. Then I though that I would like to run many of those and pass a different value as variable for each one of those. Then, I just created a simple text file which contains all the values I want to pass into (they are about 20k different ones) and I am using gnu-parallel to spawn multiple Dockers in parallel.
My question is how I could do something like that in a Kubernetes environment?
Sounds like you want you want to do can be achieved using kubernetes jobs.
I would advise against using gnu parallel on kubernetes unless you can fit all the jobs in one node. If this is the case I think it's ok, just set the cpu request in the job template.

Resources