Is there any way to get Dask diagnostic data, not the dashboard for a Dask.distributed client?
Dask already provides a nice Bokeh dashboard, where it plots quite a lot diagnostics informations. However, what I want are not the plots but their values. Something like, along with the timestamp, progress value, cpu and memory usage. I would like to store these values in a database for my own monitoring purposes.
So far, I have tried to use function Dask.distributed.get_task_stream(), it provides information about the workers in a list but I would like to get in a stream manner, what exactly Task Stream plot shows in the dashboard.
Note: there exists a package called dask.diagnostics and from there you can import a ProgressBar, Profiler(), ResourceProfiler() etc., However, from my current understanding, they are only for a single machine scheduler and not for a distributed scheduler. Am I right? Or, can I use them for a distributed environment?
In most cases we recommend the get_task_stream function that you've already found.
If you want to trigger something on every transition you might consider the Scheduler plugins. In particular, the task stream plugin that feeds that dashboard lives here:
https://github.com/dask/distributed/blob/master/distributed/diagnostics/task_stream.py
Related
I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you
I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.
Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks
When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.
I have the following basic code which (I thought) should set up xarray to use a LocalCluster.
from dask.distributed import Client
client = Client("tcp://127.0.0.1:46573") # this points to a LocalCluster
import xarray as xr
ds = xr.open_mfdataset('*.nc', combine='by_coords') # Uses dask to defer actually loading data
I now launch some task which also completes with no issues:
(ds.mean('time').mean('longitude')**10).compute()
I noticed that the tabs for the Task graph, Workers, or Task stream (among others) in the dask-labextension for my LocalCluster remain empty. Shouldn't there be some sort of progress displayed while the computation is running?
Which leads me to wonder, how do I tell xarray to explicitly use this cluster? Or is Client a singleton such that xarray only ever has one instance to use anyway?
When you create a Dask Client it automatically registers itself as the default way to run Dask computations.
You can check to see if an object is a Dask collection with the dask.is_dask_collection function. As you say, I believe that xr.open_mfdataset uses Dask by default, but this would be a good way to check.
As to why you're not seeing anything on the dashboard, I unfortunately don't know enough about your situation to be able to help you there.
Context:
I'm using custom dask graphs to manage and distribute computations.
Problem:
Some tasks include reading in files which are produced outside of dask and not necessarily available at the time of calling dask.get(graph,result_key).
Question:
Having the i/o tasks wait for files is not an option as this would block workers. Is there (or which would be) a good way to let dask wait for the files to become available and only then execute the i/o tasks?
Thanks a lot for any thoughts!
It sounds like you might want to use some of the more real-time features of Dask, described here.
You might consider making tasks that use secede and rejoin or use async-await style programming and only launch tasks once your client process notices that they exist.
I using Serilog inside my aspnet core app for logging. And i need to write log events to console pretty frequently (300-500 events per second). I run my app inside docker container and procces console logs using orchestrator tools.
So my question: should i use Async wrapper for my Console sink and will i get any benefits from that?
I read the documentation (https://github.com/serilog/serilog-sinks-async), but can't understand is it actual for Console sink or not.
The Async Sink takes the already-captured LogEvent items and shifts them to a single background processor from multiple foreground threads using a ConcurrentQueue Producer/Consumer collection. In general that's a good thing for stable throughput esp with that throughput of events.
Also if sending to >1 sink, shifting to a background thread which will be scheduled as necessary focusing on that workload (i.e., paths propagating to sinks being in cache) can be good if you have enough cores available and/or Sinks block even momentarily.
Having said that, to base anything of this information is premature optimization.
Console sinks and their ability to ingest efficiently without blocking if you don't put an Async in front, always Depends a lot - for example, hosting environments typically synthesize a stdout that buffers efficiently. When that works well, adding an Async in front of the Console sink is merely going to prolong object lifetimes without much benefit vs allowing each thread submit to the Console sink directly.
So, it depends - IME feeding everything to Async and doing all processing there (e.g. writing to a buffered file, emitting every .5s (perhaps to a sidecar process that forwards to your log store)) can work well. The bottom line is that a good load generator rig is a very useful thing for any high throughput app. Once you have one, you can experiment - I've seen 30% throughput gains from reorganizing the exact same output and how it's scheduled (admittedly I also switched to Serilog during that transition - you're unlikely to see anything of that order).
Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination