Dask - diagnostics dashboard - custom info about task - dask

I'm using Dask to schedule and run research batches.
Those mostly produce side effects and are quite heavy (ranging from few minutes to a couple of hours). There's no communication between the tasks.
In code it looks like this, first I'm passing all the batches to process:
def process_batches(batches: Iterator[Batch], log_dir: Path):
cluster = LocalCluster(
n_workers=os.cpu_count(),
threads_per_worker=1
)
client = Client(cluster)
futures = []
for batch in batches:
futures += process_batch(batch, client, log_dir)
progress(futures)
Then I'm submitting repetitions from each batch as tasks:
def process_batch(batch: Batch, client: Client, log_dir: Path) -> List[Future]:
batch_dir = log_dir.joinpath(batch.nice_hash)
batch_futures = []
num_workers = len(client.scheduler_info()['workers'])
with Logger(batch_dir, clear_dir=True) as logger:
logger.save_json(batch.as_dict, 'batch')
for repetition in range(batch.n_repeats):
cpu_index = repetition % num_workers
future = client.submit(
process_batch_repetition,
batch,
repetition,
cpu_index,
logger
)
batch_futures.append(future)
return batch_futures
Is there any way to pass some custom info about the submitted task to the dashboard?
All I'm seeing are just tasks process_batch_repetition. Could I replace it with a custom string, so I can see what batch configurations are being processed at the moment?

Got an answer from Dask's BDFL mrocklin.
You can use the key= keyword to specify a key for the future. This should
be unique per future. Dask will use the prefix of the key name to
determine how it is rendered on the dashboard. See the docstring for
dask.utils.key_split for examples on how a key prefix is generated from a
key.
So you can use it like this:
future = client.submit(
process_batch_repetition,
batch,
repetition,
cpu_index,
logger,
key=f'{str(batch)}_repetition_{repetition}'
)
You just pass a unique string for this task. There are some forbidden chars (i.e. spaces), so expect some key errors.

Related

How to change dask job_name to SGECluster

I am using dask_jobqueue.SGECluster() and when I submit jobs to the grid they are all listed as dask-worker. I want to have different names for each submitted job.
Here is one example:
futures = []
for i in range(1,10):
res = client.submit(slow_pow, i,2)
futures.append(res)
[future.result() for future in as_completed(futures)]
All 10 jobs appear with name dask-worker when checking their status with qsub.
I have tried adding client.adapt(job_name=f'job{i}') within the loop, but no success, name still dask-worker.
Any hints?
The dask-worker is a generic name for the compute allocation from the cluster, it can be changed by providing cluster-specific arguments at the time the cluster is created. For example, for SLURMCluster this would be:
cluster = SLURMCluster('job_extra': ['--job-name="func"'])
SGECluster might have a different syntax.
The actual tasks submitted to dask scheduler will have their names generated automatically by dask and can be viewed through the dashboard, by default on (http://localhost:8787/status).
It's possible to specify a custom name for each task submitted to scheduler by using key kwarg:
fut = client.submit(myfunc, my_arg, key='custom_key')
Note that if you are submitting multiple futures, you will want them to have unique keys:
futs = [client.submit(myfunc, i, key=f'custom_key_{i}') for i in range(3)]

Access tasks results launched by other tasks in Dask

My application requires me to launch tasks from within other tasks, like the following
def a():
# ... some computation ..
def b():
# ... some computation ..
def c():
client = get_client()
a = client.submit(a)
b = client.submit(b)
[a,b] = client.gather([a,b])
return a+b
client = get_client()
res = client.submit(c)
However, I would like to have access to the intermediate results a and b (when calling c), but only c shows up in client.futures.
Is there a way to tell dask to keep the results for a and b?
I have tried to use the Future.add_done_callback method but it does not work for submit calls inside other submit calls.
Thank you
You probably want to look at Dask's coordination primitives like shared variables, queues, and pub/sub. https://docs.dask.org/en/latest/futures.html#coordination-primitives

Dask opportunistic caching in custom graphs

I have a custom DAG such as:
dag = {'load': (load, 'myfile.txt'),
'heavy_comp': (heavy_comp, 'load'),
'simple_comp_1': (sc_1, 'heavy_comp'),
'simple_comp_2': (sc_2, 'heavy_comp'),
'simple_comp_3': (sc_3, 'heavy_comp')}
And I'm looking to compute the keys simple_comp_1, simple_comp_2, and simple_comp_3, which I perform as follows,
import dask
from dask.distributed import Client
from dask_yarn import YarnCluster
task_1 = dask.get(dag, 'simple_comp_1')
task_2 = dask.get(dag, 'simple_comp_2')
task_3 = dask.get(dag, 'simple_comp_3')
tasks = [task_1, task_2, task_3]
cluster = YarnCluster()
cluster.scale(3)
client = Client(cluster)
dask.compute(tasks)
cluster.shutdown()
It seems, that without caching, the computation of these 3 keys will lead to the computation of heavy_comp also 3 times. And since this is a heavy computation, I tried to implement opportunistic caching from here as follows:
from dask.cache import Cache
cache = Cache(2e9)
cache.register()
However, when I tried to print the results of what was being cached I got nothing:
>>> cache.cache.data
[]
>>> cache.cache.heap.heap
{}
>>> cache.cache.nbytes
{}
I even tried increasing the cache size to 6GB, however to no effect. Am I doing something wrong? How can I get Dask to cache the result of the key heavy_comp?
Expanding on MRocklin's answer and to format code in the comments below the question.
Computing the entire graph at once works as you would expect it to. heavy_comp would only be executed once, which is what you want. Consider the following code you provided in the comments completed by empty function definitions:
def load(fn):
print('load')
return fn
def sc_1(i):
print('sc_1')
return i
def sc_2(i):
print('sc_2')
return i
def sc_3(i):
print('sc_3')
return i
def heavy_comp(i):
print('heavy_comp')
return i
def merge(*args):
print('merge')
return args
dag = {'load': (load, 'myfile.txt'), 'heavy_comp': (heavy_comp, 'load'), 'simple_comp_1': (sc_1, 'heavy_comp'), 'simple_comp_2': (sc_2, 'heavy_comp'), 'simple_comp_3': (sc_3, 'heavy_comp'), 'merger_comp': (merge, 'sc_1', 'sc_2', 'sc_3')}
import dask
result = dask.get(dag, 'merger_comp')
print('result:', result)
It outputs:
load
heavy_comp
sc_1
sc_2
sc_3
merge
result: ('sc_1', 'sc_2', 'sc_3')
As you can see, "heavy_comp" is only printed once, showing that the function heavy_comp has only been executed once.
The opportunistic cache in the core Dask library only works for the single-machine scheduler, not the distributed scheduler.
However, if you just compute the entire graph at once Dask will hold onto intermediate values intelligently. If there are values that you would like to hold onto regardless you might also look at the persist function.

How to find the concurrent.future input arguments for a Dask distributed function call

I'm using Dask to distribute work to a cluster. I'm creating a cluster and calling .submit() to submit a function to the scheduler. It returns a Futures object. I'm trying to figure out how to obtain the input arguments to that future object once it's been completed.
For example:
from dask.distributed import Client
from dask_yarn import YarnCluster
def somefunc(a,b,c ..., n ):
# do something
return
cluster = YarnCluster.from_specification(spec)
client = Client(cluster)
future = client.submit(somefunc, arg1, arg2, ..., argn)
# ^^^ how do I obtain the input arguments for this future object?
# `future.args` doesn't work
Futures don't hold onto their inputs. You can do this yourself though.
futures = {}
future = client.submit(func, *args)
futures[future] = args
A future only knows the key by which it is uniquely known on the scheduler. At the time of submission, if it has dependencies, these are transiently found and sent to the scheduler but no copy if kept locally.
The pattern you are after sounds more like delayed, which keeps hold of its graph, and indeed client.compute(delayed_thing) returns a future.
d = delayed(somefunc)(a, b, c)
future = client.compute(d)
dict(d.dask) # graph of things needed by d
You could communicate directly with the scheduler to find the dependencies of some key, which will in general also be keys, and so reverse-engineer the graph, but that does not sound like a great path, so I won't try to describe it here.

Is there something similar to JS 'Promise.all()' in Ruby?

Below is a code that should be optimized:
def statistics
blogs = Blog.where(id: params[:ids])
results = blogs.map do |blog|
{
id: blog.id,
comment_count: blog.blog_comments.select("DISTINCT user_id").count
}
end
render json: results.to_json
end
Each SQL query cost around 200ms. If I have 10 blog posts, this function would take 2s because it runs synchronously. I can use GROUP BY to optimize the query, but I put that aside first because the task could be a third party request, and I am interested in how Ruby deals with async.
In Javascript, when I want to dispatch multiple asynchronous works and wait all of them to resolve, I can use Promise.all(). I wonder what the alternatives are for Ruby language to solve this problem.
Do I need a thread for this case? And is it safe to do that in Ruby?
There are multiple ways to solve this in ruby, including promises (enabled by gems).
JavaScript accomplishes asynchronous execution using an event loop and event driven I/O. There are event libraries to accomplish the same thing in ruby. One of the most popular is eventmachine.
As you mentioned, threads can also solve this problem. Thread-safety is a big topic and is further complicated by different thread models in different flavors of ruby (MRI, JRuby, etc). In summary I'll just say that of course threads can be used safely... there are just times when that is difficult. However, when used with blocking I/O (like to an API or a database request) threads can be very useful and fairly straight-forward. A solution with threads might look something like this:
# run blocking IO requests simultaneously
thread_pool = [
Thread.new { execute_sql_1 },
Thread.new { execute_sql_2 },
Thread.new { execute_sql_3 },
# ...
]
# wait for the slowest one to finish
thread_pool.each(&:join)
You also have access to other currency models, like the actor model, async classes, promises, and others enabled by gems like concurrent-ruby.
Finally, ruby concurrency can take the form of multiple processes communicating through built in mechanisms (drb, sockets, etc) or through distributed message brokers (redis, rabbitmq, etc).
Sure just do the count in one database call:
blogs = Blog
.select('blogs.id, COUNT(DISTINCT blog_comments.user_id) AS comment_count')
.joins('LEFT JOIN blog_comments ON blog_comments.blog_id = blogs.id')
.where(comments: { id: params[:ids] })
.group('blogs.id')
results = blogs.map do |blog|
{ id: blog.id, comment_count: blog.comment_count }
end
render json: results.to_json
You might need to change the statements depending on how your table as named in the database because I just guessed by the name of your associations.
Okay, generalizing a bit:
You have a list of data data and want to operate on that data asynchronously. Assuming the operation is the same for all entries in your list, you can do this:
data = [1, 2, 3, 4] # Example data
operation = -> (data_entry) { data * 2 } # Our operation: multiply by two
results = data.map{ |e| Thread.new(e, &operation) }.map{ |t| t.value }
Taking it apart:
data = [1, 2, 3, 4]
This could be anything from database IDs to URIs. Using numbers for simplicity here.
operation = -> (data_entry) { data * 2 }
Definition of a lambda that takes one argument and does some calculation on it. This could be an API call, an SQL query or any other operation that takes some time to complete. Again, for simplicity, I'm just multiplicating the numbers by 2.
results =
This array will contain the results of all the asynchronous operations.
data.map{ |e| Thread.new(e, &operation) }...
For every entry in the data set, spawn a thread that runs operation and pass the entry as argument. This is the data_entry argument in the lambda.
...map{ |t| t.value }
Extract the value from each thread. This will wait for the thread to finish first, so by the end of this line all your data will be there.
Lambdas
Lambdas are really just glorified blocks that raise an error if you pass in the wrong number of arguments. The syntax -> (arguments) {code} is just syntactic sugar for Lambda.new { |arguments| code }.
When a method accepts a block like Thread.new { do_async_stuff_here } you can also pass a Lambda or Proc object prefixed with & and it will be treated the same way.

Resources