How to find the concurrent.future input arguments for a Dask distributed function call - dask

I'm using Dask to distribute work to a cluster. I'm creating a cluster and calling .submit() to submit a function to the scheduler. It returns a Futures object. I'm trying to figure out how to obtain the input arguments to that future object once it's been completed.
For example:
from dask.distributed import Client
from dask_yarn import YarnCluster
def somefunc(a,b,c ..., n ):
# do something
return
cluster = YarnCluster.from_specification(spec)
client = Client(cluster)
future = client.submit(somefunc, arg1, arg2, ..., argn)
# ^^^ how do I obtain the input arguments for this future object?
# `future.args` doesn't work

Futures don't hold onto their inputs. You can do this yourself though.
futures = {}
future = client.submit(func, *args)
futures[future] = args

A future only knows the key by which it is uniquely known on the scheduler. At the time of submission, if it has dependencies, these are transiently found and sent to the scheduler but no copy if kept locally.
The pattern you are after sounds more like delayed, which keeps hold of its graph, and indeed client.compute(delayed_thing) returns a future.
d = delayed(somefunc)(a, b, c)
future = client.compute(d)
dict(d.dask) # graph of things needed by d
You could communicate directly with the scheduler to find the dependencies of some key, which will in general also be keys, and so reverse-engineer the graph, but that does not sound like a great path, so I won't try to describe it here.

Related

str() is not usable anymore to get true value of a Text tfx.data_types.RuntimeParameter during pipeline execution

how to get string as true value of tfx.orchestration.data_types.RuntimeParameter during execution pipeline?
Hi,
I'm defining a runtime parameter like data_root = tfx.orchestration.data_types.RuntimeParameter(name='data-root', ptype=str) for a base path, from which I define many subfolders for various components like str(data_root)+'/model' for model serving path in tfx.components.Pusher().
It was working like a charm before I moved to tfx==1.12.0: str(data_root) is now providing a json dump.
To overcome that, i tried to define a runtime parameter for model path like model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-root', ptype=str) and then feed the Pusher component the way I saw in many tutotrials:
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(base_directory=model_root)))
but I get a TypeError saying tfx.proto.PushDestination.Filesystem does not accept Runtime parameter.
It completely breaks the existing setup as i received those parameters from external client for each kubeflow run.
Thanks a lot for any help.
I was able to fix it.
First of all, the docstring is not clear regarding which parameter of Pusher can be a RuntimeParameter or not.
I finally went to __init__ code definition of component Pusher to see that only the parameter push_destination can be a RuntimeParameter:
def __init__(
self,
model: Optional[types.BaseChannel] = None,
model_blessing: Optional[types.BaseChannel] = None,
infra_blessing: Optional[types.BaseChannel] = None,
push_destination: Optional[Union[pusher_pb2.PushDestination,
data_types.RuntimeParameter]] = None,
custom_config: Optional[Dict[str, Any]] = None,
custom_executor_spec: Optional[executor_spec.ExecutorSpec] = None):
Then I defined the component consequently, using my RuntimeParameter
model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-serving-location', ptype=str)
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=model_root)
As push_destination parameter is supposed to be message proto tfx.proto.pusher_pb2.PushDestination, you have then to respect the associated schema when instantiating and running a pipeline execution, meaning the value should be like:
{'type': 'model-serving-location': 'value': '{"filesystem": {"base_directory": "path/to/model/serving/for/the/run"}}'}
Regards

Dask - diagnostics dashboard - custom info about task

I'm using Dask to schedule and run research batches.
Those mostly produce side effects and are quite heavy (ranging from few minutes to a couple of hours). There's no communication between the tasks.
In code it looks like this, first I'm passing all the batches to process:
def process_batches(batches: Iterator[Batch], log_dir: Path):
cluster = LocalCluster(
n_workers=os.cpu_count(),
threads_per_worker=1
)
client = Client(cluster)
futures = []
for batch in batches:
futures += process_batch(batch, client, log_dir)
progress(futures)
Then I'm submitting repetitions from each batch as tasks:
def process_batch(batch: Batch, client: Client, log_dir: Path) -> List[Future]:
batch_dir = log_dir.joinpath(batch.nice_hash)
batch_futures = []
num_workers = len(client.scheduler_info()['workers'])
with Logger(batch_dir, clear_dir=True) as logger:
logger.save_json(batch.as_dict, 'batch')
for repetition in range(batch.n_repeats):
cpu_index = repetition % num_workers
future = client.submit(
process_batch_repetition,
batch,
repetition,
cpu_index,
logger
)
batch_futures.append(future)
return batch_futures
Is there any way to pass some custom info about the submitted task to the dashboard?
All I'm seeing are just tasks process_batch_repetition. Could I replace it with a custom string, so I can see what batch configurations are being processed at the moment?
Got an answer from Dask's BDFL mrocklin.
You can use the key= keyword to specify a key for the future. This should
be unique per future. Dask will use the prefix of the key name to
determine how it is rendered on the dashboard. See the docstring for
dask.utils.key_split for examples on how a key prefix is generated from a
key.
So you can use it like this:
future = client.submit(
process_batch_repetition,
batch,
repetition,
cpu_index,
logger,
key=f'{str(batch)}_repetition_{repetition}'
)
You just pass a unique string for this task. There are some forbidden chars (i.e. spaces), so expect some key errors.

Can Python gRPC do computation when sending messages out?

Suppose I need to send a large amount of data from the client to the server using python gRPC. And I want to continue the rest computation when sending the message out instead of blocking the code. Is there any way can implement this?
I will illustrate the question by an example using the modified code from the greeter_client.py
for i in range(5):
res=computation()
response = stub.SayHello(helloworld_pb2.HelloRequest(data=res))
I want the computation of the next iteration continue while sending the "res" of last iteration. To this end, I have tried the "async/await", which looks like this
async with aio.insecure_channel('localhost:50051') as channel:
stub = helloworld_pb2_grpc.GreeterStub(channel)
for j in range(5):
res=computation()
response = await stub.SayHello(helloworld_pb2.HelloRequest(data=res))
But the running time is actually the same with the version without async/await. The async/await does not work. I am wondering is there anything wrong in my codes or there are other ways?
Concurrency is different than parallelism. AsyncIO allows multiple coroutines to run on the same thread, but they are not actually computed at the same time. If the thread is given a CPU-heavy work like "computation()" in your snippet, it doesn't yield control back to the event loop, hence there won't be any progress on other coroutines.
Besides, in the snippet, the RPC depends on the result of "computation()", this meant the work will be serialized for each RPC. But we can still gain some concurrency from AsyncIO, by handing them over to the event loop with asyncio.gather():
async with aio.insecure_channel('localhost:50051') as channel:
stub = helloworld_pb2_grpc.GreeterStub(channel)
async def one_hello():
res=computation()
response = await stub.SayHello(helloworld_pb2.HelloRequest(data=res))
await asyncio.gather(*(one_hello() for _ in range(5)))

Access tasks results launched by other tasks in Dask

My application requires me to launch tasks from within other tasks, like the following
def a():
# ... some computation ..
def b():
# ... some computation ..
def c():
client = get_client()
a = client.submit(a)
b = client.submit(b)
[a,b] = client.gather([a,b])
return a+b
client = get_client()
res = client.submit(c)
However, I would like to have access to the intermediate results a and b (when calling c), but only c shows up in client.futures.
Is there a way to tell dask to keep the results for a and b?
I have tried to use the Future.add_done_callback method but it does not work for submit calls inside other submit calls.
Thank you
You probably want to look at Dask's coordination primitives like shared variables, queues, and pub/sub. https://docs.dask.org/en/latest/futures.html#coordination-primitives

Dask opportunistic caching in custom graphs

I have a custom DAG such as:
dag = {'load': (load, 'myfile.txt'),
'heavy_comp': (heavy_comp, 'load'),
'simple_comp_1': (sc_1, 'heavy_comp'),
'simple_comp_2': (sc_2, 'heavy_comp'),
'simple_comp_3': (sc_3, 'heavy_comp')}
And I'm looking to compute the keys simple_comp_1, simple_comp_2, and simple_comp_3, which I perform as follows,
import dask
from dask.distributed import Client
from dask_yarn import YarnCluster
task_1 = dask.get(dag, 'simple_comp_1')
task_2 = dask.get(dag, 'simple_comp_2')
task_3 = dask.get(dag, 'simple_comp_3')
tasks = [task_1, task_2, task_3]
cluster = YarnCluster()
cluster.scale(3)
client = Client(cluster)
dask.compute(tasks)
cluster.shutdown()
It seems, that without caching, the computation of these 3 keys will lead to the computation of heavy_comp also 3 times. And since this is a heavy computation, I tried to implement opportunistic caching from here as follows:
from dask.cache import Cache
cache = Cache(2e9)
cache.register()
However, when I tried to print the results of what was being cached I got nothing:
>>> cache.cache.data
[]
>>> cache.cache.heap.heap
{}
>>> cache.cache.nbytes
{}
I even tried increasing the cache size to 6GB, however to no effect. Am I doing something wrong? How can I get Dask to cache the result of the key heavy_comp?
Expanding on MRocklin's answer and to format code in the comments below the question.
Computing the entire graph at once works as you would expect it to. heavy_comp would only be executed once, which is what you want. Consider the following code you provided in the comments completed by empty function definitions:
def load(fn):
print('load')
return fn
def sc_1(i):
print('sc_1')
return i
def sc_2(i):
print('sc_2')
return i
def sc_3(i):
print('sc_3')
return i
def heavy_comp(i):
print('heavy_comp')
return i
def merge(*args):
print('merge')
return args
dag = {'load': (load, 'myfile.txt'), 'heavy_comp': (heavy_comp, 'load'), 'simple_comp_1': (sc_1, 'heavy_comp'), 'simple_comp_2': (sc_2, 'heavy_comp'), 'simple_comp_3': (sc_3, 'heavy_comp'), 'merger_comp': (merge, 'sc_1', 'sc_2', 'sc_3')}
import dask
result = dask.get(dag, 'merger_comp')
print('result:', result)
It outputs:
load
heavy_comp
sc_1
sc_2
sc_3
merge
result: ('sc_1', 'sc_2', 'sc_3')
As you can see, "heavy_comp" is only printed once, showing that the function heavy_comp has only been executed once.
The opportunistic cache in the core Dask library only works for the single-machine scheduler, not the distributed scheduler.
However, if you just compute the entire graph at once Dask will hold onto intermediate values intelligently. If there are values that you would like to hold onto regardless you might also look at the persist function.

Resources