Passing Futures as arguments in Dask - dask

What is the best way to pass a Future to a Dask Delayed function such that the Future stays in tact? In other words, how can we ensure the function will get the actual Future and not the result it represents?

Generally the semantics are that dask.delayed functions get concrete results rather than dask-y ones. This is not easily supported today without some tricks.
That being said, I recommend the following trick:
Place your future in a Variable
import dask
from dask.distributed import Client, Variable
client = Client()
v = Variable()
#dask.delayed
def f(v):
return v.get().result()
future = client.scatter(123)
f(future).compute()
# 123

Related

TFF: How define tff.simulation.ClientData.from_clients_and_fn Function?

In the federated learning context, One such classmethod that should work would be tff.simulation.ClientData.from_clients_and_fn. Here, if I pass a list of client_ids and a function which returns the appropriate dataset when given a client id, you will have your hands on a fully functional ClientData.
I think here, an approach for defining the function I may use is to construct a Python dict which maps client IDs to tf.data.Dataset objects--you could then define a function which takes a client id, looks up the dataset in the dict, and returns the dataset.
So I define function as below but I think it is wrong, what do you think?
list = ["0","1","2"]
tab = {"0":ds, "1":ds, "2":ds}
def create_tf_dataset_for_client_fn(id):
return ds
source = tff.simulation.ClientData.from_clients_and_fn(list, create_tf_dataset_for_client_fn)
I suppose here that the 4 clients have the same dataset :'ds'
Creating a dict of (client_id, dataset) key-value pairs is a reasonable way to set up a tff.simulation.ClientData. Indeed, the code in the question will result in all clients have the same dataset since ds is return for all values of parameter id. One thing to watch out in pre-constructing a dict of datasets is that it may require loading the entire contents of the data into memory (may fail for large datasets).
Alternatively, constructing the dataset on-demand could reduce memory usage. One example might be to have a dict of (client_id, file path) key-value pairs. Something like:
dataset_paths = {
'client_0': '/tmp/A.txt',
'client_1': '/tmp/B.txt',
'client_2': '/tmp/C.txt',
}
def create_tf_dataset_for_client_fn(id):
path = dataset_paths.get(id)
if path is None:
raise ValueError(f'No dataset for client {id}')
return tf.data.Dataset.TextLineDataset(path)
source = tff.simulation.ClientData.from_clients_and_fn(
dataset_paths.keys(), create_tf_dataset_for_client_fn)
This is similar to the approach used in tff.simulation.FilePerUserClientData. It may be useful to look at the code of that class as an example.

dask.distributed not utilising the cluster

I'm not able to process this block using the distributed cluster.
import pandas as pd
from dask import dataframe as dd
import dask
df = pd.DataFrame({'reid_encod': [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]})
dask_df = dd.from_pandas(df, npartitions=3)
save_val = []
def add(dask_df):
for _, outer_row in dask_df.iterrows():
for _, inner_row in dask_df.iterrows():
for base_encod in outer_row['reid_encod']:
for compare_encod in inner_row['reid_encod']:
val = base_encod + compare_encod
save_val.append(val)
return save_val
from dask.distributed import Client
client = Client(...)
dask_compute = dask.delayed(add)(dask_df)
dask_compute.compute()
Also I have few queries
Does dask.delayed use the available clusters to do the computation.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
does dask.distributed work on pandas dataframe.
can we use dask.delayed in dask.distributed.
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
For the record, some answers, although I wish to note my earlier general points about this question
Does dask.delayed use the available clusters to do the computation.
If you have created a client to a distributed cluster, dask will use it for computation unless you specify otherwise.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
Yes, you can in general use delayed with pandas dataframes for parallelism if you wish. However, your dataframe only has one row, so it is not obvious in this case how - it depends on what you really want to achieve.
does dask.distributed work on pandas dataframe.
Yes, you can do anything that python can do with distributed, since it is just python processes executing code. Whether it brings you the performance you are after is a separate question
can we use dask.delayed in dask.distributed.
Yes, distributed can execute anything that dask in general can, including delayed functions/objects
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
Not easily, it is not clear to me that this is a dataframe operation at all. It seems more like an array - but, again, I note that your function does not actually return anything useful at all.
In the tutorial: passing pandas dataframes to delayed ; same with dataframe API.
The main problem with your code is sketched in this section of the best practices: don't pass Dask collections to delayed functions. This means, you should use either the delayed API or the dataframe API. While you can convert dataframes<->delayed, simply passing like this is not recommended.
Furthermore,
you only have one row in your dataframe, so you only get one partition and no parallelism whatever. You can only slow things down like this.
this appears to be an everything-to-everything (N^2) operation, so if you had many rows (the normal case for Dask), it would presumably take extremely long, no matter how many cores you used
passing lists in a pandas row is not a great idea, perhaps you wanted to use an array?
the function doesn't return anything useful, so it's not at all clear what you are trying to achieve. Under the description of MVCE, you will see references to "expected outcome" and "what went wrong". To get more help, please be more precise.

How does Dask handle external or global variables in function definitions?

If I have a function that depends on some global or other constant like the following:
x = 123
def f(partition):
return partition + x # note that x is defined outside this function
df = df.map_partitions(f)
Does this work? Or do I need to include the external variable, x, explicitly somehow?
Single process
If you're on a single machine and not using dask.distributed, then this doesn't matter. The variable x is present and doesn't need to be moved around
Distributed or multi-process
If we have to move the function between processes then we'll need to serialize that function into a bytestring. Dask uses the library cloudpickle to do this.
The cloudpickle library converts the Python function f into a bytes object in a way that captures the external variables in most settings. So one way to see if your function will work with Dask is to try to serialize it and then deserialize it on some other machine.
import cloudpickle
b = cloudpickle.dumps(f)
cloudpickle.loads(b) # you may want to try this on your other machine as well
How cloudpickle achieves this can be quite complex. You may want to look at their documentation.

Parallelizing in-memory tasks in dask using shared memory (no sending to other processes)?

I have a trivially parallelizable in-memory problem, but one that which does not give great speedups with regular Python multiprocessing (only 2xish), due to the need for sending lots of data back and forth between processes. Hoping dask can help.
My code basically looks like this:
delayed_results = []
for key, kdf in natsorted(scdf.groupby(grpby_key)):
d1 = dd.from_pandas(kdf, npartitions=1)
d2 = dd.from_pandas(other_dfs[key], npartitions=1)
result = dask.delayed(function)(d1, d2, key=key, n_jobs=n_jobs, **kwargs)
delayed_results.append(result)
outdfs = dask.compute(*delayed_results)
This is what my old joblib code looked like:
outdfs = Parallel(n_jobs=n_jobs)(delayed(function)(scdf, other_dfs[key], key=key, n_jobs=n_jobs, **kwargs) for key, scdf in natsorted(scdf.groupby(grpby_key)))
However, the dask code is much much slower and more memory-consuming, both for the threaded and multiprocessing schedulers. I was hoping that dask could be used to parallelize tasks without needing to send stuff to other processes. Is there a way to use multiple processes with dask by using shared memory?
Btw. The docs have a reference to http://distributed.readthedocs.io/en/latest/local-cluster.html where they explain that this scheduler
It handles data locality with more sophistication, and so can be more
efficient than the multiprocessing scheduler on workloads that require
multiple processes.
But they have no examples of its usage. What should I replace my dask.compute() call with in the code above to try the local cluster?
So you can just do the following
from distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
<your code>
Distributed will by default register itself as the executor, and you can just use dask.compute as normal

Iterate sequentially over a dask bag

I need to submit the elements of a very large dask.bag to a non-threadsafe store, ie I need something like
for x in dbag:
store.add(x)
I can not use compute since the bag is to large to fit in memory.
I need something more like distributed.as_completed but that works on bags, which distributed.as_completed does not.
I would probably continue to use normal compute, but add a lock
def commit(x, lock=None):
with lock:
store.add(x)
b.map(commit, lock=my_lock)
Where you might create a threading.Lock, or multiprocessing.Lock depending on the kind of processing you're doing
If you want to use as_completed you can convert your bag to futures and use as_completed on them.
from distributed.client import futures_of, as_completed
b = b.persist()
futures = futures_of(b)
for future in as_completed(futures):
for x in future.result():
store.add(x)
You can also convert to a dataframe, which I believe does iterate more sensibly
df = b.to_dataframe(...)
for x in df.iteritems(...):
...

Resources