Mismanagement of dask future results slow down the performances - dask

I'm looking for any suggestion on how to solve the bottleneck below described.
Within a dask distributed infrastructure I map some futures and gain results whenever they are ready. Once retrieved I've to invoke a time consuming, blocking "pandas" function and, unfortunately, this function can't be avoided.
The optimum would be to have something that let me create another process, detached from the for loop, that's able to ingest the flow of results. For other constraints, not present in the example, the output can't be serialized and sent to workers and must be processed on the master.
here a small mockup. Just grab the idea and not focus too much on the details of the code.
class pxldrl(object):
def __init__(self, df):
self.table = df
def simulation(list_param):
time.sleep(random.random())
val = sum(list_param)/4
if val < 0.5:
result = {'param_e': val}
else:
result = {'param_f': val}
return pxldrl(result)
def costly_function(result, output):
time.sleep(1)
# blocking pandas function
output = output.append(result.table, sort=False, ignore_index=True)
return output
def main():
client = Client(n_workers=4, threads_per_worker=1)
output = pd.DataFrame(columns=['param_e', 'param_f'])
input = pd.DataFrame(np.random.random(size=(100, 4)),
columns=['param_a', 'param_b', 'param_c', 'param_d'])
for i in range(2):
futures = client.map(simulation, input.values)
for future, result in as_completed(futures, with_results=True):
output = costly_function(result, output)

It sounds like you want to run costly_function in a separate thread. Perhaps you could using the threading or concurrent.futures module to run your entire routine on a separate thread?
If you wanted to get fancy, you could even use Dask again and create a second client that ran within this process:
local_client = Client(processes=False)
and use that. (although you'll have to be careful about mixing futures between clients, which won't work)

Related

Iterate on (or access directly) xarray chunks

I'm after a way to iterate on xarray chunks, so something similar to dask.array.blocks but that would give me access to xarray chunks with coordinates and dimensions.
For the record, I'm aware that xarray.map_blocks exists, but what I'm doing maps input chunks to output chunks of unknown shape, so I'd like to write something custom by looping directly on the xarray chunks.
I've tried to look into the xarray.map_blocks source code, since I guess something similar to what I need is in there, but I had a hard time understanding what's going on there.
EDIT:
My use case is that I would like, for each xarray chunk, to get an output xarray chunk of variable length along a new dimension (called foo below), and eventually concatenate them along foo.
This is a mocked scenario that should at least clarify what I'm after.
For now I've solved the problem constructing, from each dask chunk of the DataArray, an "xarray" chunk (but this looks quite convoluted), and then using client.map(fn_on_chunk, xarray_chunks).
n = 1000
x_raster = y_raster = np.arange(n)
time = np.arange(10)
vals_raster = np.arange(n*n*10).reshape(n, n, 10)
da_raster = xr.DataArray(vals_raster, coords={"y": y_raster, "x": x_raster, 'time':time})
da_raster = da_raster.chunk(dict(x=100, y=100))
def fn_on_chunk(da_chunk):
# Tried to replicate the fact that I can't know in advance
# the lenght of one dimension of the output
len_range = np.random.randint(10)
outs = []
for foo in range(len_range):
# Do some magic that finds needed coordinates
# on this particular chunk
x_chunk, y_chunk = fn_magic(foo)
out = da_chunk.sel(x=x_chunk, y=y_chunk)
out['foo'] = foo
outs.append(out)
return xr.concat(outs, dim='foo')

Load and merge many files from S3 using Dask

I have about 1m "result" files in S3 bucket which I want to process. Each result file should be merge with additional columns from an associated "context" file, which I have about 50k of (i.e. each context is associated with about 20 results)
Processing it serially is slow so I am using dask to parallelize some of the work.
In my serial code, I just load everything up-front and merge them, e.g.
contexts_map = {get_context_id(ctx_file): load_context(ctx_file) for ctx_file in ctx_files}
data = []
for result_file in result_files:
ctx_id, res_id = get_context_and_res_id(result_file)
ctx = contexts_map[ctx_id]
data.append(process_result(ctx))
df = pd.DataFrame(data)
Initially I thought to divide the data and process in batches using dask (i.e. run the above in parallel on several batches) but then I read about dask bag and dask dataframe from_delayed and thought to use it. What I have:
delayed_get_context = delayed(get_context)
# load the contexts
ctx_map = {}
for ctx_file in ctx_files:
ctx_id = get_context_id(ctx_file)
ctx_map[ctx_file] = delayed_get_context(ctx_item)
# process the contexts
delayed_get_context_stats = delayed(get_context_stats)
ctx_stat_map = {ctx_id: delayed_get_context_stats(ctx) for ctx_id, ctx in ctx_map}
# the main bag of result files to process
res_bag = db.from_sequence(res_items, npartitions=num_workers * 2)
# prepare a list of corresponding delayed per results
# the order in this list corresponds to order of res_bag
res_context_list = [
ctx_stat_map[get_context_and_res_id(item)[0]] for item in res_items
]
# then create a bag from that list
ctx_bag = db.from_sequence(res_context_list, npartitions=num_workers * 2)
# create delays for the results
delayed_extract = delayed(extract_stats)
# from what I understand, if one of the arguments is also a bug
# it is distributed in accordance to the "main" bag
results = res_bag.map(delayed_extract, ctx_stats=ctx_bag)
df = ddf.from_delayed(results)
df = df.compute()
df.to_csv("results.csv")
This create a computation graph similar to the following:
When I run this on a subset (as in the image above) it works ok. Running the code on 1m items, I don't see anything happen (maybe didn't wait enough for it to finish building the graph and moving things around?)
With that, does the code above makes sense? Should I have done it another way?
One of the things I am "afraid" of with the above implementation is that there's a lot of data movement.
I could potentially spend some time up-front to arrange context+results and then treat that as the "unit-of-work" and maybe get better results?
Any feedback here would be appreciated - is there a better approach?
And another question - what number of partitions I should use? I saw in the docs it will default to about 100, but is there some rule of thumb to use here?

How to share the same index among multiple dask arrays

I'm trying to build a dask-based ipython application, that holds a meta-class which consists of some sub-dask-arrays (which are all shaped (n_samples, dim_1, dim_2 ...)) and should be able to sector the sub-dask-arrays by its getitem operator.
In the getitem method, I call the da.Array.compute method (the code is still in it's very early state), so I would be able to iterate batches of the sub-arrays.
def MetaClass(object):
...
def __getitem__(self, inds):
new_m = MetaClass()
inds = inds.compute()
for name,var in vars(self).items():
if isinstance(var,da.Array):
try:
setattr(new_m, name, var[inds])
except Exception as e:
print(e)
else:
setattr(new_m, name, var)
return new_m
# Here I construct the meta-class to work with some directory.
m = MetaClass('/my/data/...')
# m.type is one of the sub-dask-arrays
m2 = m[m.type==2]
It works as expected, and I get the sliced arrays, but as a result I get a huge memory consumption, and I assume that in the background the mechanism of dask is copying the index for each sub-dask-array.
My question is, how do I achieve the same results, without using so much memory?
(I tried not to "compute" the "inds" in getitem, but then I get nan shaped arrays, which can not be iterated, which is a must for the application)
I have been thinking about three possible solutions that I'd be happy to be advised which of them is the "right" one for me. (or to get another solution which I haven't thought of):
To use a Dask DataFrame, which I'm not sure how to fit multidimensional-dask-arrays in (would really appreciate some help or even a link that explains how to deal with multidimensional arrays in dd).
To forget about the entire MetaClass, and to use one dask-array with a nasty dtype (something like [("type",int,(1,)),("images",np.uint8,(1000,1000))]), again, I'm not familiar with this and would really appreciate some help with that (tried to google it.. it's a bit complicated..)
To share the index as a global inside the calling function (getitem) with property and its get-function-mechanism (https://docs.python.org/2/library/functions.html#property). But the big downside here is that I lose the types of the arrays (big down for representation and everything that needs anything but the data itself).
Thanks in advance!!!
One can use the sub-arrays.map_blocks with a shared function that holds the indices in its memory.
Here is an example:
def bool_mask(arr, block_info=None):
from_ind,to_ind = block_info[0]["array-location"][0]
return arr[inds[from_ind:to_ind]]
def getitem(var):
original_chunks = var.chunks[0]
tmp_inds = np.cumsum([0]+list(original_chunks))
from_inds = tmp_inds[:-1]
to_inds = tmp_inds[1:]
new_chunks_0 = np.array(list(map(lambda f,t:inds[f:t].sum(),from_inds,to_inds)))
new_chunks = tuple([tuple(new_chunks_0.tolist())] + list(var.chunks[1:]))
return var.map_blocks(bool_mask,dtype=var.dtype,chunks=new_chunks)

Dask compute fails when using client, works when no client setup

I am trying to use the dask client to parallelize my compute. When I run df.compute() I get the correct output (though it is very slow), but when I run the same thing after setting up a client, I get the following error:
distributed.protocol.pickle - INFO - Failed to serialize <function part at 0x7fd5186ed730>. Exception: can't pickle _thread.RLock objects
Here is my code, in the first df.compute(), I get the expected result, in the second I do not.
#dask.delayed
def part(x):
lower, upper = x
q = "SELECT id,tfidf_vec,emb_vec FROM document_table"
lines=man.session.execute(q)
counter = lower
df = []
for line in lines:
df.append(line)
counter += 1
if counter == upper:
break
return pd.DataFrame(df)
parts = [part(x) for x in [[0,100000],[100000,200000]]]
df = dd.from_delayed(parts)
df.compute()
from dask.distributed import Client
client = Client('127.0.0.1:8786')
df.compute()
Your function contains a reference to man.session, which is part of the function closure. When you use the default scheduler, threads, the object can be shared between the threads that execute your code. When you use the distributed scheduler, the function must be serialised and sent to workers in difference process(es).
You should make a function which creates the session object on each invocation, as was suggested as an answer to your very similar question.

How to communicate a future or lazy collection among dask distributed workers

The setup is basically:
client side:
with Client(...):
dfs= [client.submit(load_data_lazily, i) for i in ...]
ddf = dd.from_delayed(dfs)
xxx = ... how to serialize the handle to ddf ...?
... or xxx could be a future instead of a dask collection ...
result = client.submit(do_something_with_lazy_remotely, xxx)
worker (worker-client) side:
def do_something_with_lazy_remotely(xxx):
with worker_client() as client:
ddf = ... how to deserialize from xxx and not materialize? ...
# slice to a reasonable size
df = client.compute(ddf.loc[...], sync=True)
# do some computation on df
client.submit(...)
I can get something working by using client.persist and client.publish_dataset. The problem is then I'm doing manual tracking to figure out when to unpublish_dataset, and with asynchronous behavior this gets difficult.
Is there a (more) natural way to do this right now?
EDIT: it seems like this may be the way to go here?

Resources