Access a single element in large published array with Dask - dask

Is there a faster way to only retrieve a single element in a large published array with Dask without retrieving the entire array?
In the example below client.get_dataset('array1')[0] takes roughly the same time as client.get_dataset('array1').
import distributed
client = distributed.Client()
data = [1]*10000000
payload = {'array1': data}
client.publish(**payload)
one_element = client.get_dataset('array1')[0]

Note that anything you publish goes to the scheduler, not to the workers, so this is somewhat inefficient. Publish was intended to be used with Dask collections like dask.array.
Client 1
import dask.array as da
x = da.ones(10000000, chunks=(100000,)) # 1e7 size array cut into 1e5 size chunks
x = x.persist() # persist array on the workers of the cluster
client.publish(x=x) # store the metadata of x on the scheduler
Client 2
x = client.get_dataset('x') # get the lazy collection x
x[0].compute() # this selection happens on the worker, only the result comes down

Related

Iterate on (or access directly) xarray chunks

I'm after a way to iterate on xarray chunks, so something similar to dask.array.blocks but that would give me access to xarray chunks with coordinates and dimensions.
For the record, I'm aware that xarray.map_blocks exists, but what I'm doing maps input chunks to output chunks of unknown shape, so I'd like to write something custom by looping directly on the xarray chunks.
I've tried to look into the xarray.map_blocks source code, since I guess something similar to what I need is in there, but I had a hard time understanding what's going on there.
EDIT:
My use case is that I would like, for each xarray chunk, to get an output xarray chunk of variable length along a new dimension (called foo below), and eventually concatenate them along foo.
This is a mocked scenario that should at least clarify what I'm after.
For now I've solved the problem constructing, from each dask chunk of the DataArray, an "xarray" chunk (but this looks quite convoluted), and then using client.map(fn_on_chunk, xarray_chunks).
n = 1000
x_raster = y_raster = np.arange(n)
time = np.arange(10)
vals_raster = np.arange(n*n*10).reshape(n, n, 10)
da_raster = xr.DataArray(vals_raster, coords={"y": y_raster, "x": x_raster, 'time':time})
da_raster = da_raster.chunk(dict(x=100, y=100))
def fn_on_chunk(da_chunk):
# Tried to replicate the fact that I can't know in advance
# the lenght of one dimension of the output
len_range = np.random.randint(10)
outs = []
for foo in range(len_range):
# Do some magic that finds needed coordinates
# on this particular chunk
x_chunk, y_chunk = fn_magic(foo)
out = da_chunk.sel(x=x_chunk, y=y_chunk)
out['foo'] = foo
outs.append(out)
return xr.concat(outs, dim='foo')

Load and merge many files from S3 using Dask

I have about 1m "result" files in S3 bucket which I want to process. Each result file should be merge with additional columns from an associated "context" file, which I have about 50k of (i.e. each context is associated with about 20 results)
Processing it serially is slow so I am using dask to parallelize some of the work.
In my serial code, I just load everything up-front and merge them, e.g.
contexts_map = {get_context_id(ctx_file): load_context(ctx_file) for ctx_file in ctx_files}
data = []
for result_file in result_files:
ctx_id, res_id = get_context_and_res_id(result_file)
ctx = contexts_map[ctx_id]
data.append(process_result(ctx))
df = pd.DataFrame(data)
Initially I thought to divide the data and process in batches using dask (i.e. run the above in parallel on several batches) but then I read about dask bag and dask dataframe from_delayed and thought to use it. What I have:
delayed_get_context = delayed(get_context)
# load the contexts
ctx_map = {}
for ctx_file in ctx_files:
ctx_id = get_context_id(ctx_file)
ctx_map[ctx_file] = delayed_get_context(ctx_item)
# process the contexts
delayed_get_context_stats = delayed(get_context_stats)
ctx_stat_map = {ctx_id: delayed_get_context_stats(ctx) for ctx_id, ctx in ctx_map}
# the main bag of result files to process
res_bag = db.from_sequence(res_items, npartitions=num_workers * 2)
# prepare a list of corresponding delayed per results
# the order in this list corresponds to order of res_bag
res_context_list = [
ctx_stat_map[get_context_and_res_id(item)[0]] for item in res_items
]
# then create a bag from that list
ctx_bag = db.from_sequence(res_context_list, npartitions=num_workers * 2)
# create delays for the results
delayed_extract = delayed(extract_stats)
# from what I understand, if one of the arguments is also a bug
# it is distributed in accordance to the "main" bag
results = res_bag.map(delayed_extract, ctx_stats=ctx_bag)
df = ddf.from_delayed(results)
df = df.compute()
df.to_csv("results.csv")
This create a computation graph similar to the following:
When I run this on a subset (as in the image above) it works ok. Running the code on 1m items, I don't see anything happen (maybe didn't wait enough for it to finish building the graph and moving things around?)
With that, does the code above makes sense? Should I have done it another way?
One of the things I am "afraid" of with the above implementation is that there's a lot of data movement.
I could potentially spend some time up-front to arrange context+results and then treat that as the "unit-of-work" and maybe get better results?
Any feedback here would be appreciated - is there a better approach?
And another question - what number of partitions I should use? I saw in the docs it will default to about 100, but is there some rule of thumb to use here?

Dask compute fails when using client, works when no client setup

I am trying to use the dask client to parallelize my compute. When I run df.compute() I get the correct output (though it is very slow), but when I run the same thing after setting up a client, I get the following error:
distributed.protocol.pickle - INFO - Failed to serialize <function part at 0x7fd5186ed730>. Exception: can't pickle _thread.RLock objects
Here is my code, in the first df.compute(), I get the expected result, in the second I do not.
#dask.delayed
def part(x):
lower, upper = x
q = "SELECT id,tfidf_vec,emb_vec FROM document_table"
lines=man.session.execute(q)
counter = lower
df = []
for line in lines:
df.append(line)
counter += 1
if counter == upper:
break
return pd.DataFrame(df)
parts = [part(x) for x in [[0,100000],[100000,200000]]]
df = dd.from_delayed(parts)
df.compute()
from dask.distributed import Client
client = Client('127.0.0.1:8786')
df.compute()
Your function contains a reference to man.session, which is part of the function closure. When you use the default scheduler, threads, the object can be shared between the threads that execute your code. When you use the distributed scheduler, the function must be serialised and sent to workers in difference process(es).
You should make a function which creates the session object on each invocation, as was suggested as an answer to your very similar question.

I have collection of futures which are result of persist on dask dataframe. How to do a delayed operation on them?

I have setup a scheduler and 4 worker nodes to do some processing on csv. size of the csv is just 300 mb.
df = dd.read_csv('/Downloads/tmpcrnin5ta',assume_missing=True)
df = df.groupby(['col_1','col_2']).agg('mean').reset_index()
df = client.persist(df)
def create_sep_futures(symbol,df):
symbol_df = copy.deepcopy(df[df['symbol' == symbol]])
return symbol_df
lazy_values = [delayed(create_sep_futures)(symbol, df) for symbol in st]
future = client.compute(lazy_values)
result = client.gather(future)
st list contains 1000 elements
when I do this, I get this error:
distributed.worker - WARNING - Compute Failed
Function: create_sep_futures
args: ('PHG', symbol col_3 col_2 \
0 A 1.451261e+09 23.512857
1 A 1.451866e+09 23.886857
2 A 1.452470e+09 25.080429
kwargs: {}
Exception: KeyError(False,)
My assumption is that workers should get full dataframe and query on it. But I think it just gets the block and tries to do it.
What is the workaround for it? Since dataframe chunks are already in workers memory. I don't want to move the dataframe to each worker.
Operations on dataframes, using the dataframe syntax and API, are lazy (delayed) by default, you need do nothing more.
First problem: your syntax is wrong df[df['symbol' == symbol]] => df[df['symbol'] == symbol]. That is the origin of the False key.
So the solution you are probably looking for:
future = client.compute(df[df['symbol'] == symbol])
If you do want to work on the chunks separately, you can look into df.map_partitions, which you use with a normal function and takes care of passing data or delayed/futures or df.to_delayed, which will give you a set of delayed objects which you can use with a delayed function.

Memory Blow Up when using Dask compute or persist with Dask Delayed

I am trying to process the data of several subjects all in one dataframe. There are >30 subjects and 14 computations per subject it is a large data set but any more than 5 blows up the memory on the scheduler node with out running any workers on the same node as the scheduler it has 128gb of memory? Any ideas how I can get around this or if im doing something wrong? code bellow.
def channel_select(chn,sub):
subject = pd.DataFrame(df.loc[df['sub'] == sub])
subject['s0'] = subject[chn]
val = []
for x in range(13):
for i in range(len(subject)):
val.append(subject['s0'].values[i-x])
name = 's' + str(x+1)
subject[name] = val
val = []
return subject
subs = df['sub'].unique()
subs = np.delete(subs, [34,33])
for s in subs:
for c in chn:
chn_del.append(delayed(channel_select)(c,subs[s]))
results = e.persist(pred)
I have the code shown to run all the subjects but anymore than 5 at a time and I run out of memory
You're telling the computer to keep almost 1,000 GB of memory.
But you knew that already (:
As Mary stated above, every call to channel_select created and stores the dataframe on the schedulers memory, with 30 subjects calling 14 time each and a 2gb dataframe...yeah you can do the math of how much memory that was trying grab.

Resources