Load and merge many files from S3 using Dask - dask

I have about 1m "result" files in S3 bucket which I want to process. Each result file should be merge with additional columns from an associated "context" file, which I have about 50k of (i.e. each context is associated with about 20 results)
Processing it serially is slow so I am using dask to parallelize some of the work.
In my serial code, I just load everything up-front and merge them, e.g.
contexts_map = {get_context_id(ctx_file): load_context(ctx_file) for ctx_file in ctx_files}
data = []
for result_file in result_files:
ctx_id, res_id = get_context_and_res_id(result_file)
ctx = contexts_map[ctx_id]
data.append(process_result(ctx))
df = pd.DataFrame(data)
Initially I thought to divide the data and process in batches using dask (i.e. run the above in parallel on several batches) but then I read about dask bag and dask dataframe from_delayed and thought to use it. What I have:
delayed_get_context = delayed(get_context)
# load the contexts
ctx_map = {}
for ctx_file in ctx_files:
ctx_id = get_context_id(ctx_file)
ctx_map[ctx_file] = delayed_get_context(ctx_item)
# process the contexts
delayed_get_context_stats = delayed(get_context_stats)
ctx_stat_map = {ctx_id: delayed_get_context_stats(ctx) for ctx_id, ctx in ctx_map}
# the main bag of result files to process
res_bag = db.from_sequence(res_items, npartitions=num_workers * 2)
# prepare a list of corresponding delayed per results
# the order in this list corresponds to order of res_bag
res_context_list = [
ctx_stat_map[get_context_and_res_id(item)[0]] for item in res_items
]
# then create a bag from that list
ctx_bag = db.from_sequence(res_context_list, npartitions=num_workers * 2)
# create delays for the results
delayed_extract = delayed(extract_stats)
# from what I understand, if one of the arguments is also a bug
# it is distributed in accordance to the "main" bag
results = res_bag.map(delayed_extract, ctx_stats=ctx_bag)
df = ddf.from_delayed(results)
df = df.compute()
df.to_csv("results.csv")
This create a computation graph similar to the following:
When I run this on a subset (as in the image above) it works ok. Running the code on 1m items, I don't see anything happen (maybe didn't wait enough for it to finish building the graph and moving things around?)
With that, does the code above makes sense? Should I have done it another way?
One of the things I am "afraid" of with the above implementation is that there's a lot of data movement.
I could potentially spend some time up-front to arrange context+results and then treat that as the "unit-of-work" and maybe get better results?
Any feedback here would be appreciated - is there a better approach?
And another question - what number of partitions I should use? I saw in the docs it will default to about 100, but is there some rule of thumb to use here?

Related

Iterate on (or access directly) xarray chunks

I'm after a way to iterate on xarray chunks, so something similar to dask.array.blocks but that would give me access to xarray chunks with coordinates and dimensions.
For the record, I'm aware that xarray.map_blocks exists, but what I'm doing maps input chunks to output chunks of unknown shape, so I'd like to write something custom by looping directly on the xarray chunks.
I've tried to look into the xarray.map_blocks source code, since I guess something similar to what I need is in there, but I had a hard time understanding what's going on there.
EDIT:
My use case is that I would like, for each xarray chunk, to get an output xarray chunk of variable length along a new dimension (called foo below), and eventually concatenate them along foo.
This is a mocked scenario that should at least clarify what I'm after.
For now I've solved the problem constructing, from each dask chunk of the DataArray, an "xarray" chunk (but this looks quite convoluted), and then using client.map(fn_on_chunk, xarray_chunks).
n = 1000
x_raster = y_raster = np.arange(n)
time = np.arange(10)
vals_raster = np.arange(n*n*10).reshape(n, n, 10)
da_raster = xr.DataArray(vals_raster, coords={"y": y_raster, "x": x_raster, 'time':time})
da_raster = da_raster.chunk(dict(x=100, y=100))
def fn_on_chunk(da_chunk):
# Tried to replicate the fact that I can't know in advance
# the lenght of one dimension of the output
len_range = np.random.randint(10)
outs = []
for foo in range(len_range):
# Do some magic that finds needed coordinates
# on this particular chunk
x_chunk, y_chunk = fn_magic(foo)
out = da_chunk.sel(x=x_chunk, y=y_chunk)
out['foo'] = foo
outs.append(out)
return xr.concat(outs, dim='foo')

How to apply multiple criteria on a parsed csv chunk to get multiple outputs?

New to dask , Any help is appreciated!
Basically i read csv file out of 540 csv files(out of RAM) & everytime i read a csv i apply 2 filter criteria to get 2 output files, though dask is doing its job , it is taking twice the time for the same chunk. how can i write an efficient code for this.
pricing_data = dd.read_csv(os.path.join('Selection Tool', 'prc_data','original','*.csv'),dtype={'BENCHMARK YIELD': 'object',
'BID YIELD': 'object','SPREAD': 'object'},parse_dates=['PRICING DATE'],assume_missing=True,low_memory=False)
pricing_data['Running_Month_ISIN'] = pricing_data['PRICING DATE'].apply(lambda x: x.strftime('%m%Y'), meta=('PRICING DATE', 'object')) + pricing_data['ISIN']
pricing_data['ISIN_PRICING_DATE'] = pricing_data['ISIN'] + pricing_data['PRICING DATE'].dt.strftime('%Y%m%d').astype(str) # master pricing data
pricing_data['PRICING_DATE_ISIN'] = pricing_data['PRICING DATE'].dt.strftime('%Y%m%d').astype(str) + pricing_data['ISIN']
prc_output2 = pricing_data[pricing_data.PRICING_DATE_ISIN.isin(matching_list_2)].compute()
prc_output1 = pricing_data[pricing_data.Running_Month_ISIN.isin(matching_list_1)].compute()
From the documentation:
Compute related results with shared computations in a single dask.compute() call
This allows Dask to compute the shared parts of the computation (like the dd.read_csv call above) only once, rather than once per compute call.
So in your case:
dask_prc_output2 = pricing_data[pricing_data.PRICING_DATE_ISIN.isin(matching_list_2)]
dask_prc_output1 = pricing_data[pricing_data.Running_Month_ISIN.isin(matching_list_1)]
prc_output1, prc_output2 = dask.compute(dask_prc_output1, dask_prc_output2)

Mismanagement of dask future results slow down the performances

I'm looking for any suggestion on how to solve the bottleneck below described.
Within a dask distributed infrastructure I map some futures and gain results whenever they are ready. Once retrieved I've to invoke a time consuming, blocking "pandas" function and, unfortunately, this function can't be avoided.
The optimum would be to have something that let me create another process, detached from the for loop, that's able to ingest the flow of results. For other constraints, not present in the example, the output can't be serialized and sent to workers and must be processed on the master.
here a small mockup. Just grab the idea and not focus too much on the details of the code.
class pxldrl(object):
def __init__(self, df):
self.table = df
def simulation(list_param):
time.sleep(random.random())
val = sum(list_param)/4
if val < 0.5:
result = {'param_e': val}
else:
result = {'param_f': val}
return pxldrl(result)
def costly_function(result, output):
time.sleep(1)
# blocking pandas function
output = output.append(result.table, sort=False, ignore_index=True)
return output
def main():
client = Client(n_workers=4, threads_per_worker=1)
output = pd.DataFrame(columns=['param_e', 'param_f'])
input = pd.DataFrame(np.random.random(size=(100, 4)),
columns=['param_a', 'param_b', 'param_c', 'param_d'])
for i in range(2):
futures = client.map(simulation, input.values)
for future, result in as_completed(futures, with_results=True):
output = costly_function(result, output)
It sounds like you want to run costly_function in a separate thread. Perhaps you could using the threading or concurrent.futures module to run your entire routine on a separate thread?
If you wanted to get fancy, you could even use Dask again and create a second client that ran within this process:
local_client = Client(processes=False)
and use that. (although you'll have to be careful about mixing futures between clients, which won't work)

I have collection of futures which are result of persist on dask dataframe. How to do a delayed operation on them?

I have setup a scheduler and 4 worker nodes to do some processing on csv. size of the csv is just 300 mb.
df = dd.read_csv('/Downloads/tmpcrnin5ta',assume_missing=True)
df = df.groupby(['col_1','col_2']).agg('mean').reset_index()
df = client.persist(df)
def create_sep_futures(symbol,df):
symbol_df = copy.deepcopy(df[df['symbol' == symbol]])
return symbol_df
lazy_values = [delayed(create_sep_futures)(symbol, df) for symbol in st]
future = client.compute(lazy_values)
result = client.gather(future)
st list contains 1000 elements
when I do this, I get this error:
distributed.worker - WARNING - Compute Failed
Function: create_sep_futures
args: ('PHG', symbol col_3 col_2 \
0 A 1.451261e+09 23.512857
1 A 1.451866e+09 23.886857
2 A 1.452470e+09 25.080429
kwargs: {}
Exception: KeyError(False,)
My assumption is that workers should get full dataframe and query on it. But I think it just gets the block and tries to do it.
What is the workaround for it? Since dataframe chunks are already in workers memory. I don't want to move the dataframe to each worker.
Operations on dataframes, using the dataframe syntax and API, are lazy (delayed) by default, you need do nothing more.
First problem: your syntax is wrong df[df['symbol' == symbol]] => df[df['symbol'] == symbol]. That is the origin of the False key.
So the solution you are probably looking for:
future = client.compute(df[df['symbol'] == symbol])
If you do want to work on the chunks separately, you can look into df.map_partitions, which you use with a normal function and takes care of passing data or delayed/futures or df.to_delayed, which will give you a set of delayed objects which you can use with a delayed function.

Access a single element in large published array with Dask

Is there a faster way to only retrieve a single element in a large published array with Dask without retrieving the entire array?
In the example below client.get_dataset('array1')[0] takes roughly the same time as client.get_dataset('array1').
import distributed
client = distributed.Client()
data = [1]*10000000
payload = {'array1': data}
client.publish(**payload)
one_element = client.get_dataset('array1')[0]
Note that anything you publish goes to the scheduler, not to the workers, so this is somewhat inefficient. Publish was intended to be used with Dask collections like dask.array.
Client 1
import dask.array as da
x = da.ones(10000000, chunks=(100000,)) # 1e7 size array cut into 1e5 size chunks
x = x.persist() # persist array on the workers of the cluster
client.publish(x=x) # store the metadata of x on the scheduler
Client 2
x = client.get_dataset('x') # get the lazy collection x
x[0].compute() # this selection happens on the worker, only the result comes down

Resources