I have collection of futures which are result of persist on dask dataframe. How to do a delayed operation on them? - dask

I have setup a scheduler and 4 worker nodes to do some processing on csv. size of the csv is just 300 mb.
df = dd.read_csv('/Downloads/tmpcrnin5ta',assume_missing=True)
df = df.groupby(['col_1','col_2']).agg('mean').reset_index()
df = client.persist(df)
def create_sep_futures(symbol,df):
symbol_df = copy.deepcopy(df[df['symbol' == symbol]])
return symbol_df
lazy_values = [delayed(create_sep_futures)(symbol, df) for symbol in st]
future = client.compute(lazy_values)
result = client.gather(future)
st list contains 1000 elements
when I do this, I get this error:
distributed.worker - WARNING - Compute Failed
Function: create_sep_futures
args: ('PHG', symbol col_3 col_2 \
0 A 1.451261e+09 23.512857
1 A 1.451866e+09 23.886857
2 A 1.452470e+09 25.080429
kwargs: {}
Exception: KeyError(False,)
My assumption is that workers should get full dataframe and query on it. But I think it just gets the block and tries to do it.
What is the workaround for it? Since dataframe chunks are already in workers memory. I don't want to move the dataframe to each worker.

Operations on dataframes, using the dataframe syntax and API, are lazy (delayed) by default, you need do nothing more.
First problem: your syntax is wrong df[df['symbol' == symbol]] => df[df['symbol'] == symbol]. That is the origin of the False key.
So the solution you are probably looking for:
future = client.compute(df[df['symbol'] == symbol])
If you do want to work on the chunks separately, you can look into df.map_partitions, which you use with a normal function and takes care of passing data or delayed/futures or df.to_delayed, which will give you a set of delayed objects which you can use with a delayed function.

Related

Iterate on (or access directly) xarray chunks

I'm after a way to iterate on xarray chunks, so something similar to dask.array.blocks but that would give me access to xarray chunks with coordinates and dimensions.
For the record, I'm aware that xarray.map_blocks exists, but what I'm doing maps input chunks to output chunks of unknown shape, so I'd like to write something custom by looping directly on the xarray chunks.
I've tried to look into the xarray.map_blocks source code, since I guess something similar to what I need is in there, but I had a hard time understanding what's going on there.
EDIT:
My use case is that I would like, for each xarray chunk, to get an output xarray chunk of variable length along a new dimension (called foo below), and eventually concatenate them along foo.
This is a mocked scenario that should at least clarify what I'm after.
For now I've solved the problem constructing, from each dask chunk of the DataArray, an "xarray" chunk (but this looks quite convoluted), and then using client.map(fn_on_chunk, xarray_chunks).
n = 1000
x_raster = y_raster = np.arange(n)
time = np.arange(10)
vals_raster = np.arange(n*n*10).reshape(n, n, 10)
da_raster = xr.DataArray(vals_raster, coords={"y": y_raster, "x": x_raster, 'time':time})
da_raster = da_raster.chunk(dict(x=100, y=100))
def fn_on_chunk(da_chunk):
# Tried to replicate the fact that I can't know in advance
# the lenght of one dimension of the output
len_range = np.random.randint(10)
outs = []
for foo in range(len_range):
# Do some magic that finds needed coordinates
# on this particular chunk
x_chunk, y_chunk = fn_magic(foo)
out = da_chunk.sel(x=x_chunk, y=y_chunk)
out['foo'] = foo
outs.append(out)
return xr.concat(outs, dim='foo')

Load and merge many files from S3 using Dask

I have about 1m "result" files in S3 bucket which I want to process. Each result file should be merge with additional columns from an associated "context" file, which I have about 50k of (i.e. each context is associated with about 20 results)
Processing it serially is slow so I am using dask to parallelize some of the work.
In my serial code, I just load everything up-front and merge them, e.g.
contexts_map = {get_context_id(ctx_file): load_context(ctx_file) for ctx_file in ctx_files}
data = []
for result_file in result_files:
ctx_id, res_id = get_context_and_res_id(result_file)
ctx = contexts_map[ctx_id]
data.append(process_result(ctx))
df = pd.DataFrame(data)
Initially I thought to divide the data and process in batches using dask (i.e. run the above in parallel on several batches) but then I read about dask bag and dask dataframe from_delayed and thought to use it. What I have:
delayed_get_context = delayed(get_context)
# load the contexts
ctx_map = {}
for ctx_file in ctx_files:
ctx_id = get_context_id(ctx_file)
ctx_map[ctx_file] = delayed_get_context(ctx_item)
# process the contexts
delayed_get_context_stats = delayed(get_context_stats)
ctx_stat_map = {ctx_id: delayed_get_context_stats(ctx) for ctx_id, ctx in ctx_map}
# the main bag of result files to process
res_bag = db.from_sequence(res_items, npartitions=num_workers * 2)
# prepare a list of corresponding delayed per results
# the order in this list corresponds to order of res_bag
res_context_list = [
ctx_stat_map[get_context_and_res_id(item)[0]] for item in res_items
]
# then create a bag from that list
ctx_bag = db.from_sequence(res_context_list, npartitions=num_workers * 2)
# create delays for the results
delayed_extract = delayed(extract_stats)
# from what I understand, if one of the arguments is also a bug
# it is distributed in accordance to the "main" bag
results = res_bag.map(delayed_extract, ctx_stats=ctx_bag)
df = ddf.from_delayed(results)
df = df.compute()
df.to_csv("results.csv")
This create a computation graph similar to the following:
When I run this on a subset (as in the image above) it works ok. Running the code on 1m items, I don't see anything happen (maybe didn't wait enough for it to finish building the graph and moving things around?)
With that, does the code above makes sense? Should I have done it another way?
One of the things I am "afraid" of with the above implementation is that there's a lot of data movement.
I could potentially spend some time up-front to arrange context+results and then treat that as the "unit-of-work" and maybe get better results?
Any feedback here would be appreciated - is there a better approach?
And another question - what number of partitions I should use? I saw in the docs it will default to about 100, but is there some rule of thumb to use here?

pyarrow - identify the fragments written or filters used when writing a parquet dataset?

My use case is that I want to pass the file paths or filters to a task in Airflow as an xcom so that my next task can read the data which was just processed.
Task A writes a table to a partitioned dataset and a number of Parquet file fragments are generated --> Task B reads those fragments later as a dataset. I need to only read relevant data though, not the entire dataset which could have many millions of rows.
I have tested two approaches:
List modified files right after I finish writing to the dataset. This will provide me with a list of paths which I can call ds.dataset(paths) on during my next task. I can use partitioning.parse() on these paths or check the fragments to get a list of filters used (frag.partition_expression)
A flaw with this is that I can have files being written in parallel to the same dataset.
I can generate the filters used when writing the dataset by turning the table into a pandas dataframe, doing a groupby, and then constructing filters. I am not sure if there is a simpler approach to this. I can then use pq._filters_to_expression() on the results to create a usable filter.
This is not ideal since I need to fix certain data types which do not get saved properly as an Airflow xcom (no pickling so everything has to be in json format). Also, if I want to partition on a dictionary column, I might need to tweak this function.
def create_filter_list(df, partition_columns):
"""Creates a list of pyarrow filters to be sent through an xcom and evaluated as an expression. Xcom disables pickling, so we need to save timestamp and date values as strings and convert downstream"""
filter_list = []
value_list = []
partition_keys = [df[col] for col in partition_columns]
for keys, _ in df[partition_columns].groupby(partition_keys):
if len(partition_columns) == 1:
if is_jsonable(keys):
value_list.append(keys)
elif keys is not None:
value_list.append(str(keys))
else:
if not isinstance(keys, tuple):
keys = (keys,)
read_filter = []
for name, val in zip(partition_columns, keys):
if type(val) == np.int_:
read_filter.append((name, "==", int(val)))
elif val is not None:
read_filter.append((name, "==", str(val)))
filter_list.append(read_filter)
if len(partition_columns) == 1:
if len(value_list) > 0:
filter_list = [(name, "in", value_list) for name in partition_columns]
return filter_list
Any suggestions on which approach I should take, or if there is a better way to achieve my goal?
You can watch this issue (https://issues.apache.org/jira/browse/ARROW-10440) which does what you want I believe. In the meantime, you could use basename_template as a workaround.
import glob
import os
import pyarrow as pa
import pyarrow.dataset as pads
class TrackingWriter:
def __init__(self):
self.counter = 0
part_schema = pa.schema({'part': pa.int64()})
self.partitioning = pads.HivePartitioning(part_schema)
def next_counter(self):
result = self.counter
self.counter += 1
return result
def write_dataset(self, table, base_dir):
counter = self.next_counter()
pads.write_dataset(table, base_dir, format='parquet', partitioning=self.partitioning, basename_template=f'batch-{counter}-part-{{i}}')
files_written = glob.glob(os.path.join(base_dir, '**', f'batch-{counter}-*'))
return files_written
table_one = pa.table({'part': [0, 0, 1, 1], 'val': [1, 2, 3, 4]})
table_two = pa.table({'part': [0, 0, 1, 1], 'val': [5, 6, 7, 8]})
writer = TrackingWriter()
print(writer.write_dataset(table_one, '/tmp/mydataset'))
print(writer.write_dataset(table_two, '/tmp/mydataset'))
This is just a rough sketch. You'd probably also want code to run at startup to see what the next free value of counter is. Or you could use a uuid instead of a counter.
A suggestion (not sure if this is optimal for your use case or not):
The key problem is the need to correctly select subset of the data, this can be 'fixed' upstream. The function/script that updates the big dataframe can contain a condition to save a temporary copy of data that is modified and satisfies some requirements in a separate (temporary) path. Then this file would be passed to the downstream tasks, which can delete the temporary file once it's processed.

Dask compute fails when using client, works when no client setup

I am trying to use the dask client to parallelize my compute. When I run df.compute() I get the correct output (though it is very slow), but when I run the same thing after setting up a client, I get the following error:
distributed.protocol.pickle - INFO - Failed to serialize <function part at 0x7fd5186ed730>. Exception: can't pickle _thread.RLock objects
Here is my code, in the first df.compute(), I get the expected result, in the second I do not.
#dask.delayed
def part(x):
lower, upper = x
q = "SELECT id,tfidf_vec,emb_vec FROM document_table"
lines=man.session.execute(q)
counter = lower
df = []
for line in lines:
df.append(line)
counter += 1
if counter == upper:
break
return pd.DataFrame(df)
parts = [part(x) for x in [[0,100000],[100000,200000]]]
df = dd.from_delayed(parts)
df.compute()
from dask.distributed import Client
client = Client('127.0.0.1:8786')
df.compute()
Your function contains a reference to man.session, which is part of the function closure. When you use the default scheduler, threads, the object can be shared between the threads that execute your code. When you use the distributed scheduler, the function must be serialised and sent to workers in difference process(es).
You should make a function which creates the session object on each invocation, as was suggested as an answer to your very similar question.

Memory Blow Up when using Dask compute or persist with Dask Delayed

I am trying to process the data of several subjects all in one dataframe. There are >30 subjects and 14 computations per subject it is a large data set but any more than 5 blows up the memory on the scheduler node with out running any workers on the same node as the scheduler it has 128gb of memory? Any ideas how I can get around this or if im doing something wrong? code bellow.
def channel_select(chn,sub):
subject = pd.DataFrame(df.loc[df['sub'] == sub])
subject['s0'] = subject[chn]
val = []
for x in range(13):
for i in range(len(subject)):
val.append(subject['s0'].values[i-x])
name = 's' + str(x+1)
subject[name] = val
val = []
return subject
subs = df['sub'].unique()
subs = np.delete(subs, [34,33])
for s in subs:
for c in chn:
chn_del.append(delayed(channel_select)(c,subs[s]))
results = e.persist(pred)
I have the code shown to run all the subjects but anymore than 5 at a time and I run out of memory
You're telling the computer to keep almost 1,000 GB of memory.
But you knew that already (:
As Mary stated above, every call to channel_select created and stores the dataframe on the schedulers memory, with 30 subjects calling 14 time each and a 2gb dataframe...yeah you can do the math of how much memory that was trying grab.

Resources