Delayed dask.dataframe.DataFrame.to_hdf computations crashing - dask

I'm using Dask to to execute the following logic:
read in a master delayed dd.DataFrame from multiple input files (one pd.DataFrame per file)
perform multiple query calls on the master delayed DataFrame
use DataFrame.to_hdf to save all dataframes from the DataFrame.query calls.
If I use compute=False in my to_hdf calls and feed the list of Delayeds returned by each to_hdf call to dask.compute then I get a crash/seg fault. (If I omit compute=False everything runs fine). Some googling gave me some information about locks; I tried adding a dask.distributed.Client with a dask.distributed.Lock fed to to_hdf, as well as a dask.utils.SerializableLock, but I couldn't solve the crash.
here's the flow:
import uproot
import dask
import dask.dataframe as dd
from dask.delayed import delayed
def delayed_frame(files, tree_name):
"""create master delayed DataFrame from multiple files"""
#delayed
def single_frame(file_name, tree_name):
"""read external file, convert to pandas.DataFrame, return it"""
tree = uproot.open(file_name).get(tree_name)
return tree.pandas.df() ## this is the pd.DataFrame
return dd.from_delayed([single_frame(f, tree_name) for f in files])
def save_selected_frames(df, selections, prefix):
"""perform queries on a delayed DataFrame and save HDF5 output"""
queries = {sel_name: df.query(sel_query)
for sel_name, sel_query in selections.items()]
computes = []
for dfname, df in queries.items():
outname = f"{prefix}_{dfname}.h5"
computes.append(df.to_hdf(outname, f"/{prefix}", compute=False))
dask.compute(*computes)
selections = {"s1": "(A == True) & (N > 1)",
"s2": "(B == True) & (N > 2)",
"s3": "(C == True) & (N > 3)"}
from glob import glob
df = delayed_frame(glob("/path/to/files/*.root"), "selected")
save_selected_frames(df, selections, "selected")
## expect output files:
## - selected_s1.h5
## - selected_s2.h5
## - selected_s3.h5

Maybe the HDF library that you're using isn't entirely threadsafe? If you don't mind losing parallelism then you could add scheduler="single-threaded" to the compute call.
You might want to consider using Parquet rather than HDF. It has fewer issues like this.

Related

Load and merge many files from S3 using Dask

I have about 1m "result" files in S3 bucket which I want to process. Each result file should be merge with additional columns from an associated "context" file, which I have about 50k of (i.e. each context is associated with about 20 results)
Processing it serially is slow so I am using dask to parallelize some of the work.
In my serial code, I just load everything up-front and merge them, e.g.
contexts_map = {get_context_id(ctx_file): load_context(ctx_file) for ctx_file in ctx_files}
data = []
for result_file in result_files:
ctx_id, res_id = get_context_and_res_id(result_file)
ctx = contexts_map[ctx_id]
data.append(process_result(ctx))
df = pd.DataFrame(data)
Initially I thought to divide the data and process in batches using dask (i.e. run the above in parallel on several batches) but then I read about dask bag and dask dataframe from_delayed and thought to use it. What I have:
delayed_get_context = delayed(get_context)
# load the contexts
ctx_map = {}
for ctx_file in ctx_files:
ctx_id = get_context_id(ctx_file)
ctx_map[ctx_file] = delayed_get_context(ctx_item)
# process the contexts
delayed_get_context_stats = delayed(get_context_stats)
ctx_stat_map = {ctx_id: delayed_get_context_stats(ctx) for ctx_id, ctx in ctx_map}
# the main bag of result files to process
res_bag = db.from_sequence(res_items, npartitions=num_workers * 2)
# prepare a list of corresponding delayed per results
# the order in this list corresponds to order of res_bag
res_context_list = [
ctx_stat_map[get_context_and_res_id(item)[0]] for item in res_items
]
# then create a bag from that list
ctx_bag = db.from_sequence(res_context_list, npartitions=num_workers * 2)
# create delays for the results
delayed_extract = delayed(extract_stats)
# from what I understand, if one of the arguments is also a bug
# it is distributed in accordance to the "main" bag
results = res_bag.map(delayed_extract, ctx_stats=ctx_bag)
df = ddf.from_delayed(results)
df = df.compute()
df.to_csv("results.csv")
This create a computation graph similar to the following:
When I run this on a subset (as in the image above) it works ok. Running the code on 1m items, I don't see anything happen (maybe didn't wait enough for it to finish building the graph and moving things around?)
With that, does the code above makes sense? Should I have done it another way?
One of the things I am "afraid" of with the above implementation is that there's a lot of data movement.
I could potentially spend some time up-front to arrange context+results and then treat that as the "unit-of-work" and maybe get better results?
Any feedback here would be appreciated - is there a better approach?
And another question - what number of partitions I should use? I saw in the docs it will default to about 100, but is there some rule of thumb to use here?

pyarrow - identify the fragments written or filters used when writing a parquet dataset?

My use case is that I want to pass the file paths or filters to a task in Airflow as an xcom so that my next task can read the data which was just processed.
Task A writes a table to a partitioned dataset and a number of Parquet file fragments are generated --> Task B reads those fragments later as a dataset. I need to only read relevant data though, not the entire dataset which could have many millions of rows.
I have tested two approaches:
List modified files right after I finish writing to the dataset. This will provide me with a list of paths which I can call ds.dataset(paths) on during my next task. I can use partitioning.parse() on these paths or check the fragments to get a list of filters used (frag.partition_expression)
A flaw with this is that I can have files being written in parallel to the same dataset.
I can generate the filters used when writing the dataset by turning the table into a pandas dataframe, doing a groupby, and then constructing filters. I am not sure if there is a simpler approach to this. I can then use pq._filters_to_expression() on the results to create a usable filter.
This is not ideal since I need to fix certain data types which do not get saved properly as an Airflow xcom (no pickling so everything has to be in json format). Also, if I want to partition on a dictionary column, I might need to tweak this function.
def create_filter_list(df, partition_columns):
"""Creates a list of pyarrow filters to be sent through an xcom and evaluated as an expression. Xcom disables pickling, so we need to save timestamp and date values as strings and convert downstream"""
filter_list = []
value_list = []
partition_keys = [df[col] for col in partition_columns]
for keys, _ in df[partition_columns].groupby(partition_keys):
if len(partition_columns) == 1:
if is_jsonable(keys):
value_list.append(keys)
elif keys is not None:
value_list.append(str(keys))
else:
if not isinstance(keys, tuple):
keys = (keys,)
read_filter = []
for name, val in zip(partition_columns, keys):
if type(val) == np.int_:
read_filter.append((name, "==", int(val)))
elif val is not None:
read_filter.append((name, "==", str(val)))
filter_list.append(read_filter)
if len(partition_columns) == 1:
if len(value_list) > 0:
filter_list = [(name, "in", value_list) for name in partition_columns]
return filter_list
Any suggestions on which approach I should take, or if there is a better way to achieve my goal?
You can watch this issue (https://issues.apache.org/jira/browse/ARROW-10440) which does what you want I believe. In the meantime, you could use basename_template as a workaround.
import glob
import os
import pyarrow as pa
import pyarrow.dataset as pads
class TrackingWriter:
def __init__(self):
self.counter = 0
part_schema = pa.schema({'part': pa.int64()})
self.partitioning = pads.HivePartitioning(part_schema)
def next_counter(self):
result = self.counter
self.counter += 1
return result
def write_dataset(self, table, base_dir):
counter = self.next_counter()
pads.write_dataset(table, base_dir, format='parquet', partitioning=self.partitioning, basename_template=f'batch-{counter}-part-{{i}}')
files_written = glob.glob(os.path.join(base_dir, '**', f'batch-{counter}-*'))
return files_written
table_one = pa.table({'part': [0, 0, 1, 1], 'val': [1, 2, 3, 4]})
table_two = pa.table({'part': [0, 0, 1, 1], 'val': [5, 6, 7, 8]})
writer = TrackingWriter()
print(writer.write_dataset(table_one, '/tmp/mydataset'))
print(writer.write_dataset(table_two, '/tmp/mydataset'))
This is just a rough sketch. You'd probably also want code to run at startup to see what the next free value of counter is. Or you could use a uuid instead of a counter.
A suggestion (not sure if this is optimal for your use case or not):
The key problem is the need to correctly select subset of the data, this can be 'fixed' upstream. The function/script that updates the big dataframe can contain a condition to save a temporary copy of data that is modified and satisfies some requirements in a separate (temporary) path. Then this file would be passed to the downstream tasks, which can delete the temporary file once it's processed.

How to limit number of lines per file written using FileIO

Is there a possible way to limit number of lines in each written shard using TextIO or may be FileIO?
Example:
Read rows from Big Query - Batch Job (Result is 19500 rows for example).
Make some transformations.
Write files to Google Cloud storage (19 files, each file is limited to 1000 records, one file has 500 records).
Cloud Function is triggered to make a POST request to an external API for each file in GCS.
Here is what I'm trying to do so far but doesn't work (Trying to limit 1000 rows per file):
BQ_DATA = p | 'read_bq_view' >> beam.io.Read(
beam.io.BigQuerySource(query=query,
use_standard_sql=True)) | beam.Map(json.dumps)
BQ_DATA | beam.WindowInto(GlobalWindows(), Repeatedly(trigger=AfterCount(1000)),
accumulation_mode=AccumulationMode.DISCARDING)
| WriteToFiles(path='fileio', destination="csv")
Am I conceptually wrong or is there any other way to implement this?
You can implement the write to GCS step inside ParDo and limit the number of elements to include in a "batch" like this:
from apache_beam.io import filesystems
class WriteToGcsWithRowLimit(beam.DoFn):
def __init__(self, row_size=1000):
self.row_size = row_size
self.rows = []
def finish_bundle(self):
if len(self.rows) > 0:
self._write_file()
def process(self, element):
self.rows.append(element)
if len(self.rows) >= self.row_size:
self._write_file()
def _write_file(self):
from time import time
new_file = 'gs://bucket/file-{}.csv'.format(time())
writer = filesystems.FileSystems.create(path=new_file)
writer.write(self.rows) # may need to format
self.rows = []
writer.close()
BQ_DATA | beam.ParDo(WriteToGcsWithRowLimit())
Note that this will not create any files with less than 1000 rows, but you can change the logic in process to do that.
(Edit 1 to handle the remainders)
(Edit 2 to stop using counters, as files will be overridden)

Apache Beam TextIO.read and then combine into batches

After using TextIO.read to get a PCollection<String> of the individual lines, is it possible to then use some kind of combine transform to into batches (groups of 25 for example)? So the return type would end up looking something like: PCollection<String, List<String>>. It looks like it should be possible using some kind of CombineFn, but the API is a little arcane to me still.
The context here is I'm reading CSV files (potentially very very large), parsing + processing the lines and turning them into JSON, and then calling a REST API... but I don't want to hit the REST API for each line individually because the REST API supports multiple items at a time (up to 1000, so not the whole batch).
I guess you can do some simple batching like below (using stateful API). The state you want to maintain in BatchingFn is the current buffer of lines or self._lines. Sorry I did it in python (not familiar with Java API)
from apache_beam.transforms import DoFn
from apache_beam.transforms import ParDo
MY_BATCH_SIZE = 512
class BatchingFn(DoFn):
def __init__(self, batch_size=100):
self._batch_size = batch_size
def start_bundle(self):
# buffer for string of lines
self._lines = []
def process(self, element):
# Input element is a string (representing a CSV line)
self._lines.append(element)
if len(_lines) >= self._batch_size:
self._flush_batch()
def finish_bundle(self):
# takes care of the unflushed buffer before finishing
if self._lines:
self._flush_batch()
def _flush_batch(self):
#### Do your REST API call here with self._lines
# .....
# Clear the buffer.
self._lines = []
# pcoll is your PCollection of lines.
(pcoll | 'Call Rest API with batch data' >> ParDo(BatchingFn(MY_BATCH_SIZE)))
Regarding using Data-driven triggers, you can refer to Batch PCollection in Beam/Dataflow.

I have collection of futures which are result of persist on dask dataframe. How to do a delayed operation on them?

I have setup a scheduler and 4 worker nodes to do some processing on csv. size of the csv is just 300 mb.
df = dd.read_csv('/Downloads/tmpcrnin5ta',assume_missing=True)
df = df.groupby(['col_1','col_2']).agg('mean').reset_index()
df = client.persist(df)
def create_sep_futures(symbol,df):
symbol_df = copy.deepcopy(df[df['symbol' == symbol]])
return symbol_df
lazy_values = [delayed(create_sep_futures)(symbol, df) for symbol in st]
future = client.compute(lazy_values)
result = client.gather(future)
st list contains 1000 elements
when I do this, I get this error:
distributed.worker - WARNING - Compute Failed
Function: create_sep_futures
args: ('PHG', symbol col_3 col_2 \
0 A 1.451261e+09 23.512857
1 A 1.451866e+09 23.886857
2 A 1.452470e+09 25.080429
kwargs: {}
Exception: KeyError(False,)
My assumption is that workers should get full dataframe and query on it. But I think it just gets the block and tries to do it.
What is the workaround for it? Since dataframe chunks are already in workers memory. I don't want to move the dataframe to each worker.
Operations on dataframes, using the dataframe syntax and API, are lazy (delayed) by default, you need do nothing more.
First problem: your syntax is wrong df[df['symbol' == symbol]] => df[df['symbol'] == symbol]. That is the origin of the False key.
So the solution you are probably looking for:
future = client.compute(df[df['symbol'] == symbol])
If you do want to work on the chunks separately, you can look into df.map_partitions, which you use with a normal function and takes care of passing data or delayed/futures or df.to_delayed, which will give you a set of delayed objects which you can use with a delayed function.

Resources