How to limit number of lines per file written using FileIO - google-cloud-dataflow

Is there a possible way to limit number of lines in each written shard using TextIO or may be FileIO?
Example:
Read rows from Big Query - Batch Job (Result is 19500 rows for example).
Make some transformations.
Write files to Google Cloud storage (19 files, each file is limited to 1000 records, one file has 500 records).
Cloud Function is triggered to make a POST request to an external API for each file in GCS.
Here is what I'm trying to do so far but doesn't work (Trying to limit 1000 rows per file):
BQ_DATA = p | 'read_bq_view' >> beam.io.Read(
beam.io.BigQuerySource(query=query,
use_standard_sql=True)) | beam.Map(json.dumps)
BQ_DATA | beam.WindowInto(GlobalWindows(), Repeatedly(trigger=AfterCount(1000)),
accumulation_mode=AccumulationMode.DISCARDING)
| WriteToFiles(path='fileio', destination="csv")
Am I conceptually wrong or is there any other way to implement this?

You can implement the write to GCS step inside ParDo and limit the number of elements to include in a "batch" like this:
from apache_beam.io import filesystems
class WriteToGcsWithRowLimit(beam.DoFn):
def __init__(self, row_size=1000):
self.row_size = row_size
self.rows = []
def finish_bundle(self):
if len(self.rows) > 0:
self._write_file()
def process(self, element):
self.rows.append(element)
if len(self.rows) >= self.row_size:
self._write_file()
def _write_file(self):
from time import time
new_file = 'gs://bucket/file-{}.csv'.format(time())
writer = filesystems.FileSystems.create(path=new_file)
writer.write(self.rows) # may need to format
self.rows = []
writer.close()
BQ_DATA | beam.ParDo(WriteToGcsWithRowLimit())
Note that this will not create any files with less than 1000 rows, but you can change the logic in process to do that.
(Edit 1 to handle the remainders)
(Edit 2 to stop using counters, as files will be overridden)

Related

Batch vs Streaming Performance in Google Cloud Dataflow

we have bounded data , around 3.5 million records in BigQuery. These data needs to be processed using Dataflow (mostly it is some external API calls + transformations)
From the document - https://cloud.google.com/dataflow/docs/resources/faq#beam-java-sdk
I see Batch mode uses single thread and stream uses 300 threads per worker.For us, most of my operation is Network bound because of external API calls.
Considering this, which one would be more performant and cost efficient ? Batch - by spinning x workers or Stream with x workers and 300 threads.
If it is streaming then I should send the data which is present in BigQuery to pub/sub ? Is my understanding correct ?
The Batch vs Streaming decision usually comes from the source that you are reading from (Bounded vs Unbounded). When reading from BigQueryIO, it comes is bounded.
There are ways to convert from a BoundedSource to an UnboundedSource) (see Using custom DataFlow unbounded source on DirectPipelineRunner) but I don't see it recommended anywhere, and I am not sure you would get any benefit from it. Streaming has to keep track of checkpoints and watermarks, which could result in an overhead for your workers.
Here is an example of a DoFn that processes multiple items concurrently:
class MultiThreadedDoFn(beam.DoFn):
def __init__(self, func, num_threads=10):
self.func = func
self.num_threads = num_threads
def setup(self):
self.done = False
self.input_queue = queue.Queue(2)
self.output_queue = queue.Queue()
self.threads = [
threading.Thread(target=self.work, daemon=True)
for _ in range(self.num_threads)]
for t in self.threads:
t.start()
def work(self):
while not self.done:
try:
windowed_value = self.input_queue.get(timeout=0.1)
self.output_queue.put(
windowed_value.with_value(func(windowed_value.value)))
except queue.Empty:
pass # check self.done
def start_bundle(self):
self.pending = 0
def process(self, element,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam):
self.pending += 1
self.input_queue.put(
beam.transforms.window.WindowedValue(
element, timestamp, (window,)))
try:
while not self.output_queue.empty():
yield self.output_queue.get(block=False)
self.pending -= 1
except queue.Empty:
pass
def finish_bundle(self):
while self.pending > 0:
yield self.output_queue.get()
self.pending -= 1
def teardown(self):
self.done = True
for t in self.threads:
t.join()
It can be used as
def func(n):
time.sleep(n / 10)
return n + 1
with beam.Pipeline() as p:
p | beam.Create([1, 3, 5, 7] * 10 + [9]) | beam.ParDo(MultiThreadedDoFn(func)) | beam.Map(logging.error)

Splitting file processing by initial keys

Use Case
I have some terabytes of US property data to merge. It is spread across two distinct file formats and thousands of files. The source data is split geographically.
I can't find a way to branch a single pipeline into many independent processing flows.
This is especially difficult because the Dataframe API doesn't seem to support a PTransform on a collection of filenames.
Detailed Background
The distribution of files is like this:
StateData - 51 total files (US states + DC)
CountyData - ~2000 total files (county specific, grouped by state)
The ideal pipeline would split into thousands of independent processing steps and complete in minutes.
1 -> 51 (each US state + DC starts processing)
51 -> thousands (each US state then spawns a process that merges the counties, combining at the end for the whole state)
The directory structure is like this:
๐Ÿ“‚state-data/
|-๐Ÿ“œAL.zip
|-๐Ÿ“œAK.zip
|-๐Ÿ“œ...
|-๐Ÿ“œWY.zip
๐Ÿ“‚county-data/
|-๐Ÿ“‚AL/
|-๐Ÿ“œCOUNTY1.csv
|-๐Ÿ“œCOUNTY2.csv
|-๐Ÿ“œ...
|-๐Ÿ“œCOUNTY68.csv
|-๐Ÿ“‚AK/
|-๐Ÿ“œ...
|-๐Ÿ“‚.../
|-๐Ÿ“‚WY/
|-๐Ÿ“œ...
Sample Data
This is extremely abbreviated, but imagine something like this:
State Level Data - 51 of these (~200 cols wide)
uid
census_plot
flood_zone
abc121
ACVB-1249575
R50
abc122
ACVB-1249575
R50
abc123
ACVB-1249575
R51
abc124
ACVB-1249599
R51
abc125
ACVB-1249599
R50
...
...
...
County Level Data - thousands of these (~300 cols wide)
uid
county
subdivision
tax_id
abc121
04021
Roland Heights
3t4g
abc122
04021
Roland Heights
3g444
abc123
04021
Roland Heights
09udd
...
...
...
...
So we join many county-level to a single state level, and thus have an aggregated, more-complete state-level data set.
Then we aggregate all the states, and we have a national level data set.
Desired Outcome
I can successfully merge one state at a time (many county to one state). I built a pipeline to do that, but the pipeline starts with a single CountyData CSV and a single StateData CSV. The issue is getting to the point where I can load the CountyData and StateData.
In other words:
#
# I need to find a way to generalize this flow to
# dynamically created COUNTY and STATE variables.
#
from apache_beam.dataframe.convert import to_pcollection
from apache_beam.dataframe.io import read_csv
COUNTY = "county-data/AL/*.csv"
STATE = "state-data/AL.zip"
def key_by_uid(elem):
return (elem.uid, elem)
with beam.Pipeline() as p:
county_df = p | read_csv(COUNTY)
county_rows_keyed = to_pcollection(county_df) | beam.Map(key_by_uid)
state_df = pd.read_csv(STATE, compression="zip")
state_rows_keys = to_pcollection(state_df, pipeline=p) | beam.Map(key_by_uid)
merged = ({ "state": state_rows_keys, "county": county_rows_keyed } ) | beam.CoGroupByKey() | beam.Map(merge_logic)
merged | WriteToParquet()
Starting with a list of states
By state, generate filepatterns to the source data
By state, load and merge the filenames
Flatten the output from each state into a US data set.
Write to Parquet file.
with beam.Pipeline(options=pipeline_options) as p:
merged_data = (
p
| beam.Create(cx.STATES)
| "PathsKeyedByState" >> tx.PathsKeyedByState()
# ('AL', {'county-data': 'gs://data/county-data/AL/COUNTY*.csv', 'state-data': 'gs://data/state-data/AL.zip'})
| "MergeSourceDataByState" >> tx.MergeSourceDataByState()
| "MergeAllStateData" >> beam.Flatten()
)
merged_data | "WriteParquet" >> tx.WriteParquet()
The issue I'm having is something like this:
I have two filepatterns in a dictionary, per state. To access those I need to use a DoFn to get at the element.
To communicate the way the data flows, I need access to Pipeline, which is a PTransform. Ex: df = p | read_csv(...)
These appear to be incompatible needs.
Here's an alternative answer.
Read the state data one at a time and flatten them, e.g.
state_dataframe = None
for state in STATES:
df = p | read_csv('/path/to/state')
df['state'] = state
if state_dataframe is None:
state_dataframe = df
else:
state_dataframe = state_dataframe.append(df)
Similarly for county data. Now join them using dataframe operations.
I'm not sure exactly what kind of merging you're doing here, but one way to structure this pipeline might be to have a DoFn that takes the county data in as a filename as an input element (i.e. you'd have a PCollection of county data filenames), opens it up using "normal" Python (e.g. pandas), and then reads the relevant state data in as a side input to do the merge.

pyarrow - identify the fragments written or filters used when writing a parquet dataset?

My use case is that I want to pass the file paths or filters to a task in Airflow as an xcom so that my next task can read the data which was just processed.
Task A writes a table to a partitioned dataset and a number of Parquet file fragments are generated --> Task B reads those fragments later as a dataset. I need to only read relevant data though, not the entire dataset which could have many millions of rows.
I have tested two approaches:
List modified files right after I finish writing to the dataset. This will provide me with a list of paths which I can call ds.dataset(paths) on during my next task. I can use partitioning.parse() on these paths or check the fragments to get a list of filters used (frag.partition_expression)
A flaw with this is that I can have files being written in parallel to the same dataset.
I can generate the filters used when writing the dataset by turning the table into a pandas dataframe, doing a groupby, and then constructing filters. I am not sure if there is a simpler approach to this. I can then use pq._filters_to_expression() on the results to create a usable filter.
This is not ideal since I need to fix certain data types which do not get saved properly as an Airflow xcom (no pickling so everything has to be in json format). Also, if I want to partition on a dictionary column, I might need to tweak this function.
def create_filter_list(df, partition_columns):
"""Creates a list of pyarrow filters to be sent through an xcom and evaluated as an expression. Xcom disables pickling, so we need to save timestamp and date values as strings and convert downstream"""
filter_list = []
value_list = []
partition_keys = [df[col] for col in partition_columns]
for keys, _ in df[partition_columns].groupby(partition_keys):
if len(partition_columns) == 1:
if is_jsonable(keys):
value_list.append(keys)
elif keys is not None:
value_list.append(str(keys))
else:
if not isinstance(keys, tuple):
keys = (keys,)
read_filter = []
for name, val in zip(partition_columns, keys):
if type(val) == np.int_:
read_filter.append((name, "==", int(val)))
elif val is not None:
read_filter.append((name, "==", str(val)))
filter_list.append(read_filter)
if len(partition_columns) == 1:
if len(value_list) > 0:
filter_list = [(name, "in", value_list) for name in partition_columns]
return filter_list
Any suggestions on which approach I should take, or if there is a better way to achieve my goal?
You can watch this issue (https://issues.apache.org/jira/browse/ARROW-10440) which does what you want I believe. In the meantime, you could use basename_template as a workaround.
import glob
import os
import pyarrow as pa
import pyarrow.dataset as pads
class TrackingWriter:
def __init__(self):
self.counter = 0
part_schema = pa.schema({'part': pa.int64()})
self.partitioning = pads.HivePartitioning(part_schema)
def next_counter(self):
result = self.counter
self.counter += 1
return result
def write_dataset(self, table, base_dir):
counter = self.next_counter()
pads.write_dataset(table, base_dir, format='parquet', partitioning=self.partitioning, basename_template=f'batch-{counter}-part-{{i}}')
files_written = glob.glob(os.path.join(base_dir, '**', f'batch-{counter}-*'))
return files_written
table_one = pa.table({'part': [0, 0, 1, 1], 'val': [1, 2, 3, 4]})
table_two = pa.table({'part': [0, 0, 1, 1], 'val': [5, 6, 7, 8]})
writer = TrackingWriter()
print(writer.write_dataset(table_one, '/tmp/mydataset'))
print(writer.write_dataset(table_two, '/tmp/mydataset'))
This is just a rough sketch. You'd probably also want code to run at startup to see what the next free value of counter is. Or you could use a uuid instead of a counter.
A suggestion (not sure if this is optimal for your use case or not):
The key problem is the need to correctly select subset of the data, this can be 'fixed' upstream. The function/script that updates the big dataframe can contain a condition to save a temporary copy of data that is modified and satisfies some requirements in a separate (temporary) path. Then this file would be passed to the downstream tasks, which can delete the temporary file once it's processed.

Delayed dask.dataframe.DataFrame.to_hdf computations crashing

I'm using Dask to to execute the following logic:
read in a master delayed dd.DataFrame from multiple input files (one pd.DataFrame per file)
perform multiple query calls on the master delayed DataFrame
use DataFrame.to_hdf to save all dataframes from the DataFrame.query calls.
If I use compute=False in my to_hdf calls and feed the list of Delayeds returned by each to_hdf call to dask.compute then I get a crash/seg fault. (If I omit compute=False everything runs fine). Some googling gave me some information about locks; I tried adding a dask.distributed.Client with a dask.distributed.Lock fed to to_hdf, as well as a dask.utils.SerializableLock, but I couldn't solve the crash.
here's the flow:
import uproot
import dask
import dask.dataframe as dd
from dask.delayed import delayed
def delayed_frame(files, tree_name):
"""create master delayed DataFrame from multiple files"""
#delayed
def single_frame(file_name, tree_name):
"""read external file, convert to pandas.DataFrame, return it"""
tree = uproot.open(file_name).get(tree_name)
return tree.pandas.df() ## this is the pd.DataFrame
return dd.from_delayed([single_frame(f, tree_name) for f in files])
def save_selected_frames(df, selections, prefix):
"""perform queries on a delayed DataFrame and save HDF5 output"""
queries = {sel_name: df.query(sel_query)
for sel_name, sel_query in selections.items()]
computes = []
for dfname, df in queries.items():
outname = f"{prefix}_{dfname}.h5"
computes.append(df.to_hdf(outname, f"/{prefix}", compute=False))
dask.compute(*computes)
selections = {"s1": "(A == True) & (N > 1)",
"s2": "(B == True) & (N > 2)",
"s3": "(C == True) & (N > 3)"}
from glob import glob
df = delayed_frame(glob("/path/to/files/*.root"), "selected")
save_selected_frames(df, selections, "selected")
## expect output files:
## - selected_s1.h5
## - selected_s2.h5
## - selected_s3.h5
Maybe the HDF library that you're using isn't entirely threadsafe? If you don't mind losing parallelism then you could add scheduler="single-threaded" to the compute call.
You might want to consider using Parquet rather than HDF. It has fewer issues like this.

Apache Beam TextIO.read and then combine into batches

After using TextIO.read to get a PCollection<String> of the individual lines, is it possible to then use some kind of combine transform to into batches (groups of 25 for example)? So the return type would end up looking something like: PCollection<String, List<String>>. It looks like it should be possible using some kind of CombineFn, but the API is a little arcane to me still.
The context here is I'm reading CSV files (potentially very very large), parsing + processing the lines and turning them into JSON, and then calling a REST API... but I don't want to hit the REST API for each line individually because the REST API supports multiple items at a time (up to 1000, so not the whole batch).
I guess you can do some simple batching like below (using stateful API). The state you want to maintain in BatchingFn is the current buffer of lines or self._lines. Sorry I did it in python (not familiar with Java API)
from apache_beam.transforms import DoFn
from apache_beam.transforms import ParDo
MY_BATCH_SIZE = 512
class BatchingFn(DoFn):
def __init__(self, batch_size=100):
self._batch_size = batch_size
def start_bundle(self):
# buffer for string of lines
self._lines = []
def process(self, element):
# Input element is a string (representing a CSV line)
self._lines.append(element)
if len(_lines) >= self._batch_size:
self._flush_batch()
def finish_bundle(self):
# takes care of the unflushed buffer before finishing
if self._lines:
self._flush_batch()
def _flush_batch(self):
#### Do your REST API call here with self._lines
# .....
# Clear the buffer.
self._lines = []
# pcoll is your PCollection of lines.
(pcoll | 'Call Rest API with batch data' >> ParDo(BatchingFn(MY_BATCH_SIZE)))
Regarding using Data-driven triggers, you can refer to Batch PCollection in Beam/Dataflow.

Resources