Test a step that yields the same instance multiple times - google-cloud-dataflow

We have a step that splits up Pubsub messages on newline in Dataflow. We have a test that passes for the code, but it seems to fail in production. Looks like we get the same Pubsub message in multiple places in the pipeline at once (to the best of my knowledge at least).
Should we have written the first test in another way? Or is this just a hard lesson learned about what not to do in Apache Beam?
import apache_beam as beam
from apache_beam.io import PubsubMessage
from apache_beam.testing.test_pipeline import TestPipeline
from apache_beam.testing.util import assert_that, equal_to
import unittest
class SplitUpBatches(beam.DoFn):
def process(self, msg):
bodies = msg.data.split('\n')
for body in bodies:
msg.data = body.strip()
yield msg
class TestSplitting(unittest.TestCase):
body = """
first
second
third
""".strip()
def test_incorrectly_passing(self):
"""Incorrectly passing"""
msg = PubsubMessage(self.body, {})
with TestPipeline() as p:
assert_that(
p
| beam.Create([msg])
| "split up batches" >> beam.ParDo(SplitUpBatches())
| "map to data" >> beam.Map(lambda m: m.data),
equal_to(['first', 'second', 'third']))
def test_correctly_failing(self):
"""Failing, but not using a TestPipeline"""
msg = PubsubMessage(self.body, {})
messages = list(SplitUpBatches().process(msg))
bodies = [m.data for m in messages]
self.assertEqual(bodies, ['first', 'second', 'third'])
# => AssertionError: ['third', 'third', 'third'] != ['first', 'second', 'third']

TL;DR: Yes, this is an example of what not to do in Beam: Reutilize (mutate) your element objects.
In fact, Beam discourages mutating inputs and outputs of your transforms, because Beam passes/buffers those objects in various ways that can be affected if you mutate them.
The recommendation here is to create a new PubsubMessage instance for each output.
Detailed explanation
This happens due to the ways in which Beam serializes and passes data around.
You may know that Beam executes several steps together in single workers - what we call stages. Your pipeline does something like this:
read_data -> split_up_batches -> serialize all data -> perform assert
This intermediate serialize data step is an implementation detail. The reason is that for the Beam assert_that we gather all of the data of a single PCollection into a single machine, and perform the assert (thus we need to serialize all elements and send them over to a single machine). We do this with a GroupByKey operation.
When the DirectRunner receives the first yield of a PubsubMessage('first'), it serializes it and transfers it to a GroupByKey immediately - so you get the 'first', 'second', 'third' result - because serialization happens immediately.
When the DataflowRunner receives the first yield of a PubsubMessage('first'), it buffers it, and sends over a batch of elements. You get the 'third', 'third', 'third' result, because serialization happens after a buffer is transmitted over - and your original PubsubMessage instance has been overwritten.

Related

Downloading a file in a DoFn

It's unclear whether it's safe to download files within a DoFn.
My DoFn will download a ~20MB file (an ML model) to apply to elements in my pipeline. According to the Beam docs, requirements include serializability and thread-compatibility.
An example (1, 2) is very similar to my DoFn. It demonstrates downloading from a GCP storage bucket (as I'm doing w/ DataflowRunner), but I'm not sure this approach is safe.
Should objects be downloaded to an in-memory bytes buffer instead of downloading to disk, or is there another best practice for this use case? I haven't come across a best practice approach to this pattern yet.
Adding on to this answer.
If your model data is static than you can use below code example to pass your model as side input.
#DoFn to open the model from GCS location
class get_model(beam.DoFn):
def process(self, element):
from apache_beam.io.gcp import gcsio
logging.info('reading model from GCS')
gcs = gcsio.GcsIO()
yield gcs.open(element)
#Pipeline to load pickle file from GCS bucket
model_step = (p
| 'start' >> beam.Create(['gs://somebucket/model'])
| 'load_model' >> beam.ParDo(get_model())
| 'unpickle_model' >> beam.Map(lambda bin: dill.load(bin)))
#DoFn to predict the results.
class predict(beam.DoFn):
def process(self, element, model):
(features, clients) = element
result = model.predict_proba(features)[:, 1]
return [(clients, result)]
#main pipeline to get input and predict results.
_ = (p
| 'get_input' >> #get input based on source and preprocess it.
| 'predict_sk_model' >> beam.ParDo(predict(), beam.pvalue.AsSingleton(model_step))
| 'write' >> #write output based on target.
In case of streaming pipeline if you want to load model again after predefined time, you can check "Slowly-changing lookup cache" pattern here.
If it is a scikit-learn model then you can look at hosting it in Cloud ML Engine and expose it as a REST endpoint. You can then use something like BagState to optimize invocation of models over the network. More details can be found in this link https://beam.apache.org/blog/2017/08/28/timely-processing.html

Delayed dask.dataframe.DataFrame.to_hdf computations crashing

I'm using Dask to to execute the following logic:
read in a master delayed dd.DataFrame from multiple input files (one pd.DataFrame per file)
perform multiple query calls on the master delayed DataFrame
use DataFrame.to_hdf to save all dataframes from the DataFrame.query calls.
If I use compute=False in my to_hdf calls and feed the list of Delayeds returned by each to_hdf call to dask.compute then I get a crash/seg fault. (If I omit compute=False everything runs fine). Some googling gave me some information about locks; I tried adding a dask.distributed.Client with a dask.distributed.Lock fed to to_hdf, as well as a dask.utils.SerializableLock, but I couldn't solve the crash.
here's the flow:
import uproot
import dask
import dask.dataframe as dd
from dask.delayed import delayed
def delayed_frame(files, tree_name):
"""create master delayed DataFrame from multiple files"""
#delayed
def single_frame(file_name, tree_name):
"""read external file, convert to pandas.DataFrame, return it"""
tree = uproot.open(file_name).get(tree_name)
return tree.pandas.df() ## this is the pd.DataFrame
return dd.from_delayed([single_frame(f, tree_name) for f in files])
def save_selected_frames(df, selections, prefix):
"""perform queries on a delayed DataFrame and save HDF5 output"""
queries = {sel_name: df.query(sel_query)
for sel_name, sel_query in selections.items()]
computes = []
for dfname, df in queries.items():
outname = f"{prefix}_{dfname}.h5"
computes.append(df.to_hdf(outname, f"/{prefix}", compute=False))
dask.compute(*computes)
selections = {"s1": "(A == True) & (N > 1)",
"s2": "(B == True) & (N > 2)",
"s3": "(C == True) & (N > 3)"}
from glob import glob
df = delayed_frame(glob("/path/to/files/*.root"), "selected")
save_selected_frames(df, selections, "selected")
## expect output files:
## - selected_s1.h5
## - selected_s2.h5
## - selected_s3.h5
Maybe the HDF library that you're using isn't entirely threadsafe? If you don't mind losing parallelism then you could add scheduler="single-threaded" to the compute call.
You might want to consider using Parquet rather than HDF. It has fewer issues like this.

Dask compute fails when using client, works when no client setup

I am trying to use the dask client to parallelize my compute. When I run df.compute() I get the correct output (though it is very slow), but when I run the same thing after setting up a client, I get the following error:
distributed.protocol.pickle - INFO - Failed to serialize <function part at 0x7fd5186ed730>. Exception: can't pickle _thread.RLock objects
Here is my code, in the first df.compute(), I get the expected result, in the second I do not.
#dask.delayed
def part(x):
lower, upper = x
q = "SELECT id,tfidf_vec,emb_vec FROM document_table"
lines=man.session.execute(q)
counter = lower
df = []
for line in lines:
df.append(line)
counter += 1
if counter == upper:
break
return pd.DataFrame(df)
parts = [part(x) for x in [[0,100000],[100000,200000]]]
df = dd.from_delayed(parts)
df.compute()
from dask.distributed import Client
client = Client('127.0.0.1:8786')
df.compute()
Your function contains a reference to man.session, which is part of the function closure. When you use the default scheduler, threads, the object can be shared between the threads that execute your code. When you use the distributed scheduler, the function must be serialised and sent to workers in difference process(es).
You should make a function which creates the session object on each invocation, as was suggested as an answer to your very similar question.

Apache Beam TextIO.read and then combine into batches

After using TextIO.read to get a PCollection<String> of the individual lines, is it possible to then use some kind of combine transform to into batches (groups of 25 for example)? So the return type would end up looking something like: PCollection<String, List<String>>. It looks like it should be possible using some kind of CombineFn, but the API is a little arcane to me still.
The context here is I'm reading CSV files (potentially very very large), parsing + processing the lines and turning them into JSON, and then calling a REST API... but I don't want to hit the REST API for each line individually because the REST API supports multiple items at a time (up to 1000, so not the whole batch).
I guess you can do some simple batching like below (using stateful API). The state you want to maintain in BatchingFn is the current buffer of lines or self._lines. Sorry I did it in python (not familiar with Java API)
from apache_beam.transforms import DoFn
from apache_beam.transforms import ParDo
MY_BATCH_SIZE = 512
class BatchingFn(DoFn):
def __init__(self, batch_size=100):
self._batch_size = batch_size
def start_bundle(self):
# buffer for string of lines
self._lines = []
def process(self, element):
# Input element is a string (representing a CSV line)
self._lines.append(element)
if len(_lines) >= self._batch_size:
self._flush_batch()
def finish_bundle(self):
# takes care of the unflushed buffer before finishing
if self._lines:
self._flush_batch()
def _flush_batch(self):
#### Do your REST API call here with self._lines
# .....
# Clear the buffer.
self._lines = []
# pcoll is your PCollection of lines.
(pcoll | 'Call Rest API with batch data' >> ParDo(BatchingFn(MY_BATCH_SIZE)))
Regarding using Data-driven triggers, you can refer to Batch PCollection in Beam/Dataflow.

Check if PCollection is empty - Apache Beam

Is there any way to check if a PCollection is empty?
I haven't found anything relevant in the documentation of Dataflow and Apache Beam.
You didn't specify which SDK you're using, so I assumed Python. The code is easily portable to Java.
You can apply global counting of elements and then map numeric value to boolean by applying simple comparison. You will be able to side-input this value using pvalue.AsSingleton function, like this:
import apache_beam as beam
from apache_beam import pvalue
is_empty_check = (your_pcollection
| "Count" >> beam.combiners.Count.Globally()
| "Is empty?" >> beam.Map(lambda n: n == 0)
)
another_pipeline_branch = (
p
| beam.Map(do_something, is_empty=pvalue.AsSingleton(is_empty_check))
)
Usage of the side input is the following:
def do_something(element, is_empty):
if is_empty:
# yes
else:
# no
There is no way to check size of the PCollection without applying a PTransform on it (such as Count.globally() or Combine.combineFn()) because PCollection is not like a typical Collection in Java SDK or so.
It is an abstraction of bounded or unbounded collection of data where data is fed into the collection for an operation being applied on it (e.g. PTransform). Also it is parallelized (as the P at the beginning of the class suggest).
Therefore you need a mechanism to get counts of elements from each worker/node and combine them to get a value. Whether it is 0 or n can not be known until the end of that transformation.

Resources