What is the Apache Beam way to handle 'routing' - google-cloud-dataflow

Using Apache Beam I am doing computations - and if they succeed I'd like to write the output to one sink, and if there is a failure I'd like to write that to another sink.
Is there any way to handle metadata or content based routing in Apache Beam?
I've used Apache Camel extensively, and so in my mind based on the outcome of a previous transform, I should route a message to a different sink using a router (perhaps determined by a metadata flag I set on the message header). Is there an analogous capability with Apache Beam, or would I instead just have a sequential transform that inspects the PCollection and handles writing to sinks within the transform?
Ideally I'd like this logic (written verbosely for attempted clarity)
result = my_pcollections | 'compute_stuff' >> beam.Map(lambda (pcollection): my_compute_func(pcollection))
result | ([success_failure_router]
| 'sucess_sink' >> beam.io.WriteToText('/path/to/file')
| 'failure_sink' >> beam.io.WriteStringsToPubSub('mytopic'))
However.. I suspect the 'Beam' way of handling this is
result = my_pcollections | 'compute_stuff' >> beam.Map(lambda (pcollection): my_compute_func(pcollection))
result | 'write_results_appropriately' >> write_results_appropriately(result))
...
def write_results_appropriately(result):
if result == ..:
# success, write to file
else:
# failure, write to topic
Thanks,
Kevin

High-level:
I am not sure of specifics of the Python API in this case, but from high level it looks like this:
par-dos support multiple outputs;
outputs are identified by the tag you give them (e.g. "correct-elements", "invalid-elements");
in your main par-do you write to multiple outputs choosing the output using your criteria;
each output is represented by a separate PCollection;
then you get the separate PCollections representing the tagged outputs from your par-do;
then apply different sinks to each of the tagged PCollections;
In detail see the section
https://beam.apache.org/documentation/programming-guide/#additional-outputs

Related

Downloading a file in a DoFn

It's unclear whether it's safe to download files within a DoFn.
My DoFn will download a ~20MB file (an ML model) to apply to elements in my pipeline. According to the Beam docs, requirements include serializability and thread-compatibility.
An example (1, 2) is very similar to my DoFn. It demonstrates downloading from a GCP storage bucket (as I'm doing w/ DataflowRunner), but I'm not sure this approach is safe.
Should objects be downloaded to an in-memory bytes buffer instead of downloading to disk, or is there another best practice for this use case? I haven't come across a best practice approach to this pattern yet.
Adding on to this answer.
If your model data is static than you can use below code example to pass your model as side input.
#DoFn to open the model from GCS location
class get_model(beam.DoFn):
def process(self, element):
from apache_beam.io.gcp import gcsio
logging.info('reading model from GCS')
gcs = gcsio.GcsIO()
yield gcs.open(element)
#Pipeline to load pickle file from GCS bucket
model_step = (p
| 'start' >> beam.Create(['gs://somebucket/model'])
| 'load_model' >> beam.ParDo(get_model())
| 'unpickle_model' >> beam.Map(lambda bin: dill.load(bin)))
#DoFn to predict the results.
class predict(beam.DoFn):
def process(self, element, model):
(features, clients) = element
result = model.predict_proba(features)[:, 1]
return [(clients, result)]
#main pipeline to get input and predict results.
_ = (p
| 'get_input' >> #get input based on source and preprocess it.
| 'predict_sk_model' >> beam.ParDo(predict(), beam.pvalue.AsSingleton(model_step))
| 'write' >> #write output based on target.
In case of streaming pipeline if you want to load model again after predefined time, you can check "Slowly-changing lookup cache" pattern here.
If it is a scikit-learn model then you can look at hosting it in Cloud ML Engine and expose it as a REST endpoint. You can then use something like BagState to optimize invocation of models over the network. More details can be found in this link https://beam.apache.org/blog/2017/08/28/timely-processing.html

Google dataflow streaming pipeline is not distributing workload over several workers after windowing

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this:
The first step is doing some basic processing and takes about 2 seconds per message to get to the windowing. We are using sliding windows of 3 seconds and 3 second interval (might change later so we have overlapping windows). As last step we have the SOG prediction that takes about 15ish seconds to process and which is clearly our bottleneck transform.
So, The issue we seem to face is that the workload is perfectly distributed over our workers before the windowing, but the most important transform is not distributed at all. All the windows are processed one at a time seemingly on 1 worker, while we have 50 available.
The logs show us that the sog prediction step has an output once every 15ish seconds which should not be the case if the windows would be processed over more workers, so this builds up huge latency over time which we don't want. With 1 minute of messages, we have a latency of 5 minutes for the last window. When distribution would work, this should only be around 15sec (the SOG prediction time). So at this point we are clueless..
Does anyone see if there is something wrong with our code or how to prevent/circumvent this?
It seems like this is something happening in the internals of google cloud dataflow. Does this also occur in java streaming pipelines?
In batch mode, Everything works fine. There, one could try to do a reshuffle to make sure no fusion etc occurs. But that is not possible after windowing in streaming.
args = parse_arguments(sys.argv if argv is None else argv)
pipeline_options = get_pipeline_options(project=args.project_id,
job_name='XX',
num_workers=args.workers,
max_num_workers=MAX_NUM_WORKERS,
disk_size_gb=DISK_SIZE_GB,
local=args.local,
streaming=args.streaming)
pipeline = beam.Pipeline(options=pipeline_options)
# Build pipeline
# pylint: disable=C0330
if args.streaming:
frames = (pipeline | 'ReadFromPubsub' >> beam.io.ReadFromPubSub(
subscription=SUBSCRIPTION_PATH,
with_attributes=True,
timestamp_attribute='timestamp'
))
frame_tpl = frames | 'CreateFrameTuples' >> beam.Map(
create_frame_tuples_fn)
crops = frame_tpl | 'MakeCrops' >> beam.Map(make_crops_fn, NR_CROPS)
bboxs = crops | 'bounding boxes tfserv' >> beam.Map(
pred_bbox_tfserv_fn, SERVER_URL)
sliding_windows = bboxs | 'Window' >> beam.WindowInto(
beam.window.SlidingWindows(
FEATURE_WINDOWS['goal']['window_size'],
FEATURE_WINDOWS['goal']['window_interval']),
trigger=AfterCount(30),
accumulation_mode=AccumulationMode.DISCARDING)
# GROUPBYKEY (per match)
group_per_match = sliding_windows | 'Group' >> beam.GroupByKey()
_ = group_per_match | 'LogPerMatch' >> beam.Map(lambda x: logging.info(
"window per match per timewindow: # %s, %s", str(len(x[1])), x[1][0][
'timestamp']))
sog = sliding_windows | 'Predict SOG' >> beam.Map(predict_sog_fn,
SERVER_URL_INCEPTION,
SERVER_URL_SOG )
pipeline.run().wait_until_finish()
In beam the unit of parallelism is the key--all the windows for a given key will be produced on the same machine. However, if you have 50+ keys they should get distributed among all workers.
You mentioned that you were unable to add a Reshuffle in streaming. This should be possible; if you're getting errors please file a bug at https://issues.apache.org/jira/projects/BEAM/issues . Does re-windowing into GlobalWindows make the issue with reshuffling go away?
It looks like you do not necessarily need GroupByKey because you are always grouping on the same key. Instead you could maybe use CombineGlobally to append all the elements inside the window in stead of the GroupByKey (with always the same key).
combined = values | beam.CombineGlobally(append_fn).without_defaults()
combined | beam.ParDo(PostProcessFn())
I am not sure how the load distribution works when using CombineGlobally but since it does not process key,value pairs I would expect another mechanism to do the load distribution.

Best way to prevent fusion in Google Dataflow?

From: https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion
You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation.
This is what I came up with in python - is this reasonable / is there a simpler way?
def prevent_fuse(collection):
return (
collection
| beam.Map(lambda x: (x, 1))
| beam.GroupByKey()
| beam.FlatMap(lambda x: (x[0] for v in x[1]))
)
EDIT, in response to Ben Chambers' question
We want to prevent fusion because we have a collection which generates a much larger collection, and we need parallelization across the larger collection. If it fuses, I only get one worker across the larger collection.
Apache Beam SDK 2.3.0 adds the experimental Reshuffle transform, which is the Python alternative to the Reshuffle.viaRandomKey operation mentioned by #BenChambers. You can use it in place of your custom prevent_fuse code.
That should work. There are other ways, but they partly depend on what you are trying to do and why you want to prevent fusion. Keep in mind that fusion is an important optimization to improve the performance of your pipeline.
Could you elaborate on why you want to prevent fusion?
A small adjustment to my original proposal - if each item is too large, that will fail will fail. You need to force them into multiple items, so using a constant key doesn't work. So here, you can supply a key function which needs to differentiate the objects and be small, like a hash.
That said, still not sure this is the best way, or whether something simpler (beam.Partition?) would work. And would be good for Beam to supply an explicit primitive.
def prevent_fuse(collection, key=None):
"""
prevent a dataflow PCol fusing with the next PCol
supply a key function if the items are too big to use as keys
"""
key = key or (lambda x: x)
return (
collection
| beam.Map(lambda v: (key(v), v))
| beam.GroupByKey()
| beam.FlatMap(lambda kv: (v for v in kv[1]))
)

Check if PCollection is empty - Apache Beam

Is there any way to check if a PCollection is empty?
I haven't found anything relevant in the documentation of Dataflow and Apache Beam.
You didn't specify which SDK you're using, so I assumed Python. The code is easily portable to Java.
You can apply global counting of elements and then map numeric value to boolean by applying simple comparison. You will be able to side-input this value using pvalue.AsSingleton function, like this:
import apache_beam as beam
from apache_beam import pvalue
is_empty_check = (your_pcollection
| "Count" >> beam.combiners.Count.Globally()
| "Is empty?" >> beam.Map(lambda n: n == 0)
)
another_pipeline_branch = (
p
| beam.Map(do_something, is_empty=pvalue.AsSingleton(is_empty_check))
)
Usage of the side input is the following:
def do_something(element, is_empty):
if is_empty:
# yes
else:
# no
There is no way to check size of the PCollection without applying a PTransform on it (such as Count.globally() or Combine.combineFn()) because PCollection is not like a typical Collection in Java SDK or so.
It is an abstraction of bounded or unbounded collection of data where data is fed into the collection for an operation being applied on it (e.g. PTransform). Also it is parallelized (as the P at the beginning of the class suggest).
Therefore you need a mechanism to get counts of elements from each worker/node and combine them to get a value. Whether it is 0 or n can not be known until the end of that transformation.

Make WRITE_TRUNCATE and WRITE_APPEND dynamic in Apache Beam

I have an Apache Beam program that processes the file that comes in a GCS bucket and dumps the data in some BigQuery table. Depending on the file, I want to set the truncate or append operation. Can this be made dynamic or configurable?
Thank You.
I assume that when you say "depending on file", you have some information about the file (to recognize which when to use WRITE_TRUNCATE and WRITE_APPEND) in your pipeline.
Easiest thing to do will be to split the input you're passing into BigQuery into two PCollections (by filtering) and pass each of them into appropriate BigQuery sink (one with WRITE_TRUNCATE and one with WRITE_APPEND).
You didn't mention if you use Java or Python, the pseudocode below is for Python, but it could be easily ported to Java SDK
files = (pipeline
| 'Read files' >> beam.io.Read(Your_GCS_Source())
)
files_to_truncate = (files
| beam.Filter(lambda file: filter_for_files_to_truncate())
| beam.io.Write(beam.io.BigQuerySink(output_table, schema=output_schema, create_disposition=create_disposition, write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
)
files_to_append = (files
| beam.Filter(lambda file: filter_for_files_to_append())
| beam.io.Write(beam.io.BigQuerySink(output_table, schema=output_schema, create_disposition=create_disposition, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
)

Resources