How do parallelize apache-beam (Dataflow) pipeline DAGs - google-cloud-dataflow

I am using apache-beam 2.5.0 python SDK
Attaching the code snippet, in a pipeline, I am taking i/p from pubsub topic parsing it and want to perform some operation on it, when I ran it with DataflowRunner it runs fine but it seems that "data processing fun1", "data processing fun2" "data processing fun3" are running in sequential, I need it to run in parallel.
I am new to dataflow.
Is there a way to parallelize it?
def run():
parser = argparse.ArgumentParser()
args, pipeline_args = parser.parse_known_args()
options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=options) as p:
data = (p | "Read Pubsub Messages" >>
beam.io.ReadFromPubSub(topic=config.pub_sub_topic)
| "Parse messages " >> beam.Map(parse_pub_sub_message_with_bq_data)
)
data | "data processing fun1 " >> beam.ParDo(Fun1())
data | "data processing fun2" >> beam.ParDo(Fun2())
data | "data processing fun3" >> beam.ParDo(Fun3())
if __name__ == '__main__':
run()

Why do you need these functions to run at the same time?
Beam / Dataflow take your graph, and try to optimize things that can run in the same thread. This called fusion optimization, and it's mentioned in the Flume Java paper.
The point is that it will usually be more efficient to run those functions one by one on the same thread, rather than interchange data between multiple processing threads or VMs, to parallelize the processing.
If your funtions must run more or less in parallel, you can add a beam.Reshuffle transform before the downstream transforms:
data = (p
| beam.io.ReadFromPubSub(topic)
| beam.Map(parse_messages))
# After the data has been shuffled, it may be consumed by multiple workers
data | beam.Reshuffle() | beam.ParDo(Fun1())
data | beam.Reshuffle() | beam.ParDo(Fun2())
data | beam.Reshuffle() | beam.ParDo(Fun3())
Let me know if I can add some detail into this.

Related

Downloading a file in a DoFn

It's unclear whether it's safe to download files within a DoFn.
My DoFn will download a ~20MB file (an ML model) to apply to elements in my pipeline. According to the Beam docs, requirements include serializability and thread-compatibility.
An example (1, 2) is very similar to my DoFn. It demonstrates downloading from a GCP storage bucket (as I'm doing w/ DataflowRunner), but I'm not sure this approach is safe.
Should objects be downloaded to an in-memory bytes buffer instead of downloading to disk, or is there another best practice for this use case? I haven't come across a best practice approach to this pattern yet.
Adding on to this answer.
If your model data is static than you can use below code example to pass your model as side input.
#DoFn to open the model from GCS location
class get_model(beam.DoFn):
def process(self, element):
from apache_beam.io.gcp import gcsio
logging.info('reading model from GCS')
gcs = gcsio.GcsIO()
yield gcs.open(element)
#Pipeline to load pickle file from GCS bucket
model_step = (p
| 'start' >> beam.Create(['gs://somebucket/model'])
| 'load_model' >> beam.ParDo(get_model())
| 'unpickle_model' >> beam.Map(lambda bin: dill.load(bin)))
#DoFn to predict the results.
class predict(beam.DoFn):
def process(self, element, model):
(features, clients) = element
result = model.predict_proba(features)[:, 1]
return [(clients, result)]
#main pipeline to get input and predict results.
_ = (p
| 'get_input' >> #get input based on source and preprocess it.
| 'predict_sk_model' >> beam.ParDo(predict(), beam.pvalue.AsSingleton(model_step))
| 'write' >> #write output based on target.
In case of streaming pipeline if you want to load model again after predefined time, you can check "Slowly-changing lookup cache" pattern here.
If it is a scikit-learn model then you can look at hosting it in Cloud ML Engine and expose it as a REST endpoint. You can then use something like BagState to optimize invocation of models over the network. More details can be found in this link https://beam.apache.org/blog/2017/08/28/timely-processing.html

How to set coder for Google Dataflow Pipeline in Python?

I am creating a custom Dataflow job in Python to ingest data from PubSub to BigQuery. Table has many nested fields.
Where Can I set Coder in this pipeline?
avail_schema = parse_table_schema_from_json(bg_out_schema)
coder = TableRowJsonCoder(table_schema=avail_schema)
with beam.Pipeline(options=options) as p:
# Read the text from PubSub messages.
lines = (p | beam.io.ReadFromPubSub(subscription="projects/project_name/subscriptions/subscription_name")
| 'Map' >> beam.Map(coder))
# transformed = lines| 'Parse JSON to Dict' >> beam.Map(json.loads)
transformed | 'Write to BigQuery' >> beam.io.WriteToBigQuery("Project:DataSet.Table", schema=avail_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
Error: Map can be used only with callable objects. Received TableRowJsonCoder instead.
In the code above, the coder is applied to the message read from the PubSub which is text.
WriteToBigQuery works with both, dictionary and TableRow. json.load emits dict so you can simply use the output from it to write to BigQuery without apply any coder. Note, the field in dictionary has to match Table Schema.
To avoid coder issue I would suggest using following code.
avail_schema = parse_table_schema_from_json(bg_out_schema)
with beam.Pipeline(options=options) as p:
# Read the text from PubSub messages.
lines = (p | beam.io.ReadFromPubSub(subscription="projects/project_name/subscriptions/subscription_name"))
transformed = lines| 'Parse JSON to Dict' >> beam.Map(json.loads)
transformed | 'Write to BigQuery' >> beam.io.WriteToBigQuery("Project:DataSet.Table", schema=avail_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)

Google dataflow streaming pipeline is not distributing workload over several workers after windowing

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this:
The first step is doing some basic processing and takes about 2 seconds per message to get to the windowing. We are using sliding windows of 3 seconds and 3 second interval (might change later so we have overlapping windows). As last step we have the SOG prediction that takes about 15ish seconds to process and which is clearly our bottleneck transform.
So, The issue we seem to face is that the workload is perfectly distributed over our workers before the windowing, but the most important transform is not distributed at all. All the windows are processed one at a time seemingly on 1 worker, while we have 50 available.
The logs show us that the sog prediction step has an output once every 15ish seconds which should not be the case if the windows would be processed over more workers, so this builds up huge latency over time which we don't want. With 1 minute of messages, we have a latency of 5 minutes for the last window. When distribution would work, this should only be around 15sec (the SOG prediction time). So at this point we are clueless..
Does anyone see if there is something wrong with our code or how to prevent/circumvent this?
It seems like this is something happening in the internals of google cloud dataflow. Does this also occur in java streaming pipelines?
In batch mode, Everything works fine. There, one could try to do a reshuffle to make sure no fusion etc occurs. But that is not possible after windowing in streaming.
args = parse_arguments(sys.argv if argv is None else argv)
pipeline_options = get_pipeline_options(project=args.project_id,
job_name='XX',
num_workers=args.workers,
max_num_workers=MAX_NUM_WORKERS,
disk_size_gb=DISK_SIZE_GB,
local=args.local,
streaming=args.streaming)
pipeline = beam.Pipeline(options=pipeline_options)
# Build pipeline
# pylint: disable=C0330
if args.streaming:
frames = (pipeline | 'ReadFromPubsub' >> beam.io.ReadFromPubSub(
subscription=SUBSCRIPTION_PATH,
with_attributes=True,
timestamp_attribute='timestamp'
))
frame_tpl = frames | 'CreateFrameTuples' >> beam.Map(
create_frame_tuples_fn)
crops = frame_tpl | 'MakeCrops' >> beam.Map(make_crops_fn, NR_CROPS)
bboxs = crops | 'bounding boxes tfserv' >> beam.Map(
pred_bbox_tfserv_fn, SERVER_URL)
sliding_windows = bboxs | 'Window' >> beam.WindowInto(
beam.window.SlidingWindows(
FEATURE_WINDOWS['goal']['window_size'],
FEATURE_WINDOWS['goal']['window_interval']),
trigger=AfterCount(30),
accumulation_mode=AccumulationMode.DISCARDING)
# GROUPBYKEY (per match)
group_per_match = sliding_windows | 'Group' >> beam.GroupByKey()
_ = group_per_match | 'LogPerMatch' >> beam.Map(lambda x: logging.info(
"window per match per timewindow: # %s, %s", str(len(x[1])), x[1][0][
'timestamp']))
sog = sliding_windows | 'Predict SOG' >> beam.Map(predict_sog_fn,
SERVER_URL_INCEPTION,
SERVER_URL_SOG )
pipeline.run().wait_until_finish()
In beam the unit of parallelism is the key--all the windows for a given key will be produced on the same machine. However, if you have 50+ keys they should get distributed among all workers.
You mentioned that you were unable to add a Reshuffle in streaming. This should be possible; if you're getting errors please file a bug at https://issues.apache.org/jira/projects/BEAM/issues . Does re-windowing into GlobalWindows make the issue with reshuffling go away?
It looks like you do not necessarily need GroupByKey because you are always grouping on the same key. Instead you could maybe use CombineGlobally to append all the elements inside the window in stead of the GroupByKey (with always the same key).
combined = values | beam.CombineGlobally(append_fn).without_defaults()
combined | beam.ParDo(PostProcessFn())
I am not sure how the load distribution works when using CombineGlobally but since it does not process key,value pairs I would expect another mechanism to do the load distribution.

What is the Apache Beam way to handle 'routing'

Using Apache Beam I am doing computations - and if they succeed I'd like to write the output to one sink, and if there is a failure I'd like to write that to another sink.
Is there any way to handle metadata or content based routing in Apache Beam?
I've used Apache Camel extensively, and so in my mind based on the outcome of a previous transform, I should route a message to a different sink using a router (perhaps determined by a metadata flag I set on the message header). Is there an analogous capability with Apache Beam, or would I instead just have a sequential transform that inspects the PCollection and handles writing to sinks within the transform?
Ideally I'd like this logic (written verbosely for attempted clarity)
result = my_pcollections | 'compute_stuff' >> beam.Map(lambda (pcollection): my_compute_func(pcollection))
result | ([success_failure_router]
| 'sucess_sink' >> beam.io.WriteToText('/path/to/file')
| 'failure_sink' >> beam.io.WriteStringsToPubSub('mytopic'))
However.. I suspect the 'Beam' way of handling this is
result = my_pcollections | 'compute_stuff' >> beam.Map(lambda (pcollection): my_compute_func(pcollection))
result | 'write_results_appropriately' >> write_results_appropriately(result))
...
def write_results_appropriately(result):
if result == ..:
# success, write to file
else:
# failure, write to topic
Thanks,
Kevin
High-level:
I am not sure of specifics of the Python API in this case, but from high level it looks like this:
par-dos support multiple outputs;
outputs are identified by the tag you give them (e.g. "correct-elements", "invalid-elements");
in your main par-do you write to multiple outputs choosing the output using your criteria;
each output is represented by a separate PCollection;
then you get the separate PCollections representing the tagged outputs from your par-do;
then apply different sinks to each of the tagged PCollections;
In detail see the section
https://beam.apache.org/documentation/programming-guide/#additional-outputs

Make WRITE_TRUNCATE and WRITE_APPEND dynamic in Apache Beam

I have an Apache Beam program that processes the file that comes in a GCS bucket and dumps the data in some BigQuery table. Depending on the file, I want to set the truncate or append operation. Can this be made dynamic or configurable?
Thank You.
I assume that when you say "depending on file", you have some information about the file (to recognize which when to use WRITE_TRUNCATE and WRITE_APPEND) in your pipeline.
Easiest thing to do will be to split the input you're passing into BigQuery into two PCollections (by filtering) and pass each of them into appropriate BigQuery sink (one with WRITE_TRUNCATE and one with WRITE_APPEND).
You didn't mention if you use Java or Python, the pseudocode below is for Python, but it could be easily ported to Java SDK
files = (pipeline
| 'Read files' >> beam.io.Read(Your_GCS_Source())
)
files_to_truncate = (files
| beam.Filter(lambda file: filter_for_files_to_truncate())
| beam.io.Write(beam.io.BigQuerySink(output_table, schema=output_schema, create_disposition=create_disposition, write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
)
files_to_append = (files
| beam.Filter(lambda file: filter_for_files_to_append())
| beam.io.Write(beam.io.BigQuerySink(output_table, schema=output_schema, create_disposition=create_disposition, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
)

Resources