Add dependency between 2 Dofn in Apache Beam - google-cloud-dataflow

Is there any way I can create dependency between 2 Dofn, so that it will wait for 1st Dofn method to complete then second Dofn method will run.
Just wondering how we can achieve this use-case.

There may be a cleaner way to do this, but I've noticed that doing the following will acheive the effect you desire:
Route the output of the first DoFn to also go to a counter, and then have the output of that counter be passed into the ParDo of the second DoFn as a side input
class DoFn2(apache_beam.DoFn):
def process(self, element, count_do_fn_1_output, *args, **kwargs):
# ...
do_fn_1_output = do_fn_1_input | 'do fn 1' >> apache_beam.ParDo(DoFn1())
count_do_fn_1_output = (
do_fn_1_output
| 'count do_fn_1_output' >> apache_beam.combiners.Count.Globally())
do_fn_2_output = (
do_fn_1_output
| 'do fn 2' >> apache_beam.ParDo(DoFn2(), count_do_fn_1_output=apache_beam.pvalue.AsSingleton(count_do_fn_1_output)))

For Java SDK I'd recommend to take a look on Wait transform. This is an example similar to what you want to achieve as I can guess.

Related

Test a step that yields the same instance multiple times

We have a step that splits up Pubsub messages on newline in Dataflow. We have a test that passes for the code, but it seems to fail in production. Looks like we get the same Pubsub message in multiple places in the pipeline at once (to the best of my knowledge at least).
Should we have written the first test in another way? Or is this just a hard lesson learned about what not to do in Apache Beam?
import apache_beam as beam
from apache_beam.io import PubsubMessage
from apache_beam.testing.test_pipeline import TestPipeline
from apache_beam.testing.util import assert_that, equal_to
import unittest
class SplitUpBatches(beam.DoFn):
def process(self, msg):
bodies = msg.data.split('\n')
for body in bodies:
msg.data = body.strip()
yield msg
class TestSplitting(unittest.TestCase):
body = """
first
second
third
""".strip()
def test_incorrectly_passing(self):
"""Incorrectly passing"""
msg = PubsubMessage(self.body, {})
with TestPipeline() as p:
assert_that(
p
| beam.Create([msg])
| "split up batches" >> beam.ParDo(SplitUpBatches())
| "map to data" >> beam.Map(lambda m: m.data),
equal_to(['first', 'second', 'third']))
def test_correctly_failing(self):
"""Failing, but not using a TestPipeline"""
msg = PubsubMessage(self.body, {})
messages = list(SplitUpBatches().process(msg))
bodies = [m.data for m in messages]
self.assertEqual(bodies, ['first', 'second', 'third'])
# => AssertionError: ['third', 'third', 'third'] != ['first', 'second', 'third']
TL;DR: Yes, this is an example of what not to do in Beam: Reutilize (mutate) your element objects.
In fact, Beam discourages mutating inputs and outputs of your transforms, because Beam passes/buffers those objects in various ways that can be affected if you mutate them.
The recommendation here is to create a new PubsubMessage instance for each output.
Detailed explanation
This happens due to the ways in which Beam serializes and passes data around.
You may know that Beam executes several steps together in single workers - what we call stages. Your pipeline does something like this:
read_data -> split_up_batches -> serialize all data -> perform assert
This intermediate serialize data step is an implementation detail. The reason is that for the Beam assert_that we gather all of the data of a single PCollection into a single machine, and perform the assert (thus we need to serialize all elements and send them over to a single machine). We do this with a GroupByKey operation.
When the DirectRunner receives the first yield of a PubsubMessage('first'), it serializes it and transfers it to a GroupByKey immediately - so you get the 'first', 'second', 'third' result - because serialization happens immediately.
When the DataflowRunner receives the first yield of a PubsubMessage('first'), it buffers it, and sends over a batch of elements. You get the 'third', 'third', 'third' result, because serialization happens after a buffer is transmitted over - and your original PubsubMessage instance has been overwritten.

Is it OK to use for loop for step order in Apache beam

I have several functions (or transforms): func1, func2, func3, ...
and a dict that holds the functions
FUNCS = {
'1': func1,
'2': func2,
...
}
What I'm thinking about is to pass a parameter funcs, that accepts a string of integers, and use for to loop through funcs, and execute the functions.
For example:
say I pass funcs="1321", then the functions are executed as:
with beam.Pipeline as p:
lines = (
p
| "read file" >> beam.io.ReadFromText('gs://some/inputData.txt')
)
for f in funcs: # 1321
lines = lines | FUNCS[f](#some other params)
and the functions are executed in order: func1, func3, func2, func1.
Is there any difference compared with:
with ...
lines = ...
lines = lines | func1 | func3 | func2 | func1
It's possible, I think; but is this even a good idea? Will there be any disadvantage about the parallel things of beam?
The true question is:
Is the pipeline get built first, THEN get executed?
Will the for loop and hard coded steps above end up with same pipeline? What effects does for loop have on efficiency and final result?
I'm using flex template of Google Dataflow btw.
That's a smart question.
The short answer is: Yes, the pipeline is built first, and then executed.
The pipeline is only executed after you exit the with block.
Your for loop is completely fine.

Google Dataflow Python Apache Beam Windowing delay issue

I have a simple pipeline that receives data from PubSub, prints it and then at every 10 seconds fires a window into a GroupByKey and prints that message again.
However this window seems to be delaying sometimes. Is this a google limitation or is there something wrong with my code:
with beam.Pipeline(options=pipeline_options) as pipe:
messages = (
pipe
| beam.io.ReadFromPubSub(subscription=known_args.input_subscription).with_output_types(bytes)
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'Ex' >> beam.ParDo(ExtractorAndPrinter())
| beam.WindowInto(window.FixedWindows(10), allowed_lateness=0, accumulation_mode=AccumulationMode.DISCARDING, trigger=AfterProcessingTime(10) )
| 'group' >> beam.GroupByKey()
| 'PRINTER' >> beam.ParDo(PrinterWorker()))
Edit for the most recent code. I removed the triggers however the problem persists:
class ExtractorAndCounter(beam.DoFn):
def __init__(self):
beam.DoFn.__init__(self)
def process(self, element, *args, **kwargs):
import logging
logging.info(element)
return [("Message", json.loads(element)["Message"])]
class PrinterWorker(beam.DoFn):
def __init__(self):
beam.DoFn.__init__(self)
def process(self, element, *args, **kwargs):
import logging
logging.info(element)
return [str(element)]
class DefineTimestamp(beam.DoFn):
def process(self, element, *args, **kwargs):
from datetime import datetime
return [(str(datetime.now()), element)]
def run(argv=None, save_main_session=True):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--output_topic',
required=True,
help=(
'Output PubSub topic of the form '
'"projects/<PROJECT>/topics/<TOPIC>".'))
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
'--input_topic',
help=(
'Input PubSub topic of the form '
'"projects/<PROJECT>/topics/<TOPIC>".'))
group.add_argument(
'--input_subscription',
help=(
'Input PubSub subscription of the form '
'"projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'))
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
pipeline_options.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=pipeline_options) as pipe:
messages = (
pipe
| beam.io.ReadFromPubSub(subscription=known_args.input_subscription).with_output_types(bytes)
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'Ex' >> beam.ParDo(ExtractorAndCounter())
| beam.WindowInto(window.FixedWindows(10))
| 'group' >> beam.GroupByKey()
| 'PRINTER' >> beam.ParDo(PrinterWorker())
| 'encode' >> beam.Map(lambda x: x.encode('utf-8'))
| beam.io.WriteToPubSub(known_args.output_topic))
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
So what this basically ask the pipeline to do is to group elements to 10 second windows and fire every window after 10 seconds have passed since the first element was received for each window (and discard the rest of the data for that window). Was that your intention ?
Assuming this was the case, note that triggering depends on the time elements were received by the system as well as the time the first element is received for each window. Probably this is why you are seeing some variation in your results.
I think if you need more consistent grouping for your elements you should use event time triggers instead of processing time triggers.
All the triggers are based on best effort means they will be fired sometime after the specified duration, 10 sec in this case. Generally it happen close to the time specified but a few seconds delay is possible.
Also, the triggers are set for Key+Window. The window is derived from event time.
It is possible that the first GBK pint at 10:30:04 is due to the first element which as at 10:29:52
The 2nd GBK print at 10:30:07 is due to the first element at 10:29:56
So it will be good to print the window and event timestamp for each element and then co-relate the data.

If my pipeline rekeys elements - how do I get values for one key in the same worker without GroupByKey?

Let's say I have a pipeline, and I have a series of ParDo operations where element keys change. How can I ensure that elements for the same key some to the same worker without having to do a GroupByKey with windowing?
input_pcoll = p | beam.ReadFromXYZ(...)
rekeyed_pcoll = (input_pcoll
| beam.FlatMap(some_operation)
| beam.Map(lambda x: (compute_new_key(x), x['value'])))
After this, I would like to have elements of the same key go to the same worker without having to run a GroupByKey that uses windowing or triggering.
There are two ways to accomplish this.
The first one is by doing a GroupByKey, and having a trigger that triggers after every single element. Something like so:
keys_together_pcoll = (rekeyed_pcoll
| beam.WindowInto(window.GlobalWindows()
trigger=AfterCount(1))
| beam.GroupByKey()
| beam.FlatMap(lambda x: x[1]))
result_pcoll = (keys_together_pcoll
| beam.ParDo(DoFnWithElementsInCorrespondingWorkers()))
Granted, this is a little awkward.
Another way to do this is to make your DoFn stateful. This will force the runner to shuffle the elements into their corresponding workers by key. Something like this:
class DoFnWithElementsInCorrespondingWorkers(beam.DoFn):
UNUSED_STATE = BagStateSpec('unused', VarIntCoder())
def process(self,
element,
unused=beam.DoFn.StateParam(UNUSED_STATE)):
# .. My processing
result_pcoll = (rekeyed_pcoll
| beam.ParDo(DoFnWithElementsInCorrespondingWorkers()))
Why does this happen?
Remember that in Beam (and Flink, and similar systems), state is organized by key, so if you insert a stateful DoFn, Beam will recognize that elements need to be shuffled into the correct workers according to their keys.

Check if PCollection is empty - Apache Beam

Is there any way to check if a PCollection is empty?
I haven't found anything relevant in the documentation of Dataflow and Apache Beam.
You didn't specify which SDK you're using, so I assumed Python. The code is easily portable to Java.
You can apply global counting of elements and then map numeric value to boolean by applying simple comparison. You will be able to side-input this value using pvalue.AsSingleton function, like this:
import apache_beam as beam
from apache_beam import pvalue
is_empty_check = (your_pcollection
| "Count" >> beam.combiners.Count.Globally()
| "Is empty?" >> beam.Map(lambda n: n == 0)
)
another_pipeline_branch = (
p
| beam.Map(do_something, is_empty=pvalue.AsSingleton(is_empty_check))
)
Usage of the side input is the following:
def do_something(element, is_empty):
if is_empty:
# yes
else:
# no
There is no way to check size of the PCollection without applying a PTransform on it (such as Count.globally() or Combine.combineFn()) because PCollection is not like a typical Collection in Java SDK or so.
It is an abstraction of bounded or unbounded collection of data where data is fed into the collection for an operation being applied on it (e.g. PTransform). Also it is parallelized (as the P at the beginning of the class suggest).
Therefore you need a mechanism to get counts of elements from each worker/node and combine them to get a value. Whether it is 0 or n can not be known until the end of that transformation.

Resources