Apache Beam Google Cloud Dataflow side input lags far behind main input - google-cloud-dataflow

I have a Dataflow pipeline with a main input (PubSub) and a side input (PeriodicImpulse), and the two get merged together. The side input timestamps, however, do not advance for a very long time, but sometimes jump forward erratically much later. Based on the documentation on side inputs and windowing, I expect the side inputs to lag behind the main input by at most the side input fire interval, regardless if the interval between main input elements is more or less than the side input.
Here's a simplified test case that consists of a PubSub main input and PeriodicImpulse side input. The pipeline simply merges PubSub messages with side input timestamps.
from datetime import datetime
import logging
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms import trigger
from apache_beam.transforms import window
from apache_beam.transforms.periodicsequence import PeriodicImpulse
RUNNER = "DataflowRunner"
PROJECT = "my_project"
REGION = "my_region"
TEMP_LOCATION = "gs://my_project/temp"
# Requires a PubSub topic to exist.
TOPIC = "projects/my_project/topics/my_topic"
SIDE_INTERVAL = 30 # in seconds.
PUBSUB_WINDOW = 30 # in seconds.
class Logger(beam.DoFn):
def __init__(self, name):
self._name = name
def process(self, element, w=beam.DoFn.WindowParam,
ts=beam.DoFn.TimestampParam):
logging.info('%s: %s', self._name, element)
yield element
def format_timestamp(ts):
return datetime.utcfromtimestamp(float(ts)).strftime("%H:%M:%S")
class AddTimestamp(beam.DoFn):
def process(self, element, publish_time=beam.DoFn.TimestampParam):
yield (element, publish_time)
def merge(main_input, side_inputs):
for s in side_inputs:
msg_bytes, ts = main_input
yield (f"Main={format_timestamp(ts)} {msg_bytes.decode('utf-8')}, "
f"Side={format_timestamp(s)}")
def run(topic):
pipeline_options = PipelineOptions(
runner=RUNNER,
streaming=True,
save_main_session=True, # Preserves python imports.
project=PROJECT,
region=REGION,
temp_location=TEMP_LOCATION)
with beam.Pipeline(options=pipeline_options) as pipeline:
side_input = (
pipeline
| PeriodicImpulse(fire_interval=SIDE_INTERVAL,
apply_windowing=False)
| "ApplyGlobalWindow" >> beam.WindowInto(
window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| "SideLogger" >> beam.ParDo(Logger("side"))
)
main_input = (
pipeline
| "Read from Pub/Sub" >> beam.io.ReadFromPubSub(topic=topic)
| "AddTimestamp" >> beam.ParDo(AddTimestamp())
| "Window" >> beam.WindowInto(window.FixedWindows(PUBSUB_WINDOW))
| "MainLogger" >> beam.ParDo(Logger("main"))
)
merged = (
main_input
| "Merge" >> beam.ParDo(
merge, side_inputs=beam.pvalue.AsList(side_input))
| "MergedLogger" >> beam.ParDo(Logger("merged"))
)
Publish some test messages:
MESSAGE_INTERVAL = 10 # in seconds.
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_id)
def publish(message):
"""Publish string message."""
data = message.encode("utf-8")
future = publisher.publish(topic_path, data)
print(future.result())
i = 0
while (True):
publish(f'Message ({i})')
i += 1
time.sleep(MESSAGE_INTERVAL)
For example, initially the main input interval=10, side input interval=30. The very first main and side inputs have timestamps aligned, but the following side input timestamps are unexpectedly not advancing (BAD):
...
merged: Main=22:45:15 Message (57), Side=22:32:51
merged: Main=22:45:25 Message (58), Side=22:32:51
merged: Main=22:45:36 Message (59), Side=22:32:51
merged: Main=22:45:46 Message (60), Side=22:32:51
merged: Main=22:45:56 Message (61), Side=22:32:51
merged: Main=22:46:06 Message (62), Side=22:32:51
merged: Main=22:46:16 Message (63), Side=22:32:51
....
If I slow the main input interval=60, side input interval=30, the side input timestamp suddenly advances (BETTER), but then immediately starts lagging behind the main input (BAD).
...
merged: Main=22:55:06 Message (6), Side=22:32:51
merged: Main=22:56:06 Message (7), Side=22:32:51
merged: Main=22:57:52 Message (0), Side=22:57:21 <--- jumps forward to align
merged: Main=22:58:52 Message (1), Side=22:57:21 <--- stuck again
merged: Main=22:59:52 Message (2), Side=22:59:21 <--- jumps forward
merged: Main=23:00:52 Message (3), Side=22:59:21 <--- stuck again
merged: Main=23:01:52 Message (4), Side=22:59:21
merged: Main=23:02:52 Message (5), Side=22:59:21
merged: Main=23:03:52 Message (6), Side=22:59:21
merged: Main=23:04:52 Message (7), Side=22:59:21
merged: Main=23:05:52 Message (8), Side=23:04:21
merged: Main=23:06:53 Message (9), Side=23:04:21
merged: Main=23:07:53 Message (10), Side=23:04:21
merged: Main=23:08:53 Message (11), Side=23:04:21
merged: Main=23:09:53 Message (12), Side=23:04:21
merged: Main=23:10:53 Message (13), Side=23:04:21
merged: Main=23:11:53 Message (14), Side=23:04:21
merged: Main=23:12:53 Message (15), Side=23:04:21
The last merge, the side input is 8 minutes behind the main input, but I expect at most 30 seconds lag due to the side interval rate of 30 seconds. I tried sprinkling some Reshuffle() steps in various places, but that did not seem to change the behavior.
How can I get the side input timestamps to keep up with the main input timestamps?
Thanks!

Related

DataFlowRunner + Beam in streaming mode with a SideInput AsDict hangs

I have a simple graph that reads from a pubsub message (currently just a single string key), creates a very short window, generates 3 integers that use this key via a beam.ParDo, and a simple Map that creates a single "config" with this as a key.
Ultimately, there are 2 PCollections:
items: [('key', 0), ('key', 1), ...]
infos: [('key', 'the value is key')]
I want a final beam.Map over items that uses infos as a dictionary side input so I can look up the value in the dictionary.
Using the LocalRunner, the final print works with the side input.
On DataFlow the first two steps print, but the final Map with the side input never is called, presumably because it somehow is an unbounded window (despite the earlier window function).
I am using runner_v2, dataflow prime, and streaming engine.
p = beam.Pipeline(options=pipeline_options)
pubsub_message = (
p | beam.io.gcp.pubsub.ReadFromPubSub(
subscription='projects/myproject/testsubscription') |
'SourceWindow' >> beam.WindowInto(
beam.transforms.window.FixedWindows(1e-6),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(1)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING))
def _create_items(pubsub_key: bytes) -> Iterable[tuple[str, int]]:
for i in range(3):
yield pubsub_key.decode(), i
def _create_info(pubsub_key: bytes) -> tuple[str, str]:
return pubsub_key.decode(), f'the value is {pubsub_key.decode()}'
items = pubsub_message | 'CreateItems' >> beam.ParDo(_create_items) | beam.Reshuffle()
info = pubsub_message | 'CreateInfo' >> beam.Map(_create_info)
def _print_item(keyed_item: tuple[str, int], info_dict: dict[str, str]) -> None:
key, _ = keyed_item
log(key + '::' + info_dict[key])
_ = items | 'MapWithSideInput' >> beam.Map(_print_item, info_dict=beam.pvalue.AsDict(info))
Here is the output in local runner:
Creating item 0
Creating item 1
Creating item 2
Creating info b'key'
key::the value is key
key::the value is key
key::the value is key
Here is the DataFlow graph:
I've tried various windowing functions over the AsDict, but I can never get it to be exactly the same window as my input.
Thoughts on what I might be doing wrong here?

Apache beam blocked on unbounded side input

My question is very similar to another post: Apache Beam Cloud Dataflow Streaming Stuck Side Input.
However, I tried the resolution there (apply GlobalWindows() to the side input), and it did not seem to fix my problem.
I have a Dataflow pipeline (but I'm using DirectRunner for debug) with Python SDK where the main input are logs from PubSub and the side input is associated data from a mostly unchanging database. I would like to join the two such that each log is paired with side input data from the same approximate time. Excess side inputs without an associated log can be dropped.
The behavior I see is that the pipeline seems to be operating as a single thread. It processes the all side input elements first, then starts processing the main input elements. If the side input is bounded (non-streaming), this is fine, and the pipeline can merge inputs and run to completion. If the side input is unbounded (streaming), however, the main input is blocked indefinitely while apparently waiting for the side input processing to finish.
To illustrate the behavior, I made simplified test case below.
class Logger(apache_beam.DoFn):
def __init__(self, name):
self._name = name
def process(self, element, w=apache_beam.DoFn.WindowParam,
ts=apache_beam.DoFn.TimestampParam):
logging.error('%s: %s', self._name, element)
yield element
def cross_join(left, rights):
for right in rights:
yield (left, right)
def main():
start = timestamp.Timestamp.now()
# Bounded side inputs work OK.
stop = start + 20
# Unbounded side inputs appear to block execution of main input
# processing.
#stop = timestamp.MAX_TIMESTAMP
side_interval = 5
main_interval = 1
side_input = (
pipeline
| PeriodicImpulse(
start_timestamp=start,
stop_timestamp=stop,
fire_interval=side_interval,
apply_windowing=True)
| apache_beam.Map(lambda x: ('side', x))
| apache_beam.ParDo(Logger('side_input'))
)
main_input = (
pipeline
| PeriodicImpulse(
start_timestamp=start, stop_timestamp=stop,
fire_interval=main_interval, apply_windowing=True)
| apache_beam.Map(lambda x: ('main', x))
| apache_beam.ParDo(Logger('main_input'))
| 'CrossJoin' >> apache_beam.FlatMap(
cross_join, rights=apache_beam.pvalue.AsIter(side_input))
| 'CrossJoinLogger' >> apache_beam.ParDo(Logger('cross_join_output'))
)
pipeline.run()
Am missing something that is preventing main inputs from being processed in parallel with the side inputs?
The main input can advance only when the watermark has passed the corresponding side input's windowing. See details in the programming guide. You likely need to window both the main and side input, and make sure PeriodicImpulse is correctly advancing the watermark.
Using the example from stackoverflow.com/q/70561769 I was able to get the side input and main input working concurrently as expected for certain cases. The answer there was to apply GlobalWindows() to the side_input.
side_input = ( pipeline
| PeriodicImpulse(fire_interval=300, apply_windowing=False)
| "ApplyGlobalWindow" >> WindowInto(window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| ...
)
Based on experimentation, my conclusion is there are cases when PeriodicImpulse on the side input causes the main input to block, such as below:
Case 1: GOOD
GlobalWindow
Main input = PubSub
Side input = PeriodicImpulse
Case 2: BAD
FixedWindow
Main input = PubSub
Side input = PeriodicImpulse
Case 3: BAD
GlobalWindow / FixedWindow
Main input = PeriodicImpulse
Side input = PeriodicImpulse
Case 4: GOOD
FixedWindow
Main input = PubSub
Side input = PubSub
My problem now is that the side input timestamps are not aligning with the main input properly stackoverflow.com/q/72382440.

Test a step that yields the same instance multiple times

We have a step that splits up Pubsub messages on newline in Dataflow. We have a test that passes for the code, but it seems to fail in production. Looks like we get the same Pubsub message in multiple places in the pipeline at once (to the best of my knowledge at least).
Should we have written the first test in another way? Or is this just a hard lesson learned about what not to do in Apache Beam?
import apache_beam as beam
from apache_beam.io import PubsubMessage
from apache_beam.testing.test_pipeline import TestPipeline
from apache_beam.testing.util import assert_that, equal_to
import unittest
class SplitUpBatches(beam.DoFn):
def process(self, msg):
bodies = msg.data.split('\n')
for body in bodies:
msg.data = body.strip()
yield msg
class TestSplitting(unittest.TestCase):
body = """
first
second
third
""".strip()
def test_incorrectly_passing(self):
"""Incorrectly passing"""
msg = PubsubMessage(self.body, {})
with TestPipeline() as p:
assert_that(
p
| beam.Create([msg])
| "split up batches" >> beam.ParDo(SplitUpBatches())
| "map to data" >> beam.Map(lambda m: m.data),
equal_to(['first', 'second', 'third']))
def test_correctly_failing(self):
"""Failing, but not using a TestPipeline"""
msg = PubsubMessage(self.body, {})
messages = list(SplitUpBatches().process(msg))
bodies = [m.data for m in messages]
self.assertEqual(bodies, ['first', 'second', 'third'])
# => AssertionError: ['third', 'third', 'third'] != ['first', 'second', 'third']
TL;DR: Yes, this is an example of what not to do in Beam: Reutilize (mutate) your element objects.
In fact, Beam discourages mutating inputs and outputs of your transforms, because Beam passes/buffers those objects in various ways that can be affected if you mutate them.
The recommendation here is to create a new PubsubMessage instance for each output.
Detailed explanation
This happens due to the ways in which Beam serializes and passes data around.
You may know that Beam executes several steps together in single workers - what we call stages. Your pipeline does something like this:
read_data -> split_up_batches -> serialize all data -> perform assert
This intermediate serialize data step is an implementation detail. The reason is that for the Beam assert_that we gather all of the data of a single PCollection into a single machine, and perform the assert (thus we need to serialize all elements and send them over to a single machine). We do this with a GroupByKey operation.
When the DirectRunner receives the first yield of a PubsubMessage('first'), it serializes it and transfers it to a GroupByKey immediately - so you get the 'first', 'second', 'third' result - because serialization happens immediately.
When the DataflowRunner receives the first yield of a PubsubMessage('first'), it buffers it, and sends over a batch of elements. You get the 'third', 'third', 'third' result, because serialization happens after a buffer is transmitted over - and your original PubsubMessage instance has been overwritten.

Google Dataflow Python Apache Beam Windowing delay issue

I have a simple pipeline that receives data from PubSub, prints it and then at every 10 seconds fires a window into a GroupByKey and prints that message again.
However this window seems to be delaying sometimes. Is this a google limitation or is there something wrong with my code:
with beam.Pipeline(options=pipeline_options) as pipe:
messages = (
pipe
| beam.io.ReadFromPubSub(subscription=known_args.input_subscription).with_output_types(bytes)
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'Ex' >> beam.ParDo(ExtractorAndPrinter())
| beam.WindowInto(window.FixedWindows(10), allowed_lateness=0, accumulation_mode=AccumulationMode.DISCARDING, trigger=AfterProcessingTime(10) )
| 'group' >> beam.GroupByKey()
| 'PRINTER' >> beam.ParDo(PrinterWorker()))
Edit for the most recent code. I removed the triggers however the problem persists:
class ExtractorAndCounter(beam.DoFn):
def __init__(self):
beam.DoFn.__init__(self)
def process(self, element, *args, **kwargs):
import logging
logging.info(element)
return [("Message", json.loads(element)["Message"])]
class PrinterWorker(beam.DoFn):
def __init__(self):
beam.DoFn.__init__(self)
def process(self, element, *args, **kwargs):
import logging
logging.info(element)
return [str(element)]
class DefineTimestamp(beam.DoFn):
def process(self, element, *args, **kwargs):
from datetime import datetime
return [(str(datetime.now()), element)]
def run(argv=None, save_main_session=True):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--output_topic',
required=True,
help=(
'Output PubSub topic of the form '
'"projects/<PROJECT>/topics/<TOPIC>".'))
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
'--input_topic',
help=(
'Input PubSub topic of the form '
'"projects/<PROJECT>/topics/<TOPIC>".'))
group.add_argument(
'--input_subscription',
help=(
'Input PubSub subscription of the form '
'"projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'))
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
pipeline_options.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=pipeline_options) as pipe:
messages = (
pipe
| beam.io.ReadFromPubSub(subscription=known_args.input_subscription).with_output_types(bytes)
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'Ex' >> beam.ParDo(ExtractorAndCounter())
| beam.WindowInto(window.FixedWindows(10))
| 'group' >> beam.GroupByKey()
| 'PRINTER' >> beam.ParDo(PrinterWorker())
| 'encode' >> beam.Map(lambda x: x.encode('utf-8'))
| beam.io.WriteToPubSub(known_args.output_topic))
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
So what this basically ask the pipeline to do is to group elements to 10 second windows and fire every window after 10 seconds have passed since the first element was received for each window (and discard the rest of the data for that window). Was that your intention ?
Assuming this was the case, note that triggering depends on the time elements were received by the system as well as the time the first element is received for each window. Probably this is why you are seeing some variation in your results.
I think if you need more consistent grouping for your elements you should use event time triggers instead of processing time triggers.
All the triggers are based on best effort means they will be fired sometime after the specified duration, 10 sec in this case. Generally it happen close to the time specified but a few seconds delay is possible.
Also, the triggers are set for Key+Window. The window is derived from event time.
It is possible that the first GBK pint at 10:30:04 is due to the first element which as at 10:29:52
The 2nd GBK print at 10:30:07 is due to the first element at 10:29:56
So it will be good to print the window and event timestamp for each element and then co-relate the data.

Shift in the columns of spool file

I am using a shell script to extract the data from 'extr' table. The extr table is a very big table having 410 columns. The table has 61047 rows of data. The size of one record is around 5KB.
I the script is as follows:
#!/usr/bin/ksh
sqlplus -s \/ << rbb
set pages 0
set head on
set feed off
set num 20
set linesize 32767
set colsep |
set trimspool on
spool extr.csv
select * from extr;
/
spool off
rbb
#-------- END ---------
One fine day the extr.csv file was having 2 records with incorrect number of columns (i.e. one record with more number of columns and other with less). Upon investigation I came to know that the two duplicate records were repeated in the file. The primary key of the records should ideally be unique in file but in this case 2 records were repeated. Also, the shift in the columns was abrupt.
Small example of the output file:
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|154|H
5003|A7A|AAB|249.67|AAB|153.33|205|R
5004|A8A|269|F
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
Here the primary key records for 5003 and 5004 have reappeared in place of 5007 and 5008. Also the duplicate reciords have shifted the records of 5007 and 5008 by appending/cutting down their columns.
Need your help in analysing why this happened? Why the 2 rows were extracted multiple times? Why the other 2 rows were missing from the file? and Why the records were shifted?
Note: This script is working fine since last two years and has never failed except for one time (mentioned above). It ran successfully during next run. Recently we have added one more program which accesses the extr table with cursor (select only).
I reproduced a similar behaviour.
;-> cat input
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
See the input file as your database.
Now I write a script that accesses "the database" and show some random freezes.
;-> cat writeout.sh
# Start this script twice
while IFS=\| read a b c d e f; do
# I think you need \c for skipping \n, but I do it different one time
echo "$a|$b|$c|$d|" | tr -d "\n"
(( sleeptime = RANDOM % 5 ))
sleep ${sleeptime}
echo "$e|$f"
done < input >> output
EDIT: Removed cat input | in script above, replaced by < input
Start this script twice in the background
;-> ./writeout.sh &
;-> ./writeout.sh &
Wait until both jobs are finished and see the result
;-> cat output
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|200|F
5003|A3A|AAB|153.33|258|G
5006|A6A|ABB|147.89|154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|215|F
154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
When I edit the last line of writeout.sh into done > output I do not see the problem, but that might be due to buffering and the small amount of data.
I still don't know exactly what happened in your case, but it really seems like 2 progs writing simultaneously to the same script.
A job in TWS could have been restarted manually, 2 scripts in your masterscript might write to the same file or something else.
Preventing this in the future can be done using some locking / checks (when the output file exists, quit and return errorcode to TWS).

Resources