My question is very similar to another post: Apache Beam Cloud Dataflow Streaming Stuck Side Input.
However, I tried the resolution there (apply GlobalWindows() to the side input), and it did not seem to fix my problem.
I have a Dataflow pipeline (but I'm using DirectRunner for debug) with Python SDK where the main input are logs from PubSub and the side input is associated data from a mostly unchanging database. I would like to join the two such that each log is paired with side input data from the same approximate time. Excess side inputs without an associated log can be dropped.
The behavior I see is that the pipeline seems to be operating as a single thread. It processes the all side input elements first, then starts processing the main input elements. If the side input is bounded (non-streaming), this is fine, and the pipeline can merge inputs and run to completion. If the side input is unbounded (streaming), however, the main input is blocked indefinitely while apparently waiting for the side input processing to finish.
To illustrate the behavior, I made simplified test case below.
class Logger(apache_beam.DoFn):
def __init__(self, name):
self._name = name
def process(self, element, w=apache_beam.DoFn.WindowParam,
ts=apache_beam.DoFn.TimestampParam):
logging.error('%s: %s', self._name, element)
yield element
def cross_join(left, rights):
for right in rights:
yield (left, right)
def main():
start = timestamp.Timestamp.now()
# Bounded side inputs work OK.
stop = start + 20
# Unbounded side inputs appear to block execution of main input
# processing.
#stop = timestamp.MAX_TIMESTAMP
side_interval = 5
main_interval = 1
side_input = (
pipeline
| PeriodicImpulse(
start_timestamp=start,
stop_timestamp=stop,
fire_interval=side_interval,
apply_windowing=True)
| apache_beam.Map(lambda x: ('side', x))
| apache_beam.ParDo(Logger('side_input'))
)
main_input = (
pipeline
| PeriodicImpulse(
start_timestamp=start, stop_timestamp=stop,
fire_interval=main_interval, apply_windowing=True)
| apache_beam.Map(lambda x: ('main', x))
| apache_beam.ParDo(Logger('main_input'))
| 'CrossJoin' >> apache_beam.FlatMap(
cross_join, rights=apache_beam.pvalue.AsIter(side_input))
| 'CrossJoinLogger' >> apache_beam.ParDo(Logger('cross_join_output'))
)
pipeline.run()
Am missing something that is preventing main inputs from being processed in parallel with the side inputs?
The main input can advance only when the watermark has passed the corresponding side input's windowing. See details in the programming guide. You likely need to window both the main and side input, and make sure PeriodicImpulse is correctly advancing the watermark.
Using the example from stackoverflow.com/q/70561769 I was able to get the side input and main input working concurrently as expected for certain cases. The answer there was to apply GlobalWindows() to the side_input.
side_input = ( pipeline
| PeriodicImpulse(fire_interval=300, apply_windowing=False)
| "ApplyGlobalWindow" >> WindowInto(window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| ...
)
Based on experimentation, my conclusion is there are cases when PeriodicImpulse on the side input causes the main input to block, such as below:
Case 1: GOOD
GlobalWindow
Main input = PubSub
Side input = PeriodicImpulse
Case 2: BAD
FixedWindow
Main input = PubSub
Side input = PeriodicImpulse
Case 3: BAD
GlobalWindow / FixedWindow
Main input = PeriodicImpulse
Side input = PeriodicImpulse
Case 4: GOOD
FixedWindow
Main input = PubSub
Side input = PubSub
My problem now is that the side input timestamps are not aligning with the main input properly stackoverflow.com/q/72382440.
Related
I have a tool that is using a Telegram chatbot to interact with its users.
Telegram limits the call rate, so I use a queue system that gets flushed at regular intervals.
So, the current code is very basic:
// flush the message queue
let flushMessageQueue() =
if not messageQueue.IsEmpty then <- messageQueue is a ConcurrentQueue
// get all the messages
let messages =
messageQueue
|> Seq.unfold(fun q ->
match q.TryDequeue () with
| true, m -> Some (m, q)
| _ -> None)
// put all the messages in a single string
let messagesString = String.Join("\n", messages)
// send the data
client.SendTextMessageAsync(chatId, messagesString, ParseMode.Markdown)
|> Async.AwaitTask
|> Async.RunSynchronously
|> ignore
this is called at regular interval, while the write is:
// broadcast message
let broadcastMessage message =
messageQueue.Enqueue(message)
printfn "%s" (message.Replace ("```", String.Empty))
But as messages became more complex, two problems came at once:
Part of the output is formatted text with simple markdown:
Some blocks of lines are wrapped between ``` sections
There are some ``` sections as well inside some lines
The text is UTF-8 and uses a bunch of symbols
Some example of text may be:
```
this is a group of lines
with one, or many many lines
```
and sometimes there are things ```like this``` as well
And... I found out that Telegram limits message size to 4kb as well
So, I thought of two things:
I can maintain a state with the open / close ``` and pull from a queue, wrap each line in triple back ticks based on the state and push into another queue that will be used to make the 4kb block.
I can keep taking messages from the re-formatted queue and aggregate them until I reach 4kb, or the end of the queue and loop around.
Is there an elegant way to do this in F#?
I remember seeing a snippet where a collection function was used to aggregate data until a certain size but it looked very inefficient as it was making a collection of line1, line1+line2, line1+line2+line3... and then picking the one with the right size.
I'm trying to use Apache Beam (via Scio) to run a continuous aggregation of the last 3 days of data (processing time) from a streaming source and output results from the earliest, active window every 5 minutes. Earliest meaning the window with the earliest start time, active meaning that the end of the window hasn't yet passed. Essentially I'm trying to get a 'rolling' aggregation by dropping the non-overlapping period between sliding windows.
A visualization of what I'm trying to accomplish with an example sliding window of size 3 days and period 1 day:
early firing - ^ no firing - x
|
** stop firing from this window once time passes this point
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | | ** stop firing from this window once time passes this point
w1: +====================+^ ^ ^
x x x x x x x | | |
w2: +====================+^ ^ ^
x x x x x x x | | |
w3: +====================+
time: ----d1-----d2-----d3-----d4-----d5-----d6-----d7---->
I've tried using sliding windows (size=3 days, period=5 min), but they produce a new window for every 3 days/5 min combination in the future and are emitting early results for every window. I tried using trigger = AfterWatermark.pastEndOfWindow(), but I need early results when the job first starts. I've tried comparing the pane data (isLast, timestamp, etc.) between windows but they seem identical.
My most recent attempt, which seems somewhat of a hack, included attaching window information to each key in a DoFn, re-windowing into a fixed window, and attempting to group and reduce to the oldest window from the attached data, but the final reduceByKey doesn't seem to output anything.
DoFn to attach window information
// ValueType is just a case class I'm using for objects
type DoFnT = DoFn[KV[String, ValueType], KV[String, (ValueType, Instant)]]
class Test extends DoFnT {
// Window.toString looks like the following:
// [2020-05-16T23:57:00.000Z..2020-05-17T00:02:00.000Z)
def parseWindow(window: String): Instant = {
Instant.parse(
window
.stripPrefix("[")
.stripSuffix(")")
.split("\\.\\.")(1))
}
#ProcessElement
def process(
context: DoFnT#ProcessContext,
window: BoundedWindow): Unit = {
context.output(
KV.of(
context.element().getKey,
(context.element().getValue, parseWindow(window.toString))
)
)
}
}
sc
.pubsubSubscription(...)
.keyBy(_.key)
.withSlidingWindows(
size = Duration.standardDays(3),
period = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
allowedLateness = Duration.ZERO,
trigger = Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1)))))))
.reduceByKey(ValueType.combineFunction())
.applyPerKeyDoFn(new Test())
.withFixedWindows(
duration = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
trigger = AfterWatermark.pastEndOfWindow(),
allowedLateness = Duration.ZERO))
.reduceByKey((x, y) => if (x._2.isBefore(y._2)) x else y)
.saveAsCustomOutput(
TextIO.write()...
)
Any suggestions?
First, regarding processing time: If you want to window according to processing time, you should set your event time to the processing time. This is perfectly fine - it means that the event you are processing is the event of ingesting the record, not the event that the record represents.
Now you can use sliding windows off-the-shelf to get the aggregation you want, grouped the way you want.
But you are correct that it is a bit of a headache to trigger the way you want. Triggers are not easily expressive enough to say "output the last 3 day aggregation but only begin when the window is 5 minutes from over" and even less able to express "for the first 3 day period from pipeline startup, output the whole time".
I believe a stateful ParDo(DoFn) will be your best choice. State is partitioned per key and window. Since you want to have interactions across 3 day aggregations you will need to run your DoFn in the global window and manage the partitioning of the aggregations yourself. You tagged your question google-cloud-dataflow and Dataflow does not support MapState so you will need to use a ValueState that holds a map of the active 3 day aggregations, starting new aggregations as needed and removing old ones when they are done. Separately, you can easily track the aggregation from which you want to periodically output, and have a timer callback that periodically emits the active aggregation. Something like the following pseudo-Java; you can translate to Scala and insert your own types:
DoFn<> {
#StateId("activePeriod") StateSpec<ValueState<Period>> activePeriod = StateSpecs.value();
#StateId("accumulators") StateSpec<ValueState<Map<Period, Accumulator>>> accumulators = StateSpecs.value();
#TimerId("nextPeriod") TimerSpec nextPeriod = TimerSpecs.timer(TimeDomain.EVENT_TIME);
#TimerId("output") TimerSpec outputTimer = TimerSpecs.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void process(
#Element element,
#TimerId("nextPeriod") Timer nextPeriod,
#TimerId("output") Timer output,
#StateId("activePeriod") ValueState<Period> activePeriod
#StateId("accumulators") ValueState<Map<Period, Accumulator>> accumulators) {
// Set nextPeriod if it isn't already running
// Set output if it isn't already running
// Set activePeriod if it isn't already set
// Add the element to the appropriate accumulator
}
#OnTimer("nextPeriod")
public void onNextPeriod(
#TimerId("nextPeriod") Timer nextPeriod,
#StateId("activePriod") ValueState<Period> activePeriod {
// Set activePeriod to the next one
// Clear the period we will never read again
// Reset the timer (there's a one-time change in this logic after the first window; add a flag for this)
}
#OnTimer("output")
public void onOutput(
#TimerId("output") Timer output,
#StateId("activePriod") ValueState<Period> activePeriod,
#StateId("accumulators") ValueState<MapState<Period, Accumulator>> {
// Output the current accumulator for the active period
// Reset the timer
}
}
I do have some reservations about this, because the outputs we are working so hard to suppress are not comparable to the outputs that are "replacing" them. I would be interesting in learning more about the use case. It is possible there is a more straightforward way to express the result you are interested in.
Let's say I have a pipeline, and I have a series of ParDo operations where element keys change. How can I ensure that elements for the same key some to the same worker without having to do a GroupByKey with windowing?
input_pcoll = p | beam.ReadFromXYZ(...)
rekeyed_pcoll = (input_pcoll
| beam.FlatMap(some_operation)
| beam.Map(lambda x: (compute_new_key(x), x['value'])))
After this, I would like to have elements of the same key go to the same worker without having to run a GroupByKey that uses windowing or triggering.
There are two ways to accomplish this.
The first one is by doing a GroupByKey, and having a trigger that triggers after every single element. Something like so:
keys_together_pcoll = (rekeyed_pcoll
| beam.WindowInto(window.GlobalWindows()
trigger=AfterCount(1))
| beam.GroupByKey()
| beam.FlatMap(lambda x: x[1]))
result_pcoll = (keys_together_pcoll
| beam.ParDo(DoFnWithElementsInCorrespondingWorkers()))
Granted, this is a little awkward.
Another way to do this is to make your DoFn stateful. This will force the runner to shuffle the elements into their corresponding workers by key. Something like this:
class DoFnWithElementsInCorrespondingWorkers(beam.DoFn):
UNUSED_STATE = BagStateSpec('unused', VarIntCoder())
def process(self,
element,
unused=beam.DoFn.StateParam(UNUSED_STATE)):
# .. My processing
result_pcoll = (rekeyed_pcoll
| beam.ParDo(DoFnWithElementsInCorrespondingWorkers()))
Why does this happen?
Remember that in Beam (and Flink, and similar systems), state is organized by key, so if you insert a stateful DoFn, Beam will recognize that elements need to be shuffled into the correct workers according to their keys.
I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this:
The first step is doing some basic processing and takes about 2 seconds per message to get to the windowing. We are using sliding windows of 3 seconds and 3 second interval (might change later so we have overlapping windows). As last step we have the SOG prediction that takes about 15ish seconds to process and which is clearly our bottleneck transform.
So, The issue we seem to face is that the workload is perfectly distributed over our workers before the windowing, but the most important transform is not distributed at all. All the windows are processed one at a time seemingly on 1 worker, while we have 50 available.
The logs show us that the sog prediction step has an output once every 15ish seconds which should not be the case if the windows would be processed over more workers, so this builds up huge latency over time which we don't want. With 1 minute of messages, we have a latency of 5 minutes for the last window. When distribution would work, this should only be around 15sec (the SOG prediction time). So at this point we are clueless..
Does anyone see if there is something wrong with our code or how to prevent/circumvent this?
It seems like this is something happening in the internals of google cloud dataflow. Does this also occur in java streaming pipelines?
In batch mode, Everything works fine. There, one could try to do a reshuffle to make sure no fusion etc occurs. But that is not possible after windowing in streaming.
args = parse_arguments(sys.argv if argv is None else argv)
pipeline_options = get_pipeline_options(project=args.project_id,
job_name='XX',
num_workers=args.workers,
max_num_workers=MAX_NUM_WORKERS,
disk_size_gb=DISK_SIZE_GB,
local=args.local,
streaming=args.streaming)
pipeline = beam.Pipeline(options=pipeline_options)
# Build pipeline
# pylint: disable=C0330
if args.streaming:
frames = (pipeline | 'ReadFromPubsub' >> beam.io.ReadFromPubSub(
subscription=SUBSCRIPTION_PATH,
with_attributes=True,
timestamp_attribute='timestamp'
))
frame_tpl = frames | 'CreateFrameTuples' >> beam.Map(
create_frame_tuples_fn)
crops = frame_tpl | 'MakeCrops' >> beam.Map(make_crops_fn, NR_CROPS)
bboxs = crops | 'bounding boxes tfserv' >> beam.Map(
pred_bbox_tfserv_fn, SERVER_URL)
sliding_windows = bboxs | 'Window' >> beam.WindowInto(
beam.window.SlidingWindows(
FEATURE_WINDOWS['goal']['window_size'],
FEATURE_WINDOWS['goal']['window_interval']),
trigger=AfterCount(30),
accumulation_mode=AccumulationMode.DISCARDING)
# GROUPBYKEY (per match)
group_per_match = sliding_windows | 'Group' >> beam.GroupByKey()
_ = group_per_match | 'LogPerMatch' >> beam.Map(lambda x: logging.info(
"window per match per timewindow: # %s, %s", str(len(x[1])), x[1][0][
'timestamp']))
sog = sliding_windows | 'Predict SOG' >> beam.Map(predict_sog_fn,
SERVER_URL_INCEPTION,
SERVER_URL_SOG )
pipeline.run().wait_until_finish()
In beam the unit of parallelism is the key--all the windows for a given key will be produced on the same machine. However, if you have 50+ keys they should get distributed among all workers.
You mentioned that you were unable to add a Reshuffle in streaming. This should be possible; if you're getting errors please file a bug at https://issues.apache.org/jira/projects/BEAM/issues . Does re-windowing into GlobalWindows make the issue with reshuffling go away?
It looks like you do not necessarily need GroupByKey because you are always grouping on the same key. Instead you could maybe use CombineGlobally to append all the elements inside the window in stead of the GroupByKey (with always the same key).
combined = values | beam.CombineGlobally(append_fn).without_defaults()
combined | beam.ParDo(PostProcessFn())
I am not sure how the load distribution works when using CombineGlobally but since it does not process key,value pairs I would expect another mechanism to do the load distribution.
Using Apache Beam I am doing computations - and if they succeed I'd like to write the output to one sink, and if there is a failure I'd like to write that to another sink.
Is there any way to handle metadata or content based routing in Apache Beam?
I've used Apache Camel extensively, and so in my mind based on the outcome of a previous transform, I should route a message to a different sink using a router (perhaps determined by a metadata flag I set on the message header). Is there an analogous capability with Apache Beam, or would I instead just have a sequential transform that inspects the PCollection and handles writing to sinks within the transform?
Ideally I'd like this logic (written verbosely for attempted clarity)
result = my_pcollections | 'compute_stuff' >> beam.Map(lambda (pcollection): my_compute_func(pcollection))
result | ([success_failure_router]
| 'sucess_sink' >> beam.io.WriteToText('/path/to/file')
| 'failure_sink' >> beam.io.WriteStringsToPubSub('mytopic'))
However.. I suspect the 'Beam' way of handling this is
result = my_pcollections | 'compute_stuff' >> beam.Map(lambda (pcollection): my_compute_func(pcollection))
result | 'write_results_appropriately' >> write_results_appropriately(result))
...
def write_results_appropriately(result):
if result == ..:
# success, write to file
else:
# failure, write to topic
Thanks,
Kevin
High-level:
I am not sure of specifics of the Python API in this case, but from high level it looks like this:
par-dos support multiple outputs;
outputs are identified by the tag you give them (e.g. "correct-elements", "invalid-elements");
in your main par-do you write to multiple outputs choosing the output using your criteria;
each output is represented by a separate PCollection;
then you get the separate PCollections representing the tagged outputs from your par-do;
then apply different sinks to each of the tagged PCollections;
In detail see the section
https://beam.apache.org/documentation/programming-guide/#additional-outputs