I have a Pub/Sub topic + subscription and want to consume and aggregate the unbounded data from the subscription in a Dataflow. I use a fixed window and write the aggregates to BigQuery.
Reading and writing (without windowing and aggregation) works fine. But when I pipe the data into a fixed window (to count the elements in each window) the window is never triggered. And thus the aggregates are not written.
Here is my word publisher (it uses kinglear.txt from the examples as input file):
public static class AddCurrentTimestampFn extends DoFn<String, String> {
#ProcessElement public void processElement(ProcessContext c) {
c.outputWithTimestamp(c.element(), new Instant(System.currentTimeMillis()));
}
}
public static class ExtractWordsFn extends DoFn<String, String> {
#ProcessElement public void processElement(ProcessContext c) {
String[] words = c.element().split("[^a-zA-Z']+");
for (String word:words){ if(!word.isEmpty()){ c.output(word); }}
}
}
// main:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
p.apply("ReadLines", TextIO.Read.from(o.getInputFile()))
.apply("Lines2Words", ParDo.of(new ExtractWordsFn()))
.apply("AddTimestampFn", ParDo.of(new AddCurrentTimestampFn()))
.apply("WriteTopic", PubsubIO.Write.topic(o.getTopic()));
p.run();
Here is my windowed word counter:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
BigQueryIO.Write.Bound tablePipe = BigQueryIO.Write.to(o.getTable(o))
.withSchema(o.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND);
Window.Bound<String> w = Window
.<String>into(FixedWindows.of(Duration.standardSeconds(1)));
p.apply("ReadTopic", PubsubIO.Read.subscription(o.getSubscription()))
.apply("FixedWindow", w)
.apply("CountWords", Count.<String>perElement())
.apply("CreateRows", ParDo.of(new WordCountToRowFn()))
.apply("WriteRows", tablePipe);
p.run();
The above subscriber will not work, since the window does not seem to trigger using the default trigger. However, if I manually define a trigger the code works and the counts are written to BigQuery.
Window.Bound<String> w = Window.<String>into(FixedWindows.of(Duration.standardSeconds(1)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes();
I like to avoid specifying custom triggers if possible.
Questions:
Why does my solution not work with Dataflow's default trigger?
How do I have to change my publisher or subscriber to trigger windows using the default trigger?
How are you determining the trigger never fires?
Your PubSubIO.Write and PubSubIO.Read transforms should both specify a timestamp label using withTimestampLabel, otherwise the timestamps you've added will not be written to PubSub and the publish times will be used.
Either way, the input watermark of the pipeline will be derived from the timestamps of the elements waiting in PubSub. Once all inputs have been processed, it will stay back for a few minutes (in case there was a delay in the publisher) before advancing to real time.
What you are likely seeing is that all the elements are published in the same ~1 second window (since the input file is pretty small). These are all read and processed relatively quickly, but the 1-second window they are put in will not trigger until after the input watermark has advanced, indicating that all data in that 1-second window has been consumed.
This won't happen until several minutes, which may make it look like the trigger isn't working. The trigger you wrote fired after 1 second of processing time, which would fire much earlier, but there is no guarantee all the data has been processed.
Steps to get better behavior from the default trigger:
Use withTimestampLabel on both the write and read pubsub steps.
Have the publisher spread the timestamps out further (eg., run for several minutes and spread the timestamps out across that range)
Related
I have an Apache Beam pipeline which runs on Google Cloud Dataflow. This a streaming pipeline which receives input messages from Google Cloud PubSub which are basically JSON arrays of elements to process.
Roughly speaking, the pipeline has these steps:
Deserializes the message into a PCollecttion<List<T>>.
Splits (or explodes) the array into a PCollection<T>.
Few processing steps: some elements will finish before other elements and some elements are cached so they simply skip to the end without much processing at all.
Flatten all outputs and apply a GroupByKey(this is the problem step): it transforms the PCollection back into a Pcollection<List<T>> but it doesn't wait for all the elements.
Serialize to publish a PubSub Message.
I cannot get the last GroupByKey to group all elements that where received together. The published message doesn't contain the elements that had to be processed and took longer than those which skipped to the end.
I think this would be straight forward to solve if I could write a custom Data-Driven trigger. Or even if I could dynamically set the trigger AfterPane.elementCountAtLeast() from a customized WindowFn.
It doesn't seem that I can make a custom trigger. But is it possible to somehow dynamically set the trigger for each window?
--
Here is a simplified version of the pipeline I am working on.
I have simplified the input from an array of objects T into a simple array of Integer. I have simulated the keys (or IDs) for these integers. Normally they would be part of the objects.
I also simplified the slow processing step (which really is several steps) into a sigle step with an artificial delay.
(complete example gist https://gist.github.com/naringas/bfc25bcf8e7aca69f74de719d75525f2 )
PCollection<String> queue = pipeline
.apply("ReadQueue", PubsubIO.readStrings().fromTopic(topic))
.apply(Window
.<String>into(FixedWindows.of(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.standardSeconds(3))
.triggering(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(2)))
.discardingFiredPanes());
TupleTag<List<KV<Integer, Integer>>> tagDeserialized = new TupleTag<List<KV<Integer, Integer>>>() {};
TupleTag<Integer> tagDeserializeError = new TupleTag<Integer>() {};
PCollectionTuple imagesInputTuple = queue
.apply("DeserializeJSON", ParDo.of(new DeserializingFn()).withOutputTags(tagDeserialized, TupleTagList.of(tagDeserializeError)));
/*
This is where I think that I must adjust the custom window strategy, set the customized dynamic-trigger
*/
PCollection<KV<Integer, Integer>> images = imagesInputTuple.get(tagDeserialized)
/* I have tried many things
.apply(Window.<List<KV<Integer, Integer>>>into(new GlobalWindows()))
*/
.apply("Flatten into timestamp", ParDo.of(new DoFn<List<KV<Integer, Integer>>, KV<Integer, Integer>>() {
// Flatten and output into same ts
// like Flatten.Iterables() but I set the output window
#ProcessElement
public void processElement(#Element List<KV<Integer, Integer>> input, OutputReceiver<KV<Integer, Integer>> out, #Timestamp Instant ts, BoundedWindow w, PaneInfo p) {
Instant timestamp = w.maxTimestamp();
for (KV<Integer, Integer> el : input) {
out.outputWithTimestamp(el, timestamp);
}
}
}))
.apply(Window.<KV<Integer, Integer>>into(new GlobalWindows()));
TupleTag<KV<Integer, Integer>> tagProcess = new TupleTag<KV<Integer, Integer>>() {};
TupleTag<KV<Integer, Integer>> tagSkip = new TupleTag<KV<Integer, Integer>>() {};
PCollectionTuple preproc = images
.apply("PreProcessingStep", ParDo.of(new SkipOrNotDoFn()).withOutputTags(tagProcess, TupleTagList.of(tagSkip)));
TupleTag<KV<Integer, Integer>> tagProcessed = new TupleTag<KV<Integer, Integer>>() {};
TupleTag<KV<Integer, Integer>> tagError = new TupleTag<KV<Integer, Integer>>() {};
PCollectionTuple processed = preproc.get(tagProcess)
.apply("ProcessingStep", ParDo.of(new DummyDelasyDoFn).withOutputTags(tagProcessed, TupleTagList.of(tagError)));
/* Here, at the "end"
the elements get grouped back
first: join into a PcollectionList and flatten it
second: GroupByKey which should but doesn't way for all elements
lastly: serilize and publish (in this case just print out)
*/
PCollection end = PCollectionList.of(preproc.get(tagSkip)).and(processed.get(tagProcessed))
.apply("FlattenUpsert", Flatten.pCollections())
//
.apply("GroupByParentId", GroupByKey.create())
.apply("GroupedValues", Values.create())
.apply("PublishSerialize", ParDo.of(
new DoFn<Object, String>() {
#ProcessElement
public void processElement(ProcessContext pc) {
String output = GSON.toJson(pc.element());
LOG.info("DONE: {}", output);
pc.output(output);
}
}));
// "send the string to pubsub" goes here
I played around a little bit with stateful pipelines. As you'd like to use data-driven triggers or AfterPane.elementCountAtLeast() I assume you know the number of elements that conform the message (or, at least, it does not change per key) so I defined NUM_ELEMENTS = 10 in my case.
The main idea of my approach is to keep track of the number of elements that I have seen so far for a particular key. Notice that I had to merge the PreProcessingStep and ProcessingStep into a single one for an accurate count. I understand this is just a simplified example so I don't know how that would translate to the real scenario.
In the stateful ParDo I defined two state variables, one BagState with all integers seen and a ValueState to count the number of errors:
// A state bag holding all elements seen for that key
#StateId("elements_seen")
private final StateSpec<BagState<Integer>> elementSpec =
StateSpecs.bag();
// A state cell holding error count
#StateId("errors")
private final StateSpec<ValueState<Integer>> errorSpec =
StateSpecs.value(VarIntCoder.of());
Then we process each element as usual but we don't output anything yet unless it's an error. In that case we update the error counter before emitting the element to the tagError side output:
errors.write(firstNonNull(errors.read(), 0) + 1);
is_error = true;
output.get(tagError).output(input);
We update the count and, for successfully processed or skipped elements (i.e. !is_error), write the new observed element into the BagState:
int count = firstNonNull(Iterables.size(state.read()), 0) + firstNonNull(errors.read(), 0);
if (!is_error) {
state.add(input.getValue());
count += 1;
}
Then, if the sum of successfully processed elements and errors is equal to NUM_ELEMENTS (we are simulating a data-driven trigger here), we flush all the items from the BagState:
if (count >= NUM_ELEMENTS) {
Iterable<Integer> all_elements = state.read();
Integer key = input.getKey();
for (Integer value : all_elements) {
output.get(tagProcessed).output(KV.of(key, value));
}
}
Note that here we can already group the values and emit just a single KV<Integer, Iterable<Integer>> instead. I just made a for loop instead to avoid changing other steps downstream.
With this, I publish a message such as:
gcloud pubsub topics publish streamdemo --message "[1,2,3,4,5,6,7,8,9,10]"
And where before I got:
INFO: DONE: [4,8]
Now I get:
INFO: DONE: [1,2,3,4,5,6,8,9,10]
Element 7 is not present as is the one that simulates errors.
Tested with DirectRunner and 2.16.0 SDK. Full code here.
Let me know if that works for your use case, keep in mind that I only did some minor tests.
I'm trying to understand Beam/Dataflow concepts better, so pretend I have the following streaming pipeline:
pipeline
.apply(PubsubIO.readStrings().fromSubscription("some-subscription"))
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
String message = c.element();
LOGGER.debug("Got message: {}", message);
c.output(message);
}
}));
How often will the unbounded source pull messages from the subscription? Is this configurable at all (potentially based on windows/triggers)?
Since no custom windowing/triggers have been defined, and there are no sinks (just a ParDo that logs + re-outputs the message), will my ParDo still be executed immediately as messages are received, and is that setup problematic in any way (not having any windows/triggers/sinks defined)?
It will pull messages from the subscription continuously - as soon as a message arrives, it will be processed immediately (modulo network and RPC latency).
Windowing and triggers do not affect this at all - they only affect how the data gets grouped at grouping operations (GroupByKey and Combine). If your pipeline doesn't have grouping operations, windowing and triggers are basically a no-op.
The Beam model does not have the concept of a sink - writing to various storage systems (e.g. writing files, writing to BigQuery etc) is implemented as regular Beam composite transforms, made of ParDo and GroupByKey like anything else. E.g. writing each element to its own file could be implemented by a ParDo whose #ProcessElement opens the file, writes the element to it and closes the file.
I'm trying with Apache Beam 2.1.0 to consume simple data (key,value) from google PubSub and group by key to be able to treat batches of data.
With default trigger my code after "GroupByKey" never fires (I waited 30min).
If I defined custom trigger, code is executed but I would like to understand why default trigger is never fired. I tried to define my own timestamp with "withTimestampLabel" but same issue. I tried to change duration of windows but same issue too (1second, 10seconds, 30seconds etc).
I used command line for this test to insert data
gcloud beta pubsub topics publish test A,1
gcloud beta pubsub topics publish test A,2
gcloud beta pubsub topics publish test B,1
gcloud beta pubsub topics publish test B,2
From documentation it says that we can do one or the other but not necessarily both
If you are using unbounded PCollections, you must use either
non-global windowing OR an aggregation trigger in order to perform a
GroupByKey or CoGroupByKey
It looks to be similar to
Consuming unbounded data in windows with default trigger
Scio: groupByKey doesn't work when using Pub/Sub as collection source
My code
static class Compute extends DoFn<KV<String, Iterable<Integer>>, Void> {
#ProcessElement
public void processElement(ProcessContext c) {
// Code never fires
System.out.println("KEY:" + c.element().getKey());
System.out.println("NB:" + c.element().getValue().spliterator().getExactSizeIfKnown());
}
}
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.create());
p.apply(PubsubIO.readStrings().fromSubscription("projects/" + args[0] + "/subscriptions/test"))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(
MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers()))
.via((String row) -> {
String[] parts = row.split(",");
System.out.println(Arrays.toString(parts)); // Code fires
return KV.of(parts[0], Integer.parseInt(parts[1]));
})
)
.apply(GroupByKey.create())
.apply(ParDo.of(new Compute()));
p.run();
}
We have a dataflow streaming job which reads from PubSub, extracts some fields and writes to bigtable. We are observing that dataflow's throughput drops when it is autoscaling. For example, if the dataflow job is currently running with 2 workers and processing at the rate of 100 messages/sec, during autoscaling this rate of 100 messages/sec drops down and some times it drops down to nearly 0 and then increases to 500 messages/sec. We are seeing this every time, dataflow upscales. This is causing higher system lag during autoscaling and bigger spikes of unacknowledged messages in pub/sub.
Is this the expected behavior of dataflow autoscaling or is there a way to maintain this 100 messages/sec while it autoscales and minimize the spices of unacknowledged messages?
(Please Note: 100 messages/sec and 500 messages/sec are just example figures)
job ID: 2017-10-23_12_29_09-11538967430775949506
I am attaching the screen shots of pub/sub stackdriver and dataflow autoscaling.
There is drop in number of pull requests everytime dataflow autoscales. I could not take screenshot with timestamps, but drop in pull requests matches with the time data flow autoscaling.
===========EDIT========================
We are writing to GCS in parallel using below mentioned windowing.
inputCollection.apply("Windowing",
Window.<String>into(FixedWindows.of(ONE_MINUTE))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(ONE_MINUTE))
.withAllowedLateness(ONE_HOUR)
.discardingFiredPanes()
)
//Writing to GCS
.apply(TextIO.write()
.withWindowedWrites()
.withNumShards(10)
.to(options.getOutputPath())
.withFilenamePolicy(
new
WindowedFileNames(options.getOutputPath())));
WindowedFileNames.java
public class WindowedFileNames extends FilenamePolicy implements OrangeStreamConstants{
/**
*
*/
private static final long serialVersionUID = 1L;
private static Logger logger = LoggerFactory.getLogger(WindowedFileNames.class);
protected final String outputPath;
public WindowedFileNames(String outputPath) {
this.outputPath = outputPath;
}
#Override
public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext context, String extension) {
IntervalWindow intervalWindow = (IntervalWindow) context.getWindow();
DateTime date = intervalWindow.maxTimestamp().toDateTime(DateTimeZone.forID("America/New_York"));
String fileName = String.format(FOLDER_EXPR, outputPath, //"orangestreaming",
DAY_FORMAT.print(date), HOUR_MIN_FORMAT.print(date) + HYPHEN + context.getShardNumber());
logger.error(fileName+"::::: File name for the current minute");
return outputDirectory
.getCurrentDirectory()
.resolve(fileName, StandardResolveOptions.RESOLVE_FILE);
}
#Override
public ResourceId unwindowedFilename(ResourceId outputDirectory, Context context, String extension) {
return null;
}
}
What is actually happening is that your throughput is decreasing first, and that is the reason that workers are scaling up.
If you look at your pipeline around 1:30am, the series of events is like so:
Around 1:23am, throughput drops. This builds up the backlog.
Around 1:28am, the pipeline unblocks and starts making progress.
Due to the large backlog, the pipeline scales up to 30 workers.
Also, if you look at the autoscaling UI, the justification for going up to 30 workers is:
"Raised the number of workers to 30 so that the pipeline can catch up
with its backlog and keep up with its input rate."
Hope that helps!
Background
We have a pipeline that starts by receiving messages from PubSub, each with the name of a file. These files are exploded to line level, parsed to JSON object nodes and then sent to an external decoding service (which decodes some encoded data). Object nodes are eventually converted to Table Rows and written to Big Query.
It appeared that Dataflow was not acknowledging the PubSub messages until they arrived at the decoding service. The decoding service is slow, resulting in a backlog when many message are sent at once. This means that lines associated with a PubSub message can take some time to arrive at the decoding service. As a result, PubSub was receiving no acknowledgement and resending the message. My first attempt to remedy this was adding an attribute to each PubSub messages that is passed to the Reader using withAttributeId(). However, on testing, this only prevented duplicates that arrived close together.
My second attempt was to add a fusion breaker (example) after the PubSub read. This simply performs a needless GroupByKey and then ungroups, the idea being that the GroupByKey forces Dataflow to acknowledge the PubSub message.
The Problem
The fusion breaker discussed above works in that it prevents PubSub from resending messages, but I am finding that this GroupByKey outputs more elements than it receives: See image.
To try and diagnose this I have removed parts of the pipeline to get a simple pipeline that still exhibits this behavior. The behavior remains even when
PubSub is replaced by some dummy transforms that send out a fixed list of messages with a slight delay between each one.
The Writing transforms are removed.
All Side Inputs/Outputs are removed.
The behavior I have observed is:
Some number of the received messages pass straight through the GroupByKey.
After a certain point, messages are 'held' by the GroupByKey (presumably due to the backlog after the GroupByKey).
These messages eventually exit the GroupByKey (in groups of size one).
After a short delay (about 3 minutes), the same messages exit the GroupByKey again (still in groups of size one). This may happen several times (I suspect it is proportional to the time they spend waiting to enter the GroupByKey).
Example job id is 2017-10-11_03_50_42-6097948956276262224. I have not run the beam on any other runner.
The Fusion Breaker is below:
#Slf4j
public class FusionBreaker<T> extends PTransform<PCollection<T>, PCollection<T>> {
#Override
public PCollection<T> expand(PCollection<T> input) {
return group(window(input.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break in")))))
.apply("Getting iterables after breaking fusion", Values.create())
.apply("Flattening iterables after breaking fusion", Flatten.iterables())
.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break out")));
}
private PCollection<T> window(PCollection<T> input) {
return input.apply("Windowing before breaking fusion", Window.<T>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.discardingFiredPanes());
}
private PCollection<KV<Integer, Iterable<T>>> group(PCollection<T> input) {
return input.apply("Keying with random number", ParDo.of(new RandomKeyFn<>()))
.apply("Grouping by key to break fusion", GroupByKey.create());
}
private static class RandomKeyFn<T> extends DoFn<T, KV<Integer, T>> {
private Random random;
#Setup
public void setup() {
random = new Random();
}
#ProcessElement
public void processElement(ProcessContext context) {
context.output(KV.of(random.nextInt(), context.element()));
}
}
}
The PassthroughLoggers simply log the elements passing through (I use these to confirm that elements are indeed repeated, rather than there being an issue with the counts).
I suspect this is something to do with windows/triggers, but my understanding is that elements should never be repeated when .discardingFiredPanes() is used - regardless of the windowing setup. I have also tried FixedWindows with no success.
First, the Reshuffle transform is equivalent to your Fusion Breaker, but has some additional performance improvements that should make it preferable.
Second, both counters and logging may see an element multiple times if it is retried. As described in the Beam Execution Model, an element at a step may be retried if anything that is fused into it is retried.
Have you actually observed duplicates in what is written as the output of the pipeline?