Dataflow pipeline is dropping events during processing when using outputWithTimestamp - google-cloud-dataflow

I have a Cloud Dataflow pipeline in which I alter the original timestamp for the event in order to simulate real world scenarios of events arriving late. However, it appears I'm dropping some percentage of my events on each run of the pipeline. Inside my DoFn I use the following code to change the timestamp:
Instant newTimestamp = originalTimestamp.minus(Duration.standardMinutes(RANDOM.nextInt(15)));
c.outputWithTimestamp(KV.of(Integer.toString(RANDOM.nextInt(100)), element), newTimestamp);

The problem is most likely caused by your DoFn step outputting a timestamp that is earlier than the timestamp that was received by the processing step minus the allowed timestamp skew. The exception that would be thrown can be found here in the code:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/DoFnRunnerBase.java#L493
This behavior is documented with regard to using outputWithTimestamp here:
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn.Context#outputWithTimestamp-OutputT-org.joda.time.Instant-
While you could override the getAllowedTimestampSkew function, is is also documented that this might cause unpredictable issues with the watermark calculations so it should only be used without windowing/grouping.
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn#getAllowedTimestampSkew--

Related

Is there a way to make repeatedly forever apache beam trigger to only execute after the previous execution is completed?

I am using global window with repeated forever after processing time trigger to process streaming data from pub-sub as below :
PCollection<KV<String,SMMessage>> perMSISDNLatestEvents = messages
.apply("Apply global window",Window.<SMMessage>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))
.discardingFiredPanes())
.apply("Convert into kv of msisdn and SM message", ParDo.of(new SmartcareMessagetoKVFn()))
.apply("Get per MSISDN latest event",Latest.perKey()).apply("Write into Redis", ParDo.of(new WriteRedisFn()));
Is there a way to make repeatedly forever apache beam trigger to only execute after the previous execution is completed ? The reason for my question is because the next trigger processing will need to read data from redis, written by the previous trigger execution.
Thank You
So the trigger here would fire at the interval you provided. The trigger is not aware of any downstream processing so it's unable to depend on such steps of your pipeline.
Instead of depending on the trigger for consistency here, you could add a barrier (a DoFn) that exists before the Write step and only gives up execution after you see the previous data in Redis.
You could try and explicitly declare a global window trigger, as the example below:
Trigger subtrigger = AfterProcessingTime.pastFirstElementInPane();
Trigger maintrigger = Repeatedly.forever(subtrigger);
I think that triggers would help you on your case, since it will allow you to create event times, which will run when you or your code trigger them, so you would only run repeatedly forever when a trigger finishes first.
I found this documentation which might guide you on the triggers you are trying to create.

Assigning to GenericRecord the timestamp from inner object

Processing streaming events and writing files in hourly buckets is a challenge due to windows, as some events from incoming hour can go into previous ones and such.
I've been digging around Apache Beam and its triggers but I'm struggling to manage triggering with timestamp as follows...
Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
This is what I've been doing so far, triggering 1 min windows no matter what timestamp. However, I would like to include the timestamp within the object so that it gets triggered just for those within.
Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(AfterWatermark
.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
The objects that I'm dealing with have a timestamp object, however, this is a long field and not an Instant field whatsoever.
"{ \"name\": \"timestamp\", \"type\": \"long\", \"logicalType\": \"timestamp-micros\" },"
Having my POJO class with that long field triggers nothing, but if I swap it for an Instant class and recreate the object properly, the following error is thrown whenever a PubSub message is read.
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.Long
I've been also thinking to create a kind of wrapper class around GenericRecord which contains a timestamp, but would need to just use the GenericRecord part within once its ready to write with FileIO to .parquet.
Which other ways do I have to use watermark triggers?
EDIT: After #Anton comments, I've tried the following.
.apply("Apply timestamps", WithTimestamps.of(
(SerializableFunction<GenericRecord, Instant>) item -> new Instant(Long.valueOf(item.get("timestamp").toString())))
.withAllowedTimestampSkew(Duration.standardSeconds(30)))
Even it it has been deprecated this seem to pass through the pipeline but still not written (still getting discarded prior writing for some reason by the previously shown trigger?).
And also tried the other mentioned approach using outputWithTimestamp but due to the delay, it's printing the following error...
Caused by: java.lang.IllegalArgumentException: Cannot output with timestamp 2019-06-12T18:59:58.609Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-12T18:59:59.848Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.

Can Dataflow sideInput be updated per window by reading a gcs bucket?

I’m currently creating a PCollectionView by reading filtering information from a gcs bucket and passing it as side input to different stages of my pipeline in order to filter the output. If the file in the gcs bucket changes, I want the currently running pipeline to use this new filter info. Is there a way to update this PCollectionView on each new window of data if my filter changes? I thought I could do it in a startBundle but I can’t figure out how or if it’s possible. Could you give an example if it is possible.
PCollectionView<Map<String, TagObject>>
tagMapView =
pipeline.apply(TextIO.Read.named("TagListTextRead")
.from("gs://tag-list-bucket/tag-list.json"))
.apply(ParDo.named("TagsToTagMap").of(new Tags.BuildTagListMapFn()))
.apply("MakeTagMapView", View.asSingleton());
PCollection<String>
windowedData =
pipeline.apply(PubsubIO.Read.topic("myTopic"))
.apply(Window.<String>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardSeconds(31))));
PCollection<MY_DATA>
lineData = windowedData
.apply(ParDo.named("ExtractJsonObject")
.withSideInputs(tagMapView)
.of(new ExtractJsonObjectFn()));
You probably want something like "use an at most a 1-minute-old version of the filter as a side input" (since in theory the file can change frequently, unpredictably, and independently from your pipeline - so there's no way really to completely synchronize changes of the file with the behavior of the pipeline).
Here's a (granted, rather clumsy) solution I was able to come up with. It relies on the fact that side inputs are implicitly also keyed by window. In this solution we're going to create a side input windowed into 1-minute fixed windows, where each window will contain a single value of the tag map, derived from the filter file as-of some moment inside that window.
PCollection<Long> ticks = p
// Produce 1 "tick" per second
.apply(CountingInput.unbounded().withRate(1, Duration.standardSeconds(1)))
// Window the ticks into 1-minute windows
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
// Use an arbitrary per-window combiner to reduce to 1 element per window
.apply(Count.globally());
// Produce a collection of tag maps, 1 per each 1-minute window
PCollectionView<TagMap> tagMapView = ticks
.apply(MapElements.via((Long ignored) -> {
... manually read the json file as a TagMap ...
}))
.apply(View.asSingleton());
This pattern (joining against slowly changing external data as a side input) is coming up repeatedly, and the solution I'm proposing here is far from perfect, I wish we had better support for this in the programming model. I've filed a BEAM JIRA issue to track this.

Google Dataflow trigger firing before elementCountAtLeast reached

I was planning on using Google Dataflow to coordinate human-in-the-loop form completion, checking for conflict after 3 forms have been completed. I have setup Google PubSub for both Dataflow source and sink and want to simply have the trigger fire and send to the PubSub sink after three forms have been received for a given JobId.
This SO post looked similar to the problem I was trying to solve, however when I implement it, the trigger is firing and sending output to the PubSub sink before the AfterPane.elementCountAtLeast is reached.
I have tried it with the GlobalWindow and SlidingWindows. Once I get the trigger to fire after the elementCountAtLeast is reached, I was planning on implementing a GroupByKey for the jobId. However, before I moved to that step I'd like to get the elementCountAtLeast working in isolation.
Here is the code for reading from PubSub and the SlidingWindow:
PCollection<String> humanInTheLoopInput;
humanInTheLoopInput = pipeline
.apply(PubsubIO.Read
.named("ReadFromHumanInTheLoopSubscription")
.subscription(options.getInputHumanInTheLoopRawSubscription()));
PCollection<String> windowedInput = humanInTheLoopInput
.apply(Window
.<String>into(SlidingWindows
.of(Duration.standardSeconds(30))
.every(Duration.standardSeconds(5)))
.<String>triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(3)))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(10)));
Without a GroupByKey nothing is being triggered. Both windowing and triggering only affect grouping (and combining) operations.

Cloud Dataflow: java.lang.IllegalStateException: no evaluator registered for GroupedValues

I'm getting the following exception when running the pipeline locally. There is no exception when submitting for cloud execution.
Thanks,
Genady
INFO: Executing pipeline using the DirectPipelineRunner.
Exception in thread "main" java.lang.IllegalStateException: no evaluator registered for GroupedValues [GroupedValues]
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:606)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:200)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:196)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:109)
at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:204)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:583)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:327)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:70)
at app.Main.main(Main.java:124)
The code outline is basically this:
PCollection<KV<MyKey, Iterable<MyValue>>> groupedByMyKey = ...
PCollection<KV<MyKey, MyAggregated>> aggregated = groupedByMyKey.apply(
Combine.<MyKey, MyValue, MyAggregated>groupedValues(new Aggregator()));
Aggregator class extends CombineFn<MyValue, List<MyValue>, MyAggregated>
Can you share a code snippet that triggers this? GroupedValues is a PTransform that is often used within various combining transforms, so it might be from using something like Min, Max, etc.
The error means that the DirectPipelineRunner doesn't know how to evaluate a GroupedValues. However, that's unexpected, since that should have been expanded into a ParDo before execution.
I found the reason to this behaviour
I was using a command line argument to run it in remote mode (--runner=BlockingDataflowPipelineRunner) and then forced it to run locally with
PipelineRunner<?> runner = DirectPipelineRunner.fromOptions(options);
runner.run(p);
After removing these lines and just using the --runner=DirectPipelineRunner argument it worked as expected.

Resources