I was planning on using Google Dataflow to coordinate human-in-the-loop form completion, checking for conflict after 3 forms have been completed. I have setup Google PubSub for both Dataflow source and sink and want to simply have the trigger fire and send to the PubSub sink after three forms have been received for a given JobId.
This SO post looked similar to the problem I was trying to solve, however when I implement it, the trigger is firing and sending output to the PubSub sink before the AfterPane.elementCountAtLeast is reached.
I have tried it with the GlobalWindow and SlidingWindows. Once I get the trigger to fire after the elementCountAtLeast is reached, I was planning on implementing a GroupByKey for the jobId. However, before I moved to that step I'd like to get the elementCountAtLeast working in isolation.
Here is the code for reading from PubSub and the SlidingWindow:
PCollection<String> humanInTheLoopInput;
humanInTheLoopInput = pipeline
.apply(PubsubIO.Read
.named("ReadFromHumanInTheLoopSubscription")
.subscription(options.getInputHumanInTheLoopRawSubscription()));
PCollection<String> windowedInput = humanInTheLoopInput
.apply(Window
.<String>into(SlidingWindows
.of(Duration.standardSeconds(30))
.every(Duration.standardSeconds(5)))
.<String>triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(3)))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(10)));
Without a GroupByKey nothing is being triggered. Both windowing and triggering only affect grouping (and combining) operations.
Related
I am using global window with repeated forever after processing time trigger to process streaming data from pub-sub as below :
PCollection<KV<String,SMMessage>> perMSISDNLatestEvents = messages
.apply("Apply global window",Window.<SMMessage>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))
.discardingFiredPanes())
.apply("Convert into kv of msisdn and SM message", ParDo.of(new SmartcareMessagetoKVFn()))
.apply("Get per MSISDN latest event",Latest.perKey()).apply("Write into Redis", ParDo.of(new WriteRedisFn()));
Is there a way to make repeatedly forever apache beam trigger to only execute after the previous execution is completed ? The reason for my question is because the next trigger processing will need to read data from redis, written by the previous trigger execution.
Thank You
So the trigger here would fire at the interval you provided. The trigger is not aware of any downstream processing so it's unable to depend on such steps of your pipeline.
Instead of depending on the trigger for consistency here, you could add a barrier (a DoFn) that exists before the Write step and only gives up execution after you see the previous data in Redis.
You could try and explicitly declare a global window trigger, as the example below:
Trigger subtrigger = AfterProcessingTime.pastFirstElementInPane();
Trigger maintrigger = Repeatedly.forever(subtrigger);
I think that triggers would help you on your case, since it will allow you to create event times, which will run when you or your code trigger them, so you would only run repeatedly forever when a trigger finishes first.
I found this documentation which might guide you on the triggers you are trying to create.
Requirement is to delete the data in spanner tables before inserting the data from pubsub messages. As MutationGroup does not guarantee the order of execution, separated delete mutations into separate set and so have two sets, one for Delete and other to AddReplace Mutations.
PCollection<Data> dataJson =
pipeLine
.apply(PubsubIO.readStrings().fromSubscription(options.getInputSubscription()))
.apply("ParsePubSubMessage", ParDo.of(new PubSubToDataFn()))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(10))))
;
SpannerWriteResult deleteResult = dataJson
.apply("DeleteDataMutation", MapElements.via(......))
.apply("DeleteData", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
dataJson
.apply("WaitOnDeleteMutation", Wait.on(deleteResult.getOutput()))
.apply("AddReplaceMutation", MapElements.via(...))
.apply("UpsertInfoToSpanner", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
This is a streaming dataflow job and I tried multiple Windowing but it never executes "UpsertInfoToSpanner" Step.
How can I fix this issue? Can someone suggest a path forward.
Update:
Requirement is to apply Two Mutation Groups sequential on same input data i.e. Read JSON from PubSub message to delete existing data from multiple tables with mutation group and then insert data reading from the JSON PubSub message.
Re-pasting the comment earlier for better visibility:
The Mutation operations within a single MutationGroup are guaranteed to be executed in order within a single transaction, so I don't see what the issue is here... The reason why Wait.on() never releases is because the output stream that is being waited on is on the global window, so will never be closed in a streaming pipeline.
I have a ParDo that uses state and timers with a periodically updating PcollectionView as sideInput to that parDo; google dataflow will throw an exception that timers are not allowed in such a case. Is there another way to feed config data to the parDo with out sideInput? Essentially, the sideInput was a map of config data that was reading from datastore about every 24 hours.
I am currently trying to see if I can create a ParDo before the one with state and timers to periodically update the config, but I don't see how we can access that map from within the next ParDo. Any suggestions?
Note: This pipeline is running in streaming mode with a global window and reading from pubsub messages as they arrive. Datastore is used to hold data needed to decide when to output an element to a pubsub topic.
Instead of using state timers to update the side input, you can use a fixed window to periodically update your PCollectionView with your data source:
PCollectionView<Map<String,String>> sideInput = pipeline
.apply(notifications)
.apply(
Window.<Long>into(FixedWindows.of(Duration.standardMinutes(refreshMinutes)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply( /* query data source */ )
.apply(View.<Map<String,String>>asSingleton());
I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?
It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it
To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.
Note: this technique will not work when running Beam from within a Flex Template Job...
The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...
I have a Cloud Dataflow pipeline in which I alter the original timestamp for the event in order to simulate real world scenarios of events arriving late. However, it appears I'm dropping some percentage of my events on each run of the pipeline. Inside my DoFn I use the following code to change the timestamp:
Instant newTimestamp = originalTimestamp.minus(Duration.standardMinutes(RANDOM.nextInt(15)));
c.outputWithTimestamp(KV.of(Integer.toString(RANDOM.nextInt(100)), element), newTimestamp);
The problem is most likely caused by your DoFn step outputting a timestamp that is earlier than the timestamp that was received by the processing step minus the allowed timestamp skew. The exception that would be thrown can be found here in the code:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/DoFnRunnerBase.java#L493
This behavior is documented with regard to using outputWithTimestamp here:
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn.Context#outputWithTimestamp-OutputT-org.joda.time.Instant-
While you could override the getAllowedTimestampSkew function, is is also documented that this might cause unpredictable issues with the watermark calculations so it should only be used without windowing/grouping.
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn#getAllowedTimestampSkew--