Is there a way to make repeatedly forever apache beam trigger to only execute after the previous execution is completed? - google-cloud-dataflow

I am using global window with repeated forever after processing time trigger to process streaming data from pub-sub as below :
PCollection<KV<String,SMMessage>> perMSISDNLatestEvents = messages
.apply("Apply global window",Window.<SMMessage>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))
.discardingFiredPanes())
.apply("Convert into kv of msisdn and SM message", ParDo.of(new SmartcareMessagetoKVFn()))
.apply("Get per MSISDN latest event",Latest.perKey()).apply("Write into Redis", ParDo.of(new WriteRedisFn()));
Is there a way to make repeatedly forever apache beam trigger to only execute after the previous execution is completed ? The reason for my question is because the next trigger processing will need to read data from redis, written by the previous trigger execution.
Thank You

So the trigger here would fire at the interval you provided. The trigger is not aware of any downstream processing so it's unable to depend on such steps of your pipeline.
Instead of depending on the trigger for consistency here, you could add a barrier (a DoFn) that exists before the Write step and only gives up execution after you see the previous data in Redis.

You could try and explicitly declare a global window trigger, as the example below:
Trigger subtrigger = AfterProcessingTime.pastFirstElementInPane();
Trigger maintrigger = Repeatedly.forever(subtrigger);
I think that triggers would help you on your case, since it will allow you to create event times, which will run when you or your code trigger them, so you would only run repeatedly forever when a trigger finishes first.
I found this documentation which might guide you on the triggers you are trying to create.

Related

Dart eventloop control equivalent

In NodeJS we have process.nextTick(), setImmediate and setTimeout , I want to know what is equivalent in Dart
setImmediate() is designed to execute a script once the current poll phase completes.
setTimeout() schedules a script to be run after a minimum threshold in ms has elapsed.
any time you call process.nextTick() in a given phase, all callbacks passed to process.nextTick() will be resolved before the event loop continues. This can create some bad situations because it allows you to "starve" your I/O by making recursive process.nextTick() calls, which prevents the event loop from reaching the poll phase
The corresponding Dart operations are:
setTimeout(callback, durationMS): Timer(duration, callback) or Future.delayed(duration, callbackWithResult).
setImmediate(callback): Timer.run(callback), Timer(Duration.zero, callback) or Future(callbackWithResult).
nextTick(callback): scheduleMicrotask(callback) or Future.microtask(callbackWithResult).

How can I programmatically cancel a Dataflow job that has run for too long?

I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?
It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it
To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.
Note: this technique will not work when running Beam from within a Flex Template Job...
The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...

Defining "global" behavior in Gulp (measuring task duration)

I'm working on moving us from ant to gulp, and as part of the effort I want to write timing stats to Graphite. We're doing this in ant as well (no idea how, beside the point anyway). My question is, I'd prefer to not have to add some or other plugin manually to every task we have (we have over 60), but rather have some sort of global behavior, where for every task, before the task is run a timer is start, and when it signals completion we push some data to Graphite (over statsd).
Can someone point me in the right direction where to hook into gulp for this? I couldn't find anything particularly useful in the docs / recipes...
We're running gulp#4.
Instead of adding timing code to your numerous tasks, you could make use of the NPM gulp-duration package.
A snippet of an example of it's use is shown below:
function rebundle() {
var uglifyTimer = duration('uglify time')
var bundleTimer = duration('bundle time')
return bundler.bundle()
.pipe(source('bundle.js'))
.pipe(bundleTimer)
// start just before uglify recieves its first file
.once('data', uglifyTimer.start)
.pipe(uglify())
.pipe(uglifyTimer)
.pipe(gulp.dest('example/'))
}
gulp-duration's duration function:
Creates a new pass-through duration stream. When this stream is
closed, it will log the amount of time since its creation to your
terminal.
will then allow you to log the duration of the task.
Whilst this is not a global behaviour solution, at least you can specify the timing code in your gulp file, as opposed to having to modify all 60+ of your tasks.

Dataflow pipeline is dropping events during processing when using outputWithTimestamp

I have a Cloud Dataflow pipeline in which I alter the original timestamp for the event in order to simulate real world scenarios of events arriving late. However, it appears I'm dropping some percentage of my events on each run of the pipeline. Inside my DoFn I use the following code to change the timestamp:
Instant newTimestamp = originalTimestamp.minus(Duration.standardMinutes(RANDOM.nextInt(15)));
c.outputWithTimestamp(KV.of(Integer.toString(RANDOM.nextInt(100)), element), newTimestamp);
The problem is most likely caused by your DoFn step outputting a timestamp that is earlier than the timestamp that was received by the processing step minus the allowed timestamp skew. The exception that would be thrown can be found here in the code:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/DoFnRunnerBase.java#L493
This behavior is documented with regard to using outputWithTimestamp here:
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn.Context#outputWithTimestamp-OutputT-org.joda.time.Instant-
While you could override the getAllowedTimestampSkew function, is is also documented that this might cause unpredictable issues with the watermark calculations so it should only be used without windowing/grouping.
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn#getAllowedTimestampSkew--

Google Dataflow trigger firing before elementCountAtLeast reached

I was planning on using Google Dataflow to coordinate human-in-the-loop form completion, checking for conflict after 3 forms have been completed. I have setup Google PubSub for both Dataflow source and sink and want to simply have the trigger fire and send to the PubSub sink after three forms have been received for a given JobId.
This SO post looked similar to the problem I was trying to solve, however when I implement it, the trigger is firing and sending output to the PubSub sink before the AfterPane.elementCountAtLeast is reached.
I have tried it with the GlobalWindow and SlidingWindows. Once I get the trigger to fire after the elementCountAtLeast is reached, I was planning on implementing a GroupByKey for the jobId. However, before I moved to that step I'd like to get the elementCountAtLeast working in isolation.
Here is the code for reading from PubSub and the SlidingWindow:
PCollection<String> humanInTheLoopInput;
humanInTheLoopInput = pipeline
.apply(PubsubIO.Read
.named("ReadFromHumanInTheLoopSubscription")
.subscription(options.getInputHumanInTheLoopRawSubscription()));
PCollection<String> windowedInput = humanInTheLoopInput
.apply(Window
.<String>into(SlidingWindows
.of(Duration.standardSeconds(30))
.every(Duration.standardSeconds(5)))
.<String>triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(3)))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(10)));
Without a GroupByKey nothing is being triggered. Both windowing and triggering only affect grouping (and combining) operations.

Resources