State garbage collection in Beam with GlobalWindow - google-cloud-dataflow

Apache Beam has recently introduced state cells, through StateSpec and the #StateId annotation, with partial support in Apache Flink and Google Cloud Dataflow.
I cannot find any documentation on what happens when this is used with a GlobalWindow. In particular, is there a way to have a "state garbage collection" mechanism to get rid of states for keys that have not been seen for a while according to some configuration, while still maintaining a single all-time state for keys are that seen frequently enough?
Or, is the amount of state used in this case going to diverge, with no way to ever reclaim state corresponding to keys that have not been seen in a while?
I am also interested in whether a potential solution would be supported in either Apache Flink or Google Cloud Dataflow.
Flink and direct runners seem to have some code for "state GC" but I am not really sure what it does and whether it is relevant when using a global window.

State can be automatically garbage collected by a Beam runner at some point after a window expires - when the input watermark exceeds the end of the window by the allowed lateness, so all further input is droppable. The exact details depend on the runner.
As you correctly determined, the Global window may never expire. Then this automatic collection of state will not be invoked. For bounded data, including drain scenarios, it actually will expire, but for a perpetual unbounded data source it will not.
If you are doing stateful processing on such data in the Global window you can use user-defined timers (used through #TimerId, #OnTimer, and TimerSpec - I haven't blogged about these yet) to clear state after some timeout of your choosing. If the state represents an aggregation of some sort, then you'll want a timer anyhow to make sure your data is not stranded in state.
Here is a quick example of their use:
new DoFn<Foo, Baz>() {
private static final String MY_TIMER = "my-timer";
private static final String MY_STATE = "my-state";
#StateId(MY_STATE)
private final StateSpec<ValueState<Bizzle>> =
StateSpec.value(Bizzle.coder());
#TimerId(MY_TIMER)
private final TimerSpec myTimer =
TimerSpecs.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void process(
ProcessContext c,
#StateId(MY_STATE) ValueState<Bizzle> bizzleState,
#TimerId(MY_TIMER) Timer myTimer) {
bizzleState.write(...);
myTimer.setForNowPlus(...);
}
#OnTimer(MY_TIMER)
public void onMyTimer(
OnTimerContext context,
#StateId(MY_STATE) ValueState<Bizzle> bizzleState) {
context.output(... bizzleState.read() ...);
bizzleState.clear();
}
}

There is not automatic garbage collection of state if you use GlobalWindows. Only if you use some non-global window will state be garbage collected after the watermark passes the end of a window plus the allowed lateness.
What you can do if you must work with GlobalWindows is to manually keep as state the last update timestamp. Then you would periodically set a timer where you check this timestamp against the current time and delete state if necessary. You would set this timer when encountering a key for the first time (which you can see from the absence of your timestamp state) and then re-set it in the #OnTimer method.

Related

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode.
But it always meets Exception :
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey
Is anything wrong with this snippet code?
If I use .discardingFiredPanes() instead, I will lose information in the last emit.
pipeline
.apply(GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1L)))
.apply(
Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()))
.accumulatingFiredPanes())
.apply(new ReadSlowChangingTable())
.apply(Latest.perKey())
.apply(View.asMap());
Example Input Trigger:
t1 : KV<k1,v1> KV< k2,v2>
t2 : KV<k1,v1>
accumulatingFiredPanes => expected result at t2 => KV(k1,v1), KV(k2,v2) but failed due to duplicated exception
discardingFiredPanes => expected result at t2 => KV(k1,v1) Success
Specifically with regards to view.asMap and accumulating panes discussion in the comments:
If you would like to make use of the View.asMap side input (for example, when the source of the map elements is itself distributed – often because you are creating a side input from the output of a previous transform), there are some other factors that will need to be taken into consideration: View.asMap is itself an aggregation, it will inherit triggering and accumulate its input. In this specific pattern, setting the pipeline to accumulatingPanes mode before this transform will result in duplicate key errors even if a transform such as Latest.perKey is used before the View.asMap transform.
Given the read updates the whole map, then the use of View.asSingleton would I think be a better approach for this use case.
Some general notes around this pattern, which will hopefully be useful for others as well:
For this pattern we can use the GenerateSequence source transform to emit a value periodically for example once a day. Pass this value into a global window via a data-driven trigger that activates on each element. In a DoFn, use this process as a trigger to pull data from your bounded source Create your SideInput for use in downstream transforms.
It's important to note that because this pattern uses a global-window side input triggering on processing time, matching to elements being processed in event time will be nondeterministic. For example if we have a main pipeline which is windowed on event time, the version of the SideInput View that those windows will see will depend on the latest trigger that has fired in processing time rather than the event time.
Also important to note that in general the side input should be something that fits into memory.
Java (SDK 2.9.0):
In the sample below the side input is updated at very short intervals, this is so that effects can be easily seen. The expectation is that the side input is updating slowly, for example every few hours or once a day.
In the example code below we make use of a Map that we create in a DoFn which becomes the View.asSingleton, this is the recommended approach for this pattern.
The sample below illustrates the pattern, please note the View.asSingleton is rebuilt on every counter update.
public static void main(String[] args) {
// Create pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(PipelineOptions.class);
// Using View.asSingleton, this pipeline uses a dummy external service as illustration.
// Run in debug mode to see the output
Pipeline p = Pipeline.create(options);
// Create slowly updating sideinput
PCollectionView<Map<String, String>> map = p
.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, String>>() {
#ProcessElement public void process(#Element Long input,
OutputReceiver<Map<String, String>> o) {
// Do any external reads needed here...
// We will make use of our dummy external service.
// Every time this triggers, the complete map will be replaced with that read from
// the service.
o.output(DummyExternalService.readDummyData());
}
})).apply(View.asSingleton());
// ---- Consume slowly updating sideinput
// GenerateSequence is only used here to generate dummy data for this illustration.
// You would use your real source for example PubSubIO, KafkaIO etc...
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(ParDo.of(new DoFn<Long, KV<Long, Long>>() {
#ProcessElement public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug("Value is {} key A is {} and key B is {}"
, c.element(), keyMap.get("Key_A"),keyMap.get("Key_B"));
}
}).withSideInputs(map));
p.run();
}
public static class DummyExternalService {
public static Map<String, String> readDummyData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
}
}

SideInputs kill dataflow performance

I'm using dataflow to generate a large amount of data.
I've tested two versions of my pipeline: one with a side input (of varying sizes), and other other without.
When I run the pipeline without the side input, my job will finish in about 7 minutes. When I run my job with the side input, my job will never finish.
Here's what my DoFn looks like:
public class MyDoFn extends DoFn<String, String> {
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pCollectionView;
final List<CSVRecord> stuff;
private Aggregator<Integer, Integer> dofnCounter =
createAggregator("DoFn Counter", new Sum.SumIntegerFn());
public MyDoFn(PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv, List<CSVRecord> m) {
this.pCollectionView = pcv;
this.stuff = m;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
Map<String, Iterable<TreeMap<Long, Float>>> pdata = processContext.sideInput(pCollectionView);
processContext.output(AnotherClass.generateData(stuff, pdata));
dofnCounter.addValue(1);
}
}
And here's what my pipeline looks like:
final Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<KV<String, TreeMap<Long, Float>>> data;
data = p.apply(TextIO.Read.from("gs://where_the_files_are/*").named("Reading Data"))
.apply(ParDo.named("Parsing data").of(new DoFn<String, KV<String, TreeMap<Long, Float>>>() {
#Override
public void processElement(ProcessContext processContext) throws Exception {
// Parse some data
processContext.output(KV.of(key, value));
}
}));
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv =
data.apply(GroupByKey.<String, TreeMap<Long, Float>>create())
.apply(View.<String, Iterable<TreeMap<Long, Float>>>asMap());
DoFn<String, String> dofn = new MyDoFn(pcv, localList);
p.apply(TextIO.Read.from("gs://some_text.txt").named("Sizing"))
.apply(ParDo.named("Generating the Data").withSideInputs(pvc).of(dofn))
.apply(TextIO.Write.named("Write_out").to(outputFile));
p.run();
We've spent about two days trying various methods of getting this to work. We've narrowed it down to the inclusion of the side input. If the processContext is modified to not use the side input, it will still be very slow as long as it's included. If we don't call .withSideInput() it's very fast again.
Just to clarify, we've tested this on sideinput sizes from 20mb - 1.5gb.
Very grateful for any insight.
Edit
Including a few job ID's:
2016-01-20_14_31_12-1354600113427960103
2016-01-21_08_04_33-1642110636871153093 (latest)
Please try out the Dataflow SDK 1.5.0+, they should have addressed the known performance problems of your issue.
Side inputs in the Dataflow SDK 1.5.0+ use a new distributed format when running batch pipelines. Note that streaming pipelines and pipelines using older versions of the Dataflow SDK are still subject to re-reading the side input if the view can not be cached entirely in memory.
With the new format, we use an index to provide a block based lookup and caching strategy. Thus when looking into a list by index or looking into a map by key, only the block that contains said index or key will be loaded. Having a cache size which is greater than the working set size will aid in performance as frequently accessed indices/keys will not require re-reading the block they are contained in.
Side inputs in the Dataflow SDK can, indeed, introduce slowness if not used carefully. Most often, this happens when each worker has to re-read the entire side input per main input element.
You seem to be using a PCollectionView created via asMap. In this case, the entire side input PCollection must fit into memory of each worker. When needed, Dataflow SDK will copy this data on each worker to create such a map.
That said, the map on each worker may be created just once or multiple times, depending on several factors. If its size is small enough (usually less than 100 MB), it is likely that the map is read only once per worker and reused across elements and across bundles. However, if its size cannot fit into our cache (or something else evicts it from the cache), the entire map may be re-read again and again on each worker. This is, most often, the root-cause of the slowness.
The cache size is controllable via PipelineOptions as follows, but due to several important bugfixes, this should be used in version 1.3.0 and later only.
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
For the time being, the fix is to change the structure of the pipeline to avoid excessive re-reading. I cannot offer you a specific advice there, as you haven't shared enough information about your use case. (Please post a separate question if needed.)
We are actively working on a related feature we refer to as distributed side inputs. This will allow a lookup into the side input PCollection without constructing the entire map on the worker. It should significantly help performance in this and related cases. We expect to release this very shortly.
I didn't see anything particularly suspicious about the two jobs you have quoted. They've been cancelled relatively quickly.
I'm manually setting the cache size when creating the pipeline in the following manner:
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
for a side input of ~25mb, this speeds up the execution time considerably (job id 2016-01-25_08_42_52-657610385797048159) vs. creating a pipeline in the manner below (job id 2016-01-25_07_56_35-14864561652521586982)
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
However, when the side input is increased to ~400mb, no increase in cache size improves performance. Theoretically, is all the memory indicated by the GCE machine type available for use by the worker? What would invalidate or evict something from the worker cache, forcing the re-read?

Is it possible to keep track of the state when processing logs by DataFlow?

Is it possible to have the DataFlow process maintain the state. There are log processing tools that allow for that by providing fast access (propriety / in-memory) files available for the real time process to keep track of the state on the logs while processing them.
A use case example would be with tracking registration steps taken by users. The registration steps would come in different logs and the data form those logs would be assembled by the real time process into one final database record (for each registered user) that is written to a database.
Can my DataFLow code keep track of the many registration steps (streaming input) by users and once user's registration steps are completed then have the DataFLow process write the records to the database (one record per user).
I don't know much about DataFlow architecture. It must be using some (proprietary / in-memory nosql) data storage for keeping track of things it needs to keep track of (ex. when it tries to produce top 100 customers). Is that fast access data storage also available to the DataFlow processes to use?
Thanks
As danielm said, state is not yet exposed. The good news is you may not need it for your use case.
If you have a PCollection<KV<UserId, LogEvent>> you can use a CombineFn and Combine.perKey to take all of the LogEvents for a specific UserId and combine them into a single output. The CombineFn tells Dataflow how to create an accumulator, update it by incorporating input elements, and then extract a final output. Transforms like Top actually use a CombineFn (with a Heap as the accumulator) rather than an actual state API.
If your events are of different types, you can still do something like this. For instance, if you have two logs, you can do:
PCollection<KV<UserId, LogEvent1>> events1 = ...;
PCollection<KV<UserId, LogEvent2>> events2 = ...;
// Create tuple tags for the value types in each collection.
final TupleTag<LogEvent1> tag1 = new TupleTag<LogEvent1>();
final TupleTag<LogEvent2> tag2 = new TupleTag<LogEvent2>();
//Merge collection values into a CoGbkResult collection
PCollection<KV<UserIf, CoGbkResult>> coGbkResultCollection =
KeyedPCollectionTuple.of(tag1, pt1)
.and(tag2, pt2)
.apply(CoGroupByKey.<UserId>create());
// Access results and do something.
PCollection<T> finalResultCollection =
coGbkResultCollection.apply(ParDo.of(
new DoFn<KV<K, CoGbkResult>, T>() {
#Override
public void processElement(ProcessContext c) {
KV<K, CoGbkResult> e = c.element();
// Get all LogEvent1 values
Iterable<LogEvent1> event1s = e.getValue().getAll(tag1);
// There will only be one LogEvent2
LogEvent2 event2 = e.getValue().getOnly(tag2);
... Do Something to compute T ....
c.output(...some T...);
}
}));
The above example was adapted from docs on CoGroupByKey which have information.
Dataflow does not currently expose the underlying state mechanism that it uses. However, this is definitely on the radar for a future update.

Inject bridge-code in JavaFX WebView before page-load?

I want to load some content or page in a JavaFX WebView and offer a Bridge object to Java so the content of the page can do calls into java.
The basic concept of how to do this is described here: https://blogs.oracle.com/javafx/entry/communicating_between_javascript_and_javafx
Now my question is: When is a good time inject the bridge-object into the WebView so it is available as soon as possible.
One option would be after page load as described here: https://stackoverflow.com/a/17612361/1520422
But is there a way to inject this sooner (before the page content itself is initialized), so the bridge-object is available DURING page-load (and not only after page-load)?
Since no one has answered, I'll tell you how I'm doing it, although it is ugly. This provides the ability for the page to function normally in non-Java environments but receive a Java object in Java environments.
I start by providing an onStatusChanged handler to the WebEngine. It listens for a magic value for window.status. If the magic value is received, the handler installs the Java object. (In my case, it's more complex, because I have some more complex orchestration: I'm executing a script that provides a client-side API for the page and then sets another magic value on window.status to cause the Java object to be sent to an initialization method of the client-side API).
Then in my target page, I have the following code in the first script in the page:
window.status = "MY-MAGIC-VALUE";
window.status = "";
This code is essentially a no-op in a "normal" browser but triggers the initialization when running in the custom JavaFX embedding.
In Java 8, you can trigger event changing from SCHEDULED to RUNNING to inject objects at this time. The objects will present in WebEngine before JavaScript running. Java 7, I see the state machine quite differs in operating, no solution given for Java 7.
webEngine.getLoadWorker().stateProperty().addListener(
new ChangeListener<State>(){
public void changed(ObservableValue<? extends State> ov,
State oldState,
State newState)
{
// System.out.println("old: "+oldState+", new: "+newState);
if(newState == State.RUNNING &&
oldState == State.SCHEDULED){
JSObject window = (JSObject)webEngine.executeScript("window");
window.setMember("foutput", foutput);
}
}
});

Amazon SWF for long running business workflows

I'm evaluating Amazon SWF as an option to build a distributed workflow system. The main language will be Java, so the Flow framework is an obvious choice. There's just one thing that keeps puzzling me and I would get some other opinions before I can recommend it as a key component in our architecture:
The examples are all about tasks that produce a result after a deterministic, relatively short period of time (i.e. after some minutes). In our real-life business workflow, the matter looks different, here we have tasks that could take potentially weeks to complete. I checked the calculator already, having workflows that live 30 days or so do not lead to a cost explosion, so it seems they already counted in that possibility.
Did anybody use SWF for some scenario like this and can share any experience? Are there any recommendations, best practices how to design a workflow like this? Is Flow the right choice here?
It seems to me Activity implementations are expected to eventually return a value synchronously, however, for long running transactions we would rather use messages to send back worker results asynchronously.
Any helpful feedback is appreciated.
From the Amazon Simple Workflow Service point of view an activity execution is a pair of API calls: PollForActivityTask and RespondActivityTaskCompleted that share a task token. There is no requirement for those calls coming from the same thread, process or even host.
By default AWS Flow Framework executes an activity synchronously. Use #ManualActivityCompletion annotation to indicate that activity is not complete upon return of the activity method. Such activity should be explicitly completed (or failed) using provided ManualActivityCompletionClient.
Here is an example taken from the AWS Flow Framework Developer Guide:
#ManualActivityCompletion
public String getName() {
ActivityExecutionContext executionContext = contextProvider.getActivityExecutionContext();
String taskToken = executionContext.getTaskToken();
sendEmail("abc#xyz.com",
"Please provide a name for the greeting message and close task with token: " + taskToken);
return "This will not be returned to the caller";
}
public class CompleteActivityTask {
public void completeGetNameActivity(String taskToken) {
AmazonSimpleWorkflow swfClient = new AmazonSimpleWorkflowClient(…); //pass in user credentials
ManualActivityCompletionClientFactory manualCompletionClientFactory = new ManualActivityCompletionClientFactoryImpl(swfClient);
ManualActivityCompletionClient manualCompletionClient
= manualCompletionClientFactory.getClient(taskToken);
String result = "Hello World!";
manualCompletionClient.complete(result);
}
public void failGetNameActivity(String taskToken, Throwable failure) {
AmazonSimpleWorkflow swfClient
= new AmazonSimpleWorkflowClient(…); //pass in user credentials
ManualActivityCompletionClientFactory manualCompletionClientFactory
= new ManualActivityCompletionClientFactoryImpl(swfClient);
ManualActivityCompletionClient manualCompletionClient
= manualCompletionClientFactory.getClient(taskToken);
manualCompletionClient.fail(failure);
}
}
That an activity is implemented using #ManualActivityCompletion is an implementation detail. Workflow code calls it through the same interface and doesn't treat any differently than any activity implemented synchronously.

Resources