SideInputs kill dataflow performance - google-cloud-dataflow

I'm using dataflow to generate a large amount of data.
I've tested two versions of my pipeline: one with a side input (of varying sizes), and other other without.
When I run the pipeline without the side input, my job will finish in about 7 minutes. When I run my job with the side input, my job will never finish.
Here's what my DoFn looks like:
public class MyDoFn extends DoFn<String, String> {
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pCollectionView;
final List<CSVRecord> stuff;
private Aggregator<Integer, Integer> dofnCounter =
createAggregator("DoFn Counter", new Sum.SumIntegerFn());
public MyDoFn(PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv, List<CSVRecord> m) {
this.pCollectionView = pcv;
this.stuff = m;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
Map<String, Iterable<TreeMap<Long, Float>>> pdata = processContext.sideInput(pCollectionView);
processContext.output(AnotherClass.generateData(stuff, pdata));
dofnCounter.addValue(1);
}
}
And here's what my pipeline looks like:
final Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<KV<String, TreeMap<Long, Float>>> data;
data = p.apply(TextIO.Read.from("gs://where_the_files_are/*").named("Reading Data"))
.apply(ParDo.named("Parsing data").of(new DoFn<String, KV<String, TreeMap<Long, Float>>>() {
#Override
public void processElement(ProcessContext processContext) throws Exception {
// Parse some data
processContext.output(KV.of(key, value));
}
}));
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv =
data.apply(GroupByKey.<String, TreeMap<Long, Float>>create())
.apply(View.<String, Iterable<TreeMap<Long, Float>>>asMap());
DoFn<String, String> dofn = new MyDoFn(pcv, localList);
p.apply(TextIO.Read.from("gs://some_text.txt").named("Sizing"))
.apply(ParDo.named("Generating the Data").withSideInputs(pvc).of(dofn))
.apply(TextIO.Write.named("Write_out").to(outputFile));
p.run();
We've spent about two days trying various methods of getting this to work. We've narrowed it down to the inclusion of the side input. If the processContext is modified to not use the side input, it will still be very slow as long as it's included. If we don't call .withSideInput() it's very fast again.
Just to clarify, we've tested this on sideinput sizes from 20mb - 1.5gb.
Very grateful for any insight.
Edit
Including a few job ID's:
2016-01-20_14_31_12-1354600113427960103
2016-01-21_08_04_33-1642110636871153093 (latest)

Please try out the Dataflow SDK 1.5.0+, they should have addressed the known performance problems of your issue.
Side inputs in the Dataflow SDK 1.5.0+ use a new distributed format when running batch pipelines. Note that streaming pipelines and pipelines using older versions of the Dataflow SDK are still subject to re-reading the side input if the view can not be cached entirely in memory.
With the new format, we use an index to provide a block based lookup and caching strategy. Thus when looking into a list by index or looking into a map by key, only the block that contains said index or key will be loaded. Having a cache size which is greater than the working set size will aid in performance as frequently accessed indices/keys will not require re-reading the block they are contained in.

Side inputs in the Dataflow SDK can, indeed, introduce slowness if not used carefully. Most often, this happens when each worker has to re-read the entire side input per main input element.
You seem to be using a PCollectionView created via asMap. In this case, the entire side input PCollection must fit into memory of each worker. When needed, Dataflow SDK will copy this data on each worker to create such a map.
That said, the map on each worker may be created just once or multiple times, depending on several factors. If its size is small enough (usually less than 100 MB), it is likely that the map is read only once per worker and reused across elements and across bundles. However, if its size cannot fit into our cache (or something else evicts it from the cache), the entire map may be re-read again and again on each worker. This is, most often, the root-cause of the slowness.
The cache size is controllable via PipelineOptions as follows, but due to several important bugfixes, this should be used in version 1.3.0 and later only.
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
For the time being, the fix is to change the structure of the pipeline to avoid excessive re-reading. I cannot offer you a specific advice there, as you haven't shared enough information about your use case. (Please post a separate question if needed.)
We are actively working on a related feature we refer to as distributed side inputs. This will allow a lookup into the side input PCollection without constructing the entire map on the worker. It should significantly help performance in this and related cases. We expect to release this very shortly.
I didn't see anything particularly suspicious about the two jobs you have quoted. They've been cancelled relatively quickly.

I'm manually setting the cache size when creating the pipeline in the following manner:
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
for a side input of ~25mb, this speeds up the execution time considerably (job id 2016-01-25_08_42_52-657610385797048159) vs. creating a pipeline in the manner below (job id 2016-01-25_07_56_35-14864561652521586982)
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
However, when the side input is increased to ~400mb, no increase in cache size improves performance. Theoretically, is all the memory indicated by the GCE machine type available for use by the worker? What would invalidate or evict something from the worker cache, forcing the re-read?

Related

Dataflow: DoFn to return KV after a Join

it is the first time for me dealing with a performance critic Dataflow job.
When trying to fix the performance issue I'm having I then started to introduce this kind of pattern in order to try minimise the number of methods calls.
Code before
PCollection<KV<Long, TableRow> result = Join.leftOuterJoin("Joining AlfaCollection with BetaCollection", alfaCollection, betaCollection, Beta.newBuilder().build())
.apply(ParDo.of(ToTableRowParDo.builder().build()))
.apply("Keying via my_field", WithKeys.of((TableRow tr) -> tr.get("my_field")))
Code after
PCollection<KV<Long, TableRow> result = Join.leftOuterJoin("Joining AlfaCollection with BetaCollection", alfaCollection, betaCollection, Beta.newBuilder().build())
.apply(ParDo.of(ToTableRowKVParDo.builder().build()))
where ToTableRowKVParDo is implemented like
#Builder
public class ToTableRowKVParDo extends DoFn<KV<Alfa, Beta>, KV<Long, TableRoW> {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
KV<Alfa, Beta> kv = c.element().getValue();
Alfa a = kv.getKey();
TableRoW tr = new TableRow();
tr.set(....);
tr.set(....);
c.output(KV.of(a.getKey(), tr));
}
}
saying that my pipeline is still suffering of performance issues, my question for now is: is this considered a good approach/pattern or it is not? which version of the code is to be considered better than the other?
Dataflow will automatically fuse these steps together; it is generally best practice to use distinct Beam operations for distinct logical operations (just as one would with functions). It is rare for method calls to be the performance issue; instead I would look for things like reducing the data that needs to be shuffled (by filtering rows or columns if possible before the join).

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode.
But it always meets Exception :
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey
Is anything wrong with this snippet code?
If I use .discardingFiredPanes() instead, I will lose information in the last emit.
pipeline
.apply(GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1L)))
.apply(
Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()))
.accumulatingFiredPanes())
.apply(new ReadSlowChangingTable())
.apply(Latest.perKey())
.apply(View.asMap());
Example Input Trigger:
t1 : KV<k1,v1> KV< k2,v2>
t2 : KV<k1,v1>
accumulatingFiredPanes => expected result at t2 => KV(k1,v1), KV(k2,v2) but failed due to duplicated exception
discardingFiredPanes => expected result at t2 => KV(k1,v1) Success
Specifically with regards to view.asMap and accumulating panes discussion in the comments:
If you would like to make use of the View.asMap side input (for example, when the source of the map elements is itself distributed – often because you are creating a side input from the output of a previous transform), there are some other factors that will need to be taken into consideration: View.asMap is itself an aggregation, it will inherit triggering and accumulate its input. In this specific pattern, setting the pipeline to accumulatingPanes mode before this transform will result in duplicate key errors even if a transform such as Latest.perKey is used before the View.asMap transform.
Given the read updates the whole map, then the use of View.asSingleton would I think be a better approach for this use case.
Some general notes around this pattern, which will hopefully be useful for others as well:
For this pattern we can use the GenerateSequence source transform to emit a value periodically for example once a day. Pass this value into a global window via a data-driven trigger that activates on each element. In a DoFn, use this process as a trigger to pull data from your bounded source Create your SideInput for use in downstream transforms.
It's important to note that because this pattern uses a global-window side input triggering on processing time, matching to elements being processed in event time will be nondeterministic. For example if we have a main pipeline which is windowed on event time, the version of the SideInput View that those windows will see will depend on the latest trigger that has fired in processing time rather than the event time.
Also important to note that in general the side input should be something that fits into memory.
Java (SDK 2.9.0):
In the sample below the side input is updated at very short intervals, this is so that effects can be easily seen. The expectation is that the side input is updating slowly, for example every few hours or once a day.
In the example code below we make use of a Map that we create in a DoFn which becomes the View.asSingleton, this is the recommended approach for this pattern.
The sample below illustrates the pattern, please note the View.asSingleton is rebuilt on every counter update.
public static void main(String[] args) {
// Create pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(PipelineOptions.class);
// Using View.asSingleton, this pipeline uses a dummy external service as illustration.
// Run in debug mode to see the output
Pipeline p = Pipeline.create(options);
// Create slowly updating sideinput
PCollectionView<Map<String, String>> map = p
.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, String>>() {
#ProcessElement public void process(#Element Long input,
OutputReceiver<Map<String, String>> o) {
// Do any external reads needed here...
// We will make use of our dummy external service.
// Every time this triggers, the complete map will be replaced with that read from
// the service.
o.output(DummyExternalService.readDummyData());
}
})).apply(View.asSingleton());
// ---- Consume slowly updating sideinput
// GenerateSequence is only used here to generate dummy data for this illustration.
// You would use your real source for example PubSubIO, KafkaIO etc...
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(ParDo.of(new DoFn<Long, KV<Long, Long>>() {
#ProcessElement public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug("Value is {} key A is {} and key B is {}"
, c.element(), keyMap.get("Key_A"),keyMap.get("Key_B"));
}
}).withSideInputs(map));
p.run();
}
public static class DummyExternalService {
public static Map<String, String> readDummyData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
}
}

State garbage collection in Beam with GlobalWindow

Apache Beam has recently introduced state cells, through StateSpec and the #StateId annotation, with partial support in Apache Flink and Google Cloud Dataflow.
I cannot find any documentation on what happens when this is used with a GlobalWindow. In particular, is there a way to have a "state garbage collection" mechanism to get rid of states for keys that have not been seen for a while according to some configuration, while still maintaining a single all-time state for keys are that seen frequently enough?
Or, is the amount of state used in this case going to diverge, with no way to ever reclaim state corresponding to keys that have not been seen in a while?
I am also interested in whether a potential solution would be supported in either Apache Flink or Google Cloud Dataflow.
Flink and direct runners seem to have some code for "state GC" but I am not really sure what it does and whether it is relevant when using a global window.
State can be automatically garbage collected by a Beam runner at some point after a window expires - when the input watermark exceeds the end of the window by the allowed lateness, so all further input is droppable. The exact details depend on the runner.
As you correctly determined, the Global window may never expire. Then this automatic collection of state will not be invoked. For bounded data, including drain scenarios, it actually will expire, but for a perpetual unbounded data source it will not.
If you are doing stateful processing on such data in the Global window you can use user-defined timers (used through #TimerId, #OnTimer, and TimerSpec - I haven't blogged about these yet) to clear state after some timeout of your choosing. If the state represents an aggregation of some sort, then you'll want a timer anyhow to make sure your data is not stranded in state.
Here is a quick example of their use:
new DoFn<Foo, Baz>() {
private static final String MY_TIMER = "my-timer";
private static final String MY_STATE = "my-state";
#StateId(MY_STATE)
private final StateSpec<ValueState<Bizzle>> =
StateSpec.value(Bizzle.coder());
#TimerId(MY_TIMER)
private final TimerSpec myTimer =
TimerSpecs.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void process(
ProcessContext c,
#StateId(MY_STATE) ValueState<Bizzle> bizzleState,
#TimerId(MY_TIMER) Timer myTimer) {
bizzleState.write(...);
myTimer.setForNowPlus(...);
}
#OnTimer(MY_TIMER)
public void onMyTimer(
OnTimerContext context,
#StateId(MY_STATE) ValueState<Bizzle> bizzleState) {
context.output(... bizzleState.read() ...);
bizzleState.clear();
}
}
There is not automatic garbage collection of state if you use GlobalWindows. Only if you use some non-global window will state be garbage collected after the watermark passes the end of a window plus the allowed lateness.
What you can do if you must work with GlobalWindows is to manually keep as state the last update timestamp. Then you would periodically set a timer where you check this timestamp against the current time and delete state if necessary. You would set this timer when encountering a key for the first time (which you can see from the absence of your timestamp state) and then re-set it in the #OnTimer method.

Is it possible to keep track of the state when processing logs by DataFlow?

Is it possible to have the DataFlow process maintain the state. There are log processing tools that allow for that by providing fast access (propriety / in-memory) files available for the real time process to keep track of the state on the logs while processing them.
A use case example would be with tracking registration steps taken by users. The registration steps would come in different logs and the data form those logs would be assembled by the real time process into one final database record (for each registered user) that is written to a database.
Can my DataFLow code keep track of the many registration steps (streaming input) by users and once user's registration steps are completed then have the DataFLow process write the records to the database (one record per user).
I don't know much about DataFlow architecture. It must be using some (proprietary / in-memory nosql) data storage for keeping track of things it needs to keep track of (ex. when it tries to produce top 100 customers). Is that fast access data storage also available to the DataFlow processes to use?
Thanks
As danielm said, state is not yet exposed. The good news is you may not need it for your use case.
If you have a PCollection<KV<UserId, LogEvent>> you can use a CombineFn and Combine.perKey to take all of the LogEvents for a specific UserId and combine them into a single output. The CombineFn tells Dataflow how to create an accumulator, update it by incorporating input elements, and then extract a final output. Transforms like Top actually use a CombineFn (with a Heap as the accumulator) rather than an actual state API.
If your events are of different types, you can still do something like this. For instance, if you have two logs, you can do:
PCollection<KV<UserId, LogEvent1>> events1 = ...;
PCollection<KV<UserId, LogEvent2>> events2 = ...;
// Create tuple tags for the value types in each collection.
final TupleTag<LogEvent1> tag1 = new TupleTag<LogEvent1>();
final TupleTag<LogEvent2> tag2 = new TupleTag<LogEvent2>();
//Merge collection values into a CoGbkResult collection
PCollection<KV<UserIf, CoGbkResult>> coGbkResultCollection =
KeyedPCollectionTuple.of(tag1, pt1)
.and(tag2, pt2)
.apply(CoGroupByKey.<UserId>create());
// Access results and do something.
PCollection<T> finalResultCollection =
coGbkResultCollection.apply(ParDo.of(
new DoFn<KV<K, CoGbkResult>, T>() {
#Override
public void processElement(ProcessContext c) {
KV<K, CoGbkResult> e = c.element();
// Get all LogEvent1 values
Iterable<LogEvent1> event1s = e.getValue().getAll(tag1);
// There will only be one LogEvent2
LogEvent2 event2 = e.getValue().getOnly(tag2);
... Do Something to compute T ....
c.output(...some T...);
}
}));
The above example was adapted from docs on CoGroupByKey which have information.
Dataflow does not currently expose the underlying state mechanism that it uses. However, this is definitely on the radar for a future update.

Cloud Dataflow to BigQuery - too many sources

I have a job that among other things also inserts some of the data it reads from files into BigQuery table for later manual analysis.
It fails with the following error:
job error: Too many sources provided: 10001. Limit is 10000., error: Too many sources provided: 10001. Limit is 10000.
What does it refer to as "source"? Is it a file or a pipeline step?
Thanks,
G
I'm guessing the error is coming from BigQuery and means that we are trying to upload too many files when we create your output table.
Could you provide some more details on the error / context (like a snippet of the commandline output (if using the BlockingDataflowPipelineRunner) so I can confirm? A jobId would also be helpful.
Is there something about your pipeline structure that is going to result in a large number of output files? That could either be a large amount of data or perhaps finely sharded input files without a subsequent GroupByKey operation (which would let us reshard the data into larger pieces).
The note in In Google Cloud Dataflow BigQueryIO.Write occur Unknown Error (http code 500) mitigates this issue:
Dataflow SDK for Java 1.x: as a workaround, you can enable this experiment in : --experiments=enable_custom_bigquery_sink
In Dataflow SDK for Java 2.x, this behavior is default and no experiments are necessary.
Note that in both versions, temporary files in GCS may be left over if your job fails.
public static class ForceGroupBy <T> extends PTransform<PCollection<T>, PCollection<KV<T, Iterable<Void>>>> {
private static final long serialVersionUID = 1L;
#Override
public PCollection<KV<T, Iterable<Void>>> apply(PCollection<T> input) {
PCollection<KV<T,Void>> syntheticGroup = input.apply(
ParDo.of(new DoFn<T,KV<T,Void>>(){
private static final long serialVersionUID = 1L;
#Override
public void processElement(
DoFn<T, KV<T, Void>>.ProcessContext c)
throws Exception {
c.output(KV.of(c.element(),(Void)null));
} }));
return syntheticGroup.apply(GroupByKey.<T,Void>create());
}
}

Resources