Access sideinput inside ParDo - google-cloud-dataflow

I am new to apache beam, and I am doing some investigation on using sideinput for one of our usecases. Below is the code.
PipelineOptions options =
PipelineOptionsFactory.fromArgs().as(PipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
final List<String> sideInput = Arrays.asList("1", "2", "3", "4");
final List<String> input = Arrays.asList("a", "b", "c", "d");
PCollectionView<List<String>> sideinput =
pipeline.apply("readInput", Create.of(sideInput)).apply(View.asList());
pipeline.apply("read", Create.of(input))
.apply("process", ParDo.of(new DoFn<String, String>() {
#ProcessElement public void process(ProcessContext pc) {
System.out.println("processing element:" + pc.element());
List<String> list = pc.sideInput(sideinput);
for (String element : list) {
System.out.print(element);
}
System.out.println("");
}
}).withSideInputs(sideinput));
pipeline.run();
I am expecting it prints out all the sideinput elements after each element, e.g
processing element:d
1234
processing element:c
1234
processing element:a
1234
processing element:b
1234
However the results are different each time:
processing element:d
processing element:a
processing element:c
processing element:b
44441113312
2
32
32
Or
processing element:c
processing element:d
processing element:b
processing element:a
444422233211
31
31

It's rather expected since in distributed environment there is no guarantee in the order of input elements processing and the order of aggregated system output. You may want to concatenate the main element and side input elements and write it out in one shot to have what you expect as an output.

Related

Iterate Keys with Values for Beam pipeline

After applying .apply(GroupByKey.create()) I am getting values like PCollection<KV<Integer,Iterable>. Can you suggest how to apply further transforms for each key.
Ex: PCollection<KV<1,Iterable>
PCollection<KV<2,Iterable>
The keys are dynamic values. I need to iterate for each Key Present in the PCollection.
You should be able to use a DoFn / ParDo to iterate over such iterable.
I drafted a quick example to show how this can be done.
// Create sample rows
PCollection<TableRow> rows =
pipeline
.apply(
Create.of(
new TableRow().set("group", 1).set("name", "Dataflow"),
new TableRow().set("group", 1).set("name", "Pub/Sub"),
new TableRow().set("group", 2).set("name", "BigQuery"),
new TableRow().set("group", 2).set("name", "Vertex")))
.setCoder(TableRowJsonCoder.of());
// Convert into a KV of <group, name>
PCollection<KV<Integer, String>> keyValues =
rows.apply(
"Key",
MapElements.into(
TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptors.strings()))
.via(row -> KV.of((Integer) row.get("group"), (String) row.get("name"))));
// Group by key
PCollection<KV<Integer, Iterable<String>>> groups =
keyValues.apply("Group", GroupByKey.create());
// Iterate and print group + values
groups.apply(
ParDo.of(
new DoFn<KV<Integer, Iterable<String>>, Void>() {
#ProcessElement
public void processElement(#Element KV<Integer, Iterable<String>> kv) {
StringBuilder sb = new StringBuilder();
for (String name : kv.getValue()) {
if (sb.length() > 0) {
sb.append(", ");
}
sb.append(name);
}
System.out.println("Group " + kv.getKey() + " values: " + sb);
}
}));
pipeline.run();
Prints (note that the output is not ordered/guaranteed due to concurrency).
Group 2 values: BigQuery, Vertex
Group 1 values: Dataflow, Pub/Sub

Join the collection using SideInput

Trying to join two Pcollection using SideInput transform. In the ParDo function while mapping the value, from the sideinput collection we may get the multiple mapping records as a collection. In such a case how to handle the collection and how to return those collection of values to the PCollection.
It would be good if some one help to solve this case. Here is the code snippet that I tried.
PCollection<TableRow> pc1 = ...;
PCollection<Row> pc1Rows = pc1.apply(
ParDo.of(new fnConvertTableRowToRow())).setRowSchema(schemaPc1);
PCollection<KV<Integer, Row>> keyed_pc1Rows = pc1Rows.apply(
WithKeys.of(new SerializableFunction<Row, Integer>() {
public Integer apply(Row s) {
return Integer.parseInt(s.getValue("LOCATION_ID").toString());
}
}));
PCollection<TableRow> pc2 = ...;
PCollection<Row> pc2Rows = pc2.apply(
ParDo.of(new fnConvertTableRowToRow())).setRowSchema(schemaPc2);
PCollection<KV<Integer, Iterable<Row>>> keywordGroups = pc2Rows.apply(
new fnGroupKeyWords());
PCollectionView<Map<Integer, Iterable<Row>>> sideInputView =
keywordGroups.apply("Side Input",
View.<Integer, Iterable<Row>>asMap());
PCollection<Row> finalResultCollection = keyed_pc1Rows.apply("Process",
ParDo.of(new DoFn<KV<Integer,Row>, Row>() {
#ProcessElement
public void processElement(ProcessContext c) {
Integer key = Integer.parseInt(c.element().getKey().toString());
Row leftRow = c.element().getValue();
Map<Integer, Iterable<Row>> key2Rows = c.sideInput(sideInputView);
Iterable<Row> rightRowsIterable = key2Rows.get(key);
for (Iterator<Row> i = rightRowsIterable.iterator(); i.hasNext(); ) {
Row suit = (Row) i.next();
Row targetRow = Row.withSchema(schemaOutput)
.addValues(leftRow.getValues())
.addValues(suit.getValues())
.build();
c.output(targetRow);
}
}
}).withSideInputs(sideInputView));
public static class fnGroupKeyWords extends
PTransform<PCollection<Row>, PCollection<KV<Integer, Iterable<Row>>>> {
#Override
public PCollection<KV<Integer, Iterable<Row>>> expand(
PCollection<Row> rows) {
PCollection<KV<Integer, Row>> kvs = rows.apply(
ParDo.of(new TransferKeyValueFn()));
PCollection<KV<Integer, Iterable<Row>>> group = kvs.apply(
GroupByKey.<Integer, Row> create());
return group;
}
}
public static class TransferKeyValueFn extends
DoFn<Row, KV<Integer, Row>> {
#ProcessElement
public void processElement(ProcessContext c) throws ParseException {
Row tRow = c.element();
c.output(
KV.of(
Integer.parseInt(tRow.getValue("DW_LOCATION_ID").toString()),
tRow));
}
}
If you wish to join two PCollections together using a common key. the CoGroupByKey might make more sense. Please consider this approach instead of side inputs
Also this blog post has a great explanation as well.
I think using the SideInput suggestion would perform well if you have a very small collection which could fit into memory. You could use it as a side input with view.asMultimap. Then in a ParDo processing the larger PCollection (After a GBK, to give you an iterable over all elements for the key), lookup the key you are interested in from the side input. Here is an example test pipeline using a multimap pcollection.
However, if your collection is quite large then using Flatten to combine both pcollections together would be a better approach. Then using a GroupByKey afterward, which will give you an iterable for element under the same key. This will still be processed sequentially. Though, I believe you will will have issues with performance, unless you eliminate the hot key. Please see the explanation of using combiners to alleviate this.

Live monitoring using Apache Beam

I'd like to accomplish the following using Apache Beam:
calculate every 5 seconds the events that are read from pubsub in the last minute
The goal is to have a semi-realtime view on the rate data comes in. This can then be expanded towards more complex use cases afterwards.
After searching, I've not come across a way to solve this seemingly simple problem. Things that do not work:
global window + repeated triggers (triggers do not fire when there is no input)
sliding window + withoutDefaults (does not allow empty windows to be emitted apparently)
Any suggestion on how to solve this problem?
As already discussed, Beam does not emit data for empty windows. In addition to the reasons given by Rui Wang we can add the challenge of how the latter stages would handle those empty panes.
Anyway, the specific use case that you describe -monitoring rolling count of number of messages - should be possible with some work even if the metric falls down to zero eventually. One possibility would be to publish a steady number of dummy messages which would advance the watermark and fire the panes but are filtered out later within the pipeline. The problem with this approach is that the publishing source needs to be adapted and that might not always be convenient/possible. Another one would involve generating this fake data as another input and co-group it with the main stream. The advantage is that everything can be done in Dataflow without the need to tweak the source or the sink. To illustrate this I provide an example.
The inputs are divided in two streams. For the dummy one, I used GenerateSequence to create a new element every 5 seconds. I then window the PCollection (windowing strategy needs to be compatible with the one for the main stream so I will use the same). Then I map the element to a key-value pair where the value is 0 (we could use other values as we know from which stream the element comes but I want to evince that dummy records are not counted).
PCollection<KV<String,Integer>> dummyStream = p
.apply("Generate Sequence", GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5)))
.apply("Window Messages - Dummy", Window.<Long>into(
...
.apply("Count Messages - Dummy", ParDo.of(new DoFn<Long, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
c.output(KV.of("num_messages", 0));
}
}));
For the main stream, that reads from Pub/Sub, I map each record to value 1. Later on, I will add all the ones as in typical word count examples using map-reduce stages.
PCollection<KV<String,Integer>> mainStream = p
.apply("Get Messages - Data", PubsubIO.readStrings().fromTopic(topic))
.apply("Window Messages - Data", Window.<String>into(
...
.apply("Count Messages - Data", ParDo.of(new DoFn<String, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
c.output(KV.of("num_messages", 1));
}
}));
Then we need to join them using a CoGroupByKey (I used the same num_messages key to group counts). This stage will output results when one of the two inputs has elements, therefore unblocking the main issue here (empty windows with no Pub/Sub messages).
final TupleTag<Integer> dummyTag = new TupleTag<>();
final TupleTag<Integer> dataTag = new TupleTag<>();
PCollection<KV<String, CoGbkResult>> coGbkResultCollection = KeyedPCollectionTuple.of(dummyTag, dummyStream)
.and(dataTag, mainStream).apply(CoGroupByKey.<String>create());
Finally, we add all the ones to obtain the total number of messages for the window. If there are no elements coming from dataTag then the sum will just default to 0.
public void processElement(ProcessContext c, BoundedWindow window) {
Integer total_sum = new Integer(0);
Iterable<Integer> dataTagVal = c.element().getValue().getAll(dataTag);
for (Integer val : dataTagVal) {
total_sum += val;
}
LOG.info("Window: " + window.toString() + ", Number of messages: " + total_sum.toString());
}
This should result in something like:
Note that results from different windows can come unordered (this can happen anyway when writing to BigQuery) and I did not play with the window settings to optimize the example.
Full code:
public class EmptyWindows {
private static final Logger LOG = LoggerFactory.getLogger(EmptyWindows.class);
public static interface MyOptions extends PipelineOptions {
#Description("Input topic")
String getInput();
void setInput(String s);
}
#SuppressWarnings("serial")
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
String topic = options.getInput();
PCollection<KV<String,Integer>> mainStream = p
.apply("Get Messages - Data", PubsubIO.readStrings().fromTopic(topic))
.apply("Window Messages - Data", Window.<String>into(
SlidingWindows.of(Duration.standardMinutes(1))
.every(Duration.standardSeconds(5)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes())
.apply("Count Messages - Data", ParDo.of(new DoFn<String, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
//LOG.info("New data element in main output");
c.output(KV.of("num_messages", 1));
}
}));
PCollection<KV<String,Integer>> dummyStream = p
.apply("Generate Sequence", GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5)))
.apply("Window Messages - Dummy", Window.<Long>into(
SlidingWindows.of(Duration.standardMinutes(1))
.every(Duration.standardSeconds(5)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes())
.apply("Count Messages - Dummy", ParDo.of(new DoFn<Long, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
//LOG.info("New dummy element in main output");
c.output(KV.of("num_messages", 0));
}
}));
final TupleTag<Integer> dummyTag = new TupleTag<>();
final TupleTag<Integer> dataTag = new TupleTag<>();
PCollection<KV<String, CoGbkResult>> coGbkResultCollection = KeyedPCollectionTuple.of(dummyTag, dummyStream)
.and(dataTag, mainStream).apply(CoGroupByKey.<String>create());
coGbkResultCollection
.apply("Log results", ParDo.of(new DoFn<KV<String, CoGbkResult>, Void>() {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
Integer total_sum = new Integer(0);
Iterable<Integer> dataTagVal = c.element().getValue().getAll(dataTag);
for (Integer val : dataTagVal) {
total_sum += val;
}
LOG.info("Window: " + window.toString() + ", Number of messages: " + total_sum.toString());
}
}));
p.run();
}
}
Another way to approach this problem is using a stateful DoFn with a looping Timer that triggers at each 5 second tick. This looping timer generates the default data necessary for the live monitoring, and ensures that each window has at least one event to process.
One issue with the approach described by https://stackoverflow.com/a/54543527/430128 is that, in a system with multiple keys, these "dummy" events need to be generated for every key.
See https://beam.apache.org/blog/looping-timers/. Option 1 and 2 in that article are an external heartbeat source and a generated source in the beam pipeline respectively. Option 3 is the looping timer.

Conditional skip in google cloud-dataflow java pipeline

I have the code:
maxHotelEntry.apply("Convert to string", ToString.elements()).apply("Write to file", TextIO.write().to( config.getString( "gcs.checkPointLocation")).withoutSharding());
This PCollection can be empty, -2147483648(Integer.minvalue), 124526 (+value)
I do not want to write to checkPointLocation if it is empty or its value is less than 0.
One option is writing to GCS inside DoFn, but I do not know how to do this.
Add an additional step to filter out the elements you don't want to write
maxHotelEntry
.apply("Convert to string", ToString.elements())
.apply("Filter out elements", ParDo
.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
String element = c.element();
if ( /*perform your filter here*/ ) {
c.output(element);
}
}
}))
.apply("Write to file", TextIO.write().to( config.getString( "gcs.checkPointLocation")).withoutSharding());

Stateful ParDo not working on Dataflow Runner

Based on Javadocs and the blog post at https://beam.apache.org/blog/2017/02/13/stateful-processing.html, I tried using a simple de-duplication example using 2.0.0-beta-2 SDK which reads a file from GCS (containing a list of jsons each with a user_id field) and then running it through a pipeline as explained below.
The input data contains about 146K events of which only 50 events are unique. The entire input is about 50MB which should be processable in considerably less time than the 2 min Fixed window. I just placed a window there to make sure the per-key-per-window semantics hold without using a GlobalWindow. I run the windowed data through 3 parallel stages to compare the results, each of which are explained below.
just copies the contents into a new file on GCS - this ensures all the events were being processed as expected and I verified the contents are exactly the same as input
Combine.PerKey on the user_id and pick only the first element from the Iterable - this essentially should deduplicate the data and it works as expected. The resulting file has the exact number of unique items from the original list of events - 50 elements
stateful ParDo which checks if the key has been seen already and emits an output only when its not. Ideally, the result from this should match the deduped data as [2] but all I am seeing is only 3 unique events. These 3 unique events always point to the same 3 user_ids in a few runs I did.
Interestingly, when I just switch from the DataflowRunner to the DirectRunner running this whole process locally, I see that the output from [3] matches [2] having only 50 unique elements as expected. So, I am doubting if there are any issues with the DataflowRunner for the Stateful ParDo.
public class StatefulParDoSample {
private static Logger logger = LoggerFactory.getLogger(StatefulParDoSample.class.getName());
static class StatefulDoFn extends DoFn<KV<String, String>, String> {
final Aggregator<Long, Long> processedElements = createAggregator("processed", Sum.ofLongs());
final Aggregator<Long, Long> skippedElements = createAggregator("skipped", Sum.ofLongs());
#StateId("keyTracker")
private final StateSpec<Object, ValueState<Integer>> keyTrackerSpec =
StateSpecs.value(VarIntCoder.of());
#ProcessElement
public void processElement(
ProcessContext context,
#StateId("keyTracker") ValueState<Integer> keyTracker) {
processedElements.addValue(1l);
final String userId = context.element().getKey();
int wasSeen = firstNonNull(keyTracker.read(), 0);
if (wasSeen == 0) {
keyTracker.write( 1);
context.output(context.element().getValue());
} else {
keyTracker.write(wasSeen + 1);
skippedElements.addValue(1l);
}
}
}
public static void main(String[] args) {
DataflowPipelineOptions pipelineOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
pipelineOptions.setRunner(DataflowRunner.class);
pipelineOptions.setProject("project-name");
pipelineOptions.setStagingLocation(GCS_STAGING_LOCATION);
pipelineOptions.setStreaming(false);
pipelineOptions.setAppName("deduper");
Pipeline p = Pipeline.create(pipelineOptions);
final ObjectMapper mapper = new ObjectMapper();
PCollection<KV<String, String>> keyedEvents =
p
.apply(TextIO.Read.from(GCS_SAMPLE_INPUT_FILE_PATH))
.apply(WithKeys.of(new SerializableFunction<String, String>() {
#Override
public String apply(String input) {
try {
Map<String, Object> eventJson =
mapper.readValue(input, Map.class);
return (String) eventJson.get("user_id");
} catch (Exception e) {
}
return "";
}
}))
.apply(
Window.into(
FixedWindows.of(Duration.standardMinutes(2))
)
);
keyedEvents
.apply(ParDo.of(new StatefulDoFn()))
.apply(TextIO.Write.to(GCS_SAMPLE_OUTPUT_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COPY_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Combine.perKey(new SerializableFunction<Iterable<String>, String>() {
#Override
public String apply(Iterable<String> input) {
return !input.iterator().hasNext() ? "empty" : input.iterator().next();
}
}))
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COMBINE_FILE_PATH).withNumShards(1));
PipelineResult result = p.run();
result.waitUntilFinish();
}
}
This was a bug in the Dataflow service in batch mode, fixed in the upcoming 0.6.0 Beam release (or HEAD if you track the bleeding edge).
Thank you for bringing it to my attention! For reference, or if anything else comes up, this was tracked by BEAM-1611.

Resources