Iterate Keys with Values for Beam pipeline - google-cloud-dataflow

After applying .apply(GroupByKey.create()) I am getting values like PCollection<KV<Integer,Iterable>. Can you suggest how to apply further transforms for each key.
Ex: PCollection<KV<1,Iterable>
PCollection<KV<2,Iterable>
The keys are dynamic values. I need to iterate for each Key Present in the PCollection.

You should be able to use a DoFn / ParDo to iterate over such iterable.
I drafted a quick example to show how this can be done.
// Create sample rows
PCollection<TableRow> rows =
pipeline
.apply(
Create.of(
new TableRow().set("group", 1).set("name", "Dataflow"),
new TableRow().set("group", 1).set("name", "Pub/Sub"),
new TableRow().set("group", 2).set("name", "BigQuery"),
new TableRow().set("group", 2).set("name", "Vertex")))
.setCoder(TableRowJsonCoder.of());
// Convert into a KV of <group, name>
PCollection<KV<Integer, String>> keyValues =
rows.apply(
"Key",
MapElements.into(
TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptors.strings()))
.via(row -> KV.of((Integer) row.get("group"), (String) row.get("name"))));
// Group by key
PCollection<KV<Integer, Iterable<String>>> groups =
keyValues.apply("Group", GroupByKey.create());
// Iterate and print group + values
groups.apply(
ParDo.of(
new DoFn<KV<Integer, Iterable<String>>, Void>() {
#ProcessElement
public void processElement(#Element KV<Integer, Iterable<String>> kv) {
StringBuilder sb = new StringBuilder();
for (String name : kv.getValue()) {
if (sb.length() > 0) {
sb.append(", ");
}
sb.append(name);
}
System.out.println("Group " + kv.getKey() + " values: " + sb);
}
}));
pipeline.run();
Prints (note that the output is not ordered/guaranteed due to concurrency).
Group 2 values: BigQuery, Vertex
Group 1 values: Dataflow, Pub/Sub

Related

Join the collection using SideInput

Trying to join two Pcollection using SideInput transform. In the ParDo function while mapping the value, from the sideinput collection we may get the multiple mapping records as a collection. In such a case how to handle the collection and how to return those collection of values to the PCollection.
It would be good if some one help to solve this case. Here is the code snippet that I tried.
PCollection<TableRow> pc1 = ...;
PCollection<Row> pc1Rows = pc1.apply(
ParDo.of(new fnConvertTableRowToRow())).setRowSchema(schemaPc1);
PCollection<KV<Integer, Row>> keyed_pc1Rows = pc1Rows.apply(
WithKeys.of(new SerializableFunction<Row, Integer>() {
public Integer apply(Row s) {
return Integer.parseInt(s.getValue("LOCATION_ID").toString());
}
}));
PCollection<TableRow> pc2 = ...;
PCollection<Row> pc2Rows = pc2.apply(
ParDo.of(new fnConvertTableRowToRow())).setRowSchema(schemaPc2);
PCollection<KV<Integer, Iterable<Row>>> keywordGroups = pc2Rows.apply(
new fnGroupKeyWords());
PCollectionView<Map<Integer, Iterable<Row>>> sideInputView =
keywordGroups.apply("Side Input",
View.<Integer, Iterable<Row>>asMap());
PCollection<Row> finalResultCollection = keyed_pc1Rows.apply("Process",
ParDo.of(new DoFn<KV<Integer,Row>, Row>() {
#ProcessElement
public void processElement(ProcessContext c) {
Integer key = Integer.parseInt(c.element().getKey().toString());
Row leftRow = c.element().getValue();
Map<Integer, Iterable<Row>> key2Rows = c.sideInput(sideInputView);
Iterable<Row> rightRowsIterable = key2Rows.get(key);
for (Iterator<Row> i = rightRowsIterable.iterator(); i.hasNext(); ) {
Row suit = (Row) i.next();
Row targetRow = Row.withSchema(schemaOutput)
.addValues(leftRow.getValues())
.addValues(suit.getValues())
.build();
c.output(targetRow);
}
}
}).withSideInputs(sideInputView));
public static class fnGroupKeyWords extends
PTransform<PCollection<Row>, PCollection<KV<Integer, Iterable<Row>>>> {
#Override
public PCollection<KV<Integer, Iterable<Row>>> expand(
PCollection<Row> rows) {
PCollection<KV<Integer, Row>> kvs = rows.apply(
ParDo.of(new TransferKeyValueFn()));
PCollection<KV<Integer, Iterable<Row>>> group = kvs.apply(
GroupByKey.<Integer, Row> create());
return group;
}
}
public static class TransferKeyValueFn extends
DoFn<Row, KV<Integer, Row>> {
#ProcessElement
public void processElement(ProcessContext c) throws ParseException {
Row tRow = c.element();
c.output(
KV.of(
Integer.parseInt(tRow.getValue("DW_LOCATION_ID").toString()),
tRow));
}
}
If you wish to join two PCollections together using a common key. the CoGroupByKey might make more sense. Please consider this approach instead of side inputs
Also this blog post has a great explanation as well.
I think using the SideInput suggestion would perform well if you have a very small collection which could fit into memory. You could use it as a side input with view.asMultimap. Then in a ParDo processing the larger PCollection (After a GBK, to give you an iterable over all elements for the key), lookup the key you are interested in from the side input. Here is an example test pipeline using a multimap pcollection.
However, if your collection is quite large then using Flatten to combine both pcollections together would be a better approach. Then using a GroupByKey afterward, which will give you an iterable for element under the same key. This will still be processed sequentially. Though, I believe you will will have issues with performance, unless you eliminate the hot key. Please see the explanation of using combiners to alleviate this.

Google Dataflow write multiple line in BigQuery

I have a simple flow which aim is to write two lines in one BigQuery Table.
I use a DynamicDestinations because after that I will write on mutliple Table, on that example it's the same table...
The problem is that I only have 1 line in my BigQuery table at the end.
It stacktrace I see the following error on the second insert
"
status: {
code: 6
message: "Already Exists: Job sampleprojet3:b9912b9b05794aec8f4292b2ae493612_eeb0082ade6f4a58a14753d1cc92ddbc_00001-0"
}
"
What does it means ?
Is it related to this limitation ?
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/550
How can I do the job ?
I use BeamSDK 2.0.0, I have try with 2.1.0 (same result)
The way I launch :
mvn compile exec:java -Dexec.mainClass=fr.gireve.dataflow.LogsFlowBug -Dexec.args="--runner=DataflowRunner --inputDir=gs://sampleprojet3.appspot.com/ --project=sampleprojet3 --stagingLocation=gs://dataflow-sampleprojet3/tmp" -Pdataflow-runner
Pipeline p = Pipeline.create(options);
final List<String> tableNameTableValue = Arrays.asList("table1:value1", "table1:value2", "table2:value1", "table2:value2");
p.apply(Create.of(tableNameTableValue)).setCoder(StringUtf8Coder.of())
.apply(BigQueryIO.<String>write()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to(new DynamicDestinations<String, KV<String, String>>() {
#Override
public KV<String, String> getDestination(ValueInSingleWindow<String> element) {
final String[] split = element.getValue().split(":");
return KV.of(split[0], split[1]) ;
}
#Override
public Coder<KV<String, String>> getDestinationCoder() {
return KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());
}
#Override
public TableDestination getTable(KV<String, String> row) {
String tableName = row.getKey();
String tableSpec = "sampleprojet3:testLoadJSON." + tableName;
return new TableDestination(tableSpec, "Table " + tableName);
}
#Override
public TableSchema getSchema(KV<String, String> row) {
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("myColumn").setType("STRING"));
TableSchema ts = new TableSchema();
ts.setFields(fields);
return ts;
}
})
.withFormatFunction(new SerializableFunction<String, TableRow>() {
public TableRow apply(String row) {
TableRow tr = new TableRow();
tr.set("myColumn", row);
return tr;
}
}));
p.run().waitUntilFinish();
Thanks
DynamicDestinations associates each element with a destination - i.e. where the element should go. Elements are routed to BigQuery tables according to their destinations: 1 destination = 1 BigQuery table with a schema: the destination should include just enough information to produce a TableDestination and a schema. Elements with the same destination go to the same table, elements with different destinations go to different tables.
Your code snippet uses DynamicDestinations with a destination type that contains both the element and the table, which is unnecessary, and of course, violates the constraint above: elements with a different destination end up going to the same table: e.g. KV("table1", "value1") and KV("table1", "value2") are different destinations but your getTable maps them to the same table table1.
You need to remove the element from your destination type. That will also lead to simpler code. As a side note, I think you don't need to override getDestinationCoder() - it can be inferred automatically.
Try this:
.to(new DynamicDestinations<String, String>() {
#Override
public String getDestination(ValueInSingleWindow<String> element) {
return element.getValue().split(":")[0];
}
#Override
public TableDestination getTable(String tableName) {
return new TableDestination(
"sampleprojet3:testLoadJSON." + tableName, "Table " + tableName);
}
#Override
public TableSchema getSchema(String tableName) {
List<TableFieldSchema> fields = Arrays.asList(
new TableFieldSchema().setName("myColumn").setType("STRING"));
return new TableSchema().setFields(fields);
}
})

Execute read operations in sequence - Apache Beam

I need to execute below operations in sequence as given:-
PCollection<String> read = p.apply("Read Lines",TextIO.read().from(options.getInputFile()))
.apply("Get fileName",ParDo.of(new DoFn<String,String>(){
ValueProvider<String> fileReceived = options.getfilename();
#ProcessElement
public void procesElement(ProcessContext c)
{
fileName = fileReceived.get().toString();
LOG.info("File: "+fileName);
}
}));
PCollection<TableRow> rows = p.apply("Read from BigQuery",
BigQueryIO.read()
.fromQuery("SELECT table,schema FROM `DatasetID.TableID` WHERE file='" + fileName +"'")
.usingStandardSql());
How to accomplish this in Apache Beam/Dataflow?
It seems that you want to apply BigQueryIO.read().fromQuery() to a query that depends on a value available via a property of type ValueProvider<String> in your PipelineOptions, and the provider is not accessible at pipeline construction time - i.e. you are invoking your job via a template.
In that case, the proper solution is to use NestedValueProvider:
PCollection<TableRow> tableRows = p.apply(BigQueryIO.read().fromQuery(
NestedValueProvider.of(
options.getfilename(),
new SerializableFunction<String, String>() {
#Override
public String apply(String filename) {
return "SELECT table,schema FROM `DatasetID.TableID` WHERE file='" + fileName +"'";
}
})));

Stateful ParDo not working on Dataflow Runner

Based on Javadocs and the blog post at https://beam.apache.org/blog/2017/02/13/stateful-processing.html, I tried using a simple de-duplication example using 2.0.0-beta-2 SDK which reads a file from GCS (containing a list of jsons each with a user_id field) and then running it through a pipeline as explained below.
The input data contains about 146K events of which only 50 events are unique. The entire input is about 50MB which should be processable in considerably less time than the 2 min Fixed window. I just placed a window there to make sure the per-key-per-window semantics hold without using a GlobalWindow. I run the windowed data through 3 parallel stages to compare the results, each of which are explained below.
just copies the contents into a new file on GCS - this ensures all the events were being processed as expected and I verified the contents are exactly the same as input
Combine.PerKey on the user_id and pick only the first element from the Iterable - this essentially should deduplicate the data and it works as expected. The resulting file has the exact number of unique items from the original list of events - 50 elements
stateful ParDo which checks if the key has been seen already and emits an output only when its not. Ideally, the result from this should match the deduped data as [2] but all I am seeing is only 3 unique events. These 3 unique events always point to the same 3 user_ids in a few runs I did.
Interestingly, when I just switch from the DataflowRunner to the DirectRunner running this whole process locally, I see that the output from [3] matches [2] having only 50 unique elements as expected. So, I am doubting if there are any issues with the DataflowRunner for the Stateful ParDo.
public class StatefulParDoSample {
private static Logger logger = LoggerFactory.getLogger(StatefulParDoSample.class.getName());
static class StatefulDoFn extends DoFn<KV<String, String>, String> {
final Aggregator<Long, Long> processedElements = createAggregator("processed", Sum.ofLongs());
final Aggregator<Long, Long> skippedElements = createAggregator("skipped", Sum.ofLongs());
#StateId("keyTracker")
private final StateSpec<Object, ValueState<Integer>> keyTrackerSpec =
StateSpecs.value(VarIntCoder.of());
#ProcessElement
public void processElement(
ProcessContext context,
#StateId("keyTracker") ValueState<Integer> keyTracker) {
processedElements.addValue(1l);
final String userId = context.element().getKey();
int wasSeen = firstNonNull(keyTracker.read(), 0);
if (wasSeen == 0) {
keyTracker.write( 1);
context.output(context.element().getValue());
} else {
keyTracker.write(wasSeen + 1);
skippedElements.addValue(1l);
}
}
}
public static void main(String[] args) {
DataflowPipelineOptions pipelineOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
pipelineOptions.setRunner(DataflowRunner.class);
pipelineOptions.setProject("project-name");
pipelineOptions.setStagingLocation(GCS_STAGING_LOCATION);
pipelineOptions.setStreaming(false);
pipelineOptions.setAppName("deduper");
Pipeline p = Pipeline.create(pipelineOptions);
final ObjectMapper mapper = new ObjectMapper();
PCollection<KV<String, String>> keyedEvents =
p
.apply(TextIO.Read.from(GCS_SAMPLE_INPUT_FILE_PATH))
.apply(WithKeys.of(new SerializableFunction<String, String>() {
#Override
public String apply(String input) {
try {
Map<String, Object> eventJson =
mapper.readValue(input, Map.class);
return (String) eventJson.get("user_id");
} catch (Exception e) {
}
return "";
}
}))
.apply(
Window.into(
FixedWindows.of(Duration.standardMinutes(2))
)
);
keyedEvents
.apply(ParDo.of(new StatefulDoFn()))
.apply(TextIO.Write.to(GCS_SAMPLE_OUTPUT_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COPY_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Combine.perKey(new SerializableFunction<Iterable<String>, String>() {
#Override
public String apply(Iterable<String> input) {
return !input.iterator().hasNext() ? "empty" : input.iterator().next();
}
}))
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COMBINE_FILE_PATH).withNumShards(1));
PipelineResult result = p.run();
result.waitUntilFinish();
}
}
This was a bug in the Dataflow service in batch mode, fixed in the upcoming 0.6.0 Beam release (or HEAD if you track the bleeding edge).
Thank you for bringing it to my attention! For reference, or if anything else comes up, this was tracked by BEAM-1611.

DataflowAssert doesn't pass TableRow test

We don't know why when running this simple test, DataflowAssert fails:
#Test
#Category(RunnableOnService.class)
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
p.run();
}
We are getting the following exception:
Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement
SEVERE: DataflowAssert failed expectations.
java.lang.AssertionError:
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:865)
at org.junit.Assert.assertThat(Assert.java:832)
at ...
In order to simplify the DataflowAssert test we hardcoded the output of Table.join to match DataflowAssert,having:
private static final TableRow TABLEROW_TEST = new TableRow()
.set("id", "x");
static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
PCollection<TableRow> pCollectionTable2) throws Exception {
final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();
PCollection<KV<String, String>> table1Data = pCollectionTable1
.apply(ParDo.of(new ExtractTable1DataFn()));
PCollection<KV<String, String>> table2Data = pCollectionTable2
.apply(ParDo.of(new ExtractTable2DataFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
.apply(CoGroupByKey.<String> create());
PCollection<KV<String, String>> resultCollection = kvpCollection
.apply(ParDo.named("Process join")
.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
private static final long serialVersionUID = 0;
#Override
public void processElement(ProcessContext c) {
// System.out.println(c);
KV<String, CoGbkResult> e = c.element();
String key = e.getKey();
String value = null;
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
}
}));
PCollection<TableRow> formattedResults = resultCollection.apply(
ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("id", "x");
c.output(row);
}
}));
return formattedResults;
}
Does anyone know what we are doing wrong?
I think the error message is telling you that the actual collection contains more copies of that element than the expectation.
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
This is hamcrest indicating that you wanted an iterable over a single element, but the actual collection had an item which wasn't matched. Since all of the items coming out of "format join" have the same value, it made this harder to read than it should have been.
Specifically, this is the message produced when I run the following test, which checks to see if the collection with two copies of row is the contains exactly one copy of row:
#Category(RunnableOnService.class)
#Test
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
TableRow row = new TableRow().set("id", "x");
PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
DataflowAssert.that(rows).containsInAnyOrder(row);
p.run();
}
In order to get that result with your code, I had to take advantage of the fact that you only iterate over entries in table2. Specifically:
// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]
// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
("keyB", ([], ["B2"]))]
// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
"B2,B2"]
// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]
Note that the DoFn for "Process join" may not produce the expected results as commented below:
String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
// NOTE: this updates value, and doesn't output it. So for each
// key there will be a single output with the *last* value
// rather than one for each pair.
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));

Resources