I tried to move the data from one table to another table. Used SideInput for filtering the records while transform the data. SideInput also type of KV collection and its loaded the data from another table.
When run my pipeline got "java.lang.IllegalArgumentException: calling sideInput() with unknown view" error.
Here is the entire code that I tried:
{
PipelineOptionsFactory.register(OptionPipeline.class);
OptionPipeline options = PipelineOptionsFactory.fromArgs(args).withValidation().as(OptionPipeline.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> sideInputData = p.apply("ReadSideInput",BigQueryIO.readTableRows().from(options.getOrgRegionMapping()));
PCollection<KV<String,String>> sideInputMap = sideInputData.apply(ParDo.of(new getSideInputDataFn()));
final PCollectionView<Map<String,String>> sideInputView = sideInputMap.apply(View.<String,String>asMap());
PCollection<TableRow> orgMaster = p.apply("ReadOrganization",BigQueryIO.readTableRows().from(options.getOrgCodeMaster()));
PCollection<TableRow> orgCode = orgMaster.apply(ParDo.of(new gnGetOrgMaster()));
#SuppressWarnings("serial")
PCollection<TableRow> finalResultCollection = orgCode.apply("Process", ParDo.of(new DoFn<TableRow, TableRow>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow outputRow = new TableRow();
TableRow orgCodeRow = c.element();
String orgCodefromMaster = (String) orgCodeRow.get("orgCode");
String region = c.sideInput(sideInputView).get(orgCodefromMaster);
outputRow.set("orgCode", orgCodefromMaster);
outputRow.set("orgName", orgCodeRow.get("orgName"));
outputRow.set("orgName", region);
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSSSSS");
Date dateobj = new Date();
outputRow.set("updatedDate",df.format(dateobj));
c.output(outputRow);
}
}));
finalResultCollection.apply(BigQueryIO.writeTableRows()
.withSchema(schema)
.to(options.getOrgCodeTable())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
}
#SuppressWarnings("serial")
static class getSideInputDataFn extends DoFn<TableRow,KV<String, String>>
{
#ProcessElement
public void processElement(ProcessContext c)
{
TableRow row = c.element();
c.output(KV.of((String) row.get("orgcode"), (String) row.get("region")));
}
}
It looks like the runner is complaining because you never told it about the side input when defining the graph. In this case you call .withSideInputs after the ParDo.of call passing in the reference to the PCollectionView<T> you defined earlier.
#SuppressWarnings("serial")
PCollection<TableRow> finalResultCollection = orgCode.apply("Process", ParDo.of(new DoFn<TableRow, TableRow>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow outputRow = new TableRow();
TableRow orgCodeRow = c.element();
String orgCodefromMaster = (String) orgCodeRow.get("orgCode");
String region = c.sideInput(sideInputView).get(orgCodefromMaster);
outputRow.set("orgCode", orgCodefromMaster);
outputRow.set("orgName", orgCodeRow.get("orgName"));
outputRow.set("orgName", region);
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSSSSS");
Date dateobj = new Date();
outputRow.set("updatedDate",df.format(dateobj));
c.output(outputRow);
}
}).withSideInputs(sideInputView));
I didn't test this code but that's what stands out when I look at it.
Related
I'm working with Apache Beam, trying to enrich data (based on this), but it seems that Beam has changed in a while, as GroupByKey does not work with unbounded sources (like PubSub) without windowing.
This is what I've got (overly simplified):
PCollection<String> input = pipeline.apply("Read pubsub",
PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("Log element", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
System.out.println(String.format("incomig %s", c.element()));
c.output(c.element());
}
}))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));
PCollection<KV<String, String>> incomingData = input
.apply("Apply Random Key", MapElements
.via(new SimpleFunction<String, KV<String, String>>() {
public KV<String, String> apply(String json) {
JSONObject jsonObject = new JSONObject(json);
System.out.println(String.format("JSON: %s, %s", jsonObject.getString("id"), jsonObject.get("usageRules")));
return KV.of(jsonObject.getString("id"), json);
}
})
);
PCollection<KV<String,String>> enrichedData = incomingData
.apply("Search in db",
JdbcIO.<KV<String,String>, KV<String,String>>readAll()
.withDataSourceConfiguration(config)
.withQuery("SELECT * FROM myTable WHERE id = ?")
.withParameterSetter((element, preparedStatement) ->
preparedStatement.setString(1, element.getKey())
)
.withRowMapper(resultSet -> {
System.out.println(String.format("Result from db: %s", resultSet.getString("id")));
return KV.of(resultSet.getString("id"), resultSet.getString("id"));
})
.withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())));
GroupByKey.applicableTo(enrichedData);
TupleTag<String> CREATE_TAG = new TupleTag<>();
TupleTag<String> UPDATE_TAG = new TupleTag<>();
KeyedPCollectionTuple
.of(CREATE_TAG, incomingData)
.and(UPDATE_TAG, enrichedData)
.apply("Combine", CoGroupByKey.create())
.apply("Show data?", ParDo.of(new DoFn<KV<String, CoGbkResult>, String>() {
#ProcessElement
public void processElement(ProcessContext context) {
System.out.println("Print from CoGbkResult");
System.out.println(context.element().getKey());
System.out.println(context.element().getValue());
}
}));
At the moment, with windowing, getting incoming data, transforming it into JSONObject and searching in BD is working fine, the problem is that any .apply done after the JdbcIO.readAll is not working at all. The line "Print from CoGbkResult" just doesn't get printed at all.
I've tried modifying the window, adding other triggers, trying just to output a result just immediately, but it just stop at the RowMapper.
Thanks for your help
I am trying to use DynamicDestinations to write to a partitioned table in BigQuery where the partition name is mytable$yyyyMMdd. If I bypass dynamicdestinations and supply a hardcoded table name in .to(), it works; however, with dynamicdestinations I get the following exception:
java.lang.IllegalArgumentException: unable to serialize org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite$1#6fff253c
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:53)
at org.apache.beam.sdk.util.SerializableUtils.clone(SerializableUtils.java:90)
at org.apache.beam.sdk.transforms.ParDo$SingleOutput.<init>(ParDo.java:591)
at org.apache.beam.sdk.transforms.ParDo.of(ParDo.java:435)
at org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite.expand(PrepareWrite.java:51)
at org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite.expand(PrepareWrite.java:36)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:514)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:473)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:297)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped(BigQueryIO.java:987)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:972)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:659)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:514)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:454)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:284)
at com.homedepot.payments.monitoring.eventprocessor.MetricsAggregator.main(MetricsAggregator.java:82)
Caused by: java.io.NotSerializableException: com.google.api.services.bigquery.model.TableReference
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
And here is the code:
PCollection<Event> rawEvents = pipeline
.apply("ReadFromPubSub",
PubsubIO.readProtos(EventOuterClass.Event.class)
.fromSubscription(OPTIONS.getSubscription())
)
.apply("Parse", ParDo.of(new ParseFn()))
.apply("ExtractAttributes", ParDo.of(new ExtractAttributesFn()));
EventTable table = new EventTable(OPTIONS.getProjectId(), OPTIONS.getMetricsDatasetId(), OPTIONS.getRawEventsTable());
rawEvents.apply(BigQueryIO.<Event>write()
.to(new DynamicDestinations<Event, String>() {
private static final long serialVersionUID = 1L;
#Override
public TableSchema getSchema(String destination) {
return table.schema();
}
#Override
public TableDestination getTable(String destination) {
return new TableDestination(table.reference(), null);
}
#Override
public String getDestination(ValueInSingleWindow<Event> element) {
String dayString = DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC).toString();
return table.reference().getTableId() + "$" + dayString;
}
})
.withFormatFunction(new SerializableFunction<Event, TableRow>() {
public TableRow apply(Event event) {
TableRow row = new TableRow();
Event evnt = (Event) event;
row.set(EventTable.Field.VERSION.getName(), evnt.getVersion());
row.set(EventTable.Field.TIMESTAMP.getName(), evnt.getTimestamp() / 1000);
row.set(EventTable.Field.EVENT_TYPE_ID.getName(), evnt.getEventTypeId());
row.set(EventTable.Field.EVENT_ID.getName(), evnt.getId());
row.set(EventTable.Field.LOCATION.getName(), evnt.getLocation());
row.set(EventTable.Field.SERVICE.getName(), evnt.getService());
row.set(EventTable.Field.HOST.getName(), evnt.getHost());
row.set(EventTable.Field.BODY.getName(), evnt.getBody());
return row;
}
})
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
Any pointers in the correct direction would be greatly appreciated.
Thanks!
From inspecting the exception message and the code above, it seems that the EventTable field used within your anonymous DynamicDestinations class contains a TableReference field which is not serializable.
One workaround would be to convert the anonymous DynamicDestinations to a static inner class and define a constructor which stores only the serializable pieces of the EventTable needed to implement the interface.
For example:
private static class EventDestinations extends DynamicDestinations<Event, String> {
private final TableSchema schema;
private final TableDestination destination;
private final String tableId;
private EventDestinations(EventTable table) {
this.schema = table.schema();
this.destination = new TableDestination(table.reference(), null);
this.tableId = table.reference().getTableId();
}
// ..
}
Looks like you're trying to fill a specific partition based on the event. Why not use:
SerializableFunction<ValueInSingleWindow<Event>, TableDestination>?
I am trying to read array of JSON posted to a topic that my pipeline is subscribed to and persist the same to BigQuery. The problem I face while doing so is that it persists only the first object, can someone please provide me insight on what I am doing wrong here.
/** A DoFn that converts a table row from JSON into a BigQuery table row. */
static class FormatAsTableRowFn extends DoFn<TableRow, TableRow> {
private static final long serialVersionUID = 0;
static TableSchema getSchema() {
return new TableSchema().setFields(new ArrayList<TableFieldSchema>() {
// Compose the list of TableFieldSchema from tableSchema.
{
add(new TableFieldSchema().setName("PillBoxID").setType("STRING").setMode("NULLABLE"));
add(new TableFieldSchema().setName("Period").setType("STRING").setMode("NULLABLE"));
add(new TableFieldSchema().setName("Time").setType("TIMESTAMP").setMode("NULLABLE"));
add(new TableFieldSchema().setName("IsTaken").setType("STRING").setMode("NULLABLE"));
}
});
}
#Override
public void processElement(ProcessContext c) {
TableRow jsonRow = c.element();
// Setup a date formatter to parse the date appropriately
SimpleDateFormat ft = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
try {
TableRow bQueryRow = new TableRow()
.set("PillBoxID", (String) jsonRow.get("PillBoxID"))
.set("Period", (String) jsonRow.get("Period"))
.set("Time",ft.format(ft.parse((String) jsonRow.get("Time"))))
.set("IsTaken", (String) jsonRow.get("IsTaken"));
LOG.error("Inside try" + bQueryRow.getF());
c.output(bQueryRow);
} catch (ParseException pe) {
LOG.error("ParseException");
LOG.error(pe.getMessage());
}
}
}
and my pipleline code is as shown below,
bigQueryPipeLine
.apply(PubsubIO.Read.topic(options.getPubsubTopic()).withCoder(TableRowJsonCoder.of()))
.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write.to(tableSpec)
.withSchema(FormatAsTableRowFn.getSchema()));
it is possible to process multiple records if you format the input JSON to include an array of items.
Example input:
{
"items":
[
{"PillBoxID":"ID5", "Period":"Morning", "Time":"2016-03-14T11:11:11", "IsTaken":"true"},
{"PillBoxID":"ID6", "Period":"Afternoon", "Time":"2016-03-14T15:11:11", "IsTaken":"false"}
]
}
The rough example processElement() code adds the 2 items to c.output() for later storage in BigQuery.
#Override
public void processElement(ProcessContext c) throws DatastoreException, IOException{
TableRow jsonRowObj = c.element();
LOG.info("Original input:" + c.element().toPrettyString());
ArrayList<Map> jsonRows = (ArrayList<Map>)jsonRowObj.get("items");
Iterator<Map> iterator = jsonRows.iterator();
while(iterator.hasNext()) {
Map jsonRow = (Map)iterator.next();
// Setup a date formatter to parse the date appropriately
SimpleDateFormat ft = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
try {
LOG.info("child input JSON: "+jsonRow.toString());
TableRow bQueryRow = new TableRow()
.set("PillboxID", (String) jsonRow.get("PillBoxID"))
.set("Period", (String) jsonRow.get("Period"))
.set("Time",ft.format(ft.parse((String) jsonRow.get("Time"))))
.set("IsTaken", (boolean) Boolean.parseBoolean((String)jsonRow.get("IsTaken")));
c.output(bQueryRow);
} catch (ParseException pe) {
LOG.error("");
LOG.error("ParseException " +pe.getMessage());
}
}
}
I have following type of sample data.
s.n., time, user, time_span, user_level
1, 2016-01-04T1:26:13, Hari, 8, admin
2, 2016-01-04T11:6:13, Gita, 2, admin
3, 2016-01-04T11:26:13, Gita, 0, user
Now I need to find average_time_span/user, average_time_span/user_level and total_time_span/user.
I'm able to find each of above mention value but couldn't able to find all of those at once. As I'm new to DataFlow, please suggest me appropriate method to do so.
static class ExtractUserAndUserLevelFn extends DoFn<String, KV<String, Long>> {
#Override
public void processElement(ProcessContext c) {
String[] words = c.element().split(",");
if (words.length == 5) {
Instant timestamp = Instant.parse(words[1].trim());
KV<String, Long> userTime = KV.of(words[2].trim(), Long.valueOf(words[3].trim()));
KV<String, Long> userLevelTime = KV.of(words[4].trim(), Long.valueOf(words[3].trim()));
c.outputWithTimestamp(userTime, timestamp);
c.outputWithTimestamp(userLevelTime, timestamp);
}
}
}
public static void main(String[] args) {
TestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(TestOptions.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadLines").from(options.getInputFile()))
.apply(ParDo.of(new ExtractUserAndUserLevelFn()))
.apply(Window.<KV<String, Long>>into(
FixedWindows.of(Duration.standardSeconds(options.getMyWindowSize()))))
.apply(GroupByKey.<String, Long>create())
.apply(ParDo.of(new DoFn<KV<String, Iterable<Long>>, KV<String, Long>>() {
public void processElement(ProcessContext c) {
String key = c.element().getKey();
Iterable<Long> docsWithThatUrl = c.element().getValue();
Long sum = 0L;
for (Long item : docsWithThatUrl)
sum += item;
KV<String, Long> userTime = KV.of(key, sum);
c.output(userTime);
}
}))
.apply(MapElements.via(new FormatAsTextFn()))
.apply(TextIO.Write.named("WriteCounts").to(options.getOutput()).
withNumShards(options.getShardsNumber()));
p.run();
}
One approach would be to first parse the lines into one PCollection that contains a record per line, and the from that collection create two PCollection of key-value pairs. Let's say you define a record representing a line like this:
static class Record implements Serializable {
final String user;
final String role;
final long duration;
// need a constructor here
}
Now, create a LineToRecordFn that create Records from the input lines, so that you can do:
PCollection<Record> records = p.apply(TextIO.Read.named("ReadLines")
.from(options.getInputFile()))
.apply(ParDo.of(new LineToRecordFn()));
You can window here, if you want. Whether you window or not, you can then create your keyed-by-role and keyed-by-user PCollections:
PCollection<KV<String,Long>> role_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.role,r.duration);
}
}));
PCollection<KV<String,Long>> user_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.user, r.duration);
}
}));
Now, you can get the means and sum in just a few lines:
PCollection<KV<String,Double>> mean_by_user = user_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Double>> mean_by_role = role_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Long>> sum_by_role = role_duration.apply(
Sum.<String>longsPerKey());
Note that dataflow does some optimization before running your job. So, while it might look like you're doing two passes over the records PCollection, that may not be true.
The Mean and Sum transforms look like they would work well for this use case. Basic usage looks like this:
PCollection<KV<String, Double>> meanPerKey =
input.apply(Mean.<String, Integer>perKey());
PCollection<KV<String, Integer>> sumPerKey = input
.apply(Sum.<String>integersPerKey());
We don't know why when running this simple test, DataflowAssert fails:
#Test
#Category(RunnableOnService.class)
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
p.run();
}
We are getting the following exception:
Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement
SEVERE: DataflowAssert failed expectations.
java.lang.AssertionError:
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:865)
at org.junit.Assert.assertThat(Assert.java:832)
at ...
In order to simplify the DataflowAssert test we hardcoded the output of Table.join to match DataflowAssert,having:
private static final TableRow TABLEROW_TEST = new TableRow()
.set("id", "x");
static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
PCollection<TableRow> pCollectionTable2) throws Exception {
final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();
PCollection<KV<String, String>> table1Data = pCollectionTable1
.apply(ParDo.of(new ExtractTable1DataFn()));
PCollection<KV<String, String>> table2Data = pCollectionTable2
.apply(ParDo.of(new ExtractTable2DataFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
.apply(CoGroupByKey.<String> create());
PCollection<KV<String, String>> resultCollection = kvpCollection
.apply(ParDo.named("Process join")
.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
private static final long serialVersionUID = 0;
#Override
public void processElement(ProcessContext c) {
// System.out.println(c);
KV<String, CoGbkResult> e = c.element();
String key = e.getKey();
String value = null;
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
}
}));
PCollection<TableRow> formattedResults = resultCollection.apply(
ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("id", "x");
c.output(row);
}
}));
return formattedResults;
}
Does anyone know what we are doing wrong?
I think the error message is telling you that the actual collection contains more copies of that element than the expectation.
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
This is hamcrest indicating that you wanted an iterable over a single element, but the actual collection had an item which wasn't matched. Since all of the items coming out of "format join" have the same value, it made this harder to read than it should have been.
Specifically, this is the message produced when I run the following test, which checks to see if the collection with two copies of row is the contains exactly one copy of row:
#Category(RunnableOnService.class)
#Test
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
TableRow row = new TableRow().set("id", "x");
PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
DataflowAssert.that(rows).containsInAnyOrder(row);
p.run();
}
In order to get that result with your code, I had to take advantage of the fact that you only iterate over entries in table2. Specifically:
// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]
// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
("keyB", ([], ["B2"]))]
// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
"B2,B2"]
// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]
Note that the DoFn for "Process join" may not produce the expected results as commented below:
String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
// NOTE: this updates value, and doesn't output it. So for each
// key there will be a single output with the *last* value
// rather than one for each pair.
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));