JdbcIO.readAll not firing results - google-cloud-dataflow

I'm working with Apache Beam, trying to enrich data (based on this), but it seems that Beam has changed in a while, as GroupByKey does not work with unbounded sources (like PubSub) without windowing.
This is what I've got (overly simplified):
PCollection<String> input = pipeline.apply("Read pubsub",
PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("Log element", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
System.out.println(String.format("incomig %s", c.element()));
c.output(c.element());
}
}))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));
PCollection<KV<String, String>> incomingData = input
.apply("Apply Random Key", MapElements
.via(new SimpleFunction<String, KV<String, String>>() {
public KV<String, String> apply(String json) {
JSONObject jsonObject = new JSONObject(json);
System.out.println(String.format("JSON: %s, %s", jsonObject.getString("id"), jsonObject.get("usageRules")));
return KV.of(jsonObject.getString("id"), json);
}
})
);
PCollection<KV<String,String>> enrichedData = incomingData
.apply("Search in db",
JdbcIO.<KV<String,String>, KV<String,String>>readAll()
.withDataSourceConfiguration(config)
.withQuery("SELECT * FROM myTable WHERE id = ?")
.withParameterSetter((element, preparedStatement) ->
preparedStatement.setString(1, element.getKey())
)
.withRowMapper(resultSet -> {
System.out.println(String.format("Result from db: %s", resultSet.getString("id")));
return KV.of(resultSet.getString("id"), resultSet.getString("id"));
})
.withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())));
GroupByKey.applicableTo(enrichedData);
TupleTag<String> CREATE_TAG = new TupleTag<>();
TupleTag<String> UPDATE_TAG = new TupleTag<>();
KeyedPCollectionTuple
.of(CREATE_TAG, incomingData)
.and(UPDATE_TAG, enrichedData)
.apply("Combine", CoGroupByKey.create())
.apply("Show data?", ParDo.of(new DoFn<KV<String, CoGbkResult>, String>() {
#ProcessElement
public void processElement(ProcessContext context) {
System.out.println("Print from CoGbkResult");
System.out.println(context.element().getKey());
System.out.println(context.element().getValue());
}
}));
At the moment, with windowing, getting incoming data, transforming it into JSONObject and searching in BD is working fine, the problem is that any .apply done after the JdbcIO.readAll is not working at all. The line "Print from CoGbkResult" just doesn't get printed at all.
I've tried modifying the window, adding other triggers, trying just to output a result just immediately, but it just stop at the RowMapper.
Thanks for your help

Related

Enable Streaming Data transformation using Cloud Dataflow

I am trying to implementing streaming based data transformation from one CloudSQL table to another CloudSQL table using Windowing method (UnBounded PCollections).
My dataflow package is completed with success. However it is not keep on running when new data comes to the first table. Am I missed anything in code level in order to make it run streaming mode.
Run Command:
--project=<your project ID> --stagingLocation=gs://<your staging bucket>
--runner=DataflowRunner --streaming=true
Code Snippet:
PCollection<TableRow> tblRows = p.apply(JdbcIO.<TableRow>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"org.postgresql.Driver", connectionString)
.withUsername(<UserName>).withPassword(<PWD>))
.withQuery("select id,order_number from public.tableRead")
.withCoder(TableRowJsonCoder.of())
.withRowMapper(new JdbcIO.RowMapper<TableRow>() {
public TableRow mapRow(ResultSet resultSet) throws Exception
{
TableRow result = new TableRow();
result.set("id", resultSet.getString("id"));
result.set("order_number", resultSet.getString("order_number"));
return result;
}
})
);
PCollection<TableRow> tblWindow = tblRows.apply("window 1s", Window.into(FixedWindows.of(Duration.standardMinutes(5))));
PCollection<KV<Integer,TableRow>> keyedTblWindow= tblRows.apply("Process", ParDo.of(new DoFn<TableRow, KV<Integer,TableRow>>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow leftRow = c.element();
c.output(KV.of(Integer.parseInt(leftRow.get("id").toString()), leftRow) );
}}));
PCollection<KV<Integer, Iterable<TableRow>>> groupedWindow = keyedTblWindow.apply(GroupByKey.<Integer, TableRow> create());
groupedWindow.apply(JdbcIO.<KV<Integer, Iterable<TableRow>>>write()
.withDataSourceConfiguration( DataSourceConfiguration.create("org.postgresql.Driver", connectionString)
.withUsername(<UserName>).withPassword(<PWD>))
.withStatement("insert into streaming_testing(id,order_number) values(?, ?)")
.withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, Iterable<TableRow>>>() {
public void setParameters(KV<Integer, Iterable<TableRow>> element, PreparedStatement query)
throws SQLException {
Iterable<TableRow> rightRowsIterable = element.getValue();
for (Iterator<TableRow> i = rightRowsIterable.iterator(); i.hasNext(); ) {
TableRow mRow = (TableRow) i.next();
query.setInt(1, Integer.parseInt(mRow.get("id").toString()));
query.setInt(2, Integer.parseInt(mRow.get("order_number").toString()));
}
}
})
);
JdbcIO just work on the snapshot of data produced at the time of execution and does not poll for changes.
I suppose you have to build a custom ParDo to detect change in the database.

Beam : Not Event with SideInput into streaming pipeline with DataflowRunner

I ve tested sideinput into a streaming pipeline with DirectRunner and DataflowRunner with this code :
public class Testsideinput {
private static final Logger LOG = LoggerFactory.getLogger(Testsideinput.class);
static class RefreshCache extends DoFn<Long, String> {
private static final long serialVersionUID = 1;
private static final Random RANDOM = new Random();
#ProcessElement
public void processElement(ProcessContext c) {
c.output("A"+c.element());
c.output("B"+c.element());
c.output("C"+c.element());
c.output("D"+c.element());
c.output("E"+c.element());
c.output("F"+c.element());
}
}
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
Pipeline pipeline = Pipeline.create(options);
final PCollectionView<List<String>> sideInput2 =
pipeline.apply("TextIO", TextIO.read().from("<Put your gs://>))
.apply("viewTags", View.asList());
final PCollectionView<List<String>> sideInput =
pipeline.apply("GenerateSequence",
GenerateSequence
.from(0)
.withRate(1, Duration.standardSeconds(1)))
.apply("Window GenerateSequence",
Window.into(FixedWindows.of(Duration.standardSeconds(5))))
.apply("Counts", Combine.globally(Sum.ofLongs()).withoutDefaults())
.apply("RefreshCache", ParDo.of(new RefreshCache()))
.apply("viewTags", View.asList());
final PubsubIO.Read<PubsubMessage> pubsubRead =
PubsubIO.readMessages()
.withIdAttribute("id")
.withTimestampAttribute("ts")
.fromTopic("<put your topic>");
// PCollection<KV<String,Long>> taxi =;
PCollection<String> taxi =
pipeline.apply("Read from", pubsubRead)
.apply("Window Fixed",
Window.into(FixedWindows.of(Duration.standardSeconds(15))))
.apply(MapElements.via(new PubSubToTableRow()))
.apply("key rides by rideid",
MapElements
.into(TypeDescriptors
.kvs(TypeDescriptors.strings(),
TypeDescriptor.of(TableRow.class)))
.via(ride -> KV.of(ride.get("ride_id").toString(), ride)))
.apply("Count Per Element", Count.perKey())
.apply(
ParDo.of(new DoFn<KV<String,Long>, String>() {
#ProcessElement
public void processElement(
#Element KV<String,Long> value,
OutputReceiver<String> out, ProcessContext c) {
// In our DoFn, access the side input.
List<String> sideinput = c.sideInput(sideInput);
List<String> sideinput2 = c.sideInput(sideInput2);
LOG.info("sideinput" + sideinput.toString());
LOG.info("sideinput2 " + sideinput2.toString());
LOG.info("value " + value);
out.output("test");
}
}).withSideInputs(sideInput,sideInput2));
pipeline.run();
}
I have all value of my sideinput (list and map) on DirectRunner but I don't have value with DataflowRunner ( I have no output with View.CreatePCollectionView/ParDo(StreamingPCollectionViewWriter) step)
do you have an idea to solve this?

java.lang.IllegalArgumentException: calling sideInput() with unknown view

I tried to move the data from one table to another table. Used SideInput for filtering the records while transform the data. SideInput also type of KV collection and its loaded the data from another table.
When run my pipeline got "java.lang.IllegalArgumentException: calling sideInput() with unknown view" error.
Here is the entire code that I tried:
{
PipelineOptionsFactory.register(OptionPipeline.class);
OptionPipeline options = PipelineOptionsFactory.fromArgs(args).withValidation().as(OptionPipeline.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> sideInputData = p.apply("ReadSideInput",BigQueryIO.readTableRows().from(options.getOrgRegionMapping()));
PCollection<KV<String,String>> sideInputMap = sideInputData.apply(ParDo.of(new getSideInputDataFn()));
final PCollectionView<Map<String,String>> sideInputView = sideInputMap.apply(View.<String,String>asMap());
PCollection<TableRow> orgMaster = p.apply("ReadOrganization",BigQueryIO.readTableRows().from(options.getOrgCodeMaster()));
PCollection<TableRow> orgCode = orgMaster.apply(ParDo.of(new gnGetOrgMaster()));
#SuppressWarnings("serial")
PCollection<TableRow> finalResultCollection = orgCode.apply("Process", ParDo.of(new DoFn<TableRow, TableRow>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow outputRow = new TableRow();
TableRow orgCodeRow = c.element();
String orgCodefromMaster = (String) orgCodeRow.get("orgCode");
String region = c.sideInput(sideInputView).get(orgCodefromMaster);
outputRow.set("orgCode", orgCodefromMaster);
outputRow.set("orgName", orgCodeRow.get("orgName"));
outputRow.set("orgName", region);
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSSSSS");
Date dateobj = new Date();
outputRow.set("updatedDate",df.format(dateobj));
c.output(outputRow);
}
}));
finalResultCollection.apply(BigQueryIO.writeTableRows()
.withSchema(schema)
.to(options.getOrgCodeTable())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
}
#SuppressWarnings("serial")
static class getSideInputDataFn extends DoFn<TableRow,KV<String, String>>
{
#ProcessElement
public void processElement(ProcessContext c)
{
TableRow row = c.element();
c.output(KV.of((String) row.get("orgcode"), (String) row.get("region")));
}
}
It looks like the runner is complaining because you never told it about the side input when defining the graph. In this case you call .withSideInputs after the ParDo.of call passing in the reference to the PCollectionView<T> you defined earlier.
#SuppressWarnings("serial")
PCollection<TableRow> finalResultCollection = orgCode.apply("Process", ParDo.of(new DoFn<TableRow, TableRow>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow outputRow = new TableRow();
TableRow orgCodeRow = c.element();
String orgCodefromMaster = (String) orgCodeRow.get("orgCode");
String region = c.sideInput(sideInputView).get(orgCodefromMaster);
outputRow.set("orgCode", orgCodefromMaster);
outputRow.set("orgName", orgCodeRow.get("orgName"));
outputRow.set("orgName", region);
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSSSSS");
Date dateobj = new Date();
outputRow.set("updatedDate",df.format(dateobj));
c.output(outputRow);
}
}).withSideInputs(sideInputView));
I didn't test this code but that's what stands out when I look at it.

Unit test hangs forever if `DoFn` resets event timers

I'm unit testing (with TestStream and PAssert) a DoFn that resets event timers. Test hangs forever if DoFn resets timers and this behavior seems specific to event domain timers.
Is this a bug in beam testing facilities or expected timer behavior?
Here is a toy example that I can reproduce this behavior with beam 2.3 SDK.
static class KeyElements extends DoFn<String, KV<String, String>> {
#ProcessElement
public void processElement(ProcessContext context) {
final String[] parts = context.element().split(":");
if (parts.length == 2) {
context.output(KV.of(parts[0], parts[1]));
}
}
}
static class TimerDoFn extends DoFn<KV<String, String>, KV<String, String>> {
#TimerId("expiry")
private final TimerSpec timerSpec = TimerSpecs.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void processElement(ProcessContext context, #TimerId("expiry") Timer timer) {
timer.set(context.timestamp().plus(Duration.standardHours(1)));
final KV<String, String> e = context.element();
context.output(KV.of(e.getKey(), e.getValue() + "_output"));
}
#OnTimer("expiry")
public void onExpiry(OnTimerContext context) {
// do nothing
}
}
#Rule
public TestPipeline p = TestPipeline.create();
#Test
public void testTimerDoFn() {
TestStream<String> stream = TestStream
.create(StringUtf8Coder.of())
.addElements(
TimestampedValue.of("a:0", new Instant(0)),
TimestampedValue.of("a:1", new Instant(1)),
TimestampedValue.of("a:2", new Instant(2)),
TimestampedValue.of("a:3", new Instant(3)))
.advanceWatermarkToInfinity();
PCollection<KV<String, String>> result = p
.apply(stream)
.apply(ParDo.of(new KeyElements()))
.apply(ParDo.of(new TimerDoFn()));
PAssert.that(result).containsInAnyOrder(
KV.of("a", "0_output"),
KV.of("a", "1_output"),
KV.of("a", "2_output"),
KV.of("a", "3_output"));
p.run();
}
The above test would halt if input elements are a:1, b:2, c:3, d:4.

DataflowAssert doesn't pass TableRow test

We don't know why when running this simple test, DataflowAssert fails:
#Test
#Category(RunnableOnService.class)
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
p.run();
}
We are getting the following exception:
Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement
SEVERE: DataflowAssert failed expectations.
java.lang.AssertionError:
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:865)
at org.junit.Assert.assertThat(Assert.java:832)
at ...
In order to simplify the DataflowAssert test we hardcoded the output of Table.join to match DataflowAssert,having:
private static final TableRow TABLEROW_TEST = new TableRow()
.set("id", "x");
static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
PCollection<TableRow> pCollectionTable2) throws Exception {
final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();
PCollection<KV<String, String>> table1Data = pCollectionTable1
.apply(ParDo.of(new ExtractTable1DataFn()));
PCollection<KV<String, String>> table2Data = pCollectionTable2
.apply(ParDo.of(new ExtractTable2DataFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
.apply(CoGroupByKey.<String> create());
PCollection<KV<String, String>> resultCollection = kvpCollection
.apply(ParDo.named("Process join")
.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
private static final long serialVersionUID = 0;
#Override
public void processElement(ProcessContext c) {
// System.out.println(c);
KV<String, CoGbkResult> e = c.element();
String key = e.getKey();
String value = null;
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
}
}));
PCollection<TableRow> formattedResults = resultCollection.apply(
ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("id", "x");
c.output(row);
}
}));
return formattedResults;
}
Does anyone know what we are doing wrong?
I think the error message is telling you that the actual collection contains more copies of that element than the expectation.
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
This is hamcrest indicating that you wanted an iterable over a single element, but the actual collection had an item which wasn't matched. Since all of the items coming out of "format join" have the same value, it made this harder to read than it should have been.
Specifically, this is the message produced when I run the following test, which checks to see if the collection with two copies of row is the contains exactly one copy of row:
#Category(RunnableOnService.class)
#Test
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
TableRow row = new TableRow().set("id", "x");
PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
DataflowAssert.that(rows).containsInAnyOrder(row);
p.run();
}
In order to get that result with your code, I had to take advantage of the fact that you only iterate over entries in table2. Specifically:
// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]
// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
("keyB", ([], ["B2"]))]
// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
"B2,B2"]
// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]
Note that the DoFn for "Process join" may not produce the expected results as commented below:
String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
// NOTE: this updates value, and doesn't output it. So for each
// key there will be a single output with the *last* value
// rather than one for each pair.
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));

Resources