We don't know why when running this simple test, DataflowAssert fails:
#Test
#Category(RunnableOnService.class)
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
p.run();
}
We are getting the following exception:
Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement
SEVERE: DataflowAssert failed expectations.
java.lang.AssertionError:
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:865)
at org.junit.Assert.assertThat(Assert.java:832)
at ...
In order to simplify the DataflowAssert test we hardcoded the output of Table.join to match DataflowAssert,having:
private static final TableRow TABLEROW_TEST = new TableRow()
.set("id", "x");
static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
PCollection<TableRow> pCollectionTable2) throws Exception {
final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();
PCollection<KV<String, String>> table1Data = pCollectionTable1
.apply(ParDo.of(new ExtractTable1DataFn()));
PCollection<KV<String, String>> table2Data = pCollectionTable2
.apply(ParDo.of(new ExtractTable2DataFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
.apply(CoGroupByKey.<String> create());
PCollection<KV<String, String>> resultCollection = kvpCollection
.apply(ParDo.named("Process join")
.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
private static final long serialVersionUID = 0;
#Override
public void processElement(ProcessContext c) {
// System.out.println(c);
KV<String, CoGbkResult> e = c.element();
String key = e.getKey();
String value = null;
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
}
}));
PCollection<TableRow> formattedResults = resultCollection.apply(
ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("id", "x");
c.output(row);
}
}));
return formattedResults;
}
Does anyone know what we are doing wrong?
I think the error message is telling you that the actual collection contains more copies of that element than the expectation.
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
This is hamcrest indicating that you wanted an iterable over a single element, but the actual collection had an item which wasn't matched. Since all of the items coming out of "format join" have the same value, it made this harder to read than it should have been.
Specifically, this is the message produced when I run the following test, which checks to see if the collection with two copies of row is the contains exactly one copy of row:
#Category(RunnableOnService.class)
#Test
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
TableRow row = new TableRow().set("id", "x");
PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
DataflowAssert.that(rows).containsInAnyOrder(row);
p.run();
}
In order to get that result with your code, I had to take advantage of the fact that you only iterate over entries in table2. Specifically:
// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]
// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
("keyB", ([], ["B2"]))]
// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
"B2,B2"]
// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]
Note that the DoFn for "Process join" may not produce the expected results as commented below:
String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
// NOTE: this updates value, and doesn't output it. So for each
// key there will be a single output with the *last* value
// rather than one for each pair.
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
Related
I'm working with Apache Beam, trying to enrich data (based on this), but it seems that Beam has changed in a while, as GroupByKey does not work with unbounded sources (like PubSub) without windowing.
This is what I've got (overly simplified):
PCollection<String> input = pipeline.apply("Read pubsub",
PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("Log element", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
System.out.println(String.format("incomig %s", c.element()));
c.output(c.element());
}
}))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));
PCollection<KV<String, String>> incomingData = input
.apply("Apply Random Key", MapElements
.via(new SimpleFunction<String, KV<String, String>>() {
public KV<String, String> apply(String json) {
JSONObject jsonObject = new JSONObject(json);
System.out.println(String.format("JSON: %s, %s", jsonObject.getString("id"), jsonObject.get("usageRules")));
return KV.of(jsonObject.getString("id"), json);
}
})
);
PCollection<KV<String,String>> enrichedData = incomingData
.apply("Search in db",
JdbcIO.<KV<String,String>, KV<String,String>>readAll()
.withDataSourceConfiguration(config)
.withQuery("SELECT * FROM myTable WHERE id = ?")
.withParameterSetter((element, preparedStatement) ->
preparedStatement.setString(1, element.getKey())
)
.withRowMapper(resultSet -> {
System.out.println(String.format("Result from db: %s", resultSet.getString("id")));
return KV.of(resultSet.getString("id"), resultSet.getString("id"));
})
.withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())));
GroupByKey.applicableTo(enrichedData);
TupleTag<String> CREATE_TAG = new TupleTag<>();
TupleTag<String> UPDATE_TAG = new TupleTag<>();
KeyedPCollectionTuple
.of(CREATE_TAG, incomingData)
.and(UPDATE_TAG, enrichedData)
.apply("Combine", CoGroupByKey.create())
.apply("Show data?", ParDo.of(new DoFn<KV<String, CoGbkResult>, String>() {
#ProcessElement
public void processElement(ProcessContext context) {
System.out.println("Print from CoGbkResult");
System.out.println(context.element().getKey());
System.out.println(context.element().getValue());
}
}));
At the moment, with windowing, getting incoming data, transforming it into JSONObject and searching in BD is working fine, the problem is that any .apply done after the JdbcIO.readAll is not working at all. The line "Print from CoGbkResult" just doesn't get printed at all.
I've tried modifying the window, adding other triggers, trying just to output a result just immediately, but it just stop at the RowMapper.
Thanks for your help
I tried to move the data from one table to another table. Used SideInput for filtering the records while transform the data. SideInput also type of KV collection and its loaded the data from another table.
When run my pipeline got "java.lang.IllegalArgumentException: calling sideInput() with unknown view" error.
Here is the entire code that I tried:
{
PipelineOptionsFactory.register(OptionPipeline.class);
OptionPipeline options = PipelineOptionsFactory.fromArgs(args).withValidation().as(OptionPipeline.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> sideInputData = p.apply("ReadSideInput",BigQueryIO.readTableRows().from(options.getOrgRegionMapping()));
PCollection<KV<String,String>> sideInputMap = sideInputData.apply(ParDo.of(new getSideInputDataFn()));
final PCollectionView<Map<String,String>> sideInputView = sideInputMap.apply(View.<String,String>asMap());
PCollection<TableRow> orgMaster = p.apply("ReadOrganization",BigQueryIO.readTableRows().from(options.getOrgCodeMaster()));
PCollection<TableRow> orgCode = orgMaster.apply(ParDo.of(new gnGetOrgMaster()));
#SuppressWarnings("serial")
PCollection<TableRow> finalResultCollection = orgCode.apply("Process", ParDo.of(new DoFn<TableRow, TableRow>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow outputRow = new TableRow();
TableRow orgCodeRow = c.element();
String orgCodefromMaster = (String) orgCodeRow.get("orgCode");
String region = c.sideInput(sideInputView).get(orgCodefromMaster);
outputRow.set("orgCode", orgCodefromMaster);
outputRow.set("orgName", orgCodeRow.get("orgName"));
outputRow.set("orgName", region);
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSSSSS");
Date dateobj = new Date();
outputRow.set("updatedDate",df.format(dateobj));
c.output(outputRow);
}
}));
finalResultCollection.apply(BigQueryIO.writeTableRows()
.withSchema(schema)
.to(options.getOrgCodeTable())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
}
#SuppressWarnings("serial")
static class getSideInputDataFn extends DoFn<TableRow,KV<String, String>>
{
#ProcessElement
public void processElement(ProcessContext c)
{
TableRow row = c.element();
c.output(KV.of((String) row.get("orgcode"), (String) row.get("region")));
}
}
It looks like the runner is complaining because you never told it about the side input when defining the graph. In this case you call .withSideInputs after the ParDo.of call passing in the reference to the PCollectionView<T> you defined earlier.
#SuppressWarnings("serial")
PCollection<TableRow> finalResultCollection = orgCode.apply("Process", ParDo.of(new DoFn<TableRow, TableRow>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow outputRow = new TableRow();
TableRow orgCodeRow = c.element();
String orgCodefromMaster = (String) orgCodeRow.get("orgCode");
String region = c.sideInput(sideInputView).get(orgCodefromMaster);
outputRow.set("orgCode", orgCodefromMaster);
outputRow.set("orgName", orgCodeRow.get("orgName"));
outputRow.set("orgName", region);
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSSSSS");
Date dateobj = new Date();
outputRow.set("updatedDate",df.format(dateobj));
c.output(outputRow);
}
}).withSideInputs(sideInputView));
I didn't test this code but that's what stands out when I look at it.
I have a simple flow which aim is to write two lines in one BigQuery Table.
I use a DynamicDestinations because after that I will write on mutliple Table, on that example it's the same table...
The problem is that I only have 1 line in my BigQuery table at the end.
It stacktrace I see the following error on the second insert
"
status: {
code: 6
message: "Already Exists: Job sampleprojet3:b9912b9b05794aec8f4292b2ae493612_eeb0082ade6f4a58a14753d1cc92ddbc_00001-0"
}
"
What does it means ?
Is it related to this limitation ?
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/550
How can I do the job ?
I use BeamSDK 2.0.0, I have try with 2.1.0 (same result)
The way I launch :
mvn compile exec:java -Dexec.mainClass=fr.gireve.dataflow.LogsFlowBug -Dexec.args="--runner=DataflowRunner --inputDir=gs://sampleprojet3.appspot.com/ --project=sampleprojet3 --stagingLocation=gs://dataflow-sampleprojet3/tmp" -Pdataflow-runner
Pipeline p = Pipeline.create(options);
final List<String> tableNameTableValue = Arrays.asList("table1:value1", "table1:value2", "table2:value1", "table2:value2");
p.apply(Create.of(tableNameTableValue)).setCoder(StringUtf8Coder.of())
.apply(BigQueryIO.<String>write()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to(new DynamicDestinations<String, KV<String, String>>() {
#Override
public KV<String, String> getDestination(ValueInSingleWindow<String> element) {
final String[] split = element.getValue().split(":");
return KV.of(split[0], split[1]) ;
}
#Override
public Coder<KV<String, String>> getDestinationCoder() {
return KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());
}
#Override
public TableDestination getTable(KV<String, String> row) {
String tableName = row.getKey();
String tableSpec = "sampleprojet3:testLoadJSON." + tableName;
return new TableDestination(tableSpec, "Table " + tableName);
}
#Override
public TableSchema getSchema(KV<String, String> row) {
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("myColumn").setType("STRING"));
TableSchema ts = new TableSchema();
ts.setFields(fields);
return ts;
}
})
.withFormatFunction(new SerializableFunction<String, TableRow>() {
public TableRow apply(String row) {
TableRow tr = new TableRow();
tr.set("myColumn", row);
return tr;
}
}));
p.run().waitUntilFinish();
Thanks
DynamicDestinations associates each element with a destination - i.e. where the element should go. Elements are routed to BigQuery tables according to their destinations: 1 destination = 1 BigQuery table with a schema: the destination should include just enough information to produce a TableDestination and a schema. Elements with the same destination go to the same table, elements with different destinations go to different tables.
Your code snippet uses DynamicDestinations with a destination type that contains both the element and the table, which is unnecessary, and of course, violates the constraint above: elements with a different destination end up going to the same table: e.g. KV("table1", "value1") and KV("table1", "value2") are different destinations but your getTable maps them to the same table table1.
You need to remove the element from your destination type. That will also lead to simpler code. As a side note, I think you don't need to override getDestinationCoder() - it can be inferred automatically.
Try this:
.to(new DynamicDestinations<String, String>() {
#Override
public String getDestination(ValueInSingleWindow<String> element) {
return element.getValue().split(":")[0];
}
#Override
public TableDestination getTable(String tableName) {
return new TableDestination(
"sampleprojet3:testLoadJSON." + tableName, "Table " + tableName);
}
#Override
public TableSchema getSchema(String tableName) {
List<TableFieldSchema> fields = Arrays.asList(
new TableFieldSchema().setName("myColumn").setType("STRING"));
return new TableSchema().setFields(fields);
}
})
I have a PCollection [String] say "X" that I need to dump in a BigQuery table.
The table destination and the schema for it is in a PCollection[TableRow] say "Y".
How to accomplish this in the simplest manner?
I tried extracting the table and schema from "Y" and saving it in static global variables (tableName and schema respectively). But somehow oddly the BigQueryIO.writeTableRows() always gets the value of the variable tableName as null. But it gets the schema. I tried logging the values of those variables and I can see the values are there for both.
Here is my pipeline code:
static String tableName;
static TableSchema schema;
PCollection<String> read = p.apply("Read from input file",
TextIO.read().from(options.getInputFile()));
PCollection<TableRow> tableRows = p.apply(
BigQueryIO.read().fromQuery(NestedValueProvider.of(
options.getfilename(),
new SerializableFunction<String, String>() {
#Override
public String apply(String filename) {
return "SELECT table,schema FROM `BigqueryTest.configuration` WHERE file='" + filename +"'";
}
})).usingStandardSql().withoutValidation());
final PCollectionView<List<String>> dataView = read.apply(View.asList());
tableRows.apply("Convert data read from file to TableRow",
ParDo.of(new DoFn<TableRow,TableRow>(){
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
String[] schemas = c.element().get("schema").toString().split(",");
List<TableFieldSchema> fields = new ArrayList<>();
for(int i=0;i<schemas.length;i++) {
fields.add(new TableFieldSchema()
.setName(schemas[i].split(":")[0]).setType(schemas[i].split(":")[1]));
}
schema = new TableSchema().setFields(fields);
//My code to convert data to TableRow format.
}}).withSideInputs(dataView));
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Everything works fine. Only BigQueryIO.write operation fails and I get the error TableId is null.
I also tried using SerializableFunction and returning the value from there but i still get null.
Here is the code that I tried for it:
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to(new GetTable(tableName))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
public static class GetTable implements SerializableFunction<String,String> {
String table;
public GetTable() {
this.table = tableName;
}
#Override
public String apply(String arg0) {
return "ProjectId:DatasetId."+table;
}
}
I also tried using DynamicDestinations but I get an error saying schema is not provided. Honestly I'm new to the concept of DynamicDestinations and I'm not sure that I'm doing it correctly.
Here is the code that I tried for it:
tableRows2.apply(BigQueryIO.writeTableRows()
.to(new DynamicDestinations<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#Override
public TableDestination getTable(TableRow dest) {
List<TableRow> list = sideInput(bqDataView); //bqDataView contains table and schema
String table = list.get(0).get("table").toString();
String tableSpec = "ProjectId:DatasetId."+table;
String tableDescription = "";
return new TableDestination(tableSpec, tableDescription);
}
public String getSideInputs(PCollectionView<List<TableRow>> bqDataView) {
return null;
}
#Override
public TableSchema getSchema(TableRow destination) {
return schema; //schema is getting added from the global variable
}
#Override
public TableRow getDestination(ValueInSingleWindow<TableRow> element) {
return null;
}
}.getSideInputs(bqDataView)));
Please let me know what I'm doing wrong and which path I should take.
Thank You.
Part of the reason your having trouble is because of the two stages of pipeline execution. First the pipeline is constructed on your machine. This is when all of the applications of PTransforms occur. In your first example, this is when the following lines are executed:
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
The code within a ParDo however runs when your pipeline executes, and it does so on many machines. So the following code runs much later than the pipeline construction:
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
...
schema = new TableSchema().setFields(fields);
...
}
This means that neither the tableName nor the schema fields will be set at when the BigQueryIO sink is created.
Your idea to use DynamicDestinations is correct, but you need to move the code to actually generate the schema the destination into that class, rather than relying on global variables that aren't available on all of the machines.
I have following type of sample data.
s.n., time, user, time_span, user_level
1, 2016-01-04T1:26:13, Hari, 8, admin
2, 2016-01-04T11:6:13, Gita, 2, admin
3, 2016-01-04T11:26:13, Gita, 0, user
Now I need to find average_time_span/user, average_time_span/user_level and total_time_span/user.
I'm able to find each of above mention value but couldn't able to find all of those at once. As I'm new to DataFlow, please suggest me appropriate method to do so.
static class ExtractUserAndUserLevelFn extends DoFn<String, KV<String, Long>> {
#Override
public void processElement(ProcessContext c) {
String[] words = c.element().split(",");
if (words.length == 5) {
Instant timestamp = Instant.parse(words[1].trim());
KV<String, Long> userTime = KV.of(words[2].trim(), Long.valueOf(words[3].trim()));
KV<String, Long> userLevelTime = KV.of(words[4].trim(), Long.valueOf(words[3].trim()));
c.outputWithTimestamp(userTime, timestamp);
c.outputWithTimestamp(userLevelTime, timestamp);
}
}
}
public static void main(String[] args) {
TestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(TestOptions.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadLines").from(options.getInputFile()))
.apply(ParDo.of(new ExtractUserAndUserLevelFn()))
.apply(Window.<KV<String, Long>>into(
FixedWindows.of(Duration.standardSeconds(options.getMyWindowSize()))))
.apply(GroupByKey.<String, Long>create())
.apply(ParDo.of(new DoFn<KV<String, Iterable<Long>>, KV<String, Long>>() {
public void processElement(ProcessContext c) {
String key = c.element().getKey();
Iterable<Long> docsWithThatUrl = c.element().getValue();
Long sum = 0L;
for (Long item : docsWithThatUrl)
sum += item;
KV<String, Long> userTime = KV.of(key, sum);
c.output(userTime);
}
}))
.apply(MapElements.via(new FormatAsTextFn()))
.apply(TextIO.Write.named("WriteCounts").to(options.getOutput()).
withNumShards(options.getShardsNumber()));
p.run();
}
One approach would be to first parse the lines into one PCollection that contains a record per line, and the from that collection create two PCollection of key-value pairs. Let's say you define a record representing a line like this:
static class Record implements Serializable {
final String user;
final String role;
final long duration;
// need a constructor here
}
Now, create a LineToRecordFn that create Records from the input lines, so that you can do:
PCollection<Record> records = p.apply(TextIO.Read.named("ReadLines")
.from(options.getInputFile()))
.apply(ParDo.of(new LineToRecordFn()));
You can window here, if you want. Whether you window or not, you can then create your keyed-by-role and keyed-by-user PCollections:
PCollection<KV<String,Long>> role_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.role,r.duration);
}
}));
PCollection<KV<String,Long>> user_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.user, r.duration);
}
}));
Now, you can get the means and sum in just a few lines:
PCollection<KV<String,Double>> mean_by_user = user_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Double>> mean_by_role = role_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Long>> sum_by_role = role_duration.apply(
Sum.<String>longsPerKey());
Note that dataflow does some optimization before running your job. So, while it might look like you're doing two passes over the records PCollection, that may not be true.
The Mean and Sum transforms look like they would work well for this use case. Basic usage looks like this:
PCollection<KV<String, Double>> meanPerKey =
input.apply(Mean.<String, Integer>perKey());
PCollection<KV<String, Integer>> sumPerKey = input
.apply(Sum.<String>integersPerKey());