Getting unable to serialize DoFnWithExecutionInformation getting this error while building TableRow

Getting unable to serialize DoFnWithExecutionInformation getting this error while building TableRow - google-cloud-dataflow

I am trying to Convert PCollection of Strings into Pcollection of BQ TableRow.
My Apache beam version is 2.41 and JAVA 11. I tried multiple ways but could not able to fix this error.
TableSchema is loaded from avro file and providing it to pcollection as ValueProvider.
Please help me to fix this.
Code:
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DirectRunner.class);
options.setTempLocation("data/temp/");
Pipeline p = Pipeline.create(options);
BeamShemaUtil beamShemaUtil = new BeamShemaUtil("data/ship_data_schema.avsc");
TableSchema tableSchema = beamShemaUtil.convertBQTableSchema();
ValueProvider<TableSchema> ts= ValueProvider.StaticValueProvider.of(tableSchema);
PCollection<String> pc1 = p.apply(TextIO.read().from("data/ship_data.csv"));
PCollection<TableRow> pc2 = pc1.apply(MapElements.via(new ConvertStringToTableRow(ts))) ;
PipelineResult result = p.run();
result.waitUntilFinish();
SimpleFunction Class
public static class ConvertStringToTableRow extends SimpleFunction<String, TableRow> {
ValueProvider<TableSchema> tableSchema;
public ConvertStringToTableRow(ValueProvider<TableSchema> tableSchema) {
this.tableSchema = tableSchema;
}
public TableRow buildTableRow(TableSchema sc,String[] arr) {
List<TableFieldSchema> fieldSchemaList = sc.getFields();
List<String> data = Arrays.stream(arr).collect(Collectors.toList());
TableRow row = new TableRow();
TableCell record = new TableCell();
List<TableCell> tc = new ArrayList<TableCell>();
for ( int i = 0; i < fieldSchemaList.size(); i++ ){
TableFieldSchema sc2 = fieldSchemaList.get(i);
String fieldName = sc2.getName();
String fieldType = sc2.getType();
String fieldValue = data.get(i);
if (fieldValue.isEmpty()) {
record.set(fieldName,null);
tc.add(record);
}
else {
switch (fieldType) {
case "STRING":
record.set(fieldName,fieldValue);
tc.add(record);
case "BYTES":
record.set(fieldName,fieldValue.getBytes());
tc.add(record);
case "INT64":
record.set(fieldName,Integer.valueOf(fieldValue));
tc.add(record);
case "INTEGER":
record.set(fieldName,Integer.valueOf(fieldValue));
tc.add(record);
case "FLOAT64":
record.set(fieldName,Float.valueOf(fieldValue));
tc.add(record);
case "FLOAT":
record.set(fieldName,Float.valueOf(fieldValue));
tc.add(record);
case "BOOL":
case "BOOLEAN":
case "NUMERIC":
record.set(fieldName,Integer.valueOf(fieldValue));
tc.add(record);
case "TIMESTAMP":
case "TIME":
case "DATE":
case "DATETIME":
case "STRUCT":
case "RECORD":
default:
// row.set(fieldName,fieldValue);
// throw new UnsupportedOperationException("Unsupported BQ Data Type");
}
}
}
return row.setF(tc);
}
#Override
public TableRow apply(String element) {
String[] arr = element.split(",");
// BeamShemaUtil beamShemaUtil = new BeamShemaUtil("data/ship_data_schema.avsc");
// TableSchema tableSchema = beamShemaUtil.convertBQTableSchema();
TableRow row = buildTableRow(tableSchema.get(), arr);
return row;
}
Error Messages:
Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize DoFnWithExecutionInformation{doFn=org.apache.beam.sdk.transforms.MapElements$1#270a620, mainOutputTag=Tag<output>, sideInputMapping={}, schemaInformation=DoFnSchemaInformation{elementConverters=[], fieldAccessDescriptor=*}}
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:59)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateDoFn(ParDoTranslation.java:737)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation$1.translateDoFn(ParDoTranslation.java:268)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.payloadForParDoLike(ParDoTranslation.java:877)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:264)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:225)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation$ParDoTranslator.translate(ParDoTranslation.java:191)
at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformTranslation.toProto(PTransformTranslation.java:248)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getParDoPayload(ParDoTranslation.java:788)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.isSplittable(ParDoTranslation.java:803)
at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformMatchers$6.matches(PTransformMatchers.java:274)
at org.apache.beam.sdk.Pipeline$2.visitPrimitiveTransform(Pipeline.java:290)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:593)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$500(TransformHierarchy.java:240)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:214)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:469)
at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:268)
at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:218)
at org.apache.beam.runners.direct.DirectRunner.performRewrites(DirectRunner.java:254)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:175)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at BuildWriteBQTableRowExample01.main(BuildWriteBQTableRowExample01.java:50)
Caused by: java.io.NotSerializableException: com.google.api.services.bigquery.model.TableSchema
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1185)
at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at java.base/java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1379)
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175)
at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:55)
... 26 more
Process finished with exit code 1

TableSchema is not Serializable, so the JVM/Runner can't copy your the instance wrapped in the StaticValueProvider. This is similar to the issue seen here: Read specific record(s) from Dynamo using Apache Beam DynamoDBIO
Please check https://beam.apache.org/documentation/programming-guide/#user-code-serializability for more information.
In your specific scenario, my recommendation would be creating the TableSchema within the ValueProvider itself instead of relying on serialization.
While I haven't tested with your code, I believe something similar is sufficient:
PCollection<String> pc1 = p.apply(TextIO.read().from("data/ship_data.csv"));
PCollection<TableRow> pc2 = pc1.apply(MapElements.via(
new ConvertStringToTableRow(
() -> new BeamShemaUtil("data/ship_data_schema.avsc").convertBQTableSchema()
)));
PipelineResult result = p.run();
result.waitUntilFinish();

I propose you a solution, it's not perfect but I hope it can help.
You can use your own structure for table schema, and convert TableFieldSchema to a custom created object that implements Serializable, example :
public class MyTableSchemaFields implement Serializable {
private String fieldName;
private String fieldType;
// Constructor
.....
// Getters and setters
.......
}
public List<MyTableSchemaFields> toMyTableSchemaFields(final List<TableFieldSchema> schemaFields) {
return schemaFields.stream()
.map(this::toMyTableSchemaField)
.collect(Collectors.toList());
}
public List<MyTableSchemaFields> toMyTableSchemaField(final TableFieldSchema schemaField) {
MyTableSchemaFields field = new MyTableSchemaFields();
field.setFieldName(schemaField.getName());
field.setFieldType(schemaField.getType());
return field;
}
Then in the rest of your program, use MyTableSchemaFields instead of TableFieldSchema :
public static class ConvertStringToTableRow extends SerializableFunction<String, TableRow> {
List<MyTableSchemaFields> schemaFields;
public ConvertStringToTableRow(List<MyTableSchemaFields> schemaFields) {
this.schemaFields = schemaFields;
}
public TableRow buildTableRow(List<MyTableSchemaFields> schemaFields,String[] arr) {
...........
For the class ConvertStringToTableRow I used a SerializableFunction in my example instead of SimpleFunction.

Related

Need to insert rows in clickhouseIO from apache beam(dataflow)

I am reading from a Pub/Sub topic which running fine now I need to insert into a Table on clickHouse.
I am learning please excuse the tardiness.
PipelineOptions options = PipelineOptionsFactory.create();
//PubSubToDatabasesPipelineOptions options;
Pipeline p = Pipeline.create(options);
PCollection<String> inputFromPubSub = p.apply(namePrefix + "ReadFromPubSub",
PubsubIO.readStrings().fromSubscription("projects/*********/subscriptions/crypto_bitcoin.dataflow.bigquery.transactions").withIdAttribute(PUBSUB_ID_ATTRIBUTE));
PCollection<TransactionSmall> res = inputFromPubSub.apply(namePrefix + "ReadFromPubSub", ParDo.of(new DoFn<String, TransactionSmall>() {
#ProcessElement
public void processElement(ProcessContext c) {
String item = c.element();
//System.out.print(item);
Transaction transaction = JsonUtils.parseJson(item, Transaction.class);
//System.out.print(transaction);
c.output(new TransactionSmall(new Date(),transaction.getHash(), 123));
}}));
res.apply(ClickHouseIO.<TransactionSmall>write("jdbc:clickhouse://**.**.**.**:8123/litecoin?password=*****", "****"));
p.run().waitUntilFinish();
My TransactionSmall.java
import java.io.Serializable;
import java.util.Date;
public class TransactionSmall implements Serializable {
private Date created_dt;
private String hash;
private int number;
public TransactionSmall(Date created_dt, String hash, int number) {
this.created_dt = created_dt;
this.hash = hash;
this.number = number;
}
}
My table definition
clickhouse.us-east1-b.c.staging-btc-etl.internal :) CREATE TABLE litecoin.saurabh_blocks_small (`created_date` Date DEFAULT today(), `hash` String, `number` In) ENGINE = MergeTree(created_date, (hash, number), 8192)
CREATE TABLE litecoin.saurabh_blocks_small
(
`created_date` Date,
`hash` String,
`number` In
)
ENGINE = MergeTree(created_date, (hash, number), 8192)
I am getting error like
java.lang.IllegalArgumentException: Type of #Element must match the DoFn typesaurabhReadFromPubSub2/ParMultiDo(Anonymous).output [PCollection]
at org.apache.beam.sdk.transforms.ParDo.getDoFnSchemaInformation (ParDo.java:577)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo (ParDoTranslation.java:185)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation$ParDoTranslator.translate (ParDoTranslation.java:124)
at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformTranslation.toProto (PTransformTranslation.java:155)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getParDoPayload (ParDoTranslation.java:650)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.isSplittable (ParDoTranslation.java:665)
at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformMatchers$6.matches (PTransformMatchers.java:269)
at org.apache.beam.sdk.Pipeline$2.visitPrimitiveTransform (Pipeline.java:282)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit (TransformHierarchy.java:665)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit (TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit (TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit (TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600 (TransformHierarchy.java:317)
at org.apache.beam.sdk.runners.TransformHierarchy.visit (TransformHierarchy.java:251)
at org.apache.beam.sdk.Pipeline.traverseTopologically (Pipeline.java:460)
at org.apache.beam.sdk.Pipeline.replace (Pipeline.java:260)
at org.apache.beam.sdk.Pipeline.replaceAll (Pipeline.java:210)
at org.apache.beam.runners.direct.DirectRunner.run (DirectRunner.java:170)
at org.apache.beam.runners.direct.DirectRunner.run (DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run (Pipeline.java:315)
at org.apache.beam.sdk.Pipeline.run (Pipeline.java:301)
at io.blockchainetl.bitcoin.Trail.main (Trail.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
what would be the best way and cleanest way to achieve this without explicitly creating objects?
Thanks

This is likely happening because Beam relies on the coder specification for a PCollection when it infers the schema for it. It seems to be having trouble inferring the input schema for your ClickhouseIO transform.
You can compel Beam to have a schema by specifying a coder with schema inference, such as AvroCoder. You'd do:
#DefaultCoder(AvroCoder.class)
public class TransactionSmall implements Serializable {
private Date created_dt;
private String hash;
private int number;
public TransactionSmall(Date created_dt, String hash, int number) {
this.created_dt = created_dt;
this.hash = hash;
this.number = number;
}
}
Or you can also set the coder for the PCollection on your pipeline:
PCollection<TransactionSmall> res = inputFromPubSub.apply(namePrefix + "ReadFromPubSub", ParDo.of(new DoFn<String, TransactionSmall>() {
#ProcessElement
public void processElement(ProcessContext c) {
String item = c.element();
Transaction transaction = JsonUtils.parseJson(item, Transaction.class);
c.output(new TransactionSmall(new Date(),transaction.getHash(), 123));
}}))
.setCoder(AvroCoder.of(TransactionSmall.class));
res.apply(ClickHouseIO.<TransactionSmall>write("jdbc:clickhouse://**.**.**.**:8123/litecoin?password=*****", "****"));

Dataflow DynamicDestinations unable to serialize org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite

I am trying to use DynamicDestinations to write to a partitioned table in BigQuery where the partition name is mytable$yyyyMMdd. If I bypass dynamicdestinations and supply a hardcoded table name in .to(), it works; however, with dynamicdestinations I get the following exception:
java.lang.IllegalArgumentException: unable to serialize org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite$1#6fff253c
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:53)
at org.apache.beam.sdk.util.SerializableUtils.clone(SerializableUtils.java:90)
at org.apache.beam.sdk.transforms.ParDo$SingleOutput.<init>(ParDo.java:591)
at org.apache.beam.sdk.transforms.ParDo.of(ParDo.java:435)
at org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite.expand(PrepareWrite.java:51)
at org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite.expand(PrepareWrite.java:36)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:514)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:473)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:297)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped(BigQueryIO.java:987)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:972)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:659)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:514)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:454)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:284)
at com.homedepot.payments.monitoring.eventprocessor.MetricsAggregator.main(MetricsAggregator.java:82)
Caused by: java.io.NotSerializableException: com.google.api.services.bigquery.model.TableReference
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
And here is the code:
PCollection<Event> rawEvents = pipeline
.apply("ReadFromPubSub",
PubsubIO.readProtos(EventOuterClass.Event.class)
.fromSubscription(OPTIONS.getSubscription())
)
.apply("Parse", ParDo.of(new ParseFn()))
.apply("ExtractAttributes", ParDo.of(new ExtractAttributesFn()));
EventTable table = new EventTable(OPTIONS.getProjectId(), OPTIONS.getMetricsDatasetId(), OPTIONS.getRawEventsTable());
rawEvents.apply(BigQueryIO.<Event>write()
.to(new DynamicDestinations<Event, String>() {
private static final long serialVersionUID = 1L;
#Override
public TableSchema getSchema(String destination) {
return table.schema();
}
#Override
public TableDestination getTable(String destination) {
return new TableDestination(table.reference(), null);
}
#Override
public String getDestination(ValueInSingleWindow<Event> element) {
String dayString = DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC).toString();
return table.reference().getTableId() + "$" + dayString;
}
})
.withFormatFunction(new SerializableFunction<Event, TableRow>() {
public TableRow apply(Event event) {
TableRow row = new TableRow();
Event evnt = (Event) event;
row.set(EventTable.Field.VERSION.getName(), evnt.getVersion());
row.set(EventTable.Field.TIMESTAMP.getName(), evnt.getTimestamp() / 1000);
row.set(EventTable.Field.EVENT_TYPE_ID.getName(), evnt.getEventTypeId());
row.set(EventTable.Field.EVENT_ID.getName(), evnt.getId());
row.set(EventTable.Field.LOCATION.getName(), evnt.getLocation());
row.set(EventTable.Field.SERVICE.getName(), evnt.getService());
row.set(EventTable.Field.HOST.getName(), evnt.getHost());
row.set(EventTable.Field.BODY.getName(), evnt.getBody());
return row;
}
})
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
Any pointers in the correct direction would be greatly appreciated.
Thanks!

From inspecting the exception message and the code above, it seems that the EventTable field used within your anonymous DynamicDestinations class contains a TableReference field which is not serializable.
One workaround would be to convert the anonymous DynamicDestinations to a static inner class and define a constructor which stores only the serializable pieces of the EventTable needed to implement the interface.
For example:
private static class EventDestinations extends DynamicDestinations<Event, String> {
private final TableSchema schema;
private final TableDestination destination;
private final String tableId;
private EventDestinations(EventTable table) {
this.schema = table.schema();
this.destination = new TableDestination(table.reference(), null);
this.tableId = table.reference().getTableId();
}
// ..
}

Looks like you're trying to fill a specific partition based on the event. Why not use:
SerializableFunction<ValueInSingleWindow<Event>, TableDestination>?

DataflowAssert doesn't pass TableRow test

We don't know why when running this simple test, DataflowAssert fails:
#Test
#Category(RunnableOnService.class)
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
p.run();
}
We are getting the following exception:
Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement
SEVERE: DataflowAssert failed expectations.
java.lang.AssertionError:
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:865)
at org.junit.Assert.assertThat(Assert.java:832)
at ...
In order to simplify the DataflowAssert test we hardcoded the output of Table.join to match DataflowAssert,having:
private static final TableRow TABLEROW_TEST = new TableRow()
.set("id", "x");
static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
PCollection<TableRow> pCollectionTable2) throws Exception {
final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();
PCollection<KV<String, String>> table1Data = pCollectionTable1
.apply(ParDo.of(new ExtractTable1DataFn()));
PCollection<KV<String, String>> table2Data = pCollectionTable2
.apply(ParDo.of(new ExtractTable2DataFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
.apply(CoGroupByKey.<String> create());
PCollection<KV<String, String>> resultCollection = kvpCollection
.apply(ParDo.named("Process join")
.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
private static final long serialVersionUID = 0;
#Override
public void processElement(ProcessContext c) {
// System.out.println(c);
KV<String, CoGbkResult> e = c.element();
String key = e.getKey();
String value = null;
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
}
}));
PCollection<TableRow> formattedResults = resultCollection.apply(
ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("id", "x");
c.output(row);
}
}));
return formattedResults;
}
Does anyone know what we are doing wrong?

I think the error message is telling you that the actual collection contains more copies of that element than the expectation.
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
This is hamcrest indicating that you wanted an iterable over a single element, but the actual collection had an item which wasn't matched. Since all of the items coming out of "format join" have the same value, it made this harder to read than it should have been.
Specifically, this is the message produced when I run the following test, which checks to see if the collection with two copies of row is the contains exactly one copy of row:
#Category(RunnableOnService.class)
#Test
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
TableRow row = new TableRow().set("id", "x");
PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
DataflowAssert.that(rows).containsInAnyOrder(row);
p.run();
}
In order to get that result with your code, I had to take advantage of the fact that you only iterate over entries in table2. Specifically:
// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]
// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
("keyB", ([], ["B2"]))]
// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
"B2,B2"]
// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]
Note that the DoFn for "Process join" may not produce the expected results as commented below:
String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
// NOTE: this updates value, and doesn't output it. So for each
// key there will be a single output with the *last* value
// rather than one for each pair.
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));

How to use a custom Coder in a PCollection<KV<String, B>>?

I'm trying to use a custom Coder so that I can do some transforms, but I'm having trouble getting the PCollection to use my custom coder, and I suspect (???) it's because it's wrapped in a KV. Specifically:
Pipeline p = Pipeline.create ...
p.getCoderRegistry().registerCoder(MyClass.class, MyClassCoder.class);
...
PCollection<String> input = ...
PCollection<KV<String, MyClass>> t = input.apply(new ToKVTransform());
When I try to run something like this, I get a java.lang.ClassCastException and a stacktrace that includes a SerializableCoder instead of MyClassCoder like I would expect.
[error] at com.google.cloud.dataflow.sdk.coders.SerializableCoder.decode(SerializableCoder.java:133)
[error] at com.google.cloud.dataflow.sdk.coders.SerializableCoder.decode(SerializableCoder.java:50)
[error] at com.google.cloud.dataflow.sdk.coders.KvCoder.decode(KvCoder.java:95)
[error] at com.google.cloud.dataflow.sdk.coders.KvCoder.decode(KvCoder.java:42)
I see that the answer to another, somewhat related question (Using TextIO.Write with a complicated PCollection type in Google Cloud Dataflow) says to map everything to strings, and use that to pass stuff around PCollections. Is that really the recommended way??
(Note: the actual code is in Scala, but I'm pretty sure it's not a Scala <=> Java issue so I've translated it into Java here.)
Update to include Scala code and more background:
So this is the actual exception itself (should have included this at the beginning):
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.HashMap$SerializationProxy to field com.example.schema.Schema.keyTypes of type scala.collection.immutable.Map in instance of com.example.schema.Schema
Where com.example.schema.Schema is:
case class Schema(id: String, keyTypes: Map[String, Type])
And lastly, the SchemaCoder is:
class SchemaCoder extends com.google.cloud.dataflow.sdk.coders.CustomCoder[Schema] {
def decode(inputStream: InputStream, context: Context): Schema = {
val ois = new ObjectInputStream(inputStream)
val id: String = ois.readObject().asInstanceOf[String]
val javaMap: java.util.Map[String, Type] = ois.readObject().asInstanceOf[java.util.Map[String, Type]]
ois.close()
Schema(id, javaMap.asScala.toMap)
}
def encode(schema: Schema, outputStream: OutputStream, context: Context): Unit = {
val baos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(baos)
oos.writeObject(schema.id)
val javaMap: java.util.Map[String, Type] = schema.keyTypes.asJava
oos.writeObject(javaMap)
oos.close()
val encoded = new String(Base64.encodeBase64(baos.toByteArray()))
outputStream.write(encoded.getBytes())
}
}
====
Edit2: And here's what ToKVTransform actually looks like:
class SchemaExtractorTransform extends PTransform[PCollection[String], PCollection[Schema]] {
class InferSchemaFromStringWithKeyFn extends DoFn[String, KV[String, Schema]] {
override def processElement(c: DoFn[String, KV[String, Schema]]#ProcessContext): Unit = {
val line = c.element()
inferSchemaFromString(line)
}
}
class GetFirstFn extends DoFn[KV[String, java.lang.Iterable[Schema]], Schema] {
override def processElement(c: DoFn[KV[String, java.lang.Iterable[Schema]], Schema]#ProcessContext): Unit = {
val idAndSchemas: KV[String, java.lang.Iterable[Schema]] = c.element()
val it: java.util.Iterator[Schema] = idAndSchemas.getValue().iterator()
c.output(it.next())
}
}
override def apply(inputLines: PCollection[String]): PCollection[Schema] = {
val schemasWithKey: PCollection[KV[String, Schema]] = inputLines.apply(
ParDo.named("InferSchemas").of(new InferSchemaFromStringWithKeyFn())
)
val keyed: PCollection[KV[String, java.lang.Iterable[Schema]]] = schemasWithKey.apply(
GroupByKey.create()
)
val schemasOnly: PCollection[Schema] = keyed.apply(
ParDo.named("GetFirst").of(new GetFirstFn())
)
schemasOnly
}
}

This problem doesn't reproduce in Java; Scala is doing something differently with types that breaks Dataflow coder inference. To work around this, you can call setCoder on a PCollection to set its Coder explicitly, such as
schemasWithKey.setCoder(KvCoder.of(StringUtf8Coder.of(), SchemaCoder.of());
Here's the Java version of your code, just to make sure that it's doing approximately the same thing:
public static class SchemaExtractorTransform
extends PTransform<PCollection<String>, PCollection<Schema>> {
class InferSchemaFromStringWithKeyFn extends DoFn<String, KV<String, Schema>> {
public void processElement(ProcessContext c) {
c.output(KV.of(c.element(), new Schema()));
}
}
class GetFirstFn extends DoFn<KV<String, java.lang.Iterable<Schema>>, Schema> {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
c.output(c.element().getValue().iterator().next());
}
}
public PCollection<Schema> apply(PCollection<String> inputLines) {
PCollection<KV<String, Schema>> schemasWithKey = inputLines.apply(
ParDo.named("InferSchemas").of(new InferSchemaFromStringWithKeyFn()));
PCollection<KV<String, java.lang.Iterable<Schema>>> keyed =
schemasWithKey.apply(GroupByKey.<String, Schema>create());
PCollection<Schema> schemasOnly =
keyed.apply(ParDo.named("GetFirst").of(new GetFirstFn()));
return schemasOnly;
}
}

Accessing a Service from within an XNA Content Pipeline Extension

I need to allow my content pipeline extension to use a pattern similar to a factory. I start with a dictionary type:
public delegate T Mapper<T>(MapFactory<T> mf, XElement d);
public class MapFactory<T>
{
Dictionary<string, Mapper<T>> map = new Dictionary<string, Mapper<T>>();
public void Add(string s, Mapper<T> m)
{
map.Add(s, m);
}
public T Get(XElement xe)
{
if (xe == null) throw new ArgumentNullException(
"Invalid document");
var key = xe.Name.ToString();
if (!map.ContainsKey(key)) throw new ArgumentException(
key + " is not a valid key.");
return map[key](this, xe);
}
public IEnumerable<T> GetAll(XElement xe)
{
if (xe == null) throw new ArgumentNullException(
"Invalid document");
foreach (var e in xe.Elements())
{
var val = e.Name.ToString();
if (map.ContainsKey(val))
yield return map[val](this, e);
}
}
}
Here is one type of object I want to store:
public partial class TestContent
{
// Test type
public string title;
// Once test if true
public bool once;
// Parameters
public Dictionary<string, object> args;
public TestContent()
{
title = string.Empty;
args = new Dictionary<string, object>();
}
public TestContent(XElement xe)
{
title = xe.Name.ToString();
args = new Dictionary<string, object>();
xe.ParseAttribute("once", once);
}
}
XElement.ParseAttribute is an extension method that works as one might expect. It returns a boolean that is true if successful.
The issue is that I have many different types of tests, each of which populates the object in a way unique to the specific test. The element name is the key to MapFactory's dictionary. This type of test, while atypical, illustrates my problem.
public class LogicTest : TestBase
{
string opkey;
List<TestBase> items;
public override bool Test(BehaviorArgs args)
{
if (items == null) return false;
if (items.Count == 0) return false;
bool result = items[0].Test(args);
for (int i = 1; i < items.Count; i++)
{
bool other = items[i].Test(args);
switch (opkey)
{
case "And":
result &= other;
if (!result) return false;
break;
case "Or":
result |= other;
if (result) return true;
break;
case "Xor":
result ^= other;
break;
case "Nand":
result = !(result & other);
break;
case "Nor":
result = !(result | other);
break;
default:
result = false;
break;
}
}
return result;
}
public static TestContent Build(MapFactory<TestContent> mf, XElement xe)
{
var result = new TestContent(xe);
string key = "Or";
xe.GetAttribute("op", key);
result.args.Add("key", key);
var names = mf.GetAll(xe).ToList();
if (names.Count() < 2) throw new ArgumentException(
"LogicTest requires at least two entries.");
result.args.Add("items", names);
return result;
}
}
My actual code is more involved as the factory has two dictionaries, one that turns an XElement into a content type to write and another used by the reader to create the actual game objects.
I need to build these factories in code because they map strings to delegates. I have a service that contains several of these factories. The mission is to make these factory classes available to a content processor. Neither the processor itself nor the context it uses as a parameter have any known hooks to attach an IServiceProvider or equivalent.
Any ideas?

I needed to create a data structure essentially on demand without access to the underlying classes as they came from a third party, in this case XNA Game Studio. There is only one way to do this I know of... statically.
public class TestMap : Dictionary<string, string>
{
private static readonly TestMap map = new TestMap();
private TestMap()
{
Add("Logic", "LogicProcessor");
Add("Sequence", "SequenceProcessor");
Add("Key", "KeyProcessor");
Add("KeyVector", "KeyVectorProcessor");
Add("Mouse", "MouseProcessor");
Add("Pad", "PadProcessor");
Add("PadVector", "PadVectorProcessor");
}
public static TestMap Map
{
get { return map; }
}
public IEnumerable<TestContent> Collect(XElement xe, ContentProcessorContext cpc)
{
foreach(var e in xe.Elements().Where(e => ContainsKey(e.Name.ToString())))
{
yield return cpc.Convert<XElement, TestContent>(
e, this[e.Name.ToString()]);
}
}
}
I took this a step further and created content processors for each type of TestBase:
/// <summary>
/// Turns an imported XElement into a TestContent used for a LogicTest
/// </summary>
[ContentProcessor(DisplayName = "LogicProcessor")]
public class LogicProcessor : ContentProcessor<XElement, TestContent>
{
public override TestContent Process(XElement input, ContentProcessorContext context)
{
var result = new TestContent(input);
string key = "Or";
input.GetAttribute("op", key);
result.args.Add("key", key);
var items = TestMap.Map.Collect(input, context);
if (items.Count() < 2) throw new ArgumentNullException(
"LogicProcessor requires at least two items.");
result.args.Add("items", items);
return result;
}
}
Any attempt to reference or access the class such as calling TestMap.Collect will generate the underlying static class if needed. I basically moved the code from LogicTest.Build to the processor. I also carry out any needed validation in the processor.
When I get to reading these classes I will have the ContentService to help.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Getting unable to serialize DoFnWithExecutionInformation getting this error while building TableRow - google-cloud-dataflow

Related

Need to insert rows in clickhouseIO from apache beam(dataflow)

Dataflow DynamicDestinations unable to serialize org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite

DataflowAssert doesn't pass TableRow test

How to use a custom Coder in a PCollection<KV<String, B>>?

Accessing a Service from within an XNA Content Pipeline Extension

Categories

Resources