Cannot write multibyte string to Spanner properly from Dataflow Pipeline - google-cloud-dataflow

I want to write multibyte string (e.g. japanese) to Spanner from Dataflow Pipeline.
But it does not working.
Below is the code I tried.
(edited: I rewrote it that is closer to actual)
ParDo.of(new DoFn<TableRow, Mutation>() {
#ProcessElement
public void processElement(ProcessContext c) throws IOException {
TableRow row = c.element();
Mutation.WriteBuilder mutationWriteBuilder = Mutation.newInsertOrUpdateBuilder('testtable');
for (Entry<String, Object> entry : row.entrySet()) {
String columnName = entry.getKey();
Object value = entry.getValue();
Charset utf8 = StandardCharsets.UTF_8;
String str = new String(value.toString().getBytes(utf8), utf8);
mutationWriteBuilder.set(columnName).to(str);
}
Mutation mutation = mutationWriteBuilder.build();
c.output(mutation)
}
}
This pipeline will succeed, but the value actually written is a garbled string like '�'.
Am I doing something wrong?

Related

Need to insert rows in clickhouseIO from apache beam(dataflow)

I am reading from a Pub/Sub topic which running fine now I need to insert into a Table on clickHouse.
I am learning please excuse the tardiness.
PipelineOptions options = PipelineOptionsFactory.create();
//PubSubToDatabasesPipelineOptions options;
Pipeline p = Pipeline.create(options);
PCollection<String> inputFromPubSub = p.apply(namePrefix + "ReadFromPubSub",
PubsubIO.readStrings().fromSubscription("projects/*********/subscriptions/crypto_bitcoin.dataflow.bigquery.transactions").withIdAttribute(PUBSUB_ID_ATTRIBUTE));
PCollection<TransactionSmall> res = inputFromPubSub.apply(namePrefix + "ReadFromPubSub", ParDo.of(new DoFn<String, TransactionSmall>() {
#ProcessElement
public void processElement(ProcessContext c) {
String item = c.element();
//System.out.print(item);
Transaction transaction = JsonUtils.parseJson(item, Transaction.class);
//System.out.print(transaction);
c.output(new TransactionSmall(new Date(),transaction.getHash(), 123));
}}));
res.apply(ClickHouseIO.<TransactionSmall>write("jdbc:clickhouse://**.**.**.**:8123/litecoin?password=*****", "****"));
p.run().waitUntilFinish();
My TransactionSmall.java
import java.io.Serializable;
import java.util.Date;
public class TransactionSmall implements Serializable {
private Date created_dt;
private String hash;
private int number;
public TransactionSmall(Date created_dt, String hash, int number) {
this.created_dt = created_dt;
this.hash = hash;
this.number = number;
}
}
My table definition
clickhouse.us-east1-b.c.staging-btc-etl.internal :) CREATE TABLE litecoin.saurabh_blocks_small (`created_date` Date DEFAULT today(), `hash` String, `number` In) ENGINE = MergeTree(created_date, (hash, number), 8192)
CREATE TABLE litecoin.saurabh_blocks_small
(
`created_date` Date,
`hash` String,
`number` In
)
ENGINE = MergeTree(created_date, (hash, number), 8192)
I am getting error like
java.lang.IllegalArgumentException: Type of #Element must match the DoFn typesaurabhReadFromPubSub2/ParMultiDo(Anonymous).output [PCollection]
at org.apache.beam.sdk.transforms.ParDo.getDoFnSchemaInformation (ParDo.java:577)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo (ParDoTranslation.java:185)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation$ParDoTranslator.translate (ParDoTranslation.java:124)
at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformTranslation.toProto (PTransformTranslation.java:155)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getParDoPayload (ParDoTranslation.java:650)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.isSplittable (ParDoTranslation.java:665)
at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformMatchers$6.matches (PTransformMatchers.java:269)
at org.apache.beam.sdk.Pipeline$2.visitPrimitiveTransform (Pipeline.java:282)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit (TransformHierarchy.java:665)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit (TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit (TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit (TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600 (TransformHierarchy.java:317)
at org.apache.beam.sdk.runners.TransformHierarchy.visit (TransformHierarchy.java:251)
at org.apache.beam.sdk.Pipeline.traverseTopologically (Pipeline.java:460)
at org.apache.beam.sdk.Pipeline.replace (Pipeline.java:260)
at org.apache.beam.sdk.Pipeline.replaceAll (Pipeline.java:210)
at org.apache.beam.runners.direct.DirectRunner.run (DirectRunner.java:170)
at org.apache.beam.runners.direct.DirectRunner.run (DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run (Pipeline.java:315)
at org.apache.beam.sdk.Pipeline.run (Pipeline.java:301)
at io.blockchainetl.bitcoin.Trail.main (Trail.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
what would be the best way and cleanest way to achieve this without explicitly creating objects?
Thanks
This is likely happening because Beam relies on the coder specification for a PCollection when it infers the schema for it. It seems to be having trouble inferring the input schema for your ClickhouseIO transform.
You can compel Beam to have a schema by specifying a coder with schema inference, such as AvroCoder. You'd do:
#DefaultCoder(AvroCoder.class)
public class TransactionSmall implements Serializable {
private Date created_dt;
private String hash;
private int number;
public TransactionSmall(Date created_dt, String hash, int number) {
this.created_dt = created_dt;
this.hash = hash;
this.number = number;
}
}
Or you can also set the coder for the PCollection on your pipeline:
PCollection<TransactionSmall> res = inputFromPubSub.apply(namePrefix + "ReadFromPubSub", ParDo.of(new DoFn<String, TransactionSmall>() {
#ProcessElement
public void processElement(ProcessContext c) {
String item = c.element();
Transaction transaction = JsonUtils.parseJson(item, Transaction.class);
c.output(new TransactionSmall(new Date(),transaction.getHash(), 123));
}}))
.setCoder(AvroCoder.of(TransactionSmall.class));
res.apply(ClickHouseIO.<TransactionSmall>write("jdbc:clickhouse://**.**.**.**:8123/litecoin?password=*****", "****"));

java.lang.IllegalArgumentException: calling sideInput() with unknown view

I tried to move the data from one table to another table. Used SideInput for filtering the records while transform the data. SideInput also type of KV collection and its loaded the data from another table.
When run my pipeline got "java.lang.IllegalArgumentException: calling sideInput() with unknown view" error.
Here is the entire code that I tried:
{
PipelineOptionsFactory.register(OptionPipeline.class);
OptionPipeline options = PipelineOptionsFactory.fromArgs(args).withValidation().as(OptionPipeline.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> sideInputData = p.apply("ReadSideInput",BigQueryIO.readTableRows().from(options.getOrgRegionMapping()));
PCollection<KV<String,String>> sideInputMap = sideInputData.apply(ParDo.of(new getSideInputDataFn()));
final PCollectionView<Map<String,String>> sideInputView = sideInputMap.apply(View.<String,String>asMap());
PCollection<TableRow> orgMaster = p.apply("ReadOrganization",BigQueryIO.readTableRows().from(options.getOrgCodeMaster()));
PCollection<TableRow> orgCode = orgMaster.apply(ParDo.of(new gnGetOrgMaster()));
#SuppressWarnings("serial")
PCollection<TableRow> finalResultCollection = orgCode.apply("Process", ParDo.of(new DoFn<TableRow, TableRow>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow outputRow = new TableRow();
TableRow orgCodeRow = c.element();
String orgCodefromMaster = (String) orgCodeRow.get("orgCode");
String region = c.sideInput(sideInputView).get(orgCodefromMaster);
outputRow.set("orgCode", orgCodefromMaster);
outputRow.set("orgName", orgCodeRow.get("orgName"));
outputRow.set("orgName", region);
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSSSSS");
Date dateobj = new Date();
outputRow.set("updatedDate",df.format(dateobj));
c.output(outputRow);
}
}));
finalResultCollection.apply(BigQueryIO.writeTableRows()
.withSchema(schema)
.to(options.getOrgCodeTable())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
}
#SuppressWarnings("serial")
static class getSideInputDataFn extends DoFn<TableRow,KV<String, String>>
{
#ProcessElement
public void processElement(ProcessContext c)
{
TableRow row = c.element();
c.output(KV.of((String) row.get("orgcode"), (String) row.get("region")));
}
}
It looks like the runner is complaining because you never told it about the side input when defining the graph. In this case you call .withSideInputs after the ParDo.of call passing in the reference to the PCollectionView<T> you defined earlier.
#SuppressWarnings("serial")
PCollection<TableRow> finalResultCollection = orgCode.apply("Process", ParDo.of(new DoFn<TableRow, TableRow>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow outputRow = new TableRow();
TableRow orgCodeRow = c.element();
String orgCodefromMaster = (String) orgCodeRow.get("orgCode");
String region = c.sideInput(sideInputView).get(orgCodefromMaster);
outputRow.set("orgCode", orgCodefromMaster);
outputRow.set("orgName", orgCodeRow.get("orgName"));
outputRow.set("orgName", region);
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSSSSS");
Date dateobj = new Date();
outputRow.set("updatedDate",df.format(dateobj));
c.output(outputRow);
}
}).withSideInputs(sideInputView));
I didn't test this code but that's what stands out when I look at it.

DymanicDestinations in Apache Beam

I have a PCollection [String] say "X" that I need to dump in a BigQuery table.
The table destination and the schema for it is in a PCollection[TableRow] say "Y".
How to accomplish this in the simplest manner?
I tried extracting the table and schema from "Y" and saving it in static global variables (tableName and schema respectively). But somehow oddly the BigQueryIO.writeTableRows() always gets the value of the variable tableName as null. But it gets the schema. I tried logging the values of those variables and I can see the values are there for both.
Here is my pipeline code:
static String tableName;
static TableSchema schema;
PCollection<String> read = p.apply("Read from input file",
TextIO.read().from(options.getInputFile()));
PCollection<TableRow> tableRows = p.apply(
BigQueryIO.read().fromQuery(NestedValueProvider.of(
options.getfilename(),
new SerializableFunction<String, String>() {
#Override
public String apply(String filename) {
return "SELECT table,schema FROM `BigqueryTest.configuration` WHERE file='" + filename +"'";
}
})).usingStandardSql().withoutValidation());
final PCollectionView<List<String>> dataView = read.apply(View.asList());
tableRows.apply("Convert data read from file to TableRow",
ParDo.of(new DoFn<TableRow,TableRow>(){
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
String[] schemas = c.element().get("schema").toString().split(",");
List<TableFieldSchema> fields = new ArrayList<>();
for(int i=0;i<schemas.length;i++) {
fields.add(new TableFieldSchema()
.setName(schemas[i].split(":")[0]).setType(schemas[i].split(":")[1]));
}
schema = new TableSchema().setFields(fields);
//My code to convert data to TableRow format.
}}).withSideInputs(dataView));
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Everything works fine. Only BigQueryIO.write operation fails and I get the error TableId is null.
I also tried using SerializableFunction and returning the value from there but i still get null.
Here is the code that I tried for it:
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to(new GetTable(tableName))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
public static class GetTable implements SerializableFunction<String,String> {
String table;
public GetTable() {
this.table = tableName;
}
#Override
public String apply(String arg0) {
return "ProjectId:DatasetId."+table;
}
}
I also tried using DynamicDestinations but I get an error saying schema is not provided. Honestly I'm new to the concept of DynamicDestinations and I'm not sure that I'm doing it correctly.
Here is the code that I tried for it:
tableRows2.apply(BigQueryIO.writeTableRows()
.to(new DynamicDestinations<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#Override
public TableDestination getTable(TableRow dest) {
List<TableRow> list = sideInput(bqDataView); //bqDataView contains table and schema
String table = list.get(0).get("table").toString();
String tableSpec = "ProjectId:DatasetId."+table;
String tableDescription = "";
return new TableDestination(tableSpec, tableDescription);
}
public String getSideInputs(PCollectionView<List<TableRow>> bqDataView) {
return null;
}
#Override
public TableSchema getSchema(TableRow destination) {
return schema; //schema is getting added from the global variable
}
#Override
public TableRow getDestination(ValueInSingleWindow<TableRow> element) {
return null;
}
}.getSideInputs(bqDataView)));
Please let me know what I'm doing wrong and which path I should take.
Thank You.
Part of the reason your having trouble is because of the two stages of pipeline execution. First the pipeline is constructed on your machine. This is when all of the applications of PTransforms occur. In your first example, this is when the following lines are executed:
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
The code within a ParDo however runs when your pipeline executes, and it does so on many machines. So the following code runs much later than the pipeline construction:
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
...
schema = new TableSchema().setFields(fields);
...
}
This means that neither the tableName nor the schema fields will be set at when the BigQueryIO sink is created.
Your idea to use DynamicDestinations is correct, but you need to move the code to actually generate the schema the destination into that class, rather than relying on global variables that aren't available on all of the machines.

Execute read operations in sequence - Apache Beam

I need to execute below operations in sequence as given:-
PCollection<String> read = p.apply("Read Lines",TextIO.read().from(options.getInputFile()))
.apply("Get fileName",ParDo.of(new DoFn<String,String>(){
ValueProvider<String> fileReceived = options.getfilename();
#ProcessElement
public void procesElement(ProcessContext c)
{
fileName = fileReceived.get().toString();
LOG.info("File: "+fileName);
}
}));
PCollection<TableRow> rows = p.apply("Read from BigQuery",
BigQueryIO.read()
.fromQuery("SELECT table,schema FROM `DatasetID.TableID` WHERE file='" + fileName +"'")
.usingStandardSql());
How to accomplish this in Apache Beam/Dataflow?
It seems that you want to apply BigQueryIO.read().fromQuery() to a query that depends on a value available via a property of type ValueProvider<String> in your PipelineOptions, and the provider is not accessible at pipeline construction time - i.e. you are invoking your job via a template.
In that case, the proper solution is to use NestedValueProvider:
PCollection<TableRow> tableRows = p.apply(BigQueryIO.read().fromQuery(
NestedValueProvider.of(
options.getfilename(),
new SerializableFunction<String, String>() {
#Override
public String apply(String filename) {
return "SELECT table,schema FROM `DatasetID.TableID` WHERE file='" + fileName +"'";
}
})));

DataflowAssert doesn't pass TableRow test

We don't know why when running this simple test, DataflowAssert fails:
#Test
#Category(RunnableOnService.class)
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
p.run();
}
We are getting the following exception:
Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement
SEVERE: DataflowAssert failed expectations.
java.lang.AssertionError:
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:865)
at org.junit.Assert.assertThat(Assert.java:832)
at ...
In order to simplify the DataflowAssert test we hardcoded the output of Table.join to match DataflowAssert,having:
private static final TableRow TABLEROW_TEST = new TableRow()
.set("id", "x");
static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
PCollection<TableRow> pCollectionTable2) throws Exception {
final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();
PCollection<KV<String, String>> table1Data = pCollectionTable1
.apply(ParDo.of(new ExtractTable1DataFn()));
PCollection<KV<String, String>> table2Data = pCollectionTable2
.apply(ParDo.of(new ExtractTable2DataFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
.apply(CoGroupByKey.<String> create());
PCollection<KV<String, String>> resultCollection = kvpCollection
.apply(ParDo.named("Process join")
.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
private static final long serialVersionUID = 0;
#Override
public void processElement(ProcessContext c) {
// System.out.println(c);
KV<String, CoGbkResult> e = c.element();
String key = e.getKey();
String value = null;
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
}
}));
PCollection<TableRow> formattedResults = resultCollection.apply(
ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("id", "x");
c.output(row);
}
}));
return formattedResults;
}
Does anyone know what we are doing wrong?
I think the error message is telling you that the actual collection contains more copies of that element than the expectation.
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
This is hamcrest indicating that you wanted an iterable over a single element, but the actual collection had an item which wasn't matched. Since all of the items coming out of "format join" have the same value, it made this harder to read than it should have been.
Specifically, this is the message produced when I run the following test, which checks to see if the collection with two copies of row is the contains exactly one copy of row:
#Category(RunnableOnService.class)
#Test
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
TableRow row = new TableRow().set("id", "x");
PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
DataflowAssert.that(rows).containsInAnyOrder(row);
p.run();
}
In order to get that result with your code, I had to take advantage of the fact that you only iterate over entries in table2. Specifically:
// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]
// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
("keyB", ([], ["B2"]))]
// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
"B2,B2"]
// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]
Note that the DoFn for "Process join" may not produce the expected results as commented below:
String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
// NOTE: this updates value, and doesn't output it. So for each
// key there will be a single output with the *last* value
// rather than one for each pair.
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));

Resources