Using MySQL as input source and writing into Google BigQuery - google-cloud-dataflow

I have an Apache Beam task that reads from a MySQL source using JDBC and it's supposed to write the data as it is to a BigQuery table. No transformation is performed at this point, that will come later on, for the moment I just want the database output to be directly written into BigQuery.
This is the main method trying to perform this operation:
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline p = Pipeline.create(options);
// Build the table schema for the output table.
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("phone").setType("STRING"));
fields.add(new TableFieldSchema().setName("url").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
p.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://host:3306/db_name")
.withUsername("user")
.withPassword("pass"))
.withQuery("SELECT phone_number, identity_profile_image FROM scraper_caller_identities LIMIT 100")
.withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getString(1), resultSet.getString(2));
}
})
.apply(BigQueryIO.Write
.to(options.getOutput())
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)));
p.run();
}
But when I execute the template using maven, I get the following error:
Test.java:[184,6] cannot find symbol symbol: method
apply(com.google.cloud.dataflow.sdk.io.BigQueryIO.Write.Bound)
location: class org.apache.beam.sdk.io.jdbc.JdbcIO.Read<com.google.cloud.dataflow.sdk.values.KV<java.lang.String,java.lang.String>>
It seems that I'm not passing BigQueryIO.Write the expected data collection and that's what I am struggling with at the moment.
How can I make the data coming from MySQL meets BigQuery's expectations in this case?

I think that you need to provide a PCollection<TableRow> to BigQueryIO.Write instead of the PCollection<KV<String,String>> type that the RowMapper is outputting.
Also, please use the correct column name and value pairs when setting the TableRow.
Note: I think that your KVs are the phone and url values (e.g. {"555-555-1234": "http://www.url.com"}), not the column name and value pairs (e.g. {"phone": "555-555-1234", "url": "http://www.url.com"})
See the example here:
https://beam.apache.org/documentation/sdks/javadoc/0.5.0/
Would you please give this a try and let me know if it works for you? Hope this helps.

Related

Unable to query Cosmos DB using numeric partition key

I am trying to use a numeric field as a partition key but I am unable to run stored procedures on them. I am not sure if I am doing something wrong or if it is not possible.
I have two collections with two different partition keys.
A sample document from collection 1
{
"id":"1",
"group": "a"
}
A sample document from collection 2
{
"id":"1",
"group": 1
}
The difference is that the group field in second collection does not have double quotes around it.
Both the collections have group as the partition key and I am trying to execute the sample stored procedure provided in the Azure Portal for Cosmos DB.
While i am able to run the first collection successfully, the second collection produces an error of "no docs found".
I am importing the data from sql server database and the int fields are imported without double quotes. I searched quite a lot but could not find documentation relating to numeric partition key without double quotes.
Is it possible to do partition keys from numeric fields?? Any help appreciated as this field is the best fit for my collection for partitioning and would be really useful.
Please don't worry about the numeric partition key which is supported in the cosmos db stored procedure definitely. You did everything rightly.You met the unexpected results because azure portal identify the partition key input as String type even though you filled Number type.
You could test it with cosmos db sdk or rest api. For example, java sdk sample as below:
import com.microsoft.azure.documentdb.*;
public class ExecuteSPTest {
static private String YOUR_COSMOS_DB_ENDPOINT = "https://***.documents.azure.com:443/";
static private String YOUR_COSMOS_DB_MASTER_KEY = "***";
public static void main(String[] args) throws DocumentClientException {
DocumentClient client = new DocumentClient(
YOUR_COSMOS_DB_ENDPOINT,
YOUR_COSMOS_DB_MASTER_KEY,
new ConnectionPolicy(),
ConsistencyLevel.Session);
RequestOptions options = new RequestOptions();
PartitionKey partitionKey = new PartitionKey(1);
options.setPartitionKey(partitionKey);
StoredProcedureResponse queryResults = client.executeStoredProcedure(
"/dbs/db/colls/group/sprocs/sample",
options,
null);
String document = queryResults.getResponseAsString();
System.out.println(document);
}
}
Output:

Import CSV file from GCS to BigQuery

I'm trying to figure out how to load a CSV file from GCS into BigQuery. Pipeline below:
// Create the pipeline
Pipeline p = Pipeline.create(options);
// Create the PCollection from csv
PCollection<String> lines = p.apply(TextIO.read().from("gs://impression_tst_data/incoming_data.csv"));
// Transform into TableRow
PCollection<TableRow> row = lines.apply(ParDo.of(new StringToRowConverter()));
// Write table to BigQuery
row.apply(BigQueryIO.<TableRow>writeTableRows()
.to(“project_id:dataset.table”)
.withSchema(getSchema())
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));
Here is the StringToRowConverter class I'm using in the ParDo to create a TableRow PCollection:
// StringToRowConverter
static class StringToRowConverter extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(new TableRow().set("string_field", c.element()));
}
}
Looking at the staging files it looks like this creates TableRows of JSON that lump the csv into a single column named "string_field". If I don't define string_field in my schema the job fails. When I do define string_field, it writes the each row of the CSV into the column and leaves all my other columns defined in the schema empty. I know this is expected behavior.
So my question: How do I take this JSON output and write it into the schema? Sample output and schema below...
"string_field": "6/26/17 21:28,Dave Smith,1 Learning Drive,867-5309,etc"}
Schema:
static TableSchema getSchema() {
return new TableSchema().setFields(new ArrayList<TableFieldSchema>() {
// Compose the list of TableFieldSchema from tableSchema.
{
add(new TableFieldSchema().setName("Event_Time").setType("TIMESTAMP"));
add(new TableFieldSchema().setName("Name").setType("STRING"));
add(new TableFieldSchema().setName("Address").setType("STRING"));
add(new TableFieldSchema().setName("Phone").setType("STRING"));
add(new TableFieldSchema().setName("etc").setType("STRING"));
}
});
}
Is there a better way of doing this than using the StringToRowConverter?
I need to use a ParDo to create a TableRow PCollection before I can write it out to BQ. However, I'm unable to find a solid example of how to take in a CSV PCollection, transform to TableRow and write it out.
Yes, I am a noob trying to learn here. I'm hoping somebody can help me with a snippet or point me in the right direction on the easiest way to accomplish this. Thanks in advance.
The code in your StringToRowConverter DoFn should parse the string and produce a TableRow with multiple fields. Since each row is comma separated, this would likely involve splitting the string on commas, and then using your knowledge of the column order to do something like:
String inputLine = c.element();
// May need to make the line parsing more robust, depending on your
// files. Look at how to parse rows of a CSV using Java.
String[] split = inputLine.split(',');
// Also, you may need to handle errors such as not enough columns, etc.
TableRow output = new TableRow();
output.set("Event_Time", split[0]); // may want to parse the string
output.set("Name", split[1]);
...
c.output(output);

Dataflow output parameterized type to avro file

I have a pipeline that successfully outputs an Avro file as follows:
#DefaultCoder(AvroCoder.class)
class MyOutput_T_S {
T foo;
S bar;
Boolean baz;
public MyOutput_T_S() {}
}
#DefaultCoder(AvroCoder.class)
class T {
String id;
public T() {}
}
#DefaultCoder(AvroCoder.class)
class S {
String id;
public S() {}
}
...
PCollection<MyOutput_T_S> output = input.apply(myTransform);
output.apply(AvroIO.Write.to("/out").withSchema(MyOutput_T_S.class));
How can I reproduce this exact behavior except with a parameterized output MyOutput<T, S> (where T and S are both Avro code-able using reflection).
The main issue is that Avro reflection doesn't work for parameterized types. So based on these responses:
Setting Custom Coders & Handling Parameterized types
Using Avrocoder for Custom Types with Generics
1) I think I need to write a custom CoderFactory but, I am having difficulty figuring out exactly how this works (I'm having trouble finding examples). Oddly enough, a completely naive coder factory appears to let me run the pipeline and inspect proper output using DataflowAssert:
cr.RegisterCoder(MyOutput.class, new CoderFactory() {
#Override
public Coder<?> create(List<? excents Coder<?>> componentCoders) {
Schema schema = new Schema.Parser().parse("{\"type\":\"record\,"
+ "\"name\":\"MyOutput\","
+ "\"namespace\":\"mypackage"\","
+ "\"fields\":[]}"
return AvroCoder.of(MyOutput.class, schema);
}
#Override
public List<Object> getInstanceComponents(Object value) {
MyOutput<Object, Object> myOutput = (MyOutput<Object, Object>) value;
List components = new ArrayList();
return components;
}
While I can successfully assert against the output now, I expect this will not cut it for writing to a file. I haven't figured out how I'm supposed to use the provided componentCoders to generate the correct schema and if I try to just shove the schema of T or S into fields I get:
java.lang.IllegalArgumentException: Unable to get field id from class null
2) Assuming I figure out how to encode MyOutput. What do I pass to AvroIO.Write.withSchema? If I pass either MyOutput.class or the schema I get type mismatch errors.
I think there are two questions (correct me if I am wrong):
How do I enable the coder registry to provide coders for various parameterizations of MyOutput<T, S>?
How do I values of MyOutput<T, S> to a file using AvroIO.Write.
The first question is to be solved by registering a CoderFactory as in the linked question you found.
Your naive coder is probably allowing you to run the pipeline without issues because serialization is being optimized away. Certainly an Avro schema with no fields will result in those fields being dropped in a serialization+deserialization round trip.
But assuming you fill in the schema with the fields, your approach to CoderFactory#create looks right. I don't know the exact cause of the message java.lang.IllegalArgumentException: Unable to get field id from class null, but the call to AvroCoder.of(MyOutput.class, schema) should work, for an appropriately assembled schema. If there is an issue with this, more details (such as the rest of the stack track) would be helpful.
However, your override of CoderFactory#getInstanceComponents should return a list of values, one per type parameter of MyOutput. Like so:
#Override
public List<Object> getInstanceComponents(Object value) {
MyOutput<Object, Object> myOutput = (MyOutput<Object, Object>) value;
return ImmutableList.of(myOutput.foo, myOutput.bar);
}
The second question can be answered using some of the same support code as the first, but otherwise is independent. AvroIO.Write.withSchema always explicitly uses the provided schema. It does use AvroCoder under the hood, but this is actually an implementation detail. Providing a compatible schema is all that is necessary - such a schema will have to be composed for each value of T and S for which you want to output MyOutput<T, S>.

How to add column name as header when using dataflow to export data to csv

I am exporting some data to csv by Dataflow, but beyond data I want to add each column names as the first line on the output file such as
col_name1, col_name2, col_name3, col_name4 ...
data1.1, data1.2, data1.3, data1.4 ...
data2.1 ...
Is there anyway to do with current API?(searched around TextIO.Write but didn't find anything seems relevant...) or is there anyway I could sort of "insert" column name at the head of to-be-exported PCollection and enforce the the data to be written in order...?
There is no built-in way to do that using TextIO.Write. PCollections are unordered so it isn't possible ot add an eleemnt to the front. You could write a custom BoundedSink which does this.
Custom sink APIs are now available if you want to be the brave one to craft a CSV sink. Current workaround which builds up the output as a single string and outputs it all at finish bundle:
PCollection<String> output = data.apply(ParDo.of(new DoFn<String, String>() {
private static final long serialVersionUID = 0;
String new_line = System.getProperty("line.separator");
String csv_header = "id, stuff1, stuff2, stuff3" + new_line;
StringBuilder csv_body = new StringBuilder().append(csv_header);
#Override
public void processElement(ProcessContext c) {
csv_body.append(c.element()).append(newline);
}
#Override
public void finishBundle(Context c) throws Exception {
c.output(csv_body);
}
})).apply(TextIO.Write.named("WriteData").to(options.getOutput()));
This will only work if your BIG output string fits in memory
As of Dataflow SDK version 1.7.0, you have withHeader function in TextIO.Write .
So you can do this:
TextIO.Write.named("WriteToText")
.to("/path/to/the/file")
.withHeader("col_name1,col_name2,col_name3,col_name4")
.withSuffix(".csv"));
A new line character is automatically added to the end of the header.

Dynamic table name when writing to BQ from dataflow pipelines

As a followup question to the following question and answer:
https://stackoverflow.com/questions/31156774/about-key-grouping-with-groupbykey
I'd like to confirm with google dataflow engineering team (#jkff) if the 3rd option proposed by Eugene is at all possible with google dataflow:
"have a ParDo that takes these keys and creates the BigQuery tables, and another ParDo that takes the data and streams writes to the tables"
My understanding is that ParDo/DoFn will process each element, how could we specify a table name (function of the keys passed in from side inputs) when writing out from processElement of a ParDo/DoFn?
Thanks.
Updated with a DoFn, which is not working obviously since c.element().value is not a pcollection.
PCollection<KV<String, Iterable<String>>> output = ...;
public class DynamicOutput2Fn extends DoFn<KV<String, Iterable<String>>, Integer> {
private final PCollectionView<List<String>> keysAsSideinputs;
public DynamicOutput2Fn(PCollectionView<List<String>> keysAsSideinputs) {
this.keysAsSideinputs = keysAsSideinputs;
}
#Override
public void processElement(ProcessContext c) {
List<String> keys = c.sideInput(keysAsSideinputs);
String key = c.element().getKey();
//the below is not working!!! How could we write the value out to a sink, be it gcs file or bq table???
c.element().getValue().apply(Pardo.of(new FormatLineFn()))
.apply(TextIO.Write.to(key));
c.output(1);
}
}
The BigQueryIO.Write transform does not support this. The closest thing you can do is to use per-window tables, and encode whatever information you need to select the table in the window objects by using a custom WindowFn.
If you don't want to do that, you can make BigQuery API calls directly from your DoFn. With this, you can set the table name to anything you want, as computed by your code. This could be looked up from a side input, or computed directly from the element the DoFn is currently processing. To avoid making too many small calls to BigQuery, you can batch up the requests using finishBundle();
You can see how the Dataflow runner does the streaming import here:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/BigQueryTableInserter.java

Resources