Example to read and write parquet file using ParquetIO through Apache Beam - google-cloud-dataflow

Has anybody tried reading/writing Parquet file using Apache Beam. Support is added recently in version 2.5.0, hence not much documentation.
I am trying to read json input file and would like to write to parquet format.
Thanks in advance.

Add the following dependency as ParquetIO in different module.
<dependency>
<groupId>org.apache.beam</groupId>;
<artifactId>beam-sdks-java-io-parquet</artifactId>;
<version>2.6.0</version>;
</dependency>;
//Here is code to read and write....
PCollection<JsonObject> input = #Your data
PCollection<GenericRecord> pgr =input.apply("parse json", ParDo.of(new DoFn<JsonObject, GenericRecord> {
#ProcessElement
public void processElement(ProcessContext context) {
JsonObject json= context.getElement();
GenericRecord record = #convert json to GenericRecord with schema
context.output(record);
}
}));
pgr.apply(FileIO.<GenericRecord>write().via(ParquetIO.sink(schema)).to("path/to/save"));
PCollection<GenericRecord> data = pipeline.apply(
ParquetIO.read(schema).from("path/to/read"));

You will need to use ParquetIO.Sink. It implements FileIO.

Related

Import CSV file from GCS to BigQuery

I'm trying to figure out how to load a CSV file from GCS into BigQuery. Pipeline below:
// Create the pipeline
Pipeline p = Pipeline.create(options);
// Create the PCollection from csv
PCollection<String> lines = p.apply(TextIO.read().from("gs://impression_tst_data/incoming_data.csv"));
// Transform into TableRow
PCollection<TableRow> row = lines.apply(ParDo.of(new StringToRowConverter()));
// Write table to BigQuery
row.apply(BigQueryIO.<TableRow>writeTableRows()
.to(“project_id:dataset.table”)
.withSchema(getSchema())
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));
Here is the StringToRowConverter class I'm using in the ParDo to create a TableRow PCollection:
// StringToRowConverter
static class StringToRowConverter extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(new TableRow().set("string_field", c.element()));
}
}
Looking at the staging files it looks like this creates TableRows of JSON that lump the csv into a single column named "string_field". If I don't define string_field in my schema the job fails. When I do define string_field, it writes the each row of the CSV into the column and leaves all my other columns defined in the schema empty. I know this is expected behavior.
So my question: How do I take this JSON output and write it into the schema? Sample output and schema below...
"string_field": "6/26/17 21:28,Dave Smith,1 Learning Drive,867-5309,etc"}
Schema:
static TableSchema getSchema() {
return new TableSchema().setFields(new ArrayList<TableFieldSchema>() {
// Compose the list of TableFieldSchema from tableSchema.
{
add(new TableFieldSchema().setName("Event_Time").setType("TIMESTAMP"));
add(new TableFieldSchema().setName("Name").setType("STRING"));
add(new TableFieldSchema().setName("Address").setType("STRING"));
add(new TableFieldSchema().setName("Phone").setType("STRING"));
add(new TableFieldSchema().setName("etc").setType("STRING"));
}
});
}
Is there a better way of doing this than using the StringToRowConverter?
I need to use a ParDo to create a TableRow PCollection before I can write it out to BQ. However, I'm unable to find a solid example of how to take in a CSV PCollection, transform to TableRow and write it out.
Yes, I am a noob trying to learn here. I'm hoping somebody can help me with a snippet or point me in the right direction on the easiest way to accomplish this. Thanks in advance.
The code in your StringToRowConverter DoFn should parse the string and produce a TableRow with multiple fields. Since each row is comma separated, this would likely involve splitting the string on commas, and then using your knowledge of the column order to do something like:
String inputLine = c.element();
// May need to make the line parsing more robust, depending on your
// files. Look at how to parse rows of a CSV using Java.
String[] split = inputLine.split(',');
// Also, you may need to handle errors such as not enough columns, etc.
TableRow output = new TableRow();
output.set("Event_Time", split[0]); // may want to parse the string
output.set("Name", split[1]);
...
c.output(output);

How to deal with json data read from a database with RPC package

I try to implement an API using the RPC dart package.
In all the example I found, response are build manually (ie new Response()..message = "hello").
In my case, i read JSON data from mongodb and want to return them with minimal transformation (basically picking only external properties).
The fromRequest of the method schema can be used to do this :
class QueryResult {
//my props
}
#ApiMethod(path: "myMongoQuery")
Future<List<QueryResult>> myMongoQuery() async {
var schema = _server.apiMap["/query/v1"].schemaMap["QueryResult"];
var results = await coll.find();
return results.map(schema.fromRequest).toList();
}
The problem in my code is the first line (_server.apiMap["/query/v1"].schemaMap["QueryResult"]) it's a pure hack to retrieve the schema of my method.
I tried to use mirror to retrieve the schema in an elegant/generic way but did not succeed.
Anyone can help me on this ?
Cheers,
Nicolas

Using MySQL as input source and writing into Google BigQuery

I have an Apache Beam task that reads from a MySQL source using JDBC and it's supposed to write the data as it is to a BigQuery table. No transformation is performed at this point, that will come later on, for the moment I just want the database output to be directly written into BigQuery.
This is the main method trying to perform this operation:
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline p = Pipeline.create(options);
// Build the table schema for the output table.
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("phone").setType("STRING"));
fields.add(new TableFieldSchema().setName("url").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
p.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://host:3306/db_name")
.withUsername("user")
.withPassword("pass"))
.withQuery("SELECT phone_number, identity_profile_image FROM scraper_caller_identities LIMIT 100")
.withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getString(1), resultSet.getString(2));
}
})
.apply(BigQueryIO.Write
.to(options.getOutput())
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)));
p.run();
}
But when I execute the template using maven, I get the following error:
Test.java:[184,6] cannot find symbol symbol: method
apply(com.google.cloud.dataflow.sdk.io.BigQueryIO.Write.Bound)
location: class org.apache.beam.sdk.io.jdbc.JdbcIO.Read<com.google.cloud.dataflow.sdk.values.KV<java.lang.String,java.lang.String>>
It seems that I'm not passing BigQueryIO.Write the expected data collection and that's what I am struggling with at the moment.
How can I make the data coming from MySQL meets BigQuery's expectations in this case?
I think that you need to provide a PCollection<TableRow> to BigQueryIO.Write instead of the PCollection<KV<String,String>> type that the RowMapper is outputting.
Also, please use the correct column name and value pairs when setting the TableRow.
Note: I think that your KVs are the phone and url values (e.g. {"555-555-1234": "http://www.url.com"}), not the column name and value pairs (e.g. {"phone": "555-555-1234", "url": "http://www.url.com"})
See the example here:
https://beam.apache.org/documentation/sdks/javadoc/0.5.0/
Would you please give this a try and let me know if it works for you? Hope this helps.

How to add column name as header when using dataflow to export data to csv

I am exporting some data to csv by Dataflow, but beyond data I want to add each column names as the first line on the output file such as
col_name1, col_name2, col_name3, col_name4 ...
data1.1, data1.2, data1.3, data1.4 ...
data2.1 ...
Is there anyway to do with current API?(searched around TextIO.Write but didn't find anything seems relevant...) or is there anyway I could sort of "insert" column name at the head of to-be-exported PCollection and enforce the the data to be written in order...?
There is no built-in way to do that using TextIO.Write. PCollections are unordered so it isn't possible ot add an eleemnt to the front. You could write a custom BoundedSink which does this.
Custom sink APIs are now available if you want to be the brave one to craft a CSV sink. Current workaround which builds up the output as a single string and outputs it all at finish bundle:
PCollection<String> output = data.apply(ParDo.of(new DoFn<String, String>() {
private static final long serialVersionUID = 0;
String new_line = System.getProperty("line.separator");
String csv_header = "id, stuff1, stuff2, stuff3" + new_line;
StringBuilder csv_body = new StringBuilder().append(csv_header);
#Override
public void processElement(ProcessContext c) {
csv_body.append(c.element()).append(newline);
}
#Override
public void finishBundle(Context c) throws Exception {
c.output(csv_body);
}
})).apply(TextIO.Write.named("WriteData").to(options.getOutput()));
This will only work if your BIG output string fits in memory
As of Dataflow SDK version 1.7.0, you have withHeader function in TextIO.Write .
So you can do this:
TextIO.Write.named("WriteToText")
.to("/path/to/the/file")
.withHeader("col_name1,col_name2,col_name3,col_name4")
.withSuffix(".csv"));
A new line character is automatically added to the end of the header.

Read RDF:foaf file with Apache Jena

I have a problem with reading RDF file, which is using foaf tags. I would like to read it with Apache Jena. Below is the snippet of the RDF file.
<rdf:RDF xmlns="http://test.example.com/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<foaf:Person rdf:about="http://test.example.com/MainPerson.rdf">
<foaf:firstName>John</foaf:firstName>
<foaf:lastName>Doe</foaf:lastName>
<foaf:nick>Doe</foaf:nick>
<foaf:gender>Male</foaf:gender>
<foaf:based_near>Honolulu</foaf:based_near>
<foaf:birthday>08-14-1990</foaf:birthday>
<foaf:mbox>john#example.com</foaf:mbox>
<foaf:homepage rdf:resource="http://www.example.com"/>
<foaf:img rdf:resource="http://weknowmemes.com/wp-content/uploads/2013/09/wat-meme.jpg"/>
<foaf:made>
Article: Developing applications in Java
</foaf:made>
<foaf:age>24</foaf:age>
<foaf:interest>
Java, Java EE (web tier), PrimeFaces, MySQL, PHP, OpenCart, Joomla, Prestashop, CSS3, HTML5
</foaf:interest>
<foaf:pastProject rdf:resource="http://www.supercombe.si"/>
<foaf:status>Student</foaf:status>
<foaf:geekcode>M+, L++</foaf:geekcode>
<foaf:knows>
<foaf:Person>
<rdfs:seeAlso rdf:resource="http://test.example.com/Person.rdf"/>
</foaf:Person>
</foaf:knows>
<foaf:knows>
<foaf:Person>
<rdfs:seeAlso rdf:resource="http://test.example.com/Person2.rdf"/>
</foaf:Person>
</foaf:knows>
<foaf:knows>
<foaf:Person>
<rdfs:seeAlso rdf:resource="http://test.example.com/Person3.rdf"/>
</foaf:Person>
</foaf:knows>
</foaf:Person>
</rdf:RDF>
I just don't understand how to read this data with Apache Jena in regular POJO object. Any help will be appreciated (couldn't find tutorial on the web for this kind of parsing).
I don't know if I understood your problem. But if you need to read RDF file to a POJO object, you have a lot of choice. For example, you can read your rdf file using Jena to a model, and then create POJO objects using the methods proposed by the framework to get the values of your properties.
This is a code example that extracts the foaf:firstName from your file
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ModelFactory;
import com.hp.hpl.jena.rdf.model.Property;
import com.hp.hpl.jena.rdf.model.Resource;
import com.hp.hpl.jena.util.FileManager;
public class Test {
//First, create a Jena model and use FileManager to read the file
public static Model model = ModelFactory.createDefaultModel();
public static void main(String[] args) {
//Use FileManager to read the file and add it to the Jena model
FileManager.get().readModel(model, "test.rdf");
//Apply methods like getResource, getProperty, listStatements,listLiteralStatements ...
//to your model to extract the information you want
Resource person = model.getResource("http://test.example.com/MainPerson.rdf");
Property firstName = model.createProperty("http://xmlns.com/foaf/0.1/firstName");
String firstNameValue = person.getProperty(firstName).getString();
System.out.println(firstNameValue);
}
}
You can use those methods in the setters of your POJO class. You can find a very good introduction here

Resources