Apache Beam Coder for GenericRecord - google-cloud-dataflow

I am building a pipeline that reads Avro generic records. To pass GenericRecord between stages I need to register AvroCoder. The documentation says that if I use generic record, the schema argument can be arbitrary: https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/coders/AvroCoder.html#of-java.lang.Class-org.apache.avro.Schema-
However, when I pass an empty schema to the method AvroCoder.of(Class, Schema) it throws an exception at run time. Is there a way to create an AvroCoder for GenericRecord that does not require a schema? In my case, each GenericRecord has an embedded schema.
The exception and stacktrace:
Exception in thread "main" java.lang.NullPointerException
at org.apache.beam.sdk.coders.AvroCoder$AvroDeterminismChecker.checkIndexedRecord(AvroCoder.java:562)
at org.apache.beam.sdk.coders.AvroCoder$AvroDeterminismChecker.recurse(AvroCoder.java:430)
at org.apache.beam.sdk.coders.AvroCoder$AvroDeterminismChecker.check(AvroCoder.java:409)
at org.apache.beam.sdk.coders.AvroCoder.<init>(AvroCoder.java:260)
at org.apache.beam.sdk.coders.AvroCoder.of(AvroCoder.java:141)

I had a similar case and solved it with custom coder. The simplest (but sub-efficient) solution would be to encode schema along with each record. If your schemas are not too volatile you can get benefit of caching.
public class GenericRecordCoder extends AtomicCoder<GenericRecord> {
public static GenericRecordCoder of() {
return new GenericRecordCoder();
}
private static final ConcurrentHashMap<String, AvroCoder<GenericRecord>> avroCoders = new ConcurrentHashMap<>();
#Override
public void encode(GenericRecord value, OutputStream outStream) throws IOException {
String schemaString = value.getSchema().toString();
String schemaHash = getHash(schemaString);
StringUtf8Coder.of().encode(schemaString, outStream);
StringUtf8Coder.of().encode(schemaHash, outStream);
AvroCoder<GenericRecord> coder = avroCoders.computeIfAbsent(schemaHash,
s -> AvroCoder.of(value.getSchema()));
coder.encode(value, outStream);
}
#Override
public GenericRecord decode(InputStream inStream) throws IOException {
String schemaString = StringUtf8Coder.of().decode(inStream);
String schemaHash = StringUtf8Coder.of().decode(inStream);
AvroCoder<GenericRecord> coder = avroCoders.computeIfAbsent(schemaHash,
s -> AvroCoder.of(new Schema.Parser().parse(schemaString)));
return coder.decode(inStream);
}
}
While this solves the task, in fact I made it slightly different, using external schema registry (you can build this on the top of datastore for example). In this case you don't need to serialize/deserialize schema. The code looks like:
public class GenericRecordCoder extends AtomicCoder<GenericRecord> {
public static GenericRecordCoder of() {
return new GenericRecordCoder();
}
private static final ConcurrentHashMap<String, AvroCoder<GenericRecord>> avroCoders = new ConcurrentHashMap<>();
#Override
public void encode(GenericRecord value, OutputStream outStream) throws IOException {
SchemaRegistry.registerIfAbsent(value.getSchema());
String schemaName = value.getSchema().getFullName();
StringUtf8Coder.of().encode(schemaName, outStream);
AvroCoder<GenericRecord> coder = avroCoders.computeIfAbsent(schemaName,
s -> AvroCoder.of(value.getSchema()));
coder.encode(value, outStream);
}
#Override
public GenericRecord decode(InputStream inStream) throws IOException {
String schemaName = StringUtf8Coder.of().decode(inStream);
AvroCoder<GenericRecord> coder = avroCoders.computeIfAbsent(schemaName,
s -> AvroCoder.of(SchemaRegistry.get(schemaName)));
return coder.decode(inStream);
}
}
The usage is pretty straightforward:
PCollection<GenericRecord> inputCollection = pipeline
.apply(AvroIO
.parseGenericRecords(t -> t)
.withCoder(GenericRecordCoder.of())
.from(...));

After reviewing the code for AvroCoder, I do not think the documentation is correct there. Your AvroCoder instance will need a way to figure out the schema for your Avro records - and likely the only way to do that is by providing one.
So, I'd recommend calling AvroCoder.of(GenericRecord.class, schema), where schema is the correct schema for the GenericRecord objects in your PCollection.

Related

OAuth2Authentication object deserialization (RedisTokenStore)

I'm trying to rewrite some legacy code which used org.springframework.security.oauth2.provider.token.store.InMemoryTokenStore to store the access tokens. I'm currently trying to use the RedisTokenStore instead of the previously used InMemoryTokenStore. The token gets generated and gets stored in Redis fine (Standalone redis configuration), however, deserialization of OAuth2Authentication fails with the following error:
Could not read JSON: Cannot construct instance of `org.springframework.security.oauth2.provider.OAuth2Authentication` (no Creators, like default constructor, exist): cannot deserialize from Object value (no delegate- or property-based Creator)
Since there's no default constructor for this class, the deserialization and mapping to the actual object while looking up from Redis fails.
RedisTokenStore redisTokenStore = new RedisTokenStore(jedisConnectionFactory);
redisTokenStore.setSerializationStrategy(new StandardStringSerializationStrategy() {
#Override
protected <T> T deserializeInternal(byte[] bytes, Class<T> aClass) {
return Utilities.parse(new String(bytes, StandardCharsets.UTF_8),aClass);
}
#Override
protected byte[] serializeInternal(Object o) {
return Objects.requireNonNull(Utilities.convert(o)).getBytes();
}
});
this.tokenStore = redisTokenStore;
public static <T> T parse(String json, Class<T> clazz) {
try {
return OBJECT_MAPPER.readValue(json, clazz);
} catch (IOException e) {
log.error("Jackson2Json failed: " + e.getMessage());
} return null;}
public static String convert(Object data) {
try {
return OBJECT_MAPPER.writeValueAsString(data);
} catch (JsonProcessingException e) {
log.error("Conversion failed: " + e.getMessage());
}
return null;
}
How is OAuth2Authentication object reconstructed when the token is looked up from Redis? Since it does not define a default constructor, any Jackson based serializer and object mapper won't be able to deserialize it.
Again, the serialization works great (since OAuth2Authentication implements Serializable interface) and the token gets stored fine in Redis. It just fails when the /oauth/check_token is called.
What am I missing and how is this problem dealt with while storing access token in Redis?
I solved the issue by writing custom deserializer. It looks like this:
import com.fasterxml.jackson.core.JacksonException;
import com.fasterxml.jackson.core.JsonParser;
import com.fasterxml.jackson.databind.DeserializationContext;
import com.fasterxml.jackson.databind.JsonDeserializer;
import com.fasterxml.jackson.databind.module.SimpleModule;
import org.springframework.security.oauth2.core.AuthorizationGrantType;
import java.io.IOException;
public class AuthorizationGrantTypeCustomDeserializer extends JsonDeserializer<AuthorizationGrantType> {
#Override
public AuthorizationGrantType deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JacksonException {
Root root = p.readValueAs(Root.class);
return root != null ? new AuthorizationGrantType(root.value) : new AuthorizationGrantType("");
}
private static class Root {
public String value;
}
public static SimpleModule generateModule() {
SimpleModule authGrantModule = new SimpleModule();
authGrantModule.addDeserializer(AuthorizationGrantType.class, new AuthorizationGrantTypeCustomDeserializer());
return authGrantModule;
}
}
Then I registered deserializer in objectMapper which is later used by jackson API
ObjectMapper mapper = new ObjectMapper()
.registerModule(AuthorizationGrantTypeCustomDeserializer.generateModule());

DymanicDestinations in Apache Beam

I have a PCollection [String] say "X" that I need to dump in a BigQuery table.
The table destination and the schema for it is in a PCollection[TableRow] say "Y".
How to accomplish this in the simplest manner?
I tried extracting the table and schema from "Y" and saving it in static global variables (tableName and schema respectively). But somehow oddly the BigQueryIO.writeTableRows() always gets the value of the variable tableName as null. But it gets the schema. I tried logging the values of those variables and I can see the values are there for both.
Here is my pipeline code:
static String tableName;
static TableSchema schema;
PCollection<String> read = p.apply("Read from input file",
TextIO.read().from(options.getInputFile()));
PCollection<TableRow> tableRows = p.apply(
BigQueryIO.read().fromQuery(NestedValueProvider.of(
options.getfilename(),
new SerializableFunction<String, String>() {
#Override
public String apply(String filename) {
return "SELECT table,schema FROM `BigqueryTest.configuration` WHERE file='" + filename +"'";
}
})).usingStandardSql().withoutValidation());
final PCollectionView<List<String>> dataView = read.apply(View.asList());
tableRows.apply("Convert data read from file to TableRow",
ParDo.of(new DoFn<TableRow,TableRow>(){
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
String[] schemas = c.element().get("schema").toString().split(",");
List<TableFieldSchema> fields = new ArrayList<>();
for(int i=0;i<schemas.length;i++) {
fields.add(new TableFieldSchema()
.setName(schemas[i].split(":")[0]).setType(schemas[i].split(":")[1]));
}
schema = new TableSchema().setFields(fields);
//My code to convert data to TableRow format.
}}).withSideInputs(dataView));
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Everything works fine. Only BigQueryIO.write operation fails and I get the error TableId is null.
I also tried using SerializableFunction and returning the value from there but i still get null.
Here is the code that I tried for it:
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to(new GetTable(tableName))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
public static class GetTable implements SerializableFunction<String,String> {
String table;
public GetTable() {
this.table = tableName;
}
#Override
public String apply(String arg0) {
return "ProjectId:DatasetId."+table;
}
}
I also tried using DynamicDestinations but I get an error saying schema is not provided. Honestly I'm new to the concept of DynamicDestinations and I'm not sure that I'm doing it correctly.
Here is the code that I tried for it:
tableRows2.apply(BigQueryIO.writeTableRows()
.to(new DynamicDestinations<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#Override
public TableDestination getTable(TableRow dest) {
List<TableRow> list = sideInput(bqDataView); //bqDataView contains table and schema
String table = list.get(0).get("table").toString();
String tableSpec = "ProjectId:DatasetId."+table;
String tableDescription = "";
return new TableDestination(tableSpec, tableDescription);
}
public String getSideInputs(PCollectionView<List<TableRow>> bqDataView) {
return null;
}
#Override
public TableSchema getSchema(TableRow destination) {
return schema; //schema is getting added from the global variable
}
#Override
public TableRow getDestination(ValueInSingleWindow<TableRow> element) {
return null;
}
}.getSideInputs(bqDataView)));
Please let me know what I'm doing wrong and which path I should take.
Thank You.
Part of the reason your having trouble is because of the two stages of pipeline execution. First the pipeline is constructed on your machine. This is when all of the applications of PTransforms occur. In your first example, this is when the following lines are executed:
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
The code within a ParDo however runs when your pipeline executes, and it does so on many machines. So the following code runs much later than the pipeline construction:
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
...
schema = new TableSchema().setFields(fields);
...
}
This means that neither the tableName nor the schema fields will be set at when the BigQueryIO sink is created.
Your idea to use DynamicDestinations is correct, but you need to move the code to actually generate the schema the destination into that class, rather than relying on global variables that aren't available on all of the machines.

Element value based writing to Google Cloud Storage using Dataflow

I'm trying to build a dataflow process to help archive data by storing data into Google Cloud Storage. I have a PubSub stream of Event data which contains the client_id and some metadata. This process should archive all incoming events, so this needs to be a streaming pipeline.
I'd like to be able to handle archiving the events by putting each Event I receive inside a bucket that looks like gs://archive/client_id/eventdata.json . Is that possible to do within dataflow/apache beam, specifically being able to assign the file name differently for each Event in the PCollection?
EDIT:
So my code currently looks like:
public static class PerWindowFiles extends FileBasedSink.FilenamePolicy {
private String customerId;
public PerWindowFiles(String customerId) {
this.customerId = customerId;
}
#Override
public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext context, String extension) {
String filename = bucket+"/"+customerId;
return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
}
#Override
public ResourceId unwindowedFilename(
ResourceId outputDirectory, Context context, String extension) {
throw new UnsupportedOperationException("Unsupported.");
}
}
public static void main(String[] args) throws IOException {
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(DataflowPipelineOptions.class);
options.setRunner(DataflowRunner.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
PCollection<Event> set = p.apply(PubsubIO.readStrings()
.fromTopic("topic"))
.apply(new ConvertToEvent()));
PCollection<KV<String, Event>> events = labelEvents(set);
PCollection<KV<String, EventGroup>> sessions = groupEvents(events);
String customers = System.getProperty("CUSTOMERS");
JSONArray custList = new JSONArray(customers);
for (Object cust : custList) {
if (cust instanceof String) {
String customerId = (String) cust;
PCollection<KV<String, EventGroup>> custCol = sessions.apply(new FilterByCustomer(customerId));
stringifyEvents(custCol)
.apply(TextIO.write()
.to("gs://archive/")
.withFilenamePolicy(new PerWindowFiles(customerId))
.withWindowedWrites()
.withNumShards(3));
} else {
LOG.info("Failed to create TextIO: customerId was not String");
}
}
p.run()
.waitUntilFinish();
}
This code is ugly because I need to redeploy every time a new client happens in order to be able to save their data. I would prefer to be able to assign customer data to an appropriate bucket dynamically.
"Dynamic destinations" - choosing the file name based on the elements being written - will be a new feature available in Beam 2.1.0, which has not yet been released.

ValueProvider Issue

I am trying to get the value of a property that is passed from a cloud function to a dataflow template. I am getting errors because the value being passed is a wrapper, and using the .get() method fails during the compile. with this error
An exception occurred while executing the Java class. null: InvocationTargetException: Not called from a runtime context.
public interface MyOptions extends DataflowPipelineOptions {
...
#Description("schema of csv file")
ValueProvider<String> getHeader();
void setHeader(ValueProvider<String> header);
...
}
public static void main(String[] args) throws IOException {
...
List<String> sideInputColumns = Arrays.asList(options.getHeader().get().split(","));
...
//ultimately use the getHeaders as side inputs
PCollection<String> input = p.apply(Create.of(sideInputColumns));
final PCollectionView<List<String>> finalColumnView = input.apply(View.asList());
}
How do I extract the value from the ValueProvider type?
The value of a ValueProvider is not available during pipeline construction. As such, you need to organize your pipeline so that it always has the same structure, and serializes the ValueProvider. At runtime, the individual transforms within your pipeline can inspect the value to determine how to operate.
Based on your example, you may need to do something like the following. It creates a single element, and then uses a DoFn that is evaluated at runtime to expand the headers:
public static class HeaderDoFn extends DoFn<String, String> {
private final ValueProvider<String> header;
public HeaderDoFn(ValueProvider<String> header) {
this.header = header;
}
#ProcessElement
public void processElement(ProcessContext c) {
// Ignore input element -- there should be exactly one
for (String column : this.header().get().split(",")) {
c.output(column);
}
}
}
public static void main(String[] args) throws IOException {
PCollection<String> input = p
.apply(Create.of("one")) // create a single element
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
}
});
// Note that the order of this list is not guaranteed.
final PCollectionView<List<String>> finalColumnView =
input.apply(View.asList());
}
Another option would be to use a NestedValueProvider to create a ValueProvider<List<String>> from the option, and pass that ValueProvider<List<String>> to the necessary DoFns rather than using a side input.

Sharing BigTable Connection object among DataFlow DoFn sub-classes

I am setting up a Java Pipeline in DataFlow to read a .csv file and to create a bunch of BigTable rows based on the content of the file. I see in the BigTable documentation the note that connecting to BigTable is an 'expensive' operation and that it's a good idea to do it only once and to share the connection among the functions that need it.
However, if I declare the Connection object as a public static variable in the main class and first connect to BigTable in the main function, I get the NullPointerException when I subsequently try to reference the connection in instances of DoFn sub-classes' processElement() function as part of my DataFlow pipeline.
Conversely, if I declare the Connection as a static variable in the actual DoFn class, then the operation works successfully.
What is the best-practice or optimal way to do this?
I'm concerned that if I implement the second option at scale, I will be wasting a lot of time and resources. If I keep the variable as static in the DoFn class, is it enough to ensure that the APIs don't try to re-establish the connection every time?
I realize there is a special BigTable I/O call to sync DataFlow pipeline objects with BigTable, but I think I need to write one on my own to build-in some special logic into the DoFn processElement() function...
This is what the "working" code looks like:
class DigitizeBT extends DoFn<String, String>{
private static Connection m_locConn;
#Override
public void processElement(ProcessContext c)
{
try
{
m_locConn = BigtableConfiguration.connect("projectID", "instanceID");
Table tbl = m_locConn.getTable(TableName.valueOf("TableName"));
Put put = new Put(Bytes.toBytes(rowKey));
put.addColumn(
Bytes.toBytes("CF1"),
Bytes.toBytes("SomeName"),
Bytes.toBytes("SomeValue"));
tbl.put(put);
}
catch (IOException e)
{
e.printStackTrace();
System.exit(1);
}
}
}
This is what updated code looks like, FYI:
public void SmallKVJob()
{
CloudBigtableScanConfiguration config = new CloudBigtableScanConfiguration.Builder()
.withProjectId(DEF.ID_PROJ)
.withInstanceId(DEF.ID_INST)
.withTableId(DEF.ID_TBL_UNITS)
.build();
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setProject(DEF.ID_PROJ);
options.setStagingLocation(DEF.ID_STG_LOC);
// options.setNumWorkers(3);
// options.setMaxNumWorkers(5);
// options.setRunner(BlockingDataflowPipelineRunner.class);
options.setRunner(DirectPipelineRunner.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from(DEF.ID_BAL))
.apply(ParDo.of(new DoFn1()))
.apply(ParDo.of(new DoFn2()))
.apply(ParDo.of(new DoFn3(config)));
m_log.info("starting to run the job");
p.run();
m_log.info("finished running the job");
}
}
class DoFn1 extends DoFn<String, KV<String, Integer>>
{
#Override
public void processElement(ProcessContext c)
{
c.output(KV.of(c.element().split("\\,")[0],Integer.valueOf(c.element().split("\\,")[1])));
}
}
class DoFn2 extends DoFn<KV<String, Integer>, KV<String, Integer>>
{
#Override
public void processElement(ProcessContext c)
{
int max = c.element().getValue();
String name = c.element().getKey();
for(int i = 0; i<max;i++)
c.output(KV.of(name, 1));
}
}
class DoFn3 extends AbstractCloudBigtableTableDoFn<KV<String, Integer>, String>
{
public DoFn3(CloudBigtableConfiguration config)
{
super(config);
}
#Override
public void processElement(ProcessContext c)
{
try
{
Integer max = c.element().getValue();
for(int i = 0; i<max; i++)
{
String owner = c.element().getKey();
String rnd = UUID.randomUUID().toString();
Put p = new Put(Bytes.toBytes(owner+"*"+rnd));
p.addColumn(Bytes.toBytes(DEF.ID_CF1), Bytes.toBytes("Owner"), Bytes.toBytes(owner));
getConnection().getTable(TableName.valueOf(DEF.ID_TBL_UNITS)).put(p);
c.output("Success");
}
} catch (IOException e)
{
c.output(e.toString());
e.printStackTrace();
}
}
}
The input .csv file looks something like this:
Mary,3000
John,5000
Peter,2000
So, for each row in the .csv file, I have to put in x number of rows into BigTable, where x is the second cell in the .csv file...
We built AbstractCloudBigtableTableDoFn ( Source & Docs ) for this purpose. Extend that class instead of DoFn, and call getConnection() instead of creating a Connection yourself.
10,000 small rows should take a second or two of actual work.
EDIT: As per the comments, BufferedMutator should be used instead of Table.put() for best throughput.

Resources