Apache Beam and avro : Create a dataflow pipeline without schema - google-cloud-dataflow

I am building a dataflow pipeline with Apache beam. Below is the pseudo code:
PCollection<GenericRecord> rows = pipeline.apply("Read Json from PubSub", <some reader>)
.apply("Convert Json to pojo", ParDo.of(new JsonToPojo()))
.apply("Convert pojo to GenericRecord", ParDo.of(new PojoToGenericRecord()))
.setCoder(AvroCoder.of(GenericRecord.class, schema));
I am trying to get rid of setting the coder in the pipeline as schema won't be known at pipeline creation time (it will be present in the message).
I commented out the line that sets the coder and got an Exception saying that default coder is not configured. I used one argument version of of method and got the following Exception:
Not a Specific class: interface org.apache.avro.generic.GenericRecord
at org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:285)
at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:594)
at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218)
at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215)
at avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
... 9 more
Is there any way for us to supply the coder at runtime, without knowing the schema beforehand?

This is possible. I recommend the following approach:
Do not use an intermediate collection of type GenericRecord. Keep it as a collection of your POJOs.
Write some transform that extracts the schema of your data and makes it available as a PCollectionView<however you want to represent the schema>.
When writing to BigQuery, write your PCollection<YourPojo> via write().to(DynamicDestinations), and when writing to Avro, use FileIO.write() or writeDynamic() in combination with AvroIO.sinkViaGenericRecords(). Both of these can take a dynamically computed schema from a side input (that you computed above).

Related

Dataflow: Consuming runtime parameters in template

Trying to create a template for dataflow job.
Is there any way to generate a template with runtime parameters?
Till now, whatever parameters were used at the time of creation of template, but when I tried passing different values for the variables, it is not picking the runtime values.
If any additional details are needed, will provide the same.
You can use value providers in your pipeline options to have runtime arguments in a pipeline.
But I'm afraid that this is too limited to where you can use these parameters (Mostly in DoFn).
This behaviour is expected from dataflow template as it is representation of a pipeline rather than the code itself.
Please bear in mind that you cannot create dataflow template with dynamic processing steps based on the value passed.
The steps are hard-coded into the template and cannot be changed unless the code to generate the template is executed again.
A parameter needs to be wrapped inside of a ValueProvider object in order for the template pipeline to access the runtime value of that parameter. All of the example templates provided here demonstrate how ValueProvider can be used to parameterize a template pipeline.
Take a look at the WordCount pipeline as an example.
As you can see, the pipeline uses a ValueProvider (instead of a simple String) to read the path of the file on which a WordCount needs to be executed:
#Description("Path of the file to read from")
ValueProvider<String> getInputFile();
void setInputFile(ValueProvider<String> value);
Since the value of the inputFile is unknown until runtime (when the template is actually executed with valid inputs), the transform using the ValueProvider will defer reading the value of the parameter until runtime (for e.g. inside a DoFn).
The native TextIO.Read Beam transform provides support for reading from a ValueProvider in addition to reading from String.

Read a pickle from another pipeline in Beam?

I'm running batch pipelines in Google Cloud Dataflow. I need to read objects in one pipeline that another pipeline has previously written. The easiest wa objects is pickle / dill.
The writing works well, writing a number of files, each with a pickled object. When I download the file manually, I can unpickle the file. Code for writing: beam.io.WriteToText('gs://{}', coder=coders.DillCoder())
But the reading breaks every time, with one of the errors below. Code for reading: beam.io.ReadFromText('gs://{}*', coder=coders.DillCoder())
Either...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
KeyError: '\x90'
...or...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
__import__(module)
ImportError: No module named measur
(the class of the object sits in a path with measure, though not sure why it misses the last character there)
I've tried using the default coder, and a BytesCoder, and pickling & unpickling as a custom task in the pipeline.
My working hypothesis is the reader splitting the file by line, and so treating a single pickle (which has new lines within it) as multiple objects. If so, is there a way of avoiding that?
I could attempt to build a reader myself, but I'm hesitant since this seems like a well-solved problem (e.g. Beam already has a format to move objects from one pipeline stage to another).
Tangentially related: How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?
Thank you!
ReadFromText is designed to read new line separated records in text files hence is not suitable for your use-case. Implementing FileBasedSource is not a good solution either since it's designed for reading large files with multiple records (and usually splits these files into shards for parallel processing). So, in your case, the current best solution for Python SDK is to implement a source yourself. This can be as simple as a ParDo that reads files and produces a PCollection of records. If your ParDo produce a large number of records consider adding a apache_beam.transforms.util.Reshuffle step following that which will allow runners to parallelize following steps better. For Java SDK we have FileIO which already provides transforms to make this bit easier.
Encoding as string_escape escapes the newlines, so the only newlines that Beam sees are those between pickles:
class DillMultiCoder(DillCoder):
"""
Coder that allows multi-line pickles to be read
After an object is pickled, the bytes are encoded as `unicode_escape`,
meaning newline characters (`\n`) aren't in the string.
Previously, the presence of newline characters these confues the Dataflow
reader, as it can't discriminate between a new object and a new line
within a pickle string
"""
def _create_impl(self):
return coder_impl.CallbackCoderImpl(
maybe_dill_multi_dumps, maybe_dill_multi_loads)
def maybe_dill_multi_dumps(o):
# in Py3 this needs to be `unicode_escape`
return maybe_dill_dumps(o).encode('string_escape')
def maybe_dill_multi_loads(o):
# in Py3 this needs to be `unicode_escape`
return maybe_dill_loads(o.decode('string_escape'))
For large pickles, I also needed to set the buffersize much higher to 8MB - on the previous buffer size (8kB), a 120MB file spun for 2 days of CPU time:
class ReadFromTextPickle(ReadFromText):
"""
Same as ReadFromText, but with a really big buffer. With the standard 8KB
buffer, large files can be read on a loop and never finish
Also added DillMultiCoder
"""
def __init__(
self,
file_pattern=None,
min_bundle_size=0,
compression_type=CompressionTypes.AUTO,
strip_trailing_newlines=True,
coder=DillMultiCoder(),
validate=True,
skip_header_lines=0,
**kwargs):
# needs commenting out, not sure why
# super(ReadFromTextPickle, self).__init__(**kwargs)
self._source = _TextSource(
file_pattern,
min_bundle_size,
compression_type,
strip_trailing_newlines=strip_trailing_newlines,
coder=coder,
validate=validate,
skip_header_lines=skip_header_lines,
buffer_size=8000000)
Another approach would be to implement a PickleFileSource inherited from FileBasedSource and call pickle.load on the file - each call would yield a new object. But there's a bunch of complication around offset_range_tracker that looked like more lift than strictly necessary

Dataflow/Beam Templates, Productionization, Initialization, and ValueProviders

I have an Apache Beam job running on Google Cloud Dataflow, and as part of its initialization it needs to run some basic sanity/availability checks on services, pub/sub subscriptions, GCS blobs, etc. It's a streaming pipeline intended to run ad infinitum that processes hundreds of thousands of pub/sub messages.
Currently it needs a whole heap of required, variable parameters: which Google Cloud project it needs to run in, which bucket and directory prefix it's going to be storing files in, which pub/sub subscriptions it needs to read from, and so on. It does some work with these parameters before pipeline.run is called - validation, string splitting, and the like. In its current form in order to start a job we've been passing these parameters to to a PipelineOptionsFactory and issuing a new compile every single time, but it seems like there should be a better way. I've set up the parameters to be ValueProvider objects, but because they're being called outside of pipeline.run, Maven complains at compile time that ValueProvider.get() is being called outside of a runtime context (which, yes, it is.)
I've tried using NestedValueProviders as in the Google "Creating Templates" document, but my IDE complains if I try to use NestedValueProvider.of to return a string as shown in the document. The only way I've been able to get NestedValueProviders to compile is as follows:
NestedValueProvider<String, String> pid = NestedValueProvider.of(
pipelineOptions.getDataflowProjectId(),
(SerializableFunction<String, String>) s -> s
);
(String pid = NestedValueProvider.of(...) results in the following error: "incompatible types: no instance(s) of type variable(s) T,X exist so that org.apache.beam.sdk.options.ValueProvider.NestedValueProvider conforms to java.lang.String")
I have the following in my pipelineOptions:
ValueProvider<String> getDataflowProjectId();
void setDataflowProjectId(ValueProvider<String> value);
Because of the volume of messages we're going to be processing, adding these checks at the front of the pipeline for every message that comes through isn't really practical; we'll hit daily account administrative limits on some of these calls pretty quickly.
Are templates the right approach for what I want to do? How do I go about actually productionizing this? Should (can?) I compile with maven into a jar, then just run the jar on a local dev/qa/prod box with my parameters and just not bother with ValueProviders at all? Or is it possible to provide a default to a ValueProvider and override it as part of the options passed to the template?
Any advice on how to proceed would be most appreciated. Thanks!
The way templates are currently implemented there is no point to perform "post-template creation" but "pre-pipeline start" initialization/validation.
All of the existing validation executes during template creation. If the validation detects that there the values aren't available (due to being a ValueProvider) the validation is skipped.
In some cases it is possible to approximate validation by adding runtime checks either as part of initial splitting of a custom source or part of the #Setup method of a DoFn. In the latter case, the #Setup method will run once for each instance of the DoFn that is created. If the pipeline is Batch, after 4 failures for a specific instance it will fail the pipeline.
Another option for productionizing pipelines is to build the JAR that runs the pipeline, and have a production process that runs that JAR to initiate the pipeline.
Regarding the compile error you received -- the NestedValueProvider returns a ValueProvider -- it isn't possible to get a String out of that. You could, however, put the validation code into the SerializableFunction that is run within the NestedValueProvider.
Although I believe this will currently re-run the validation everytime the value is accessed, it wouldn't be unreasonable to have the NestedValueProvider cache the translated value.

Jenkins Dynamic Combination Matrix

I've been using the option Restrict matrix execution to a subset from the Parameterized Trigger Plugin to pass on a combination filter to a rather large Matrix Project where all test execution is made. As the number of tests grow, so does the combination filter (which is dynamically built up) and I seemed to hit the cap. The following job gets this error message:
FATAL: Invalid method Code length 69871 in class file Script1
java.lang.ClassFormatError: Invalid method Code length 69871 in class file Script1
After reading about this problem, it seems to be a JVM constraint after reading the JVM documentation
The value of the code_length item must be less than 65536.
I get the impression that this is not something I can (or even should) tinker with in Jenkins.
My second idea was to go around this problem was to create the combination filter and then pass it as String parameter to the following Matrix Project, then use the Combination Filter option and expand the variable to achieve the same result.
Unfortunately I get this exception when trying to save my Matrix Project with a String parameter as combination filter
javax.servlet.ServletException: groovy.lang.MissingPropertyException: No such property: $COMBINATION_FILTER for class: groovy.lang.Binding
I guess this is because the variable needs to be available in the configuration when saving but I want to inject it when starting the Matrix Project.
I am running out of ideas to solve this problem. Any ideas?
You could try the Matrix Groovy Execution Strategy which is like a super combination filter
If I can quote myself
A plugin to decide the execution order and valid combinations of
matrix projects.
This uses a user defined groovy script to arrange the order which will
then be executed
Disclaimer: I built this plugin

Different coders for the same class in dataflow job

I'm trying to use different coders for the same class for two different scenarios:
Reading from JSON input files - using data = TextIO.Read.from(options.getInput()).withCoder(new Coder1())
Elsewhere in the job I want the class to be persisted using SerializableCoder using data.setCoder(SerializableCoder.of(MyClass.class)
It works locally, but fails when run in the cloud with
Caused by: java.io.StreamCorruptedException: invalid stream header: 7B227365.
Is it a supported scenario? The reason to do this in the first place is to avoid read/write of JSON format, and on the other hand make reading from input files more efficient (UTF-8 parsing is part of the JSON reader, so it can read from InputStream directly)
Clarifications:
Coder1 is my coder.
The other coder is a SerializableCoder.of(MyClass.class)
How does the system choose which coder to use? The two formats are binary incompatible, and it looks like due to some optimization, the second coder is used for data format which can only be read by the first coder.
Yes, using two different coders like that should work. (With the caveat that the coder in #2 will only be used if the system choses to persist 'data' instead of optimizing it into surround computations.)
Are you using your own Coders or ones provided by the Dataflow SDK? Quick caveat on TextIO -- because it uses newlines to encode element boundaries, you'll get into trouble if you use a coder that produces encoded values containing something that can be mistaken for a newline. You really should only use textual encodings within TextIO. We're hoping to make that clearer in the future.

Resources