Dataflow: Consuming runtime parameters in template - google-cloud-dataflow

Trying to create a template for dataflow job.
Is there any way to generate a template with runtime parameters?
Till now, whatever parameters were used at the time of creation of template, but when I tried passing different values for the variables, it is not picking the runtime values.
If any additional details are needed, will provide the same.

You can use value providers in your pipeline options to have runtime arguments in a pipeline.
But I'm afraid that this is too limited to where you can use these parameters (Mostly in DoFn).
This behaviour is expected from dataflow template as it is representation of a pipeline rather than the code itself.
Please bear in mind that you cannot create dataflow template with dynamic processing steps based on the value passed.
The steps are hard-coded into the template and cannot be changed unless the code to generate the template is executed again.

A parameter needs to be wrapped inside of a ValueProvider object in order for the template pipeline to access the runtime value of that parameter. All of the example templates provided here demonstrate how ValueProvider can be used to parameterize a template pipeline.
Take a look at the WordCount pipeline as an example.
As you can see, the pipeline uses a ValueProvider (instead of a simple String) to read the path of the file on which a WordCount needs to be executed:
#Description("Path of the file to read from")
ValueProvider<String> getInputFile();
void setInputFile(ValueProvider<String> value);
Since the value of the inputFile is unknown until runtime (when the template is actually executed with valid inputs), the transform using the ValueProvider will defer reading the value of the parameter until runtime (for e.g. inside a DoFn).
The native TextIO.Read Beam transform provides support for reading from a ValueProvider in addition to reading from String.

Related

setting sdk_container_image in flex template

In order to provide hermetic builds and runtime, we currently build custom flex template and worker images. As part of deployment, we build the flex template and we can specify the custom flex template image but not the custom worker image - we have to pass it to dataflow separately when invoking the flex template. This creates hidden dependencies that we need to track as part of the release process elsewhere (and seems to go against the design of flex templates of being self-contained). Is there a way to bake at least a default for the worker image ( sdk_container_image ) into the template?
In general, if you want to have a default for a template parameter, you can just use the the #Default annotation on it and change the default value for every template version you build. Though in this particular case, since sdkContainerImage parameter is declared in DataflowPipelineOptions in Java SDK (or WorkerOptions in Python SDK), and you can't control its default from the user code, I'd try to set the parameter value programmatically in the template code.
I think something like this should work, but I haven't tested it:
DataflowPipelineOptions options =
PipelineOptionsFactory.fromArgs(args).as(DataflowPipelineOptions.class);
if (options.getSdkContainerImage() == null || options.getSdkContainerImage().isEmpty()) {
// Set the default if not already set by the template runner.
options.setSdkContainerImage("...");
}
Pipeline pipeline = Pipeline.create(options);
// ...
This is for Java, but you can do a similar thing with Python SDK.

Apache Beam and avro : Create a dataflow pipeline without schema

I am building a dataflow pipeline with Apache beam. Below is the pseudo code:
PCollection<GenericRecord> rows = pipeline.apply("Read Json from PubSub", <some reader>)
.apply("Convert Json to pojo", ParDo.of(new JsonToPojo()))
.apply("Convert pojo to GenericRecord", ParDo.of(new PojoToGenericRecord()))
.setCoder(AvroCoder.of(GenericRecord.class, schema));
I am trying to get rid of setting the coder in the pipeline as schema won't be known at pipeline creation time (it will be present in the message).
I commented out the line that sets the coder and got an Exception saying that default coder is not configured. I used one argument version of of method and got the following Exception:
Not a Specific class: interface org.apache.avro.generic.GenericRecord
at org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:285)
at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:594)
at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218)
at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215)
at avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
... 9 more
Is there any way for us to supply the coder at runtime, without knowing the schema beforehand?
This is possible. I recommend the following approach:
Do not use an intermediate collection of type GenericRecord. Keep it as a collection of your POJOs.
Write some transform that extracts the schema of your data and makes it available as a PCollectionView<however you want to represent the schema>.
When writing to BigQuery, write your PCollection<YourPojo> via write().to(DynamicDestinations), and when writing to Avro, use FileIO.write() or writeDynamic() in combination with AvroIO.sinkViaGenericRecords(). Both of these can take a dynamically computed schema from a side input (that you computed above).

Dataflow/Beam Templates, Productionization, Initialization, and ValueProviders

I have an Apache Beam job running on Google Cloud Dataflow, and as part of its initialization it needs to run some basic sanity/availability checks on services, pub/sub subscriptions, GCS blobs, etc. It's a streaming pipeline intended to run ad infinitum that processes hundreds of thousands of pub/sub messages.
Currently it needs a whole heap of required, variable parameters: which Google Cloud project it needs to run in, which bucket and directory prefix it's going to be storing files in, which pub/sub subscriptions it needs to read from, and so on. It does some work with these parameters before pipeline.run is called - validation, string splitting, and the like. In its current form in order to start a job we've been passing these parameters to to a PipelineOptionsFactory and issuing a new compile every single time, but it seems like there should be a better way. I've set up the parameters to be ValueProvider objects, but because they're being called outside of pipeline.run, Maven complains at compile time that ValueProvider.get() is being called outside of a runtime context (which, yes, it is.)
I've tried using NestedValueProviders as in the Google "Creating Templates" document, but my IDE complains if I try to use NestedValueProvider.of to return a string as shown in the document. The only way I've been able to get NestedValueProviders to compile is as follows:
NestedValueProvider<String, String> pid = NestedValueProvider.of(
pipelineOptions.getDataflowProjectId(),
(SerializableFunction<String, String>) s -> s
);
(String pid = NestedValueProvider.of(...) results in the following error: "incompatible types: no instance(s) of type variable(s) T,X exist so that org.apache.beam.sdk.options.ValueProvider.NestedValueProvider conforms to java.lang.String")
I have the following in my pipelineOptions:
ValueProvider<String> getDataflowProjectId();
void setDataflowProjectId(ValueProvider<String> value);
Because of the volume of messages we're going to be processing, adding these checks at the front of the pipeline for every message that comes through isn't really practical; we'll hit daily account administrative limits on some of these calls pretty quickly.
Are templates the right approach for what I want to do? How do I go about actually productionizing this? Should (can?) I compile with maven into a jar, then just run the jar on a local dev/qa/prod box with my parameters and just not bother with ValueProviders at all? Or is it possible to provide a default to a ValueProvider and override it as part of the options passed to the template?
Any advice on how to proceed would be most appreciated. Thanks!
The way templates are currently implemented there is no point to perform "post-template creation" but "pre-pipeline start" initialization/validation.
All of the existing validation executes during template creation. If the validation detects that there the values aren't available (due to being a ValueProvider) the validation is skipped.
In some cases it is possible to approximate validation by adding runtime checks either as part of initial splitting of a custom source or part of the #Setup method of a DoFn. In the latter case, the #Setup method will run once for each instance of the DoFn that is created. If the pipeline is Batch, after 4 failures for a specific instance it will fail the pipeline.
Another option for productionizing pipelines is to build the JAR that runs the pipeline, and have a production process that runs that JAR to initiate the pipeline.
Regarding the compile error you received -- the NestedValueProvider returns a ValueProvider -- it isn't possible to get a String out of that. You could, however, put the validation code into the SerializableFunction that is run within the NestedValueProvider.
Although I believe this will currently re-run the validation everytime the value is accessed, it wouldn't be unreasonable to have the NestedValueProvider cache the translated value.

How to pass values for options arguments when running a Dataflow app in Eclipse

I try to test my first Dataflow app by running it in Eclipse.
When I try to pass 4 values for the arguments on "Run Configuration" on "Arguments" tab as following:
projects/poc/subscriptions/poc-TestApp1 poc myDataSet my_logs
I get the error:
Argument 'projects/poc/subscriptions/poc-TestApp1' does not begin
with '--'
adding -- to all arguments produced a different error.
Based on your question, it seems that you have custom argument parsing code in your program (I suppose you're extracting your arguments as args[0], args[1] etc. in your main() function?), but still use PipelineOptionsFactory.fromArgs(args) to configure the options for Dataflow itself.
Dataflow does not support this mixed way of specifying command-line arguments - you need to define your own PipelineOptions to represent your configuration parameters, and specify them prefixed with --.
Please see here for details, in particular here for creating custom options.

Can a workflow step access environment variables provided by an EnvironmentContributingAction?

A custom plugin we wrote for an older version of Jenkins uses an EnvironmentContributingAction to provide environment variables to the execution so they could be used in future build steps and passed as parameters to downstream jobs.
While attempting to convert our build to workflow, I'm having trouble accessing these variables:
node {
// this step queries an API and puts the results in
// environment variables called FE1|BE1_INTERNAL_ADDRESS
step([$class: 'SomeClass', parameter: foo])
// this ends up echoing 'null and null'
echo "${env.FE1_INTERNAL_ADDRESS} and ${env.BE1_INTERNAL_ADDRESS}"
}
Is there a way to access the environment variable that was injected? Do I have to convert this functionality to a build wrapper instead?
EnvironmentContributingAction is currently limited to AbstractBuilds, which WorkflowRuns are not, so pending JENKINS-29537 which I just filed, your plugin would need to be modified somehow. Options include:
Have the builder add a plain Action instead, then register an EnvironmentContributor whose buildEnvironmentFor(Run, …) checks for its presence using Run.getAction(Class).
Switch to a SimpleBuildWrapper which defines the environment variables within a scope, then invoke it from Workflow using the wrap step.
Depend on workflow-step-api and define a custom Workflow Step with comparable functionality but directly returning a List<String> or whatever makes sense in your context. (code sample)
Since PR-2975 is merged, you are able to use new interface:
void buildEnvVars(#Nonnull Run<?, ?> run, #Nonnull EnvVars env, #CheckForNull Node node)
It will be used by old type of builds as well.

Resources