Dataflow/Beam Templates, Productionization, Initialization, and ValueProviders - google-cloud-dataflow

I have an Apache Beam job running on Google Cloud Dataflow, and as part of its initialization it needs to run some basic sanity/availability checks on services, pub/sub subscriptions, GCS blobs, etc. It's a streaming pipeline intended to run ad infinitum that processes hundreds of thousands of pub/sub messages.
Currently it needs a whole heap of required, variable parameters: which Google Cloud project it needs to run in, which bucket and directory prefix it's going to be storing files in, which pub/sub subscriptions it needs to read from, and so on. It does some work with these parameters before pipeline.run is called - validation, string splitting, and the like. In its current form in order to start a job we've been passing these parameters to to a PipelineOptionsFactory and issuing a new compile every single time, but it seems like there should be a better way. I've set up the parameters to be ValueProvider objects, but because they're being called outside of pipeline.run, Maven complains at compile time that ValueProvider.get() is being called outside of a runtime context (which, yes, it is.)
I've tried using NestedValueProviders as in the Google "Creating Templates" document, but my IDE complains if I try to use NestedValueProvider.of to return a string as shown in the document. The only way I've been able to get NestedValueProviders to compile is as follows:
NestedValueProvider<String, String> pid = NestedValueProvider.of(
pipelineOptions.getDataflowProjectId(),
(SerializableFunction<String, String>) s -> s
);
(String pid = NestedValueProvider.of(...) results in the following error: "incompatible types: no instance(s) of type variable(s) T,X exist so that org.apache.beam.sdk.options.ValueProvider.NestedValueProvider conforms to java.lang.String")
I have the following in my pipelineOptions:
ValueProvider<String> getDataflowProjectId();
void setDataflowProjectId(ValueProvider<String> value);
Because of the volume of messages we're going to be processing, adding these checks at the front of the pipeline for every message that comes through isn't really practical; we'll hit daily account administrative limits on some of these calls pretty quickly.
Are templates the right approach for what I want to do? How do I go about actually productionizing this? Should (can?) I compile with maven into a jar, then just run the jar on a local dev/qa/prod box with my parameters and just not bother with ValueProviders at all? Or is it possible to provide a default to a ValueProvider and override it as part of the options passed to the template?
Any advice on how to proceed would be most appreciated. Thanks!

The way templates are currently implemented there is no point to perform "post-template creation" but "pre-pipeline start" initialization/validation.
All of the existing validation executes during template creation. If the validation detects that there the values aren't available (due to being a ValueProvider) the validation is skipped.
In some cases it is possible to approximate validation by adding runtime checks either as part of initial splitting of a custom source or part of the #Setup method of a DoFn. In the latter case, the #Setup method will run once for each instance of the DoFn that is created. If the pipeline is Batch, after 4 failures for a specific instance it will fail the pipeline.
Another option for productionizing pipelines is to build the JAR that runs the pipeline, and have a production process that runs that JAR to initiate the pipeline.
Regarding the compile error you received -- the NestedValueProvider returns a ValueProvider -- it isn't possible to get a String out of that. You could, however, put the validation code into the SerializableFunction that is run within the NestedValueProvider.
Although I believe this will currently re-run the validation everytime the value is accessed, it wouldn't be unreasonable to have the NestedValueProvider cache the translated value.

Related

Skip Jenkins Stage if its last excution was successfull

I am looking for way to speed up our Build-Pipeline.
The biggest impact would be to do certain things only if they have not been done.
So basically, I have already parameterized some optional stages which works fine, but there are some things I'd like to skip if there was a execution before which was successfull.
I have searched the docs, especially the section ref. when but the was nothing that did the job.
So the question is: Is there something I can do in a declarative pipeline to always skip a stage, except when
It has never been run successfully before
The last time it ran was not successful (eg. if I allow for its execution to be forced by passing a parameter)
I though about using the Build Number (eg only run it on the first execution of the pipeline), but this does't cut it for 2 reasons
I'm using milestones to prevent multiple executions of the pipe for a given branch/PR
If the first run fails, it would never be tried again
Oh, and I also thought about putting all of the logic in post -> regression and forcing the stage to fail on the first pipeline run, but this doesn't seem to be a good idea either.
One option ,although not ideal for large scale, can be to store the diffrent states as Global Environment Variables. (Manage Jenkins -> Configure System -> Global properties -> Environment variables).
These parameters are available for all jobs and can store parameters in a 'server wide' scope.
You can then update or set them via code using the following groovy method:
#NonCPS
def updateGlobalEnvVariable(String name, String value) {
def globalNodeProperties = Jenkins.getInstance().getGlobalNodeProperties()
def envVarsNodePropertyList = globalNodeProperties.getAll(hudson.slaves.EnvironmentVariablesNodeProperty.class)
if (envVarsNodePropertyList == null || envVarsNodePropertyList.size() == 0) {
def envVarsNodePropertyClass = this.class.classLoader.loadClass('hudson.slaves.EnvironmentVariablesNodeProperty')
globalNodeProperties.add(envVarsNodePropertyClass.newInstance())
}
envVarsNodePropertyList.get(0).getEnvVars().put(name, value)
}
This function will create the parameter if it does not exists or update its value in case it already exists. Also this function should better be placed in a Shared Library from which it will be available for all pipelines.
A nice advantage for this technique is that you can always 'reset' the different stages from the configuration page, However if you need to store multiple stages it can overflow the configuration page with a bit too much information.
Maybe you could use custom shared-workspace and then create/store some state file or something that could be used by your next executions like lastexecfailed.state and then try to locate the file on the shared workspace at the beginning of your execution.

How to get taskid that "waitForQualityGate()" is going to check for?

I apologize for duplicating my post from sonarsource, but sometimes this gets a different audience.
We’re using SonarQube 7.9.2.
Our Jenkins builds use the pipeline steps “withSonarQubeEnv” and “waitForQualityGate”, and in between we use “mvn sonar:sonar” to run the scan. At the end of the latter, it prints the task id it’s going to be waiting for in “waitForQualityGate”. It also shows that task id in the results of that step.
What WebApi call(s) can I perform in between “mvn sonar:sonar” and “waitForQualityGate” that will let me store into a variable the task id that is going to be polled for? I know the project key at that point. I’ve inspected all of the environment variables in scope at that point.
I know how to find the WebApi documentation, and I’ve scanned through what I think are the relevant operations, but I can’t figure out which operation I need for this particular “task”.
When you run the mvn sonar:sonar, report-task.txt will be created in the workspace folder .scannerwork.
You will get the ceTaskUrl and ceTaskId in report-task.txt. Now, you can use that ceTaskUrl to get the analysisId.
You can use the below api to get the quality gate status using analysisId.
https://localhost:9000/sonarqube/api/qualitygates/project_status?analysisId=$ANALYSIS_ID"

Dataflow: Consuming runtime parameters in template

Trying to create a template for dataflow job.
Is there any way to generate a template with runtime parameters?
Till now, whatever parameters were used at the time of creation of template, but when I tried passing different values for the variables, it is not picking the runtime values.
If any additional details are needed, will provide the same.
You can use value providers in your pipeline options to have runtime arguments in a pipeline.
But I'm afraid that this is too limited to where you can use these parameters (Mostly in DoFn).
This behaviour is expected from dataflow template as it is representation of a pipeline rather than the code itself.
Please bear in mind that you cannot create dataflow template with dynamic processing steps based on the value passed.
The steps are hard-coded into the template and cannot be changed unless the code to generate the template is executed again.
A parameter needs to be wrapped inside of a ValueProvider object in order for the template pipeline to access the runtime value of that parameter. All of the example templates provided here demonstrate how ValueProvider can be used to parameterize a template pipeline.
Take a look at the WordCount pipeline as an example.
As you can see, the pipeline uses a ValueProvider (instead of a simple String) to read the path of the file on which a WordCount needs to be executed:
#Description("Path of the file to read from")
ValueProvider<String> getInputFile();
void setInputFile(ValueProvider<String> value);
Since the value of the inputFile is unknown until runtime (when the template is actually executed with valid inputs), the transform using the ValueProvider will defer reading the value of the parameter until runtime (for e.g. inside a DoFn).
The native TextIO.Read Beam transform provides support for reading from a ValueProvider in addition to reading from String.

Perform action after Dataflow pipeline has processed all data

Is it possible to perform an action once a batch Dataflow job has finished processing all data? Specifically, I'd like to move the text file that the pipeline just processed to a different GCS bucket. I'm not sure where to place that in my pipeline to ensure it executes once after the data processing has completed.
I don't see why you need to do this post pipeline execution. You could use side outputs to write the file to multiple buckets, and save yourself the copy after the pipeline finishes.
If that's not going to work for you (for whatever reason), then you can simply run your pipeline in blocking execution mode i.e. use pipeline.run().waitUntilFinish(), and then just write the rest of your code (which does the copy) after that.
[..]
// do some stuff before the pipeline runs
Pipeline pipeline = ...
pipeline.run().waitUntilFinish();
// do something after the pipeline finishes here
[..]
A little trick I got from reading the source code of apache beam's PassThroughThenCleanup.java.
Right after your reader, create a side input that 'combine' the entire collection (in the source code, it is the View.asIterable() PTransform) and connect its output to a DoFn. This DoFn will be called only after the reader has finished reading ALL elements.
P.S. The code literally name the operation, cleanupSignalView which I found really clever
Note that you can achieve the same effect using Combine.globally() (java) or beam.CombineGlobally() (python). For more info check out section 4.2.4.3 here
I think two options can help you here:
1) Use TextIO to write to the bucket or folder you want, specifying the exact GCS path (for e.g. gs://sandbox/other-bucket)
2) Use Object Change Notifications in combination with Cloud Functions. You can find a good primer on doing this here and the SDK for GCS in JS here. What you will do in this option is basically setting up a trigger when something drops in a certain bucket, and move it to another one using your self-written Cloud Function.

Can a workflow step access environment variables provided by an EnvironmentContributingAction?

A custom plugin we wrote for an older version of Jenkins uses an EnvironmentContributingAction to provide environment variables to the execution so they could be used in future build steps and passed as parameters to downstream jobs.
While attempting to convert our build to workflow, I'm having trouble accessing these variables:
node {
// this step queries an API and puts the results in
// environment variables called FE1|BE1_INTERNAL_ADDRESS
step([$class: 'SomeClass', parameter: foo])
// this ends up echoing 'null and null'
echo "${env.FE1_INTERNAL_ADDRESS} and ${env.BE1_INTERNAL_ADDRESS}"
}
Is there a way to access the environment variable that was injected? Do I have to convert this functionality to a build wrapper instead?
EnvironmentContributingAction is currently limited to AbstractBuilds, which WorkflowRuns are not, so pending JENKINS-29537 which I just filed, your plugin would need to be modified somehow. Options include:
Have the builder add a plain Action instead, then register an EnvironmentContributor whose buildEnvironmentFor(Run, …) checks for its presence using Run.getAction(Class).
Switch to a SimpleBuildWrapper which defines the environment variables within a scope, then invoke it from Workflow using the wrap step.
Depend on workflow-step-api and define a custom Workflow Step with comparable functionality but directly returning a List<String> or whatever makes sense in your context. (code sample)
Since PR-2975 is merged, you are able to use new interface:
void buildEnvVars(#Nonnull Run<?, ?> run, #Nonnull EnvVars env, #CheckForNull Node node)
It will be used by old type of builds as well.

Resources