Perform action after Dataflow pipeline has processed all data - google-cloud-dataflow

Is it possible to perform an action once a batch Dataflow job has finished processing all data? Specifically, I'd like to move the text file that the pipeline just processed to a different GCS bucket. I'm not sure where to place that in my pipeline to ensure it executes once after the data processing has completed.

I don't see why you need to do this post pipeline execution. You could use side outputs to write the file to multiple buckets, and save yourself the copy after the pipeline finishes.
If that's not going to work for you (for whatever reason), then you can simply run your pipeline in blocking execution mode i.e. use pipeline.run().waitUntilFinish(), and then just write the rest of your code (which does the copy) after that.
[..]
// do some stuff before the pipeline runs
Pipeline pipeline = ...
pipeline.run().waitUntilFinish();
// do something after the pipeline finishes here
[..]

A little trick I got from reading the source code of apache beam's PassThroughThenCleanup.java.
Right after your reader, create a side input that 'combine' the entire collection (in the source code, it is the View.asIterable() PTransform) and connect its output to a DoFn. This DoFn will be called only after the reader has finished reading ALL elements.
P.S. The code literally name the operation, cleanupSignalView which I found really clever
Note that you can achieve the same effect using Combine.globally() (java) or beam.CombineGlobally() (python). For more info check out section 4.2.4.3 here

I think two options can help you here:
1) Use TextIO to write to the bucket or folder you want, specifying the exact GCS path (for e.g. gs://sandbox/other-bucket)
2) Use Object Change Notifications in combination with Cloud Functions. You can find a good primer on doing this here and the SDK for GCS in JS here. What you will do in this option is basically setting up a trigger when something drops in a certain bucket, and move it to another one using your self-written Cloud Function.

Related

what is the entry point of the workers in apache beam? (what methods are called?)

Apache beam has a ton of great documentation but I don't see the code that is run to create a pipeline vs. what code the worker runs. I guess I see this code will be run once but will it also be run by every worker that starts up..
public static void main(String[] args) {
// Create the pipeline.
PipelineOptions options =
PipelineOptionsFactory.fromArgs(args).create();
Pipeline p = Pipeline.create(options);
// Create the PCollection 'lines' by applying a 'Read' transform.
PCollection<String> lines = p.apply(
"ReadMyFile", TextIO.read().from("gs://some/inputData.txt"));
}
This is a great question and is at the heart of what Apache Beam is.
tl;dr There is no user-defined entry-point that is called when a worker starts up.
Long answer
When you code using the Apache Beam SDK (using applies, etc.) what you are really doing is creating a graph in the background containing all the applied transformations, see the documentation here. Thus, once p.run() is called, the graph is sent to the worker to be executed. The transformations on the graph are then broken down into component pieces and executed in order.
As for the code you wrote in your question, that will only be run once. That code is run exactly once when you execute the jar. The transformations in the graph, however, are run for each element in your data (or more, or less depending on your graph).
If you are curious about the implementation of how your transforms and user-defined functions (ParDos) are executed, then the entry-point lives in the Apache Beam SDK here.

is there any way I can avoid reading old files from old folder with Apache Beam's TextIo watchForNewFiles(Duration, condition)?

Use Case: During dataflow job start up we should provide initial file name to read data and later on it should watch for new files in that directory and it should consider all remaining old files as already read.
Issues:
Approach 1:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/*").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
If we are using like this its considering old files as new files for this dataflow job and reading all those files in that folder
Approach 2:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/file-name").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
Its reading only this particular file and not able to read upcoming new files
can anyone please suggest the approach to achieve my use case?
The watchForNewFiles() function will always read all files matching the filepattern, both existing and new. In your second approach, the file pattern is only one file, so you just get that.
However, you can use the lower-level building block transforms in FileIO to accomplish what you need. The following code will just read files written after the pipeline starts:
PCollection<String> lines = p
.apply(FileIO.match().filepattern("gs://folder-Name/*")
.continuously(Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1)))
.setCoder(MetadataCoderV2.of())
.apply(Filter.by(metadata -> metadata.lastModifiedMillis() > PIPELINE_START))
.apply(FileIO.readMatches())
.apply(apply(TextIO.readFiles()))
You can change the details of the Filter transform to whatever precise condition you need. To also include specific older files, you can read those with a standard TextIO.read().from(...) and then use Flatten to combine that PCollection with the continuous set. Like this:
PCollection allLines =
PCollectionList.of(lines).and(p.apply(TextIO.read().from("gs://folder-Name/file-name)
.apply(Flatten.pCollections())
Maybe you need to clarify your Use Case, do you provide a file name to read ? or a file pattern ? What is the number of files expected ? Should you really use a Dataflow streaming pipeline ? Doesn't a Cloud Function answer your need ? What is your issue ? Files get read again when you restart your pipeline ?
You can, as suggested by danielm use FileIO to fetch and filter on file metadata in order to know which file was added after the pipeline began.
If you provide a file pattern, then all file will be read once by the pipeline. There's no way to keep a State between pipelines if you not code it yourself, so when you restart the pipeline you will read again all the file matching the pattern.
If you want to avoid that, you can manually move old files to another path between stopping the old pipeline and starting a new one.
You could also consider is consuming GCS notification on file creation with PubsubIO and use this event to know which file to treat in your pipeline.
A good practice though is to have multiple folders that reflects the status of the files:
input
processing
failed
succeed
This way you know the state of each file. You can put files to treat in the input folder, and inside your pipeline move the file to its corresponding state folder.

Automated ansible output parsing in jenkins pipeline

How do you automatically parse ansible warnings and errors in your jenkins pipeline jobs?
I greatly enjoy the power of leveraging in ansible in jenkins when it works. Upon a failure, the hunt to locate the actual error can be challenging.
I use WarningsNG which supports custom parsers (and allows their programmatic generation)
Do you know of any plugins or addons that already transform these logs into the kind charts similar to WarningsNG?
I figured I'd ask as I go off into deep regex land and make my own.
One good way to achieve this seems to be the following:
select an existing structured output ansible callback plugin (json, junit and yaml are all viable) . I selected junit as I can play with the format to get a really nice view into the playbook with errors reported in a very obvious way.
fork that GPL file (yes, so be careful with that license) to augment with the following:
store output as file
implement the missing callback methods (the three mentioned above do not implement the v2...item callbacks.
forward events to the default or debug callback to ensure operators see something when they execute the plan
add a secrets cleaner - if you use jenkins credentials-binding-plugin it will hide secrets from the console, it will not not hide secrets within stored files. You'll need to handle that in your playbook or via some groovy code (if groovy, try{...} finally { clean } seems a good pattern)
Snippet - forewarding to default callback
from ansible.plugins.callback.default import CallbackModule as CallbackModule_default
...
class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'stdout'
CALLBACK_NAME = 'json'
def __init__(self, display=None):
super(CallbackModule, self).__init__(display)
self.default_callback = CallbackModule_default()
...
def v2_on_file_diff(self, result):
self.default_callback.v2_on_file_diff(result)
... do whatever you'd want to ensure the content appears in the json file

Dataflow/Beam Templates, Productionization, Initialization, and ValueProviders

I have an Apache Beam job running on Google Cloud Dataflow, and as part of its initialization it needs to run some basic sanity/availability checks on services, pub/sub subscriptions, GCS blobs, etc. It's a streaming pipeline intended to run ad infinitum that processes hundreds of thousands of pub/sub messages.
Currently it needs a whole heap of required, variable parameters: which Google Cloud project it needs to run in, which bucket and directory prefix it's going to be storing files in, which pub/sub subscriptions it needs to read from, and so on. It does some work with these parameters before pipeline.run is called - validation, string splitting, and the like. In its current form in order to start a job we've been passing these parameters to to a PipelineOptionsFactory and issuing a new compile every single time, but it seems like there should be a better way. I've set up the parameters to be ValueProvider objects, but because they're being called outside of pipeline.run, Maven complains at compile time that ValueProvider.get() is being called outside of a runtime context (which, yes, it is.)
I've tried using NestedValueProviders as in the Google "Creating Templates" document, but my IDE complains if I try to use NestedValueProvider.of to return a string as shown in the document. The only way I've been able to get NestedValueProviders to compile is as follows:
NestedValueProvider<String, String> pid = NestedValueProvider.of(
pipelineOptions.getDataflowProjectId(),
(SerializableFunction<String, String>) s -> s
);
(String pid = NestedValueProvider.of(...) results in the following error: "incompatible types: no instance(s) of type variable(s) T,X exist so that org.apache.beam.sdk.options.ValueProvider.NestedValueProvider conforms to java.lang.String")
I have the following in my pipelineOptions:
ValueProvider<String> getDataflowProjectId();
void setDataflowProjectId(ValueProvider<String> value);
Because of the volume of messages we're going to be processing, adding these checks at the front of the pipeline for every message that comes through isn't really practical; we'll hit daily account administrative limits on some of these calls pretty quickly.
Are templates the right approach for what I want to do? How do I go about actually productionizing this? Should (can?) I compile with maven into a jar, then just run the jar on a local dev/qa/prod box with my parameters and just not bother with ValueProviders at all? Or is it possible to provide a default to a ValueProvider and override it as part of the options passed to the template?
Any advice on how to proceed would be most appreciated. Thanks!
The way templates are currently implemented there is no point to perform "post-template creation" but "pre-pipeline start" initialization/validation.
All of the existing validation executes during template creation. If the validation detects that there the values aren't available (due to being a ValueProvider) the validation is skipped.
In some cases it is possible to approximate validation by adding runtime checks either as part of initial splitting of a custom source or part of the #Setup method of a DoFn. In the latter case, the #Setup method will run once for each instance of the DoFn that is created. If the pipeline is Batch, after 4 failures for a specific instance it will fail the pipeline.
Another option for productionizing pipelines is to build the JAR that runs the pipeline, and have a production process that runs that JAR to initiate the pipeline.
Regarding the compile error you received -- the NestedValueProvider returns a ValueProvider -- it isn't possible to get a String out of that. You could, however, put the validation code into the SerializableFunction that is run within the NestedValueProvider.
Although I believe this will currently re-run the validation everytime the value is accessed, it wouldn't be unreasonable to have the NestedValueProvider cache the translated value.

No suitable ClassLoader found for grab when using #GrabConfig

I'm attempting to write a global function script that uses groovy.sql.SQL.
When adding the annotation #GrabConfig(systemClassLoader=true) I get an exception when using the global function in Jenkinsfile.
Here is the exception:
hudson.remoting.ProxyException: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
General error during conversion: No suitable ClassLoader found for grab
Here is my code:
#GrabResolver(name='nexus', root='http://internal.repo.com')
#GrabConfig(systemClassLoader=true)
#Grab('com.microsoft.sqlserver:sqljdbc4:4.0')
import groovy.sql.Sql
import com.microsoft.sqlserver.jdbc.SQLServerDriver
def call(name) {
echo "Hello world, ${name}"
Sql.newInstance("jdbc:sqlserver://ipaddress/dbname", "username","password", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
// sql.execute "select count(*) from TableName"
}
Ensure that the "Use groovy sandbox" checkbox is unticked (it's below the pipeline script text box).
As explained here, Pipeline "scripts" are not simple Groovy scripts, they are heavily transformed before running, some parts on master, some parts on slaves, with their state (variable values) serialized and passed to the next step. As such, not every Groovy feature is supported.
I'm not sure about #Grab support. It is discussed in JENKINS-26192 (which is declared as resolved, so maybe it works now).
Extract from a very interesting comment:
If you need to perform some complex or expensive tasks with
unrestricted Groovy physically running on a slave, it may be simplest
and most effective to simply write that code in a *.groovy file in
your workspace (for example, in an SCM checkout) and then use tool and
sh/bat to run Groovy as an external process; or even put this stuff
into a Gradle script, Groovy Maven plugin execution, etc. The workflow
script itself should be limited to simple and extremely lightweight
logical operations focused on orchestrating the overall flow of
control and interacting with other Jenkins features—slave allocation,
user input, and the like.
In short, if you can move that custom part that needs SQL to an external script and execute that in a separate process (called from your Pipeline script), that should work. But doing this in the Pipeline script itself is more complicated.

Resources