Running dependant templates from Cloud Function in Apache Beam

Running dependant templates from Cloud Function in Apache Beam - google-cloud-dataflow

Is it possible to run two or more dependant templates in sequence using cloud function? My function triggers running of first template the moment any file gets dropped in a particular bucket.
What I need to achieve is to run the second template only after first template gets successfully executed.
I would like to keep the execution order like this :
Template-1 -> (If successfully executed) -> Template-2 ---likewise I can have 'n' number of templates to execute in sequence.
Any leads how to achieve the same using cloud function?

Related

Composed task custom condition

Is it possible to create a custom Exit status in spring cloud data flow?
Let's say i have the following:
I saw an examples for FAILED and UNKNOWN, so I've created 2 custom conditions Worked & Generated.
Assuming this approach is possible - How do i pass those strings from inside the task? Or it needs to be passed from somewhere else?
If not - then why i can write any string that i want in the "Properties for TRANSITION" modal?

Other than providing the UI option to wire the exit-code to map to a particular downstream step, there's nothing dynamically influenced by SCDF. In other words, SCDF doesn't interfere with whatever is happening internally in each Task application.
The custom transitions require the desired exit-codes to be returned/handled within the Task application itself.
In your example above, if the Timestamp's business logic returned "Worked" as the exit-code, then the transition would result in executing Bar application. Likewise, if the exit-code is "Generated", you'd see Foo running.

How to setup a reusable Geb test script (to be used by other test scripts)

So I have just created a geb script that tests the creation of a report. Let's call this Script A
I have other test cases I need to run that are dependent on the previous report being created, but I still want the Script A to be a stand alone test. we will call the subsiquent script Script B
Furthermore Script A generates a pair of numbers that will be needed in subsequent scripts (to verify data got recorded accurately)
Is there a way I can setup geb such that Script B executes 'Script Aand is able to pull those 2 numbers fromScript Ato be used inScript B`?
In summary there will be a number a scripts that are dependent on the actions of Script A (which is itself a test) I want to be able to modularize Script A so that it can be executed from other scripts. What would be the best way to do this?

For reuse and not repeating yourself I would put the report creation into a separate method call in a new class such as ReportGenerator, this would generate the report given a set of parameters (if required) and return the report figures for use in whatever test you like.
You could then call that in any spec you want, with no reliance on other specs.

Jenkins Build / Pipeline job - Job's listing in tree / layout ordered listing

Is it possible that for a given Build Pipeline job (which has downstream jobs either in the build or post build action as "Trigger build on other projects"), I can get a Tree listing view showing which Pipeline job# N called, what child jobs in the calling order (sequential or parallel) with child build# for that pipeline run build#.
For ex: If my pipeline job has this view:
then,
I'm expecting to get a listing of the top run similar to (in case I just put in simple text format):
vac-3.0-src:52 called: vac-3.0-unit-test-main:37
vac-3.0-unit-test-main:37 called: vac-3.0-unit-testA:36
vac-3.0-unit-test-main:37 called: vac-3.0-unit-testB:36
vac-3.0-unit-test-main:37 called: vac-3.0-unit-testC:35
vac-3.0-unit-test-main:37 called: vac-3.0-unit-testD:35
vac-3.0-unit-test-main:37 called: vac-3.0-unit-testReporting:35
vac-3.0-unit-testReporting:35 called: vac-3.0-integration-test-main:28
vac-3.0-integration-test-main:28 called: vac-3.0-integration-testA:27
vac-3.0-integration-test-main:28 called: vac-3.0-integration-testB:27
vac-3.0-integration-testB:27 called: vac-3.0-acceptance-test:25
vac-3.0-acceptance-test:25 called: vac-3.0-configure-something:24
vac-3.0-configure-something:24 called: vac-3.0-perform-someaction:23
vac-3.0-perform-someaction:23 called: vac-3.0-preview-step:22
vac-3.0-preview-step:22 called: vac-3.0-deb-delivery-job:27
vac-3.0-preview-step:22 called: vac-3.0-rpm-el6:23
vac-3.0-preview-step:22 called: vac-3.0-vagrant-provision:20
vac-3.0-preview-step:22 called: vac-3.0-vagrant-run:21
vac-3.0-vagrant-run:21 called: vac-3.0-demo:10
OR this information can be presented in a more robust structural manner i.e. it can be JSON blob where a parent job has structure which will have all jobs that it called (parallel/sequence) in the pipeline run / given order.
I tried the main job's URL (via curl) with Jenkins API i.e. /api/xml or /api/json?pretty=true&depth=10 or more but it doesn't give me the information that I'm looking for (related to a given pipeline run).
This information is visually available on the pipeline view (as per the image) and some information about subprojects is available on a given Jenkins job's dashboard (which was part of the pipeline) but the order is not there.
I'll appreciate if you have tried to solve this and have any solution to get this data. The reason for this effort is to find metrics horizontally for a given pipeline run (Rather than vertically for each individual job which are part of pipeline as I already have the vertical / individual job metrics for Total time, build#, result etc) but how I can relate each individual job's metrics for a given pipeline run, is what I'm trying to get.
If the above image example is big enough, we can refer a smaller run image snapshot here:

I see one possible solution, not sure if that's helpful but surely it's an attempt.
Algorithm Steps:==============
1) Maintain a Direct Parent-Child file (i.e. JobA:JobB, JobA:JobC, JobA:JobC, JobC:JobD, ....) i.e. this file will tell that for each JobX, what's are the direct Sub-child/downstream jobs of that. Via Jenkins Groovy script, this can be easily generated/available. PS: You can add more columns to this file i.e. JobA:JobB:Build:Sequential or JobA:JobB:Test:Parallel to get even better horizontal metrics for calculating turnaround time / per given step (build, test, deploy, etc) and whether a parent job called the child job in a sequence or in parallel with two or more jobs) and calculate the metrics accordingly.
2) Inside "Build Pipeline View" Configure (Settings), set the no. of jobs to be displayed as 1. PS: You can set this to whatever 5, 10, or more if you want to capture a given pipeline build# of that main pipeline job.
For testing purpose, I'm showing only 1 pipeline build run.
3) In Linux, use curl, get the "View Source" HTML page information on the build-pipeline-view's NAME (PS: This is NOT on the main pipeline job).
i.e. **not for jobA or xxvt-main or ** in this case, but use the View Name URL (which shows the whole pipeline). Let's assume the view name (via Build Pipeline View plugin) was created as "MyPipelineView"
ex: curl -s http://my-jenkins-server:8080/view/MyPipelineView/ > /tmp/9.txt
This will give you the HTML content.
Store this information in some file (Temporary). Let's assume I stored it in /tmp/9.txt
3) Run the following command to get the job's build#s. As per the second smaller pipeline image (in my post), the output of that will be:
grep -o "\"extId\":\"[a-zA-Z0-9_-][a-zA-Z0-9_-]*#[0-9][0-9]*\"" /tmp/9.txt
This will give you output like (use sed/cut to make it more cleaner):
"extId":"xxvt_main#157"
"extId":"xxvt_splunk_run_collect_operation#29"
"extId":"xxvt_splunk_run_process_operation#29"
"extId":"xxvt_splunk_update_date_restart_splunk#29"
"extId":"xxvt_splunk_get_jenkins_data#38"
"extId":"xxvt_splunk_get_clearquest_dr_data#47"
4) Now you have the above output for a given pipeline run, using the Parent-Child (direct relationship) file (which we generated in bullet 1), we can use that to create our final Build Pipeline Tree file i.e.
xxvt_main#157 called: xxvt_splunk_get_jenkins_data#38
xxvt_main#157 called: xxvt_splunk_get_clearquest_dr_data#47
xxvt_main#157 called: xxvt_splunk_run_collect_operation#29
xxvt_splunk_run_collect_operation#29 called: xxvt_splunk_run_process_operation#29
xxvt_splunk_run_process_operation#29 called: xxvt_splunk_update_date_restart_splunk#29
5) Upon knowing the a given run related job-name and its build#, we can use Jenkins's api/json?pretty=true&depth=1 or 2 or 3 carefully, to fetch fields we want to fetch for metrics and finally create/come up with a .csv file in whatever format you like, which will have metrics for a given pipeline run - HORIZONTALLY.

If you are working with Jenkinsfile DSL etc..
I achieved it via dynamically creating the stages, running them in parallel and also getting Jenkinsfile UI to show separate columns. This assumes parallel steps are independent of each other (otherwise don't use parallel) and you can nest them as deep as you want (depending upon the for loop).
Jenkinsfile Pipeline DSL: How to Show Multi-Columns in Jobs dashboard GUI - For all Dynamically created stages - When within PIPELINE section see here for more.

Dataflow/Beam Templates, Productionization, Initialization, and ValueProviders

I have an Apache Beam job running on Google Cloud Dataflow, and as part of its initialization it needs to run some basic sanity/availability checks on services, pub/sub subscriptions, GCS blobs, etc. It's a streaming pipeline intended to run ad infinitum that processes hundreds of thousands of pub/sub messages.
Currently it needs a whole heap of required, variable parameters: which Google Cloud project it needs to run in, which bucket and directory prefix it's going to be storing files in, which pub/sub subscriptions it needs to read from, and so on. It does some work with these parameters before pipeline.run is called - validation, string splitting, and the like. In its current form in order to start a job we've been passing these parameters to to a PipelineOptionsFactory and issuing a new compile every single time, but it seems like there should be a better way. I've set up the parameters to be ValueProvider objects, but because they're being called outside of pipeline.run, Maven complains at compile time that ValueProvider.get() is being called outside of a runtime context (which, yes, it is.)
I've tried using NestedValueProviders as in the Google "Creating Templates" document, but my IDE complains if I try to use NestedValueProvider.of to return a string as shown in the document. The only way I've been able to get NestedValueProviders to compile is as follows:
NestedValueProvider<String, String> pid = NestedValueProvider.of(
pipelineOptions.getDataflowProjectId(),
(SerializableFunction<String, String>) s -> s
);
(String pid = NestedValueProvider.of(...) results in the following error: "incompatible types: no instance(s) of type variable(s) T,X exist so that org.apache.beam.sdk.options.ValueProvider.NestedValueProvider conforms to java.lang.String")
I have the following in my pipelineOptions:
ValueProvider<String> getDataflowProjectId();
void setDataflowProjectId(ValueProvider<String> value);
Because of the volume of messages we're going to be processing, adding these checks at the front of the pipeline for every message that comes through isn't really practical; we'll hit daily account administrative limits on some of these calls pretty quickly.
Are templates the right approach for what I want to do? How do I go about actually productionizing this? Should (can?) I compile with maven into a jar, then just run the jar on a local dev/qa/prod box with my parameters and just not bother with ValueProviders at all? Or is it possible to provide a default to a ValueProvider and override it as part of the options passed to the template?
Any advice on how to proceed would be most appreciated. Thanks!

The way templates are currently implemented there is no point to perform "post-template creation" but "pre-pipeline start" initialization/validation.
All of the existing validation executes during template creation. If the validation detects that there the values aren't available (due to being a ValueProvider) the validation is skipped.
In some cases it is possible to approximate validation by adding runtime checks either as part of initial splitting of a custom source or part of the #Setup method of a DoFn. In the latter case, the #Setup method will run once for each instance of the DoFn that is created. If the pipeline is Batch, after 4 failures for a specific instance it will fail the pipeline.
Another option for productionizing pipelines is to build the JAR that runs the pipeline, and have a production process that runs that JAR to initiate the pipeline.
Regarding the compile error you received -- the NestedValueProvider returns a ValueProvider -- it isn't possible to get a String out of that. You could, however, put the validation code into the SerializableFunction that is run within the NestedValueProvider.
Although I believe this will currently re-run the validation everytime the value is accessed, it wouldn't be unreasonable to have the NestedValueProvider cache the translated value.

Perform action after Dataflow pipeline has processed all data

Is it possible to perform an action once a batch Dataflow job has finished processing all data? Specifically, I'd like to move the text file that the pipeline just processed to a different GCS bucket. I'm not sure where to place that in my pipeline to ensure it executes once after the data processing has completed.

I don't see why you need to do this post pipeline execution. You could use side outputs to write the file to multiple buckets, and save yourself the copy after the pipeline finishes.
If that's not going to work for you (for whatever reason), then you can simply run your pipeline in blocking execution mode i.e. use pipeline.run().waitUntilFinish(), and then just write the rest of your code (which does the copy) after that.
[..]
// do some stuff before the pipeline runs
Pipeline pipeline = ...
pipeline.run().waitUntilFinish();
// do something after the pipeline finishes here
[..]

A little trick I got from reading the source code of apache beam's PassThroughThenCleanup.java.
Right after your reader, create a side input that 'combine' the entire collection (in the source code, it is the View.asIterable() PTransform) and connect its output to a DoFn. This DoFn will be called only after the reader has finished reading ALL elements.
P.S. The code literally name the operation, cleanupSignalView which I found really clever
Note that you can achieve the same effect using Combine.globally() (java) or beam.CombineGlobally() (python). For more info check out section 4.2.4.3 here

I think two options can help you here:
1) Use TextIO to write to the bucket or folder you want, specifying the exact GCS path (for e.g. gs://sandbox/other-bucket)
2) Use Object Change Notifications in combination with Cloud Functions. You can find a good primer on doing this here and the SDK for GCS in JS here. What you will do in this option is basically setting up a trigger when something drops in a certain bucket, and move it to another one using your self-written Cloud Function.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Running dependant templates from Cloud Function in Apache Beam - google-cloud-dataflow

Related

Composed task custom condition

How to setup a reusable Geb test script (to be used by other test scripts)

Jenkins Build / Pipeline job - Job's listing in tree / layout ordered listing

Dataflow/Beam Templates, Productionization, Initialization, and ValueProviders

Perform action after Dataflow pipeline has processed all data

Categories

Resources