Run function / pipeline after a pipeline completes on Google DataFlow - google-cloud-dataflow

I'm trying to run a function (or a pipeline) after a Beam pipeline completes, on Google DataFlow.
Currently I've built a hack to run the function by writing the results of the previous Pipeline to null with
_ = existing_pipeline | "do next task" >> beam.Map(func)
...where func is:
def func(_):
# do some work, and ignore `_`
But is there a better way?

Assuming you want the function to run on your machine and not in the Cloud, you should do something like this:
result = existing_pipeline.run()
result.wait_until_finish()
# do some work

Related

Is there a Jenkins env var for Replay?

Does Jenkins provide a variable when replay is ran? If so what is that? I see in the log that is writes Replayed but I am not looking to scrape the console output.
You can use 'cause' of which has triggered the job, in rawBuild.
def replayClassName = "org.jenkinsci.plugins.workflow.cps.replay.ReplayCause​"
def isReplay = currentBuild.rawBuild.getCauses().any{ cause -> cause.toString().contains(replayClassName) }
*refered from
How to know inside jenkinsfile / script that current build is a replay?

Extract Jenkins pipeline from job when it's only defined in the Jenkins GUI

I have any Jenkins jobs that are written directly into Jenkins pipeline script and not directly in the SCM as is best practice.
I have been tasked with grabbing these scripts and firstly creating a backup of them.
My preference is to script the collection using python something like the below:
from utils.args import parse_arguments
from jenkinsapi.jenkins import Jenkins
args = parse_arguments()
url = "http://jmaster:8080/"
master = Jenkins(url, username=args.username, password=args.password)
for job in master.get_jobs():
print(job[0])
if (job[1]._data["_class"] == "org.jenkinsci.plugins.workflow.job.WorkflowJob"):
print "doing work"
However this is where I get stuck as I cannot see the pipeline script is exposed ? Is it even exposed as a JSON parameter that I have access to ?
I've tried looking at the jenkins api data structure no luck.
I have tried to use the rest api directly in a browser but I couldn't find the right part.
Does anyone know if this is possible or am I just chasing a dream ?
Have you tried grabbing job's config.xml files? (from like http://jmaster:8080/jobs/myjob/config.xml)
There it looks like this:
<definition class="org.jenkinsci.plugins.workflow.cps.CpsFlowDefinition" plugin="workflow-cps#2.59">
<script>node { echo 'Hello World' }</script>
<sandbox>false</sandbox>
</definition>
Or maybe you can get the CpsFlowDefinition with groovy in your original code...

How can I programmatically cancel a Dataflow job that has run for too long?

I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?
It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it
To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.
Note: this technique will not work when running Beam from within a Flex Template Job...
The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...

How retrieve job config via groovy web console?

I have Jenkins with several hundreds of jobs, and I need to find job which runs specified gradle task. I see following solution
1.retrieve all jobs (Jenkins.instance.projects)
2.iterate over them
3.get xml config and verify presence of substring
The question is how to retrieve xml representation from hudson.model.FreeStyleProject? Or may be this data stored as map, than the question will be how it is called and how get it?
I did something similar before but using a shell script. Not sure if that approach is useful for you but just in case:
cd /var/lib/jenkins/jobs/
grep WORD */config.xml
Also the next groovy script that can list all the FreeStyleProject name and gradle tasks:
def builderFilter = { builder -> builder.class == hudson.plugins.gradle.Gradle.class }
jenkins.model.Jenkins.instance.getAllItems(hudson.model.FreeStyleProject.class).each{ job ->
job.getBuilders().findAll(builderFilter).each{ gradleStep ->
gradleStep.each { gradleItem ->
println(job.getDisplayName() + ' ' + gradleItem.getTasks())
}
}
}

Use BlockingDataflowPipelineRunner and post-processing code for Dataflow template

I'd like to run some code after my pipeline finishes all processing, so I'm using BlockingDataflowPipelineRunner and placing code after pipeline.run() in main.
This works properly when I run the job from the command line using BlockingDataflowPipelineRunner. The code under pipeline.run() runs after the pipeline has finished processing.
However, it does not work when I try to run the job as a template. I deployed the job as a template (with TemplatingDataflowPipelineRunner), and then tried to run the template in a Cloud Function like this:
dataflow.projects.templates.create({
projectId: 'PROJECT ID HERE',
resource: {
parameters: {
runner: 'BlockingDataflowPipelineRunner'
},
jobName: `JOB NAME HERE`,
gcsPath: 'GCS TEMPLATE PATH HERE'
}
}, function(err, response) {
if (err) {
// etc
}
callback();
});
The runner does not seem to take. I can put gibberish under runner and the job still runs.
The code I had under pipeline.run() does not run when each job runs -- it runs only when I deploy the template.
Is it expected that the code under pipeline.run() in main would not run each time the job runs? Is there a solution for executing code after the pipeline is finished?
(For context, the code after pipeline.run() moves a file from one Cloud Storage bucket to another. It's archiving the file that was just processed by the job.)
Yes, this expected behavior. A template represents the pipeline itself, and allows (re-)executing the pipeline by launching the template. Since the template doesn't include any of the code from the main() method, it doesn't allow doing anything after the pipeline execution.
Similarly, the dataflow.projects.templates.create API is just the API to launch the template.
The way the blocking runner accomplished this was to get the job ID from the created pipeline and periodically poll to observe when it has completed. For your use case, you'll need to do the same:
Execute dataflow.projects.templates.create(...) to create the Dataflow job. This should return the job ID.
Periodically (every 5-10s, for instance) poll dataflow.projects.jobs.get(...) to retrieve the job with the given ID, and check what state it is in.

Resources