Google Cloud Dataflow - Apache Beam - Pipeline Shutdown Hook - google-cloud-dataflow

Wondering if there is some kind of "hook" to place a piece of code that would be executed when apache beam pipeline is being shutdown (for whatever reason - crash, cancel)
I need to delete a subscription on pubsub topic each time Dataflow is stopped.

Apache Beam is not naturally suited for this sort of flow. For this you may want to look at an orchestration engine, such as Apache Airflow.
With Airflow you should be able to schedule any sort of script to run after a Beam pipeline finishes/fails/is cancelled, etc. Take a look at it!

There are some examples of waiting for pipelines to finish and indeed managing Pubsub topics/subscriptions in a ExampleUtils class in the examples folder in the apache/beam repository here. See if there is anything you can use in the waitUntilFinish and tearDown methods.
This is java code - not sure if that is the language you use.
(In the long run #Pablo's suggestion to separate this further from the pipeline code may be best - perhaps depends on your exact goal here.)

Related

One-time (per worker) setup for python dataflow?

My dataflow job has to download some file from remote server. I want to save the file on worker machine so job doesn't have to keep downloading the same file.
I tried to do this with setup method, however it seems setup will be called for each thread, and multiple threads can call setup in parallel (I cannot find documentation around this, but based on my experience my job tries to write file data in parallel and hence causing malformed data).
Is there a way to perform one-time setup whenever worker machine is launched?
I also checked Apache Beam: DoFn.Setup equivalent in Python SDK but I believe it focuses around per-thread setup.
The Beam model doesn't include a specific callback for when a VM is created because the model doesn't guarantee the runtime environment. However, because you are using Dataflow that uses containers you have two options:
Modify the container image
Modify the setup.py
The first will give you direct control over the container image, and it works for all languages. The second only works for Python.

Spring Cloud Data Flow Java DSL for Task / Batch

I am wondering if there is any java dsl support for registering spring batch applications as SCDF taks? Till now I was able to find only streams support for Java DSL.
Any link in this direction will be much helpful. Also how to automatically deploy spring batch as SCDF tasks in production environment without any manual intervention.
Thank you.
There's Java DSL support for Tasks. This was recently introduced in 2.4 as part of the SCDF's IT test-suite, but it is not promoted as part of the SCDF's REST client that we ship as a library — see: spring-cloud/spring-cloud-dataflow#3949 / spring-io/dataflow.spring.io#242. Feel free to contribute the migration if you have cycles.
That said, though, we have end-to-end IT tests that leverage this capability [see: DataFlowIT.java#L771-L906], which we use for each commit based IT test runs internally. You could certainly follow this as a pattern to automate the creation and launch of the task/batch-jobs.

Jenkinsfile - Mutual exclusivity across multiple pipelines

I'm looking for a way to make multiple declaratively written Jenkinsfiles only run exclusively and block each other. They consume test instances who will be terminated after they run which causes problems when PRs are being tested as they come in.
I cannot find an option to make the BuildBlocker plugin do this, all the jenkinsfiles that use this plugin are not running in our Plugin/Jenkins version schema and it seems as if these [$class: <some.java.expression>] strings being exported from the syntax generator don't work here anyways.
I cannot find a way to run these Locks over all the steps involved in the pipeline.
I could hack a file-lock but this won't help me with multi-node builds.
This plugin could perhaps help you as it allows to lock resources you've declared previously so that if a resource is currently locked, any other job that requires the same resource will wait until it is released.
https://plugins.jenkins.io/lockable-resources/
Since you say you want declarative, probably wait for the currently-in-review Allow locking multiple stages in declarative pipeline jira issue to be completed. You can also vote for it and watch it.
And if you can't wait, this is your opportunity to learn golang (or whatever language you want to learn) by implementing a microservice that holds these locks that you call from your pipeline scripts. :D
The JobDSL plugin is for configuring Jenkins execution policies including blocking another and calling pipeline code.
https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.blockOn describes the method, which also the blocker plugin uses.
This is the tutorial for usage https://github.com/jenkinsci/job-dsl-plugin/wiki/Tutorial---Using-the-Jenkins-Job-DSL, the api https://github.com/jenkinsci/job-dsl-plugin/wiki/Job-DSL-Commands.
Taken from https://www.digitalocean.com/community/tutorials/how-to-automate-jenkins-job-configuration-using-job-dsl:
It should be possible to use https://github.com/jenkinsci/job-dsl-plugin/wiki/Dynamic-DSL, but I found no good usage example yet.

Schedule Mail batch by Rails in Cloud Foundry

I want to send email batch at specific time like CRON.
I think whenever gem (https://github.com/javan/whenever) is not to fit in Cloud Foundry Environment. Because Cloud Foundry can't use crontab.
Please inform me what options are available to me.
There's a node.js app here that you could use to schedule a specific rake task.
I haven't worked with cloudfare so I'm not sure if it'll serve your needs, but you can also try some of the batch job processing tools rails has available: Delayed job and sidekiq. Those store data for recurring jobs either on your database (DJ) or in a separate redis database (Sidekiq) and both need keeping extra processes up and running, so review them deeply and the changes you'd need for your deployment process before using each one. There's also resque, and here's a tutorial to use it with rails for scheduling tasks.
There are multiple solutions here, but the short answer is that whatever you end up doing needs to implement its own scheduler. This is because there is no cron service available to your application when it runs on CF. This means there is nothing to trigger or schedule your actions. Any project or solution that depends on cron will not work when deploying to CF. Any project that implements it's own scheduler should work fine.
Some specific things I've seen people do successfully:
Use a web service that sends HTTP requests to your app on predefined intervals. The requests trigger your action. It's the services responsibility to let you define when to trigger and to send the HTTP requests. I'm intentionally avoiding mentioning any specific services, but you can find them by searching for "cron http service" or something like that.
Importing a library that has cron like functionality. I'm not familiar with Ruby, so I don't know the landscape there. #mlabarca has mentioned a couple that you might try out. Again, look to see that they implement the scheduling functionality and do not depend on cron. I'm more familiar with Java where you have Quartz and Spring, which has some scheduling functionality too.
Implement a "clock" process or scheduler. This would generally be a second app that you deploy on CF. It would be lightweight and probably not have a web interface. It could be as simple as do something, sleep, loop for ever repeating those two steps. It really depends on your needs. You could even get fancy and implement something like the first option above where you're sending some sort of request to your other apps to trigger the actual events.
There are probably other solutions as well, those are just some examples to get you started.
Probably also worth mentioning that the Cloud Controller v3 API will have first class features to run tasks. In this case, the "task" is some job that runs in a finite amount of time and exits (like a batch job). This is opposed to the standard "app" that when run on CF should continue executing forever (i.e. if it exits, it's cause of a crash). That said, I do not believe it will include a scheduler so you'd still need something to trigger the task.

Does spring-cloud-dataflow provide support for scheduling applications defined as tasks?

I have been looking at using projects built using spring-cloud-task within spring-cloud-dataflow. Having looked at the example projects and the documentation, the indication seems to be that tasks are launched manually through the dashboard or the shell. Does spring-cloud-dataflow provide any way of scheduling task definitions so that they can run for example on a cron schedule? I.e. Can you create a spring-cloud-task app which itself has no knowledge of a schedule, but deploy it to the dataflow server and configure the scheduling there?
Among the posts and blogs I have looked at I noticed the following:
https://spring.io/blog/2016/01/27/introducing-spring-cloud-task
Some of the Q&A afterwards hints at this being a possibility, with the reference to triggers, but I think this was discussed before it was released.
Any advice would be greatly appreciated, many thanks.
There are few ways you could launch Tasks in Spring Cloud Data Flow. Following are the available options today.
Launch it using TriggerTask; with this you could either choose to launch it with fixedDelay or via a cron expression - example here.
Launch it via an event in streaming pipeline. Imagine a use-case where you would want to create a "thumbnail" as and when there's a new image (event) in s3-bucket or in a file-system directory; the "thumbnail" operation could be a task in this case - example here.
Lastly, in the upcoming releases, we will port over "scheduler" functionality from Spring XD to Spring Cloud Data Flow.
Yes, Spring Cloud Data Flow does provide a scheduling option. To enable it, you need to add below arguments while starting the server:
--spring.cloud.dataflow.features.schedules-enabled=true

Resources