Can Airflow run streaming GCP Dataflow jobs? - google-cloud-dataflow

I am looking for orchestration software for streaming GCP Dataflow jobs - something that can provide alerting, status, job launching etc. akin to what this does on Kubernetes. The answer here suggests Airflow as they have some hooks into GCP - this would be nice because we have some other infrastructure that runs on Airflow. However I am not sure if this would be able to handle streaming jobs - my understanding is that Airflow is designed for tasks that will complete, which is not the case for a streaming job. Is Airflow appropriate for this? Or is there different software I should use?

Its probably late, but answering for people who visit this topic in future.
Yes you can definitely run dataflow streaming job from airflow. Use airflow version 1.9 or above.
Link :
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataflow_operator.py
You dont need to put extra efforts for running streamin job. Above Dataflow operators run both batch and streaming jobs. It mark the airflow task successful as soon as dataflow streaming job start running (i.e. job is in running state)

Related

How to handle flink management and k8s management

I'm considering deploying Flink with K8s. I'm a newbie on Flink and have a simple question:
Saying that I use K8s to manager dockers and deploy the TaskManager into the dockers.
As my understanding, a docker can be restarted by K8s when it fails, and a Task can be restarted by Flink when it fails.
If a Task is running in a container of docker and the container suddenly fails for some reason, in the Flink's view, a Task failed so the task should be restarted, and in the K8s' view, a container failed so the docker should be restarted. In this case, should we worry about some conflict because of the two kinds of "be restarted"?
I think you want to read up on the official kubernetes setup guide here: https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/kubernetes.html
It describes 3 ways of getting it to work:
Session Cluster: This involves spinning up 2 deployments in the appendix and requires you to submit your Flink job manually or via a script in the beginning. This is very similar to a local standalone cluster when you are developing, except it is now in your kubernetes cluster
Job Cluster: By deploying Flink as a k8s job, you would be able to eliminate the job submission step.
Helm chart: By the look of it, the project has not updated this for 2 years, so your mileage may vary.
I have had success with a Session Cluster, but I would eventually like to try the "proper" way, which is to deploy it as kubernetes job using the 2nd method by the looks of it.
Depending on your Flink Source and the kind of failure, your Flink job will fail differently. You shouldn't worry about the "conflict". Either Kubernetes is going to restart the container, or Flink is going to handle the error it could handle. After a certain amount of retry it would cancel, depending on how you configured this. See Configuration for more details. If the container exited with a code that is not 0, Kubernetes would try to restart it. However, it may or may not resubmit the job depending on whether you deployed the job in a Job Cluster or whether you had an initialization script for the image you used. In a Session Cluster, this can be problematic depending on whether the job submission is done through task manager or job manager. If the job was submitted through task manager, then we need to cancel the existing failed job so the resubmitted job can start.
Note: if you did go with the Session Cluster and have a file system based Stateful Backend (non-RocksDB Stateful Backend) for checkpoints, you would need to figure out a way for the job manager and task manager to share a checkpoint directory.
If the task manager uses a checkpoint directory that is inaccessible to the job manager, the task manager's persistence layer would build up and eventually cause some kind of out of disk space error. This may not be a problem if you decided to go with RocksDB and enable incremental checkpoints

How to scale down OpenShift/Kubernetes pods automatically on a schedule?

I have a requirement to scale down OpenShift pods at the end of each business day automatically.
How might I schedule this automatically?
OpenShift, like Kubernetes, is an api-driven application. Essentially all application functionality is exposed over the control-plane API running on the master hosts.
You can use any orchestration tool that is capable of making API calls to perform this activity. Information on calling the OpenShift API directly can be found in the official documentation in the REST API Reference Overview section.
Many orchestration tools have plugins that allow you to interact with OpenShift/Kubernetes API more natively than running network calls directly. In the case of Jenkins for example there is the OpensShift Pipeline Jenkins plugin that allows you to perform OpenShift activities directly from Jenkins pipelines. In the cases of Ansible there is the k8s module.
If you were to combine this with Jenkins capability to run jobs on a schedule you have something that meets your requirements.
For something much simpler you could just schedule Ansible or bash scripts on a server via cron to execute the appropriate API commands against the OpenShift API.
Executing these commands from within OpenShift would also be possible via the CronJob object.

Starting up dependent services in Jenkins

Our test suite relies on a number of subsidiary services being present - database, message queue, redis, and so on. I would like to set up a Jenkins build that spins up all the correct services (docker containers, most likely) and then runs the correct tests, followed by some other steps.
Can someone point me to a good example for doing such a thing? I've seen a plug-in for mongo, and some general guides on spinning up agents, but their relationship to what I'm trying to do is unclear.
One possibility is to use the JenkinsCI Kubernetes plugin and jenkinsCI Kubernetes pipeline plugin: they will allow you to
launch docker slaves automatically,
with container group support through podTemplate and containerTemplate.

Cloud dataflow workflow management through Luigi or Airflow

I want my dependant cloud dataflow jobs to be managed through any workflow scheduler. Have anyone done it before? I have gone through the documentation parts of Airflow and Luigi as well but I really need some working samples.
Please help me with some relevant examples or links which can help me to explore and implement workflow management of Dataflow jobs.

How to support disconnections or reboots for Jenkins slaves on Windows?

I have many long running jobs that take almost a day to complete. Splitting is not possible. If the network fails then all progress is lost.
How can a slave survive disconnections?
EDIT 1
I have around 300 slaves running in Windows tied to one single Jenkins instance.
Slaves are connected using the manual method java - jar slave.jar -jlnpUrl <serverUrl> <slaveName>. I cannot run them as a regular Windows service because some tests manipulate GUI elements and require a real interactive session otherwise test fail.
EDIT 2
According to Jenkins Cookbook I should be using Cygwin + OpenSSH approach instead of custom script with JLNP-connector. Could this improve stability?
Jenkins was not originally designed for builds to survive across server or slave restarts. There is a CloudBees Long-Running Build plugin that supports long-running builds but, unfortunately, it is available only for enterprise users and still beta.
I didn't find any free alternative and would suggest you to try to improve your network stability and to split your long running jobs. At least you can divide your tests on logical groups (test suites).
Jenkins now has a workflow plugin. It claims to handle "server" restart and loss-of connectivity with slave.
From the link
A key feature of a workflow execution is that it's suspendable. That
is, while the workflow is running your script, you can shut down
Jenkins or lose a connectivity to a slave. When it comes back, Jenkins
will still remember what it was doing, and your workflow script
resumes execution as if it was never interrupted. A technique known as
the "continuation-passing style" execution plays a key role in
achieving this.
(not tested at all)
Edit: Copied from #Jesse Glick's comments :
Workflow is open source and available for anyone running Jenkins 1.580.1+ or later. CloudBees Jenkins Enterprise does include a checkpoint feature, but this is not necessary simply to have a build survive slave disconnections and Jenkins restarts: that is automatic

Resources