Cloud dataflow workflow management through Luigi or Airflow - google-cloud-dataflow

I want my dependant cloud dataflow jobs to be managed through any workflow scheduler. Have anyone done it before? I have gone through the documentation parts of Airflow and Luigi as well but I really need some working samples.
Please help me with some relevant examples or links which can help me to explore and implement workflow management of Dataflow jobs.

Related

What is the usage of rundeck, spinnaker and jenkins

Could anyone help me the relationship between rundeck, spinnaker and jenkins?
I have seen jenkins for build and spinnaker for pipeline. And how this run deck is integrated and the usage of them.
I got some info from the below image how spinnaker and jenkins are related. But Run deck also integrated.
Ref: https://www.opsmx.com/what-is-spinnaker/
Rundeck is focused on automation and operations, Jenkins on CI/CD, you can design workflows integrating Jenkins pipelines on automated Rundeck jobs, take a look at this.
Spinnaker: a multi-cloud continuous delivery platform for releasing software changes.
Rundeck: Enable anyone to safely execute self-service operations tasks. Spinnaker looks closer to Jenkins. With Rundeck you can automate everything, not only CI/CD pipelines.
In fact, you can integrate Rundeck with any other solution to deliver automated tasks and save time avoiding interruptions. This is a good example of that.

What are the differences between airflow and Kubeflow pipeline?

Machine learning platform is one of the buzzwords in business, in order to boost develop ML or Deep learning.
There are a common part workflow orchestrator or workflow scheduler that help users build DAG, schedule and track experiments, jobs, and runs.
There are many machine learning platform that has workflow orchestrator, like Kubeflow pipeline, FBLearner Flow, Flyte
My question is what are the main differences between airflow and Kubeflow pipeline or other ML platform workflow orchestrator?
And airflow supports different language API and has large community, can we use airflow to build our ML workflow ?
You can definitely use Airflow to orchestrate Machine Learning tasks, but you probably want to execute ML tasks remotely with operators.
For example, Dailymotion uses the KubernetesPodOperator to scale Airflow for ML tasks.
If you don't have the resources to setup a Kubernetes cluster yourself, you can use a ML platforms like Valohai that have an Airflow operator.
When doing ML on production, ideally you want to also version control your models to keep track of the data, code, parameters and metrics of each execution.
You can find more details on this article on Scaling Apache Airflow for Machine Learning Workflows
My question is what are the main differences between airflow and
Kubeflow pipeline or other ML platform workflow orchestrator?
Airflow pipelines run in the Airflow server (with the risk of bringing it down if the task is too resource intensive) while Kubeflow pipelines run in a dedicated Kubernetes pod. Also Airflow pipelines are defined as a Python script while Kubernetes task are defined as Docker containers.
And airflow supports different language API and has large community,
can we use airflow to build our ML workflow ?
Yes you can, you could for example use an Airflow DAG to launch a training job in a Kubernetes pod to run a Docker container emulating Kubeflow's behaviour, what you will miss is some ML specific features from Kubeflow like model tracking or experimentation.

Is BitBucket cloud version of source code repo along with Bamboo for CI/CD?

I'm new to Bamboo and currently learning & using the Bamboo as a standalone server in my company. There I can see the much-advanced options like creating the Build Plans, separate deployment projects based on different environments and also can integrate with notifications and triggers.
I wanted to do a lot of research and learning by myself at home so I was looking for a cloud-based version of Bamboo which I can straight away use to perform similar task like creating build plans, etc. I do not see anything cloud version of Bamboo but I can see BitBucket (cloud-based). What I know is that it is a source code repository like GitHub and GitLab and it has integration with inbuilt CI/CD.
Q1. Is BitBucket a cloud version of source code repository plus Bamboo?
Q2. If not, then do we have cloud version of Bamboo with exact options like build plans, deployment projects, etc
Q3. Also, I'm looking if there is any Bot which I can use like SlackBot or DeployBot to invoke or trigger the Bamboo Build Plan with a chat command? Slack I'm familiar but not DeployBot. I can get the Bamboo build notifications to my Slack channel but not the other way around.
I'm learning and doing research & development hence required clarification on my doubts from experts in this DevOps field to show me the right path.
Please suggest as I'm looking for setting up Bamboo with Bot instructing my build plans.
Thank you
Doing hands-on experience in company on Bamboo and learning as much as I can and playing around with it.
Bamboo Cloud was discontinued in January 2017. Bitbucket Cloud can still notify your Bamboo instance via webhook, assuming you configure Bamboo and your firewall and the webhook properly, or you can use Bitbucket Pipelines for the all-in-one approach.
You can also use Bitbucket Server if you'd prefer to keep everything behind the firewall.

"The Dataflow appears to be stuck" on Cloud Dataflow with Apache Beam 2.1.1 after switching to Firebase Firestore as a pipeline source

I am struggling with this, and initially thought it could be the result of switching the pipeline data source from Cloud Datastore to Firebase Firestore, which required a new project. But I've since found the same error in separate pipelines. All pipelines run successfully on the local DirectRunner and the permissions appear to be the same as the old project.
It looks like none of the VMs are booting and the pipeline never scales above 0 workers. "The Dataflow appears to be stuck" is the only error message I could find and there is nothing in StackDriver. Tried every dependency management variation I could find in the docs but it doesn't seem to be the problem.
My last Dataflow job-id is 2017-10-11_11_12_01-15165703816317931044.
Tried elevating the access roles of all services accounts and still no luck.
Without any logging information, it's hard to pinpoint. But this can happen if you have changed the permissions or roles of the Dataflow service account or the Compute Engine service account so that the service account does not have enough permissions to get the images for the Dataflow workers.

Can Airflow run streaming GCP Dataflow jobs?

I am looking for orchestration software for streaming GCP Dataflow jobs - something that can provide alerting, status, job launching etc. akin to what this does on Kubernetes. The answer here suggests Airflow as they have some hooks into GCP - this would be nice because we have some other infrastructure that runs on Airflow. However I am not sure if this would be able to handle streaming jobs - my understanding is that Airflow is designed for tasks that will complete, which is not the case for a streaming job. Is Airflow appropriate for this? Or is there different software I should use?
Its probably late, but answering for people who visit this topic in future.
Yes you can definitely run dataflow streaming job from airflow. Use airflow version 1.9 or above.
Link :
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataflow_operator.py
You dont need to put extra efforts for running streamin job. Above Dataflow operators run both batch and streaming jobs. It mark the airflow task successful as soon as dataflow streaming job start running (i.e. job is in running state)

Resources