Can Apache Beam Pipeline be used for batch orchestration? - google-cloud-dataflow

I am newbie in apache beam environment.
Trying to fit apache beam pipeline for batch orchestration.
My definition of batch is as follows
Batch==> a set of jobs,
Job==> can have one or more sub-job.
There can be dependencies between jobs/sub-jobs.
Can apache beam pipeline be mapped with my custom batch??

Apache Beam is unified for developing both batch and stream pipelines which can be run on Dataflow. You can create and deploy your pipeline using Dataflow. Beam Pipelines are portable so that you can use any of the runners available according to your requirement.
Cloud Composer can be used for batch orchestration as per your requirement. Cloud Composer is built on Apache Airflow. Both Apache Beam and Apache Airflow can be used together since Apache Airflow can be used to trigger the Beam jobs. Since you have custom jobs running, you can configure the beam and airflow for batch orchestration.
Airflow is meant to perform orchestration and also pipeline dependency management while Beam is used to build data pipelines which are executed data processing systems.

I believe Composer might be more suited for what you're trying to make. From there, you can launch Dataflow jobs from your environment using Airflow operators (for example, in case you're using Python, you can use the DataflowCreatePythonJobOperator).

Related

Flex Template Python dependency management without outbound network connectivity

we are running Dataflow Python Flex Templates in a VPC without outbound network connectivity and without artifact repository. Hence, for Dataflow python jobs, we provide dependencies as described here: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#local-or-nonpypi.
What is the best practice to provide dependencies to Python Flex Templates, where the build process that builds the docker image has access to PyPI, but the Dataflow workers don't have access?

What are the differences between airflow and Kubeflow pipeline?

Machine learning platform is one of the buzzwords in business, in order to boost develop ML or Deep learning.
There are a common part workflow orchestrator or workflow scheduler that help users build DAG, schedule and track experiments, jobs, and runs.
There are many machine learning platform that has workflow orchestrator, like Kubeflow pipeline, FBLearner Flow, Flyte
My question is what are the main differences between airflow and Kubeflow pipeline or other ML platform workflow orchestrator?
And airflow supports different language API and has large community, can we use airflow to build our ML workflow ?
You can definitely use Airflow to orchestrate Machine Learning tasks, but you probably want to execute ML tasks remotely with operators.
For example, Dailymotion uses the KubernetesPodOperator to scale Airflow for ML tasks.
If you don't have the resources to setup a Kubernetes cluster yourself, you can use a ML platforms like Valohai that have an Airflow operator.
When doing ML on production, ideally you want to also version control your models to keep track of the data, code, parameters and metrics of each execution.
You can find more details on this article on Scaling Apache Airflow for Machine Learning Workflows
My question is what are the main differences between airflow and
Kubeflow pipeline or other ML platform workflow orchestrator?
Airflow pipelines run in the Airflow server (with the risk of bringing it down if the task is too resource intensive) while Kubeflow pipelines run in a dedicated Kubernetes pod. Also Airflow pipelines are defined as a Python script while Kubernetes task are defined as Docker containers.
And airflow supports different language API and has large community,
can we use airflow to build our ML workflow ?
Yes you can, you could for example use an Airflow DAG to launch a training job in a Kubernetes pod to run a Docker container emulating Kubeflow's behaviour, what you will miss is some ML specific features from Kubeflow like model tracking or experimentation.

How to scale down OpenShift/Kubernetes pods automatically on a schedule?

I have a requirement to scale down OpenShift pods at the end of each business day automatically.
How might I schedule this automatically?
OpenShift, like Kubernetes, is an api-driven application. Essentially all application functionality is exposed over the control-plane API running on the master hosts.
You can use any orchestration tool that is capable of making API calls to perform this activity. Information on calling the OpenShift API directly can be found in the official documentation in the REST API Reference Overview section.
Many orchestration tools have plugins that allow you to interact with OpenShift/Kubernetes API more natively than running network calls directly. In the case of Jenkins for example there is the OpensShift Pipeline Jenkins plugin that allows you to perform OpenShift activities directly from Jenkins pipelines. In the cases of Ansible there is the k8s module.
If you were to combine this with Jenkins capability to run jobs on a schedule you have something that meets your requirements.
For something much simpler you could just schedule Ansible or bash scripts on a server via cron to execute the appropriate API commands against the OpenShift API.
Executing these commands from within OpenShift would also be possible via the CronJob object.

Can Airflow run streaming GCP Dataflow jobs?

I am looking for orchestration software for streaming GCP Dataflow jobs - something that can provide alerting, status, job launching etc. akin to what this does on Kubernetes. The answer here suggests Airflow as they have some hooks into GCP - this would be nice because we have some other infrastructure that runs on Airflow. However I am not sure if this would be able to handle streaming jobs - my understanding is that Airflow is designed for tasks that will complete, which is not the case for a streaming job. Is Airflow appropriate for this? Or is there different software I should use?
Its probably late, but answering for people who visit this topic in future.
Yes you can definitely run dataflow streaming job from airflow. Use airflow version 1.9 or above.
Link :
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataflow_operator.py
You dont need to put extra efforts for running streamin job. Above Dataflow operators run both batch and streaming jobs. It mark the airflow task successful as soon as dataflow streaming job start running (i.e. job is in running state)

Are there Dataflow Log Appenders?

Is it possible to register logback appenders in Dataflow?
With Beam, I'm able to define an appender for the DirectRunner, but when I deploy to Dataflow, it no longer seems to work. Is this just my logback.xml getting lost or is it because the runner has its own separate root logger?
The Dataflow runner is in charge of orchestrating and paralelizing your pipeline to run in a distributed environment. As part of that it manages logging by using SLF4J.
If you can get logback to work on top of one of the supported libraries, or simply use one of them directly, you should be able to get your job's log messages on cloud logging.
The doc is for the Dataflow SDK, but it should apply equally for beam: https://cloud.google.com/dataflow/pipelines/logging

Resources