What are the differences between airflow and Kubeflow pipeline? - machine-learning

Machine learning platform is one of the buzzwords in business, in order to boost develop ML or Deep learning.
There are a common part workflow orchestrator or workflow scheduler that help users build DAG, schedule and track experiments, jobs, and runs.
There are many machine learning platform that has workflow orchestrator, like Kubeflow pipeline, FBLearner Flow, Flyte
My question is what are the main differences between airflow and Kubeflow pipeline or other ML platform workflow orchestrator?
And airflow supports different language API and has large community, can we use airflow to build our ML workflow ?

You can definitely use Airflow to orchestrate Machine Learning tasks, but you probably want to execute ML tasks remotely with operators.
For example, Dailymotion uses the KubernetesPodOperator to scale Airflow for ML tasks.
If you don't have the resources to setup a Kubernetes cluster yourself, you can use a ML platforms like Valohai that have an Airflow operator.
When doing ML on production, ideally you want to also version control your models to keep track of the data, code, parameters and metrics of each execution.
You can find more details on this article on Scaling Apache Airflow for Machine Learning Workflows

My question is what are the main differences between airflow and
Kubeflow pipeline or other ML platform workflow orchestrator?
Airflow pipelines run in the Airflow server (with the risk of bringing it down if the task is too resource intensive) while Kubeflow pipelines run in a dedicated Kubernetes pod. Also Airflow pipelines are defined as a Python script while Kubernetes task are defined as Docker containers.
And airflow supports different language API and has large community,
can we use airflow to build our ML workflow ?
Yes you can, you could for example use an Airflow DAG to launch a training job in a Kubernetes pod to run a Docker container emulating Kubeflow's behaviour, what you will miss is some ML specific features from Kubeflow like model tracking or experimentation.

Related

Does Docker Compose Conflict with Turborepo Pipelines?

I have a turbo monorepo that I want to perform end-to-end tests on in a CI environment. All of my applications are containerized and some external services are hosted by container during development via docker compose.
I was having trouble working out how docker was supposed to fit into turborepo during development, and I realized they accomplish a lot of the same things:
Compose allows services to define their dependents, and so long as tests are run during the build phase, the results can be cached. Multi-stage builds and service profiles / multiple compose files can be configured to represent more complicated and interdependent tasks.
This seems to able to accomplish the same thing as turborepo pipelines, with the bonus that everything is containerized during development. However, turbo pipelines are in my opinion much more user friendly for this use case, although they cannot orchestrate several containerized applications.
So my question is does the pipeline feature of turborepo conflict with docker for development? If I want to containerize my applications during development should I forgo using pipelines completely? Or is there a more preferable setup, for example each containerized application has “up” and “down” scripts for starting their containers that turbo leverages?

Can Apache Beam Pipeline be used for batch orchestration?

I am newbie in apache beam environment.
Trying to fit apache beam pipeline for batch orchestration.
My definition of batch is as follows
Batch==> a set of jobs,
Job==> can have one or more sub-job.
There can be dependencies between jobs/sub-jobs.
Can apache beam pipeline be mapped with my custom batch??
Apache Beam is unified for developing both batch and stream pipelines which can be run on Dataflow. You can create and deploy your pipeline using Dataflow. Beam Pipelines are portable so that you can use any of the runners available according to your requirement.
Cloud Composer can be used for batch orchestration as per your requirement. Cloud Composer is built on Apache Airflow. Both Apache Beam and Apache Airflow can be used together since Apache Airflow can be used to trigger the Beam jobs. Since you have custom jobs running, you can configure the beam and airflow for batch orchestration.
Airflow is meant to perform orchestration and also pipeline dependency management while Beam is used to build data pipelines which are executed data processing systems.
I believe Composer might be more suited for what you're trying to make. From there, you can launch Dataflow jobs from your environment using Airflow operators (for example, in case you're using Python, you can use the DataflowCreatePythonJobOperator).

How to scale down OpenShift/Kubernetes pods automatically on a schedule?

I have a requirement to scale down OpenShift pods at the end of each business day automatically.
How might I schedule this automatically?
OpenShift, like Kubernetes, is an api-driven application. Essentially all application functionality is exposed over the control-plane API running on the master hosts.
You can use any orchestration tool that is capable of making API calls to perform this activity. Information on calling the OpenShift API directly can be found in the official documentation in the REST API Reference Overview section.
Many orchestration tools have plugins that allow you to interact with OpenShift/Kubernetes API more natively than running network calls directly. In the case of Jenkins for example there is the OpensShift Pipeline Jenkins plugin that allows you to perform OpenShift activities directly from Jenkins pipelines. In the cases of Ansible there is the k8s module.
If you were to combine this with Jenkins capability to run jobs on a schedule you have something that meets your requirements.
For something much simpler you could just schedule Ansible or bash scripts on a server via cron to execute the appropriate API commands against the OpenShift API.
Executing these commands from within OpenShift would also be possible via the CronJob object.

How to isolate CI pipeline per-branch environments in Kubernetes?

We are developing a CI/CD pipeline leveraging Docker/Kubernetes in AWS. This topic is touched in Kubernetes CI/CD pipeline.
We want to create (and destroy) a new environment for each SCM branch, since a Git pull request until merge.
We will have a Kubernetes cluster available for that.
During prototyping by the dev team, we came up to Kubernetes namespaces. It looks quite suitable: For each branch, we create a namespace ns-<issue-id>.
But that idea was dismissed by dev-ops prototyper, without much explanation, just stating that "we are not doing that because it's complicated due to RBAC". And it's quite hard to get some detailed reasons.
However, for the CI/CD purposes, we need no RBAC - all can run with unlimited privileges and no quotas, we just need a separated network for each environment.
Is using namespaces for such purposes a good idea? I am still not sure after reading Kubernetes docs on namespaces.
If not, is there a better way? Ideally, we would like to avoid using Helm as it a level of complexity we probably don't need.
We're working on an open source project called Jenkins X which is a proposed sub project of the Jenkins foundation aimed at automating CI/CD on Kubernetes using Jenkins and GitOps for promotion.
When you submit a Pull Request we automatically create a Preview Environment which is exactly what you describe - a temporary environment which is used to deploy the pull request for validation, testing & approval before the pull request is approved.
We now use Preview Environments all the time for many reasons and are big fans of them! Each Preview Environment is in a separate namespace so you get all the usual RBAC features from Kubernetes with them.
If you're interested here's a demo of how to automate CI/CD with multiple environments on Kubernetes using GitOps for promotion between environments and Preview Environments on Pull Requests - using Spring Boot and nodejs apps (but we support many languages + frameworks).

Can Airflow run streaming GCP Dataflow jobs?

I am looking for orchestration software for streaming GCP Dataflow jobs - something that can provide alerting, status, job launching etc. akin to what this does on Kubernetes. The answer here suggests Airflow as they have some hooks into GCP - this would be nice because we have some other infrastructure that runs on Airflow. However I am not sure if this would be able to handle streaming jobs - my understanding is that Airflow is designed for tasks that will complete, which is not the case for a streaming job. Is Airflow appropriate for this? Or is there different software I should use?
Its probably late, but answering for people who visit this topic in future.
Yes you can definitely run dataflow streaming job from airflow. Use airflow version 1.9 or above.
Link :
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataflow_operator.py
You dont need to put extra efforts for running streamin job. Above Dataflow operators run both batch and streaming jobs. It mark the airflow task successful as soon as dataflow streaming job start running (i.e. job is in running state)

Resources