Equivalent of TFX Standard Components in KubeFlow - kubeflow

I have an existing TFX pipeline here that I want to rewrite using the KubeFlow Pipelines SDK.
The existing pipeline is using many TFX Standard Components such as ExampleValidator. When checking the KubeFlow SDK, I see a kfp.components.package but no existing prebuilt components like TFX provides.
Does the KubeFlow SDK have an equivalent to the TFX Standard Components?

You don’t have to rewrite the components, there is no mapping of components of tfx in kfp, as they are not competitive tools.
With tfx you create the components and then you use an orchestrator to run them. Kubeflow pipelines is one of the orchestrators.
The tfx.orchestration.pipeline will wrap your tfx components and create your pipeline.
We have two schedulers behind kubeflow pipelines: Argo (used by gcp) and Tekton (used by openshift). There are examples for tfx with kubeflow pipelines using tekton and tfx with kubeflow pipelines using argo in the respective repositories.

Actually Kubeflow does have a notion of reusable components that they reference in the docs. They can be python-based or YAML-based and so on. However, there is no 'standard' ones like TFX has them. You can just see a bunch of them in the examples repo, and create your own reusable ones.
You can sort of treat TFX components and Kubeflow components somewhat interchangeably though, as TFX components do get compiled into the Kubeflow representation via the orchestrator logic. Simply use the KubeflowDagRunner with your TFX pipelines. However I might be missing something: What is your motivation to re-write in Kubeflow?

Related

Is it possible to mix kubeflow components with tensorflow extended components?

It looks like Kubeflow has deprecated all of their TFX components. I currently have some custom Kubeflow components that help launch some of my data pipelines and I was hoping I could use some TFX components in the same kubeflow pipeline. Is there a recommended approach to mix Kubeflow and Tfx components together?
I saw an older PR from Kubeflow deprecating their TFX components:
https://github.com/kubeflow/pipelines/issues/3853
It states:
These components were created to allow the users to use TFX components
in their KFP pipelines, to be able to mix KFP and TFX components. If
your pipeline uses only TFX components, please use the official TFX
SDK.
But I actually do need to mix KFP and TFX components, is there a way to do this?
The simple answer is no, the long answer is you could, if you hack it. The experience wouldn't be great though.
When you look at an example TFX pipeline, it has it's own Python DSL. As a user, you define the pipeline components the way you want it to run, and at the very end you can change the target runner (Airflow, Beam, and KFP). TFX will compile it's intermediate representation before submitting that to the runner of your choice.
The question then is how can you mix that with other tools. TFX compiles an Argo workflow DAG, similar to if you use the KFP SDK or Couler. When you use the KubeflowDAG runner, you can find the output Argo YAML for the pipeline. If you repeat the same compilation process with your KFP native pipeline, you'll have two Argo YAMLs you can merge together for the specific workload you want.
If you are using MLMD, you may need to do some input/output manipulation to make it all work.

What is difference between Jenkins Shared Libraries and Jenkins pipeline templates

I am trying to understand what is exact difference between Jenkins Shared Libraries and Jenkins pipeline templates.
Shared libraries as I understand is used for keeping common code and making it accessible to multiple pipelines.
I am not able to understand then what is difference between Jenkins pipeline template. Also what is the use of Jenkins templates created using template engine. Is it somehow similar to shared library
maintainer of the Jenkins Templating Engine here.
Shared Libraries
Focused on reusing pipeline code. A Jenkinsfile is still required for each individual application and those individual pipelines have to import the libraries.
Jenkins Templating Engine
A pipeline development framework that enables tool-agnostic pipeline templates.
Rather than creating individual Jenkinsfiles, you can create a centralized set of pipeline templates.
These templates invoke steps such as:
build()
unit_test()
deploy_to dev
This common template can be applied across teams, regardless of the technology they're using.
The build, unit_test, and deploy_to steps would come from libraries.
There may be multiple libraries that implement the build step, such as npm, gradle, maven, etc.
Rather than have each team define an entire pipeline, they can now just declare the tools that should be used to "hydrate" the template via a pipeline configuration file:
libraries{
npm // contributes the build step
}
Feel free to check out this CDF Webinar: Pipeline Templating with the Jenkins Templating Engine.

What are the differences between airflow and Kubeflow pipeline?

Machine learning platform is one of the buzzwords in business, in order to boost develop ML or Deep learning.
There are a common part workflow orchestrator or workflow scheduler that help users build DAG, schedule and track experiments, jobs, and runs.
There are many machine learning platform that has workflow orchestrator, like Kubeflow pipeline, FBLearner Flow, Flyte
My question is what are the main differences between airflow and Kubeflow pipeline or other ML platform workflow orchestrator?
And airflow supports different language API and has large community, can we use airflow to build our ML workflow ?
You can definitely use Airflow to orchestrate Machine Learning tasks, but you probably want to execute ML tasks remotely with operators.
For example, Dailymotion uses the KubernetesPodOperator to scale Airflow for ML tasks.
If you don't have the resources to setup a Kubernetes cluster yourself, you can use a ML platforms like Valohai that have an Airflow operator.
When doing ML on production, ideally you want to also version control your models to keep track of the data, code, parameters and metrics of each execution.
You can find more details on this article on Scaling Apache Airflow for Machine Learning Workflows
My question is what are the main differences between airflow and
Kubeflow pipeline or other ML platform workflow orchestrator?
Airflow pipelines run in the Airflow server (with the risk of bringing it down if the task is too resource intensive) while Kubeflow pipelines run in a dedicated Kubernetes pod. Also Airflow pipelines are defined as a Python script while Kubernetes task are defined as Docker containers.
And airflow supports different language API and has large community,
can we use airflow to build our ML workflow ?
Yes you can, you could for example use an Airflow DAG to launch a training job in a Kubernetes pod to run a Docker container emulating Kubeflow's behaviour, what you will miss is some ML specific features from Kubeflow like model tracking or experimentation.

How to isolate CI pipeline per-branch environments in Kubernetes?

We are developing a CI/CD pipeline leveraging Docker/Kubernetes in AWS. This topic is touched in Kubernetes CI/CD pipeline.
We want to create (and destroy) a new environment for each SCM branch, since a Git pull request until merge.
We will have a Kubernetes cluster available for that.
During prototyping by the dev team, we came up to Kubernetes namespaces. It looks quite suitable: For each branch, we create a namespace ns-<issue-id>.
But that idea was dismissed by dev-ops prototyper, without much explanation, just stating that "we are not doing that because it's complicated due to RBAC". And it's quite hard to get some detailed reasons.
However, for the CI/CD purposes, we need no RBAC - all can run with unlimited privileges and no quotas, we just need a separated network for each environment.
Is using namespaces for such purposes a good idea? I am still not sure after reading Kubernetes docs on namespaces.
If not, is there a better way? Ideally, we would like to avoid using Helm as it a level of complexity we probably don't need.
We're working on an open source project called Jenkins X which is a proposed sub project of the Jenkins foundation aimed at automating CI/CD on Kubernetes using Jenkins and GitOps for promotion.
When you submit a Pull Request we automatically create a Preview Environment which is exactly what you describe - a temporary environment which is used to deploy the pull request for validation, testing & approval before the pull request is approved.
We now use Preview Environments all the time for many reasons and are big fans of them! Each Preview Environment is in a separate namespace so you get all the usual RBAC features from Kubernetes with them.
If you're interested here's a demo of how to automate CI/CD with multiple environments on Kubernetes using GitOps for promotion between environments and Preview Environments on Pull Requests - using Spring Boot and nodejs apps (but we support many languages + frameworks).

Aggregation of jenkins pipelines

The Jenkins pipeline plugin is awesome.
But is it also possible to aggregate pipelines of (dependent projects) e.g. micro-services?
If you have separate jobs that run pipelines you could just call build [job name] to invoke subsequent pipelines
You can use the way that #ebnius says where you have little pipeline jobs and a parent which is orchestrating the complete workflow and calling the different pipelines.
Or you can use the Shared Library plugin (https://jenkins.io/doc/book/pipeline/shared-libraries/) where you define a step per groovy file for example and you have the entire structure modularized.

Resources