What is the best option for build kubeflow components? - kubeflow

I am read about Kubeflow, and for create components there are two ways.
Container-Based
Function-Based
But there isn't an explication about why I should to use one or another, for example for load a Container-based, I need to generate a docker image push, and load in the pipeline the yaml, with the specification, but with function-based, I only need import the function.
And in order to apply ci-cd with the latest version, if I have a container-based, I can have a repo with all yml and load with load_by_url, but if they are a function, I can have a repo with all and load as a package too.
So what do you think that is the best approach container-based or function-based.
Thanks.

The short answer is it depends, but a more nuance answer is depends what you want to do with the component.
As base knowledge, when a KFP pipeline is compiled, it's actually a series of different YAMLs that are launched by Argo Workflows. All of these needs to be container based to run on Kubernetes, even if the container itself has all python.
A function to Python Container Op is a quick way to get started with Kubeflow Pipelines. It was designed to model after Airflow's python-native DSL. It will take your python function and run it within a defined Python container. You're right it's easier to encapsulate all your work within the same Git folder. This set up is great for teams just getting started with KFP and don't mind some boilerplate to get going quickly.
Components really become powerful when your team needs to share work, or you have an enterprise ML platform that is creating template logic of how to run specific jobs in a pipeline. The components can be separately versioned and built to use on any of your clusters in the same way (underlying container should be stored in docker hub or ECR, if you're on AWS). There are inputs/outputs to prescribe how the run will execute using the component. You can imagine a team in Uber might use a KFP to pull data for number of drivers in a certain zone. The inputs to the component could be Geo coordinate box and also time of day of when to load the data. The component saves the data to S3, which then is loaded to your model for training. Without the component, there would be quite a bit of boiler plate that would need to copy the code across multiple pipelines and users.
I'm a former PM at AWS for SageMaker and open source ML integrations, and this is sharing from my experience looking at enterprise set ups.

But there isn't an explication about why I should to use one or another, for example for load a Container-based, I need to generate a docker image push, and load in the pipeline the yaml, with the specification, but with function-based, I only need import the function.
There are some misconceptions here.
There is only one kind of component under the hood - container-based component (there are also graph components, but this is irrelevant here).
However, most of our users like python and do not like building container. This is why I've developed a feature called "Lightweight python components" which generates ComponentSpec/component.yaml from a python function source code. The generated component basically runs python3 -u -c '<your function>; <command-line parsing>' arg1 arg2 ....
There is a misconception that "function-based components are different from component.yaml files".
No, it's the same format. You're supposed to save the generated component into a file for sharing: create_component_from_func(my_func, output_component_file='component.yaml'). After your code stabilizes, you should upload the code and the component.yaml to GitHub or other place and use load_component_from_url to load that component.yaml in pipelines.
Check the component.yaml files in the KFP repo. More than half of the component.yaml files are Lightweight components - they're generated from python functions.
component.yaml are intended for sharing the components. They're declarative, portable, indexable, safe, language-agnostic etc. You should always publish component.yaml files. If component.yaml is generated from a python function, then it's good practice to put component.py alongside so that the component can be easily regenerated when making changes.
The decision whether to create component using Lightweight python component feature or not is very simple:
Is you code in a self-contained python function (not a CLI program yet)? Do you want to avoid building, pushing and maintaining containers? If yes, then the Lightweight python component feature (create_component_from_func) can help you and generate the component.yaml for you.
Otherwise, write component.yaml yourself.

Related

Kubernetes best way 4 locally [duplicate]

I am using docker containers and have docker-compose files for both local development and production environment. I want to try Google Cloud Platform for my new app and specifically Google Kubernetes Engine. My tools is Docker for Mac with Kubernetes on local machine.
It is super important for developers to be able to change code and to see changes live for local development.
Use cases:
Backend developer make changes to basic Flask API (or whatever you use) and should see changes on reloaded app immediately.
Frontend developer make changes to HTML layout and should see changes on web page immediately.
At the moment i am using docker-compose files to mount source code to local containers. But Kubernetes does not support relative paths to mount the source code.
Ideally i should be able to set the variable
Deployment.spec.templates.spec.containers.volumes.hostPath
as relative path to my repo. For example, in our team developers clone repo to this folders:
/User/BACKEND_developer/code/project_repo
/User/FRONTEND_developer/code/project_repo
Obviously you can't commit and build the image after every little change to the source code.
So what is the best practice for local development with Kubernetes? Do i need some additional tools to modify .yaml files for every developer?
#tgogos is right.
The best way to achieve your goal is to use Skaffold
It will rebuild container whenever it sees changes in source code.
Skaffold has a pluggable architecture that allows you to choose the tools in developer workflow that work best for you:
A very promising approach for dynamic languages is the hybrid approach recently introduced by Skaffold, allowing to take advantage of the usual auto-reload mechanisms. You can define two set of files:
Changing a file on the first set triggers the full rebuild+push+deploy mechanism.
Changing a file on the second set only syncs the file between the local machine and the container.
Such an hybrid approach is well suited to a large class of technology stacks, like Node.js, React, Angular, Python, where you can use the native hot-reload mechanism for source code changes, and trigger the full rebuild only when it’s needed (for example, adding a dependency). This helps a lot in keeping the latency low.
I spoke about this in my recent talk at All Day Devops. Here there’s an example based on Node.JS.

What is the best practice to use kubernetes for local development?

I am using docker containers and have docker-compose files for both local development and production environment. I want to try Google Cloud Platform for my new app and specifically Google Kubernetes Engine. My tools is Docker for Mac with Kubernetes on local machine.
It is super important for developers to be able to change code and to see changes live for local development.
Use cases:
Backend developer make changes to basic Flask API (or whatever you use) and should see changes on reloaded app immediately.
Frontend developer make changes to HTML layout and should see changes on web page immediately.
At the moment i am using docker-compose files to mount source code to local containers. But Kubernetes does not support relative paths to mount the source code.
Ideally i should be able to set the variable
Deployment.spec.templates.spec.containers.volumes.hostPath
as relative path to my repo. For example, in our team developers clone repo to this folders:
/User/BACKEND_developer/code/project_repo
/User/FRONTEND_developer/code/project_repo
Obviously you can't commit and build the image after every little change to the source code.
So what is the best practice for local development with Kubernetes? Do i need some additional tools to modify .yaml files for every developer?
#tgogos is right.
The best way to achieve your goal is to use Skaffold
It will rebuild container whenever it sees changes in source code.
Skaffold has a pluggable architecture that allows you to choose the tools in developer workflow that work best for you:
A very promising approach for dynamic languages is the hybrid approach recently introduced by Skaffold, allowing to take advantage of the usual auto-reload mechanisms. You can define two set of files:
Changing a file on the first set triggers the full rebuild+push+deploy mechanism.
Changing a file on the second set only syncs the file between the local machine and the container.
Such an hybrid approach is well suited to a large class of technology stacks, like Node.js, React, Angular, Python, where you can use the native hot-reload mechanism for source code changes, and trigger the full rebuild only when it’s needed (for example, adding a dependency). This helps a lot in keeping the latency low.
I spoke about this in my recent talk at All Day Devops. Here there’s an example based on Node.JS.

How to display configuration differences between two jenkins Jenkins builds?

I want to display non-code differences between current build and the latest known successful build on Jenkins.
By non-code differences I mean things like:
Environment variables (includes Jenkins parameters) (set), maybe with some filter
Version of system tool packages (rpm -qa | sort)
Versions of python packages installed (pip freeze)
While I know how to save and archive these files as part of the build, the only part that is not clear is how to generate the diff/change-report regarding differences found between current build and the last successful build.
Please note that I am looking for a pipeline compatible solution and ideally I would prefer to make this report easily accessible on Jenkins UI, like we currently have with SCM changelogs.
Or to rephrase this, how do I create build manifest and diff it against last known successful one? If anyone knows a standard manifest format that can easily be used to combine all these information it would be great.
you always ask the most baller questions, nice work. :)
we always try to push as many things into code as possible because of the same sort of lack of traceability you're describing with non-code configuration. we start with using Jenkinsfiles, so we capture a lot of the build configuration there (in a way that still shows changes in source control). for system tool packages, we get that into the app by using docker and by inheriting from a specific tag of the docker base image. so even if we want to change system packages or even the python version, for example, that would manifest as an update of the FROM line in the app's Dockerfile. Even environment variables can be micromanaged by docker, to address your other example. There's more detail about how we try to sidestep your question at https://jenkins.io/blog/2017/07/13/speaker-blog-rosetta-stone/.
there will always be things that are hard to capture as code, and builds will therefore still fail and be hard to debug occasionally, so i hope someone pipes up with a clean solution to your question.

Dockerfile vs Docker image

I'm working on creating some docker images to be used for testing on dev machines. I plan to build one for our main app as well as one for each of our external dependencies (postgres, elasticsearch, etc). For the main app, I'm struggling with the decision of writing a Dockerfile or compiling an image to be hosted.
On one hand, a Dockerfile is easy to share and modify over time. On the other hand, I expect that advanced configuration (customizing application property files) will be much easier to do in vim before simply committing an new image.
I understand that I can get to the same result either way, but I'm looking for PROS, CONS, and gotchas with either direction.
As a side note, I plan on wrapping this all together using Fig. My initial impression of this tool has been very positive.
Thanks!
Using a Dockerfile:
You have an 'audit log' that describes how the image is built. For me this is fundamental if it is going to be used in a production pipeline where more people are working and maintainability should be a priority.
You can automate the building process of your image, being an easy way of updating the container with system updates, or if it has to take part in a continuous delivery pipeline.
It is a cleaner way of create the layers of your container (each Dockerfile command is a different layer)
Changing a container and committing the changes is great for testing purposes and for fast development for a conceptual test. But if you plan to use the result image for some time, I would definitely use Dockerfiles.
Apart from this, if you have to modify a file and doing it using bash tools (awk, sed...) results very tedious, you can add any file you wish from outside during the building process.
I totally agree with Javier but you need to understand that one image created with a dockerfile can be different with an image build with the same version of the dockerfile 1 day after.
maybe in your build process you retrieve automatically last updates of an app or the os etc …
And at this time if you need to reproduce a crash or whatever you can’t rely on the dockerfile.

With a Composer repository, how can one determine reverse package requirements?

For a given version of a package, I need to find those packages which Require it.
I can re-invent the wheel by crafting some kind of parser to go against our Satis repo's packages.json but surely there's an easier way that's already present in the Composer API?
The use case for this is a build pipeline I am constructing on our Jenkins CI server that responds to commits to our top-level master composer project which is the moving version to we need to retrieve and assemble (via composer require) each package in our Satis repo that has a dependency, applying fuzzy version matching.
I don't have internal knowledge about the Composer API - my approach is to use the published CLI interface to get the job done, trying to avoid fancy stuff, because the Composer project still is heavily work in progess. That being said:
You cannot compile a list of "this package A is being used by all these packages" in the general case, because that would mean that you have to scan ALL packages existing in the world. That job would never end.
However, using Satis will compile a list of packages detected to use a certain package, which is rendered in the HTML template just for information. So for a reasonable small world it is possible to detect the dependencies and create a reverse dependency map. But I don't think this is a reliable feature within Composer, because for the usual use case Composer will only evaluate forward dependencies, never reverse. There is no use case within Composer to create these relations, chances are high Satis also doesn't expose them in the regular case.
You can however try to output something that is more machine readable than the rendered HTML to get the "packages using this" info - it is passed as an array variable into the template. Shouldn't be too difficult to output something else.

Resources