Mounting an existing volume from a Kubeflow Pipeline: kfp.VolumeOP creates a new volume instead of creating a PVC to existing volume - kubeflow

I am new to KubeFlow and trying to port / adapt an existing solution to run in KubeFlow pipelines. The issue I am solving now is that the existing solution shared data via a mounted volume. I know this is not the best practice for components exchanging data in KubeFlow however this will be a temporary proof of concept and I have no other choice.
I am facing issues with accessing an existing Volume from the pipeline. I am basically running the code from KubeFlow documentation here, but pointing to an existing K8S Vo
def volume_op_dag():
vop = dsl.VolumeOp(
name="shared-cache",
resource_name="shared-cache",
size="5Gi",
modes=dsl.VOLUME_MODE_RWO
)
The Volume shared-cache exists:
However when I run the pipeline a new volume is created:
What am I doing wrong? I obviously don't want to create a new volume every time I run the pipeline but instead mount an existing one.
Edit: Adding KubeFlow versions:
kfp (1.8.13)
kfp-pipeline-spec (0.1.16)
kfp-server-api (1.8.3)

Have a look at the function kfp.onperm.mount_pvc. You can find values for the arguments pvc_name and volume_name via the console command
kubectl -n <your-namespace> get pvc.
The way you use it is by writing the component as if the volume is already mounted and following the example from the doc when binding it in the pipeline:
train = train_op(...)
train.apply(mount_pvc('claim-name', 'pipeline', '/mnt/pipeline'))
Also note, that both the volume and the pipeline must be in the same namespace.

Related

rbind usage on local volume mounting

I have a directory that is configured to be managed by an automounter (as described here). I need to use this directory (and all directories that are mounted inside) in multiple pods as a Local Persistent Volume.
I am able to trigger the automounter within the containers, but there are some use-cases when this directory is not empty when the container starts up. This makes sub-directories appear as empty and not being able to trigger the automounter (whithin the container)
I did some investigation and discovered that when using Local PVs, there is a mount -o bind command between the source directory and some internal directory managed by the kubelet (this is the line in the source code).
What I actually do need is rbind to be used (recursive binding - here is a good explanation).
Using rbind also requires some changes to the part that unmounts the volume (recursive unmounting is needed)
I don't want to patch the kubelet and recompile it..yet.
So my question is: are there some official methods to provide to Kubernetes some custom mounter/unmounter?
Meanwhile, I did find a solution for this use-case.
Based on Kubernetes docs there is something called Out-Of-Tree Volume Plugins
The Out-of-tree volume plugins include the Container Storage Interface (CSI) and FlexVolume. They enable storage vendors to create custom storage plugins without adding them to the Kubernetes repository
Even that CSI is encouraged to be used, I chose FlexVolume to implement my custom driver. Here is a detailed documentation.
This driver is actually a py script that supports three actions: init/mount/unmount (--rbind is used to mount that directory managed by automounter and unmounts it like this). It is deployed using a DaemonSet (docs here)
And this is it!

CI testing with docker-compose on Jenkins with Kubernetes

I have tests that I run locally using a docker-compose environment.
I would like to implement these tests as part of our CI using Jenkins with Kubernetes on Google Cloud (following this setup).
I have been unsuccessful because docker-in-docker does not work.
It seems that right now there is no solution for this use-case. I have found other questions related to this issue; here, and here.
I am looking for solutions that will let me run docker-compose. I have found solutions for running docker, but not for running docker-compose.
I am hoping someone else has had this use-case and found a solution.
Edit: Let me clarify my use-case:
When I detect a valid trigger (ie: push to repo) I need to start a new job.
I need to setup an environment with multiple dockers/instances (docker-compose).
The instances on this environment need access to code from git (mount volumes/create new images with the data).
I need to run tests in this environment.
I need to then retrieve results from these instances (JUnit test results for Jenkins to parse).
The problems I am having are with 2, and 3.
For 2 there is a problem running this in parallel (more than one job) since the docker context is shared (docker-in-docker issues). If this is running on more than one node then i get clashes because of shared resources (ports for example). my workaround is to only limit it to one running instance and queue the rest (not ideal for CI)
For 3 there is a problem mounting volumes since the docker context is shared (docker-in-docker issues). I can not mount the code that I checkout in the job because it is not present on the host that is responsible for running the docker instances that I trigger. my workaround is to build a new image from my template and just copy the code into the new image and then use that for the test (this works, but means I need to use docker cp tricks to get data back out, which is also not ideal)
I think the better way is to use the pure Kubernetes resources to run tests directly by Kubernetes, not by docker-compose.
You can convert your docker-compose files into Kubernetes resources using kompose utility.
Probably, you will need some adaptation of the conversion result, or maybe you should manually convert your docker-compose objects into Kubernetes objects. Possibly, you can just use Jobs with multiple containers instead of a combination of deployments + services.
Anyway, I definitely recommend you to use Kubernetes abstractions instead of running tools like docker-compose inside Kubernetes.
Moreover, you still will be able to run tests locally using Minikube to spawn the small all-in-one cluster right on your PC.

Create environment variables for Kubernetes main container in Kubernetes Init container

I use an Kubernetes Init container to provision the application's database. After this is done I want to provide the DB's credentials to the main container via environment variables.
How can this be achieved?
I don't want to create a Kubernetes Secret inside the Init container, since I don't want to save the credentials there!
I see several ways to achieve what you want:
From my perspective, the best way is to use Kubernetes Secret. #Nebril has already provided that idea in the comments. You can generate it by Init Container and remove it by PreStop hook, for example. But, you don't want to go that way.
You can use a shared volume which will be used by InitConatainer and your main pod. InitContainer will generate the environment variables file db_cred.env in the volume which you can mount, for example, to /env path. After that, you can load it by modifying a command of your container in the Pod spec and add the command source /env/db_cred.env before the main script which will start your application. #user2612030 already gave you that idea.
Another alternative way can be Vault by Hashicorp, you can use it as storage of all your credentials.
You can use some custom solution to write and read directly to Etcd from Kubernetes apps. Here is a library example - k8s-kv.
But anyway, the best and the most proper way to store credentials in Kubernetes is Secrets. It is more secure and easier than almost any other way.

Docker stack to Kubernetes

I am quite familiar with Docker, but I have zero experience on Kubernetes.
I have a Docker stack (multi-container) software that I can deploy in a Docker swarm cluster. I was wondering if Kubernetes has something similar? I don't need replicas, auto scaling and so on... I just need a group of containers working together with its dependencies and networks defined in single text file.
I have searched and found a tool called kompose that translates the Docker stack file to Kubernetes syntax... However, it looks like the output is a list of *.yaml files, instead of a single file.
So, I came to the conclusion that kubernetes does not have this exact functionality.. Am I missing something?
You can copy the content of the generated files into one file and separate them with ---.
For instance, if you've got 3 Kubernetes files: service.yml, deployment.yml and configmap.yml, your file should look something like:
# content of service.yml
....
---
# content of deployment.yml
....
---
# content of configmap.yml
....
You would use the same kubectl commands to CRUD using this spec file.
A single 'Docker Stack' yml definition is equivalent to a collection of Kubernetes Deployments and Services. Each service in a Docker Stack definition is also available to one another via a default overlay network automatically created by docker at deploy time. To simulate this in Kubernetes you would need to define multiple deployments/services with-in the same file so that they could be created and deleted as a single 'stack'.

Using custom docker containers in Dataflow

From this link I found that Google Cloud Dataflow uses Docker containers for its workers: Image for Google Cloud Dataflow instances
I see it's possible to find out the image name of the docker container.
But, is there a way I can get this docker container (ie from which repository do I go to get it?), modify it, and then indicate my Dataflow job to use this new docker container?
The reason I ask is that we need to install various C++ and Fortran and other library code on our dockers so that the Dataflow jobs can call them, but these installations are very time consuming so we don't want to use the "resource" property option in df.
Update for May 2020
Custom containers are only supported within the Beam portability framework.
Pipelines launched within portability framework currently must pass --experiments=beam_fn_api explicitly (user-provided flag) or implicitly (for example, all Python streaming pipelines pass that).
See the documentation here: https://cloud.google.com/dataflow/docs/guides/using-custom-containers?hl=en#docker
There will be more Dataflow-specific documentation once custom containers are fully supported by Dataflow runner. For support of custom containers in other Beam runners, see: http://beam.apache.org/documentation/runtime/environments.
The docker containers used for the Dataflow workers are currently private, and can't be modified or customized.
In fact, they are served from a private docker repository, so I don't think you're able to install them on your machine.
Update Jan 2021: Custom containers are now supported in Dataflow.
https://cloud.google.com/dataflow/docs/guides/using-custom-containers?hl=en#docker
you can generate a template from your job (see https://cloud.google.com/dataflow/docs/templates/creating-templates for details), then inspect the template file to find the workerHarnessContainerImage used
I just created one for a job using the Python SDK and the image used in there is dataflow.gcr.io/v1beta3/python:2.0.0
Alternatively, you can run a job, then ssh into one of the instances and use docker ps to see all running docker containers. Use docker inspect [container_id] to see more details about volumes bound to the container etc.

Resources