How to pass data or files between Kubeflow containerized components in python - kubeflow

I'm exploring Kubeflow as an option to deploy and connect various components of a typical ML pipeline. I'm using docker containers as Kubeflow components and so far I've been unable to successfully use ContainerOp.file_outputs object to pass results between components.
Based on my understanding of the feature, creating and saving to a file that's declared as one of the file_outputs of a component should cause it to persist and be accessible for reading by the following component.
This is how I attempted to declare this in my pipeline python code:
import kfp.dsl as dsl
import kfp.gcp as gcp
#dsl.pipeline(name='kubeflow demo')
def pipeline(project_id='kubeflow-demo-254012'):
data_collector = dsl.ContainerOp(
name='data collector',
image='eu.gcr.io/kubeflow-demo-254012/data-collector',
arguments=[ "--project_id", project_id ],
file_outputs={ "output": '/output.txt' }
)
data_preprocessor = dsl.ContainerOp(
name='data preprocessor',
image='eu.gcr.io/kubeflow-demo-254012/data-preprocessor',
arguments=[ "--project_id", project_id ]
)
data_preprocessor.after(data_collector)
#TODO: add other components
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline, __file__ + '.tar.gz')
In the python code for the data-collector.py component I fetch the dataset then write it to output.txt. I'm able to read from the file within the same component but not inside data-preprocessor.py where I get a FileNotFoundError.
Is the use of file_outputs invalid for container-based Kubeflow components or am I incorrectly using it in my code? If it's not an option in my case, is it possible to programmatically create Kubernetes volumes inside the pipeline declaration python code and use them instead of file_outputs?

Files created in one Kubeflow pipeline component are local to the container. To reference it in the subsequent steps, you would need to pass it as:
data_preprocessor = dsl.ContainerOp(
name='data preprocessor',
image='eu.gcr.io/kubeflow-demo-254012/data-preprocessor',
arguments=["--fetched_dataset", data_collector.outputs['output'],
"--project_id", project_id,
]
Note: data_collector.outputs['output'] will contain the actual string contents of the file /output.txt (not a path to the file). If you want for it to contain the path of the file, you'll need to write the dataset to shared storage (like s3, or a mounted PVC volume) and write the path/link to the shared storage to /output.txt. data_preprocessor can then read the dataset based on the path.

There are three main steps:
save a outputs.txt file which will include data/parameter/anything that you want to pass to next component.
Note: it should be at the root level i.e /output.txt
pass file_outputs={'output': '/output.txt'} as arguments as shown is example.
inside a container_op which you will write inside dsl.pipeline pass argument (to respective argument of commponent which needs output from earlier component) as comp1.output (here comp1 is 1st component which produces output & stores it in /output.txt)
import kfp
from kfp import dsl
def SendMsg(
send_msg: str = 'akash'
):
return dsl.ContainerOp(
name = 'Print msg',
image = 'docker.io/akashdesarda/comp1:latest',
command = ['python', 'msg.py'],
arguments=[
'--msg', send_msg
],
file_outputs={
'output': '/output.txt',
}
)
def GetMsg(
get_msg: str
):
return dsl.ContainerOp(
name = 'Read msg from 1st component',
image = 'docker.io/akashdesarda/comp2:latest',
command = ['python', 'msg.py'],
arguments=[
'--msg', get_msg
]
)
#dsl.pipeline(
name = 'Pass parameter',
description = 'Passing para')
def passing_parameter(send_msg):
comp1 = SendMsg(send_msg)
comp2 = GetMsg(comp1.output)
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(passing_parameter, __file__ + '.tar.gz')

You don't have to write the data to shared storage, you can use kfp.dsl.InputArgumentPath to pass an output from a python function to the input of a container op.
#kfp.dsl.pipeline(
name='Build Model Server Pipeline',
description='Build a kserve model server pipeline.'
)
def build_model_server_pipeline(s3_src_path):
download_s3_files_task = download_archive_step(s3_src_path)
tarball_path = "/tmp/artifact.tar"
artifact_tarball = kfp.dsl.InputArgumentPath(download_s3_files_task.outputs['output_tarball'], path=tarball_path)
build_container = kfp.dsl.ContainerOp(name ='build_container',
image ='python:3.8',
command=['sh', '-c'],
arguments=[
'ls -l ' + tarball_path + ';'
],
artifact_argument_paths=[artifact_tarball],
)

Related

How to get files generated inside a KFP component's container as an output and save it in the local filesystem? [duplicate]

I'm exploring Kubeflow as an option to deploy and connect various components of a typical ML pipeline. I'm using docker containers as Kubeflow components and so far I've been unable to successfully use ContainerOp.file_outputs object to pass results between components.
Based on my understanding of the feature, creating and saving to a file that's declared as one of the file_outputs of a component should cause it to persist and be accessible for reading by the following component.
This is how I attempted to declare this in my pipeline python code:
import kfp.dsl as dsl
import kfp.gcp as gcp
#dsl.pipeline(name='kubeflow demo')
def pipeline(project_id='kubeflow-demo-254012'):
data_collector = dsl.ContainerOp(
name='data collector',
image='eu.gcr.io/kubeflow-demo-254012/data-collector',
arguments=[ "--project_id", project_id ],
file_outputs={ "output": '/output.txt' }
)
data_preprocessor = dsl.ContainerOp(
name='data preprocessor',
image='eu.gcr.io/kubeflow-demo-254012/data-preprocessor',
arguments=[ "--project_id", project_id ]
)
data_preprocessor.after(data_collector)
#TODO: add other components
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline, __file__ + '.tar.gz')
In the python code for the data-collector.py component I fetch the dataset then write it to output.txt. I'm able to read from the file within the same component but not inside data-preprocessor.py where I get a FileNotFoundError.
Is the use of file_outputs invalid for container-based Kubeflow components or am I incorrectly using it in my code? If it's not an option in my case, is it possible to programmatically create Kubernetes volumes inside the pipeline declaration python code and use them instead of file_outputs?
Files created in one Kubeflow pipeline component are local to the container. To reference it in the subsequent steps, you would need to pass it as:
data_preprocessor = dsl.ContainerOp(
name='data preprocessor',
image='eu.gcr.io/kubeflow-demo-254012/data-preprocessor',
arguments=["--fetched_dataset", data_collector.outputs['output'],
"--project_id", project_id,
]
Note: data_collector.outputs['output'] will contain the actual string contents of the file /output.txt (not a path to the file). If you want for it to contain the path of the file, you'll need to write the dataset to shared storage (like s3, or a mounted PVC volume) and write the path/link to the shared storage to /output.txt. data_preprocessor can then read the dataset based on the path.
There are three main steps:
save a outputs.txt file which will include data/parameter/anything that you want to pass to next component.
Note: it should be at the root level i.e /output.txt
pass file_outputs={'output': '/output.txt'} as arguments as shown is example.
inside a container_op which you will write inside dsl.pipeline pass argument (to respective argument of commponent which needs output from earlier component) as comp1.output (here comp1 is 1st component which produces output & stores it in /output.txt)
import kfp
from kfp import dsl
def SendMsg(
send_msg: str = 'akash'
):
return dsl.ContainerOp(
name = 'Print msg',
image = 'docker.io/akashdesarda/comp1:latest',
command = ['python', 'msg.py'],
arguments=[
'--msg', send_msg
],
file_outputs={
'output': '/output.txt',
}
)
def GetMsg(
get_msg: str
):
return dsl.ContainerOp(
name = 'Read msg from 1st component',
image = 'docker.io/akashdesarda/comp2:latest',
command = ['python', 'msg.py'],
arguments=[
'--msg', get_msg
]
)
#dsl.pipeline(
name = 'Pass parameter',
description = 'Passing para')
def passing_parameter(send_msg):
comp1 = SendMsg(send_msg)
comp2 = GetMsg(comp1.output)
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(passing_parameter, __file__ + '.tar.gz')
You don't have to write the data to shared storage, you can use kfp.dsl.InputArgumentPath to pass an output from a python function to the input of a container op.
#kfp.dsl.pipeline(
name='Build Model Server Pipeline',
description='Build a kserve model server pipeline.'
)
def build_model_server_pipeline(s3_src_path):
download_s3_files_task = download_archive_step(s3_src_path)
tarball_path = "/tmp/artifact.tar"
artifact_tarball = kfp.dsl.InputArgumentPath(download_s3_files_task.outputs['output_tarball'], path=tarball_path)
build_container = kfp.dsl.ContainerOp(name ='build_container',
image ='python:3.8',
command=['sh', '-c'],
arguments=[
'ls -l ' + tarball_path + ';'
],
artifact_argument_paths=[artifact_tarball],
)

Conditionally create a Bazel rule based on --config

I'm working on a problem in which I only want to create a particular rule if a certain Bazel config has been specified (via '--config'). We have been using Bazel since 0.11 and have a bunch of build infrastructure that works around former limitations in Bazel. I am incrementally porting us up to newer versions. One of the features that was missing was compiler transitions, and so we rolled our own using configs and some external scripts.
My first attempt at solving my problem looks like this:
load("#rules_cc//cc:defs.bzl", "cc_library")
# use this with a select to pick targets to include/exclude based on config
# see __build_if_role for an example
def noop_impl(ctx):
pass
noop = rule(
implementation = noop_impl,
attrs = {
"deps": attr.label_list(),
},
)
def __sanitize(config):
if len(config) > 2 and config[:2] == "//":
config = config[2:]
return config.replace(":", "_").replace("/", "_")
def build_if_config(**kwargs):
config = kwargs['config']
kwargs.pop('config')
name = kwargs['name'] + '_' + __sanitize(config)
binary_target_name = kwargs['name']
kwargs['name'] = binary_target_name
cc_library(**kwargs)
noop(
name = name,
deps = select({
config: [ binary_target_name ],
"//conditions:default": [],
})
)
This almost gets me there, but the problem is that if I want to build a library as an output, then it becomes an intermediate dependency, and therefore gets deleted or never built.
For example, if I do this:
build_if_config(
name="some_lib",
srcs=[ "foo.c" ],
config="//:my_config",
)
and then I run
bazel build --config my_config //:some_lib
Then libsome_lib.a does not make it to bazel-out, although if I define it using cc_library, then it does.
Is there a way that I can just create the appropriate rule directly in the macro instead of creating a noop rule and using a select? Or another mechanism?
Thanks in advance for your help!
As I noted in my comment, I was misunderstanding how Bazel figures out its dependencies. The create a file section of The Rules Tutorial explains some of the details, and I followed along here for some of my solution.
Basically, the problem was not that the built files were not sticking around, it was that they were never getting built. Bazel did not know to look in the deps variable and build those things: it seems I had to create an action which uses the deps, and then register an action by returning a (list of) DefaultInfo
Below is my new noop_impl function
def noop_impl(ctx):
if len(ctx.attr.deps) == 0:
return None
# ctx.attr has the attributes of this rule
dep = ctx.attr.deps[0]
# DefaultInfo is apparently some sort of globally available
# class that can be used to index Target objects
infile = dep[DefaultInfo].files.to_list()[0]
outfile = ctx.actions.declare_file('lib' + ctx.label.name + '.a')
ctx.actions.run_shell(
inputs = [infile],
outputs = [outfile],
command = "cp %s %s" % (infile.path, outfile.path),
)
# we can also instantiate a DefaultInfo to indicate what output
# we provide
return [DefaultInfo(files = depset([outfile]))]

CDK generating empty targets for CfnCrawler

I'm using CDK Python API to define a Glue crawler, however, the CDK generated template contains empty 'Targets' block in the Crawler resource.
I've not been able to find an example to emulate. I've tried varying the definition of the targets object, but the object definition seems to be ignored by CDK.
from aws_cdk import cdk
BUCKET='poc-1-bucket43879c71-5uabw2rni0cp'
class PocStack(cdk.Stack):
def __init__(self, app: cdk.App, id: str, **kwargs) -> None:
super().__init__(app, id)
from aws_cdk import (
aws_iam as iam,
aws_glue as glue,
cdk
)
glue_role = iam.Role(
self, 'glue_role',
assumed_by=iam.ServicePrincipal('glue.amazonaws.com'),
managed_policy_arns=['arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole']
)
glue_crawler = glue.CfnCrawler(
self, 'glue_crawler',
database_name='db',
role=glue_role.role_arn,
targets={"S3Targets": [{"Path": f'{BUCKET}/path/'}]},
)
I expect the generated template to contain a valid 'targets' block with a single S3Target. However, cdk synth outputs a template with empty Targets in the AWS::Glue::Crawler resource:
gluecrawler:
Type: AWS::Glue::Crawler
Properties:
DatabaseName: db
Role:
Fn::GetAtt:
- glueroleFCCAEB57
- Arn
Targets: {}
Resolved, thanks to a clever colleague!
Changing "S3Targets" to "s3Targets", and "Path" to "path" resolved the issue. See below.
Hi Bob,
When I use typescript, the following works for me:
new glue.CfnCrawler(this, 'glue_crawler', {
databaseName: 'db',
role: glue_role.roleArn,
targets: {
s3Targets: [{ path: "path" }]
}
}
When I used Python, the following appears working too:
glue_crawler = glue.CfnCrawler(
self, 'glue_crawler',
database_name='db',
role=glue_role.role_arn,
targets={
"s3Targets": [{ "path": f'{BUCKET}/path/'}]
},
)
In Typescript, TargetsProperty is an interface with s3Targets as a property. And in
s3Targets, path is a property as well. I guess during the JSII transformation, it forces
us to use the same names in Python instead of the initial CFN resource names.
A more general way to approach this problem is to dig inside the cdk library in 2 steps:
1.
from aws_cdk import aws_glue
print(aws_glue.__file__)
(.env/lib/python3.8/site-packages/aws_cdk/aws_glue/__init__.py)
Go to that file and see how the mapping/types are defined. As of 16 Aug 2020, you find
#jsii.data_type(
jsii_type="#aws-cdk/aws-glue.CfnCrawler.TargetsProperty",
jsii_struct_bases=[],
name_mapping={
"catalog_targets": "catalogTargets",
"dynamo_db_targets": "dynamoDbTargets",
"jdbc_targets": "jdbcTargets",
"s3_targets": "s3Targets",
}
)
I found that the lowerCamelCase always work, while the pythonic snake_case does not.

Dataflow template not using the runtime parameters

I am using a dataflow template to run cloud dataflow
I am providing some default values and calling template. Dataflow shows the pipeline options correctly in the dataflow pipeline summary. but it's not taking the runtime values.
class Mypipeoptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--preprocess_indir',
help='GCS path of the data to be preprocessed',
required=False,
default='gs://default/dataset/'
)
parser.add_value_provider_argument(
'--output_dir_train',
help='GCS path of the preprocessed train data',
required=False,
default='gs://default/train/'
)
parser.add_value_provider_argument(
'--output_dir_test',
help='GCS path of the preprocessed test data',
required=False,
default='gs://default/test/'
)
parser.add_value_provider_argument(
'--output_dir_validate',
help='GCS path of the preprocessed validate data',
required=False,
default='gs://default/validate/'
)
Then I am checking the values are accessible
p = beam.Pipeline(options=args)
if args.preprocess_indir.is_accessible():
input_dir = args.preprocess_indir
else:
input_dir = getValObj(args.preprocess_indir)
if args.output_dir_train.is_accessible():
output_train = args.output_dir_train
else:
output_train = getValObj(args.output_dir_train)
if args.output_dir_test.is_accessible():
output_test = args.output_dir_test
else:
output_test = getValObj(args.output_dir_test)
if args.output_dir_validate.is_accessible():
output_validate = args.output_dir_validate
else:
output_validate = getValObj(args.output_dir_validate)
Now when calling the template, I could see the values I wanted being passed as (Mypipeoptions)pipeine option parmater but its not used in the actual run, Instead using default options given
I think I found the solution, I was assigning runtime parameters to variables and then passing it to the input or output.
When I directly passed the runtime parameters to source or sink it worked. Like the one below
'Write train dataset to destination' >> beam.io.tfrecordio.WriteToTFRecord(
file_path_prefix=args.output_dir_train
)
I believe that the part I missed was that when the template is created it builds the graph and only the runtime parameters can be plugged into its runtime. Other computations are already done when building a graph.
Please correct me if I am wrong

How can I build custom rules using the output of workspace_status_command?

The bazel build flag --workspace_status_command supports calling a script to retrieve e.g. repository metadata, this is also known as build stamping and available in rules like java_binary.
I'd like to create a custom rule using this metadata.
I want to use this for a common support function. It should receive the git version and some other attributes and create a version.go output file usable as a dependency.
So I started a journey looking at rules in various bazel repositories.
Rules like rules_docker support stamping with stamp in container_image and let you reference the status output in attributes.
rules_go supports it in the x_defs attribute of go_binary.
This would be ideal for my purpose and I dug in...
It looks like I can get what I want with ctx.actions.expand_template using the entries in ctx.info_file or ctx.version_file as a dictionary for substitutions. But I didn't figure out how to get a dictionary of those files. And those two files seem to be "unofficial", they are not part of the ctx documentation.
Building on what I found out already: How do I get a dict based on the status command output?
If that's not possible, what is the shortest/simplest way to access workspace_status_command output from custom rules?
I've been exactly where you are and I ended up following the path you've started exploring. I generate a JSON description that also includes information collected from git to package with the result and I ended up doing something like this:
def _build_mft_impl(ctx):
args = ctx.actions.args()
args.add('-f')
args.add(ctx.info_file)
args.add('-i')
args.add(ctx.files.src)
args.add('-o')
args.add(ctx.outputs.out)
ctx.actions.run(
outputs = [ctx.outputs.out],
inputs = ctx.files.src + [ctx.info_file],
arguments = [args],
progress_message = "Generating manifest: " + ctx.label.name,
executable = ctx.executable._expand_template,
)
def _get_mft_outputs(src):
return {"out": src.name[:-len(".tmpl")]}
build_manifest = rule(
implementation = _build_mft_impl,
attrs = {
"src": attr.label(mandatory=True,
allow_single_file=[".json.tmpl", ".json_tmpl"]),
"_expand_template": attr.label(default=Label("//:expand_template"),
executable=True,
cfg="host"),
},
outputs = _get_mft_outputs,
)
//:expand_template is a label in my case pointing to a py_binary performing the transformation itself. I'd be happy to learn about a better (more native, fewer hops) way of doing this, but (for now) I went with: it works. Few comments on the approach and your concerns:
AFAIK you cannot read in (the file and perform operations in Skylark) itself...
...speaking of which, it's probably not a bad thing to keep the transformation (tool) and build description (bazel) separate anyways.
It could be debated what constitutes the official documentation, but ctx.info_file may not appear in the reference manual, it is documented in the source tree. :) Which is case for other areas as well (and I hope that is not because those interfaces are considered not committed too yet).
For sake of comleteness in src/main/java/com/google/devtools/build/lib/skylarkbuildapi/SkylarkRuleContextApi.java there is:
#SkylarkCallable(
name = "info_file",
structField = true,
documented = false,
doc =
"Returns the file that is used to hold the non-volatile workspace status for the "
+ "current build request."
)
public FileApi getStableWorkspaceStatus() throws InterruptedException, EvalException;
EDIT: few extra details as asked in the comment.
In my workspace_status.sh I would have for instance the following line:
echo STABLE_GIT_REF $(git log -1 --pretty=format:%H)
In my .json.tmpl file I would then have:
"ref": "${STABLE_GIT_REF}",
I've opted for shell like notation of text to be replaced, since it's intuitive for many users as well as easy to match.
As for the replacement, relevant (CLI kept out of this) portion of the actual code would be:
def get_map(val_file):
"""
Return dictionary of key/value pairs from ``val_file`.
"""
value_map = {}
for line in val_file:
(key, value) = line.split(' ', 1)
value_map.update(((key, value.rstrip('\n')),))
return value_map
def expand_template(val_file, in_file, out_file):
"""
Read each line from ``in_file`` and write it to ``out_file`` replacing all
${KEY} references with values from ``val_file``.
"""
def _substitue_variable(mobj):
return value_map[mobj.group('var')]
re_pat = re.compile(r'\${(?P<var>[^} ]+)}')
value_map = get_map(val_file)
for line in in_file:
out_file.write(re_pat.subn(_substitue_variable, line)[0])
EDIT2: This is how the Python script is how I expose the python script to rest of bazel.
py_binary(
name = "expand_template",
main = "expand_template.py",
srcs = ["expand_template.py"],
visibility = ["//visibility:public"],
)
Building on Ondrej's answer, I now use somthing like this (adapted in SO editor, might contain small errors):
tools/bazel.rc:
build --workspace_status_command=tools/workspace_status.sh
tools/workspace_status.sh:
echo STABLE_GIT_REV $(git rev-parse HEAD)
version.bzl:
_VERSION_TEMPLATE_SH = """
set -e -u -o pipefail
while read line; do
export "${line% *}"="${line#* }"
done <"$INFILE" \
&& cat <<EOF >"$OUTFILE"
{ "ref": "${STABLE_GIT_REF}"
, "service": "${SERVICE_NAME}"
}
EOF
"""
def _commit_info_impl(ctx):
ctx.actions.run_shell(
outputs = [ctx.outputs.outfile],
inputs = [ctx.info_file],
progress_message = "Generating version file: " + ctx.label.name,
command = _VERSION_TEMPLATE_SH,
env = {
'INFILE': ctx.info_file.path,
'OUTFILE': ctx.outputs.version_go.path,
'SERVICE_NAME': ctx.attr.service,
},
)
commit_info = rule(
implementation = _commit_info_impl,
attrs = {
'service': attr.string(
mandatory = True,
doc = 'name of versioned service',
),
},
outputs = {
'outfile': 'manifest.json',
},
)

Resources