Dataflow template not using the runtime parameters - google-cloud-dataflow

I am using a dataflow template to run cloud dataflow
I am providing some default values and calling template. Dataflow shows the pipeline options correctly in the dataflow pipeline summary. but it's not taking the runtime values.
class Mypipeoptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--preprocess_indir',
help='GCS path of the data to be preprocessed',
required=False,
default='gs://default/dataset/'
)
parser.add_value_provider_argument(
'--output_dir_train',
help='GCS path of the preprocessed train data',
required=False,
default='gs://default/train/'
)
parser.add_value_provider_argument(
'--output_dir_test',
help='GCS path of the preprocessed test data',
required=False,
default='gs://default/test/'
)
parser.add_value_provider_argument(
'--output_dir_validate',
help='GCS path of the preprocessed validate data',
required=False,
default='gs://default/validate/'
)
Then I am checking the values are accessible
p = beam.Pipeline(options=args)
if args.preprocess_indir.is_accessible():
input_dir = args.preprocess_indir
else:
input_dir = getValObj(args.preprocess_indir)
if args.output_dir_train.is_accessible():
output_train = args.output_dir_train
else:
output_train = getValObj(args.output_dir_train)
if args.output_dir_test.is_accessible():
output_test = args.output_dir_test
else:
output_test = getValObj(args.output_dir_test)
if args.output_dir_validate.is_accessible():
output_validate = args.output_dir_validate
else:
output_validate = getValObj(args.output_dir_validate)
Now when calling the template, I could see the values I wanted being passed as (Mypipeoptions)pipeine option parmater but its not used in the actual run, Instead using default options given

I think I found the solution, I was assigning runtime parameters to variables and then passing it to the input or output.
When I directly passed the runtime parameters to source or sink it worked. Like the one below
'Write train dataset to destination' >> beam.io.tfrecordio.WriteToTFRecord(
file_path_prefix=args.output_dir_train
)
I believe that the part I missed was that when the template is created it builds the graph and only the runtime parameters can be plugged into its runtime. Other computations are already done when building a graph.
Please correct me if I am wrong

Related

str() is not usable anymore to get true value of a Text tfx.data_types.RuntimeParameter during pipeline execution

how to get string as true value of tfx.orchestration.data_types.RuntimeParameter during execution pipeline?
Hi,
I'm defining a runtime parameter like data_root = tfx.orchestration.data_types.RuntimeParameter(name='data-root', ptype=str) for a base path, from which I define many subfolders for various components like str(data_root)+'/model' for model serving path in tfx.components.Pusher().
It was working like a charm before I moved to tfx==1.12.0: str(data_root) is now providing a json dump.
To overcome that, i tried to define a runtime parameter for model path like model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-root', ptype=str) and then feed the Pusher component the way I saw in many tutotrials:
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(base_directory=model_root)))
but I get a TypeError saying tfx.proto.PushDestination.Filesystem does not accept Runtime parameter.
It completely breaks the existing setup as i received those parameters from external client for each kubeflow run.
Thanks a lot for any help.
I was able to fix it.
First of all, the docstring is not clear regarding which parameter of Pusher can be a RuntimeParameter or not.
I finally went to __init__ code definition of component Pusher to see that only the parameter push_destination can be a RuntimeParameter:
def __init__(
self,
model: Optional[types.BaseChannel] = None,
model_blessing: Optional[types.BaseChannel] = None,
infra_blessing: Optional[types.BaseChannel] = None,
push_destination: Optional[Union[pusher_pb2.PushDestination,
data_types.RuntimeParameter]] = None,
custom_config: Optional[Dict[str, Any]] = None,
custom_executor_spec: Optional[executor_spec.ExecutorSpec] = None):
Then I defined the component consequently, using my RuntimeParameter
model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-serving-location', ptype=str)
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=model_root)
As push_destination parameter is supposed to be message proto tfx.proto.pusher_pb2.PushDestination, you have then to respect the associated schema when instantiating and running a pipeline execution, meaning the value should be like:
{'type': 'model-serving-location': 'value': '{"filesystem": {"base_directory": "path/to/model/serving/for/the/run"}}'}
Regards

Accessing the return value of a Lambda Step in Sagemaker pipeline

I've added a Lambda Step as the first step in my Sagemaker Pipeline. It processes some data and creates 2 files as part of the output like so:
from sagemaker.workflow.lambda_step import LambdaStep, Lambda, LambdaOutput, LambdaOutputTypeEnum
# lamb_preprocess = LambdaStep(func_arn="")
output_param_1 = LambdaOutput(output_name="status", output_type=LambdaOutputTypeEnum.Integer)
output_param_2 = LambdaOutput(output_name="file_name_a_c_drop", output_type=LambdaOutputTypeEnum.String)
output_param_3 = LambdaOutput(output_name="file_name_q_c_drop", output_type=LambdaOutputTypeEnum.String)
step_lambda = LambdaStep(
name="ProcessingLambda",
lambda_func=Lambda(
function_arn="arn:aws:lambda:us-east-1:xxxxxxxx:function:xxxxx"
),
inputs={
"input_data": input_data,
"input_file": trigger_file,
"input_bucket": trigger_bucket
},
outputs = [
output_param_1, output_param_2, output_param_3
]
)
In my next step, I want to trigger a Processing Job for which I need to pass in the above Lambda function's outputs as it's inputs. I'm trying to do it like so:
inputs = [
ProcessingInput(source=step_lambda.properties.Outputs["file_name_q_c_drop"], destination="/opt/ml/processing/input"),
ProcessingInput(source=step_lambda.properties.Outputs["file_name_a_c_drop"], destination="/opt/ml/processing/input"),
]
However, when the processing step is trying to get created, I get a validation message saying
Object of type Properties is not JSON serializable
I followed the data dependency docs here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#lambdastep and tried accessing step_lambda.OutputParameters["file_name_a_c_drop"] too but it errored out saying 'LambdaStep' object has no attribute 'OutputParameters'
How do I properly access the return value of a LambdaStep in a Sagemaker pipeline ?
You can access the output as follows - step_lambda.OutputParameters["output1"]. You don't need to add .properties
To access a LambdaStep output in another step you can do this:
step_lambda.properties.Outputs["file_name_a_c_drop"]
Try this
steplambda.properties.ProcessingOutputConfig.Outputs["file_name_q_c_drop"].S3Output.S3Uri

How to get files generated inside a KFP component's container as an output and save it in the local filesystem? [duplicate]

I'm exploring Kubeflow as an option to deploy and connect various components of a typical ML pipeline. I'm using docker containers as Kubeflow components and so far I've been unable to successfully use ContainerOp.file_outputs object to pass results between components.
Based on my understanding of the feature, creating and saving to a file that's declared as one of the file_outputs of a component should cause it to persist and be accessible for reading by the following component.
This is how I attempted to declare this in my pipeline python code:
import kfp.dsl as dsl
import kfp.gcp as gcp
#dsl.pipeline(name='kubeflow demo')
def pipeline(project_id='kubeflow-demo-254012'):
data_collector = dsl.ContainerOp(
name='data collector',
image='eu.gcr.io/kubeflow-demo-254012/data-collector',
arguments=[ "--project_id", project_id ],
file_outputs={ "output": '/output.txt' }
)
data_preprocessor = dsl.ContainerOp(
name='data preprocessor',
image='eu.gcr.io/kubeflow-demo-254012/data-preprocessor',
arguments=[ "--project_id", project_id ]
)
data_preprocessor.after(data_collector)
#TODO: add other components
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline, __file__ + '.tar.gz')
In the python code for the data-collector.py component I fetch the dataset then write it to output.txt. I'm able to read from the file within the same component but not inside data-preprocessor.py where I get a FileNotFoundError.
Is the use of file_outputs invalid for container-based Kubeflow components or am I incorrectly using it in my code? If it's not an option in my case, is it possible to programmatically create Kubernetes volumes inside the pipeline declaration python code and use them instead of file_outputs?
Files created in one Kubeflow pipeline component are local to the container. To reference it in the subsequent steps, you would need to pass it as:
data_preprocessor = dsl.ContainerOp(
name='data preprocessor',
image='eu.gcr.io/kubeflow-demo-254012/data-preprocessor',
arguments=["--fetched_dataset", data_collector.outputs['output'],
"--project_id", project_id,
]
Note: data_collector.outputs['output'] will contain the actual string contents of the file /output.txt (not a path to the file). If you want for it to contain the path of the file, you'll need to write the dataset to shared storage (like s3, or a mounted PVC volume) and write the path/link to the shared storage to /output.txt. data_preprocessor can then read the dataset based on the path.
There are three main steps:
save a outputs.txt file which will include data/parameter/anything that you want to pass to next component.
Note: it should be at the root level i.e /output.txt
pass file_outputs={'output': '/output.txt'} as arguments as shown is example.
inside a container_op which you will write inside dsl.pipeline pass argument (to respective argument of commponent which needs output from earlier component) as comp1.output (here comp1 is 1st component which produces output & stores it in /output.txt)
import kfp
from kfp import dsl
def SendMsg(
send_msg: str = 'akash'
):
return dsl.ContainerOp(
name = 'Print msg',
image = 'docker.io/akashdesarda/comp1:latest',
command = ['python', 'msg.py'],
arguments=[
'--msg', send_msg
],
file_outputs={
'output': '/output.txt',
}
)
def GetMsg(
get_msg: str
):
return dsl.ContainerOp(
name = 'Read msg from 1st component',
image = 'docker.io/akashdesarda/comp2:latest',
command = ['python', 'msg.py'],
arguments=[
'--msg', get_msg
]
)
#dsl.pipeline(
name = 'Pass parameter',
description = 'Passing para')
def passing_parameter(send_msg):
comp1 = SendMsg(send_msg)
comp2 = GetMsg(comp1.output)
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(passing_parameter, __file__ + '.tar.gz')
You don't have to write the data to shared storage, you can use kfp.dsl.InputArgumentPath to pass an output from a python function to the input of a container op.
#kfp.dsl.pipeline(
name='Build Model Server Pipeline',
description='Build a kserve model server pipeline.'
)
def build_model_server_pipeline(s3_src_path):
download_s3_files_task = download_archive_step(s3_src_path)
tarball_path = "/tmp/artifact.tar"
artifact_tarball = kfp.dsl.InputArgumentPath(download_s3_files_task.outputs['output_tarball'], path=tarball_path)
build_container = kfp.dsl.ContainerOp(name ='build_container',
image ='python:3.8',
command=['sh', '-c'],
arguments=[
'ls -l ' + tarball_path + ';'
],
artifact_argument_paths=[artifact_tarball],
)

Kedro - how to pass nested parameters directly to node

kedro recommends storing parameters in conf/base/parameters.yml. Let's assume it looks like this:
step_size: 1
model_params:
learning_rate: 0.01
test_data_ratio: 0.2
num_train_steps: 10000
And now imagine I have some data_engineering pipeline whose nodes.py has a function that looks something like this:
def some_pipeline_step(num_train_steps):
"""
Takes the parameter `num_train_steps` as argument.
"""
pass
How would I go about and pass that nested parameters straight to this function in data_engineering/pipeline.py? I unsuccessfully tried:
from kedro.pipeline import Pipeline, node
from .nodes import split_data
def create_pipeline(**kwargs):
return Pipeline(
[
node(
some_pipeline_step,
["params:model_params.num_train_steps"],
dict(
train_x="train_x",
train_y="train_y",
),
)
]
)
I know that I could just pass all parameters into the function by using ['parameters'] or just pass all model_params parameters with ['params:model_params'] but it seems unelegant and I feel like there must be a way. Would appreciate any input!
(Disclaimer: I'm part of the Kedro team)
Thank you for your question. Current version of Kedro, unfortunately, does not support nested parameters. The interim solution would be to use top-level keys inside the node (as you already pointed out) or decorate your node function with some sort of a parameter filter, which is not elegant either.
Probably the most viable solution would be to customise your ProjectContext (in src/<package_name>/run.py) class by overwriting _get_feed_dict method as follows:
class ProjectContext(KedroContext):
# ...
def _get_feed_dict(self) -> Dict[str, Any]:
"""Get parameters and return the feed dictionary."""
params = self.params
feed_dict = {"parameters": params}
def _add_param_to_feed_dict(param_name, param_value):
"""This recursively adds parameter paths to the `feed_dict`,
whenever `param_value` is a dictionary itself, so that users can
specify specific nested parameters in their node inputs.
Example:
>>> param_name = "a"
>>> param_value = {"b": 1}
>>> _add_param_to_feed_dict(param_name, param_value)
>>> assert feed_dict["params:a"] == {"b": 1}
>>> assert feed_dict["params:a.b"] == 1
"""
key = "params:{}".format(param_name)
feed_dict[key] = param_value
if isinstance(param_value, dict):
for key, val in param_value.items():
_add_param_to_feed_dict("{}.{}".format(param_name, key), val)
for param_name, param_value in params.items():
_add_param_to_feed_dict(param_name, param_value)
return feed_dict
Please also note that this issue has already been addressed on develop and will become available in the next release. The fix uses the approach from the snippet above.
As mentioned by Dmitry, kedro 0.16.0 introduced nested parameter values inside the node inputs which can be accessed via . operator:
node(func, "params:a.b", None)
whereas kedro 0.17.6 enabled overriding nested parameters with params in CLI, e.g.
kedro run --params="model.model_tuning.booster:gbtree"

How to pass data or files between Kubeflow containerized components in python

I'm exploring Kubeflow as an option to deploy and connect various components of a typical ML pipeline. I'm using docker containers as Kubeflow components and so far I've been unable to successfully use ContainerOp.file_outputs object to pass results between components.
Based on my understanding of the feature, creating and saving to a file that's declared as one of the file_outputs of a component should cause it to persist and be accessible for reading by the following component.
This is how I attempted to declare this in my pipeline python code:
import kfp.dsl as dsl
import kfp.gcp as gcp
#dsl.pipeline(name='kubeflow demo')
def pipeline(project_id='kubeflow-demo-254012'):
data_collector = dsl.ContainerOp(
name='data collector',
image='eu.gcr.io/kubeflow-demo-254012/data-collector',
arguments=[ "--project_id", project_id ],
file_outputs={ "output": '/output.txt' }
)
data_preprocessor = dsl.ContainerOp(
name='data preprocessor',
image='eu.gcr.io/kubeflow-demo-254012/data-preprocessor',
arguments=[ "--project_id", project_id ]
)
data_preprocessor.after(data_collector)
#TODO: add other components
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline, __file__ + '.tar.gz')
In the python code for the data-collector.py component I fetch the dataset then write it to output.txt. I'm able to read from the file within the same component but not inside data-preprocessor.py where I get a FileNotFoundError.
Is the use of file_outputs invalid for container-based Kubeflow components or am I incorrectly using it in my code? If it's not an option in my case, is it possible to programmatically create Kubernetes volumes inside the pipeline declaration python code and use them instead of file_outputs?
Files created in one Kubeflow pipeline component are local to the container. To reference it in the subsequent steps, you would need to pass it as:
data_preprocessor = dsl.ContainerOp(
name='data preprocessor',
image='eu.gcr.io/kubeflow-demo-254012/data-preprocessor',
arguments=["--fetched_dataset", data_collector.outputs['output'],
"--project_id", project_id,
]
Note: data_collector.outputs['output'] will contain the actual string contents of the file /output.txt (not a path to the file). If you want for it to contain the path of the file, you'll need to write the dataset to shared storage (like s3, or a mounted PVC volume) and write the path/link to the shared storage to /output.txt. data_preprocessor can then read the dataset based on the path.
There are three main steps:
save a outputs.txt file which will include data/parameter/anything that you want to pass to next component.
Note: it should be at the root level i.e /output.txt
pass file_outputs={'output': '/output.txt'} as arguments as shown is example.
inside a container_op which you will write inside dsl.pipeline pass argument (to respective argument of commponent which needs output from earlier component) as comp1.output (here comp1 is 1st component which produces output & stores it in /output.txt)
import kfp
from kfp import dsl
def SendMsg(
send_msg: str = 'akash'
):
return dsl.ContainerOp(
name = 'Print msg',
image = 'docker.io/akashdesarda/comp1:latest',
command = ['python', 'msg.py'],
arguments=[
'--msg', send_msg
],
file_outputs={
'output': '/output.txt',
}
)
def GetMsg(
get_msg: str
):
return dsl.ContainerOp(
name = 'Read msg from 1st component',
image = 'docker.io/akashdesarda/comp2:latest',
command = ['python', 'msg.py'],
arguments=[
'--msg', get_msg
]
)
#dsl.pipeline(
name = 'Pass parameter',
description = 'Passing para')
def passing_parameter(send_msg):
comp1 = SendMsg(send_msg)
comp2 = GetMsg(comp1.output)
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(passing_parameter, __file__ + '.tar.gz')
You don't have to write the data to shared storage, you can use kfp.dsl.InputArgumentPath to pass an output from a python function to the input of a container op.
#kfp.dsl.pipeline(
name='Build Model Server Pipeline',
description='Build a kserve model server pipeline.'
)
def build_model_server_pipeline(s3_src_path):
download_s3_files_task = download_archive_step(s3_src_path)
tarball_path = "/tmp/artifact.tar"
artifact_tarball = kfp.dsl.InputArgumentPath(download_s3_files_task.outputs['output_tarball'], path=tarball_path)
build_container = kfp.dsl.ContainerOp(name ='build_container',
image ='python:3.8',
command=['sh', '-c'],
arguments=[
'ls -l ' + tarball_path + ';'
],
artifact_argument_paths=[artifact_tarball],
)

Resources