Kedro - how to pass nested parameters directly to node - machine-learning

kedro recommends storing parameters in conf/base/parameters.yml. Let's assume it looks like this:
step_size: 1
model_params:
learning_rate: 0.01
test_data_ratio: 0.2
num_train_steps: 10000
And now imagine I have some data_engineering pipeline whose nodes.py has a function that looks something like this:
def some_pipeline_step(num_train_steps):
"""
Takes the parameter `num_train_steps` as argument.
"""
pass
How would I go about and pass that nested parameters straight to this function in data_engineering/pipeline.py? I unsuccessfully tried:
from kedro.pipeline import Pipeline, node
from .nodes import split_data
def create_pipeline(**kwargs):
return Pipeline(
[
node(
some_pipeline_step,
["params:model_params.num_train_steps"],
dict(
train_x="train_x",
train_y="train_y",
),
)
]
)
I know that I could just pass all parameters into the function by using ['parameters'] or just pass all model_params parameters with ['params:model_params'] but it seems unelegant and I feel like there must be a way. Would appreciate any input!

(Disclaimer: I'm part of the Kedro team)
Thank you for your question. Current version of Kedro, unfortunately, does not support nested parameters. The interim solution would be to use top-level keys inside the node (as you already pointed out) or decorate your node function with some sort of a parameter filter, which is not elegant either.
Probably the most viable solution would be to customise your ProjectContext (in src/<package_name>/run.py) class by overwriting _get_feed_dict method as follows:
class ProjectContext(KedroContext):
# ...
def _get_feed_dict(self) -> Dict[str, Any]:
"""Get parameters and return the feed dictionary."""
params = self.params
feed_dict = {"parameters": params}
def _add_param_to_feed_dict(param_name, param_value):
"""This recursively adds parameter paths to the `feed_dict`,
whenever `param_value` is a dictionary itself, so that users can
specify specific nested parameters in their node inputs.
Example:
>>> param_name = "a"
>>> param_value = {"b": 1}
>>> _add_param_to_feed_dict(param_name, param_value)
>>> assert feed_dict["params:a"] == {"b": 1}
>>> assert feed_dict["params:a.b"] == 1
"""
key = "params:{}".format(param_name)
feed_dict[key] = param_value
if isinstance(param_value, dict):
for key, val in param_value.items():
_add_param_to_feed_dict("{}.{}".format(param_name, key), val)
for param_name, param_value in params.items():
_add_param_to_feed_dict(param_name, param_value)
return feed_dict
Please also note that this issue has already been addressed on develop and will become available in the next release. The fix uses the approach from the snippet above.

As mentioned by Dmitry, kedro 0.16.0 introduced nested parameter values inside the node inputs which can be accessed via . operator:
node(func, "params:a.b", None)
whereas kedro 0.17.6 enabled overriding nested parameters with params in CLI, e.g.
kedro run --params="model.model_tuning.booster:gbtree"

Related

str() is not usable anymore to get true value of a Text tfx.data_types.RuntimeParameter during pipeline execution

how to get string as true value of tfx.orchestration.data_types.RuntimeParameter during execution pipeline?
Hi,
I'm defining a runtime parameter like data_root = tfx.orchestration.data_types.RuntimeParameter(name='data-root', ptype=str) for a base path, from which I define many subfolders for various components like str(data_root)+'/model' for model serving path in tfx.components.Pusher().
It was working like a charm before I moved to tfx==1.12.0: str(data_root) is now providing a json dump.
To overcome that, i tried to define a runtime parameter for model path like model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-root', ptype=str) and then feed the Pusher component the way I saw in many tutotrials:
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(base_directory=model_root)))
but I get a TypeError saying tfx.proto.PushDestination.Filesystem does not accept Runtime parameter.
It completely breaks the existing setup as i received those parameters from external client for each kubeflow run.
Thanks a lot for any help.
I was able to fix it.
First of all, the docstring is not clear regarding which parameter of Pusher can be a RuntimeParameter or not.
I finally went to __init__ code definition of component Pusher to see that only the parameter push_destination can be a RuntimeParameter:
def __init__(
self,
model: Optional[types.BaseChannel] = None,
model_blessing: Optional[types.BaseChannel] = None,
infra_blessing: Optional[types.BaseChannel] = None,
push_destination: Optional[Union[pusher_pb2.PushDestination,
data_types.RuntimeParameter]] = None,
custom_config: Optional[Dict[str, Any]] = None,
custom_executor_spec: Optional[executor_spec.ExecutorSpec] = None):
Then I defined the component consequently, using my RuntimeParameter
model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-serving-location', ptype=str)
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=model_root)
As push_destination parameter is supposed to be message proto tfx.proto.pusher_pb2.PushDestination, you have then to respect the associated schema when instantiating and running a pipeline execution, meaning the value should be like:
{'type': 'model-serving-location': 'value': '{"filesystem": {"base_directory": "path/to/model/serving/for/the/run"}}'}
Regards

Accessing the return value of a Lambda Step in Sagemaker pipeline

I've added a Lambda Step as the first step in my Sagemaker Pipeline. It processes some data and creates 2 files as part of the output like so:
from sagemaker.workflow.lambda_step import LambdaStep, Lambda, LambdaOutput, LambdaOutputTypeEnum
# lamb_preprocess = LambdaStep(func_arn="")
output_param_1 = LambdaOutput(output_name="status", output_type=LambdaOutputTypeEnum.Integer)
output_param_2 = LambdaOutput(output_name="file_name_a_c_drop", output_type=LambdaOutputTypeEnum.String)
output_param_3 = LambdaOutput(output_name="file_name_q_c_drop", output_type=LambdaOutputTypeEnum.String)
step_lambda = LambdaStep(
name="ProcessingLambda",
lambda_func=Lambda(
function_arn="arn:aws:lambda:us-east-1:xxxxxxxx:function:xxxxx"
),
inputs={
"input_data": input_data,
"input_file": trigger_file,
"input_bucket": trigger_bucket
},
outputs = [
output_param_1, output_param_2, output_param_3
]
)
In my next step, I want to trigger a Processing Job for which I need to pass in the above Lambda function's outputs as it's inputs. I'm trying to do it like so:
inputs = [
ProcessingInput(source=step_lambda.properties.Outputs["file_name_q_c_drop"], destination="/opt/ml/processing/input"),
ProcessingInput(source=step_lambda.properties.Outputs["file_name_a_c_drop"], destination="/opt/ml/processing/input"),
]
However, when the processing step is trying to get created, I get a validation message saying
Object of type Properties is not JSON serializable
I followed the data dependency docs here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#lambdastep and tried accessing step_lambda.OutputParameters["file_name_a_c_drop"] too but it errored out saying 'LambdaStep' object has no attribute 'OutputParameters'
How do I properly access the return value of a LambdaStep in a Sagemaker pipeline ?
You can access the output as follows - step_lambda.OutputParameters["output1"]. You don't need to add .properties
To access a LambdaStep output in another step you can do this:
step_lambda.properties.Outputs["file_name_a_c_drop"]
Try this
steplambda.properties.ProcessingOutputConfig.Outputs["file_name_q_c_drop"].S3Output.S3Uri

How to execute a groovy share library with optional parameters

I'm developing a shared library to execute it from my Jenkinsfile. This library has a function with optional parameters and I want to be able to execute this function with any number of parameters by specifying my value. I've been googling a lot, but couldn't find a good answer, so maybe somebody here could help me.
Example:
The function looks this way:
def doRequest(def moduleName=env.MODULE_NAME, def environment=env.ENVIRONMENT, def repoName=env.REPO_NAME) {
<some code goes here>
}
If I execute it from my Jenkinsfile this way:
script {
sendDeploymentStatistics.doRequest service_name
}
the function puts "service_name" value to the moduleName, but how do I specify "repoName" parameter?
In Python you would do it somehow like:
function_name(moduleName=service_name, repoName=repo_name)
but in Groovy + Jenkinsfile I can't find the right way.
Can anybody please help me to find out the right syntax?
Thank you!
Groovy has the concept of Default Parameters. If you change the order of the parameters, such that the environment comes last:
def doRequest(def moduleName=env.MODULE_NAME, def repoName=env.REPO_NAME, def environment=env.ENVIRONMENT) {
<some code goes here>
}
Then your call to function_name will take the default value for environment:
function_name(moduleName=service_name, repoName=repo_name)
Groovy however also has some sort of support for Named Parameters. It is not as nice as Python but you can get it to work as follows:
env = [MODULE_NAME: 'foo', ENVIRONMENT: 'bar', REPO_NAME: 'baz']
def doRequest(Map args = [:]) {
defaultMap = [moduleName: env.MODULE_NAME, environment: env.ENVIRONMENT, repoName: env.REPO_NAME]
args = defaultMap << args
return "${args.moduleName} ${args.environment} ${args.repoName}"
}
assert 'foo bar baz' == doRequest()
assert 'foo bar qux' == doRequest(repoName: 'qux')
assert '1 2 3' == doRequest(repoName: '3', moduleName: '1', environment: '2')
For Named Parameters you need a parameter of type Map (with a default value of the empty map). Groovy will then map the arguments upon calling the function to entries in a Map.
To use default values you need to create a map with the default values, and merge that defaultMap with the passed-in arguments.

Dataflow template not using the runtime parameters

I am using a dataflow template to run cloud dataflow
I am providing some default values and calling template. Dataflow shows the pipeline options correctly in the dataflow pipeline summary. but it's not taking the runtime values.
class Mypipeoptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--preprocess_indir',
help='GCS path of the data to be preprocessed',
required=False,
default='gs://default/dataset/'
)
parser.add_value_provider_argument(
'--output_dir_train',
help='GCS path of the preprocessed train data',
required=False,
default='gs://default/train/'
)
parser.add_value_provider_argument(
'--output_dir_test',
help='GCS path of the preprocessed test data',
required=False,
default='gs://default/test/'
)
parser.add_value_provider_argument(
'--output_dir_validate',
help='GCS path of the preprocessed validate data',
required=False,
default='gs://default/validate/'
)
Then I am checking the values are accessible
p = beam.Pipeline(options=args)
if args.preprocess_indir.is_accessible():
input_dir = args.preprocess_indir
else:
input_dir = getValObj(args.preprocess_indir)
if args.output_dir_train.is_accessible():
output_train = args.output_dir_train
else:
output_train = getValObj(args.output_dir_train)
if args.output_dir_test.is_accessible():
output_test = args.output_dir_test
else:
output_test = getValObj(args.output_dir_test)
if args.output_dir_validate.is_accessible():
output_validate = args.output_dir_validate
else:
output_validate = getValObj(args.output_dir_validate)
Now when calling the template, I could see the values I wanted being passed as (Mypipeoptions)pipeine option parmater but its not used in the actual run, Instead using default options given
I think I found the solution, I was assigning runtime parameters to variables and then passing it to the input or output.
When I directly passed the runtime parameters to source or sink it worked. Like the one below
'Write train dataset to destination' >> beam.io.tfrecordio.WriteToTFRecord(
file_path_prefix=args.output_dir_train
)
I believe that the part I missed was that when the template is created it builds the graph and only the runtime parameters can be plugged into its runtime. Other computations are already done when building a graph.
Please correct me if I am wrong

How to find the concurrent.future input arguments for a Dask distributed function call

I'm using Dask to distribute work to a cluster. I'm creating a cluster and calling .submit() to submit a function to the scheduler. It returns a Futures object. I'm trying to figure out how to obtain the input arguments to that future object once it's been completed.
For example:
from dask.distributed import Client
from dask_yarn import YarnCluster
def somefunc(a,b,c ..., n ):
# do something
return
cluster = YarnCluster.from_specification(spec)
client = Client(cluster)
future = client.submit(somefunc, arg1, arg2, ..., argn)
# ^^^ how do I obtain the input arguments for this future object?
# `future.args` doesn't work
Futures don't hold onto their inputs. You can do this yourself though.
futures = {}
future = client.submit(func, *args)
futures[future] = args
A future only knows the key by which it is uniquely known on the scheduler. At the time of submission, if it has dependencies, these are transiently found and sent to the scheduler but no copy if kept locally.
The pattern you are after sounds more like delayed, which keeps hold of its graph, and indeed client.compute(delayed_thing) returns a future.
d = delayed(somefunc)(a, b, c)
future = client.compute(d)
dict(d.dask) # graph of things needed by d
You could communicate directly with the scheduler to find the dependencies of some key, which will in general also be keys, and so reverse-engineer the graph, but that does not sound like a great path, so I won't try to describe it here.

Resources