Accessing the return value of a Lambda Step in Sagemaker pipeline - machine-learning

I've added a Lambda Step as the first step in my Sagemaker Pipeline. It processes some data and creates 2 files as part of the output like so:
from sagemaker.workflow.lambda_step import LambdaStep, Lambda, LambdaOutput, LambdaOutputTypeEnum
# lamb_preprocess = LambdaStep(func_arn="")
output_param_1 = LambdaOutput(output_name="status", output_type=LambdaOutputTypeEnum.Integer)
output_param_2 = LambdaOutput(output_name="file_name_a_c_drop", output_type=LambdaOutputTypeEnum.String)
output_param_3 = LambdaOutput(output_name="file_name_q_c_drop", output_type=LambdaOutputTypeEnum.String)
step_lambda = LambdaStep(
name="ProcessingLambda",
lambda_func=Lambda(
function_arn="arn:aws:lambda:us-east-1:xxxxxxxx:function:xxxxx"
),
inputs={
"input_data": input_data,
"input_file": trigger_file,
"input_bucket": trigger_bucket
},
outputs = [
output_param_1, output_param_2, output_param_3
]
)
In my next step, I want to trigger a Processing Job for which I need to pass in the above Lambda function's outputs as it's inputs. I'm trying to do it like so:
inputs = [
ProcessingInput(source=step_lambda.properties.Outputs["file_name_q_c_drop"], destination="/opt/ml/processing/input"),
ProcessingInput(source=step_lambda.properties.Outputs["file_name_a_c_drop"], destination="/opt/ml/processing/input"),
]
However, when the processing step is trying to get created, I get a validation message saying
Object of type Properties is not JSON serializable
I followed the data dependency docs here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#lambdastep and tried accessing step_lambda.OutputParameters["file_name_a_c_drop"] too but it errored out saying 'LambdaStep' object has no attribute 'OutputParameters'
How do I properly access the return value of a LambdaStep in a Sagemaker pipeline ?

You can access the output as follows - step_lambda.OutputParameters["output1"]. You don't need to add .properties

To access a LambdaStep output in another step you can do this:
step_lambda.properties.Outputs["file_name_a_c_drop"]

Try this
steplambda.properties.ProcessingOutputConfig.Outputs["file_name_q_c_drop"].S3Output.S3Uri

Related

str() is not usable anymore to get true value of a Text tfx.data_types.RuntimeParameter during pipeline execution

how to get string as true value of tfx.orchestration.data_types.RuntimeParameter during execution pipeline?
Hi,
I'm defining a runtime parameter like data_root = tfx.orchestration.data_types.RuntimeParameter(name='data-root', ptype=str) for a base path, from which I define many subfolders for various components like str(data_root)+'/model' for model serving path in tfx.components.Pusher().
It was working like a charm before I moved to tfx==1.12.0: str(data_root) is now providing a json dump.
To overcome that, i tried to define a runtime parameter for model path like model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-root', ptype=str) and then feed the Pusher component the way I saw in many tutotrials:
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(base_directory=model_root)))
but I get a TypeError saying tfx.proto.PushDestination.Filesystem does not accept Runtime parameter.
It completely breaks the existing setup as i received those parameters from external client for each kubeflow run.
Thanks a lot for any help.
I was able to fix it.
First of all, the docstring is not clear regarding which parameter of Pusher can be a RuntimeParameter or not.
I finally went to __init__ code definition of component Pusher to see that only the parameter push_destination can be a RuntimeParameter:
def __init__(
self,
model: Optional[types.BaseChannel] = None,
model_blessing: Optional[types.BaseChannel] = None,
infra_blessing: Optional[types.BaseChannel] = None,
push_destination: Optional[Union[pusher_pb2.PushDestination,
data_types.RuntimeParameter]] = None,
custom_config: Optional[Dict[str, Any]] = None,
custom_executor_spec: Optional[executor_spec.ExecutorSpec] = None):
Then I defined the component consequently, using my RuntimeParameter
model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-serving-location', ptype=str)
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=model_root)
As push_destination parameter is supposed to be message proto tfx.proto.pusher_pb2.PushDestination, you have then to respect the associated schema when instantiating and running a pipeline execution, meaning the value should be like:
{'type': 'model-serving-location': 'value': '{"filesystem": {"base_directory": "path/to/model/serving/for/the/run"}}'}
Regards

Kedro - how to pass nested parameters directly to node

kedro recommends storing parameters in conf/base/parameters.yml. Let's assume it looks like this:
step_size: 1
model_params:
learning_rate: 0.01
test_data_ratio: 0.2
num_train_steps: 10000
And now imagine I have some data_engineering pipeline whose nodes.py has a function that looks something like this:
def some_pipeline_step(num_train_steps):
"""
Takes the parameter `num_train_steps` as argument.
"""
pass
How would I go about and pass that nested parameters straight to this function in data_engineering/pipeline.py? I unsuccessfully tried:
from kedro.pipeline import Pipeline, node
from .nodes import split_data
def create_pipeline(**kwargs):
return Pipeline(
[
node(
some_pipeline_step,
["params:model_params.num_train_steps"],
dict(
train_x="train_x",
train_y="train_y",
),
)
]
)
I know that I could just pass all parameters into the function by using ['parameters'] or just pass all model_params parameters with ['params:model_params'] but it seems unelegant and I feel like there must be a way. Would appreciate any input!
(Disclaimer: I'm part of the Kedro team)
Thank you for your question. Current version of Kedro, unfortunately, does not support nested parameters. The interim solution would be to use top-level keys inside the node (as you already pointed out) or decorate your node function with some sort of a parameter filter, which is not elegant either.
Probably the most viable solution would be to customise your ProjectContext (in src/<package_name>/run.py) class by overwriting _get_feed_dict method as follows:
class ProjectContext(KedroContext):
# ...
def _get_feed_dict(self) -> Dict[str, Any]:
"""Get parameters and return the feed dictionary."""
params = self.params
feed_dict = {"parameters": params}
def _add_param_to_feed_dict(param_name, param_value):
"""This recursively adds parameter paths to the `feed_dict`,
whenever `param_value` is a dictionary itself, so that users can
specify specific nested parameters in their node inputs.
Example:
>>> param_name = "a"
>>> param_value = {"b": 1}
>>> _add_param_to_feed_dict(param_name, param_value)
>>> assert feed_dict["params:a"] == {"b": 1}
>>> assert feed_dict["params:a.b"] == 1
"""
key = "params:{}".format(param_name)
feed_dict[key] = param_value
if isinstance(param_value, dict):
for key, val in param_value.items():
_add_param_to_feed_dict("{}.{}".format(param_name, key), val)
for param_name, param_value in params.items():
_add_param_to_feed_dict(param_name, param_value)
return feed_dict
Please also note that this issue has already been addressed on develop and will become available in the next release. The fix uses the approach from the snippet above.
As mentioned by Dmitry, kedro 0.16.0 introduced nested parameter values inside the node inputs which can be accessed via . operator:
node(func, "params:a.b", None)
whereas kedro 0.17.6 enabled overriding nested parameters with params in CLI, e.g.
kedro run --params="model.model_tuning.booster:gbtree"

Dataflow template not using the runtime parameters

I am using a dataflow template to run cloud dataflow
I am providing some default values and calling template. Dataflow shows the pipeline options correctly in the dataflow pipeline summary. but it's not taking the runtime values.
class Mypipeoptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--preprocess_indir',
help='GCS path of the data to be preprocessed',
required=False,
default='gs://default/dataset/'
)
parser.add_value_provider_argument(
'--output_dir_train',
help='GCS path of the preprocessed train data',
required=False,
default='gs://default/train/'
)
parser.add_value_provider_argument(
'--output_dir_test',
help='GCS path of the preprocessed test data',
required=False,
default='gs://default/test/'
)
parser.add_value_provider_argument(
'--output_dir_validate',
help='GCS path of the preprocessed validate data',
required=False,
default='gs://default/validate/'
)
Then I am checking the values are accessible
p = beam.Pipeline(options=args)
if args.preprocess_indir.is_accessible():
input_dir = args.preprocess_indir
else:
input_dir = getValObj(args.preprocess_indir)
if args.output_dir_train.is_accessible():
output_train = args.output_dir_train
else:
output_train = getValObj(args.output_dir_train)
if args.output_dir_test.is_accessible():
output_test = args.output_dir_test
else:
output_test = getValObj(args.output_dir_test)
if args.output_dir_validate.is_accessible():
output_validate = args.output_dir_validate
else:
output_validate = getValObj(args.output_dir_validate)
Now when calling the template, I could see the values I wanted being passed as (Mypipeoptions)pipeine option parmater but its not used in the actual run, Instead using default options given
I think I found the solution, I was assigning runtime parameters to variables and then passing it to the input or output.
When I directly passed the runtime parameters to source or sink it worked. Like the one below
'Write train dataset to destination' >> beam.io.tfrecordio.WriteToTFRecord(
file_path_prefix=args.output_dir_train
)
I believe that the part I missed was that when the template is created it builds the graph and only the runtime parameters can be plugged into its runtime. Other computations are already done when building a graph.
Please correct me if I am wrong

Jenkins Groovy Pipeline org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: No such field found: field groovy.util.Node

I am retrieving an XML file from a remote host and parsing it using XmlParser. The content of the file is as follows:
<?xml version="1.0" encoding="utf-8"?><Metrics> <Safety> <score>81.00</score> <Percentrules>98.00</Percentrules> </Safety> </Metrics>
I am able to retrieve the score value in the following way when I execute the script outside the Groovy sandbox.
def report = readFile(file: 'Qualitycheck.xml')
def metrics = new XmlParser().parseText(report)
println metrics
double score = Double.parseDouble(metrics.Safety.score[0].value()[0])
However, when I execute the script using SCM I get the following:
org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: No such field found: field groovy.util.Node
The issue persist even though I have installed the Permissive-Script-Security-Plugin and enabled the plugin using the -Dpermissive-script-security.enabled=no_securityJVM option. Is there something different about this method? No other method is causing issues. Why?
Edit
I decided to use XmlSlurper(), and retrieved the value 81.00. However the result was type groovy.util.slurpersupport.NodeChildren
def metrics2 = new XmlSlurper().parseText(report)
def score = metrics2.Safety.score
print score
print score.getClass()
=> 81.0098.00
=> groovy.util.slurpersupport.NodeChildren
How do I use XmlSlurper to extract the value 81.00 and cast it as double? Will that be a good alternative?
There seems to be some issues with the script sandbox with Node and NodeList field access. You can work around this like the following, its not nice but works at least.
node() {
def xml = readFile "${env.WORKSPACE}/Qualitycheck.xml"
def rootNode = new XmlParser().parseText(xml)
print Double.parseDouble(rootNode.value()[0].value()[0].value()[0])
// Next line if position isnt fixed, can return an array
// if theres more than 1 with structure "Safety.score", [0] at the end takes the first.
print Double.parseDouble(rootNode.find{it.name() == "Safety"}.value().find{it.name() == "score"}.value()[0])
}
You also need to approve following signatures in the In-process Script Approval section in Manage Jenkins menu.
method groovy.util.Node name
method groovy.util.Node value
method groovy.util.XmlParser parseText java.lang.String
new groovy.util.XmlParser
staticMethod java.lang.Double parseDouble java.lang.String
staticMethod org.codehaus.groovy.runtime.DefaultGroovyMethods find java.lang.Object groovy.lang.Closure

Jenkins How To Run Same Job for the Same Count the Parameters are defined?

My requirement is that I have written a bash script which monitors telnet on several ip(s) and ports. I have used the CSV which contains the input data and the script will read each row in the CSV and checks if the ip(s) can be telnet.
However I have requirement to jenkinize it, and I am wondering if there a way I can define my parameter in the Jenkins Job with different combination or values
say for example:
PARAM_KEY : VAL_1
PARAM_KEY : VAL_2
PARAM_KEY : VAL_3
and so on thus I can use the PARAM_KEY in the script and the Jenkins job gets executed for all the parameters defined i.e. based on the number of PARAMETERS defined i.e. 3 in above case.
Can any one guide me on this requirement.
If you mean to run 1 job and iterate over the ips inside, you can parse the CSV file inside a pipeline or pass it as a parameter ( and then split it )
// example of pipeline code
node ('slave80') {
csvString = "1.1.1.1,2.2.2.2,3.3.3.3" // can be sent as parameter
def ips = csvString.split(',')
ips.each { ip ->
sh """
./bash_script ${ip}
"""
}
}

Resources