Dataflow Pipeline workers stall when passing extra arguments in PipelineOptions - google-cloud-dataflow

I have a Dataflow job defined in Apache Beam that works fine normally but breaks when I attempt to include all of my custom command line options in the PipelineOptions that I pass to beam.Pipeline(options=pipeline_options). It fails after the graph is constructed, but before the first step starts, because the worker becomes unresponsive after starting up and eventually the job times out with no useful logs.
I would like to pass my custom options because only the options that you pass directly to the pipeline show up on the right hand side in the Dataflow console UI, and its very handy to be able to see them.
Full broken example is here. The old version that works looked more or less like this
def run():
parser = argparse.ArgumentParser()
# Many parser.add_argument lines
known_args, pipeline_args = parser.parse_known_args()
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(options=pipeline_options) as p:
# Pipeline definition
The code that doesn't work looks like this
class CustomOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
# same lines of parser.add_argument
def run():
pipeline_options = CustomOptions()
pipeline_options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(options=pipeline_options) as p:
# Same pipeline definition
Here are the extra keys that I end up passing into the PipelineOptions object.
api_key
dataset_id
date_column
date_grouping_frequency
input_bigquery_sql
input_mode
org_id
output
output_executable_path # This one isn't really me, it just ends up in there
Setting aside that the argparse/PipelineOptions API seems to be based entirely off of side effects, I can't make sense of why this might lead to the job failing to start. My best guess is that one of the options I'm passing through are overwriting/having some unintended side effect on the worker, but I've done this sort of thing before so I know its possible in general to pass options through like this and have the pipeline work.
Can someone spot some issue that might cause the first worker to become unresponsive? Something about the way I'm passing options in seems to be the issue.

I tested with your arguments and Beam version 2.41.0 and Python 3.8.12 :
"api_key": "test",
"dataset_id": "test",
"date_column": "test",
"date_grouping_frequency": "test",
"input_bigquery_sql": "test",
"input_mode": "test",
"org_id": "test",
"output": "test",
"output_executable_path": "test"
In the Beam options :
class CustomOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_argument("--api_key", help="Api key", required=True)
parser.add_argument("--dataset_id", help="dataset ID", required=True)
parser.add_argument("--date_column", help="datdate_column", required=True)
parser.add_argument("--date_grouping_frequency", help="date_grouping_frequency", required=True)
parser.add_argument("--input_bigquery_sql", help="input_bigquery_sql", required=True)
parser.add_argument("--input_mode", help="input_mode", required=True)
parser.add_argument("--org_id", help="org_id", required=True)
parser.add_argument("--output", help="output", required=True)
parser.add_argument("--output_executable_path", help="output_executable_path", required=True)
In the Beam pipeline :
def run():
custom_pipeline_options = PipelineOptions().view_as(CustomOptions)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
# Get your custom option arguments
custom_pipeline_options.api_key
custom_pipeline_options.dataset_id
......
When the argument output_executable_path is a part of options, I have the following error :
[2022-11-18, 22:51:38 UTC]
{beam.py:127} WARNING - argparse.ArgumentError: argument --output_executable_path: conflicting option string: --output_executable_path
There is a conflict with an argument used internally on Beam.
When I remove the argument output_executable_path from options, the Dataflow works without issue.
Can you test without this argument please ?

Related

str() is not usable anymore to get true value of a Text tfx.data_types.RuntimeParameter during pipeline execution

how to get string as true value of tfx.orchestration.data_types.RuntimeParameter during execution pipeline?
Hi,
I'm defining a runtime parameter like data_root = tfx.orchestration.data_types.RuntimeParameter(name='data-root', ptype=str) for a base path, from which I define many subfolders for various components like str(data_root)+'/model' for model serving path in tfx.components.Pusher().
It was working like a charm before I moved to tfx==1.12.0: str(data_root) is now providing a json dump.
To overcome that, i tried to define a runtime parameter for model path like model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-root', ptype=str) and then feed the Pusher component the way I saw in many tutotrials:
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(base_directory=model_root)))
but I get a TypeError saying tfx.proto.PushDestination.Filesystem does not accept Runtime parameter.
It completely breaks the existing setup as i received those parameters from external client for each kubeflow run.
Thanks a lot for any help.
I was able to fix it.
First of all, the docstring is not clear regarding which parameter of Pusher can be a RuntimeParameter or not.
I finally went to __init__ code definition of component Pusher to see that only the parameter push_destination can be a RuntimeParameter:
def __init__(
self,
model: Optional[types.BaseChannel] = None,
model_blessing: Optional[types.BaseChannel] = None,
infra_blessing: Optional[types.BaseChannel] = None,
push_destination: Optional[Union[pusher_pb2.PushDestination,
data_types.RuntimeParameter]] = None,
custom_config: Optional[Dict[str, Any]] = None,
custom_executor_spec: Optional[executor_spec.ExecutorSpec] = None):
Then I defined the component consequently, using my RuntimeParameter
model_root = tfx.orchestration.data_types.RuntimeParameter(name='model-serving-location', ptype=str)
pusher = Pusher(model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=model_root)
As push_destination parameter is supposed to be message proto tfx.proto.pusher_pb2.PushDestination, you have then to respect the associated schema when instantiating and running a pipeline execution, meaning the value should be like:
{'type': 'model-serving-location': 'value': '{"filesystem": {"base_directory": "path/to/model/serving/for/the/run"}}'}
Regards

Kedro - how to pass nested parameters directly to node

kedro recommends storing parameters in conf/base/parameters.yml. Let's assume it looks like this:
step_size: 1
model_params:
learning_rate: 0.01
test_data_ratio: 0.2
num_train_steps: 10000
And now imagine I have some data_engineering pipeline whose nodes.py has a function that looks something like this:
def some_pipeline_step(num_train_steps):
"""
Takes the parameter `num_train_steps` as argument.
"""
pass
How would I go about and pass that nested parameters straight to this function in data_engineering/pipeline.py? I unsuccessfully tried:
from kedro.pipeline import Pipeline, node
from .nodes import split_data
def create_pipeline(**kwargs):
return Pipeline(
[
node(
some_pipeline_step,
["params:model_params.num_train_steps"],
dict(
train_x="train_x",
train_y="train_y",
),
)
]
)
I know that I could just pass all parameters into the function by using ['parameters'] or just pass all model_params parameters with ['params:model_params'] but it seems unelegant and I feel like there must be a way. Would appreciate any input!
(Disclaimer: I'm part of the Kedro team)
Thank you for your question. Current version of Kedro, unfortunately, does not support nested parameters. The interim solution would be to use top-level keys inside the node (as you already pointed out) or decorate your node function with some sort of a parameter filter, which is not elegant either.
Probably the most viable solution would be to customise your ProjectContext (in src/<package_name>/run.py) class by overwriting _get_feed_dict method as follows:
class ProjectContext(KedroContext):
# ...
def _get_feed_dict(self) -> Dict[str, Any]:
"""Get parameters and return the feed dictionary."""
params = self.params
feed_dict = {"parameters": params}
def _add_param_to_feed_dict(param_name, param_value):
"""This recursively adds parameter paths to the `feed_dict`,
whenever `param_value` is a dictionary itself, so that users can
specify specific nested parameters in their node inputs.
Example:
>>> param_name = "a"
>>> param_value = {"b": 1}
>>> _add_param_to_feed_dict(param_name, param_value)
>>> assert feed_dict["params:a"] == {"b": 1}
>>> assert feed_dict["params:a.b"] == 1
"""
key = "params:{}".format(param_name)
feed_dict[key] = param_value
if isinstance(param_value, dict):
for key, val in param_value.items():
_add_param_to_feed_dict("{}.{}".format(param_name, key), val)
for param_name, param_value in params.items():
_add_param_to_feed_dict(param_name, param_value)
return feed_dict
Please also note that this issue has already been addressed on develop and will become available in the next release. The fix uses the approach from the snippet above.
As mentioned by Dmitry, kedro 0.16.0 introduced nested parameter values inside the node inputs which can be accessed via . operator:
node(func, "params:a.b", None)
whereas kedro 0.17.6 enabled overriding nested parameters with params in CLI, e.g.
kedro run --params="model.model_tuning.booster:gbtree"

Dataflow template not using the runtime parameters

I am using a dataflow template to run cloud dataflow
I am providing some default values and calling template. Dataflow shows the pipeline options correctly in the dataflow pipeline summary. but it's not taking the runtime values.
class Mypipeoptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--preprocess_indir',
help='GCS path of the data to be preprocessed',
required=False,
default='gs://default/dataset/'
)
parser.add_value_provider_argument(
'--output_dir_train',
help='GCS path of the preprocessed train data',
required=False,
default='gs://default/train/'
)
parser.add_value_provider_argument(
'--output_dir_test',
help='GCS path of the preprocessed test data',
required=False,
default='gs://default/test/'
)
parser.add_value_provider_argument(
'--output_dir_validate',
help='GCS path of the preprocessed validate data',
required=False,
default='gs://default/validate/'
)
Then I am checking the values are accessible
p = beam.Pipeline(options=args)
if args.preprocess_indir.is_accessible():
input_dir = args.preprocess_indir
else:
input_dir = getValObj(args.preprocess_indir)
if args.output_dir_train.is_accessible():
output_train = args.output_dir_train
else:
output_train = getValObj(args.output_dir_train)
if args.output_dir_test.is_accessible():
output_test = args.output_dir_test
else:
output_test = getValObj(args.output_dir_test)
if args.output_dir_validate.is_accessible():
output_validate = args.output_dir_validate
else:
output_validate = getValObj(args.output_dir_validate)
Now when calling the template, I could see the values I wanted being passed as (Mypipeoptions)pipeine option parmater but its not used in the actual run, Instead using default options given
I think I found the solution, I was assigning runtime parameters to variables and then passing it to the input or output.
When I directly passed the runtime parameters to source or sink it worked. Like the one below
'Write train dataset to destination' >> beam.io.tfrecordio.WriteToTFRecord(
file_path_prefix=args.output_dir_train
)
I believe that the part I missed was that when the template is created it builds the graph and only the runtime parameters can be plugged into its runtime. Other computations are already done when building a graph.
Please correct me if I am wrong

How to find the concurrent.future input arguments for a Dask distributed function call

I'm using Dask to distribute work to a cluster. I'm creating a cluster and calling .submit() to submit a function to the scheduler. It returns a Futures object. I'm trying to figure out how to obtain the input arguments to that future object once it's been completed.
For example:
from dask.distributed import Client
from dask_yarn import YarnCluster
def somefunc(a,b,c ..., n ):
# do something
return
cluster = YarnCluster.from_specification(spec)
client = Client(cluster)
future = client.submit(somefunc, arg1, arg2, ..., argn)
# ^^^ how do I obtain the input arguments for this future object?
# `future.args` doesn't work
Futures don't hold onto their inputs. You can do this yourself though.
futures = {}
future = client.submit(func, *args)
futures[future] = args
A future only knows the key by which it is uniquely known on the scheduler. At the time of submission, if it has dependencies, these are transiently found and sent to the scheduler but no copy if kept locally.
The pattern you are after sounds more like delayed, which keeps hold of its graph, and indeed client.compute(delayed_thing) returns a future.
d = delayed(somefunc)(a, b, c)
future = client.compute(d)
dict(d.dask) # graph of things needed by d
You could communicate directly with the scheduler to find the dependencies of some key, which will in general also be keys, and so reverse-engineer the graph, but that does not sound like a great path, so I won't try to describe it here.

Need help to read a json file using groovy and Jenkins

I am facing some issues reading a JSON file.
I am using Jenkins Active Choice Parameter to read value from a JSON file via groovy script. This is how my JSON file look.
{
"smoke": "Test1.js",
"default": "Test2.js"
}
I want my groovy script to print out smoke and default. Below is what my groovy code look like.
import groovy.json.JsonSlurper
def inputFile = new File(".\TestSuitesJ.json")
def InputJSON = new JsonSlurper().parseText(inputFile)
InputJson.each
{
return[
key
]
}
Above code is not working for me. Can someone please suggest a better groovy way?
Anyone in similar situation as me trying to import a JSON file at runtime. I used Active Choice parameter to solve my problem. There is an option to write groovy script in Active Choice Parameter plugin of Jenkins. There i have written below code to import a JSON file to achieve desired results.
import groovy.json.JsonSlurper
def inputFile = new File('.//TestSuitesJ.json')
def inputJSON = new JsonSlurper().parse(inputFile)
def keys = inputJSON.keySet() as List
Thanks #sensei to help me learn groovy.
You really should read the groovy dev kit page, and this in particular.
Because in your case parseText() returns a LazyMap instance, the it variable you're getting in your each closure represents a Map.Entry instance. So you could println it.key to get what you want.
A more groovy way would be:
inputJson.each { k, v ->
println k
}
In which case groovy passes your closure the key (k) and the value (v) for every element in the map.

Resources