Let's suppose I have a Step Function with a Map State. The Map State is a Batch Job, associated with a Docker container. I want pass input parameters to containers, and receive output for other SF's states.
I believe it could be a Lambda Function, iterating thru the input as array, and pass each element as environment variables set to containers. But how could the lambda working with foreach + environment variables look like? How can I catch Docker container output (I believe it could be S3 file/directory)?
Also is there any alternative to a Lambda Function at all?
Handling the iterator:
If you have a predefined input array that you want to iterate on with the map state, then you can just pass that as the Map InputPath and ItemsPath, but in some cases you may need to setup a lambda that that will go and create that list for you.
Your ItemsPath might looks something like:
"list": [
{
"input": "<my_cool_input parameters>"
},
{
"input": "<my_cool_input parameters>"
}...
]
Handling the output:
As far as I know currently there is no way to get an output from batch compute back to the state machine directly. So you will need to take an indirect approach.
One way could be to write the output from your docker container to some temporary location such as dynamodb or s3. Then you would need a step in your step function to read the output from dynamodb (you can do that directly, no lambda needed, if you write to s3 then you will need a lambda to read the output).
It would seem that this approach would also be needed to capture raised exceptions from a docker container - I'm all ears if anyone has a better approach.
Related
Let's say I have a rule like this.
foo(
name = "helloworld",
myarray = [
":bar",
"//path/to:qux",
],
)
In this case, myarray is static.
However, I want it to be given by cli, like
bazel run //:helloworld --myarray=":bar,//path/to:qux,:baz,:another"
How is this possible?
Thanks
To get exactly what you're asking for, Bazel would need to support LABEL_LIST in Starlark-defined command line flags, which are documented here:
https://docs.bazel.build/versions/2.1.0/skylark/lib/config.html
and here: https://docs.bazel.build/versions/2.1.0/skylark/config.html
Unfortunately that's not implemented at the moment.
If you don't actually need a list of labels (i.e., to create dependencies between targets), then maybe STRING_LIST will work for you.
If you do need a list of labels, and the different possible values are known, then you can use --define, config_setting(), and select():
https://docs.bazel.build/versions/2.1.0/configurable-attributes.html
The question is, what are you really after. Passing variable, array into the bazel build/run isn't really possible, well not as such and not (mostly) without (very likely unwanted) side effects. Aren't you perhaps really just looking into passing arguments directly to what is being run by the run? I.e. pass it to the executable itself, not bazel?
There are few ways you could sneak stuff in (you'd also in most cases need to come up with a syntax to pass data on CLI and unpack the array in a rule), but many come with relatively substantial price.
You can define your array in a bzl file and load it from where the rule uses it. You can then dump the bzl content rewriting your build/run configuration (also making it obvious, traceable) and load the bits from the rule (only affecting the rule loading and using the variable). E.g, BUILD file:
load(":myarray.bzl", "myarray")
foo(
name = "helloworld",
myarray = myarray,
],
)
And you can then call your build:
$ echo 'myarray=[":bar", "//path/to:qux", ":baz", ":another"]' > myarray.bzl
$ bazel run //:helloworld
Which you can of course put in a single wrapper script. If this really needs to be a bazel array, this one is probably the cleanest way to do that.
--workspace_status_command: you can collection information about your environment, add either or both of the resulting files (depending on whether the inputs are meant to invalidate the rule results or not, you could use volatile or stable status files) as a dependency of your rule and process the incoming file in the what is being executed by the rule (at which point one would wonder why not pass it to as its command line arguments directly). If using stable status file, also each other rule depending on it is invalidated by any change.
You can do similar thing by using --action_env. From within the executable/tool/script underpinning the rule, you can directly access defined environmental variable. However, this also means environment of each rule is affected (not just the one you're targeting); and again, why would it parse the information from environment and not accept arguments on the command line.
There is also --define, but you would not really get direct access it's value as much as you could select() a choice out of possible options.
Alright! Following is the scenario with respective queries:
1) I am using a bash script to generate JSON object for status of custom processes.
2) Providing the bash inside zabbix_agentd.conf file:
UserParameter=service.check[*],/usr/lib/zabbix/externalscripts/service_check.bash
I want to provide the process names as parameters to the bash file here in UserParameter, how do I do that?
3) Restarting the zabbix-agent and checking with zabbix-get yields an empty JSON (because we have not given any process names):
{
"data":[
]
}
4) If I provide a process name into UserParameter as:
UserParameter=service.check[*],/usr/lib/zabbix/externalscripts/service_check.bash apache2 ntp cron
It yields the following:
{
"data":[
which I know is wrong, since I need to pass the process names in a different way. I tried passing them inside the bash script and even then it generates an invalid json as above.
5) The JSON generated will be taken care by the Zabbix discovery rule of type "Zabbix agent", where it will create different items out of process names. Following is the JSON that my script should send:
{"data":[{"{#NAME}":"apache2","{#STATUS}":"RUNNING","{#VALUE}":"1"},{"{#NAME}":"ntp","{#STATUS}":"RUNNING","{#VALUE}":"1"},{"{#NAME}":"cron","{#STATUS}":"STOPPED","{#VALUE}":"0"}]}
I could have used zabbix-sender for the same, but it would need me to run the sender for every key-value that I need to send. Also, this way I have to be concerned with manipulating data at one place only, and the rest will be taken care of.
Hope this is clear enough and explains my situation.
I have an Apache Beam job running on Google Cloud Dataflow, and as part of its initialization it needs to run some basic sanity/availability checks on services, pub/sub subscriptions, GCS blobs, etc. It's a streaming pipeline intended to run ad infinitum that processes hundreds of thousands of pub/sub messages.
Currently it needs a whole heap of required, variable parameters: which Google Cloud project it needs to run in, which bucket and directory prefix it's going to be storing files in, which pub/sub subscriptions it needs to read from, and so on. It does some work with these parameters before pipeline.run is called - validation, string splitting, and the like. In its current form in order to start a job we've been passing these parameters to to a PipelineOptionsFactory and issuing a new compile every single time, but it seems like there should be a better way. I've set up the parameters to be ValueProvider objects, but because they're being called outside of pipeline.run, Maven complains at compile time that ValueProvider.get() is being called outside of a runtime context (which, yes, it is.)
I've tried using NestedValueProviders as in the Google "Creating Templates" document, but my IDE complains if I try to use NestedValueProvider.of to return a string as shown in the document. The only way I've been able to get NestedValueProviders to compile is as follows:
NestedValueProvider<String, String> pid = NestedValueProvider.of(
pipelineOptions.getDataflowProjectId(),
(SerializableFunction<String, String>) s -> s
);
(String pid = NestedValueProvider.of(...) results in the following error: "incompatible types: no instance(s) of type variable(s) T,X exist so that org.apache.beam.sdk.options.ValueProvider.NestedValueProvider conforms to java.lang.String")
I have the following in my pipelineOptions:
ValueProvider<String> getDataflowProjectId();
void setDataflowProjectId(ValueProvider<String> value);
Because of the volume of messages we're going to be processing, adding these checks at the front of the pipeline for every message that comes through isn't really practical; we'll hit daily account administrative limits on some of these calls pretty quickly.
Are templates the right approach for what I want to do? How do I go about actually productionizing this? Should (can?) I compile with maven into a jar, then just run the jar on a local dev/qa/prod box with my parameters and just not bother with ValueProviders at all? Or is it possible to provide a default to a ValueProvider and override it as part of the options passed to the template?
Any advice on how to proceed would be most appreciated. Thanks!
The way templates are currently implemented there is no point to perform "post-template creation" but "pre-pipeline start" initialization/validation.
All of the existing validation executes during template creation. If the validation detects that there the values aren't available (due to being a ValueProvider) the validation is skipped.
In some cases it is possible to approximate validation by adding runtime checks either as part of initial splitting of a custom source or part of the #Setup method of a DoFn. In the latter case, the #Setup method will run once for each instance of the DoFn that is created. If the pipeline is Batch, after 4 failures for a specific instance it will fail the pipeline.
Another option for productionizing pipelines is to build the JAR that runs the pipeline, and have a production process that runs that JAR to initiate the pipeline.
Regarding the compile error you received -- the NestedValueProvider returns a ValueProvider -- it isn't possible to get a String out of that. You could, however, put the validation code into the SerializableFunction that is run within the NestedValueProvider.
Although I believe this will currently re-run the validation everytime the value is accessed, it wouldn't be unreasonable to have the NestedValueProvider cache the translated value.
Is it possible to perform an action once a batch Dataflow job has finished processing all data? Specifically, I'd like to move the text file that the pipeline just processed to a different GCS bucket. I'm not sure where to place that in my pipeline to ensure it executes once after the data processing has completed.
I don't see why you need to do this post pipeline execution. You could use side outputs to write the file to multiple buckets, and save yourself the copy after the pipeline finishes.
If that's not going to work for you (for whatever reason), then you can simply run your pipeline in blocking execution mode i.e. use pipeline.run().waitUntilFinish(), and then just write the rest of your code (which does the copy) after that.
[..]
// do some stuff before the pipeline runs
Pipeline pipeline = ...
pipeline.run().waitUntilFinish();
// do something after the pipeline finishes here
[..]
A little trick I got from reading the source code of apache beam's PassThroughThenCleanup.java.
Right after your reader, create a side input that 'combine' the entire collection (in the source code, it is the View.asIterable() PTransform) and connect its output to a DoFn. This DoFn will be called only after the reader has finished reading ALL elements.
P.S. The code literally name the operation, cleanupSignalView which I found really clever
Note that you can achieve the same effect using Combine.globally() (java) or beam.CombineGlobally() (python). For more info check out section 4.2.4.3 here
I think two options can help you here:
1) Use TextIO to write to the bucket or folder you want, specifying the exact GCS path (for e.g. gs://sandbox/other-bucket)
2) Use Object Change Notifications in combination with Cloud Functions. You can find a good primer on doing this here and the SDK for GCS in JS here. What you will do in this option is basically setting up a trigger when something drops in a certain bucket, and move it to another one using your self-written Cloud Function.
I just want to call some specific function in my Lua script.
A simple script:
msg("hello")
function showamsgbox()
msg("123")
end
I just want to let my C app call showamsgbox() only but not to run msg("hello") beacuse it will show a msgbox when i load this script! So how to do that to keep this situation away?
PS:it is just example.sometimes i want to let users make thier own plugins in my program.but I do not want them write something outside the functions(i want to use functions to decide what to do.for example function OnLoad() means it will be run when i load it ).If there is something outside functions i cannot control them!
You can't. The script defines two variables when run: a and geta. Recall that function geta()...end is the same as geta=function()...end.
The a = 9 will be called when the script is initially evaluated in a lua_State.
If you reuse that lua_State instance, you can retrieve the function and invoke it without re-initializing a.
It seems that you want to sandbox scripts. Just give them a suitable, separate environment before running them. It may be an empty one or it may contain references to the functions you want them to use. They can write at will in their environment and it will not affect yours. Then just get the value of OnLoad or whatever user function you want to call and call it.