Testing #Default value for Beam PipelineOptions ValueProvider - google-cloud-dataflow

I would like a Dataflow template with a default value for one of the PipelineOptions parameters.
Inspired by examples online I use a ValueProvider placeholder for deferred parameter setting in my PipelineOptions "sub"-interface:
#Default.String("MyDefaultValue")
ValueProvider<String> getMyValue();
void setMyValue(ValueProvider<String> value);
If I specify the parameter at runtime, the template works for launching a real GCP Dataflow job. However if I try to test not including the parameter before doing this for real:
#Rule
public TestPipeline pipeline = TestPipeline.create();
...
#Test
public void test() {
PipelineOptions options = PipelineOptionsFactory.fromArgs(new String[] {...}).withValidation();
...
pipeline.run(options);
}
Then when my TestPipeline executes a DoFn processElement method where the parameter is needed I get
IllegalStateException: Value only available at runtime, but accessed from a non-runtime context:
RuntimeValueProvider{propertyName=myValue, default=MyDefaultValue}
...
More specifically it fails here in org.apache.beam.sdk.options.ValueProvider:
#Override
public T get() {
PipelineOptions options = optionsMap.get(optionsId);
if (options == null) {
throw new IllegalStateException(...
One could presumably be forgiven for thinking runtime is when the pipeline is running.
Anyway, does anybody know how would I unit test the parameter defaulting, assuming the top code snippet is how it should be set up and it is supported? Thank you.

I had the same problem when I was generating a Dataflow template from Eclipse, my Dataflow template receives a parameter from Cloud Composer DAG.
I got the solution from the Google Cloud documentation:
https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#using-valueprovider-in-your-functions

You can also use Flex Tempaltes and avoid all the hassles with ValueProviders.

Related

Unapprovable RejectedAccessException when using Tuple in Jenkinsfile

I tried to use Tuple in a Jenkinsfile.
The line I wrote is def tupleTest = new Tuple('test', 'test2').
However, Jenkins did not accept this line and keep writing the following error to the console output:
No such constructor found: new groovy.lang.Tuple java.lang.String java.lang.String. Administrators can decide whether to approve or reject this signature.
...
org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: No such constructor found: new groovy.lang.Tuple java.lang.Integer java.lang.String
...
When I visited the "Script Approval" configuration I could not see any scripts that pend approval.
Following this link, I tried to install and enable the "Permissive Security" plugin, but it did not help either - The error was the same.
I even tried to manually add the problematic signature to the scriptApproval.xml file. After I added it, I was able to see it in the list of approved signatures, but the error still remained.
Is there something I am doing wrong?
I had the same issue trying to use tuple on jenkins so I found out that I can simply use a list literal instead:
def tuple = ["test1", "test2"]
which is equivalent to
def (a, b) = ["test1", "test2"]
So now, instead of returning a tuple, I am returning a list in my method
def myMethod(...) {
...
return ["test 1", "test 2"]
}
...
def (a, b) = myMethod(...)
This is more or less a problem caused by groovy.lang.Tuple constructor + Jenkins sandbox Groovy mode. If you take a look at the constructor of this class you will see something like this:
package groovy.lang;
import java.util.AbstractList;
import java.util.List;
public class Tuple extends AbstractList {
private final Object[] contents;
private int hashCode;
public Tuple(Object[] contents) {
if (contents == null) throw new NullPointerException();
this.contents = contents;
}
//....
}
Groovy sandbox mode (enabled by default for all Jenkins pipelines) ensures that every invocation passes script approval check. It's not foolproof, and when it sees new Tuple('a','b') it thinks that the user is looking for a constructor that matches exactly two parameters of type String. And because such constructor does not exists, it throws this exception. However, there are two simple workarounds to this problem.
Use groovy.lang.Tuple2 instead
If your tuple is a pair, then use groovy.lang.Tuple2 instead. The good news about this class is that it provides a constructor that supports two generic types, so it will work in your case.
Use exact Object[] constructor
Alternatively, you can use the exact constructor, e.g
def tuple = new Tuple(["test","test2"] as Object[])
Both options require script approval before you can use them (however, in this case both constructors appear in the in-process script approval page).

Expected getter for property [tempLocation] to be marked with #default on all

I am trying to execute a Dataflow pipeline that writes to BigQuery. I understand that in order to do so, I need to specify a GCS temp location.
So I defined options:
private interface Options extends PipelineOptions {
#Description("GCS temp location to store temp files.")
#Default.String(GCS_TEMP_LOCATION)
#Validation.Required
String getTempLocation();
void setTempLocation(String value);
#Description("BigQuery table to write to, specified as "
+ "<project_id>:<dataset_id>.<table_id>. The dataset must already exist.")
#Default.String(BIGQUERY_OUTPUT_TABLE)
#Validation.Required
String getOutput();
void setOutput(String value);
}
And try to pass this to the Pipeline.Create() function:
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class));
...
}
But I am getting the following error. I don't understand why because I annotate with#Default:
Exception in thread "main" java.lang.IllegalArgumentException: Expected getter for property [tempLocation] to be marked with #Default on all [my.gcp.dataflow.StarterPipeline$Options, org.apache.beam.sdk.options.PipelineOptions], found only on [my.gcp.dataflow.StarterPipeline$Options]
Is the above snippet your code or a copy from the SDK?
You don't define a new options class for this. You actually want to call withCustomGcsTempLocation on BigQueryIO.Write [1].
Also, I think BQ should determine a temp location on it's own if you do not provide one. Have you tried without setting this? Did you get an error?
[1] https://github.com/apache/beam/blob/a17478c2ee11b1d7a8eba58da5ce385d73c6dbbc/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1402
Most users simply set the staging directory. To set the staging directory, you want to do something like:
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
options.setStagingLocation("gs://SET-YOUR-BUCKET-NAME-HERE");
However if you want to set gcpTemporaryDirectory, you can do that as well:
GcpOptions options = PipelineOptionsFactory.as(GcpOptions.class);
options.setGcpTempLocation()
Basically you have to do .as(X.class) to get to the X options. Then once you have that object you can just set any options that are part of X. You can find many additional examples online.

How to send messages from Google Dataflow (Apache Beam) on the Flink runner to Kafka

I’m trying to write a proof-of-concept which takes messages from Kafka, transforms them using Beam on Flink, then pushes the results onto a different Kafka topic.
I’ve used the KafkaWindowedWordCountExample as a starting point, and that’s doing the first part of what I want to do, but it outputs to text files as opposed to Kafka. FlinkKafkaProducer08 looks promising, but I can’t figure out how to plug it into the pipeline. I was thinking that it would be wrapped with an UnboundedFlinkSink, or some such, but that doesn’t seem to exist.
Any advice or thoughts on what I’m trying to do?
I’m running the latest incubator-beam (as of last night from Github), Flink 1.0.0 in cluster mode and Kafka 0.9.0.1, all on Google Compute Engine (Debian Jessie).
There is currently no UnboundedSink class in Beam. Most unbounded sinks are implemented using a ParDo.
You may wish to check out the KafkaIO connector. This is a Kafka reader that works in all Beam runners, and implements the parallel reading, checkpointing, and other UnboundedSource APIs. That pull request also includes a crude sink in the TopHashtags example pipeline by writing to Kafka in a ParDo:
class KafkaWriter extends DoFn<String, Void> {
private final String topic;
private final Map<String, Object> config;
private transient KafkaProducer<String, String> producer = null;
public KafkaWriter(Options options) {
this.topic = options.getOutputTopic();
this.config = ImmutableMap.<String, Object>of(
"bootstrap.servers", options.getBootstrapServers(),
"key.serializer", StringSerializer.class.getName(),
"value.serializer", StringSerializer.class.getName());
}
#Override
public void startBundle(Context c) throws Exception {
if (producer == null) { // in Beam, startBundle might be called multiple times.
producer = new KafkaProducer<String, String>(config);
}
}
#Override
public void finishBundle(Context c) throws Exception {
producer.close();
}
#Override
public void processElement(ProcessContext ctx) throws Exception {
producer.send(new ProducerRecord<String, String>(topic, ctx.element()));
}
}
Of course, we would like to add sink support in KafkaIO as well. It would effectively be same as KafkaWriter above, but much simpler to use.
Sink transform for writing to Kafka was added to Apache Beam / Dataflow in 2016. See JavaDoc for KafkaIO in Apache Beam for usage example.

Cloud Dataflow to BigQuery - too many sources

I have a job that among other things also inserts some of the data it reads from files into BigQuery table for later manual analysis.
It fails with the following error:
job error: Too many sources provided: 10001. Limit is 10000., error: Too many sources provided: 10001. Limit is 10000.
What does it refer to as "source"? Is it a file or a pipeline step?
Thanks,
G
I'm guessing the error is coming from BigQuery and means that we are trying to upload too many files when we create your output table.
Could you provide some more details on the error / context (like a snippet of the commandline output (if using the BlockingDataflowPipelineRunner) so I can confirm? A jobId would also be helpful.
Is there something about your pipeline structure that is going to result in a large number of output files? That could either be a large amount of data or perhaps finely sharded input files without a subsequent GroupByKey operation (which would let us reshard the data into larger pieces).
The note in In Google Cloud Dataflow BigQueryIO.Write occur Unknown Error (http code 500) mitigates this issue:
Dataflow SDK for Java 1.x: as a workaround, you can enable this experiment in : --experiments=enable_custom_bigquery_sink
In Dataflow SDK for Java 2.x, this behavior is default and no experiments are necessary.
Note that in both versions, temporary files in GCS may be left over if your job fails.
public static class ForceGroupBy <T> extends PTransform<PCollection<T>, PCollection<KV<T, Iterable<Void>>>> {
private static final long serialVersionUID = 1L;
#Override
public PCollection<KV<T, Iterable<Void>>> apply(PCollection<T> input) {
PCollection<KV<T,Void>> syntheticGroup = input.apply(
ParDo.of(new DoFn<T,KV<T,Void>>(){
private static final long serialVersionUID = 1L;
#Override
public void processElement(
DoFn<T, KV<T, Void>>.ProcessContext c)
throws Exception {
c.output(KV.of(c.element(),(Void)null));
} }));
return syntheticGroup.apply(GroupByKey.<T,Void>create());
}
}

Quartz.Net Job Creation on the fly

Job class must implement the Job interface."I created simple job using Quartz.Net 1.0.3
public class SimpleTestJob : IJob
{
public virtual void Execute(JobExecutionContext context)
{
System.Diagnostics.EventLog.WriteEntry("QuartzTest", "This is a test run");
}
}
Then I am tried dynamically add the job above to the Quartz server
First I received a Type using reflection
string jobType = "Scheduler.Quartz.Jobs.SimpleTestJob,Scheduler.Quartz,Version=1.0.0.0,Culture=neutral,PublicKeyToken=null";
var schedType= Type.GetType(jobType, false, true);
It's working.Then I am trying to create JobDetail object
JobDetail job = job = new JobDetail(jobName, groupName, schedType.GetType());
But I am receiving an error from Quartz.Net framework.
"Job class must implement the Job interface."
Please help
Try removing the virtual keyword and you might also want to try using the typeof operator where you have schedType.GetType(). I'm not sure what the type of schedType ends up being given it is defined as var.
I am using Quartz 1.0.3 it's compiled with .net 3.5.
But schedType.GetType
returned type with attribute RunTime version 4.
Really I do not need to use GetType function because I alread have a type, that I received before
var schedType= Type.GetType(jobType, false, true);
So my fix was
JobDetail job = new JobDetail(jobName, groupName, schedType);

Resources