Following the dataflow docs, I can name each step of a Google Cloud Dataflow pipeline using ParDo.named:
PCollection<Integer> wordLengths = words.apply(
ParDo
.named("ComputeWordLengths") // the transform name
.of(new DoFn<String, Integer>() {
#Override
public void processElement(ProcessContext c) {
c.output(c.element().length());
}
}));
If I use MapElements instead, however, the example in the documentation does not name the step:
PCollection<Integer> wordLengths = words.apply(
MapElements.via((String word) -> word.length())
.withOutputType(new TypeDescriptor<Integer>() {});
How can I name this MapElements step?
I have several MapElements steps and I'm getting errors like this:
Mar 01, 2016 1:36:39 PM com.google.cloud.dataflow.sdk.Pipeline applyInternal
WARNING: Transform MapElements2 does not have a stable unique name. This will prevent updating of pipelines.
You can specify the name when you apply it. For instance:
words.apply("name", MapElements.via(...))
// instead of
words.apply(MapElements.via(...))
See the JavaDoc on the named apply method for more details.
Related
I created a streaming apache beam pipeline that read files from GCS folders and insert them in BigQuery, it works perfectly but it re-process all the files when i stop and run the job,so all the data will be replicated again.
So my idea is to move files from the scanned directory to another one but i don't how technically do it with apache beam.
Thank you
public static PipelineResult run(Options options) {
// Create the pipeline.
Pipeline pipeline = Pipeline.create(options);
/*
* Steps:
* 1) Read from the text source.
* 2) Write each text record to Pub/Sub
*/
LOG.info("Running pipeline");
LOG.info("Input : " + options.getInputFilePattern());
LOG.info("Output : " + options.getOutputTopic());
PCollection<String> collection = pipeline
.apply("Read Text Data", TextIO.read()
.from(options.getInputFilePattern())
.watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))
.apply("Write logs", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
LOG.info(c.element());
c.output(c.element());
}
}));
collection.apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));
return pipeline.run();
}
A couple tips:
You are normally not expected to stop and rerun a streaming pipeline. Streaming pipelines are more meant to run forever, and be updated sometimes if you want to make changes to the logic.
Nonetheless, it is possible to use FileIO to match a number of files, and move them after they have been processed.
You would write a DoFn class like so: ReadWholeFileThenMoveToAnotherBucketDoFn, which would read the whole file, and then move it to a new bucket.
Pipeline pipeline = Pipeline.create(options);
PCollection<FileIO.Match> matches = pipeline
.apply("Read Text Data", FileIO.match()
.filepattern(options.getInputFilePattern())
.continuously(Duration.standardSeconds(60),
Watch.Growth.<String>never()));
matches.apply(FileIO.readMatches())
.apply(ParDo.of(new ReadWholeFileThenMoveToAnotherBucketDoFn()))
.apply("Write logs", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
LOG.info(c.element());
c.output(c.element());
}
}));
....
I am setting up a Jenkins Pipeline, which calls an external library with a compare XML function written in Groovy that utilises xmlunit.
The function looks as follows:
import java.util.List
import org.custommonkey.xmlunit.*
// Gives you a list of all the differences.
#NonCPS
void call(String xmlControl, String xmlTest) throws Exception {
String myControlXML = xmlControl
String myTestXML = xmlTest
DetailedDiff myDiff = new DetailedDiff(compareXML(myControlXML,
myTestXML));
List allDifferences = myDiff.getAllDifferences();
assertEquals(myDiff.toString(), 0, allDifferences.size());
}
However when running the pipeline in Jenkins it returns a java.io.NotSerializableException.
Checking StackOverflow it seemed like adding a the #NonCPS annotation might help.
But sadly it did not make a difference.
What more could I try to resolve the java.io.NotSerializableException?
I'm trying with Apache Beam 2.1.0 to consume simple data (key,value) from google PubSub and group by key to be able to treat batches of data.
With default trigger my code after "GroupByKey" never fires (I waited 30min).
If I defined custom trigger, code is executed but I would like to understand why default trigger is never fired. I tried to define my own timestamp with "withTimestampLabel" but same issue. I tried to change duration of windows but same issue too (1second, 10seconds, 30seconds etc).
I used command line for this test to insert data
gcloud beta pubsub topics publish test A,1
gcloud beta pubsub topics publish test A,2
gcloud beta pubsub topics publish test B,1
gcloud beta pubsub topics publish test B,2
From documentation it says that we can do one or the other but not necessarily both
If you are using unbounded PCollections, you must use either
non-global windowing OR an aggregation trigger in order to perform a
GroupByKey or CoGroupByKey
It looks to be similar to
Consuming unbounded data in windows with default trigger
Scio: groupByKey doesn't work when using Pub/Sub as collection source
My code
static class Compute extends DoFn<KV<String, Iterable<Integer>>, Void> {
#ProcessElement
public void processElement(ProcessContext c) {
// Code never fires
System.out.println("KEY:" + c.element().getKey());
System.out.println("NB:" + c.element().getValue().spliterator().getExactSizeIfKnown());
}
}
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.create());
p.apply(PubsubIO.readStrings().fromSubscription("projects/" + args[0] + "/subscriptions/test"))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(
MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers()))
.via((String row) -> {
String[] parts = row.split(",");
System.out.println(Arrays.toString(parts)); // Code fires
return KV.of(parts[0], Integer.parseInt(parts[1]));
})
)
.apply(GroupByKey.create())
.apply(ParDo.of(new Compute()));
p.run();
}
In the interest of providing a minimal example of my problem, I'm trying to implement a simple Beam job that takes in a String as a side input and applies it to a PCollection which is read from a csv file in Cloud Storage. The result is then output to a .txt file in Cloud Storage.
So far, I have tried: Experimenting with PipelineResult.waitUntilFinish (as in (p.run().waitUntilFinish()), altering the placement of the two p.run() commands, and simplifying as much as possible by just using a string as my side input, always with the same result. Searching on Stack and Google just led me to the PR on the Beam repo which implemented the error message.
SideInputTest.java:
public class SideInputTest {
public static void main(String[] arg) throws IOException {
// Build a pipeline to read in string
DataflowPipelineOptions options1 = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options1.setRunner(DataflowRunner.class);
Pipeline p = Pipeline.create(options1);
// Build really simple side input
PCollectionView<String> sideInputView = p.apply(Create.of("foo"))
.apply(View.<String>asSingleton());
// Run p
p.run();
// Build main pipeline to read csv data
DataflowPipelineOptions options2 = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options2.setProject(PROJECT_NAME);
options2.setStagingLocation(STAGING_LOCATION);
options2.setRunner(DataflowRunner.class);
Pipeline p2 = Pipeline.create(options2);
p2.apply(TextIO.Read.from(INPUT_DATA))
.apply(ParDo.withSideInputs(sideInputView).of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
String[] rowData = c.element().split(",");
String sideInput = c.sideInput(sideInputView);
c.output(rowData[0] + sideInput);
}
}))
.apply(TextIO.Write
.to(OUTPUT_DATA));
p2.run();
}
}
Full stack trace:
Caused by: java.lang.NullPointerException: Unknown producer for value SingletonPCollectionView{tag=Tag<org.apache.beam.sdk.util.PCollectionViews$SimplePCollectionView.<init>:435#3d93cb799b3970be>} while translating step ParDo(Anonymous)
at org.apache.beam.runners.dataflow.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:1079)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.getProducer(DataflowPipelineTranslator.java:508)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translateSideInputs(DataflowPipelineTranslator.java:926)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translateInputs(DataflowPipelineTranslator.java:913)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.access$1100(DataflowPipelineTranslator.java:112)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$7.translateSingleHelper(DataflowPipelineTranslator.java:863)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$7.translate(DataflowPipelineTranslator.java:856)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$7.translate(DataflowPipelineTranslator.java:853)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.visitPrimitiveTransform(DataflowPipelineTranslator.java:415)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:486)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:481)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$400(TransformHierarchy.java:231)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:206)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:321)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.translate(DataflowPipelineTranslator.java:365)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translate(DataflowPipelineTranslator.java:154)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:514)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:151)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:210)
at com.xpw.SideInputTest.main(SideInputTest.java:63)
Currently using org.apache.beam packages #0.6.0
This code is taking a PCollectionView created in one pipeline (p.apply(Create.of("foo")).apply(View.<String>asSingleton());) and using it in another pipeline (p2).
PCollection's and PCollectionView's belong to a particular pipeline and reuse of them in a different pipeline is not supported.
You can create an analogous PCollectionView in p2.
I'm also confused as to what your pipeline p is trying to accomplish: the only transform it has is creating the view?.. so there's no data being processed in it. I think you should get rid of p entirely and just use p2.
Update:
From the bottom of the Automatically Generated DSL wiki entry ... The generated DSL is only supported when running in Jenkins,....
Since slackNotifier is generated DSL, it doesn't appear that there is a way to test this in our particular infrastructure. We're going to write a function which generates the config using the configure block.
I have a seed job definition which is failing gradle test even though it seems to work fine when we use it in Jenkins.
Job Definition Excerpt
//package master
// GitURL
def gitUrl = 'https://github.com/team/myapp'
def slackRoom = null
job('seed-dsl') {
description('This seed is updated from the seed-dsl-updater job')
properties {
//Set github project URL
githubProjectUrl(gitUrl)
}
...
// publishers is another name for post build steps
publishers {
mailer('', false, true)
slackNotifier {
room(slackRoom)
notifyAborted(true)
notifyFailure(true)
notifyNotBuilt(true)
notifyUnstable(true)
notifyBackToNormal(true)
notifySuccess(false)
notifyRepeatedFailure(false)
startNotification(false)
includeTestSummary(false)
includeCustomMessage(false)
customMessage(null)
buildServerUrl(null)
sendAs(null)
commitInfoChoice('NONE')
teamDomain(null)
authToken(null)
}
}
}
The gradle test command works fine when I comment out the with the slackNotifier declaration, but fail with the following error when it's enabled:
Test output excerpt
Caused by:
javaposse.jobdsl.dsl.DslScriptException: (script, line 79) No signature of method: javaposse.jobdsl.dsl.helpers.publisher.PublisherContext.slackNotifier() is applicable for argument types: (script$_run_closure1$_closure9$_closure14) values: [script$_run_closure1$_closure9$_closure14#d2392a1]
Possible solutions: stashNotifier(), stashNotifier(groovy.lang.Closure)
at javaposse.jobdsl.dsl.DslScriptLoader.runScriptEngine(DslScriptLoader.groovy:135)
at javaposse.jobdsl.dsl.DslScriptLoader.runScriptsWithClassLoader_closure1(DslScriptLoader.groovy:78)
According to the migration doc, slackNotifer has been supported since 1.47. In my gradle.build, I'm using 1.48. I see the same errors with plugin version 1.50
gradle.build excerpt
ext {
jobDslVersion = '1.48'
...
}
...
// Job DSL plugin including plugin dependencies
testCompile "org.jenkins-ci.plugins:job-dsl:${jobDslVersion}"
testCompile "org.jenkins-ci.plugins:job-dsl:${jobDslVersion}#jar"
...
The gradle.build also includes the following, as suggested by the [testing docs] *(https://github.com/jenkinsci/job-dsl-plugin/wiki/Testing-DSL-Scripts).
testPlugins 'org.jenkins-ci.plugins:slack:2.0.1'
What do I need to do to be able to successfully test my job definitions. Is this a bug, or have I missed something else?
removed incorrect reply
EDIT
I see I missed the point.
The new approach is to reuse the #DataBoundConstructor exposed by plugins, so nothing needs to be written to support a new plugin assuming it has a DataBoundConstructor
Your SlackNotifier has this - note the DSL converts the lowercase first letter for you
#DataBoundConstructor
public SlackNotifier(
final String teamDomain,
final String authToken,
final String room,
final String buildServerUrl,
final String sendAs,
final boolean startNotification,
final boolean notifyAborted,
final boolean notifyFailure,
final boolean notifyNotBuilt,
final boolean notifySuccess,
final boolean notifyUnstable,
final boolean notifyBackToNormal,
final boolean notifyRepeatedFailure,
final boolean includeTestSummary,
CommitInfoChoice commitInfoChoice,
boolean includeCustomMessage,
String customMessage) {
...
}
Unfortunately there is an embedded type in the parameter list CommitInfoChoice and this does not have a DataBoundConstructor and its an enum too.
public enum CommitInfoChoice {
NONE("nothing about commits", false, false),
AUTHORS("commit list with authors only", true, false),
AUTHORS_AND_TITLES("commit list with authors and titles", true, true);
...
}
I'll go out on a limb and say that it won't work out the box until the nested enum implements a databound constructor and also has a descriptor, sorry.
I don't have the plugin but you can look at the XML for a real created job with the plugin and see what goes into this section. I suspect it is a nested structure
You can try the job dsl google group - link to a post about the generic approach
We ran into this as well. The solution for us was to add the slack plugin version we were using on jenkins to our list of plugins in gradle.
To be more specific, in our build.gradle file under dependencies, we added the following code to get our plugins included and hence allow the auto-generated DSL to work.
You can see this described here and an example of a different plugin next to testPlugins:
https://github.com/jenkinsci/job-dsl-plugin/wiki/Testing-DSL-Scripts
Like the following:
dependencies {
...
// plugins to install in test instance
testPlugins 'org.jenkins-ci.plugins:ghprb:1.31.4'
testPlugins 'com.coravy.hudson.plugins.github:github:1.19.0'
}