We've created a pretty simple pipeline for pub sub event processing. The pub sub message payload itself is tab separated csv data.
After the message is read, the payload data is being truncated when inflating back into the event object. Using the direct runner and running locally the pipeline is working end to end.
Its only when running within the Google Cloud Dataflow runner where we are seeing this message data truncated?
// Create the pipeline
Pipeline pipeline = Pipeline.create(options);
LOG.info("Reading from subscription: " + options.getInputSubscription());
//Step #1: Read from a PubSub subscription.
PCollection<PubsubMessage> pubsubMessages = pipeline.apply(
"ReadPubSubSubscription",
PubsubIO.readMessagesWithMessageId()
.fromSubscription(options.getInputSubscription())
);
//Step #2: Transform the PubsubMessages into snowplow events.
PCollection<Event> rawEvents = pubsubMessages.apply(
"ConvertMessageToEvent",
ParDo.of(new PubsubMessageEventFn())
);
// other pipeline functions.....
Here the conversion function, where for every pub sub message were falling into the error case. Note that Event.parse() is actually a scala library but I don't see how that could affect this as the message data itself is what has been truncated between the two stages of the pipeline.
Perhaps there is an encoding issue?
public static class PubsubMessageEventFn extends DoFn<PubsubMessage, Event> {
#ProcessElement
public void processElement(ProcessContext context) {
PubsubMessage message = context.element();
Validated<ParsingError, Event> event = Event.parse(new String(message.getPayload()));
Either<ParsingError, Event> condition = event.toEither();
if (condition.isLeft()) {
ParsingError err = condition.left().get();
LOG.error("Event parsing error: " + err.toString() + " for message: " + new String(message.getPayload()));
} else {
Event e = condition.right().get();
context.output(e);
}
}
}
Here is a sample of the data that is emitted in the log message:
Event parsing error: FieldNumberMismatch(5) for message: 4f6ec25-67a7-4edf-972a-29e80320f67f web 2020-04-14 21:26:40.034 2020-04-14 21:26:39.884 2020-04-1
Note that the Pub/Sub implementation for DirectRunner is different from the implementation in Dataflow Runner as documented here - https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub#integration-features.
I believe the issue is related to encoding because message.getPayload is of type bytes and the code might need to be modified as new String(message.getPayload(), StandardCharsets.UTF_8) in the below line
Validated<ParsingError, Event> event = Event.parse(new String(message.getPayload(), StandardCharsets.UTF_8));
Using readMessagesWithAttributesAndMessageId instead of readMessagesWithMessageId is the workaround according to this bug issue https://issues.apache.org/jira/browse/BEAM-9483.
It does not appear to have been fixed yet.
Related
I created a streaming apache beam pipeline that read files from GCS folders and insert them in BigQuery, it works perfectly but it re-process all the files when i stop and run the job,so all the data will be replicated again.
So my idea is to move files from the scanned directory to another one but i don't how technically do it with apache beam.
Thank you
public static PipelineResult run(Options options) {
// Create the pipeline.
Pipeline pipeline = Pipeline.create(options);
/*
* Steps:
* 1) Read from the text source.
* 2) Write each text record to Pub/Sub
*/
LOG.info("Running pipeline");
LOG.info("Input : " + options.getInputFilePattern());
LOG.info("Output : " + options.getOutputTopic());
PCollection<String> collection = pipeline
.apply("Read Text Data", TextIO.read()
.from(options.getInputFilePattern())
.watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))
.apply("Write logs", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
LOG.info(c.element());
c.output(c.element());
}
}));
collection.apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));
return pipeline.run();
}
A couple tips:
You are normally not expected to stop and rerun a streaming pipeline. Streaming pipelines are more meant to run forever, and be updated sometimes if you want to make changes to the logic.
Nonetheless, it is possible to use FileIO to match a number of files, and move them after they have been processed.
You would write a DoFn class like so: ReadWholeFileThenMoveToAnotherBucketDoFn, which would read the whole file, and then move it to a new bucket.
Pipeline pipeline = Pipeline.create(options);
PCollection<FileIO.Match> matches = pipeline
.apply("Read Text Data", FileIO.match()
.filepattern(options.getInputFilePattern())
.continuously(Duration.standardSeconds(60),
Watch.Growth.<String>never()));
matches.apply(FileIO.readMatches())
.apply(ParDo.of(new ReadWholeFileThenMoveToAnotherBucketDoFn()))
.apply("Write logs", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
LOG.info(c.element());
c.output(c.element());
}
}));
....
I'm trying to understand Beam/Dataflow concepts better, so pretend I have the following streaming pipeline:
pipeline
.apply(PubsubIO.readStrings().fromSubscription("some-subscription"))
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
String message = c.element();
LOGGER.debug("Got message: {}", message);
c.output(message);
}
}));
How often will the unbounded source pull messages from the subscription? Is this configurable at all (potentially based on windows/triggers)?
Since no custom windowing/triggers have been defined, and there are no sinks (just a ParDo that logs + re-outputs the message), will my ParDo still be executed immediately as messages are received, and is that setup problematic in any way (not having any windows/triggers/sinks defined)?
It will pull messages from the subscription continuously - as soon as a message arrives, it will be processed immediately (modulo network and RPC latency).
Windowing and triggers do not affect this at all - they only affect how the data gets grouped at grouping operations (GroupByKey and Combine). If your pipeline doesn't have grouping operations, windowing and triggers are basically a no-op.
The Beam model does not have the concept of a sink - writing to various storage systems (e.g. writing files, writing to BigQuery etc) is implemented as regular Beam composite transforms, made of ParDo and GroupByKey like anything else. E.g. writing each element to its own file could be implemented by a ParDo whose #ProcessElement opens the file, writes the element to it and closes the file.
I'm trying with Apache Beam 2.1.0 to consume simple data (key,value) from google PubSub and group by key to be able to treat batches of data.
With default trigger my code after "GroupByKey" never fires (I waited 30min).
If I defined custom trigger, code is executed but I would like to understand why default trigger is never fired. I tried to define my own timestamp with "withTimestampLabel" but same issue. I tried to change duration of windows but same issue too (1second, 10seconds, 30seconds etc).
I used command line for this test to insert data
gcloud beta pubsub topics publish test A,1
gcloud beta pubsub topics publish test A,2
gcloud beta pubsub topics publish test B,1
gcloud beta pubsub topics publish test B,2
From documentation it says that we can do one or the other but not necessarily both
If you are using unbounded PCollections, you must use either
non-global windowing OR an aggregation trigger in order to perform a
GroupByKey or CoGroupByKey
It looks to be similar to
Consuming unbounded data in windows with default trigger
Scio: groupByKey doesn't work when using Pub/Sub as collection source
My code
static class Compute extends DoFn<KV<String, Iterable<Integer>>, Void> {
#ProcessElement
public void processElement(ProcessContext c) {
// Code never fires
System.out.println("KEY:" + c.element().getKey());
System.out.println("NB:" + c.element().getValue().spliterator().getExactSizeIfKnown());
}
}
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.create());
p.apply(PubsubIO.readStrings().fromSubscription("projects/" + args[0] + "/subscriptions/test"))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(
MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers()))
.via((String row) -> {
String[] parts = row.split(",");
System.out.println(Arrays.toString(parts)); // Code fires
return KV.of(parts[0], Integer.parseInt(parts[1]));
})
)
.apply(GroupByKey.create())
.apply(ParDo.of(new Compute()));
p.run();
}
In the interest of providing a minimal example of my problem, I'm trying to implement a simple Beam job that takes in a String as a side input and applies it to a PCollection which is read from a csv file in Cloud Storage. The result is then output to a .txt file in Cloud Storage.
So far, I have tried: Experimenting with PipelineResult.waitUntilFinish (as in (p.run().waitUntilFinish()), altering the placement of the two p.run() commands, and simplifying as much as possible by just using a string as my side input, always with the same result. Searching on Stack and Google just led me to the PR on the Beam repo which implemented the error message.
SideInputTest.java:
public class SideInputTest {
public static void main(String[] arg) throws IOException {
// Build a pipeline to read in string
DataflowPipelineOptions options1 = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options1.setRunner(DataflowRunner.class);
Pipeline p = Pipeline.create(options1);
// Build really simple side input
PCollectionView<String> sideInputView = p.apply(Create.of("foo"))
.apply(View.<String>asSingleton());
// Run p
p.run();
// Build main pipeline to read csv data
DataflowPipelineOptions options2 = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options2.setProject(PROJECT_NAME);
options2.setStagingLocation(STAGING_LOCATION);
options2.setRunner(DataflowRunner.class);
Pipeline p2 = Pipeline.create(options2);
p2.apply(TextIO.Read.from(INPUT_DATA))
.apply(ParDo.withSideInputs(sideInputView).of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
String[] rowData = c.element().split(",");
String sideInput = c.sideInput(sideInputView);
c.output(rowData[0] + sideInput);
}
}))
.apply(TextIO.Write
.to(OUTPUT_DATA));
p2.run();
}
}
Full stack trace:
Caused by: java.lang.NullPointerException: Unknown producer for value SingletonPCollectionView{tag=Tag<org.apache.beam.sdk.util.PCollectionViews$SimplePCollectionView.<init>:435#3d93cb799b3970be>} while translating step ParDo(Anonymous)
at org.apache.beam.runners.dataflow.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:1079)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.getProducer(DataflowPipelineTranslator.java:508)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translateSideInputs(DataflowPipelineTranslator.java:926)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translateInputs(DataflowPipelineTranslator.java:913)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.access$1100(DataflowPipelineTranslator.java:112)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$7.translateSingleHelper(DataflowPipelineTranslator.java:863)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$7.translate(DataflowPipelineTranslator.java:856)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$7.translate(DataflowPipelineTranslator.java:853)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.visitPrimitiveTransform(DataflowPipelineTranslator.java:415)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:486)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:481)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$400(TransformHierarchy.java:231)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:206)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:321)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.translate(DataflowPipelineTranslator.java:365)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translate(DataflowPipelineTranslator.java:154)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:514)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:151)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:210)
at com.xpw.SideInputTest.main(SideInputTest.java:63)
Currently using org.apache.beam packages #0.6.0
This code is taking a PCollectionView created in one pipeline (p.apply(Create.of("foo")).apply(View.<String>asSingleton());) and using it in another pipeline (p2).
PCollection's and PCollectionView's belong to a particular pipeline and reuse of them in a different pipeline is not supported.
You can create an analogous PCollectionView in p2.
I'm also confused as to what your pipeline p is trying to accomplish: the only transform it has is creating the view?.. so there's no data being processed in it. I think you should get rid of p entirely and just use p2.
I am wondering if anyone knows what URL is required (as a GET or POST) that will get the status code (result) of the last Jenkins job (when the build# is not known by the client calling the GET request)? I just want to be able to detect if the result was RED or GREEN/BLUE .
I have this code sample, but I need to adjust it so that it works for Jenkins, for this purpose (as stated above):
public class Main {
public static void main(String[] args) throws Exception {
URL url = new URL("http://localhost/jenkins/api/xml");
Document dom = new SAXReader().read(url);
for( Element job : (List<Element>)dom.getRootElement().elements("job")) {
System.out.println(String.format("Name:%s\tStatus:%s",
job.elementText("name"), job.elementText("color")));
}
}
}
Once I figure out the answer, I will share a full example of how I used it. I want to create a job that collects information on a test suite of 20+ jobs and reports on all of them with an email.
You can use the symbolic descriptor lastBuild:
http://localhost/jenkins/job/<jobName>/lastBuild/api/xml
The result element contains a string describing the outcome of the build.