Assigning to GenericRecord the timestamp from inner object - google-cloud-dataflow

Processing streaming events and writing files in hourly buckets is a challenge due to windows, as some events from incoming hour can go into previous ones and such.
I've been digging around Apache Beam and its triggers but I'm struggling to manage triggering with timestamp as follows...
Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
This is what I've been doing so far, triggering 1 min windows no matter what timestamp. However, I would like to include the timestamp within the object so that it gets triggered just for those within.
Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(AfterWatermark
.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
The objects that I'm dealing with have a timestamp object, however, this is a long field and not an Instant field whatsoever.
"{ \"name\": \"timestamp\", \"type\": \"long\", \"logicalType\": \"timestamp-micros\" },"
Having my POJO class with that long field triggers nothing, but if I swap it for an Instant class and recreate the object properly, the following error is thrown whenever a PubSub message is read.
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.Long
I've been also thinking to create a kind of wrapper class around GenericRecord which contains a timestamp, but would need to just use the GenericRecord part within once its ready to write with FileIO to .parquet.
Which other ways do I have to use watermark triggers?
EDIT: After #Anton comments, I've tried the following.
.apply("Apply timestamps", WithTimestamps.of(
(SerializableFunction<GenericRecord, Instant>) item -> new Instant(Long.valueOf(item.get("timestamp").toString())))
.withAllowedTimestampSkew(Duration.standardSeconds(30)))
Even it it has been deprecated this seem to pass through the pipeline but still not written (still getting discarded prior writing for some reason by the previously shown trigger?).
And also tried the other mentioned approach using outputWithTimestamp but due to the delay, it's printing the following error...
Caused by: java.lang.IllegalArgumentException: Cannot output with timestamp 2019-06-12T18:59:58.609Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-12T18:59:59.848Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.

Related

Gcp Dataflow processes invalid data

We have an API as a proxy between clients and google Pub/Sub, so it basically retrieves a JSON body and publishes it to the topic. Then, it is processed by DataFlow, which stores it in BigQuery. Also, we use transform UDF to, for instance, convert a field value to upper case; it parses JSON sent and produces a new one.
The problem is the following. The number of bytes sent to the destination table is much less than to the deadletter, and the error message is 99% percent contains the error saying that the sent JSON is invalid. And that's true, the payloadstring column contains distorted JSONs: they could be truncated, concatenated with other ones, or even both. I've added logs on the API side to see where did the message set corrupted, but neither received or sent by the API JSON bodies are invalid.
How can I debug this problem? Is it any chance of pub/sub or dataflow to corrupt messages? If so, what can I do to fix it?
UPD. By the way, we use a Google-provided template called "pubsub topic to bigquery"
UPD2. API is written in Go, and the way we send the message is simply by calling
res := p.topic.Publish(ctx, &pubsub.Message{Data: msg})
The res variable is then used for error logging. p here is a custom struct.
The message we sent is a JSON with 15 fields, and just to be concise I'll mock it and UDF.
Message:
{"MessageName":"Name","MessageTimestamp":123123123",...}
UDF:
function transform(inJson) {
var obj;
try {
obj = JSON.parse(inJson);
} catch (error){
throw 'parse JSON error: '+error;
}
if (Object.keys(obj).length !== 15){
throw "Message is invalid";
}
if (!(obj.hasOwnProperty('EventSource') && typeof obj.EventSource === 'string' && obj.MessageName.length>0)) {
throw "MessageName is absent or invalid";
}
/*
other fields check
*/
obj.MessageName = obj.MessageName.toUpperCase()
/*
other fields transform
*/
return JSON.stringify(obj);
}
UPD3:
Besides being corrupted, I've noticed that every single message is duplicated at least once, and the duplicates are often truncated.
The problem occurred several days ago when it was a massive increase in the number of messages, but now it got back to normal, and the error is still there. The problem was seeing before, but it was a much more rare case.
The behavior you describe suggests that the data is corrupt before it gets to Pubsub or Dataflow.
I have performed a test, sending JSON messages containing 15 fields. Your UDF function as well as the Dataflow template work fine since I was able to insert the data to BigQuery.
Based on that, it seems your messages are already corrupted before getting to Pub/Sub, I suggest you to check your messages once they arrived to Pub/Sub and see if they have the correct format.
Please notice that it's required for the messages schema match with the BigQuery table schema.

Google dataflow: AvroIO read from file in google storage passed as runtime parameter

I want to read Avro files in my dataflow using java SDK 2
I have schedule my dataflow using cloud function which are triggered based on the files uploaded to the bucket.
Following is the code for options:
ValueProvider <String> getInputFile();
void setInputFile(ValueProvider<String> value);
I am trying to read this input file using following code:
PCollection<user> records = p.apply(
AvroIO.read(user.class)
.from(String.valueOf(options.getInputFile())));
I get following error while running the pipeline:
java.lang.IllegalArgumentException: Unable to find any files matching RuntimeValueProvider{propertyName=inputFile, default=gs://test_bucket/user.avro, value=null}
Same code works fine in case of TextIO.
How can we read Avro file which is uploaded for triggering cloud function which triggers the dataflow pipeline?
Please try ...from(options.getInputFile())) without converting it to a string.
For simplicity, you could even define your option as simple string:
String getInputFile();
void setInputFile(String value);
You need to use simply from(options.getInputFile()): AvroIO explicitly supports reading from a ValueProvider.
Currently the code is taking options.getInputFile() which is a ValueProvider, calling the JavatoString() function on it which gives a human-readable debug string "RuntimeValueProvider{propertyName=inputFile, default=gs://test_bucket/user.avro, value=null}" and passing that as a filename for AvroIO to read, and of course this string is not a valid filename, that's why the code currently doesn't work.
Also note that the whole point of ValueProvider is that it is placeholder for a value that is not known while constructing the pipeline and will be supplied later (potentially the pipeline will be executed several times, supplying different values) - so extracting the value of a ValueProvider at pipeline construction time is impossible by design, because there is no value. At runtime though (e.g. in a DoFn) you can extract the value by calling .get() on it.

Dataflow pipeline is dropping events during processing when using outputWithTimestamp

I have a Cloud Dataflow pipeline in which I alter the original timestamp for the event in order to simulate real world scenarios of events arriving late. However, it appears I'm dropping some percentage of my events on each run of the pipeline. Inside my DoFn I use the following code to change the timestamp:
Instant newTimestamp = originalTimestamp.minus(Duration.standardMinutes(RANDOM.nextInt(15)));
c.outputWithTimestamp(KV.of(Integer.toString(RANDOM.nextInt(100)), element), newTimestamp);
The problem is most likely caused by your DoFn step outputting a timestamp that is earlier than the timestamp that was received by the processing step minus the allowed timestamp skew. The exception that would be thrown can be found here in the code:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/DoFnRunnerBase.java#L493
This behavior is documented with regard to using outputWithTimestamp here:
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn.Context#outputWithTimestamp-OutputT-org.joda.time.Instant-
While you could override the getAllowedTimestampSkew function, is is also documented that this might cause unpredictable issues with the watermark calculations so it should only be used without windowing/grouping.
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn#getAllowedTimestampSkew--

Can Dataflow sideInput be updated per window by reading a gcs bucket?

I’m currently creating a PCollectionView by reading filtering information from a gcs bucket and passing it as side input to different stages of my pipeline in order to filter the output. If the file in the gcs bucket changes, I want the currently running pipeline to use this new filter info. Is there a way to update this PCollectionView on each new window of data if my filter changes? I thought I could do it in a startBundle but I can’t figure out how or if it’s possible. Could you give an example if it is possible.
PCollectionView<Map<String, TagObject>>
tagMapView =
pipeline.apply(TextIO.Read.named("TagListTextRead")
.from("gs://tag-list-bucket/tag-list.json"))
.apply(ParDo.named("TagsToTagMap").of(new Tags.BuildTagListMapFn()))
.apply("MakeTagMapView", View.asSingleton());
PCollection<String>
windowedData =
pipeline.apply(PubsubIO.Read.topic("myTopic"))
.apply(Window.<String>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardSeconds(31))));
PCollection<MY_DATA>
lineData = windowedData
.apply(ParDo.named("ExtractJsonObject")
.withSideInputs(tagMapView)
.of(new ExtractJsonObjectFn()));
You probably want something like "use an at most a 1-minute-old version of the filter as a side input" (since in theory the file can change frequently, unpredictably, and independently from your pipeline - so there's no way really to completely synchronize changes of the file with the behavior of the pipeline).
Here's a (granted, rather clumsy) solution I was able to come up with. It relies on the fact that side inputs are implicitly also keyed by window. In this solution we're going to create a side input windowed into 1-minute fixed windows, where each window will contain a single value of the tag map, derived from the filter file as-of some moment inside that window.
PCollection<Long> ticks = p
// Produce 1 "tick" per second
.apply(CountingInput.unbounded().withRate(1, Duration.standardSeconds(1)))
// Window the ticks into 1-minute windows
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
// Use an arbitrary per-window combiner to reduce to 1 element per window
.apply(Count.globally());
// Produce a collection of tag maps, 1 per each 1-minute window
PCollectionView<TagMap> tagMapView = ticks
.apply(MapElements.via((Long ignored) -> {
... manually read the json file as a TagMap ...
}))
.apply(View.asSingleton());
This pattern (joining against slowly changing external data as a side input) is coming up repeatedly, and the solution I'm proposing here is far from perfect, I wish we had better support for this in the programming model. I've filed a BEAM JIRA issue to track this.

How to implement a custom file parser in Google DataFlow for a Google Cloud Storage file

I have a custom file format in Google Cloud Storage and I want to read it from Google DataFlow.
I've implemented a Source and a Reader by subclassing FileBasedReader, but then I realized it didn't support reading from Google Cloud Storage (while FileBasedSink actually does...) so I'm not sure what's the best idea to solve that here...
I tried to subclass TextIO but I couldn't reach an end with that as it doesn't seem to be designed to be subclassed.
Any good idea on how to deal with that?
Thanks.
Update to reflect on the comments
File pattern used: gs://mybucket/my.json
Implemented the Source class from FileBasedSource:
MessageSource<T> extends FileBasedSource<T>
Implemented the Reader class (what I really care about here) from FileBasedReader:
MessageReader<T> extends FileBasedReader<T>
Process for reading is:
MySource source = // instantiate source
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from(options.getSource()).named("ReadFileData"))
.apply(ParDo.of(new DoFn<String, String>() {
And the getSource() comes from this command line parameter (verified correct):
--source=gs://${BUCKET_NAME}/my.json \
Am I missing anything?
2nd UPDATE
While running source.getEstimatedSizeBytes(options) it tells me no handler found?
java.io.IOException: Unable to find handler for gs://mybucket/my.json
at com.google.cloud.dataflow.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:186)
at com.google.cloud.dataflow.sdk.io.FileBasedSource.getEstimatedSizeBytes(FileBasedSource.java:182)
at com.etc.TrackingDataPipeline.main(TrackingDataPipeline.java:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
at java.lang.Thread.run(Thread.java:745)
I thought the FileBasedSource was supposed to handle GCS?
From the stack trace you show in "2nd Update", it looks like you have called getEstimatedSizeBytes directly from your main() method. This is expected to lead to the error you see.
The standard URL scheme handlers are registered when a pipeline runner is constructed. In your example code, that would happen when you call Pipeline.create(options) (this calls PipelineRunner.fromOptions(options), where the standard handlers are registered).
If you want to have the standard URL schemes registered in a context other than running a pipeline, you can explicitly call IOChannelUtils.registerStandardIOFactories(). I should note that this is not a supported API, but reaching a bit "under the hood". As such, it may change at any time.

Resources