DataFlow ProcessContext output exception ClosedChannelException - google-cloud-dataflow

I'm parsing XML from stream, and dispatching POJO to ProcessContext.output.
It's throwing following ClosedChannelException.
Any idea what's going on?
com.google.cloud.dataflow.sdk.util.UserCodeException: java.nio.channels.ClosedChannelException
at com.google.cloud.dataflow.sdk.util.DoFnRunner.invokeProcessElement(DoFnRunner.java:193)
at com.google.cloud.dataflow.sdk.util.DoFnRunner.processElement(DoFnRunner.java:171)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase.processElement(ParDoFnBase.java:193)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.process(ParDoOperation.java:52)
at com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase$1.output(ParDoFnBase.java:157)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnContext.outputWindowedValue(DoFnRunner.java:329)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnProcessContext.output(DoFnRunner.java:483)
at com.myproj.dataflow.MyDocDispatcher.onMyDoc(MyDocDispatcher.java:24)

One likely cause of this is that your DoFn that does the XML processing to produce a POJO actually lazily produces the POJO. When you pass that POJO to ProcessContext#output() it may be directly passed to other DoFns later in the pipeline, based on the optimizer.
In this case, if the downstream DoFn interacting with the POJO has some side-effects on the POJO that was received, it violates the immutability requirements, since interacting with the POJO received from the ProcessContext#element() modifies it.
If this is the problem, the easiest fix is cloning the POJO before passing it to output().

Related

Apache Beam Coder for org.json.JSONObject

I am writing a data pipeline in Apache Beam that reads from Pub/Sub, deserializes the message into JSONObjects and pass them to some other pipeline stages. The issue is, when I try to submit my code I get the following error:
An exception occured while executing the Java class. Unable to return a default Coder for Convert to JSON and obfuscate PII data/ParMultiDo(JSONifyAndObfuscate).output [PCollection]. Correct one of the following root causes:
[ERROR] No Coder has been manually specified; you may do so using .setCoder().
[ERROR] Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.json.JSONObject.
[ERROR] Building a Coder using a registered CoderProvider failed.
[ERROR] See suppressed exceptions for detailed failures.
[ERROR] Using the default output Coder from the producing PTransform failed: PTransform.getOutputCoder called.
basically the error says Beam cannot find a Coder for org.json.JSONObject objects. I have no idea where to get such a coder or how to build one. Any ideas?
Thanks!
The best starting point for understanding coders is in the Beam Programming Guide: Data Encoding and Type Safety. The short version is that Coders are used to specify how different types of data are encoded to and from byte strings at certain points in a Beam pipeline (usually at stage boundaries). Unfortunately there is no coder for JSONObjects by default, so you have two options here:
Avoid creating JSONObjects in PCollections. Instead of passing JSONObjects throughout your pipeline, you could extract desired data from the JSON and either pass it around as basic data types, or have your own class encapsulating the data you need. Java's basic data types all have default coders assigned, and coders can easily be generated for classes that are just structs of those types. As a side benefit, this is how Beam pipelines are expected to be built, so it's likely to work more optimally if you stick with basic data and well-known coders when possible.
If JSONObjects are necessary, you'll want to create a custom coder for them. The programming guide contains info for how to set a custom coder as a default coder. For the implementation itself, the easiest way with JSONObject is to encode it to a JSON string with JSONObject.toString and then decode it from the string with JSONObject's string constructor. For details on how to do this, check out the programming guide above and take a look at the Coder documentation.

Test Dataflow with DirectRunner and got lots of verifyUnmodifiedThrowingCheckedExceptions

I was testing my Dataflow pipeline using DirectRunner from my Mac and got lots of "WARNING" message like this, may I know how to get rid of them because it is too much that I can not even see my debug message.
Thanks
Apr 05, 2018 2:14:48 PM org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector verifyUnmodifiedThrowingCheckedExceptions
WARNING: Coder of type class org.apache.beam.sdk.coders.SerializableCoder has a #structuralValue method which does not return true when the encoding of the elements is equal.
Element com.apigee.analytics.platform.core.service.schema.EventRow#4a590d0b
It may help to ensure that all serialized values have proper equals() implementations since SerializableCoder expects them:
The structural value of the object is the object itself. The SerializableCoder should be only used for objects with a proper Object#equals implementation.
You can implement your own Coder for your POJOs. SerializableCoder does not guarantee a deterministic encoding according to docs:
SerializableCoder does not guarantee a deterministic encoding, as Java
serialization may produce different binary encodings for two equivalent
objects.
This article explains custom coders in details.
I had this same problem. I was using SerializableCoder for a class implementing Serializable, and my tests were failing because the PAssert() containsInAnyOrder() method was not using MyClass.equals() to evaluate object equality. The signature of my equals() method was:
public boolean equals(MyClass other) {...}
All I had to do to fix it was to define equals in terms of Object:
public boolean equals(Object other) {...}
This made the warnings go away, and made the tests pass.
Just add
https://projectlombok.org/features/EqualsAndHashCode
#EqualsAndHashCode
public class YourRecord implements Serializable {

Unable to run multiple Pipelines in desired order by creating template in Apache Beam

I have two separate Pipelines say 'P1' and 'P2'. As per my requirement I need to run P2 only after P1 has completely finished its execution. I need to get this entire operation done through a single Template.
Basically Template gets created the moment it finds run() its way say p1.run().
So what I can see that I need to handle two different Pipelines using two different templates but that would not satisfy my strict order based Pipeline execution requirement.
Another way I could think of calling p1.run() inside the ParDo of p2.run() and keep the run() of p2 wait until finish of run() of p1. I tried this way but stuck at IllegalArgumentException given below.
java.io.NotSerializableException: PipelineOptions objects are not serializable and should not be embedded into transforms (did you capture a PipelineOptions object in a field or in an anonymous class?). Instead, if you're using a DoFn, access PipelineOptions at runtime via ProcessContext/StartBundleContext/FinishBundleContext.getPipelineOptions(), or pre-extract necessary fields from PipelineOptions at pipeline construction time.
Is it not possible at all to call the run() of a pipeline inside any transform say 'Pardo' of another Pipeline?
If this is the case then how to satisfy my requirement of calling two different Pipelines in sequence by creating a single template?
A template can contain only a single pipeline. In order to sequence the execution of two separate pipelines each of which is a template, you'll need to schedule them externally, e.g. via some workflow management system (such as what Anuj mentioned, or Airflow, or something else - you might draw some inspiration from this post for example).
We are aware of the need for better sequencing primitives in Beam within a single pipeline, but do not have a concrete design yet.

How to force an actor to fail after a timeout in akka

I'm using a master-slave architecture in akka using java. The master receives messages that represent job commands which he routes to the slaves. Those jobs involve calling a third party library that is not open source and sometimes crashes just by hanging and blocking the execution without throwing any exception. Akka does not recognize this as a failure and keeps sending messages to the Mailbox this actor is using but since the first call blocks indefinitely, the rest of the commands in the mailbox will never get executed.
My goal is to emulate this type of failure with a timeout expiration and an exception so as to forward the whole incident to the Failure Strategy build in akka. So my question is, can I somehow configure an actor to throw an exception after a message is received and it's execution is not completed after the timeout ?
If not, what are some other alternatives to handle such a case without performing any blocking operations ? I was thinking to encapsulate the execution in a Future and call it from inside an actor that will block on that future for a timeout. It works but as suggested by many, blocking, is not a good solution in akka.
There is no need to block two threads where one suffices: just have one actor which coordinates how many calls to that (horribly unreliable) API are permitted and launch them in Futures (as cmbaxter recommends you should NOT use the same ExecutionContext as the actor is running on, I’d use a dedicated one). Those Futures should then be combined using firstCompletedOf with a timeout Future:
import akka.pattern.after
import context.system.scheduler
import scala.concurrent.duration._
implicit val ec = myDedicatedDangerousActivityThreadPool
val myDangerousFuture = ???
val timeout = after(1.second, scheduler(throw new TimeoutException)
val combined = Future.firstCompletedOf(myDangerousFuture, timeout)
Then you pipe that back to your actor in some suitable fashion, e.g. mapping its result value into a message type or whatever you need, and keep track of how many are outstanding. I would recommend wrapping the myDangerousFuture in a Circuit Breaker for improved responsiveness in the failure case.

Parsing variable length descriptors from a byte stream and acting on their type

I'm reading from a byte stream that contains a series of variable length descriptors which I'm representing as various structs/classes in my code. Each descriptor has a fixed length header in common with all the other descriptors, which are used to identify its type.
Is there an appropriate model or pattern I can use to best parse and represent each descriptor, and then perform an appropriate action depending on it's type?
I've written lots of these types of parser.
I recommend that you read the fixed length header, and then dispatch to the correct constructor to your structures using a simple switch-case, passing the fixed header and stream to that constructor so that it can consume the variable part of the stream.
This is a common problem in file parsing. Commonly, you read the known part of the descriptor (which luckily is fixed-length in this case, but isn't always), and branch it there. Generally I use a strategy pattern here, since I generally expect the system to be broadly flexible - but a straight switch or factory may work as well.
The other question is: do you control and trust the downstream code? Meaning: the factory / strategy implementation? If you do, then you can just give them the stream and the number of bytes you expect them to consume (perhaps putting some debug assertions in place, to verify that they do read exactly the right amount).
If you can't trust the factory/strategy implementation (perhaps you allow the user-code to use custom deserializers), then I would construct a wrapper on top of the stream (example: SubStream from protobuf-net), that only allows the expected number of bytes to be consumed (reporting EOF afterwards), and doesn't allow seek/etc operations outside of this block. I would also have runtime checks (even in release builds) that enough data has been consumed - but in this case I would probably just read past any unread data - i.e. if we expected the downstream code to consume 20 bytes, but it only read 12, then skip the next 8 and read our next descriptor.
To expand on that; one strategy design here might have something like:
interface ISerializer {
object Deserialize(Stream source, int bytes);
void Serialize(Stream destination, object value);
}
You might build a dictionary (or just a list if the number is small) of such serializers per expected markers, and resolve your serializer, then invoke the Deserialize method. If you don't recognise the marker, then (one of):
skip the given number of bytes
throw an error
store the extra bytes in a buffer somewhere (allowing for round-trip of unexpected data)
As a side-note to the above - this approach (strategy) is useful if the system is determined at runtime, either via reflection or via a runtime DSL (etc). If the system is entirely predictable at compile-time (because it doesn't change, or because you are using code-generation), then a straight switch approach may be more appropriate - and you probably don't need any extra interfaces, since you can inject the appropriate code directly.
One key thing to remember, if you're reading from the stream and do not detect a valid header/message, throw away only the first byte before trying again. Many times I've seen a whole packet or message get thrown away instead, which can result in valid data being lost.
This sounds like it might be a job for the Factory Method or perhaps Abstract Factory. Based on the header you choose which factory method to call, and that returns an object of the relevant type.
Whether this is better than simply adding constructors to a switch statement depends on the complexity and the uniformity of the objects you're creating.
I would suggest:
fifo = Fifo.new
while(fd is readable) {
read everything off the fd and stick it into fifo
if (the front of the fifo is has a valid header and
the fifo is big enough for payload) {
dispatch constructor, remove bytes from fifo
}
}
With this method:
you can do some error checking for bad payloads, and potentially throw bad data away
data is not waiting on the fd's read buffer (can be an issue for large payloads)
If you'd like it to be nice OO, you can use the visitor pattern in an object hierarchy. How I've done it was like this (for identifying packets captured off the network, pretty much the same thing you might need):
huge object hierarchy, with one parent class
each class has a static contructor that registers with its parent, so the parent knows about its direct children (this was c++, I think this step is not needed in languages with good reflection support)
each class had a static constructor method that got the remaining part of the bytestream and based on that, it decided if it is his responsibility to handle that data or not
When a packet came in, I've simply passed it to static constructor method of the main parent class (called Packet), which in turn checked all of its children if it's their responsibility to handle that packet, and this went recursively, until one class at the bottom of the hierarchy returned the instantiated class back.
Each of the static "constructor" methods cut its own header from the bytestream and passed down only the payload to its children.
The upside of this approach is that you can add new types anywhere in the object hierarchy WITHOUT needing to see/change ANY other class. It worked remarkably nice and well for packets; it went like this:
Packet
EthernetPacket
IPPacket
UDPPacket, TCPPacket, ICMPPacket
...
I hope you can see the idea.

Resources