Apache Beam Coder for org.json.JSONObject - google-cloud-dataflow

I am writing a data pipeline in Apache Beam that reads from Pub/Sub, deserializes the message into JSONObjects and pass them to some other pipeline stages. The issue is, when I try to submit my code I get the following error:
An exception occured while executing the Java class. Unable to return a default Coder for Convert to JSON and obfuscate PII data/ParMultiDo(JSONifyAndObfuscate).output [PCollection]. Correct one of the following root causes:
[ERROR] No Coder has been manually specified; you may do so using .setCoder().
[ERROR] Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.json.JSONObject.
[ERROR] Building a Coder using a registered CoderProvider failed.
[ERROR] See suppressed exceptions for detailed failures.
[ERROR] Using the default output Coder from the producing PTransform failed: PTransform.getOutputCoder called.
basically the error says Beam cannot find a Coder for org.json.JSONObject objects. I have no idea where to get such a coder or how to build one. Any ideas?
Thanks!

The best starting point for understanding coders is in the Beam Programming Guide: Data Encoding and Type Safety. The short version is that Coders are used to specify how different types of data are encoded to and from byte strings at certain points in a Beam pipeline (usually at stage boundaries). Unfortunately there is no coder for JSONObjects by default, so you have two options here:
Avoid creating JSONObjects in PCollections. Instead of passing JSONObjects throughout your pipeline, you could extract desired data from the JSON and either pass it around as basic data types, or have your own class encapsulating the data you need. Java's basic data types all have default coders assigned, and coders can easily be generated for classes that are just structs of those types. As a side benefit, this is how Beam pipelines are expected to be built, so it's likely to work more optimally if you stick with basic data and well-known coders when possible.
If JSONObjects are necessary, you'll want to create a custom coder for them. The programming guide contains info for how to set a custom coder as a default coder. For the implementation itself, the easiest way with JSONObject is to encode it to a JSON string with JSONObject.toString and then decode it from the string with JSONObject's string constructor. For details on how to do this, check out the programming guide above and take a look at the Coder documentation.

Related

How to define content transformers in Siesta?

I'm just integrating Siesta and I love it, it solves a lot of issues we have when using frameworks like RestKit.
What I can't get my head around is how to use the content transformers? I've looked at the docs and examples and I can't quite understand how it works, I'm also fairly new to Swift.
Looking at this example taken from another SO reply:
private let SwiftyJSONTransformer = ResponseContentTransformer(skipWhenEntityMatchesOutputType: false) {
JSON($0.content as AnyObject)
}
I can't quite understand what's going on here, there is no return value so I don't understand how content is being transformed. This might be my due to a lack of deep Swift knowledge.
I've understand how NSValueTransformer objects work in Obj-C but I can't work out how to map a response abit JSON or just a simple response body like a single string, number of boolean value to a object or type using Siesta.
We have some API responses that return just a single BOOL value in the response body while most of the other API responses are complex JSON object graphs.
How would I go about mapping these responses to more primitive types and or more complex objects.
Thanks.
Some of your confusion is basic Swift stuff. Where a closure uses $0 and contains only a single statement, the input types are inferred and the return is implicit. Thus the code in your question is equivalent to:
ResponseContentTransformer(skipWhenEntityMatchesOutputType: false) {
(content: AnyObject, entity: Entity) in
return JSON(content)
}
(Using $0.content instead of just $0 is a workaround for a maybe-bug-maybe-feature in Swift where $0 becomes a tuple of all arguments instead of just the first one. Don’t worry too much about it; $0.content is just a magic incantation you can use in your Siesta transformers.)
The other half of your confusion is Siesta itself. The general approach is as follows:
Configure a generic transformer that turns the raw NSData into a decoded but unstructured type such as String or Dictionary.
You’ll usually configure this based on content type.
Siesta includes parsing for strings, JSON, and images unless you turn it off with useDefaultTransformers: false.
Optionally configure a second transformer that turns the unstructured type into a model.
You’ll usually configure this based on API path.
Siesta doesn’t include any of this by default; it’s all per app.
For responses that are just a bare boolean, you’d probably do only #1 — depending on exactly what kind of response the server is sending, and depending on how you know it's just a boolean.
I recommend looking at the example project included with Siesta, which gives a good example of how all this plays out. You’ll see examples of both transformers that conditionally operate on the content based on its type (#1) and model-specific tranformers (#2) in that code.

DataFlow ProcessContext output exception ClosedChannelException

I'm parsing XML from stream, and dispatching POJO to ProcessContext.output.
It's throwing following ClosedChannelException.
Any idea what's going on?
com.google.cloud.dataflow.sdk.util.UserCodeException: java.nio.channels.ClosedChannelException
at com.google.cloud.dataflow.sdk.util.DoFnRunner.invokeProcessElement(DoFnRunner.java:193)
at com.google.cloud.dataflow.sdk.util.DoFnRunner.processElement(DoFnRunner.java:171)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase.processElement(ParDoFnBase.java:193)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.process(ParDoOperation.java:52)
at com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase$1.output(ParDoFnBase.java:157)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnContext.outputWindowedValue(DoFnRunner.java:329)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnProcessContext.output(DoFnRunner.java:483)
at com.myproj.dataflow.MyDocDispatcher.onMyDoc(MyDocDispatcher.java:24)
One likely cause of this is that your DoFn that does the XML processing to produce a POJO actually lazily produces the POJO. When you pass that POJO to ProcessContext#output() it may be directly passed to other DoFns later in the pipeline, based on the optimizer.
In this case, if the downstream DoFn interacting with the POJO has some side-effects on the POJO that was received, it violates the immutability requirements, since interacting with the POJO received from the ProcessContext#element() modifies it.
If this is the problem, the easiest fix is cloning the POJO before passing it to output().

What does # do in Dart programs?

I've just spent 20 hours in learning about the basics of Dart language, but when I find the # prefix in an open-source Dart program such as here, which I found that most programs use, I wonder what the # directive do in those programs...
For your information, the official documentation says the following:
Metadata
Use metadata to give additional information about your code. A metadata annotation begins with the character #, followed by either a reference to a compile-time constant (such as deprecated) or a call to a constant constructor.
Three annotations are available to all Dart code: #deprecated, #override, and #proxy. For examples of using #override and #proxy, see the section called “Extending a Class”. Here’s an example of using the #deprecated annotation:
However, what "additional information" does the # directive add to the code? If you create an instance by writing the following constructor
#todo('seth', 'make this do something')
, instead of the following constructor, which is the default:
todo('seth", 'make this do something')
, what is the benefit I can get from the first constructor?
I've got that using the built-in metadata such as #deprecated and #override can give me an advantage of being warned in running the app, but what can I get from the case on the custom #todo, or the aforementioned linked sample code over Github?
Annotations can be accessed through the dart:mirrors library. You can use custom annotations whenever you want to provide additional information about a class, method, etc. For instance #MirrorsUsed is used to provide the dart2js compiler with extra information to optimize the size of the generated JavaScript.
Annotations are generally more useful to framework or library authors than application authors. For instance, if you were creating a REST server framework in Dart, you could use annotations to turn methods into web resources. For example it might look something like the following (assuming you have created the #GET annotation):
#GET('/users/')
List<User> getUsers() {
// ...
}
You could then have your framework scan your code at server startup using mirrors to find all methods with the #GET annotation and bind the method to the URL specified in the annotation.
You can do some 'reasoning' about code.
You can query for fields/methods/classes/libraries/... that have a specific annotation.
You acquire those code parts using reflection. In Dart reflection is done by the 'dart:mirrors' package.
You can find an code example here: How to retrieve metadata in Dartlang?
An example where annotations are regularly used is for serialization or database persistence where you add metatdata to the class that can be used as configuration settings by the serialization/persistence framework to know how to process a field or method.
For example you add an #Entity() annotation to indicate that this class should be persisted.
On each field that should be persisted you add another annotation like #Column().
Many persistence frameworks generate the database table automatically from this metadata.
For this they need more information so you add a #Id() on the field that should be used as primary key and #Column(name: 'first_name', type: 'varchar', length: 32) to define parameters for the database table and columns.
This is just an example. The limit is you imagination.

Can we get closure information when using reflection on Grails?

So we have a task to produce documentation for a REST API we are building. We are using Grails and it's all nicely defined in UrlMappings but that doesn't generate a documentation the way we want.
We thought of using reflection to extract the mappings into some format that we can manipulate through code.
There's also a possibility of maybe parsing the UrlMappings as a text file and produce some data format that we can manipulate.
For the reflection bit, can we extract info from Closures - such as properties, methods, etc - in Groovy?
Any other ideas would be appreciated.

How do I create an F# Type Provider for a generated assembly?

I'm having problems using any types created in an assembly for an F# Generative Type Provider. I created a YouTube video that demonstrates this.
The error messages I get are:
The module/namespace 'tutorial' from compilation unit 'Addressbook1' did not contain the namespace, module or type 'Person'
A reference to the type 'tutorial.Person' in assembly 'Addressbook1' was found, but the type could not be found in that assembly
I don't understand because the type is definitely in the assembly. For troubleshooting this, the assembly is a very basic C# dll. The code in the video is available via git:
git url: https://code.google.com/p/froto/
git branch: help
Any troubleshooting ideas would be appreciated. I'm hoping to make more progress on an F# Type Provider for .proto files, but I'm stuck on this.
I've taken a quick look at your code - as I mentioned in a comment I think you would be much better served by using the ProvidedTypes API that is defined by the F# 3.0 Sample Pack and documented (a bit) on MSDN.
Basically, the raw type provider API has a lot of assumptions baked in which will be hard for you to maintain by hand. I think that the specific problem you have is that the compiler expects to see a type named tutorial.Person in your assembly (since it's the return type of a method on tutorial.AddressbookProto, which you are exposing as a generated type), but it isn't ever embedded into your assembly.
However, this is really only one of several problems - as you've probably realized, your will see additional errors if the type that you're defining is called anything other than tutorial.AddressbookProto. That's because you're using a concrete type as the return from ApplyStaticArguments, but you would typically want to use a synthetic System.Type instance that accurately reflects the namespace and type name that the user used (e.g. in the ProvidedTypes API the ProvidedTypeDefinition class inherits from System.Type and handles this bookkeeping).

Resources