Different coders for the same class in dataflow job - google-cloud-dataflow

I'm trying to use different coders for the same class for two different scenarios:
Reading from JSON input files - using data = TextIO.Read.from(options.getInput()).withCoder(new Coder1())
Elsewhere in the job I want the class to be persisted using SerializableCoder using data.setCoder(SerializableCoder.of(MyClass.class)
It works locally, but fails when run in the cloud with
Caused by: java.io.StreamCorruptedException: invalid stream header: 7B227365.
Is it a supported scenario? The reason to do this in the first place is to avoid read/write of JSON format, and on the other hand make reading from input files more efficient (UTF-8 parsing is part of the JSON reader, so it can read from InputStream directly)
Clarifications:
Coder1 is my coder.
The other coder is a SerializableCoder.of(MyClass.class)
How does the system choose which coder to use? The two formats are binary incompatible, and it looks like due to some optimization, the second coder is used for data format which can only be read by the first coder.

Yes, using two different coders like that should work. (With the caveat that the coder in #2 will only be used if the system choses to persist 'data' instead of optimizing it into surround computations.)
Are you using your own Coders or ones provided by the Dataflow SDK? Quick caveat on TextIO -- because it uses newlines to encode element boundaries, you'll get into trouble if you use a coder that produces encoded values containing something that can be mistaken for a newline. You really should only use textual encodings within TextIO. We're hoping to make that clearer in the future.

Related

IBM Integration Bus and xsd:anyType

I'm working with IIB v9 mxsd message definitions. I'd like to define one of the XML elements to be of type xsd:anyType. However, in the list of types I can choose from, only anySimpleType and anyUri are possible (besides all other types like string, integer, etc.).
How can I get around this limitation?
The XMLNSC parser supports the entire XML Schema specification, including xs:any and xs:anyType. In IIBv9 you should create a Library and import your xsds into it. Link your Application to the Library and the XMLNSC parser will find and use the model. You do not need to specify the name of the Library in the node properties; the XSD model will be automatically available to the entire application.
You do not need to use a message set at all in IIBv9 and later versions.
The mxsd file format is used only by the MRM (not DFDL) parser.
You shouldn't use an MXSD to model your XML data, use a normal XSD.
MXSD is for modelling data for the DFDL parser, but you should use the XMLNSC parser for XML messages and define them in XSDs, in which you can use anyType.
As far as I know DFDL doesn't support anyType.

Does Ruby on Rails have read stream for files?

Does rails have a way to implement read streams like Node js for file reading?
i.e.
fs.createReadStream(__dirname + '/data.txt');
as apposed to
fs.readFile(__dirname + '/data.txt');
Where I see ruby has
file = File.new("data.txt")
I am unsure of the equivalent in ruby/rails for creating a stream and would like to know if this is possible. The reasons I ask is for memory management as a stream will be delivered piece by piece as apposed to one whole file.
If you want to read a file in Ruby piece-by-piece, there are a host of methods available to you.
IO#each_line/IO::foreach, also implemented in File to iterate over each line of the file. Neither reads the whole file into memory; instead, both simply read up until the next newline, return, and pause reading, barring a possible buffer.
IO#read/IO::read takes a length parameter, which allows you to specify for it to read up to length bytes from the file. This will only read that many, and not the whole thing.
IO::binread does the same as IO::read, but will open the file in binary mode.
IO#readpartial appears to be very similar or identical to IO#read, but is also worth looking at.
IO#getc and IO#gets both read from the file until they reach the end of what they'll return, as far as I can tell.
There are several more that I'm looking for right now.

Processing multiline events from a text file in Dataflow

I am attempting to build a dataflow pipeline to process a text file which contains events that span multiple lines. The dataflow SDK TextIO class assumes each line is a new event.
My plan is to create a new TextReader and register it with the DataPipelineRunner. This new reader will know how to aggregate the multiple lines into a single line.
I am pretty sure that this approach will work but I am wondering if this is the right way to do it or if there is a simpler solution?
The text I am trying to parse is:
==============> len:45 pktype:4 mtype:2
SYMBOL: USOCSTIA151632.00
OPEN_INT: 212
PR_OPEN_INTEREST: 212
TIME_STAMP: 04/10/2015 06:30:17:420 val:1428661817
The result should be the last 4 lines concatenated together and the first line dropped.
Best regards,
Peter
Note that TextReader is an internal implementation detail class, so subclassing it would be highly discouraged and challenging to do properly.
The recommended way to define a new file-based format like yours is to subclass FileBasedSource using the user-defined source API.
In your case, I would recommend to base your class on the LineIO example from documentation, and wrap the LineReader defined there into your own class which would use LineReader as a helper for reading individual lines, but:
In startReading() it would skip until the line starting with "====>"
In readNextRecord() it would read lines until the next "====>" and bundle them into a single record.
Please make sure to carefully read the documentation to FileBasedSource and FileBasedReader: the parallelization mechanism relies on the consistency properties described there, which your format has to satisfy, for ensuring that records are not duplicated or omitted on the boundaries between adjacent processing shards. XmlSource tests are a good example of how to unit-test these properties.
Please tell us how it goes and report back with any problems or questions - we are very interested in feedback on this API.

How to process a GCS filepattern, full file at a time?

I need to process a (GCS) bucket of files, where each file is compressed and contains a single multi-line JSON record. Also, the name of the file being processed is significant and I need to know it within my transform.
Starting with examples in the docs, TextIO looks pretty close, but it looks like its designed to process each file line-by-line and does not allow me to read the entire file at once. Also, I don't see any way to get the filename that's being processed?
PCollectionTuple results = p.apply(TextIO.Read
.from("gs://bucket/a/*.gz")
.withCompressionType(TextIO.CompressionType.GZIP)
.withCoder(MyJsonCoder.of()))
Looks like I need to write a custom IO reader, or some such? Any tips for best place to start?
You are correct that right now none of the existing classes do exactly what you want. There are 2 reasonable approaches:
Match the filepattern yourself (using IOChannelUtils and IOChannelFactory) and wrap the resulting files into a PCollection<String> where the String will be a filename, using Create.of(filenames). Then apply a ParDo with a function which reads the given filename.
Write your own subclass of Source (there's also FileBasedSource, but it's not quite right for your use case). It would be configured by the filepattern, and splitIntoBundles would match the filepattern and expand into individual sources each corresponding to one file.
I would recommend the first approach because it seems like less code and your use case does not require the full power of Source.

Tool to manage string literals used for parsing JSON

We parse a lot of JSON in our app - without the back-end, it would be a pretty useless app. I know this goes for a bunch of other apps out there as well. In order to parse JSON, we need a list of keys to get to the data. I'd like to know what is considered 'best practice' or at least 'damn good practice' for managing these paths/string literals. Is there a tool out there that helps manage such keys and reduces duplication?
Hard-coding them is definitely not an option although to be frank, if our back-end programmers change the key, in concept, a simple find/replace in XCode (or whatever IDE you're using) would suffice. It's ugly and unclean and I just feel dirty putting string literals all over my code though.
What I'm currently doing now is putting them all into my PCH file, which means I end up with:
#define kBookmarksSearchResultsIDFieldName #"business.id"
#define kBookmarksSearchResultsNameFieldName #"business.name"
#define kBookmarksSearchResultsThumbnailURLFieldName #"business.display_image.images.small_mobile.source"
#define kBookmarksBusinessCategoryArrayFieldName #"business.categories"
This gets unwieldy real fast though since now I have around a thousand lines of these things in my PCH file.
The other option I'm considering is breaking these up into separate .h files - but then if two components of my app end up using the same key (for example, a business object is embedded into the JSON for a bookmark, or for a review of that business) then I have to import the .h that contains the JSON paths for the business object. So in this case I'm still importing all of the same data, it's just the file organization that's cleaner.
My objectives are:
Easy management of string literals used for parsing JSON
Reduce the amount of duplication needed
Easy changing/replacement of JSON paths if/when needed
Is option 3 that I listed above (separate .h files) my best option? What do you guys use, and am I missing an easy tool out there (and no, JSONModel isn't an option because of the way it requires your JSON keys to match your ivar/property names - our back-end supports a number of platforms so we can't change the JSON keys just for iOS).
Look into using a library such as RestKit which allows you to map a JSON document to a set of Objective-C classes. This means you can read the document in and get an array of objects you can manipulate by properties instead of having to keep track of key names. It's much easier, and Xcode will autocomplete your property names as you work with the classes.
It takes some setup, but you only have to do it once. :)
Just to update this answer - there's a very cool library called Mantle - not perfect, there are some issues with typecasting but still a very solid effort.

Resources