Processing multiline events from a text file in Dataflow - google-cloud-dataflow

I am attempting to build a dataflow pipeline to process a text file which contains events that span multiple lines. The dataflow SDK TextIO class assumes each line is a new event.
My plan is to create a new TextReader and register it with the DataPipelineRunner. This new reader will know how to aggregate the multiple lines into a single line.
I am pretty sure that this approach will work but I am wondering if this is the right way to do it or if there is a simpler solution?
The text I am trying to parse is:
==============> len:45 pktype:4 mtype:2
SYMBOL: USOCSTIA151632.00
OPEN_INT: 212
PR_OPEN_INTEREST: 212
TIME_STAMP: 04/10/2015 06:30:17:420 val:1428661817
The result should be the last 4 lines concatenated together and the first line dropped.
Best regards,
Peter

Note that TextReader is an internal implementation detail class, so subclassing it would be highly discouraged and challenging to do properly.
The recommended way to define a new file-based format like yours is to subclass FileBasedSource using the user-defined source API.
In your case, I would recommend to base your class on the LineIO example from documentation, and wrap the LineReader defined there into your own class which would use LineReader as a helper for reading individual lines, but:
In startReading() it would skip until the line starting with "====>"
In readNextRecord() it would read lines until the next "====>" and bundle them into a single record.
Please make sure to carefully read the documentation to FileBasedSource and FileBasedReader: the parallelization mechanism relies on the consistency properties described there, which your format has to satisfy, for ensuring that records are not duplicated or omitted on the boundaries between adjacent processing shards. XmlSource tests are a good example of how to unit-test these properties.
Please tell us how it goes and report back with any problems or questions - we are very interested in feedback on this API.

Related

Deletion of module columns/attribute not visible on a given View

I have been given the task of creating a DXL script. First problem is that I have never used DXL before, even though I have many years experience with DOORS itself. I have been surfing the Net to seek guidance on my particular problem. I also have a few specimen DXL scripts for reference.
My new client requires that for each View of a given Module, of which there are many Views, new "reduced" Modules are to be produced reflecting each View.
By "reduced", I mean that these new Modules are to contain nothing that isn't actually needed for that View., i.e. Columns, Attributes etc. These new Modules will only have the single View.
So, the way forward as I see it, is to take copies of the single master Module, one for each View, rename those copies to reflect a given Master Module/Required View, select that required View in the given copy Module and then delete everything that is not needed by that View, i.e. available Columns, Attributes etc.
This would be simple if I had the required DXL knowledge, which I am endeavouring to pick up as fast as I can.
If at all possible, this script has to be generic and be able to work upon any of the master Module copies to produce the associated "reduced" Module reflecting a particular View.
The client aims to use the script periodically for View archiving (I know, that's the way they want it).
Clarification
Some clarification of what I believe is required, given the following text from my original question:
If at all possible, this script has to be generic and be able to work upon any of the master Module copies to produce the associated "reduced" Module reflecting a particular View.
So, say there are ten views of the master Module, outside of the DXL script, I would copy the master Module ten times, renaming each copy to reflect each of the ten views. Unless you know different, each of those ten copies will reflect the same “Absolute Number”s as are in the master Module, so no problem there?
So, starting with the first of the copied Modules, each named to reflect the View it will eventually represent, its View would be set from the ten Views available to it, that which matches its title.
The single generic DXL script would then be run against that first copy Module, the aim being to delete everything not actually needed for that view, i.e. Attributes, Columns etc. Would some kind of purging command be required in the script for any aforementioned deleted items?
The single generic DXL script would then delete ALL views from that copy Module. The log that is produced when running the script also needs capturing, but I’m not sure whether this should be done from within the script, if possible or as a separate manual task outside of the script.
The aforementioned (indented) process would then be repeated, using the same generic script, against the remaining nine copied Modules. The intension is to leave us with ten copy Modules, each one reflecting one of the ten possible Views, with each one containing only the Attributes, Columns etc. required for that View.
Creating a mirror of a module with this approach is not so easy IMO. Think e.g. about "Absolute Number". If the original module contains the numbers 15 (level 1), 2000 (level 2), 1 (level 1), you will have to create 2000 objects, purge 1997 of them and move them to the correct place.
There is a "duplicate" tool at https://www.ibm.com/developerworks/community/forums/html/topic?id=43862118-113d-4eac-b3f1-21d3b73959d1 which tries to do this, but as stated there, this script is said not to work correctly in all situations.
So, I would rather use the approach "string clipCopy (Item i); string clipPaste(Folder folderRef)". Should be faster and less error prone. But: all Out-Links will also be copied with this method, you will probably have to delete these after the copy or else the link target module(s) will have lots of In-Links.
The problem is still not so easy to solve, as every view might have DXL columns that rely on some or other attribute, and it might contain DXL attributes which again might rely on sth else. I doubt that there is a way to analyze DXL code "on the fly" and find out which columns may be deleted.
Perhaps a totally different approach would be feasible: open each view and create an export to Excel, this way you will get rid of any dynamic dependencies. Then re-import the excel sheet to a new DOORS module. You will still have the "Absolute Number" problem, but perhaps you can make a deal that you will have a pseudo attribute "Original Absolute Number" and disregard the "new" "Absolute Number"'
Quite a big task for a DXL beginner....
Update: On second thought, perhaps you might want to combine these approaches
agree with your employer that you will use an alternative attribute for Absolute Number
use a loop like Russel suggested, when creating objects remember that objects might have to be created "below" or "after" its predecessor or sibling
for DXL attributes do not copy the DXL code but the actual current value of the object
for DXL columns create pseudo attributes _ and create a new view that uses these pseudo attributes instead of the original value
Copying the entire module, then deleting everything not in that view, seems worse than just copying the things you need from each particular view.
I would take the following as the outline of your program:
for view in main module do {
for column in view do {
Find attribute for each column and store (possibly in a skip list?)
Store name of column
}
create new module
create needed types / attributes in new module
create new view in new module
for object in main module {
create object in new module
for attribute in main module {
check if attribute is in new module {
copy info from old object to new
}
}
}
}
Each of these for X in y loops should be in the DXL reference manual in some for or another.
If you need more help, let me know!

What is the recommended way to make & load a library?

I want to make a small "library" to be used by my future maxima scripts, but I am not quite sure on how to proceed (I use wxMaxima). Maxima's documentation covers the save(), load() and loadFile() functions, yet does not provide examples. Therefore, I am not sure whether I am using the proper/best way or not. My current solution, which is based on this post, stores my library in the *.lisp format.
As a simple example, let's say that my library defines the cosSin(x) function. I open a new session and define this function as
(%i0) cosSin(x) := cos(x) * sin(x);
I then save it to a lisp file located in the /tmp/ directory.
(%i1) save("/tmp/lib.lisp");
I then open a new instance of maxima and load the library
(%i0) loadfile("/tmp/lib.lisp");
The cosSin(x) is now defined and can be called
(%i1) cosSin(%pi/4)
(%o1) 1/2
However, I noticed that a substantial number of the libraries shipped with maxima are of *.mac format: the /usr/share/maxima/5.37.2/share/ directory contains 428 *.mac files and 516 *.lisp files. Is it a better format? How would I generate such files?
More generally, what are the different ways a library can be saved and loaded? What is the recommended approach?
Usually people put the functions they need in a file name something.mac and then load("something.mac"); loads the functions into Maxima.
A file can contain any number of functions. A file can load other files, so if you have somethingA.mac and somethingB.mac, then you can have another file that just says load("somethingA.mac"); load("somethingB.mac");.
One can also create Lisp files and load them too, but it is not required to write functions in Lisp.
Unless you are specifically interested in writing Lisp functions, my advice is to write your functions in the Maxima language and put them in a file, using an ordinary text editor. Also, I recommend that you don't use save to save the functions to a file as Lisp code; just type the functions into a file, as Maxima code, with a plain text editor.
Take a look at the files in share to get a feeling for how other people have gone about it. I am looking right now at share/contrib/ggf.mac and I see it has a lengthy comment header describing its purpose -- such comments are always a good idea.
For principiants, like me,
Menu Edit:configure:Startup commands
Copy all the functions you have verified in the first box (this will write your wxmaxima-init.mac in the location indicated below)
Restart Wxmaxima.
Now you can access the functions whitout any load() command

ChromeWorker to write a huge file

In my extension, I need to write a huge file (say around 20 gigs) to the disk. Currently I am doing it in the main thread, but file creation is very expensive operation. I was about to move the whole file creation process to a ChromeWorker, but based on https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Functions_and_classes_available_to_workers I cannot have access to the nsiFile from a ChromeWorker.
So my questions are:
1. Is it possible to access Cc, Ci, and Cu from within a ChromeWorker?
2. If not what would be the most efficient way to create and fill large files in Firefox. Note that I need to write the file based on segments and offsets (Ci.nsISeekableStream).
It's not possible to access nsIFile from ChromeWorker. But nsIFile is horrible synchronus option.
Go with OS.File: https://developer.mozilla.org/en-US/docs/Mozilla/JavaScript_code_modules/OSFile.jsm
On that page go to the link for usage on workers: https://developer.mozilla.org/docs/Mozilla/JavaScript_code_modules/OSFile.jsm/OS.File_for_workers
On the mainthread os.file returns promises.
In worker they are synchronus. Wrap your os.file functions in worker with a try-catch, as when an error occurs, (like os.file.remove with option of ignoreAbsent set to false) then the catch will hold the OS.File.Error object.
Great move to ChromeWorker btw! I'm a huge fan of ChromeWorkers. I wrote a simple example of jsm using chromeworker here: https://github.com/Noitidart/jpm-chromeworker
For segments, you'll have to OS.File.open and then on the return value do a .setPosition() then you can read certain number of bytes from that position, or write, or whatever. Its awesome stuff. OS.File is the new way and the recommended way to do file operations. Its been around awhile now though since like Firefox 29 or before that.

How to process a GCS filepattern, full file at a time?

I need to process a (GCS) bucket of files, where each file is compressed and contains a single multi-line JSON record. Also, the name of the file being processed is significant and I need to know it within my transform.
Starting with examples in the docs, TextIO looks pretty close, but it looks like its designed to process each file line-by-line and does not allow me to read the entire file at once. Also, I don't see any way to get the filename that's being processed?
PCollectionTuple results = p.apply(TextIO.Read
.from("gs://bucket/a/*.gz")
.withCompressionType(TextIO.CompressionType.GZIP)
.withCoder(MyJsonCoder.of()))
Looks like I need to write a custom IO reader, or some such? Any tips for best place to start?
You are correct that right now none of the existing classes do exactly what you want. There are 2 reasonable approaches:
Match the filepattern yourself (using IOChannelUtils and IOChannelFactory) and wrap the resulting files into a PCollection<String> where the String will be a filename, using Create.of(filenames). Then apply a ParDo with a function which reads the given filename.
Write your own subclass of Source (there's also FileBasedSource, but it's not quite right for your use case). It would be configured by the filepattern, and splitIntoBundles would match the filepattern and expand into individual sources each corresponding to one file.
I would recommend the first approach because it seems like less code and your use case does not require the full power of Source.

Different coders for the same class in dataflow job

I'm trying to use different coders for the same class for two different scenarios:
Reading from JSON input files - using data = TextIO.Read.from(options.getInput()).withCoder(new Coder1())
Elsewhere in the job I want the class to be persisted using SerializableCoder using data.setCoder(SerializableCoder.of(MyClass.class)
It works locally, but fails when run in the cloud with
Caused by: java.io.StreamCorruptedException: invalid stream header: 7B227365.
Is it a supported scenario? The reason to do this in the first place is to avoid read/write of JSON format, and on the other hand make reading from input files more efficient (UTF-8 parsing is part of the JSON reader, so it can read from InputStream directly)
Clarifications:
Coder1 is my coder.
The other coder is a SerializableCoder.of(MyClass.class)
How does the system choose which coder to use? The two formats are binary incompatible, and it looks like due to some optimization, the second coder is used for data format which can only be read by the first coder.
Yes, using two different coders like that should work. (With the caveat that the coder in #2 will only be used if the system choses to persist 'data' instead of optimizing it into surround computations.)
Are you using your own Coders or ones provided by the Dataflow SDK? Quick caveat on TextIO -- because it uses newlines to encode element boundaries, you'll get into trouble if you use a coder that produces encoded values containing something that can be mistaken for a newline. You really should only use textual encodings within TextIO. We're hoping to make that clearer in the future.

Resources