Best way to process a GCS file within Dataflow? - google-cloud-dataflow

I have a PCollection of matched GCS filenames, each of which contains a single compressed JSON blob. What's the best way to read the entire file, decompress it (Gzip format), and JSON decode it?
TextIO is really close, but reads data per-line.
GCS API offers an example for how to read the entire file, but it doesn't handle decompression and is leading me to reimplement a lot of core functionality.
Are there any existing APIs and/or examples that can give me a head start? Seems like this would be a pretty common use case.

This isn't natively supported in Dataflow. To accomplish reading a JSON blob out of a file, you could implement FileBasedSource:
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/FileBasedSource
If that's enough to get started, we can continue to update this answer with more information.

Related

Google Spreadsheet: Parsing .PDF from Google Drive

I'm still getting into google spreadsheets, recently understood how to format a .txt to be able to use =ImportData properly thanks to Tanaike's assitance, now tackling a -slightly- more challenging task.
Goal:
Automatically extracting specific data from .pdf files hosted inside of a google drive folder and arranging the information into specific cells
Challenges:
Being able to decode the blobs of information, as just the raw data obtained with =ImportData is useless
Truly learning how to use google-apps-script for something useful (that's on my own)
Instructing a single extraction of information rather than constant online status as with =ImportData
[Second Priority] Stop Depending on an add-on (Drive Direct Links) to get the URL of the files
To my understanding, I'll need to do some parsing. I know .pdf is not always straight forward, all the files will come from the same place and have the exact same format, so understanding how to do it once should be enough.
I already know how to get the real/permanent link to the files automatically and how to arrange information segregated into cells using =Index, =Extract and others.
Hope I'm being clear enough. Thanks a lot in advance.
Best regards,
Lucas.-

How to store json file locally for quick access

Is there any way I can download json file from webserver and store it in a local folder for easy access for those with poor internet connection, so data will be downloaded once and user won't have to suffer every time.
I found similar questions on here1 and here2, but they were asked for objective-C, but I was looking something for Swift. Thanks
Yes, you can certainly do this. After you've read the remote JSON, it will be a Data object.1
Build a URL to a path in your app's caches directory and then use the Data method write(to:options:) to write that data into your file.
On read, check to see if the file exists in the caches directory before triggering a network read. Note that you need to be sure that the filenames you use are consistent and unique. (The same filename must always fetch the same unique data.)
1 Note that Mohammad has a good point. There are better ways of persisting your data than saving the raw JSON. Core Data is a pretty complex framework with a steep learning curve, but there are other options as well. You might look at conforming to the Codable protocol, which would let you serialize/deserialize your data objects in a variety of formats including JSON, property lists, and (shudder) XML.
Yes, you can create a .json file and store it in documents folder. First see how to create .json file, and then see how to store a file in documents folder.
Check this

Google Cloud Dataflow (Apache Beam) - how to process gzipped csv files with a header?

I have csv (gzip compressed) files in GCS. I want to read these files and send data to BigQuery.
The header info can be changed (although I know all columns in advance), so just dropping a header is not enough, somehow I need to read the first line and append the column info to remaining line.
How is it possible?
I first think I must implement a custom source like this post.
Reading CSV header with Dataflow
But with this solution, I'm not sure how I can decompress Gzip first. Can I somehow use withCompressionType like TextIO?
(I found a parameter compression_type in a python Class but I'm using Java and could not find a similar one in Java FileBasedSource class.)
Also I feel this a bit overkilling because it makes a file unsplittable (although in my case it's okay).
Or I can use GoogleCloudStorage and directly read a file and its first line in the first place in my main() function then proceed to a pipeline.
But it is also bothersome, so I want to confirm if there is any best practice (the Dataflow way) to read csv file while utilizing a header in Dataflow?
If I understand what you are trying to accomplish correctly, SideInput (doc, example) is likely the answer here. It will allow the header to be available for processing along side every line of the file.
The general idea is to emit the header as a separate PCollectionView and use this as a SideInput to your per-line processing. You can achieve this using a single pass over your file using SideOutput (doc)
If I am reading your question correctly, it sounds like your header contents vary form file to file. If so, you can use View.asMap to keep a map of headers from each file. Unfortunately keeping track of the current filename being read is not currently supported natively, but there are work-arrounds discussed in this post.

efficient and flexible binary data parsing

I have an external device that spits out UDP packets of binary data and software running on an embedded system that needs to read this data stream, parse it and do somethign useful. The binary data gets logged to a file as well. I would like to write a parser that can easily take the input directly from either the UDP stream, or a file, parse the data into a specific format and then direct the output to either a file (e.g. matlab dat file) or to another process that will do some real time processing. Are there any resources that would help me with this and what is the best way to go about this? I think it might make sense to use C++ streams but I'm not familiar with creating custom output streams. Does this seem like a good approach to take or is there a better way to go about it?
Thanks.
The beauty of binary data is that its is generally of very fixed format.
A typical method of parsing it is to declare a structure that maps onto the received packets, and then to just use type-casts to read the fields as structure elements.
The beauty is that this requires no parsing.
you have to be careful about structure packing rules, and endian-ness to make the structure map exactly the same way. Use of the C "offsetof" and "sizeof" macros is useful to emit some debug info to check that your structure is indeed mapping to what you think it is mapping.
Packing rules can typically be altered either by directives (such as #pragma's) or command line options. Endian-ness you are stuck with. If its different from what your embedded system uses, declare all the fields as bytes, or use something like the "ntoh" macro to do the byte swapping.
The New Jersey Machine Code Toolkit is a scheme for decoding arbitrary binary patterns. It was originally designed for decoding instruction sets, but it ought to be just fine for decoding message formats. You provide a description of the binary format, it synthesizes code to access the fields of that format (when valid). THus you can refer to message fields using generated function calls rather than think about where the field is or how it is encoded.

What are the differences or advantages of using a binary file vs XML with TClientDataSet?

Is there any difference or advantages using binary a file or XML file with
TClientDataSet.
Binary will be smaller and faster.
XML will be more portable and human readable.
The Binary file will be a little smaller.
The main advantage of the XML format is that you can pass it around via http(s) protocols.
Binary is smaller and faster, but only readable by TClientDataSets.
XML is larger and slower (both are not that bad, i.e. not by orders of magnitude bigger or slower).
XML is readable by people (not recommended in general, but it is doable), and software.
Therefore it is more portable (as Nick wrote).
TClientDataSets can load and save their own style of XML, or you can use the Delphi XML Mapper tool to read and write any kind of XML.
XSLT can for instance be used to transform those XML files into any kind of text, including other XML, HTML, CSV, fixed columns, etc.
In contrast to what Tim indicates, both binary and XML can be transferred through HTTP and HTTPS. However, it is often appreciated sending XML as it is easier to trace.
Without having tested it: I guess the binary format would be quite a lot faster when reading and writing. You'd better do your own benchmarks for that, though.
Another advantage of binary might be, that it cannot be easily edited which prevents people from mucking up the data outside the application.
When using Delphi 2009, we have noticed that if the file has an extension of .XML, it will not save in binary format over an existing dfXMLUTF8 format, even with a LoadFromFile, SaveToFile. Changing the file extension to something else (.DAT, for example) allows saving the file in dfBinary. Our experience is that the binary file, in addition to being somewhat more difficult for the end-user to manipulate (a plus!), is approximately 50% smaller than the dfXMLUTF8 format file.

Resources