google cloud dataflow read data from compressed data - google-cloud-dataflow

I'm trying to use google cloud dataflow to read data from GCS and load to BigQuery tables, however the files in GCS are compressed(gzip), is there any class can be used to read data from compressed/gzipped files?

Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:
TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.

Related

how to handle the data encoding issue while copying the data from CSV file to parquet using Azure copy activity?

I have a CSV file that I wanted to convert to the parquet the CSV file contains the value Querý in one column
So I am using use copy activity from the azure data factory and converting to the parquet but I get the value as Queryý. I don't find any enoding option in the sink. I have seen a few documentation but everything says about the CSV file ending. Could someone help with this?
There is no way to set the encoding of parquet in Azure Data Factory.
I created a pipeline to test and it can work fine.
Here are some advice for you to troubleshoot:
Make sure the encoding of your csv file is correct.
Make sure your schema of Parquet is correct.

Create a Zip File in Memory ios swift?

We have stored image as an encrypted format and stored in local path. And once we captured the all documents the user click on submit button on that scenario we have decrypted all the images using RNCryptor(https://github.com/RNCryptor/RNCryptor) and save as Zip and https://github.com/marmelroy/Zip
But we have to store a decrypt format in memory instead of disk.
How would I zip a file so I could send it without writing to the hard drive and do it purely in memory?
Update
Another alternative is the ZIPFoundation library on Github (MIT/Thomas Zoechling). It appears to be Swift compatible and is apparently "effortless." BTW - I learned about this library while reading an interesting blog article where the author (Max Desiatov) walks through how he unzips in memory using the library (see the section - Unzipping an archive in memory and parsing the contents).
Original
Have you taken a close look at the Single-Step Compression article? There is a section that talks about writing the compressed data to file (but it's already been compressed in memory at that point). Once you get the data generated then I guess you could do with it as you will...
Article Steps
Create the Source Data
Create the Destination Buffer
Select a Compression Algorithm
Compress the Data
Write the Encoded Data to a File
Read the Encoded Data from a File
Decompress the Data

Is it possible to read non-text files into a google dataflow pipeline?

I would like to read pdf files into the pipeline. However, I haven't found any apache beam example regarding file formats other than plain text or xml.
There is no pre-existing PDF reader available in Dataflow or Apache Beam libraries. However, you could use the example of this reader for TensorFlow records as a model to write your own using the PDF parsing library of your choice.
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java

Reading from compressed files in Dataflow

Is there a way (or any kind of hack) to read input data from compressed files?
My input consists of a few hundreds of files, which are produced as compressed with gzip and decompressing them is somewhat tedious.
Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:
TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.
The slower performance with my work around was most likely because Dataflow was putting most of the files in the same split so they weren't being processed in parallel. You can try the following to speed things up.
Create a PCollection for each file by applying the Create transform multiple times (each time to a single file).
Use the Flatten transform to create a single PCollection containing all the files from PCollections representing individual files.
Apply your pipeline to this PCollection.
I also found that for files that reside in the cloud store, setting the content type and content encoding appears to "just work" without the need for a workaround.
Specifically - I run
gsutil -m setmeta -h "Content-Encoding:gzip" -h "Content-Type:text/plain" <path>
I just noticed that specifying the compression type is now available in the latest version of the SDK (v0.3.150210). I've tested it, and was able to load my GZ files directly from GCS to BQ without any problems.

File format of spool files with .tmp extensions?

In many Windows setups, when you print directly to a printer, two files are typically created in the windows spool directory "C:\Windows\System32\spool\PRINTERS". A spool file "80021.SPL" and a shadow file "80021.SHD" are examples of these files. The spool file contains the meat and potatoes of the drawing instructions so the printer can print the page. The data in this spool file comes in a smorgasbord of different formats depending on the language technology and the print driver used. However, when you are printing to a printer that's on a print server, a single ".TMP" file is created instead and gets transmitted to the print server. I think its fair to assume that this is just the .SHD and .SPL files combined into a single transport file to get it to the server. However, its unreadable, i'm nto sure if its zipped, encrypted, or what, but I can't decipher it. When printing PDFs you can typically see plain text PostScript instructions in the spool file (.SPL), by just opening it and viewing it in a text editor. You can even send that spool file (.SPL) to a postscript viewer like GhostScript and have it show you the pages drawn on screen. But when the job is all packaged up in a .TMP file, its basically just a binary pile of bits. Does anyone know how to uncompress the data from these transport .TMP spool files?
I believe that file you have will be an EMF file that is padded with a proprietary MS structure at the beginning. Easiest way to find out if you are dealing with an EMF structure is to look for the ANSI characters ' EMF' in tmp file you have.
Assuming that you do find these characters it is just a matter of removing the proprietary structure data from the beginning of the file then treating it as a standard EMF file. Fortunately all EMF files have a standard header format so it should be reasonably to determine where the EMF file starts.
There is a good description of EMF file headers here

Resources