Reading from compressed files in Dataflow - google-cloud-dataflow

Is there a way (or any kind of hack) to read input data from compressed files?
My input consists of a few hundreds of files, which are produced as compressed with gzip and decompressing them is somewhat tedious.

Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:
TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.

The slower performance with my work around was most likely because Dataflow was putting most of the files in the same split so they weren't being processed in parallel. You can try the following to speed things up.
Create a PCollection for each file by applying the Create transform multiple times (each time to a single file).
Use the Flatten transform to create a single PCollection containing all the files from PCollections representing individual files.
Apply your pipeline to this PCollection.

I also found that for files that reside in the cloud store, setting the content type and content encoding appears to "just work" without the need for a workaround.
Specifically - I run
gsutil -m setmeta -h "Content-Encoding:gzip" -h "Content-Type:text/plain" <path>

I just noticed that specifying the compression type is now available in the latest version of the SDK (v0.3.150210). I've tested it, and was able to load my GZ files directly from GCS to BQ without any problems.

Related

How to describe logical paths that includes looking inside zip and tar.gz files?

I want to build a list of all the files on a large disk that includes the files that are inside container formats like tar files and zip files.
So I'd like to know if there already exists a notation for describing what you might call "logical" filesystem paths, which includes looking inside zip and tar.gz files?
For example, if I had a directory named a.dir that includes the file b.zip, and that file was a compressed version of b.txt, then then I'd imagine a notation that could describe the location of b.txt that would look something like
a.dir/b.zip/b.txt
However, I am not sure if that would always work, and it doesn't really tell you that b.zip is a zip file.
I am looking for a simple syntax that will identify common compression/container formats (zip, tar, tar.gz, tar.bz2, etc), handle nested compressed files, and handle compressed files with absolute or relative paths. Paths must uniquely identify one file (there can be no ambiguity.) Should work across a range of filesystems. Identifying symlinks would be a bonus.
I am NOT looking for a syntax that unix commands would understand or that programs would be able to open directly. The syntax does not need to explain how to access those files. (However, these are the obvious next steps.)
Thanks.

google cloud dataflow read data from compressed data

I'm trying to use google cloud dataflow to read data from GCS and load to BigQuery tables, however the files in GCS are compressed(gzip), is there any class can be used to read data from compressed/gzipped files?
Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:
TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.

Download Directory and Contents

Is it possible to persuade the stream result to download an entire directory and it's contents? And if so, how? I've no problem getting it to download individual files, but I have a need to download a series of files that must be in a specific directory structure.
I don't think so.
Stream result allow you to download ONE content, with its MIME type, its name, etc.
This makes it impossible to work with a lot of files, with different names and content type.
What you can do is:
Render in a JSP the list of files (in anchor tags for example), everyone targeting the Action that will download that single file;
Call multiple Actions via scripting opening multiple pages (target="_blank") for every file you have (dangerous, annoying, almost useless...);
Create a zip with Java in server side, containing all your files and directories, then output the zip with Stream result.
I think you may consider the third option.

File format of spool files with .tmp extensions?

In many Windows setups, when you print directly to a printer, two files are typically created in the windows spool directory "C:\Windows\System32\spool\PRINTERS". A spool file "80021.SPL" and a shadow file "80021.SHD" are examples of these files. The spool file contains the meat and potatoes of the drawing instructions so the printer can print the page. The data in this spool file comes in a smorgasbord of different formats depending on the language technology and the print driver used. However, when you are printing to a printer that's on a print server, a single ".TMP" file is created instead and gets transmitted to the print server. I think its fair to assume that this is just the .SHD and .SPL files combined into a single transport file to get it to the server. However, its unreadable, i'm nto sure if its zipped, encrypted, or what, but I can't decipher it. When printing PDFs you can typically see plain text PostScript instructions in the spool file (.SPL), by just opening it and viewing it in a text editor. You can even send that spool file (.SPL) to a postscript viewer like GhostScript and have it show you the pages drawn on screen. But when the job is all packaged up in a .TMP file, its basically just a binary pile of bits. Does anyone know how to uncompress the data from these transport .TMP spool files?
I believe that file you have will be an EMF file that is padded with a proprietary MS structure at the beginning. Easiest way to find out if you are dealing with an EMF structure is to look for the ANSI characters ' EMF' in tmp file you have.
Assuming that you do find these characters it is just a matter of removing the proprietary structure data from the beginning of the file then treating it as a standard EMF file. Fortunately all EMF files have a standard header format so it should be reasonably to determine where the EMF file starts.
There is a good description of EMF file headers here

how separate files from an uncompressed zlib stream

I have buffered a zlib stream in a std:vector.
I have uncompressed programmatically this stream in a new std:vector named "UncompressedZlibStream".
I know that my stream contains multiple files.
I would like to knwo how "cut" (separate) my files in the stream.
I think zlib use a separator ? but why caracter or sequence !?
Anyone have any informations about this ?
Thanks a lot,
best regards,
CrashOverHead
Zlib itself is only a compression library. It is typically only used to compress a single file. Putting multiple files into zlib requires that you use a format like tar and then compress the result. Zlib compressed tar files are pretty common in the Unix world. You may want to have a look at LibTar. If it's anything else it's likely proprietary and you're kind of on your own for how to dice the stream.

Resources