Any recommendation for a workflow/serverless solution decompressing the avro files before ingestion?
You can specify "apacheavro" format that supports avro with snappy encoding
Related
I have a CSV file that I wanted to convert to the parquet the CSV file contains the value Querý in one column
So I am using use copy activity from the azure data factory and converting to the parquet but I get the value as Queryý. I don't find any enoding option in the sink. I have seen a few documentation but everything says about the CSV file ending. Could someone help with this?
There is no way to set the encoding of parquet in Azure Data Factory.
I created a pipeline to test and it can work fine.
Here are some advice for you to troubleshoot:
Make sure the encoding of your csv file is correct.
Make sure your schema of Parquet is correct.
I would like to read pdf files into the pipeline. However, I haven't found any apache beam example regarding file formats other than plain text or xml.
There is no pre-existing PDF reader available in Dataflow or Apache Beam libraries. However, you could use the example of this reader for TensorFlow records as a model to write your own using the PDF parsing library of your choice.
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java
I'm trying to use google cloud dataflow to read data from GCS and load to BigQuery tables, however the files in GCS are compressed(gzip), is there any class can be used to read data from compressed/gzipped files?
Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:
TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.
Is there a way (or any kind of hack) to read input data from compressed files?
My input consists of a few hundreds of files, which are produced as compressed with gzip and decompressing them is somewhat tedious.
Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:
TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.
The slower performance with my work around was most likely because Dataflow was putting most of the files in the same split so they weren't being processed in parallel. You can try the following to speed things up.
Create a PCollection for each file by applying the Create transform multiple times (each time to a single file).
Use the Flatten transform to create a single PCollection containing all the files from PCollections representing individual files.
Apply your pipeline to this PCollection.
I also found that for files that reside in the cloud store, setting the content type and content encoding appears to "just work" without the need for a workaround.
Specifically - I run
gsutil -m setmeta -h "Content-Encoding:gzip" -h "Content-Type:text/plain" <path>
I just noticed that specifying the compression type is now available in the latest version of the SDK (v0.3.150210). I've tested it, and was able to load my GZ files directly from GCS to BQ without any problems.
I have buffered a zlib stream in a std:vector.
I have uncompressed programmatically this stream in a new std:vector named "UncompressedZlibStream".
I know that my stream contains multiple files.
I would like to knwo how "cut" (separate) my files in the stream.
I think zlib use a separator ? but why caracter or sequence !?
Anyone have any informations about this ?
Thanks a lot,
best regards,
CrashOverHead
Zlib itself is only a compression library. It is typically only used to compress a single file. Putting multiple files into zlib requires that you use a format like tar and then compress the result. Zlib compressed tar files are pretty common in the Unix world. You may want to have a look at LibTar. If it's anything else it's likely proprietary and you're kind of on your own for how to dice the stream.