Does binary encoding of AVRO compress data? - avro

In one of our projects we are using Kafka with AVRO to transfer data across applications. Data is added to an AVRO object and object is binary encoded to write to Kafka. We use binary encoding as it is generally mentioned as a minimal representation compared to other formats.
The data is usually a JSON string and when it is saved in a file, it uses up to 10 Mb of disk. However, when the file is compressed (.zip), it uses only few KBs. We are concerned storing such data in Kafka, so trying to compress before writing to a Kafka topic.
When length of binary encoded message (i.e. length of byte array) is measured, it is proportional to the length of the data string. So I assume binary encoding is not reducing any size.
Could someone tell me if binary encoding compresses data? If not, how can I apply compression?
Thanks!

If binary encoding compresses data?
Yes and no, it depends on your data.
According to avro binary encoding, yes for it only stores the schema once for each .avro file, regardless how many datas in that file, hence save some space w/o storing JSON's key name many times. And avro serialization do a bit compression with storing int and long leveraging variable-length zig-zag coding(only for small values). For the rest, avro don't "compress" data.
No for in some extreme case avro serialized data could be bigger than raw data. Eg. one .avro file with one Record in which only one string field. The schema overhead can defeat the saving from don't need to store the key name.
If not, how can I apply compression?
According to avro codecs, avro has built-in compression codec and optional ones. Just add one line while writing object container files :
DataFileWriter.setCodec(CodecFactory.deflateCodec(6)); // using deflate
or
DataFileWriter.setCodec(CodecFactory.snappyCodec()); // using snappy codec
To use snappy you need to include snappy-java library into your dependencies.

If you plan to store your data on Kafka, consider using Kafka producer compression support:
ProducerConfig.set("compression.codec","snappy")
The compression is totally transparent with consumer side, all consumed messages are automatically uncompressed.

Related

Defining a dask array from already blocked data on gcs

I want to create a 3d dask array from data that I have that is already chunked. My data consists of 216 blocks containing 1024x1024x1024 uint8 voxels each, each stored as a compressed hdf5 file with one key called data. Compressed, my data is only a few megabytes per block, but decompressed, it takes 1GB per block. Furthermore, my data is currently stored in Google Cloud storage (gcs), although I could potentially mirror it locally inside a container.
I thought the easiest way would be to use zarr, following these instructions (https://pangeo.io/data.html). Would xarray have to decompress my data before saving to zarr format? Would it have to shuffle data and try to communicate across blocks? Is there a lower level way of assembling a zarr from hdf5 blocks?
There are a few questions there, so I will try to be brief and hope that some edits can flesh out details I may have omitted.
You do not need to do anything in order to view your data as a single dask array, since you can reference the individual chunks as arrays (see here) and then use the stack/concatenate functions to build up into a single array. That does mean opening every file in the client, in order to read the meatadata, though.
Similarly, xarray has some functions for reading sets of files, where you should be able to assume consistency of dtype and dimensionality - please see their docs.
As far as zarr is concerned, you could use dask to create the set of files for you on GCS or not, and choose to use the same chunking scheme as the input - then there will be no shuffling. Since zarr is very simple to set up and understand, you could even create the zarr dataset yourself and write the chunks one-by-one without having to create the dask array up front from the zarr files. That would normally be via the zarr API, and writing a chunk of data does not require any change to the metadata file, so can be done in parallel. In theory, you could simply copy a block in, if you understood the low-level data representation (e.g., int64 in C-array layout); however, I don't know how likely it is that the exact same compression mechanism will be available in both the original hdf and zarr (see here).

Reading file as stream of strings in Dart: how many events will be emitted?

Standard way to open a file in Dart as a stream is to use file.openRead() which returns a Stream<List<int>>.
The next standard step is to transform this stream with utf8.decoder SteamTranformer which returns Stream<String>.
I noticed that with the files I've tried this resulting stream only emits a single event with the whole file content represented as one string. But I feel like this should not be a general case since otherwise the API wouldn't need to return a stream of strings, a Future<String> would suffice.
Could you explain how can I observe the behavior when this stream emits more than one event? Is this dependent on the file size / disk IO rate / some buffers size?
It depends on file size and buffer size, and however the file operations are implemented.
If you read a large file, you will very likely get multiple events of a limited size. The UTF-8 decoder decodes chunks eagerly, so you should get roughly the same number of chunks after decoding. It might carry a few bytes across chunk boundaries, but the rest of the bytes are decoded as soon as possible.
Checking on my local machine, the buffer size seems to be 65536 bytes. Reading a file larger than that gives me multiple chunks.

How to store large compressed CSV on S3 for use with Dask

I have a large dataset (~1 terabyte of data) spread across several csv files that I want to store (compressed) on S3. I have had issues reading compressed files into dask because they are too large, and so my initial solution was to split each csv into manageable sizes. These files are then read in the following way:
ddf = dd.read_csv('s3://bucket-name/*.xz', encoding = "ISO-8859-1",
compression='xz', blocksize=None, parse_dates=[6])
Before I ingest the full dataset - is this the correct approach, or is there a better way to accomplish what I need?
This seems sensible to me.
The only challenge that arises here is due to compression. If a compression format doesn't support random access then Dask can't break up large files into multiple smaller pieces. This can also be true for formats that do support random access, like xz, but are not configured to for that particular file.
Breaking up the file manually into many small files and using blocksize=None as you've done above is a good solution in this case.

Zip stream implementation with C in embedded devices

I have to embed a large text file in the limited space of internal memory of a MCU. This MCU will use the content of the text file for some purposes later.
The memory limitation dos not allow me to embed file content directly in my code (suppose that I use a character array to store file content), but if I compress the content of the file (using a light-weight algorithm like zip or gzip) then everything would be OK.
Suppose that the MCU uses getBytes(i, len) function to read content of my array (where i is index of the begining of required byte & len is length of data to be readed),
Here the problem is that when I compress the content & store it on the device(in my character array) I can't use getBytes function anymore for getting target data, so if I can write a wrapper on top of the getBytes function to map compressed content to requested content, then my problem will be solve.
I've no processing limitation on the MCU, anymore the memory amount is limited, & as I know access to the content of a zip compressed file is sequential, so I don't know is it possible to do this in an acceptable manner using C or C++ in such environment?
It is definitely possible to do this in a simple and efficient manner.
However, it's better to use piecewise compression (at the expense of compression ratio) instead of compressing/decompressing the entire file at once, otherwise you need to store the entire decompressed file in RAM.
With a small piece, the compression ratio of a strong algorithm will not be much different from a relatively weak one. So I recommend using a simple compression algorithm.
Disk compression algorithms are best suited for such purposes as they are designed to compress/decompress blocks.

Encoding image data as UTF-8 string with gzip compression

I am trying to store image data from a file into a PostgreSQL database as a base64 string that is compressed by gzip to save space. I am using the following code to encode the image:
#file = File.open("#{Rails.root.to_s}/public/" << #ad_object.image_url).read
#base64 = Base64.encode64(#file)
#compressed = ActiveSupport::Gzip.compress(#base64)
#compressed.force_encoding('UTF-8')
#ad_object.imageData = #compressed
When I try to save the object, I get the following error:
ActiveRecord::StatementInvalid (PG::Error: ERROR: invalid byte sequence for encoding "UTF8": 0x8b
In the rails console, any gzip compression is outputting the data as ASCII 8-BIT encoding. I have tried to set my internal and external encodings to UTF-8 but the results have not changed. How can I get this compressed data into a UTF-8 string?
This doesn't make much sense for a number of reasons.
gzip is a binary encoding. There's absolutely no point base64-encoding something then gzipping it, since the output is binary and base64 is only for transmitting over non-8bit-clean protocols. Just gzip the file directly.
Most image data is already compressed with a codec like PNG or JPEG that is much more efficient at compression of image data than gzip is. Gzipping it will usually make the image slightly bigger. Gzip will never be as efficient for image data as the loss-les PNG format, so if your image data is uncompressed, PNG compress it instead of gzipping it.
When representing binary data there isn't really a text encoding concern, because it isn't text. It won't be valid utf-8, and trying to tell the system it is will just cause further problems.
Do away entirely with the base64 encoding and gzip steps. As mu is too short says, just use the Rails binary field and let Rails handle the encoding and sending of the binary data.
Just use bytea fields in the database and store the PNG or JPEG images directly. These are hex-encoded on the wire for transmission, which takes 2x the space of the binary, but they're stored on disk in binary form. PostgreSQL automatically compresses bytea fields on disk if they benefit from compression, but most image data won't.
To minimize the size of the image, choose an appropriate compression format like PNG for lossless compression or JPEG for photographs. Downsample the image as much as you can before compression, and use the strongest compression that produces acceptable quality (for lossy codecs like JPEG). Do not attempt to further compress the image with gzip/LZMA/etc, it'll achieve nothing.
You'll still have the data double in size when transmitted as hex escapes over the wire. Solving that requires either the use of the PostgreSQL binary protocol (difficult and complicated) or a binary-clean side-band to transmit the image data. If the Pg gem supports SSL compression you can use it to compress the protocol traffic, which will reduce the cost of the hex escaping considerably.
If keeping the size down to the absolute minimum is necessary, I would not use the PotsgreSQL wire protocol to send the images to the device. It's designed for performance and reliability more than absolutely minimum size.

Resources