Finding the size of a deflated chunk of data - parsing

Let's suppose I have a chunk of deflated data (as in the compressed data right after the first 2 bytes and before the ADLER32 check in a zlib compressed file)
How can I know the length of that chunk? How can I find where it ends?

You would need to get that information from some metadata external to the zlib block. (The same is true of the uncompressed size.) Or you could decompress and see where you end up.
Compressed blocks in deflate format are self-delimiting, so the decoder will terminate at the correct point (unless the datastream has been corrupted).
Most file storage and data transmission formats which include compressed data make this metadata available. But since it is not necessary for decompression, it is not stored in the compressed stream.

The only way to find where it ends is to decompress it. deflate streams are self-terminating.

Related

Reading file as stream of strings in Dart: how many events will be emitted?

Standard way to open a file in Dart as a stream is to use file.openRead() which returns a Stream<List<int>>.
The next standard step is to transform this stream with utf8.decoder SteamTranformer which returns Stream<String>.
I noticed that with the files I've tried this resulting stream only emits a single event with the whole file content represented as one string. But I feel like this should not be a general case since otherwise the API wouldn't need to return a stream of strings, a Future<String> would suffice.
Could you explain how can I observe the behavior when this stream emits more than one event? Is this dependent on the file size / disk IO rate / some buffers size?
It depends on file size and buffer size, and however the file operations are implemented.
If you read a large file, you will very likely get multiple events of a limited size. The UTF-8 decoder decodes chunks eagerly, so you should get roughly the same number of chunks after decoding. It might carry a few bytes across chunk boundaries, but the rest of the bytes are decoded as soon as possible.
Checking on my local machine, the buffer size seems to be 65536 bytes. Reading a file larger than that gives me multiple chunks.

How objects are stored in NSData? Will the data size (in bytes) be the same as before storing as NSData?

To expand my question, I have an NSString that contains a URL to a video file that weighs 2MB. I use dataWithContentsOfURL to store it in the form of NSData. I added a breakpoint and checked out the size of the data. It was way damn high (more than 12MB), just took the bytes and did math on it (data/1024/1024).
Then I save the data as a video file with UISaveVideoAtPathToSavedPhotosAlbum and I used attributesOfItemAtPath to get FileSize attribute of that saved file. It is showing up correctly (2MB). All I want to know is, how the objects are encoded. Why this drastic change in size (in bytes) when converting URL to a data file.
All I want to know is, how the objects are encoded. Why this drastic change in size (in bytes) when converting URL to a data file.
An NSData is not "encoded" in any way, the bytes stored in one are the raw bytes. There is some overhead in the structure, but this is small.
Reading in a 2MB file will take a little more than 2MB, the overhead is small enough that you might not notice it in the memory usage. You can read much larger files than this and see no significant memory usage growth above the file size.
Your increase in memory usage is due to something else. Accidental multiple copies of the NSData? Maybe you are unintentionally uncompressing compressed video? Etc. You'll have to hunt for the cause.

Stream interface for data compression in Common Lisp

In the chipz decompression library there is an extremely useful function make-decompressing-stream, which provides an interface (using Gray streams behind the scenes) to transparently decompress data read from the provided stream. This allows me to write a single function read-tag (which reads a single "tag" from a stream of structured binary data, much like Common Lisp's read function reads a single Lisp "form" from a stream) that works on both compressed and uncompressed data, eg:
;; For uncompressed data:
(read-tag in-stream)
;; For compressed data:
(read-tag (chipz:make-decompressing-stream 'chipz:zlib in-stream))
As far as I can tell, the API of the associated compression library, salza2, doesn't provide an (out-of-the-box) equivalent interface for performing the reverse task. How could I implement such an interface myself? Let's call it make-compressing-stream. It will be used with my own complementary write-tag function, and provide the same benefits as for reading:
;; For uncompressed-data:
(write-tag out-stream current-tag)
;; For compressed data:
(write-tag (make-compressing-stream 'salza2:zlib-compressor out-stream)
current-tag)
In salza2's documentation (linked above), in the overview, it says: "Salza2 provides an interface for creating a compressor object. This object acts as a sink for octets (either individual or vectors of octets), and is a source for octets in a compressed data format. The compressed octet data is provided to a user-defined callback that can write it to a stream, copy it to another vector, etc." For my current purposes, I only require compression in zlib and gzip formats, for which standard compressors are provided.
So here's how I think it could be done: Firstly, convert my "tag" object to an octet vector, secondly compress it using salza2:compress-octet-vector, and thirdly, provide a callback function that writes the compressed data directly to a file. From reading around, I think the first step could be achieved using flexi-streams:with-output-to-sequence - see here - but I'm really not sure about the callback function, despite looking at salza2's source. But here's the thing: a single tag can contain an arbitrary number of arbitrarily nested tags, and the "leaf" tags of this structure can each carry a sizeable payload; in other words, a single tag can be quite a lot of data.
So the tag->uncompressed-octets->compressed-octets->file conversion would ideally need to be performed in chunks, and this raises a question that I don't know how to answer, namely: compression formats - AIUI - tend to store in their headers a checksum of their payload data; if I compress the data one chunk at a time and append each compressed chunk to an output file, surely there will be a header and checksum for each chunk, as opposed to a single header and checksum for the entire tag's data, which is what I want? How can I solve this problem? Or is it already handled by salza2?
Thanks for any help, sorry for rambling :)
From what I understand, you can't directly decompress multiple chunks from a single file.
(defun bytes (&rest elements)
(make-array (length elements)
:element-type '(unsigned-byte 8)
:initial-contents elements))
(defun compress (chunk &optional mode)
(with-open-file (output #P"/tmp/compressed"
:direction :output
:if-exists mode
:if-does-not-exist :create
:element-type '(unsigned-byte 8))
(salza2:with-compressor (c 'salza2:gzip-compressor
:callback (salza2:make-stream-output-callback output))
(salza2:compress-octet-vector chunk c))))
(compress (bytes 10 20 30) :supersede)
(compress (bytes 40 50 60) :append)
Now, /tmp/compressed contains two consecutive chunks of compressed data.
Calling decompress reads the first chunk only:
(chipz:decompress nil 'chipz:gzip #P"/tmp/compressed")
=> #(10 20 30)
Looking at the source of chipz, the stream is read using an internal buffer, which means the bytes that follows the first chunk are probably already read but not decompressed. That explains why, when using two consecutive decompress calls on the same stream, the second one errors with EOF.
(with-open-file (input #P"/tmp/compressed"
:element-type '(unsigned-byte 8))
(list
#1=(multiple-value-list(ignore-errors(chipz:decompress nil 'chipz:gzip input)))
#1#))
=> ((#(10 20 30))
(NIL #<CHIPZ:PREMATURE-END-OF-STREAM {10155E2163}>))
I don't know how large the data is supposed to be, but if it ever becomes a problem, you might need to change the decompression algorithm so that when we are in the done state (see inflate.lisp), enough data is returned to process the remaining bytes as a new chunk. Or, you compress into different files and use an archive format like TAR (see https://github.com/froydnj/archive).

Does binary encoding of AVRO compress data?

In one of our projects we are using Kafka with AVRO to transfer data across applications. Data is added to an AVRO object and object is binary encoded to write to Kafka. We use binary encoding as it is generally mentioned as a minimal representation compared to other formats.
The data is usually a JSON string and when it is saved in a file, it uses up to 10 Mb of disk. However, when the file is compressed (.zip), it uses only few KBs. We are concerned storing such data in Kafka, so trying to compress before writing to a Kafka topic.
When length of binary encoded message (i.e. length of byte array) is measured, it is proportional to the length of the data string. So I assume binary encoding is not reducing any size.
Could someone tell me if binary encoding compresses data? If not, how can I apply compression?
Thanks!
If binary encoding compresses data?
Yes and no, it depends on your data.
According to avro binary encoding, yes for it only stores the schema once for each .avro file, regardless how many datas in that file, hence save some space w/o storing JSON's key name many times. And avro serialization do a bit compression with storing int and long leveraging variable-length zig-zag coding(only for small values). For the rest, avro don't "compress" data.
No for in some extreme case avro serialized data could be bigger than raw data. Eg. one .avro file with one Record in which only one string field. The schema overhead can defeat the saving from don't need to store the key name.
If not, how can I apply compression?
According to avro codecs, avro has built-in compression codec and optional ones. Just add one line while writing object container files :
DataFileWriter.setCodec(CodecFactory.deflateCodec(6)); // using deflate
or
DataFileWriter.setCodec(CodecFactory.snappyCodec()); // using snappy codec
To use snappy you need to include snappy-java library into your dependencies.
If you plan to store your data on Kafka, consider using Kafka producer compression support:
ProducerConfig.set("compression.codec","snappy")
The compression is totally transparent with consumer side, all consumed messages are automatically uncompressed.

Encoding image data as UTF-8 string with gzip compression

I am trying to store image data from a file into a PostgreSQL database as a base64 string that is compressed by gzip to save space. I am using the following code to encode the image:
#file = File.open("#{Rails.root.to_s}/public/" << #ad_object.image_url).read
#base64 = Base64.encode64(#file)
#compressed = ActiveSupport::Gzip.compress(#base64)
#compressed.force_encoding('UTF-8')
#ad_object.imageData = #compressed
When I try to save the object, I get the following error:
ActiveRecord::StatementInvalid (PG::Error: ERROR: invalid byte sequence for encoding "UTF8": 0x8b
In the rails console, any gzip compression is outputting the data as ASCII 8-BIT encoding. I have tried to set my internal and external encodings to UTF-8 but the results have not changed. How can I get this compressed data into a UTF-8 string?
This doesn't make much sense for a number of reasons.
gzip is a binary encoding. There's absolutely no point base64-encoding something then gzipping it, since the output is binary and base64 is only for transmitting over non-8bit-clean protocols. Just gzip the file directly.
Most image data is already compressed with a codec like PNG or JPEG that is much more efficient at compression of image data than gzip is. Gzipping it will usually make the image slightly bigger. Gzip will never be as efficient for image data as the loss-les PNG format, so if your image data is uncompressed, PNG compress it instead of gzipping it.
When representing binary data there isn't really a text encoding concern, because it isn't text. It won't be valid utf-8, and trying to tell the system it is will just cause further problems.
Do away entirely with the base64 encoding and gzip steps. As mu is too short says, just use the Rails binary field and let Rails handle the encoding and sending of the binary data.
Just use bytea fields in the database and store the PNG or JPEG images directly. These are hex-encoded on the wire for transmission, which takes 2x the space of the binary, but they're stored on disk in binary form. PostgreSQL automatically compresses bytea fields on disk if they benefit from compression, but most image data won't.
To minimize the size of the image, choose an appropriate compression format like PNG for lossless compression or JPEG for photographs. Downsample the image as much as you can before compression, and use the strongest compression that produces acceptable quality (for lossy codecs like JPEG). Do not attempt to further compress the image with gzip/LZMA/etc, it'll achieve nothing.
You'll still have the data double in size when transmitted as hex escapes over the wire. Solving that requires either the use of the PostgreSQL binary protocol (difficult and complicated) or a binary-clean side-band to transmit the image data. If the Pg gem supports SSL compression you can use it to compress the protocol traffic, which will reduce the cost of the hex escaping considerably.
If keeping the size down to the absolute minimum is necessary, I would not use the PotsgreSQL wire protocol to send the images to the device. It's designed for performance and reliability more than absolutely minimum size.

Resources