Stream interface for data compression in Common Lisp - stream

In the chipz decompression library there is an extremely useful function make-decompressing-stream, which provides an interface (using Gray streams behind the scenes) to transparently decompress data read from the provided stream. This allows me to write a single function read-tag (which reads a single "tag" from a stream of structured binary data, much like Common Lisp's read function reads a single Lisp "form" from a stream) that works on both compressed and uncompressed data, eg:
;; For uncompressed data:
(read-tag in-stream)
;; For compressed data:
(read-tag (chipz:make-decompressing-stream 'chipz:zlib in-stream))
As far as I can tell, the API of the associated compression library, salza2, doesn't provide an (out-of-the-box) equivalent interface for performing the reverse task. How could I implement such an interface myself? Let's call it make-compressing-stream. It will be used with my own complementary write-tag function, and provide the same benefits as for reading:
;; For uncompressed-data:
(write-tag out-stream current-tag)
;; For compressed data:
(write-tag (make-compressing-stream 'salza2:zlib-compressor out-stream)
current-tag)
In salza2's documentation (linked above), in the overview, it says: "Salza2 provides an interface for creating a compressor object. This object acts as a sink for octets (either individual or vectors of octets), and is a source for octets in a compressed data format. The compressed octet data is provided to a user-defined callback that can write it to a stream, copy it to another vector, etc." For my current purposes, I only require compression in zlib and gzip formats, for which standard compressors are provided.
So here's how I think it could be done: Firstly, convert my "tag" object to an octet vector, secondly compress it using salza2:compress-octet-vector, and thirdly, provide a callback function that writes the compressed data directly to a file. From reading around, I think the first step could be achieved using flexi-streams:with-output-to-sequence - see here - but I'm really not sure about the callback function, despite looking at salza2's source. But here's the thing: a single tag can contain an arbitrary number of arbitrarily nested tags, and the "leaf" tags of this structure can each carry a sizeable payload; in other words, a single tag can be quite a lot of data.
So the tag->uncompressed-octets->compressed-octets->file conversion would ideally need to be performed in chunks, and this raises a question that I don't know how to answer, namely: compression formats - AIUI - tend to store in their headers a checksum of their payload data; if I compress the data one chunk at a time and append each compressed chunk to an output file, surely there will be a header and checksum for each chunk, as opposed to a single header and checksum for the entire tag's data, which is what I want? How can I solve this problem? Or is it already handled by salza2?
Thanks for any help, sorry for rambling :)

From what I understand, you can't directly decompress multiple chunks from a single file.
(defun bytes (&rest elements)
(make-array (length elements)
:element-type '(unsigned-byte 8)
:initial-contents elements))
(defun compress (chunk &optional mode)
(with-open-file (output #P"/tmp/compressed"
:direction :output
:if-exists mode
:if-does-not-exist :create
:element-type '(unsigned-byte 8))
(salza2:with-compressor (c 'salza2:gzip-compressor
:callback (salza2:make-stream-output-callback output))
(salza2:compress-octet-vector chunk c))))
(compress (bytes 10 20 30) :supersede)
(compress (bytes 40 50 60) :append)
Now, /tmp/compressed contains two consecutive chunks of compressed data.
Calling decompress reads the first chunk only:
(chipz:decompress nil 'chipz:gzip #P"/tmp/compressed")
=> #(10 20 30)
Looking at the source of chipz, the stream is read using an internal buffer, which means the bytes that follows the first chunk are probably already read but not decompressed. That explains why, when using two consecutive decompress calls on the same stream, the second one errors with EOF.
(with-open-file (input #P"/tmp/compressed"
:element-type '(unsigned-byte 8))
(list
#1=(multiple-value-list(ignore-errors(chipz:decompress nil 'chipz:gzip input)))
#1#))
=> ((#(10 20 30))
(NIL #<CHIPZ:PREMATURE-END-OF-STREAM {10155E2163}>))
I don't know how large the data is supposed to be, but if it ever becomes a problem, you might need to change the decompression algorithm so that when we are in the done state (see inflate.lisp), enough data is returned to process the remaining bytes as a new chunk. Or, you compress into different files and use an archive format like TAR (see https://github.com/froydnj/archive).

Related

Are there any good DEFLATE-like compression algorithms that are additive and immutable?

I need to add more things onto a stacklike structure, but compress them additively such that addition of new data results in the expected compression gain, but new chunks are still stored as compressed data without altering any of the past data chunks.
So in other words, it must preserve the additive property over successive compressed chunks. If f() is the 'adding another chunk' function, I need the following to hold for all chunks 'x':
Sure. Deflate does this, if you just keep it running and terminate each chunk at a byte boundary, e.g. with Z_SYNC_FLUSH, so that it can be written to file up to and including that chunk.
So long as your chunks are large enough, you will get the same compression gain.

Reading file as stream of strings in Dart: how many events will be emitted?

Standard way to open a file in Dart as a stream is to use file.openRead() which returns a Stream<List<int>>.
The next standard step is to transform this stream with utf8.decoder SteamTranformer which returns Stream<String>.
I noticed that with the files I've tried this resulting stream only emits a single event with the whole file content represented as one string. But I feel like this should not be a general case since otherwise the API wouldn't need to return a stream of strings, a Future<String> would suffice.
Could you explain how can I observe the behavior when this stream emits more than one event? Is this dependent on the file size / disk IO rate / some buffers size?
It depends on file size and buffer size, and however the file operations are implemented.
If you read a large file, you will very likely get multiple events of a limited size. The UTF-8 decoder decodes chunks eagerly, so you should get roughly the same number of chunks after decoding. It might carry a few bytes across chunk boundaries, but the rest of the bytes are decoded as soon as possible.
Checking on my local machine, the buffer size seems to be 65536 bytes. Reading a file larger than that gives me multiple chunks.

OCaml Marshal very large data structure

I would like to send a very large (~8GB) datastructure through the network, so I use the Marshal module to transform it into Bytes.
My problem is that the memory doubles, because we need to store both representations (initial data and Marshaled data).
Is there a simple way to Marshal into a Stream instead ? This would avoid to have the full Marshalled representation of the initial datastructure.
I thought of Marshaling to an out_channel in which I opened a pipe with a second thread and reading from the pipe in the main thread into s Stream, but I guess there might be a simpler solution.
Thanks !
Edit: Answer to a comment:
In the toplevel :
let a = Array.make (1024*1024*1024) 0. ;; (* Takes 8GB of RAM *)
let data = Marshal.to_bytes a [Marshal.Closures] ;; (* Takes an extra 8GB *)
It's not possible. You would have to modify the Marshal module to stream the data as it marshals something and to reconstruct the data in place without buffering it all first.
In the short run it might be simpler to implement your own specialized marshal function specific to your data. For an 8GiB array you might want to switch to using BigArray so you can send/recv the data without having to copy it.
Note: A 8GiB array will use 16GiB if the GC ever copies it, at least temporary.
From what I understand, MPI only allows to send data packets with a known size, not a stream of data. You could implement a custom stream type that split an incoming flow of data to packets of constant, small size (on close, you flush whatever remains in the buffer).
Also, you only can marshall arbitrary long data to a channel, because otherwise you take up too many space.
And then, you need to have a way to connect the channel to the stream, which AFAIK is not easily possible. Maybe you could start antoer ocaml process: the process would convert the flow of bytes (you can wrap a custom stream over Stream.of_channel) and send it through MPI. The main process would marshall data to the process's input channel.

Zip stream implementation with C in embedded devices

I have to embed a large text file in the limited space of internal memory of a MCU. This MCU will use the content of the text file for some purposes later.
The memory limitation dos not allow me to embed file content directly in my code (suppose that I use a character array to store file content), but if I compress the content of the file (using a light-weight algorithm like zip or gzip) then everything would be OK.
Suppose that the MCU uses getBytes(i, len) function to read content of my array (where i is index of the begining of required byte & len is length of data to be readed),
Here the problem is that when I compress the content & store it on the device(in my character array) I can't use getBytes function anymore for getting target data, so if I can write a wrapper on top of the getBytes function to map compressed content to requested content, then my problem will be solve.
I've no processing limitation on the MCU, anymore the memory amount is limited, & as I know access to the content of a zip compressed file is sequential, so I don't know is it possible to do this in an acceptable manner using C or C++ in such environment?
It is definitely possible to do this in a simple and efficient manner.
However, it's better to use piecewise compression (at the expense of compression ratio) instead of compressing/decompressing the entire file at once, otherwise you need to store the entire decompressed file in RAM.
With a small piece, the compression ratio of a strong algorithm will not be much different from a relatively weak one. So I recommend using a simple compression algorithm.
Disk compression algorithms are best suited for such purposes as they are designed to compress/decompress blocks.

Writing a lexer for chunked data

I have an embedded application which communicates with a RESTful server over HTTP. Some services involve sending some data to the client which is interpreted using a very simple lexer I wrote using flex.
Now I'm in the process of adding a gzip compression layer to reduce bandwidth consumption but I'm not satisfied with the current architecture because of the memory requirements: first I receive the whole data in a buffer, then I decompress the whole buffer into a new buffer and then I feed the whole data to flex.
I can save some memory between the first and second steps by feeding chunked data from the HTTP client to the zlib routines. But I'm wondering whether it's possible to do the same between the zlib chunked output and the flex input.
Currently I use only yy_scan_bytes and yylex to analyze the input. Does flex have any feature to feed multiple chunks of data to yylex? I've read the documentation about multiple input buffers but to no avail.
YY_INPUT seems to be the correct answer:
The nature of how [the scanner] gets its input can be controlled by defining the
YY_INPUT macro. The calling sequence for YY_INPUT() is
YY_INPUT(buf,result,max_size). Its action is to place up to max_size
characters in the character array buf and return in the integer
variable result either the number of characters read or the constant
YY_NULL (0 on Unix systems) to indicate `EOF'.

Resources