Writing a lexer for chunked data - flex-lexer

I have an embedded application which communicates with a RESTful server over HTTP. Some services involve sending some data to the client which is interpreted using a very simple lexer I wrote using flex.
Now I'm in the process of adding a gzip compression layer to reduce bandwidth consumption but I'm not satisfied with the current architecture because of the memory requirements: first I receive the whole data in a buffer, then I decompress the whole buffer into a new buffer and then I feed the whole data to flex.
I can save some memory between the first and second steps by feeding chunked data from the HTTP client to the zlib routines. But I'm wondering whether it's possible to do the same between the zlib chunked output and the flex input.
Currently I use only yy_scan_bytes and yylex to analyze the input. Does flex have any feature to feed multiple chunks of data to yylex? I've read the documentation about multiple input buffers but to no avail.

YY_INPUT seems to be the correct answer:
The nature of how [the scanner] gets its input can be controlled by defining the
YY_INPUT macro. The calling sequence for YY_INPUT() is
YY_INPUT(buf,result,max_size). Its action is to place up to max_size
characters in the character array buf and return in the integer
variable result either the number of characters read or the constant
YY_NULL (0 on Unix systems) to indicate `EOF'.

Related

Reading file as stream of strings in Dart: how many events will be emitted?

Standard way to open a file in Dart as a stream is to use file.openRead() which returns a Stream<List<int>>.
The next standard step is to transform this stream with utf8.decoder SteamTranformer which returns Stream<String>.
I noticed that with the files I've tried this resulting stream only emits a single event with the whole file content represented as one string. But I feel like this should not be a general case since otherwise the API wouldn't need to return a stream of strings, a Future<String> would suffice.
Could you explain how can I observe the behavior when this stream emits more than one event? Is this dependent on the file size / disk IO rate / some buffers size?
It depends on file size and buffer size, and however the file operations are implemented.
If you read a large file, you will very likely get multiple events of a limited size. The UTF-8 decoder decodes chunks eagerly, so you should get roughly the same number of chunks after decoding. It might carry a few bytes across chunk boundaries, but the rest of the bytes are decoded as soon as possible.
Checking on my local machine, the buffer size seems to be 65536 bytes. Reading a file larger than that gives me multiple chunks.

OCaml Marshal very large data structure

I would like to send a very large (~8GB) datastructure through the network, so I use the Marshal module to transform it into Bytes.
My problem is that the memory doubles, because we need to store both representations (initial data and Marshaled data).
Is there a simple way to Marshal into a Stream instead ? This would avoid to have the full Marshalled representation of the initial datastructure.
I thought of Marshaling to an out_channel in which I opened a pipe with a second thread and reading from the pipe in the main thread into s Stream, but I guess there might be a simpler solution.
Thanks !
Edit: Answer to a comment:
In the toplevel :
let a = Array.make (1024*1024*1024) 0. ;; (* Takes 8GB of RAM *)
let data = Marshal.to_bytes a [Marshal.Closures] ;; (* Takes an extra 8GB *)
It's not possible. You would have to modify the Marshal module to stream the data as it marshals something and to reconstruct the data in place without buffering it all first.
In the short run it might be simpler to implement your own specialized marshal function specific to your data. For an 8GiB array you might want to switch to using BigArray so you can send/recv the data without having to copy it.
Note: A 8GiB array will use 16GiB if the GC ever copies it, at least temporary.
From what I understand, MPI only allows to send data packets with a known size, not a stream of data. You could implement a custom stream type that split an incoming flow of data to packets of constant, small size (on close, you flush whatever remains in the buffer).
Also, you only can marshall arbitrary long data to a channel, because otherwise you take up too many space.
And then, you need to have a way to connect the channel to the stream, which AFAIK is not easily possible. Maybe you could start antoer ocaml process: the process would convert the flow of bytes (you can wrap a custom stream over Stream.of_channel) and send it through MPI. The main process would marshall data to the process's input channel.

How to convert hexadecimal data (stored in a string variable) to an integer value

Edit (abstract)
I tried to interpret Char/String data as Byte, 4 bytes at a time. This was because I could only get TComport/TDatapacket to interpret streamed data as String, not as any other data type. I still don't know how to get the Read method and OnRxBuf event handler to work with TComport.
Problem Summary
I'm trying to get data from a mass spectrometer (MS) using some Delphi code. The instrument is connected with a serial cable and follows the RS232 protocol. I am able to send commands and process the text-based outputs from the MS without problems, but I am having trouble with interpreting the data buffer.
Background
From the user manual of this instrument:
"With the exception of the ion current values, the output of the RGA are ASCII character strings terminated by a linefeed + carriage return terminator. Ion signals are represented as integers in units of 10^-16 Amps, and transmitted directly in hex format (four byte integers, 2's complement format, Least Significant Byte first) for maximum data throughput."
I'm not sure whether (1) hex data can be stored properly in a string variable. I'm also not sure how to (2) implement 2's complement in Delphi and (3) the Least Significant Byte first.
Following #David Heffernan 's advice, I went and revised my data types. Attempting to harvest binary data from characters doesn't work, because not all values from 0-255 can be properly represented. You lose data along the way, basically. Especially it your data is represented 4 bytes at a time.
The solution for me was to use the Async Professional component instead of Denjan's Comport lib. It handles datastreams better and has a built-in log that I could use to figure out how to interpret streamed resposes from the instrument. It's also better documented. So, if you're new to serial communications (like I am), rather give that a go.

Stream interface for data compression in Common Lisp

In the chipz decompression library there is an extremely useful function make-decompressing-stream, which provides an interface (using Gray streams behind the scenes) to transparently decompress data read from the provided stream. This allows me to write a single function read-tag (which reads a single "tag" from a stream of structured binary data, much like Common Lisp's read function reads a single Lisp "form" from a stream) that works on both compressed and uncompressed data, eg:
;; For uncompressed data:
(read-tag in-stream)
;; For compressed data:
(read-tag (chipz:make-decompressing-stream 'chipz:zlib in-stream))
As far as I can tell, the API of the associated compression library, salza2, doesn't provide an (out-of-the-box) equivalent interface for performing the reverse task. How could I implement such an interface myself? Let's call it make-compressing-stream. It will be used with my own complementary write-tag function, and provide the same benefits as for reading:
;; For uncompressed-data:
(write-tag out-stream current-tag)
;; For compressed data:
(write-tag (make-compressing-stream 'salza2:zlib-compressor out-stream)
current-tag)
In salza2's documentation (linked above), in the overview, it says: "Salza2 provides an interface for creating a compressor object. This object acts as a sink for octets (either individual or vectors of octets), and is a source for octets in a compressed data format. The compressed octet data is provided to a user-defined callback that can write it to a stream, copy it to another vector, etc." For my current purposes, I only require compression in zlib and gzip formats, for which standard compressors are provided.
So here's how I think it could be done: Firstly, convert my "tag" object to an octet vector, secondly compress it using salza2:compress-octet-vector, and thirdly, provide a callback function that writes the compressed data directly to a file. From reading around, I think the first step could be achieved using flexi-streams:with-output-to-sequence - see here - but I'm really not sure about the callback function, despite looking at salza2's source. But here's the thing: a single tag can contain an arbitrary number of arbitrarily nested tags, and the "leaf" tags of this structure can each carry a sizeable payload; in other words, a single tag can be quite a lot of data.
So the tag->uncompressed-octets->compressed-octets->file conversion would ideally need to be performed in chunks, and this raises a question that I don't know how to answer, namely: compression formats - AIUI - tend to store in their headers a checksum of their payload data; if I compress the data one chunk at a time and append each compressed chunk to an output file, surely there will be a header and checksum for each chunk, as opposed to a single header and checksum for the entire tag's data, which is what I want? How can I solve this problem? Or is it already handled by salza2?
Thanks for any help, sorry for rambling :)
From what I understand, you can't directly decompress multiple chunks from a single file.
(defun bytes (&rest elements)
(make-array (length elements)
:element-type '(unsigned-byte 8)
:initial-contents elements))
(defun compress (chunk &optional mode)
(with-open-file (output #P"/tmp/compressed"
:direction :output
:if-exists mode
:if-does-not-exist :create
:element-type '(unsigned-byte 8))
(salza2:with-compressor (c 'salza2:gzip-compressor
:callback (salza2:make-stream-output-callback output))
(salza2:compress-octet-vector chunk c))))
(compress (bytes 10 20 30) :supersede)
(compress (bytes 40 50 60) :append)
Now, /tmp/compressed contains two consecutive chunks of compressed data.
Calling decompress reads the first chunk only:
(chipz:decompress nil 'chipz:gzip #P"/tmp/compressed")
=> #(10 20 30)
Looking at the source of chipz, the stream is read using an internal buffer, which means the bytes that follows the first chunk are probably already read but not decompressed. That explains why, when using two consecutive decompress calls on the same stream, the second one errors with EOF.
(with-open-file (input #P"/tmp/compressed"
:element-type '(unsigned-byte 8))
(list
#1=(multiple-value-list(ignore-errors(chipz:decompress nil 'chipz:gzip input)))
#1#))
=> ((#(10 20 30))
(NIL #<CHIPZ:PREMATURE-END-OF-STREAM {10155E2163}>))
I don't know how large the data is supposed to be, but if it ever becomes a problem, you might need to change the decompression algorithm so that when we are in the done state (see inflate.lisp), enough data is returned to process the remaining bytes as a new chunk. Or, you compress into different files and use an archive format like TAR (see https://github.com/froydnj/archive).

UTF8 Encoding and Network Streams

A client and server communicate with each other via TCP. The server and client send each other UTF-8 encoded messages.
When encoding UTF-8, the amount of bytes per character is variable. It could take one or more bytes to represent a single character.
Lets say that I am reading a UTF-8 encoded message on the network stream and it is a huge message. In my case it was about 145k bytes. To create a buffer of this size to read from the network stream could lead to an OutMemoryException since the byte array needs that amount of sequential memory.
It would be best then to read from the network stream in a while loop until the entire message is read, reading the pieces in to a smaller buffer (probably 4kb) and then decoding the string and concatenating.
What I am wondering is what happens when the very last byte of the read buffer is actually one of the bytes of a character which is represented by multiple bytes. When I decode the read buffer, that last byte and the beginning bytes of the next read would either be invalid or the wrong character. The quickest way to solve this in my mind would be to encode using a non variable encoding (like UTF-16), and then make your buffer a multiple of the amount of bytes in each character (with UTF-16 being a buffer using the power 2, UTF-32 the power of 4).
But UTF-8 seems to be a common encoding, which would leave me to believe this is a solved problem. Is there another way to solve my concern other than changing the encoding? Perhaps using a linked-list type object to store the bytes would be the way to handle this since it would not use sequential memory.
It is a solved problem. Woot woot!
http://mikehadlow.blogspot.com/2012/07/reading-utf-8-characters-from-infinite.html

Resources