Lua: How to gzip a string (gzip, not zlib) in memory? - lua

Given a string, how can I compress it in memory with gzip? I'm using Lua.
It sounds like an easy problem, but there is a huge list of libraries. So far, all that I tried were either dead or I could produce only zlib compressed strings. In my use case, I need gzip compression, as it is expected by the receiver.
As a test, if you dump the compressed string to a file, zcat should be able to decompress it.
I'm using OpenResty, so any Lua library should be fine.
(The only solution that I got working so far is to dump the string in a file, call os.execute("gzip /tmp/example.txt") and read it back. Unfortunately, that is not a practical solution.)

It turns out that zlib is not far away from gzip. The difference is that gzip has an additional header.
To get this header, you can use lua-zlib like this:
local zlib = require "zlib"
-- input: string
-- output: string compressed with gzip
function compress(str)
local level = 5
local windowSize = 15+16
return zlib.deflate(level, windowSize)(str, "finish")
end
Explanation:
The second parameter of deflate is the window size. It makes sure that a gzip header is written. If you omit the parameter, you get a zlib compressed string.
level is the gzip compression level (1=worst to 9=best)
Here is the documentation of the deflate (source: lua-zlib documentation):
function stream = zlib.deflate([ int compression_level ], [ int window_size ])
If no compression_level is provided uses Z_DEFAULT_COMPRESSION (6),
compression level is a number from 1-9 where zlib.BEST_SPEED is 1
and zlib.BEST_COMPRESSION is 9.
Returns a "stream" function that compresses (or deflates) all
strings passed in. Specifically, use it as such:
string deflated, bool eof, int bytes_in, int bytes_out =
stream(string input [, 'sync' | 'full' | 'finish'])
Takes input and deflates and returns a portion of it,
optionally forcing a flush.
A 'sync' flush will force all pending output to be flushed to
the return value and the output is aligned on a byte boundary,
so that the decompressor can get all input data available so
far. Flushing may degrade compression for some compression
algorithms and so it should be used only when necessary.
A 'full' flush will flush all output as with 'sync', and the
compression state is reset so that decompression can restart
from this point if previous compressed data has been damaged
or if random access is desired. Using Z_FULL_FLUSH too often
can seriously degrade the compression.
A 'finish' flush will force all pending output to be processed
and results in the stream become unusable. Any future
attempts to print anything other than the empty string will
result in an error that begins with IllegalState.
The eof result is true if 'finish' was specified, otherwise
it is false.
The bytes_in is how many bytes of input have been passed to
stream, and bytes_out is the number of bytes returned in
deflated string chunks.

Related

Usocket unsigned byte 8 doesn't receive data while character element type does

I've run into some truly puzzling behavior with the USocket library. Consider the following snippet:
(defvar server-socket (usocket:socket-listen "localhost" 43593
:element-type
'(unsigned-byte 8)))
(defvar client-connection (usocket:socket-accept server-socket))
;in a separate terminal, type "telnet localhost 43593".
;then type some text and hit enter.
(listen (usocket:socket-stream client-connection))
=> NIL
Why is this happening? When I leave out :element-type '(unsigned-byte 8) from the arguments to usocket:socket-listen, it works just fine. I could understand if any arbitrary bytes couldn't be represented as characters (utf-8 encoding for example has invalid sequences of bytes), but the inverse - characters that can't be represented by bytes - makes no sense, especially in a network context.
(I'm running clisp-2.49 on Lubuntu 15.10, USocket 0.6.3.2, in case that helps).
Turns out the issue was in the precise wording used by the documentation for listen in the hyperspec (http://www.lispworks.com/documentation/HyperSpec/Body/f_listen.htm).
Returns true if there is a character immediately available from input-stream; otherwise, returns false. On a non-interactive input-stream, listen returns true except when at end of file[1]. If an end of file is encountered, listen returns false. listen is intended to be used when input-stream obtains characters from an interactive device such as a keyboard.
Since the socket-stream doesn't produce characters if it's told to produce '(unsigned-byte 8)'s, listen will return NIL for the stream regardless of whether it has data ready to be read.
As far as I know, there is no alternative to listen for non-character types in the standard. Use usocket's wait-for-input instead, with :timeout set to 0 for polling (http://quickdocs.org/usocket/api).

Why is "no code allowed to be all ones" in libjpeg's Huffman decoding?

I'm trying to satisfy myself that METEOSAT images I'm getting from their FTP server are actually valid images. My doubt arises because all the tools I've used so far complain about "Bogus Huffman table definition" - yet when I simply comment out that error message, the image appears quite plausible (a greyscale segment of the Earth's disc).
From https://github.com/libjpeg-turbo/libjpeg-turbo/blob/jpeg-8d/jdhuff.c#L379:
while (huffsize[p]) {
while (((int) huffsize[p]) == si) {
huffcode[p++] = code;
code++;
}
/* code is now 1 more than the last code used for codelength si; but
* it must still fit in si bits, since no code is allowed to be all ones.
*/
if (((INT32) code) >= (((INT32) 1) << si))
ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
code <<= 1;
si++;
}
If I simply comment out the check, or add a check for huffsize[p] to be nonzero (as in the containing loop's controlling expression), then djpeg manages to convert the image to a BMP which I can view with few problems.
Why does the comment claim that all-ones codes are not allowed?
It claims that because they are not allowed. That doesn't mean that there can't be images out there that don't comply with the standard.
The reason they are not allowed is this (from the standard):
Making entropy-coded segments an integer number of bytes is performed
as follows: for Huffman coding, 1-bits are used, if necessary, to pad
the end of the compressed data to complete the final byte of a
segment.
If the all 1's code was allowed, then you could end up with an ambiguity in the last byte of compressed data where the padded 1's could be another coded symbol.

Why does Stata store Boolean values as float?

I have a variable that looks like this " 88.0*" or " 79.5 " where the asterisk is a flag for something. To extract this flag I run
gen newvar = regexm(oldvar,"\*$")
This works fine, but my new variable is a float, which seems inefficient.
Stata offers storage in byte format, so why doesn't the regexm command (which indicates 0/1 whether a match was found) default to that? For that matter, why doesn't generate (abbreviated gen above) compress the right-hand side by default, or at least as an option?
You can specify the storage type after the gen:
clear
set more off
input ///
str5(var1 var2)
"88.0*" "79.5 "
end
list
gen byte newvar = regexm(var1,"\*$")
list
describe
Note that Stata has no boolean type. A 0 is false, a 1 is true. The syntax for generate is (from help generate):
generate [type] newvar[:lblname] =exp [if] [in]
The type appears between brackets, which means it is an option.
See also help compress to reduce memory used by the data.

How to serve Array[Byte] from spray.io

I am using the following path in my spray-can server (using spray 1.2):
path("my"/"path"){
get{
complete{
val buf:Array[Byte] = functionReturningArrayofByte()
println(buf.length)
buf
}
}
}
The length of the buffer (and what is printed by the code above) is 2,263,503 bytes. However, when accessing my/path from a web browser, it downloads a file that is 10,528,063 bytes long.
I thought spray set the content type to application/octet-stream, and the content length, automatically when completing with an Array[Byte]. I don't realize what I may be doing wrong.
EDIT
I've run a small test and have seen that the array of bytes is output as a String. So, for example, if I had two bytes, for example 0xFF and 0x01, the output, instead of just the two bytes, would be the string [ 255, 1 ]. I just don't know how to make it output the raw content instead of a string representation of it.
Wrapping the buf as HttpData solves the problem:
path("my"/"path"){
get{
complete{
val buf:Array[Byte] = functionReturningArrayofByte()
HttpData(buf)
}
}
}

TCP/IP Client / Server commands data

I have a Client/Server architecture (C# .Net 4.0) that send's command packets of data as byte arrays. There is a variable number of parameters in any command, and each paramater is of variable length. Because of this I use delimiters for the end of a parameter and the command as a whole. The operand is always 2 bytes and both types of delimiter are 1 byte. The last parameter_delmiter is redundant as command_delmiter provides the same functionality.
The command structure is as follow:
FIELD SIZE(BYTES)
operand 2
parameter1 x
parameter_delmiter 1
parameter2 x
parameter_delmiter 1
parameterN x
.............
.............
command_delmiter 1
Parameters are sourced from many different types, ie, ints, strings etc all encoded into byte arrays.
The problem I have is that sometimes parameters when encoded into byte arrays contain bytes that are the same value as a delimiter. For example command_delmiter=255.. and a paramater may have that byte inside of it.
There is 3 ways I can think of fixing this:
1) Encode the parameters differently so that they can never be the same value as a delimiter (255 and 254) Modulus?. This will mean that paramaters will become larger, ie Int16 will be more than 2 bytes etc.
2) Do not use delimiters at all, use count and length values at the start of the command structure.
3) Use something else.
To my knowledge, the way TCP/IP buffers work is that SOME SORT of delimiter has to be used to seperate 'commands' or 'bundles of data' as a buffer may contain multiple commands, or a command may span multiple buffers.. So this
BinaryReader / Writer seems like an obvious candidate, the only issue is that the byte array may contain multiple commands ( with parameters inside). So the byte array would still have to be chopped up in order to feel into the BinaryReader.
Suggestions?
Thanks.
The standard way to do this is to have the length of the message in the (fixed) first few bytes of a message. So you could have the first 4 bytes to denote the length of a message, read those many bytes for the content of the message. The next 4 bytes would be the length of the next message. A length of 0 could indicate end of messages. Or you could use a header with a message count.
Also, remember TCP is a byte stream, so don't expect a complete message to be available every time you read data from a socket. You could receive an arbitrary number of bytes at ever read.

Resources