Usocket unsigned byte 8 doesn't receive data while character element type does - network-programming

I've run into some truly puzzling behavior with the USocket library. Consider the following snippet:
(defvar server-socket (usocket:socket-listen "localhost" 43593
:element-type
'(unsigned-byte 8)))
(defvar client-connection (usocket:socket-accept server-socket))
;in a separate terminal, type "telnet localhost 43593".
;then type some text and hit enter.
(listen (usocket:socket-stream client-connection))
=> NIL
Why is this happening? When I leave out :element-type '(unsigned-byte 8) from the arguments to usocket:socket-listen, it works just fine. I could understand if any arbitrary bytes couldn't be represented as characters (utf-8 encoding for example has invalid sequences of bytes), but the inverse - characters that can't be represented by bytes - makes no sense, especially in a network context.
(I'm running clisp-2.49 on Lubuntu 15.10, USocket 0.6.3.2, in case that helps).

Turns out the issue was in the precise wording used by the documentation for listen in the hyperspec (http://www.lispworks.com/documentation/HyperSpec/Body/f_listen.htm).
Returns true if there is a character immediately available from input-stream; otherwise, returns false. On a non-interactive input-stream, listen returns true except when at end of file[1]. If an end of file is encountered, listen returns false. listen is intended to be used when input-stream obtains characters from an interactive device such as a keyboard.
Since the socket-stream doesn't produce characters if it's told to produce '(unsigned-byte 8)'s, listen will return NIL for the stream regardless of whether it has data ready to be read.
As far as I know, there is no alternative to listen for non-character types in the standard. Use usocket's wait-for-input instead, with :timeout set to 0 for polling (http://quickdocs.org/usocket/api).

Related

Lua: How to gzip a string (gzip, not zlib) in memory?

Given a string, how can I compress it in memory with gzip? I'm using Lua.
It sounds like an easy problem, but there is a huge list of libraries. So far, all that I tried were either dead or I could produce only zlib compressed strings. In my use case, I need gzip compression, as it is expected by the receiver.
As a test, if you dump the compressed string to a file, zcat should be able to decompress it.
I'm using OpenResty, so any Lua library should be fine.
(The only solution that I got working so far is to dump the string in a file, call os.execute("gzip /tmp/example.txt") and read it back. Unfortunately, that is not a practical solution.)
It turns out that zlib is not far away from gzip. The difference is that gzip has an additional header.
To get this header, you can use lua-zlib like this:
local zlib = require "zlib"
-- input: string
-- output: string compressed with gzip
function compress(str)
local level = 5
local windowSize = 15+16
return zlib.deflate(level, windowSize)(str, "finish")
end
Explanation:
The second parameter of deflate is the window size. It makes sure that a gzip header is written. If you omit the parameter, you get a zlib compressed string.
level is the gzip compression level (1=worst to 9=best)
Here is the documentation of the deflate (source: lua-zlib documentation):
function stream = zlib.deflate([ int compression_level ], [ int window_size ])
If no compression_level is provided uses Z_DEFAULT_COMPRESSION (6),
compression level is a number from 1-9 where zlib.BEST_SPEED is 1
and zlib.BEST_COMPRESSION is 9.
Returns a "stream" function that compresses (or deflates) all
strings passed in. Specifically, use it as such:
string deflated, bool eof, int bytes_in, int bytes_out =
stream(string input [, 'sync' | 'full' | 'finish'])
Takes input and deflates and returns a portion of it,
optionally forcing a flush.
A 'sync' flush will force all pending output to be flushed to
the return value and the output is aligned on a byte boundary,
so that the decompressor can get all input data available so
far. Flushing may degrade compression for some compression
algorithms and so it should be used only when necessary.
A 'full' flush will flush all output as with 'sync', and the
compression state is reset so that decompression can restart
from this point if previous compressed data has been damaged
or if random access is desired. Using Z_FULL_FLUSH too often
can seriously degrade the compression.
A 'finish' flush will force all pending output to be processed
and results in the stream become unusable. Any future
attempts to print anything other than the empty string will
result in an error that begins with IllegalState.
The eof result is true if 'finish' was specified, otherwise
it is false.
The bytes_in is how many bytes of input have been passed to
stream, and bytes_out is the number of bytes returned in
deflated string chunks.

What does 16#4000000 mean in Erlang?

I'm reading ejabberd source, specifically ejabberd_http.erl.
In the code below,
...
case (State#state.sockmod):recv(State#state.socket,
min(Len, 16#4000000), 300000)
of
{ok, Data} ->
recv_data(State, Len - byte_size(Data), <<Acc/binary, Data/binary>>);
...
What does 16#4000000 mean?
I've tested this in the Erlang shell.
%%erlang shell
...
7>16#4000000.
67108864
8>is_integer(16#4000000).
true
I know it's just an integer value.
Is there any advantage to writing 16#4000000 instead of 67108864?
In Erlang, the number before the # is the integer base. In your example, 16#4000000 means the hexadecimal representation of 67108864. In other languages it is often represented as 0x4000000.
One reason for using the hex representation is because each digit represents 4 bits, for example 16#F is 16 (in decimal), or 1111 in binary. When working with binary processing, using base 16 makes it easier to handle and understand for the human reader.

Word similar to "parse-word", but for pushing integers onto the stack

I would like to implement a DSL for setting port numbers on a socket object.
I would like the DSL to follow this API for setting the host port number:
host: 8080
If this were a string operation (such as host: localhost) I could use parse-word. That's less than ideal though, since Forth is very good at parsing numbers, and re-inventing the wheel is a bad thing.
Are there any standard words in Forth that take the first item on the input string, parse it to a number and push it on the stack?
>NUMBER is an ANS word (in CORE) that turns strings into numbers, but it's cumbersome to use. Your Forth probably has a more flexible variant. Your Forth probably also supports syntax like #16 $10 %10000, which all evaluate to 16, regardless of BASE. So one way to do this:
: parse-num ( "number" -- n | d | r ) parse-word evaluate ;
Or with >NUMBER, and only returning a single-cell number:
: parse-num ( "number" -- n )
0. parse-word >number ( d c-addr u )
abort" not a number" drop
abort" double received where single-cell number expected" ;
Which aborts if the returned string isn't the empty string that would result if the entire output of PARSE-WORD was converted into a number, or if the high bits of the double aren't 0, which would be the case if a number not representable by a cell were entered. (NB. >NUMBER doesn't handle double-number syntax either. It will stop parsing 1. at the dot. It doesn't even handle negative numbers.)

Can I get a list of all currently-registered atoms?

My project has blown through the max 1M atoms, we've cranked up the limit, but I need to apply some sanity to the code that people are submitting with regard to list_to_atom and its friends. I'd like to start by getting a list of all the registered atoms so I can see where the largest offenders are. Is there any way to do this. I'll have to be creative about how I do it so I don't end up trying to dump 1-2M lines in a live console.
You can get hold of all atoms by using an undocumented feature of the external term format.
TL;DR: Paste the following line into the Erlang shell of your running node. Read on for explanation and a non-terse version of the code.
(fun F(N)->try binary_to_term(<<131,75,N:24>>) of A->[A]++F(N+1) catch error:badarg->[]end end)(0).
Elixir version by Ivar Vong:
for i <- 0..:erlang.system_info(:atom_count)-1, do: :erlang.binary_to_term(<<131,75,i::24>>)
An Erlang term encoded in the external term format starts with the byte 131, then a byte identifying the type, and then the actual data. I found that EEP-43 mentions all the possible types, including ATOM_INTERNAL_REF3 with type byte 75, which isn't mentioned in the official documentation of the external term format.
For ATOM_INTERNAL_REF3, the data is an index into the atom table, encoded as a 24-bit integer. We can easily create such a binary: <<131,75,N:24>>
For example, in my Erlang VM, false seems to be the zeroth atom in the atom table:
> binary_to_term(<<131,75,0:24>>).
false
There's no simple way to find the number of atoms currently in the atom table*, but we can keep increasing the number until we get a badarg error.
So this little module gives you a list of all atoms:
-module(all_atoms).
-export([all_atoms/0]).
atom_by_number(N) ->
binary_to_term(<<131,75,N:24>>).
all_atoms() ->
atoms_starting_at(0).
atoms_starting_at(N) ->
try atom_by_number(N) of
Atom ->
[Atom] ++ atoms_starting_at(N + 1)
catch
error:badarg ->
[]
end.
The output looks like:
> all_atoms:all_atoms().
[false,true,'_',nonode#nohost,'$end_of_table','','fun',
infinity,timeout,normal,call,return,throw,error,exit,
undefined,nocatch,undefined_function,undefined_lambda,
'DOWN','UP','EXIT',aborted,abs_path,absoluteURI,ac,accessor,
active,all|...]
> length(v(-1)).
9821
* In Erlang/OTP 20.0, you can call erlang:system_info(atom_count):
> length(all_atoms:all_atoms()) == erlang:system_info(atom_count).
true
I'm not sure if there's a way to do it on a live system, but if you can run it in a test environment you should be able to get a list via crash dump. The atom table is near the end of the crash dump format. You can create a crash dump via erlang:halt/1, but that will bring down the whole runtime system.
I dare say that if you use more than 1M atoms, then you are doing something wrong. Atoms are intended to be static as soon as the application runs or at least upper bounded by some small number, 3000 or so for a medium sized application.
Be very careful when an enemy can generate atoms in your vm. especially calls like list_to_atom/1 is somewhat dangerous.
EDITED (wrong answer..)
You can adjust number of atoms with +t
http://www.erlang.org/doc/efficiency_guide/advanced.html
..but I know very few use cases when it is necessary.
You can track atom stats with erlang:memory()

TCP/IP Client / Server commands data

I have a Client/Server architecture (C# .Net 4.0) that send's command packets of data as byte arrays. There is a variable number of parameters in any command, and each paramater is of variable length. Because of this I use delimiters for the end of a parameter and the command as a whole. The operand is always 2 bytes and both types of delimiter are 1 byte. The last parameter_delmiter is redundant as command_delmiter provides the same functionality.
The command structure is as follow:
FIELD SIZE(BYTES)
operand 2
parameter1 x
parameter_delmiter 1
parameter2 x
parameter_delmiter 1
parameterN x
.............
.............
command_delmiter 1
Parameters are sourced from many different types, ie, ints, strings etc all encoded into byte arrays.
The problem I have is that sometimes parameters when encoded into byte arrays contain bytes that are the same value as a delimiter. For example command_delmiter=255.. and a paramater may have that byte inside of it.
There is 3 ways I can think of fixing this:
1) Encode the parameters differently so that they can never be the same value as a delimiter (255 and 254) Modulus?. This will mean that paramaters will become larger, ie Int16 will be more than 2 bytes etc.
2) Do not use delimiters at all, use count and length values at the start of the command structure.
3) Use something else.
To my knowledge, the way TCP/IP buffers work is that SOME SORT of delimiter has to be used to seperate 'commands' or 'bundles of data' as a buffer may contain multiple commands, or a command may span multiple buffers.. So this
BinaryReader / Writer seems like an obvious candidate, the only issue is that the byte array may contain multiple commands ( with parameters inside). So the byte array would still have to be chopped up in order to feel into the BinaryReader.
Suggestions?
Thanks.
The standard way to do this is to have the length of the message in the (fixed) first few bytes of a message. So you could have the first 4 bytes to denote the length of a message, read those many bytes for the content of the message. The next 4 bytes would be the length of the next message. A length of 0 could indicate end of messages. Or you could use a header with a message count.
Also, remember TCP is a byte stream, so don't expect a complete message to be available every time you read data from a socket. You could receive an arbitrary number of bytes at ever read.

Resources