Erlang: unmarshalling variable length data fields in binary stream - parsing

I'm creating an Erlang application that needs to parse a binary TCP stream from a 3rd party program.
One of the types of packets I can receive has data formatted like this:
N_terms *[Flags ( 8 bits ), Type ( 8 bits ), [ optional data ] ].
The problem I have is the optional data is determined by a permutation of all possible
combinations of flags and types. Additionally, depending on the type there is additional optional data associated with it.
If I were to write a parser in an imperative language, I'd simply read in the 2 fields and then have a series of if( ... ) statements where I would read a value and increment my position in the stream. In Erlang, my initial naive assumption is that I would have 2^N function clauses to match byte syntax on the stream, where N is total number of flags + all types with additional optional data.
As it stands, at a minimum I have 3 flags and 1 type that has optional data that I must implement, which would mean I'd have 16 different function clauses to match on the stream.
There must be a better, idiomatic way to do this in Erlang - what am I missing?
Edit:
I should clarify I do not know the number of terms in advance.

One solution is to take
<<Flag:8/integer, Type:8/integer, Rest/binary>>
and then write a function decode(Flag, Type) which returns a description of what Rest will contain. Now, that description can then be passed to a decoder for Rest which can then use the description given to parse it correctly. A simple solution is to make the description into a list and whenever you decode something off of the stream, you use that description list to check that it is valid. That is, the description acts like your if.. construction.
As for the pointer move, it is easy. If you have a Binary and decode it
<<Take:N/binary, Next/binary>> = Binary,
Next is the moved pointer under the hood you are searching for. So you can just break your binary into pieces like that to obtain the next part to work on.

I would parse it something like:
parse_term(Acc,0,<<>>) -> {ok,Acc};
parse_term(_,0,_) -> {error,garbage};
parse_term(Acc,N,<<Flag:8/integer,Type:8/integer,Rest/binary>>) ->
{Optional,Rest1} = extract_optional(Flag,Type,Rest),
parse_term([{Flag,Type,Optional}|Acc],N-1,Rest1>>).
parse_stream(<<NTerms/integer,Rest/binary>>)->
parse_term([],NTerms,Rest).

Related

How to force nom to parse the whole input string?

I am working with nom version 6.1.2 and I am trying to parse Strings like
A 2 1 2.
At the moment I would be happy to at least differentiate between input that fits the requirements and inputs which don't do that. (After that I would like to change the output to a tuple that has the "A" as first value and as second value a vector of the u16 numbers.)
The String always has to start with a capital A and after that there should be at least one space and after that one a number. Furthermore, there can be as much additional spaces and numbers as you want. It is just important to end with a number and not with a space. All numbers will be within the range of u16. I already wrote the following function:
extern crate nom;
use nom::sequence::{preceded, pair};
use nom::character::streaming::{char, space1};
use nom::combinator::recognize;
use nom::multi::many1;
use nom::character::complete::digit1;
pub fn parse_and(line: &str) -> IResult<&str, &str>{
preceded(
char('A'),
recognize(
many1(
pair(
space1,
digit1
)
)
)
)(line)
}
Also I want to mention that there are answers for such a problem which use CompleteStr but that isn't an option anymore because it got removed some time ago.
People explained that the reason for my behavior is that nom doesn't know when the slice of a string ends and therefore I get parse_and: Err(Incomplete(Size(1))) as answer for the provided example as input.
It seems like that one part of the use declarations created that problem. In the documentation (somewhere in some paragraph way to low that I looked at it) it says:
"
Streaming / Complete
Some of nom's modules have streaming or complete submodules. They hold different variants of the same combinators.
A streaming parser assumes that we might not have all of the input data. This can happen with some network protocol or large file parsers, where the input buffer can be full and need to be resized or refilled.
A complete parser assumes that we already have all of the input data. This will be the common case with small files that can be read entirely to memory.
"
Therefore, the solution to my problem is to swap use nom::character::complete::{char, space1}; instead of nom::character::streaming::{char, space1}; (3rd loc without counting empty lines). That worked for me :)

Calculation of Internet Checksum of two 16-bit streams

I want to calculate Internet checksum of two bit streams of 16 bits each. Do I need to break these strings into segments or I can directly sum the both?
Here are the strings:
String 1 = 1010001101011111
String 2 = 1100011010000110
Short answer
No. You don't need to split them.
Somewhat longer answer
Not sure exactly what you mean by "internet" checksum (a hash or checksum is just the result of a mathematical operation, and has no direct relation or dependence on the internet), but anyway:
The checksum of any value should not depend on the length of the input. In theory, your input strings could be of any length at all.
You can test this with a basic online checksum generator such as this one, for instance. That appears to generate a whole slew of checksums using lots of different algorithms. The names of the algorithms appear on the left in the list.
If you want to do this in code, a good starting point might be to search for examples using one of them in whatever language / environment you are working in.

How to check whether input is a string in Erlang?

I would like to write a function to check if the input is a string or not like this:
is_string(Input) ->
case check_if_string(Input) of
true -> {ok, Input};
false -> error
end.
But I found it is tricky to check whether the input is a string in Erlang.
The string definition in Erlang is here: http://erlang.org/doc/man/string.html.
Any suggestions?
Thanks in advance.
In Erlang a string can be actually quite a few things, so there are a few ways to do this depending on exactly what you mean by "a string". It is worth bearing in mind that every sort of string in Erlang is a list of character or lexeme values of some sort.
Encodings are not simple things, particularly when Unicode is involved. Characters can be almost arbitrarily high values, lexemes are globbed together in deep lists of integers, and Erlang iolist()s (which are super useful) are deep lists of mixed integer and binary values that get automatically flattened and converted during certain operations. If you are dealing with anything other than flat lists of printable ASCII values then I strongly recommend you read these:
Unicode module docs
String module docs
IO Library module docs
So... this is not a very simple question.
What to do about all the confusion?
Quick answer that always works: Consider the origin of the data.
You should know what kind of data you are dealing with, whether it is coming over a socket or from a file, or especially if you are generating it yourself. On the edges of your system you may need some help purifying data, though, because network clients send all sorts of random trash from time to time.
Some helper functions for the most common cases live in the io_lib module:
io_lib:char_list/1: Returns true if the input is a list of characters in the unicode range.
io_lib:deep_char_list/1: Returns true if the input is a deep list of legal chars.
io_lib:deep_latin1_char_list/1: Returns true if the input is a deep list of Latin-1 (your basic printable ASCII values from 32 to 126).
io_lib:latin1_char_list/1: Returns true if the input is a flat list of Latin-1 characters (90% of the time this is what you're looking for)
io_lib:printable_latin1_list/1: Returns true if the input is a list of printable Latin-1 (If the above isn't what you wanted, 9% of the time this is the one you want)
io_lib:printable_list/1: Returns true if the input is a flat list of printable chars.
io_lib:printable_unicode_list/1: Returns true if the input is a flat list of printable unicode chars (for that 1% of the time that this is your problem -- except that for some of us, myself included here in Japan, this covers 99% of my input checking cases).
For more particular cases you can either use a regex from the re module or write your own recursive function that zips through a string for those special cases where a regex either doesn't fit, is impossible, or could make you vulnerable to regex attacks.
In erlang, string can be represented by list or binary.
If string is used as list then you can use following function to check:
is_string([C|T]) when (C >= 0) and (C =< 255) ->
is_string(T);
is_string([]) ->
true;
is_string(_) ->
false.
If string is used as binary in code then is_binary(Term) in build function can be used.

How to create a string outside of Erlang that represents a DICT Term?

I want to construct a string in Java that represents a DICT term and that will be passed to an Erlang process for being reflected back as an erlang term ( string-to-term ).
I can achieve this easily for ORDDICT 's, since they are structured as a simple sorted key / value pair in a list of tuples such as : [ {field1 , "value1"} , {field2 , "value2} ]
But, for DICTS, they are compiled into a specific term that I want to find how to reverse-engineer it. I am aware this structure can change over new releases, but the benefits for performance and ease of integration to Java would overcome this. Unfortunately Erlang's JInterface is based on simple data structures. An efficient DICT type would be of great use.
A simple dict gets defined as follows:
D1 = dict:store("field1","AAA",dict:new()).
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],
[["field1",65,65,65]],
[],[],[],[],[],[],[]}}}
As it can be seen above, there are some coordinates which I do not understand what they mean ( the numbers 1,16,16,8,80,48 and a set of empty lists, which likely represent something as well.
Adding two other rows (key-value pairs) causes the data to look like:
D3 = dict:store("field3","CCC",D2).
{dict,3,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],
[["field3",67,67,67]],
[],[],[],[],[],
[["field1",65,65,65]],
[],[],[],[],
[["field2",66,66,66]],
[],[]}}}
From the above I can notice that:
the first number (3) reppresets the number of items in the DICT.
the second number (16) shows the number of list slots in the first tuple of lists
the third number (16) shows the number of list slots in the second typle of lists, of which the values ended up being placed on ( in the middle ).
the fourth number (8) appears to be the number of slots in the second row of tuples from where the values are placed ( a sort of index-pointer )
the remaining numbers (80 and 48)... no idea...
adding a key "field0" gets placed not in the end but just after "field1"'s data. This indicates the indexing approach.
So the question, is there a way (algorithm) to reliably directly create a DICT string from outside of Erlang ?
The comprehensive specification how dict is implemented can be found simply in the dict.erl sourcecode.
But I'm not sure replicating dict.erl's implementation in Java is worthwhile. This would only make sense if you want a fast dict like data structure that you need to pass often between Java and Erlang code. It might make more sense to use a Key-Value store both from Erlang and Java without passing it directly around. Depending on your application this could be e.g. riak or maybe even connect your different language worlds with RabbitMQ. Both examples are implemented in Erlang and are easily accessible from both worlds.

When to Define "unit" in the TypeSpecifierList for Erlang Bins

I've started learning Erlang and recently wrapped up the section on bit syntax. I feel I have a firm understanding of how they can be constructed and matched but failed to come up with an example of when I would want to change the default values of "unit" inside the TypeSpecifierList.
Can anyone share a situation when this would prove useful?
Thanks for your time.
Sometimes, just for convenience: you've got a parameter from somewhere (e.g., from a file header) specifying a count of units of a given size, such as N words of 24-bit audio data, and instead of doing some multiplication, you just say:
<<Audio:N/binary-unit:24, Rest/binary>> = Data
to extract that data (as a chunk) from the rest of the file contents. After parsing the rest of the file, you could pass that chunk to some other function that splits it up into samples.

Resources