How does the Erlang compiler handle pattern matching? What does it output? - erlang

I just asked a question about how the Erlang compiler implements pattern matching, and I got some great responses, one of which is the compiled bytecode (obtained with a parameter passed to the c() directive):
{function, match, 1, 2}.
{label,1}.
{func_info,{atom,match},{atom,match},1}.
{label,2}.
{test,is_tuple,{f,3},[{x,0}]}.
{test,test_arity,{f,3},[{x,0},2]}.
{get_tuple_element,{x,0},0,{x,1}}.
{test,is_eq_exact,{f,3},[{x,1},{atom,a}]}.
return.
{label,3}.
{badmatch,{x,0}}
Its all just plain Erlang tuples. I was expecting some cryptic binary thingy, guess not. I am asking this on impulse here (I could look at the compiler source but asking questions always ends up better with extra insight), how is this output translated in the binary level?
Say {test,is_tuple,{f,3},[{x,0}]} for example. I am assuming this is one instruction, called 'test'... anyway, so this output would essentially be the AST of the bytecode level language, from which the binary encoding is just a 1-1 translation?
This is all so exciting, I had no idea that I can this easily see what the Erlang compiler break things into.

ok so I dug into the compiler source code to find the answer, and to my surprise the asm file produced with the 'S' parameter to the compile:file() function is actually consulted in as is (file:consult()) and then the tuples are checked one by one for further action(line 661 - beam_consult_asm(St) -> - compile.erl). further on then there's a generated mapping table in there (compile folder of the erlang source) that shows what the serial number of each bytecode label is, and Im guessing this is used to generate the actual binary signature of the bytecode.
great stuff. but you just gotta love the consult() function, you can almost have a lispy type syntax for a random language and avoid the need for a parser/lexer fully and just consult source code into the compiler and do stuff with it... code as data data as code...

The compiler has a so-called pattern match compiler which will take a pattern and compile it down to what is essentially a series of branches, switches and such. The code for Erlang is in v3_kernel.erl in the compiler. It uses Simon Peyton Jones, "The Implementation of Functional
Programming Languages", available online at
http://research.microsoft.com/en-us/um/people/simonpj/papers/slpj-book-1987/
Another worthy paper is the one by Peter Sestoft,
http://www.itu.dk/~sestoft/papers/match.ps.gz
which derives a pattern match compiler by inspecting partial evaluation of a simpler system. It may be an easier read, especially if you know ML.
The basic idea is that if you have, say:
% 1
f(a, b) ->
% 2
f(a, c) ->
% 3
f(b, b) ->
% 4
f(b, c) ->
Suppose now we have a call f(X, Y). Say X = a. Then only 1 and 2 are applicable. So we check Y = b and then Y = c. If on the other hand X /= a then we know that we can skip 1 and 2 and begin testing 3 and 4. The key is that if something does not match it tells us something about where the match can continue as well as when we do match. It is a set of constraints which we can solve by testing.
Pattern match compilers seek to optimize the number of tests so there are as few as possible before we have conclusion. Statically typed language have some advantages here since they may know that:
-type foo() :: a | b | c.
and then if we have
-spec f(foo() -> any().
f(a) ->
f(b) ->
f(c) ->
and we did not match f(a), f(b) then f(c) must match. Erlang has to check and then fail if it doesn't match.

Related

Is it appropriate for a parser DCG to not be deterministic?

I am writing a parser for a query engine. My parser DCG query is not deterministic.
I will be using the parser in a relational manner, to both check and synthesize queries.
Is it appropriate for a parser DCG to not be deterministic?
In code:
If I want to be able to use query/2 both ways, does it require that
?- phrase(query, [q,u,e,r,y]).
true;
false.
or should I be able to obtain
?- phrase(query, [q,u,e,r,y]).
true.
nevertheless, given that the first snippet would require me to use it as such
?- bagof(X, phrase(query, [q,u,e,r,y]), [true]).
true.
when using it to check a formula?
The first question to ask yourself, is your grammar deterministic, or in the terminology of grammars, unambiguous. This is not asking if your DCG is deterministic, but if the grammar is unambiguous. That can be answered with basic parsing concepts, no use of DCG is needed to answer that question. In other words, is there only one way to parse a valid input. The standard book for this is "Compilers : principles, techniques, & tools" (WorldCat)
Now you are actually asking about three different uses for parsing.
A recognizer.
A parser.
A generator.
If your grammar is unambiguous then
For a recognizer the answer should only be true for valid input that can be parsed and false for invalid input.
For the parser it should be deterministic as there is only one way to parse the input. The difference between a parser and an recognizer is that a recognizer only returns true or false and a parser will return something more, typically an abstract syntax tree.
For the generator, it should be semi-deterministic so that it can generate multiple results.
Can all of this be done with one, DCG, yes. The three different ways are dependent upon how you use the input and output of the DCG.
Here is an example with a very simple grammar.
The grammar is just an infix binary expression with one operator and two possible operands. The operator is (+) and the operands are either (1) or (2).
expr(expr(Operand_1,Operator,Operand_2)) -->
operand(Operand_1),
operator(Operator),
operand(Operand_2).
operand(operand(1)) --> "1".
operand(operand(2)) --> "2".
operator(operator(+)) --> "+".
recognizer(Input) :-
string_codes(Input,Codes),
DCG = expr(_),
phrase(DCG,Codes,[]).
parser(Input,Ast) :-
string_codes(Input,Codes),
DCG = expr(Ast),
phrase(DCG,Codes,[]).
generator(Generated) :-
DCG = expr(_),
phrase(DCG,Codes,[]),
string_codes(Generated,Codes).
:- begin_tests(expr).
recognizer_test_case_success("1+1").
recognizer_test_case_success("1+2").
recognizer_test_case_success("2+1").
recognizer_test_case_success("2+2").
test(recognizer,[ forall(recognizer_test_case_success(Input)) ] ) :-
recognizer(Input).
recognizer_test_case_fail("2+3").
test(recognizer,[ forall(recognizer_test_case_fail(Input)), fail ] ) :-
recognizer(Input).
parser_test_case_success("1+1",expr(operand(1),operator(+),operand(1))).
parser_test_case_success("1+2",expr(operand(1),operator(+),operand(2))).
parser_test_case_success("2+1",expr(operand(2),operator(+),operand(1))).
parser_test_case_success("2+2",expr(operand(2),operator(+),operand(2))).
test(parser,[ forall(parser_test_case_success(Input,Expected_ast)) ] ) :-
parser(Input,Ast),
assertion( Ast == Expected_ast).
parser_test_case_fail("2+3").
test(parser,[ forall(parser_test_case_fail(Input)), fail ] ) :-
parser(Input,_).
test(generator,all(Generated == ["1+1","1+2","2+1","2+2"]) ) :-
generator(Generated).
:- end_tests(expr).
The grammar is unambiguous and has only 4 valid strings which are all unique.
The recognizer is deterministic and only returns true or false.
The parser is deterministic and returns a unique AST.
The generator is semi-deterministic and returns all 4 valid unique strings.
Example run of the test cases.
?- run_tests.
% PL-Unit: expr ........... done
% All 11 tests passed
true.
To expand a little on the comment by Daniel
As Daniel notes
1 + 2 + 3
can be parsed as
(1 + 2) + 3
or
1 + (2 + 3)
So 1+2+3 is an example as you said is specified by a recursive DCG and as I noted a common way out of the problem is to use parenthesizes to start a new context. What is meant by starting a new context is that it is like getting a new clean slate to start over again. If you are creating an AST, you just put the new context, items in between the parenthesizes, as a new subtree at the current node.
With regards to write_canonical/1, this is also helpful but be aware of left and right associativity of operators. See Associative property
e.g.
+ is left associative
?- write_canonical(1+2+3).
+(+(1,2),3)
true.
^ is right associative
?- write_canonical(2^3^4).
^(2,^(3,4))
true.
i.e.
2^3^4 = 2^(3^4) = 2^81 = 2417851639229258349412352
2^3^4 != (2^3)^4 = 8^4 = 4096
The point of this added info is to warn you that grammar design is full of hidden pitfalls and if you have not had a rigorous class in it and done some of it you could easily create a grammar that looks great and works great and then years latter is found to have a serious problem. While Python was not ambiguous AFAIK, it did have grammar issues, it had enough issues that when Python 3 was created, many of the issues were fixed. So Python 3 is not backward compatible with Python 2 (differences). Yes they have made changes and libraries to make it easier to use Python 2 code with Python 3, but the point is that the grammar could have used a bit more analysis when designed.
The only reason why code should be non-deterministic is that your question has multiple answers. In that case, you'd of course want your query to have multiple solutions. Even then, however, you'd like it to not leave a choice point after the last solution, if at all possible.
Here is what I mean:
"What is the smaller of two numbers?"
min_a(A, B, B) :- B < A.
min_a(A, B, A) :- A =< B.
So now you ask, "what is the smaller of 1 and 2" and the answer you expect is "1":
?- min_a(1, 2, Min).
Min = 1.
?- min_a(2, 1, Min).
Min = 1 ; % crap...
false.
?- min_a(2, 1, 2).
false.
?- min_a(2, 1, 1).
true ; % crap...
false.
So that's not bad code but I think it's still crap. This is why, for the smaller of two numbers, you'd use something like the min() function in SWI-Prolog.
Similarly, say you want to ask, "What are the even numbers between 1 and 10"; you write the query:
?- between(1, 10, X), X rem 2 =:= 0.
X = 2 ;
X = 4 ;
X = 6 ;
X = 8 ;
X = 10.
... and that's fine, but if you then ask for the numbers that are multiple of 3, you get:
?- between(1, 10, X), X rem 3 =:= 0.
X = 3 ;
X = 6 ;
X = 9 ;
false. % crap...
The "low-hanging fruit" are the cases where you as a programmer would see that there cannot be non-determinism, but for some reason your Prolog is not able to deduce that from the code you wrote. In most cases, you can do something about it.
On to your actual question. If you can, write your code so that there is non-determinism only if there are multiple answers to the question you'll be asking. When you use a DCG for both parsing and generating, this sometimes means you end up with two code paths. It feels clumsy but it is easier to write, to read, to understand, and probably to make efficient. As a word of caution, take a look at this question. I can't know that for sure, but the problems that OP is running into are almost certainly caused by unnecessary non-determinism. What probably happens with larger inputs is that a lot of choice points are left behind, there is a lot of memory that cannot be reclaimed, a lot of processing time going into book keeping, huge solution trees being traversed only to get (as expected) no solutions.... you get the point.
For examples of what I mean, you can take a look at the implementation of library(dcg/basics) in SWI-Prolog. Pay attention to several things:
The documentation is very explicit about what is deterministic, what isn't, and how non-determinism is supposed to be useful to the client code;
The use of cuts, where necessary, to get rid of choice points that are useless;
The implementation of number//1 (towards the bottom) that can "generate extract a number".
(Hint: use the primitives in this library when you write your own parser!)
I hope you find this unnecessarily long answer useful.

Erlang flatten function time complexity

I need a help with following:
flatten ([]) -> [];
flatten([H|T]) -> H ++ flatten(T).
Input List contain other lists with a different length
For example:
flatten([[1,2,3],[4,7],[9,9,9,9,9,9]]).
What is the time complexity of this function?
And why?
I got it to O(n) where n is a number of elements in the Input list.
For example:
flatten([[1,2,3],[4,7],[9,9,9,9,9,9]]) n=3
flatten([[1,2,3],[4,7],[9,9,9,9,9,9],[3,2,4],[1,4,6]]) n=5
Thanks for help.
First of all I'm not sure your code will work, at least not in the way standard library works. You could compare your function with lists:flatten/1 and maybe improve on your implementation. Try lists such as [a, [b, c]] and [[a], [b, [c]], [d]] as input and verify if you return what you expected.
Regarding complexity it is little tricky due to ++ operator and functional (immutable) nature of the language. All lists in Erlang are linked lists (not arrays like in C++), and you can not just add something to end of one without modifying it; before it was pointing to end of list, now you would like it to link to something else. And again, since it is not mutable language you have to make copy of whole list left of ++ operator, which increases complexity of this operator.
You could say that complexity of A ++ B is length(A), and it makes complexity of your function little bit greater. It would go something like length(FirstElement) + (lenght(FirstElement) + length(SecondElement)) + .... up to (without) last, which after some math magic could be simplified to (n -1) * 1/2 * k * k where n is number of elements, and k is average length of element. Or O(n^3).
If you are new to this it might seem little bit odd, but with some practice you can get hang of it. I would recommend going through few resources:
Good explanation of lists and how they are created
Documentation on list handling with DO and DO NOT parts
Short description of ++ operator myths and best practices
Chapter about recursion and tail-recursion with examples using ++ operator

Erlang: variable is unbound

Why is the following saying variable unbound?
9> {<<A:Length/binary, Rest/binary>>, Length} = {<<1,2,3,4,5>>, 3}.
* 1: variable 'Length' is unbound
It's pretty clear that Length should be 3.
I am trying to have a function with similar pattern matching, ie.:
parse(<<Body:Length/binary, Rest/binary>>, Length) ->
But if fails with the same reason. How can I achieve the pattern matching I want?
What I am really trying to achieve is parse in incoming tcp stream packets as LTV(Length, Type, Value).
At some point after I parse the the Length and the Type, I want to ready only up to Length number of bytes as the value, as the rest will probably be for the next LTV.
So my parse_value function is like this:
parse_value(Value0, Left, Callback = {Module, Function},
{length, Length, type, Type, value, Value1}) when byte_size(Value0) >= Left ->
<<Value2:Left/binary, Rest/binary>> = Value0,
Module:Function({length, Length, type, Type, value, lists:reverse([Value2 | Value1])}),
if
Rest =:= <<>> ->
{?MODULE, parse, {}};
true ->
parse(Rest, Callback, {})
end;
parse_value(Value0, Left, _, {length, Length, type, Type, value, Value1}) ->
{?MODULE, parse_value, Left - byte_size(Value0), {length, Length, type, Type, value, [Value0 | Value1]}}.
If I could do the pattern matching, I could break it up to something more pleasant to the eye.
The rules for pattern matching are that if a variable X occurs in two subpatterns, as in {X, X}, or {X, [X]}, or similar, then they have to have the same value in both positions, but the matching of each subpattern is still done in the same input environment - bindings from one side do not carry over to the other. The equality check is conceptually done afterwards, as if you had matched on {X, X2} and added a guard X =:= X2. This means that your Length field in the tuple cannot be used as input to the binary pattern, not even if you make it the leftmost element.
However, within a binary pattern, variables bound in a field can be used in other fields following it, left-to-right. Therefore, the following works (using a leading 32-bit size field in the binary):
1> <<Length:32, A:Length/binary, Rest/binary>> = <<0,0,0,3,1,2,3,4,5>>.
<<0,0,0,3,1,2,3,4,5>>
2> A.
<<1,2,3>>
3> Rest.
<<4,5>>
I've run into this before. There is some weirdness between what is happening inside binary syntax and what happens during unification (matching). I suspect that it is just that binary syntax and matching occur at different times in the VM somewhere (we don't know which Length is failing to get assigned -- maybe binary matching is always first in evaluation, so Length is still meaningless). I was once going to dig in and find out, but then I realized that I never really needed to solve this problem -- which might be why it was never "solved".
Fortunately, this won't stop you with whatever you are doing.
Unfortunately, we can't really help further unless you explain the context in which you think this kind of a match is a good idea (you are having an X-Y problem).
In binary parsing you can always force the situation to be one of the following:
Have a fixed-sized header at the beginning of the binary message that tells you the next size element you need (and from there that can continue as a chain of associations endlessly)
Inspect the binary once on entry to determine the size you are looking for, pull that one value, and then begin the real parsing task
Have a set of fields, all of predetermined sizes that conform to some a binary schema standard
Convert the binary to a list and iterate through it with any arbitrary amount of look-ahead and backtracking you might need
Quick Solution
Without knowing anything else about your general problem, a typical solution would look like:
parse(Length, Bin) ->
<<Body:Length/binary, Rest/binary>> = Bin,
ok = do_something(Body),
do_other_stuff(Rest).
But I smell something funky here.
Having things like this in your code is almost always a sign that a more fundamental aspect of the code structure is not in agreement with the data that you are handling.
But deadlines.
Erlang is all about practical code that satisfies your goals in the real world. With that in mind, I suggest that you do something like the above for now, and then return to this problem domain and rethink it. Then refactor it. This will gain you three benefits:
Something will work right away.
You will later learn something fundamental about parsing in general.
Your code will almost certainly run faster if it fits your data better.
Example
Here is an example in the shell:
1> Parse =
1> fun
1> (Length, Bin) when Length =< byte_size(Bin) ->
1> <<Body:Length/binary, Rest/binary>> = Bin,
1> ok = io:format("Chopped off ~p bytes: ~p~n", [Length, Body]),
1> Rest;
1> (Length, Bin) ->
1> ok = io:format("Binary shorter than ~p~n", [Length]),
1> Bin
1> end.
#Fun<erl_eval.12.87737649>
2> Parse(3, <<1,2,3,4,5>>).
Chopped off 3 bytes: <<1,2,3>>
<<4,5>>
3> Parse(8, <<1,2,3,4,5>>).
Binary shorter than 8
<<1,2,3,4,5>>
Note that this version is a little safer, in that we avoid a crash in the case that Length is longer than the binary. This is yet another good reason why maybe we can't do that match in the function head.
Try with below code:
{<<A:Length/binary, Rest/binary>>, _} = {_, Length} = {<<1,2,3,4,5>>, 3}.
This question is mentioned a bit in EEP-52:
Any variables used in the expression must have been previously bound, or become bound in the same binary pattern as the expression. That is, the following example is illegal:
illegal_example2(N, <<X:N,T/binary>>) ->
{X,T}.
And explained a bit more in the following e-mail: http://erlang.org/pipermail/eeps/2020-January/000636.html
Illegal. With one exception, matching is not done in a left-to-right
order, but all variables in the pattern will be bound at the same
time. That means that the variables must be bound before the match
starts. For maps, that means that the variables referenced in key
expressions must be bound before the case (or receive) that matches
the map. In a function head, all map keys must be literals.
The exception to this general rule is that within a binary pattern,
the segments are matched from left to right, and a variable bound in a
previous segment can be used in the size expression for a segment
later in the binary pattern.
Also one of the members of OTP team mentioned that they made a prototype that can do that, but it was never finished http://erlang.org/pipermail/erlang-questions/2020-May/099538.html
We actually tried to make your example legal. The transformation of
the code that we did was not to rewrite to guards, but to match
arguments or parts of argument in the right order so that variables
that input variables would be bound before being used. (We would do a
topological sort to find the correct order.) For your example, the
transformation would look similar to this:
legal_example(Key, Map) ->
case Map of
#{Key := Value} -> Value;
_ -> error(function_clause, [Key, Map])
end.
In the prototype implementation, the compiler could compile the
following example:
convoluted(Ref,
#{ node(Ref) := NodeId, Loop := universal_answer},
[{NodeId, Size} | T],
<<Int:(Size*8+length(T)),Loop>>) when is_reference(Ref) ->
Int.
Things started to fall apart when variables are repeated. Repeated
variables in patterns already have a meaning in Erlang (they should be
the same), so it become tricky to understand to distinguish between
variables being bound or variables being used a binary size or map
key. Here is an example that the prototype couldn't handle:
foo(#{K := K}, K) -> ok.
A human can see that it should be transformed similar to this:
foo(Map, K) -> case Map of
{K := V} when K =:= V -> ok end.
Here are few other examples that should work but the prototype would
refuse to compile (often emitting an incomprehensible error message):
bin2(<<Sz:8,X:Sz>>, <<Y:Sz>>) -> {X,Y}.
repeated_vars(#{K := #{K := K}}, K) -> K.
match_map_bs(#{K1 := {bin,<<Int:Sz>>}, K2 := <<Sz:8>>}, {K1,K2}) ->
Int.
Another problem was when example was correctly rejected, the error
message would be confusing.
Because much more work would clearly be needed, we have shelved the
idea for now. Personally, I am not sure that the idea is sound in the
first place. But I am sure of one thing: the implementation would be
very complicated.
UPD: latest news from 2020-05-14

Creating a valid function declaration from a complex tuple/list structure

Is there a generic way, given a complex object in Erlang, to come up with a valid function declaration for it besides eyeballing it? I'm maintaining some code previously written by someone who was a big fan of giant structures, and it's proving to be error prone doing it manually.
I don't need to iterate the whole thing, just grab the top level, per se.
For example, I'm working on this right now -
[[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"],
[{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.5","5060"},
[{'via-branch',"z9hG4bKb561e4f03a40c4439ba375b2ac3c9f91.0"}]}]},
{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.15","5060"},
[{'via-branch',"12dee0b2f48309f40b7857b9c73be9ac"}]}]},
{'From',
{'from-spec',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"003018CFE4EF"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"b7226ffa86c46af7bf6e32969ad16940"}]}},
{'To',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3966"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"a830c764"}]},
{'Call-ID',"90df0e4968c9a4545a009b1adf268605#172.20.10.15"},
{'CSeq',1358286,"SUBSCRIBE"},
["date",'HCOLON',
["Mon",44,32,["13",32,"Jun",32,"2011"],32,["17",58,"03",58,"55"],32,"GMT"]],
{'Contact',
[[{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3ComCallProcessor"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[]],
[]]},
["expires",'HCOLON',3600],
["user-agent",'HCOLON',
["3Com",[]],
[['LWS',["VCX",[]]],
['LWS',["7210",[]]],
['LWS',["IP",[]]],
['LWS',["CallProcessor",[['SLASH',"v10.0.8"]]]]]],
["proxy-authenticate",'HCOLON',
["Digest",'LWS',
["realm",'EQUAL',['SWS',34,"3Com",34]],
[['COMMA',["domain",'EQUAL',['SWS',34,"3Com",34]]],
['COMMA',
["nonce",'EQUAL',
['SWS',34,"btbvbsbzbBbAbwbybvbxbCbtbzbubqbubsbqbtbsbqbtbxbCbxbsbybs",
34]]],
['COMMA',["stale",'EQUAL',"FALSE"]],
['COMMA',["algorithm",'EQUAL',"MD5"]]]]],
{'Content-Length',0}],
"\r\n",
["\n"]]
Maybe https://github.com/etrepum/kvc
I noticed your clarifying comment. I'd prefer to add a comment myself, but don't have enough karma. Anyway, the trick I use for that is to experiment in the shell. I'll iterate a pattern against a sample data structure until I've found the simplest form. You can use the _ match-all variable. I use an erlang shell inside an emacs shell window.
First, bind a sample to a variable:
A = [{a,b},[{c,d}, {e,f}]].
Now set the original structure against the variable:
[{a,b},[{c,d},{e,f}]] = A.
If you hit enter, you'll see they match. Hit alt-p (forget what emacs calls alt, but it's alt on my keyboard) to bring back the previous line. Replace some tuple or list item with an underscore:
[_,[{c,d},{e,f}]].
Hit enter to make sure you did it right and they still match. This example is trivial, but for deeply nested, multiline structures it's trickier, so it's handy to be able to just quickly match to test. Sometimes you'll want to try to guess at whole huge swaths, like using an underscore to match a tuple list inside a tuple that's the third element of a list. If you place it right, you can match the whole thing at once, but it's easy to misread it.
Anyway, repeat to explore the essential shape of the structure and place real variables where you want to pull out values:
[_, [_, _]] = A.
[_, _] = A.
[_, MyTupleList] = A. %% let's grab this tuple list
[{MyAtom,b}, [{c,d}, MyTuple]] = A. %% or maybe we want this atom and tuple
That's how I efficiently dissect and pattern match complex data structures.
However, I don't know what you're doing. I'd be inclined to have a wrapper function that uses KVC to pull out exactly what you need and then distributes to helper functions from there for each type of structure.
If I understand you correctly you want to pattern match some large datastructures of unknown formatting.
Example:
Input: {a, b} {a,b,c,d} {a,[],{},{b,c}}
function({A, B}) -> do_something;
function({A, B, C, D}) when is_atom(B) -> do_something_else;
function({A, B, C, D}) when is_list(B) -> more_doing.
The generic answer is of course that it is undecidable from just data to know how to categorize that data.
First you should probably be aware of iolists. They are created by functions such as io_lib:format/2 and in many other places in the code.
One example is that
[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"]
will print as
SIP/2.0 407 Proxy Authentication Required
So, I'd start with flattening all those lists, using a function such as
flatten_io(List) when is_list(List) ->
Flat = lists:map(fun flatten_io/1, List),
maybe_flatten(Flat);
flatten_io(Tuple) when is_tuple(Tuple) ->
list_to_tuple([flatten_io(Element) || Element <- tuple_to_list(Tuple)];
flatten_io(Other) -> Other.
maybe_flatten(L) when is_list(L) ->
case lists:all(fun(Ch) when Ch > 0 andalso Ch < 256 -> true;
(List) when is_list(List) ->
lists:all(fun(X) -> X > 0 andalso X < 256 end, List);
(_) -> false
end, L) of
true -> lists:flatten(L);
false -> L
end.
(Caveat: completely untested and quite inefficient. Will also crash for inproper lists, but you shouldn't have those in your data structures anyway.)
On second thought, I can't help you. Any data structure that uses the atom 'COMMA' for a comma in a string should be taken out and shot.
You should be able to flatten those things as well and start to get a view of what you are looking at.
I know that this is not a complete answer. Hope it helps.
Its hard to recommend something for handling this.
Transforming all the structures in a more sane and also more minimal format looks like its worth it. This depends mainly on the similarities in these structures.
Rather than having a special function for each of the 100 there must be some automatic reformatting that can be done, maybe even put the parts in records.
Once you have records its much easier to write functions for it since you don't need to know the actual number of elements in the record. More important: your code won't break when the number of elements changes.
To summarize: make a barrier between your code and the insanity of these structures by somehow sanitizing them by the most generic code possible. It will be probably a mix of generic reformatting with structure speicific stuff.
As an example already visible in this struct: the 'name-addr' tuples look like they have a uniform structure. So you can recurse over your structures (over all elements of tuples and lists) and match for "things" that have a common structure like 'name-addr' and replace these with nice records.
In order to help you eyeballing you can write yourself helper functions along this example:
eyeball(List) when is_list(List) ->
io:format("List with length ~b\n", [length(List)]);
eyeball(Tuple) when is_tuple(Tuple) ->
io:format("Tuple with ~b elements\n", [tuple_size(Tuple)]).
So you would get output like this:
2> eyeball({a,b,c}).
Tuple with 3 elements
ok
3> eyeball([a,b,c]).
List with length 3
ok
expansion of this in a useful tool for your use is left as an exercise. You could handle multiple levels by recursing over the elements and indenting the output.
Use pattern matching and functions that work on lists to extract only what you need.
Look at http://www.erlang.org/doc/man/lists.html:
keyfind, keyreplace, L = [H|T], ...

How to generate integer ranges in Erlang?

From the other languages I program in, I'm used to having ranges. In Python, if I want all numbers one up to 100, I write range(1, 101). Similarly, in Haskell I'd write [1..100] and in Scala I'd write 1 to 100.
I can't find something similar in Erlang, either in the syntax or the library. I know that this would be fairly simple to implement myself, but I wanted to make sure it doesn't exist elsewhere first (particularly since a standard library or language implementation would be loads more efficient).
Is there a way to do ranges either in the Erlang language or standard library? Or is there some idiom that I'm missing? I just want to know if I should implement it myself.
I'm also open to the possibility that I shouldn't want to use a range in Erlang (I wouldn't want to be coding Python or Haskell in Erlang). Also, if I do need to implement this myself, if you have any good suggestions for improving performance, I'd love to hear them :)
From http://www.erlang.org/doc/man/lists.html it looks like lists:seq(1, 100) does what you want. You can also do things like lists:seq(1, 100, 2) to get all of the odd numbers in that range instead.
You can use list:seq(From, TO) that's say #bitilly, and also you can use list comprehensions to add more functionality, for example:
1> [X || X <- lists:seq(1,100), X rem 2 == 0].
[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,
44,46,48,50,52,54,56,58|...]
There is a difference between range in Ruby and list:seq in Erlang. Ruby's range doesn't create list and rely on next method, so (1..HugeInteger).each { ... } will not eat up memory. Erlang lists:seq will create list (or I believe it will). So when range is used for side effects, it does make a difference.
P.S. Not just for side effects:
(1..HugeInteger).inject(0) { |s, v| s + v % 1000000 == 0 ? 1 : 0 }
will work the same way as each, not creating a list. Erlang way for this is to create a recursive function. In fact, it is a concealed loop anyway.
Example of lazy stream in Erlang. Although it is not Erlang specific, I guess it can be done in any language with lambdas. New lambda gets created every time stream is advanced so it might put some strain on garbage collector.
range(From, To, _) when From > To ->
done;
range(From, To, Step) ->
{From, fun() -> range(From + Step, To, Step) end}.
list(done) ->
[];
list({Value, Iterator}) ->
[Value | list(Iterator())].
% ----- usage example ------
list_odd_numbers(From, To) ->
list(range(From bor 1, To, 2)).

Resources