Erlang: What is most-wrong with this trie implementation?

Erlang: What is most-wrong with this trie implementation? - erlang

Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.
At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.
This is the trie:add implementation:
add([], Trie) ->
dict:store(stop, true, Trie);
add([Ch|Rest], Trie) ->
% setdefault(Key, Default, Dict) ->
% case dict:find(Key, Dict) of
% { ok, Val } -> { Dict, Val }
% error -> { dict:new(), Default }
% end.
{ NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
NewSubTrie = add(Rest, SubTrie),
dict:store(Ch, NewSubTrie, NewTrie).
And you can see the rest, along with an example of how it's used (at the bottom), here:
http://gist.github.com/263513
Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.
So, what am I doing most-wrong? And how might I make it a bit less-wrong?

You could implement this functionality by simply storing the words in an ets table:
% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).
> ets:insert(words, [{"zebra"}]).
% check if word exists
> ets:lookup(words, "zed").
[{"zed"}]
% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]
If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.
If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).

I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.
Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.

An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)

Look into DAWGs. They're much more compact than tries.

I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts.
http://www.erlang.org/doc/man/digraph.html

If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.

Related

Erlang flatten function time complexity

I need a help with following:
flatten ([]) -> [];
flatten([H|T]) -> H ++ flatten(T).
Input List contain other lists with a different length
For example:
flatten([[1,2,3],[4,7],[9,9,9,9,9,9]]).
What is the time complexity of this function?
And why?
I got it to O(n) where n is a number of elements in the Input list.
For example:
flatten([[1,2,3],[4,7],[9,9,9,9,9,9]]) n=3
flatten([[1,2,3],[4,7],[9,9,9,9,9,9],[3,2,4],[1,4,6]]) n=5
Thanks for help.

First of all I'm not sure your code will work, at least not in the way standard library works. You could compare your function with lists:flatten/1 and maybe improve on your implementation. Try lists such as [a, [b, c]] and [[a], [b, [c]], [d]] as input and verify if you return what you expected.
Regarding complexity it is little tricky due to ++ operator and functional (immutable) nature of the language. All lists in Erlang are linked lists (not arrays like in C++), and you can not just add something to end of one without modifying it; before it was pointing to end of list, now you would like it to link to something else. And again, since it is not mutable language you have to make copy of whole list left of ++ operator, which increases complexity of this operator.
You could say that complexity of A ++ B is length(A), and it makes complexity of your function little bit greater. It would go something like length(FirstElement) + (lenght(FirstElement) + length(SecondElement)) + .... up to (without) last, which after some math magic could be simplified to (n -1) * 1/2 * k * k where n is number of elements, and k is average length of element. Or O(n^3).
If you are new to this it might seem little bit odd, but with some practice you can get hang of it. I would recommend going through few resources:
Good explanation of lists and how they are created
Documentation on list handling with DO and DO NOT parts
Short description of ++ operator myths and best practices
Chapter about recursion and tail-recursion with examples using ++ operator

Sieve of Erastosthenes highest prime factor in erlang

I have edited the program so that it works(with small numbers) however I do not understand how to implement an accumulator as suggested. The reason why is because P changes throughout the process, therefore I do not know in with which granularity I should break up the mother list. The Sieve of Erastosthenes is only efficient for generating smaller primes, so maybe I should have picked a different algorithm to use. Can anybody recommend a decent algorithm for calculating the highest prime factor of 600851475143? Please do not give me code I would prefer a Wikipedia article of something of that nature.
-module(sieve).
-export([find/2,mark/2,primes/1]).
primes(N) -> [2|lists:reverse(primes(lists:seq(2,N),2,[]))].
primes(_,bound_reached,[_|T]) -> T;
primes(L,P,Primes) -> NewList = mark(L,P),
NewP = find(NewList,P),
primes(NewList,NewP,[NewP|Primes]).
find([],_) -> bound_reached;
find([H|_],P) when H > P -> H;
find([_|T],P) -> find(T,P).
mark(L,P) -> lists:reverse(mark(L,P,2,[])).
mark([],_,_,NewList) -> NewList;
mark([_|T],P,Counter,NewList) when Counter rem P =:= 0 -> mark(T,P,Counter+1,[P|NewList]);
mark([H|T],P,Counter,NewList) -> mark(T,P,Counter+1,[H|NewList]).
I found writing this very difficult and I know there are a few things about it that are not very elegant, such as the way I have 2 hardcoded as a prime number. So I would appreciate any C&C and also advice about how to attack these kinds of problems. I look at other implementations and I have absoulutely no idea how the authors think in this way but its something I would like to master.
I have worked out that I can forget the list up until the most recent prime number found, however I have no idea how I am supposed to produce an end bound (subtle humour). I think there is probably something I can use like lists:seq(P,something) and the Counter would be able to handle that as I use modulo rather than resetting it to 0 each time. Ive only done AS level maths so I have no idea what this is.
I cant even do that can I? because I will have to remove multiples of 2 from the entirety of the list. Im thinking that this algorithm will not work unless I cache data to the harddrive, so I'm back to looking for a better algorithm.
I'm now considering writing an algorithm that just uses a counter and keeps a list of primes which are numbers that do not divide evenly with the previously generated prime numbers is this a good way to do it?
This is my new algorithm that I wrote I think it should work but I get the following error "sieve2.erl:7: call to local/imported function is_prime/2 is illegal in guard" I think this is just an aspect of erlang that I do not understand. However I've no idea how I could find the material to read about it. [Im purposely not using higher order functions etc as I have only read upto the bit on recursion in learnyousomeerlang.org]
-module(sieve2).
-export([primes/1]).
primes(N) -> primes(2,N,[2]).
primes(Counter,Max,Primes) when Counter =:= Max -> Primes;
primes(Counter,Max,Primes) when is_prime(Counter,Primes) -> primes(Counter+1,Max,[Counter|Primes]);
primes(Counter,Max,Primes) -> primes(Counter+1,Max,Primes).
is_prime(X, []) -> true;
is_prime(X,[H|T]) when X rem H =:= 0 -> false;
is_prime(X,[H|T]) -> prime(X,T).
The 2nd algorithm does not crash but runs too slowly, I'm thinking that I should reimplement the 1st but this time forget the numbers up until the most recently discovered prime, does anybody know what I could use as an end bound? After looking at other solutions it seems people sometimes just set an arbitrary limit i.e 2 million (this is something I do not really want to do. Others used "lazy" implementations which is what I think I am doing.

This:
lists:seq(2,N div 2)
allocates a list, and as the efficiency guide says, a list requires at least two words of memory per element. (A word is 4 or 8 bytes, depending on whether you have a 32-bit or 64-bit Erlang virtual machine.) So if N is 600851475143, this would require 48 terabytes of memory if I count correctly. (Unlike Haskell, Erlang doesn't do lazy evaluation.)
So you'd need to implement this using an accumulator, similar to what you did with Counter in the mark function. For the stop condition of the recursive function, you wouldn't check for the list being empty, but for the accumulator reaching the max value.

By the way you don't need to test all numbers up to N/2. It is enough to test up to sqrt(N).
Here I wrote a version that takes 20 seconds to find the answer on my machine. It uses kind of lazy list of primes and folding through them. It was fun because I solved some project-euler problems using Haskell quite a long ago and to use the same approach on Erlang was a bit of strange.

On your update3:
primes(Counter,Max,Primes) when Counter =:= Max -> Primes;
primes(Counter,Max,Primes) when is_prime(Counter,Primes) -> primes(Counter+1,Max,[Counter|Primes]);
primes(Counter,Max,Primes) -> primes(Counter+1,Max,Primes).
You cannot use your own defined functions as guard clauses as in Haskell. You have to rewrite it to use it in a case statement:
primes(Counter,Max,Primes) when Counter =:= Max ->
Primes;
primes(Counter,Max,Primes) ->
case is_prime(Counter,Primes) of
true ->
primes(Counter+1,Max,[Counter|Primes]);
_ ->
primes(Counter+1,Max,Primes)
end.

Can I get a list of all currently-registered atoms?

My project has blown through the max 1M atoms, we've cranked up the limit, but I need to apply some sanity to the code that people are submitting with regard to list_to_atom and its friends. I'd like to start by getting a list of all the registered atoms so I can see where the largest offenders are. Is there any way to do this. I'll have to be creative about how I do it so I don't end up trying to dump 1-2M lines in a live console.

You can get hold of all atoms by using an undocumented feature of the external term format.
TL;DR: Paste the following line into the Erlang shell of your running node. Read on for explanation and a non-terse version of the code.
(fun F(N)->try binary_to_term(<<131,75,N:24>>) of A->[A]++F(N+1) catch error:badarg->[]end end)(0).
Elixir version by Ivar Vong:
for i <- 0..:erlang.system_info(:atom_count)-1, do: :erlang.binary_to_term(<<131,75,i::24>>)
An Erlang term encoded in the external term format starts with the byte 131, then a byte identifying the type, and then the actual data. I found that EEP-43 mentions all the possible types, including ATOM_INTERNAL_REF3 with type byte 75, which isn't mentioned in the official documentation of the external term format.
For ATOM_INTERNAL_REF3, the data is an index into the atom table, encoded as a 24-bit integer. We can easily create such a binary: <<131,75,N:24>>
For example, in my Erlang VM, false seems to be the zeroth atom in the atom table:
> binary_to_term(<<131,75,0:24>>).
false
There's no simple way to find the number of atoms currently in the atom table*, but we can keep increasing the number until we get a badarg error.
So this little module gives you a list of all atoms:
-module(all_atoms).
-export([all_atoms/0]).
atom_by_number(N) ->
binary_to_term(<<131,75,N:24>>).
all_atoms() ->
atoms_starting_at(0).
atoms_starting_at(N) ->
try atom_by_number(N) of
Atom ->
[Atom] ++ atoms_starting_at(N + 1)
catch
error:badarg ->
[]
end.
The output looks like:
> all_atoms:all_atoms().
[false,true,'_',nonode#nohost,'$end_of_table','','fun',
infinity,timeout,normal,call,return,throw,error,exit,
undefined,nocatch,undefined_function,undefined_lambda,
'DOWN','UP','EXIT',aborted,abs_path,absoluteURI,ac,accessor,
active,all|...]
> length(v(-1)).
9821
* In Erlang/OTP 20.0, you can call erlang:system_info(atom_count):
> length(all_atoms:all_atoms()) == erlang:system_info(atom_count).
true

I'm not sure if there's a way to do it on a live system, but if you can run it in a test environment you should be able to get a list via crash dump. The atom table is near the end of the crash dump format. You can create a crash dump via erlang:halt/1, but that will bring down the whole runtime system.

I dare say that if you use more than 1M atoms, then you are doing something wrong. Atoms are intended to be static as soon as the application runs or at least upper bounded by some small number, 3000 or so for a medium sized application.
Be very careful when an enemy can generate atoms in your vm. especially calls like list_to_atom/1 is somewhat dangerous.

EDITED (wrong answer..)
You can adjust number of atoms with +t
http://www.erlang.org/doc/efficiency_guide/advanced.html
..but I know very few use cases when it is necessary.
You can track atom stats with erlang:memory()

Creating a valid function declaration from a complex tuple/list structure

Is there a generic way, given a complex object in Erlang, to come up with a valid function declaration for it besides eyeballing it? I'm maintaining some code previously written by someone who was a big fan of giant structures, and it's proving to be error prone doing it manually.
I don't need to iterate the whole thing, just grab the top level, per se.
For example, I'm working on this right now -
[[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"],
[{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.5","5060"},
[{'via-branch',"z9hG4bKb561e4f03a40c4439ba375b2ac3c9f91.0"}]}]},
{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.15","5060"},
[{'via-branch',"12dee0b2f48309f40b7857b9c73be9ac"}]}]},
{'From',
{'from-spec',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"003018CFE4EF"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"b7226ffa86c46af7bf6e32969ad16940"}]}},
{'To',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3966"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"a830c764"}]},
{'Call-ID',"90df0e4968c9a4545a009b1adf268605#172.20.10.15"},
{'CSeq',1358286,"SUBSCRIBE"},
["date",'HCOLON',
["Mon",44,32,["13",32,"Jun",32,"2011"],32,["17",58,"03",58,"55"],32,"GMT"]],
{'Contact',
[[{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3ComCallProcessor"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[]],
[]]},
["expires",'HCOLON',3600],
["user-agent",'HCOLON',
["3Com",[]],
[['LWS',["VCX",[]]],
['LWS',["7210",[]]],
['LWS',["IP",[]]],
['LWS',["CallProcessor",[['SLASH',"v10.0.8"]]]]]],
["proxy-authenticate",'HCOLON',
["Digest",'LWS',
["realm",'EQUAL',['SWS',34,"3Com",34]],
[['COMMA',["domain",'EQUAL',['SWS',34,"3Com",34]]],
['COMMA',
["nonce",'EQUAL',
['SWS',34,"btbvbsbzbBbAbwbybvbxbCbtbzbubqbubsbqbtbsbqbtbxbCbxbsbybs",
34]]],
['COMMA',["stale",'EQUAL',"FALSE"]],
['COMMA',["algorithm",'EQUAL',"MD5"]]]]],
{'Content-Length',0}],
"\r\n",
["\n"]]

Maybe https://github.com/etrepum/kvc

I noticed your clarifying comment. I'd prefer to add a comment myself, but don't have enough karma. Anyway, the trick I use for that is to experiment in the shell. I'll iterate a pattern against a sample data structure until I've found the simplest form. You can use the _ match-all variable. I use an erlang shell inside an emacs shell window.
First, bind a sample to a variable:
A = [{a,b},[{c,d}, {e,f}]].
Now set the original structure against the variable:
[{a,b},[{c,d},{e,f}]] = A.
If you hit enter, you'll see they match. Hit alt-p (forget what emacs calls alt, but it's alt on my keyboard) to bring back the previous line. Replace some tuple or list item with an underscore:
[_,[{c,d},{e,f}]].
Hit enter to make sure you did it right and they still match. This example is trivial, but for deeply nested, multiline structures it's trickier, so it's handy to be able to just quickly match to test. Sometimes you'll want to try to guess at whole huge swaths, like using an underscore to match a tuple list inside a tuple that's the third element of a list. If you place it right, you can match the whole thing at once, but it's easy to misread it.
Anyway, repeat to explore the essential shape of the structure and place real variables where you want to pull out values:
[_, [_, _]] = A.
[_, _] = A.
[_, MyTupleList] = A. %% let's grab this tuple list
[{MyAtom,b}, [{c,d}, MyTuple]] = A. %% or maybe we want this atom and tuple
That's how I efficiently dissect and pattern match complex data structures.
However, I don't know what you're doing. I'd be inclined to have a wrapper function that uses KVC to pull out exactly what you need and then distributes to helper functions from there for each type of structure.

If I understand you correctly you want to pattern match some large datastructures of unknown formatting.
Example:
Input: {a, b} {a,b,c,d} {a,[],{},{b,c}}
function({A, B}) -> do_something;
function({A, B, C, D}) when is_atom(B) -> do_something_else;
function({A, B, C, D}) when is_list(B) -> more_doing.
The generic answer is of course that it is undecidable from just data to know how to categorize that data.
First you should probably be aware of iolists. They are created by functions such as io_lib:format/2 and in many other places in the code.
One example is that
[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"]
will print as
SIP/2.0 407 Proxy Authentication Required
So, I'd start with flattening all those lists, using a function such as
flatten_io(List) when is_list(List) ->
Flat = lists:map(fun flatten_io/1, List),
maybe_flatten(Flat);
flatten_io(Tuple) when is_tuple(Tuple) ->
list_to_tuple([flatten_io(Element) || Element <- tuple_to_list(Tuple)];
flatten_io(Other) -> Other.
maybe_flatten(L) when is_list(L) ->
case lists:all(fun(Ch) when Ch > 0 andalso Ch < 256 -> true;
(List) when is_list(List) ->
lists:all(fun(X) -> X > 0 andalso X < 256 end, List);
(_) -> false
end, L) of
true -> lists:flatten(L);
false -> L
end.
(Caveat: completely untested and quite inefficient. Will also crash for inproper lists, but you shouldn't have those in your data structures anyway.)
On second thought, I can't help you. Any data structure that uses the atom 'COMMA' for a comma in a string should be taken out and shot.
You should be able to flatten those things as well and start to get a view of what you are looking at.
I know that this is not a complete answer. Hope it helps.

Its hard to recommend something for handling this.
Transforming all the structures in a more sane and also more minimal format looks like its worth it. This depends mainly on the similarities in these structures.
Rather than having a special function for each of the 100 there must be some automatic reformatting that can be done, maybe even put the parts in records.
Once you have records its much easier to write functions for it since you don't need to know the actual number of elements in the record. More important: your code won't break when the number of elements changes.
To summarize: make a barrier between your code and the insanity of these structures by somehow sanitizing them by the most generic code possible. It will be probably a mix of generic reformatting with structure speicific stuff.
As an example already visible in this struct: the 'name-addr' tuples look like they have a uniform structure. So you can recurse over your structures (over all elements of tuples and lists) and match for "things" that have a common structure like 'name-addr' and replace these with nice records.
In order to help you eyeballing you can write yourself helper functions along this example:
eyeball(List) when is_list(List) ->
io:format("List with length ~b\n", [length(List)]);
eyeball(Tuple) when is_tuple(Tuple) ->
io:format("Tuple with ~b elements\n", [tuple_size(Tuple)]).
So you would get output like this:
2> eyeball({a,b,c}).
Tuple with 3 elements
ok
3> eyeball([a,b,c]).
List with length 3
ok
expansion of this in a useful tool for your use is left as an exercise. You could handle multiple levels by recursing over the elements and indenting the output.

Use pattern matching and functions that work on lists to extract only what you need.
Look at http://www.erlang.org/doc/man/lists.html:
keyfind, keyreplace, L = [H|T], ...

How to generate integer ranges in Erlang?

From the other languages I program in, I'm used to having ranges. In Python, if I want all numbers one up to 100, I write range(1, 101). Similarly, in Haskell I'd write [1..100] and in Scala I'd write 1 to 100.
I can't find something similar in Erlang, either in the syntax or the library. I know that this would be fairly simple to implement myself, but I wanted to make sure it doesn't exist elsewhere first (particularly since a standard library or language implementation would be loads more efficient).
Is there a way to do ranges either in the Erlang language or standard library? Or is there some idiom that I'm missing? I just want to know if I should implement it myself.
I'm also open to the possibility that I shouldn't want to use a range in Erlang (I wouldn't want to be coding Python or Haskell in Erlang). Also, if I do need to implement this myself, if you have any good suggestions for improving performance, I'd love to hear them :)

From http://www.erlang.org/doc/man/lists.html it looks like lists:seq(1, 100) does what you want. You can also do things like lists:seq(1, 100, 2) to get all of the odd numbers in that range instead.

You can use list:seq(From, TO) that's say #bitilly, and also you can use list comprehensions to add more functionality, for example:
1> [X || X <- lists:seq(1,100), X rem 2 == 0].
[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,
44,46,48,50,52,54,56,58|...]

There is a difference between range in Ruby and list:seq in Erlang. Ruby's range doesn't create list and rely on next method, so (1..HugeInteger).each { ... } will not eat up memory. Erlang lists:seq will create list (or I believe it will). So when range is used for side effects, it does make a difference.
P.S. Not just for side effects:
(1..HugeInteger).inject(0) { |s, v| s + v % 1000000 == 0 ? 1 : 0 }
will work the same way as each, not creating a list. Erlang way for this is to create a recursive function. In fact, it is a concealed loop anyway.

Example of lazy stream in Erlang. Although it is not Erlang specific, I guess it can be done in any language with lambdas. New lambda gets created every time stream is advanced so it might put some strain on garbage collector.
range(From, To, _) when From > To ->
done;
range(From, To, Step) ->
{From, fun() -> range(From + Step, To, Step) end}.
list(done) ->
[];
list({Value, Iterator}) ->
[Value | list(Iterator())].
% ----- usage example ------
list_odd_numbers(From, To) ->
list(range(From bor 1, To, 2)).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Erlang: What is most-wrong with this trie implementation? - erlang

An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)

Look into DAWGs. They're much more compact than tries.

I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts. http://www.erlang.org/doc/man/digraph.html

If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.

Related

Erlang flatten function time complexity

Sieve of Erastosthenes highest prime factor in erlang

Can I get a list of all currently-registered atoms?

Creating a valid function declaration from a complex tuple/list structure

How to generate integer ranges in Erlang?

Categories

Resources