Related
Suppose I wanted to do something like:
dict
.values()
.map(fun scrub/1)
.flatMap(fun split/1)
.groupBy(fun keyFun/1, fun count/1)
.to_dict()
What is the most elegant way to achieve this in Erlang?
There is no direct easy way of doing that. All attempts I saw looked even worse than straightforward composition. If you will look at majority of open source project in Erlang, you will find that they use generic composition. Re-using your example:
to_dict(
groupBy(fun keyFun/1, fun count/1,
flatMap(fun split/1,
map(fun scrub/1,
values(dict))))).
This isn't a construct that's natural in Erlang. If you have a couple functions, regular composition is what I'd use:
lists:flatten(lists:map(fun (A) ->
do_stuff(A)
end,
generate_list())).
For a longer series of operations, intermediary variables:
Dict = #{hello => world, ...},
Values = maps:values(Dict),
ScrubbedValues = lists:map(fun scrub/1, Values),
SplitValues = lists:flatten(lists:map(fun split/1, ScrubbedValues)),
GroupedValues = basil_lists:group_by(fun keyFun/1, fun count/1, SplitValues),
Dict2 = maps:from_list(GroupedValues).
That's how it'd look if you wanted all of those operations grouped in one shot together.
However, I'd more likely write this in a different way:
-spec remap_values(map()) -> map().
remap_values(Map) ->
map_values(maps:values(Map)).
-spec map_values(list()) -> map().
map_values(Values) ->
map_values(Values, [], []).
-spec map_values(list(), list(), list()) -> map().
map_values([], OutList, OutGroup) ->
%% Base case: transform into a map
Grouped = lists:zip(OutGroup, OutList),
lists:foldl(fun ({Group, Element}, Acc = #{Group := Existing}) ->
Acc#{Group => [Element | Existing]};
({Group, Element}, Acc) ->
Acc#{Group => [Element]}
end,
#{},
Grouped;
map_values([First|Rest], OutList, OutGroup) ->
%% Recursive case: categorize process the first element and categorize the result
Processed = split(scrub(First)),
Categories = lists:map(fun categorize/1, Processed),
map_values(Rest, OutList ++ Processed, OutGroup ++ Categories).
The actual correct implementation depends a lot on how the code's going to be run -- what I've written here is pretty simple, but might not perform well on large amounts of data. If you're actually looking to process an endless stream of data you'll need to write that yourself (though you may find Gen Servers to be a very useful framework for doing so).
I am learning Erlang from a Ruby background and having some difficulty grasping the thought process. The problem I am trying to solve is the following:
I need to make the same request to an api, each time I receive a unique ID in the response which I need to pass into the next request until there is not ID returned. From each response I need to extract certain data and use it for other things as well.
First get the iterator:
ShardIteratorResponse = kinetic:get_shard_iterator(GetShardIteratorPayload).
{ok,[{<<"ShardIterator">>,
<<"AAAAAAAAAAGU+v0fDvpmu/02z5Q5OJZhPo/tU7fjftFF/H9M7J9niRJB8MIZiB9E1ntZGL90dIj3TW6MUWMUX67NEj4GO89D"...>>}]}
Parse out the shard_iterator..
{_, [{_, ShardIterator}]} = ShardIteratorResponse.
Make the request to kinesis for the streams records...
GetRecordsPayload = [{<<"ShardIterator">>, <<ShardIterator/binary>>}].
[{<<"ShardIterator">>,
<<"AAAAAAAAAAGU+v0fDvpmu/02z5Q5OJZhPo/tU7fjftFF/H9M7J9niRJB8MIZiB9E1ntZGL90dIj3TW6MUWMUX67NEj4GO89DETABlwVV"...>>}]
14> RecordsResponse = kinetic:get_records(GetRecordsPayload).
{ok,[{<<"NextShardIterator">>,
<<"AAAAAAAAAAFy3dnTJYkWr3gq0CGo3hkj1t47ccUS10f5nADQXWkBZaJvVgTMcY+nZ9p4AZCdUYVmr3dmygWjcMdugHLQEg6x"...>>},
{<<"Records">>,
[{[{<<"Data">>,<<"Zmlyc3QgcmVjb3JkISEh">>},
{<<"PartitionKey">>,<<"BlanePartitionKey">>},
{<<"SequenceNumber">>,
<<"49545722516689138064543799042897648239478878787235479554">>}]}]}]}
What I am struggling with is how do I write a loop that keeps hitting the kinesis endpoint for that stream until there are no more shard iterators, aka I want all records. Since I can't re-assign the variables as I would in Ruby.
WARNING: My code might be bugged but it's "close". I've never ran it and don't see how last iterator can look like.
I see you are trying to do your job entirely in shell. It's possible but hard. You can use named function and recursion (since release 17.0 it's easier), for example:
F = fun (ShardIteratorPayload) ->
{_, [{_, ShardIterator}]} = kinetic:get_shard_iterator(ShardIteratorPayload),
FunLoop =
fun Loop(<<>>, Accumulator) -> % no clue how last iterator can look like
lists:reverse(Accumulator);
Loop(ShardIterator, Accumulator) ->
{ok, [{_, NextShardIterator}, {<<"Records">>, Records}]} =
kinetic:get_records([{<<"ShardIterator">>, <<ShardIterator/binary>>}]),
Loop(NextShardIterator, [Records | Accumulator])
end,
FunLoop(ShardIterator, [])
end.
AllRecords = F(GetShardIteratorPayload).
But it's too complicated to type in shell...
It's much easier to code it in modules.
A common pattern in erlang is to spawn another process or processes to fetch your data. To keep it simple you can spawn another process by calling spawn or spawn_link but don't bother with links now and use just spawn/3.
Let's compile simple consumer module:
-module(kinetic_simple_consumer).
-export([start/1]).
start(GetShardIteratorPayload) ->
Pid = spawn(kinetic_simple_fetcher, start, [self(), GetShardIteratorPayload]),
consumer_loop(Pid).
consumer_loop(FetcherPid) ->
receive
{FetcherPid, finished} ->
ok;
{FetcherPid, {records, Records}} ->
consume(Records),
consumer_loop(FetcherPid);
UnexpectedMsg ->
io:format("DROPPING:~n~p~n", [UnexpectedMsg]),
consumer_loop(FetcherPid)
end.
consume(Records) ->
io:format("RECEIVED:~n~p~n",[Records]).
And fetcher:
-module(kinetic_simple_fetcher).
-export([start/2]).
start(ConsumerPid, GetShardIteratorPayload) ->
{ok, [ShardIterator]} = kinetic:get_shard_iterator(GetShardIteratorPayload),
fetcher_loop(ConsumerPid, ShardIterator).
fetcher_loop(ConsumerPid, {_, <<>>}) -> % no clue how last iterator can look like
ConsumerPid ! {self(), finished};
fetcher_loop(ConsumerPid, ShardIterator) ->
{ok, [NextShardIterator, {<<"Records">>, Records}]} =
kinetic:get_records(shard_iterator(ShardIterator)),
ConsumerPid ! {self(), {records, Records}},
fetcher_loop(ConsumerPid, NextShardIterator).
shard_iterator({_, ShardIterator}) ->
[{<<"ShardIterator">>, <<ShardIterator/binary>>}].
As you can see both processes can do their job concurrently.
Try from your shell:
kinetic_simple_consumer:start(GetShardIteratorPayload).
Now your see that your shell process turns to consumer and you will have your shell back after fetcher will send {ItsPid, finished}.
Next time instead of
kinetic_simple_consumer:start(GetShardIteratorPayload).
run:
spawn(kinetic_simple_consumer, start, [GetShardIteratorPayload]).
You should play with spawning processes - it's erlang main strength.
In Erlang, you can write loop using tail recursive functions. I don't know the kinetic API, so for simplicity, I just assume, that kinetic:next_iterator/1 return {ok, NextIterator} or {error, Reason} when there are no more shards.
loop({error, Reason}) ->
ok;
loop({ok, Iterator}) ->
do_something_with(Iterator),
Result = kinetic:next_iterator(Iterator),
loop(Result).
You are replacing loop with iteration. First clause deals with case, where there are no more shards left (always start recursion with the end condition). Second clause deals with case, where we got some iterator, we do something with it and call next.
The recursive call is last instruction in the function body, which is called tail recursion. Erlang optimizes such calls - they don't use call stack, so they can run infinitely in constant memory (you will not get anything like "Stack level too deep")
I have created a digraph term on process A and I want to pass this digraph to a process on another node. Whenever I use this digraph on the other process I am getting errors such as:
** {badarg,
[{ets,insert,[598105,{"EPqzYxiM9UV0pplPTRg8vX28h",[]}],[]},
{digraph,do_add_vertex,2,[{file,"digraph.erl"},{line,377}]},
Becasuse a digraph is based on ETS, it appears that this is quite more complicated, making a digraph pretty much standalone on the process it was created. I have found this entry that reveals a similar problem : ETS on a different process
I know I can create the digraph in a server an then connect to it through otp messages, but I cannot do that in my architecture. All nodes can communicate using a specific approach designed to pass the state along as Terms.
It appears to me that having digraphs sent accross different nodes that cannot directly communicate with each other is not possible. Overall, it appears that a digraph cannot be directly serialized. I am thinking that I can "unwind" the digraph as a list of vertices and edges and then transmit and recreate it on the other process (not efficient, performing or elegant). Any thoughts on a better way to serialize it ? Is there a way to serialize the digraph state out of the ETS store ?
Any thoughts ?
You can serialize/deserialize a digraph like this;
serialize({digraph, V, E, N, B}) ->
{ets:tab2list(V),
ets:tab2list(E),
ets:tab2list(N),
B}.
deserialize({VL, EL, NL, B}) ->
DG = {digraph, V, E, N, B} = case B of
true -> digraph:new();
false -> digraph:new([acyclic])
end,
ets:delete_all_objects(V)
ets:delete_all_objects(L)
ets:delete_all_objects(N)
ets:insert(V, VL)
ets:insert(E, EL)
ets:insert(N, NL)
DG.
And this is the code I used to test this;
passer() ->
G = digraph:new(),
V1 = digraph:add_vertex(G),
V2 = digraph:add_vertex(G),
V3 = digraph:add_vertex(G),
digraph:add_edge(G, V1, V2, "edge1"),
digraph:add_edge(G, V1, V3, "edge2"),
Pid = spawn(fun receiver/0),
Pid ! serialize(G).
receiver() ->
receive
SG = {_VL, _EL, _NL, _B} ->
G = deserialize(SG),
io:format("Edges: ~p~n", [digraph:edges(G)]),
io:format("Edges: ~p~n", [digraph:vertices(G)])
end.
Pretty ugly solution but works. I think this is the only way pass digraph between nodes since ets tables cannot be shared between nodes.
Edit: remove unnecessary loops
I have a solution, but it relies on the structure of the variable returned by digrapgh:new(), so I am not sure that it will be compatible with future version.
D = digraph:new(),
...
%some code modifying D
...
{digraph,Vertices,Edges,Neighbours,Cyclic} = D, % get the table Id of the 3 tables containing D values
% It should be preferable to use the record definition of the digraph module
%-record(digraph, {vtab = notable :: ets:tab(),
% etab = notable :: ets:tab(),
% ntab = notable :: ets:tab(),
% cyclic = true :: boolean()}).
LV = ets:tab2list(Vertices),
LE = ets:tab2list(Edges),
LN = ets:tab2list(Neighbours),
...
% then serialize and send all this variables to the target node, ideally in one single tuple like
% {my_digraph_data,LV,LE,LN,Cyclic} or using refs to avoid the mix of messages,
% and on reception on the remote node:
receive
{my_digraph_data,LV,LE,LN,Cyclic} ->
Dcopy = digrapgh:new(Cyclic),
{digraph,Vertices,Edges,Neighbours,_Cyclic} = Dcopy,
ets:insert(Vertices,LV),
ets:insert(Edges,LE),
ets:insert(Neighbours,LN),
Dcopy;
...
and that's it.
Note: If you are testing this in the same shell, then make sure to change the name of Vertices, Edges, and Neighbours in the
spawned process' receive expression otherwise it will fail with
badmatch (as they have already been bound when matching D).
I wish to manipulate data on a very low level.
Therefore I have a function that receives a virtual memory address as an integer and "does stuff" with this memory address. I interfaced this function from C, so it has the type (CUInt -> a).
The memory I want to link is a Word8 in a file. Sadly, I have no idea how to access the pointer value to that Word8.
To be clear, I do not need the value of the Word8, i need the value to the virtual memory address, which is the value of the pointer to it.
For the sake of a simple example, say you want to add an offset to the pointer.
Front matter:
module Main where
import Control.Monad (forM_)
import Data.Char (chr)
import Data.Word (Word8)
import Foreign.ForeignPtr (ForeignPtr, withForeignPtr)
import Foreign.Ptr (Ptr, plusPtr)
import Foreign.Storable (peek)
import System.IO.MMap (Mode(ReadOnly), mmapFileForeignPtr)
Yes, you wrote that you don't want the value of the Word8, but I've retrieved it with peek to demonstrate that the pointer is valid. You might be tempted to return the Ptr from inside withForeignPtr, but the documentation warns against that:
Note that it is not safe to return the pointer from the action and use it after the action completes. All uses of the pointer should be inside the withForeignPtr bracket. The reason for this unsafeness is the same as for unsafeForeignPtrToPtr below: the finalizer may run earlier than expected, because the compiler can only track usage of the ForeignPtr object, not a Ptr object made from it.
The code is straightforward:
doStuff :: ForeignPtr Word8 -> Int -> IO ()
doStuff fp i =
withForeignPtr fp $ \p -> do
let addr = p `plusPtr` i
val <- peek addr :: IO Word8
print (addr, val, chr $ fromIntegral val)
To approximate “a Word8 in a File” from your question, the main program memory-maps a file and uses that buffer to do stuff with memory addresses.
main :: IO ()
main = do
(p,offset,size) <- mmapFileForeignPtr path mode range
forM_ [0 .. size-1] $ \i -> do
doStuff p (offset + i)
where
path = "/tmp/input.dat"
mode = ReadOnly
range = Nothing
-- range = Just (4,3)
Output:
(0x00007f1b40edd000,71,'G')
(0x00007f1b40edd001,117,'u')
(0x00007f1b40edd002,116,'t')
(0x00007f1b40edd003,101,'e')
(0x00007f1b40edd004,110,'n')
(0x00007f1b40edd005,32,' ')
(0x00007f1b40edd006,77,'M')
(0x00007f1b40edd007,111,'o')
(0x00007f1b40edd008,114,'r')
(0x00007f1b40edd009,103,'g')
(0x00007f1b40edd00a,101,'e')
(0x00007f1b40edd00b,110,'n')
(0x00007f1b40edd00c,33,'!')
(0x00007f1b40edd00d,10,'\n')
You are probably looking for ptrToIntPtr and probably fromIntegral to make it a CUInt.
Note that a CUInt cannot represent a pointer on all platforms, though.
As suggested in answers to a previous question, I tried using Erlang proplists to implement a prefix trie.
The code seems to work decently well... But, for some reason, it doesn't play well with the interactive shell. When I try to run it, the shell hangs:
> Trie = trie:from_dict(). % Creates a trie from a dictionary
% ... the trie is printed ...
% Then nothing happens
I see the new trie printed to the screen (ie, the call to trie:from_dict() has returned), then the shell just hangs. No new > prompt comes up and ^g doesn't do anything (but ^c will eventually kill it off).
With a very small dictionary (the first 50 lines of /usr/share/dict/words), the hang only lasts a second or two (and the trie is built almost instantly)... But it seems to grow exponentially with the size of the dictionary (100 words takes 5 or 10 seconds, I haven't had the patience to try larger wordlists). Also, as the shell is hanging, I notice that the beam.smp process starts eating up a lot of memory (somewhere between 1 and 2 gigs).
So, is there anything obvious that could be causing this shell hang and incredible memory usage?
Some various comments:
I have a hunch that the garbage collector is at fault, but I don't know how to profile or create an experiment to test that.
I've tried profiling with eprof and nothing obvious showed up.
Here is my "add string to trie" function:
add([], Trie) ->
[ stop | Trie ];
add([Ch|Rest], Trie) ->
SubTrie = proplists:get_value(Ch, Trie, []),
NewSubTrie = add(Rest, SubTrie),
NewTrie = [ { Ch, NewSubTrie } | Trie ],
% Arbitrarily decide to compress key/value list once it gets
% more than 60 pairs.
if length(NewTrie) > 60 ->
proplists:compact(NewTrie);
true ->
NewTrie
end.
The problem is (amongst others ? -- see my comment) that you are always adding a new {Ch, NewSubTrie} tuple to your proplist Tries, no matter if Ch already existed, or not.
Instead of
NewTrie = [ { Ch, NewSubTrie } | Trie ]
you need something like:
NewTrie = lists:keystore(Ch, 1, Trie, {Ch, NewSubTrie})
You're not really building a trie here. Your end result is effectively a randomly ordered proplist of proplists that requires full scans at each level when walking the list. Tries are typically implied ordering based on position in the array (or list).
Here's an implementation that uses tuples as the storage mechanism. Calling set only rebuilds the root and direct path tuples.
(note: would probably have to make the pair a triple (adding size) make delete work with any efficiency)
I believe erlang tuples are really just arrays (thought I read that somewhere), so lookup should be super fast, and modify is probably straight forward. Maybe this is faster with the array module, but I haven't really played with it much to know.
this version also stores an arbitrary value, so you can do things like:
1> c(trie).
{ok,trie}
2> trie:get("ab",trie:set("aa",bar,trie:new("ab",foo))).
foo
3> trie:get("abc",trie:set("aa",bar,trie:new("ab",foo))).
undefined
4>
code (entire module): note2: assumes lower case non empty string keys
-module(trie).
-compile(export_all).
-define(NEW,{ %% 26 pairs, to avoid cost of calculating a new level at runtime
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth}
}
).
-define(POS(Ch), Ch - $a + 1).
new(Key,V) -> set(Key,V,?NEW).
set([H],V,Trie) ->
Pos = ?POS(H),
{_,SubTrie} = element(Pos,Trie),
setelement(Pos,Trie,{V,SubTrie});
set([H|T],V,Trie) ->
Pos = ?POS(H),
{SubKey,SubTrie} = element(Pos,Trie),
case SubTrie of
nodepth -> setelement(Pos,Trie,{SubKey,set(T,V,?NEW)});
SubTrie -> setelement(Pos,Trie,{SubKey,set(T,V,SubTrie)})
end.
get([H],Trie) ->
{Val,_} = element(?POS(H),Trie),
Val;
get([H|T],Trie) ->
case element(?POS(H),Trie) of
{_,nodepth} -> undefined;
{_,SubTrie} -> get(T,SubTrie)
end.