Haskell: Lazily read binary file with binary - parsing

I'm trying to read in a binary file and parse it lazily using the 'binary' package. The package documentation gives an example of how to do this without forcing all the input for a scenario very similar to mine:
example2 :: BL.ByteString -> [Trade]
example2 input
| BL.null input = []
| otherwise =
let (trade, rest, _) = runGetState getTrade input 0
in trade : example2 rest
However, this uses the deprecated runGetState function, which itself points you towards the runGetIncremental function.
The problem is that the 'runGetIncremental' function seems to force the remaining input to be a strict bytestring, thus forcing it to load the whole file into memory. Indeed, I'm seeing memory usage of around 6GB when I try to run this. Even the implementation of runGetState now seems to be based on runGetIncremental and then reconverts the strict bytestring back to a lazy one using chunk.
Can I get the behaviour as described in the tutorial, or is this now unsupported by binary? If the latter, what's the best way to do this? I have a little experience using conduit, but it's not clear to me how I could use it here.

You can do this using pipes-binary and pipes-bytestring. Here's a helper function for your benefit:
import Control.Monad (void)
import Data.Binary
import Pipes
import Pipes.Binary (decodeMany)
import Pipes.ByteString (fromHandle)
import qualified Pipes.Prelude as P
import System.IO
decodeHandle :: (Binary a) => Handle -> Producer a IO ()
decodeHandle handle = void $ decodeMany (fromHandle handle) >-> P.map snd
The void and map snd are there because decodeMany actually returns more information (like byte offsets and parsing errors). If you actually want that information, then just remove them.
Here's an example of how you might use decodeHandle, using a quick skeleton for Trade I threw together:
data Trade = Trade
instance Binary Trade where
get = return Trade
put _ = return ()
instance Show Trade where show _ = "Trade"
main = withFile "inFile.txt" ReadMode $ \handle -> runEffect $
for (decodeHandle handle) $ \trade -> do
lift $ print (trade :: Trade)
-- do more with the parsed trade
You can use for to loop over the decoded trades and handle them, or if you prefer you can use pipe composition:
main = withFile "inFile.txt" ReadMode $ \handle -> runEffect $
decodeHandle handle >-> P.print
This will be lazy and only decode as many trades as you actually need. So if you insert a take in between the decoder and the printer, it will only read as much input as necessary to process the requested number of trades:
main = withFile "inFile.txt" ReadMode $ \handle -> runEffect $
for (decodeHandle handle >-> P.take 4) $ \trade -> do
... -- This will only process the first 4 trades
-- or using purely pipe composition:
main = withFile "inFile.txt" ReadMode $ \handle -> runEffect $
decodeHandle handle >-> P.take 4 >-> P.print

Related

What is the Erlang way to do stream manipulations?

Suppose I wanted to do something like:
dict
.values()
.map(fun scrub/1)
.flatMap(fun split/1)
.groupBy(fun keyFun/1, fun count/1)
.to_dict()
What is the most elegant way to achieve this in Erlang?
There is no direct easy way of doing that. All attempts I saw looked even worse than straightforward composition. If you will look at majority of open source project in Erlang, you will find that they use generic composition. Re-using your example:
to_dict(
groupBy(fun keyFun/1, fun count/1,
flatMap(fun split/1,
map(fun scrub/1,
values(dict))))).
This isn't a construct that's natural in Erlang. If you have a couple functions, regular composition is what I'd use:
lists:flatten(lists:map(fun (A) ->
do_stuff(A)
end,
generate_list())).
For a longer series of operations, intermediary variables:
Dict = #{hello => world, ...},
Values = maps:values(Dict),
ScrubbedValues = lists:map(fun scrub/1, Values),
SplitValues = lists:flatten(lists:map(fun split/1, ScrubbedValues)),
GroupedValues = basil_lists:group_by(fun keyFun/1, fun count/1, SplitValues),
Dict2 = maps:from_list(GroupedValues).
That's how it'd look if you wanted all of those operations grouped in one shot together.
However, I'd more likely write this in a different way:
-spec remap_values(map()) -> map().
remap_values(Map) ->
map_values(maps:values(Map)).
-spec map_values(list()) -> map().
map_values(Values) ->
map_values(Values, [], []).
-spec map_values(list(), list(), list()) -> map().
map_values([], OutList, OutGroup) ->
%% Base case: transform into a map
Grouped = lists:zip(OutGroup, OutList),
lists:foldl(fun ({Group, Element}, Acc = #{Group := Existing}) ->
Acc#{Group => [Element | Existing]};
({Group, Element}, Acc) ->
Acc#{Group => [Element]}
end,
#{},
Grouped;
map_values([First|Rest], OutList, OutGroup) ->
%% Recursive case: categorize process the first element and categorize the result
Processed = split(scrub(First)),
Categories = lists:map(fun categorize/1, Processed),
map_values(Rest, OutList ++ Processed, OutGroup ++ Categories).
The actual correct implementation depends a lot on how the code's going to be run -- what I've written here is pretty simple, but might not perform well on large amounts of data. If you're actually looking to process an endless stream of data you'll need to write that yourself (though you may find Gen Servers to be a very useful framework for doing so).

Extract values from list of tuples continuously in Erlang

I am learning Erlang from a Ruby background and having some difficulty grasping the thought process. The problem I am trying to solve is the following:
I need to make the same request to an api, each time I receive a unique ID in the response which I need to pass into the next request until there is not ID returned. From each response I need to extract certain data and use it for other things as well.
First get the iterator:
ShardIteratorResponse = kinetic:get_shard_iterator(GetShardIteratorPayload).
{ok,[{<<"ShardIterator">>,
<<"AAAAAAAAAAGU+v0fDvpmu/02z5Q5OJZhPo/tU7fjftFF/H9M7J9niRJB8MIZiB9E1ntZGL90dIj3TW6MUWMUX67NEj4GO89D"...>>}]}
Parse out the shard_iterator..
{_, [{_, ShardIterator}]} = ShardIteratorResponse.
Make the request to kinesis for the streams records...
GetRecordsPayload = [{<<"ShardIterator">>, <<ShardIterator/binary>>}].
[{<<"ShardIterator">>,
<<"AAAAAAAAAAGU+v0fDvpmu/02z5Q5OJZhPo/tU7fjftFF/H9M7J9niRJB8MIZiB9E1ntZGL90dIj3TW6MUWMUX67NEj4GO89DETABlwVV"...>>}]
14> RecordsResponse = kinetic:get_records(GetRecordsPayload).
{ok,[{<<"NextShardIterator">>,
<<"AAAAAAAAAAFy3dnTJYkWr3gq0CGo3hkj1t47ccUS10f5nADQXWkBZaJvVgTMcY+nZ9p4AZCdUYVmr3dmygWjcMdugHLQEg6x"...>>},
{<<"Records">>,
[{[{<<"Data">>,<<"Zmlyc3QgcmVjb3JkISEh">>},
{<<"PartitionKey">>,<<"BlanePartitionKey">>},
{<<"SequenceNumber">>,
<<"49545722516689138064543799042897648239478878787235479554">>}]}]}]}
What I am struggling with is how do I write a loop that keeps hitting the kinesis endpoint for that stream until there are no more shard iterators, aka I want all records. Since I can't re-assign the variables as I would in Ruby.
WARNING: My code might be bugged but it's "close". I've never ran it and don't see how last iterator can look like.
I see you are trying to do your job entirely in shell. It's possible but hard. You can use named function and recursion (since release 17.0 it's easier), for example:
F = fun (ShardIteratorPayload) ->
{_, [{_, ShardIterator}]} = kinetic:get_shard_iterator(ShardIteratorPayload),
FunLoop =
fun Loop(<<>>, Accumulator) -> % no clue how last iterator can look like
lists:reverse(Accumulator);
Loop(ShardIterator, Accumulator) ->
{ok, [{_, NextShardIterator}, {<<"Records">>, Records}]} =
kinetic:get_records([{<<"ShardIterator">>, <<ShardIterator/binary>>}]),
Loop(NextShardIterator, [Records | Accumulator])
end,
FunLoop(ShardIterator, [])
end.
AllRecords = F(GetShardIteratorPayload).
But it's too complicated to type in shell...
It's much easier to code it in modules.
A common pattern in erlang is to spawn another process or processes to fetch your data. To keep it simple you can spawn another process by calling spawn or spawn_link but don't bother with links now and use just spawn/3.
Let's compile simple consumer module:
-module(kinetic_simple_consumer).
-export([start/1]).
start(GetShardIteratorPayload) ->
Pid = spawn(kinetic_simple_fetcher, start, [self(), GetShardIteratorPayload]),
consumer_loop(Pid).
consumer_loop(FetcherPid) ->
receive
{FetcherPid, finished} ->
ok;
{FetcherPid, {records, Records}} ->
consume(Records),
consumer_loop(FetcherPid);
UnexpectedMsg ->
io:format("DROPPING:~n~p~n", [UnexpectedMsg]),
consumer_loop(FetcherPid)
end.
consume(Records) ->
io:format("RECEIVED:~n~p~n",[Records]).
And fetcher:
-module(kinetic_simple_fetcher).
-export([start/2]).
start(ConsumerPid, GetShardIteratorPayload) ->
{ok, [ShardIterator]} = kinetic:get_shard_iterator(GetShardIteratorPayload),
fetcher_loop(ConsumerPid, ShardIterator).
fetcher_loop(ConsumerPid, {_, <<>>}) -> % no clue how last iterator can look like
ConsumerPid ! {self(), finished};
fetcher_loop(ConsumerPid, ShardIterator) ->
{ok, [NextShardIterator, {<<"Records">>, Records}]} =
kinetic:get_records(shard_iterator(ShardIterator)),
ConsumerPid ! {self(), {records, Records}},
fetcher_loop(ConsumerPid, NextShardIterator).
shard_iterator({_, ShardIterator}) ->
[{<<"ShardIterator">>, <<ShardIterator/binary>>}].
As you can see both processes can do their job concurrently.
Try from your shell:
kinetic_simple_consumer:start(GetShardIteratorPayload).
Now your see that your shell process turns to consumer and you will have your shell back after fetcher will send {ItsPid, finished}.
Next time instead of
kinetic_simple_consumer:start(GetShardIteratorPayload).
run:
spawn(kinetic_simple_consumer, start, [GetShardIteratorPayload]).
You should play with spawning processes - it's erlang main strength.
In Erlang, you can write loop using tail recursive functions. I don't know the kinetic API, so for simplicity, I just assume, that kinetic:next_iterator/1 return {ok, NextIterator} or {error, Reason} when there are no more shards.
loop({error, Reason}) ->
ok;
loop({ok, Iterator}) ->
do_something_with(Iterator),
Result = kinetic:next_iterator(Iterator),
loop(Result).
You are replacing loop with iteration. First clause deals with case, where there are no more shards left (always start recursion with the end condition). Second clause deals with case, where we got some iterator, we do something with it and call next.
The recursive call is last instruction in the function body, which is called tail recursion. Erlang optimizes such calls - they don't use call stack, so they can run infinitely in constant memory (you will not get anything like "Stack level too deep")

How to pass a digraph to a different process and node?

I have created a digraph term on process A and I want to pass this digraph to a process on another node. Whenever I use this digraph on the other process I am getting errors such as:
** {badarg,
[{ets,insert,[598105,{"EPqzYxiM9UV0pplPTRg8vX28h",[]}],[]},
{digraph,do_add_vertex,2,[{file,"digraph.erl"},{line,377}]},
Becasuse a digraph is based on ETS, it appears that this is quite more complicated, making a digraph pretty much standalone on the process it was created. I have found this entry that reveals a similar problem : ETS on a different process
I know I can create the digraph in a server an then connect to it through otp messages, but I cannot do that in my architecture. All nodes can communicate using a specific approach designed to pass the state along as Terms.
It appears to me that having digraphs sent accross different nodes that cannot directly communicate with each other is not possible. Overall, it appears that a digraph cannot be directly serialized. I am thinking that I can "unwind" the digraph as a list of vertices and edges and then transmit and recreate it on the other process (not efficient, performing or elegant). Any thoughts on a better way to serialize it ? Is there a way to serialize the digraph state out of the ETS store ?
Any thoughts ?
You can serialize/deserialize a digraph like this;
serialize({digraph, V, E, N, B}) ->
{ets:tab2list(V),
ets:tab2list(E),
ets:tab2list(N),
B}.
deserialize({VL, EL, NL, B}) ->
DG = {digraph, V, E, N, B} = case B of
true -> digraph:new();
false -> digraph:new([acyclic])
end,
ets:delete_all_objects(V)
ets:delete_all_objects(L)
ets:delete_all_objects(N)
ets:insert(V, VL)
ets:insert(E, EL)
ets:insert(N, NL)
DG.
And this is the code I used to test this;
passer() ->
G = digraph:new(),
V1 = digraph:add_vertex(G),
V2 = digraph:add_vertex(G),
V3 = digraph:add_vertex(G),
digraph:add_edge(G, V1, V2, "edge1"),
digraph:add_edge(G, V1, V3, "edge2"),
Pid = spawn(fun receiver/0),
Pid ! serialize(G).
receiver() ->
receive
SG = {_VL, _EL, _NL, _B} ->
G = deserialize(SG),
io:format("Edges: ~p~n", [digraph:edges(G)]),
io:format("Edges: ~p~n", [digraph:vertices(G)])
end.
Pretty ugly solution but works. I think this is the only way pass digraph between nodes since ets tables cannot be shared between nodes.
Edit: remove unnecessary loops
I have a solution, but it relies on the structure of the variable returned by digrapgh:new(), so I am not sure that it will be compatible with future version.
D = digraph:new(),
...
%some code modifying D
...
{digraph,Vertices,Edges,Neighbours,Cyclic} = D, % get the table Id of the 3 tables containing D values
% It should be preferable to use the record definition of the digraph module
%-record(digraph, {vtab = notable :: ets:tab(),
% etab = notable :: ets:tab(),
% ntab = notable :: ets:tab(),
% cyclic = true :: boolean()}).
LV = ets:tab2list(Vertices),
LE = ets:tab2list(Edges),
LN = ets:tab2list(Neighbours),
...
% then serialize and send all this variables to the target node, ideally in one single tuple like
% {my_digraph_data,LV,LE,LN,Cyclic} or using refs to avoid the mix of messages,
% and on reception on the remote node:
receive
{my_digraph_data,LV,LE,LN,Cyclic} ->
Dcopy = digrapgh:new(Cyclic),
{digraph,Vertices,Edges,Neighbours,_Cyclic} = Dcopy,
ets:insert(Vertices,LV),
ets:insert(Edges,LE),
ets:insert(Neighbours,LN),
Dcopy;
...
and that's it.
Note: If you are testing this in the same shell, then make sure to change the name of Vertices, Edges, and Neighbours in the
spawned process' receive expression otherwise it will fail with
badmatch (as they have already been bound when matching D).

How to get a pointer value in Haskell?

I wish to manipulate data on a very low level.
Therefore I have a function that receives a virtual memory address as an integer and "does stuff" with this memory address. I interfaced this function from C, so it has the type (CUInt -> a).
The memory I want to link is a Word8 in a file. Sadly, I have no idea how to access the pointer value to that Word8.
To be clear, I do not need the value of the Word8, i need the value to the virtual memory address, which is the value of the pointer to it.
For the sake of a simple example, say you want to add an offset to the pointer.
Front matter:
module Main where
import Control.Monad (forM_)
import Data.Char (chr)
import Data.Word (Word8)
import Foreign.ForeignPtr (ForeignPtr, withForeignPtr)
import Foreign.Ptr (Ptr, plusPtr)
import Foreign.Storable (peek)
import System.IO.MMap (Mode(ReadOnly), mmapFileForeignPtr)
Yes, you wrote that you don't want the value of the Word8, but I've retrieved it with peek to demonstrate that the pointer is valid. You might be tempted to return the Ptr from inside withForeignPtr, but the documentation warns against that:
Note that it is not safe to return the pointer from the action and use it after the action completes. All uses of the pointer should be inside the withForeignPtr bracket. The reason for this unsafeness is the same as for unsafeForeignPtrToPtr below: the finalizer may run earlier than expected, because the compiler can only track usage of the ForeignPtr object, not a Ptr object made from it.
The code is straightforward:
doStuff :: ForeignPtr Word8 -> Int -> IO ()
doStuff fp i =
withForeignPtr fp $ \p -> do
let addr = p `plusPtr` i
val <- peek addr :: IO Word8
print (addr, val, chr $ fromIntegral val)
To approximate “a Word8 in a File” from your question, the main program memory-maps a file and uses that buffer to do stuff with memory addresses.
main :: IO ()
main = do
(p,offset,size) <- mmapFileForeignPtr path mode range
forM_ [0 .. size-1] $ \i -> do
doStuff p (offset + i)
where
path = "/tmp/input.dat"
mode = ReadOnly
range = Nothing
-- range = Just (4,3)
Output:
(0x00007f1b40edd000,71,'G')
(0x00007f1b40edd001,117,'u')
(0x00007f1b40edd002,116,'t')
(0x00007f1b40edd003,101,'e')
(0x00007f1b40edd004,110,'n')
(0x00007f1b40edd005,32,' ')
(0x00007f1b40edd006,77,'M')
(0x00007f1b40edd007,111,'o')
(0x00007f1b40edd008,114,'r')
(0x00007f1b40edd009,103,'g')
(0x00007f1b40edd00a,101,'e')
(0x00007f1b40edd00b,110,'n')
(0x00007f1b40edd00c,33,'!')
(0x00007f1b40edd00d,10,'\n')
You are probably looking for ptrToIntPtr and probably fromIntegral to make it a CUInt.
Note that a CUInt cannot represent a pointer on all platforms, though.

Erlang: erl shell hangs after building a large data structure

As suggested in answers to a previous question, I tried using Erlang proplists to implement a prefix trie.
The code seems to work decently well... But, for some reason, it doesn't play well with the interactive shell. When I try to run it, the shell hangs:
> Trie = trie:from_dict(). % Creates a trie from a dictionary
% ... the trie is printed ...
% Then nothing happens
I see the new trie printed to the screen (ie, the call to trie:from_dict() has returned), then the shell just hangs. No new > prompt comes up and ^g doesn't do anything (but ^c will eventually kill it off).
With a very small dictionary (the first 50 lines of /usr/share/dict/words), the hang only lasts a second or two (and the trie is built almost instantly)... But it seems to grow exponentially with the size of the dictionary (100 words takes 5 or 10 seconds, I haven't had the patience to try larger wordlists). Also, as the shell is hanging, I notice that the beam.smp process starts eating up a lot of memory (somewhere between 1 and 2 gigs).
So, is there anything obvious that could be causing this shell hang and incredible memory usage?
Some various comments:
I have a hunch that the garbage collector is at fault, but I don't know how to profile or create an experiment to test that.
I've tried profiling with eprof and nothing obvious showed up.
Here is my "add string to trie" function:
add([], Trie) ->
[ stop | Trie ];
add([Ch|Rest], Trie) ->
SubTrie = proplists:get_value(Ch, Trie, []),
NewSubTrie = add(Rest, SubTrie),
NewTrie = [ { Ch, NewSubTrie } | Trie ],
% Arbitrarily decide to compress key/value list once it gets
% more than 60 pairs.
if length(NewTrie) > 60 ->
proplists:compact(NewTrie);
true ->
NewTrie
end.
The problem is (amongst others ? -- see my comment) that you are always adding a new {Ch, NewSubTrie} tuple to your proplist Tries, no matter if Ch already existed, or not.
Instead of
NewTrie = [ { Ch, NewSubTrie } | Trie ]
you need something like:
NewTrie = lists:keystore(Ch, 1, Trie, {Ch, NewSubTrie})
You're not really building a trie here. Your end result is effectively a randomly ordered proplist of proplists that requires full scans at each level when walking the list. Tries are typically implied ordering based on position in the array (or list).
Here's an implementation that uses tuples as the storage mechanism. Calling set only rebuilds the root and direct path tuples.
(note: would probably have to make the pair a triple (adding size) make delete work with any efficiency)
I believe erlang tuples are really just arrays (thought I read that somewhere), so lookup should be super fast, and modify is probably straight forward. Maybe this is faster with the array module, but I haven't really played with it much to know.
this version also stores an arbitrary value, so you can do things like:
1> c(trie).
{ok,trie}
2> trie:get("ab",trie:set("aa",bar,trie:new("ab",foo))).
foo
3> trie:get("abc",trie:set("aa",bar,trie:new("ab",foo))).
undefined
4>
code (entire module): note2: assumes lower case non empty string keys
-module(trie).
-compile(export_all).
-define(NEW,{ %% 26 pairs, to avoid cost of calculating a new level at runtime
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth}
}
).
-define(POS(Ch), Ch - $a + 1).
new(Key,V) -> set(Key,V,?NEW).
set([H],V,Trie) ->
Pos = ?POS(H),
{_,SubTrie} = element(Pos,Trie),
setelement(Pos,Trie,{V,SubTrie});
set([H|T],V,Trie) ->
Pos = ?POS(H),
{SubKey,SubTrie} = element(Pos,Trie),
case SubTrie of
nodepth -> setelement(Pos,Trie,{SubKey,set(T,V,?NEW)});
SubTrie -> setelement(Pos,Trie,{SubKey,set(T,V,SubTrie)})
end.
get([H],Trie) ->
{Val,_} = element(?POS(H),Trie),
Val;
get([H|T],Trie) ->
case element(?POS(H),Trie) of
{_,nodepth} -> undefined;
{_,SubTrie} -> get(T,SubTrie)
end.

Resources