I'm trying to wrap the Data.Binary.Put monad into another so that later I can ask it questions like "how many bytes it's going to write" or "what is the current position in file". But even very trivial wraps like:
data Writer1M a = Writer1M { write :: P.PutM a }
or
data Writer2M a = Writer2M { write :: (a, P.Put) }
create a huge space leak and the program usually crashes (after taking up 4GB of RAM). Here is what I've tried so far:
-- This works well and consumes almost no memory.
type Writer = P.Put
writer :: P.Put -> Writer
writer put = put
writeToFile :: String -> Writer -> IO ()
writeToFile path writer = BL.writeFile path (P.runPut writer)
-- This one will cause memory leak.
data Writer1M a = Writer1M { write :: P.PutM a }
instance Monad Writer1M where
return a = Writer1M $ return a
ma >>= f = Writer1M $ (write ma) >>= \a -> write $ f a
type WriterM = Writer1M
type Writer = WriterM ()
writer :: P.Put -> Writer
writer put = Writer1M $ put
writeToFile :: String -> Writer -> IO ()
writeToFile path writer = BL.writeFile path (P.runPut $ write writer)
-- This one will crash as well with exactly the
-- same memory foot print as Writer1M
data Writer2M a = Writer2M { write :: (a, P.Put) }
instance Monad Writer2M where
return a = Writer2M $ (a, return ())
ma >>= f = Writer2M $ (b, p >> p')
where (a,p) = write ma
(b,p') = write $ f a
type WriterM = Writer2M
type Writer = WriterM ()
writer :: P.Put -> Writer
writer put = Writer2M $ ((), put)
writeToFile :: String -> Writer -> IO ()
writeToFile path writer = BL.writeFile path (P.runPut $ snd $ write writer)
I'm new to Haskell and this makes no sence to me, but the wrapper monads seem very trivial so I'm guessing there is something obvious I'm missing.
Thanks for looking.
UPDATE:
Here is a sample code that demonstrates the problem: http://hpaste.org/43400/why_wrapping_the_databinaryp
UPDATE2:
There is also a second part to this question here.
After poking around for a bit, I found that the problem seems to be the usage of binary's (>>=) to implement (>>). The following addition to the Writer1M monad implementation solves the problem:
m >> k = Writer1M $ write m >> write k
Whereas this version still leaks memory:
m >> k = Writer1M $ write m >>= const (write k)
Looking at binary's source, (>>) seems to discard the result of the first monad explicitly. Not sure how exactly this prevents the leak, though. My best theory is that GHC otherwise holds onto the PairS object, and the "a" reference leaks because it never gets looked at.
Did you tried to make the monad more strict? Eg. try to make the constructors of your datatyp strict / replace them with a newtype.
I don't know what's the exact problem here, but this is the usual source of leaks.
PS: And try to remove unnecessary lambdas, for instance:
ma >>= f = Writer1M $ (write ma) >=> write . f
Related
Based on my previous question whose setting I refined gradually (How to create monadic behaviour in reactive-banana):
Suppose there is eKey, an event fired whenever a key is pressed, b of type Buffer which is modified appropriately whenever a supported event occurs in eKey, and finally there are IO actions to be taken on some of the events. These IO actions depend on the state of b (for simplicity assume they print the current state of b to the console).
Currently, I have this to select the action that happens on an event:
getKeyAction :: KeyVal -> Maybe (IO Buffer -> IO Buffer)
getKeyAction 65288 = Just . liftM $ backspace
getKeyAction 65293 = Just $ \mb -> do
b <- mb
print $ "PRESSED ENTER: " ++ toString b
return emptyBuffer
getKeyAction 65360 = Just . liftM $ leftMost
getKeyAction 65361 = Just . liftM $ left
...
getKeyAction _ = Nothing
and then I do:
let
eBufferActions = filterJust $ getKeyAction <$> eKey
bBuffer = accumB (return emptyBuffer) eBufferActions -- model `b`
eBuffer <- changes bBuffer
reactimate' $ fmap displayBuffer <$> eBuffer
for some displayBuffer :: IO Buffer -> IO ().
It does not seem to work as intended. The state of bBuffer seems to be evaluated on every event anew (effectively running all IO actions collected thus far every time an event occurs) which makes sense when I think about it in retrospect.
How should I refactor my code to do this correctly? (i.e. IO actions see current state of the buffer, nothing except the buffer gets accumulated)
If I could construct an Event that bears value of the bBuffer on occasion of an appropriate eKey event, then I could simply map my a dedicated IO action over it and reactimate. What do you think? How to do that? Would <# achieve what I am trying to do? But how would I postpone the current change to b associated with the current key press after I take snapshot of b with <# to map the Buffer -> IO () over it?
OK, so I believe this solves my problem, but I am not sure it is the right thing to do. So please comment.
I factor out those actions that do some non-trivial IO (other than return)
I filter the eKey event into two: eBuffer and eConfirm
eBuffer collects all modifying events (including clearing the buffer on confirmation)
eConfirm collects all confirmation events
I tag eConfirm with values of bBuffer, which captures the evolution of the buffer
Finally, I reactimate separately the IO and the changes of the buffer
The code fragment:
getKeyAction :: KeyVal -> Maybe (Buffer -> Buffer)
getKeyAction 65288 = Just backspace
-- omit action for ENTER
...
getConfirm :: KeyVal -> Maybe (Buffer -> Buffer)
getConfirm 65293 = Just (const mkBuffer) -- Clear buffer on ENTER
getConfirm _ = Nothing
Then in the description of the network:
let
eBuffer = filterJust $ getKeyAction <$> eKey
eConfirm = filterJust $ getConfirm <$> eKey
bBuffer = accumB mkBuffer $ unions [ eBuffer, eConfirm ]
eEval = bBuffer <# eConfirm
eBufferChanges <- changes bBuffer
reactimate $ evalBuffer <$> eEval
reactimate' $ fmap displayBuffer <$> eBufferChanges
for evalBuffer :: Buffer -> IO () and displayBuffer :: Buffer -> IO ().
I've got a 279MB file that contains ~10 million key/value pairs, with ~500,000 unique keys. It's grouped by key (each key only needs to be written once), so all the values for a given key are together.
What I want to do is transpose the association, create a file where the pairs are grouped by value, and all the keys for a given value are stored together.
Currently, I'm using Parsec to read in the pairs as a list of tuples (K,[V]) (using lazy IO so I can process it as a stream while Parsec is processing the input file), where:
newtype K = K Text deriving (Show, Eq, Ord, Hashable)
newtype V = V Text deriving (Show, Eq, Ord, Hashable)
tupleParser :: Parser (K,[V])
tupleParser = ...
data ErrList e a = Cons a (ErrList e a) | End | Err e
parseAllFromFile :: Parser a -> FilePath-> IO (ErrList ParseError a)
parseAllFromFile parser inputFile = do
contents <- readFile inputFile
let Right initialState = parse getParserState inputFile contents
return $ loop initialState
where loop state = case unconsume $ runParsecT parser' state of
Error err -> Err err
Ok Nothing _ _ -> End
Ok (Just a) state' _ -> a `Cons` loop state'
unconsume v = runIdentity $ case runIdentity v of
Consumed ma -> ma
Empty ma -> ma
parser' = (Just <$> parser) <|> (const Nothing <$> eof)
I've tried to insert the tuples into a Data.HashMap.Map V [K] to transpose the association:
transpose :: ErrList ParseError (K,[V]) -> Either ParseError [(V,[K])]
transpose = transpose' M.empty
where transpose' _ (Err e) = Left e
transpose' m End = Right $ assocs m
transpose' m (Cons (k,vs) xs) = transpose' (L.foldl' (include k) m vs) xs
include k m v = M.insertWith (const (k:)) v [k] m
But when I tried it, I got the error:
memory allocation failed (requested 2097152 bytes)
I could think of a couple things I'm doing wrong:
2MB seems a bit low (considerably less than the 2GB RAM my machine has installed), so maybe I need to tell GHC it's ok to use more?
My problems could be algorithmic/data structure related. Maybe I'm using the wrong tools for the job?
My attempt to use lazy IO could be coming back to bite me.
I'm leaning toward (1) for now, but I'm not sure by any means.
Is there the possibility that the data will increase? If yes then I'd suggest not to read the while file into memory and process the data in another way.
One simple possibility is to use a relational database for that. This'd be fairly easy - just load your data in, create a proper index and get it sorted as you need. The database will do all the work for you. I'd definitely recommend this.
Another option would be to create your own file-based mechanism. For example:
Choose some limit l that is reasonable to load into memory.
Create n = d `div` l files, where d is the total amount of your data. (Hopefully this will not exceed your file descriptor limit. You could also close and reopen files after each operation, but this will make the process very slow.)
Process the input sequentially and place each pair (k,v) into file number hash v `mod` l. This ensures that the pairs with the same value v will end up in the same file.
Process each file separately.
Merge them together.
It is essentially a hash table with file buckets. This solution assumes that each value has roughly the same number of keys (otherwise some files could get exceptionally large).
You could also implement an external sort which would allow you to sort basically any amount of data.
To allow for files that are larger than available memory, it's a good idea to process them in bite-sized chunks at a time.
Here is a solid algorithm to copy file A to a new file B:
Create file B and lock it to your machine
Begin loop
If there isn't a next line in file A then exit loop
Read in the next line of file A
Apply processing to the line
Check if File B contains the line already
If File B does not contain the line already then append the line to file B
Goto beginning of loop
Unlock file B
It can also be worthwhile making a copy of file A into a temp folder and locking it while you work with it so that other people on the network aren't prevented from changing the original, but you have a snapshot of the file as it was at the point the procedure was begun.
I intend to revisit this answer in the future and add code.
I've googled and read, and I'm trying to find a "correct" way to do it, but every question I read on SO seems to have completely different answers.
Here is the gist of my problem. files has the type signature of a seq of a triple (a:string, b:string,c:Int64). Being new to f# I'm still not fluent in expressing type signatures (or for that matter understanding them). a is a filename, b is an internal identifier, and c is a value representing the length (size) of the file. baseconfig is a string from earlier in the code.
ignore(files
|> Seq.filter( fun(x,y,z) -> y = baseconfig) // used to filter only files we want
|> Seq.fold( fun f n ->
if( (fun (_,_,z) -> z) n > 50L*1024L*1024L) then
zipfilex.Add((fun (z:string, _, _) -> z) n)
printfn("Adding 50mb to zip")
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
("","",0L)
else
zipfilex.Add((fun (z, _, _) -> z) n)
("", "", (fun (_, _, z:Int64) -> z) n + (fun (_, _, z:Int64) -> z) f)
) ("","",0L)
)
What this chunk of code is supposed to do, is iterate through each file in files, add it to a zip archive (but not really, it just goes on a list to be committed later), and when the files exceed 50MB, commit the currently pending files to the zip archive. Adding a file is cheap, committing is expensive, so I try to mitigate the cost by batching it.
So far the code kinda works... Except for the ObjectDisposedException I got when it approached 150MB of committed files. But I'm not sure this is the right way to do such an operation. It feels like I'm using Seq.fold in a unconventional way, but yet, I don't know of a better way to do it.
Bonus question: Is there a better way to snipe values out of tuples? fst and snd only work for 2 valued tuples, and I realize you can define your own functions instead of inline them like I did, but it seems there should be a better way.
Update: My previous attempts at fold, I couldn't understand why I couldn't just use an Int64 as an accumulator. Turns out I was missing some critical parenthesis. Little simpler version below. Also eliminates all the crazy tuple extraction.
ignore(foundoldfiles
|> Seq.filter( fun (x,y,z) -> y = baseconfig)
|> Seq.fold( fun (a) (f,g,j) ->
zipfilex.Add( f)
if( a > 50L*1024L*1024L) then
printfn("Adding 50mb to zip")
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
0L
else
a + j
) 0L
)
Update 2: I'm going to have to go with an imperative solution, F# is somehow re-entering this block of code, after the zip file is closed in the statement which follows it. Which explains the ObjectDisposedException. No idea how that works or why.
As an alternative to the "dirty" imperative style, you can extend the Seq module with a general and reusable function for chunking. The function is a bit like fold, but it takes a lambda that returns option<'State>. If it returns None, then a new chunk is started and otherwise the element is added to the previous chunk. Then you can write an elegant solution:
files
|> Seq.filter(fun (x, y, z) -> y = baseconfig)
|> Seq.chunkBy(fun (x, y, z) sum ->
if sum + z > 50L*1024L*1024L then None
else Some(sum + z)) 0L
|> Seq.iter(fun files ->
zipfilex.BeginUpdate()
for f, _, _ in files do zipfilex.Add(f)
zipfilex.CommitUpdate())
The implementation of the chunkBy function is a bit longer - it needs to use IEnumerator directly & it can be expressed using recursion:
module Seq =
let chunkBy f initst (files:seq<_>) =
let en = files.GetEnumerator()
let rec loop chunk st = seq {
if not (en.MoveNext()) then
if chunk <> [] then yield chunk
else
match f en.Current st with
| Some(nst) -> yield! loop (en.Current::chunk) nst
| None ->
yield chunk
yield! loop [en.Current] initst }
loop [] initst
I don't think your problem benefits from the use of fold. It's most useful when building immutable structures. My opinion, in this case, is that it makes what you're trying to do less clear. The imperative solution works nicely:
let mutable a = 0L
for (f, g, j) in foundoldfiles do
if g = baseconfig then
zipfilex.Add(f)
if a > 50L * 1024L * 1024L then
printfn "Adding 50mb to zip"
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
a <- 0L
else
a <- a + j
Here's my take:
let inline zip a b = a, b
foundoldfiles
|> Seq.filter (fun (_, internalid, _) -> internalid = baseconfig)
|> zip 0L
||> Seq.fold (fun acc (filename, _, filesize) ->
zipfilex.Add filename
let acc = acc + filesize
if acc > 50L*1024L*1024L then
printfn "Adding 50mb to zip"
zipfilex.CommitUpdate ()
zipfilex.BeginUpdate ()
0L
else acc)
|> ignore
Some notes:
The zip helper function makes for a clean a pipeline through the entire function without any overhead, and in more complex scenarios helps with type inferrence since the state gets shifted from the right to the left side of the fold functor (though that doesn't matter or help in this particular case)
The use of _ to locally discard elements of the tuple that you don't need makes the code easier to read
The approach of pipelining into ignore rather than wrapping the entire expression with extra parenthesis makes the code easier to read
Wrapping the arguments of unary functions in parenthesis looks bizarre; you can't use parenthesis for non-unary curried functions, so using them for unary functions is inconsistent. My policy is to reserve parenthesis for constructor calls and tupled-function calls
EDIT: P.S. if( a > 50L*1024L*1024L) then is incorrect logic -- the if needs to take into account the accumulator plus the current filesize. E.g., if the first file was >= 50MB then the if wouldn't trigger.
If you're not fond of mutable variables and imperative loops, you could always rewrite this using GOTO a functional loop:
let rec loop acc = function
| (file, id, size) :: files ->
if id = baseconfig then
zipfilex.Add file
if acc > 50L*1024L*1024L then
printfn "Adding 50mb to zip"
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
loop 0L files
else
loop (acc + size) files
else
loop acc files
| [] -> ()
loop 0L foundoldfiles
The advantage of this is it explicitly states the three different ways that the inductive case can proceed and how the accumulator is transformed in each case (so you're less likely to get this wrong - witness the bug in Daniel's for loop version).
You could even move the baseconfig check into a when clause:
let rec loop acc = function
| (file, id, size) :: files when id = baseconfig ->
zipfilex.Add file
if acc > 50L*1024L*1024L then
printfn "Adding 50mb to zip"
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
loop 0L files
else
loop (acc + size) files
| _ :: files -> loop acc files
| [] -> ()
loop 0L foundoldfiles
Is this function tail-recursive ?
let rec rec_algo1 step J =
if step = dSs then J
else
let a = Array.init (Array2D.length1 M) (fun i -> minby1J i M J)
let argmin = a|> Array.minBy snd |> fst
rec_algo1 (step+1) (argmin::J)
In general, is there a way to formally check it ?
Thanks.
This function is tail-recursive; I can tell by eyeballing it.
In general it is not always easy to tell. Perhaps the most reliable/pragmatic thing is just to check it on a large input (and make sure you are compiling in 'Release' mode, as 'Debug' mode turns off tail calls for better debugging).
Yes, you can formally prove that a function is tail-recursive. Every expression reduction has a tail-position, and if all recursions are in tail-positions then the function is tail-recursive. It's possible for a function to be tail-recursive in one place, but not in another.
In the expression let pat = exprA in exprB only exprB is in tail-position. That is, while you can go evaluate exprA, you still have to come back to evaluate exprB with exprA in mind. For every expression in the language, there's a reduction rule that tells you where the tail position is. In ExprA; ExprB it's ExprB again. In if ExprA then ExprB else ExprC it's both ExprB and ExprC and so on.
The compiler of course knows this as it goes. However the many expressions available in F# and the many internal optimizations carried out by the compiler as it goes, e.g. during pattern match compiling, computation expressions like seq{} or async{} can make knowing which expressions are in tail-position non-obvious.
Practically speaking, with some practice it's easy for small functions to determine a tail call by just looking at your nested expressions and checking the slots which are NOT in tail positions for function calls. (Remember that a tail call may be to another function!)
You asked how we can formally check this so I'll have a stab. We first have to define what it means for a function to be tail-recursive. A recursive function definition of the form
let rec f x_1 ... x_n = e
is tail recursive if all calls of f inside e are tail calls - ie. occur in a tail context. A tail context C is defined inductively as a term with a hole []:
C ::= []
| e
| let p = e in C
| e; C
| match e with p_1 -> C | ... | p_n -> C
| if e then C else C
where e is an F# expression, x is a variable and p is a pattern. We ought to expand this to mutually recursive function definitions but I'll leave that as an exercise.
Lets now apply this to your example. The only call to rec_algo1 in the body of the function is in this context:
if step = dSs then J
else
let a = Array.init (Array2D.length1 M) (fun i -> minby1J i M J)
let argmin = a|> Array.minBy snd |> fst
[]
And since this is a tail context, the function is tail-recursive. This is how functional programmers eyeball it - scan the body of the definition for recursive calls and then verify that each occurs in a tail context. A more intuitive definition of a tail call is when nothing else is done with the result of the call apart from returning it.
As part of the 4th exercise here
I would like to use a reads type function such as readHex with a parsec Parser.
To do this I have written a function:
liftReadsToParse :: Parser String -> (String -> [(a, String)]) -> Parser a
liftReadsToParse p f = p >>= \s -> if null (f s) then fail "No parse" else (return . fst . head ) (f s)
Which can be used, for example in GHCI, like this:
*Main Numeric> parse (liftReadsToParse (many1 hexDigit) readHex) "" "a1"
Right 161
Can anyone suggest any improvement to this approach with regard to:
Will the term (f s) be memoised, or evaluated twice in the case of a null (f s) returning False?
Handling multiple successful parses, i.e. when length (f s) is greater than one, I do not know how parsec deals with this.
Handling the remainder of the parse, i.e. (snd . head) (f s).
This is a nice idea. A more natural approach that would make
your ReadS parser fit in better with Parsec would be to
leave off the Parser String at the beginning of the type:
liftReadS :: ReadS a -> String -> Parser a
liftReadS reader = maybe (unexpected "no parse") (return . fst) .
listToMaybe . filter (null . snd) . reader
This "combinator" style is very idiomatic Haskell - once you
get used to it, it makes function definitions much easier
to read and understand.
You would then use liftReadS like this in the simple case:
> parse (many1 hexDigit >>= liftReadS readHex) "" "a1"
(Note that listToMaybe is in the Data.Maybe module.)
In more complex cases, liftReadS is easy to use inside any
Parsec do block.
Regarding some of your other questions:
The function reader is applied only once now, so there is nothing to "memoize".
It is common and accepted practice to ignore all except the first parse in a ReadS parser in most cases, so you're fine.
To answer the first part of your question, no (f s) will not be memoised, you would have to do that manually:
liftReadsToParse p f = p >>= \s -> let fs = f s in if null fs then fail "No parse"
else (return . fst . head ) fs
But I'd use pattern matching instead:
liftReadsToParse p f = p >>= \s -> case f s of
[] -> fail "No parse"
(answer, _) : _ -> return answer