I've got a 279MB file that contains ~10 million key/value pairs, with ~500,000 unique keys. It's grouped by key (each key only needs to be written once), so all the values for a given key are together.
What I want to do is transpose the association, create a file where the pairs are grouped by value, and all the keys for a given value are stored together.
Currently, I'm using Parsec to read in the pairs as a list of tuples (K,[V]) (using lazy IO so I can process it as a stream while Parsec is processing the input file), where:
newtype K = K Text deriving (Show, Eq, Ord, Hashable)
newtype V = V Text deriving (Show, Eq, Ord, Hashable)
tupleParser :: Parser (K,[V])
tupleParser = ...
data ErrList e a = Cons a (ErrList e a) | End | Err e
parseAllFromFile :: Parser a -> FilePath-> IO (ErrList ParseError a)
parseAllFromFile parser inputFile = do
contents <- readFile inputFile
let Right initialState = parse getParserState inputFile contents
return $ loop initialState
where loop state = case unconsume $ runParsecT parser' state of
Error err -> Err err
Ok Nothing _ _ -> End
Ok (Just a) state' _ -> a `Cons` loop state'
unconsume v = runIdentity $ case runIdentity v of
Consumed ma -> ma
Empty ma -> ma
parser' = (Just <$> parser) <|> (const Nothing <$> eof)
I've tried to insert the tuples into a Data.HashMap.Map V [K] to transpose the association:
transpose :: ErrList ParseError (K,[V]) -> Either ParseError [(V,[K])]
transpose = transpose' M.empty
where transpose' _ (Err e) = Left e
transpose' m End = Right $ assocs m
transpose' m (Cons (k,vs) xs) = transpose' (L.foldl' (include k) m vs) xs
include k m v = M.insertWith (const (k:)) v [k] m
But when I tried it, I got the error:
memory allocation failed (requested 2097152 bytes)
I could think of a couple things I'm doing wrong:
2MB seems a bit low (considerably less than the 2GB RAM my machine has installed), so maybe I need to tell GHC it's ok to use more?
My problems could be algorithmic/data structure related. Maybe I'm using the wrong tools for the job?
My attempt to use lazy IO could be coming back to bite me.
I'm leaning toward (1) for now, but I'm not sure by any means.
Is there the possibility that the data will increase? If yes then I'd suggest not to read the while file into memory and process the data in another way.
One simple possibility is to use a relational database for that. This'd be fairly easy - just load your data in, create a proper index and get it sorted as you need. The database will do all the work for you. I'd definitely recommend this.
Another option would be to create your own file-based mechanism. For example:
Choose some limit l that is reasonable to load into memory.
Create n = d `div` l files, where d is the total amount of your data. (Hopefully this will not exceed your file descriptor limit. You could also close and reopen files after each operation, but this will make the process very slow.)
Process the input sequentially and place each pair (k,v) into file number hash v `mod` l. This ensures that the pairs with the same value v will end up in the same file.
Process each file separately.
Merge them together.
It is essentially a hash table with file buckets. This solution assumes that each value has roughly the same number of keys (otherwise some files could get exceptionally large).
You could also implement an external sort which would allow you to sort basically any amount of data.
To allow for files that are larger than available memory, it's a good idea to process them in bite-sized chunks at a time.
Here is a solid algorithm to copy file A to a new file B:
Create file B and lock it to your machine
Begin loop
If there isn't a next line in file A then exit loop
Read in the next line of file A
Apply processing to the line
Check if File B contains the line already
If File B does not contain the line already then append the line to file B
Goto beginning of loop
Unlock file B
It can also be worthwhile making a copy of file A into a temp folder and locking it while you work with it so that other people on the network aren't prevented from changing the original, but you have a snapshot of the file as it was at the point the procedure was begun.
I intend to revisit this answer in the future and add code.
Related
Consider the binary tree algebraic datatype
type btree = Empty | Node of btree * int * btree
and a new datatype deļ¬ned as follows:
type finding = NotFound | Found of int
Heres my code so far:
let s = Node (Node(Empty, 5, Node(Empty, 2, Empty)), 3, Node (Empty, 6, Empty))
(*
(3)
/ \
(5) (6)
/ \ | \
() (2) () ()
/ \
() ()
*)
(* size: btree -> int *)
let rec size t =
match t with
Empty -> false
| Node (t1, m, t2) -> if (m != Empty) then sum+1 || (size t1) || (size t2)
let num = occurs s
printfn "There are %i nodes in the tree" num
This probably isn't close, I took a function that would find if an integer existed in a tree and tried changing the code for what I was trying to do.
I am very new to using F# and would appreciate any help. I am trying to count all non empty nodes in the tree. For example the tree I'm using should print the value 4.
I did not run the compiler on your code, but I believe this does even compile.
However your idea to use a pattern match in a recursive function is good.
As rmunn commented, you want to determine the number of nodes in each case:
An empty tree has no nodes, hence the result is zero.
A non-empty tree, has at least the root node plus the count of its left and right subtrees.
So something along the lines of the following should work
let rec size t =
match t with
| Empty -> 0
| Node (t1, _, t2) -> 1 + (size t1) + (size t2)
The most important detail here is, that you do not need a global variable sum to store any intermediate values. The whole idea of a recursive function is that those intermediate values are the results of recursive calls.
As a remark, your tree in the comment should look like this, I believe.
(*
(3)
/ \
(5) (6)
/ \ | \
() (2) () ()
/ \
() ()
*)
Edit: I misread the misaligned () as leaves of an empty tree, where in fact they are leaves of the subtree (2). So it was just an ASCII art issue :-)
Friedrich already posted a simple version of the size function that will work for most trees. However, the solution is not "tail-recursive", so it can cause a Stack Overflow for large trees. In functional programming languages like F#, recursion is often the preferred technique for things like counting and other aggregate functions. However, recursive functions generally consume a stack frame for each recursive call. This means that for large structures, the call stack can be exhausted before the function completes. In order to avoid this problem, compilers can optimize functions that are considered "tail-recursive" so that they use only one stack frame regardless of how many times they recurse. Unfortunately, this optimization cannot just be implemented for any recursive algorithm. It requires that the recursive call be the last thing that the function does, thereby ensuring that the compiler does not have to worry about jumping back into the function after the call, allowing it to overwrite the stack frame instead of adding another one.
In order to change the size function to be tail-recursive, we need some way to avoid having to call it twice in the case of a non-empty node, so that the call can be the last step of the function, instead of the addition between the two calls in Friedrich's solution. This can be accomplished using a couple different techniques, generally either using an accumulator or using Continuation Passing Style. The simpler solution is often to use an accumulator to keep track of the total size instead of having it be the return value, while Continuation Passing Style is a more general solution that can handle more complex recursive algorithms.
In order to make an accumulator pattern work for a tree where we have to sum both the left and right sub-trees, we need some way to make one tail-call at the end of the function, while still making sure that both sub-trees are evaluated. A simple way to do that is to also accumulate the right sub-trees in addition to the total count, so we can make subsequent tail-calls to evaluate those trees while evaluating the left sub-trees first. That solution might look something like this:
let size t =
let rec size acc ts = function
| Empty ->
match ts with
| [] -> acc
| head :: tail -> head |> size acc tail
| Node (t1, _, t2) ->
t1 |> size (acc + 1) (t2 :: ts)
t |> size 0 []
This adds the acc parameter and the ts parameter to represent the total count and remaining unevaluated sub-trees. When we hit a populated node, we evaluate the left sub-tree while adding the right sub-tree to our list of trees to evaluate later. When we hit the an empty node, we start evaluating any ts we've accumulated, until we have no further populated nodes or unevaluated sub-trees. This isn't the best possible solution for computing the tree-size, and most real solutions would use Continuation Passing Style to make it tail-recusive, but that should make a good exercise as you get more familiar with the language.
I am doing data processing with F#. First I got all files in a directory, then process each file to generate some data structure. Finally I will store the processed data into SQLite. I known that if I using Seq to store the file name and then pipe-forward to Seq.map that will do lazy process for each file. But how about there are so many files that contain all of them in memory is impossible. Then in imperative programming language, I could read one file, process it, store it and release the intermedia data and do next file. Of course F# could do imperative programming, but I want to know if there are some chances to do it in Functional programming style?
files
|> Seq.map readFile
|> Seq.map processContent
|> Seq.map storeProcessResult
code above shows my opinion. files contains a sequence of file names, then I read the content of file, process it into some structure and finally store the result into database. I know that because of the lazy behaviour, file will be read and processed one by one. But when will the final data released?
Obviously only you know what happens inside your readFile, processContent and storeProcessResult functions. As #FuleSnabel says in his comment you can map and then use fold (recursion) to process the file.
Here's a simple test you can perform to see the difference in memory consumption: create a List of lists with 10 million elements and sum the list, then create a Seq of lists with 10 million elements, and sum the list. I'm using 64-bit FSI.
This will use about 1GB of memory:
let z = [for i in 1..3 -> List.init 10000000 (fun _ -> 1)]
let w = z |> List.map (fun x -> System.GC.Collect();List.sum x)
This will only use a few MB of memory, much less than even one list with 10 million 1s in it:
let x = seq {for i in 1..3 -> List.init 10000000 (fun _ -> 1 ) }
let y = x |> Seq.map (fun x -> System.GC.Collect(); List.sum x)
This is just one (and probably easy) part in the workflow. If you're opening files, you have to make sure to close them as well, hence my suggestion of use above. However I do recognize that accessing the filesystem, and processing large amounts of data in a lazy sequence might cause some problems, in that case you can always profile it and see where the bottleneck is.
By the way, you don't need to call the GC directly in the code, I just did so the intermediate results don't pollute the memory count in the test.
Is it possible to use one of the parsing libraries (e.g. Parsec) for parsing something different than a String? And how would I do this?
For the sake of simplicity, let's assume the input is a list of ints [Int]. The task could be
drop leading zeros
parse the rest into the pattern (S+L+)*, where S is a number less than 10, and L is a number larger or equal to ten.
return a list of tuples (Int,Int), where fst is the product of the S and snd is the product of the L integers
It would be great if someone could show how to write such a parser (or something similar).
Yes, as user5402 points out, Parsec can parse any instance of Stream, including arbitrary lists. As there are no predefined token parsers (as there are for text) you have to roll your own, (myToken below) using e.g. tokenPrim
The only thing I find a bit awkward is the handling of "source positions". SourcePos is an abstract type (rather than a type class) and forces me to use its "filename/line/column" format, which feels a bit unnatural here.
Anyway, here is the code (without the skipping of leading zeroes, for brevity)
import Text.Parsec
myToken :: (Show a) => (a -> Bool) -> Parsec [a] () a
myToken test = tokenPrim show incPos $ justIf test where
incPos pos _ _ = incSourceColumn pos 1
justIf test x = if (test x) then Just x else Nothing
small = myToken (< 10)
large = myToken (>= 10)
smallLargePattern = do
smallints <- many1 small
largeints <- many1 large
let prod = foldl1 (*)
return (prod smallints, prod largeints)
myIntListParser :: Parsec [Int] () [(Int,Int)]
myIntListParser = many smallLargePattern
testMe :: [Int] -> [(Int, Int)]
testMe xs = case parse myIntListParser "your list" xs of
Left err -> error $ show err
Right result -> result
Trying it all out:
*Main> testMe [1,2,55,33,3,5,99]
[(2,1815),(15,99)]
*Main> testMe [1,2,55,33,3,5,99,1]
*** Exception: "your list" (line 1, column 9):
unexpected end of input
Note the awkward line/column format in the error message
Of course one could write a function sanitiseSourcePos :: SourcePos -> MyListPosition
There is very likely a way to get Parsec to use [a] as the stream type, but the idea behind parser combinators is actually very simple, and it's not very difficult to roll your own library.
A very accessible resource I would recommend is Monadic Parsing in Haskell by Graham Hutton and Erik Meijer.
Indeed, right now Erik Meijer is teaching an intro Haskell/functional programming course on edx.org (link) and Lecture 7 is all about functional parsers. As he states in the intro to the lecture:
"... No one can follow the path towards mastering functional programming without writing their own parser combinator library. We start by explaining what parsers are and how they can naturally be viewed as side-effecting functions. Next we define a number of basic parsers and higher-order functions for combining parsers. ..."
This question is related to both Parsec and uu-parsinglib. When we write parser combinators, they process characters streams from compiler. Is it somehow possible to parse a character and put it back (or return another character back) to the input stream?
I want for example to parse input "test + 5", parse the t, e, s, t and after recognition of test pattern, put for example v character back into the character stream, so while continuating the parsing process we are matching against v + 5
I do not want to use this in any particular case for now - I want to deeply learn the possibilities.
I'm not sure if it's possible with these parsers directly, but in general you can accomplish it by combining parsers with some streaming that allows injecting leftovers.
For example, using attoparsec-conduit you can turn a parser into a conduit using
sinkParser :: (AttoparsecInput a, MonadThrow m)
=> Parser a b -> Consumer a m b
where Consumer is a special kind of conduit that doesn't produce any output, only receives input and returns a final value.
Since conduits support leftovers, you can create a helper method that converts a parser that optionally returns a value to be pushed into the stream into a conduit:
import Data.Attoparsec.Types
import Data.Conduit
import Data.Conduit.Attoparsec
import Data.Functor
reinject :: (AttoparsecInput a, MonadThrow m)
=> Parser a (Maybe a, b) -> Consumer a m b
reinject p = do
(lo, r) <- sinkParser p
maybe (return ()) leftover lo
return r
Then you convert standard parsers to conduits using sinkParser and these special parsers using reinject, and then combine conduits instead of parsers.
I think the simplest way to archive this is to build a multi-layered parser. Think of a lexer + parser combination. This is a clean approach to this problem.
You have to separate the two kind of parsing. The search-and-replace parsing goes to the first parser and the build-the-AST parsing to the second. Or you can create an intermediate token representation.
import Text.Parsec
import Text.Parsec.String
parserLvl1 :: Parser String
parserLvl1 = many (try (string "test" >> return 'v') <|> anyChar)
parserLvl2 :: Parser Plus
parserLvl2 = do text1 <- many (noneOf "+")
char '+'
text2 <- many (noneOf "+")
return $ Plus text1 text2
data Plus = Plus String String
deriving Show
wholeParse :: String -> Either ParseError Plus
wholeParse source = do res1 <- parse parserLvl1 "lvl1" source
res2 <- parse parserLvl2 "lvl2" res1
return res2
Now you can parse your example. wholeParse "test+5" results in Right (Plus "v" "5").
Possible variations:
Create a class and an instance for combining wrapped parser stages. (Possibly carrying parser state.)
Create an intermediate representation, a stream of tokens
This is easily done in uu-parsinglib using the pSwitch function. But the question is why you want to do so? Because the v is missing from the input? In that case uu-parsinglib will perform error correction automatically so you do not need something like this. Otherwise you can write
pSwitch :: (st1 -> (st2, st2 -> st1)) -> P st2 a -> P st1 a
pInsert_v = pSwitch (\st1 -> (prepend v st2, id) (pSucceed ())
It depends on your actual state type how the v is actually added, so you will have to define the function prepend yourself. I do not know e.g. how such an insertion would influence the current position in the file etc.
Doaitse Swierstra
I have a csv file with the following structure :
The first line is a header row
The remaining lines are data lines,
each with the same number of commas, so we can think of the data in
terms of columns
I have written a little script to go through each line of the file and return a sequence of tuples containing the column header and the length of the largest string of data in that column :
let getColumnInfo (fileName:string) =
let delimiter = ','
let readLinesIntoColumns (sr:StreamReader) = seq {
while not sr.EndOfStream do
yield sr.ReadLine().Split(delimiter) |> Seq.map (fun c -> c.Length )
}
use sr = new StreamReader(fileName)
let headers = sr.ReadLine().Split(delimiter)
let columnSizes =
let initial = Seq.map ( fun h -> 0 ) headers
let toMaxColLengths (accumulator:seq<int>) (line:seq<int>) =
let chooseBigger a b = if a > b then a else b
Seq.map2 chooseBigger accumulator line
readLinesIntoColumns sr |> Seq.fold toMaxColLengths initial
Seq.zip headers columnSizes;
This works fine on a small file. However when it trys to process a large file (> 75 Mb) it blows fsi with a StackOverflow exception. If I remove the line
Seq.map2 chooseBigger accumulator line
the program completes.
Now, my question is this : why is F# using up the stack? My understanding of sequences in F# is that the entire sequence is not held in memory, only the elements that are being processed. Therefore I expected that the lines that had already been processed would not remain on the stack. Where is my misunderstanding?
I think this is a good question. Here's a simpler repro:
let test n =
[for i in 1 .. n -> Seq.empty]
|> List.fold (Seq.map2 max) Seq.empty
|> Seq.iter ignore
test creats a sequence of empty sequences, calculates the max by rows, and then iterates over the resulting (empty) sequence. You'll find that with a high value of n this will cause a stack overflow, even though there aren't any values to iterate over at all!
It's a bit tricky to explain why, but here's a stab at it. The problem is that as you fold over the sequences, Seq.map2 is returning a new sequence which defers its work until it's enumerated. Thus, when you try to iterate through the resulting sequence, you end up calling back into a chain of computations n layers deep.
As Daniel explains, you can escape this by evaluating the resulting sequence eagerly (e.g. by converting it to a list).
EDIT
Here's an attempt to further explain what's going wrong. When you call Seq.map2 max s1 s2, neither s1 nor s2 is actually enumerated; you get a new sequence which, when enumerated, will enumerate both of them and compare the yielded values. Thus, if we do something like the following:
let s0 = Seq.empty
let s1 = Seq.map2 max Seq.emtpy s0
let s2 = Seq.map2 max Seq.emtpy s1
let s3 = Seq.map2 max Seq.emtpy s2
let s4 = Seq.map2 max Seq.emtpy s3
let s5 = Seq.map2 max Seq.emtpy s4
...
Then the call to Seq.map2 always returns immediately and uses constant stack space. However, enumerating s5 requires enumerating s4, which requires enumerating s3, etc. This means that enumerating s99999 will build up a huge call stack that looks sort of like:
...
(s99996's enumerator).MoveNext()
(s99997's enumerator).MoveNext()
(s99998's enumerator).MoveNext()
(s99999's enumerator).MoveNext()
and we'll get a stack overflow.
Your code contains so many sequences it's hard to reason about. My guess is that's what's tripping you up. You can make this much simpler and efficient (eagerness is not all bad):
let getColumnInfo (fileName:string) =
let delimiter = ','
use sr = new StreamReader(fileName)
match sr.ReadLine() with
| null | "" -> Array.empty
| hdr ->
let cols = hdr.Split(delimiter)
let counts = Array.zeroCreate cols.Length
while not sr.EndOfStream do
sr.ReadLine().Split(delimiter)
|> Array.iteri (fun i fld ->
counts.[i] <- max counts.[i] fld.Length)
Array.zip cols counts
This assumes all lines are non-empty and have the same number of columns.
You can fix your function by changing this line to:
Seq.map2 chooseBigger accumulator line |> Seq.toList |> seq
why is F# using up the stack? My understanding of sequences in F# is that the entire sequence is not held in memory, only the elements that are being processed. Therefore I expected that the lines that had already been processed would not remain on the stack. Where is my misunderstanding?
The lines themselves are not eating up your stack space. The problem is you've accidentally written a function that builds up a huge unevaluated computation (tree of thunks) that stack overflows when it is evaluated because it makes non-tail calls O(n) deep. This tends to happen whenever you build sequences from other sequences and don't force the evaluation of anything.