Why does Seq give stack overflow when iterating through large csv file - f#

I have a csv file with the following structure :
The first line is a header row
The remaining lines are data lines,
each with the same number of commas, so we can think of the data in
terms of columns
I have written a little script to go through each line of the file and return a sequence of tuples containing the column header and the length of the largest string of data in that column :
let getColumnInfo (fileName:string) =
let delimiter = ','
let readLinesIntoColumns (sr:StreamReader) = seq {
while not sr.EndOfStream do
yield sr.ReadLine().Split(delimiter) |> Seq.map (fun c -> c.Length )
}
use sr = new StreamReader(fileName)
let headers = sr.ReadLine().Split(delimiter)
let columnSizes =
let initial = Seq.map ( fun h -> 0 ) headers
let toMaxColLengths (accumulator:seq<int>) (line:seq<int>) =
let chooseBigger a b = if a > b then a else b
Seq.map2 chooseBigger accumulator line
readLinesIntoColumns sr |> Seq.fold toMaxColLengths initial
Seq.zip headers columnSizes;
This works fine on a small file. However when it trys to process a large file (> 75 Mb) it blows fsi with a StackOverflow exception. If I remove the line
Seq.map2 chooseBigger accumulator line
the program completes.
Now, my question is this : why is F# using up the stack? My understanding of sequences in F# is that the entire sequence is not held in memory, only the elements that are being processed. Therefore I expected that the lines that had already been processed would not remain on the stack. Where is my misunderstanding?

I think this is a good question. Here's a simpler repro:
let test n =
[for i in 1 .. n -> Seq.empty]
|> List.fold (Seq.map2 max) Seq.empty
|> Seq.iter ignore
test creats a sequence of empty sequences, calculates the max by rows, and then iterates over the resulting (empty) sequence. You'll find that with a high value of n this will cause a stack overflow, even though there aren't any values to iterate over at all!
It's a bit tricky to explain why, but here's a stab at it. The problem is that as you fold over the sequences, Seq.map2 is returning a new sequence which defers its work until it's enumerated. Thus, when you try to iterate through the resulting sequence, you end up calling back into a chain of computations n layers deep.
As Daniel explains, you can escape this by evaluating the resulting sequence eagerly (e.g. by converting it to a list).
EDIT
Here's an attempt to further explain what's going wrong. When you call Seq.map2 max s1 s2, neither s1 nor s2 is actually enumerated; you get a new sequence which, when enumerated, will enumerate both of them and compare the yielded values. Thus, if we do something like the following:
let s0 = Seq.empty
let s1 = Seq.map2 max Seq.emtpy s0
let s2 = Seq.map2 max Seq.emtpy s1
let s3 = Seq.map2 max Seq.emtpy s2
let s4 = Seq.map2 max Seq.emtpy s3
let s5 = Seq.map2 max Seq.emtpy s4
...
Then the call to Seq.map2 always returns immediately and uses constant stack space. However, enumerating s5 requires enumerating s4, which requires enumerating s3, etc. This means that enumerating s99999 will build up a huge call stack that looks sort of like:
...
(s99996's enumerator).MoveNext()
(s99997's enumerator).MoveNext()
(s99998's enumerator).MoveNext()
(s99999's enumerator).MoveNext()
and we'll get a stack overflow.

Your code contains so many sequences it's hard to reason about. My guess is that's what's tripping you up. You can make this much simpler and efficient (eagerness is not all bad):
let getColumnInfo (fileName:string) =
let delimiter = ','
use sr = new StreamReader(fileName)
match sr.ReadLine() with
| null | "" -> Array.empty
| hdr ->
let cols = hdr.Split(delimiter)
let counts = Array.zeroCreate cols.Length
while not sr.EndOfStream do
sr.ReadLine().Split(delimiter)
|> Array.iteri (fun i fld ->
counts.[i] <- max counts.[i] fld.Length)
Array.zip cols counts
This assumes all lines are non-empty and have the same number of columns.
You can fix your function by changing this line to:
Seq.map2 chooseBigger accumulator line |> Seq.toList |> seq

why is F# using up the stack? My understanding of sequences in F# is that the entire sequence is not held in memory, only the elements that are being processed. Therefore I expected that the lines that had already been processed would not remain on the stack. Where is my misunderstanding?
The lines themselves are not eating up your stack space. The problem is you've accidentally written a function that builds up a huge unevaluated computation (tree of thunks) that stack overflows when it is evaluated because it makes non-tail calls O(n) deep. This tends to happen whenever you build sequences from other sequences and don't force the evaluation of anything.

Related

comparing length of sublists in a list F#

I'm new to F#, and currently working on a problem where I'm trying to compare the length of sublists inside a list, and returning a boolean.
The program is also supposed to return "false" in case any of the sublists are empty. However as I've been progressing I haven't been able to solve my current problem, even though I somehow see what is wrong (this linked to my experience in the F# language thus far). Hopefully someone can lend me a hand, so I can quickly move on to my next project.
My program so far is as follows:
let projectOne (initList: int list list) =
let mutable lst = initList.[0].Length
let mutable lst1 = ""
let n = initList.Length
for i=1 to n-1 do
if lst = 0 || initList.[i].Length = 0 then
lst1 <- "false"
elif lst <> initList.[i].Length then
lst1 <- "false"
elif
lst = initList.[i].Length then
lst1 <- "true"
lst1
printfn "The sublists are of same lenght: %A" (projectOne [[1;2;3;4];[4;5;6];[6;7;8;9];[5;6;7;8]])
The way I see it is, that right now I am comparing [0] with [i] incrementing with each iteration in my loop, this causes a problem as for the print example, I end my iterations by comparing [0] with [3] and since the 2 sublists are of equal size my function returns "true" which is obviously wrong, since [1] is of length shorter than the rest, hence the result should be "false".
I've tried to solve this by mutating the value of lst, for each iteration, but this again causes a problem if for instance [2] and [3] are same length but [0] and [1] are not, and again it returns "true" even though the output should be "false". (like [[1;2;3];[3;4;5];[6;7];[8;9]])
I can't seem to wrap my head around what I've missed. Since I cant break a loop in F# (at least not in a traditional way like Python), I need to run all my iterations, but I want each iteration to compare with the average of all the previous sublists length (if that makes sense).
What am I missing? :-) I have though of using af List.fold operator to solve the problem, but not sure how I am going to implement this, with the fact that the program also need to check for empty lists.
I can say however I am trying to solve the problem using the metod appropriate to my level og experience thus far. I am sure that several very compact solutions using the pipeline operator |> are available, but I am not yet capable of utilizing these solutions, so I am looking for a simpler perhabs beginners solution.
Thanks in advance.
A more functional way to think about this would be
If all the sublists are empty, they are all the same length
Otherwise, if any of the sublists are empty, they are not of the same length
Otherwise, the lists are all the same length if their tails are all the same length
For example:
let rec projectOne initList =
if List.forall List.isEmpty initList then
true
else if List.exists List.isEmpty initList then
false
else
projectOne (List.map List.tail initList)
Here is another take:
let projectOne lst =
match lst with
| h::t -> t |> List.forall(fun (l:_ list) -> l.Length = h.Length)
| _ -> true
A simple fix of your code could be:
let projectOne (initList: int list list) =
let length = initList.[0].Length
let mutable result = length <> 0
let n = initList.Length
for i=1 to n-1 do
result <- result && initList.[i].Length = length
result
It still operates on a mutable variable which is undesirable in functional programming and it is inefficient in that it searches all lists even if a wrong length has been found.
Another - more functional - solution could be:
let haveEqualLength lists =
let length = lists |> List.head |> List.length
length <> 0 && lists |> List.tail |> List.tryFind (fun l -> l.Length <> length) = None

How do I print out the entire Fibonacci sequence up to a user inputted value in F#?

So I have a program that, currently, finds the fibonacci equivalent of a user inputted value, e.g. 6 would be 5 (or 13 depending on whether or not you start with a 0). Personally I prefer the sequence starting with 0.
open System
let rec fib (n1 : bigint) (n2 : bigint) c =
if c = 1 then
n2
else
fib n2 (n1+n2) (c-1);;
let GetFib n =
(fib 1I 1I n);;
let input = Console.ReadLine()
Console.WriteLine(GetFib (Int32.Parse input))
The problem is that ALL it does is find the equivalent number in the sequence. I am trying to get it to print out all the values up to that user inputted value, e.g. 6 would print out 0,1,1,2,3,5. If anyone could help me figure out how to print out the whole sequence, that would be very helpful. Also if anyone can look at my code and tell me how to make it start at 0 when printing out the whole sequence, that would also be very much appreciated.
Thank you in advance for any help.
Take a look at the link s952163 gave you in the comments - that shows ways of generating a fibonnaci sequence using Seq expressions and also explains why these are useful.
The following will print a sequence up until the specified sequence number:
let fibsTo n = Seq.unfold (fun (m,n) -> Some (m, (n,n+m))) (0I,1I)
|>Seq.takeWhile (fun x -> x <= n)
let input = Console.ReadLine()
(fibsTo (Numerics.BigInteger.Parse input))|>Seq.iter(printfn "%A")
Note the use of printfn rather than console.writeline, the former is more idiomatic.
Also, you may want to consider handling negative inputs here as these will throw an error.

How to evaluate only a part of a lazy sequence?

Lazy evaluation is a great boon for stuff like processing huge files that will not fit in main memory at one go. However, suppose there are some elements in the sequence that I want evaluated immediately, while the rest can be lazily computed - is there any way to specify that?
Specific problem: (in case that helps to answer the question)
Specifically, I am using a series of IEnumerables as iterators for multiple sequences - these sequences are data read from files opened using BinaryReader streams (each sequence is responsible for the reading in of data from one of the files). The MoveNext() on these is to be called in a specific order. Eg. iter0 then iter1 then iter5 then iter3 .... and so on. This order is specified in another sequence index = {0,1,5,3,....}. However sequences being lazy, the evaluation is naturally done only when required. Hence, the file reads (for the sequences right at the beginning that read from files on disk) happens as the IEnumerables for a sequence are moving. This is causing an illegal file access - a file that is being read by one process is accessed again (as per the error msg).
True, the illegal file access could be for other reasons, and after having tried my best to debug other causes a partially lazy evaluation might be worth trying out.
While I agree with Tomas' comment: you shouldn't need this if file sharing is handled properly, here's one way to eagerly evaluate the first N elements:
let cacheFirst n (items: seq<_>) =
seq {
use e = items.GetEnumerator()
let i = ref 0
yield!
[
while !i < n && e.MoveNext() do
yield e.Current
incr i
]
while e.MoveNext() do
yield e.Current
}
Example
let items = Seq.initInfinite (fun i -> printfn "%d" i; i)
items
|> Seq.take 10
|> cacheFirst 5
|> Seq.take 3
|> Seq.toList
Output
0
1
2
3
4
val it : int list = [0; 1; 2]
Daniel's solution is sound, but I don't think we need another operator, just Seq.cache for most cases.
First cache your sequence:
let items = Seq.initInfinite (fun i -> printfn "%d" i; i) |> Seq.cache
Eager evaluation followed by lazy access from the beginning:
let eager = items |> Seq.take 5 |> Seq.toList
let cached = items |> Seq.take 3 |> Seq.toList
This will evaluate the first 5 elements once (during eager) but make them cached for secondary access.

How to efficiently find out if a sequence has at least n items?

Just naively using Seq.length may be not good enough as will blow up on infinite sequences.
Getting more fancy with using something like ss |> Seq.truncate n |> Seq.length will work, but behind the scene would involve double traversing of the argument sequence chunk by IEnumerator's MoveNext().
The best approach I was able to come up with so far is:
let hasAtLeast n (ss: seq<_>) =
let mutable result = true
use e = ss.GetEnumerator()
for _ in 1 .. n do result <- e.MoveNext()
result
This involves only single sequence traverse (more accurately, performing e.MoveNext() n times) and correctly handles boundary cases of empty and infinite sequences. I can further throw in few small improvements like explicit processing of specific cases for lists, arrays, and ICollections, or some cutting on traverse length, but wonder if any more effective approach to the problem exists that I may be missing?
Thank you for your help.
EDIT: Having on hand 5 overall implementation variants of hasAtLeast function (2 my own, 2 suggested by Daniel and one suggested by Ankur) I've arranged a marathon between these. Results that are tie for all implementations prove that Guvante is right: a simplest composition of existing algorithms would be the best, there is no point here in overengineering.
Further throwing in the readability factor I'd use either my own pure F#-based
let hasAtLeast n (ss: seq<_>) =
Seq.length (Seq.truncate n ss) >= n
or suggested by Ankur the fully equivalent Linq-based one that capitalizes on .NET integration
let hasAtLeast n (ss: seq<_>) =
ss.Take(n).Count() >= n
Here's a short, functional solution:
let hasAtLeast n items =
items
|> Seq.mapi (fun i x -> (i + 1), x)
|> Seq.exists (fun (i, _) -> i = n)
Example:
let items = Seq.initInfinite id
items |> hasAtLeast 10000
And here's an optimally efficient one:
let hasAtLeast n (items:seq<_>) =
use e = items.GetEnumerator()
let rec loop n =
if n = 0 then true
elif e.MoveNext() then loop (n - 1)
else false
loop n
Functional programming breaks up work loads into small chunks that do very generic tasks that do one simple thing. Determining if there are at least n items in a sequence is not a simple task.
You already found both the solutions to this "problem", composition of existing algorithms, which works for the majority of cases, and creating your own algorithm to solve the issue.
However I have to wonder whether your first solution wouldn't work. MoveNext() is only called n times on the original method for certain, Current is never called, and even if MoveNext() is called on some wrapper class the performance implications are likely tiny unless n is huge.
EDIT:
I was curious so I wrote a simple program to test out the timing of the two methods. The truncate method was quicker for a simple infinite sequence and one that had Sleep(1). It looks like I was right when your correction sounded like overengineering.
I think clarification is needed to explain what is happening in those methods. Seq.truncate takes a sequence and returns a sequence. Other than saving the value of n it doesn't do anything until enumeration. During enumeration it counts and stops after n values. Seq.length takes an enumeration and counts, returning the count when it ends. So the enumeration is only enumerated once, and the amount of overhead is a couple of method calls and two counters.
Using Linq this would be as simple as:
let hasAtLeast n (ss: seq<_>) =
ss.Take(n).Count() >= n
Seq take method blows up if there are not enough elements.
Example usage to show it traverse seq only once and till required elements:
seq { for i = 0 to 5 do
printfn "Generating %d" i
yield i }
|> hasAtLeast 4 |> printfn "%A"

f# iterating over two arrays, using function from a c# library

I have a list of words and a list of associated part of speech tags. I want to iterate over both, simultaneously (matched index) using each indexed tuple as input to a .NET function. Is this the best way (it works, but doesn't feel natural to me):
let taggingModel = SeqLabeler.loadModel(lthPath +
"models\penn_00_18_split_dict.model");
let lemmatizer = new Lemmatizer(lthPath + "v_n_a.txt")
let input = "the rain in spain falls on the plain"
let words = Preprocessor.tokenizeSentence( input )
let tags = SeqLabeler.tagSentence( taggingModel, words )
let lemmas = Array.map2 (fun x y -> lemmatizer.lookup(x,y)) words tags
Your code looks quite good to me - most of it deals with some loading and initialization, so there isn't much you could do to simplify that part. Alternatively to Array.map2, you could use Seq.zip combined with Seq.map - the zip function combines two sequences into a single one that contains pairs of elements with matching indices:
let lemmas = Seq.zip words tags
|> Seq.map (fun (x, y) -> lemmatizer.lookup (x, y))
Since lookup function takes a tuple that you got as an argument, you could write:
// standard syntax using the pipelining operator
let lemmas = Seq.zip words tags |> Seq.map lemmatizer.lookup
// .. an alternative syntax doing exactly the same thing
let lemmas = (words, tags) ||> Seq.zip |> Seq.map lemmatizer.lookup
The ||> operator used in the second version takes a tuple containing two values and passes them to the function on the right side as two arguments, meaning that (a, b) ||> f means f a b. The |> operator takes only a single value on the left, so (a, b) |> f would mean f (a, b) (which would work if the function f expected tuple instead of two, space separated, parameters).
If you need lemmas to be an array at the end, you'll need to add Array.ofSeq to the end of the processing pipeline (all Seq functions work with sequences, which correspond to IEnumerable<T>)
One more alternative is to use sequence expressions (you can use [| .. |] to construct an array directly if that's what you need):
let lemmas = [| for wt in Seq.zip words tags do // wt is tuple (string * string)
yield lemmatizer.lookup wt |]
Whether to use sequence expressions or not - that's just a personal preference. The first option seems to be more succinct in this case, but sequence expressions may be more readable for people less familiar with things like partial function application (in the shorter version using Seq.map)

Resources