I want to generate large xml files for testing purpose but the code I ended up with is really slow, the time grows exponentially with the number of rows I write to the file. Th example below shows that it takes milliseconds to write 100 rows, but it takes over 20 seconds to write 1000 rows (on my machine). I really can't figured out what is making this slow, since I think that writing 1000 rows shouldn't take that long. Also, writing 200 rows takes about 4 times as long as writing 100 rows which is not good. To run the code you might want to change the path for the StreamWriter.
open System.IO
open System.Diagnostics
let xmlSeq = Seq.initInfinite (fun index -> sprintf "<author><name>name%d</name><age>%d</age><books><book>book%d</book></books></author>" index index index)
let createFile (seq: string seq) numberToTake fileName =
use streamWriter = new StreamWriter("C:\\tmp\\FSharpXmlTest\\FSharpXmlTest\\" + fileName, false)
streamWriter.WriteLine("<startTag>")
let rec internalWriter (seq: string seq) (sw:StreamWriter) i (endTag:string) =
match i with
| 0 -> (sw.WriteLine(Seq.head seq);
sw.WriteLine(endTag))
| _ -> (sw.WriteLine(Seq.head seq);
internalWriter (Seq.skip 1 seq) sw (i-1) endTag)
internalWriter seq streamWriter numberToTake "</startTag>"
let funcTimer fn =
let stopWatch = Stopwatch.StartNew()
printfn "Timing started"
fn()
stopWatch.Stop()
printfn "Time elased: %A" stopWatch.Elapsed
(funcTimer (fun () -> createFile xmlSeq 100 "file100.xml"))
(funcTimer (fun () -> createFile xmlSeq 1000 "file1000.xml"))
You observed a quadratic behaviour O(n^2) on manipulating sequences. When you call Seq.skip, a brand new sequence will be created, so you implicitly traverse the remaining part. More detailed explanation could be found at https://stackoverflow.com/a/1306267.
In this example, you don't need to decompose sequences. Replacing your inner function by:
let internalWriter (seq: string seq) (sw:StreamWriter) i (endTag:string) =
for node in Seq.take i seq do
sw.WriteLine(node)
sw.WriteLine(endTag)
I can write 10000 rows in fraction of a second.
You can refactor further by remove this inner function and copy its body to the parent function.
As the link above mentioned, if you ever need decomposing sequences, LazyList should be better to use.
pad in his answer has pointed to the cause of slowdown. Another idiomatic approach might be instead of infinite sequence generating sequence of needed length with Seq.unfold, which makes the code really trivial:
let xmlSeq n = Seq.unfold (fun i ->
if i = 0 then None
else Some((sprintf "<author><name>name%d</name><age>%d</age><books><book>book%d</book></books></author>" i i i), i - 1)) n
let createFile seqLen fileName =
use streamWriter = new StreamWriter("C:\\tmp\\FSharpXmlTest\\" + fileName, false)
streamWriter.WriteLine("<startTag>")
seqLen |> xmlSeq |> Seq.iter streamWriter.WriteLine
streamWriter.WriteLine("</startTag>")
(funcTimer (fun () -> createFile 10000 "file10000.xml"))
Generating 10000 elements takes around 500ms on my laptop.
I came up with the following solution:
namespace FSharpBasics
module Program2 =
open System
open System.IO
open System.Diagnostics
let seqTest count : seq<string> =
let template = "<author>\
<name>Name {0}</name>\
<age>{0}</age>\
<books>\
<book>Book {0}</book>\
</books>\
</author>"
let row (i: int) =
String.Format (template, i)
seq {
yield "<authors>"
for x in [ 1..count ] do
yield row x
yield "</authors>"
}
[<EntryPoint>]
let main argv =
printfn "File will be written now"
let stopwatch = Stopwatch.StartNew()
File.WriteAllLines (#".\test.xml", seqTest 10000) |> ignore
stopwatch.Stop()
printf "Ended, took %f seconds" stopwatch.Elapsed.TotalSeconds
System.Console.ReadKey() |> ignore
0
It takes less than 90 milliseconds on my laptop to create a well-formed test.xml file with 10,000 authors.
Related
I'm reading a file and I want to do something with the first line, and something else with all the other lines
let lines = System.IO.File.ReadLines "filename.txt" |> Seq.map (fun r -> r.Trim())
let head = Seq.head lines
let tail = Seq.tail lines
```
Problem: the call to tail fails because the TextReader is closed.
What it means is that the Seq is evaluated twice: once to get the head once to get the tail.
How can I get the firstLine and the lastLines, while keeping a Seq and without reevaluating the Seq ?
the signature could be, for example :
let fn: ('a -> Seq<'a> -> b) -> Seq<'a> -> b
The easiest thing to do is probably just using Seq.cache to wrap your lines sequence:
let lines =
System.IO.File.ReadLines "filename.txt"
|> Seq.map (fun r -> r.Trim())
|> Seq.cache
Of note from the documentation:
This result sequence will have the same elements as the input sequence. The result can be enumerated multiple times. The input sequence is enumerated at most once and only as far as is necessary. Caching a sequence is typically useful when repeatedly evaluating items in the original sequence is computationally expensive or if iterating the sequence causes side-effects that the user does not want to be repeated multiple times.
I generally use a seq expression in which the Stream is scoped inside the expression. That will allow you to enumerate the sequence fully before the stream is disposed. I usually use a function like this:
let readLines file =
seq {
use stream = File.OpenText file
while not stream.EndOfStream do
yield stream.ReadLine().Trim()
}
Then you should be able to call Seq.head and get the first line in the fail, and Seq.last to get the last line in the file. I think this will technically create two different enumerators though. If you want to only read the file exactly one time, then materializing the sequence to a list or using a function like Seq.cache will be your best option.
I had an important use case for this, where I am using Seq.unfold to read a large number of blocks with REST reads, and sequentially processing each block, with further REST reads.
The reading of the sequence had to be both "lazy" but also cached to avoid duplicate re-evaluation (with every Seq.tail operation).
Hence finding this question and the accepted answer (Seq.cache). Thanks!
I experimented with Seq.cache and discovered that it worked as claimed (ie, lazy and avoid re-evaluation), but with one noteworthy condition - the first five elements of the sequence are always read first (and retained with 'cache'), so experiments on five or smaller numbers won't show lazy evaluation. However, after five, lazy evaluation kicks in for each element.
This code can be used to experiment. Try it for 5, and see no lazy evaluation, and then 10, and see each element after 5 being 'lazy' read, as required. Also remove Seq.cache to see the problem we are addressing (re-evaluation)
// Get a Sequence of numbers.
let getNums n = seq { for i in 1..n do printfn "Yield { %d }" i; yield i}
// Unfold a sequence of numbers
let unfoldNums (nums : int seq) =
nums
|> Seq.unfold
(fun (nums : int seq) ->
printfn "unfold: nums = { %A }" nums
if Seq.isEmpty nums then
printfn "Done"
None
else
let num = Seq.head nums // Value to yield
let tl = Seq.tail nums // Next State. CAUSES RE-EVALUTION!
printfn "Yield: < %d >, tl = { %A }" num tl
Some (num,tl))
// Get n numbers as a sequence, then unfold them as a sequence
// Observe that with 'Seq.cache' input is not re-evaluated unnecessarily,
// and also that lazy evaulation kicks in for n > 5
let experiment n =
getNums n
|> Seq.cache
// Without cache, Seq.tail causes the sequence to be re-evaluated
|> unfoldNums
|> Seq.iter (fun x -> printfn "Process: %d" x)
Why is the following function returning a sequence of incorrect length when the repl variable is set to false?
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
| false ->
idx
let sample =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
sample
Running the function...
let requested = 1000
let dat = Normal.Samples(0., 1.) |> Seq.take 10000
let resultlen = sample dat requested false |> Seq.length
printfn "requested -> %A\nreturned -> %A" requested resultlen
Resulting lengths are wrong.
>
requested -> 1000
returned -> 998
>
requested -> 1000
returned -> 1001
>
requested -> 1000
returned -> 997
Any idea what mistake I'm making?
First, there's a comment I want to make about coding style. Then I'll get to the explanation of why your sequences are coming back with different lengths.
In the comments, I mentioned replacing match (bool) with true -> ... | false -> ... with a simple if ... then ... else expression, but there's another coding style that you're using that I think could be improved. You wrote:
let sample (various_parameters) = // This is a function
// Other code ...
let sample = some_calculation // This is a variable
sample // Return the variable
While F# allows you to reuse names like that, and the name inside the function will "shadow" the name outside the function, it's generally a bad idea for the reused name to have a totally different type than the original name. In other words, this can be a good idea:
let f (a : float option) =
let a = match a with
| None -> 0.0
| Some value -> value
// Now proceed, knowing that `a` has a real value even if had been None before
Or, because the above is exactly what F# gives you defaultArg for:
let f (a : float option) =
let a = defaultArg a 0.0
// This does exactly the same thing as the previous snippet
Here, we are making the name a inside our function refer to a different type than the parameter named a: the parameter was a float option, and the a inside our function is a float. But they're sort of the "same" type -- that is, there's very little mental difference between "The caller may have specified a floating-point value or they may not" and "Now I definitely have a floating-point value". But there's a very large mental gap between "The name sample is a function that takes three parameters" and "The name sample is a sequence of floats". I strongly recommend using a name like result for the value you're going to return from your function, rather than re-using the function name.
Also, this seems unnecessarily verbose:
let result =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
result
Anytime I find myself writing "let result = (something) ; result" at the end of my function, I usually just want to replace that whole code block with just the (something). I.e., the above snippet could just become:
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
Which in turn can be replaced with an if...then...else expression:
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
And that's the last expression in your code. In other words, I would probably rewrite your function as follows (changing ONLY the style, and making no changes to the logic):
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
if m > 0 then
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
else
idx
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
If I can figure out why your sequences have the wrong length, I'll update this answer with that information as well.
UPDATE: Okay, I think I see what's happening in your generateIndex function that's giving you unexpected results. There are two things tripping you up: one is sequence laziness, and the other is randomness.
I copied your generateIndex function into VS Code and added some printfn statements to look at what's going on. First, the code I ran, and then the results:
let rec generateIndex n size idx =
let m = size - Seq.length(idx)
printfn "m = %d" m
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
printfn "Generating newIdx as %A" (List.ofSeq newIdx)
let idx = (Seq.append idx newIdx) |> Seq.distinct
printfn "Now idx is %A" (List.ofSeq idx)
generateIndex n size idx
| false ->
printfn "Done, returning %A" (List.ofSeq idx)
idx
All those List.ofSeq idx calls are so that F# Interactive would print more than four items of the seq when I print it out (by default, if you try to print a seq with %A, it will only print out four values and then print an ellipsis if there are more values available in the seq). Also, I turned n and size into parameters (that I don't change between calls) so that I could test it easily. I then called it as generateIndex 100 5 (seq []) and got the following result:
m = 5
Generating newIdx as [74; 76; 97; 78; 31]
Now idx is [68; 28; 65; 58; 82]
m = 0
Done, returning [37; 58; 24; 48; 49]
val it : seq<int> = seq [12; 69; 97; 38; ...]
See how the numbers keep changing? That was my first clue that something was up. See, seqs are lazy. They don't evaluate their contents until they have to. You shouldn't think of a seq as a list of numbers. Instead, think of it as a generator that will, when asked for numbers, produce them according to some rule. In your case, the rule is "Choose random integers between 0 and n-1, then take m of those numbers". And the other thing about seqs is that they do not cache their contents (although there's a Seq.cache function available that will cache their contents). Therefore, if you have a seq based on a random number generator, its results will be different each time, as you can see in my output. When I printed out newIdx, it printed out as [74; 76; 97; 78; 31], but when I appended it to an empty seq, the result printed out as [68; 28; 65; 58; 82].
Why this difference? Because Seq.append does not force evaluation. It simply creates a new seq whose rule is "take all items from the first seq, then when that one exhausts, take all items from the second seq. And when that one exhausts, end." And Seq.distinct does not force evaluation either; it simply creates a new seq whose rule is "take the items from the seq handed to you, and start handing them out when asked. But memorize them as you go, and if you've handed one of them out before, don't hand it out again." So what you are passing around between your calls to generateIdx is an object that, when evaluated, will pick a set of random numbers between 0 and n-1 (in my simple case, between 0 and 100) and then reduce that set down to a distinct set of numbers.
Now, here's the thing. Every time you evaluate that seq, it will start from the beginning: first calling DiscreteUniform.Samples(0, n-1) to generate an infinite stream of random numbers, then selecting m numbers from that stream, then throwing out any duplicates. (I'm ignoring the Seq.append for now, because it would create unnecessary mental complexity and it isn't really part of the bug anyway). Now, at the start of each go-round of your function, you check the length of the sequence, which does cause it to be evaluated. That means that it selects (in the case of my sample code) 5 random numbers between 0 and 99, then makes sure that they're all distinct. If they are all distinct, then m = 0 and the function will exit, returning... not the list of numbers, but the seq object. And when that seq object is evaluated, it will start over from the beginning, choosing a different set of 5 random numbers and then throwing out any duplicates. Therefore, there's still a chance that at least one of that set of 5 numbers will end up being a duplicate, because the sequence whose length was tested (which we know contained no duplicates, otherwise m would have been greater than 0) was not the sequence that was returned. The sequence that was returned has a 1.0 * 0.99 * 0.98 * 0.97 * 0.96 chance of not containing any duplicates, which comes to about 0.9035. So there's a just-under-10% chance that even though you checked Seq.length and it was 5, the length of the returned seq ends up being 4 after all -- because it was choosing a different set of random numbers than the one you checked.
To prove this, I ran the function again, this time only picking 4 numbers so that the result would be completely shown at the F# Interactive prompt. And my run of generateIndex 100 4 (seq []) produced the following output:
m = 4
Generating newIdx as [36; 63; 97; 31]
Now idx is [39; 93; 53; 94]
m = 0
Done, returning [47; 94; 34]
val it : seq<int> = seq [48; 24; 14; 68]
Notice how when I printed "Done, returning (value of idx)", it had only 3 values? Even though it eventually returned 4 values (because it picked a different selection of random numbers for the actual result, and that selection had no duplicates), that demonstrated the problem.
By the way, there's one other problem with your function, which is that it's far slower than it needs to be. The function Seq.item, in some circumstances, has to run through the sequence from the beginning in order to pick the nth item of the sequence. It would be far better to store your data in an array at the start of your function (let arrData = data |> Array.ofSeq), then replace
|> Seq.map (fun index -> Seq.item index data)
with
|> Seq.map (fun index -> arrData.[index])
Array lookups are done in constant time, so that takes your sample function down from O(N^2) to O(N).
TL;DR: Use Seq.distinct before you take m values from it and the bug will go away. You can just replace your entire generateIdx function with a simple DiscreteUniform.Samples(0, n-1) |> Seq.distinct |> Seq.take size. (And use an array for your data lookups so that your function will run faster). In other words, here's the final almost-final version of how I would rewrite your code:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
else
DiscreteUniform.Samples(0, n-1)
|> Seq.distinct
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
That's it! Simple, easy to understand, and (as far as I can tell) bug-free.
Edit: ... but not completely DRY, because there's still a bit of repeated code in that "final" version. (Credit to CaringDev for pointing it out in the comments below). The Seq.take size |> Seq.map is repeated in both branches of the if expression, so there's a way to simplify that expression. We could do this:
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
So here's a truly-final version of my suggestion:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
IE,
What am I doing wrong here? Does it have to to with lists, sequences and arrays and the way the limitations work?
So here is the setup: I'm trying to generate some primes. I see that there are a billion text files of a billion primes. The question isn't why...the question is how are the guys using python calculating all of the primes below 1,000,000 in milliseconds on this post...and what am I doing wrong with the following F# code?
let sieve_primes2 top_number =
let numbers = [ for i in 2 .. top_number do yield i ]
let sieve (n:int list) =
match n with
| [x] -> x,[]
| hd :: tl -> hd, List.choose(fun x -> if x%hd = 0 then None else Some(x)) tl
| _ -> failwith "Pernicious list error."
let rec sieve_prime (p:int list) (n:int list) =
match (sieve n) with
| i,[] -> i::p
| i,n' -> sieve_prime (i::p) n'
sieve_prime [1;0] numbers
With the timer on in FSI, I get 4.33 seconds worth of CPU for 100000... after that, it all just blows up.
Your sieve function is slow because you tried to filter out composite numbers up to top_number. With Sieve of Eratosthenes, you only need to do so until sqrt(top_number) and remaining numbers are inherently prime. Suppose we havetop_number = 1,000,000, your function does 78498 rounds of filtering (the number of primes until 1,000,000) while the original sieve only does so 168 times (the number of primes until 1,000).
You can avoid generating even numbers except 2 which cannot be prime from the beginning. Moreover, sieve and sieve_prime can be merged into a recursive function. And you could use lightweight List.filter instead of List.choose.
Incorporating above suggestions:
let sieve_primes top_number =
let numbers = [ yield 2
for i in 3..2..top_number -> i ]
let rec sieve ns =
match ns with
| [] -> []
| x::xs when x*x > top_number -> ns
| x::xs -> x::sieve (List.filter(fun y -> y%x <> 0) xs)
sieve numbers
In my machine, the updated version is very fast and it completes within 0.6s for top_number = 1,000,000.
Based on my code here: stackoverflow.com/a/8371684/124259
Gets the first 1 million primes in 22 milliseconds in fsi - a significant part is probably compiling the code at this point.
#time "on"
let limit = 1000000
//returns an array of all the primes up to limit
let table =
let table = Array.create limit true //use bools in the table to save on memory
let tlimit = int (sqrt (float limit)) //max test no for table, ints should be fine
let mutable curfactor = 1;
while curfactor < tlimit-2 do
curfactor <- curfactor+2
if table.[curfactor] then //simple optimisation
let mutable v = curfactor*2
while v < limit do
table.[v] <- false
v <- v + curfactor
let out = Array.create (100000) 0 //this needs to be greater than pi(limit)
let mutable idx = 1
out.[0]<-2
let mutable curx=1
while curx < limit-2 do
curx <- curx + 2
if table.[curx] then
out.[idx]<-curx
idx <- idx+1
out
There have been several good answers both as to general trial division algorithm using lists (#pad) and in choice of an array for a sieving data structure using the Sieve of Eratosthenes (SoE) (#John Palmer and #Jon Harrop). However, #pad's list algorithm isn't particularly fast and will "blow up" for larger sieving ranges and #John Palmer's array solution is somewhat more complex, uses more memory than necessary, and uses external mutable state so is not different than if the program were written in an imperative language such as C#.
EDIT_ADD: I've edited the below code (old code with line comments) modifying the sequence expression to avoid some function calls so as to reflect more of an "iterator style" and while it saved 20% of the speed it still doesn't come close to that of a true C# iterator which is about the same speed as the "roll your own enumerator" final F# code. I've modified the timing information below accordingly. END_EDIT
The following true SoE program only uses 64 KBytes of memory to sieve primes up to a million (due to only considering odd numbers and using the packed bit BitArray) and still is almost as fast as #John Palmer's program at about 40 milliseconds to sieve to one million on a i7 2700K (3.5 GHz), with only a few lines of code:
open System.Collections
let primesSoE top_number=
let BFLMT = int((top_number-3u)/2u) in let buf = BitArray(BFLMT+1,true)
let SQRTLMT = (int(sqrt (double top_number))-3)/2
let rec cullp i p = if i <= BFLMT then (buf.[i] <- false; cullp (i+p) p)
for i = 0 to SQRTLMT do if buf.[i] then let p = i+i+3 in cullp (p*(i+1)+i) p
seq { for i = -1 to BFLMT do if i<0 then yield 2u
elif buf.[i] then yield uint32(3+i+i) }
// seq { yield 2u; yield! seq { 0..BFLMT } |> Seq.filter (fun i->buf.[i])
// |> Seq.map (fun i->uint32 (i+i+3)) }
primesSOE 1000000u |> Seq.length;;
Almost all of the elapsed time is spent in the last two lines enumerating the found primes due to the inefficiency of the sequence run time library as well as the cost of enumerating itself at about 28 clock cycles per function call and return with about 16 function calls per iteration. This could be reduced to only a few function calls per iteration by rolling our own iterator, but the code is not as concise; note that in the following code there is no mutable state exposed other than the contents of the sieving array and the reference variable necessary for the iterator implementation using object expressions:
open System
open System.Collections
open System.Collections.Generic
let primesSoE top_number=
let BFLMT = int((top_number-3u)/2u) in let buf = BitArray(BFLMT+1,true)
let SQRTLMT = (int(sqrt (double top_number))-3)/2
let rec cullp i p = if i <= BFLMT then (buf.[i] <- false; cullp (i+p) p)
for i = 0 to SQRTLMT do if buf.[i] then let p = i+i+3 in cullp (p*(i+1)+i) p
let nmrtr() =
let i = ref -2
let rec nxti() = i:=!i+1;if !i<=BFLMT && not buf.[!i] then nxti() else !i<=BFLMT
let inline curr() = if !i<0 then (if !i= -1 then 2u else failwith "Enumeration not started!!!")
else let v = uint32 !i in v+v+3u
{ new IEnumerator<_> with
member this.Current = curr()
interface IEnumerator with
member this.Current = box (curr())
member this.MoveNext() = if !i< -1 then i:=!i+1;true else nxti()
member this.Reset() = failwith "IEnumerator.Reset() not implemented!!!"
interface IDisposable with
member this.Dispose() = () }
{ new IEnumerable<_> with
member this.GetEnumerator() = nmrtr()
interface IEnumerable with
member this.GetEnumerator() = nmrtr() :> IEnumerator }
primesSOE 1000000u |> Seq.length;;
The above code takes about 8.5 milliseconds to sieve the primes to a million on the same machine due to greatly reducing the number of function calls per iteration to about three from about 16. This is about the same speed as C# code written in the same style. It's too bad that F#'s iterator style as I used in the first example doesn't automatically generate the IEnumerable boiler plate code as C# iterators do, but I guess that is the intention of sequences - just that they are so damned inefficient as to speed performance due to being implemented as sequence computation expressions.
Now, less than half of the time is expended in enumerating the prime results for a much better use of CPU time.
What am I doing wrong here?
You've implemented a different algorithm that goes through each possible value and uses % to determine if it needs to be removed. What you're supposed to be doing is stepping through with a fixed increment removing multiples. That would be asymptotically.
You cannot step through lists efficiently because they don't support random access so use arrays.
I am trying to build a list from a sequence by recursively appending the first element of the sequence to the list:
open System
let s = seq[for i in 2..4350 -> i,2*i]
let rec copy s res =
if (s|>Seq.isEmpty) then
res
else
let (a,b) = s |> Seq.head
Console.WriteLine(string a)
let newS = s |> Seq.skip(1)|> Seq.cache
let newRes = List.append res ([(a,b)])
copy newS newRes
copy s ([])
Two problems:
. getting a Stack overflow which means my tail recusive ploy sucks
and
. why is the code 100x faster when I put |> Seq.cache here let newS = s |> Seq.skip(1)|> Seq.cache.
(Note this is just a little exercise, I understand you can do Seq.toList etc.. )
Thanks a lot
One way that works is ( the two points still remain a bit weird to me ):
let toList (s:seq<_>) =
let rec copyRev res (enum:Collections.Generic.IEnumerator<_*_>) =
let somethingLeft = enum.MoveNext()
if not(somethingLeft) then
res
else
let curr = enum.Current
Console.WriteLine(string curr)
let newRes = curr::res
copyRev newRes enum
let enumerator = s.GetEnumerator()
(copyRev ([]) (enumerator)) |>List.rev
You say it's just an exercise, but it's useful to point to my answer to
While or Tail Recursion in F#, what to use when?
and reiterate that you should favor more applicative/declarative constructs when possible. E.g.
let rec copy2 s = [
for tuple in s do
System.Console.WriteLine(string(fst tuple))
yield tuple
]
is a nice and performant way to express your particular function.
That said, I'd feel remiss if I didn't also say "never create a list that big". For huge data, you want either array or seq.
In my short experience with F# it is not a good idea to use Seq.skip 1 like you would with lists with tail. Seq.skip creates a new IEnumerable/sequence and not just skips n. Therefore your function will be A LOT slower than List.toSeq. You should properly do it imperative with
s.GetEnumerator()
and iterates through the sequence and hold a list which you cons every single element.
In this question
Take N elements from sequence with N different indexes in F#
I started to do something similar to what you do but found out it is very slow. See my method for inspiration for how to do it.
Addition: I have written this:
let seqToList (xs : seq<'a>) =
let e = xs.GetEnumerator()
let mutable res = []
while e.MoveNext() do
res <- e.Current :: res
List.rev res
And found out that the build in method actually does something very similar (including the reverse part). It do, however, checks whether the sequence you have supplied is in fact a list or an array.
You will be able to make the code entirely functional: (which I also did now - could'nt resist ;-))
let seqToList (xs : seq<'a>) =
Seq.fold (fun state t -> t :: state) [] xs |> List.rev
Your function is properly tail recursive, so the recursive calls themselves are not what is overflowing the stack. Instead, the problem is that Seq.skip is poorly behaved when used recursively, as others have pointed out. For instance, this code overflows the stack on my machine:
let mutable s = seq { 1 .. 20001 }
for i in 1 .. 20000 do
s <- Seq.skip 1 s
let v = Seq.head s
Perhaps you can see the vague connection to your own code, which also eventually takes the head of a sequence which results from repeatedly applying Seq.skip 1 to your initial sequence.
Try the following code.
Warning: Before running this code you will need to enable tail call generation in Visual Studio. This can be done through the Build tab on the project properties page. If this is not enabled the code will StackOverflow processing the continuation.
open System
open System.Collections.Generic
let s = seq[for i in 2..1000000 -> i,2*i]
let rec copy (s : (int * int) seq) =
use e = s.GetEnumerator()
let rec inner cont =
if e.MoveNext() then
let (a,b) = e.Current
printfn "%d" b
inner (fun l -> cont (b :: l))
else cont []
inner (fun x -> x)
let res = copy s
printfn "Done"
I have this "learning code" I wrote for the morris seq in f# that suffers from stack overflow that I don't know how to avoid. "morris" returns an infinite sequence of "see and say" sequences (i.e., {{1}, {1,1}, {2,1}, {1,2,1,1}, {1,1,1,2,2,1}, {3,1,2,2,1,1},...}).
let printList l =
Seq.iter (fun n -> printf "%i" n) l
printfn ""
let rec morris s =
let next str = seq {
let cnt = ref 1 // Stack overflow is below when enumerating
for cur in [|0|] |> Seq.append str |> Seq.windowed 2 do
if cur.[0] <> cur.[1] then
yield!( [!cnt ; cur.[0]] )
cnt := 0
incr cnt
}
seq {
yield s
yield! morris (next s) // tail recursion, no stack overflow
}
// "main"
// Print the nth iteration
let _ = [1] |> morris |> Seq.nth 3125 |> printList
You can pick off the nth iteration using Seq.nth but you can only get so far before you hit a stack overflow. The one bit of recursion I have is tail recursion and it in essence builds a linked set of enumerators. That's not where the problem is. It's when "enum" is called on the say the 4000th sequence. Note that's with F# 1.9.6.16, the previous version topped out above 14000). It's because the way the linked sequences are resolved. The sequences are lazy and so the "recursion" is lazy. That is, seq n calls seq n-1 which calls seq n-2 and so forth to get the first item (the very first # is the worst case).
I understand that [|0|] |> Seq.append str |> Seq.windowed 2, is making my problem worse and I could triple the # I could generate if I eliminated that. Practically speaking the code works well enough. The 3125th iteration of morris would be over 10^359 characters in length.
The problem I'm really trying to solve is how to retain the lazy eval and have a no limit based on stack size for the iteration I can pick off. I'm looking for the proper F# idiom to make the limit based on memory size.
Update Oct '10
After learning F# a bit better, a tiny bit of Haskell, thinking & investigating this problem for over year, I finally can answer my own question. But as always with difficult problems, the problem starts with it being the wrong question. The problem isn't sequences of sequences - it's really because of a recursively defined sequence. My functional programming skills are a little better now and so it's easier to see what's going on with the version below, which still gets a stackoverflow
let next str =
Seq.append str [0]
|> Seq.pairwise
|> Seq.scan (fun (n,_) (c,v) ->
if (c = v) then (n+1,Seq.empty)
else (1,Seq.ofList [n;c]) ) (1,Seq.empty)
|> Seq.collect snd
let morris = Seq.unfold(fun sq -> Some(sq,next sq))
That basicially creates a really long chain of Seq processing function calls to generate the sequnces. The Seq module that comes with F# is what can't follow the chain without using the stack. There's an optimization it uses for append and recursively defined sequences, but that optimization only works if the recursion is implementing an append.
So this will work
let rec ints n = seq { yield n; yield! ints (n+1) }
printf "%A" (ints 0 |> Seq.nth 100000);;
And this one will get a stackoverflow.
let rec ints n = seq { yield n; yield! (ints (n+1)|> Seq.map id) }
printf "%A" (ints 0 |> Seq.nth 100000);;
To prove the F# libary was the issue, I wrote my own Seq module that implemented append, pairwise, scan and collect using continutions and now I can begin generating and printing out the 50,000 seq without a problem (it'll never finish since it's over 10^5697 digits long).
Some additional notes:
Continuations were the idiom I was looking for, but in this case, they had to go into the F# library, not my code. I learned about continuations in F# from Tomas Petricek's Real-World Functional Programming book.
The lazy list answer that I accepted held the other idiom; lazy evaluation. In my rewritten library, I also had to leverage the lazy type to avoid stackoverflow.
The lazy list version sorta of works by luck (maybe by design but that's beyond my current ability to determine) - the active-pattern matching it uses while it's constructing and iterating causes the lists to calculate values before the required recursion gets too deep, so it's lazy, but not so lazy it needs continuations to avoid stackoverflow. For example, by the time the 2nd sequence needs a digit from the 1st sequence, it's already been calculated. In other words, the LL version is not strictly JIT lazy for sequence generation, only list management.
You should definitely check out
http://research.microsoft.com/en-us/um/cambridge/projects/fsharp/manual/FSharp.PowerPack/Microsoft.FSharp.Collections.LazyList.html
but I will try to post a more comprehensive answer later.
UPDATE
Ok, a solution is below. It represents the Morris sequence as a LazyList of LazyLists of int, since I presume you want it to be lazy in 'both directions'.
The F# LazyList (in the FSharp.PowerPack.dll) has three useful properties:
it is lazy (evaluation of the nth element will not happen until it is first demanded)
it does not recompute (re-evaluation of the nth element on the same object instance will not recompute it - it caches each element after it's first computed)
you can 'forget' prefixes (as you 'tail' into the list, the no-longer-referenced prefix is available for garbage collection)
The first property is common with seq (IEnumerable), but the other two are unique to LazyList and very useful for computational problems such as the one posed in this question.
Without further ado, the code:
// print a lazy list up to some max depth
let rec PrintList n ll =
match n with
| 0 -> printfn ""
| _ -> match ll with
| LazyList.Nil -> printfn ""
| LazyList.Cons(x,xs) ->
printf "%d" x
PrintList (n-1) xs
// NextMorris : LazyList<int> -> LazyList<int>
let rec NextMorris (LazyList.Cons(cur,rest)) =
let count = ref 1
let ll = ref rest
while LazyList.nonempty !ll && (LazyList.hd !ll) = cur do
ll := LazyList.tl !ll
incr count
LazyList.cons !count
(LazyList.consf cur (fun() ->
if LazyList.nonempty !ll then
NextMorris !ll
else
LazyList.empty()))
// Morris : LazyList<int> -> LazyList<LazyList<int>>
let Morris s =
let rec MakeMorris ll =
LazyList.consf ll (fun () ->
let next = NextMorris ll
MakeMorris next
)
MakeMorris s
// "main"
// Print the nth iteration, up to a certain depth
[1] |> LazyList.of_list |> Morris |> Seq.nth 3125 |> PrintList 10
[1] |> LazyList.of_list |> Morris |> Seq.nth 3126 |> PrintList 10
[1] |> LazyList.of_list |> Morris |> Seq.nth 100000 |> PrintList 35
[1] |> LazyList.of_list |> Morris |> Seq.nth 100001 |> PrintList 35
UPDATE2
If you just want to count, that's fine too:
let LLLength ll =
let rec Loop ll acc =
match ll with
| LazyList.Cons(_,rest) -> Loop rest (acc+1N)
| _ -> acc
Loop ll 0N
let Main() =
// don't do line below, it leaks
//let hundredth = [1] |> LazyList.of_list |> Morris |> Seq.nth 100
// if we only want to count length, make sure we throw away the only
// copy as we traverse it to count
[1] |> LazyList.of_list |> Morris |> Seq.nth 100
|> LLLength |> printfn "%A"
Main()
The memory usage stays flat (under 16M on my box)... hasn't finished running yet, but I computed the 55th length fast, even on my slow box, so I think this should work just fine. Note also that I used 'bignum's for the length, since I think this will overflow an 'int'.
I believe there are two main problems here:
Laziness is very inefficient so you can expect a lazy functional implementation to run orders of magnitude slower. For example, the Haskell implementation described here is 2,400× slower than the F# I give below. If you want a workaround, your best bet is probably to amortize the computations by bunching them together into eager batches where the batches are produced on-demand.
The Seq.append function is actually calling into C# code from IEnumerable and, consequently, its tail call doesn't get eliminated and you leak a bit more stack space every time you go through it. This shows up when you come to enumerate over the sequence.
The following is over 80× faster than your implementation at computing the length of the 50th subsequence but perhaps it is not lazy enough for you:
let next (xs: ResizeArray<_>) =
let ys = ResizeArray()
let add n x =
if n > 0 then
ys.Add n
ys.Add x
let mutable n = 0
let mutable x = 0
for i=0 to xs.Count-1 do
let x' = xs.[i]
if x=x' then
n <- n + 1
else
add n x
n <- 1
x <- x'
add n x
ys
let morris =
Seq.unfold (fun xs -> Some(xs, next xs)) (ResizeArray [1])
The core of this function is a fold over a ResizeArray that could be factored out and used functionally without too much performance degradation if you used a struct as the accumulator.
Just save the previous element that you looked for.
let morris2 data = seq {
let cnt = ref 0
let prev = ref (data |> Seq.nth 0)
for cur in data do
if cur <> !prev then
yield! [!cnt; !prev]
cnt := 1
prev := cur
else
cnt := !cnt + 1
yield! [!cnt; !prev]
}
let rec morrisSeq2 cur = seq {
yield cur
yield! morrisSeq2 (morris2 cur)
}