Streaming: VM freezes for large file - stream

I am building a pipeline to process, aggregate and transform data from csv files, then write back to another csv file… I load rows from a 19 column csv file, and with some mathematical operations (map reduce style) write back 30 columns in another csv.
And it was going fine until I tried to upload a 25mb file to the application, 250000 rows, and then I decided to stream all the operations instead of eagerly processing… but now that I’m changing function by function with streams, I faced a problem that I don’t understand why, after only 5 fields created, when I try to write to file the program just freezes and stops writing after a few thousand lines.
I’m streaming every single function so it shouldn’t have any locks as far as I understand, and for the first thousands writes it works fine so I wonder what’s happening, in the erlang observer I can only see usage of resources dropped to near 0 and it doesn’t write to the file anymore.
This is my stream function (before I just load from file), and next is my write function:
def process(stream, field_longs_lats, team_settings) do
main_stream =
stream
# Removing once that don't have timestamp
|> Stream.filter(fn [time | _tl] -> time != "-" end)
# Filter all duplicated rows by timestamp
|> Stream.uniq_by(fn [time | _tl] -> time end)
|> Stream.map(&Transform.apply_row_tranformations/1)
cumulative_milli =
main_stream
|> Stream.map(fn [_time, milli | _tl] -> milli end)
|> Statistics.cumulative_sum()
speeds =
main_stream
|> Stream.map(fn [_time, _milli, _lat, _long, pace | _tl] ->
pace
end)
|> Stream.map(&Statistics.get_speed/1)
cals = Motion.calories_per_timestep(cumulative_milli, cumulative_milli)
long_stream =
main_stream
|> Stream.map(fn [_time, _milli, lat | _tl] -> lat end)
lat_stream =
main_stream
|> Stream.map(fn [_time, _milli, _lat, long | _tl] -> long end)
x_y_tuples =
RelativeCoordinates.relative_coordinates(long_stream, lat_stream, field_longs_lats)
x = Stream.map(x_y_tuples, fn {x, _y} -> x end)
y = Stream.map(x_y_tuples, fn {_x, y} -> y end)
[x, y, cals, long_stream, lat_stream]
end
write:
def write_to_file(keyword_list, file_name) do
file = File.open!(file_name, [:write, :utf8])
IO.write(file, V4.empty_v4_headers() <> "\n")
keyword_list
|> Stream.zip()
|> Stream.each(&write_tuple_row(&1, file))
|> Stream.run()
File.close(file)
end
#spec write_tuple_row(tuple(), pid()) :: :ok
def write_tuple_row(tuple, file) do
IO.inspect("writing #{inspect(tuple)}")
row_content =
Tuple.to_list(tuple)
|> Enum.map_join(",", fn value -> Transformations.to_string(value) end)
IO.write(file, row_content <> "\n")
end

Related

Sequence of incorrect length generated by function

Why is the following function returning a sequence of incorrect length when the repl variable is set to false?
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
| false ->
idx
let sample =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
sample
Running the function...
let requested = 1000
let dat = Normal.Samples(0., 1.) |> Seq.take 10000
let resultlen = sample dat requested false |> Seq.length
printfn "requested -> %A\nreturned -> %A" requested resultlen
Resulting lengths are wrong.
>
requested -> 1000
returned -> 998
>
requested -> 1000
returned -> 1001
>
requested -> 1000
returned -> 997
Any idea what mistake I'm making?
First, there's a comment I want to make about coding style. Then I'll get to the explanation of why your sequences are coming back with different lengths.
In the comments, I mentioned replacing match (bool) with true -> ... | false -> ... with a simple if ... then ... else expression, but there's another coding style that you're using that I think could be improved. You wrote:
let sample (various_parameters) = // This is a function
// Other code ...
let sample = some_calculation // This is a variable
sample // Return the variable
While F# allows you to reuse names like that, and the name inside the function will "shadow" the name outside the function, it's generally a bad idea for the reused name to have a totally different type than the original name. In other words, this can be a good idea:
let f (a : float option) =
let a = match a with
| None -> 0.0
| Some value -> value
// Now proceed, knowing that `a` has a real value even if had been None before
Or, because the above is exactly what F# gives you defaultArg for:
let f (a : float option) =
let a = defaultArg a 0.0
// This does exactly the same thing as the previous snippet
Here, we are making the name a inside our function refer to a different type than the parameter named a: the parameter was a float option, and the a inside our function is a float. But they're sort of the "same" type -- that is, there's very little mental difference between "The caller may have specified a floating-point value or they may not" and "Now I definitely have a floating-point value". But there's a very large mental gap between "The name sample is a function that takes three parameters" and "The name sample is a sequence of floats". I strongly recommend using a name like result for the value you're going to return from your function, rather than re-using the function name.
Also, this seems unnecessarily verbose:
let result =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
result
Anytime I find myself writing "let result = (something) ; result" at the end of my function, I usually just want to replace that whole code block with just the (something). I.e., the above snippet could just become:
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
Which in turn can be replaced with an if...then...else expression:
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
And that's the last expression in your code. In other words, I would probably rewrite your function as follows (changing ONLY the style, and making no changes to the logic):
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
if m > 0 then
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
else
idx
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
If I can figure out why your sequences have the wrong length, I'll update this answer with that information as well.
UPDATE: Okay, I think I see what's happening in your generateIndex function that's giving you unexpected results. There are two things tripping you up: one is sequence laziness, and the other is randomness.
I copied your generateIndex function into VS Code and added some printfn statements to look at what's going on. First, the code I ran, and then the results:
let rec generateIndex n size idx =
let m = size - Seq.length(idx)
printfn "m = %d" m
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
printfn "Generating newIdx as %A" (List.ofSeq newIdx)
let idx = (Seq.append idx newIdx) |> Seq.distinct
printfn "Now idx is %A" (List.ofSeq idx)
generateIndex n size idx
| false ->
printfn "Done, returning %A" (List.ofSeq idx)
idx
All those List.ofSeq idx calls are so that F# Interactive would print more than four items of the seq when I print it out (by default, if you try to print a seq with %A, it will only print out four values and then print an ellipsis if there are more values available in the seq). Also, I turned n and size into parameters (that I don't change between calls) so that I could test it easily. I then called it as generateIndex 100 5 (seq []) and got the following result:
m = 5
Generating newIdx as [74; 76; 97; 78; 31]
Now idx is [68; 28; 65; 58; 82]
m = 0
Done, returning [37; 58; 24; 48; 49]
val it : seq<int> = seq [12; 69; 97; 38; ...]
See how the numbers keep changing? That was my first clue that something was up. See, seqs are lazy. They don't evaluate their contents until they have to. You shouldn't think of a seq as a list of numbers. Instead, think of it as a generator that will, when asked for numbers, produce them according to some rule. In your case, the rule is "Choose random integers between 0 and n-1, then take m of those numbers". And the other thing about seqs is that they do not cache their contents (although there's a Seq.cache function available that will cache their contents). Therefore, if you have a seq based on a random number generator, its results will be different each time, as you can see in my output. When I printed out newIdx, it printed out as [74; 76; 97; 78; 31], but when I appended it to an empty seq, the result printed out as [68; 28; 65; 58; 82].
Why this difference? Because Seq.append does not force evaluation. It simply creates a new seq whose rule is "take all items from the first seq, then when that one exhausts, take all items from the second seq. And when that one exhausts, end." And Seq.distinct does not force evaluation either; it simply creates a new seq whose rule is "take the items from the seq handed to you, and start handing them out when asked. But memorize them as you go, and if you've handed one of them out before, don't hand it out again." So what you are passing around between your calls to generateIdx is an object that, when evaluated, will pick a set of random numbers between 0 and n-1 (in my simple case, between 0 and 100) and then reduce that set down to a distinct set of numbers.
Now, here's the thing. Every time you evaluate that seq, it will start from the beginning: first calling DiscreteUniform.Samples(0, n-1) to generate an infinite stream of random numbers, then selecting m numbers from that stream, then throwing out any duplicates. (I'm ignoring the Seq.append for now, because it would create unnecessary mental complexity and it isn't really part of the bug anyway). Now, at the start of each go-round of your function, you check the length of the sequence, which does cause it to be evaluated. That means that it selects (in the case of my sample code) 5 random numbers between 0 and 99, then makes sure that they're all distinct. If they are all distinct, then m = 0 and the function will exit, returning... not the list of numbers, but the seq object. And when that seq object is evaluated, it will start over from the beginning, choosing a different set of 5 random numbers and then throwing out any duplicates. Therefore, there's still a chance that at least one of that set of 5 numbers will end up being a duplicate, because the sequence whose length was tested (which we know contained no duplicates, otherwise m would have been greater than 0) was not the sequence that was returned. The sequence that was returned has a 1.0 * 0.99 * 0.98 * 0.97 * 0.96 chance of not containing any duplicates, which comes to about 0.9035. So there's a just-under-10% chance that even though you checked Seq.length and it was 5, the length of the returned seq ends up being 4 after all -- because it was choosing a different set of random numbers than the one you checked.
To prove this, I ran the function again, this time only picking 4 numbers so that the result would be completely shown at the F# Interactive prompt. And my run of generateIndex 100 4 (seq []) produced the following output:
m = 4
Generating newIdx as [36; 63; 97; 31]
Now idx is [39; 93; 53; 94]
m = 0
Done, returning [47; 94; 34]
val it : seq<int> = seq [48; 24; 14; 68]
Notice how when I printed "Done, returning (value of idx)", it had only 3 values? Even though it eventually returned 4 values (because it picked a different selection of random numbers for the actual result, and that selection had no duplicates), that demonstrated the problem.
By the way, there's one other problem with your function, which is that it's far slower than it needs to be. The function Seq.item, in some circumstances, has to run through the sequence from the beginning in order to pick the nth item of the sequence. It would be far better to store your data in an array at the start of your function (let arrData = data |> Array.ofSeq), then replace
|> Seq.map (fun index -> Seq.item index data)
with
|> Seq.map (fun index -> arrData.[index])
Array lookups are done in constant time, so that takes your sample function down from O(N^2) to O(N).
TL;DR: Use Seq.distinct before you take m values from it and the bug will go away. You can just replace your entire generateIdx function with a simple DiscreteUniform.Samples(0, n-1) |> Seq.distinct |> Seq.take size. (And use an array for your data lookups so that your function will run faster). In other words, here's the final almost-final version of how I would rewrite your code:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
else
DiscreteUniform.Samples(0, n-1)
|> Seq.distinct
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
That's it! Simple, easy to understand, and (as far as I can tell) bug-free.
Edit: ... but not completely DRY, because there's still a bit of repeated code in that "final" version. (Credit to CaringDev for pointing it out in the comments below). The Seq.take size |> Seq.map is repeated in both branches of the if expression, so there's a way to simplify that expression. We could do this:
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
So here's a truly-final version of my suggestion:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])

Erlang; list comprehension without duplicates

I am doing somthing horrible but I don't know how to make it better.
I am forming all pairwise sums of the elements of a List called SomeList, but I don't want to see duplicates ( I guess I want "all possible pairwise sums" ):
sets:to_list(sets:from_list([A+B || A <- SomeList, B <- SomeList]))
SomeList does NOT contain duplicates.
This works, but is horribly inefficient, because the original list before the set conversion is GIGANTIC.
Is there a better way to do this?
You could simply use lists:usort/1
lists:usort([X+Y || X <- L, Y <- L]).
if the chance to have duplicates is very high, then you can generate the sum using 2 loops and store the sum in an ets set (or using map, I didn't check the performance of both).
7> Inloop = fun Inloop(_,[],_) -> ok; Inloop(Store,[H|T],X) -> ets:insert(Store,{X+H}), Inloop(Store,T,X) end.
#Fun<erl_eval.42.54118792>
8> Outloop = fun Outloop(Store,[],_) -> ok; Outloop(Store,[H|T],List) -> Inloop(Store,List,H), Outloop(Store,T,List) end.
#Fun<erl_eval.42.54118792>
9> Makesum = fun(L) -> S = ets:new(temp,[set]), Outloop(S,L,L), R =ets:foldl(fun({X},Acc) -> [X|Acc] end,[],S), ets:delete(S), R end.
#Fun<erl_eval.6.54118792>
10> Makesum(lists:seq(1,10)).
[15,13,8,11,20,14,16,12,7,3,10,9,19,18,4,17,6,2,5]
11> lists:sort(Makesum(lists:seq(1,10))).
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
12>
This module will allow you to compare times of execution when using list comprehension, sets or ets. You can of course add additional functions to this comparison:
-module(pairwise).
-export([start/2]).
start(Type, X) ->
L = lists:seq(1, X),
timer:tc(fun do/2, [Type, L]).
do(compr, L) ->
sets:to_list(sets:from_list([A+B || A <- L, B <- L]));
do(set, L) ->
F = fun(Sum, Set) -> sets:add_element(Sum, Set) end,
R = fun(Set) -> sets:to_list(Set) end,
do(L, L, sets:new(), {F, R});
do(ets, L) ->
F = fun(Sum, Tab) -> ets:insert(Tab, {Sum}), Tab end,
R = fun(Tab) ->
Fun = fun({X}, Acc) -> [X|Acc] end,
Res = ets:foldl(Fun, [], Tab),
ets:delete(Tab),
Res
end,
do(L, L, ets:new(?MODULE, []), {F, R}).
do([A|AT], [B|BT], S, {F, _} = Funs) -> do([A|AT], BT, F(A+B, S), Funs);
do([_AT], [], S, {_, R}) -> R(S);
do([_A|AT], [], S, Funs) -> do(AT, AT, S, Funs).
Results:
36> {_, Res1} = pairwise:start(compr, 20).
{282,
[16,32,3,19,35,6,22,38,9,25,12,28,15,31,2,18,34,5,21,37,8,
24,40,11,27,14,30|...]}
37> {_, Res2} = pairwise:start(set, 20).
{155,
[16,32,3,19,35,6,22,38,9,25,12,28,15,31,2,18,34,5,21,37,8,
24,40,11,27,14,30|...]}
38> {_, Res3} = pairwise:start(ets, 20).
{96,
[15,25,13,8,21,24,40,11,26,20,14,28,23,16,12,39,34,36,7,32,
35,3,33,10,9,19,18|...]}
39> R1=lists:usort(Res1), R2=lists:usort(Res2), R3=lists:usort(Res3).
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
24,25,26,27,28,29,30|...]
40> R1 = R2 = R3.
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
24,25,26,27,28,29,30|...]
The last line is to compare that all functions return the same result but sorted differently.
First number in each resulted tuple is the time of execution as returned from timer:tc(fun do/2, [Type, L]).. In this example it's 282 for list comprehension, 155 for sets and 96 for ets.
An effective way is to use foldl instead of lists comprehension, because in this case you nedd a state on each step
sets:to_list(
lists:foldl(fun(A, S1) ->
lists:foldl(fun(B, S2) ->
sets:add_element(A+B, S2)
end, S1, SomeListA)
end, sets:new(), SomeListB)).
This solution keeps it relatively fast and makes use of as much pre-written library code as possible.
Note that I use lists:zip/2 here rather than numeric +, only to illustrate that this approach works for any kind of non-repeating permutation of a unique list. You may only care about arithmetic, but if you want more, this can do it.
-export([permute_unique/1]).
permute_unique([]) ->
[];
permute_unique([A|Ab]) ->
lists:zip(lists:duplicate(length(Ab)+1, A), [A|Ab])
++
permute_unique(Ab).
%to sum integers, replace the lists:zip... line with
% [B+C || {B,C} <- lists:zip(lists:duplicate(length(Ab)+1, A), [A|Ab])]
%to perform normal arithmetic and yield a numeric value for each element
I am not sure what you consider gigantic - you will end up with N*(N+1)/2 total elements in the permuted list for a unique list of N original elements, so this gets big really fast.
I did some basic performance testing of this, using an Intel (Haswell) Core i7 # 4GHz with 32GB of memory, running Erlang/OTP 17 64-bit.
5001 elements in the list took between 2 and 5 seconds according to timer:tc/1.
10001 elements in the list took between 15 and 17 seconds, and required about 9GB of memory. This generates a list of 50,015,001 elements.
15001 elements in the list took between 21 and 25 seconds, and required about 19GB of memory.
20001 elements in the list took 49 seconds in one run, and peaked at about 30GB of memory, with about 200 million elements in the result. That is the limit of what I can test.

http download to disk with fsharp.data.dll and async workflows stalls

The following .fsx file is supposed to download and save to disk binary table base files which are posted as links in a html page on the internet, using Fsharp.Data.dll.
What happens, is that the whole thing stalls after a while and way before it is done, not even throwing an exception or alike.
I am pretty sure, I kind of mis-handle the CopyToAsync() thingy in my async workflow. As this is supposed to run while I go for a nap, it would be nice if someone could tell me how it is supposed to be done correctly. (In more general terms - how to handle a System.Threading.Task thingy in an async workflow thingy?)
#r #"E:\R\playground\DataTypeProviderStuff\packages\FSharp.Data.2.2.3\lib\net40\FSharp.Data.dll"
open FSharp.Data
open Microsoft.FSharp.Control.CommonExtensions
let document = HtmlDocument.Load("http://www.olympuschess.com/egtb/gaviota/")
let links =
document.Descendants ["a"] |> Seq.choose (fun x -> x.TryGetAttribute("href") |> Option.map (fun a -> a.Value()))
|> Seq.filter (fun v -> v.EndsWith(".cp4"))
|> List.ofSeq
let targetFolder = #"E:\temp\tablebases\"
let downloadUrls =
links |> List.map (fun name -> "http://www.olympuschess.com/egtb/gaviota/" + name, targetFolder + name )
let awaitTask = Async.AwaitIAsyncResult >> Async.Ignore
let fetchAndSave (s,t) =
async {
printfn "Starting with %s..." s
let! result = Http.AsyncRequestStream(s)
use fileStream = new System.IO.FileStream(t,System.IO.FileMode.Create)
do! awaitTask (result.ResponseStream.CopyToAsync(fileStream))
printfn "Done with %s." s
}
let makeBatches n jobs =
let rec collect i jl acc =
match i,jl with
| 0, _ -> acc,jl
| _, [] -> acc,jl
| _, x::xs -> collect (i-1) (xs) (acc # [x])
let rec loop remaining acc =
match remaining with
| [] -> acc
| x::xs ->
let r,rest = collect n remaining []
loop rest (acc # [r])
loop jobs []
let download () =
downloadUrls
|> List.map fetchAndSave
|> makeBatches 2
|> List.iter (fun l -> l |> Async.Parallel |> Async.RunSynchronously |> ignore )
|> ignore
download()
Note Updated code so it creates batches of 2 downloads at a time and only the first batch works. Also added the awaitTask from the first answer as this seems the right way to do it.
News What is also funny: If I interrupt the stalled script and then #load it again into the same instance of fsi.exe, it stalls right away. I start to think it is a bug in the library I use or something like that.
Thanks, in advance!
Here fetchAndSave has been modified to handle the Task returned from CopyToAsync asynchronously. In your version you are waiting on the Task synchronously. Your script will appear to lock up as you are using Async.RunSynchronously to run the whole workflow. However the files do download as expected in the background.
let awaitTask = Async.AwaitIAsyncResult >> Async.Ignore
let fetchAndSave (s,t) = async {
let! result = Http.AsyncRequestStream(s)
use fileStream = new System.IO.FileStream(t,System.IO.FileMode.Create)
do! awaitTask (result.ResponseStream.CopyToAsync(fileStream))
}
Of course you also need to call
do download()
on the last line of your script to kick things into motion.

How do I write effectively to a file in F#?

I want to generate large xml files for testing purpose but the code I ended up with is really slow, the time grows exponentially with the number of rows I write to the file. Th example below shows that it takes milliseconds to write 100 rows, but it takes over 20 seconds to write 1000 rows (on my machine). I really can't figured out what is making this slow, since I think that writing 1000 rows shouldn't take that long. Also, writing 200 rows takes about 4 times as long as writing 100 rows which is not good. To run the code you might want to change the path for the StreamWriter.
open System.IO
open System.Diagnostics
let xmlSeq = Seq.initInfinite (fun index -> sprintf "<author><name>name%d</name><age>%d</age><books><book>book%d</book></books></author>" index index index)
let createFile (seq: string seq) numberToTake fileName =
use streamWriter = new StreamWriter("C:\\tmp\\FSharpXmlTest\\FSharpXmlTest\\" + fileName, false)
streamWriter.WriteLine("<startTag>")
let rec internalWriter (seq: string seq) (sw:StreamWriter) i (endTag:string) =
match i with
| 0 -> (sw.WriteLine(Seq.head seq);
sw.WriteLine(endTag))
| _ -> (sw.WriteLine(Seq.head seq);
internalWriter (Seq.skip 1 seq) sw (i-1) endTag)
internalWriter seq streamWriter numberToTake "</startTag>"
let funcTimer fn =
let stopWatch = Stopwatch.StartNew()
printfn "Timing started"
fn()
stopWatch.Stop()
printfn "Time elased: %A" stopWatch.Elapsed
(funcTimer (fun () -> createFile xmlSeq 100 "file100.xml"))
(funcTimer (fun () -> createFile xmlSeq 1000 "file1000.xml"))
You observed a quadratic behaviour O(n^2) on manipulating sequences. When you call Seq.skip, a brand new sequence will be created, so you implicitly traverse the remaining part. More detailed explanation could be found at https://stackoverflow.com/a/1306267.
In this example, you don't need to decompose sequences. Replacing your inner function by:
let internalWriter (seq: string seq) (sw:StreamWriter) i (endTag:string) =
for node in Seq.take i seq do
sw.WriteLine(node)
sw.WriteLine(endTag)
I can write 10000 rows in fraction of a second.
You can refactor further by remove this inner function and copy its body to the parent function.
As the link above mentioned, if you ever need decomposing sequences, LazyList should be better to use.
pad in his answer has pointed to the cause of slowdown. Another idiomatic approach might be instead of infinite sequence generating sequence of needed length with Seq.unfold, which makes the code really trivial:
let xmlSeq n = Seq.unfold (fun i ->
if i = 0 then None
else Some((sprintf "<author><name>name%d</name><age>%d</age><books><book>book%d</book></books></author>" i i i), i - 1)) n
let createFile seqLen fileName =
use streamWriter = new StreamWriter("C:\\tmp\\FSharpXmlTest\\" + fileName, false)
streamWriter.WriteLine("<startTag>")
seqLen |> xmlSeq |> Seq.iter streamWriter.WriteLine
streamWriter.WriteLine("</startTag>")
(funcTimer (fun () -> createFile 10000 "file10000.xml"))
Generating 10000 elements takes around 500ms on my laptop.
I came up with the following solution:
namespace FSharpBasics
module Program2 =
open System
open System.IO
open System.Diagnostics
let seqTest count : seq<string> =
let template = "<author>\
<name>Name {0}</name>\
<age>{0}</age>\
<books>\
<book>Book {0}</book>\
</books>\
</author>"
let row (i: int) =
String.Format (template, i)
seq {
yield "<authors>"
for x in [ 1..count ] do
yield row x
yield "</authors>"
}
[<EntryPoint>]
let main argv =
printfn "File will be written now"
let stopwatch = Stopwatch.StartNew()
File.WriteAllLines (#".\test.xml", seqTest 10000) |> ignore
stopwatch.Stop()
printf "Ended, took %f seconds" stopwatch.Elapsed.TotalSeconds
System.Console.ReadKey() |> ignore
0
It takes less than 90 milliseconds on my laptop to create a well-formed test.xml file with 10,000 authors.

Complex Continuation in F#

All of the continuation tutorials I can find are on fixed length continuations(i.e. the datastructure has a known number of items as it is being traversed
I am implementing DepthFirstSearch Negamax(http://en.wikipedia.org/wiki/Negamax) and while the code works, I would like to rewrite the code using continuations
the code I have is as follows
let naiveDFS driver depth game side =
List.map (fun x ->
//- negamax depth-1 childnode opposite side
(x, -(snd (driver (depth-1) (update game x) -side))))
(game.AvailableMoves.Force())
|> List.maxBy snd
let onPlay game = match game.Turn with
| Black -> -1
| White -> 1
///naive depth first search using depth limiter
let DepthFirstSearch (depth:int) (eval:Evaluator<_>) (game:GameState) : (Move * Score) =
let myTurn = onPlay game
let rec searcher depth game side =
match depth with
//terminal Node
| x when x = 0 || (isTerminal game) -> let movescore = (eval ((),game)) |> fst
(((-1,-1),(-1,-1)),(movescore * side))
//the max of the child moves, each child move gets mapped to
//it's associated score
| _ -> naiveDFS searcher depth game side
where update updates a gamestate with a with a given move, eval evaluates the game state and returns an incrementer(currently unused) for incremental evaluation and isTerminal evaluates whether or not the position is an end position or not.
The Problem is that I have to sign up an unknown number of actions(every remaining list.map iteration) to the continuation, and I actually can't conceive of an efficient way of doing this.
Since this is an exponential algorithm, I am obviously looking to keep this as efficient as possible(although my brain hurts trying to figure this our, so I do want the answer more than an efficient one)
Thanks
I think you'll need to implement a continuation-based version of List.map to do this.
A standard implementation of map (using the accumulator argument) looks like this:
let map' f l =
let rec loop acc l =
match l with
| [] -> acc |> List.rev
| x::xs -> loop ((f x)::acc) xs
loop [] l
If you add a continuation as an argument and transform the code to return via a continuation, you'll get (the interesting case is the x::xs branch in the loop function, where we first call f using tail-call with some continuation as an argument):
let contMap f l cont =
let rec loop acc l cont =
match l with
| [] -> cont acc |> List.rev
| x::xs -> f x (fun x' -> loop (x'::acc) xs cont)
loop [] l cont
Then you can turn normal List.map into a continuation based version like this:
// Original version
let r = List.map (fun x -> x*2) [ 1 .. 3 ]
// Continuation-based version
contMap (fun x c -> c(x*2)) [ 1 .. 3 ] (fun r -> ... )
I'm not sure if this will give you any notable performance improvement. I think continuations are mainly needed if you have a very deep recursion (that doesn't fit on the stack). If it fits on the stack, then it will probably run fast using stack.
Also, the rewriting to explicit continuation style makes the program a bit ugly. You can improve that by using a computation expression for working with continuations. Brian has a blog post on this very topic.

Resources