Is F# Interactive supposed to be much slower than compiled? - f#

I've been using F# for nearly six months and have been so sure that F# Interactive should have the same performance as compiled, that when I bothered to benchmark it, I was convinced it was some kind of compiler bug. Though now it occurs to me that I should have checked here first before opening an issue.
For me it is roughly 3x slower and the optimization switch does not seem to be doing anything at all.
Is this supposed to be standard behavior? If so, I really got trolled by the #time directive. I have the timings for how long it takes to sum 100M elements on this Reddit thread.
Update:
Thanks to FuleSnabel, I uncovered some things.
I tried running the example script from both fsianycpu.exe (which is the default F# Interactive) and fsi.exe and I am getting different timings for two runs. 134ms for the first and 78ms for the later. Those two timings also correspond to the timings from unoptimized and optimized binaries respectively.
What makes the matter even more confusing is that the first project I used to compile the thing is a part of the game library (in script form) I am making and it refuses to compile the optimized binary, instead switching to the unoptimized one without informing me. I had to start a fresh project to get it to compile properly. It is a wonder the other test compiled properly.
So basically, something funky is going on here and I should look into switching fsianycpu.exe to fsi.exe as the default interpreter.

I tried the example code in pastebin I don't see the behavior you describe. This is the result from my performance run:
.\bin\Release\ConsoleApplication3.exe
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2836 ms
reduce array, result 450015000, time 594 ms
for loop array, result 450015000, time 180 ms
reduce list, result 450015000, time 593 ms
fsi -O --exec .\Interactive.fsx
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2617 ms
reduce array, result 450015000, time 589 ms
for loop array, result 450015000, time 168 ms
reduce list, result 450015000, time 603 ms
It's expected that Seq.reduce would be the slowest, the for loop the fastest and that the reduce on list/array is roughly similar (this assumes locality of list elements which isn't guaranteed).
I rewrote your code to allow for longer runs w/o running out of memory and to improve cache locality of data. With short runs the uncertainity of measurements makes it hard to compare the data.
Program.fs:
module fs
let stopWatch =
let sw = new System.Diagnostics.Stopwatch()
sw.Start ()
sw
let total = 300000000
let outer = 10000
let inner = total / outer
let timeIt (name : string) (a : unit -> 'T) : unit =
let t = stopWatch.ElapsedMilliseconds
let v = a ()
for i = 2 to outer do
a () |> ignore
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, result %A, time %d ms" name v d
[<EntryPoint>]
let sumTest(args) =
let numsList = [1..inner]
let numsArray = [|1..inner|]
printfn "Total iterations: %d, Outer: %d, Inner: %d" total outer inner
let sumsSeqReduce () = Seq.reduce (+) numsList
timeIt "reduce sequence of list" sumsSeqReduce
let sumsArray () = Array.reduce (+) numsArray
timeIt "reduce array" sumsArray
let sumsLoop () =
let mutable total = 0
for i in 0 .. inner - 1 do
total <- total + numsArray.[i]
total
timeIt "for loop array" sumsLoop
let sumsListReduce () = List.reduce (+) numsList
timeIt "reduce list" sumsListReduce
0
Interactive.fsx:
#load "Program.fs"
fs.sumTest [||]
PS. I am running on Windows with Visual Studio 2015. 32bit or 64bit seemed to make only marginal difference

Related

How do I profile F# script

I am learning F# by automating few of my tasks with F# scripts. I run this scripts with "fsi/fsarpi --exec" from command line. I am using .Net core for my work. One of the thing I was looking for is how to profile my F# script. I am primarily looking for
See overall time consumed by my entire script, I tried doing with stopwatch kind of functionality and it works well. Is there anything which can show time for my various top level function calls? Or timings/counts for function calls.
See the overall memory consumption by my script.
Hot spots in my scripts.
Overall I am trying to understand the performance bottlenecks of my scripts.
On a side note, is there a way to compile F# scripts to exe?
I recommend using BenchmarkDotNet for any benchmarking tasks (well, micro-benchmarks). Since it's a statistical tool, it accounts for many things that hand-rolled benchmarking will not. And just by applying a few attributes you can get a nifty report.
Create a .NET Core console app, add the BenchmarkDotNet package, create a benchmark, and run it to see the results. Here's an example that tests two trivial parsing functions, with one as the baseline for comparison, and informing BenchmarkDotNet to capture memory usage stats when running the benchmark:
open System
open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running
module Parsing =
/// "123,456" --> (123, 456)
let getNums (str: string) (delim: char) =
let idx = str.IndexOf(delim)
let first = Int32.Parse(str.Substring(0, idx))
let second = Int32.Parse(str.Substring(idx + 1))
first, second
/// "123,456" --> (123, 456)
let getNumsFaster (str: string) (delim: char) =
let sp = str.AsSpan()
let idx = sp.IndexOf(delim)
let first = Int32.Parse(sp.Slice(0, idx))
let second = Int32.Parse(sp.Slice(idx + 1))
struct(first, second)
[<MemoryDiagnoser>]
type ParsingBench() =
let str = "123,456"
let delim = ','
[<Benchmark(Baseline=true)>]
member __.GetNums() =
Parsing.getNums str delim |> ignore
[<Benchmark>]
member __.GetNumsFaster() =
Parsing.getNumsSpan str delim |> ignore
[<EntryPoint>]
let main _ =
let summary = BenchmarkRunner.Run<ParsingBench>()
printfn "%A" summary
0 // return an integer exit code
In this case, the results will show that the getNumsFaster function allocations 0 bytes and runs about 33% faster.
Once you've found something that consistently performs better and allocates less, you can transfer that over to a script or some other environment where the code will actually execute.
As for hotspots, your best tool is to actually run the script under a profiler like PerfView and look at CPU time and allocations caused by the script while it's executing. There's no simple answer here: interpreting profiling results correctly is challenging and time consuming work.
There's no way to compile an F# script to an executable for .NET Core. It's possible only on Windows/.NET Framework, but this is legacy behavior that is considered deprecated. It's recommended that you convert code in your script to an application if you'd like it to run as an executable.

How can I return a specific number of values from a sequence in a recursive random walk function?

I'm trying to plot the first 100 values of this random walk function using F#/Xamarin. Whenever I call the function, my application freezes. I tried to get the count on this.RW, but that also freezes the application. Help appreciated!
module SimulatingAndAnalyzingAssetPrices =
type RandomWalk(price : float) =
let sample = Normal(0.0, 1.0).Sample()
// Generate random walk from 'value' recursively
let rec randomWalk price =
seq {
yield price
yield! randomWalk (price + sample)
}
let rw = randomWalk 10.0
member this.RW = rw
Note: This is from tryfsharp.org's finance section. My intent is to port it to iOS using Xamarin.
Edit:
Tried using the following code, but still getting an infinite sequence:
let rw = randomWalk 10.0
let rwTake = Seq.take 100 rw
member this.RwTake = rwTake
Edit 2: Also tried
let rw = randomWalk 10.0 |> Seq.take 100
member this.RwTake = rw
Your randomWalk function produces an infinite sequence, so you should never call it without using Seq.take to limit the number of items to, say, the 100 you're trying to graph. If you call it directly, the app freezes because you'll never find the end of the infinite sequence. But if you graph somePrice |> this.RW |> Seq.take 100, you should no longer have your application freezing up any more.
And getting the count of an infinite sequence is also going to freeze up, because it will be going through the sequence trying to reach the end so it can return a count, but there's no end! You can never take a count of an infinite sequence.
Update: There's also Seq.truncate. The difference between Seq.take and Seq.truncate lies in what happens if there are fewer items than expected. (This won't be the case for your randomWalk function, so you can use either take or truncate and you'll be fine, but in other situations you might need to know the difference.) If a sequence has only 5 items but you try to Seq.take 10 from it, it will throw an exception. Seq.truncate, on the other hand, will return just 5 items.
So Seq.take 10 will either return exactly 10 items, no more nor less, or else throw an exception. Seq.truncate 10 will return up to 10 items, but might return fewer, and will never throw an exception.
For more details, see Does F# have an equivalent to Haskell's take?

System.OutOfMemoryException for a tail recursive function

I have isolated the problematic code to this function (that uses ASP.NET's Membership class):
let dbctx = DBSchema.GetDataContext()
let rec h1 (is2_ : int) (ie2_ : int) : unit =
match is2_ >= ie2_ with
| true ->
let st2 = query {
for row in dbctx.Tbl_Students do
where (row.Id = is2_)
head}
let l2 =
Membership.FindUsersByEmail (st2.Email_address)
|> Seq.cast<_>
|> Seq.length
match l2 >= 1 with
| true ->
()
| false ->
Membership.CreateUser (st2.Email_address, password, st2.Email_address)
|> ignore
h1 (is2_ - 1) ie2_
| false ->
()
I am getting a System.OutOfMemoryException after exactly 5626 iterations of h1. But my system's memory consumption is only at 20 percent. (I have a very powerful, 16 GB machine.)
Why should the above function overflow the stack? Is it not written tail recursively?
Thanks in advance for your help.
I don't think this is a tail-recursion issue -- if so, you'd be getting a StackOverflowException instead of an OutOfMemoryException. Note that even though you have 16GB of memory in your machine, the process your program is executing in may be limited to a smaller amount of memory. IIRC, it's 3GB for some combinations of .NET framework version and OS version -- this would explain why the process is crashing when you reach ~20% memory usage (20% of 16GB = 3.2GB).
I don't know how much it'll help, but you can simplify your code to avoid creating some unnecessary sequences:
let dbctx = DBSchema.GetDataContext()
let rec h1 (is2_ : int) (ie2_ : int) : unit =
if is2_ >= ie2_ then
let st2 = query {
for row in dbctx.Tbl_Students do
where (row.Id = is2_)
head }
let existingUsers = Membership.FindUsersByEmail st2.Email_address
if existingUsers.Count < 1 then
Membership.CreateUser (st2.Email_address, password, st2.Email_address)
|> ignore
h1 (is2_ - 1) ie2_
EDIT : Here's a link to a previous question with details about the CLR memory limitations for some versions of the .NET framework and OS versions: Is there a memory limit for a single .NET process
OutOfMemoryException usually doesn't have anything to do with the amount of RAM you have. You get it at ~3 GB most likely because your code runs as a 32-bit process. But switching it to 64 bit will only solve your issue if you actually need that much memory and the exception is not caused by some bug.

Is the C++ AMP library useful from F#?

I'm experimenting with the C++ AMP library in F# as a way of using the GPU to do work in parallel. However, the results I'm getting don't seem intuitive.
In C++, I made a library with one function that squares all the numbers in an array, using AMP:
extern "C" __declspec ( dllexport ) void _stdcall square_array(double* arr, int n)
{
// Create a view over the data on the CPU
array_view<double,1> dataView(n, &arr[0]);
// Run code on the GPU
parallel_for_each(dataView.extent, [=] (index<1> idx) restrict(amp)
{
dataView[idx] = dataView[idx] * dataView[idx];
});
// Copy data from GPU to CPU
dataView.synchronize();
}
(Code adapted from Igor Ostrovsky's blog on MSDN.)
I then wrote the following F# to compare the Task Parallel Library (TPL) to AMP:
// Print the time needed to run the given function
let time f =
let s = new Stopwatch()
s.Start()
f ()
s.Stop()
printfn "elapsed: %d" s.ElapsedTicks
module CInterop =
[<DllImport("CPlus", CallingConvention = CallingConvention.StdCall)>]
extern void square_array(float[] array, int length)
let options = new ParallelOptions()
let size = 1000.0
let arr = [|1.0 .. size|]
// Square the number at the given index of the array
let sq i =
do arr.[i] <- arr.[i] * arr.[i]
()
// Square every number in the array using TPL
time (fun() -> Parallel.For(0, arr.Length - 1, options, new Action<int>(sq)) |> ignore)
let arr2 = [|1.0 .. size|]
// Square every number in the array using AMP
time (fun() -> CInterop.square_array(arr2, arr2.Length))
If I set the array size to a trivial number like 10, it takes the TPL ~22K ticks to finish, and AMP ~10K ticks. That's what I expect. As I understand it, a GPU (hence AMP) should be better suited to this situation, where the work is broken into very small pieces, than the TPL.
However, if I increase the array size to 1000, the TPL now takes ~30K ticks and AMP takes ~70K ticks. And it just gets worse from there. For an array of size 1 million, AMP takes nearly 1000x as long as the TPL.
Since I expect the GPU (i.e. AMP) to be better at this kind of task, I'm wondering what I'm missing here.
My graphics card is a GeForce 550 Ti with 1GB, not a slouch as far as I know. I know there's overhead in using PInvoke to call into the AMP code, but I expect that to be a flat cost that is amortized over larger array sizes. I believe the array is passed by reference (though I could be wrong), so I don't expect any cost associated with copying that.
Thank you to everyone for your advice.
Transferring data back and forth between GPU and CPU takes time. You are most likely measuring your PCI Express bus bandwidth here. Squaring 1M of floats is piece of cake for a GPU.
It's also not a good idea to use the Stopwach class to measure performance for AMP because GPU calls can happen asynchronously. In your case it is ok, but if you measure the compute part only (the parallel_for_each) this won't work. I think you can use D3D11 performance counters for that.

300 milliseconds just for a variable declaration in F# Interactive?

I am playing with the F# Interactive Console and comparing the runtime of some numeric operations. On this code the total runtime seems to double only be repeating the declaration of one variable.
In VS 2010 I do:
open System.Diagnostics
let mutable a=1
Then I select this below and run it with Alt+Enter
let stopWatch = Stopwatch.StartNew()
for i=1 to 200100100 do
a <- a + 1
stopWatch.Stop()
printfn "stopWatch.Elapsed: %f" stopWatch.Elapsed.TotalMilliseconds
it takes more or less: 320 ms
now i select this and hit Alt+Enter:
let mutable a=1
let stopWatch = Stopwatch.StartNew()
for i=1 to 200100100 do
a <- a + 1
stopWatch.Stop()
printfn "stopWatch.Elapsed: %f" stopWatch.Elapsed.TotalMilliseconds
almost double: 620 ms
The same block but including the declaration at the top takes almost double. Shouldn't it be the same since I declare the variable before Stopwatch.StartNew() ? Does this have to do with the interactive console?
I have the same results using the #time directive.
I'm not convinced any of the answers yet are quite right. I think a is represented the same in both cases. And by that I mean an individual type is emitted dynamically wrapping that mutable value (certainly on the heap!). It needs to do this since in both cases, a is a top-level binding that can be accessed by subsequent interactions.
So all that said, my theory is this: in the first case, the dynamic type emitted by FSI is loaded at the end of the interaction, in order to output its default value. In the second case, however, the type wrapping a is not loaded until it is first accessed, in the loop, after the StopWatch was started.
Daniel is right - if you define a variable in a separate FSI interaction, then it is represented differently.
To get a correct comparison, you need to declare the variable as local in both cases. The easiest way to do that is to nest the code under do (which turns everything under do into a local scope) and then evaluate the entire do block at once. See the result of the following two examples:
// A version with variable declaration excluded
do
let mutable a=1
let stopWatch = Stopwatch.StartNew()
for i=1 to 200100100 do
a <- a + 1
stopWatch.Stop()
printfn "stopWatch.Elapsed: %f" stopWatch.Elapsed.TotalMilliseconds
// A version with variable declaration included
do
let stopWatch = Stopwatch.StartNew()
let mutable a=1
for i=1 to 200100100 do
a <- a + 1
stopWatch.Stop()
printfn "stopWatch.Elapsed: %f" stopWatch.Elapsed.TotalMilliseconds
The difference I get when I run them is not measureable.

Resources