FSharp.Data CsvProvider performance - f#

I have a csv file with 6 columns and 678,552 rows. Unfortunately I cannot share any data sample but the types are straightforward: int64, int64, date, date, string, string and there are no missing values.
Time to load this data in a dataframe in R using read.table: ~ 3 seconds.
Time to load this data using CsvFile.Load in F#: ~ 3 seconds.
Time to load this data in a Deedle dataframe in F#: ~ 7 seconds.
Adding inferTypes=falseand providing a schema to Deedle's Frame.ReadCsv reduces the time to ~ 3 seconds
Time to load this data using CsvProvider in F#: ~ 5 minutes.
And this 5 minutes is even after I define the types in the Schema parameter, presumably eliminating the time F# would use to infer them.
I understand that the type provider needs to do a lot more than R or CsvFile.Load in order to parse the data into the correct data type but I am surprised by the x100 speed penalty. Even more confusing is the time Deedle takes to load the data since it also needs to infer types and cast appropriately, organize in Series, etc. I would actually have expected Deedle to take longer than CsvProvider.
In this issue the bad performance of CsvProvider was caused by a large number of columns which is not my case.
I am wondering if I am doing something wrong or if there is any way to speed things up a bit.
Just to clarify: creating the provider is almost instantaneous. It is when I force the generated sequence to be realized by Seq.length df.Rows that it takes ~ 5 minutes for the fsharpi prompt to return.
I'm on a linux system, F# v4.1 on mono v4.6.1.
Here is the code for the CsvProvider
let [<Literal>] SEP = "|"
let [<Literal>] CULTURE = "sv-SE"
let [<Literal>] DATAFILE = dataroot + "all_diagnoses.csv"
type DiagnosesProvider = CsvProvider<DATAFILE, Separators=SEP, Culture=CULTURE>
let diagnoses = DiagnosesProvider()
EDIT1:
I added the time Deedle takes to load the data into a frame.
EDIT2:
Added the time Deedle takes if inferTypes=false and a schema is provided.
Also, supplying CacheRows=false in the CsvProvider as suggested in the comments has no perceptible effect in the time to load.
EDIT3:
Ok, we are getting somewhere. For some peculiar reason it seems that Culture is the culprit. If I omit this argument, CsvProvider loads the data in ~ 7 seconds. I am unsure what could be causing this. My system's locale is en_US. The data however come from an SQL Server in swedish locale where decimal digits are separated by ',' instead of '.'. This particular dataset does not have any decimals, so I can omit Culture altogether. Another set however has 2 decimal columns and more than 1,000,000 rows. My next task is to test this on a Windows system which I don't have available at the moment.
EDIT4:
Problem seems solved but I still don't understand what causes it. If I change the culture "globally" by doing:
System.Globalization.CultureInfo.DefaultThreadCurrentCulture = CultureInfo("sv-SE")
System.Threading.Thread.CurrentThread.CurrentCulture = CultureInfo("sv-SE")
and then remove the Culture="sv-SE" argument from the CsvProvider the load time is reduced to ~ 6 seconds and the decimals are parsed correctly. I'm leaving this open in case anyone can give an explanation for this behavior.

I am trying to reproduce the problem you are seeing, since you can't share the data I tried generating some test data. However, on my machine (.NET 4.6.2 F#4.1) I don't see it taking minutes, it takes seconds.
Perhaps you can try to see how my sample application performs in your setup and we can work from that?
open System
open System.Diagnostics
open System.IO
let clock =
let sw = Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let time a =
let before = clock ()
let v = a ()
let after = clock ()
after - before, v
let generateDataSet () =
let random = Random 19740531
let firstDate = DateTime(1970, 1, 1)
let randomInt () = random.Next () |> int64 |> (+) 10000000000L |> string
let randomDate () = (firstDate + (random.Next () |> float |> TimeSpan.FromSeconds)).ToString("s")
let randomString () =
let inline valid ch =
match ch with
| '"'
| '\\' -> ' '
| _ -> ch
let c = random.Next () % 16
let g i =
if i = 0 || i = c + 1 then '"'
else 32 + random.Next() % (127 - 32) |> char |> valid
Array.init (c + 2) g |> String
let columns =
[|
"Id" , randomInt
"ForeignId" , randomInt
"BirthDate" , randomDate
"OtherDate" , randomDate
"FirstName" , randomString
"LastName" , randomString
|]
use sw = new StreamWriter ("perf.csv")
let headers = columns |> Array.map fst |> String.concat ";"
sw.WriteLine headers
for i = 0 to 700000 do
let values = columns |> Array.map (fun (_, f) -> f ()) |> String.concat ";"
sw.WriteLine values
open FSharp.Data
[<Literal>]
let sample = """Id;ForeignId;BirthDate;OtherDate;FirstName;LastName
11795679844;10287417237;2028-09-14T20:33:17;1993-07-21T17:03:25;",xS# %aY)N*})Z";"ZP~;"
11127366946;11466785219;2028-02-22T08:39:57;2026-01-24T05:07:53;"H-/QA(";"g8}J?k~"
"""
type PerfFile = CsvProvider<sample, ";">
let readDataWithTp () =
use streamReader = new StreamReader ("perf.csv")
let csvFile = PerfFile.Load streamReader
let length = csvFile.Rows |> Seq.length
printfn "%A" length
[<EntryPoint>]
let main argv =
Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory
printfn "Generating dataset..."
let ms, _ = time generateDataSet
printfn " took %d ms" ms
printfn "Reading dataset..."
let ms, _ = time readDataWithTp
printfn " took %d ms" ms
0
The performance numbers (.NET462 on my desktop):
Generating dataset...
took 2162 ms
Reading dataset...
took 6156 ms
The performance numbers (Mono 4.6.2 on my Macbook Pro):
Generating dataset...
took 4432 ms
Reading dataset...
took 8304 ms
Update
It turns out that specifying Culture to CsvProvider explicitly seems to degrade performance alot. It can be any culture, not just sv-SE but why?
If one checks the code the provider generates for the fast and the slow cases one notice a difference:
Fast
internal sealed class csvFile#78
{
internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2)
{
Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]);
long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[1]);
long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[2]);
System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[3]);
System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[4]);
string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[5]);
return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption));
}
}
Slow
internal sealed class csvFile#78
{
internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2)
{
Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]);
long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[1]);
long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[2]);
System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[3]);
System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[4]);
string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption);
fSharpOption = TextConversions.AsString(arg2[5]);
return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption));
}
}
More specific this is the difference:
// Fast
TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption)
// Slow
TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption)
When we specify a culture this is passed to ConvertDateTime which forwards it to GetCulture
static member GetCulture(cultureStr) =
if String.IsNullOrWhiteSpace cultureStr
then CultureInfo.InvariantCulture
else CultureInfo cultureStr
This means that for the default case we use the CultureInfo.InvariantCulture but for any other case for each field and row we are creating a CultureInfo object. Caching could be done but it's not. The creation process itself doesn't seem to take too much time but something happens when we are parsing with a new CultureInfo object each time.
Parsing DateTime in FSharp.Data essentially is this
let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind
match DateTime.TryParse(text, cultureInfo, dateTimeStyles) with
So let's make a performance test where we use a cached CultureInfo object and another one where we create one each time.
open System
open System.Diagnostics
open System.Globalization
let clock =
let sw = Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let time a =
let before = clock ()
let v = a ()
let after = clock ()
after - before, v
let perfTest c cf () =
let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind
let text = DateTime.Now.ToString ("", cf ())
for i = 1 to c do
let culture = cf ()
DateTime.TryParse(text, culture, dateTimeStyles) |> ignore
[<EntryPoint>]
let main argv =
Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory
let ct = "sv-SE"
let cct = CultureInfo ct
let count = 10000
printfn "Using cached CultureInfo object..."
let ms, _ = time (perfTest count (fun () -> cct))
printfn " took %d ms" ms
printfn "Using fresh CultureInfo object..."
let ms, _ = time (perfTest count (fun () -> CultureInfo ct))
printfn " took %d ms" ms
0
Performance numbers on .NET 4.6.2 F#4.1:
Using cached CultureInfo object...
took 16 ms
Using fresh CultureInfo object...
took 5328 ms
So it seems caching the CultureInfo object in FSharp.Data should improve CsvProvider performance significantly when culture is specified.

The problem was caused by CsvProvider not memoizing the explicitly set Culture. The problem was addressed by this pull request.

Related

F#: How to enumerate through multiple files correctly?

I have a bunch of files several MiB in size which are very simple:
They have a size of multiples of 8
They only contain doubles in little endian, so can be read with BinaryReader's ReadDouble() method
When lexicographically sorted, they contain all values in the sequence they need to be.
I can't keep everything in memory as a float list or float array so I need a float seq that goes through the necessary files when actually being accessed. The portion that goes through the sequence actually does it in imperative style using GetEnumerator() because I don't want any resource leaks and want to close all files correctly.
My first functional approach was:
let readFile file =
let rec readReader (maybeReader : BinaryReader option) =
match maybeReader with
| None ->
let openFile() =
printfn "Opening the file"
new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.Read))
|> Some
|> readReader
seq { yield! openFile() }
| Some reader when reader.BaseStream.Position >= reader.BaseStream.Length ->
printfn "Closing the file"
reader.Dispose()
Seq.empty
| Some reader ->
reader.BaseStream.Position |> printfn "Reading from position %d"
let bytesToRead = Math.Min(1048576L, reader.BaseStream.Length - reader.BaseStream.Position) |> int
let bytes = reader.ReadBytes bytesToRead
let doubles = Array.zeroCreate<float> (bytesToRead / 8)
Buffer.BlockCopy(bytes, 0, doubles, 0, bytesToRead)
seq {
yield! doubles
yield! readReader maybeReader
}
readReader None
And then, when I have a string list containing all the files, I can say something like:
let values = files |> Seq.collect readFile
use ve = values.GetEnumerator()
// Do stuff that only gets partial data from one file
However, this only closes the files when the reader reaches its end (which is clear when looking at the function). So as a second approach I implemented the file enumerating imperatively:
type FileEnumerator(file : string) =
let reader = new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.Read))
let mutable _current : float = Double.NaN
do file |> printfn "Enumerator active for %s"
interface IDisposable with
member this.Dispose() =
reader.Dispose()
file |> printfn "Enumerator disposed for %s"
interface IEnumerator with
member this.Current = _current :> obj
member this.Reset() = reader.BaseStream.Position <- 0L
member this.MoveNext() =
let stream = reader.BaseStream
if stream.Position >= stream.Length then false
else
_current <- reader.ReadDouble()
true
interface IEnumerator<float> with
member this.Current = _current
type FileEnumerable(file : string) =
interface IEnumerable with
member this.GetEnumerator() = new FileEnumerator(file) :> IEnumerator
interface IEnumerable<float> with
member this.GetEnumerator() = new FileEnumerator(file) :> IEnumerator<float>
let readFile' file = new FileEnumerable(file) :> float seq
now, when I say
let values = files |> Seq.collect readFile'
use ve = values.GetEnumerator()
// do stuff with the enumerator
disposing the enumerator correctly bubbles through to my imperative enumerator.
While this is a feasible solution for what I want to achieve (I could make it faster by reading it blockwise like the first functional approach but for brevity I didn't do it here) I wonder if there is a truly functional approach for this avoiding the mutable state in the enumerator.
I don't quite get what you mean when you say that using GetEnumerator() will prevent resource leaks and allow to close all files correctly. The below would be my attempt at this (ignoring block copy part for demonstration purposes) and I think it results in the files properly closed.
let eof (br : BinaryReader) =
br.BaseStream.Position = br.BaseStream.Length
let readFileAsFloats filePath =
seq{
use file = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read)
use reader = new BinaryReader(file)
while (not (eof reader)) do
yield reader.ReadDouble()
}
let readFilesAsFloats filePaths =
filePaths |> Seq.collect readFileAsFloats
let floats = readFilesAsFloats ["D:\\floatFile1.txt"; "D:\\floatFile2.txt"]
Is that what you had in mind?

How can I pass a parameter to Sql.execReaderF in FsSql?

I am trying out the samples for FsSql and I seem to be stuck on how to properly use the Sql.execReaderF function. The example code uses an int parameter but I have a string. The following code blocks show my attempts. Does FsSql only support int for this function maybe?
Setup code:
module FsSqlTests
open System
open System.Data
open System.Data.SqlClient
open NUnit.Framework
open Swensen.Unquote
let openConn() =
let conn = new SqlConnection(#"Data Source=MYSERVER;Initial Catalog=MYDB;Integrated Security=True")
conn.Open()
conn :> IDbConnection
let connMgr = Sql.withNewConnection openConn
let P = Sql.Parameter.make
let execReader sql = Sql.execReader connMgr sql
let execReaderf sql = Sql.execReaderF connMgr sql
Using Sql.execReader (Test case passes using this one)
let selectSummaryByeFolderName eFolderName =
execReader "select summary from ework.V_DQ_Iccm_Activity_By_Team WHERE efoldername = #eFolderName"
[P("#eFolderName", eFolderName)]
Using Sql.execReaderF (Test case fails using this one)
let selectSummaryByeFolderName =
execReaderf "select summary from ework.V_DQ_Iccm_Activity_By_Team WHERE efoldername = '%s'"
Calling code in the test case:
[<TestCase>]
let ``Gets CM summary given eFolderName``() =
let c = selectSummaryByeFolderName "CM008671"
let r = c
|> Seq.ofDataReader
|> Seq.map(fun dr ->
let s =
match dr?summary with
| None -> "No Summary"
| Some x -> x
s)
|> Seq.length
test <# r > 0 #>
How can I modify my call to execReaderF to make it pass the parameter and run correctly?
UPDATE:
I tried it out with an integer parameter and it works fine. It seems the function may only support integers.
let selectSummaryByCallPriority =
execReaderf "select top 10 summary from ework.V_DQ_Iccm_Activity_By_Team WHERE callpriority = %d"
I had a look at the implementation to try and verify this but it's over my head. Anyway the Sql.execReader function works fine for other datatypes so I can just switch to that function for my string parameters.

F# Read Fixed Width Text File

Hi I'm looking to find the best way to read in a fixed width text file using F#. The file will be plain text, from one to a couple of thousand lines long and around 1000 characters wide. Each line contains around 50 fields, each with varying lengths. My initial thoughts were to have something like the following
type MyRecord = {
Name : string
Address : string
Postcode : string
Tel : string
}
let format = [
(0,10)
(10,50)
(50,7)
(57,20)
]
and read each line one by one, assigning each field by the format tuple(where the first item is the start character and the second is the number of characters wide).
Any pointers would be appreciated.
The hardest part is probably to split a single line according to the column format. It can be done something like this:
let splitLine format (line : string) =
format |> List.map (fun (index, length) -> line.Substring(index, length))
This function has the type (int * int) list -> string -> string list. In other words, format is an (int * int) list. This corresponds exactly to your format list. The line argument is a string, and the function returns a string list.
You can map a list of lines like this:
let result = lines |> List.map (splitLine format)
You can also use Seq.map or Array.map, depending on how lines is defined. Such a result will be a string list list, and you can now map over such a list to produce a MyRecord list.
You can use File.ReadLines to get a lazily evaluated sequence of strings from a file.
Please note that the above is only an outline of a possible solution. I left out boundary checks, error handling, and such. The above code may contain off-by-one errors.
Here's a solution with a focus on custom validation and error handling for each field. This might be overkill for a data file consisting of just numeric data!
First, for these kinds of things, I like to use the parser in Microsoft.VisualBasic.dll as it's already available without using NuGet.
For each row, we can return the array of fields, and the line number (for error reporting)
#r "Microsoft.VisualBasic.dll"
// for each row, return the line number and the fields
let parserReadAllFields fieldWidths textReader =
let parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader=textReader)
parser.SetFieldWidths fieldWidths
parser.TextFieldType <- Microsoft.VisualBasic.FileIO.FieldType.FixedWidth
seq {while not parser.EndOfData do
yield parser.LineNumber,parser.ReadFields() }
Next, we need a little error handling library (see http://fsharpforfunandprofit.com/rop/ for more)
type Result<'a> =
| Success of 'a
| Failure of string list
module Result =
let succeedR x =
Success x
let failR err =
Failure [err]
let mapR f xR =
match xR with
| Success a -> Success (f a)
| Failure errs -> Failure errs
let applyR fR xR =
match fR,xR with
| Success f,Success x -> Success (f x)
| Failure errs,Success _ -> Failure errs
| Success _,Failure errs -> Failure errs
| Failure errs1, Failure errs2 -> Failure (errs1 # errs2)
Then define your domain model. In this case, it is the record type with a field for each field in the file.
type MyRecord =
{id:int; name:string; description:string}
And then you can define your domain-specific parsing code. For each field I have created a validation function (validateId, validateName, etc).
Fields that don't need validation can pass through the raw data (validateDescription).
In fieldsToRecord the various fields are combined using applicative style (<!> and <*>).
For more on this, see http://fsharpforfunandprofit.com/posts/elevated-world-3/#validation.
Finally, readRecords maps each input row to the a record Result and chooses the successful ones only. The failed ones are written to a log in handleResult.
module MyFileParser =
open Result
let createRecord id name description =
{id=id; name=name; description=description}
let validateId (lineNo:int64) (fields:string[]) =
let rawId = fields.[0]
match System.Int32.TryParse(rawId) with
| true, id -> succeedR id
| false, _ -> failR (sprintf "[%i] Can't parse id '%s'" lineNo rawId)
let validateName (lineNo:int64) (fields:string[]) =
let rawName = fields.[1]
if System.String.IsNullOrWhiteSpace rawName then
failR (sprintf "[%i] Name cannot be blank" lineNo )
else
succeedR rawName
let validateDescription (lineNo:int64) (fields:string[]) =
let rawDescription = fields.[2]
succeedR rawDescription // no validation
let fieldsToRecord (lineNo,fields) =
let (<!>) = mapR
let (<*>) = applyR
let validatedId = validateId lineNo fields
let validatedName = validateName lineNo fields
let validatedDescription = validateDescription lineNo fields
createRecord <!> validatedId <*> validatedName <*> validatedDescription
/// print any errors and only return good results
let handleResult result =
match result with
| Success record -> Some record
| Failure errs -> printfn "ERRORS %A" errs; None
/// return a sequence of records
let readRecords parserOutput =
parserOutput
|> Seq.map fieldsToRecord
|> Seq.choose handleResult
Here's an example of the parsing in practice:
// Set up some sample text
let text = """01name1description1
02name2description2
xxname3badid-------
yy badidandname
"""
// create a low-level parser
let textReader = new System.IO.StringReader(text)
let fieldWidths = [| 2; 5; 11 |]
let parserOutput = parserReadAllFields fieldWidths textReader
// convert to records in my domain
let records =
parserOutput
|> MyFileParser.readRecords
|> Seq.iter (printfn "RECORD %A") // print each record
The output will look like:
RECORD {id = 1;
name = "name1";
description = "description";}
RECORD {id = 2;
name = "name2";
description = "description";}
ERRORS ["[3] Can't parse id 'xx'"]
ERRORS ["[4] Can't parse id 'yy'"; "[4] Name cannot be blank"]
By no means is this the most efficient way to parse a file (I think there are some CSV parsing libraries available on NuGet that can do validation while parsing) but it does show how you can have complete control over validation and error handling if you need it.
A record of 50 fields is a bit unwieldy, therefore alternate approaches which allow dynamic generation of the data structure may be preferable (eg. System.Data.DataRow).
If it has to be a record anyway, you could spare at least the manual assignment to each record field and populate it with the help of Reflection instead. This trick relies on the field order as they are defined. I am assuming that every column of fixed width represents a record field, so that start indices are implied.
open Microsoft.FSharp.Reflection
type MyRecord = {
Name : string
Address : string
City : string
Postcode : string
Tel : string } with
static member CreateFromFixedWidth format (line : string) =
let fields =
format
|> List.fold (fun (index, acc) length ->
let str = line.[index .. index + length - 1].Trim()
index + length, box str :: acc )
(0, [])
|> snd
|> List.rev
|> List.toArray
FSharpValue.MakeRecord(
typeof<MyRecord>,
fields ) :?> MyRecord
Example data:
"Postman Pat " +
"Farringdon Road " +
"London " +
"EC1A 1BB" +
"+44 20 7946 0813"
|> MyRecord.CreateFromFixedWidth [16; 16; 16; 8; 16]
// val it : MyRecord = {Name = "Postman Pat";
// Address = "Farringdon Road";
// City = "London";
// Postcode = "EC1A 1BB";
// Tel = "+44 20 7946 0813";}

How do I make this function correctly asyncronous?

I'm trying to get F# async working, and I just can't figure out what I'm doing wrong. Here's my sorta syncronous code that runs:
open System.Net
open System.Runtime.Serialization
open System.Threading.Tasks
[<DataContract>]
type Person = {
[<field: DataMember(Name = "name")>]
Name : string
[<field: DataMember(Name = "phone")>]
Phone : int
}
let url = "http://localhost:5000/app/plugins/anon/CCure"
let js = Json.DataContractJsonSerializer(typeof<Person>)
let main x =
let client = new WebClient()
let url = url + "/" + x
let reader = client.OpenRead(url)
let person = js.ReadObject(reader) :?> Person
printfn "Name: %s, Phone number: %d" person.Name person.Phone
printfn "starting x"
let x = Task.Factory.StartNew(fun () -> main "x")
printfn "starting y"
let y = Task.Factory.StartNew(fun () -> main "y")
Task.WaitAll(x, y)
I was thinking that to run it asyncronously this would work, but it doesn't:
open System.Net
open System.Runtime.Serialization
open System.Threading.Tasks
[<DataContract>]
type Person = {
[<field: DataMember(Name = "name")>]
Name : string
[<field: DataMember(Name = "phone")>]
Phone : int
}
let url = "http://localhost:5000/app/plugins/anon/CCure"
let js = Json.DataContractJsonSerializer(typeof<Person>)
let main x = async {
let client = new WebClient()
let url = url + "/" + x
let! reader = client.OpenReadAsync(url)
let person = js.ReadObject(reader) :?> Person
printfn "Name: %s, Phone number: %d" person.Name person.Phone }
printfn "starting x"
let x = Task.Factory.StartNew(fun () -> main "x")
printfn "starting y"
let y = Task.Factory.StartNew(fun () -> main "y")
Task.WaitAll(x, y)
$ fsharpc -r System.Runtime.Serialization foo.fs && ./foo.exe F#
Compiler for F# 3.1 (Open Source Edition) Freely distributed under the
Apache 2.0 Open Source License
/home/frew/code/foo.fs(19,18): error FS0001: This expression was
expected to have type
Async<'a> but here has type
unit
/home/frew/code/foo.fs(20,17): error FS0041: A unique overload for
method 'ReadObject' could not be determined based on type information
prior to this program point. A type annotation may be needed.
Candidates: XmlObjectSerializer.ReadObject(reader:
System.Xml.XmlDictionaryReader) : obj,
XmlObjectSerializer.ReadObject(reader: System.Xml.XmlReader) : obj,
XmlObjectSerializer.ReadObject(stream: System.IO.Stream) : obj
/home/frew/code/foo.fs(20,17): error FS0008: This runtime coercion or
type test from type
'a to
Person involves an indeterminate type based on information prior to this program point. Runtime type tests are not allowed on
some types. Further type annotations are needed.
What am I missing here?
OpenReadAsync is part of the .NET BCL and therefore wasn't designed with F# async in mind. You'll notice it returns unit, rather than Async<Stream>, so it won't work with let!.
The API is designed to be used with events (i.e. you have to wire up client.OpenReadCompleted).
You have a couple of options here.
There are some nice helper methods in FSharp.Core that can help
you to convert the API into a more F# friendly one (see
Async.AwaitEvent).
Use AsyncDownloadString, an extension method for WebClient that can be found in Microsoft.FSharp.Control.WebExtensions. This is easier so I've done it below although it does mean holding the whole stream in memory as a string so if you have a huge amount of Json this may not be the best idea.
It's also more idiomatic F# to use async instead of tasks for running things in parallel.
open System.Net
open System.Runtime.Serialization
open System.Threading.Tasks
open Microsoft.FSharp.Control.WebExtensions
open System.Runtime.Serialization.Json
[<DataContract>]
type Person = {
[<field: DataMember(Name = "name")>]
Name : string
[<field: DataMember(Name = "phone")>]
Phone : int
}
let url = "http://localhost:5000/app/plugins/anon/CCure"
let js = Json.DataContractJsonSerializer(typeof<Person>)
let main x = async {
printfn "Starting %s" x
let client = new WebClient()
let url = url + "/" + x
let! json = client.AsyncDownloadString(System.Uri(url))
let bytes = System.Text.Encoding.UTF8.GetBytes(json)
let st = new System.IO.MemoryStream(bytes)
let person = js.ReadObject(st) :?> Person
printfn "Name: %s, Phone number: %d" person.Name person.Phone }
let x = main "x"
let y = main "y"
[x;y] |> Async.Parallel |> Async.RunSynchronously |> ignore<unit[]>

How to make immutable F# more performant?

I'm wanting to write a big chunk of C# code using immutable F#. It's a device monitor, and the current implementation works by constantly getting data from a serial port and updating member variables based on new data. I'd like to transfer that to F# and get the benefits of immutable records, but my first shot at a proof-of-concept implementation is really slow.
open System
open System.Diagnostics
type DeviceStatus = { RPM : int;
Pressure : int;
Temperature : int }
// I'm assuming my actual implementation, using serial data, would be something like
// "let rec UpdateStatusWithSerialReadings (status:DeviceStatus) (serialInput:string[])".
// where serialInput is whatever the device streamed out since the previous check: something like
// ["RPM=90","Pres=50","Temp=85","RPM=40","Pres=23", etc.]
// The device streams out different parameters at different intervals, so I can't just wait for them all to arrive and aggregate them all at once.
// I'm just doing a POC here, so want to eliminate noise from parsing etc.
// So this just updates the status's RPM i times and returns the result.
let rec UpdateStatusITimes (status:DeviceStatus) (i:int) =
match i with
| 0 -> status
| _ -> UpdateStatusITimes {status with RPM = 90} (i - 1)
let initStatus = { RPM = 80 ; Pressure = 100 ; Temperature = 70 }
let stopwatch = new Stopwatch()
stopwatch.Start()
let endStatus = UpdateStatusITimes initStatus 100000000
stopwatch.Stop()
printfn "endStatus.RPM = %A" endStatus.RPM
printfn "stopwatch.ElapsedMilliseconds = %A" stopwatch.ElapsedMilliseconds
Console.ReadLine() |> ignore
This runs in about 1400 ms on my machine, whereas the equivalent C# code (with mutable member variables) runs in around 310 ms. Is there any way to speed this up without losing the immutability? I was hoping that the F# compiler would notice that initStatus and all the intermediate status variables were never reused, and thus simply mutate those records behind the scene, but I guess not.
In the F# community, imperative code and mutable data aren't frowned upon as long as they're not part of your public interface. I.e., using mutable data is fine as long as you encapsulate it and isolate it from the rest of your code. To that end, I suggest something like:
type DeviceStatus =
{ RPM : int
Pressure : int
Temperature : int }
// one of the rare scenarios in which I prefer explicit classes,
// to avoid writing out all the get/set properties for each field
[<Sealed>]
type private DeviceStatusFacade =
val mutable RPM : int
val mutable Pressure : int
val mutable Temperature : int
new(s) =
{ RPM = s.RPM; Pressure = s.Pressure; Temperature = s.Temperature }
member x.ToDeviceStatus () =
{ RPM = x.RPM; Pressure = x.Pressure; Temperature = x.Temperature }
let UpdateStatusITimes status i =
let facade = DeviceStatusFacade(status)
let rec impl i =
if i > 0 then
facade.RPM <- 90
impl (i - 1)
impl i
facade.ToDeviceStatus ()
let initStatus = { RPM = 80; Pressure = 100; Temperature = 70 }
let stopwatch = System.Diagnostics.Stopwatch.StartNew ()
let endStatus = UpdateStatusITimes initStatus 100000000
stopwatch.Stop ()
printfn "endStatus.RPM = %d" endStatus.RPM
printfn "stopwatch.ElapsedMilliseconds = %d" stopwatch.ElapsedMilliseconds
stdin.ReadLine () |> ignore
This way, the public interface is unaffected – UpdateStatusITimes still takes and returns an intrinsically immutable DeviceStatus – but internally UpdateStatusITimes uses a mutable class to eliminate allocation overhead.
EDIT: (In response to comment) This is the style of class I would normally prefer, using a primary constructor and lets + properties rather than vals:
[<Sealed>]
type private DeviceStatusFacade(status) =
let mutable rpm = status.RPM
let mutable pressure = status.Pressure
let mutable temp = status.Temperature
member x.RPM with get () = rpm and set n = rpm <- n
member x.Pressure with get () = pressure and set n = pressure <- n
member x.Temperature with get () = temp and set n = temp <- n
member x.ToDeviceStatus () =
{ RPM = rpm; Pressure = pressure; Temperature = temp }
But for simple facade classes where each property will be a blind getter/setter, I find this a bit tedious.
F# 3+ allows for the following instead, but I still don't find it to be an improvement, personally (unless one dogmatically avoids fields):
[<Sealed>]
type private DeviceStatusFacade(status) =
member val RPM = status.RPM with get, set
member val Pressure = status.Pressure with get, set
member val Temperature = status.Temperature with get, set
member x.ToDeviceStatus () =
{ RPM = x.RPM; Pressure = x.Pressure; Temperature = x.Temperature }
This won't answer your question, but it's probably worth stepping back and considering the big picture:
What do you perceive as the advantage of immutable data structures for this use case? F# supports mutable data structures, too.
You claim that the F# is "really slow" - but it's only 4.5 times slower than the C# code, and is making more than 70 million updates per second... Is this likely to be unacceptable performance for your actual application? Do you have a specific performance target in mind? Is there reason to believe that this type of code will be the bottleneck in your application?
Design is always about tradeoffs. You may find that for recording many changes in a short period of time, immutable data structures have an unacceptable performance penalty given your needs. On the other hand, if you have requirements such as keeping track of multiple older versions of a data structure at once, then the benefits of immutable data structures may make them attractive despite the performance penalty.
I suspect the performance problem you are seeing is due to the block memory zeroing involved when cloning the record (plus a negligible time for allocating it and subsequently garbage collecting it) in every iteration of the loop. You could rewrite your example using a struct:
[<Struct>]
type DeviceStatus =
val RPM : int
val Pressure : int
val Temperature : int
new(rpm:int, pres:int, temp:int) = { RPM = rpm; Pressure = pres; Temperature = temp }
let rec UpdateStatusITimes (status:DeviceStatus) (i:int) =
match i with
| 0 -> status
| _ -> UpdateStatusITimes (DeviceStatus(90, status.Pressure, status.Temperature)) (i - 1)
let initStatus = DeviceStatus(80, 100, 70)
The performance will now be close to that of using global mutable variables or by redefining UpdateStatusITimes status i as UpdateStatusITimes rpm pres temp i. This will only work if your struct is no more than 16 bytes long as otherwise it will get copied in the same sluggish manner as the record.
If, as you've hinted at in your comments, you intend to use this as part of a shared-memory multi-threaded design then you will need mutability at some point. Your choices are a) a shared mutable variable for each parameter b) one shared mutable variable containing a struct or c) a shared facade object containing mutable fields (like in ildjarn's answer). I would go for the last one since it is nicely encapsulated and scales beyond four int fields.
Using a tuple as follows is 15× faster than your original solution:
type DeviceStatus = int * int * int
let rec UpdateStatusITimes (rpm, pressure, temp) (i:int) =
match i with
| 0 -> rpm, pressure, temp
| _ -> UpdateStatusITimes (90,pressure,temp) (i - 1)
while true do
let initStatus = 80, 100, 70
let stopwatch = new Stopwatch()
stopwatch.Start()
let rpm,_,_ as endStatus = UpdateStatusITimes initStatus 100000000
stopwatch.Stop()
printfn "endStatus.RPM = %A" rpm
printfn "Took %fs" stopwatch.Elapsed.TotalSeconds
BTW, you should use stopwatch.Elapsed.TotalSeconds when timing.

Resources