Speed up FsCheck Arbitrary generation - f#

I'm writting some generators and an Arbitrary, but is too slow (see the GC numbers also). I think I have an error on my code, but I can't figure out where. Or my approach (map2 (fold)) is "weird"?.
Generators:
type Generators () =
static let notAllowed = Array.append [| '0'..'9' |] [| '\n'; '\r'; '['; ']'; '/'; |]
static let containsInvalidValues (s : string) = s.IndexOfAny(notAllowed) <> -1
static member positiveIntsGen() = Arb.generate<PositiveInt> |> Gen.map int
static member separatorStringGen() =
Arb.generate<NonEmptyString>
|> Gen.suchThat (fun s -> s.Get.Length < 5 && not (s.Get |> containsInvalidValues))
Arbitrary:
let manyNumbersNewLineCustomDelimiterStrInput =
Gen.map2 (fun (ints : int[]) (nes : NonEmptyString) ->
Array.fold (fun acc num ->
if num % 2 = 0 then acc + "," + num.ToString()
else if num % 3 = 0 then acc + "\n" + num.ToString()
else acc + "\n" + num.ToString()) ("//[" + nes.Get + "]\n") ints )
(Generators.array12OfIntsGen())
(Generators.separatorStringGen())
|> Arb.fromGen
The configuration have the MaxTest = 500 and it takes ~5 minutes to complete.
Output (using #timer):
StrCalcTest.get_When pass an string that starts with "//[" and contains "]\n" use the multicharacter value between them as separator-Ok, passed 500 tests.
Real: 00:07:03.467, CPU: 00:07:03.296, GC gen0: 75844, gen1: 71968, gen2: 4

Without actually testing anything, my guess would be that the problematic part is this:
Arb.generate<NonEmptyString>
|> Gen.suchThat (fun s -> s.Get.Length < 5 && not (s.Get |> containsInvalidValues))
This means that you will generate strings and filtering out all those that satisfy certain conditions. But if the conditions are too restrictive, FsCheck might need to generate a very large number of strings until you actually get some that pass the test.
If you can express the rule so that you are generating the strings so that everything you generate is a valid string, then I think it should be faster.
Could you, for example, generate a number n (for the string length) followed by n values of type char (that satisfy your conditions) and then append them to form the separator string? (I think FsCheck's gen { .. } computation might be a nice way of writing that.)

Related

Convert integer to base 4 in F#

I want to convert a given integer into a base 4 string. For eg : In scala,
var str: String = Integer.toString(10, 4)
gives the output "22" ie 2*(4^1) + 2*(4^0) = 10
Im having difficulty doing this in F#. Any help is appreciated
let intToDigits baseN value : int list =
let rec loop num digits =
let q = num / baseN
let r = num % baseN
if q = 0 then
r :: digits
else
loop q (r :: digits)
loop value []
254
|> intToDigits 16
|> List.fold (fun acc x -> acc + x.ToString("X")) ""
|> printfn "%s"
254
|> intToDigits 4
|> List.fold (fun acc x -> acc + x.ToString()) ""
|> printfn "%s"
This outputs FE for the base-16 conversion and 3332 for the base-4 conversion. Note that ToString("X") works for up to base 16, so it could be used for the base-4 conversion too.
I adapted this int based solution from Stuart Lang's bigint example (which uses bigint.DivRem rather than integer operators).

Any suggestions on how to optimize this string -> double parser further, in F#?

Part of testing, I have to load massive CSV files (200+mb each) and I'm trying to cut down the load time since this is adding up.
As the CSV files are all floating point numbers, I tried to see if I could speed that up.
One important point: I do NOT need precision beyond 5 digits after the comma. The code I post here parses it anyways to give a fair comparison to .NET's parser, but any shortcut that preserves that precision is good.
The code I came up with:
open System
open System.Diagnostics
[<EntryPoint>]
let main _ =
let r = Random()
let numbers =
seq {
for _ in 0 .. 10000000 do
let n = (1000. * r.NextDouble()) - 500.
yield (n, sprintf "%.8f" n)
}
|> Seq.toList
let parseString (number: string) : double =
let len = number.Length
let rec parse (index: int) (nSign: double) (signMultiplier: double) (nAccumulator: int64) : float =
if index < len then
match number.[index] with
| x when x >= '0' && x <= '9' -> parse (index + 1) (signMultiplier * nSign) signMultiplier (nAccumulator * 10L + int64 x - int64 '0')
| x when x = '.' -> parse (index + 1) nSign 0.1 nAccumulator
| x when x = '-' -> parse (index + 1) -nSign signMultiplier nAccumulator
| _ -> parse (index + 1) nSign signMultiplier nAccumulator
else
double nAccumulator * nSign
parse 0 +1. 1. 0L
let benchmark name (func: string -> double) =
let allowedError = 0.00001
printf "checking %s - " name
let sw = Stopwatch()
sw.Start()
numbers
|> List.iter (fun (num, str) ->
let parsed = func str
let delta = num - parsed
if abs delta > allowedError then
failwithf "(%f, %s) not matching %f" num str parsed
)
sw.Stop()
printfn "%i ms" sw.ElapsedMilliseconds
benchmark ".net parser" (fun x -> double x)
benchmark "parser1" (fun x -> parseString x)
0
Is there anything obvious I missed to speed this up?
Additionally, there is something I do not understand: I tried to make the test numbers an array instead of a list and suddenly the processing took longer, while I'd expect the opposite. If anyone has some insights as to why it is happening, I'd be happy to know. You can just change the toList to toArray and replace the List with Array in the benchmark loop to test this.
Edit:
to load the data, here is the code I use:
// load a csv file
let loadCSV (stream: Stream) =
seq {
use s = new StreamReader (stream)
while not s.EndOfStream do
yield s.ReadLine ()
}
|> Seq.map (fun line -> line.Split(",") |> Array.toList)
|> Seq.toList

F# Parallel.ForEach invalid method overload

Creating a Parallel.ForEach expression of this form:
let low = max 1 (k-m)
let high = min (k-1) n
let rangesize = (high+1-low)/(PROCS*3)
Parallel.ForEach(Partitioner.Create(low, high+1, rangesize), (fun j ->
let i = k - j
if x.[i-1] = y.[j-1] then
a.[i] <- b.[i-1] + 1
else
a.[i] <- max c.[i] c.[i-1]
)) |> ignore
Causes me to receive the error: No overloads match for method 'ForEach'. However I am using the Parallel.ForEach<TSource> Method (Partitioner<TSource>, Action<TSource>) and it seems right to me. Am I missing something?
Edited: I am trying to obtain the same results as the code below (that does not use a Partitioner):
let low = max 1 (k-m)
let high = min (k-1) n
let rangesize = (high+1-low)/(PROCS*3)
let A = [| low .. high |]
Parallel.ForEach(A, fun (j:int) ->
let i = k - j
if x.[i-1] = y.[j-1] then
a.[i] <- b.[i-1] + 1
else
a.[i] <- max c.[i] c.[i-1]
) |> ignore
Are you sure that you have opened all necessary namespaces, all the values you are using (low, high and PROCS) are defined and that your code does not accidentally redefine some of the names that you're using (like Partitioner)?
I created a very simple F# script with this code and it seems to be working fine (I refactored the code to create a partitioner called p, but that does not affect the behavior):
open System.Threading.Tasks
open System.Collections.Concurrent
let PROCS = 10
let low, high = 0, 100
let p = Partitioner.Create(low, high+1, high+1-low/(PROCS*3))
Parallel.ForEach(p, (fun j ->
printfn "%A" j // Print the desired range (using %A as it is a tuple)
)) |> ignore
It is important that the value j is actually a pair of type int * int, so if the body uses it in a wrong way (e.g. as an int), you will get the error. In that case, you can add a type annotation to j and you would get a more useful error elsewhere:
Parallel.ForEach(p, (fun (j:int * int) ->
printfn "%d" j // Error here, because `j` is used as an int, but it is a pair!
)) |> ignore
This means that if you want to perform something for all j values in the original range, you need to write something like this:
Parallel.ForEach(p, (fun (loJ, hiJ) ->
for j in loJ .. hiJ - 1 do // Iterate over all js in this partition
printfn "%d" j // process the current j
)) |> ignore
Aside, I guess that the last argument to Partitioner.Create should actually be (high+1-low)/(PROCS*3) - you probably want to divide the total number of steps, not just the low value.

Project Euler #14 attempt fails with StackOverflowException

I recently started solving Project Euler problems in Scala, however when I got to problem 14, I got the StackOverflowError, so I rewrote my solution in F#, since (I am told) the F# compiler, unlike Scala's (which produces Java bytecode), translates recursive calls into loops.
My question to you therefore is, how is it possible that the code below throws the StackOverflowException after reaching some number above 113000? I think that the recursion doesn't have to be a tail recursion in order to be translated/optimized, does it?
I tried several rewrites of my code, but without success. I really don't want to have to write the code in imperative style using loops, and I don't think I could turn the len function to be tail-recursive, even if that was the problem preventing it from being optimized.
module Problem14 =
let lenMap = Dictionary<'int,'int>()
let next n =
if n % 2 = 0 then n/2
else 3*n+1
let memoize(num:int, lng:int):int =
lenMap.[num]<-lng
lng
let rec len(num:int):int =
match num with
| 1 -> 1
| _ when lenMap.ContainsKey(num) -> lenMap.[num]
| _ -> memoize(num, 1+(len (next num)))
let cand = seq{ for i in 999999 .. -1 .. 1 -> i}
let tuples = cand |> Seq.map(fun n -> (n, len(n)))
let Result = tuples |> Seq.maxBy(fun n -> snd n) |> fst
NOTE: I am aware that the code below is very far from optimal and several lines could be a lot simpler, but I am not very proficient in F# and did not bother looking up ways to simplify it and make it more elegant (yet).
Thank you.
Your current code runs without error and gets the correct result if I change all the int to int64 and append an L after every numeric literal (e.g. -1L). If the actual problem is that you're overflowing a 32-bit integer, I'm not sure why you get a StackOverflowException.
module Problem14 =
let lenMap = System.Collections.Generic.Dictionary<_,_>()
let next n =
if n % 2L = 0L then n/2L
else 3L*n+1L
let memoize(num, lng) =
lenMap.[num]<-lng
lng
let rec len num =
match num with
| 1L -> 1L
| _ when lenMap.ContainsKey(num) -> lenMap.[num]
| _ -> memoize(num, 1L+(len (next num)))
let cand = seq{ for i in 999999L .. -1L .. 1L -> i}
let tuples = cand |> Seq.map(fun n -> (n, len(n)))
let Result = tuples |> Seq.maxBy(fun n -> snd n) |> fst

F#: How do i split up a sequence into a sequence of sequences

Background:
I have a sequence of contiguous, time-stamped data. The data-sequence has gaps in it where the data is not contiguous. I want create a method to split the sequence up into a sequence of sequences so that each subsequence contains contiguous data (split the input-sequence at the gaps).
Constraints:
The return value must be a sequence of sequences to ensure that elements are only produced as needed (cannot use list/array/cacheing)
The solution must NOT be O(n^2), probably ruling out a Seq.take - Seq.skip pattern (cf. Brian's post)
Bonus points for a functionally idiomatic approach (since I want to become more proficient at functional programming), but it's not a requirement.
Method signature
let groupContiguousDataPoints (timeBetweenContiguousDataPoints : TimeSpan) (dataPointsWithHoles : seq<DateTime * float>) : (seq<seq< DateTime * float >>)= ...
On the face of it the problem looked trivial to me, but even employing Seq.pairwise, IEnumerator<_>, sequence comprehensions and yield statements, the solution eludes me. I am sure that this is because I still lack experience with combining F#-idioms, or possibly because there are some language-constructs that I have not yet been exposed to.
// Test data
let numbers = {1.0..1000.0}
let baseTime = DateTime.Now
let contiguousTimeStamps = seq { for n in numbers ->baseTime.AddMinutes(n)}
let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items
let timeBetweenContiguousValues = (new TimeSpan(0,1,0))
dataWithOccationalHoles |> groupContiguousDataPoints timeBetweenContiguousValues |> Seq.iteri (fun i sequence -> printfn "Group %d has %d data-points: Head: %f" i (Seq.length sequence) (snd(Seq.hd sequence)))
I think this does what you want
dataWithOccationalHoles
|> Seq.pairwise
|> Seq.map(fun ((time1,elem1),(time2,elem2)) -> if time2-time1 = timeBetweenContiguousValues then 0, ((time1,elem1),(time2,elem2)) else 1, ((time1,elem1),(time2,elem2)) )
|> Seq.scan(fun (indexres,(t1,e1),(t2,e2)) (index,((time1,elem1),(time2,elem2))) -> (index+indexres,(time1,elem1),(time2,elem2)) ) (0,(baseTime,-1.0),(baseTime,-1.0))
|> Seq.map( fun (index,(time1,elem1),(time2,elem2)) -> index,(time2,elem2) )
|> Seq.filter( fun (_,(_,elem)) -> elem <> -1.0)
|> PSeq.groupBy(fst)
|> Seq.map(snd>>Seq.map(snd))
Thanks for asking this cool question
I translated Alexey's Haskell to F#, but it's not pretty in F#, and still one element too eager.
I expect there is a better way, but I'll have to try again later.
let N = 20
let data = // produce some arbitrary data with holes
seq {
for x in 1..N do
if x % 4 <> 0 && x % 7 <> 0 then
printfn "producing %d" x
yield x
}
let rec GroupBy comp (input:LazyList<'a>) : LazyList<LazyList<'a>> =
LazyList.delayed (fun () ->
match input with
| LazyList.Nil -> LazyList.cons (LazyList.empty()) (LazyList.empty())
| LazyList.Cons(x,LazyList.Nil) ->
LazyList.cons (LazyList.cons x (LazyList.empty())) (LazyList.empty())
| LazyList.Cons(x,(LazyList.Cons(y,_) as xs)) ->
let groups = GroupBy comp xs
if comp x y then
LazyList.consf
(LazyList.consf x (fun () ->
let (LazyList.Cons(firstGroup,_)) = groups
firstGroup))
(fun () ->
let (LazyList.Cons(_,otherGroups)) = groups
otherGroups)
else
LazyList.cons (LazyList.cons x (LazyList.empty())) groups)
let result = data |> LazyList.of_seq |> GroupBy (fun x y -> y = x + 1)
printfn "Consuming..."
for group in result do
printfn "about to do a group"
for x in group do
printfn " %d" x
You seem to want a function that has signature
(`a -> bool) -> seq<'a> -> seq<seq<'a>>
I.e. a function and a sequence, then break up the input sequence into a sequence of sequences based on the result of the function.
Caching the values into a collection that implements IEnumerable would likely be simplest (albeit not exactly purist, but avoiding iterating the input multiple times. It will lose much of the laziness of the input):
let groupBy (fun: 'a -> bool) (input: seq) =
seq {
let cache = ref (new System.Collections.Generic.List())
for e in input do
(!cache).Add(e)
if not (fun e) then
yield !cache
cache := new System.Collections.Generic.List()
if cache.Length > 0 then
yield !cache
}
An alternative implementation could pass cache collection (as seq<'a>) to the function so it can see multiple elements to chose the break points.
A Haskell solution, because I don't know F# syntax well, but it should be easy enough to translate:
type TimeStamp = Integer -- ticks
type TimeSpan = Integer -- difference between TimeStamps
groupContiguousDataPoints :: TimeSpan -> [(TimeStamp, a)] -> [[(TimeStamp, a)]]
There is a function groupBy :: (a -> a -> Bool) -> [a] -> [[a]] in the Prelude:
The group function takes a list and returns a list of lists such that the concatenation of the result is equal to the argument. Moreover, each sublist in the result contains only equal elements. For example,
group "Mississippi" = ["M","i","ss","i","ss","i","pp","i"]
It is a special case of groupBy, which allows the programmer to supply their own equality test.
It isn't quite what we want, because it compares each element in the list with the first element of the current group, and we need to compare consecutive elements. If we had such a function groupBy1, we could write groupContiguousDataPoints easily:
groupContiguousDataPoints maxTimeDiff list = groupBy1 (\(t1, _) (t2, _) -> t2 - t1 <= maxTimeDiff) list
So let's write it!
groupBy1 :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy1 _ [] = [[]]
groupBy1 _ [x] = [[x]]
groupBy1 comp (x : xs#(y : _))
| comp x y = (x : firstGroup) : otherGroups
| otherwise = [x] : groups
where groups#(firstGroup : otherGroups) = groupBy1 comp xs
UPDATE: it looks like F# doesn't let you pattern match on seq, so it isn't too easy to translate after all. However, this thread on HubFS shows a way to pattern match sequences by converting them to LazyList when needed.
UPDATE2: Haskell lists are lazy and generated as needed, so they correspond to F#'s LazyList (not to seq, because the generated data is cached (and garbage collected, of course, if you no longer hold a reference to it)).
(EDIT: This suffers from a similar problem to Brian's solution, in that iterating the outer sequence without iterating over each inner sequence will mess things up badly!)
Here's a solution that nests sequence expressions. The imperitave nature of .NET's IEnumerable<T> is pretty apparent here, which makes it a bit harder to write idiomatic F# code for this problem, but hopefully it's still clear what's going on.
let groupBy cmp (sq:seq<_>) =
let en = sq.GetEnumerator()
let rec partitions (first:option<_>) =
seq {
match first with
| Some first' -> //'
(* The following value is always overwritten;
it represents the first element of the next subsequence to output, if any *)
let next = ref None
(* This function generates a subsequence to output,
setting next appropriately as it goes *)
let rec iter item =
seq {
yield item
if (en.MoveNext()) then
let curr = en.Current
if (cmp item curr) then
yield! iter curr
else // consumed one too many - pass it on as the start of the next sequence
next := Some curr
else
next := None
}
yield iter first' (* ' generate the first sequence *)
yield! partitions !next (* recursively generate all remaining sequences *)
| None -> () // return an empty sequence if there are no more values
}
let first = if en.MoveNext() then Some en.Current else None
partitions first
let groupContiguousDataPoints (time:TimeSpan) : (seq<DateTime*_> -> _) =
groupBy (fun (t,_) (t',_) -> t' - t <= time)
Okay, trying again. Achieving the optimal amount of laziness turns out to be a bit difficult in F#... On the bright side, this is somewhat more functional than my last attempt, in that it doesn't use any ref cells.
let groupBy cmp (sq:seq<_>) =
let en = sq.GetEnumerator()
let next() = if en.MoveNext() then Some en.Current else None
(* this function returns a pair containing the first sequence and a lazy option indicating the first element in the next sequence (if any) *)
let rec seqStartingWith start =
match next() with
| Some y when cmp start y ->
let rest_next = lazy seqStartingWith y // delay evaluation until forced - stores the rest of this sequence and the start of the next one as a pair
seq { yield start; yield! fst (Lazy.force rest_next) },
lazy Lazy.force (snd (Lazy.force rest_next))
| next -> seq { yield start }, lazy next
let rec iter start =
seq {
match (Lazy.force start) with
| None -> ()
| Some start ->
let (first,next) = seqStartingWith start
yield first
yield! iter next
}
Seq.cache (iter (lazy next()))
Below is some code that does what I think you want. It is not idiomatic F#.
(It may be similar to Brian's answer, though I can't tell because I'm not familiar with the LazyList semantics.)
But it doesn't exactly match your test specification: Seq.length enumerates its entire input. Your "test code" calls Seq.length and then calls Seq.hd. That will generate an enumerator twice, and since there is no caching, things get messed up. I'm not sure if there is any clean way to allow multiple enumerators without caching. Frankly, seq<seq<'a>> may not be the best data structure for this problem.
Anyway, here's the code:
type State<'a> = Unstarted | InnerOkay of 'a | NeedNewInner of 'a | Finished
// f() = true means the neighbors should be kept together
// f() = false means they should be split
let split_up (f : 'a -> 'a -> bool) (input : seq<'a>) =
// simple unfold that assumes f captured a mutable variable
let iter f = Seq.unfold (fun _ ->
match f() with
| Some(x) -> Some(x,())
| None -> None) ()
seq {
let state = ref (Unstarted)
use ie = input.GetEnumerator()
let innerMoveNext() =
match !state with
| Unstarted ->
if ie.MoveNext()
then let cur = ie.Current
state := InnerOkay(cur); Some(cur)
else state := Finished; None
| InnerOkay(last) ->
if ie.MoveNext()
then let cur = ie.Current
if f last cur
then state := InnerOkay(cur); Some(cur)
else state := NeedNewInner(cur); None
else state := Finished; None
| NeedNewInner(last) -> state := InnerOkay(last); Some(last)
| Finished -> None
let outerMoveNext() =
match !state with
| Unstarted | NeedNewInner(_) -> Some(iter innerMoveNext)
| InnerOkay(_) -> failwith "Move to next inner seq when current is active: undefined behavior."
| Finished -> None
yield! iter outerMoveNext }
open System
let groupContigs (contigTime : TimeSpan) (holey : seq<DateTime * int>) =
split_up (fun (t1,_) (t2,_) -> (t2 - t1) <= contigTime) holey
// Test data
let numbers = {1 .. 15}
let contiguousTimeStamps =
let baseTime = DateTime.Now
seq { for n in numbers -> baseTime.AddMinutes(float n)}
let holeyData =
Seq.zip contiguousTimeStamps numbers
|> Seq.filter (fun (dateTime, num) -> num % 7 <> 0)
let grouped_data = groupContigs (new TimeSpan(0,1,0)) holeyData
printfn "Consuming..."
for group in grouped_data do
printfn "about to do a group"
for x in group do
printfn " %A" x
Ok, here's an answer I'm not unhappy with.
(EDIT: I am unhappy - it's wrong! No time to try to fix right now though.)
It uses a bit of imperative state, but it is not too difficult to follow (provided you recall that '!' is the F# dereference operator, and not 'not'). It is as lazy as possible, and takes a seq as input and returns a seq of seqs as output.
let N = 20
let data = // produce some arbitrary data with holes
seq {
for x in 1..N do
if x % 4 <> 0 && x % 7 <> 0 then
printfn "producing %d" x
yield x
}
let rec GroupBy comp (input:seq<_>) = seq {
let doneWithThisGroup = ref false
let areMore = ref true
use e = input.GetEnumerator()
let Next() = areMore := e.MoveNext(); !areMore
// deal with length 0 or 1, seed 'prev'
if not(e.MoveNext()) then () else
let prev = ref e.Current
while !areMore do
yield seq {
while not(!doneWithThisGroup) do
if Next() then
let next = e.Current
doneWithThisGroup := not(comp !prev next)
yield !prev
prev := next
else
// end of list, yield final value
yield !prev
doneWithThisGroup := true }
doneWithThisGroup := false }
let result = data |> GroupBy (fun x y -> y = x + 1)
printfn "Consuming..."
for group in result do
printfn "about to do a group"
for x in group do
printfn " %d" x

Resources