Aggregation function - f# vs c# performance - f#

I have a function that I use a lot and hence the performance needs to be as good as possible. It takes data from excel and then sums, averages or counts over parts of the data based on whether the data is within a certain period and whether it is a peak hour (Mo-Fr 8-20).
The data is usually around 30,000 rows and 2 columns (hourly date, value). One important feature of the data is that the date column is chronologically ordered
I have three implementations, c# with extension methods (dead slow and I m not going to show it unless somebody is interested).
Then I have this f# implementation:
let ispeak dts =
let newdts = DateTime.FromOADate dts
match newdts.DayOfWeek, newdts.Hour with
| DayOfWeek.Saturday, _ | DayOfWeek.Sunday, _ -> false
| _, h when h >= 8 && h < 20 -> true
| _ -> false
let internal isbetween a std edd =
match a with
| r when r >= std && r < edd+1. -> true
| _ -> false
[<ExcelFunction(Name="aggrF")>]
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let newd =
[0 .. (Array.length data) - 1]
|> List.map (fun i -> (data.[i], data2.[i]))
|> Seq.filter (fun (date, _) ->
let dateInRange = isbetween date std edd
match pob with
| "Peak" -> ispeak date && dateInRange
| "Offpeak" -> not(ispeak date) && dateInRange
| _ -> dateInRange)
match sac with
| 0 -> newd |> Seq.averageBy (fun (_, value) -> value)
| 2 -> newd |> Seq.sumBy (fun (_, value) -> 1.0)
| _ -> newd |> Seq.sumBy (fun (_, value) -> value)
I see two issues with this:
I need to prepare the data because both date and value are double[]
I do not utilize the knowledge that dates are chronological hence I do unnecessary iterations.
Here comes now what I would call a brute force imperative c# version:
public static bool ispeak(double dats)
{
var dts = System.DateTime.FromOADate(dats);
if (dts.DayOfWeek != DayOfWeek.Sunday & dts.DayOfWeek != DayOfWeek.Saturday & dts.Hour > 7 & dts.Hour < 20)
return true;
else
return false;
}
[ExcelFunction(Description = "Aggregates HFC/EG into average or sum over period, start date inclusive, end date exclusive")]
public static double aggrI(double[] dts, double[] vals, double std, double edd, string pob, double sumavg)
{
double accsum = 0;
int acccounter = 0;
int indicator = 0;
bool peakbool = pob.Equals("Peak", StringComparison.OrdinalIgnoreCase);
bool offpeakbool = pob.Equals("Offpeak", StringComparison.OrdinalIgnoreCase);
bool basebool = pob.Equals("Base", StringComparison.OrdinalIgnoreCase);
for (int i = 0; i < vals.Length; ++i)
{
if (dts[i] >= std && dts[i] < edd + 1)
{
indicator = 1;
if (peakbool && ispeak(dts[i]))
{
accsum += vals[i];
++acccounter;
}
else if (offpeakbool && (!ispeak(dts[i])))
{
accsum += vals[i];
++acccounter;
}
else if (basebool)
{
accsum += vals[i];
++acccounter;
}
}
else if (indicator == 1)
{
break;
}
}
if (sumavg == 0)
{
return accsum / acccounter;
}
else if (sumavg == 2)
{
return acccounter;
}
else
{
return accsum;
}
}
This is much faster (I m guessing mainly because of the exit of loop when period ended) but oviously less succinct.
My questions:
Is there a way to stop iterations in the f# Seq module for sorted series?
Is there another way to speed up the f# version?
can somebody think of an even better way of doing this?
Thanks a lot!
Update: Speed comparison
I set up a test array with hourly dates from 1/1/13-31/12/15 (roughly 30,000 rows) and corresponding values. I made 150 calls spread out over the date array and repeated this 100 times - 15000 function calls:
My csharp implementation above (with string.compare outside of loop)
1.36 secs
Matthews recursion fsharp
1.55 secs
Tomas array fsharp
1m40secs
My original fsharp
2m20secs
Obviously this is always subjective to my machine but gives an idea and people asked for it...
I also think one should keep in mind this doesnt mean recursion or for loops are always faster than array.map etc, just in this case it does a lot of unnecessary iterations as it doesnt have the early exit from iterations that the c# and the f# recursion method have

Using Array instead of List and Seq makes this about 3-4 times faster. You do not need to generate a list of indices and then map over that to lookup items in the two arrays - instead you can use Array.zip to combine the two arrays into a single one and then use Array.filter.
In general, if you want performance, then using array as your data structure will make sense (unless you have a long pipeline of things). Functions like Array.zip and Array.map can calculate the entire array size, allocate it and then do efficient imperative operation (while still looking functional from the outside).
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let newd =
Array.zip data data2
|> Array.filter (fun (date, _) ->
let dateInRange = isbetween date std edd
match pob with
| "Peak" -> ispeak date && dateInRange
| "Offpeak" -> not(ispeak date) && dateInRange
| _ -> dateInRange)
match sac with
| 0 -> newd |> Array.averageBy (fun (_, value) -> value)
| 2 -> newd |> Array.sumBy (fun (_, value) -> 1.0)
| _ -> newd |> Array.sumBy (fun (_, value) -> value)
I also changed isbetween - it can be simplified into just an expression and you can mark it inline, but that does not add that much:
let inline isbetween r std edd = r >= std && r < edd+1.
Just for completeness, I tested this with the following code (using F# Interactive):
#time
let d1 = Array.init 1000000 float
let d2 = Array.init 1000000 float
aggrF d1 d2 0.0 1000000.0 "Test" 0
The original version was about ~600ms and the new version using arrays takes between 160ms and 200ms. The version by Matthew takes about ~520ms.
Aside, I spent the last two months at BlueMountain Capital working on a time series/data frame library for F# that would make this a lot simpler. It is work in progress and also the name of the library will change, but you can find it in BlueMountain GitHub. The code would look something like this (it uses the fact that the time series is ordered and uses slicing to get the relevant part before filtering):
let ts = Series(times, values)
ts.[std .. edd] |> Series.filter (fun k _ -> not (ispeak k)) |> Series.mean
Currently, this will not be as fast as direct array operations, but I'll look into that :-).

An immediate way to speed it up would be to combine these:
[0 .. (Array.length data) - 1]
|> List.map (fun i -> (data.[i], data2.[i]))
|> Seq.filter (fun (date, _) ->
into a single list comprehension, and also as the other matthew said, do a single string comparison:
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let isValidTime = match pob with
| "Peak" -> (fun x -> ispeak x)
| "Offpeak" -> (fun x -> not(ispeak x))
| _ -> (fun _ -> true)
let data = [ for i in 0 .. (Array.length data) - 1 do
let (date, value) = (data.[i], data2.[i])
if isbetween date std edd && isValidTime date then
yield (date, value)
else
() ]
match sac with
| 0 -> data |> Seq.averageBy (fun (_, value) -> value)
| 2 -> data.Length
| _ -> data |> Seq.sumBy (fun (_, value) -> value)
Or use a tail recursive function:
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let isValidTime = match pob with
| "Peak" -> (fun x -> ispeak x)
| "Offpeak" -> (fun x -> not(ispeak x))
| _ -> (fun _ -> true)
let endDate = edd + 1.0
let rec aggr i sum count =
if i >= (Array.length data) || data.[i] >= endDate then
match sac with
| 0 -> sum / float(count)
| 2 -> float(count)
| _ -> float(sum)
else if data.[i] >= std && isValidTime data.[i] then
aggr (i + 1) (sum + data2.[i]) (count + 1)
else
aggr (i + 1) sum count
aggr 0 0.0 0

Related

Finding an index of a max value of a list in F#

I'm trying to write a function that takes a list for example
let list = [5;23;29;1]
let x = max list // This will return 2 because 29 will be the max value and it's "indexed" at position 2
I'm not sure about how to go about writing the max function
Since my list will only contain four elements I currently have some code like this
let list = (1, newMap1 |> getScore) :: (2, newMap2 |> getScore) :: (3, newMap3 |> getScore) :: (4, newMap4 |> getScore) :: []
I consider this a terrible approach but I'm still stuck on how to return (x, _) after I find the max of (_, y). I'm very confident with imperative approaches but I'm stumped on how to do this functionally
There is a couple of ways to do this. At the low-level, you can write a recursive function to iterate and pattern match over a list. This is good exercise if you are learning F#.
Similarly, you can implement this using the fold function. Here, the idea is that we keep some state, consisting of the "best value" and the index of the best value. At each step, we either keep the original information, or update it:
let _, maxValue, maxIndex =
list |> List.fold (fun (index, maxSoFar, maxIndex) v ->
if v > maxSoFar then (index+1, v, index+1)
else (index+1, maxSoFar, maxIndex)) (-1, System.Int32.MinValue, -1)
Finally, the shortest option I can think of is to use mapi and maxBy functions:
list
|> Seq.mapi (fun i v -> i, v)
|> Seq.maxBy snd
Here's an answer only using pattern matching and recursion.
let list = [5;23;29;1]
let rec findIndexOfMaxValue (maxValue:int) indexOfMaxValue currentIndex aList =
match aList with
| [] -> indexOfMaxValue
| head::tail -> match head with
| head when head > maxValue -> findIndexOfMaxValue head currentIndex (currentIndex + 1) tail
| _ -> findIndexOfMaxValue maxValue indexOfMaxValue (currentIndex + 1) tail
[<EntryPoint>]
let main argv =
let indexOfMaxValue = findIndexOfMaxValue 0 0 0 list
printfn "The index of the maximum value is %A." indexOfMaxValue
//The index of the maximum value is 2.
0
Out of interest, I made a timing script comparing my algorithm with the other ones provided:
open System.Diagnostics
let n = 5000
let random = System.Random 543252
let randomlists =
[for i in [1..n] -> [ for i in [1..n] -> random.Next (0, n*n)]]
let stopWatch =
let sw = Stopwatch ()
sw.Start ()
sw
let timeIt (name : string) (a : int list -> 'T) : unit =
let t = stopWatch.ElapsedMilliseconds
let v = a (randomlists.[0])
for i = 1 to (n - 1) do
a randomlists.[i] |> ignore
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, elapsed %d ms, result %A" name d v
let rec findIndexOfMaxValue (maxValue:int) indexOfMaxValue currentIndex aList =
match aList with
| [] -> indexOfMaxValue
| head::tail -> match head with
| head when head > maxValue -> findIndexOfMaxValue head currentIndex (currentIndex + 1) tail
| _ -> findIndexOfMaxValue maxValue indexOfMaxValue (currentIndex + 1) tail
let findIndexOfMaxValueFoldAlg list =
let _, maxValue, maxIndex =
list |> List.fold (fun (index, maxSoFar, maxIndex) v ->
if v > maxSoFar then (index+1, v, index+1)
else (index+1, maxSoFar, maxIndex)) (-1, System.Int32.MinValue, -1)
maxIndex
let findIndexOfMaxValueSimpleSeq list = list
|> Seq.mapi (fun i v -> i, v)
|> Seq.maxBy snd
|> fst
let findIndexOfMaxValueSimpleList list =
list
|> List.mapi (fun i x -> i, x)
|> List.maxBy snd
|> fst
[<EntryPoint>]
let main argv =
timeIt "recursiveOnly" (findIndexOfMaxValue 0 0 0)
timeIt "simpleSeq" findIndexOfMaxValueSimpleSeq
timeIt "simpleList" findIndexOfMaxValueSimpleList
0
The results I get are:
recursiveOnly, elapsed 356ms, result 3562
foldAlgorithm, elapsed 1602ms, result 3562
simpleSeq, elapsed 4504ms, result 3562
simpleList, elapsed 4395ms, result 3562
I have these functions in my helper library:
module List =
let maxIndexBy projection list =
list
|> List.mapi (fun i x -> i, projection x)
|> List.maxBy snd
|> fst
let maxIndex list = maxIndexBy id list
Returns the index of the max element, optionally using a given projection function. You can write the same functions for the Seq and Array modules easily by replacing the "List" part and renaming the arguments.

Get elements between two elements in an F# collection

I'd like to take a List or Array, and given two elements in the collection, get all elements between them. But I want to do this in a circular fashion, such that given a list [1;2;3;4;5;6] and if I ask for the elements that lie between 4 then 2, I get back [5;6;1]
Being used to imperative programming I can easily do this with loops, but I imagine there may be a nicer idiomatic approach to it in F#.
Edit
Here is an approach I came up with, having found the Array.indexed function
let elementsBetween (first:int) (second:int) (elements: array<'T>) =
let diff = second - first
elements
|> Array.indexed
|> Array.filter (fun (index,element) -> if diff = 0 then false
else if diff > 0 then index > first && index < second
else if diff < 0 then index > first || index < second
else false
This approach will only work with arrays obviously but this seems pretty good. I have a feeling I could clean it up by replacing the if/then/else with pattern matching but am not sure how to do that cleanly.
You should take a look at MSDN, Collections.Seq Module for example.
Let's try to be clever:
let elementsBetween a e1 e2 =
let aa = a |> Seq.append a
let i1 = aa |> Seq.findIndex (fun e -> e = e1)
let i2 = aa |> Seq.skip i1 |> Seq.findIndex (fun e -> e = e2)
aa |> Seq.skip(i1+1) |> Seq.take(i2-1)
I am not on my normal computer with an f# compiler, so I haven't tested it yet. It should look something like this
[Edit] Thank you #FoggyFinder for showing me https://dotnetfiddle.net/. I have now tested the code below with it.
[Edit] This should find the circular range in a single pass.
let x = [1;2;3;4;5]
let findCircRange l first second =
let rec findUpTo (l':int list) f (s:int) : (int list * int list) =
match l' with
| i::tail ->
if i = s then tail, (f [])
else findUpTo tail (fun acc -> f (i::acc)) s
// In case we are passed an empty list.
| _ -> [], (f [])
let remainder, upToStart = findUpTo l id first
// concatenate the list after start with the list before start.
let newBuffer = remainder#upToStart
snd <| findUpTo newBuffer id second
let values = findCircRange x 4 2
printf "%A" values
findUpTo takes a list (l'), a function for creating a remainder list (f) and a value to look for (s). We recurse through it (tail recursion) to find the list up to the given value and the list after the given value. Wrap the buffer around by appending the end to the remainder. Pass it to the findUpTo again to find up to the end. Return the buffer up to the end.
We pass a function for accumulating found items. This technique allows us to append to the end of the list as the function calls unwind.
Of course, there is no error checking here. We are assuming that start and end do actually exist. That will be left to an exercise for the reader.
Here is a variation using your idea of diff with list and list slicing
<some list.[x .. y]
let between (first : int) (second : int) (l : 'a list) : 'a list =
if first < 0 then
failwith "first cannot be less than zero"
if second < 0 then
failwith "second cannot be less than zero"
if first > (l.Length * 2) then
failwith "first cannot be greater than length of list times 2"
if second > (l.Length * 2) then
failwith "second cannot be greater than length of list times 2"
let diff = second - first
match diff with
| 0 -> []
| _ when diff > 0 && (abs diff) < l.Length -> l.[(first + 1) .. (second - 1)]
| _ when diff > 0 -> (l#l).[(first + 1) .. (second - 1)]
| _ when diff < 0 && (abs diff) < l.Length -> l.[(second + 1) .. (second + first - 1)]
| _ when diff < 0 -> (l#l).[(second + 1) .. (second + first - 1)]

Using F# match to extract two days out of the week

Learning to use F#, and I'm trying to get familiar with the match expression. I expect the below code to pick two consecutive days out of the week, the current day and the day after. It only picks out the current day. What am I missing here?
DayOfWeek array:
let days = [|DayOfWeek.Sunday, true;
DayOfWeek.Monday, false;
DayOfWeek.Tuesday, true;
DayOfWeek.Wednesday, true;
DayOfWeek.Thursday, true;
DayOfWeek.Friday, true;
DayOfWeek.Saturday, true;|]
Match expression:
let curDate = DateTime.Now
let validDates =
[
for i in days do
match i with
| day, true ->
match day with
| x when int x = int curDate.DayOfWeek ||
int x > int curDate.DayOfWeek
&& int x - int curDate.DayOfWeek = 1 ->
yield
x
| _ -> ()
|_ -> ()
]
Your solution seems extremely convoluted to me, and like others have mentioned it only works if the underlying int value of tomorrow's DayOfWeek is one greater than today's. As you know, the week is a cycle so that logic won't always hold true. I don't want to spoonfeed, but there is a much easier solution:
let today = DateTime.Now.DayOfWeek
let days = [|DayOfWeek.Sunday, true;
DayOfWeek.Monday, false;
DayOfWeek.Tuesday, true;
DayOfWeek.Wednesday, true;
DayOfWeek.Thursday, true;
DayOfWeek.Friday, true;
DayOfWeek.Saturday, true;|]
let today_and_tomorrow =
let idx_today = Array.findIndex (fun (day, _) -> day = today) days
days.[idx_today], days.[idx_today + 1 % days.Length]
For me, the difficulty is with using pattern matching.
Here is how I would do it without, allowing you to take any number of days, not just two.
open System
let next count days day =
seq { while true do yield! days } // make the days array infinite
|> Seq.skipWhile (fun (d, _) -> d <> day) // skip until we find our day
|> Seq.filter (fun (_, incl) -> incl) // get rid of 'false' days
|> Seq.take count // take the next 'count' of days
|> Seq.map (fun (d, _) -> d) // we only care about the day now, so a simple map gets rid of the boolean
Using your array of days, I get the following:
DayOfWeek.Sunday
|> next 2 days
val it : seq<DayOfWeek> = seq [Sunday; Tuesday]
and
DayOfWeek.Thursday
|> next 3 days
val it : seq<DayOfWeek> = seq [Thursday; Friday; Saturday]
and
DayOfWeek.Sunday
|> next 10000 days
|> Seq.iter (printfn "%A")
Well, I'm not going to print what this one does, you'll just have to use your imagination. :)
I hope that helps!
Edit I made it handle an infinite number of days.
I think you can write this a lot easier by using the Enum-caps of F#/.net:
open System;;
let weekdayAfter (day : DateTime) : DayOfWeek =
int day.DayOfWeek
|> (fun i -> (i+1) % 7)
|> Microsoft.FSharp.Core.LanguagePrimitives.EnumOfValue<_, _>
let today_and_tomorrow =
let today = DateTime.Today
(today.DayOfWeek, weekdayAfter today)
And if you really want to use pattern-matching then why not go with the readable/obvious solution:
let dayAfter (day : DateTime) =
match day.DayOfWeek with
| DayOfWeek.Sunday -> DayOfWeek.Monday
| DayOfWeek.Monday -> DayOfWeek.Tuesday
| DayOfWeek.Tuesday -> DayOfWeek.Wednesday
| DayOfWeek.Wednesday -> DayOfWeek.Thursday
| DayOfWeek.Thursday -> DayOfWeek.Friday
| DayOfWeek.Friday -> DayOfWeek.Saturday
| DayOfWeek.Saturday -> DayOfWeek.Sunday
| _ -> failwith "should never happen"

F#: How do i split up a sequence into a sequence of sequences

Background:
I have a sequence of contiguous, time-stamped data. The data-sequence has gaps in it where the data is not contiguous. I want create a method to split the sequence up into a sequence of sequences so that each subsequence contains contiguous data (split the input-sequence at the gaps).
Constraints:
The return value must be a sequence of sequences to ensure that elements are only produced as needed (cannot use list/array/cacheing)
The solution must NOT be O(n^2), probably ruling out a Seq.take - Seq.skip pattern (cf. Brian's post)
Bonus points for a functionally idiomatic approach (since I want to become more proficient at functional programming), but it's not a requirement.
Method signature
let groupContiguousDataPoints (timeBetweenContiguousDataPoints : TimeSpan) (dataPointsWithHoles : seq<DateTime * float>) : (seq<seq< DateTime * float >>)= ...
On the face of it the problem looked trivial to me, but even employing Seq.pairwise, IEnumerator<_>, sequence comprehensions and yield statements, the solution eludes me. I am sure that this is because I still lack experience with combining F#-idioms, or possibly because there are some language-constructs that I have not yet been exposed to.
// Test data
let numbers = {1.0..1000.0}
let baseTime = DateTime.Now
let contiguousTimeStamps = seq { for n in numbers ->baseTime.AddMinutes(n)}
let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items
let timeBetweenContiguousValues = (new TimeSpan(0,1,0))
dataWithOccationalHoles |> groupContiguousDataPoints timeBetweenContiguousValues |> Seq.iteri (fun i sequence -> printfn "Group %d has %d data-points: Head: %f" i (Seq.length sequence) (snd(Seq.hd sequence)))
I think this does what you want
dataWithOccationalHoles
|> Seq.pairwise
|> Seq.map(fun ((time1,elem1),(time2,elem2)) -> if time2-time1 = timeBetweenContiguousValues then 0, ((time1,elem1),(time2,elem2)) else 1, ((time1,elem1),(time2,elem2)) )
|> Seq.scan(fun (indexres,(t1,e1),(t2,e2)) (index,((time1,elem1),(time2,elem2))) -> (index+indexres,(time1,elem1),(time2,elem2)) ) (0,(baseTime,-1.0),(baseTime,-1.0))
|> Seq.map( fun (index,(time1,elem1),(time2,elem2)) -> index,(time2,elem2) )
|> Seq.filter( fun (_,(_,elem)) -> elem <> -1.0)
|> PSeq.groupBy(fst)
|> Seq.map(snd>>Seq.map(snd))
Thanks for asking this cool question
I translated Alexey's Haskell to F#, but it's not pretty in F#, and still one element too eager.
I expect there is a better way, but I'll have to try again later.
let N = 20
let data = // produce some arbitrary data with holes
seq {
for x in 1..N do
if x % 4 <> 0 && x % 7 <> 0 then
printfn "producing %d" x
yield x
}
let rec GroupBy comp (input:LazyList<'a>) : LazyList<LazyList<'a>> =
LazyList.delayed (fun () ->
match input with
| LazyList.Nil -> LazyList.cons (LazyList.empty()) (LazyList.empty())
| LazyList.Cons(x,LazyList.Nil) ->
LazyList.cons (LazyList.cons x (LazyList.empty())) (LazyList.empty())
| LazyList.Cons(x,(LazyList.Cons(y,_) as xs)) ->
let groups = GroupBy comp xs
if comp x y then
LazyList.consf
(LazyList.consf x (fun () ->
let (LazyList.Cons(firstGroup,_)) = groups
firstGroup))
(fun () ->
let (LazyList.Cons(_,otherGroups)) = groups
otherGroups)
else
LazyList.cons (LazyList.cons x (LazyList.empty())) groups)
let result = data |> LazyList.of_seq |> GroupBy (fun x y -> y = x + 1)
printfn "Consuming..."
for group in result do
printfn "about to do a group"
for x in group do
printfn " %d" x
You seem to want a function that has signature
(`a -> bool) -> seq<'a> -> seq<seq<'a>>
I.e. a function and a sequence, then break up the input sequence into a sequence of sequences based on the result of the function.
Caching the values into a collection that implements IEnumerable would likely be simplest (albeit not exactly purist, but avoiding iterating the input multiple times. It will lose much of the laziness of the input):
let groupBy (fun: 'a -> bool) (input: seq) =
seq {
let cache = ref (new System.Collections.Generic.List())
for e in input do
(!cache).Add(e)
if not (fun e) then
yield !cache
cache := new System.Collections.Generic.List()
if cache.Length > 0 then
yield !cache
}
An alternative implementation could pass cache collection (as seq<'a>) to the function so it can see multiple elements to chose the break points.
A Haskell solution, because I don't know F# syntax well, but it should be easy enough to translate:
type TimeStamp = Integer -- ticks
type TimeSpan = Integer -- difference between TimeStamps
groupContiguousDataPoints :: TimeSpan -> [(TimeStamp, a)] -> [[(TimeStamp, a)]]
There is a function groupBy :: (a -> a -> Bool) -> [a] -> [[a]] in the Prelude:
The group function takes a list and returns a list of lists such that the concatenation of the result is equal to the argument. Moreover, each sublist in the result contains only equal elements. For example,
group "Mississippi" = ["M","i","ss","i","ss","i","pp","i"]
It is a special case of groupBy, which allows the programmer to supply their own equality test.
It isn't quite what we want, because it compares each element in the list with the first element of the current group, and we need to compare consecutive elements. If we had such a function groupBy1, we could write groupContiguousDataPoints easily:
groupContiguousDataPoints maxTimeDiff list = groupBy1 (\(t1, _) (t2, _) -> t2 - t1 <= maxTimeDiff) list
So let's write it!
groupBy1 :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy1 _ [] = [[]]
groupBy1 _ [x] = [[x]]
groupBy1 comp (x : xs#(y : _))
| comp x y = (x : firstGroup) : otherGroups
| otherwise = [x] : groups
where groups#(firstGroup : otherGroups) = groupBy1 comp xs
UPDATE: it looks like F# doesn't let you pattern match on seq, so it isn't too easy to translate after all. However, this thread on HubFS shows a way to pattern match sequences by converting them to LazyList when needed.
UPDATE2: Haskell lists are lazy and generated as needed, so they correspond to F#'s LazyList (not to seq, because the generated data is cached (and garbage collected, of course, if you no longer hold a reference to it)).
(EDIT: This suffers from a similar problem to Brian's solution, in that iterating the outer sequence without iterating over each inner sequence will mess things up badly!)
Here's a solution that nests sequence expressions. The imperitave nature of .NET's IEnumerable<T> is pretty apparent here, which makes it a bit harder to write idiomatic F# code for this problem, but hopefully it's still clear what's going on.
let groupBy cmp (sq:seq<_>) =
let en = sq.GetEnumerator()
let rec partitions (first:option<_>) =
seq {
match first with
| Some first' -> //'
(* The following value is always overwritten;
it represents the first element of the next subsequence to output, if any *)
let next = ref None
(* This function generates a subsequence to output,
setting next appropriately as it goes *)
let rec iter item =
seq {
yield item
if (en.MoveNext()) then
let curr = en.Current
if (cmp item curr) then
yield! iter curr
else // consumed one too many - pass it on as the start of the next sequence
next := Some curr
else
next := None
}
yield iter first' (* ' generate the first sequence *)
yield! partitions !next (* recursively generate all remaining sequences *)
| None -> () // return an empty sequence if there are no more values
}
let first = if en.MoveNext() then Some en.Current else None
partitions first
let groupContiguousDataPoints (time:TimeSpan) : (seq<DateTime*_> -> _) =
groupBy (fun (t,_) (t',_) -> t' - t <= time)
Okay, trying again. Achieving the optimal amount of laziness turns out to be a bit difficult in F#... On the bright side, this is somewhat more functional than my last attempt, in that it doesn't use any ref cells.
let groupBy cmp (sq:seq<_>) =
let en = sq.GetEnumerator()
let next() = if en.MoveNext() then Some en.Current else None
(* this function returns a pair containing the first sequence and a lazy option indicating the first element in the next sequence (if any) *)
let rec seqStartingWith start =
match next() with
| Some y when cmp start y ->
let rest_next = lazy seqStartingWith y // delay evaluation until forced - stores the rest of this sequence and the start of the next one as a pair
seq { yield start; yield! fst (Lazy.force rest_next) },
lazy Lazy.force (snd (Lazy.force rest_next))
| next -> seq { yield start }, lazy next
let rec iter start =
seq {
match (Lazy.force start) with
| None -> ()
| Some start ->
let (first,next) = seqStartingWith start
yield first
yield! iter next
}
Seq.cache (iter (lazy next()))
Below is some code that does what I think you want. It is not idiomatic F#.
(It may be similar to Brian's answer, though I can't tell because I'm not familiar with the LazyList semantics.)
But it doesn't exactly match your test specification: Seq.length enumerates its entire input. Your "test code" calls Seq.length and then calls Seq.hd. That will generate an enumerator twice, and since there is no caching, things get messed up. I'm not sure if there is any clean way to allow multiple enumerators without caching. Frankly, seq<seq<'a>> may not be the best data structure for this problem.
Anyway, here's the code:
type State<'a> = Unstarted | InnerOkay of 'a | NeedNewInner of 'a | Finished
// f() = true means the neighbors should be kept together
// f() = false means they should be split
let split_up (f : 'a -> 'a -> bool) (input : seq<'a>) =
// simple unfold that assumes f captured a mutable variable
let iter f = Seq.unfold (fun _ ->
match f() with
| Some(x) -> Some(x,())
| None -> None) ()
seq {
let state = ref (Unstarted)
use ie = input.GetEnumerator()
let innerMoveNext() =
match !state with
| Unstarted ->
if ie.MoveNext()
then let cur = ie.Current
state := InnerOkay(cur); Some(cur)
else state := Finished; None
| InnerOkay(last) ->
if ie.MoveNext()
then let cur = ie.Current
if f last cur
then state := InnerOkay(cur); Some(cur)
else state := NeedNewInner(cur); None
else state := Finished; None
| NeedNewInner(last) -> state := InnerOkay(last); Some(last)
| Finished -> None
let outerMoveNext() =
match !state with
| Unstarted | NeedNewInner(_) -> Some(iter innerMoveNext)
| InnerOkay(_) -> failwith "Move to next inner seq when current is active: undefined behavior."
| Finished -> None
yield! iter outerMoveNext }
open System
let groupContigs (contigTime : TimeSpan) (holey : seq<DateTime * int>) =
split_up (fun (t1,_) (t2,_) -> (t2 - t1) <= contigTime) holey
// Test data
let numbers = {1 .. 15}
let contiguousTimeStamps =
let baseTime = DateTime.Now
seq { for n in numbers -> baseTime.AddMinutes(float n)}
let holeyData =
Seq.zip contiguousTimeStamps numbers
|> Seq.filter (fun (dateTime, num) -> num % 7 <> 0)
let grouped_data = groupContigs (new TimeSpan(0,1,0)) holeyData
printfn "Consuming..."
for group in grouped_data do
printfn "about to do a group"
for x in group do
printfn " %A" x
Ok, here's an answer I'm not unhappy with.
(EDIT: I am unhappy - it's wrong! No time to try to fix right now though.)
It uses a bit of imperative state, but it is not too difficult to follow (provided you recall that '!' is the F# dereference operator, and not 'not'). It is as lazy as possible, and takes a seq as input and returns a seq of seqs as output.
let N = 20
let data = // produce some arbitrary data with holes
seq {
for x in 1..N do
if x % 4 <> 0 && x % 7 <> 0 then
printfn "producing %d" x
yield x
}
let rec GroupBy comp (input:seq<_>) = seq {
let doneWithThisGroup = ref false
let areMore = ref true
use e = input.GetEnumerator()
let Next() = areMore := e.MoveNext(); !areMore
// deal with length 0 or 1, seed 'prev'
if not(e.MoveNext()) then () else
let prev = ref e.Current
while !areMore do
yield seq {
while not(!doneWithThisGroup) do
if Next() then
let next = e.Current
doneWithThisGroup := not(comp !prev next)
yield !prev
prev := next
else
// end of list, yield final value
yield !prev
doneWithThisGroup := true }
doneWithThisGroup := false }
let result = data |> GroupBy (fun x y -> y = x + 1)
printfn "Consuming..."
for group in result do
printfn "about to do a group"
for x in group do
printfn " %d" x

Help Needed Creating a Binary Tree Given Truth Table

First, in order to provide full disclosure, I want to point out that this is related to homework in a Machine Learning class. This question is not the homework question and instead is something I need to figure out in order to complete the bigger problem of creating an ID3 Decision Tree Algorithm.
I need to generate tree similar to the following when given a truth table
let learnedTree = Node(0,"A0", Node(2,"A2", Leaf(0), Leaf(1)), Node(1,"A1", Node(2,"A2", Leaf(0), Leaf(1)), Leaf(0)))
learnedTree is of type BinaryTree which I've defined as follows:
type BinaryTree =
| Leaf of int
| Node of int * string * BinaryTree * BinaryTree
ID3 algorithms take into account various equations to determine where to split the tree, and I've got all that figured out, I'm just having trouble creating the learned tree from my truth table. For example if I have the following table
A1 | A2 | A3 | Class
1 0 0 1
0 1 0 1
0 0 0 0
1 0 1 0
0 0 0 0
1 1 0 1
0 1 1 0
And I decide to split on attribute A1 I would end up with the following:
(A1 = 1) A1 (A1 = 0)
A2 | A3 | Class A2 | A3 | Class
0 0 1 1 0 1
0 1 0 0 0 0
1 0 1 0 0 0
0 1 1
Then I would split the left side and split the right side, and continue the recursive pattern until the leaf nodes are pure and I end up with a tree similar to the following based on the splitting.
let learnedTree = Node(0,"A0", Node(2,"A2", Leaf(0), Leaf(1)), Node(1,"A1", Node(2,"A2", Leaf(0), Leaf(1)), Leaf(0)))
Here is what I've kind of "hacked" together thus far, but I think I might be way off:
let rec createTree (listToSplit : list<list<float>>) index =
let leftSideSplit =
listToSplit |> List.choose (fun x -> if x.Item(index) = 1. then Some(x) else None)
let rightSideSplit =
listToSplit |> List.choose (fun x -> if x.Item(index) = 0. then Some(x) else None)
if leftSideSplit.Length > 0 then
let pureCheck = isListPure leftSideSplit
if pureCheck = 0 then
printfn "%s" "Pure left node class 0"
createTree leftSideSplit (index + 1)
else if pureCheck = 1 then
printfn "%s" "Pure left node class 1"
createTree leftSideSplit (index + 1)
else
printfn "%s - %A" "Recursing Left" leftSideSplit
createTree leftSideSplit (index + 1)
else printfn "%s" "Pure left node class 0"
Should I be using pattern matching instead? Any tips/ideas/help? Thanks a bunch!
Edit: I've since posted an implementation of ID3 on my blog at:
http://blogs.msdn.com/chrsmith
Hey Jim, I've been wanting to write a blog post implementing ID3 in F# for a while - thanks for giving me an execute. While this code doesn't implement the algorithm full (or correctly), it should be sufficient for getting you started.
In general you have the right approach - representing each branch as a discriminated union case is good. And like Brian said, List.partition is definitely a handy function. The trick to making this work correctly is all in determining the optimal attribute/value pair to split on - and to do that you'll need to calculate information gain via entropy, etc.
type Attribute = string
type Value = string
type Record =
{
Weather : string
Temperature : string
PlayTennis : bool
}
override this.ToString() =
sprintf
"{Weather = %s, Temp = %s, PlayTennis = %b}"
this.Weather
this.Temperature
this.PlayTennis
type Decision = Attribute * Value
type DecisionTreeNode =
| Branch of Decision * DecisionTreeNode * DecisionTreeNode
| Leaf of Record list
// ------------------------------------
// Splits a record list into an optimal split and the left / right branches.
// (This is where you use the entropy function to maxamize information gain.)
// Record list -> Decision * Record list * Record list
let bestSplit data =
// Just group by weather, then by temperature
let uniqueWeathers =
List.fold
(fun acc item -> Set.add item.Weather acc)
Set.empty
data
let uniqueTemperatures =
List.fold
(fun acc item -> Set.add item.Temperature acc)
Set.empty
data
if uniqueWeathers.Count = 1 then
let bestSplit = ("Temperature", uniqueTemperatures.MinimumElement)
let left, right =
List.partition
(fun item -> item.Temperature = uniqueTemperatures.MinimumElement)
data
(bestSplit, left, right)
else
let bestSplit = ("Weather", uniqueWeathers.MinimumElement)
let left, right =
List.partition
(fun item -> item.Weather = uniqueWeathers.MinimumElement)
data
(bestSplit, left, right)
let rec determineBranch data =
if List.length data < 4 then
Leaf(data)
else
// Use the entropy function to break the dataset on
// the category / value that best splits the data
let bestDecision, leftBranch, rightBranch = bestSplit data
Branch(
bestDecision,
determineBranch leftBranch,
determineBranch rightBranch)
// ------------------------------------
let rec printID3Result indent branch =
let padding = new System.String(' ', indent)
match branch with
| Leaf(data) ->
data |> List.iter (fun item -> printfn "%s%s" padding <| item.ToString())
| Branch(decision, lhs, rhs) ->
printfn "%sBranch predicate [%A]" padding decision
printfn "%sWhere predicate is true:" padding
printID3Result (indent + 4) lhs
printfn "%sWhere predicate is false:" padding
printID3Result (indent + 4) rhs
// ------------------------------------
let dataset =
[
{ Weather = "windy"; Temperature = "hot"; PlayTennis = false }
{ Weather = "windy"; Temperature = "cool"; PlayTennis = false }
{ Weather = "nice"; Temperature = "cool"; PlayTennis = true }
{ Weather = "nice"; Temperature = "cold"; PlayTennis = true }
{ Weather = "humid"; Temperature = "hot"; PlayTennis = false }
]
printfn "Given input list:"
dataset |> List.iter (printfn "%A")
printfn "ID3 split resulted in:"
let id3Result = determineBranch dataset
printID3Result 0 id3Result
You can use List.partition instead of your two List.choose calls.
http://research.microsoft.com/en-us/um/cambridge/projects/fsharp/manual/FSharp.Core/Microsoft.FSharp.Collections.List.html
(or now http://msdn.microsoft.com/en-us/library/ee353738(VS.100).aspx )
It isn't clear to me that pattern matching will buy you much here; the input type (list of lists) and processing (partitioning and 'pureness' check) doesn't really lend itself to that.
And of course when you finally get the 'end' (a pure list) you need to create a tree, and then presumably this function will create a Leaf when the input only has one 'side' and it's 'pure', but create a Node out of the left-side and right-side results for every other input. Maybe. I didn't quite grok the algorithm completely.
Hopefully that will help steer you a little bit. May be useful to draw up a few smaller sample inputs and outputs to help work out the various cases of the function body.
Thanks Brian & Chris! I was actually able to figure this out and I ended up with the following. This calculates the information gain for determining the best place to split. I'm sure there are probably better ways for me to arrive at this solution especially around the chosen data structures, but this is a start. I plan to refine things later.
#light
open System
let trainList =
[
[1.;0.;0.;1.;];
[0.;1.;0.;1.;];
[0.;0.;0.;0.;];
[1.;0.;1.;0.;];
[0.;0.;0.;0.;];
[1.;1.;0.;1.;];
[0.;1.;1.;0.;];
[1.;0.;0.;1.;];
[0.;0.;0.;0.;];
[1.;0.;0.;1.;];
]
type BinaryTree =
| Leaf of int
| Node of int * string * BinaryTree * BinaryTree
let entropyList nums =
let sumOfnums =
nums
|> Seq.sum
nums
|> Seq.map (fun x -> if x=0.00 then x else (-((x/sumOfnums) * Math.Log(x/sumOfnums, 2.))))
|> Seq.sum
let entropyBinaryList (dataListOfLists:list<list<float>>) =
let classList =
dataListOfLists
|> List.map (fun x -> x.Item(x.Length - 1))
let ListOfNo =
classList
|> List.choose (fun x -> if x = 0. then Some(x) else None)
let ListOfYes =
classList
|> List.choose (fun x -> if x = 1. then Some(x) else None)
let numberOfYes : float = float ListOfYes.Length
let numberOfNo : float = float ListOfNo.Length
let ListOfNumYesAndSumNo = [numberOfYes; numberOfNo]
entropyList ListOfNumYesAndSumNo
let conditionalEntropy (dataListOfLists:list<list<float>>) attributeNumber =
let NoAttributeList =
dataListOfLists
|> List.choose (fun x -> if x.Item(attributeNumber) = 0. then Some(x) else None)
let YesAttributeList =
dataListOfLists
|> List.choose (fun x -> if x.Item(attributeNumber) = 1. then Some(x) else None)
let numberOfYes : float = float YesAttributeList.Length
let numberOfNo : float = float NoAttributeList.Length
let noConditionalEntropy = (entropyBinaryList NoAttributeList) * (numberOfNo/(numberOfNo + numberOfYes))
let yesConditionalEntropy = (entropyBinaryList YesAttributeList) * (numberOfYes/(numberOfNo + numberOfYes))
[noConditionalEntropy; yesConditionalEntropy]
let findBestSplitIndex(listOfInstances : list<list<float>>) =
let IGList =
[0..(listOfInstances.Item(0).Length - 2)]
|> List.mapi (fun i x -> (i, (entropyBinaryList listOfInstances) - (List.sum (conditionalEntropy listOfInstances x))))
IGList
|> List.maxBy snd
|> fst
let isListPure (listToCheck : list<list<float>>) =
let splitList = listToCheck |> List.choose (fun x -> if x.Item(x.Length - 1) = 1. then Some(x) else None)
if splitList.Length = listToCheck.Length then 1
else if splitList.Length = 0 then 0
else -1
let rec createTree (listToSplit : list<list<float>>) =
let pureCheck = isListPure listToSplit
if pureCheck = 0 then
printfn "%s" "Pure - Leaf(0)"
else if pureCheck = 1 then
printfn "%s" "Pure - Leaf(1)"
else
printfn "%A - is not pure" listToSplit
if listToSplit.Length > 1 then // There are attributes we can split on
// Chose best place to split list
let splitIndex = findBestSplitIndex(listToSplit)
printfn "spliting at index %A" splitIndex
let leftSideSplit =
listToSplit |> List.choose (fun x -> if x.Item(splitIndex) = 1. then Some(x) else None)
let rightSideSplit =
listToSplit |> List.choose (fun x -> if x.Item(splitIndex) = 0. then Some(x) else None)
createTree leftSideSplit
createTree rightSideSplit
else
printfn "%s" "Not Pure, but can't split choose based on heuristics - Leaf(0 or 1)"

Resources