Deedle moving window stats calcuation with a dynamic condition and boundary.atending - f#

I am using a dynamic moving window to calculation simple stats on a series ordered on the date key. I want to be able to set the boundary at the end of the window. for example a timeseries with monthly moving average, the monthly is decided by a
(fun d1 d2 -> d1.addMonths(1) <= d2)
however the deedle series function
windowWhileInto cond f series
always uses the begin as the boundary. Therefore, it always creates produce a n datapoints series from the first data instance for the next n data points (n is decided by the fun above). i would like to have a n datapoints series from the nth data and look backwards into the past.
I also tried to use Series.Rev first to reverse the series but deedle think that series although in a reversed order is no longer ordered.
Is what i am looking for possible?

If you look at the list of aggregation functions in the docs, you'll find a function aggregate that is a generalization of all the windowing & chunking functions and also takes a key selector.
This means that you can do something like this:
ts |> Series.aggregateInto
(WindowWhile(fun d1 d2 -> d1.AddMonths(1) >= d2)) // Aggregation to perform
(fun seg -> seg.Data.LastKey()) // Key selector (use last)
(fun ds -> OptionalValue(ds.Data)) // Value selector
The function takes 3 parameters including key selector and a function that gets "data segment" (which has the window together with a flag whether it is complete or incomplete - e.g. at the end of windowing).
Sadly, this does not quite work here, because it will create a series with duplicate keys (and those are not supported by Deedle). The windows at the end of the chunk will all end with the same date and so you'll get duplicate keys (it actually runs, but you cannot do much with the series).
An ugly workaround is to remember the last chunk's end and return missing values once the end starts repeating:
let lastKey = ref None
let r =
ts |> Series.aggregateInto
(WindowWhile(fun d1 d2 -> d1.AddMonths(1) >= d2)) (fun seg -> seg.Data.LastKey())
(fun ds ->
match lastKey.Value, ds.Data.LastKey() with
| Some lk, clk when lk = clk -> OptionalValue.Missing
| _, clk -> lastKey := Some clk; OptionalValue(ds.Data))
|> Series.dropMissing
EDIT: I logged a GitHub issue for this.

Related

Applying function repeatedly to generate List

I currently have this f# function
let collatz' n =
match n with
| n when n <= 0 -> failwith "collatz' :n is zero or less"
| n when even n = true -> n / 2
| n when even n = false -> 3 * n + 1
Any tips for solving the following problem in F#?
As said in the comments, you need to give a bit more information for any really specific advice, but based on what you have I'll add the following.
The function you have declared satisfies the definition of the Collatz function i.e. even numbers -> n/2 ,and
odd number -> 3n + 1.
So really you only need applyN, let's break it down into its pieces
( `a -> `a) -> `a -> int -> `a list
applyN f n N
That definition is showing you exactly what the function expects.
lets look at f through to N
f -> a function that takes some value of type 'a (in your case likely int) and produces a new value of type 'a.
This corresponds to the function you have already written collatz`
n -> is your seed value. I don't think elaboration is required.
N -> This looks like a maximum amount of steps to go through. In the example posted, if N was larger, you would see a loop [ 1 ;4; 2; 1; 4... ]
and if it was smaller it would stop sooner.
So that is what the function takes and need to do, so how can we achieve this?
I would suggest making use of scan.
The scan function is much like fold, but it returns each interim state in a list.
Another option would be making use of Seq.unfold and then only taking the first few values.
Now, I could continue and give some source code, but I think you should try yourself for now.

How to count number of non-empty nodes in binary tree in F#

Consider the binary tree algebraic datatype
type btree = Empty | Node of btree * int * btree
and a new datatype deļ¬ned as follows:
type finding = NotFound | Found of int
Heres my code so far:
let s = Node (Node(Empty, 5, Node(Empty, 2, Empty)), 3, Node (Empty, 6, Empty))
(*
(3)
/ \
(5) (6)
/ \ | \
() (2) () ()
/ \
() ()
*)
(* size: btree -> int *)
let rec size t =
match t with
Empty -> false
| Node (t1, m, t2) -> if (m != Empty) then sum+1 || (size t1) || (size t2)
let num = occurs s
printfn "There are %i nodes in the tree" num
This probably isn't close, I took a function that would find if an integer existed in a tree and tried changing the code for what I was trying to do.
I am very new to using F# and would appreciate any help. I am trying to count all non empty nodes in the tree. For example the tree I'm using should print the value 4.
I did not run the compiler on your code, but I believe this does even compile.
However your idea to use a pattern match in a recursive function is good.
As rmunn commented, you want to determine the number of nodes in each case:
An empty tree has no nodes, hence the result is zero.
A non-empty tree, has at least the root node plus the count of its left and right subtrees.
So something along the lines of the following should work
let rec size t =
match t with
| Empty -> 0
| Node (t1, _, t2) -> 1 + (size t1) + (size t2)
The most important detail here is, that you do not need a global variable sum to store any intermediate values. The whole idea of a recursive function is that those intermediate values are the results of recursive calls.
As a remark, your tree in the comment should look like this, I believe.
(*
(3)
/ \
(5) (6)
/ \ | \
() (2) () ()
/ \
() ()
*)
Edit: I misread the misaligned () as leaves of an empty tree, where in fact they are leaves of the subtree (2). So it was just an ASCII art issue :-)
Friedrich already posted a simple version of the size function that will work for most trees. However, the solution is not "tail-recursive", so it can cause a Stack Overflow for large trees. In functional programming languages like F#, recursion is often the preferred technique for things like counting and other aggregate functions. However, recursive functions generally consume a stack frame for each recursive call. This means that for large structures, the call stack can be exhausted before the function completes. In order to avoid this problem, compilers can optimize functions that are considered "tail-recursive" so that they use only one stack frame regardless of how many times they recurse. Unfortunately, this optimization cannot just be implemented for any recursive algorithm. It requires that the recursive call be the last thing that the function does, thereby ensuring that the compiler does not have to worry about jumping back into the function after the call, allowing it to overwrite the stack frame instead of adding another one.
In order to change the size function to be tail-recursive, we need some way to avoid having to call it twice in the case of a non-empty node, so that the call can be the last step of the function, instead of the addition between the two calls in Friedrich's solution. This can be accomplished using a couple different techniques, generally either using an accumulator or using Continuation Passing Style. The simpler solution is often to use an accumulator to keep track of the total size instead of having it be the return value, while Continuation Passing Style is a more general solution that can handle more complex recursive algorithms.
In order to make an accumulator pattern work for a tree where we have to sum both the left and right sub-trees, we need some way to make one tail-call at the end of the function, while still making sure that both sub-trees are evaluated. A simple way to do that is to also accumulate the right sub-trees in addition to the total count, so we can make subsequent tail-calls to evaluate those trees while evaluating the left sub-trees first. That solution might look something like this:
let size t =
let rec size acc ts = function
| Empty ->
match ts with
| [] -> acc
| head :: tail -> head |> size acc tail
| Node (t1, _, t2) ->
t1 |> size (acc + 1) (t2 :: ts)
t |> size 0 []
This adds the acc parameter and the ts parameter to represent the total count and remaining unevaluated sub-trees. When we hit a populated node, we evaluate the left sub-tree while adding the right sub-tree to our list of trees to evaluate later. When we hit the an empty node, we start evaluating any ts we've accumulated, until we have no further populated nodes or unevaluated sub-trees. This isn't the best possible solution for computing the tree-size, and most real solutions would use Continuation Passing Style to make it tail-recusive, but that should make a good exercise as you get more familiar with the language.

Custom Operator for Lag or Standard Deviation

What is the proper way to extend the available operators when using RX?
I'd like to build out some operations that I think would be useful.
The first operation is simply the standard deviation of a series.
The second operation is the nth lagged value i.e. if we are lagging 2 and our series is A B C D E F when F is pushed the lag would be D when A is pushed the lag would be null/empty when B is pushed the lag would be null/empty when C is pushed the Lag would be A
Would it make sense to base these types of operators off of the built-ins from rx.codeplex.com or is there an easier way?
In idiomatic Rx, arbitrary delays can be composed by Zip.
let lag (count : int) o =
let someo = Observable.map Some o
let delayed = Observable.Repeat(None, count).Concat(someo)
Observable.Zip(someo, delayed, (fun c d -> d))
As for a rolling buffer, the most efficient way is to simply use a Queue/ResizeArray of fixed size.
let rollingBuffer (size : int) o =
Observable.Create(fun (observer : IObserver<_>) ->
let buffer = new Queue<_>(size)
o |> Observable.subscribe(fun v ->
buffer.Enqueue(v)
if buffer.Count = size then
observer.OnNext(buffer.ToArray())
buffer.Dequeue() |> ignore
)
)
For numbers |> rollingBuffer 3 |> log:
seq [0L; 1L; 2L]
seq [1L; 2L; 3L]
seq [2L; 3L; 4L]
seq [3L; 4L; 5L]
...
For pairing adjacent values, you can just use Observable.pairwise
let delta (a, b) = b - a
let deltaStream = numbers |> Observable.pairwise |> Observable.map(delta)
Observable.Scan is more concise if you want to apply a rolling calculation .
Some of these are easier than others (as usual). For a 'lag' by count (rather than time) you just create a sliding window by using Observable.Buffer equivalent to the size of 'lag', then take the first element of the result list.
So far lag = 3, the function is:
obs.Buffer(3,1).Select(l => l.[0])
This is pretty straightforward to turn into an extension function. I don't know if it is efficient in that it reuses the same list, but in most cases that shouldn't matter. I know you want F#, the translation is straightforward.
For running aggregates, you can usually use Observable.Scan to get a 'running' value. This is calculated based on all values seen so far (and is pretty straightforward to implement) - ie all you have to implement each subsequent element is the previous aggregate and the new element.
If for whatever reason you need a running aggregate based on a sliding window, then we get into more difficult domain. Here you first need an operation that can give you a sliding window - this is covered by Buffer above. However, then you need to know which values have been removed from this window, and which have been added.
As such, I recommend a new Observable function that maintains an internal window based on existing window + new value, and returns new window + removed value + added value. You can write this using Observable.Scan (I recommend an internal Queue for efficient implementation). It should take a function for determining which values to remove given a new value (this way it can be parameterised for sliding by time or by count).
At that point, Observable.Scan can again be used to take the old aggregate + window + removed values + added value and give a new aggregate.
Hope this helps, I do realise it's a lot of words. If you can confirm the requirement, I can help out with the actual extension method for that specific use case.
For lag, you could do something like
module Observable =
let lag n obs =
let buf = System.Collections.Generic.Queue()
obs |> Observable.map (fun x ->
buf.Enqueue(x)
if buf.Count > n then Some(buf.Dequeue())
else None)
This:
Observable.Range(1, 9)
|> Observable.lag 2
|> Observable.subscribe (printfn "%A")
|> ignore
prints:
<null>
<null>
Some 1
Some 2
Some 3
Some 4
Some 5
Some 6
Some 7

Apply several aggregate functions with one enumeration

Let's assume I have a series of functions that work on a sequence, and I want to use them together in the following fashion:
let meanAndStandardDeviation data =
let m = mean data
let sd = standardDeviation data
(m, sd)
The code above is going to enumerate the sequence twice. I am interested in a function that will give the same result but enumerate the sequence only once. This function will be something like this:
magicFunction (mean, standardDeviation) data
where the input is a tuple of functions and a sequence and the ouput is the same with the function above.
Is this possible if the functions mean and stadardDeviation are black boxes and I cannot change their implementation?
If I wrote mean and standardDeviation myself, is there a way to make them work together? Maybe somehow making them keep yielding the input to the next function and hand over the result when they are done?
The only way to do this using just a single iteration when the functions are black boxes is to use the Seq.cache function (which evaluates the sequence once and stores the results in memory) or to convert the sequence to other in-memory representation.
When a function takes seq<T> as an argument, you don't even have a guarantee that it will evaluate it just once - and usual implementations of standard deviation would first calculate the average and then iterate over the sequence again to calculate the squares of errors.
I'm not sure if you can calculate standard deviation with just a single pass. However, it is possible to do that if the functions are expressed using fold. For example, calculating maximum and average using two passes looks like this:
let maxv = Seq.fold max Int32.MinValue input
let minv = Seq.fold min Int32.MaxValue input
You can do that using a single pass like this:
Seq.fold (fun (s1, s2) v ->
(max s1 v, min s2 v)) (Int32.MinValue, Int32.MaxValue) input
The lambda function is a bit ugly, but you can define a combinator to compose two functions:
let par f g (i, j) v = (f i v, g j v)
Seq.fold (par max min) (Int32.MinValue, Int32.MaxValue) input
This approach works for functions that can be defined using fold, which means that they consist of some initial value (Int32.MinValue in the first example) and then some function that is used to update the initial (previous) state when it gets the next value (and then possibly some post-processing of the result). In general, it should be possible to rewrite single-pass functions in this style, but I'm not sure if this can be done for standard deviation. It can be definitely done for mean:
let (count, sum) = Seq.fold (fun (count, sum) v ->
(count + 1.0, sum + v)) (0.0, 0.0) input
let mean = sum / count
What we're talking about here is a function with the following signature:
(seq<'a> -> 'b) * (seq<'a> -> 'c) -> seq<'a> -> ('b * 'c)
There is no straightforward way that I can think of that will achieve the above using a single iteration of the sequence if that is the signature of the functions. Well, no way that is more efficient than:
let magicFunc (f1:seq<'a>->'b, f2:seq<'a>->'c) (s:seq<'a>) =
let cached = s |> Seq.cache
(f1 cached, f2 cached)
That ensures a single iteration of the sequence itself (perhaps there are side effects, or it's slow), but does so by essentially caching the results. The cache is still iterated another time. Is there anything wrong with that? What are you trying to achieve?

Pairwise Sequence Processing to compare db tables

Consider the following Use case:
I want to iterate through 2 db tables in parallel and find differences and gaps/missing records in either table. Assume that 1) pk of table is an Int ID field; 2) the tables are read in ID order; 3) records may be missing from either table (with corresponding sequence gaps).
I'd like to do this in a single pass over each db - using lazy reads. (My initial version of this program uses sequence objects and the data reader - unfortunately makes multiple passes over each db).
I've thought of using pairwise sequence processing and use Seq.skip within the iterations to try and keep the table processing in sync. However apparently this is very slow as I Seq.skip has a high overhead (creating new sequences under the hood) so this could be a problem with a large table (say 200k recs).
I imagine this is a common design pattern (compare concurrent data streams from different sources) and am interested in feedback/comments/links to similar projects.
Anyone care to comment?
Here's my (completely untested) take, doing a single pass over both tables:
let findDifferences readerA readerB =
let idsA, idsB =
let getIds (reader:System.Data.Common.DbDataReader) =
reader |> LazyList.unfold (fun reader ->
if reader.Read ()
then Some (reader.GetInt32 0, reader)
else None)
getIds readerA, getIds readerB
let onlyInA, onlyInB = ResizeArray<_>(), ResizeArray<_>()
let rec impl a b =
let inline handleOnlyInA idA as' = onlyInA.Add idA; impl as' b
let inline handleOnlyInB idB bs' = onlyInB.Add idB; impl a bs'
match a, b with
| LazyList.Cons (idA, as'), LazyList.Cons (idB, bs') ->
if idA < idB then handleOnlyInA idA as'
elif idA > idB then handleOnlyInB idB bs'
else impl as' bs'
| LazyList.Nil, LazyList.Nil -> () // termination condition
| LazyList.Cons (idA, as'), _ -> handleOnlyInA idA as'
| _, LazyList.Cons (idB, bs') -> handleOnlyInB idB bs'
impl idsA idsB
onlyInA.ToArray (), onlyInB.ToArray ()
This takes two DataReaders (one for each table) and returns two int[]s which indicate the IDs that were only present in their respective table. The code assumes that the ID field is of type int and is at ordinal index 0.
Also note that this code uses LazyList from the F# PowerPack, so you'll need to get that if you don't already have it. If you're targeting .NET 4.0 then I strongly recommend getting the .NET 4.0 binaries which I've built and hosted here, as the binaries from the F# PowerPack site only target .NET 2.0 and sometimes don't play nice with VS2010 SP1 (see this thread for more info: Problem with F# Powerpack. Method not found error).
When you use sequences, any lazy function adds some overhead on the sequence. Calling Seq.skip thousands of times on the same sequence will clearly be slow.
You can use Seq.zip or Seq.map2 to process two sequences at a time:
> Seq.map2 (+) [1..3] [10..12];;
val it : seq<int> = seq [11; 13; 15]
If the Seq module is not enough, you might need to write your own function.
I'm not sure if I understand what you try to do, but this sample function might help you:
let fct (s1: seq<_>) (s2: seq<_>) =
use e1 = s1.GetEnumerator()
use e2 = s2.GetEnumerator()
let rec walk () =
// do some stuff with the element of both sequences
printfn "%d %d" e1.Current e2.Current
if cond1 then // move in both sequences
if e1.MoveNext() && e2.MoveNext() then walk ()
else () // end of a sequence
elif cond2 then // move to the next element of s1
if e1.MoveNext() then walk()
else () // end of s1
elif cond3 then // move to the next element of s2
if e2.MoveNext() then walk ()
else () // end of s2
// we need at least one element in each sequence
if e1.MoveNext() && e2.MoveNext() then walk()
Edit :
The previous function was meant to extend functionality of the Seq module, and you'll probably want to make it a high-order function. As ildjarn said, using LazyList can lead to cleaner code:
let rec merge (l1: LazyList<_>) (l2: LazyList<_>) =
match l1, l2 with
| LazyList.Cons(h1, t1), LazyList.Cons(h2, t2) ->
if h1 <= h2 then LazyList.cons h1 (merge t1 l2)
else LazyList.cons h2 (merge l1 t2)
| LazyList.Nil, l2 -> l2
| _ -> l1
merge (LazyList.ofSeq [1; 4; 5; 7]) (LazyList.ofSeq [1; 2; 3; 6; 8; 9])
But I still think you should separate the iteration of your data, from the processing. Writing a high-order function to iterate is a good idea (at the end, it's not annoying if the iterator function code uses mutable enumerators).

Resources