A nested dictionary contains the following data: {names, {dates, prices}}
The structure is so:
type Dict<'T, 'U> = System.Collections.Generic.Dictionary<'T, 'U>
let masterDict = Dict<string, Dict<Datetime, float>> ()
The raw data looks like:
> masterDict.Keys |> printfn "%A"
seq ["Corn Future"; "Wheat Future"]
> masterDict.["Corn Future"] |> printfn "%A"
seq [[2009-09-01, 316.69]; [2009-09-02, 316.09]; [2009-09-03, 316.33]; ...]
> masterDict.["Wheat Future"] |> printfn "%A"
seq [[2009-09-01, 214.4]; [2009-09-02, 223.86]; [2009-09-03,
234.11]; [2009-09-04, 224.62]; ...]
I'm trying to full outer join the data above into a Deedle frame like so:
Corn Future Wheat Future
2009-09-01 316.69 214.4
2009-09-02 316.09 223.86
2009-09-03 316.33 234.11
2009-09-04 NaN 224.62 // in case a point is not available
The mechanics of Deedle are still alien to me. Any help would be appreciated.
There are some extension methods in Deedle library (mainly to make it friendly to C# too), which work with KeyValuePair as opposed to tuples (which is the default for F#).
So you should be able to simplify the answer that Foggy Finder posted a little (assuming you have open Deedle at the top):
let frame =
masterDict
|> Seq.map(fun kv -> kv.Key, kv.Value.ToSeries())
|> Frame.ofColumns
frame.Format() |> printfn "%s"
Not sure what you are going to join, but you can just transform:
let frame =
masterDict
|> Seq.map(fun kv -> kv.Key, kv.Value
|> Seq.map(fun nkv -> nkv.Key, nkv.Value)
|> Series.ofObservations)
|> Frame.ofColumns
frame.Format() |> printfn "%s"
Then you got:
Corn Future Wheat Future
01.09.2009 0:00:00 -> 316,69 214,4
02.09.2009 0:00:00 -> 316,09 223,86
03.09.2009 0:00:00 -> 316,33 234,11
04.09.2009 0:00:00 -> <missing> 224,62
Related
I am working on a data "intensive" app and I am not sure if I should use Series./DataFrame. It seems very interesting but it looks also way slower than the equivalent done with a List ... but I may not use the Series properly when I filter.
Please let me know what you think.
Thanks
type TSPoint<'a> =
{
Date : System.DateTime
Value : 'a
}
type TimeSerie<'a> = TSPoint<'a> list
let sd = System.DateTime(1950, 2, 1)
let tsd =[1..100000] |> List.map (fun x -> sd.AddDays(float x))
// creating a List of TSPoint
let tsList = tsd |> List.map (fun x -> {Date = x ; Value = 1.})
// creating the same as a serie
let tsSeries = Series(tsd , [1..100000] |> List.map (fun _ -> 1.))
// function to "randomise" the list of dates
let shuffleG xs = xs |> List.sortBy (fun _ -> Guid.NewGuid())
// new date list to search within out tsList and tsSeries
let d = tsd |> shuffleG |> List.take 1000
// Filter
d |> List.map (fun x -> (tsList |> List.filter (fun y -> y.Date = x)))
d |> List.map (fun x -> (tsSeries |> Series.filter (fun key _ -> key = x)))
Here is what I get:
List -> Real: 00:00:04.780, CPU: 00:00:04.508, GC gen0: 917, gen1: 2, gen2: 1
Series -> Real: 00:00:54.386, CPU: 00:00:49.311, GC gen0: 944, gen1: 7, gen2: 3
In general, Deedle series and data frames do have some extra overhead over writing hand-crafted code using whatever is the most efficient data structure for a given problem. The overhead is small for some operations and larger for some operations, so it depends on what you want to do and how you use Deedle.
If you use Deedle in a way in which it was intended to be used, then you'll get a good performance, but if you run a large number of operations that are not particularly efficient, you may get a bad performance.
In your particular case, you are running Series.filter on 1000 series and creating a new series (which is what happens behind the scenes here) does have some overhead.
However, what your code really does is that you are using Series.filter to find a value with a specific key. Deedle provides a key-based lookup operation for this (and it's one of the things it has been optimized for).
If you rewrite the code as follows, you'll get much better performance with Deedle than with list:
d |> List.map (fun x -> tsSeries.[x])
// 0.001 seconds
d |> List.map (fun x -> (tsSeries |> Series.filter (fun key _ -> key = x)))
// 3.46 seconds
d |> List.map (fun x -> (tsList |> List.filter (fun y -> y.Date = x)))
// 40.5 seconds
In some languages after one goes through a lazy sequence it becomes exhausted. That is not the case with F#:
let mySeq = seq [1..5]
mySeq |> Seq.iter (fun x -> printfn "%A" <| x)
mySeq |> Seq.iter (fun x -> printfn "%A" <| x)
1
2
3
4
5
1
2
3
4
5
However, it looks like one can go only once through the rows of a CSV provider:
open FSharp.Data
[<Literal>]
let foldr = __SOURCE_DIRECTORY__ + #"\data\"
[<Literal>]
let csvPath = foldr + #"AssetInfoFS.csv"
type AssetsInfo = CsvProvider<Sample=csvPath,
HasHeaders=true,
ResolutionFolder=csvPath,
AssumeMissingValues=false,
CacheRows=false>
let assetInfo = AssetsInfo.Load(csvPath)
assetInfo.Rows |> Seq.iter (fun x -> printfn "%A" <| x) // Works fine 1st time
assetInfo.Rows |> Seq.iter (fun x -> printfn "%A" <| x) // 2nd time exception
Why does that happen?
From this link on the CSV Parser, the CSV Type Provider is built on top of the CSV Parser. The CSV Parser works in streaming mode, most likely by calling a method like File.ReadLines, which will throw an exception if the enumerator is enumerated a second time. The CSV Parser also has a Cache method. Try setting CacheRows=true (or leaving it out of the declaration since its default value is true) to avoid this issue
CsvProvider<Sample=csvPath,
HasHeaders=true,
ResolutionFolder=csvPath,
AssumeMissingValues=false,
CacheRows=true>
The sequence iterator stays put where you point it; after the first loop, that is the end of the sequence.
If you want it to go back to the beginning, you have to set it there.
I am currently learning functional programming and F#, and I want to do a loop control until n-2. For example:
Given a list of doubles, find the pairwise average,
e.g. pairwiseAverage [1.0; 2.0; 3.0; 4.0; 5.0] will give [1.5; 2.5; 3.5; 4.5]
After doing some experimenting and searching, I have a few ways to do it:
Method 1:
let pairwiseAverage (data: List<double>) =
[for j in 0 .. data.Length-2 do
yield (data.[j]+data.[j+1])/2.0]
Method 2:
let pairwiseAverage (data: List<double>) =
let averageWithNone acc next =
match acc with
| (_,None) -> ([],Some(next))
| (result,Some prev) -> ((prev+next)/2.0)::result,Some(next))
let resultTuple = List.fold averageWithNone ([],None) data
match resultTuple with
| (x,_) -> List.rev x
Method 3:
let pairwiseAverage (data: List<double>) =
// Get elements from 1 .. n-1
let after = List.tail data
// Get elements from 0 .. n-2
let before =
data |> List.rev
|> List.tail
|> List.rev
List.map2 (fun x y -> (x+y)/2.0) before after
I just like to know if there are other ways to approach this problem. Thank you.
Using only built-ins:
list |> Seq.windowed 2 |> Seq.map Array.average
Seq.windowed n gives you sliding windows of n elements each.
One simple other way is to use Seq.pairwise
something like
list |> Seq.pairwise |> Seq.map (fun (a,b) -> (a+b)/2.0)
The approaches suggested above are appropriate for short windows, like the one in the question. For windows with a length greater than 2 one cannot use pairwise. The answer by hlo generalizes to wider windows and is a clean and fast approach if window length is not too large. For very wide windows the code below runs faster, as it only adds one number and subtracts another one from the value obtained for the previous window. Notice that Seq.map2 (and Seq.map) automatically deal with sequences of different lengths.
let movingAverage (n: int) (xs: float List) =
let init = xs |> (Seq.take n) |> Seq.sum
let additions = Seq.map2 (fun x y -> x - y) (Seq.skip n xs) xs
Seq.fold (fun m x -> ((List.head m) + x)::m) [init] additions
|> List.rev
|> List.map (fun (x: float) -> x/(float n))
xs = [1.0..1000000.0]
movingAverage 1000 xs
// Real: 00:00:00.265, CPU: 00:00:00.265, GC gen0: 10, gen1: 10, gen2: 0
For comparison, the function above performs the calculation above about 60 times faster than the windowed equivalent:
let windowedAverage (n: int) (xs: float List) =
xs
|> Seq.windowed n
|> Seq.map Array.average
|> Seq.toList
windowedAverage 1000 xs
// Real: 00:00:15.634, CPU: 00:00:15.500, GC gen0: 74, gen1: 74, gen2: 71
I tried to eliminate List.rev using foldBack but did not succeed.
A point-free approach:
let pairwiseAverage = List.pairwise >> List.map ((<||) (+) >> (*) 0.5)
Online Demo
Usually not a better way, but another way regardless... ;-]
Suppose I have the following data:
var1,var2,var3
0.942856823,0.568425866,0.325885379
1.227681099,1.335672206,0.925331054
1.952671045,1.829479996,1.512280854
2.45428731,1.990174152,1.534456808
2.987783477,2.78975186,1.725095748
3.651682331,2.966399127,1.972274564
3.768010479,3.211381506,1.993080807
4.509429614,3.642983433,2.541071547
4.81498729,3.888415006,3.218031802
Here is the code:
open System.IO
open MathNet.Numerics.LinearAlgebra
let rows = [|for line in File.ReadAllLines("Z:\\mypath.csv")
|> Seq.skip 1 do yield line.Split(',') |> Array.map float|]
let data = DenseMatrix.ofRowArrays rows
let data_logdiff =
DenseMatrix.init (data.RowCount-1) (data.ColumnCount)
(fun j i -> if j = 0 then 0. else data.At(j, i) / data.At(j-1, i) |> log)
let alpha = vector [for i in data_logdiff.EnumerateColumns() -> i |> Statistics.Mean]
let sigsq (values:Vector<float>) (avg: float) =
let sqr x = x * x
let result = values |> (fun i -> sqr (i - avg))
result
sigsq (data_logdiff.Column(i), alpha.[0]) |> printfn "%A"
Error: The type ''a * 'b' is not compatible with the type 'Vector<float>'
This is all for a broadcast operation between a matrix and a vector. All these acrobatics to do a simple mean((y-alpha).^2) in MATLAB.
You have a mistake in your code, and the F# compiler complains about it, albeit in a somewhat obscure way. You define your function:
let sigsq (values:Vector<float>) (avg: float) =
This is a function that takes two arguments. (Actually it's a function taking one argument, returning another function taking one argument.) But you call it like this:
sigsq (data_logdiff.Column(i), alpha.[0]) |> printfn "%A"
You tuple the arguments, and for F# functions (a,b) is one argument, which is a tuple. You should call your function like this:
sigsq (data_logdiff.Column(0)) (alpha.[0])
or
sigsq <| data_logdiff.Column(0) <| alpha.[0]
and my favorite one:
data_logdiff.Column(0) |> sigsq <| alpha.[0]
I replaced the (i) with 0 in your code. You can map through the columns if you want to loop:
data_logdiff.EnumerateColumnsIndexed() |> Seq.map (fun (i,col) -> sigsq col alpha.[i])
I have the following variable:
data:seq<(DateTime*float)>
and I want to do something like the following F# code but using Deedle:
data
|> Seq.groupBy (fun (k,v) -> k.Year)
|> Seq.map (fun (k,v) ->
let vals = v |> Seq.pairwise
let first = seq { yield v |> Seq.head }
let diffs = vals |> Seq.map (fun ((t0,v0),(t1,v1)) -> (t1, v1 - v0))
(k, diffs |> Seq.append first))
|> Seq.collect snd
This works fine using F# sequences but I want to do it using Deedle series. I know I can do something like:
(data:Series<DateTime*float>) |> Series.groupBy (fun k v -> k.Year)...
But then I need to take the within group year diffs except for the head value which should just be the value itself and then flatten the results into on series...I am bit confused with the deedle syntax
Thanks!
I think the following might be doing what you need:
ts
|> Series.groupInto
(fun k _ -> k.Month)
(fun m s ->
let first = series [ fst s.KeyRange => s.[fst s.KeyRange]]
Series.merge first (Series.diff 1 s))
|> Series.values
|> Series.mergeAll
The groupInto function lets you specify a function that should be called on each of the groups
For each group, we create series with the differences using Series.diff and append a series with the first value at the beginning using Series.merge.
At the end, we get all the nested series & flatten them using Series.mergeAll.