Deedle Row based calculation - f#

I am trying to use Deedle to do some row based calculation. however most of the examples are column based. For example I have this simple structure:
let tt = Series.ofObservations[ 1=>10.0; 3=>20.0;5=> 30.0 ]
let tt2 = Series.ofObservations[1=> 10.0; 3=> Double.NaN; 6=>30.0 ]
let f1 = frame ["cola" => tt; "colb"=>tt2]
val f1 : Frame<int,string> =
cola colb
1 -> 10 10
3 -> 20 <missing>
5 -> 30 <missing>
6 -> <missing> 30
I want to calculate the mean of cola and colb. if I do
f1.Rows |> Series.mapValues(fun r -> (r.GetAs<float>("cola") + r.GetAs<float>("colb") )/2.0)
val it : Series<int,float> =
1 -> 10
3 -> <missing>
5 -> <missing>
6 -> <missing>
i know i can match with each column to handle the mean, however this will not be practical if there are lots of columns.
each row returned by f1.Rows is a ObjectSeries can this be converted into a float Series and apply the stats.mean to a row?
thanks
casbby
Update:
I think i might have found one of the ways to do this (reference: https://github.com/BlueMountainCapital/Deedle/issues/100) :
folding operation:
f1.Rows |> Series.mapValues(fun v -> v.As<float>() |> Series.foldValues (fun acc elem -> elem + acc) 0.0 )
mean (it properly skip the missing value):
f1.Rows |> Series.mapValues(fun v -> v.As<float>() |> Stats.mean )
count:
f1.Rows |> Series.mapValues(fun v -> v.As<float>() |> Stats.count )
if there is a different way please let me know. hopefully this can be useful to new comers like myself.

Your approach using f1.Rows, casting each row to a numerical series and then applying Stats functions is exactly what I was going to suggest as an answer, so I think that approach makes a perfect sense.
Another option that I can think of is to turn the frame into a de-normalized representation and then group the rows by the cola and colb values (so, you'll have all data as rows, but grouped by the other attribute):
let byCol =
f1
|> Frame.stack
|> Frame.groupRowsByString "Column";;
This gives you:
Row Column Value
cola 0 -> 1 cola 10
2 -> 3 cola 20
3 -> 5 cola 30
colb 1 -> 1 colb 10
4 -> 6 colb 30
Now you can use functions working with hierarchical indices to do the calculations. For example, to compute mean of Value for the two groups, you can write:
byCol?Value |> Stats.levelMean fst
I'm not sure which approach I'd recommend at the moment - it probably depends on other operations that you need to do with the data. But it's good to keep the alternative one in mind..

Related

Drop duplicates except for the first occurrence with Deedle

I have a table with one key with duplicate values. I would like to drop/reduce all duplicate keys but preserve the first row of each duplicate.
let data = "A;B\na;1\nb;\nb;2\nc;3"
let bytes = System.Text.Encoding.UTF8.GetBytes data
let stream = new MemoryStream( bytes )
let df=
Frame.ReadCsv(
stream = stream,
separators = ";",
hasHeaders = true
)
df.Print()
A B
0 -> a 1
1 -> b <missing>
2 -> b 2
3 -> c 3
The result should be
A B
0 -> a 1
1 -> b <missing>
2 -> c 3
I have tried applyLevel but I only get the value not the first entry:
let df1 =
df
|> Frame.groupRowsByString "A"
|> Frame.applyLevel fst (fun s -> s |> Series.firstValue)
df1.Print()
A B
a -> a 1
b -> b 2 <- wrong
c -> c 3
This is essentially a duplicate of a previous SO question. The short answer is:
let df1 =
df
|> Frame.groupRowsByString "A"
|> Frame.nest // convert to a series of frames
|> Series.mapValues (Frame.take 1) // take the first row from each frame
|> Frame.unnest // convert back to a single frame
|> Frame.mapRowKeys snd
df1.Print()
The output is:
A B
0 -> a 1
1 -> b <missing>
3 -> c 3
I've added a call to Frame.mapRowKeys at the end to match your desired output as closely as possible. Note that the actual output differs slightly from your expected output, because row 3 -> c 3 has original index 3 instead of 2. I think this is more correct, but you can renumber the rows if necessary.
The referenced question has more details.
Using Frame.nest/Frame.unnest is a reasonable solution. I have noticed, it is a little bit slow.
My solution involves putting the keys in a Map and checking:
let dropDuplicates (df:Frame<_,_>) =
let selectedMap =
df.RowKeys
|> Seq.fold (fun (m:Map<'A,'B>) (a,b) ->
if m.ContainsKey a then m else m |> Map.add a b) Map.empty
df
|> Frame.filterRows(fun (a,b) _ ->
match selectedMap.TryFind a with
| Some entry -> entry = b
| _ -> false)
let df1 =
df
|> Frame.groupRowsByString "A"
|> dropDuplicates
df1.Print()
A B
a 0 -> a 1
b 1 -> b <missing>
c 3 -> c 3

Append columns with typed access to rows with Deedle

Let's say I want to
Read persons from CSV with name, weight and height with typed access to whole rows. The weight is optional. E.g.
Name Weight Height
0 -> Joe 80 1.88
1 -> Ann <missing> 1.66
Apply a given transformation, e.g. calculate bmi.
Save the result in a new column and get a new Frame without changing the old one.
Read the result of 3 and apply another transformation, e.g. round.
Result:
Name Weight Height BMI BMI_rounded
0 -> Joe 80 1.88 22.6346763241 22.6
1 -> Ann <missing> 1.66 <missing> <missing>
It is not clear to me how to make an appendColumn function.
I have tried:
#r "nuget: Deedle"
open Deedle
open System.IO
type IPerson =
abstract Name : string
abstract Weight : OptionalValue<int>
abstract Height: float
type IPersonWithBmi =
abstract Bmi : OptionalValue<float>
inherit IPerson
let data = "Name;Weight;Height\nJoe;80;1.88\nAnn;;1.66"
//https://stackoverflow.com/questions/44344061/converting-a-string-to-a-stream/44344794
let bytes = System.Text.Encoding.UTF8.GetBytes data
let stream = new MemoryStream( bytes )
let df:Frame<int,string> = Frame.ReadCsv(
stream = stream,
separators = ";",
hasHeaders = true
)
let getBMI (h:float) (w:int):Option<float> = if h>0.0 then Some ((float w)/(h*h)) else None
df.Print()
let rows = df.GetRowsAs<IPerson>()
rows |> Series.mapValues (fun (row:IPerson) -> row.Weight |> OptionalValue.asOption |> Option.bind (getBMI row.Height) |> OptionalValue.ofOption)
//Pseudocode
let df2 = appendColumn df "BMI" rows
let round (f:float):int = int (System.Math.Round(f, 0))
let rows2 = df.GetRowsAs<IPersonWithBmi>()
rows2 |> Series.mapValues (fun (row:IPersonWithBmi) -> row.Bmi |> OptionalValue.map round )
//Pseudocode
let df3 = appendColumn df2 "BMI_rounded" rows2
This is what Frame.Join does. So if you have a BMI series like this:
let bmi =
df
|> Frame.mapRowValues (fun row ->
row.TryGetAs<int>("Weight")
|> OptionalValue.asOption
|> Option.bind (getBMI <| row.GetAs<float>("Height"))
|> OptionalValue.ofOption)
You can join it to your frame like this:
let df2 = df.Join("BMI", bmi)
df2.Print()
Output is:
Name Weight Height BMI
0 -> Joe 80 1.88 22.634676324128566
1 -> Ann <missing> 1.66 <missing>
The rounded BMI column could be created the same way.

Deedle - Distinct by column

I had a situation the other day where a particular column of my Frame had some duplicate values.
I wanted to remove any rows where said column had a duplicate value.
I managed to hack a solution using a filter function, and while it was good enough for the exploratory data analysis at hand, it was way more painful that it should have been.
Despite searching high and low, I could not find any ideas on an elegant solution.
I also notices that Series don't offer a DistincyBy() or similar either.
How to do you do a "DistinctBy" operation for a specific column/s ?
One way to do it is using nest and unnest, something like this:
let noDuplicates: Frame<(int*string), string> =
df1
|> Frame.groupRowsBy "Tomas"
|> Frame.nest
|> Series.mapValues (Frame.take 1)
|> Frame.unnest
Let's explain each step. Imagine you have this dataframe:
// Create from individual observations (row * column * value)
let df1 =
[ ("Monday", "Tomas", 1); ("Tuesday", "Adam", 2)
("Tuesday", "Tomas", 4); ("Wednesday", "Tomas", -5)
("Thursday", "Tomas", 4); ("Thursday", "Adam", 5) ]
|> Frame.ofValues
Tomas Adam
Monday -> 1 <missing>
Tuesday -> 4 2
Wednesday -> -5 <missing>
Thursday -> 4 5
And you want to remove rows containing duplicate values in the "Tomas" column.
First, group by this column.
let df2 : Frame<(int * string), string> = df1 |> Frame.groupRowsBy "Tomas"
Tomas Adam
1 Monday -> 1 <missing>
4 Tuesday -> 4 2
4 Thursday -> 4 5
-5 Wednesday -> -5 <missing>
Now you have a frame with a two-level index, which you can turn into a series of data frames.
let df3 = df2 |> Frame.nest
Tomas Adam
Monday -> 1 <missing>
Tomas Adam
Tuesday -> 4 2
Thursday -> 4 5
Tomas Adam
Wednesday -> -5 <missing>
Take the first row of each frame.
let df4 = df3 |> Series.mapValues (fun fr -> fr |> Frame.take 1)
Tomas Adam
Monday -> 1 <missing>
Tomas Adam
Tuesday -> 4 2
Tomas Adam
Wednesday -> -5 <missing>
It remains to perform the backwards conversion: from a series of data frames into a frame with a two-level index.
let df5 = df4 |> Frame.unnest
Tomas Adam
-5 Wednesday -> -5 <missing>
1 Monday -> 1 <missing>
4 Tuesday -> 4 2

Immutible or not? Deedle frame filtering

this question might look a little trivial, it does happen in our process as the data is not clean. I have a data frame looks like
let tt = Series.ofObservations[ 1=>10.0; 3=>20.0;5=> 30.0; 6=> 40.0; ]
let tt2 = Series.ofObservations[1=> Double.NaN; 3=> 5.5; 6=>Double.NaN ]
let tt3 = Series.ofObservations[1=> "aaa"; 3=> "bb"; 6=>"ccc" ]
let f1 = frame ["cola" => tt; "colb"=>tt2;]
f1.AddColumn("colc", tt3)
f1.Print();;
cola colb colc
1 -> 10 <missing> aaa
3 -> 20 5.5 bb
5 -> 30 <missing> <missing>
6 -> 40 <missing> ccc
I need to filter out any row until the first row with a value in colb
cola colb colc
3 -> 20 5.5 bb
5 -> 30 <missing> <missing>
6 -> 40 <missing> ccc
The only solution i can come up with is utilising a mutable flag which breaks the integrity of functional programming. maybe a this filtering missing head can be hidden in a library. but it still makes me wonder if i did not do it the right way.
let flag = ref false
let filteredF1 = f1 |> Frame.filterRows(fun k v ->
match !flag, v.TryGetAs<float>("colb") with
| false, OptionalValue.Missing -> flag := false
| false, _ -> flag := true
| true, _ -> ()
!flag
)
This is not really a problem of Deedle but more to do with how should immutability achieve this. Something easily achievable in Python and VBA seems to be very hard to do in F#.
In statistic calculation situation like this happens where multiple serieses have a different starting time. And after the starting point (retaining) the data point containing the missing value is important as missing value means something.
Any advice is appreciated.
cassby
Here is my preferred way:
// find first index having non-null value in column b
let idx =
f1?colb
|> Series.observationsAll
|> Seq.skipWhile (function | (_, None) -> true | _ -> false)
|> Seq.head
|> fst;;
// slice frame
f1.Rows.[idx .. ];;
If you wrap your code into a function (I modified it a little, but have not tested it at all!!)
let dropTil1stNonMissingB frame =
let flag = ref false
let kernel k v ->
flag := !flag || v.TryGetAs<float>("colb").HasValue
!flag
Frame.filterRows kernel frame
then your code just looks purely functional:
let filteredF1 = f1 |> dropTil1stnonMissingB
As long as the use of reference is restricted to a narrow scope, it should be accepted. Immutability is not the final goal of functional programming. It's only a guiding principle to write a good code.
In fact the Deedle developers should have provided their version of Seq.fold for Frame:
http://msdn.microsoft.com/en-us/library/ee353471.aspx
Then you could have used it with (new Frame([],[]), false) as the initial 'State. Roughly speaking, you should be able to translate any loops in C, Python or whatever imperative language to fold (aka fold_left or foldl), though it isn't necessarily the way to go.
You might as well define it as an extension method of Frame.
type Frame with
member frame.DropTil1stNonMissingB =
...
let filteredF1 = f1.DropTil1stNonMissingB
http://fsharpforfunandprofit.com/posts/type-extensions/

Can I sort a Deedle frame?

From what I can tell a Deedle frame is only sorted by the index. Is there any way to apply a custom sorting function or sort by a given series (and define ascending/descending order)?
Sticking to a "standard" frame of type Frame<int,string> (row index of integers and column names of strings) it is easy to implement a function capable of reordering the frame based on any single column contents in ascending or descending order:
let reorder isAscending sortColumnName (frame:Frame<int,string>) =
let result = frame |> Frame.indexRows sortColumnName
|> Frame.orderRows |> Frame.indexRowsOrdinally
if isAscending then result else result |> Frame.mapRowKeys ((-) 0)
|> Frame.orderRows
|> Frame.indexRowsOrdinally
A smoke test over peopleList sample frame:
Name Age Countries
0 -> Joe 51 [UK; US; UK]
1 -> Tomas 28 [CZ; UK; US; CZ]
2 -> Eve 2 [FR]
3 -> Suzanne 15 [US]
reorder false "Name" peopleList returns the frame where Name is sorted in descending order
Name Age Countries
0 -> Tomas 28 [CZ; UK; US; CZ]
1 -> Suzanne 15 [US]
2 -> Joe 51 [UK; US; UK]
3 -> Eve 2 [FR]
while reorder true "Age" peopleList returns the frame where Age is sorted in ascending order
Name Age Countries
0 -> Eve 2 [FR]
1 -> Suzanne 15 [US]
2 -> Tomas 28 [CZ; UK; US; CZ]
3 -> Joe 51 [UK; US; UK]
Nevertheless, requirement of absent duplicate values in to-be-ordered column might be considered as a showstopper for this approach to Deedle frame ordering.
You can sort a Deedle frame based on the values in a named column, like so:
myFrame |> Frame.sortRowsBy "columnName" (fun v -> -v) (descending)
myFrame |> Frame.sortRowsBy "columnName" (fun v -> v) (ascending)
Deedle 1.0 has additional sorting features for rows & cols
Frame.sortRows
Frame.sortRowsWith
Frame.sortRowsBy

Resources