Can I sort a Deedle frame? - f#

From what I can tell a Deedle frame is only sorted by the index. Is there any way to apply a custom sorting function or sort by a given series (and define ascending/descending order)?

Sticking to a "standard" frame of type Frame<int,string> (row index of integers and column names of strings) it is easy to implement a function capable of reordering the frame based on any single column contents in ascending or descending order:
let reorder isAscending sortColumnName (frame:Frame<int,string>) =
let result = frame |> Frame.indexRows sortColumnName
|> Frame.orderRows |> Frame.indexRowsOrdinally
if isAscending then result else result |> Frame.mapRowKeys ((-) 0)
|> Frame.orderRows
|> Frame.indexRowsOrdinally
A smoke test over peopleList sample frame:
Name Age Countries
0 -> Joe 51 [UK; US; UK]
1 -> Tomas 28 [CZ; UK; US; CZ]
2 -> Eve 2 [FR]
3 -> Suzanne 15 [US]
reorder false "Name" peopleList returns the frame where Name is sorted in descending order
Name Age Countries
0 -> Tomas 28 [CZ; UK; US; CZ]
1 -> Suzanne 15 [US]
2 -> Joe 51 [UK; US; UK]
3 -> Eve 2 [FR]
while reorder true "Age" peopleList returns the frame where Age is sorted in ascending order
Name Age Countries
0 -> Eve 2 [FR]
1 -> Suzanne 15 [US]
2 -> Tomas 28 [CZ; UK; US; CZ]
3 -> Joe 51 [UK; US; UK]
Nevertheless, requirement of absent duplicate values in to-be-ordered column might be considered as a showstopper for this approach to Deedle frame ordering.

You can sort a Deedle frame based on the values in a named column, like so:
myFrame |> Frame.sortRowsBy "columnName" (fun v -> -v) (descending)
myFrame |> Frame.sortRowsBy "columnName" (fun v -> v) (ascending)

Deedle 1.0 has additional sorting features for rows & cols
Frame.sortRows
Frame.sortRowsWith
Frame.sortRowsBy

Related

What is the equivalent to pandas dataframe info() in Deedle?

Python's pandas library allows getting info() on a data frame.
For example.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 30 non-null object
1 PhoneNumber 30 non-null object
2 City 30 non-null object
3 Address 30 non-null object
4 PostalCode 30 non-null object
5 BirthDate 30 non-null object
6 Income 26 non-null float64
7 CreditLimit 30 non-null object
8 MaritalStatus 24 non-null object
dtypes: float64(1), object(8)
memory usage: 2.2+ KB
Is there an equivalent in Deedle's data frame? Something that can get an overview for missing values and the inferred types.
There isn't a single function to do this - it would be a nice addition to the library if you wanted to consider sending a pull-request.
The following gets all the information you would need:
// Prints column names and types, with data preview
df.Print(true)
// Print key range of rows (or key sequence if it is not ordered)
if df.RowIndex.IsOrdered then printfn "%A" df.RowIndex.KeyRange
else printfn "%A" df.RowIndex.Keys
// Get access to the data of the frame so that we can inspect the columns
let dt = df.GetFrameData()
for n, (ty, vec) in Seq.zip dt.ColumnKeys dt.Columns do
// Print name, type of column
printf "%A %A" n ty
// Query the interal data storage to see if it uses
// array of optional values (may have nulls) or not
match vec.Data with
| Vectors.VectorData.DenseList _ -> printfn " (no nulls)"
| _ -> printfn " (nulls)"
Based on Thomas's suggestion (thank you!) I modified it slightly to produce an output similar to pandas:
let info (df: Deedle.Frame<'a,'b>) =
let dt = df.GetFrameData()
let countOptionalValues d =
d
|> Seq.filter (
function
| OptionalValue.Present _ -> true
| _ -> false
)
|> Seq.length
Seq.zip dt.ColumnKeys dt.Columns
|> Seq.map (fun (col, (ty, vec)) ->
{|
Column = col
``Non-Null Count`` =
match vec.Data with
| Vectors.VectorData.DenseList d -> $"%i{d |> Seq.length} non-null"
| Vectors.VectorData.SparseList d -> $"%i{d |> countOptionalValues} non-null"
| Vectors.VectorData.Sequence d -> $"%i{d |> countOptionalValues} non-null"
Dtype = ty
|}
)
Pandas output:
Deedle output:

A straightforward functional way to rename columns of a Deedle data frame

Is there a concise functional way to rename columns of a Deedle data frame f?
f.RenameColumns(...) is usable, but mutates the data frame it is applied to, so it's a bit of a pain to make the renaming operation idempotent. I have something like f.RenameColumns (fun c -> ( if c.IndexOf( "_" ) < 0 then c else c.Substring( 0, c.IndexOf( "_" ) ) ) + "_renamed"), which is ugly.
What would be nice is something that creates a new frame from the input frame, like this: Frame( f |> Frame.cols |> Series.keys |> Seq.map someRenamingFunction, f |> Frame.cols |> Series.values ) but this gets tripped up by the second part -- the type of f |> Frame.cols |> Series.values is not what is required by the Frame constructor.
How can I concisely transform f |> Frame.cols |> Series.values so that its result is edible by the Frame constructor?
You can determine its function when used with RenameColumns:
df.RenameColumns someRenamingFunction
You can also use the function Frame.mapColKeys.
Builds a new data frame whose columns are the results of applying the
specified function on the columns of the input data frame. The
function is called with the column key and object series that
represents the column data.
Source
Example:
type Record = {Name:string; ID:int ; Amount:int}
let data =
[|
{Name = "Joe"; ID = 51; Amount = 50};
{Name = "Tomas"; ID = 52; Amount = 100};
{Name = "Eve"; ID = 65; Amount = 20};
|]
let df = Frame.ofRecords data
let someRenamingFunction s =
sprintf "%s(%i)" s s.Length
df.Format() |> printfn "%s"
let ndf = df |> Frame.mapColKeys someRenamingFunction
ndf.Format() |> printfn "%s"
df.RenameColumns someRenamingFunction
df.Format() |> printfn "%s"
Print:
Name ID Amount
0 -> Joe 51 50
1 -> Tomas 52 100
2 -> Eve 65 20
Name(4) ID(2) Amount(6)
0 -> Joe 51 50
1 -> Tomas 52 100
2 -> Eve 65 20
Name(4) ID(2) Amount(6)
0 -> Joe 51 50
1 -> Tomas 52 100
2 -> Eve 65 20

deedle aggregate/group based on running numbers in a column of Frame

say I have a Frame, which looks like below,
" Name ID Amount
0 -> Joe 51 50
1 -> Tomas 52 100
2 -> Eve 65 20
3 -> Suzanne 67 10
4 -> Suassss 69 10
5 -> Suzanne 70 10
6 -> Suzanne 78 1
7 -> Suzanne 79 10
8 -> Suzanne 80 12
9 -> Suzanne 85 10
10 -> Suzanne 87 10
...
What I would like to achieve is to group or aggregate base on the ID column such that if a sequence of running number is encountered, those rows should be grouped together, otherwise, the row itself is a group.
I belive a recursive function is your friend here.
Feed a list of tuples
let data = [(Joe, 51, 50);
(Tomas, 52, 100);
(Eve, 65, 20);
(Suzanne, 67, 10)]
to a function
let groupBySequencialId list =
let rec group result acc data lastId =
match data with
| [] -> acc :: result
| (name, id, amount) :: tail ->
if lastId + 1 = id then
group result ((name, id, amount) :: acc) tail id
else
group (acc :: result) ([(name, id, amount)]) tail id
group [] [] data 0
and you'll get the result you are looking for.
This should get the job done save three caveats.
You need to parse your string into the tuples required
There's an empty list in the result set because the first recursion wont match and appends the empty accumulator to the result set
The list will come out be reversed
Also note that this is a highly specialized function.
If I was you, I'd try to make this more general, if you ever plan on reusing it.
Have fun.

Deedle Row based calculation

I am trying to use Deedle to do some row based calculation. however most of the examples are column based. For example I have this simple structure:
let tt = Series.ofObservations[ 1=>10.0; 3=>20.0;5=> 30.0 ]
let tt2 = Series.ofObservations[1=> 10.0; 3=> Double.NaN; 6=>30.0 ]
let f1 = frame ["cola" => tt; "colb"=>tt2]
val f1 : Frame<int,string> =
cola colb
1 -> 10 10
3 -> 20 <missing>
5 -> 30 <missing>
6 -> <missing> 30
I want to calculate the mean of cola and colb. if I do
f1.Rows |> Series.mapValues(fun r -> (r.GetAs<float>("cola") + r.GetAs<float>("colb") )/2.0)
val it : Series<int,float> =
1 -> 10
3 -> <missing>
5 -> <missing>
6 -> <missing>
i know i can match with each column to handle the mean, however this will not be practical if there are lots of columns.
each row returned by f1.Rows is a ObjectSeries can this be converted into a float Series and apply the stats.mean to a row?
thanks
casbby
Update:
I think i might have found one of the ways to do this (reference: https://github.com/BlueMountainCapital/Deedle/issues/100) :
folding operation:
f1.Rows |> Series.mapValues(fun v -> v.As<float>() |> Series.foldValues (fun acc elem -> elem + acc) 0.0 )
mean (it properly skip the missing value):
f1.Rows |> Series.mapValues(fun v -> v.As<float>() |> Stats.mean )
count:
f1.Rows |> Series.mapValues(fun v -> v.As<float>() |> Stats.count )
if there is a different way please let me know. hopefully this can be useful to new comers like myself.
Your approach using f1.Rows, casting each row to a numerical series and then applying Stats functions is exactly what I was going to suggest as an answer, so I think that approach makes a perfect sense.
Another option that I can think of is to turn the frame into a de-normalized representation and then group the rows by the cola and colb values (so, you'll have all data as rows, but grouped by the other attribute):
let byCol =
f1
|> Frame.stack
|> Frame.groupRowsByString "Column";;
This gives you:
Row Column Value
cola 0 -> 1 cola 10
2 -> 3 cola 20
3 -> 5 cola 30
colb 1 -> 1 colb 10
4 -> 6 colb 30
Now you can use functions working with hierarchical indices to do the calculations. For example, to compute mean of Value for the two groups, you can write:
byCol?Value |> Stats.levelMean fst
I'm not sure which approach I'd recommend at the moment - it probably depends on other operations that you need to do with the data. But it's good to keep the alternative one in mind..

How to join frames using F#'s Deedle where one of the frame has a composite key?

Say I have two frames, firstFrame (Frame<(int * int),string>) and secondFrame (Frame<int,string>). I'd like to find a way to join the frames such that the values from the first part of the composite key from firstFrame match the values from the key in secondFrame.
The following is an example of the frames that I'm working with:
val firstFrame : Deedle.Frame<(int * int),string> =
Premia
1 1 -> 125
2 1 -> 135
3 1 -> 169
1 2 -> 231
2 2 -> 876
3 2 -> 24
val secondFrame : Deedle.Frame<int,string> =
year month
1 -> 2014 Apr
2 -> 2014 May
3 -> 2014 Jun
Code used to generate the sample above:
#I #"w:\\\dev\packages\Deedle.0.9.12"
#load "Deedle.fsx"
open Deedle
open System
let periodMembers =[(1,1);(2,1);(3,1);(1,2);(2,2);(3,2);]
let premia =[125;135;169;231;876;24;]
let firstSeries = Series(periodMembers,premia)
let firstFrame = Frame.ofColumns["Premia"=>firstSeries]
let projectedYears = series([1=>2014;2=>2014;3=>2014;])
let projectedMonths = series([1=>"Apr";2=>"May";3=>"Jun"])
let secondFrame = Frame(["year";"month"],[projectedYears;projectedMonths;])
Unfortunately, the issue is still open. I think Tomas' solution does not work well with missing values and it changes the order of rows.
My solution:
// 1. Make the key available
let k1 =
firstFrame
|> Frame.mapRows (fun (k,_) _ -> k)
let first =
firstFrame
|> Frame.addCol "A" k1
// 2. Create a combind column via lookup
let combined =
first.Columns.["A"].As<int>()
|> Series.mapAll (fun _ vOpt ->
vOpt
|> Option.bind (fun v ->
secondFrame.TryGetRow<string> v |> OptionalValue.asOption)
)
// 3. Split into single columns and append
let result =
secondFrame.ColumnKeys
|> Seq.fold (fun acc key ->
let col =
combined
|> Series.mapAll (fun _ sOpt ->
sOpt
|> Option.bind (fun s ->
s.TryGet key |> OptionalValue.asOption
)
)
acc
|> Frame.addCol key col) first
result.Print()
Premia A year month
1 1 -> 125 1 2014 Apr
2 1 -> 135 2 2014 May
3 1 -> 169 3 2014 Jun
1 2 -> 231 1 2014 Apr
2 2 -> 876 2 2014 May
3 2 -> 24 3 2014 Jun
Great question! This is not as easy as it should be (and it is probably related to another question about joining frames that we recorded as an issue). I think there should be a nicer way to do this and I'll add a link to this question to the issue.
That said, you can use the fact that joining can align keys that do not match exactly to do this. You can first add zero as the second element of the key in the second frame:
> let m = secondFrame |> Frame.mapRowKeys (fun k -> k, 0);;
val m : Frame<(int * int),string> =
year month
1 0 -> 2014 Apr
2 0 -> 2014 May
3 0 -> 2014 Jun
Now, the key in the second frame is always smaller than the matching keys in the first frame (assuming the numbers are positive). So, e.g. we want to align a value n the second frame with key (1, 0) to values in the first frame with keys (1, 1), (1, 2), ... You can use Lookup.NearestSmaller to tell Deedle that you want to find a value with the nearest smaller key (which will be (1, 0) for any key (1, k)).
To use this, you first need to sort the first frame, but then it works nicely:
> firstFrame.SortByRowKey().Join(m, JoinKind.Left, Lookup.NearestSmaller);;
val it : Frame<(int * int),string> =
Premia year month
1 1 -> 125 2014 Apr
2 -> 231 2014 Apr
2 1 -> 135 2014 May
2 -> 876 2014 May
3 1 -> 169 2014 Jun
2 -> 24 2014 Jun
This is not particularly obvious, but it does the trick. Although, I hope we can come up with a nicer way!

Resources