Deedle - how to select rows by slicing? - f#

I want to select some rows based on multiple rows' value comparisons.
say the data frame (df) is like:
one two three four
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
I want to get the rows where col["two"] >= 5 and col["four"] <=11.
This can be simply done in python with pandas like:
df[(df["two"]>=5 & df["four"]<=11]
How can I do this in F# with Deedle?
Thanks!

You can use the Frame.filterRows function:
df |> Frame.filterRows (fun k row ->
row?two >= 5.0 && row?four <= 11.0)
Having something like the pandas slicing syntax would be nice, but we have not found entirely smooth way of doing that yet, so using filter function is the current way of doing that (but the good thing is that this is consistent with other filter operations - for lists and sequences).
Here, I'm using row?two which is a simplified notation for getting numeric (floating point) values from a data frame, but you could use GetAs<T>("two") for values of non-numeric types.
Just for a reference, here is my sample data set:
let df =
Frame.ofRows
[ 'a' => Series.ofValues [ 0;1;2;3]
'b' => Series.ofValues [ 4;5;6;7]
'c' => Series.ofValues [ 8;9;10;11]
'd' => Series.ofValues [ 12;13;14;15] ]
|> Frame.indexColsWith ["one"; "two"; "three"; "four"]

Related

Constraint propagation with modulo

Say I have two variables a and b. I would like to define the following relation/constraint between them:
a = 1, b % 12 = 1 or b % 12 = 0
a = 2, b % 12 = 0
Some solutions are
a = 1, b = 1
a = 1, b = 12
a = 2, b = 12
I'm currently modelling this in a straightforward way (and adding an extra condition on top):
rhs = Or(
And(a == 1, Or(b % 12 == 1, b % 12 == 0)),
And(a == 2, Or(b % 12 == 0))
)
lhs = And(b > 10)
solver.add(Implies(lhs, rhs))
However, this becomes very slow as I increase the number of variables and constraints.
Is there a better way to model this? Maybe a function? But I would like to allow search to run "in both directions", i.e. given a value of b, we should be able to identify a value of a, and vice versa.
Based on your comment, it looks like the use of integers is unnecessarily complicating your constraints.
If you need to stick to "numbers" for some other reason, then I'd recommend asserting 0 <= b and b < 12 globally (to represent all 12 notes), which can help the solver reduce the search space. However, division/modulus are always hard for SMT solvers, but perhaps you do not need them at all: In fact, I'd recommend not using numbers to represent notes in the first place. Instead use an enumeration:
Note, (A, B, C) = EnumSort('Note', ('A', 'B', 'C'))
(I've only written the first three above; you can add the remaining 9.)
This very clearly communicates to the solver that you're dealing with a finite collection of distinct items. You should also consider representing octave as some enum-type, or at least restrict it to be in some small range that covers the first 6-7 octaves which I assume you're interested in.
You can read more about enums in z3 here: https://ericpony.github.io/z3py-tutorial/advanced-examples.htm (Scroll down to the part that talks about "enumerations.")
You haven't told us how octaves/notes constrain each other in your system; but it should be easy to capture them using regular functions that take octaves and return possible notes. You should post actual code that people can run so they can see what the bottle-necks can be.

Why do Maps have slower lookup property than Records in Erlang?

I'm reading Programming Erlang, in Chapter 5 of the book it says:
Records are just tuples in disguise, so they have the same storage and performance
characteristics as tuples. Maps use more storage than tuples and have
slower lookup properties.
In languages I've learned before, this is not the case. Maps are usually implemented as a Hash Table, so the lookup time complexity is O(1); Records (Tuples with names) are usually implemented as an immutable List, and the lookup time complexity is O(N).
What's different in the implementation of these data structures in Erlang?
There's no real practical performance difference between record lookup and map lookup for small numbers of fields. For large numbers of fields, though, there is, because record information is known at compile time while map keys need not be, so maps use a different lookup mechanism than records. But records and maps are not intended as interchangeable replacements, and so comparing them for use cases involving anything more than a small number of fields is pointless IMO; if you know the fields you need at compile time, use records, but if you don't, use maps or another similar mechanism. Because of this, the following focuses only on the differences in performance of looking up one record field and one map key.
Let's look at the assembler for two functions, one that accesses a record field and one that accesses a map key. Here are the functions:
-record(foo, {f}).
r(#foo{f=X}) ->
X.
m(#{f := X}) ->
X.
Both use pattern matching to extract a value from the given type instance.
Here's the assembly for r/1:
{function, r, 1, 2}.
{label,1}.
{line,[{location,"f2.erl",6}]}.
{func_info,{atom,f2},{atom,r},1}.
{label,2}.
{test,is_tuple,{f,1},[{x,0}]}.
{test,test_arity,{f,1},[{x,0},2]}.
{get_tuple_element,{x,0},0,{x,1}}.
{get_tuple_element,{x,0},1,{x,2}}.
{test,is_eq_exact,{f,1},[{x,1},{atom,foo}]}.
{move,{x,2},{x,0}}.
return.
The interesting part here starts under {label,2}. The code verifies that the argument is a tuple, then verifies the tuple's arity, and extracts two elements from it. After verifying that the first element of the tuple is equal to the atom foo, it returns the value of the second element, which is record field f.
Now let's look at the assembly of the m/1 function:
{function, m, 1, 4}.
{label,3}.
{line,[{location,"f2.erl",9}]}.
{func_info,{atom,f2},{atom,m},1}.
{label,4}.
{test,is_map,{f,3},[{x,0}]}.
{get_map_elements,{f,3},{x,0},{list,[{atom,f},{x,0}]}}.
return.
This code verifies that the argument is a map, then extracts the value associated with map key f.
The costs of each function come down to the costs of the assembly instructions. The record function has more instructions but it's likely they're less expensive than the instructions in the map function because all the record information is known at compile time. This is especially true as the key count for the map increases, since that means the get_map_elements call has to potentially wade through more map data to find what it's looking for.
We can write functions to call these accessors numerous times, then time the new functions. Here are two sets of recursive functions that call the accessors N times:
call_r(N) ->
call_r(#foo{f=1},N).
call_r(_,0) ->
ok;
call_r(F,N) ->
1 = r(F),
call_r(F,N-1).
call_m(N) ->
call_m(#{f => 1},N).
call_m(_,0) ->
ok;
call_m(M,N) ->
1 = m(M),
call_m(M,N-1).
We can call these with timer:tc/3 to check the execution time for each function. Let's call each one ten million times, but do so 50 times and take the average execution time. First, the record function:
1> lists:sum([element(1,timer:tc(f2,call_r,[10000000])) || _ <- lists:seq(1,50)])/50.
237559.02
This means calling the function ten million times took an average of 238ms. Now, the map function:
2> lists:sum([element(1,timer:tc(f2,call_m,[10000000])) || _ <- lists:seq(1,50)])/50.
235871.94
Calling the map function ten million times averaged 236ms per call. Your mileage will vary of course, as did mine; I observed that running each multiple times sometimes resulted in the record function being faster and sometimes the map function being faster, but neither was ever faster by a large margin. I'd encourage you to do your own measurements, but it seems that there's virtually no performance difference between the two, at least for accessing a single field via pattern matching. As the number of fields increases, though, the difference in performance becomes more apparent: for 10 fields, maps are slower by about 0.5%, and for 50 fields, maps are slower by about 50%. But as I stated up front, I see this as being immaterial, since if you're trying to use records and maps interchangeably you're doing it wrong.
UPDATE: based on the conversation in the comments I clarified the answer to discuss performance differences as the number of fields/keys increases, and to point out that records and maps are not meant to be interchangeable.
UPDATE: for Erlang/OTP 24 the Erlang Efficiency Guide was augmented with a chapter on maps that's worth reading for detailed answers to this question.
I have different results when repeat the test with Erlang/OTP 22 [erts-10.6].
Disassembled code is different for r/1:
The record lookup is 1.5+ times faster.
{function, r, 1, 2}.
{label,1}.
{line,[{location,"f2.erl",9}]}.
{func_info,{atom,f2},{atom,r},1}.
{label,2}.
{test,is_tagged_tuple,{f,1},[{x,0},2,{atom,foo}]}.
{get_tuple_element,{x,0},1,{x,0}}.
return.
{function, m, 1, 4}.
{label,3}.
{line,[{location,"f2.erl",12}]}.
{func_info,{atom,f2},{atom,m},1}.
{label,4}.
{test,is_map,{f,3},[{x,0}]}.
{get_map_elements,{f,3},{x,0},{list,[{atom,f},{x,1}]}}.
{move,{x,1},{x,0}}.
return.
9> lists:sum([element(1,timer:tc(f2,call_r,[10000000])) || _ <- lists:seq(1,50)])/50.
234309.04
10> lists:sum([element(1,timer:tc(f2,call_m,[10000000])) || _ <- lists:seq(1,50)])/50.
341411.9
After I declared -compile({inline, [r/1, m/1]}).
13> lists:sum([element(1,timer:tc(f2,call_r,[10000000])) || _ <- lists:seq(1,50)])/50.
199978.9
14> lists:sum([element(1,timer:tc(f2,call_m,[10000000])) || _ <- lists:seq(1,50)])/50.
356002.48
I compared record with 10 elements to map of the same size. In this case records proved to be more than 2 times faster.
-module(f22).
-compile({inline, [r/1, m/1]}).
-export([call_r/1, call_r/2, call_m/1, call_m/2]).
-define(I, '2').
-define(V, 2 ).
-record(foo, {
'1',
'2',
'3',
'4',
'5',
'6',
'7',
'8',
'9',
'0'
}).
r(#foo{?I = X}) ->
X.
m(#{?I := X}) ->
X.
call_r(N) ->
call_r(#foo{
'1' = 1,
'2' = 2,
'3' = 3,
'4' = 4,
'5' = 5,
'6' = 6,
'7' = 7,
'8' = 8,
'9' = 9,
'0' = 0
}, N).
call_r(_,0) ->
ok;
call_r(F,N) ->
?V = r(F),
call_r(F,N-1).
call_m(N) ->
call_m(#{
'1' => 1,
'2' => 2,
'3' => 3,
'4' => 4,
'5' => 5,
'6' => 6,
'7' => 7,
'8' => 8,
'9' => 9,
'0' => 0
}, N).
call_m(_,0) ->
ok;
call_m(F,N) ->
?V = m(F),
call_m(F,N-1).
% lists:sum([element(1,timer:tc(f22,call_r,[10000000])) || _ <- lists:seq(1,50)])/50.
% 229777.3
% lists:sum([element(1,timer:tc(f22,call_m,[10000000])) || _ <- lists:seq(1,50)])/50.
% 395897.68
% After declaring
% -compile({inline, [r/1, m/1]}).
% lists:sum([element(1,timer:tc(f22,call_r,[10000000])) || _ <- lists:seq(1,50)])/50.
% 130859.98
% lists:sum([element(1,timer:tc(f22,call_m,[10000000])) || _ <- lists:seq(1,50)])/50.
% 306490.6
% 306490.6 / 130859.98 .
% 2.34212629407401

Return multiple columns / a dataframe in Deedle based on row-wise mapping

I want to look at each row in a frame and construct multiple columns for a new frame based on values in that row.
The final result should be a frame that has the columns of the original frame plus the new columns.
I have a solution but I wonder if there is a better one. I think the best way to explain the desired behavior is with an example. I'm using Deedle's titanic data set:
#r #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\Deedle.1.2.3\lib\net40\Deedle.dll";;
#r #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\FSharp.Charting.0.90.12\lib\net40\FSharp.Charting.dll";;
#r #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\FSharp.Data.2.2.2\lib\net40\FSharp.Data.dll";;
open System
open FSharp.Data
open Deedle
open FSharp.Charting;;
#load #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\FSharp.Charting.0.90.12\FSharp.Charting.fsx";;
#load #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\Deedle.1.2.3\Deedle.fsx";;
let titanic = Frame.ReadCsv(#"C:\Users\aolne_000\Downloads\titanic.csv");;
This is what that frame looks like:
val titanic : Frame<int,string> =
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 -> 1 False 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
1 -> 2 True 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
My approach grabs each row, uses some selection logic, and then returns a new row value as a dictionary. Then I use Deedle's expansion operation to convert the values in this dictionary to new columns.
titanic?test <- titanic |> Frame.mapRowValues( fun x -> if x.GetAs<int>("Pclass") > 1 then dict ["A", 1; "B", 2] else dict ["A", 2 ; "B", 1] );;
titanic |> Frame.expandCols ["test"];;
This gives the following new frame:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked test.A test.B
0 -> 1 False 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S 1 2
1 -> 2 True 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C 2 1
Note the last two columns are test.A and test.B. Effectively this approach creates a new frame (A and B) and then joins the frame to the existing frame.
This is fine for my use case but it is probably confusing for others to read. Also it forces the prefix, e.g. "test", on the final columns which isn't highly desirable.
Is there a way to append the new values to the end of the row series represented in the code above by x?
I find your approach quite elegant and clever. Because the new series shares the index with the original frame, it is also going to be pretty fast. So, I think your solution may actually be better than the alternative option (but I have not measured this).
Anyway, the other option would be to return new rows from your Frame.mapRowValues call - so for each row, we return the original row together with the additional columns.
titanic
|> Frame.mapRowValues(fun x ->
let add =
if x.GetAs<int>("Pclass") > 1 then series ["A", box 1; "B", box 2]
else series ["A", box 2 ; "B", box 1]
Series.merge x add)
|> Frame.ofRows

How to do a left join on a non unique column/index in Deedle

I am trying to do a left join between two data frames in Deedle. Examples of the two data frames are below:
let workOrders =
Frame.ofColumns [
"workOrderCode" =?> series [ (20050,20050); (20051,20051); (20060,20060) ]
"workOrderDescription" =?> series [ (20050,"Door Repair"); (20051,"Lift Replacement"); (20060,"Window Cleaning") ]]
// This does not compile due to the duplicate Work Order Codes
let workOrderScores =
Frame.ofColumns [
"workOrderCode" => series [ (20050,20050); (20050,20050); (20051,20051) ]
"runTime" => series [ (20050,20100112); (20050,20100130); (20051,20100215) ]
"score" => series [ (20050,100); (20050,120); (20051,80) ]]
Frame.join JoinKind.Outer workOrders workOrderScores
The problem is that Deedle will not let me create a data frame with a non unique index and I get the following error: System.ArgumentException: Duplicate key '20050'. Duplicate keys are not allowed in the index.
Interestingly in Python/Pandas I can do the following which works perfectly. How can I reproduce this result in Deedle? I am thinking that I might have to flatten the second data frame to remove the duplicates then join and then unpivot/unstack it?
workOrders = pd.DataFrame(
{'workOrderCode': [20050, 20051, 20060],
'workOrderDescription': ['Door Repair', 'Lift Replacement', 'Window Cleaning']})
workOrderScores = pd.DataFrame(
{'workOrderCode': [20050, 20050, 20051],
'runTime': [20100112, 20100130, 20100215],
'score' : [100, 120, 80]})
pd.merge(workOrders, workOrderScores, on = 'workOrderCode', how = 'left')
# Result:
# workOrderCode workOrderDescription runTime score
#0 20050 Door Repair 20100112 100
#1 20050 Door Repair 20100130 120
#2 20051 Lift Replacement 20100215 80
#3 20060 Window Cleaning NaN NaN
This is a great question - I have to admit, there is currently no elegant way to do this with Deedle. Could you please submit an issue to GitHub to make sure we keep track of this and add some solution?
As you say, Deedle does not let you have duplicate values in the keys currently - although, your Pandas solution also does not use duplicate keys - you simply use the fact that Pandas lets you specify the column to use when joining (and I think this would be great addition to Deedle).
Here is one way to do what you wanted - but not very nice. I think using pivoting would be another option (there is a nice pivot table function in the latest source code - not yet on NuGet).
I used groupByRows and nest to turn your data frames into series grouped by the workOrderCode (each item now contains a frame with all rows that have the same work order code):
let workOrders =
Frame.ofColumns [
"workOrderCode" =?> Series.ofValues [ 20050; 20051; 20060 ]
"workOrderDescription" =?> Series.ofValues [ "Door Repair"; "Lift Replacement"; "Window Cleaning" ]]
|> Frame.groupRowsByInt "workOrderCode"
|> Frame.nest
let workOrderScores =
Frame.ofColumns [
"workOrderCode" => Series.ofValues [ 20050; 20050; 20051 ]
"runTime" => Series.ofValues [ 20100112; 20100130; 20100215 ]
"score" => Series.ofValues [ 100; 120; 80 ]]
|> Frame.groupRowsByInt "workOrderCode"
|> Frame.nest
Now we can join the two series (because their work order codes are the keys). However, then you get one or two data frames for each joined order code and there is quite a lot of work needed to outer join the rows of the two frames:
// Join the two series to align frames with the same work order code
Series.zip workOrders workOrderScores
|> Series.map(fun _ (orders, scores) ->
match orders, scores with
| OptionalValue.Present s1, OptionalValue.Present s2 ->
// There is a frame with some rows with the specified code in both
// work orders and work order scores - we return a cross product of their rows
[ for r1 in s1.Rows.Values do
for r2 in s2.Rows.Values do
// Drop workOrderCode from one series (they are the same in both)
// and append the two rows & return that as the result
yield Series.append r1 (Series.filter (fun k _ -> k <> "workOrderCode") r2) ]
|> Frame.ofRowsOrdinal
// If left or right value is missing, we just return the columns
// that are available (others will be filled with NaN)
| OptionalValue.Present s, _
| _, OptionalValue.Present s -> s)
|> Frame.unnest
|> Frame.indexRowsOrdinally
This might be slow (especially in the NuGet version). If you work on more data, please try building latest version of Deedle from sources (and if that does not help, please submit an issue - we should look into this!)

F# Multidimensional Array Types

What's the difference between 'a[,,] and 'a[][][]? They both represent 3-d arrays.
It makes me write array3d.[x].[y].[z] instead of array3d.[x, y, z].
Why I can't do the following?
> let array2d : int[,] = Array2D.zeroCreate 10 10;;
> let array1d = array2d.[0];;
error FS0001: This expression was expected to have type
'a []
but here has type
int [,]
The difference is that 'a[][] represents an array of arrays (of possibly different lengths), while in 'a[,], represents a rectangular 2D array. The first type is also called jagged arrays and the second type is called multidimensional arrays. The difference is the same as in C#, so you may want to look at the C# documentation for jagged arrays and multidimensional arrays. There is also an excelent documentation in the F# WikiBook.
To demonstrate this using a picture, a value of type 'a[][] can look like this:
0 1 2 3 4
5 6
7 8 9 0 1
While a value of type a[,] will always be a rectangle and may look for example like this:
0 1 2 3
4 5 6 7
8 9 0 1
To get a single "line" of a multidimensional array, you can use the slice notation:
let row = array2d.[0,*];;
See https://learn.microsoft.com/en-us/dotnet/fsharp/language-reference/arrays#array-slicing-and-multidimensional-arrays
As of F# 3.1 (2013) things are simpler:
As of F# 3.1, you can decompose a multidimensional array into subarrays of the same or lower dimension. For example, you can obtain a vector from a matrix by specifying a single row or column.
// Get row 3 from a matrix as a vector:
matrix.[3, *]
// Get column 3 from a matrix as a vector:
matrix.[*, 3]
See https://learn.microsoft.com/en-us/dotnet/fsharp/language-reference/arrays#array-slicing-and-multidimensional-arrays

Resources