I have a Deedle series with election data like:
"Party A", 304
"Party B", 25
"Party C", 570
....
"Party Y", 2
"Party Z", 258
I'd like to create a new series like this:
"Party C", 570
"Party A", 304
"Party Z", 258
"Others", 145
So I want to take the top 3 as they are and sum all others as a new row. What is the best way to do this?
I don't think we have anything in Deedle that would make this a one-liner (how disappointing...). So the best I could think of is to get the keys for the top 3 parties and then use Series.groupInto with a key selector that returns either the party name (for the top 3) or returns "Other" (for the other parties):
// Sample data set with a bunch of parties
let election =
[ "Party A", 304
"Party B", 25
"Party C", 570
"Party Y", 2
"Party Z", 258 ]
|> series
// Sort the data by -1 times the value (descending)
let byVotes = election |> Series.sortBy (~-)
// Create a set with top 3 keys (for efficient lookup)
let top3 = byVotes |> Series.take 3 |> Series.keys |> set
// Group the series using key selector that tries to find the party in top3
// and using an aggregation function that sums the values (for one or multiple values)
byVotes |> Series.groupInto
(fun k v -> if top3.Contains(k) then k else "Other")
(fun k s -> s |> Series.mapValues float |> Stats.sum)
Related
If I have a dataset that contains [City, Dealership, Total Cars Sold]. How would I get the top dealer in each city and the number of cars they sold?
The results should look like
City1 Dealership A 2000
City2 Dealership X 1000
etc.
I'm sure it's possible, but I'm not having any luck and it might because i'm approaching the problem the wrong way.
Currently i'm grouping by Dealership and City which creates a Frame<(string*string*int), int> and that gets me
City1 Dealership A 1 -> 2000
City1 Dealership B 2 -> 1000
City2 Dealership X 3 -> 1000
City2 Dealership Y 4 -> 500
etc.
But trying to then get the dealership that does the most deals is where i'm stumped.
Thanks.
I adapted Tomas's answer and output the type as Series<string, (string * int)>
let data = series [
("City1", "Dealership A") => 2000
("City1", "Dealership B") => 1000
("City2", "Dealership X") => 1000
("City2", "Dealership Y") => 500 ]
data
|> Series.groupBy (fun k _ -> fst k)
|> Series.mapValues (fun sr ->
let sorted = sr |> Series.sortBy(fun x -> -x)
let key = sorted |> Series.firstKey |> snd
let value = sorted |> Series.firstValue
key, value )
The output looks like
City1 -> (Dealership A, 2000)
City2 -> (Dealership X, 1000)
EDITED
I assume you have a csv file like this
City,Dealership,TotalCarsSold
City1,Dealership A,2000
City1,Dealership B,1000
City2,Dealership X,1000
City2,Dealership Y,500
This is how I'll do it. Read it as Frame and get the column as Series and apply the same code above to get result.
let df =
Frame.ReadCsv("C:/Temp/dealership.csv")
|> Frame.indexRowsUsing(fun r -> r.GetAs<string>("City"), r.GetAs<string>("Dealership"))
df?TotalCarsSold
|> Series.groupBy (fun k _ -> fst k)
|> Series.mapValues (fun sr ->
let sorted = sr |> Series.sortBy(fun x -> -x)
let key = sorted |> Series.firstKey |> snd
let value = sorted |> Series.firstValue
key, value )
You can do this using the Series.applyLevel function. It takes a series together with a key selector and then it applies a given aggregation to all rows that have the given key. In your case, the key selector just needs to project the dealership from the composed key of the series. Given your sample data:
let data = series [
("City1", "Dealership A") => 2000
("City1", "Dealership B") => 1000
("City2", "Dealership X") => 1000
("City2", "Dealership Y") => 500 ]
You can get the result by using:
data
|> Series.applyLevel (fun (c, d) -> d) Stats.max
Note that Stats.max returns option (which is None for empty series). You can get a series with just numbers using:
data
|> Series.applyLevel (fun (c, d) -> d) (Stats.max >> Option.get)
I am using FSharp.Data to transform HTML table data, i.e.
type RawResults = HtmlProvider<url>
let results = RawResults.Load(url).Tables
for row in results.Table1.Rows do
printfn " %A " row
Example output:
("Model: Generic", "Submit Date: July 22, 2016")
("Gene: Sequencing Failed", "Exectime: 5 hrs. 21 min.")
~~~ hundreds of more rows ~~~~
I am trying to split those "two column"-based elements into a single column sequence to eventually get to a dictionary result.
Desired dictionary key:value result:
["Model", Generic]
["Submit Date", July 22, 2016]
["Gene", "Sequencing Failed"]
~~~~
How can you iter (or split?) the two columns (Column1 & Column2) to pipe both of those individual columns to produce a dictionary result?
let summaryDict =
results.Table1.Rows
|> Seq.skip 1
|> Seq.iter (fun x -> x.Column1 ......
|> ....
Use the built-in string API to split over the :. I usually prefer to wrap String.Split in curried form:
let split (separator : string) (s : string) = s.Split (separator.ToCharArray ())
Additionally, while not required, when working with two-element tuples, I often find it useful to define a helper module with functions related to this particular data structure. You can put various functions in such a module (e.g. curry, uncurry, swap, etcetera), but in this case, a single function is all you need:
module Tuple2 =
let mapBoth f g (x, y) = f x, g y
With these building blocks, you can easily split each tuple element over :, as shown in this FSI session:
> [
("Model: Generic", "Submit Date: July 22, 2016")
("Gene: Sequencing Failed", "Exectime: 5 hrs. 21 min.") ]
|> List.map (Tuple2.mapBoth (split ":") (split ":"));;
val it : (string [] * string []) list =
[([|"Model"; " Generic"|], [|"Submit Date"; " July 22, 2016"|]);
([|"Gene"; " Sequencing Failed"|], [|"Exectime"; " 5 hrs. 21 min."|])]
At this point, you still need to strip leading whitespace, as well as convert the arrays into your desired format, but I trust you can take it from here (otherwise, please ask).
Is there a concise functional way to rename columns of a Deedle data frame f?
f.RenameColumns(...) is usable, but mutates the data frame it is applied to, so it's a bit of a pain to make the renaming operation idempotent. I have something like f.RenameColumns (fun c -> ( if c.IndexOf( "_" ) < 0 then c else c.Substring( 0, c.IndexOf( "_" ) ) ) + "_renamed"), which is ugly.
What would be nice is something that creates a new frame from the input frame, like this: Frame( f |> Frame.cols |> Series.keys |> Seq.map someRenamingFunction, f |> Frame.cols |> Series.values ) but this gets tripped up by the second part -- the type of f |> Frame.cols |> Series.values is not what is required by the Frame constructor.
How can I concisely transform f |> Frame.cols |> Series.values so that its result is edible by the Frame constructor?
You can determine its function when used with RenameColumns:
df.RenameColumns someRenamingFunction
You can also use the function Frame.mapColKeys.
Builds a new data frame whose columns are the results of applying the
specified function on the columns of the input data frame. The
function is called with the column key and object series that
represents the column data.
Source
Example:
type Record = {Name:string; ID:int ; Amount:int}
let data =
[|
{Name = "Joe"; ID = 51; Amount = 50};
{Name = "Tomas"; ID = 52; Amount = 100};
{Name = "Eve"; ID = 65; Amount = 20};
|]
let df = Frame.ofRecords data
let someRenamingFunction s =
sprintf "%s(%i)" s s.Length
df.Format() |> printfn "%s"
let ndf = df |> Frame.mapColKeys someRenamingFunction
ndf.Format() |> printfn "%s"
df.RenameColumns someRenamingFunction
df.Format() |> printfn "%s"
Print:
Name ID Amount
0 -> Joe 51 50
1 -> Tomas 52 100
2 -> Eve 65 20
Name(4) ID(2) Amount(6)
0 -> Joe 51 50
1 -> Tomas 52 100
2 -> Eve 65 20
Name(4) ID(2) Amount(6)
0 -> Joe 51 50
1 -> Tomas 52 100
2 -> Eve 65 20
I have two lists of records with the following types:
type AverageTempType = {Date: System.DateTime; Year: int64; Month: int64; AverageTemp: float}
type DailyTempType = {Date: System.DateTime; Year: int64; Month: int64; Day: int64; DailyTemp: float}
I want to get a new list which is made up of the DailyTempType "joined" with the AverageTempType. Ultimately though for each daily record I want the Daily Temp - Average temp for the matching month.
I think I can do this with loops as per below and massage this into a reasonable output:
let MatchLoop =
for i in DailyData do
for j in AverageData do
if (i.Year = j.Year && i.Month = j.Month)
then printfn "%A %A %A %A %A" i.Year i.Month i.Day i.DailyTemp j.Average
else printfn "NOMATCH"
I have also try to do this with matching but I can't quite get there (I'm not sure how to define the list correctly in the input type and then iterate to get a result. Also I'm not sure sure if this approach even makes sense):
let MatchPattern (x:DailyTempType) (y:AverageTempType) =
match (x,y) with
|(x,y) when (x.Year = y.Year && x.Month = y.Month) ->
printfn "match"
|(_,_) -> printfn "nomatch"
I have looked into Deedle which I think can do this relatively easily but I am keen to understand how to do it a lower level.
What you can do is to create a map of the monthly average data. You can think of a map as a read-only dictionary:
let averageDataMap =
averageData
|> Seq.map (fun x -> ((x.Year, x.Month), x))
|> Map.ofSeq
This particular map is a Map<(int64 * int64), AverageTempType>, which, in plainer words, means that the keys into the map are tuples of year and month, and the value associated with each key is an AverageTempType record.
This enables you to find all the matching month data, based on the daily data:
let matches =
dailyData
|> Seq.map (fun x -> (x, averageDataMap |> Map.tryFind (x.Year, x.Month)))
Here, matches has the data type seq<DailyTempType * AverageTempType option>. Again, in plainer words, this is a sequence of tuples, where the first element of each tuple is the original daily observation, and the second element is the corresponding monthly average, if a match was found, or None if no matching monthly average was found.
If you want to print the values as in the OP, you can do this:
matches
|> Seq.map snd
|> Seq.map (function | Some _ -> "Match" | None -> "No match")
|> Seq.iter (printfn "%s")
This expression starts with the matches; then pulls out the second element of each tuple; then again maps a Some value to the string "Match", and a None value to the string "No match"; and finally prints each string.
I would convert first AverageTempType seq to a Map (reducing cost of join):
let toMap (avg:AverageTempType seq) = avg |> Seq.groupBy(fun a -> a.Year + a.Month) |> Map.ofSeq
Then you can join and return an option, so consuming code can do whatever you want (print, store, error, etc.):
let join (avg:AverageTempType seq) (dly:DailyTempType seq) =
let avgMap = toMap avg
dly |> Seq.map (fun d -> d.Year, d.Month, d.Day, d.DailyTemp, Map.tryFind (d.Year + d.Month) avgMap);;
From what I can tell a Deedle frame is only sorted by the index. Is there any way to apply a custom sorting function or sort by a given series (and define ascending/descending order)?
Sticking to a "standard" frame of type Frame<int,string> (row index of integers and column names of strings) it is easy to implement a function capable of reordering the frame based on any single column contents in ascending or descending order:
let reorder isAscending sortColumnName (frame:Frame<int,string>) =
let result = frame |> Frame.indexRows sortColumnName
|> Frame.orderRows |> Frame.indexRowsOrdinally
if isAscending then result else result |> Frame.mapRowKeys ((-) 0)
|> Frame.orderRows
|> Frame.indexRowsOrdinally
A smoke test over peopleList sample frame:
Name Age Countries
0 -> Joe 51 [UK; US; UK]
1 -> Tomas 28 [CZ; UK; US; CZ]
2 -> Eve 2 [FR]
3 -> Suzanne 15 [US]
reorder false "Name" peopleList returns the frame where Name is sorted in descending order
Name Age Countries
0 -> Tomas 28 [CZ; UK; US; CZ]
1 -> Suzanne 15 [US]
2 -> Joe 51 [UK; US; UK]
3 -> Eve 2 [FR]
while reorder true "Age" peopleList returns the frame where Age is sorted in ascending order
Name Age Countries
0 -> Eve 2 [FR]
1 -> Suzanne 15 [US]
2 -> Tomas 28 [CZ; UK; US; CZ]
3 -> Joe 51 [UK; US; UK]
Nevertheless, requirement of absent duplicate values in to-be-ordered column might be considered as a showstopper for this approach to Deedle frame ordering.
You can sort a Deedle frame based on the values in a named column, like so:
myFrame |> Frame.sortRowsBy "columnName" (fun v -> -v) (descending)
myFrame |> Frame.sortRowsBy "columnName" (fun v -> v) (ascending)
Deedle 1.0 has additional sorting features for rows & cols
Frame.sortRows
Frame.sortRowsWith
Frame.sortRowsBy