How to group by values in a row in F#? - f#

If I have uploaded a CSV file and I have it split up into rows within a SEQUENCE.
If there are also multiple instances of one value in lets say ROW 1, how do I average the values in ROW 2 by the values in ROW 1, so I now only have one instance of each value in ROW 1.
{This is just an example, and ROW 1 and ROW 2 are theoretical.}
Be aware that I am working with a sequence.
Example of data and ideal result is below:
What is Given:
Row 1 --- Row 2 (Dollars)
2010 --- 50000.198
2010 --- 45151.451
2011 --- 75641.372
2011 --- 91652.710
2012 --- 11281.450
2012 --- 70046.154
2012 --- 97778.054
2013 --- 555574.501
2013 --- 78921.215
What I Want:
Row 1 --- Row 2
2010 --- 47575.825
2011 --- 93647.041
2012 --- 59701.886
2013 --- 317247.858

It sounds like you've already parsed the CSV file and pulled values into a sequence. For this example, let's assume you pulled it into a list of tuples with the year as the first element and the cost as the second, equivalent to this:
let costByYear =
[
(2010,50000.198)
(2010,45151.451)
(2011,75641.372)
(2011,91652.710)
(2012,11281.450)
(2012,70046.154)
(2012,97778.054)
(2013,555574.501)
(2013,78921.215)
]
You could use a few Seq functions to group by the year (Seq.groupBy) and then average the cost (Seq.average):
let avgCostPerYear =
let avg (year, costs) = (year, Seq.average <| Seq.map snd costs)
Seq.groupBy fst >> Seq.map avg
Running this:
printfn "%A" (avgCostPerYear costByYear)
yields:
seq
[(2010, 47575.8245); (2011, 83647.041); (2012, 59701.886); (2013, 317247.858)]

Related

Error when trying to calculate mean and SD of environmental dataset with loop from .nc data

I was trying to calculate mean and SD per month of a variable from an environmental dataset (.nc file of Sea surface temp/day during 2 years) and the loop I used gives me the following error
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': recursive indexing failed at level 2
I have no idea where my error could be but if you are curious I was using the following .nc dataset just for SST for 2018-2019 from copernicus sstdata
Here is the script I used so far and the packages I'm using:
# Load required libraries (install the required libraries using the Packages tab, if necessary)
library(raster)
library(ncdf4)
#Opern the .nc file with the environmental data
ENV = nc_open("SST.nc")
ENV
#create an index of the month for every (daily) capture from 2018 to 2019 (in this dataset)
m_index = c()
for (y in 2018:2019) {
# if bisestile year (do not apply for this data but in case a larger year set is used)
if (y%%4==0) { m_index = c(m_index, rep(1:12 , times = c(31,29,31,30,31,30,31,31,30,31,30,31))) }
# if non-bisestile year
else { m_index = c(m_index, rep(1:12 , times = c(31,28,31,30,31,30,31,31,30,31,30,31))) }
}
length(m_index) # expected length (730)
table(m_index) # expected number of records assigned to each of the twelve months
# computing of monthly mean and standard deviation.
# We first create two empty raster stack...
SST_MM = stack() # this stack will contain the twelve average SST (one per month)
SST_MSD = stack() # this stack will contain the twelve SST st. dev. (one per month)
# We run the following loop (this can take a while)
for (m in 1:12) { # for every month
print(m) # print current month to track the progress of the loop...
sstMean = mean(ENV[[which(m_index==m)]], na.rm=T) # calculate the mean SST for all the records of the current month
sstSd = calc(ENV[[which(m_index==m)]], sd, na.rm=T) # calculate the st. dev. of SST for all the records of the current month
# add the monthly records to the stacks
SST_MM = stack(SST_MM, sstMean)
SST_MSD = stack(SST_MSD, sstSd)
}
And as mentioned, the output of the loop including the error:
SST_MM = stack() # this stack will contain the twelve average SST (one per month)
> SST_MSD = stack() # this stack will contain the twelve SST st. dev. (one per month)
> for (m in 1:12) { # for every month
+
+ print(m) # print current month to track the progress of the loop...
+
+ sstMean = mean(ENV[[which(m_index==m)]], na.rm=T) # calculate the mean SST for all the records of the current month
+ sstSd = calc(ENV[[which(m_index==m)]], sd, na.rm=T) # calculate the st. dev. of SST for all the records of the current month
+
+ # add the monthly records to the stacks
+
+ SST_MM = stack(SST_MM, sstMean)
+ SST_MSD = stack(SST_MSD, sstSd)
+
+ }
[1] 1
**Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': recursive indexing failed at level 2**
It seems that you make things too complicated. I think the easiest way to do this is with terra::tapp like this:
library(terra)
x <- rast("SST.nc")
xmn <- tapp(x, "yearmonths", mean)
xsd <- tapp(x, "yearmonths", sd)
or more manually:
library(terra)
x <- rast("SST.nc")
y <- format(time(x),"%Y")
m <- format(time(x),"%m")
ym <- paste0(y, "_", m)
r <- tapp(x, ym, mean)

Spark streaming join wierd results

I'm trying to observe how do spark streaming uses the RDDs inside DStream to join two DStreams, but seeing strange results which is confusing.
In my code, I am collecting data from a socket stream, split them into 2 PairedDStreams by some logic. In order to have some batches collected for join, I have created a window to collect last three batches. However, the results of join is clueless. Please help me understand.
object Join extends App {
val conf = new SparkConf().setMaster("local[4]").setAppName("KBN Streaming")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val BATCH_INTERVAL_SEC = 10
val ssc = new StreamingContext(sc, Seconds(BATCH_INTERVAL_SEC))
val lines = ssc.socketTextStream("localhost", 8091)
//println(s"lines.slideDuration : ${lines.slideDuration}")
//lines.print()
val ds = lines.map(x => x)
import scala.util.Random
val randNums = List(1, 2, 3, 4, 5, 6)
val less = ds.filter(x => x.length <= 2)
val lessPairs = less.map(x => (Random.nextInt(randNums.size), x))
lessPairs.print
val greater = ds.filter(x => x.length > 2)
val greaterPairs = greater.map(x => (Random.nextInt(randNums.size), x))
greaterPairs.print
val join = lessPairs.join(greaterPairs).window(Seconds(30), Seconds(30))
join.print
ssc.start
ssc.awaitTermination
}
Test Results:
------------------------------------------- Time: 1473344240000 ms
------------------------------------------- (1,b) (4,s)
------------------------------------------- Time: 1473344240000 ms
------------------------------------------- (5,333)
------------------------------------------- Time: 1473344250000 ms
------------------------------------------- (2,x)
------------------------------------------- Time: 1473344250000 ms
------------------------------------------- (4,the)
------------------------------------------- Time: 1473344260000 ms
------------------------------------------- (2,a) (0,b)
------------------------------------------- Time: 1473344260000 ms
------------------------------------------- (2,ten) (1,one) (3,two)
------------------------------------------- Time: 1473344260000 ms
------------------------------------------- (4,(b,two))
When join is called, the two RDDs are recomputed again and thus they will contain different values than those shown when printed. So, we need to cache when the both RDDs are computed for the first time and thus same values will be used when join is called later (instead of recomputing both RDDs once again). I tried this on multiple examples and it works fine. I was missing the basic core concept of Spark.
Excerpt from "Learning Spark" book:
Persistence (Caching)
As discussed earlier, Spark RDDs are lazily evaluated, and sometimes we may wish to use the same RDD multiple times. If we do this naively, Spark will recompute the RDD and all of its dependencies each time we call an action on the RDD.

Return multiple columns / a dataframe in Deedle based on row-wise mapping

I want to look at each row in a frame and construct multiple columns for a new frame based on values in that row.
The final result should be a frame that has the columns of the original frame plus the new columns.
I have a solution but I wonder if there is a better one. I think the best way to explain the desired behavior is with an example. I'm using Deedle's titanic data set:
#r #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\Deedle.1.2.3\lib\net40\Deedle.dll";;
#r #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\FSharp.Charting.0.90.12\lib\net40\FSharp.Charting.dll";;
#r #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\FSharp.Data.2.2.2\lib\net40\FSharp.Data.dll";;
open System
open FSharp.Data
open Deedle
open FSharp.Charting;;
#load #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\FSharp.Charting.0.90.12\FSharp.Charting.fsx";;
#load #"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\Deedle.1.2.3\Deedle.fsx";;
let titanic = Frame.ReadCsv(#"C:\Users\aolne_000\Downloads\titanic.csv");;
This is what that frame looks like:
val titanic : Frame<int,string> =
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 -> 1 False 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
1 -> 2 True 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
My approach grabs each row, uses some selection logic, and then returns a new row value as a dictionary. Then I use Deedle's expansion operation to convert the values in this dictionary to new columns.
titanic?test <- titanic |> Frame.mapRowValues( fun x -> if x.GetAs<int>("Pclass") > 1 then dict ["A", 1; "B", 2] else dict ["A", 2 ; "B", 1] );;
titanic |> Frame.expandCols ["test"];;
This gives the following new frame:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked test.A test.B
0 -> 1 False 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S 1 2
1 -> 2 True 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C 2 1
Note the last two columns are test.A and test.B. Effectively this approach creates a new frame (A and B) and then joins the frame to the existing frame.
This is fine for my use case but it is probably confusing for others to read. Also it forces the prefix, e.g. "test", on the final columns which isn't highly desirable.
Is there a way to append the new values to the end of the row series represented in the code above by x?
I find your approach quite elegant and clever. Because the new series shares the index with the original frame, it is also going to be pretty fast. So, I think your solution may actually be better than the alternative option (but I have not measured this).
Anyway, the other option would be to return new rows from your Frame.mapRowValues call - so for each row, we return the original row together with the additional columns.
titanic
|> Frame.mapRowValues(fun x ->
let add =
if x.GetAs<int>("Pclass") > 1 then series ["A", box 1; "B", box 2]
else series ["A", box 2 ; "B", box 1]
Series.merge x add)
|> Frame.ofRows

How to do a left join on a non unique column/index in Deedle

I am trying to do a left join between two data frames in Deedle. Examples of the two data frames are below:
let workOrders =
Frame.ofColumns [
"workOrderCode" =?> series [ (20050,20050); (20051,20051); (20060,20060) ]
"workOrderDescription" =?> series [ (20050,"Door Repair"); (20051,"Lift Replacement"); (20060,"Window Cleaning") ]]
// This does not compile due to the duplicate Work Order Codes
let workOrderScores =
Frame.ofColumns [
"workOrderCode" => series [ (20050,20050); (20050,20050); (20051,20051) ]
"runTime" => series [ (20050,20100112); (20050,20100130); (20051,20100215) ]
"score" => series [ (20050,100); (20050,120); (20051,80) ]]
Frame.join JoinKind.Outer workOrders workOrderScores
The problem is that Deedle will not let me create a data frame with a non unique index and I get the following error: System.ArgumentException: Duplicate key '20050'. Duplicate keys are not allowed in the index.
Interestingly in Python/Pandas I can do the following which works perfectly. How can I reproduce this result in Deedle? I am thinking that I might have to flatten the second data frame to remove the duplicates then join and then unpivot/unstack it?
workOrders = pd.DataFrame(
{'workOrderCode': [20050, 20051, 20060],
'workOrderDescription': ['Door Repair', 'Lift Replacement', 'Window Cleaning']})
workOrderScores = pd.DataFrame(
{'workOrderCode': [20050, 20050, 20051],
'runTime': [20100112, 20100130, 20100215],
'score' : [100, 120, 80]})
pd.merge(workOrders, workOrderScores, on = 'workOrderCode', how = 'left')
# Result:
# workOrderCode workOrderDescription runTime score
#0 20050 Door Repair 20100112 100
#1 20050 Door Repair 20100130 120
#2 20051 Lift Replacement 20100215 80
#3 20060 Window Cleaning NaN NaN
This is a great question - I have to admit, there is currently no elegant way to do this with Deedle. Could you please submit an issue to GitHub to make sure we keep track of this and add some solution?
As you say, Deedle does not let you have duplicate values in the keys currently - although, your Pandas solution also does not use duplicate keys - you simply use the fact that Pandas lets you specify the column to use when joining (and I think this would be great addition to Deedle).
Here is one way to do what you wanted - but not very nice. I think using pivoting would be another option (there is a nice pivot table function in the latest source code - not yet on NuGet).
I used groupByRows and nest to turn your data frames into series grouped by the workOrderCode (each item now contains a frame with all rows that have the same work order code):
let workOrders =
Frame.ofColumns [
"workOrderCode" =?> Series.ofValues [ 20050; 20051; 20060 ]
"workOrderDescription" =?> Series.ofValues [ "Door Repair"; "Lift Replacement"; "Window Cleaning" ]]
|> Frame.groupRowsByInt "workOrderCode"
|> Frame.nest
let workOrderScores =
Frame.ofColumns [
"workOrderCode" => Series.ofValues [ 20050; 20050; 20051 ]
"runTime" => Series.ofValues [ 20100112; 20100130; 20100215 ]
"score" => Series.ofValues [ 100; 120; 80 ]]
|> Frame.groupRowsByInt "workOrderCode"
|> Frame.nest
Now we can join the two series (because their work order codes are the keys). However, then you get one or two data frames for each joined order code and there is quite a lot of work needed to outer join the rows of the two frames:
// Join the two series to align frames with the same work order code
Series.zip workOrders workOrderScores
|> Series.map(fun _ (orders, scores) ->
match orders, scores with
| OptionalValue.Present s1, OptionalValue.Present s2 ->
// There is a frame with some rows with the specified code in both
// work orders and work order scores - we return a cross product of their rows
[ for r1 in s1.Rows.Values do
for r2 in s2.Rows.Values do
// Drop workOrderCode from one series (they are the same in both)
// and append the two rows & return that as the result
yield Series.append r1 (Series.filter (fun k _ -> k <> "workOrderCode") r2) ]
|> Frame.ofRowsOrdinal
// If left or right value is missing, we just return the columns
// that are available (others will be filled with NaN)
| OptionalValue.Present s, _
| _, OptionalValue.Present s -> s)
|> Frame.unnest
|> Frame.indexRowsOrdinally
This might be slow (especially in the NuGet version). If you work on more data, please try building latest version of Deedle from sources (and if that does not help, please submit an issue - we should look into this!)

Deedle - how to select rows by slicing?

I want to select some rows based on multiple rows' value comparisons.
say the data frame (df) is like:
one two three four
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
I want to get the rows where col["two"] >= 5 and col["four"] <=11.
This can be simply done in python with pandas like:
df[(df["two"]>=5 & df["four"]<=11]
How can I do this in F# with Deedle?
Thanks!
You can use the Frame.filterRows function:
df |> Frame.filterRows (fun k row ->
row?two >= 5.0 && row?four <= 11.0)
Having something like the pandas slicing syntax would be nice, but we have not found entirely smooth way of doing that yet, so using filter function is the current way of doing that (but the good thing is that this is consistent with other filter operations - for lists and sequences).
Here, I'm using row?two which is a simplified notation for getting numeric (floating point) values from a data frame, but you could use GetAs<T>("two") for values of non-numeric types.
Just for a reference, here is my sample data set:
let df =
Frame.ofRows
[ 'a' => Series.ofValues [ 0;1;2;3]
'b' => Series.ofValues [ 4;5;6;7]
'c' => Series.ofValues [ 8;9;10;11]
'd' => Series.ofValues [ 12;13;14;15] ]
|> Frame.indexColsWith ["one"; "two"; "three"; "four"]

Resources