Cross join with Deedle - f#

I'm trying to learn some F# and Deedle by analyzing my electricity costs.
Suppose I have two frames, one containing my electricity usage:
let consumptionsByYear =
[ (2019, "Total", 500); (2019, "Day", 200); (2019, "Night", 300);
(2020, "Total", 600); (2020, "Day", 250); (2020, "Night", 350) ]
|> Frame.ofValues
Total Day Night
2019 -> 500 200 300
2020 -> 600 250 350
The other contains two plans with different pricing structure (either a flat fee or fee varying based on the time of the day):
let prices =
[ ("Plan A", "Base fee", 50); ("Plan A", "Fixed price", 3); ("Plan A", "Day price", 0); ("Plan A", "Night price", 0);
("Plan B", "Base fee", 40); ("Plan B", "Fixed price", 0); ("Plan B", "Day price", 5); ("Plan B", "Night price", 2) ]
|> Frame.ofValues
Base fee Fixed price Day price Night price
Plan A -> 50 3 0 0
Plan B -> 40 0 5 2
Previously I have solved this in SQL using a cross join and in Excel using nested joins. To copy those, I found Frame.mapRows, but constructing the expected output seems very tedious using it:
let costs = consumptionsByYear
|> Frame.mapRows (fun _year cols ->
["Total price" => (prices?``Base fee``
+ (prices?``Fixed price`` |> Series.mapValues ((*) (cols.GetAs<float>("Total"))))
+ (prices?``Day price`` |> Series.mapValues ((*) (cols.GetAs<float>("Day"))))
+ (prices?``Night price`` |> Series.mapValues ((*) (cols.GetAs<float>("Night"))))
)]
|> Frame.ofColumns)
|> Frame.unnest
Total price
2019 Plan A -> 1550
Plan B -> 1640
2020 Plan A -> 1850
Plan B -> 1990
Is there a better way or even small improvements?

I'm not a Deedle expert, but I think this is basically:
A dot product of two matrices: consumptionsByYear and the periodic day/night prices,
Followed by the addition of the constant base prices.
In other words:
consumptionsByYear periodicPrices basePrices
------------------- ------------------------ ---------------------------
| Day Night | | Plan A Plan B | | Plan A Plan B |
| 2019 -> 200 300 | * | Day -> 3 5 | + | Base fee -> 50 40 |
| 2020 -> 250 350 | | Night -> 3 2 | ---------------------------
------------------- ------------------------
With that approach in mind, here's how I would do it:
open Deedle
open Deedle.Math
let consumptionsByYear =
[ (2019, "Day", 200); (2019, "Night", 300)
(2020, "Day", 250); (2020, "Night", 350) ]
|> Frame.ofValues
let basePrices =
[ ("Plan A", "Base fee", 50)
("Plan B", "Base fee", 40) ]
|> Frame.ofValues
|> Frame.transpose
let periodicPrices =
[ ("Plan A", "Day", 3); ("Plan A", "Night", 3)
("Plan B", "Day", 5); ("Plan B", "Night", 2) ]
|> Frame.ofValues
|> Frame.transpose
// repeat the base prices for each year
let basePricesExpanded =
let row = basePrices.Rows.["Base fee"]
consumptionsByYear
|> Frame.mapRowValues (fun _ -> row)
|> Frame.ofRows
let result =
Matrix.dot(consumptionsByYear, periodicPrices) + basePricesExpanded
result.Print()
Output is:
Plan A Plan B
2019 -> 1550 1640
2020 -> 1850 1990
A few changes I made for simplicity:
consumptionsByYear
I mapped the years from integers to strings in order to make the matrices compatible.
I removed the Total column, since it can be derived from the other two.
prices
I broke this into two separate frames: one for the periodic prices and another for the base prices, and then transposed them to enable matrix multiplication.
I changed Day price to Day and Night price to Night to make the matrices compatible.
I got rid of the Fixed price column, since it can be represented in the Day and Night columns.
Update: As of Deedle 2.4.2, it is no longer necessary to map the years to strings. I've modified my solution accordingly.

Related

What is the way to find the price with by the quantity in Rails?

Rails version: 7.0
PostgreSQL version: 14
What is the way to find the price by the quantity in the products table?
products table
min_quantity | max_quantity | price
1 | 4 | 200
5 | 9 | 185
10 | 24 | 175
25 | 34 | 150
35 | 999 | 100
1000 | null | 60
Expected result
3 ===> 200
50 ===> 100
2500 ===> 60
You can achieve that with a where condition checking that the given value is between min_quantity and max_quantity;
with products(min_quantity, max_quantity, price) as (
values
(1, 4, 200)
, (5, 9, 185)
, (10, 24, 175)
, (25, 34, 150)
, (35, 999, 100)
, (1000, null, 60)
)
select
price
from products
where
case
when max_quantity is null
then 3 >= min_quantity
else
3 between min_quantity and max_quantity
end
-- 200
But as you might have a null value for max_quantity when the min_quantity is 1000 then you'll need a way to handle that. So you can use a case expression to only compare the input with min_quantity.
If the same applies for min_quantity and it can hold null values, then another branch in the case expression might suffice.
As Rails doesn't have specific support for these situations, you'll be off to go with just "raw" SQL in your where;
where(<<~SQL, input: input)
where
case
when max_quantity is null
then :input > min_quantity
else
:input between min_quantity and max_quantity
end
SQL

How to merge based on a subset of string in a column?

This is an extension of my previous post.
I have the following dataframes (df1 and df2) that I'm trying to merge:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("Homes (v. Vista)", "#laura v. dAvid")
df2 <- data.frame(year, state, versus)
And I df4 is my ideal output:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")
versus <- c("Homes (v. Vista)", "# george v. SALLY", "#laura v. dAvid", "#laura v. dAvid", NA, NA)
df4 <- data.frame(year, state, name, versus)
The kind responders on the last post suggested this (and a variation):
library(dplyr)
df3 <- left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
The problem with the above code is that it doesn't match subsets. Ideally, I'd like grepl(x, y) to match each other, vice versa. If x is in y and/or y is in x, then it's TRUE and results in the value in the "versus" column.
fuzzyjoin is meant for regex searches like this :-)
library(dplyr)
# library(tidyr) # unnest
# library(fuzzyjoin) # fuzzy_*_join
df1 %>%
mutate(
rn = row_number(),
ptn = strsplit(name, "[ ,]+")
) %>%
tidyr::unnest(ptn) %>%
fuzzyjoin::fuzzy_left_join(df2,
by = c("year" = "year", "state" = "state", "ptn" = "versus"),
match_fun = list(`==`, `==`, function(...) Vectorize(grepl)(..., ignore.case = TRUE))
) %>%
group_by(rn, year = year.x, state = state.x, name) %>%
summarize(versus = na.omit(versus)[1], .groups = "drop") %>%
select(-rn)
# # A tibble: 6 x 4
# year state name versus
# <chr> <chr> <chr> <chr>
# 1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
# 2 2002 TN Sally NA
# 3 1999 AL David #laura v. dAvid
# 4 1999 AL Laura #laura v. dAvid
# 5 1997 CA John NA
# 6 2002 TN Kate NA
We need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:
Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
Left join the original frames, and apply f() by row, retaining versus if one or more whole words from name are found in veruss, else set to NA
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "#laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))

Group by start and end date or join multiple columns in Power Query

I have an employees table with mutations to their contracts
EmpID Start End Function Hours SalesPercentage
1 01-01-2020 31-12-2020 FO Desk 40 1
1 01-01-2020 31-01-2021 FO Desk 32 1
1 01-02-2021 FO Desk 32 0.50
2 01-01-2021 31-01-2021 BO 32 0
2 01-02-2021 BO/FO 32 .25
For dynamic calculation of the amount of emplyees and their sales percentages I need to turn this into a tabel with an entry per month:
Year Month EmpID Hours SalesPercentage
2020 1 1 40 1
2020 2 1 40 1
..
2020 12 1 40 1
2021 1 1 32 1
2021 1 2 32 0
2021 2 1 32 0.50
2021 2 2 32 0.25
I have a simple Year Month table that I would like to append the mutation data to, but joining on multiple columns is not possible as far as I can tell. Is there a way around this?
Try this below
It generates a list of all year/month combinations for each row, then expands it and removes extra columns
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{ {"Start", type date}, {"End", type date}}),
#"Added Custom" = Table.AddColumn(
#"Changed Type",
"newcol",
each
let
begin = Date.StartOfMonth([Start]),
End2 = if [End] = null then [Start] else [End]
in
List.Accumulate(
{0..(Date.Year(End2)-Date.Year([Start]))*12+(Date.Month(End2)-Date.Month([Start]))},
{},
(s,c) => s&{Date.AddMonths(begin,c)}
)
),
#"Expanded newcol" = Table.ExpandListColumn(#"Added Custom", "newcol"),
#"Added Custom2" = Table.AddColumn(#"Expanded newcol", "Year", each Date.Year([newcol])),
#"Added Custom3" = Table.AddColumn(#"Added Custom2", "Month", each Date.Month([newcol])),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom3",{"Start", "End", "Function", "newcol"})
in #"Removed Columns"

Deedle missing values after grouping

I have two frames, each of which contains some IDs and zero to many measures for each ID. I want to get the average measure per ID for each frame and combine to a larger frame.
The problem is that when an ID does not appear in one of the two frames, after grouping it results in a missing value in the combined frame. Here is an example. Notice ID "Chris" does not appear in frame A.
let aF = frame [ "AID" =?> Series.ofValues [ "Andrew"; "Andrew"; "Andrew"]; "AMES" =?> Series.ofValues [ 2; 4; 3]]
let bF = frame [ "BID" =?> Series.ofValues [ "Andrew"; "Chris"; "Andrew"]; "BMES" =?> Series.ofValues [ 1; 6; 7]]
let groupF = frame [ "AG" => (aF |> Frame.groupRowsByString "AID" |> Frame.getCol "AMES") ; "BG" => (bF |> Frame.groupRowsByString "BID" |> Frame.getCol "BMES") ]
let groupFMean = groupF |> Frame.getNumericCols |> Series.mapValues (Stats.levelMean fst) |> Frame.ofColumns |> Frame.fillMissingWith 0
groupFMean.SaveCsv( "tgroupFMean.csv", includeRowKeys=true, keyNames=["Id"] )
The resulting table looks like this:
Id AG BG
Andrew 3 4
Chris 6
And the blank cell is "". I've tried variations with fillMissingWith 0 (at series and and frame level) without success.
The answer is not very obvious - the problem is that fillMissingWith only touches columns that have the same type as the value you are using to fill the data - so for example, fillMissingWith "Unknown" would only fill missing values in columns that are string.
In your case, Frame.fillMissingWith 0 is only applied to columns of type int and there are no such columns. If you use Frame.fillMissingWith 0.0, things work as expected!
PS: If you have any thoughts on how this could be done better, please let us know. I'm really not sure what the right behavior is here!

Get elements by two hour range in ruby / rails [duplicate]

I have some difficulties with mySQL commands that I want to do.
SELECT a.timestamp, name, count(b.name)
FROM time a, id b
WHERE a.user = b.user
AND a.id = b.id
AND b.name = 'John'
AND a.timestamp BETWEEN '2010-11-16 10:30:00' AND '2010-11-16 11:00:00'
GROUP BY a.timestamp
This is my current output statement.
timestamp name count(b.name)
------------------- ---- -------------
2010-11-16 10:32:22 John 2
2010-11-16 10:35:12 John 7
2010-11-16 10:36:34 John 1
2010-11-16 10:37:45 John 2
2010-11-16 10:48:26 John 8
2010-11-16 10:55:00 John 9
2010-11-16 10:58:08 John 2
How do I group them into 5 minutes interval results?
I want my output to be like
timestamp name count(b.name)
------------------- ---- -------------
2010-11-16 10:30:00 John 2
2010-11-16 10:35:00 John 10
2010-11-16 10:40:00 John 0
2010-11-16 10:45:00 John 8
2010-11-16 10:50:00 John 0
2010-11-16 10:55:00 John 11
This works with every interval.
PostgreSQL
SELECT
TIMESTAMP WITH TIME ZONE 'epoch' +
INTERVAL '1 second' * round(extract('epoch' from timestamp) / 300) * 300 as timestamp,
name,
count(b.name)
FROM time a, id
WHERE …
GROUP BY
round(extract('epoch' from timestamp) / 300), name
MySQL
SELECT
timestamp, -- not sure about that
name,
count(b.name)
FROM time a, id
WHERE …
GROUP BY
UNIX_TIMESTAMP(timestamp) DIV 300, name
I came across the same issue.
I found that it is easy to group by any minute interval is
just dividing epoch by minutes in amount of seconds and then either rounding or using floor to get ride of the remainder. So if you want to get interval in 5 minutes you would use 300 seconds.
SELECT COUNT(*) cnt,
to_timestamp(floor((extract('epoch' from timestamp_column) / 300 )) * 300)
AT TIME ZONE 'UTC' as interval_alias
FROM TABLE_NAME GROUP BY interval_alias
interval_alias cnt
------------------- ----
2010-11-16 10:30:00 2
2010-11-16 10:35:00 10
2010-11-16 10:45:00 8
2010-11-16 10:55:00 11
This will return the data correctly group by the selected minutes interval; however, it will not return the intervals that don't contains any data. In order to get those empty intervals we can use the function generate_series.
SELECT generate_series(MIN(date_trunc('hour',timestamp_column)),
max(date_trunc('minute',timestamp_column)),'5m') as interval_alias FROM
TABLE_NAME
Result:
interval_alias
-------------------
2010-11-16 10:30:00
2010-11-16 10:35:00
2010-11-16 10:40:00
2010-11-16 10:45:00
2010-11-16 10:50:00
2010-11-16 10:55:00
Now to get the result with interval with zero occurrences we just outer join both result sets.
SELECT series.minute as interval, coalesce(cnt.amnt,0) as count from
(
SELECT count(*) amnt,
to_timestamp(floor((extract('epoch' from timestamp_column) / 300 )) * 300)
AT TIME ZONE 'UTC' as interval_alias
from TABLE_NAME group by interval_alias
) cnt
RIGHT JOIN
(
SELECT generate_series(min(date_trunc('hour',timestamp_column)),
max(date_trunc('minute',timestamp_column)),'5m') as minute from TABLE_NAME
) series
on series.minute = cnt.interval_alias
The end result will include the series with all 5 minute intervals even those that have no values.
interval count
------------------- ----
2010-11-16 10:30:00 2
2010-11-16 10:35:00 10
2010-11-16 10:40:00 0
2010-11-16 10:45:00 8
2010-11-16 10:50:00 0
2010-11-16 10:55:00 11
The interval can be easily changed by adjusting the last parameter of generate_series. In our case we use '5m' but it could be any interval we want.
You should rather use GROUP BY UNIX_TIMESTAMP(time_stamp) DIV 300 instead of round(../300) because of the rounding I found that some records are counted into two grouped result sets.
For postgres, I found it easier and more accurate to use the
date_trunc
function, like:
select name, sum(count), date_trunc('minute',timestamp) as timestamp
FROM table
WHERE xxx
GROUP BY name,date_trunc('minute',timestamp)
ORDER BY timestamp
You can provide various resolutions like 'minute','hour','day' etc... to date_trunc.
The query will be something like:
SELECT
DATE_FORMAT(
MIN(timestamp),
'%d/%m/%Y %H:%i:00'
) AS tmstamp,
name,
COUNT(id) AS cnt
FROM
table
GROUP BY ROUND(UNIX_TIMESTAMP(timestamp) / 300), name
Not sure if you still need it.
SELECT FROM_UNIXTIME(FLOOR((UNIX_TIMESTAMP(timestamp))/300)*300) AS t,timestamp,count(1) as c from users GROUP BY t ORDER BY t;
2016-10-29 19:35:00 | 2016-10-29 19:35:50 | 4 |
2016-10-29 19:40:00 | 2016-10-29 19:40:37 | 5 |
2016-10-29 19:45:00 | 2016-10-29 19:45:09 | 6 |
2016-10-29 19:50:00 | 2016-10-29 19:51:14 | 4 |
2016-10-29 19:55:00 | 2016-10-29 19:56:17 | 1 |
You're probably going to have to break up your timestamp into ymd:HM and use DIV 5 to split the minutes up into 5-minute bins -- something like
select year(a.timestamp),
month(a.timestamp),
hour(a.timestamp),
minute(a.timestamp) DIV 5,
name,
count(b.name)
FROM time a, id b
WHERE a.user = b.user AND a.id = b.id AND b.name = 'John'
AND a.timestamp BETWEEN '2010-11-16 10:30:00' AND '2010-11-16 11:00:00'
GROUP BY year(a.timestamp),
month(a.timestamp),
hour(a.timestamp),
minute(a.timestamp) DIV 12
...and then futz the output in client code to appear the way you like it. Or, you can build up the whole date string using the sql concat operatorinstead of getting separate columns, if you like.
select concat(year(a.timestamp), "-", month(a.timestamp), "-" ,day(a.timestamp),
" " , lpad(hour(a.timestamp),2,'0'), ":",
lpad((minute(a.timestamp) DIV 5) * 5, 2, '0'))
...and then group on that
How about this one:
select
from_unixtime(unix_timestamp(timestamp) - unix_timestamp(timestamp) mod 300) as ts,
sum(value)
from group_interval
group by ts
order by ts
;
I found out that with MySQL probably the correct query is the following:
SELECT SUBSTRING( FROM_UNIXTIME( CEILING( timestamp /300 ) *300,
'%Y-%m-%d %H:%i:%S' ) , 1, 19 ) AS ts_CEILING,
SUM(value)
FROM group_interval
GROUP BY SUBSTRING( FROM_UNIXTIME( CEILING( timestamp /300 ) *300,
'%Y-%m-%d %H:%i:%S' ) , 1, 19 )
ORDER BY SUBSTRING( FROM_UNIXTIME( CEILING( timestamp /300 ) *300,
'%Y-%m-%d %H:%i:%S' ) , 1, 19 ) DESC
Let me know what you think.
select
CONCAT(CAST(CREATEDATE AS DATE),' ',datepart(hour,createdate),':',ROUNd(CAST((CAST((CAST(DATEPART(MINUTE,CREATEDATE) AS DECIMAL (18,4)))/5 AS INT)) AS DECIMAL (18,4))/12*60,2)) AS '5MINDATE'
,count(something)
from TABLE
group by CONCAT(CAST(CREATEDATE AS DATE),' ',datepart(hour,createdate),':',ROUNd(CAST((CAST((CAST(DATEPART(MINUTE,CREATEDATE) AS DECIMAL (18,4)))/5 AS INT)) AS DECIMAL (18,4))/12*60,2))
This will do exactly what you want.
Replace
dt - your datetime
c - call field
astro_transit1 - your table
300 as seconds for each time gap increase
SELECT
FROM_UNIXTIME(300 * ROUND(UNIX_TIMESTAMP(r.dt) / 300)) AS 5datetime,
(SELECT
r.c
FROM
astro_transit1 ra
WHERE
ra.dt = r.dt
ORDER BY ra.dt DESC
LIMIT 1) AS first_val
FROM
astro_transit1 r
GROUP BY UNIX_TIMESTAMP(r.dt) DIV 300
LIMIT 0 , 30
Based on #boecko answer for MySQL, I used a CTE (Common Table Expression) to accelerate the query execution time :
so this :
SELECT
`timestamp`,
`name`,
count(b.`name`)
FROM `time` a, `id` b
WHERE …
GROUP BY
UNIX_TIMESTAMP(`timestamp`) DIV 300, name
becomes :
WITH cte AS (
SELECT
`timestamp`,
`name`,
count(b.`name`),
UNIX_TIMESTAMP(`timestamp`) DIV 300 AS `intervals`
FROM `time` a, `id` b
WHERE …
)
SELECT * FROM cte GROUP BY `intervals`
In a large amount of data, the speed is accelerated by more than 10!
As timestamp and time are reserved in MySQL, don't forget to use `...` on each table and column name !
Hope it will help some of you.

Resources