group data by time - frames

I have a file named example.csv with that data:
day,number,price
2010-01-01 00:01:00,1,0.4
2010-01-01 00:02:00,2,1.2
2010-01-01 00:03:00,3,2.5
2010-01-01 00:04:00,4,9.1
2010-01-01 00:05:00,5,3.4
2010-01-01 00:06:00,6,6.9
2010-01-01 00:07:00,7,8.9
2010-01-01 00:08:00,8,9.1
2010-01-01 00:09:00,9,4.2
2010-01-01 00:10:00,10,11.2
2010-01-01 00:11:00,11,53.12
2010-01-01 00:12:00,12,45.21
2010-01-01 00:12:00,13,1.1
2010-01-01 00:13:00,14,3.43
2010-01-01 00:14:00,15,21.42
Load the file content:
example = read.csv(file="path/example.csv", sep=",")
Loading the
DD <- structure(list(day = structure(c(example$day), class = c("POSIXct", "POSIXt"), tzone = ""),
number = c(example$number), price = c(example$price)), .Names = c("day", "number", "price"), row.names = c(NA,
-15L), class = "data.frame")
After:
ddx <- xts(DD[,c('number','price')], order.by = DD[,'day'])
And:
period.apply(ddx, endpoints(ddx, on = 'minutes',k=3), sum)
And after the last period.apply it gives back this:
[,1]
1970-01-01 02:00:14 301.18
Why does it give back 1970

Once you have your Datetime column as POSIXct some other datetime class, you can use period.apply from xts. To get 3 minute intervals use endpoints(dd.xts, on = 'minutes', k = 3)
# with a data frame
DD <- structure(list(Dates = structure(c(1034644620, 1034644800, 1034644920,
1034734860, 1034734920), class = c("POSIXct", "POSIXt"), tzone = ""),
Price = c(0.6, 1.4, 4.1, 1.6, 7.7), Price.2 = c(5, 2.4, 9.1,
1.4, 3.7)), .Names = c("Dates", "Price", "Price.2"), row.names = c(NA,
-5L), class = "data.frame")
DD
# Dates Price Price.2
# 1 2002-10-15 11:17:00 0.6 5.0
# 2 2002-10-15 11:20:00 1.4 2.4
# 3 2002-10-15 11:22:00 4.1 9.1
# 4 2002-10-16 12:21:00 1.6 1.4
# 5 2002-10-16 12:22:00 7.7 3.7
# coerce to xts
ddx <- xts(DD[,c('Price','Price.2')], order.by = DD[,'Dates'])
period.apply(ddx, endpoints(ddx, on = 'minutes',k=3), sum)
## [,1]
## 2002-10-15 11:17:00 5.6
## 2002-10-15 11:20:00 3.8
## 2002-10-15 11:22:00 13.2
## 2002-10-16 12:22:00 14.4

Related

Calculate how many n.months between two dates

I want to calculate the distance between two dates by month:
Q: start_date + n.months >= end_dates, what is the variable n?
start_date = Date.parse('2021-01-31')
end_date = Date.parse('2021-02-28')
## start_date + 1.months = 2021-02-28
## The answer 1 month, which is more clearable for human using
## Don't want to find the answer like 28 days or 0.93 month. (30 day a month)
First I tried to let each month is 30.days, but the answer will have some bias on some special dates. Each month's date is different, and the date on End of Feb month is always the problem.
Then I tried to install gems like date_diff, time_difference..., but no methods can do this, most of the output is 28 days but not 1 month.
For simple way, I can easily do the iterated loop to find the n, like:
start_date = Date.parse('2021-01-31')
end_date = Date.parse('2021-02-28')
def calc_diff(start_date, end_date)
n = 0
while start_date + n.months < end_date
n += 1
end
n
end
Is there any better way to find the n months between two dates instead, but not use a loop?
Thank you.
My understanding of the question is consistent with the examples below. I have computed the difference between Date objects date1 and date2, where date2 >= date1.
require 'date'
def months_between(date1, date2)
12*(date2.yr - date1.yr) + date2.mon - date1.mon + date2.day > date1.day ? 1 : 0
end
months_between Date.new(2020, 1, 22), Date.new(2020, 3, 21) #=> 2
months_between Date.new(2020, 1, 22), Date.new(2020, 3, 22) #=> 2
months_between Date.new(2020, 1, 22), Date.new(2020, 3, 23) #=> 3
months_between Date.new(2020, 1, 22), Date.new(2021, 3, 21) #=> 14
months_between Date.new(2020, 1, 22), Date.new(2021, 3, 22) #=> 14
months_between Date.new(2020, 1, 22), Date.new(2021, 3, 23) #=> 15
# find minimum n so that `start_date + n.months >= end_dates`
def calc_diff(start_date, end_date)
diff = (end_date.yday - start_date.yday) / 30
return diff if start_date + diff.months >= end_date
diff + 1
end
calc_diff(Date.parse('2021-01-31'), Date.parse('2021-02-28')) # 1
calc_diff(Date.parse('2021-01-31'), Date.parse('2021-04-30')) # 3
calc_diff(Date.parse('2021-01-31'), Date.parse('2021-05-31')) # 4
calc_diff(Date.parse('2021-02-01'), Date.parse('2021-06-01')) # 4
calc_diff(Date.parse('2021-02-01'), Date.parse('2021-06-02')) # 5
Thanks for #Cary's and #Lam's answer.
Here is my answer to find the n month.
# try to find the minimum n month between start_date and target_date
def calc_diff(start_date, target_date)
months_diff = (target_date.year * 12 + target_date.month) - (start_date.year * 12 + start_date.month)
## need to check the end of month because some special case
## start date: 2020-01-31 ; end date 2020-06-30
## the minimum n month must be 5
## another special case of Feb must consider (test case 15)
if start_date.day > target_date.day && !((start_date == start_date.end_of_month || target_date.month == 2) && (target_date == target_date.end_of_month))
months_diff = months_diff - 1
end
puts months_diff # it will show the minimum whole n month
# the target_date will between inside
# (start_date + months_diff.months) <= target_date < (start_date + (months_diff + 1).months)
(start_date + months_diff.months)..(start_date + (months_diff + 1).months)
end
The Test Cases:
## test case 1
## 6/15 - 7/15 => n = 5
calc_diff(Date.parse('2020-01-15'), Date.parse('2020-06-19'))
## test case 2
## 7/15 - 8/15 => n = 6
calc_diff(Date.parse('2020-01-15'), Date.parse('2020-07-15'))
## test case 3
## 5/15 - 6/15 => n = 4
calc_diff(Date.parse('2020-01-15'), Date.parse('2020-06-01'))
## test case 4 (special case)
## 6/30 - 7/31 => n = 5
calc_diff(Date.parse('2020-01-31'), Date.parse('2020-06-30'))
## test case 5
## 7/30 - 8/30 => n = 4
calc_diff(Date.parse('2020-04-30'), Date.parse('2020-07-31'))
## test case 6
## 6/30 - 7/30 => n = 2
calc_diff(Date.parse('2020-04-30'), Date.parse('2020-06-30'))
## test case 7
## 5/31 - 6/30 => n = 4
calc_diff(Date.parse('2020-01-31'), Date.parse('2020-05-31'))
## test case 8
## 2/29 - 3/31 => n = 1
calc_diff(Date.parse('2020-01-31'), Date.parse('2020-02-29'))
## test case 9
## 6/29 - 7/29 => n = 4
calc_diff(Date.parse('2020-02-29'), Date.parse('2020-06-30'))
## test case 10
## 7/29 - 8/29 => n = 5
calc_diff(Date.parse('2020-02-29'), Date.parse('2020-07-31'))
## test case 11
## 1/31 - 2/29 => n = 0
calc_diff(Date.parse('2020-01-31'), Date.parse('2020-02-28'))
## test case 12
## 2/29 - 3/31 => n = 1
calc_diff(Date.parse('2020-01-31'), Date.parse('2020-03-01'))
## test case 13
## 1/17 - 2/17 => n = 0
calc_diff(Date.parse('2020-01-17'), Date.parse('2020-01-17'))
## test case 14
## 1/17 - 2/17 => n = 0
calc_diff(Date.parse('2020-01-17'), Date.parse('2020-01-18'))
## test case 15 (special case)
## 1/30 - 2/29 => n = 1
calc_diff(Date.parse('2019-12-30'), Date.parse('2020-02-28'))
## test case 16
## 2/29 - 3/30 => n = 2
calc_diff(Date.parse('2019-12-30'), Date.parse('2020-02-29'))

Forecasting using mutiple seasonal STL and arima

I am attempting to forecast half hourly electricity data. The method I am using is to decompose the electricity consumption data using 'mstl' from the 'Forecast' package by Rob Hyndman and then forecast the seasonally adjusted data using ARIMA.
df <- IntervalData %>% select(CONSUMPTION_MW)
length_test_set = 17520
h = 17520
# create msts object with daily, weekly and monthly seasonality
data_msts <- msts(df, seasonal.periods=c(48,48*7,365/12*48))
train_msts = msts(df[1:(nrow(df)-length_test_set),],seasonal.periods=c(48,48*7,365/12*48))
test_msts = msts(df[((nrow(df)-length_test_set)+1):(nrow(df)),],seasonal.periods=c(48,48*7,365/12*48))
fit_mstl = mstl(train_msts, iterate = 4, s.window = 19, robust = TRUE)
fcast_arima=forecast(fit_mstl,method='arima',h=h)
How do I specify the order of my ARIMA model eg. ARIMA(2,1,6)?
You will need to write your own forecast function like this (using fake data so it can be reproduced).
library(forecast)
df <- data.frame(y=rnorm(50000))
length_test_set <- 17520
h <- 17520
# create msts object with daily, weekly and monthly seasonality
data_msts <- msts(df, seasonal.periods = c(48, 48*7, 365/12*48))
train_msts <- msts(df[1:(nrow(df) - length_test_set), ], seasonal.periods = c(48, 48 * 7, 365 / 12 * 48))
test_msts <- msts(df[((nrow(df) - length_test_set) + 1):(nrow(df)), ], seasonal.periods = c(48, 48 * 7, 365 / 12 * 48))
fit_mstl <- mstl(train_msts, iterate = 4, s.window = 19, robust = TRUE)
# Function to fit specific ARIMA model and return forecasts
arima_forecast <- function(x, h, level, order, ...) {
fit <- Arima(x, order=order, seasonal = c(0,0,0), ...)
return(forecast(fit, h = h, level = level))
}
# Example using an ARIMA(3,0,0) model
fcast_arima <- forecast(fit_mstl, forecastfunction=arima_forecast, h = h, order=c(3,0,0))
Created on 2020-07-25 by the reprex package (v0.3.0)

How do I operate on groups returned by Dask's group by?

I have the following table.
value category
0 2 A
1 20 B
2 4 A
3 40 B
I want to add a mean column that contains the mean of the values for each category.
value category mean
0 2 A 3.0
1 20 B 30.0
2 4 A 3.0
3 40 B 30.0
I can do this in pandas like so
p = pd.DataFrame({"value":[2, 20, 4, 40], "category": ["A", "B", "A", "B"]})
groups = []
for _, group in p.groupby("category"):
group.loc[:,"mean"] = group.loc[:,"value"].mean()
groups.append(group)
pd.concat(groups).sort_index()
How do I do the same thing in Dask?
I can't use the pandas functions as-is because you can't enumerate over a groupby object in Dask. This
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
list(d.groupby("category"))
raises KeyError: 'Column not found: 0'.
I can use an apply function to calculate the mean in Dask.
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
q = d.groupby(["category"]).apply(lambda group: group["value"].mean(), meta="object")
q.compute()
returns
category
A 3.0
B 30.0
dtype: float64
But I can't figure how how to fold these back into the rows of the original table.
I would use a merge to achieve this operation:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({
'value': [2, 20, 4, 40],
'category': ['A', 'B', 'A', 'B']
})
ddf = dd.from_pandas(df, npartitions=1)
# Lazy-compute mean per category
mean_by_category = (ddf
.groupby('category')
.agg({'value': 'mean'})
.rename(columns={'value': 'mean'})
).persist()
mean_by_category.head()
# Assign 'mean' value to each corresponding category
ddf = ddf.merge(mean_by_category, left_on='category', right_index=True)
ddf.head()
Which should then output:
category value mean
0 A 2 3.0
2 A 4 3.0
1 B 20 30.0
3 B 40 30.0

How to refer to previous targets in drake?

I would like to use the wildcard to generate a bunch of targets, and then have another set of targets that refers to those original targets. I think this example represents my idea:
plan <- drake_plan(
sub_task = runif(1000, min = mean__, max = 50),
full_task = sub_task * 2
)
step <- 1:4
full_plan <- evaluate_plan(
plan,
rules = list(
mean__ = step
)
)
So what I get now is 5 targets, 4 sub_tasks and a single final_task. What I'm looking for is to get 8 targets. The 4 sub_tasks (that are good), and 4 more that are based on those 4 good sub_tasks.
This question comes up regularly, and I like how you phrased it.
More about the problem
For onlookers, I will print out the plan and the graph of the current (problematic) workflow.
library(drake)
plan <- drake_plan(
sub_task = runif(1000, min = mean__, max = 50),
full_task = sub_task * 2
)
step <- 1:4
full_plan <- evaluate_plan(
plan,
rules = list(
mean__ = step
)
)
full_plan
#> # A tibble: 5 x 2
#> target command
#> <chr> <chr>
#> 1 sub_task_1 runif(1000, min = 1, max = 50)
#> 2 sub_task_2 runif(1000, min = 2, max = 50)
#> 3 sub_task_3 runif(1000, min = 3, max = 50)
#> 4 sub_task_4 runif(1000, min = 4, max = 50)
#> 5 full_task sub_task * 2
config <- drake_config(full_plan)
vis_drake_graph(config)
Created on 2018-12-18 by the reprex package (v0.2.1)
Solution
As you say, we want full_task_* targets that depend on their corresponding single_task_* targets. to accomplish this, we need to use the mean__ wildcard in the full_task_* commands as well. Wildcards are an early-days interface based on text replacement, so they do not need to be independent variable names in their own right.
library(drake)
plan <- drake_plan(
sub_task = runif(1000, min = mean__, max = 50),
full_task = sub_task_mean__ * 2
)
step <- 1:4
full_plan <- evaluate_plan(
plan,
rules = list(
mean__ = step
)
)
full_plan
#> # A tibble: 8 x 2
#> target command
#> <chr> <chr>
#> 1 sub_task_1 runif(1000, min = 1, max = 50)
#> 2 sub_task_2 runif(1000, min = 2, max = 50)
#> 3 sub_task_3 runif(1000, min = 3, max = 50)
#> 4 sub_task_4 runif(1000, min = 4, max = 50)
#> 5 full_task_1 sub_task_1 * 2
#> 6 full_task_2 sub_task_2 * 2
#> 7 full_task_3 sub_task_3 * 2
#> 8 full_task_4 sub_task_4 * 2
config <- drake_config(full_plan)
vis_drake_graph(config)
Created on 2018-12-18 by the reprex package (v0.2.1)

How do I convert a string into a Time object?

I searched for my problem and got a lot of solutions, but unfortunately none satisfy my need.
My problem is, I have two or more strings, and I want to convert those strings into times, and add them:
time1 = "10min 43s"
time2 = "32min 30s"
The output will be: 43min 13s
My attempted solution is:
time1 = "10min 43s"
d1=DateTime.strptime(time1, '%M ')
# Sat, 02 Nov 2013 00:10:00 +0000
time2 = "32min 30s"
d2=DateTime.strptime(time2, '%M ')
# Sat, 02 Nov 2013 00:32:00 +0000
Then I can't progress.
There are many ways to do this. Here's another:
time1 = "10min 43s"
time2 = "32min 30s"
def get_mins_and_secs(time_str)
time_str.scan(/\d+/).map(&:to_i)
#=> [10, 43] for time_str = time1, [32, 30] for time_str = time2
end
min, sec = get_mins_and_secs(time1)
min2, sec2 = get_mins_and_secs(time2)
min += min2
sec += sec2
if sec > 59
min += 1
sec -= 60
end
puts "#{min}min #{sec}sec"
Let's consider what's happening here. Firstly, you need to extract the minutes and seconds from the time strings. I made a method to do that:
def get_mins_and_secs(time_str)
time_str.scan(/\d+/).map(&:to_i)
#=> [10, 43] for time_str = time1, [32, 30] for time_str = time2
end
For time_str = "10min 43s", we apply the String#scan method to extract the two numbers as strings:
"10min 43s".scan(/\d+/) # => ["10", "43"]
Array#map is then used to convert these two strings to integers
["10", "43"].map {|e| e.to_i} # => [10, 43]
This can be written more succinctly as
["10", "43"].map(&:to_i} # => [10, 43]
By chaining map to to scan we obtain
"10min 43s".scan(/\d+/).map(&:to_i} # => [10, 43]
The array [10, 43] is returned and received (deconstructed) by the variables min and sec:
min, sec = get_mins_and_secs(time_str)
The rest is straightforward.
Here's a simple solution assuming that the format stays the same:
time1 = "10min 43s"
time2 = "32min 30s"
strings = [time1, time2]
total_time = strings.inject(0) do |sum, entry|
minutes, seconds = entry.split(' ')
minutes = minutes.gsub("min", "").to_i.send(:minutes)
seconds = seconds.gsub("s", "").to_i.send(:seconds)
sum + minutes + seconds
end
puts "#{total_time/60}min #{total_time%60}s"
Something like the following should do the trick:
# split the string on all the integers in the string
def to_seconds(time_string)
min, sec = time_string.gsub(/\d+/).map(&:to_i)
min.minutes + sec.seconds
end
# Divide the seconds with 60 to get minutes and format the output.
def to_time_str(seconds)
minutes = seconds / 60
seconds = seconds % 60
format("%02dmin %02dsec", minutes, seconds)
end
time_in_seconds1 = to_seconds("10min 43s")
time_in_seconds2 = to_seconds("32min 30s")
to_time_str(time_in_seconds1 + time_in_seconds2)
My solution that takes any number of time strings and return the sum in the same format:
def add_times(*times)
digits = /\d+/
total_time = times.inject(0){|sum, entry|
m, s = entry.scan(digits).map(&:to_i)
sum + m*60 + s
}.divmod(60)
times.first.gsub(digits){total_time.shift}
end
p add_times("10min 43s", "32min 55s", "1min 2s") #=> "44min, 40s"
p add_times("10:43", "32:55") #=> "38:43"

Resources