lubridate failing to parse with hms() but not with ymd_hms() - parsing

My aim is to get the ymd() and hms() as id and time, respectively, to analyze data by day and by hour afterwards. I have done this before, and it still works with other datasets, so I think it is a data format problem.
I have tried multiple stuff, including converting the timestamps to POSIXct (with as.POSIXct()) or to Data (with as_datetime()), but all of those result in the same parsing warning, with all generated vectors by mutate being NA.
I've also tried converting to POSIXct specifying tz = "America/Sao_Paulo" before using lubridate, but it does not seem to work either.
Here's my code:
`dat.all %>% str()
'data.frame': 585 obs. of 2 variables:
$ feed_datetime : chr "2019-04-10 14:14:00" "2019-04-10 14:14:00" "2019-04-10 14:14:00" "2019-04-11 07:41:00" ...
$ def_datetime : chr "2019-04-11 07:14:00" "2019-04-11 08:24:00" "2019-04-11 08:40:00" "2019-04-11 08:40:00" ...
# no NA's:
nrow(dat.all[is.na(dat.all$def_datetime), ])
[1] 0
# using lubridate
dat.all.m <- dat.all %>%
mutate(
# convert to POSIXct (it works)
def_datetime = as.POSIXct(def_datetime, tz = "America/Sao_Paulo"),
feed_datetime = as.POSIXct(feed_datetime, tz = "America/Sao_Paulo",
# keep same format but with lubridate (it works)
feed_datetime = lubridate::ymd_hms(feed_datetime),
def_datetime = lubridate::ymd_hms(def_datetime)
# use dmy or ymd to get dates as id (generate NAs)
day = lubridate::dmy(def_datetime, tz = "America/Sao_Paulo")
)
Warning message:
Problem while computing `day = lubridate::dmy(def_datetime, tz = "America/Sao_Paulo")`.
ℹ All formats failed to parse. No formats found.
Following #Henrik comments on this post:
x <- dat.all$def_datetime[1]
x
[1] "2019-04-11 07:14:00"
format <- guess_formats(x, c("Ymd HMS", "mdY", "BdY", "Bdy", "bdY", "bdy", "mdy", "dby"), locale = "English")
format
YOmdHMS YmdHMS
"%Y-%Om-%d %H:%M:%S" "%Y-%m-%d %H:%M:%S"
strptime(x, format)
[1] "2019-04-11 07:14:00 -03" "2019-04-11 07:14:00 -03"
Maybe this is related to specifying tz? Why is this not a problem for other datasets with the same data format?
Thanks.
Update
I was able to get the date (1, "yyyy/mm/dd" format) and the hour (2, "HH:mm:ss" format) with the following lines, but I still didn't manage to gather the date ("yyyy/mm/dd" format) with lubridate::ymd():
dat.all %>%
mutate(
# (1)
day = lubridate::as_date(def_datetime),
# (2)
gut_transit_time_hms = hms::as_hms(gut_transit_time)
)

Related

How I can hide some x axis labels in seaborn time-series plot? [duplicate]

How can I convert a DataFrame column of strings (in dd/mm/yyyy format) to datetime dtype?
The easiest way is to use to_datetime:
df['col'] = pd.to_datetime(df['col'])
It also offers a dayfirst argument for European times (but beware this isn't strict).
Here it is in action:
In [11]: pd.to_datetime(pd.Series(['05/23/2005']))
Out[11]:
0 2005-05-23 00:00:00
dtype: datetime64[ns]
You can pass a specific format:
In [12]: pd.to_datetime(pd.Series(['05/23/2005']), format="%m/%d/%Y")
Out[12]:
0 2005-05-23
dtype: datetime64[ns]
If your date column is a string of the format '2017-01-01'
you can use pandas astype to convert it to datetime.
df['date'] = df['date'].astype('datetime64[ns]')
or use datetime64[D] if you want Day precision and not nanoseconds
print(type(df_launath['date'].iloc[0]))
yields
<class 'pandas._libs.tslib.Timestamp'>
the same as when you use pandas.to_datetime
You can try it with other formats then '%Y-%m-%d' but at least this works.
You can use the following if you want to specify tricky formats:
df['date_col'] = pd.to_datetime(df['date_col'], format='%d/%m/%Y')
More details on format here:
Python 2 https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior
Python 3 https://docs.python.org/3.7/library/datetime.html#strftime-strptime-behavior
If you have a mixture of formats in your date, don't forget to set infer_datetime_format=True to make life easier.
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
Source: pd.to_datetime
or if you want a customized approach:
def autoconvert_datetime(value):
formats = ['%m/%d/%Y', '%m-%d-%y'] # formats to try
result_format = '%d-%m-%Y' # output format
for dt_format in formats:
try:
dt_obj = datetime.strptime(value, dt_format)
return dt_obj.strftime(result_format)
except Exception as e: # throws exception when format doesn't match
pass
return value # let it be if it doesn't match
df['date'] = df['date'].apply(autoconvert_datetime)
Try this solution:
Change '2022–12–31 00:00:00' to '2022–12–31 00:00:01'
Then run this code: pandas.to_datetime(pandas.Series(['2022–12–31 00:00:01']))
Output: 2022–12–31 00:00:01
Multiple datetime columns
If you want to convert multiple string columns to datetime, then using apply() would be useful.
df[['date1', 'date2']] = df[['date1', 'date2']].apply(pd.to_datetime)
You can pass parameters to to_datetime as kwargs.
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime, format="%m/%d/%Y")
Use format= to speed up
If the column contains a time component and you know the format of the datetime/time, then passing the format explicitly would significantly speed up the conversion. There's barely any difference if the column is only date, though. In my project, for a column with 5 millions rows, the difference was huge: ~2.5 min vs 6s.
It turns out explicitly specifying the format is about 25x faster. The following runtime plot shows that there's a huge gap in performance depending on whether you passed format or not.
The code used to produce the plot:
import perfplot
import random
mdYHM = range(1, 13), range(1, 29), range(2000, 2024), range(24), range(60)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x), lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M')],
labels=['pd.to_datetime(x)', "pd.to_datetime(x, format='%m/%d/%Y %H:%M')"],
n_range=[2**k for k in range(19)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}"
for m,d,Y,H,M in zip(*[random.choices(e, k=n) for e in mdYHM])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)

How to iterate over dates (days/ hours/ months) inside the data pipeline using beam on cloud dataflow?

Greeting folks1
I Am trying to load data from GCS to BigQuery using Cloud Dataflow.
data inside the bucket are storing in the following structure
"bucket_name/user_id/date/date_hour_user_id.csv"
example "my_bucket/user_1262/2021-01-02/2021-01-02_18_user_id.csv"
if I have 5 users for example ["user_1262", "user_1263", "user_1264", "user_1265", "user_1266"]
and i wanna load to bq (1 hour of data) for all clients for example hour = "18" in a range of 1 week I wanna iterate over all
clients to get the file with the prefix 18 I have created this code but the iteration infect the data
pipeline for each moving from one client to another the code runs a new pipeline.
def run(argv=None):
mydate=['2021-01-02 00:00:00', '2021-01-02 23:00:00']
fmt = '%Y-%m-%d %H:%M:%S'
hour = dt.timedelta(hours=1)
day = dt.timedelta(days=1)
start_time, end_time = [dt.datetime.strptime(d, fmt) for d in mydate]
currdate = start_time
cols = ['cols0','cols1']
parser = argparse.ArgumentParser(description="User Input Data .")
args, beam_args = parser.parse_known_args(argv)
while currdate <= end_time:
str_date = currdate.strftime('%Y-%m-%d')
str_hour = '%02d' % (int(currdate.strftime('%H')))
print("********WE ARE PROCESSING FILE ON DATE ---> %s HOUR --> %s" % (str_date, str_hour))
user_list = ["user_1262", "user_1263", "user_1264", "user_1265", "user_1266"]
for user_id in user_list:
file_path_user = "gs://user_id/%s/%s/%s_%s_%s.csv" % (user_id, str_date, str_date, str_hour, user_id)
with beam.Pipeline(options=PipelineOptions(beam_args)) as p:
input_data = p | 'ReadUserfile' >> beam.io.ReadFromText(file_path_user_table, columns=cols)
decode = input_data | 'decodeData' >> beam.ParDo(de_code())
clean_data = decode | 'clean_dt' >> beam.Filter(clea_data)
writetobq....
currdate += day
run()
You can continue to generate a list of input files in your pipeline creation script. However instead of creating a new pipeline for each input file, you can put them into a list. Then make your pipeline begin with a Create transform reading that list, followed by a textio.ReadAllFromText transform. This will create a PCollection out of your list of files, and then begin reading from that list of files.

Rails absolute time from reference + french "ago" string

I need to reimport some data that was exported using the "ago" stringification helper, in French.
I have a reference Time/DateTime date at which the import was done, and from there I need to substract this "time ago" difference to find the absolute time.
I need to code the parse_relative_time method below
Some sample input/output of what I'm trying to achieve
IMPORT_DATE = Time.parse('Sat, 11 Jun 2016 15:15:19 CEST +02:00')
sample_ago_day = 'Il y a 5j' # Note : 'Il y a 5j" = "5d ago"
parse_relative_time(from: IMPORT_DATE, ago: sample_ago_day)
# => Should output sthing like Sat, 6 Jun 2016 (DateTime object)
sample_ago_month = 'Il y a 1 mois' # Note : 'Il y a 5j" = "1 month ago"
parse_relative_time(from: IMPORT_DATE, ago: sample_ago_month)
# => 11 May 2016 (it's not big deal if it's 10 or 11 or 12 because of months with odd numbers, just need something approximate)
EDIT
Range of values
"il y a xj" -> x belongs to (1..31)
"il y a y mois" -> y belongs to (2..10) and "un"
(for one)
Let's divide the problem into 2 sub-tasks:
Parse the 'ago' string
Since there is no reversible way in ruby to parse an 'ago' string, lets use regular expressions to extract the data as seconds:
def parse_ago(value)
# If the current value matches 'il y a Xj'
if match = /^il y a (.*?)j$/i.match(value)
# Convert the matched data to an integer
value = match[1].to_i
# Validate the numeric value (between 1 and 31)
raise 'Invalid days value!' unless (1..31).include? value
# Convert to seconds with `days` rails helper
value.days
# If the current value matches 'il y a X mois'
elsif match = /^il y a (.*?) mois$/i.match(value)
# If the matched value is 'un', then use 1. Otherwise, use the matched value
value = match[1] == 'un' ? 1 : match[1].to_i
# Validate the numeric value (between 1 and 10)
raise 'Invalid months value!' unless (1..10).include? value
# Convert to seconds with `months` rails helper
value.months
# Otherwise, something is wrong (or not implemented)
else
raise "Invalid 'ago' value!"
end
end
Substract from current time
This is pretty straightforward; once we have the seconds from the 'ago' string; just call the ago method on the seconds extracted from the 'ago' string. An example of usage of this method for Ruby on Rails:
5.months.ago # "Tue, 12 Jan 2016 15:21:59 UTC +00:00"
The thing is, you are substracting it from IMPORT_DATE, and not from current time. For your code, you need to specify the current time to IMPORT_DATE:
parse_ago('Il y a 5j').ago(IMPORT_DATE)
Hope this helps!

Forecasting in R using forecast package

I'm trying to forecast hourly data for 30 days for a process.
I have used the following code:
# The packages required for projection are loaded
library("forecast")
library("zoo")
# Data Preparation steps
# There is an assumption that we have all the data for all
# the 24 hours of the month of May
time_index <- seq(from = as.POSIXct("2014-05-01 07:00"),
to = as.POSIXct("2014-05-31 18:00"), by = "hour")
value <- round(runif(n = length(time_index),100,500))
# Using zoo function , we merge data with the date and hour
# to create an extensible time series object
eventdata <- zoo(value, order.by = time_index)
# As forecast package requires all the objects to be time series objects,
# the below command is used
eventdata <- ts(value, order.by = time_index)
# For forecasting the values for the next 30 days, the below command is used
z<-hw(t,h=30)
plot(z)
I feel the output of this code, is not working fine.
The forecasted line looks wrong and the dates are not getting correctly projected on the chart.
I'm not sure the fault lies in the data preparation, or if the output is as expected. Any ideas?

Lua ISO 8601 datetime parsing pattern

I'm trying to parse a full ISO8601 datetime from JSON data in Lua.
I'm having trouble with the match pattern.
So far, this is what I have:
-- Example datetime string 2011-10-25T00:29:55.503-04:00
local datetime = "2011-10-25T00:29:55.503-04:00"
local pattern = "(%d+)%-(%d+)%-(%d+)T(%d+):(%d+):(%d+)%.(%d+)"
local xyear, xmonth, xday, xhour, xminute,
xseconds, xmillies, xoffset = datetime:match(pattern)
local convertedTimestamp = os.time({year = xyear, month = xmonth,
day = xday, hour = xhour, min = xminute, sec = xseconds})
I'm stuck at how to deal with the timezone on the pattern because there is no logical or that will handle the - or + or none.
Although I know lua doesn't support the timezone in the os.time function, at least I would know how it needed to be adjusted.
I've considered stripping off everything after the "." (milliseconds and timezone), but then i really wouldn't have a valid datetime. Milliseconds is not all that important and i wouldn't mind losing it, but the timezone changes things.
Note: Somebody may have some much better code for doing this and I'm not married to it, I just need to get something useful out of the datetime string :)
The full ISO 8601 format can't be done with a single pattern match. There is too much variation.
Some examples from the wikipedia page:
There is a "compressed" format that doesn't separate numbers: YYYYMMDD vs YYYY-MM-DD
The day can be omited: YYYY-MM-DD and YYYY-MM are both valid dates
The ordinal date is also valid: YYYY-DDD, where DDD is the day of the year (1-365/6)
When representing the time, the minutes and seconds can be ommited: hh:mm:ss, hh:mm and hh are all valid times
Moreover, time also has a compressed version: hhmmss, hhmm
And on top of that, time accepts fractions, using both the dot or the comma to denote fractions of the lower time element in the time section. 14:30,5, 1430,5, 14:30.5, or 1430.5 all represent 14 hours, 30 seconds and a half.
Finally, the timezone section is optional. When present, it can be either the letter Z, ±hh:mm, ±hh or ±hhmm.
So, there are lots of possible exceptions to take into account, if you are going to parse according to the full spec. In that case, your initial code might look like this:
function parseDateTime(str)
local Y,M,D = parseDate(str)
local h,m,s = parseTime(str)
local oh,om = parseOffset(str)
return os.time({year=Y, month=M, day=D, hour=(h+oh), min=(m+om), sec=s})
end
And then you would have to create parseDate, parseTime and parseOffset. The later should return the time offsets from UTC, while the first two would have to take into account things like compressed formats, time fractions, comma or dot separators, and the like.
parseDate will likely use the "^" character at the beginning of its pattern matches, since the date has to be at the beginning of the string. parseTime's patterns will likely start with "T". And parseOffset's will end with "$", since the time offsets, when they exist, are at the end.
A "full ISO" parseOffset function might look similar to this:
function parseOffset(str)
if str:sub(-1)=="Z" then return 0,0 end -- ends with Z, Zulu time
-- matches ±hh:mm, ±hhmm or ±hh; else returns nils
local sign, oh, om = str:match("([-+])(%d%d):?(%d?%d?)$")
sign, oh, om = sign or "+", oh or "00", om or "00"
return tonumber(sign .. oh), tonumber(sign .. om)
end
By the way, I'm assuming that your computer is working in UTC time. If that's not the case, you will have to include an additional offset on your hours/minutes to account for that.
function parseDateTime(str)
local Y,M,D = parseDate(str)
local h,m,s = parseTime(str)
local oh,om = parseOffset(str)
local loh,lom = getLocalUTCOffset()
return os.time({year=Y, month=M, day=D, hour=(h+oh-loh), min=(m+om-lom), sec=s})
end
To get your local offset you might want to look at http://lua-users.org/wiki/TimeZone .
I hope this helps. Regards!
There is also the luadate package, which supports iso8601. (You probably want the patched version)
Here is a simple parseDate function for ISO dates. Note that I'm using "now" as a fallback. This may or may not work for you. YMMV 😉.
--[[
Parse date given in any of supported forms.
Note! For unrecognised format will return now.
#param str ISO date. Formats:
Y-m-d
Y-m -- this will assume January
Y -- this will assume 1st January
]]
function parseDate(str)
local y, m, d = str:match("(%d%d%d%d)-?(%d?%d?)-?(%d?%d?)$")
-- fallback to now
if y == nil then
return os.time()
end
-- defaults
if m == '' then
m = 1
end
if d == '' then
d = 1
end
-- create time
return os.time{year=y, month=m, day=d, hour=0}
end
--[[
--Tests:
print( os.date( "%Y-%m-%d", parseDate("2019-12-28") ) )
print( os.date( "%Y-%m-%d", parseDate("2019-12") ) )
print( os.date( "%Y-%m-%d", parseDate("2019") ) )
]]

Resources