Forecasting in R using forecast package - machine-learning

I'm trying to forecast hourly data for 30 days for a process.
I have used the following code:
# The packages required for projection are loaded
library("forecast")
library("zoo")
# Data Preparation steps
# There is an assumption that we have all the data for all
# the 24 hours of the month of May
time_index <- seq(from = as.POSIXct("2014-05-01 07:00"),
to = as.POSIXct("2014-05-31 18:00"), by = "hour")
value <- round(runif(n = length(time_index),100,500))
# Using zoo function , we merge data with the date and hour
# to create an extensible time series object
eventdata <- zoo(value, order.by = time_index)
# As forecast package requires all the objects to be time series objects,
# the below command is used
eventdata <- ts(value, order.by = time_index)
# For forecasting the values for the next 30 days, the below command is used
z<-hw(t,h=30)
plot(z)
I feel the output of this code, is not working fine.
The forecasted line looks wrong and the dates are not getting correctly projected on the chart.
I'm not sure the fault lies in the data preparation, or if the output is as expected. Any ideas?

Related

Error when trying to calculate mean and SD of environmental dataset with loop from .nc data

I was trying to calculate mean and SD per month of a variable from an environmental dataset (.nc file of Sea surface temp/day during 2 years) and the loop I used gives me the following error
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': recursive indexing failed at level 2
I have no idea where my error could be but if you are curious I was using the following .nc dataset just for SST for 2018-2019 from copernicus sstdata
Here is the script I used so far and the packages I'm using:
# Load required libraries (install the required libraries using the Packages tab, if necessary)
library(raster)
library(ncdf4)
#Opern the .nc file with the environmental data
ENV = nc_open("SST.nc")
ENV
#create an index of the month for every (daily) capture from 2018 to 2019 (in this dataset)
m_index = c()
for (y in 2018:2019) {
# if bisestile year (do not apply for this data but in case a larger year set is used)
if (y%%4==0) { m_index = c(m_index, rep(1:12 , times = c(31,29,31,30,31,30,31,31,30,31,30,31))) }
# if non-bisestile year
else { m_index = c(m_index, rep(1:12 , times = c(31,28,31,30,31,30,31,31,30,31,30,31))) }
}
length(m_index) # expected length (730)
table(m_index) # expected number of records assigned to each of the twelve months
# computing of monthly mean and standard deviation.
# We first create two empty raster stack...
SST_MM = stack() # this stack will contain the twelve average SST (one per month)
SST_MSD = stack() # this stack will contain the twelve SST st. dev. (one per month)
# We run the following loop (this can take a while)
for (m in 1:12) { # for every month
print(m) # print current month to track the progress of the loop...
sstMean = mean(ENV[[which(m_index==m)]], na.rm=T) # calculate the mean SST for all the records of the current month
sstSd = calc(ENV[[which(m_index==m)]], sd, na.rm=T) # calculate the st. dev. of SST for all the records of the current month
# add the monthly records to the stacks
SST_MM = stack(SST_MM, sstMean)
SST_MSD = stack(SST_MSD, sstSd)
}
And as mentioned, the output of the loop including the error:
SST_MM = stack() # this stack will contain the twelve average SST (one per month)
> SST_MSD = stack() # this stack will contain the twelve SST st. dev. (one per month)
> for (m in 1:12) { # for every month
+
+ print(m) # print current month to track the progress of the loop...
+
+ sstMean = mean(ENV[[which(m_index==m)]], na.rm=T) # calculate the mean SST for all the records of the current month
+ sstSd = calc(ENV[[which(m_index==m)]], sd, na.rm=T) # calculate the st. dev. of SST for all the records of the current month
+
+ # add the monthly records to the stacks
+
+ SST_MM = stack(SST_MM, sstMean)
+ SST_MSD = stack(SST_MSD, sstSd)
+
+ }
[1] 1
**Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': recursive indexing failed at level 2**
It seems that you make things too complicated. I think the easiest way to do this is with terra::tapp like this:
library(terra)
x <- rast("SST.nc")
xmn <- tapp(x, "yearmonths", mean)
xsd <- tapp(x, "yearmonths", sd)
or more manually:
library(terra)
x <- rast("SST.nc")
y <- format(time(x),"%Y")
m <- format(time(x),"%m")
ym <- paste0(y, "_", m)
r <- tapp(x, ym, mean)

lubridate failing to parse with hms() but not with ymd_hms()

My aim is to get the ymd() and hms() as id and time, respectively, to analyze data by day and by hour afterwards. I have done this before, and it still works with other datasets, so I think it is a data format problem.
I have tried multiple stuff, including converting the timestamps to POSIXct (with as.POSIXct()) or to Data (with as_datetime()), but all of those result in the same parsing warning, with all generated vectors by mutate being NA.
I've also tried converting to POSIXct specifying tz = "America/Sao_Paulo" before using lubridate, but it does not seem to work either.
Here's my code:
`dat.all %>% str()
'data.frame': 585 obs. of 2 variables:
$ feed_datetime : chr "2019-04-10 14:14:00" "2019-04-10 14:14:00" "2019-04-10 14:14:00" "2019-04-11 07:41:00" ...
$ def_datetime : chr "2019-04-11 07:14:00" "2019-04-11 08:24:00" "2019-04-11 08:40:00" "2019-04-11 08:40:00" ...
# no NA's:
nrow(dat.all[is.na(dat.all$def_datetime), ])
[1] 0
# using lubridate
dat.all.m <- dat.all %>%
mutate(
# convert to POSIXct (it works)
def_datetime = as.POSIXct(def_datetime, tz = "America/Sao_Paulo"),
feed_datetime = as.POSIXct(feed_datetime, tz = "America/Sao_Paulo",
# keep same format but with lubridate (it works)
feed_datetime = lubridate::ymd_hms(feed_datetime),
def_datetime = lubridate::ymd_hms(def_datetime)
# use dmy or ymd to get dates as id (generate NAs)
day = lubridate::dmy(def_datetime, tz = "America/Sao_Paulo")
)
Warning message:
Problem while computing `day = lubridate::dmy(def_datetime, tz = "America/Sao_Paulo")`.
ℹ All formats failed to parse. No formats found.
Following #Henrik comments on this post:
x <- dat.all$def_datetime[1]
x
[1] "2019-04-11 07:14:00"
format <- guess_formats(x, c("Ymd HMS", "mdY", "BdY", "Bdy", "bdY", "bdy", "mdy", "dby"), locale = "English")
format
YOmdHMS YmdHMS
"%Y-%Om-%d %H:%M:%S" "%Y-%m-%d %H:%M:%S"
strptime(x, format)
[1] "2019-04-11 07:14:00 -03" "2019-04-11 07:14:00 -03"
Maybe this is related to specifying tz? Why is this not a problem for other datasets with the same data format?
Thanks.
Update
I was able to get the date (1, "yyyy/mm/dd" format) and the hour (2, "HH:mm:ss" format) with the following lines, but I still didn't manage to gather the date ("yyyy/mm/dd" format) with lubridate::ymd():
dat.all %>%
mutate(
# (1)
day = lubridate::as_date(def_datetime),
# (2)
gut_transit_time_hms = hms::as_hms(gut_transit_time)
)

Time series in R: How to convert raw data from int type to time type

For subject 1 in the training data, I am trying to plot nine time series corresponding to nine different features. The data is supposed to be a time series but R is not reading it as such. The first two columns, as you can see, are not time but the rest should be. How do I do this in R or Rmarkdown (I think they should be the same)?
I tried plotting it:
ggplot(train_ds, aes(x = Activity, y = TimeBodyAccelerometer-mean-X) +
theme_minimal() +
geom_point()
)
but I get this error:
Error in ggplot():
! mapping should be created with aes().
✖ You've supplied a object
Backtrace:
ggplot2::ggplot(...)
ggplot2:::ggplot.default(...)
Error in ggplot(train_ds, aes(x = Activity, y = TimeBodyAccelerometer - :
✖ You've supplied a object

Getting error " no method or default for coercing “patchwork” to “dgCMatrix” in scRNA analysis, using seurat, normalization step

I have a scRNA dataset with 10 healthy controls and 17 patients. I am doing the comparative analysis. I did the following:
Created 10 seurat objects for 10 healthy controls and merged them to create one (healthy)
Created 17 seurat objects for 17 patients and merged them to create one (patients)
Created a list of the two objects: data <- list (healthy, patients)
Normalize the data:
data <- lapply(data, function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
I am getting the following error:
Error in as(object = data, Class = "dgCMatrix") : no method or default for coercing “patchwork” to “dgCMatrix”
Please help
After some trial and error I was able to reproduce your same error running this line of code before your lapply function:
data <- list(p1 + p2 , p2)
Where p1 and p2 are ggplot objects.
It looks to me that in your data list you don't have Seurat objects.
You should check for any mistakes in the code that you have used to generate your list of seurat objects.
I hope this helps :)

How to iterate over dates (days/ hours/ months) inside the data pipeline using beam on cloud dataflow?

Greeting folks1
I Am trying to load data from GCS to BigQuery using Cloud Dataflow.
data inside the bucket are storing in the following structure
"bucket_name/user_id/date/date_hour_user_id.csv"
example "my_bucket/user_1262/2021-01-02/2021-01-02_18_user_id.csv"
if I have 5 users for example ["user_1262", "user_1263", "user_1264", "user_1265", "user_1266"]
and i wanna load to bq (1 hour of data) for all clients for example hour = "18" in a range of 1 week I wanna iterate over all
clients to get the file with the prefix 18 I have created this code but the iteration infect the data
pipeline for each moving from one client to another the code runs a new pipeline.
def run(argv=None):
mydate=['2021-01-02 00:00:00', '2021-01-02 23:00:00']
fmt = '%Y-%m-%d %H:%M:%S'
hour = dt.timedelta(hours=1)
day = dt.timedelta(days=1)
start_time, end_time = [dt.datetime.strptime(d, fmt) for d in mydate]
currdate = start_time
cols = ['cols0','cols1']
parser = argparse.ArgumentParser(description="User Input Data .")
args, beam_args = parser.parse_known_args(argv)
while currdate <= end_time:
str_date = currdate.strftime('%Y-%m-%d')
str_hour = '%02d' % (int(currdate.strftime('%H')))
print("********WE ARE PROCESSING FILE ON DATE ---> %s HOUR --> %s" % (str_date, str_hour))
user_list = ["user_1262", "user_1263", "user_1264", "user_1265", "user_1266"]
for user_id in user_list:
file_path_user = "gs://user_id/%s/%s/%s_%s_%s.csv" % (user_id, str_date, str_date, str_hour, user_id)
with beam.Pipeline(options=PipelineOptions(beam_args)) as p:
input_data = p | 'ReadUserfile' >> beam.io.ReadFromText(file_path_user_table, columns=cols)
decode = input_data | 'decodeData' >> beam.ParDo(de_code())
clean_data = decode | 'clean_dt' >> beam.Filter(clea_data)
writetobq....
currdate += day
run()
You can continue to generate a list of input files in your pipeline creation script. However instead of creating a new pipeline for each input file, you can put them into a list. Then make your pipeline begin with a Create transform reading that list, followed by a textio.ReadAllFromText transform. This will create a PCollection out of your list of files, and then begin reading from that list of files.

Resources