How to iterate over dates (days/ hours/ months) inside the data pipeline using beam on cloud dataflow? - google-cloud-dataflow

Greeting folks1
I Am trying to load data from GCS to BigQuery using Cloud Dataflow.
data inside the bucket are storing in the following structure
"bucket_name/user_id/date/date_hour_user_id.csv"
example "my_bucket/user_1262/2021-01-02/2021-01-02_18_user_id.csv"
if I have 5 users for example ["user_1262", "user_1263", "user_1264", "user_1265", "user_1266"]
and i wanna load to bq (1 hour of data) for all clients for example hour = "18" in a range of 1 week I wanna iterate over all
clients to get the file with the prefix 18 I have created this code but the iteration infect the data
pipeline for each moving from one client to another the code runs a new pipeline.
def run(argv=None):
mydate=['2021-01-02 00:00:00', '2021-01-02 23:00:00']
fmt = '%Y-%m-%d %H:%M:%S'
hour = dt.timedelta(hours=1)
day = dt.timedelta(days=1)
start_time, end_time = [dt.datetime.strptime(d, fmt) for d in mydate]
currdate = start_time
cols = ['cols0','cols1']
parser = argparse.ArgumentParser(description="User Input Data .")
args, beam_args = parser.parse_known_args(argv)
while currdate <= end_time:
str_date = currdate.strftime('%Y-%m-%d')
str_hour = '%02d' % (int(currdate.strftime('%H')))
print("********WE ARE PROCESSING FILE ON DATE ---> %s HOUR --> %s" % (str_date, str_hour))
user_list = ["user_1262", "user_1263", "user_1264", "user_1265", "user_1266"]
for user_id in user_list:
file_path_user = "gs://user_id/%s/%s/%s_%s_%s.csv" % (user_id, str_date, str_date, str_hour, user_id)
with beam.Pipeline(options=PipelineOptions(beam_args)) as p:
input_data = p | 'ReadUserfile' >> beam.io.ReadFromText(file_path_user_table, columns=cols)
decode = input_data | 'decodeData' >> beam.ParDo(de_code())
clean_data = decode | 'clean_dt' >> beam.Filter(clea_data)
writetobq....
currdate += day
run()

You can continue to generate a list of input files in your pipeline creation script. However instead of creating a new pipeline for each input file, you can put them into a list. Then make your pipeline begin with a Create transform reading that list, followed by a textio.ReadAllFromText transform. This will create a PCollection out of your list of files, and then begin reading from that list of files.

Related

Error when trying to calculate mean and SD of environmental dataset with loop from .nc data

I was trying to calculate mean and SD per month of a variable from an environmental dataset (.nc file of Sea surface temp/day during 2 years) and the loop I used gives me the following error
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': recursive indexing failed at level 2
I have no idea where my error could be but if you are curious I was using the following .nc dataset just for SST for 2018-2019 from copernicus sstdata
Here is the script I used so far and the packages I'm using:
# Load required libraries (install the required libraries using the Packages tab, if necessary)
library(raster)
library(ncdf4)
#Opern the .nc file with the environmental data
ENV = nc_open("SST.nc")
ENV
#create an index of the month for every (daily) capture from 2018 to 2019 (in this dataset)
m_index = c()
for (y in 2018:2019) {
# if bisestile year (do not apply for this data but in case a larger year set is used)
if (y%%4==0) { m_index = c(m_index, rep(1:12 , times = c(31,29,31,30,31,30,31,31,30,31,30,31))) }
# if non-bisestile year
else { m_index = c(m_index, rep(1:12 , times = c(31,28,31,30,31,30,31,31,30,31,30,31))) }
}
length(m_index) # expected length (730)
table(m_index) # expected number of records assigned to each of the twelve months
# computing of monthly mean and standard deviation.
# We first create two empty raster stack...
SST_MM = stack() # this stack will contain the twelve average SST (one per month)
SST_MSD = stack() # this stack will contain the twelve SST st. dev. (one per month)
# We run the following loop (this can take a while)
for (m in 1:12) { # for every month
print(m) # print current month to track the progress of the loop...
sstMean = mean(ENV[[which(m_index==m)]], na.rm=T) # calculate the mean SST for all the records of the current month
sstSd = calc(ENV[[which(m_index==m)]], sd, na.rm=T) # calculate the st. dev. of SST for all the records of the current month
# add the monthly records to the stacks
SST_MM = stack(SST_MM, sstMean)
SST_MSD = stack(SST_MSD, sstSd)
}
And as mentioned, the output of the loop including the error:
SST_MM = stack() # this stack will contain the twelve average SST (one per month)
> SST_MSD = stack() # this stack will contain the twelve SST st. dev. (one per month)
> for (m in 1:12) { # for every month
+
+ print(m) # print current month to track the progress of the loop...
+
+ sstMean = mean(ENV[[which(m_index==m)]], na.rm=T) # calculate the mean SST for all the records of the current month
+ sstSd = calc(ENV[[which(m_index==m)]], sd, na.rm=T) # calculate the st. dev. of SST for all the records of the current month
+
+ # add the monthly records to the stacks
+
+ SST_MM = stack(SST_MM, sstMean)
+ SST_MSD = stack(SST_MSD, sstSd)
+
+ }
[1] 1
**Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': recursive indexing failed at level 2**
It seems that you make things too complicated. I think the easiest way to do this is with terra::tapp like this:
library(terra)
x <- rast("SST.nc")
xmn <- tapp(x, "yearmonths", mean)
xsd <- tapp(x, "yearmonths", sd)
or more manually:
library(terra)
x <- rast("SST.nc")
y <- format(time(x),"%Y")
m <- format(time(x),"%m")
ym <- paste0(y, "_", m)
r <- tapp(x, ym, mean)

lubridate failing to parse with hms() but not with ymd_hms()

My aim is to get the ymd() and hms() as id and time, respectively, to analyze data by day and by hour afterwards. I have done this before, and it still works with other datasets, so I think it is a data format problem.
I have tried multiple stuff, including converting the timestamps to POSIXct (with as.POSIXct()) or to Data (with as_datetime()), but all of those result in the same parsing warning, with all generated vectors by mutate being NA.
I've also tried converting to POSIXct specifying tz = "America/Sao_Paulo" before using lubridate, but it does not seem to work either.
Here's my code:
`dat.all %>% str()
'data.frame': 585 obs. of 2 variables:
$ feed_datetime : chr "2019-04-10 14:14:00" "2019-04-10 14:14:00" "2019-04-10 14:14:00" "2019-04-11 07:41:00" ...
$ def_datetime : chr "2019-04-11 07:14:00" "2019-04-11 08:24:00" "2019-04-11 08:40:00" "2019-04-11 08:40:00" ...
# no NA's:
nrow(dat.all[is.na(dat.all$def_datetime), ])
[1] 0
# using lubridate
dat.all.m <- dat.all %>%
mutate(
# convert to POSIXct (it works)
def_datetime = as.POSIXct(def_datetime, tz = "America/Sao_Paulo"),
feed_datetime = as.POSIXct(feed_datetime, tz = "America/Sao_Paulo",
# keep same format but with lubridate (it works)
feed_datetime = lubridate::ymd_hms(feed_datetime),
def_datetime = lubridate::ymd_hms(def_datetime)
# use dmy or ymd to get dates as id (generate NAs)
day = lubridate::dmy(def_datetime, tz = "America/Sao_Paulo")
)
Warning message:
Problem while computing `day = lubridate::dmy(def_datetime, tz = "America/Sao_Paulo")`.
ℹ All formats failed to parse. No formats found.
Following #Henrik comments on this post:
x <- dat.all$def_datetime[1]
x
[1] "2019-04-11 07:14:00"
format <- guess_formats(x, c("Ymd HMS", "mdY", "BdY", "Bdy", "bdY", "bdy", "mdy", "dby"), locale = "English")
format
YOmdHMS YmdHMS
"%Y-%Om-%d %H:%M:%S" "%Y-%m-%d %H:%M:%S"
strptime(x, format)
[1] "2019-04-11 07:14:00 -03" "2019-04-11 07:14:00 -03"
Maybe this is related to specifying tz? Why is this not a problem for other datasets with the same data format?
Thanks.
Update
I was able to get the date (1, "yyyy/mm/dd" format) and the hour (2, "HH:mm:ss" format) with the following lines, but I still didn't manage to gather the date ("yyyy/mm/dd" format) with lubridate::ymd():
dat.all %>%
mutate(
# (1)
day = lubridate::as_date(def_datetime),
# (2)
gut_transit_time_hms = hms::as_hms(gut_transit_time)
)

DataFlowRunner + Beam in streaming mode with a SideInput AsDict hangs

I have a simple graph that reads from a pubsub message (currently just a single string key), creates a very short window, generates 3 integers that use this key via a beam.ParDo, and a simple Map that creates a single "config" with this as a key.
Ultimately, there are 2 PCollections:
items: [('key', 0), ('key', 1), ...]
infos: [('key', 'the value is key')]
I want a final beam.Map over items that uses infos as a dictionary side input so I can look up the value in the dictionary.
Using the LocalRunner, the final print works with the side input.
On DataFlow the first two steps print, but the final Map with the side input never is called, presumably because it somehow is an unbounded window (despite the earlier window function).
I am using runner_v2, dataflow prime, and streaming engine.
p = beam.Pipeline(options=pipeline_options)
pubsub_message = (
p | beam.io.gcp.pubsub.ReadFromPubSub(
subscription='projects/myproject/testsubscription') |
'SourceWindow' >> beam.WindowInto(
beam.transforms.window.FixedWindows(1e-6),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(1)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING))
def _create_items(pubsub_key: bytes) -> Iterable[tuple[str, int]]:
for i in range(3):
yield pubsub_key.decode(), i
def _create_info(pubsub_key: bytes) -> tuple[str, str]:
return pubsub_key.decode(), f'the value is {pubsub_key.decode()}'
items = pubsub_message | 'CreateItems' >> beam.ParDo(_create_items) | beam.Reshuffle()
info = pubsub_message | 'CreateInfo' >> beam.Map(_create_info)
def _print_item(keyed_item: tuple[str, int], info_dict: dict[str, str]) -> None:
key, _ = keyed_item
log(key + '::' + info_dict[key])
_ = items | 'MapWithSideInput' >> beam.Map(_print_item, info_dict=beam.pvalue.AsDict(info))
Here is the output in local runner:
Creating item 0
Creating item 1
Creating item 2
Creating info b'key'
key::the value is key
key::the value is key
key::the value is key
Here is the DataFlow graph:
I've tried various windowing functions over the AsDict, but I can never get it to be exactly the same window as my input.
Thoughts on what I might be doing wrong here?

How to limit number of lines per file written using FileIO

Is there a possible way to limit number of lines in each written shard using TextIO or may be FileIO?
Example:
Read rows from Big Query - Batch Job (Result is 19500 rows for example).
Make some transformations.
Write files to Google Cloud storage (19 files, each file is limited to 1000 records, one file has 500 records).
Cloud Function is triggered to make a POST request to an external API for each file in GCS.
Here is what I'm trying to do so far but doesn't work (Trying to limit 1000 rows per file):
BQ_DATA = p | 'read_bq_view' >> beam.io.Read(
beam.io.BigQuerySource(query=query,
use_standard_sql=True)) | beam.Map(json.dumps)
BQ_DATA | beam.WindowInto(GlobalWindows(), Repeatedly(trigger=AfterCount(1000)),
accumulation_mode=AccumulationMode.DISCARDING)
| WriteToFiles(path='fileio', destination="csv")
Am I conceptually wrong or is there any other way to implement this?
You can implement the write to GCS step inside ParDo and limit the number of elements to include in a "batch" like this:
from apache_beam.io import filesystems
class WriteToGcsWithRowLimit(beam.DoFn):
def __init__(self, row_size=1000):
self.row_size = row_size
self.rows = []
def finish_bundle(self):
if len(self.rows) > 0:
self._write_file()
def process(self, element):
self.rows.append(element)
if len(self.rows) >= self.row_size:
self._write_file()
def _write_file(self):
from time import time
new_file = 'gs://bucket/file-{}.csv'.format(time())
writer = filesystems.FileSystems.create(path=new_file)
writer.write(self.rows) # may need to format
self.rows = []
writer.close()
BQ_DATA | beam.ParDo(WriteToGcsWithRowLimit())
Note that this will not create any files with less than 1000 rows, but you can change the logic in process to do that.
(Edit 1 to handle the remainders)
(Edit 2 to stop using counters, as files will be overridden)

Forecasting in R using forecast package

I'm trying to forecast hourly data for 30 days for a process.
I have used the following code:
# The packages required for projection are loaded
library("forecast")
library("zoo")
# Data Preparation steps
# There is an assumption that we have all the data for all
# the 24 hours of the month of May
time_index <- seq(from = as.POSIXct("2014-05-01 07:00"),
to = as.POSIXct("2014-05-31 18:00"), by = "hour")
value <- round(runif(n = length(time_index),100,500))
# Using zoo function , we merge data with the date and hour
# to create an extensible time series object
eventdata <- zoo(value, order.by = time_index)
# As forecast package requires all the objects to be time series objects,
# the below command is used
eventdata <- ts(value, order.by = time_index)
# For forecasting the values for the next 30 days, the below command is used
z<-hw(t,h=30)
plot(z)
I feel the output of this code, is not working fine.
The forecasted line looks wrong and the dates are not getting correctly projected on the chart.
I'm not sure the fault lies in the data preparation, or if the output is as expected. Any ideas?

Resources