Time Series with Gaps in R - time-series

I am trying to set a time series in R. However, I a data of daily trading. Trading takes places 5 days a week. Hence, there are gaps in the times series. I want to set the time series without filling the gaps. I tried ts() function but it only works when there are no gaps.
ncw2 = ts(ncw1, start = c("2020-01-23", 1), freq=365)

You may try using the "zoo" library. It allows you to set time series with gaps.
library(zoo)
df <- data.frame(
date <- c("2003-01-02", "2003-01-05", "2003-01-19"),
values <- c(3,8,1)
)
colnames(df) <- c("date", "values")
df.ts <- zoo(df[,-1], order.by = as.Date(df[,1], "%Y-%M-%d"))

Related

Stata timeseries rolling forecast

I'm new to Stata and have a question about its command language. I want to use my ARIMA model to forecast, ie use x[t], x[t-1]... to produce an estimate xhat[t+1], and then roll forward one time step, to make the next forecast, rebuilding the model every N time steps.
i can duplicate code, something like the following code for T, T+1, T+2, etc.:
arima x if t<=T, arima(2,0,2)
predict xhat
to produce a series of xhats to compare with in-sample x observations. There must be a more natural way to do this in the command language. any suggestions, pointers would be very much appreciated.
Posting a working solution provided by Stata tech support:
webuse dfex
tsset month
generate int id = _n
capture program drop forecarima
program forecarima, rclass
syntax [if]
tempvar yhat
arima unemp `if', arima(1,1,0)
local T = e(tmax)
local T1 = `T' + 1
summarize id if month == `T1'
local h = r(max)
predict `yhat', y dynamic(`T')
return scalar y = unemp[`h']
return scalar yhat = `yhat'[`h']
end
rolling unemp = r(y) unemp_hat = r(yhat), window(400) recursive ///
saving(results,replace): forecarima
use results,clear
browse
this provides output with the prediction and observed both available. the dates are off by one step, but easier left to post-processing.

if (freq) x$counts else x$density length > 1 and only the first element will be used

for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!
It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)

Predictors of different size for time series prediction using LSTM with Keras

I would like to predict time series values X using another time series Y and the past value of X.In detail, I would like to predict X at time t (Xt) using (Xt-p,...,Xt-1) and (Yt-p,...,Yt-1,Yt) with p the dimension of the "look back".
So, my problem is that I do not have the same length for my 2 predictors.
Let's use a exemple to be clearer.
If I use a timestep of 2, I would have for one observation :
[(Xt-p,Yt-p),...,(Xt-1,Yt-1),(??,Yt)] as input and Xt as output. I do not know what to use instead of the ??
I understand that mathematically speaking I need to have the same length for my predictors, so I am looking for a value to replace the missing value.
I really do not know if there is a good solution here and if I could to something so any help would be greatly appreciated.
Cheers !
PS : you could see my problem as if I wanted to predict the number of ice cream sell one day in advance in a city using the forcast of weather for the next day. X would be the number of ice cream and Y could be the temperature.
You could e.g. do the following:
input_x = Input(shape=input_shape_x)
input_y = Input(shape=input_shape_y)
lstm_for_x = LSTM(50, return_sequences=False)(input_x)
lstm_for_y = LSTM(50, return_sequences=False)(input_y)
merged = merge([lstm_for_x, lstm_for_y], mode="concat") # for keras < 2.0
merged = Concatenate([lstm_for_x, lstm_for_y])
output = Dense(1)(merged)
model = Model([x_input, y_input], output)
model.compile(..)
model.fit([X, Y], X_next)
Where X is an array of sequences, X_forward is X p-steps ahead and Y is an array of sequences of Ys.

Arrange nplot() by y-axis values

I am using a multiBarHorizontalChart with nplot() to show variance from a mean rate. I have "negative change" bars highlighted in red and positive rate change bars in green– via grouping by a "posneg" variable. When I group by this variable however, the years on the y axis are no longer ordered. Any idea how I could maintain the order of the years while still grouping by this variable? Personally, I think the color difference makes the graph a lot easier to interpret. Here's a reproducible example, using the data hosted on Socrata:
install.packages("RSocrata")
library(RSocrata)
url="https://opendata.socrata.com/dataset/Preliminary-Data-Data-Visulaization-Project-8-12-1/4xgc-ygke"
dfRatePer100= read.socrata(url)
dfRatePer100=subset(dfRatePer100, select=c(1,3), Year!="NA")
colnames(dfRatePer100)= c("Year", "Dollar.Rate")
dfRatePer100$Dollar.Rate= as.numeric(dfRatePer100$Dollar.Rate, 3)
dfRatePer100$mean= mean(dfRatePer100$Dollar.Rate)
dfRatePer100=dfRatePer100%>%
mutate(rateVariance= Dollar.Rate - mean) %>%
arrange(desc(Year))
dfRatePer100$PosNeg=ifelse(dfRatePer100$rateVariance>0, "Positive rate change from mean", ifelse(dfRatePer100$rateVariance<0, "Negative rate change from mean", "No change from mean"))
ratePer100 <- nPlot(rateVariance~ Year, group="PosNeg",data = dfRatePer100,type = 'multiBarHorizontalChart')
ratePer100$chart(showLegend=T)
ratePer100$chart(showControls=F)
ratePer100$chart(color = c('green','red'))
ratePer100$yAxis(axisLabel='Variance from mean rate (in dollars)')
ratePer100$yAxis(tickFormat = "#! function(d) {return d3.format('.2f')(d)} !#")
ratePer100$set(width=600)
ratePer100
I appreciate any help! Thanks.
Not an answer but a suggestion, since looking at the source code, nvd3 multiBarHorizontalChart will group by the groups first and sort then by values, so don't think possible. taucharts might be a good option if rCharts is not a requirement.
library(rCharts)
df <- data.frame(
year = as.character(2000:2012)
,value = runif(13,-1,1)
)
df$group <- ifelse(df$value>0,"positive","negative")
np <-nPlot(
value ~ year,
group = "group",
data = df,
type = 'multiBarHorizontalChart'
)
np$chart(color = c('green','red'))
np
library(taucharts)
tauchart( df ) %>%
tau_bar( "value", "year", "group", horizontal=TRUE) %>%
tau_legend()

Identifying outlying datapoints from residuals (GeoLight package)

I am analysing some data collected from a geolocator placed on a migratory bird. In a nutshell, my data are sunrise and sunset times, which are then used to determine position on the globe.
I am using a package GeoLight (http://cran.r-project.org/web/packages/GeoLight/GeoLight.pdf) to identify outlying data - specifically, I am using the LoessFilter function which applies a polynomial regression and identify residuals that are greater than 3 interquantile ranges (specified by k in the code when applying the function)
My problem is: the function returns graphs in which outlying datapoints are identified in red. There seems to be an issue with the code itself regarding returned TRUE or FALSE statements stating which points are outliers - all are stated as TRUE, even if outliers are identified.
I have therefore modified the function code to state which residuals are outliers.
However, when I then remove those rows from the original dataset and re-run the function, the points have not been removed. Therefore, there is some discrepancy between which residuals are relating to values in the original data: i.e. if the output states that residual 78 is an outlying point, removing row 78 from the original data does not remove the outlying datapoint.
I would very much appreciate some help with removing the outlying datapoints identified using the function. It seems like a very easy fix but I can't seem to figure it out.
Code for full function and data below
Thanks
Emma
log2$tFirst<-as.POSIXlt(log2$tFirst)
log2$tSecond<-as.POSIXlt(log2$tSecond)
CODE TO GET OUTLYING RESIDUALS
i.get.outliers<-function(residuals, k=3) {
x <- residuals
# x is a vector of residuals
# k is a measure of how many interquartile ranges to take before saying that point is an outlier
# it looks like 3 is a good preset for k
QR<-quantile(x, probs = c(0.25, 0.75))
IQR<-QR[2]-QR[1]
Lower.band<-QR[1]-(k*IQR)
Upper.Band<-QR[2]+(k*IQR)
delete<-which(x<Lower.band | x>Upper.Band)
return(as.vector(delete))
}
LOESS FILTER FUNCTION CODE
loessFilter <- function(tFirst, tSecond, type, k=3, plot=TRUE){
tw <- data.frame(datetime=as.POSIXct(c(tFirst,tSecond),"UTC"),type=c(type,ifelse(type==1,2,1)))
tw <- tw[!duplicated(tw$datetime),]
tw <- tw[order(tw[,1]),]
hours <- as.numeric(format(tw[,1],"%H"))+as.numeric(format(tw[,1],"%M"))/60
for(t in 1:2){
cor <- rep(NA, 24)
for(i in 0:23){
cor[i+1] <- max(abs((c(hours[tw$type==t][1],hours[tw$type==t])+i)%%24 -
(c(hours[tw$type==t],hours[tw$type==t][length(hours)])+i)%%24),na.rm=T)
}
hours[tw$type==t] <- (hours[tw$type==t] + (which.min(round(cor,2)))-1)%%24
}
dawn <- data.frame(id=1:sum(tw$type==1),
datetime=tw$datetime[tw$type==1],
type=tw$type[tw$type==1],
hours = hours[tw$type==1], filter=FALSE)
dusk <- data.frame(id=1:sum(tw$type==2),
datetime=tw$datetime[tw$type==2],
type=tw$type[tw$type==2],
hours = hours[tw$type==2], filter=FALSE)
for(d in seq(30,k,length=5)){
predict.dawn <- predict(loess(dawn$hours[!dawn$filter]~as.numeric(dawn$datetime[!dawn$filter]),span=0.1))
predict.dusk <- predict(loess(dusk$hours[!dusk$filter]~as.numeric(dusk$datetime[!dusk$filter]),span=0.1))
del.dawn <- i.get.outliers(as.vector(residuals(loess(dawn$hours[!dawn$filter]~
as.numeric(dawn$datetime[!dawn$filter]),span=0.1))),k=d)
del.dusk <- i.get.outliers(as.vector(residuals(loess(dusk$hours[!dusk$filter]~
as.numeric(dusk$datetime[!dusk$filter]),span=0.1))),k=d)
if(length(del.dawn)>0) dawn$filter[!dawn$filter][del.dawn] <- TRUE
if(length(del.dusk)>0) dusk$filter[!dusk$filter][del.dusk] <- TRUE
}
if(plot){
par(mfrow=c(2,1),mar=c(3,3,0.5,3),oma=c(2,2,0,0))
plot(dawn$datetime[dawn$type==1],dawn$hours[dawn$type==1],pch="+",cex=0.6,xlab="",ylab="",yaxt="n")
lines(dawn$datetime[!dawn$filter], predict(loess(dawn$hours[!dawn$filter]~as.numeric(dawn$datetime[!dawn$filter]),span=0.1)) , type="l")
points(dawn$datetime[dawn$filter],dawn$hours[dawn$filter],col="red",pch="+",cex=1)
axis(2,labels=F)
mtext("Sunrise",4,line=1.2)
plot(dusk$datetime[dusk$type==2],dusk$hours[dusk$type==2],pch="+",cex=0.6,xlab="",ylab="",yaxt="n")
lines(dusk$datetime[!dusk$filter], predict(loess(dusk$hours[!dusk$filter]~as.numeric(dusk$datetime[!dusk$filter]),span=0.1)), type="l")
points(dusk$datetime[dusk$filter],dusk$hours[dusk$filter],col="red",pch="+",cex=1)
axis(2,labels=F)
legend("bottomleft",c("Outside filter","Inside filter"),pch=c("+","+"),col=c("black","red"),
bty="n",cex=0.8)
mtext("Sunset",4,line=1.2)
mtext("Time",1,outer=T)
mtext("Sunrise/Sunset hours (rescaled)",2,outer=T)
}
all <- rbind(subset(dusk,filter),subset(dawn,filter))
filter <- rep(FALSE,length(tFirst))
filter[tFirst%in%all$datetime | tSecond%in%all$datetime] <- TRUE
# original code:
#return(!filter)
# altered code to return outliersreturn(del.dusk)
# replace with code below to print outlying points
return(c("delete dawn",del.dawn,"delete dusk",del.dusk))
}
APPLY FUNCTION
loessFilter(log2$tFirst, log2$tSecond, type=1, k=4, plot=TRUE)
remove the values - need to remove both sunrise and sunset curves
log2b<-log2[-c(77,78,124,125),]
length(log2$tFirst)
length(log2b$tFirst)
repeat function to see if the values have gone
loessFilter(log2b$tFirst, log2b$tSecond, type=1, k=4, plot=TRUE)
outliers still there!!
HERE ARE THE DATA:
http://www.4shared.com/file/jxVuTsVHce/002_geolight.html
A bit too long to post the full data here and the example won't work with a dummy dataset :)

Resources