I need to run an R script on a group of data within a RoR app. I tried using yth_filter (see here) but the response time was 15+ seconds. Does anyone know of a conversion of sorts or a better way to implement this R script?
## Filter macro series
# Filtering parameters
# A user will need to enter these
# values into interactive fields
# to adjust the trend line.
# h = number of periods forecasted
# into the future.
# p = number of previous values
# selected as independent
# variables.
rm(list=ls())
# Get the file
y <- read.csv('D:/Papers/cato/cmfa_filtering/NGDP.csv', header=TRUE, row.names=1)
y <- xts(y, order.by = as.Date(rownames(y), format="%m/%d/%Y"))
# Parameters
h <- 8
p <- 4
# Initialize
lag <- h
# Generate the lags
for (i in 1:p) {
# Create the lagged variable
assign(paste0('y', i), lag.xts(y, k=lag))
# Tick ahead
lag <- lag+1
}
# Do OLS
ols <- lm(y~y1+y2+y3+y4)
# Get the trend and cycle components
out.trend <- fitted(ols)
out.cycle <- resid(ols)
# Compare to yth_filter
out.pack <- yth_filter(y, h=h, p=p, output=c("x", "trend"), family = gaussian)
compare <- cbind.xts(out.pack,out.trend)
Related
I am building a multivariate model for direct time series forecasting, where the goal is to make 4 and 8-step-ahead forecasts using random forest and SVR.
The results look very similar to my 1 step-ahead forecast and I am wondering whether my code is sensible or not.
Here is an example for some 4-step-ahead forecasts using random forest in conjunction with the predict function.
As far as I understand the difference between the 1-step-ahead and the 4-step-ahead direct forecast is that instead of the first we feed the fourth row of the test set to the predict function. Meaning in the following example:
test <- mydata_2diff[(i+4), ]
instead of
test <- mydata_2diff[(i+1), ]
My code looks as follows:
train_end <- 112 # End of the training set
j <- 1 # Loop counter
k_max <- 10 # Number of RF estimations
pred_rf_4Q_dir <- matrix(0,(nrow(mydata_2diff)-train_end-3), k_max) # Prediction matrix
{
tic()
for (i in train_end:(nrow(mydata_2diff)-4)) {
train <- mydata_2diff[1:i, ] # Training data
test <- mydata_2diff[(i+4), ] # Test data
for (k in 1:k_max){
rf_RPI <- randomForest(RPI ~ RGDP + CPI + STI + LTI + UE + SER + SPI + ARH,
data = train, ntree = 500, importance = T)
pred_rf = predict(rf_RPI, newdata=test, predict.all = T)
pred_rf_4Q_dir[j,k] <- pred_rf[["aggregate"]]
}
j <- j+1
}
toc()
}
Is this approach correct or not?
I am grateful for any feedback.
I am unable to reproduce the only example I can find of using h2o with iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) as detailed here (Error when extracting variable importance with FeatureImp$new and H2O). Can anyone point to a workaround or other examples of using iml with h2o?
Reproducible example:
library(rsample) # data splitting
library(ggplot2) # allows extension of visualizations
library(dplyr) # basic data transformation
library(h2o) # machine learning modeling
library(iml) # ML interprtation
library(modeldata) #attrition data
# initialize h2o session
h2o.no_progress()
h2o.init()
# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE) %>%
mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))
# convert to h2o object
df.h2o <- as.h2o(df)
# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames =
c("train","valid","test"))
names(splits) <- c("train","valid","test")
# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y)
# elastic net model
glm <- h2o.glm(
x = x,
y = y,
training_frame = splits$train,
validation_frame = splits$valid,
family = "binomial",
seed = 123
)
# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = response,
predict.fun = pred,
class = "classification"
)
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
Error obtained:
Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns
selected
traceback()
1. FeatureImp$new(predictor.glm, loss = "mse")
2. .subset2(public_bind_env, "initialize")(...)
3. private$run.prediction(private$sampler$X)
4. self$predictor$predict(data.frame(dataDesign))
5. prediction[, self$class, drop = FALSE]
6. `[.data.frame`(prediction, , self$class, drop = FALSE)
7. stop("undefined columns selected")
In the iml package documentation, it says that the class argument is "The class column to be returned.". When you set class = "classification", it's looking for a column called "classification" which is not found. At least on GitHub, it looks like the iml package has gone through a fair amount of development since that blog post, so I imagine some functionality may not be backwards compatible anymore.
After reading through the package documentation, I think you might want to try something like:
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = "Attrition",
predict.function = pred,
type = "prob"
)
# check ability to predict first
check <- predictor.glm$predict(features)
print(check)
Even better might be to leverage H2O's extensive functionality around machine learning interpretability.
h2o.varimp(glm) will give the user the variable importance for each feature
h2o.varimp_plot(glm, 10) will render a graphic showing the relative importance of each feature.
h2o.explain(glm, as.h2o(features)) is a wrapper for the explainability interface and will by default provide the confusion matrix (in this case) as well as variable importance, and partial dependency plots for each feature.
For certain algorithms (e.g., tree-based methods), h2o.shap_explain_row_plot() and h2o.shap_summary_plot() will provide the shap contributions.
The h2o-3 docs might be useful here to explore more
This question is very similar to preprocess within cross-validation in caret; however, in a project that i'm working on I would only like to do PCA on three predictors out of 19 in my case. Here is the example from preprocess within cross-validation in caret and I'll use this data (PimaIndiansDiabetes) for ease (this is not my project data but concept should be the same). I would then like to do the preProcess only on a subset of variables i.e. PimaIndiansDiabetes[, c(4,5,6)]. Is there a way to do this?
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess= p,
trControl=control,
tuneGrid=grid)
But I get this error:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
The reason I'm trying to do this is so I can reduce three variables to one PCA1 and use for predicting. In the project I'm doing all three variables are correlated above 90% but would like to incorporate them as other studies have used them as well. Thanks. Trying to avoid data leakage!
As far as I know this is not possible with caret.
This might be possible using recipes. However I do not use recipes but I do use mlr3 so I will show how to do it with this package:
library(mlr3)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
library(mlbench)
create a task from the data:
data("PimaIndiansDiabetes")
pima_tsk <- TaskClassif$new(id = "Pima",
backend = PimaIndiansDiabetes,
target = "diabetes")
define a pre process selector named "slct1":
pos1 <- po("select", id = "slct1")
and define the selector function within it:
pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))
now define what should happen to the selected features: scaling -> pca with 1st PC selected (param_vals = list(rank. = 1))
pos1 %>>%
po("scale", id = "scale1") %>>%
po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
now define an invert selector:
pos2 <- po("select", id = "slct2")
pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)
define the learner:
rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf
combine them:
gunion(list(pr1, pos2)) %>>%
po("featureunion") %>>%
rf_lrn -> graph
check if it looks ok:
graph$plot(html = TRUE)
convert graph to a learner:
glrn <- GraphLearner$new(graph)
define parameters you want tuned:
ps <- ParamSet$new(list(
ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))
define resampling:
cv10 <- rsmp("cv", folds = 10)
define tuning:
instance <- TuningInstance$new(
task = pima_tsk,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
set.seed(1)
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
For additional details on how to tune the number of PC components to keep check this answer: R caret: How do I apply separate pca to different dataframes before training?
If you find this interesting check out the mlr3book
Also
cor(PimaIndiansDiabetes[, 4:6])
triceps insulin mass
triceps 1.0000000 0.4367826 0.3925732
insulin 0.4367826 1.0000000 0.1978591
mass 0.3925732 0.1978591 1.0000000
does not produce what you mention in the question.
I am modeling an LSTM model that contains multiple features and one target value. It is a regression problem.
I have doubts that my data preparation for the LSTM is erroneous; mainly because the model learns nothing but the average of the target value.
The following code I wrote is for preparing the data for the LSTM:
# df is a pandas data frame that contains the feature columns (f1 to f5) and the target value named 'target'
# all columns of the df are time series data (including the 'target')
# seq_length is the sequence length
def prepare_data_multiple_feature(df):
X = []
y = []
for x in range(len(df)):
start_id = x
end_id = x + seq_length
one_data_point = []
if end_id + 1 <= len(df):
# prepare X
for col in ['f1', 'f2', 'f3', 'f4', 'f5']:
one_data_point.append(np.array(df[col].values[start_id:end_id]))
X.append(np.array(one_data_point))
# prepare y
y.append(np.array(df['target'].values[end_id ]))
assert len(y) == len(X)
return X, y
Then, I reshape the data as follows:
X, y = prepare_data_multiple_feature(df)
X = X.reshape((len(X), seq_length, 5)) #5 is the number of features, i.e., f1 to f5
is my data preparation method and data reshaping correct?
As #isp-zax mentioned, please provide a reprex so we could reproduce the outcome and see where the problem lies.
As an aside, you could use for col in df.columns instead of listing all the column names and (minor optimisation) the first loop should be executed for x in range(len(df) - seq_length), otherwise at the end you execute the loop seq_length - 1 many times without actually processing any data. Also, df.values[a, b] will not include the element at index b so if you want to include the "window" with last row inside your X the end_id can be equal to the len(df), i.e. you could execute your inner condition (prepare and append) for if end_id <= len(df):
Apart from that I think it would be simpler to read if you sliced the dataframe across columns and rows at the same time, without using one_data_point, i.e.
to select seq_length rows without the (last) target column, simply do:
df.values[start_id, end_id, :-1]
I'm using bigmemory i want to calculate w.
my v length is 478000 and k length is 240500.
The two matrix multiplication is w very large I run the code by loop still is running is not finished yet and i don't know if will give me the result or no.
I'm trying to calculate without the for loop. I got error.
Please any help to correct my code r make it fast.
v <-read.big.matrix('v.dat',type='double')
k <-read.big.matrix('k.dat',type='double')
m=length(v);
n=length(k);
w <-filebacked.big.matrix(length(v),length(k),type="double",backingfile="w.bin",descriptorfile="w.desc")
start <- Sys.time()
w <- as.big.matrix(2*cos(0.001*v[]%*%t(k[]))-2)
Sys.time() - start
# for(i in 1:m)
# {
# for(j in 1:n)
# {
# w[i,j]=2*cos(dt*v[i]*k[j])-2
# }
# }
Thanks.