Select prior probability of inclusion in CausalImpact or bsts? - time-series

In the CausalImpact package, the supplied covariates are independently selected with some prior probability M/J where M is the expected model size and J is the number of covariates. However, on page 11 of the paper, they say get the values by "asking about the expected model size M." I checked the documentation for CausalImpact but was unable to find any more information. Where is this done in the package? Is there a parameter I can set in a function call to decide why my desired M?

You are right, this is not directly possible with CausalImpact, but it is possible. CausalImpact uses bsts behind the scenes and this package allows to set the parameter. So you have to define you model using bsts first, set the parameter and then provide it to your CausalImpact call like this (modified example from the CausalImpact manual):
post.period <- c(71, 100)
post.period.response <- y[post.period[1] : post.period[2]]
y[post.period[1] : post.period[2]] <- NA
ss <- AddLocalLevel(list(), y)
bsts.model <- bsts(y ~ x1, ss, niter = 1000, expected.model.size = 4)
impact <- CausalImpact(bsts.model = bsts.model,
post.period.response = post.period.response)

Related

PCA within cross validation; however, only with a subset of variables

This question is very similar to preprocess within cross-validation in caret; however, in a project that i'm working on I would only like to do PCA on three predictors out of 19 in my case. Here is the example from preprocess within cross-validation in caret and I'll use this data (PimaIndiansDiabetes) for ease (this is not my project data but concept should be the same). I would then like to do the preProcess only on a subset of variables i.e. PimaIndiansDiabetes[, c(4,5,6)]. Is there a way to do this?
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess= p,
trControl=control,
tuneGrid=grid)
But I get this error:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
The reason I'm trying to do this is so I can reduce three variables to one PCA1 and use for predicting. In the project I'm doing all three variables are correlated above 90% but would like to incorporate them as other studies have used them as well. Thanks. Trying to avoid data leakage!
As far as I know this is not possible with caret.
This might be possible using recipes. However I do not use recipes but I do use mlr3 so I will show how to do it with this package:
library(mlr3)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
library(mlbench)
create a task from the data:
data("PimaIndiansDiabetes")
pima_tsk <- TaskClassif$new(id = "Pima",
backend = PimaIndiansDiabetes,
target = "diabetes")
define a pre process selector named "slct1":
pos1 <- po("select", id = "slct1")
and define the selector function within it:
pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))
now define what should happen to the selected features: scaling -> pca with 1st PC selected (param_vals = list(rank. = 1))
pos1 %>>%
po("scale", id = "scale1") %>>%
po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
now define an invert selector:
pos2 <- po("select", id = "slct2")
pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)
define the learner:
rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf
combine them:
gunion(list(pr1, pos2)) %>>%
po("featureunion") %>>%
rf_lrn -> graph
check if it looks ok:
graph$plot(html = TRUE)
convert graph to a learner:
glrn <- GraphLearner$new(graph)
define parameters you want tuned:
ps <- ParamSet$new(list(
ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))
define resampling:
cv10 <- rsmp("cv", folds = 10)
define tuning:
instance <- TuningInstance$new(
task = pima_tsk,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
set.seed(1)
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
For additional details on how to tune the number of PC components to keep check this answer: R caret: How do I apply separate pca to different dataframes before training?
If you find this interesting check out the mlr3book
Also
cor(PimaIndiansDiabetes[, 4:6])
triceps insulin mass
triceps 1.0000000 0.4367826 0.3925732
insulin 0.4367826 1.0000000 0.1978591
mass 0.3925732 0.1978591 1.0000000
does not produce what you mention in the question.

Getting a specific random forest variable importance measure from mlr package's resample function

I am using mlr package's resample() function to subsample a random forest model 4000 times (the code snippet below).
As you can see, to create random forest models within resample() I'm using randomForest package.
I want to get random forest model's importance results (mean decrease in accuracy over all classes) for each of the subsample iterations. What I can get right now as the importance measure is the mean decrease in Gini index.
What I can see from the source code of mlr, getFeatureImportanceLearner.classif.randomForest() function (line 69) in makeRLearner.classif.randomForest uses randomForest::importance() function (line 83) to get importance value from the resulting object of randomForest class. But as you can see from the source code (line 73) it uses 2L as the default value. I want it to use 1L (line 75) as the value (mean decrease in accuracy).
How can I pass the value of 2L to resample() function, ("extract = getFeatureImportance" line in the code below) so that getFeatureImportanceLearner.classif.randomForest() function gets that value and sets ctrl$type = 2L (line 73)?
rf_task <- makeClassifTask(id = 'task',
data = data[, -1], target = 'target_var',
positive = 'positive_var')
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 1000, importance = TRUE),
predict.type = 'prob')
base_subsample_instance <- makeResampleInstance(rf_boot_desc, rf_task)
rf_subsample_result <- resample(rf_learner, rf_task,
base_subsample_instance,
extract = getFeatureImportance,
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier))
My solution: Downloaded source code of the mlr package. Changed the source file line 73 to 1L (https://github.com/mlr-org/mlr/blob/v2.15.0/R/RLearner_classif_randomForest.R). Installed the package from command line and used it. Not an optimal solution but a solution.
You provide a lot of specifics that do not actually relate to your question, at least how I understood it.
So I wrote a simple MWE that includes the answer.
The idea is that you have to write a short wrapper for getFeatureImportance so that you can pass your own arguments. Fans of purrr can do that with purrr::partial(getFeatureImportance, type = 2) but here I wrote myExtractor manually.
library(mlr)
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 100, importance = TRUE),
predict.type = 'prob')
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier)
myExtractor = function(.model, ...) {
getFeatureImportance(.model, type = 2, ...)
}
res = resample(rf_learner, sonar.task, cv10,
measures = measures, extract = myExtractor)
# first feature importance result:
res$extract[[1]]
# all values in a matrix:
sapply(res$extract, function(x) x$res)
If you want to do a bootstraped learenr maybe you should also have a look at makeBaggingWrapper instead of solving this problem through resample.

How tf.gradients work in TensorFlow

Given I have a linear model as the following I would like to get the gradient vector with regards to W and b.
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.mul(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
However if I try something like this where cost is a function of cost(x,y,w,b) and I only want to gradients with respect to w and b:
grads = tf.gradients(cost, tf.all_variable())
My placeholders will also be included (X and Y).
Even if I do get a gradient with [x,y,w,b] how do I know which element in the gradient that belong to each parameter since it is just a list without names to which parameter the derivative has be taken with regards to?
In this question I'm using parts of this code and I build on this question.
Quoting the docs for tf.gradients
Constructs symbolic partial derivatives of sum of ys w.r.t. x in xs.
So, this should work:
dc_dw, dc_db = tf.gradients(cost, [W, b])
Here, tf.gradients() returns the gradient of cost wrt each tensor in the second argument as a list in the same order.
Read tf.gradients for more information.

Glm with caret package producing "missing values in resampled performance measures"

I obtained the following code from this Stack Overflow question. caret train() predicts very different then predict.glm()
The following code is producing an error.
I am using caret 6.0-52.
library(car); library(caret); library(e1071)
#data import and preparation
data(Chile)
chile <- na.omit(Chile) #remove "na's"
chile <- chile[chile$vote == "Y" | chile$vote == "N" , ] #only "Y" and "N" required
chile$vote <- factor(chile$vote) #required to remove unwanted levels
chile$income <- factor(chile$income) # treat income as a factor
tc <- trainControl("cv", 2, savePredictions=T, classProbs=TRUE,
summaryFunction=twoClassSummary) #"cv" = cross-validation, 10-fold
fit <- train(chile$vote ~ chile$sex +
chile$education +
chile$statusquo ,
data = chile ,
method = "glm" ,
family = binomial ,
metric = "ROC",
trControl = tc)
Running this code produces the following error.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.9354 Min. :0.9187
1st Qu.: NA 1st Qu.:0.9354 1st Qu.:0.9187
Median : NA Median :0.9354 Median :0.9187
Mean :NaN Mean :0.9354 Mean :0.9187
3rd Qu.: NA 3rd Qu.:0.9354 3rd Qu.:0.9187
Max. : NA Max. :0.9354 Max. :0.9187
NA's :1
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Would anyone know what the issue is or can reproduce/ not reproduce this error. I've seen other answers to this error message that says this has to do with not having representation of classes in each cross validation fold but this isn't the issue as the number of folds is set to 2.
Looks like I needed to install and load the pROC package.
install.packages("pROC")
library(pROC)
You should install using
install.packages("caret", dependencies = c("Imports", "Depends", "Suggests"))
That gets most of the default packages. If there are specific modeling packages that are missing, the code usually prompts you to install them.
I know I'm late to the party, but I think you need to set classProbs = TRUE in train control.
You are using logistic regression when using the parameters method = "glm", family = binomial.
In this case, you must make sure that the target variable (chile$vote) has only 2 factor levels, because logistic regression only performs binary classification.
If the target has more than two labels, then you must set family = "multinomial"

Identifying outlying datapoints from residuals (GeoLight package)

I am analysing some data collected from a geolocator placed on a migratory bird. In a nutshell, my data are sunrise and sunset times, which are then used to determine position on the globe.
I am using a package GeoLight (http://cran.r-project.org/web/packages/GeoLight/GeoLight.pdf) to identify outlying data - specifically, I am using the LoessFilter function which applies a polynomial regression and identify residuals that are greater than 3 interquantile ranges (specified by k in the code when applying the function)
My problem is: the function returns graphs in which outlying datapoints are identified in red. There seems to be an issue with the code itself regarding returned TRUE or FALSE statements stating which points are outliers - all are stated as TRUE, even if outliers are identified.
I have therefore modified the function code to state which residuals are outliers.
However, when I then remove those rows from the original dataset and re-run the function, the points have not been removed. Therefore, there is some discrepancy between which residuals are relating to values in the original data: i.e. if the output states that residual 78 is an outlying point, removing row 78 from the original data does not remove the outlying datapoint.
I would very much appreciate some help with removing the outlying datapoints identified using the function. It seems like a very easy fix but I can't seem to figure it out.
Code for full function and data below
Thanks
Emma
log2$tFirst<-as.POSIXlt(log2$tFirst)
log2$tSecond<-as.POSIXlt(log2$tSecond)
CODE TO GET OUTLYING RESIDUALS
i.get.outliers<-function(residuals, k=3) {
x <- residuals
# x is a vector of residuals
# k is a measure of how many interquartile ranges to take before saying that point is an outlier
# it looks like 3 is a good preset for k
QR<-quantile(x, probs = c(0.25, 0.75))
IQR<-QR[2]-QR[1]
Lower.band<-QR[1]-(k*IQR)
Upper.Band<-QR[2]+(k*IQR)
delete<-which(x<Lower.band | x>Upper.Band)
return(as.vector(delete))
}
LOESS FILTER FUNCTION CODE
loessFilter <- function(tFirst, tSecond, type, k=3, plot=TRUE){
tw <- data.frame(datetime=as.POSIXct(c(tFirst,tSecond),"UTC"),type=c(type,ifelse(type==1,2,1)))
tw <- tw[!duplicated(tw$datetime),]
tw <- tw[order(tw[,1]),]
hours <- as.numeric(format(tw[,1],"%H"))+as.numeric(format(tw[,1],"%M"))/60
for(t in 1:2){
cor <- rep(NA, 24)
for(i in 0:23){
cor[i+1] <- max(abs((c(hours[tw$type==t][1],hours[tw$type==t])+i)%%24 -
(c(hours[tw$type==t],hours[tw$type==t][length(hours)])+i)%%24),na.rm=T)
}
hours[tw$type==t] <- (hours[tw$type==t] + (which.min(round(cor,2)))-1)%%24
}
dawn <- data.frame(id=1:sum(tw$type==1),
datetime=tw$datetime[tw$type==1],
type=tw$type[tw$type==1],
hours = hours[tw$type==1], filter=FALSE)
dusk <- data.frame(id=1:sum(tw$type==2),
datetime=tw$datetime[tw$type==2],
type=tw$type[tw$type==2],
hours = hours[tw$type==2], filter=FALSE)
for(d in seq(30,k,length=5)){
predict.dawn <- predict(loess(dawn$hours[!dawn$filter]~as.numeric(dawn$datetime[!dawn$filter]),span=0.1))
predict.dusk <- predict(loess(dusk$hours[!dusk$filter]~as.numeric(dusk$datetime[!dusk$filter]),span=0.1))
del.dawn <- i.get.outliers(as.vector(residuals(loess(dawn$hours[!dawn$filter]~
as.numeric(dawn$datetime[!dawn$filter]),span=0.1))),k=d)
del.dusk <- i.get.outliers(as.vector(residuals(loess(dusk$hours[!dusk$filter]~
as.numeric(dusk$datetime[!dusk$filter]),span=0.1))),k=d)
if(length(del.dawn)>0) dawn$filter[!dawn$filter][del.dawn] <- TRUE
if(length(del.dusk)>0) dusk$filter[!dusk$filter][del.dusk] <- TRUE
}
if(plot){
par(mfrow=c(2,1),mar=c(3,3,0.5,3),oma=c(2,2,0,0))
plot(dawn$datetime[dawn$type==1],dawn$hours[dawn$type==1],pch="+",cex=0.6,xlab="",ylab="",yaxt="n")
lines(dawn$datetime[!dawn$filter], predict(loess(dawn$hours[!dawn$filter]~as.numeric(dawn$datetime[!dawn$filter]),span=0.1)) , type="l")
points(dawn$datetime[dawn$filter],dawn$hours[dawn$filter],col="red",pch="+",cex=1)
axis(2,labels=F)
mtext("Sunrise",4,line=1.2)
plot(dusk$datetime[dusk$type==2],dusk$hours[dusk$type==2],pch="+",cex=0.6,xlab="",ylab="",yaxt="n")
lines(dusk$datetime[!dusk$filter], predict(loess(dusk$hours[!dusk$filter]~as.numeric(dusk$datetime[!dusk$filter]),span=0.1)), type="l")
points(dusk$datetime[dusk$filter],dusk$hours[dusk$filter],col="red",pch="+",cex=1)
axis(2,labels=F)
legend("bottomleft",c("Outside filter","Inside filter"),pch=c("+","+"),col=c("black","red"),
bty="n",cex=0.8)
mtext("Sunset",4,line=1.2)
mtext("Time",1,outer=T)
mtext("Sunrise/Sunset hours (rescaled)",2,outer=T)
}
all <- rbind(subset(dusk,filter),subset(dawn,filter))
filter <- rep(FALSE,length(tFirst))
filter[tFirst%in%all$datetime | tSecond%in%all$datetime] <- TRUE
# original code:
#return(!filter)
# altered code to return outliersreturn(del.dusk)
# replace with code below to print outlying points
return(c("delete dawn",del.dawn,"delete dusk",del.dusk))
}
APPLY FUNCTION
loessFilter(log2$tFirst, log2$tSecond, type=1, k=4, plot=TRUE)
remove the values - need to remove both sunrise and sunset curves
log2b<-log2[-c(77,78,124,125),]
length(log2$tFirst)
length(log2b$tFirst)
repeat function to see if the values have gone
loessFilter(log2b$tFirst, log2b$tSecond, type=1, k=4, plot=TRUE)
outliers still there!!
HERE ARE THE DATA:
http://www.4shared.com/file/jxVuTsVHce/002_geolight.html
A bit too long to post the full data here and the example won't work with a dummy dataset :)

Resources