Jags/Bugs one step ahead prediction - mcmc

imagine a simple growth model.
How do I get the one step ahead predictions ??
# Priors and constraints
N.est[1] ~ dunif(0, 10) # Prior for initial population size
mean.lambda ~ dunif(0, 10) #
mean.psi ~ dunif(0, 10) #
sigma.proc ~ dunif(0, 10)# Prior for sd of state process
sigma2.proc <- pow(sigma.proc, 2)
tau.proc <- pow(sigma.proc, -2)
sigma.obs ~ dunif(0, 10) # Prior for sd of observation process
sigma2.obs <- pow(sigma.obs, 2)
tau.obs <- pow(sigma.obs, -2)
sigma.psi ~ dunif(0, 10) # Prior for sd of observation process
psi2.psi <- pow(sigma.psi, 2)
tau.psi <- pow(sigma.psi, -2)
# Likelihood
# State process
for (t in 1:(T-1)){
lambda[t] ~ dnorm(mean.lambda, tau.proc)
psi[t] ~ dnorm(mean.psi, tau.psi)
N.est[t+1] <- 10/ ( 1+ exp(- psi[t] *(N.est[t] * lambda[t]) ) ) #N.est[t] * lambda[t] + psi[t]
}
# Observation process
for (t in 1:T) {
y[t] ~ dnorm(N.est[t], tau.obs)
}
This is from a book, this example. How do I get the one step ahead predictions, because I am not interested in the smoothed output.

What you want to do is to simply set those units you want to predict in the response vector y to NA and then generate a replicate sample of y;
y.pred[t] ~ dnorm(N.est[t], tau.obs)
To check the residuals between y and y.pred
res <- y - y.pred
Using JAGS in R with the rjags package

Related

Clustered resampling for inner layer of Caret recursive feature elimination

I have data where IDs are contained within clusters.
I would like to perform recursive feature elimination using Caret's rfe function which performs the following procedure:
Clustered resampling for the outer layer (line 2.1) is straightforward, using the index parameter.
However, within each outer resample, I would like to tune tuning parameters using cluster-based cross-validation (inner resampling) (line 2.9). Model tuning in the inner layer is possible by specifying a tuneGrid in rfe and having an appropriate trControl. It is this trControl that I would like to change to allow clustered resampling.
The outer resampling is specified in the rfeControl parameter of rfe.
The inner resampling is specified by trControl of rfe which is passed to train.
The trouble I am having is that I can't seem to specify any inner indices, because after the outer resampling, those indices are no longer valid or no longer present in the outer-resampled data.
I am looking for a way to tell train to take an outer resample (which will be missing a cluster against which to validate), and to tune the model using inner resampling by based on folds of the remaining clusters.
The MWE is as minimal as possible:
library(caret)
library(tidyverse)
library(parallel)
library(doParallel)
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
### Create some random data, 10 features, with some influence over a binomial outcome
set.seed(42)
id <- 1:1000
cluster <- rep(1:10, each = 100)
dat <- data.frame(id, cluster, replicate(10,rnorm(n = 1000, mean = runif(1, 0,100)+cluster, sd = runif(1, 0,20))))
dat <- dat %>% mutate(temp = rowSums(across(X1:X10)), prob = range01(temp), outcome = rbinom(n = nrow(dat), size = 1, prob = prob))
dat$outcome <- as.factor(dat$outcome)
levels(dat$outcome) <- c("control", "case")
dat$outcome <- factor(dat$outcome, levels=rev(levels(dat$outcome)))
### Manual outer folds-based cluster ###
for(i in 1:10) {
assign(paste0("index", i), which(dat$cluster!=i))
}
unit_indices <- list(index1, index2, index3, index4, index5, index6, index7, index8, index9, index10)
### Inner resampling method (THIS IS WHAT I'D LIKE TO CHANGE) ###
cv5 <- trainControl(classProbs = TRUE, method = "cv", number = 5, allowParallel = F) ## Is there a way to have inner cluster-based resampling WITHIN the outer cluster-based resampling?
caret_rfe_functions <- list(summary = twoClassSummary,
fit = function (x, y, first, last, ...) {
train(x, y, ...)
},
pred = caretFuncs$pred,
rank = function(object, x, y) {
vimp <- varImp(object)$importance
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = function (x, metric = "ROC", tol = 1, maximize = TRUE)
{
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
},
selectVar = caretFuncs$selectVar)
caret_rfe_ctrl <- rfeControl(
functions = caret_rfe_functions,
saveDetails = TRUE,
index = unit_indices,
indexOut = NULL,
returnResamp = "all",
allowParallel = T, ### change this if you don't want to / can't go parallel
verbose = TRUE
)
#### Feature selection ####
set.seed(42)
cl <- makePSOCKcluster(10) ### for parallel processing if available
registerDoParallel(cl)
rfe_profile_nnet <- rfe(
form = outcome ~
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
data = dat,
sizes = seq(2,10,1),
rfeControl = caret_rfe_ctrl,
## pass options to train()
method = "nnet",
preProc = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(size = c(1:5), decay = 5),
trControl = cv5) ### I would like to change this to allow inner cluster-based resampling
stopCluster(cl)
rfe_profile_nnet
plot(rfe_profile_nnet)
Presumably the inner cluster-based resampling would be achieved by specifying a new trainControl containing some dynamic inner index based on the outer resample that is selected at the time:
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = {insert magic here}, ### This is the important bit
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F) ### especially if the outer resample is parallelised
If you try with the original cluster indices e.g.
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = unit_indices,
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F)
There are various warnings about missing data in the resamples, and things like 24: In [<-.data.frame(*tmp*, , object$method$center, value = structure(list( ... : provided 81 variables to replace 9 variables.
All help greatly appreciated.
As a postscript question , you can see which parameters were used within your rfe like so:
> rfe_profile_nnet$fit
Neural Network
1000 samples
8 predictor
2 classes: 'case', 'control'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
size Accuracy Kappa
1 0.616 0.1605071
2 0.616 0.1686937
3 0.620 0.1820503
4 0.618 0.1788491
5 0.618 0.1788063
Tuning parameter 'decay' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 3 and decay = 5.
But does anyone know if this refers to one, or all of the outer resamples? Presumably the same tuning parameters won't necessarily be chosen across all outer resamples

Highcharts/HighcharteR - draw a polygon with rounded corners

library(highcharter)
highchart() %>%
hc_add_series(name='Polygon',type='polygon',data=list(c(1,4),c(2,4), c(3,3), c(2,3)),
borderRadius = 10, lineColor = "red", lineWidth = 3)][1]][1]
Hello everybody. I use a polygon to display some data. I would prefer to have the borders to be round, but the borderRadius attribute does not work for the polygon.
Does anyone have an idea how to archieve a rounded look of my polygon? Documentation did not help in this case :-(. This is made the the R Highcharter package, but I would also be totally fine with an example in die native JS Library.
Thank you!
This works somewhat:
spline.poly <- function(xy, vertices, k=3, ...) {
# Assert: xy is an n by 2 matrix with n >= k.
# Wrap k vertices around each end.
n <- dim(xy)[1]
if (k >= 1) {
data <- rbind(xy[(n-k+1):n,], xy, xy[1:k, ])
} else {
data <- xy
}
# Spline the x and y coordinates.
data.spline <- spline(1:(n+2*k), data[,1], n=vertices, ...)
x <- data.spline$x
x1 <- data.spline$y
x2 <- spline(1:(n+2*k), data[,2], n=vertices, ...)$y
# Retain only the middle part.
cbind(x1, x2)[k < x & x <= n+k, ]
}
X <- matrix(c(resultdf$yAxis, resultdf$xAxis), ncol=2)
hpts <- chull(X) # Creates indices of a convex hull from a matrix
hpts <- c(hpts, hpts[1]) # connect last and first dot
hpts <- data.frame(X[hpts, ])
hpts <- data.frame(spline.poly(as.matrix(data.frame(hpts$X1, hpts$X2)), 500)) %>%
setNames(c("yAxis", "xAxis"))
the spline.poly function creates a lot of new points which connect to a more rounded shape :-)

Convert raster to a matrix

I can read an image, raster, limit the values from 10-100. What I can't do is convert the limitation to a matrix where I could sum all values.
library(raster)
DEM <- raster("img.JPG")
image(DEM, zlim=c(10,100))
I'd like to convert the result of
image(DEM, zlim=c(10,60))
into a matrix where I can perform calculations.
image(DEM, zlim=c(10,60)) would result in
Target is to only sum the Red Circle.
library(raster)
d <- raster("img.JPG")
dd <- reclassify(d, rbind(c(-Inf, 10, NA), c(60, Inf, NA)))
# or: d[d< 10 | d > 60] <- NA
plot(dd)
hist(dd)
Or, if you really want a matrix
m <- matrix(d)
m[m<10] <- NA

R - RW metropolis using gibbs fails

I want to sample from the posterior, where LambdaA and LambdaB are exponential rates of A and B. Also, y is the observations of the r.v.'s.
The posterior is given by
and for numerical reasons, i am taking the log of this function.
Data:
n<-100
y<- c(rexp(n))
Logarithm of posterior:
dmix<-function(LambdaA,LambdaB,w){
ifelse( LambdaA<=0|LambdaB<=0|w<0|w>1 ,0,log(w*LambdaA*LambdaB*exp(-2*(LambdaA+LambdaB))*prod(w*LambdaA*exp(-
LambdaA*y) + (1-w)*LambdaB*exp(-LambdaB*y)) ))}
U-values
U.lambdaB <- runif(1)
U.lambdaA<- runif(1)
U.w<- runif(1)
Count steps
REJLambdaB <- 1
REJw <- 1
REJLambdaA<-1
Initial points
LambdaB <- LambdaA<- w<- numeric(n)
LambdaA[1]<-0.5
LambdaB[1] <- 0.5
w[1] <- 0.5
Random walk MH algorithm, updating each component at a time:
for (t in 2:n){
LambdaBprop<- rnorm(1,LB[t-1],0.5)
wprop<- rnorm(1,w[t-1],0.5)
LambdaAprop<- rnorm(1,LB[t-1],0.5)
logalpha1 = dmix(LambdaAprop,LambdaB[t-1],w[t-1])-dmix(LambdaA[t-1],LambdaB[t-
1],w[t-1])
logalpha2 = dmix(LambdaA[t-1],LambdaBprop,w[t-1])-dmix(LA[t-1],LB[t-1],w[t-
1])
if (!is.null(log(U.lambdaB) > logalpha2))
{LambdaB[t] <- LambdaBprop} ## accepted
else{LambdaB[t] <- LambdaB[t-1] ##rejected
REJLambdaB<-REJLambdaB+1}
if (!is.null(log(U.lambdaA) > logalpha1))
{LambdaA[t]<-LambdaAprop}
else {LambdaA[t]<-LambdaA[t-1]
REJLambdaA<-REJLambdaA+1}
if (w[t]<0|w[t]>1)
{w[t]<-w[t-1]}
else {w[t]<-wprop
REJw<-REJw+1}
}
Ultimately, I am having problems with my posterior since I keep getting either infinity or 0's when evaluating logalpha's. Note that i am looking to compare
log($\alpha(x'|x))$ with log(U). Any help to get this code to work ?
If you really think that a random walk means
lambdB[t]<- lambdB[t-1] + runif(1)
w[t]<- w[t-1] + runif(1)
lambdA[t] <- lambdB[t-1] + runif(1)
you should reconsider and invest into reading the bases of Markov chain theory and Markov chain Monte Carlo: At each iteration you add a Uniform U(0,1) variate to the current value. Therefore you always propose to increase the current value. Do you think this could ever produce an ergodic Markov chain?
There is also a mistake in dmix: since you work with the logarithm, remember that log(0)=-oo. And the quantities logalpha1 and logalpha2 are not updated correctly. And many more programming errors, like the incorrect use of !is.null... Anyway here is a corrected R code that works:
n<-100
y<- c(rexp(n))
#Logarithm of posterior:
dmix<-function(LambdaA,LambdaB,w){
ifelse( (LambdaA<=0)|(LambdaB<=0)|(w<0)|(w>1) ,
-1e50,log(w*LambdaA*LambdaB)-2*(LambdaA+LambdaB)+sum(log(w*LambdaA*exp(-
LambdaA*y) + (1-w)*LambdaB*exp(-LambdaB*y))) )}
#Count steps
REJLambdaB <- 1
REJw <- 1
REJLambdaA<-1
#Initial points
N <- 1e4
LambdaB <- LambdaA <- w<- numeric(N)
LambdaA[1] <- LambdaB[1] <- w[1] <- 0.5
U.lambdaB <- runif(N)
U.lambdaA<- runif(N)
U.w <- runif(N)
for (t in 2:N){
LambdaBprop=rnorm(1,LambdaB[t-1],0.5)
LambdaAprop=rnorm(1,LambdaA[t-1],0.5)
wprop=rnorm(1,w[t-1],0.05)
logalpha2 = dmix(LambdaA[t-1],LambdaBprop,w[t-1])-dmix(LambdaA[t-1],LambdaB[t-1],w[t-1])
if ((log(U.lambdaB[t]) < logalpha2))
{LambdaB[t] <- LambdaBprop} ## accepted
else{LambdaB[t] <- LambdaB[t-1] ##rejected
REJLambdaB<-REJLambdaB+1}
logalpha1 = dmix(LambdaAprop,LambdaB[t],w[t-1])-dmix(LambdaA[t-1],LambdaB[t],w[t-1])
if ((log(U.lambdaA[t]) < logalpha1))
{LambdaA[t]<-LambdaAprop}
else {LambdaA[t]<-LambdaA[t-1]
REJLambdaA<-REJLambdaA+1}
logw = dmix(LambdaA[t],LambdaB[t],wprop)-dmix(LambdaA[t],LambdaB[t],w[t-1])
if (w[t]<0|w[t]>1|(log(U.w[t])>logw))
{w[t]<-w[t-1]}
else {w[t]<-wprop
REJw<-REJw+1}
}
As shown by the outcome
the posterior produces a symmetric outcome in the Lambda's.

How to explain high AUC-ROC with mediocre precision and recall in unbalanced data?

I have some machine learning results that I am trying to make sense of. The task is to predict/label "Irish" vs. "non-Irish". Python 2.7's output:
1= ir
0= non-ir
Class count:
0 4090942
1 940852
Name: ethnicity_scan, dtype: int64
Accuracy: 0.874921350119
Classification report:
precision recall f1-score support
0 0.89 0.96 0.93 2045610
1 0.74 0.51 0.60 470287
avg / total 0.87 0.87 0.87 2515897
Confusion matrix:
[[1961422 84188]
[ 230497 239790]]
AUC-ir= 0.901238104773
As you can see, the precision and recall are mediocre, but the AUC-ROC is higher (~0.90). And I am trying to figure out why, which I suspect is due to data imbalance (about 1:5). Based on the confusion matrix, and using Irish as the target (+), I calculated the TPR=0.51 and FPR=0.04. If I am considering non-Irish as (+), then TPR=0.96 and FPR=0.49. So how can I get a 0.9 AUC while the TPR can be only 0.5 at FPR=0.04?
Codes:
try:
for i in mass[k]:
df = df_temp # reset df before each loop
#$$
#$$
if 1==1:
###if i == singleEthnic:
count+=1
ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
# fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
############################################
############################################
def ethnicity_target(row):
try:
if row[ethnicity_var] == ethnicity_tar:
return 1
else:
return 0
except: return None
df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
print '1=', ethnicity_tar
print '0=', 'non-'+ethnicity_tar
# Random sampling a smaller dataframe for debugging
rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
df = DataFrame(rows)
print 'Class count:'
print df['ethnicity_scan'].value_counts()
# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values
# Feature extraction functions
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
def feature_full_first_name(nameString):
try:
first_name = nameString.rsplit(' ', 1)[0]
if len(first_name) > 1: # not accept name with only 1 character
return first_name
else: return '?'
except: return '?'
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
my_dict5 = [{'first-name': feature_full_first_name(i)} for i in X]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(
my_dict[i].items() + my_dict5[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
# Fitting X and y into model, using training data
classifierUsed2.fit(X_train, y_train)
# Making predictions using trained data
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)
Inserted codes for resampling:
try:
for i in mass[k]:
df = df_temp # reset df before each loop
#$$
#$$
if 1==1:
###if i == singleEthnic:
count+=1
ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
# fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
############################################
############################################
def ethnicity_target(row):
try:
if row[ethnicity_var] == ethnicity_tar:
return 1
else:
return 0
except: return None
df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
print '1=', ethnicity_tar
print '0=', 'non-'+ethnicity_tar
# Resampled
df_resampled = df.append(df[df.ethnicity_scan==0].sample(len(df)*5, replace=True))
# Random sampling a smaller dataframe for debugging
rows = df_resampled.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
df = DataFrame(rows)
print 'Class count:'
print df['ethnicity_scan'].value_counts()
# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values
# Feature extraction functions
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
def feature_full_first_name(nameString):
try:
first_name = nameString.rsplit(' ', 1)[0]
if len(first_name) > 1: # not accept name with only 1 character
return first_name
else: return '?'
except: return '?'
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
my_dict5 = [{'first-name': feature_full_first_name(i)} for i in X]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(
my_dict[i].items() + my_dict5[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
# Fitting X and y into model, using training data
classifierUsed2.fit(X_train, y_train)
# Making predictions using trained data
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)
Your model outputs a probability P (between 0 and 1) for each row in the test set that it scores. The summary stats (precision, recall, etc) are for a single value of P as a prediction threshold, probably P=0.5, unless you've changed this in your code. However the ROC contains more information, the idea is that you probably won't want to use this default value as your prediction threshold, so the ROC is plotted by calculating the ratio of true positives to false positives, across every prediction threshold betwen 0 and 1.
If you've undersampled your non-Irish people in the data, then you're correct that the AUC and precision will be overestimated; if your dataset is only 5000 rows, then you will have no problem running your model on a larger training set; just rebalance your dataset (by bootstrap sampling to increase your non-Irish people) until your accurately reflect your sample population.

Resources