How to create stratified folds for repeatedcv in caret? - machine-learning

The way to create stratified folds for cv in caret is like this
library(caret)
library(data.table)
train_dat <- data.table(group = c(rep("group1",10), rep("group2",5)), x1 = rnorm(15), x2 = rnorm(15), label = factor(c(rep("treatment",15), rep("control",15))))
folds <- createFolds(train_dat[, group], k = 5)
fitCtrl <- trainControl(method = "cv", index = folds, classProbs = T, summaryFunction = twoClassSummary)
train(label~., data = train_dat[, !c("group"), with = F], trControl = fitCtrl, method = "xgbTree", metric = "ROC")
To balance group1 and group2, the creation of fold indexes is based on "group" variable.
However, is there any way to createFolds for repeatedcv in caret? So, I can have a balanced split for repeatedcv. Should I combined several createFolds and run trainControl?
trControl = trainControl(method = "cv", index = many_repeated_folds)
Thanks!

createMultiFolds is probably what you are interested in.

Related

mlr3 Multiple Measures AutoFSelector

I wanted to inquire about how to modify my code so that I could get multiple performance measures as an output.
My code is the following:
ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)
set.seed(123, "L'Ecuyer")
task = as_task_regr(ARMSS, target = "Index.ARMSS")
learner = lrn("regr.ranger", importance = "impurity")
set_threads(learner, n = 8)
resampling_inner = rsmp("cv", folds = 7)
measure = msrs(c("regr.rmse","regr.srho"))
terminator = trm("none")
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
I then receive the following error:
Error in UseMethod("as_measure") :
no applicable method for 'as_measure' applied to an object of class "list"
The result of multi-objective optimization is a Pareto front i.e. there are multiple best solutions. The AutoFselector needs one solution to fit the final model. Therefore, the AutoFselector only works with one measure.

Apply Models from Nested Resample to Permuted Dataset

I have generated a nested resampling object with the following code:
data<-read.csv("Data.csv", row.names=1)
data$factor<-as.factor(data$factor)
set.seed(123, "L'Ecuyer")
task = as_task_classif(data, target = "factor")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
measure = msr("classif.fbeta", beta=1)
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
I have a .csv file with the factor variable permuted/randomized and would like to apply the models of the nested resampling paradigm to this dataset so I can demonstrated differences in the model performance between the real dataset and the permuted/randomized dataset. I am interested in this to validate predictive performance because when sample sizes are small (which is common in biological contexts) prediction accuracy by chance alone can approach 70% or higher based on this paper (https://pubmed.ncbi.nlm.nih.gov/25596422/).
How would I do this using the resample object (rr)?
I think I figured out how to do it (do let me know if I went wrong somewhere):
data<-read.csv("Data.csv", row.names=1)
data$factor<-as.factor(data$factor)
permuted<-read.csv("Data.csv", row.names=1)
permuted$factor<-as.factor(permuted$factor)
set.seed(123, "L'Ecuyer")
task1 = as_task_classif(data, target = "factor")
task2 = as_task_classif(permuted, target = "factor")
task_list = list(task1, task2)
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
measure = msr("classif.fbeta", beta=1)
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
design = benchmark_grid(task=task_list, learner=at, resampling=resampling_outer)
bmr = benchmark(design, store_models = TRUE)
Am I right in assuming that you have two tasks t1 and t2, where the task t2 is permuted and you wanted to compare the performance of a learner on these two tasks?
The way to go then is to use the benchmark() function instead of the resample function. You would have to create two different tasks (one permuted and one not permuted).
You might find the section Resampling and Benchmarking in our book helpful.

Clustered resampling for inner layer of Caret recursive feature elimination

I have data where IDs are contained within clusters.
I would like to perform recursive feature elimination using Caret's rfe function which performs the following procedure:
Clustered resampling for the outer layer (line 2.1) is straightforward, using the index parameter.
However, within each outer resample, I would like to tune tuning parameters using cluster-based cross-validation (inner resampling) (line 2.9). Model tuning in the inner layer is possible by specifying a tuneGrid in rfe and having an appropriate trControl. It is this trControl that I would like to change to allow clustered resampling.
The outer resampling is specified in the rfeControl parameter of rfe.
The inner resampling is specified by trControl of rfe which is passed to train.
The trouble I am having is that I can't seem to specify any inner indices, because after the outer resampling, those indices are no longer valid or no longer present in the outer-resampled data.
I am looking for a way to tell train to take an outer resample (which will be missing a cluster against which to validate), and to tune the model using inner resampling by based on folds of the remaining clusters.
The MWE is as minimal as possible:
library(caret)
library(tidyverse)
library(parallel)
library(doParallel)
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
### Create some random data, 10 features, with some influence over a binomial outcome
set.seed(42)
id <- 1:1000
cluster <- rep(1:10, each = 100)
dat <- data.frame(id, cluster, replicate(10,rnorm(n = 1000, mean = runif(1, 0,100)+cluster, sd = runif(1, 0,20))))
dat <- dat %>% mutate(temp = rowSums(across(X1:X10)), prob = range01(temp), outcome = rbinom(n = nrow(dat), size = 1, prob = prob))
dat$outcome <- as.factor(dat$outcome)
levels(dat$outcome) <- c("control", "case")
dat$outcome <- factor(dat$outcome, levels=rev(levels(dat$outcome)))
### Manual outer folds-based cluster ###
for(i in 1:10) {
assign(paste0("index", i), which(dat$cluster!=i))
}
unit_indices <- list(index1, index2, index3, index4, index5, index6, index7, index8, index9, index10)
### Inner resampling method (THIS IS WHAT I'D LIKE TO CHANGE) ###
cv5 <- trainControl(classProbs = TRUE, method = "cv", number = 5, allowParallel = F) ## Is there a way to have inner cluster-based resampling WITHIN the outer cluster-based resampling?
caret_rfe_functions <- list(summary = twoClassSummary,
fit = function (x, y, first, last, ...) {
train(x, y, ...)
},
pred = caretFuncs$pred,
rank = function(object, x, y) {
vimp <- varImp(object)$importance
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = function (x, metric = "ROC", tol = 1, maximize = TRUE)
{
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
},
selectVar = caretFuncs$selectVar)
caret_rfe_ctrl <- rfeControl(
functions = caret_rfe_functions,
saveDetails = TRUE,
index = unit_indices,
indexOut = NULL,
returnResamp = "all",
allowParallel = T, ### change this if you don't want to / can't go parallel
verbose = TRUE
)
#### Feature selection ####
set.seed(42)
cl <- makePSOCKcluster(10) ### for parallel processing if available
registerDoParallel(cl)
rfe_profile_nnet <- rfe(
form = outcome ~
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
data = dat,
sizes = seq(2,10,1),
rfeControl = caret_rfe_ctrl,
## pass options to train()
method = "nnet",
preProc = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(size = c(1:5), decay = 5),
trControl = cv5) ### I would like to change this to allow inner cluster-based resampling
stopCluster(cl)
rfe_profile_nnet
plot(rfe_profile_nnet)
Presumably the inner cluster-based resampling would be achieved by specifying a new trainControl containing some dynamic inner index based on the outer resample that is selected at the time:
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = {insert magic here}, ### This is the important bit
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F) ### especially if the outer resample is parallelised
If you try with the original cluster indices e.g.
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = unit_indices,
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F)
There are various warnings about missing data in the resamples, and things like 24: In [<-.data.frame(*tmp*, , object$method$center, value = structure(list( ... : provided 81 variables to replace 9 variables.
All help greatly appreciated.
As a postscript question , you can see which parameters were used within your rfe like so:
> rfe_profile_nnet$fit
Neural Network
1000 samples
8 predictor
2 classes: 'case', 'control'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
size Accuracy Kappa
1 0.616 0.1605071
2 0.616 0.1686937
3 0.620 0.1820503
4 0.618 0.1788491
5 0.618 0.1788063
Tuning parameter 'decay' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 3 and decay = 5.
But does anyone know if this refers to one, or all of the outer resamples? Presumably the same tuning parameters won't necessarily be chosen across all outer resamples

About pytorch reduction mean

I want use L1loss and BCELoss with reduction='mean' in vae reconstruction loss
but it produce same result for all different input i.e. result for landmark
so i use reduction='sum' it produce correct result that different output for different input.
how can i use mean reduction??
L1Loss = nn.L1Loss(reduction='mean').to(device)
BCELoss = nn.BCELoss(reduction='mean').to(device)
kld_criterion = KLDLoss(reduction='mean').to(device)
in training
rec_m, (rec_f, mean_f, logvar_f), (rec_l, mean_l, logvar_l) = model(origin)
lm_loss = CELoss(rec_l, lm)
f_loss = L1Loss(rec_f, f)
m_loss = CELoss(rec_m, m)
lm_kld_loss = kld_criterion(mean_l, logvar_l)
f_kld_loss = kld_criterion(mean_f, logvar_f)
loss = 4000*(f_loss + m_loss) + 30 * (lm_kld_loss + f_kld_loss) + 2000 * lm_loss
and model code
class VAE_NET(nn.Module):
def __init__(self, nc=3, ndf=32, nef=32, nz=128, isize=128, device=torch.device("cuda:0"), is_train=True):
super(VAE_NET, self).__init__()
self.nz = nz
# Encoder
self.l_encoder = Encoder(nc=nc, nef=nef, nz=nz, isize=isize, device=device)
self.f_encoder = Encoder(nc=nc, nef=nef, nz=nz, isize=isize, device=device)
# Decoder
self.l_decoder = Decoder(nc=nc, ndf=ndf, nz=nz, isize=isize)
self.m_decoder = Decoder(nc = nc, ndf = ndf, nz = nz * 2, isize = isize)
self.f_decoder = Decoder(nc = nc, ndf = ndf, nz = nz * 2, isize = isize)
if is_train == False:
for param in self.encoder.parameters():
param.requires_grad = False
for param in self.decoder.parameters():
param.requires_grad = False
def forward(self, x):
latent_l, mean_l, logvar_l = self.l_encoder(x)
latent_f, mean_f, logvar_f = self.f_encoder(x)
concat_latent = torch.cat((latent_l, latent_f), 1)
rec_l = self.l_decoder(latent_l)
rec_m = self.m_decoder(concat_latent)
rec_f = self.f_decoder(concat_latent)
return rec_m, (rec_f, mean_f, latent_f), (rec_l, mean_l, latent_l)
l is for face landmark
m is for face mask
f is for face part
reduction='sum' and reduction='mean' differs only by a scalar multiple. There is nothing wrong with your implementation from what I see. If your model only produces correct results with reduction='sum', it is likely that your learning rate is too low (and sum makes up for that difference by amplifying the gradient).

Tuning parameters in caret error despite assigning grids and as.factor

Any help appreciated. Been at this for weeks. :(
install.packages("klaR", dependencies=TRUE)
library(klaR)
install.packages("caret", dependencies=TRUE)
library(caret)
install.packages("e1071", dependencies=TRUE)
library(e1071)
install.packages("gmodels", dependencies=TRUE)
library(gmodels)
install.packages("gbm", dependencies=TRUE)
library(gbm)
install.packages("foreach", dependencies=TRUE)
library(foreach)
Load Grading Data
grading <- read.csv("~/PA_DataFinal/GradingData160315.csv")
create stratified sample # 1%
dfstrat <- stratified(grading, "FailPass", .01)
save(dfstrat, file = "c:/Users/gillisn/Documents/PA_DataFinal/RResults/GradingRResults/iteration 1/dfstrat.rda")
split data into train and test #75:25. FailPass is the responseVble
set.seed(1)
inTrainingSet <- createDataPartition(dfstrat$FailPass, p = .75, list = FALSE)
trainSet <- dfstrat[inTrainingSet,]
testSet <- dfstrat[-inTrainingSet, ]
set predictors and labels
There are 48 labels and its the last one that want to train on.
Take all the predictors 1-47
x,y is training data
x <- trainSet[,-48]
y <- as.factor(trainSet$FailPass)
i,j is test data
i <- testSet[,-48,]
j <- as.factor(testSet$FailPass)
Set Training control parameters
Bootstrapping itself around in 25 times.
bootControl <- trainControl(number = 25)
The grid is for the decision tree
gbmGrid <- expand.grid(.interaction.depth = (1:5) * 2, .n.trees = (1:10)*25, .shrinkage = .1)
nbGrid <- expand.grid(.fL=0, .usekernel=FALSE)
svmGrid >- expandGrid(.sigma=, .c=)
set.seed(2)
Train the models
naive bayes
nbFit <- train(x,y,method='nb',tuneGrid="nbGrid")
svm
svmFit <- train(x, y,method = "svmRadial", tuneLength = 10,trControl = bootControl, scaled = FALSE)
gbm
gbmFit <- train(x, y,method = "gbm", trControl = bootControl, verbose = FALSE, bag.fraction = 0.5, tuneGrid = gbmGrid)
predict the models on training data
models <- list(svm = svmFit, nb = nbFit, gbm = gbmFit)
predict(models)

Resources