machine learning using R and randomForestSRC package - machine-learning

I'am trying to use the "surv.randomForestSRC" as the learner of machine learning in R.
My code and results are as below. "newHCC" is the survival data of HCC patients with result of multiple numeric paramaters.
> newHCC$status = (newHCC$status == 1)
> surv.task = makeSurvTask(data = newHCC, target = c("time", "status"))
> surv.task
Supervised task: newHCC
Type: surv
Target: time,status
Events: 61
Observations: 127
Features:
numerics factors ordered
30 0 0
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
> lrn = makeLearner("surv.randomForestSRC")
> rdesc = makeResampleDesc(method = "RepCV", folds=10, reps=10)
> r = resample(learner = lrn, task = surv.task, resampling = rdesc)
[Resample] repeated cross-validation iter 1: cindex.test.mean=0.485
[Resample] repeated cross-validation iter 2: cindex.test.mean=0.556
[Resample] repeated cross-validation iter 3: cindex.test.mean=0.825
[Resample] repeated cross-validation iter 4: cindex.test.mean=0.81
...
[Resample] repeated cross-validation iter 100: cindex.test.mean=0.683
[Resample] Aggr. Result: cindex.test.mean=0.688
I have several questions.
How can I check the parameters like used ntree, mtry and so on?
Is there any good way to tune up?
How can I watch the predicted individual risk, things like what we can see when we use predicted of randomForestSRC package?
Many thanks in advance.

and 2. You can try as below
surv_param <- makeParamSet(
makeIntegerParam("ntree",lower = 50, upper = 100),
makeIntegerParam("mtry", lower = 1, upper = 6),
makeIntegerParam("nodesize", lower = 10, upper = 50),
makeIntegerParam("nsplit", lower = 3, upper = 50)
)
rancontrol <- makeTuneControlRandom(maxit = 10L)
surv_tune <- tuneParams(learner = lrn, resampling = rdesc, task = surv.task,
par.set = surv_param, control = rancontrol)
surv.tree <- setHyperPars(lrn, par.vals = surv_tune$x)
surv <- mlr::train(surv.tree, surv.task)
getLearnerModel(surva)
model <- predict(surv, surv.task)
for today you can not predict individual risk in mlr surv.randomForestSRC. There is just predict type response

Related

mlr3 Multiple Measures AutoFSelector

I wanted to inquire about how to modify my code so that I could get multiple performance measures as an output.
My code is the following:
ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)
set.seed(123, "L'Ecuyer")
task = as_task_regr(ARMSS, target = "Index.ARMSS")
learner = lrn("regr.ranger", importance = "impurity")
set_threads(learner, n = 8)
resampling_inner = rsmp("cv", folds = 7)
measure = msrs(c("regr.rmse","regr.srho"))
terminator = trm("none")
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
I then receive the following error:
Error in UseMethod("as_measure") :
no applicable method for 'as_measure' applied to an object of class "list"
The result of multi-objective optimization is a Pareto front i.e. there are multiple best solutions. The AutoFselector needs one solution to fit the final model. Therefore, the AutoFselector only works with one measure.

RFE Termination Using RMSE with AutoFSelector

To mimic how caret performs RFE and select features that produce the lowest RMSE, it was suggested to use the archive.
I am using AutoFSelector and nested resampling with the following code:
ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)
set.seed(123, "L'Ecuyer")
task = as_task_regr(ARMSS, target = "Index.ARMSS")
learner = lrn("regr.ranger", importance = "impurity")
set_threads(learner, n = 8)
resampling_inner = rsmp("cv", folds = 7)
measure = msr("regr.rmse")
terminator = trm("none")
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer, store_models = TRUE)
Should I use the extract_inner_fselect_archives() command to identify each iteration with the smallest RMSE and the features that were selected then rereun the code above with the n_features argument changed? How do I reconcile differences across iterations in the number of features and/or the features selected?
Nested resampling is a statistical procedure to estimate the predictive performance of the model trained on the full dataset, it is not a procedure to select optimal hyperparameters. Nested resampling produces many hyperparameter configurations which should not be used to construct a final model.
mlr3book Chapter 4 - Optimization.
The same is true for feature selection. You don't select a feature set with nested resampling. You estimate the performance of the final model.
it was suggested to use the archive
Without nested resampling, you just call instance$result or at$fselect_result to get the feature subset with the lowest rmse.

mlr3 RFE Termination Metric

This may be a naive question but I would like to use recursive feature elimination with a random forest model and wanted to see if I could terminate based on the feature set that gives the smallest RMSE (like this figure from caret)?
I looked at the documentation and it seems that it defaults to terminating at half of the features chosen if I am not mistaken?
Thanks for your help #be-marc and my apologies for my naivety as this is all new to me. I was trying to implement your suggestion with the code I was already running (see below) but was not sure where to find the archive since I wasn't using the fselect command but rather AutoFSelector and nested resampling:
ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)
set.seed(123, "L'Ecuyer")
task = as_task_regr(ARMSS, target = "Index.ARMSS")
learner = lrn("regr.ranger", importance = "impurity")
set_threads(learner, n = 8)
resampling_inner = rsmp("cv", folds = 7)
measure = msr("regr.rmse")
terminator = trm("none")
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer, store_models = TRUE)
Should I use the extract_inner_fselect_archives() command then identify under each iteration the smallest RMSE and the features selected? How do I reconcile differences across iterations in the number of features and/or the features selected?
if I could terminate based on the feature set that gives the smallest RMSE
That makes no sense. You can terminate when one feature is left and then look at the archive to find the feature set with the lowest rmse. You can achieve the same run as caret with feature_fraction = 0.5 and n_features = 1.
instance = fselect(
method = fs("rfe", n_features = 1, feature_fraction = 0.5),
task = tsk("mtcars"),
learner = lrn("regr.rpart"),
resampling = rsmp("holdout"),
measure = msr("regr.rmse"),
store_models = TRUE
)
as.data.table(instance$archive)
instance$archive$best()

Apply Models from Nested Resample to Permuted Dataset

I have generated a nested resampling object with the following code:
data<-read.csv("Data.csv", row.names=1)
data$factor<-as.factor(data$factor)
set.seed(123, "L'Ecuyer")
task = as_task_classif(data, target = "factor")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
measure = msr("classif.fbeta", beta=1)
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
I have a .csv file with the factor variable permuted/randomized and would like to apply the models of the nested resampling paradigm to this dataset so I can demonstrated differences in the model performance between the real dataset and the permuted/randomized dataset. I am interested in this to validate predictive performance because when sample sizes are small (which is common in biological contexts) prediction accuracy by chance alone can approach 70% or higher based on this paper (https://pubmed.ncbi.nlm.nih.gov/25596422/).
How would I do this using the resample object (rr)?
I think I figured out how to do it (do let me know if I went wrong somewhere):
data<-read.csv("Data.csv", row.names=1)
data$factor<-as.factor(data$factor)
permuted<-read.csv("Data.csv", row.names=1)
permuted$factor<-as.factor(permuted$factor)
set.seed(123, "L'Ecuyer")
task1 = as_task_classif(data, target = "factor")
task2 = as_task_classif(permuted, target = "factor")
task_list = list(task1, task2)
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
measure = msr("classif.fbeta", beta=1)
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
design = benchmark_grid(task=task_list, learner=at, resampling=resampling_outer)
bmr = benchmark(design, store_models = TRUE)
Am I right in assuming that you have two tasks t1 and t2, where the task t2 is permuted and you wanted to compare the performance of a learner on these two tasks?
The way to go then is to use the benchmark() function instead of the resample function. You would have to create two different tasks (one permuted and one not permuted).
You might find the section Resampling and Benchmarking in our book helpful.

Clustered resampling for inner layer of Caret recursive feature elimination

I have data where IDs are contained within clusters.
I would like to perform recursive feature elimination using Caret's rfe function which performs the following procedure:
Clustered resampling for the outer layer (line 2.1) is straightforward, using the index parameter.
However, within each outer resample, I would like to tune tuning parameters using cluster-based cross-validation (inner resampling) (line 2.9). Model tuning in the inner layer is possible by specifying a tuneGrid in rfe and having an appropriate trControl. It is this trControl that I would like to change to allow clustered resampling.
The outer resampling is specified in the rfeControl parameter of rfe.
The inner resampling is specified by trControl of rfe which is passed to train.
The trouble I am having is that I can't seem to specify any inner indices, because after the outer resampling, those indices are no longer valid or no longer present in the outer-resampled data.
I am looking for a way to tell train to take an outer resample (which will be missing a cluster against which to validate), and to tune the model using inner resampling by based on folds of the remaining clusters.
The MWE is as minimal as possible:
library(caret)
library(tidyverse)
library(parallel)
library(doParallel)
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
### Create some random data, 10 features, with some influence over a binomial outcome
set.seed(42)
id <- 1:1000
cluster <- rep(1:10, each = 100)
dat <- data.frame(id, cluster, replicate(10,rnorm(n = 1000, mean = runif(1, 0,100)+cluster, sd = runif(1, 0,20))))
dat <- dat %>% mutate(temp = rowSums(across(X1:X10)), prob = range01(temp), outcome = rbinom(n = nrow(dat), size = 1, prob = prob))
dat$outcome <- as.factor(dat$outcome)
levels(dat$outcome) <- c("control", "case")
dat$outcome <- factor(dat$outcome, levels=rev(levels(dat$outcome)))
### Manual outer folds-based cluster ###
for(i in 1:10) {
assign(paste0("index", i), which(dat$cluster!=i))
}
unit_indices <- list(index1, index2, index3, index4, index5, index6, index7, index8, index9, index10)
### Inner resampling method (THIS IS WHAT I'D LIKE TO CHANGE) ###
cv5 <- trainControl(classProbs = TRUE, method = "cv", number = 5, allowParallel = F) ## Is there a way to have inner cluster-based resampling WITHIN the outer cluster-based resampling?
caret_rfe_functions <- list(summary = twoClassSummary,
fit = function (x, y, first, last, ...) {
train(x, y, ...)
},
pred = caretFuncs$pred,
rank = function(object, x, y) {
vimp <- varImp(object)$importance
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = function (x, metric = "ROC", tol = 1, maximize = TRUE)
{
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
},
selectVar = caretFuncs$selectVar)
caret_rfe_ctrl <- rfeControl(
functions = caret_rfe_functions,
saveDetails = TRUE,
index = unit_indices,
indexOut = NULL,
returnResamp = "all",
allowParallel = T, ### change this if you don't want to / can't go parallel
verbose = TRUE
)
#### Feature selection ####
set.seed(42)
cl <- makePSOCKcluster(10) ### for parallel processing if available
registerDoParallel(cl)
rfe_profile_nnet <- rfe(
form = outcome ~
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
data = dat,
sizes = seq(2,10,1),
rfeControl = caret_rfe_ctrl,
## pass options to train()
method = "nnet",
preProc = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(size = c(1:5), decay = 5),
trControl = cv5) ### I would like to change this to allow inner cluster-based resampling
stopCluster(cl)
rfe_profile_nnet
plot(rfe_profile_nnet)
Presumably the inner cluster-based resampling would be achieved by specifying a new trainControl containing some dynamic inner index based on the outer resample that is selected at the time:
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = {insert magic here}, ### This is the important bit
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F) ### especially if the outer resample is parallelised
If you try with the original cluster indices e.g.
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = unit_indices,
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F)
There are various warnings about missing data in the resamples, and things like 24: In [<-.data.frame(*tmp*, , object$method$center, value = structure(list( ... : provided 81 variables to replace 9 variables.
All help greatly appreciated.
As a postscript question , you can see which parameters were used within your rfe like so:
> rfe_profile_nnet$fit
Neural Network
1000 samples
8 predictor
2 classes: 'case', 'control'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
size Accuracy Kappa
1 0.616 0.1605071
2 0.616 0.1686937
3 0.620 0.1820503
4 0.618 0.1788491
5 0.618 0.1788063
Tuning parameter 'decay' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 3 and decay = 5.
But does anyone know if this refers to one, or all of the outer resamples? Presumably the same tuning parameters won't necessarily be chosen across all outer resamples

Resources