RFE Termination Using RMSE with AutoFSelector - mlr3

To mimic how caret performs RFE and select features that produce the lowest RMSE, it was suggested to use the archive.
I am using AutoFSelector and nested resampling with the following code:
ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)
set.seed(123, "L'Ecuyer")
task = as_task_regr(ARMSS, target = "Index.ARMSS")
learner = lrn("regr.ranger", importance = "impurity")
set_threads(learner, n = 8)
resampling_inner = rsmp("cv", folds = 7)
measure = msr("regr.rmse")
terminator = trm("none")
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer, store_models = TRUE)
Should I use the extract_inner_fselect_archives() command to identify each iteration with the smallest RMSE and the features that were selected then rereun the code above with the n_features argument changed? How do I reconcile differences across iterations in the number of features and/or the features selected?

Nested resampling is a statistical procedure to estimate the predictive performance of the model trained on the full dataset, it is not a procedure to select optimal hyperparameters. Nested resampling produces many hyperparameter configurations which should not be used to construct a final model.
mlr3book Chapter 4 - Optimization.
The same is true for feature selection. You don't select a feature set with nested resampling. You estimate the performance of the final model.
it was suggested to use the archive
Without nested resampling, you just call instance$result or at$fselect_result to get the feature subset with the lowest rmse.

Related

mlr3 Multiple Measures AutoFSelector

I wanted to inquire about how to modify my code so that I could get multiple performance measures as an output.
My code is the following:
ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)
set.seed(123, "L'Ecuyer")
task = as_task_regr(ARMSS, target = "Index.ARMSS")
learner = lrn("regr.ranger", importance = "impurity")
set_threads(learner, n = 8)
resampling_inner = rsmp("cv", folds = 7)
measure = msrs(c("regr.rmse","regr.srho"))
terminator = trm("none")
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
I then receive the following error:
Error in UseMethod("as_measure") :
no applicable method for 'as_measure' applied to an object of class "list"
The result of multi-objective optimization is a Pareto front i.e. there are multiple best solutions. The AutoFselector needs one solution to fit the final model. Therefore, the AutoFselector only works with one measure.

mlr3 RFE Termination Metric

This may be a naive question but I would like to use recursive feature elimination with a random forest model and wanted to see if I could terminate based on the feature set that gives the smallest RMSE (like this figure from caret)?
I looked at the documentation and it seems that it defaults to terminating at half of the features chosen if I am not mistaken?
Thanks for your help #be-marc and my apologies for my naivety as this is all new to me. I was trying to implement your suggestion with the code I was already running (see below) but was not sure where to find the archive since I wasn't using the fselect command but rather AutoFSelector and nested resampling:
ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)
set.seed(123, "L'Ecuyer")
task = as_task_regr(ARMSS, target = "Index.ARMSS")
learner = lrn("regr.ranger", importance = "impurity")
set_threads(learner, n = 8)
resampling_inner = rsmp("cv", folds = 7)
measure = msr("regr.rmse")
terminator = trm("none")
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer, store_models = TRUE)
Should I use the extract_inner_fselect_archives() command then identify under each iteration the smallest RMSE and the features selected? How do I reconcile differences across iterations in the number of features and/or the features selected?
if I could terminate based on the feature set that gives the smallest RMSE
That makes no sense. You can terminate when one feature is left and then look at the archive to find the feature set with the lowest rmse. You can achieve the same run as caret with feature_fraction = 0.5 and n_features = 1.
instance = fselect(
method = fs("rfe", n_features = 1, feature_fraction = 0.5),
task = tsk("mtcars"),
learner = lrn("regr.rpart"),
resampling = rsmp("holdout"),
measure = msr("regr.rmse"),
store_models = TRUE
)
as.data.table(instance$archive)
instance$archive$best()

Apply Models from Nested Resample to Permuted Dataset

I have generated a nested resampling object with the following code:
data<-read.csv("Data.csv", row.names=1)
data$factor<-as.factor(data$factor)
set.seed(123, "L'Ecuyer")
task = as_task_classif(data, target = "factor")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
measure = msr("classif.fbeta", beta=1)
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
I have a .csv file with the factor variable permuted/randomized and would like to apply the models of the nested resampling paradigm to this dataset so I can demonstrated differences in the model performance between the real dataset and the permuted/randomized dataset. I am interested in this to validate predictive performance because when sample sizes are small (which is common in biological contexts) prediction accuracy by chance alone can approach 70% or higher based on this paper (https://pubmed.ncbi.nlm.nih.gov/25596422/).
How would I do this using the resample object (rr)?
I think I figured out how to do it (do let me know if I went wrong somewhere):
data<-read.csv("Data.csv", row.names=1)
data$factor<-as.factor(data$factor)
permuted<-read.csv("Data.csv", row.names=1)
permuted$factor<-as.factor(permuted$factor)
set.seed(123, "L'Ecuyer")
task1 = as_task_classif(data, target = "factor")
task2 = as_task_classif(permuted, target = "factor")
task_list = list(task1, task2)
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
measure = msr("classif.fbeta", beta=1)
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models = TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
design = benchmark_grid(task=task_list, learner=at, resampling=resampling_outer)
bmr = benchmark(design, store_models = TRUE)
Am I right in assuming that you have two tasks t1 and t2, where the task t2 is permuted and you wanted to compare the performance of a learner on these two tasks?
The way to go then is to use the benchmark() function instead of the resample function. You would have to create two different tasks (one permuted and one not permuted).
You might find the section Resampling and Benchmarking in our book helpful.

mlr3 tuning: Error cannot allocate vector of size n Mb

I always hit the memory limit when I try to tune a model with mlr3 and I get the follow error (Error cannot allocate vector of size n Mb). This happens even when I remove all unneeded objects and try to minimize memory use to the lowest possible.
The tuning works with a small resample (changing resample method from cv to holdout). Is there a way to get around this?
see relevant code below
inner_cv = rsmp("cv", folds = 10)
preproc = po("imputemedian", affect_columns = selector_type("numeric")) %>>%
po("imputemode", affect_columns = selector_type("factor")) %>>%
po("scale") %>>%
po("encode", method = "one-hot")
para_mod = AutoTuner$new(learner = as_learner(preproc %>>%
lrn("surv.parametric",
type = "aft", dist = "lognormal")),
resampling = inner_cv,
measure = msr("surv.cindex"),
terminator = trm("evals", n_evals = 200),
tuner = tnr("irace"))}
future::plan("multisession")
para_mod$train(task)

MLR3 Survival Analysis: how to simultaneously perform feature selection & hyperparameter tuning together and get selected_features?

I am trying to fit coxph and parametric models and simultaneously perform feature selection and hyperparameter tuning. I have the following code below where I can use either auto_fselecter or auto_tuner inside resample but not both. How do I do that? Do I need to have 3 nested resampling (inner for feature selection, middle for tuning and outer for performance evaluation)? In mlr it was easily done where we use feature selection wrapper then tuning wrapper but not sure how it is best done in mlr3.
I also want to get the selected features at the end. It seems learner$selected_features() does not work for survival models
task = tsk("rats")
learner = lrn("surv.coxph")
outer_cv = rsmp("cv", folds = 10)$instantiate(task)
inner_cv = rsmp("cv", folds = 10)$instantiate(task)
Feat_select= auto_fselecter(method = "random_search",
learner = learner,
resampling = inner_cv,
measure = msr("x"),
term_evals = 200)
model_tune = auto_tuner(method = "irace",
learner = learner,
resampling = inner_cv,
measure = msr("x"),
search_space = ps())
model_res = resample(task, model_tune , outer_cv, store_models = TRUE)
task = tsk("rats")
learner2 = as_learner(po("encode") %>>% lrn("surv.cv_glmnet"))
learner2$selected_features()
Error: attempt to apply non-function
learner3 = mlr3extralearners::lrn("surv.rsfsrc")
learner$selected_features()
Error: attempt to apply non-function
You can nest AutoTuner and AutoFSelector in mlr3:
library(mlr3tuning)
library(mlr3fselect)
task = tsk("pima")
at = auto_tuner(
method = "random_search",
learner = lrn("classif.rpart", cp = to_tune(0.01, 0.1)),
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 5
)
afs = auto_fselector(
method = "random_search",
learner = at,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 5
)
rr = resample(task, afs, resampling = rsmp("cv", folds = 3), store_models = TRUE)
extract_inner_fselect_results(rr)

Resources