rsmp. How to create folds with custom size? - mlr3

Here I create 3 folds with the same size
task <- tsk("iris")
folds <- rsmp("cv", folds = 3)$instantiate(task)
Is is possible to create n folds, each with a different size?

You can use ResamplingCustomCV.
library(mlr3)
# Create a task with 10 observations
task = tsk("penguins")
task$filter(1:10)
# Instantiate Resampling
custom_cv = rsmp("custom_cv")
f = factor(c("a", "a", "b", "b", "b", "c", "c", "c", "d", "d"))
custom_cv$instantiate(task, f = f)
custom_cv$iters # 4 folds
# Individual sets
custom_cv$train_set(1)
custom_cv$test_set(1)

Related

Clustered resampling for inner layer of Caret recursive feature elimination

I have data where IDs are contained within clusters.
I would like to perform recursive feature elimination using Caret's rfe function which performs the following procedure:
Clustered resampling for the outer layer (line 2.1) is straightforward, using the index parameter.
However, within each outer resample, I would like to tune tuning parameters using cluster-based cross-validation (inner resampling) (line 2.9). Model tuning in the inner layer is possible by specifying a tuneGrid in rfe and having an appropriate trControl. It is this trControl that I would like to change to allow clustered resampling.
The outer resampling is specified in the rfeControl parameter of rfe.
The inner resampling is specified by trControl of rfe which is passed to train.
The trouble I am having is that I can't seem to specify any inner indices, because after the outer resampling, those indices are no longer valid or no longer present in the outer-resampled data.
I am looking for a way to tell train to take an outer resample (which will be missing a cluster against which to validate), and to tune the model using inner resampling by based on folds of the remaining clusters.
The MWE is as minimal as possible:
library(caret)
library(tidyverse)
library(parallel)
library(doParallel)
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
### Create some random data, 10 features, with some influence over a binomial outcome
set.seed(42)
id <- 1:1000
cluster <- rep(1:10, each = 100)
dat <- data.frame(id, cluster, replicate(10,rnorm(n = 1000, mean = runif(1, 0,100)+cluster, sd = runif(1, 0,20))))
dat <- dat %>% mutate(temp = rowSums(across(X1:X10)), prob = range01(temp), outcome = rbinom(n = nrow(dat), size = 1, prob = prob))
dat$outcome <- as.factor(dat$outcome)
levels(dat$outcome) <- c("control", "case")
dat$outcome <- factor(dat$outcome, levels=rev(levels(dat$outcome)))
### Manual outer folds-based cluster ###
for(i in 1:10) {
assign(paste0("index", i), which(dat$cluster!=i))
}
unit_indices <- list(index1, index2, index3, index4, index5, index6, index7, index8, index9, index10)
### Inner resampling method (THIS IS WHAT I'D LIKE TO CHANGE) ###
cv5 <- trainControl(classProbs = TRUE, method = "cv", number = 5, allowParallel = F) ## Is there a way to have inner cluster-based resampling WITHIN the outer cluster-based resampling?
caret_rfe_functions <- list(summary = twoClassSummary,
fit = function (x, y, first, last, ...) {
train(x, y, ...)
},
pred = caretFuncs$pred,
rank = function(object, x, y) {
vimp <- varImp(object)$importance
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = function (x, metric = "ROC", tol = 1, maximize = TRUE)
{
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
},
selectVar = caretFuncs$selectVar)
caret_rfe_ctrl <- rfeControl(
functions = caret_rfe_functions,
saveDetails = TRUE,
index = unit_indices,
indexOut = NULL,
returnResamp = "all",
allowParallel = T, ### change this if you don't want to / can't go parallel
verbose = TRUE
)
#### Feature selection ####
set.seed(42)
cl <- makePSOCKcluster(10) ### for parallel processing if available
registerDoParallel(cl)
rfe_profile_nnet <- rfe(
form = outcome ~
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
data = dat,
sizes = seq(2,10,1),
rfeControl = caret_rfe_ctrl,
## pass options to train()
method = "nnet",
preProc = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(size = c(1:5), decay = 5),
trControl = cv5) ### I would like to change this to allow inner cluster-based resampling
stopCluster(cl)
rfe_profile_nnet
plot(rfe_profile_nnet)
Presumably the inner cluster-based resampling would be achieved by specifying a new trainControl containing some dynamic inner index based on the outer resample that is selected at the time:
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = {insert magic here}, ### This is the important bit
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F) ### especially if the outer resample is parallelised
If you try with the original cluster indices e.g.
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = unit_indices,
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F)
There are various warnings about missing data in the resamples, and things like 24: In [<-.data.frame(*tmp*, , object$method$center, value = structure(list( ... : provided 81 variables to replace 9 variables.
All help greatly appreciated.
As a postscript question , you can see which parameters were used within your rfe like so:
> rfe_profile_nnet$fit
Neural Network
1000 samples
8 predictor
2 classes: 'case', 'control'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
size Accuracy Kappa
1 0.616 0.1605071
2 0.616 0.1686937
3 0.620 0.1820503
4 0.618 0.1788491
5 0.618 0.1788063
Tuning parameter 'decay' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 3 and decay = 5.
But does anyone know if this refers to one, or all of the outer resamples? Presumably the same tuning parameters won't necessarily be chosen across all outer resamples

Crosstalk links broken by second Leaflet addCircleMarkers call

I am working with multivariant data linking Leaflet and d3scatter plots. It works well for one variable. If I try to include a second variable in Leaflet by a second addCircleMarkers and addLayersControl then the sharedData links break, the filtering doesn't work and the brushing doesn't work. Thanks in advance.
A MWE is attached:
library("crosstalk")
library("d3scatter")
library("leaflet")
Long <- c(117.4,117.5,117.6)
Lat<- c(-33.7,-33.8,-33.9)
var1 <- c(21,22,23)
var2 <- c(31,32,33)
species <- c(8,9,10)
df1<- data.frame(Long, Lat, var1, var2, species)
sdf1 <- SharedData$new(df1)
col_1 <- c( "yellow" ,"black" ,"orange")
col_2 <- c("red" ,"green" ,"blue")
l <- leaflet(sdf1)%>%
setView(117.5, -33.8, 10) %>%
addCircleMarkers(radius = 1, color = col_1, group = "1") %>%
# addCircleMarkers(radius = 1, color = col_2, group = "2") %>%
# PROBLEM - adding the second "addCircleMarkers" enables the overlayGroups but
# it breaks the link between the plots and breaks the filter
addLayersControl(overlayGroups=c("1","2"))
m <- list(l, filter_checkbox("unique_id_for_species", "Animal Species", sdf1, ~species))
n <- list(d3scatter(sdf1, ~var2, ~var1, color = ~species, x_lim = c(30,40), y_lim = c(20,25), width="70%", height=200),
d3scatter(sdf1, ~var1, ~var2, color = ~species, y_lim = c(30,40), x_lim = c(20,25), width="70%", height=200))
bscols(m, n)

Get index of a value in a table which already has a key (pick a random key/value pair)

I'm looking to pick a random key/value pair from a table, but using math.random() doesn't work.
--intialises randomization
math.randomseed(os.time()+30) --sets a random seed based on the time
math.random(); math.random(); math.random(); --clears presets
local phrases = {
["a"] = 3
["b"] = 7
["d"] = 4
["f"] = 8
["p"] = 5
}
local phrase = phrases[math.random(1,5)]
phrase would always output as nil. Is there a method of getting the index to use with math.random() or an alternate method I could use?
math.random(1, 5) returns numbers from 1 to 5. Your keys are strings.
You could create a table (as in: array - integer keys), get a random key from there and then access the phrases:
local phrases = {
["a"] = 3,
["b"] = 7,
["d"] = 4,
["f"] = 8,
["p"] = 5
}
local keys = {}
for k in pairs(phrases) do
table.insert(keys, k)
end
local random_key = keys[math.random(1,5)] -- One of: "a", "b", "d", "f", "p"
local phrase = phrases[random_key] -- One of: 3, 7, 4, 8, 5

Tuning parameters in caret error despite assigning grids and as.factor

Any help appreciated. Been at this for weeks. :(
install.packages("klaR", dependencies=TRUE)
library(klaR)
install.packages("caret", dependencies=TRUE)
library(caret)
install.packages("e1071", dependencies=TRUE)
library(e1071)
install.packages("gmodels", dependencies=TRUE)
library(gmodels)
install.packages("gbm", dependencies=TRUE)
library(gbm)
install.packages("foreach", dependencies=TRUE)
library(foreach)
Load Grading Data
grading <- read.csv("~/PA_DataFinal/GradingData160315.csv")
create stratified sample # 1%
dfstrat <- stratified(grading, "FailPass", .01)
save(dfstrat, file = "c:/Users/gillisn/Documents/PA_DataFinal/RResults/GradingRResults/iteration 1/dfstrat.rda")
split data into train and test #75:25. FailPass is the responseVble
set.seed(1)
inTrainingSet <- createDataPartition(dfstrat$FailPass, p = .75, list = FALSE)
trainSet <- dfstrat[inTrainingSet,]
testSet <- dfstrat[-inTrainingSet, ]
set predictors and labels
There are 48 labels and its the last one that want to train on.
Take all the predictors 1-47
x,y is training data
x <- trainSet[,-48]
y <- as.factor(trainSet$FailPass)
i,j is test data
i <- testSet[,-48,]
j <- as.factor(testSet$FailPass)
Set Training control parameters
Bootstrapping itself around in 25 times.
bootControl <- trainControl(number = 25)
The grid is for the decision tree
gbmGrid <- expand.grid(.interaction.depth = (1:5) * 2, .n.trees = (1:10)*25, .shrinkage = .1)
nbGrid <- expand.grid(.fL=0, .usekernel=FALSE)
svmGrid >- expandGrid(.sigma=, .c=)
set.seed(2)
Train the models
naive bayes
nbFit <- train(x,y,method='nb',tuneGrid="nbGrid")
svm
svmFit <- train(x, y,method = "svmRadial", tuneLength = 10,trControl = bootControl, scaled = FALSE)
gbm
gbmFit <- train(x, y,method = "gbm", trControl = bootControl, verbose = FALSE, bag.fraction = 0.5, tuneGrid = gbmGrid)
predict the models on training data
models <- list(svm = svmFit, nb = nbFit, gbm = gbmFit)
predict(models)

How to determine if one array contains all elements of another array

Given:
a1 = [5, 1, 6, 14, 2, 8]
I would like to determine if it contains all elements of:
a2 = [2, 6, 15]
In this case the result is false.
Are there any built-in Ruby/Rails methods to identify such array inclusion?
One way to implement this is:
a2.index{ |x| !a1.include?(x) }.nil?
Is there a better, more readable, way?
a = [5, 1, 6, 14, 2, 8]
b = [2, 6, 15]
a - b
# => [5, 1, 14, 8]
b - a
# => [15]
(b - a).empty?
# => false
Perhaps this is easier to read:
a2.all? { |e| a1.include?(e) }
You can also use array intersection:
(a1 & a2).size == a1.size
Note that size is used here just for speed, you can also do (slower):
(a1 & a2) == a1
But I guess the first is more readable. These 3 are plain ruby (not rails).
This can be achieved by doing
(a2 & a1) == a2
This creates the intersection of both arrays, returning all elements from a2 which are also in a1. If the result is the same as a2, you can be sure you have all elements included in a1.
This approach only works if all elements in a2 are different from each other in the first place. If there are doubles, this approach fails. The one from Tempos still works then, so I wholeheartedly recommend his approach (also it's probably faster).
If there are are no duplicate elements or you don't care about them, then you can use the Set class:
a1 = Set.new [5, 1, 6, 14, 2, 8]
a2 = Set.new [2, 6, 15]
a1.subset?(a2)
=> false
Behind the scenes this uses
all? { |o| set.include?(o) }
You can monkey-patch the Array class:
class Array
def contains_all?(ary)
ary.uniq.all? { |x| count(x) >= ary.count(x) }
end
end
test
irb(main):131:0> %w[a b c c].contains_all? %w[a b c]
=> true
irb(main):132:0> %w[a b c c].contains_all? %w[a b c c]
=> true
irb(main):133:0> %w[a b c c].contains_all? %w[a b c c c]
=> false
irb(main):134:0> %w[a b c c].contains_all? %w[a]
=> true
irb(main):135:0> %w[a b c c].contains_all? %w[x]
=> false
irb(main):136:0> %w[a b c c].contains_all? %w[]
=> true
irb(main):137:0> %w[a b c d].contains_all? %w[d c h]
=> false
irb(main):138:0> %w[a b c d].contains_all? %w[d b c]
=> true
Of course the method can be written as a standard-alone method, eg
def contains_all?(a,b)
b.uniq.all? { |x| a.count(x) >= b.count(x) }
end
and you can invoke it like
contains_all?(%w[a b c c], %w[c c c])
Indeed, after profiling, the following version is much faster, and the code is shorter.
def contains_all?(a,b)
b.all? { |x| a.count(x) >= b.count(x) }
end
Most answers based on (a1 - a2) or (a1 & a2) would not work if there are duplicate elements in either array. I arrived here looking for a way to see if all letters of a word (split to an array) were part of a set of letters (for scrabble for example). None of these answers worked, but this one does:
def contains_all?(a1, a2)
try = a1.chars.all? do |letter|
a1.count(letter) <= a2.count(letter)
end
return try
end
Depending on how big your arrays are you might consider an efficient algorithm O(n log n)
def equal_a(a1, a2)
a1sorted = a1.sort
a2sorted = a2.sort
return false if a1.length != a2.length
0.upto(a1.length - 1) do
|i| return false if a1sorted[i] != a2sorted[i]
end
end
Sorting costs O(n log n) and checking each pair costs O(n) thus this algorithm is O(n log n). The other algorithms cannot be faster (asymptotically) using unsorted arrays.
I was directed to this post when trying to find whether one array ["a", "b", "c"] contained another array ["a", "b"], where in my case identical ordering was an additional requirement to the question.
Here is my solution (I believe it's O(n) complexity), to anyone who has that extra requirement:
def array_includes_array(array_to_inspect, array_to_search_for)
inspectLength = array_to_inspect.length
searchLength = array_to_search_for.length
if searchLength == 0 then
return true
end
if searchLength > inspectLength then
return false
end
buffer = []
for i in 0..inspectLength
buffer.push(array_to_inspect[i])
bufferLastIndex = buffer.length - 1
if(buffer[bufferLastIndex] != array_to_search_for[bufferLastIndex]) then
buffer.clear
next
end
if(buffer.length == searchLength) then
return true
end
end
return false
end
This produces the test results:
puts "1: #{array_includes_array(["a", "b", "c"], ["b", "c"])}" # true
puts "2: #{array_includes_array(["a", "b", "c"], ["a", "b"])}" # true
puts "3: #{array_includes_array(["a", "b", "c"], ["b", "b"])}" # false
puts "4: #{array_includes_array(["a", "b", "c"], ["c", "b", "a"])}" # false
puts "5: #{array_includes_array(["a", "b", "c"], [])}" # true
puts "6: #{array_includes_array([], ["a"])}" # false
puts "7: #{array_includes_array([], [])}" # true

Resources