PCA within cross validation; however, only with a subset of variables - machine-learning

This question is very similar to preprocess within cross-validation in caret; however, in a project that i'm working on I would only like to do PCA on three predictors out of 19 in my case. Here is the example from preprocess within cross-validation in caret and I'll use this data (PimaIndiansDiabetes) for ease (this is not my project data but concept should be the same). I would then like to do the preProcess only on a subset of variables i.e. PimaIndiansDiabetes[, c(4,5,6)]. Is there a way to do this?
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess= p,
trControl=control,
tuneGrid=grid)
But I get this error:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
The reason I'm trying to do this is so I can reduce three variables to one PCA1 and use for predicting. In the project I'm doing all three variables are correlated above 90% but would like to incorporate them as other studies have used them as well. Thanks. Trying to avoid data leakage!

As far as I know this is not possible with caret.
This might be possible using recipes. However I do not use recipes but I do use mlr3 so I will show how to do it with this package:
library(mlr3)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
library(mlbench)
create a task from the data:
data("PimaIndiansDiabetes")
pima_tsk <- TaskClassif$new(id = "Pima",
backend = PimaIndiansDiabetes,
target = "diabetes")
define a pre process selector named "slct1":
pos1 <- po("select", id = "slct1")
and define the selector function within it:
pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))
now define what should happen to the selected features: scaling -> pca with 1st PC selected (param_vals = list(rank. = 1))
pos1 %>>%
po("scale", id = "scale1") %>>%
po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
now define an invert selector:
pos2 <- po("select", id = "slct2")
pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)
define the learner:
rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf
combine them:
gunion(list(pr1, pos2)) %>>%
po("featureunion") %>>%
rf_lrn -> graph
check if it looks ok:
graph$plot(html = TRUE)
convert graph to a learner:
glrn <- GraphLearner$new(graph)
define parameters you want tuned:
ps <- ParamSet$new(list(
ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))
define resampling:
cv10 <- rsmp("cv", folds = 10)
define tuning:
instance <- TuningInstance$new(
task = pima_tsk,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
set.seed(1)
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
For additional details on how to tune the number of PC components to keep check this answer: R caret: How do I apply separate pca to different dataframes before training?
If you find this interesting check out the mlr3book
Also
cor(PimaIndiansDiabetes[, 4:6])
triceps insulin mass
triceps 1.0000000 0.4367826 0.3925732
insulin 0.4367826 1.0000000 0.1978591
mass 0.3925732 0.1978591 1.0000000
does not produce what you mention in the question.

Related

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

Overcoming compatibility issues with using iml from h2o models

I am unable to reproduce the only example I can find of using h2o with iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) as detailed here (Error when extracting variable importance with FeatureImp$new and H2O). Can anyone point to a workaround or other examples of using iml with h2o?
Reproducible example:
library(rsample) # data splitting
library(ggplot2) # allows extension of visualizations
library(dplyr) # basic data transformation
library(h2o) # machine learning modeling
library(iml) # ML interprtation
library(modeldata) #attrition data
# initialize h2o session
h2o.no_progress()
h2o.init()
# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE) %>%
mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))
# convert to h2o object
df.h2o <- as.h2o(df)
# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames =
c("train","valid","test"))
names(splits) <- c("train","valid","test")
# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y)
# elastic net model
glm <- h2o.glm(
x = x,
y = y,
training_frame = splits$train,
validation_frame = splits$valid,
family = "binomial",
seed = 123
)
# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = response,
predict.fun = pred,
class = "classification"
)
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
Error obtained:
Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns
selected
traceback()
1. FeatureImp$new(predictor.glm, loss = "mse")
2. .subset2(public_bind_env, "initialize")(...)
3. private$run.prediction(private$sampler$X)
4. self$predictor$predict(data.frame(dataDesign))
5. prediction[, self$class, drop = FALSE]
6. `[.data.frame`(prediction, , self$class, drop = FALSE)
7. stop("undefined columns selected")
In the iml package documentation, it says that the class argument is "The class column to be returned.". When you set class = "classification", it's looking for a column called "classification" which is not found. At least on GitHub, it looks like the iml package has gone through a fair amount of development since that blog post, so I imagine some functionality may not be backwards compatible anymore.
After reading through the package documentation, I think you might want to try something like:
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = "Attrition",
predict.function = pred,
type = "prob"
)
# check ability to predict first
check <- predictor.glm$predict(features)
print(check)
Even better might be to leverage H2O's extensive functionality around machine learning interpretability.
h2o.varimp(glm) will give the user the variable importance for each feature
h2o.varimp_plot(glm, 10) will render a graphic showing the relative importance of each feature.
h2o.explain(glm, as.h2o(features)) is a wrapper for the explainability interface and will by default provide the confusion matrix (in this case) as well as variable importance, and partial dependency plots for each feature.
For certain algorithms (e.g., tree-based methods), h2o.shap_explain_row_plot() and h2o.shap_summary_plot() will provide the shap contributions.
The h2o-3 docs might be useful here to explore more

Getting a specific random forest variable importance measure from mlr package's resample function

I am using mlr package's resample() function to subsample a random forest model 4000 times (the code snippet below).
As you can see, to create random forest models within resample() I'm using randomForest package.
I want to get random forest model's importance results (mean decrease in accuracy over all classes) for each of the subsample iterations. What I can get right now as the importance measure is the mean decrease in Gini index.
What I can see from the source code of mlr, getFeatureImportanceLearner.classif.randomForest() function (line 69) in makeRLearner.classif.randomForest uses randomForest::importance() function (line 83) to get importance value from the resulting object of randomForest class. But as you can see from the source code (line 73) it uses 2L as the default value. I want it to use 1L (line 75) as the value (mean decrease in accuracy).
How can I pass the value of 2L to resample() function, ("extract = getFeatureImportance" line in the code below) so that getFeatureImportanceLearner.classif.randomForest() function gets that value and sets ctrl$type = 2L (line 73)?
rf_task <- makeClassifTask(id = 'task',
data = data[, -1], target = 'target_var',
positive = 'positive_var')
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 1000, importance = TRUE),
predict.type = 'prob')
base_subsample_instance <- makeResampleInstance(rf_boot_desc, rf_task)
rf_subsample_result <- resample(rf_learner, rf_task,
base_subsample_instance,
extract = getFeatureImportance,
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier))
My solution: Downloaded source code of the mlr package. Changed the source file line 73 to 1L (https://github.com/mlr-org/mlr/blob/v2.15.0/R/RLearner_classif_randomForest.R). Installed the package from command line and used it. Not an optimal solution but a solution.
You provide a lot of specifics that do not actually relate to your question, at least how I understood it.
So I wrote a simple MWE that includes the answer.
The idea is that you have to write a short wrapper for getFeatureImportance so that you can pass your own arguments. Fans of purrr can do that with purrr::partial(getFeatureImportance, type = 2) but here I wrote myExtractor manually.
library(mlr)
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 100, importance = TRUE),
predict.type = 'prob')
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier)
myExtractor = function(.model, ...) {
getFeatureImportance(.model, type = 2, ...)
}
res = resample(rf_learner, sonar.task, cv10,
measures = measures, extract = myExtractor)
# first feature importance result:
res$extract[[1]]
# all values in a matrix:
sapply(res$extract, function(x) x$res)
If you want to do a bootstraped learenr maybe you should also have a look at makeBaggingWrapper instead of solving this problem through resample.

How tf.gradients work in TensorFlow

Given I have a linear model as the following I would like to get the gradient vector with regards to W and b.
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.mul(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
However if I try something like this where cost is a function of cost(x,y,w,b) and I only want to gradients with respect to w and b:
grads = tf.gradients(cost, tf.all_variable())
My placeholders will also be included (X and Y).
Even if I do get a gradient with [x,y,w,b] how do I know which element in the gradient that belong to each parameter since it is just a list without names to which parameter the derivative has be taken with regards to?
In this question I'm using parts of this code and I build on this question.
Quoting the docs for tf.gradients
Constructs symbolic partial derivatives of sum of ys w.r.t. x in xs.
So, this should work:
dc_dw, dc_db = tf.gradients(cost, [W, b])
Here, tf.gradients() returns the gradient of cost wrt each tensor in the second argument as a list in the same order.
Read tf.gradients for more information.

Using partialPlot after fitting a Random Forest model in caret

After I fit a randomForest using the train() function, I'm having problems invoking partialPlot() and plotmo(). Here's some reproducible code:
library(AER)
library(caret)
data(Mortgage)
fitControl <- trainControl(method = "repeatedcv"
,number = 5
,repeats = 10
,allowParallel = TRUE)
library(doMC)
registerDoMC(cores=10)
Final.rfModel <- train(form=networth ~ ., data=Mortgage, method = "rf", metric='RMSE', trControl = fitControl, tuneLength=10, importance = TRUE)
#### partial plots fail
partialPlot(Final.rfModel$finalModel, Mortgage, "liquid")
library(plotmo)
plotmo(Final.rfModel$finalModel)
There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).
The error occurs because rate, married and a few other predictors are factors. The dummy variables created by train don't have the same names so partialPlot can't find them.
Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.
TL;DR
Use the non-formula method with train in this case:
Final.rfModel <- train(form=networth ~ ., data=Mortgage,
method = "rf",
metric='RMSE',
trControl = fitControl,
tuneLength=10,
importance = TRUE)
Max

Resources