Overcoming compatibility issues with using iml from h2o models - machine-learning

I am unable to reproduce the only example I can find of using h2o with iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) as detailed here (Error when extracting variable importance with FeatureImp$new and H2O). Can anyone point to a workaround or other examples of using iml with h2o?
Reproducible example:
library(rsample) # data splitting
library(ggplot2) # allows extension of visualizations
library(dplyr) # basic data transformation
library(h2o) # machine learning modeling
library(iml) # ML interprtation
library(modeldata) #attrition data
# initialize h2o session
h2o.no_progress()
h2o.init()
# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE) %>%
mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))
# convert to h2o object
df.h2o <- as.h2o(df)
# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames =
c("train","valid","test"))
names(splits) <- c("train","valid","test")
# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y)
# elastic net model
glm <- h2o.glm(
x = x,
y = y,
training_frame = splits$train,
validation_frame = splits$valid,
family = "binomial",
seed = 123
)
# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = response,
predict.fun = pred,
class = "classification"
)
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
Error obtained:
Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns
selected
traceback()
1. FeatureImp$new(predictor.glm, loss = "mse")
2. .subset2(public_bind_env, "initialize")(...)
3. private$run.prediction(private$sampler$X)
4. self$predictor$predict(data.frame(dataDesign))
5. prediction[, self$class, drop = FALSE]
6. `[.data.frame`(prediction, , self$class, drop = FALSE)
7. stop("undefined columns selected")

In the iml package documentation, it says that the class argument is "The class column to be returned.". When you set class = "classification", it's looking for a column called "classification" which is not found. At least on GitHub, it looks like the iml package has gone through a fair amount of development since that blog post, so I imagine some functionality may not be backwards compatible anymore.
After reading through the package documentation, I think you might want to try something like:
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = "Attrition",
predict.function = pred,
type = "prob"
)
# check ability to predict first
check <- predictor.glm$predict(features)
print(check)
Even better might be to leverage H2O's extensive functionality around machine learning interpretability.
h2o.varimp(glm) will give the user the variable importance for each feature
h2o.varimp_plot(glm, 10) will render a graphic showing the relative importance of each feature.
h2o.explain(glm, as.h2o(features)) is a wrapper for the explainability interface and will by default provide the confusion matrix (in this case) as well as variable importance, and partial dependency plots for each feature.
For certain algorithms (e.g., tree-based methods), h2o.shap_explain_row_plot() and h2o.shap_summary_plot() will provide the shap contributions.
The h2o-3 docs might be useful here to explore more

Related

What is the input type of GNN?

from reading the articles and papers ,I understood that GNN is used for
Node level Prediction
Link prediction and
graph level prediction
but i am very much confused about GNN's input type
I have a list of questions
what is the input type of GNN ? Graphs or numerical data
If GNN takes graphs as input then how it is generated ?
Second edit:
By reading another paper i found that GNN takes graphs as input
1
Now i had only one question how graph is generated from the input ?
Reference:
Jie Zhou a,1, Ganqu Cui a,1, Shengding Hu a, Zhengyan Zhang a, Cheng Yang b, Zhiyuan Liu a,*,
Lifeng Wang c, Changcheng Li c, Maosong Sun a Graph neural networks: A review of methods and applications "Graph neural networks: A review of methods and applications" AI Open
The input of GNN include objects of different dimensions e.g. properties matrix dimension is [n_nodes, n_node_features], adjacency matrix dimension is [n_nodes, n_nodes] depending of the type of graph-neural-networks.
Spektral is a nice library with good examples of different types of GNN. The examples of how to load the data are also provided.
Here is the example of GNN model created using tensorflow and1
class GIN0(Model):
def __init__(self, channels, n_layers):
super().__init__()
self.conv1 = GINConv(channels, epsilon=0, mlp_hidden=[channels, channels])
self.convs = []
for _ in range(1, n_layers):
self.convs.append(
GINConv(channels, epsilon=0, mlp_hidden=[channels, channels])
)
self.pool = GlobalAvgPool()
self.dense1 = Dense(channels, activation="relu")
self.dropout = Dropout(0.5)
self.dense2 = Dense(channels, activation="relu")
def call(self, inputs):
x, a, i = inputs
x = self.conv1([x, a])
for conv in self.convs:
x = conv([x, a])
x = self.pool([x, i])
x = self.dense1(x)
x = self.dropout(x)
return self.dense2(x)
You can also look at this question for a more complete example of GNN application.

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

PCA within cross validation; however, only with a subset of variables

This question is very similar to preprocess within cross-validation in caret; however, in a project that i'm working on I would only like to do PCA on three predictors out of 19 in my case. Here is the example from preprocess within cross-validation in caret and I'll use this data (PimaIndiansDiabetes) for ease (this is not my project data but concept should be the same). I would then like to do the preProcess only on a subset of variables i.e. PimaIndiansDiabetes[, c(4,5,6)]. Is there a way to do this?
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess= p,
trControl=control,
tuneGrid=grid)
But I get this error:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
The reason I'm trying to do this is so I can reduce three variables to one PCA1 and use for predicting. In the project I'm doing all three variables are correlated above 90% but would like to incorporate them as other studies have used them as well. Thanks. Trying to avoid data leakage!
As far as I know this is not possible with caret.
This might be possible using recipes. However I do not use recipes but I do use mlr3 so I will show how to do it with this package:
library(mlr3)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
library(mlbench)
create a task from the data:
data("PimaIndiansDiabetes")
pima_tsk <- TaskClassif$new(id = "Pima",
backend = PimaIndiansDiabetes,
target = "diabetes")
define a pre process selector named "slct1":
pos1 <- po("select", id = "slct1")
and define the selector function within it:
pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))
now define what should happen to the selected features: scaling -> pca with 1st PC selected (param_vals = list(rank. = 1))
pos1 %>>%
po("scale", id = "scale1") %>>%
po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
now define an invert selector:
pos2 <- po("select", id = "slct2")
pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)
define the learner:
rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf
combine them:
gunion(list(pr1, pos2)) %>>%
po("featureunion") %>>%
rf_lrn -> graph
check if it looks ok:
graph$plot(html = TRUE)
convert graph to a learner:
glrn <- GraphLearner$new(graph)
define parameters you want tuned:
ps <- ParamSet$new(list(
ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))
define resampling:
cv10 <- rsmp("cv", folds = 10)
define tuning:
instance <- TuningInstance$new(
task = pima_tsk,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
set.seed(1)
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
For additional details on how to tune the number of PC components to keep check this answer: R caret: How do I apply separate pca to different dataframes before training?
If you find this interesting check out the mlr3book
Also
cor(PimaIndiansDiabetes[, 4:6])
triceps insulin mass
triceps 1.0000000 0.4367826 0.3925732
insulin 0.4367826 1.0000000 0.1978591
mass 0.3925732 0.1978591 1.0000000
does not produce what you mention in the question.

Getting a specific random forest variable importance measure from mlr package's resample function

I am using mlr package's resample() function to subsample a random forest model 4000 times (the code snippet below).
As you can see, to create random forest models within resample() I'm using randomForest package.
I want to get random forest model's importance results (mean decrease in accuracy over all classes) for each of the subsample iterations. What I can get right now as the importance measure is the mean decrease in Gini index.
What I can see from the source code of mlr, getFeatureImportanceLearner.classif.randomForest() function (line 69) in makeRLearner.classif.randomForest uses randomForest::importance() function (line 83) to get importance value from the resulting object of randomForest class. But as you can see from the source code (line 73) it uses 2L as the default value. I want it to use 1L (line 75) as the value (mean decrease in accuracy).
How can I pass the value of 2L to resample() function, ("extract = getFeatureImportance" line in the code below) so that getFeatureImportanceLearner.classif.randomForest() function gets that value and sets ctrl$type = 2L (line 73)?
rf_task <- makeClassifTask(id = 'task',
data = data[, -1], target = 'target_var',
positive = 'positive_var')
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 1000, importance = TRUE),
predict.type = 'prob')
base_subsample_instance <- makeResampleInstance(rf_boot_desc, rf_task)
rf_subsample_result <- resample(rf_learner, rf_task,
base_subsample_instance,
extract = getFeatureImportance,
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier))
My solution: Downloaded source code of the mlr package. Changed the source file line 73 to 1L (https://github.com/mlr-org/mlr/blob/v2.15.0/R/RLearner_classif_randomForest.R). Installed the package from command line and used it. Not an optimal solution but a solution.
You provide a lot of specifics that do not actually relate to your question, at least how I understood it.
So I wrote a simple MWE that includes the answer.
The idea is that you have to write a short wrapper for getFeatureImportance so that you can pass your own arguments. Fans of purrr can do that with purrr::partial(getFeatureImportance, type = 2) but here I wrote myExtractor manually.
library(mlr)
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 100, importance = TRUE),
predict.type = 'prob')
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier)
myExtractor = function(.model, ...) {
getFeatureImportance(.model, type = 2, ...)
}
res = resample(rf_learner, sonar.task, cv10,
measures = measures, extract = myExtractor)
# first feature importance result:
res$extract[[1]]
# all values in a matrix:
sapply(res$extract, function(x) x$res)
If you want to do a bootstraped learenr maybe you should also have a look at makeBaggingWrapper instead of solving this problem through resample.

Using partialPlot after fitting a Random Forest model in caret

After I fit a randomForest using the train() function, I'm having problems invoking partialPlot() and plotmo(). Here's some reproducible code:
library(AER)
library(caret)
data(Mortgage)
fitControl <- trainControl(method = "repeatedcv"
,number = 5
,repeats = 10
,allowParallel = TRUE)
library(doMC)
registerDoMC(cores=10)
Final.rfModel <- train(form=networth ~ ., data=Mortgage, method = "rf", metric='RMSE', trControl = fitControl, tuneLength=10, importance = TRUE)
#### partial plots fail
partialPlot(Final.rfModel$finalModel, Mortgage, "liquid")
library(plotmo)
plotmo(Final.rfModel$finalModel)
There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).
The error occurs because rate, married and a few other predictors are factors. The dummy variables created by train don't have the same names so partialPlot can't find them.
Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.
TL;DR
Use the non-formula method with train in this case:
Final.rfModel <- train(form=networth ~ ., data=Mortgage,
method = "rf",
metric='RMSE',
trControl = fitControl,
tuneLength=10,
importance = TRUE)
Max

Resources