Related
I am unable to reproduce the only example I can find of using h2o with iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) as detailed here (Error when extracting variable importance with FeatureImp$new and H2O). Can anyone point to a workaround or other examples of using iml with h2o?
Reproducible example:
library(rsample) # data splitting
library(ggplot2) # allows extension of visualizations
library(dplyr) # basic data transformation
library(h2o) # machine learning modeling
library(iml) # ML interprtation
library(modeldata) #attrition data
# initialize h2o session
h2o.no_progress()
h2o.init()
# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE) %>%
mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))
# convert to h2o object
df.h2o <- as.h2o(df)
# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames =
c("train","valid","test"))
names(splits) <- c("train","valid","test")
# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y)
# elastic net model
glm <- h2o.glm(
x = x,
y = y,
training_frame = splits$train,
validation_frame = splits$valid,
family = "binomial",
seed = 123
)
# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = response,
predict.fun = pred,
class = "classification"
)
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
Error obtained:
Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns
selected
traceback()
1. FeatureImp$new(predictor.glm, loss = "mse")
2. .subset2(public_bind_env, "initialize")(...)
3. private$run.prediction(private$sampler$X)
4. self$predictor$predict(data.frame(dataDesign))
5. prediction[, self$class, drop = FALSE]
6. `[.data.frame`(prediction, , self$class, drop = FALSE)
7. stop("undefined columns selected")
In the iml package documentation, it says that the class argument is "The class column to be returned.". When you set class = "classification", it's looking for a column called "classification" which is not found. At least on GitHub, it looks like the iml package has gone through a fair amount of development since that blog post, so I imagine some functionality may not be backwards compatible anymore.
After reading through the package documentation, I think you might want to try something like:
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = "Attrition",
predict.function = pred,
type = "prob"
)
# check ability to predict first
check <- predictor.glm$predict(features)
print(check)
Even better might be to leverage H2O's extensive functionality around machine learning interpretability.
h2o.varimp(glm) will give the user the variable importance for each feature
h2o.varimp_plot(glm, 10) will render a graphic showing the relative importance of each feature.
h2o.explain(glm, as.h2o(features)) is a wrapper for the explainability interface and will by default provide the confusion matrix (in this case) as well as variable importance, and partial dependency plots for each feature.
For certain algorithms (e.g., tree-based methods), h2o.shap_explain_row_plot() and h2o.shap_summary_plot() will provide the shap contributions.
The h2o-3 docs might be useful here to explore more
I am exploring model variable selection within imputed data.
One technique is to stack imputations in long format (where n observations in M imputed datasets creates a dataset n x M long), and use weighted regression to reduce the contribution of each observation proportionally to the number of imputations. If we treated the stacked dataset as one large dataset, the standard errors would be too small.
I am trying to use the weights argument in svyglm to account for the stacked data, resulting in SEs that you would expect with n obervations, rather than n x M observations.
To illustrate:
library(mice)
### create data
set.seed(42)
n <- 50
id <- 1:n
var1 <- rbinom(n,1,0.4)
var2 <- runif(n,30,80)
var3 <- rnorm(n, mean = 12, sd = 5)
var4 <- rnorm(n, mean = 100, sd = 20)
prob <- (((var1*var2)+var3)-min((var1*var2)+var3)) / (max((var1*var2)+var3)-min((var1*var2)+var3))
outcome <- rbinom(n, 1, prob = prob)
data <- data.frame(id, var1, var2, var3, var4, outcome)
### Add missingness
data_miss <- ampute(data)
patt <- data_miss$patterns
patt <- patt[2:5,]
data_miss <- ampute(data, patterns = patt)
data_miss <- data_miss$amp
## create 5 imputed datasets
nimp <- 5
imp <- mice(data_miss, m = nimp)
## Stack data
data_long <- complete(imp, action = "long")
## Generate model in stacked data (SEs will be too small)
modlong <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = data_long)
summary(modlong)
the long data gives overly small SEs, as we've increased the size of our dataset by 5x
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.906417 0.965090 -3.012 0.0026 **
var1 2.221053 0.311167 7.138 9.48e-13 ***
var2 -0.002543 0.010468 -0.243 0.8081
var3 0.076955 0.032265 2.385 0.0171 *
var4 0.006595 0.008031 0.821 0.4115
Add weights
data_long$weight <- 1/nimp
library(survey)
des <- svydesign(ids = ~1, data = data_long, weights = ~weight)
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des)
summary(mod_svy)
The weighted regression gives similar SEs to the unweighted model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036691 -2.804 0.00546 **
var1 2.221053 0.310906 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
Adding rescale = F (to apparently stop weights being rescaled to the sum of the sample size) doesn't change anything
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des, rescale = F)
summary(mod_svy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036688 -2.804 0.00546 **
var1 2.221053 0.310905 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
I would have expected SEs similar to those obtained when running a model in a single imputed dataset
## Assess SEs in single imputation
mod_singleimp <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = complete(imp,1))
summary(mod_singleimp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.679589 2.116806 -1.266 0.20556
var1 2.476193 0.761195 3.253 0.00114 **
var2 0.014823 0.025350 0.585 0.55874
var3 0.048940 0.072752 0.673 0.50114
var4 -0.004551 0.017986 -0.253 0.80026
All assistance greatly appreciated. Or if anybody knows other ways of achieving the same goal.
Alternative options
the psfmi package allows for stepwise selection in multiply imputed datasets and pooling of models. However, it is computationally intensive and slow with large datasets, particularly if the process needs to be bootstrapped (e.g. during internal validation) - hence the requirement for a less intensive stacking approach.
Sorry, no, this isn't going to work.
To handle stacked imputation data with weights you need frequency weights, so that a weight of 1/10 means you have 1/10 of an observation. With svydesign you specify sampling weights, so that a weight of 1/10 means your observation represents 10 observations in the population. These will (and should) give different standard errors. Pretending you have frequency weights when you actually have imputations is a clever hack to avoid having software that understands what it's doing, which is fine but isn't compatible with survey, which understands what it's doing and is doing something different.
Currently,if you want to use svyglm with multiple imputations you need to compute the standard errors separately -- most conveniently with Rubin's rules using mitools::MIcombine, which is set up to work with the survey package (see the help for with.svyimputationList and withPV).
It might be worth putting in a feature request to the mitools or survey developers (with citations to examples) to allow for stacked analysis of imputations, but this isn't just a matter of adjusting the weights.
This question is very similar to preprocess within cross-validation in caret; however, in a project that i'm working on I would only like to do PCA on three predictors out of 19 in my case. Here is the example from preprocess within cross-validation in caret and I'll use this data (PimaIndiansDiabetes) for ease (this is not my project data but concept should be the same). I would then like to do the preProcess only on a subset of variables i.e. PimaIndiansDiabetes[, c(4,5,6)]. Is there a way to do this?
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess= p,
trControl=control,
tuneGrid=grid)
But I get this error:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
The reason I'm trying to do this is so I can reduce three variables to one PCA1 and use for predicting. In the project I'm doing all three variables are correlated above 90% but would like to incorporate them as other studies have used them as well. Thanks. Trying to avoid data leakage!
As far as I know this is not possible with caret.
This might be possible using recipes. However I do not use recipes but I do use mlr3 so I will show how to do it with this package:
library(mlr3)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
library(mlbench)
create a task from the data:
data("PimaIndiansDiabetes")
pima_tsk <- TaskClassif$new(id = "Pima",
backend = PimaIndiansDiabetes,
target = "diabetes")
define a pre process selector named "slct1":
pos1 <- po("select", id = "slct1")
and define the selector function within it:
pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))
now define what should happen to the selected features: scaling -> pca with 1st PC selected (param_vals = list(rank. = 1))
pos1 %>>%
po("scale", id = "scale1") %>>%
po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
now define an invert selector:
pos2 <- po("select", id = "slct2")
pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)
define the learner:
rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf
combine them:
gunion(list(pr1, pos2)) %>>%
po("featureunion") %>>%
rf_lrn -> graph
check if it looks ok:
graph$plot(html = TRUE)
convert graph to a learner:
glrn <- GraphLearner$new(graph)
define parameters you want tuned:
ps <- ParamSet$new(list(
ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))
define resampling:
cv10 <- rsmp("cv", folds = 10)
define tuning:
instance <- TuningInstance$new(
task = pima_tsk,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
set.seed(1)
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
For additional details on how to tune the number of PC components to keep check this answer: R caret: How do I apply separate pca to different dataframes before training?
If you find this interesting check out the mlr3book
Also
cor(PimaIndiansDiabetes[, 4:6])
triceps insulin mass
triceps 1.0000000 0.4367826 0.3925732
insulin 0.4367826 1.0000000 0.1978591
mass 0.3925732 0.1978591 1.0000000
does not produce what you mention in the question.
I have a dataset consisting of student marks in 2 subjects and the result if the student is admitted in college or not. I need to perform a logistic regression on the data and find the optimum parameter θ to minimize the loss and predict the results for the test data. I am not trying to build any complex non linear network here.
The data looks like this
I have the loss function defined for logistic regression like this which works fine
predict(X) = sigmoid(X*θ)
loss(X,y) = (1 / length(y)) * sum(-y .* log.(predict(X)) .- (1 - y) .* log.(1 - predict(X)))
I need to minimize this loss function and find the optimum θ. I want to do it with Flux.jl or any other library which makes it even easier.
I tried using Flux.jl after reading the examples but not able to minimize the cost.
My code snippet:
function update!(ps, η = .1)
for w in ps
w.data .-= w.grad .* η
print(w.data)
w.grad .= 0
end
end
for i = 1:400
back!(L)
update!((θ, b))
#show L
end
You can use either GLM.jl (simpler) or Flux.jl (more involved but more powerful in general).
In the code I generate the data so that you can check if the result is correct. Additionally I have a binary response variable - if you have other encoding of target variable you might need to change the code a bit.
Here is the code to run (you can tweak the parameters to increase the convergence speed - I chose ones that are safe):
using GLM, DataFrames, Flux.Tracker
srand(1)
n = 10000
df = DataFrame(s1=rand(n), s2=rand(n))
df[:y] = rand(n) .< 1 ./ (1 .+ exp.(-(1 .+ 2 .* df[1] .+ 0.5 .* df[2])))
model = glm(#formula(y~s1+s2), df, Binomial(), LogitLink())
x = Matrix(df[1:2])
y = df[3]
W = param(rand(2,1))
b = param(rand(1))
predict(x) = 1.0 ./ (1.0+exp.(-x*W .- b))
loss(x,y) = -sum(log.(predict(x[y,:]))) - sum(log.(1 - predict(x[.!y,:])))
function update!(ps, η = .0001)
for w in ps
w.data .-= w.grad .* η
w.grad .= 0
end
end
i = 1
while true
back!(loss(x,y))
max(maximum(abs.(W.grad)), abs(b.grad[1])) > 0.001 || break
update!((W, b))
i += 1
end
And here are the results:
julia> model # GLM result
StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ 1 + s1 + s2
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 0.910347 0.0789283 11.5338 <1e-30
s1 2.18707 0.123487 17.7109 <1e-69
s2 0.556293 0.115052 4.83513 <1e-5
julia> (b, W, i) # Flux result with number of iterations needed to converge
(param([0.910362]), param([2.18705; 0.556278]), 1946)
Thanks for this helpful example. However, it does not seem to run with my setup (Julia 1.1, Flux 0.7.1.), since the 1+ and 1- operations in the predict and loss function is not broadcast over the TrackedArray objects. Fortunately the fix is simple (note the dots!):
predict(x) = 1.0 ./ (1.0 .+ exp.(-x*W .- b))
loss(x,y) = -sum(log.(predict(x[y,:]))) - sum(log.(1 .- predict(x[.!y,:])))
In every book and example always they show only binary classification (two classes) and new vector can belong to any one class.
Here the problem is I have 4 classes(c1, c2, c3, c4). I've training data for 4 classes.
For new vector the output should be like
C1 80% (the winner)
c2 10%
c3 6%
c4 4%
How to do this? I'm planning to use libsvm (because it most popular). I don't know much about it. If any of you guys used it previously please tell me specific commands I'm supposed to use.
LibSVM uses the one-against-one approach for multi-class learning problems. From the FAQ:
Q: What method does libsvm use for multi-class SVM ? Why don't you use the "1-against-the rest" method ?
It is one-against-one. We chose it after doing the following comparison: C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13(2002), 415-425.
"1-against-the rest" is a good method whose performance is comparable to "1-against-1." We do the latter simply because its training time is shorter.
Commonly used methods are One vs. Rest and One vs. One.
In the first method you get n classifiers and the resulting class will have the highest score.
In the second method the resulting class is obtained by majority votes of all classifiers.
AFAIR, libsvm supports both strategies of multiclass classification.
You can always reduce a multi-class classification problem to a binary problem by choosing random partititions of the set of classes, recursively. This is not necessarily any less effective or efficient than learning all at once, since the sub-learning problems require less examples since the partitioning problem is smaller. (It may require at most a constant order time more, e.g. twice as long). It may also lead to more accurate learning.
I'm not necessarily recommending this, but it is one answer to your question, and is a general technique that can be applied to any binary learning algorithm.
Use the SVM Multiclass library. Find it at the SVM page by Thorsten Joachims
It does not have a specific switch (command) for multi-class prediction. it automatically handles multi-class prediction if your training dataset contains more than two classes.
Nothing special compared with binary prediction. see the following example for 3-class prediction based on SVM.
install.packages("e1071")
library("e1071")
data(iris)
attach(iris)
## classification mode
# default with factor response:
model <- svm(Species ~ ., data = iris)
# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y)
print(model)
summary(model)
# test with train data
pred <- predict(model, x)
# (same as:)
pred <- fitted(model)
# Check accuracy:
table(pred, y)
# compute decision values and probabilities:
pred <- predict(model, x, decision.values = TRUE)
attr(pred, "decision.values")[1:4,]
# visualize (classes by color, SV by crosses):
plot(cmdscale(dist(iris[,-5])),
col = as.integer(iris[,5]),
pch = c("o","+")[1:150 %in% model$index + 1])
data=load('E:\dataset\scene_categories\all_dataset.mat');
meas = data.all_dataset;
species = data.dataset_label;
[g gn] = grp2idx(species); %# nominal class to numeric
%# split training/testing sets
[trainIdx testIdx] = crossvalind('HoldOut', species, 1/10);
%# 1-vs-1 pairwise models
num_labels = length(gn);
clear gn;
num_classifiers = num_labels*(num_labels-1)/2;
pairwise = zeros(num_classifiers ,2);
row_end = 0;
for i=1:num_labels - 1
row_start = row_end + 1;
row_end = row_start + num_labels - i -1;
pairwise(row_start : row_end, 1) = i;
count = 0;
for j = i+1 : num_labels
pairwise( row_start + count , 2) = j;
count = count + 1;
end
end
clear row_start row_end count i j num_labels num_classifiers;
svmModel = cell(size(pairwise,1),1); %# store binary-classifers
predTest = zeros(sum(testIdx),numel(svmModel)); %# store binary predictions
%# classify using one-against-one approach, SVM with 3rd degree poly kernel
for k=1:numel(svmModel)
%# get only training instances belonging to this pair
idx = trainIdx & any( bsxfun(#eq, g, pairwise(k,:)) , 2 );
%# train
svmModel{k} = svmtrain(meas(idx,:), g(idx), ...
'Autoscale',true, 'Showplot',false, 'Method','QP', ...
'BoxConstraint',2e-1, 'Kernel_Function','rbf', 'RBF_Sigma',1);
%# test
predTest(:,k) = svmclassify(svmModel{k}, meas(testIdx,:));
end
pred = mode(predTest,2); %# voting: clasify as the class receiving most votes
%# performance
cmat = confusionmat(g(testIdx),pred);
acc = 100*sum(diag(cmat))./sum(cmat(:));
fprintf('SVM (1-against-1):\naccuracy = %.2f%%\n', acc);
fprintf('Confusion Matrix:\n'), disp(cmat)
For multi class classification using SVM;
It is NOT (one vs one) and NOT (one vs REST).
Instead learn a two-class classifier where the feature vector is (x, y) where x is data and y is the correct label associated with the data.
The training gap is the Difference between the value for the correct class and the value of the nearest other class.
At Inference choose the "y" that has the maximum
value of (x,y).
y = arg_max(y') W.(x,y') [W is the weight vector and (x,y) is the feature Vector]
Please Visit link:
https://nlp.stanford.edu/IR-book/html/htmledition/multiclass-svms-1.html#:~:text=It%20is%20also%20a%20simple,the%20label%20of%20structural%20SVMs%20.