Getting an error while converting Tibble to h2o hex file - machine-learning

I am running the h2o package in Rstudio, I am getting an error while converting Tibble into h2o.
Below is my code
#Augment Time Series Signature
PO_Data_aug = PO_Data %>%
tk_augment_timeseries_signature()
PO_Data_aug
# Split into training, validation and test sets
train_tbl = PO_Data_aug %>% filter(Date <= '2017-12-29')
valid_tbl = PO_Data_aug %>% filter(Date>'2017-12-29'& Date <='2018-03-31')
test_tbl = PO_Data_aug %>% filter(Date > '2018-03-31')
str(train_tbl)
train_tbl$month.lbl<-as.character(train_tbl$month.lbl)
h2o.init() # Fire up h2o
##hex
train_h2o = as.h2o(train_tbl)
valid_h2o = as.h2o(valid_tbl)
test_h2o = as.h2o(test_tbl)
ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http://localhost:54321/3/Parse)
ERROR MESSAGE:
Provided column type ordered is unknown. Cannot proceed with parse due to invalid argument.
Kindly Suggest

This is actually a bug in H2O -- it has nothing to do with tibbles. There is no support for the "ordered" column type in data.frames or tibbles. We will fix this (ticket here).
The work-around right now is to manually convert your "ordered" columns into un-ordered "factor" columns.
tb <- tibble(x = ordered(c(1,2,3)), y = 1:3)
tb$x <- factor(tb$x, ordered = FALSE)
hf <- as.h2o(tb)

as.h2o() expects an R dataframe. You could use an R dataframe instead of your tibble dataframe or as Tom mentioned in the comments you could use one of the supported file formats for H2O.
train_h2o = as.h2o(as_data_frame(train_tbl))
valid_h2o = as.h2o(as_data_frame(valid_tbl))
test_h2o = as.h2o(as_data_frame(test_tbl))

Related

Leave one out cross validation RStudio randomForest package error

Creating an LOOCV loop using the randomForest package. I have adapted the following code from this link (https://stats.stackexchange.com/questions/459293/loocv-in-caret-package-randomforest-example-not-unique-results) however I am unable to reproduce a successful code.
Here is the code that I am running but on the iris dataset.
irisdata <- iris[1:150,]
predictionsiris <- 1:150
for (k in 1:150){
set.seed(123)
predictioniris[k] <- predict(randomForest(Petal.Width ~ Sepal.Length, data = irisdata[-k], ntree = 10), newdata = irisdata[k,,drop=F])[2]
}
What I would expect to happen is for it to run the random forest model on all but one row and then use that one row to test the model.
However, when I run this code, I get the following error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'predict': object 'Sepal.Length' not found
Any suggestions? I have been messing around with LOOCV code for the past two days including messing with code in this page (Compute Random Forest with a leave one ID out cross validation in R) and running the following:
iris %>%
mutate(ID = 1:516)
loocv <- NULL
for(i in iris$ID){
test[[i]] <- slice(iris, i)
train[[i]] <- slice(iris, i+1:516)
rf <- randomForest(Sepal.Length ~., data = train, ntree = 10, importance = TRUE)
loocv[[i]] <- predict(rf, newdata = test)
}
but I have had no success. Any help would be appreciated.

how to visualise and compare distributions across several conditions with histogram?

I have the percentage data of six error types in four different experiments. How can I visualise the distribution of each error type across four conditions? Thank you!
this is how my data look like
I want to have six histograms showing the distribution of each error type in the four conditions on the same panel using facet_grid function and compare the differences and changes in the distribution. I tried the following but did not work.
p <- ggplot(data = error_percentage_sep, aes(x = factor(error_type))) +
geom_histogram(binwidth = 100)
p + facet_wrap(~color)
p
error_percentage_sep %>% #this is the dataset I have
mutate(error_percentage_sep = factor(error_type)) %>%
group_by(error_type) %>%
summarise(mean_percent = rount(mean(percentage), 2))
ggplot(error_percentage_sep, aes(x=error_type, y=mean_percent)) +
geom_bar(stat = "identity")

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

Overcoming compatibility issues with using iml from h2o models

I am unable to reproduce the only example I can find of using h2o with iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) as detailed here (Error when extracting variable importance with FeatureImp$new and H2O). Can anyone point to a workaround or other examples of using iml with h2o?
Reproducible example:
library(rsample) # data splitting
library(ggplot2) # allows extension of visualizations
library(dplyr) # basic data transformation
library(h2o) # machine learning modeling
library(iml) # ML interprtation
library(modeldata) #attrition data
# initialize h2o session
h2o.no_progress()
h2o.init()
# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE) %>%
mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))
# convert to h2o object
df.h2o <- as.h2o(df)
# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames =
c("train","valid","test"))
names(splits) <- c("train","valid","test")
# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y)
# elastic net model
glm <- h2o.glm(
x = x,
y = y,
training_frame = splits$train,
validation_frame = splits$valid,
family = "binomial",
seed = 123
)
# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = response,
predict.fun = pred,
class = "classification"
)
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
Error obtained:
Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns
selected
traceback()
1. FeatureImp$new(predictor.glm, loss = "mse")
2. .subset2(public_bind_env, "initialize")(...)
3. private$run.prediction(private$sampler$X)
4. self$predictor$predict(data.frame(dataDesign))
5. prediction[, self$class, drop = FALSE]
6. `[.data.frame`(prediction, , self$class, drop = FALSE)
7. stop("undefined columns selected")
In the iml package documentation, it says that the class argument is "The class column to be returned.". When you set class = "classification", it's looking for a column called "classification" which is not found. At least on GitHub, it looks like the iml package has gone through a fair amount of development since that blog post, so I imagine some functionality may not be backwards compatible anymore.
After reading through the package documentation, I think you might want to try something like:
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = "Attrition",
predict.function = pred,
type = "prob"
)
# check ability to predict first
check <- predictor.glm$predict(features)
print(check)
Even better might be to leverage H2O's extensive functionality around machine learning interpretability.
h2o.varimp(glm) will give the user the variable importance for each feature
h2o.varimp_plot(glm, 10) will render a graphic showing the relative importance of each feature.
h2o.explain(glm, as.h2o(features)) is a wrapper for the explainability interface and will by default provide the confusion matrix (in this case) as well as variable importance, and partial dependency plots for each feature.
For certain algorithms (e.g., tree-based methods), h2o.shap_explain_row_plot() and h2o.shap_summary_plot() will provide the shap contributions.
The h2o-3 docs might be useful here to explore more

How to solve "not all divisions are known" error?

I'm trying to filter a Dask dataframe with groupby.
df = df.set_index('ngram');
sizes = df.groupby('ngram').size();
df = df[sizes > 15];
However, df.head(15) throws the error ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.. The divisions on sizes are not known:
>>> df.known_divisions
True
>>> sizes.known_divisions
False
A workaround is to do sizes.compute() or .to_csv(...) and then read it back to Dask with dd.from_pandas or dd.read_csv. Then sizes.known_divisions would return True. That's a notable inconvenience.
How else can this be solved? Am I using Dask wrong?
Note: there's an unanswered dublicate here.
In the common case you are using, it appears to be that your indexing series is in fact much smaller than the source dataframe you want to apply it to. In this case, it makes sense to materialise it and use simple indexing like this:
df = pd.DataFrame({'ngram': np.random.choice([1, 2, 3], size=1000),
'other': np.random.randn(1000)}) # fake data
d = dd.from_pandas(df, npartitions=3)
sizes = d.groupby('ngram').size().compute()
d = d.set_index('ngram') # also sorts the divisions
ngrams = sizes[sizes > 300].index.tolist() # a list of good ngrams
d.loc[ngrams].compute()

Resources