Visualise the validation set ROC Curve - random-forest

And visualize the validation set ROC curve:
lr_auc <-
lr_res %>%
collect_predictions(parameters = lr_best) %>%
roc_curve(ProdTaken, .pred_ProdTaken) %>%
mutate(model = "Logistic Regression")
autoplot(lr_auc)
ProdTaken is a column in my data set, and I created a validation set. But console says the column .pred_ProdTaken does not exist?

Related

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

Overcoming compatibility issues with using iml from h2o models

I am unable to reproduce the only example I can find of using h2o with iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) as detailed here (Error when extracting variable importance with FeatureImp$new and H2O). Can anyone point to a workaround or other examples of using iml with h2o?
Reproducible example:
library(rsample) # data splitting
library(ggplot2) # allows extension of visualizations
library(dplyr) # basic data transformation
library(h2o) # machine learning modeling
library(iml) # ML interprtation
library(modeldata) #attrition data
# initialize h2o session
h2o.no_progress()
h2o.init()
# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE) %>%
mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))
# convert to h2o object
df.h2o <- as.h2o(df)
# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames =
c("train","valid","test"))
names(splits) <- c("train","valid","test")
# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y)
# elastic net model
glm <- h2o.glm(
x = x,
y = y,
training_frame = splits$train,
validation_frame = splits$valid,
family = "binomial",
seed = 123
)
# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = response,
predict.fun = pred,
class = "classification"
)
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
Error obtained:
Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns
selected
traceback()
1. FeatureImp$new(predictor.glm, loss = "mse")
2. .subset2(public_bind_env, "initialize")(...)
3. private$run.prediction(private$sampler$X)
4. self$predictor$predict(data.frame(dataDesign))
5. prediction[, self$class, drop = FALSE]
6. `[.data.frame`(prediction, , self$class, drop = FALSE)
7. stop("undefined columns selected")
In the iml package documentation, it says that the class argument is "The class column to be returned.". When you set class = "classification", it's looking for a column called "classification" which is not found. At least on GitHub, it looks like the iml package has gone through a fair amount of development since that blog post, so I imagine some functionality may not be backwards compatible anymore.
After reading through the package documentation, I think you might want to try something like:
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = "Attrition",
predict.function = pred,
type = "prob"
)
# check ability to predict first
check <- predictor.glm$predict(features)
print(check)
Even better might be to leverage H2O's extensive functionality around machine learning interpretability.
h2o.varimp(glm) will give the user the variable importance for each feature
h2o.varimp_plot(glm, 10) will render a graphic showing the relative importance of each feature.
h2o.explain(glm, as.h2o(features)) is a wrapper for the explainability interface and will by default provide the confusion matrix (in this case) as well as variable importance, and partial dependency plots for each feature.
For certain algorithms (e.g., tree-based methods), h2o.shap_explain_row_plot() and h2o.shap_summary_plot() will provide the shap contributions.
The h2o-3 docs might be useful here to explore more

How to use a voting regressor to predict a particular instance?

I created a voting regressor from some regressors like
voting_regressor = VotingRegressor(estimators=[('xg',xgbregressor),('gb',gradient_boosting_regressor),('et',extra_trees_regressor),('rf',random_forest_regressor)])
voting_regressor.fit(X_train, y_train)
The regressor predicts well on the test set
y_pred = voting_regressor.predict(X_test)
but when I try to predict for a particular instance
voting_regressor.predict(X_test.iloc[0].values.reshape(1,-1))
it shows following error
ValueError: feature_names mismatch: ['yearpublished', 'minplayers', 'maxplayers', 'playingtime', 'minplaytime', 'maxplaytime', 'minage', 'users_rated', 'total_owners', 'total_traders', 'total_wanters', 'total_wishers', 'total_comments', 'total_weights', 'average_weight'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14']
expected users_rated, total_wishers, yearpublished, maxplayers, maxplaytime, total_owners, total_weights, average_weight, minplaytime, total_wanters, total_traders, playingtime, minage, total_comments, minplayers in input data
training data did not have the following fields: f9, f3, f13, f0, f8, f4, f14, f5, f2, f6, f12, f11, f7, f10, f1
You are passing pandas.Series instead of pandas.DataFrame when using iloc, when names of columns are required as indicated by error.
If you want to return dataframe with one example you can wrap it with another list like this:
voting_regressor.predict(X_test.iloc[[0]])
This way names of columns are preserved
You could specify many examples as well simply by using [0, 1, 2, 3].

TensorFlow - Classification with thousands of labels

I'm very new to TensorFlow. I've been trying use TensorFlow to create a function where I give it a vector with 6 features and get back a label.
I have a training data set in the form of 6 features and 1 label. The label is in the first column:
309,3,0,2,4,0,6
309,12,0,2,4,0,6
309,0,4,17,2,0,6
318,0,660,414,58,3,12
311,0,0,414,58,0,2
298,0,53,355,5,0,2
60,16,14,381,30,4,2
312,0,8,8,13,0,3
...
I have the index for the labels which is a list of thousand and thousands of names:
309,Joe
318,Joey
311,Bruce
...
How do I create a model and train it using TensorFlow to be able to predict the label, given a vector without the first column?
--
This is what I tried:
from __future__ import print_function
import tflearn
name_count = sum(1 for line in open('../../names.csv')) # this comes out to 24260
# Load CSV file, indicate that the first column represents labels
from tflearn.data_utils import load_csv
data, labels = load_csv('../../data.csv', target_column=0,
categorical_labels=True, n_classes=name_count)
# Build neural network
net = tflearn.input_data(shape=[None, 6])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)
# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)
# Predict
pred = model.predict([[218,5,124,26,0,3]]) # 326
print("Name:", pred[0][1])
It's based on https://github.com/tflearn/tflearn/blob/master/tutorials/intro/quickstart.md
I get the error:
ValueError: Cannot feed value of shape (16, 24260) for Tensor u'TargetsData/Y:0', which has shape '(?, 2)'
24260 is the number of lines in names.csv
Thank you!
net = tflearn.fully_connected(net, 2, activation='softmax')
looks to be saying you have 2 output classes, but in reality you have 24260. 16 is the size of your minibatch, so you have 16 rows of 24260 columns (one of these 24260 will be a 1, the others will be all 0s).

different clustering labels

I am trying to cluster new data that have not been seen during the training and only including in the testing data. The training file has five classes whereas the testing data has 7 classes (5 +2) where the 2 are new classes. Now, I want to run k-mean to find a the proper cluster to the new add classes or create new cluster for each if they are not close to any cluster.
This is a part of my code:
print("Reading training data...")
#mydata = pd.read_csv('.\KDDTrain.csv', header=0)
mydata = pd.read_csv('.\PTraining.csv', header=0)
# select all but the last column as data
X_train = mydata.ix[1:, :-1]
X_train = np.array(X_train)
n_samples, n_features = np.shape(X_train)
# print np.shape(X_train)
# select last column as target/class
y_train = mydata.ix[1:, n_features]
y_train = np.array(y_train)
# encode target labels with numeric values from 0 to no of classes
# print "Encoding class labels..."
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(y_train)
# print list(label_encoder.classes_)
# print 'total no of classes in dataset=' + str(len(label_encoder.classes_))
y_train = label_encoder.transform(y_train)
# n_samples, n_features = data.shape
n_digits = len(np.unique(y_train))
print("Training data statistics")
print("n_attack_catagories: %d, \t n_samples %d, \t n_features %d"
% (n_digits, n_samples, n_features))
sample_size = 300
# Read test data
mytestdata = pd.read_csv('.\KDDTest+.csv', header=0)
print("Reading test data...")
# select all but the last column as data
X_test = mytestdata.ix[1:, :-1]
X_test = np.array(X_test)
# print np.shape(X_test)
# select last column as target/class
y_test = mytestdata.ix[1:, n_features]
# print "actual labels"
# print y_test
y_test = label_encoder.transform(y_test)
# print "Encoded labels"
# print y_test
y_test = np.array(y_test)
n_samples_test, n_features_test = np.shape(X_test)
n_digits_test = len(np.unique(y_test))
print("Test data statistics")
print("n_attack_catagories: %d, \t n_samples %d, \t n_features %d"
% (n_digits_test, n_samples_test, n_features_test))
print(79 * '_')
and giving this error
File "C:/Users/aalsham4/PycharmProjects/clusteringtask/clustering.py", line 87, in <module>
y_test = label_encoder.transform(y_test)
File "C:\Users\aalsham4\AppData\Local\Continuum\Miniconda3\lib\site-packages\sklearn\preprocessing\label.py", line 153, in transform
raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['calss6' 'class7' ]
Now, I'm not sure If I am doing this correctly to cluster labeled classes or not.
Any suggestion
As #Anony-Mousse already said, this is not a k-means problem. k-means is to find the "natural" groupings, given the number of classes you want. Once you assign those labels, further updates are no longer a k-means problem.
You can use a variety of statistical analysis heuristics to decide whether a new class is "sufficiently close" to an existing class. This usually uses measures of mean and deviation (which you already have for the k-means classes), density, and anything else you find pertinent to your problem.
I suggest that you research spectral clustering algorithms, and try them on the entire data set; those are better-suited at finding gaps, reacting to density, etc. (depending on the algorithm you choose for this application).

Resources