Glm with caret package producing "missing values in resampled performance measures" - r-caret

I obtained the following code from this Stack Overflow question. caret train() predicts very different then predict.glm()
The following code is producing an error.
I am using caret 6.0-52.
library(car); library(caret); library(e1071)
#data import and preparation
data(Chile)
chile <- na.omit(Chile) #remove "na's"
chile <- chile[chile$vote == "Y" | chile$vote == "N" , ] #only "Y" and "N" required
chile$vote <- factor(chile$vote) #required to remove unwanted levels
chile$income <- factor(chile$income) # treat income as a factor
tc <- trainControl("cv", 2, savePredictions=T, classProbs=TRUE,
summaryFunction=twoClassSummary) #"cv" = cross-validation, 10-fold
fit <- train(chile$vote ~ chile$sex +
chile$education +
chile$statusquo ,
data = chile ,
method = "glm" ,
family = binomial ,
metric = "ROC",
trControl = tc)
Running this code produces the following error.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.9354 Min. :0.9187
1st Qu.: NA 1st Qu.:0.9354 1st Qu.:0.9187
Median : NA Median :0.9354 Median :0.9187
Mean :NaN Mean :0.9354 Mean :0.9187
3rd Qu.: NA 3rd Qu.:0.9354 3rd Qu.:0.9187
Max. : NA Max. :0.9354 Max. :0.9187
NA's :1
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Would anyone know what the issue is or can reproduce/ not reproduce this error. I've seen other answers to this error message that says this has to do with not having representation of classes in each cross validation fold but this isn't the issue as the number of folds is set to 2.

Looks like I needed to install and load the pROC package.
install.packages("pROC")
library(pROC)

You should install using
install.packages("caret", dependencies = c("Imports", "Depends", "Suggests"))
That gets most of the default packages. If there are specific modeling packages that are missing, the code usually prompts you to install them.

I know I'm late to the party, but I think you need to set classProbs = TRUE in train control.

You are using logistic regression when using the parameters method = "glm", family = binomial.
In this case, you must make sure that the target variable (chile$vote) has only 2 factor levels, because logistic regression only performs binary classification.
If the target has more than two labels, then you must set family = "multinomial"

Related

MCMC for multi coefficients with normal gaussian distribution

I have a linear model as follows: Acceleration= (C1V)+ C2(X- D) where D= alpha-(betaV)+(gammaM).
note that the values for V,X,M are given in dataset.
My goal is to run 350 times MCMC for each following coefficients: C1,C2,alpha,beta,gamma.
1- I have mean and standard deviation for C1,C2,alpha,beta,gamma.
2- All coefficients (C1,C2,alpha,beta,gamma) are normal distribution.
I have tried two methods to find MCMC for each of coefficients, one is by pymc3(which I'm not sure if I have done correctly), another is defining likelihood function based on the method mentioned in following link by Jonny Homfmeister (in my case, I changed the distribution from binomial to normal gaussian):
https://towardsdatascience.com/bayesian-inference-and-markov-chain-monte-carlo-sampling-in-python-bada1beabca7
The problem is that, after running MCMC for C1,C2,alpha,beta and gamma, and using the mean of posterior (output of MCMC) in my main model, I see that the absolute error has increased! it means the coefficients have not optimized by MCMC and my method does not work properly!
I do appreciate it if someone help me with correct algorithm for MCMC for Normal distribution.
####First method: pymc3##########
import pymc3 as pm
import scipy.stats as st
import arviz as az
for row in range(350):
X_c1 = st.norm(loc=-0.06, scale=0.47).rvs(size=100)
with pm.Model() as model:
prior = pm.Normal('c1', mu=-0.06, sd=0.47) ###### prior #####weights
obs = pm.Normal('obs', mu=prior, sd=0.47, observed=X_c1) #######likelihood
step = pm.Metropolis()
trace_c1 = pm.sample(draws=30, chains=2, step=step, return_inferencedata=True)
###calculate the mean of output (posterior distribution)#################3
mean_c1= az.summary(trace_c1, var_names=["c1"], round_to=2).iloc[0][['mean']]
mean_c1 = mean_c1.to_numpy()
Acceleration= (mean_C1*V)+ C2*(X- D) ######apply model
######second method in the given link##########
## Define the Likelihood P(x|p) - normal distribution
def likelihood(p):
return scipy.stats.norm.cdf(C1, loc=-0.06, scale=0.47)
def prior(p):
return scipy.stats.norm.pdf(p)
def acceptance_ratio(p, p_new):
# Return R, using the functions we created before
return min(1, ((likelihood(p_new) / likelihood(p)) * (prior(p_new) / prior(p))))
p = np.random.normal(C1, 0.47) # Initialzie a value of p
#### Define model parameters
n_samples = 790 ################# I HAVE NO IDEA HOW TO CHOOSE THIS VALUE????###########
burn_in = 99
lag = 2
##### Create the MCMC loop
for i in range(n_samples):
p_new = np.random.random_sample() ###Propose a new value of p randomly from a normal distribution
R = acceptance_ratio(p, p_new) #### Compute acceptance probability
u= np.random.uniform(0, 1) ####Draw random sample to compare R to
if u < R: ##### If R is greater than u, accept the new value of p
p = p_new
if i > burn_in and i%lag == 0: #### Record values after burn in - how often determined by lag
results_1.append(p)

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

Getting a specific random forest variable importance measure from mlr package's resample function

I am using mlr package's resample() function to subsample a random forest model 4000 times (the code snippet below).
As you can see, to create random forest models within resample() I'm using randomForest package.
I want to get random forest model's importance results (mean decrease in accuracy over all classes) for each of the subsample iterations. What I can get right now as the importance measure is the mean decrease in Gini index.
What I can see from the source code of mlr, getFeatureImportanceLearner.classif.randomForest() function (line 69) in makeRLearner.classif.randomForest uses randomForest::importance() function (line 83) to get importance value from the resulting object of randomForest class. But as you can see from the source code (line 73) it uses 2L as the default value. I want it to use 1L (line 75) as the value (mean decrease in accuracy).
How can I pass the value of 2L to resample() function, ("extract = getFeatureImportance" line in the code below) so that getFeatureImportanceLearner.classif.randomForest() function gets that value and sets ctrl$type = 2L (line 73)?
rf_task <- makeClassifTask(id = 'task',
data = data[, -1], target = 'target_var',
positive = 'positive_var')
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 1000, importance = TRUE),
predict.type = 'prob')
base_subsample_instance <- makeResampleInstance(rf_boot_desc, rf_task)
rf_subsample_result <- resample(rf_learner, rf_task,
base_subsample_instance,
extract = getFeatureImportance,
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier))
My solution: Downloaded source code of the mlr package. Changed the source file line 73 to 1L (https://github.com/mlr-org/mlr/blob/v2.15.0/R/RLearner_classif_randomForest.R). Installed the package from command line and used it. Not an optimal solution but a solution.
You provide a lot of specifics that do not actually relate to your question, at least how I understood it.
So I wrote a simple MWE that includes the answer.
The idea is that you have to write a short wrapper for getFeatureImportance so that you can pass your own arguments. Fans of purrr can do that with purrr::partial(getFeatureImportance, type = 2) but here I wrote myExtractor manually.
library(mlr)
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 100, importance = TRUE),
predict.type = 'prob')
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier)
myExtractor = function(.model, ...) {
getFeatureImportance(.model, type = 2, ...)
}
res = resample(rf_learner, sonar.task, cv10,
measures = measures, extract = myExtractor)
# first feature importance result:
res$extract[[1]]
# all values in a matrix:
sapply(res$extract, function(x) x$res)
If you want to do a bootstraped learenr maybe you should also have a look at makeBaggingWrapper instead of solving this problem through resample.

Intermittent error message with ROC metric for rfe in caret

I am using rfe in caret to perform feature selection based on the ROC metric from twoClassSummary for a logistic regression model created from an imbalanced dataset (approx 25:1). More often than not I get an error message. Sometimes however I do not.
On the two occasions that the code has run without error (giving a believable result), I have run the exact same rfe line again immediately, and it has failed with this error message:
Error in { : task 1 failed - "undefined columns selected"
(Note that the task number can also vary up to 4.)
myLRFuncs <- lrFuncs
myLRFuncs$summary <- twoClassSummary
rfe.ctrl <- rfeControl(functions = myLRFuncs,
method = "cv",
number = 5,
verbose = TRUE)
train.ctrl <- trainControl(method="none",
classProbs=TRUE,
summaryFunction=twoClassSummary,
verbose=TRUE)
glm_rfe_ROC <- rfe(x=train[,-c("outcome")],y=train$outcome,
sizes=c(1:5, 10, 15, 20, 25),
rfeControl=rfe.ctrl,
method="glm",
metric="ROC",
trControl=train.ctrl)
I am aware that I could use lasso or gradient boosted regression, and so avoid rfe, but I plan to use this approach for a wide range of additional algorithms, so would really like to have this working reliably.
The error seems to be related to how you are subsetting your predictors:
> train <- data.frame(outcome = 1:10, x1 = 1:10, x2 = 1:10)
> train[,-c("outcome")]
Error in -c("outcome") : invalid argument to unary operator
> train(x = train[,-c("outcome")], y = train$outcome)
Error in -c("outcome") : invalid argument to unary operator
Max

What is the meaning of the GridSearchCV best_score_ attribute? (the value is different from the mean of the cross validation array)

I'm confused with the results, probably I'm not getting the concept of cross validation and GridSearch right. I had followed the logic behind this post:
https://randomforests.wordpress.com/2014/02/02/basics-of-k-fold-cross-validation-and-gridsearchcv-in-scikit-learn/
argd = CommandLineParser(argv)
folder,fname=argd['dir'],argd['fname']
df = pd.read_csv('../../'+folder+'/Results/'+fname, sep=";")
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df['A']
explanatory_variable_columns.remove('A')
y = np.array([1 if e else 0 for e in response_variable_column])
X =df[list(explanatory_variable_columns)].as_matrix()
kf_total = KFold(len(X), n_folds=5, indices=True, shuffle=True, random_state=4)
dt=DecisionTreeClassifier(criterion='entropy')
min_samples_split_range=[x for x in range(1,20)]
dtgs=GridSearchCV(estimator=dt, param_grid=dict(min_samples_split=min_samples_split_range), n_jobs=1)
scores=[dtgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]
# SAME AS DOING: cross_validation.cross_val_score(dtgs, X, y, cv=kf_total, n_jobs = 1)
print scores
print np.mean(scores)
print dtgs.best_score_
RESULTS OBTAINED:
# score [0.81818181818181823, 0.78181818181818186, 0.7592592592592593, 0.7592592592592593, 0.72222222222222221]
# mean score 0.768
# .best_score_ 0.683486238532
ADDITIONAL NOTE:
I ran it using another combination of the explanatory variables (using only some of them) and I got the inverse problem. Now the .best_score_ is higher than all the values in the cross validation array.
# score [0.74545454545454548, 0.70909090909090911, 0.79629629629629628, 0.7407407407407407, 0.64814814814814814]
# mean score 0.728
# .best_score_ 0.802752293578
The code is confusing several things.
dtgs.fit(X[train_],y[train_]) does internal 3-fold cross-validation for every parameter combination from param_grid, producing a grid of 20 results, which you can open by calling dtgs.grid_scores_.
[dtgs.fit(X[train_],y[train_]).score(X[test],y[test]) for train_, test in kf_total] Therefore this line fits grid search five times and then takes its score using 5-Fold cross validation. The result is the array of scores of 5-Fold validation.
And when you call dtgs.best_score_ you get the best score in the grid of the results of 3-fold validation of hyperparameters for the last fit (of 5).

Resources