Intermittent error message with ROC metric for rfe in caret - r-caret

I am using rfe in caret to perform feature selection based on the ROC metric from twoClassSummary for a logistic regression model created from an imbalanced dataset (approx 25:1). More often than not I get an error message. Sometimes however I do not.
On the two occasions that the code has run without error (giving a believable result), I have run the exact same rfe line again immediately, and it has failed with this error message:
Error in { : task 1 failed - "undefined columns selected"
(Note that the task number can also vary up to 4.)
myLRFuncs <- lrFuncs
myLRFuncs$summary <- twoClassSummary
rfe.ctrl <- rfeControl(functions = myLRFuncs,
method = "cv",
number = 5,
verbose = TRUE)
train.ctrl <- trainControl(method="none",
classProbs=TRUE,
summaryFunction=twoClassSummary,
verbose=TRUE)
glm_rfe_ROC <- rfe(x=train[,-c("outcome")],y=train$outcome,
sizes=c(1:5, 10, 15, 20, 25),
rfeControl=rfe.ctrl,
method="glm",
metric="ROC",
trControl=train.ctrl)
I am aware that I could use lasso or gradient boosted regression, and so avoid rfe, but I plan to use this approach for a wide range of additional algorithms, so would really like to have this working reliably.

The error seems to be related to how you are subsetting your predictors:
> train <- data.frame(outcome = 1:10, x1 = 1:10, x2 = 1:10)
> train[,-c("outcome")]
Error in -c("outcome") : invalid argument to unary operator
> train(x = train[,-c("outcome")], y = train$outcome)
Error in -c("outcome") : invalid argument to unary operator
Max

Related

Leave one out cross validation RStudio randomForest package error

Creating an LOOCV loop using the randomForest package. I have adapted the following code from this link (https://stats.stackexchange.com/questions/459293/loocv-in-caret-package-randomforest-example-not-unique-results) however I am unable to reproduce a successful code.
Here is the code that I am running but on the iris dataset.
irisdata <- iris[1:150,]
predictionsiris <- 1:150
for (k in 1:150){
set.seed(123)
predictioniris[k] <- predict(randomForest(Petal.Width ~ Sepal.Length, data = irisdata[-k], ntree = 10), newdata = irisdata[k,,drop=F])[2]
}
What I would expect to happen is for it to run the random forest model on all but one row and then use that one row to test the model.
However, when I run this code, I get the following error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'predict': object 'Sepal.Length' not found
Any suggestions? I have been messing around with LOOCV code for the past two days including messing with code in this page (Compute Random Forest with a leave one ID out cross validation in R) and running the following:
iris %>%
mutate(ID = 1:516)
loocv <- NULL
for(i in iris$ID){
test[[i]] <- slice(iris, i)
train[[i]] <- slice(iris, i+1:516)
rf <- randomForest(Sepal.Length ~., data = train, ntree = 10, importance = TRUE)
loocv[[i]] <- predict(rf, newdata = test)
}
but I have had no success. Any help would be appreciated.

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

How to fix this HoltWinters Prediction code?

Hi I get this error message when I run this code
model <- forecast:::forecast.HoltWinters(mod, h=(length(data.ts)-length(dataTrain)))
Error in copy_msts(object$x, fitted) :
x and y should have the same number of observations
In addition: There were 50 or more warnings (use warnings() to see the first 50)
And another code
model <- hw(dataTrain, initial = "optimal", h=(length(data.ts)-length(dataTrain)))
Error in hw(dataTrain, initial = "optimal", h = (length(data.ts) - length(dataTrain))) :
I need at least 15 observations to estimate seasonality.
This is the code I used to predict the organic traffic
#Training and Test Split
enter image description here

Bootstrapping a pooled regression

I am trying to bootstrap a panel data and then run a pooled panel regression and finally collect the coefficients for each estimation. Here is the code I have so far:
df <- data.frame(id=c(rep('1',57),rep('2',57),rep('3',57),rep('4',57),rep('5',57)),YEAR=c(1961:2017), data=Project2)
N=length(Project2) # Counts the number of observations
B=10 # Number of times to recompute the estimate
stor.r2=matrix(c(rep(0,B)), nrow=B, ncol=20) #Storage matrix for the coefficients
for(i in 1:B) {
newdata.df = df[sample(nrow(df), 10, replace=TRUE), ]
wols.pop = plm(pop ~ Il+In+Io+Oh+Mis+TDum
+lag(cmt.)+lag(pop.)+lag(lv.)+lag(cr.)
+lag(cmt..1)+lag(pop..1)+lag(lv..1)+lag(cr..1)+lag(cpiplus)+lag(vlyeg)+0, data=newdata.df, weights=wls.pop/1000, model="pooling")
stor.r2[i] = summary(wols.pop)$r.squared
}
I keep getting the following error message:
Error in row.names<-.data.frame(*tmp*, value = orig_rownames[as.numeric(row.names(data))]) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names':
I am not sure how to fix my code.

Glm with caret package producing "missing values in resampled performance measures"

I obtained the following code from this Stack Overflow question. caret train() predicts very different then predict.glm()
The following code is producing an error.
I am using caret 6.0-52.
library(car); library(caret); library(e1071)
#data import and preparation
data(Chile)
chile <- na.omit(Chile) #remove "na's"
chile <- chile[chile$vote == "Y" | chile$vote == "N" , ] #only "Y" and "N" required
chile$vote <- factor(chile$vote) #required to remove unwanted levels
chile$income <- factor(chile$income) # treat income as a factor
tc <- trainControl("cv", 2, savePredictions=T, classProbs=TRUE,
summaryFunction=twoClassSummary) #"cv" = cross-validation, 10-fold
fit <- train(chile$vote ~ chile$sex +
chile$education +
chile$statusquo ,
data = chile ,
method = "glm" ,
family = binomial ,
metric = "ROC",
trControl = tc)
Running this code produces the following error.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.9354 Min. :0.9187
1st Qu.: NA 1st Qu.:0.9354 1st Qu.:0.9187
Median : NA Median :0.9354 Median :0.9187
Mean :NaN Mean :0.9354 Mean :0.9187
3rd Qu.: NA 3rd Qu.:0.9354 3rd Qu.:0.9187
Max. : NA Max. :0.9354 Max. :0.9187
NA's :1
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Would anyone know what the issue is or can reproduce/ not reproduce this error. I've seen other answers to this error message that says this has to do with not having representation of classes in each cross validation fold but this isn't the issue as the number of folds is set to 2.
Looks like I needed to install and load the pROC package.
install.packages("pROC")
library(pROC)
You should install using
install.packages("caret", dependencies = c("Imports", "Depends", "Suggests"))
That gets most of the default packages. If there are specific modeling packages that are missing, the code usually prompts you to install them.
I know I'm late to the party, but I think you need to set classProbs = TRUE in train control.
You are using logistic regression when using the parameters method = "glm", family = binomial.
In this case, you must make sure that the target variable (chile$vote) has only 2 factor levels, because logistic regression only performs binary classification.
If the target has more than two labels, then you must set family = "multinomial"

Resources