Leave one out cross validation RStudio randomForest package error - random-forest

Creating an LOOCV loop using the randomForest package. I have adapted the following code from this link (https://stats.stackexchange.com/questions/459293/loocv-in-caret-package-randomforest-example-not-unique-results) however I am unable to reproduce a successful code.
Here is the code that I am running but on the iris dataset.
irisdata <- iris[1:150,]
predictionsiris <- 1:150
for (k in 1:150){
set.seed(123)
predictioniris[k] <- predict(randomForest(Petal.Width ~ Sepal.Length, data = irisdata[-k], ntree = 10), newdata = irisdata[k,,drop=F])[2]
}
What I would expect to happen is for it to run the random forest model on all but one row and then use that one row to test the model.
However, when I run this code, I get the following error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'predict': object 'Sepal.Length' not found
Any suggestions? I have been messing around with LOOCV code for the past two days including messing with code in this page (Compute Random Forest with a leave one ID out cross validation in R) and running the following:
iris %>%
mutate(ID = 1:516)
loocv <- NULL
for(i in iris$ID){
test[[i]] <- slice(iris, i)
train[[i]] <- slice(iris, i+1:516)
rf <- randomForest(Sepal.Length ~., data = train, ntree = 10, importance = TRUE)
loocv[[i]] <- predict(rf, newdata = test)
}
but I have had no success. Any help would be appreciated.

Related

How to fix this HoltWinters Prediction code?

Hi I get this error message when I run this code
model <- forecast:::forecast.HoltWinters(mod, h=(length(data.ts)-length(dataTrain)))
Error in copy_msts(object$x, fitted) :
x and y should have the same number of observations
In addition: There were 50 or more warnings (use warnings() to see the first 50)
And another code
model <- hw(dataTrain, initial = "optimal", h=(length(data.ts)-length(dataTrain)))
Error in hw(dataTrain, initial = "optimal", h = (length(data.ts) - length(dataTrain))) :
I need at least 15 observations to estimate seasonality.
This is the code I used to predict the organic traffic
#Training and Test Split
enter image description here

Overcoming compatibility issues with using iml from h2o models

I am unable to reproduce the only example I can find of using h2o with iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) as detailed here (Error when extracting variable importance with FeatureImp$new and H2O). Can anyone point to a workaround or other examples of using iml with h2o?
Reproducible example:
library(rsample) # data splitting
library(ggplot2) # allows extension of visualizations
library(dplyr) # basic data transformation
library(h2o) # machine learning modeling
library(iml) # ML interprtation
library(modeldata) #attrition data
# initialize h2o session
h2o.no_progress()
h2o.init()
# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE) %>%
mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))
# convert to h2o object
df.h2o <- as.h2o(df)
# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames =
c("train","valid","test"))
names(splits) <- c("train","valid","test")
# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y)
# elastic net model
glm <- h2o.glm(
x = x,
y = y,
training_frame = splits$train,
validation_frame = splits$valid,
family = "binomial",
seed = 123
)
# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = response,
predict.fun = pred,
class = "classification"
)
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
Error obtained:
Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns
selected
traceback()
1. FeatureImp$new(predictor.glm, loss = "mse")
2. .subset2(public_bind_env, "initialize")(...)
3. private$run.prediction(private$sampler$X)
4. self$predictor$predict(data.frame(dataDesign))
5. prediction[, self$class, drop = FALSE]
6. `[.data.frame`(prediction, , self$class, drop = FALSE)
7. stop("undefined columns selected")
In the iml package documentation, it says that the class argument is "The class column to be returned.". When you set class = "classification", it's looking for a column called "classification" which is not found. At least on GitHub, it looks like the iml package has gone through a fair amount of development since that blog post, so I imagine some functionality may not be backwards compatible anymore.
After reading through the package documentation, I think you might want to try something like:
predictor.glm <- Predictor$new(
model = glm,
data = features,
y = "Attrition",
predict.function = pred,
type = "prob"
)
# check ability to predict first
check <- predictor.glm$predict(features)
print(check)
Even better might be to leverage H2O's extensive functionality around machine learning interpretability.
h2o.varimp(glm) will give the user the variable importance for each feature
h2o.varimp_plot(glm, 10) will render a graphic showing the relative importance of each feature.
h2o.explain(glm, as.h2o(features)) is a wrapper for the explainability interface and will by default provide the confusion matrix (in this case) as well as variable importance, and partial dependency plots for each feature.
For certain algorithms (e.g., tree-based methods), h2o.shap_explain_row_plot() and h2o.shap_summary_plot() will provide the shap contributions.
The h2o-3 docs might be useful here to explore more

ValueError: Found input variables with inconsistent numbers of samples : [1, 14048]

I am trying to run MultinomiaL Naive bayes and receiving the below error. Sample training data is given. Test data is exactly similar.
def main():
text_train, targets_train = read_data('train')
text_test, targets_test = read_data('test')
classifier1 = MultinomialNB()
classifier1.fit(text_train, targets_train)
prediction1 = classifier1.predict(text_test)
Sample Data:
Train:
category, text
Family, I love you Mom
University, I hate this course
Sometimes I face this question and find most of reason from the error is the input data should be 2-D array, such as if you want to build a regression model. you write this code and then you will face this error!
for example:
a = np.array([1,2,3]).T
b = np.array([4,5,6]).T
regr = linear_model.LinearRegression()
regr.fit(a, b)
then you should add something!
a = np.array([[1,2,3]]).T
b = np.array([[4,5,6]]).T
lastly you will be run normally!
so it is just my empirical!
This is just a reference, not a standard answer!
i am from Chinese as a student in learning English and python!

Intermittent error message with ROC metric for rfe in caret

I am using rfe in caret to perform feature selection based on the ROC metric from twoClassSummary for a logistic regression model created from an imbalanced dataset (approx 25:1). More often than not I get an error message. Sometimes however I do not.
On the two occasions that the code has run without error (giving a believable result), I have run the exact same rfe line again immediately, and it has failed with this error message:
Error in { : task 1 failed - "undefined columns selected"
(Note that the task number can also vary up to 4.)
myLRFuncs <- lrFuncs
myLRFuncs$summary <- twoClassSummary
rfe.ctrl <- rfeControl(functions = myLRFuncs,
method = "cv",
number = 5,
verbose = TRUE)
train.ctrl <- trainControl(method="none",
classProbs=TRUE,
summaryFunction=twoClassSummary,
verbose=TRUE)
glm_rfe_ROC <- rfe(x=train[,-c("outcome")],y=train$outcome,
sizes=c(1:5, 10, 15, 20, 25),
rfeControl=rfe.ctrl,
method="glm",
metric="ROC",
trControl=train.ctrl)
I am aware that I could use lasso or gradient boosted regression, and so avoid rfe, but I plan to use this approach for a wide range of additional algorithms, so would really like to have this working reliably.
The error seems to be related to how you are subsetting your predictors:
> train <- data.frame(outcome = 1:10, x1 = 1:10, x2 = 1:10)
> train[,-c("outcome")]
Error in -c("outcome") : invalid argument to unary operator
> train(x = train[,-c("outcome")], y = train$outcome)
Error in -c("outcome") : invalid argument to unary operator
Max

Identifying outlying datapoints from residuals (GeoLight package)

I am analysing some data collected from a geolocator placed on a migratory bird. In a nutshell, my data are sunrise and sunset times, which are then used to determine position on the globe.
I am using a package GeoLight (http://cran.r-project.org/web/packages/GeoLight/GeoLight.pdf) to identify outlying data - specifically, I am using the LoessFilter function which applies a polynomial regression and identify residuals that are greater than 3 interquantile ranges (specified by k in the code when applying the function)
My problem is: the function returns graphs in which outlying datapoints are identified in red. There seems to be an issue with the code itself regarding returned TRUE or FALSE statements stating which points are outliers - all are stated as TRUE, even if outliers are identified.
I have therefore modified the function code to state which residuals are outliers.
However, when I then remove those rows from the original dataset and re-run the function, the points have not been removed. Therefore, there is some discrepancy between which residuals are relating to values in the original data: i.e. if the output states that residual 78 is an outlying point, removing row 78 from the original data does not remove the outlying datapoint.
I would very much appreciate some help with removing the outlying datapoints identified using the function. It seems like a very easy fix but I can't seem to figure it out.
Code for full function and data below
Thanks
Emma
log2$tFirst<-as.POSIXlt(log2$tFirst)
log2$tSecond<-as.POSIXlt(log2$tSecond)
CODE TO GET OUTLYING RESIDUALS
i.get.outliers<-function(residuals, k=3) {
x <- residuals
# x is a vector of residuals
# k is a measure of how many interquartile ranges to take before saying that point is an outlier
# it looks like 3 is a good preset for k
QR<-quantile(x, probs = c(0.25, 0.75))
IQR<-QR[2]-QR[1]
Lower.band<-QR[1]-(k*IQR)
Upper.Band<-QR[2]+(k*IQR)
delete<-which(x<Lower.band | x>Upper.Band)
return(as.vector(delete))
}
LOESS FILTER FUNCTION CODE
loessFilter <- function(tFirst, tSecond, type, k=3, plot=TRUE){
tw <- data.frame(datetime=as.POSIXct(c(tFirst,tSecond),"UTC"),type=c(type,ifelse(type==1,2,1)))
tw <- tw[!duplicated(tw$datetime),]
tw <- tw[order(tw[,1]),]
hours <- as.numeric(format(tw[,1],"%H"))+as.numeric(format(tw[,1],"%M"))/60
for(t in 1:2){
cor <- rep(NA, 24)
for(i in 0:23){
cor[i+1] <- max(abs((c(hours[tw$type==t][1],hours[tw$type==t])+i)%%24 -
(c(hours[tw$type==t],hours[tw$type==t][length(hours)])+i)%%24),na.rm=T)
}
hours[tw$type==t] <- (hours[tw$type==t] + (which.min(round(cor,2)))-1)%%24
}
dawn <- data.frame(id=1:sum(tw$type==1),
datetime=tw$datetime[tw$type==1],
type=tw$type[tw$type==1],
hours = hours[tw$type==1], filter=FALSE)
dusk <- data.frame(id=1:sum(tw$type==2),
datetime=tw$datetime[tw$type==2],
type=tw$type[tw$type==2],
hours = hours[tw$type==2], filter=FALSE)
for(d in seq(30,k,length=5)){
predict.dawn <- predict(loess(dawn$hours[!dawn$filter]~as.numeric(dawn$datetime[!dawn$filter]),span=0.1))
predict.dusk <- predict(loess(dusk$hours[!dusk$filter]~as.numeric(dusk$datetime[!dusk$filter]),span=0.1))
del.dawn <- i.get.outliers(as.vector(residuals(loess(dawn$hours[!dawn$filter]~
as.numeric(dawn$datetime[!dawn$filter]),span=0.1))),k=d)
del.dusk <- i.get.outliers(as.vector(residuals(loess(dusk$hours[!dusk$filter]~
as.numeric(dusk$datetime[!dusk$filter]),span=0.1))),k=d)
if(length(del.dawn)>0) dawn$filter[!dawn$filter][del.dawn] <- TRUE
if(length(del.dusk)>0) dusk$filter[!dusk$filter][del.dusk] <- TRUE
}
if(plot){
par(mfrow=c(2,1),mar=c(3,3,0.5,3),oma=c(2,2,0,0))
plot(dawn$datetime[dawn$type==1],dawn$hours[dawn$type==1],pch="+",cex=0.6,xlab="",ylab="",yaxt="n")
lines(dawn$datetime[!dawn$filter], predict(loess(dawn$hours[!dawn$filter]~as.numeric(dawn$datetime[!dawn$filter]),span=0.1)) , type="l")
points(dawn$datetime[dawn$filter],dawn$hours[dawn$filter],col="red",pch="+",cex=1)
axis(2,labels=F)
mtext("Sunrise",4,line=1.2)
plot(dusk$datetime[dusk$type==2],dusk$hours[dusk$type==2],pch="+",cex=0.6,xlab="",ylab="",yaxt="n")
lines(dusk$datetime[!dusk$filter], predict(loess(dusk$hours[!dusk$filter]~as.numeric(dusk$datetime[!dusk$filter]),span=0.1)), type="l")
points(dusk$datetime[dusk$filter],dusk$hours[dusk$filter],col="red",pch="+",cex=1)
axis(2,labels=F)
legend("bottomleft",c("Outside filter","Inside filter"),pch=c("+","+"),col=c("black","red"),
bty="n",cex=0.8)
mtext("Sunset",4,line=1.2)
mtext("Time",1,outer=T)
mtext("Sunrise/Sunset hours (rescaled)",2,outer=T)
}
all <- rbind(subset(dusk,filter),subset(dawn,filter))
filter <- rep(FALSE,length(tFirst))
filter[tFirst%in%all$datetime | tSecond%in%all$datetime] <- TRUE
# original code:
#return(!filter)
# altered code to return outliersreturn(del.dusk)
# replace with code below to print outlying points
return(c("delete dawn",del.dawn,"delete dusk",del.dusk))
}
APPLY FUNCTION
loessFilter(log2$tFirst, log2$tSecond, type=1, k=4, plot=TRUE)
remove the values - need to remove both sunrise and sunset curves
log2b<-log2[-c(77,78,124,125),]
length(log2$tFirst)
length(log2b$tFirst)
repeat function to see if the values have gone
loessFilter(log2b$tFirst, log2b$tSecond, type=1, k=4, plot=TRUE)
outliers still there!!
HERE ARE THE DATA:
http://www.4shared.com/file/jxVuTsVHce/002_geolight.html
A bit too long to post the full data here and the example won't work with a dummy dataset :)

Resources