Error with bagImpute and predict from caret package - r-caret

I have the following when error when trying to use the preProcess function from the caret package. The predict function works for knn and median imputation, but gives an error for bagging. How should I edit my call to the predict function.
Reproducible example:
data = iris
set.seed(1)
data = as.data.frame(lapply(data, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(cc), replace = TRUE) ]))
preprocess_values = preProcess(data, method = c("bagImpute"), verbose = TRUE)
data_new = predict(preprocess_values, data)
This gives the following error:
> data_new = predict(preprocess_values, data)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "NULL"

The preprocessing/imputation functions in caret work only for numerical variables.
As stated in the help of preProcess
x a matrix or data frame. Non-numeric predictors are allowed but will be ignored.
You most likely found a bug in the part that should ignore the non numerical variables which throws an uninformative error instead of ignoring them.
If you remove the factor variable the imputation works
library(caret)
df <- iris
set.seed(1)
df <- as.data.frame(lapply(data, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(cc), replace = TRUE) ]))
df <- df[,-5] #remove factor variable
preprocess_values <- preProcess(df, method = c("bagImpute"), verbose = TRUE)
data_new <- predict(preprocess_values, df)
The last line of code works but results in a bunch of warnings:
In cprob[tindx] + pred :
longer object length is not a multiple of shorter object length
These warnings are not from caret but from the internal call to ipred::bagging which is called internally by caret::preProcess. The cause for these errors are instances in the data where there are 3 NA values in a row, when they are removed
df <- df[rowSums(sapply(df, is.na)) != 3,]
preprocess_values <- preProcess(df, method = c("bagImpute"), verbose = TRUE)
data_new <- predict(preprocess_values, df)
the warnings disappear.
You should check out recipes, and specifically step_bagimpute, to overcome the above mentioned limitations.

Related

R: Error in predict.xgboost: Feature names stored in `object` and `newdata` are different

I wrote a script using xgboost to predict soil class for a certain area using data from field and satellite images. The script as below:
`
rm(list=ls())
library(xgboost)
library(caret)
library(raster)
library(sp)
library(rgeos)
library(ggplot2)
setwd("G:/DATA")
data <- read.csv('96PointsClay02finalone.csv')
head(data)
summary(data)
dim(data)
ras <- stack("Allindices04TIFF.tif")
names(ras) <- c("b1", "b2", "b3", "b4", "b5", "b6", "b7", "b10", "b11","DEM",
"R1011", "SCI", "SAVI", "NDVI", "NDSI", "NDSandI", "MBSI",
"GSI", "GSAVI", "EVI", "DryBSI", "BIL", "BI","SRCI")
set.seed(27) # set seed for generating random data.
# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%)
parts = createDataPartition(data$Clay, p = .8, list = F)
train = data[parts, ]
test = data[-parts, ]
#define predictor and response variables in training set
train_x = data.matrix(train[, -1])
train_y = train[,1]
#define predictor and response variables in testing set
test_x = data.matrix(test[, -1])
test_y = test[, 1]
#define final training and testing sets
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
#defining a watchlist
watchlist = list(train=xgb_train, test=xgb_test)
#fit XGBoost model and display training and testing data at each iteartion
model = xgb.train(data = xgb_train, max.depth = 3, watchlist=watchlist, nrounds = 100)
#define final model
model_xgboost = xgboost(data = xgb_train, max.depth = 3, nrounds = 86, verbose = 0)
summary(model_xgboost)
#use model to make predictions on test data
pred_y = predict(model_xgboost, xgb_test)
# performance metrics on the test data
mean((test_y - pred_y)^2) #mse - Mean Squared Error
caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error
y_test_mean = mean(test_y)
rmseE<- function(error)
{
sqrt(mean(error^2))
}
y = test_y
yhat = pred_y
rmseresult=rmseE(y-yhat)
(r2 = R2(yhat , y, form = "traditional"))
cat('The R-square of the test data is ', round(r2,4), ' and the RMSE is ', round(rmseresult,4), '\n')
#use model to make predictions on satellite image
result <- predict(model_xgboost, ras[1:(nrow(ras)*ncol(ras))])
#create a result raster
res <- raster(ras)
#fill in results and add a "1" to them (to get back to initial class numbering! - see above "Prepare data" for more information)
res <- setValues(res,result+1)
#Save the output .tif file into saved directory
writeRaster(res, "xgbmodel_output", format = "GTiff", overwrite=T)
`
The script works well till it reachs
result <- predict(model_xgboost, ras[1:(nrow(ras)*ncol(ras))])
it takes some time then gives this error:
Error in predict.xgb.Booster(model_xgboost, ras[1:(nrow(ras) * ncol(ras))]) :
Feature names stored in `object` and `newdata` are different!
I realize that I am doing something wrong in that line. However, I do not know how to apply the xgboost model to a raster image that represents my study area.
It would be highly appreciated if someone give a hand, enlightened me, and helped me solve this problem....
My data as csv and raster image can be found here.
Finally, I got the reason for this error.
It was my mistake as the number of columns in the traning data was not the same as in the number of layers in the satellite image.

Training random forest (ranger) using caret with custom F4 metric in R yields but after running full ,error showing undefined columns selected

library(MLmetrics)
library(caret)
library(doSNOW)
library(ranger)
data is called as the "bank additional" full from this enter link description here and then following code to generate data1
library(VIM)
data1<-hotdeck(data,variable=c('job','marital','education','default','housing','loan'),domain_var = "y",imp_var=FALSE)
#converting the categorical variables to factors as they should be
library(magrittr)
data1%<>%
mutate_at(colnames(data1)[grepl('factor|logical|character',sapply(data1,class))],factor)
Now, splitting
library(caret)
#spliting data into train test 70/30
set.seed(1234)
trainIndex<-createDataPartition(data1$y,p=0.7,times = 1,list = F)
train<-data1[trainIndex,-11]
test<-data1[-trainIndex,-11]
levels(train$y)
train$y = as.factor(train$y)
# train$y = factor(train$y,levels = c("yes","no"))
# train$y = relevel(train$y,ref="yes")
Here, i got an idea of how to create F1 metric in Training Model in Caret Using F1 Metric
and using fbeta score formula i created f1_val; now i can't understand what lev,obs and pred are indicating . in my train dataset only column y showing data$obs , but no data$pred . So, is following error is due to this? and how to rectify this?
f1 <- function (data, lev = NULL, model = NULL) {
precision <- precision(data$obs,data$pred)
recall <- sensitivity(data$obs,data$pred)
f1_val <- (17*precision*recall)/(16*precision+recall)
names(f1_val) <- c("F1")
f1_val
}
tgrid <- expand.grid(
.mtry = 1:5,
.splitrule = "gini",
.min.node.size = seq(1,500,75)
)
model_caret <- train(train$y~., data = train,
method = "ranger",
trControl = trainControl(method="cv",
number = 2,
verboseIter = T,
classProbs = T,
summaryFunction = f1),
tuneGrid = tgrid,
num.trees = 500,
importance = "impurity",
metric = "F1")
After running for 3/4 minutes we get following :
Aggregating results
Selecting tuning parameters
Fitting mtry = 5, splitrule = gini, min.node.size = 1 on full training set
but error:
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
Also when running model_caret we get,
Error: object 'model_caret' not found
Kindly help. Thanks in advance

question for dask output when using dask.array.map_overlap

I would like to use dask.array.map_overlap to deal with the scipy interpolation function. However, I keep meeting errors that I cannot understand and hoping someone can answer this to me.
Here is the error message I have received if I want to run .compute().
ValueError: could not broadcast input array from shape (1070,0) into shape (1045,0)
To resolve the issue, I started to use .to_delayed() to check each partition outputs, and this is what I found.
Following is my python code.
Step 1. Load netCDF file through Xarray, and then output to dask.array with chunk size (400,400)
df = xr.open_dataset('./Brazil Sentinal2 Tile/' + data_file +'.nc')
lon, lat = df['lon'].data, df['lat'].data
slon = da.from_array(df['lon'], chunks=(400,400))
slat = da.from_array(df['lat'], chunks=(400,400))
data = da.from_array(df.isel(band=0).__xarray_dataarray_variable__.data, chunks=(400,400))
Step 2. declare a function for da.map_overlap use
def sumsum2(lon,lat,data, hex_res=10):
hex_col = 'hex' + str(hex_res)
lon_max, lon_min = lon.max(), lon.min()
lat_max, lat_min = lat.max(), lat.min()
b = box(lon_min, lat_min, lon_max, lat_max, ccw=True)
b = transform(lambda x, y: (y, x), b)
b = mapping(b)
target_df = pd.DataFrame(h3.polyfill( b, hex_res), columns=[hex_col])
target_df['lat'] = target_df[hex_col].apply(lambda x: h3.h3_to_geo(x)[0])
target_df['lon'] = target_df[hex_col].apply(lambda x: h3.h3_to_geo(x)[1])
tlon, tlat = target_df[['lon','lat']].values.T
abc = lNDI(points=(lon.ravel(), lat.ravel()),
values= data.ravel())(tlon,tlat)
target_df['out'] = abc
print(np.stack([tlon, tlat, abc],axis=1).shape)
return np.stack([tlon, tlat, abc],axis=1)
Step 3. Apply the da.map_overlap
b = da.map_overlap(sumsum2, slon[:1200,:1200], slat[:1200,:1200], data[:1200,:1200], depth=10, trim=True, boundary=None, align_arrays=False, dtype='float64',
)
Step 4. Using to_delayed() to test output shape
print(b.to_delayed().flatten()[0].compute().shape, )
print(b.to_delayed().flatten()[1].compute().shape)
(1065, 3)
(1045, 0)
(1090, 3)
(1070, 0)
which is saying that the output from da.map_overlap is only outputting 1-D dimension ( which is (1045,0) and (1070,0) ), while in the da.map_overlap, the output I am preparing is 2-D dimension ( which is (1065,3) and (1090,3) ).
In addition, if I turn off the trim argument, which is
c = da.map_overlap(sumsum2,
slon[:1200,:1200],
slat[:1200,:1200],
data[:1200,:1200],
depth=10,
trim=False,
boundary=None,
align_arrays=False,
dtype='float64',
)
print(c.to_delayed().flatten()[0].compute().shape, )
print(c.to_delayed().flatten()[1].compute().shape)
The output becomes
(1065, 3)
(1065, 3)
(1090, 3)
(1090, 3)
This is saying that when trim=True, I cut out everything?
because...
#-- print out the values
b.to_delayed().flatten()[0].compute()[:10,:]
(1065, 3)
array([], shape=(1045, 0), dtype=float64)
while...
#-- print out the values
c.to_delayed().flatten()[0].compute()[:10,:]
array([[ -47.83683837, -18.98359832, 1395.01848583],
[ -47.8482856 , -18.99038681, 2663.68391094],
[ -47.82800624, -18.99207069, 1465.56517187],
[ -47.81897323, -18.97919009, 2769.91556363],
[ -47.82066663, -19.00712956, 1607.85927095],
[ -47.82696896, -18.97167714, 2110.7516765 ],
[ -47.81562653, -18.98302933, 2662.72112163],
[ -47.82176881, -18.98594465, 2201.83205114],
[ -47.84567 , -18.97512514, 1283.20631652],
[ -47.84343568, -18.97270783, 1282.92117225]])
Any thoughts for this?
Thank You.
I guess I got the answer. Please let me if I am wrong.
I am not allowing to use trim=True is because I change the shape of output array (after surfing the internet, I notice that the shape of output array should be the same with the shape of input array). Since I change the shape, the dask has no idea how to deal with it so it returns the empty array to me (weird).
Instead of using trim=False, since I didn't ask cutting-out the buffer zone, it is now okay to output the return values. (although I still don't know why the dask cannot concat the chunked array, but believe is also related to shape)
The solution is using delayed function on da.concatenate, which is
delayed(da.concatenate)([e.to_delayed().flatten()[idx] for idx in range(len(e.to_delayed().flatten()))])
In this case, we are not relying on the concat function in map_overlap but use our own concat to combine the outputs we want.

Tensorflow: Variable rnn/basic_lstm_cell/kernel already exists, disallowed

Here is a part of my tensorflow RNN network code written in jupyter. The whole code runs perfect for the first time, however, running it furthermore produces an error. The code:
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
def recurrent_nn_model(x):
x = tf.transpose(x, [1,0,2])
x = tf.reshape(x, [-1, chunk_size])
x = tf.split(x, n_chunks, 0)
lstm_layer = {'hidden_state': tf.zeros([n_batches, lstm_size]),
'current_state': tf.zeros([n_batches, lstm_size])}
layer = {'weights': tf.Variable(tf.random_normal([lstm_size, n_classes])),
'biases': tf.Variable(tf.random_normal([n_classes]))}
lstm = rnn_cell.BasicLSTMCell(lstm_size)
rnn_outputs, rnn_states = rnn.static_rnn(lstm, x, dtype=tf.float32)
output = tf.add(tf.matmul(rnn_outputs[-1], layer['weights']),
layer['biases'])
return output
The error is:
Variable rnn/basic_lstm_cell/kernel already exists, disallowed. Did
you mean to set reuse=True in VarScope? Originally defined at:
If recurrent_nn_model is the whole network, just add this line to reset previously defined graph:
tf.reset_default_graph()
If you're intentionally calling recurrent_nn_model several times and combine these RNNs into one graph, you should use different variable scopes for each one:
with tf.variable_scope('lstm1'):
recurrent_nn_model(x1)
with tf.variable_scope('lstm2'):
recurrent_nn_model(x2)

How to join DecisionTreeRegressor predict output to the original data

I am developing a model that uses DecisionTreeRegressor. I have built and fit the tree using training data, and predicted the results from more recent data to confirm the model's accuracy.
To build and fit the tree:
X = np.matrix ( pre_x )
y = np.matrix( pre_y )
regr_b = DecisionTreeRegressor(max_depth = 4 )
regr_b.fit(X, y)
To predict new data:
X = np.matrix ( pre_test_x )
trial_pred = regr_b.predict(X, check_input=True)
trial_pred is an array of the predicted values. I need to join it back to pre_test_x so I can see how well the prediction matches what actually happened.
I have tried merges:
all_pred = pre_pre_test_x.merge(predictions, left_index = True, right_index = True)
and
all_pred = pd.merge (pre_pre_test_x, predictions, how='left', left_index=True, right_index=True )
and either get no results or a second copy of the columns appended to the bottom of the DataFrame with NaN in all the existing columns.
Turns out it was simple. Leave the predict output as an array, then run:
w_pred = pre_pre_test_x.copy(deep=True)
w_pred['pred_val']=trial_pred

Resources