Follow Up Question About Whether Preprocessing Test Set Is Needed - mlr3

Please refer to the previous question here (https://stackoverflow.com/a/71389007/17537724)
With the pipeline below, will imputation, scaling and dummying variables be performed automatically on test set when predicting?
rsf = as_learner(po("imputemedian") %>>%
po("imputemode") %>>%
po("scale") %>>%
po("encode") %>>%
lrn("surv.rfsrc")
Another question, if I create a learner with specific hyperparameters for example based on a published model and I want to use it for prediction only without training. What would happen if I use two different data sets? Do I need to de-select non-influential variables from the data set? I assume so since all variables will be used in this case since model is not trained
rsf = as_learner(po("imputemedian") %>>%
po("imputemode") %>>%
po("scale") %>>%
po("encode") %>>%
lrn("surv.rfsrc",
ntree = 1200,
mtry = 2,
nodesize = 10,
nsplit = 1)

Related

How to append a tensor with model output?

I have a simple Neural network made in Keras.
I do data prep in Pandas and then convert my test / train splits to tensors;
train_features = tf.convert_to_tensor(train_features)
test_features = tf.convert_to_tensor(test_features)
train_labels = tf.convert_to_tensor(train_labels)
test_labels = tf.convert_to_tensor(test_labels)
Then, using thse to fit & validate the model.
If the output is like below
model.fit(train_features, train_labels, epochs=epochs, batch_size=batch_size)
z = model.predict(test_features)
How would I write back each prediction to the input tensor (in this case test_features)?

PCA within cross validation; however, only with a subset of variables

This question is very similar to preprocess within cross-validation in caret; however, in a project that i'm working on I would only like to do PCA on three predictors out of 19 in my case. Here is the example from preprocess within cross-validation in caret and I'll use this data (PimaIndiansDiabetes) for ease (this is not my project data but concept should be the same). I would then like to do the preProcess only on a subset of variables i.e. PimaIndiansDiabetes[, c(4,5,6)]. Is there a way to do this?
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess= p,
trControl=control,
tuneGrid=grid)
But I get this error:
Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr
The reason I'm trying to do this is so I can reduce three variables to one PCA1 and use for predicting. In the project I'm doing all three variables are correlated above 90% but would like to incorporate them as other studies have used them as well. Thanks. Trying to avoid data leakage!
As far as I know this is not possible with caret.
This might be possible using recipes. However I do not use recipes but I do use mlr3 so I will show how to do it with this package:
library(mlr3)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
library(mlbench)
create a task from the data:
data("PimaIndiansDiabetes")
pima_tsk <- TaskClassif$new(id = "Pima",
backend = PimaIndiansDiabetes,
target = "diabetes")
define a pre process selector named "slct1":
pos1 <- po("select", id = "slct1")
and define the selector function within it:
pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))
now define what should happen to the selected features: scaling -> pca with 1st PC selected (param_vals = list(rank. = 1))
pos1 %>>%
po("scale", id = "scale1") %>>%
po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
now define an invert selector:
pos2 <- po("select", id = "slct2")
pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)
define the learner:
rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf
combine them:
gunion(list(pr1, pos2)) %>>%
po("featureunion") %>>%
rf_lrn -> graph
check if it looks ok:
graph$plot(html = TRUE)
convert graph to a learner:
glrn <- GraphLearner$new(graph)
define parameters you want tuned:
ps <- ParamSet$new(list(
ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))
define resampling:
cv10 <- rsmp("cv", folds = 10)
define tuning:
instance <- TuningInstance$new(
task = pima_tsk,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
set.seed(1)
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
For additional details on how to tune the number of PC components to keep check this answer: R caret: How do I apply separate pca to different dataframes before training?
If you find this interesting check out the mlr3book
Also
cor(PimaIndiansDiabetes[, 4:6])
triceps insulin mass
triceps 1.0000000 0.4367826 0.3925732
insulin 0.4367826 1.0000000 0.1978591
mass 0.3925732 0.1978591 1.0000000
does not produce what you mention in the question.

How to split data into train and test sets using torchvision.datasets.Imagefolder?

In my custom dataset, one kind of image is in one folder which torchvision.datasets.Imagefolder can handle, but how to split the dataset into train and test?
You can use torch.utils.data.Subset to split your ImageFolder dataset into train and test based on indices of the examples.
For example:
orig_set = torchvision.datasets.Imagefolder(...) # your dataset
n = len(orig_set) # total number of examples
n_test = int(0.1 * n) # take ~10% for test
test_set = torch.utils.data.Subset(orig_set, range(n_test)) # take first 10%
train_set = torch.utils.data.Subset(orig_set, range(n_test, n)) # take the rest

Using a stateful Keras model in pure TensorFlow

I have a stateful RNN model with several GRU layers that was created in Keras.
I have to run this model now from Java, so I dumped the model as protobuf, and I'm loading it from Java TensorFlow.
This model must be stateful because features will be fed one timestep at-a-time.
As far as I understand, in order to achieve statefulness in a TensorFlow model, I must somehow feed in the last state every time I execute the session runner, and also that the run would return the state after the execution.
Is there a way to output the state in the Keras model?
Is there a simpler way altogether to get a stateful Keras model to work as such using TensorFlow?
Many thanks
An alternative solution is to use the model.state_updates property of the keras model, and add it to the session.run call.
Here is a full example that illustrates this solutions with two lstms:
import tensorflow as tf
class SimpleLstmModel(tf.keras.Model):
""" Simple lstm model with two lstm """
def __init__(self, units=10, stateful=True):
super(SimpleLstmModel, self).__init__()
self.lstm_0 = tf.keras.layers.LSTM(units=units, stateful=stateful, return_sequences=True)
self.lstm_1 = tf.keras.layers.LSTM(units=units, stateful=stateful, return_sequences=True)
def call(self, inputs):
"""
:param inputs: [batch_size, seq_len, 1]
:return: output tensor
"""
x = self.lstm_0(inputs)
x = self.lstm_1(x)
return x
def main():
model = SimpleLstmModel(units=1, stateful=True)
x = tf.placeholder(shape=[1, 1, 1], dtype=tf.float32)
output = model(x)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
res_at_step_1, _ = sess.run([output, model.state_updates], feed_dict={x: [[[0.1]]]})
print(res_at_step_1)
res_at_step_2, _ = sess.run([output, model.state_updates], feed_dict={x: [[[0.1]]]})
print(res_at_step_2)
if __name__ == "__main__":
main()
Which produces the following output:
[[[0.00168626]]]
[[[0.00434444]]]
and shows that the lstm state is preserved between batches.
If we set stateful to False, the output becomes:
[[[0.00033928]]]
[[[0.00033928]]]
Showing that the state is not reused.
ok, so I managed to solve this problem!
What worked for me was creating tf.identity tensors for not only the outputs, as is standard, but also for the state tensors.
In the Keras models, the state tensors can be found by doing:
model.updates
Which gives something like this:
[(<tf.Variable 'gru_1_1/Variable:0' shape=(1, 70) dtype=float32_ref>,
<tf.Tensor 'gru_1_1/while/Exit_2:0' shape=(1, 70) dtype=float32>),
(<tf.Variable 'gru_2_1/Variable:0' shape=(1, 70) dtype=float32_ref>,
<tf.Tensor 'gru_2_1/while/Exit_2:0' shape=(1, 70) dtype=float32>),
(<tf.Variable 'gru_3_1/Variable:0' shape=(1, 4) dtype=float32_ref>,
<tf.Tensor 'gru_3_1/while/Exit_2:0' shape=(1, 4) dtype=float32>)]
The 'Variable' is used for inputting the states, and the 'Exit' for outputs of the new states.
So I created tf.identity out of the 'Exit' tensors. I gave them meaningful names, e.g.:
tf.identity(state_variables[j], name='state'+str(j))
Where state_variables contained only the 'Exit' tensors
Then used the input variables (e.g. gru_1_1/Variable:0) to feed the model state from TensorFlow, and the identity variables I created out of the 'Exit' tensors were used to extract the new states after feeding the model at each timestep

Using partialPlot after fitting a Random Forest model in caret

After I fit a randomForest using the train() function, I'm having problems invoking partialPlot() and plotmo(). Here's some reproducible code:
library(AER)
library(caret)
data(Mortgage)
fitControl <- trainControl(method = "repeatedcv"
,number = 5
,repeats = 10
,allowParallel = TRUE)
library(doMC)
registerDoMC(cores=10)
Final.rfModel <- train(form=networth ~ ., data=Mortgage, method = "rf", metric='RMSE', trControl = fitControl, tuneLength=10, importance = TRUE)
#### partial plots fail
partialPlot(Final.rfModel$finalModel, Mortgage, "liquid")
library(plotmo)
plotmo(Final.rfModel$finalModel)
There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).
The error occurs because rate, married and a few other predictors are factors. The dummy variables created by train don't have the same names so partialPlot can't find them.
Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.
TL;DR
Use the non-formula method with train in this case:
Final.rfModel <- train(form=networth ~ ., data=Mortgage,
method = "rf",
metric='RMSE',
trControl = fitControl,
tuneLength=10,
importance = TRUE)
Max

Resources