Is it possible to obtain predictions on the training data from resample results? - mlr3

After executing the code below, it is possible to obtain predictions on the testing partition by using rr$predictions()[[1]]. But is it possible to obtain the predictions on the training partition?
task = tsk("penguins")
learner = lrn("classif.rpart")
resampling = rsmp("holdout")
rr = resample(task, learner, resampling)
Thanks!

You need to set predict_sets field of learner to both train and test, like this:
learner$predict_sets=c("test", "train")
Keep everything else the same and get train set predictions with
rr$prediction("train")

Related

get coefficients and features from a glmnet learner #mlr3

Thanks for providing the mlr3 package in R , since I am trying it for the first time
simple code in mlr3:
learner = lrn("classif.cv_glmnet")
lrn_glmnet <- learner$train(task, row_ids = train_set).
my question is how to get the features and coefficients from lrn_glmnet object?
This also applies to other learners like log_reg
In general, how to get the feature importance and the scores from an mlr3 object after simple training or even resampling ?
Thanks,
Haneme

Quantile prediction for mlr3 graph learner

I have a stacked learner where the output layer is a regr.ranger with params list(rf.quantreg = TRUE, rf.keep.inbag = TRUE). Is it possible to predict quantiles with GraphLearners like this?
I know that for a pure lrn('regr.ranger'), once trained, I can simply reference the ranger model directly and use that for quantile prediction:
predict(my_learner$model, data = my_test_data, type = "quantiles", quantiles = c(0.025, 0.975))
But for the stacked learner, I have other learners mediating between the features and regr.ranger, so it seems to me that I have to go via mlr3.
My GraphLearner consists of some feature coding, a regr.kknn learner and a regr.glm learner plus some extras. Then rf (a regr.ranger) sits at the output level:

Neural Network Trains Fine and Test Predictions are Horrible Bordering on Ridiculous

I am having a lot of trouble with a neural network model using R neuralnet() function. When I train a network on all of the data as expected the predictions are very accurate. However, when I split the data into training and test sets, the test predictions are terrible. I cannot figure what all I am doing wrong. I would appreciate any advice or help troubleshooting as I don't think I'll be able to figure this out on my own. Thanks in advance.
I have included the R code and some plots and an example of the data below the full data is 3600 observations.
Best Regards-Pat
UPDATE 05/12/18: BASED ON FEEDBACK THAT THIS LOOKS LIKE OVERFITTING, I TRIED STOPPING THE TRAINING EARLIER AND FOUND THAT THE MSE OF THE TEST PREDICTION NEVER GETS VERY LOW AND IS LOWEST APPROACHING 0 TRAINING EPOCHS AND RISES FROM THERE (PLOT INCLUDED AND CODE APPENDED)
###########
#ANN Models
###########
#Load libraries
library(plyr)
library(ggplot2)
library(gridExtra)
library(neuralnet)
#Retain only numerically coded data from data1 in (data2) for ANN fitting
data2 = data1[,c(3:7)]
#Calculate Min and Max for Scaling
max_data = apply(data2,2,max)
min_data = apply(data2,2,min)
#Scale data 0-1
data2_scaled = scale(data2,center=min_data,scale=max_data-min_data)
#Check data structure
data2_scaled
#Fit neural net model
model_nn1 = neuralnet(formula=time~instructions+nodes+machine_num+app_num,data=data2_scaled,hidden=c(8,8),stepmax=1000000,threshold=0.01)
#Calculate Min and Max Response for rescaling
max_time = max(data2$time)
min_time = min(data2$time)
#Rescale neural net response predictions
pred_nn1 = model_nn1$net.result[[1]][,1]*(max_time-min_time)+min_time
#Compare model predictions to actual values
a03 = cbind.data.frame(data1$time,pred_nn1,data1$machine,data1$app)
colnames(a03) = c("actual","prediction","machine","app")
attach(a03)
p01 = ggplot(a03,aes(x=actual,y=prediction))+
geom_point(aes(color=machine),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (ALL DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
p02 = ggplot(a03,aes(x=actual,y=prediction))+
geom_point(aes(color=app),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (ALL DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
grid.arrange(p01,p02,nrow=1)
#Visualize ANN
plot(model_nn1)
#Epochs taken to train "steps"
model_nn1$result.matrix[3,]
#########################
#Testing and Training ANN
#########################>
#Split the data into a test and training set
index = sample(1:nrow(data2_scaled),round(0.80*nrow(data2_scaled)))
train_data = as.data.frame(data2_scaled[index,])
test_data = as.data.frame(data2_scaled[-index,])
model_nn2 = neuralnet(formula=time~instructions+nodes+machine_num+app_num,data=train_data,hidden=c(3,2),stepmax=1000000,threshold=0.01)
pred_nn2_scaled = compute(model_nn2,test_data[,c(1,2,4,5)])
pred_nn2 = pred_nn2_scaled$net.result*(max_time-min_time)+min_time
test_data_time = test_data$time*(max_time-min_time)+min_time
a04 = cbind.data.frame(test_data_time,pred_nn2,data1[-index,2],data1[-index,1])
colnames(a04) = c("actual","prediction","machine","app")
attach(a04)
p01 = ggplot(a04,aes(x=actual,y=prediction))+
geom_point(aes(color=machine),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (TEST DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
p02 = ggplot(a04,aes(x=actual,y=prediction))+
geom_point(aes(color=app),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (TEST DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
grid.arrange(p01,p02,nrow=1)
#EARLY STOPPING TEST
i = 1000
summary_data = data.frame(matrix(rep(0,4*i),ncol=4))
colnames(summary_data) = c("treshold","epochs","train_mse","test_mse")
for (j in 1:i){
a = runif(1,min=0.01,max=10)
#Train the model
model_nn2 = neuralnet(formula=time~instructions+nodes+machine_num+app_num,data=train_data,hidden=3,stepmax=1000000,threshold=a)
#Calculate Min and Max Response for rescaling
max_time = max(data2$time)
min_time = min(data2$time)
#Predict test data from trained nn
pred_nn2_scaled = compute(model_nn2,test_data[,c(1,2,4,5)])
#Rescale test prediction
pred_test_data_time = pred_nn2_scaled$net.result*(max_time-min_time)+min_time
#Rescale test actual
test_data_time = test_data$time*(max_time-min_time)+min_time
#Rescale train prediction
pred_train_data_time = model_nn2$net.result[[1]][,1]*(max_time-min_time)+min_time
#Rescale train actual
train_data_time = train_data$time*(max_time-min_time)+min_time
#Calculate mse
test_mse = mean((pred_test_data_time-test_data_time)^2)
train_mse = mean((pred_train_data_time-train_data_time)^2)
#Summarize
summary_data[j,1] = a
summary_data[j,2] = model_nn2$result.matrix[3,]
summary_data[j,3] = round(train_mse,3)
summary_data[j,4] = round(test_mse,3)
print(summary_data[j,])
}
plot(summary_data$epochs,summary_data$test_mse,pch=19,xlim=c(0,2000),ylim=c(0,300000),cex=0.5,xlab="Training Steps",ylab="MSE",main="Early Stopping Test: Comparing MSE : TEST=BLACK TRAIN=RED")
points(summary_data$epochs,summary_data$train_mse,pch=19,col=2,cex=0.5)
I would guess that it is overfitting. The network is learning to reproduce the data like a dictionary instead of learning the underlying function in the data. There are various things which can cause this and ways to address them.
Things which cause overfitting are:
The network could be training for too long.
The network could having far more weights than training examples.
Ways to reduce overfitting are:
Create a validation dataset and stop training the network as soon as the
validation set's loss starts increasing. This is a necessity.
Reducing the network size. (Less weights)
Using a regularization technique like weight decay or dropout.
Also, it may be possible that the problem is too difficult for a neural network to solve based on the data it is given. Reproducing training data does not prove that the network can solve the problem, it only proves that the network can remember things like a dictionary.

The proper way of using IsolationForest to detect outliers of high-dim dataset

I use the following simple IsolationForest algorithm to detect the outliers of given dataset X of 20K samples and 16 features, I run the following
train_X, tesy_X, train_y, test_y = train_test_split(X, y, train_size=.8)
clf = IsolationForest()
clf.fit(X) # Notice I am using the entire dataset X when fitting!!
print (clf.predict(X))
I get the result:
[ 1 1 1 -1 ... 1 1 1 -1 1]
This question is: Is it logically correct to use the entire dataset X when fitting into IsolationForest or only train_X?
Yes, it is logically correct to ultimately train on the entire dataset.
With that in mind, you could measure the test set performance against the training set's performance. This could tell you if the test set is from a similar distribution as your training set.
If the test set scores anomalous as compared to the training set, then you can expect future data to be similar. In this case, I would like more data to have a more complete view of what is 'normal'.
If the test set scores similarly to the training set, I would be more comfortable with the final Isolation Forest trained on all data.
Perhaps you could use sklearn TimeSeriesSplit CV in this fashion to get a sense for how much data is enough for your problem?
Since this is unlabeled data to the anomaly detector, the more data the better when defining 'normal'.

Tensorflow low train/test accuracy

I restored a pre-trained model in Tensorflow 1.2 to do some testing work. I assumed the model was well-trained since the loss decreased to very low (0.0001). However, with either the testing samples or training samples, the accuracy ops give me a value which is almost 0. Is this because I'm using the wrong accuracy function or is it because the model is the problem?
Here is the accuracy function, the test_image below is a batch with a single test sample, test_image_label is a single label:
correct_prediction = tf.equal(tf.argmax(GoogleNet(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
with Session() as less:
accuracy_vector = []
for num in range(len(testnames)):
accuracy_vector.append(sess.run(accuracy, feed_dict={keep_prob: 1.0}))
print(accuracy_vector)
mean_accuracy = sess.run(tf.divide(tf.add_n(accuracy_vector), len(testnames)))
print("test accuracy %g"%mean_accuracy)
The model is defined as GoogleNet(data) above, it is a function that returns the logits of the input batch. The training was done like this:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=train_label_batch, logits=GoogleNet(train_batch)))
train_step = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(cost, global_step=global_step)
The train_step is ran in every iteration. I think it is worth noting that after restored the model, I cannot run print(GoogleNet(test_image).eval(feed_dict={keep_prob: 1.0})) in the session, with which I intended to take a look at the output of the model. It returns the error of FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Variable_213
[[Node: Variable_213/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_213"], _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_213)]]

Resources